8. Recognition tasks in speech processing#

Typical tasks in speech processing, where machine learning is often applied include:

  • Speech recognition, which refers to converting an acoustic waveform of spoken speech to the corresponding text (speech-to-text). See also Speech and Language Processing, by Dan Jurafsky and James H. Martin, (accessed 15.6.2022). [Jurafsky and Martin, 2021]

  • Speaker recognition and speaker verification, which refer to, respectively, identifying the speaker (who is speaking?) and verifying whether the speaker is who he claims to be (is it really you?).

  • Speech synthesis, which entails the creation of a natural sounding speech signal from text input (text-to-speech).

  • Speech enhancement, refers to improving a recorded speech signal, for example with the objective of removing background noise (noise attenuation) or the effect of room acoustics.

  • Wake-word and keyword detection, refers to the task where the purpose is to find single characterizing words from continuous speech. The idea is that by using a light-weight algorithm, we can extract useful information without a computationally complex speech recognizer. Specifically, wake-word detection refers to the waiting for the activation command, that is, the device sleeps until the wake-word is heard. Keyword detection can refer to similar task, or for example, the task of recognizing the topic of a conversation.

  • Voice activity detection (VAD), refers to the task of determining whether a signal contains speech or not (is someone speaking?). Many of the above tasks are resource-intensive operations, such that we would like to, for example, use speech recognition only when speech is present. We can therefore first use a simple VAD to determine whether a signal is speech or not, and only start the speech recognizer when speech is present.

  • Speech diarisation is the process of segmenting a multi-speaker conversation into continuous single-speaker segments.

  • Paralinguistic analysis tasks, refers generally to the extraction of non-linguistic and non-speaker identity related information from speech signals, such as speaker emotions, health, attitude, sleepiness etc.