Short-time processing of speech signals

3.2. Short-time processing of speech signals#

In the short-time analysis section we already discussed speech signals and analysis methods, and found that we need to split the signal into shorter segments and apply windowing functions. Here we discuss what extra steps we need to consider when we want to process signals, that is, when we want to modify the signal.

To process a windowed signal, we thus need the steps:

windowing
time-frequency transform such as the DFT (optional)
apply the desired processing
inverse time-frequency transform (when DFT was applied)
reverse of windowing (?).

Here the analysis steps 1-2 were already discussed and the desired processing is whatever modification to the signal that you might have. The question is thus about the inverse transforms in steps 4 and 5. Time-frequency transforms such as the discrete Fourier transform (DFT) and the discrete cosine transform (DCT) are orthonormal and have well-known fast algorithms for their inverses. The main challenge is thus the “reverse of windowing”, whatever that might be.

The direct approach of just multiplying with the inverse of the windowing function has a problem.

../_images/2923f212fe0eb1b988853f7f0c33c78ff2700f2124be6ccd3ac8c0a5c3dae252.png

Namely, near the ends of the window, the windowing function goes smoothly to zero and thus the inverse of the windowing function approaches infinity. That clearly leads to numerical issues; very small changes in the windowed signal can lead to arbitrarily large samples after inverse windowing.

We thus need a method which does not rely on inverting the windowing function.

3.2.1. Overlap-add#

A majority of modern speech processing is based on the overlap-add method, where the input signal is windowed in overlapping segments, such that when the overlapping parts are added together, we can perfectly reconstruct the original signal. It is like a cross-fade between subsequent windows.

The most common scenario where overlap-add cannot be used are applications which require a very low algorithmic delay. For example, stage microphones at concerts and theaters require a low-delay, such that interaction between people on the stage is not disrupted. Almost always otherwise, the best option is overlap-add.

../_images/69c7bbb507989f0455ecfc2ec15f58fbe1ceb10f3b64fd8243face802637d133.png

In the above example, we see two overlapping windows extracted from the original signal. When we add the two together we obtain the “Overlap-add” signal which, we subtracted from the original, is zero in the whole region of the overlap. At the left and right ends, where we do not have an overlapping window, we do not get perfect reconstruction.

3.2.1.1. Windowing and reconstruction#

Overlapping segments are added together (fade-in/fade-out). Windowing should be chosen such that reconstruction is exactly equal to the original.

Let original signal be \(x_n\).
The left and right parts of overlap windows are then \(w_{L,n}\) and \(w_{R,n}\) and the windowed signals are \(w_{L,n}x_n\) and \(w_{R,n}x_n\).
The reconstruction is \(\hat x_n = w_{L,n}x_n + w_{R,n}x_n = (w_{L,n} + w_{R,n})x_n\).
With \(w_{L,n} + w_{R,n} = 1\) we obtain perfect reconstruction, \(x_n = \hat x_n\).

The window in the above example already adheres to this requirement.

../_images/81ecbb47738ae54912d48a06494471572484da069a6bc29cafb048b52ac33cbc.png

3.2.1.2. Windowing and processing#

The whole point of windowing for processing was that we could modify the windowed signal and that we can then synthesise the modified signal. Suppose \(x':=wx\) is the windowed signal and the modified signal is \(\hat x'\), such that the modification part is \(e=x'-\hat x\). In overlap-add, we would then take multiple modified windows \(\hat x_k'\) and add them together. We have already seen that the part which corresponds to the original signal \(x'\) will perfectly reconstruct to the original signal. The question is however what happens to the modification part?

With a direct implementation, we would just add the modifications together. There is then no guarantee that consecutive modifications play nicely together and we can, for example, have discontinuities between windows.

../_images/ab88962b5e137a5b55edc01cc27b1c040f78ec4086e167ffdccc6e57bd225fb9.png

We therefore need to multiply also the modified windows with a windowing function.

../_images/388311e41b2c09d2981b556aebbfae8e3753d20a85195f53210c59ac89ac3488.png

3.2.1.2.1. Algorithm ``Overlap-add’’#

Let \(w_{in,n}\) and \(w_{out,n}\) be the input and output windowing functions.

The input signal is \(x_n\) is windowed at the input to obtain \(w_{in,n}x_n\).
The windowed signal is modified with \(e_n\) to obtain \(w_{in,n}x_n+e_n\).
We apply an output window \(w_{out,n}\) to the modified signal to obtain \(w_{out,n}(w_{in,n}x_n+e_n)\).
Add subsequent, overlapping windows together to obtain the output signal.

Then if the output windows go to zero at the border, then the output signal will be continuous. With no-modification \(e_n=0\), the output is \(w_{out,n}w_{in,n}x_n\). In other words, if the left and right parts add up

\[ w_{L,out,n}w_{L,in,n} + w_{R,out,n}w_{R,in,n} = 1 \]

then we have perfect reconstruction.

If we the modification is uniform white noise, then the modification part overlap is \(w_{L,out,n}e_{L,n} + w_{R,out,n}e_{R,n}\). The energy expectation of the modification is then

\[ E\left[\left(w_{L,out,n}e_{L,n} + w_{R,out,n}e_{R,n}\right)^2\right] = \left(w^2_{L,out,n} + w^2_{R,out,n}\right) E[e_n^2]. \]

If \(w^2_{L,out,n} + w^2_{R,out,n}=1\) then output energy is uniform. To fulfil the criteria, we can set the input and output windows to be the same \(w_{in,n} = w_{out,n}\).

We can then require that (Princen-Bradley condition)

\[ \boxed{w^2_{L,n} + w^2_{R,n} = 1}. \]

and that \(w_n\) goes to zero at the borders. Overlap-add obtains the following properties:

Perfect reconstruction – If there is no modification, we can reconstruct the original signal.
Continuous output – There are no discontinuities.
Uniform noise energy – Output noise does not have temporal structure (noise has a smooth energy envelope).

One such windowing function is the half-sine

\[ w_n = \sin\left(\frac{\pi n}{N}\right).\qquad\text{(It is the square root of a Hann-window.)} \]

We can readily show that it fulfils the Prince-Bradly condition.

../_images/81a64ae4e22e6d91f7e902e4bc51cd3fa5d3d8f8ef77b9eea23b4a3258abe917.png

3.2.1.3. Overlap-add summary#

Overlap-add is a method for windowing a signal such that we can modify the segments and reconstruct the modified signal.

Algorithm

Applying windowing function \(w_n\).
Modify/process window with your-algorithm-of-choice.
Applying windowing function \(w_n\) again.
Add overlapping segments together to obtain output signal.

Usually we would perform a time-frequency transform on the windowed signal \(w_nx_n\) and perform modifications in the frequency-domain. (Almost) all frequency-domain processing algorithms are based on overlap-add.

3.2.2. The short-time Fourier transform (STFT)#

Overlap-add is typically combined with taking discrete Fourier transforms of the windowed signal, as well as an inverse transforms after processing. This algorithm is known as the short-time Fourier transform (STFT) and it is the most commonly used domain for speech and audio processing. It is so common that often when we talk about a time-frequency transform in conjunction with processing algorithms, we implicitly mean the STFT.