
11.1. Noise attenuation#
When using speech technology in realistic environments, such as at home, office or in a car, there will invariably be also other sounds present and not only the speech sounds of desired speaker. There will be the background hum of computers and air conditioning, cars honking, other speakers, and so on. Such sounds reduces the quality of the desired signal, making it more strenuous to listen, more difficult to understand or at the worst case, it might render the speech signal unintelligible. A common feature of these sounds is however that they are independent of and uncorrelated with the desired signal. [Benesty et al., 2008]
That is, we can usually assume that such noises are additive, such that the observed signal \(y\) is the sum of the desired signal \(x\) and interfering noises \(v\), that is, \(y=x+v\). To improve the quality of the observed signal, we would like to make an estimate \( \hat x =f(y)\) of the desired signal \(x\). The estimate should approximate the desired signal \( x\approx \hat x \) or conversely, we would like to minimize the distance \( d\left(x,\hat x\right) \) with some distance measure \(d(\cdot,\cdot)\).
The sound sample below serves as an example of a noisy speech signal. In the following sections, we discuss incrementally more advanced approaches for attenuating such additive noise.
# Initialization for all
from scipy.io import wavfile
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as ipd
import scipy
import scipy.fft
#from helper_functions import stft, istft, halfsinewindow
Show code cell source
def stft(data,fs,window_length_ms=30,window_step_ms=20,windowing_function=None):
window_length = int(window_length_ms*fs/2000)*2
window_step = int(window_step_ms*fs/1000)
if windowing_function is None:
windowing_function = np.sin(np.pi*np.arange(0.5,window_length,1)/window_length)**2
total_length = len(data)
window_count = int( (total_length-window_length)/window_step) + 1
spectrum_length = int((window_length)/2)+1
spectrogram = np.zeros((window_count,spectrum_length),dtype=complex)
for k in range(window_count):
starting_position = k*window_step
data_vector = data[starting_position:(starting_position+window_length),]
window_spectrum = scipy.fft.rfft(data_vector*windowing_function,n=window_length)
spectrogram[k,:] = window_spectrum
return spectrogram
def istft(spectrogram,fs,window_length_ms=30,window_step_ms=20,windowing_function=None):
window_length = int(window_length_ms*fs/2000)*2
window_step = int(window_step_ms*fs/1000)
#if windowing_function is None:
# windowing_function = np.ones(window_length)
window_count = spectrogram.shape[0]
total_length = (window_count-1)*window_step + window_length
data = np.zeros(total_length)
for k in range(window_count):
starting_position = k*window_step
ix = np.arange(starting_position,starting_position+window_length)
thiswin = scipy.fft.irfft(spectrogram[k,:],n=window_length)
data[ix] = data[ix] + thiswin*windowing_function
return data
def halfsinewindow(window_length):
return np.sin(np.pi*np.arange(0.5,window_length,1)/window_length)
Show code cell source
fs = 44100 # Sample rate
seconds = 5 # Duration of recording
window_length_ms=30
window_step_ms=15
window_length = int(window_length_ms*fs/2000)*2
window_step_samples = int(window_step_ms*fs/1000)
windowing_function = halfsinewindow(window_length)
filename = 'sounds/enhancement_test.wav'
Show code cell source
# read from storage
fs, data = wavfile.read(filename)
data = data[:]
ipd.display(ipd.Audio(data,rate=fs))
plt.figure(figsize=[12,6])
plt.subplot(211)
t = np.arange(0,len(data),1)/fs
plt.plot(t,data)
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.title('Waveform of noisy audio')
plt.axis([0, len(data)/fs, 1.05*np.min(data), 1.05*np.max(data)])
spectrogram_matrix = stft(data,
fs,
window_length_ms=window_length_ms,
window_step_ms=window_step_ms,
windowing_function=windowing_function)
fft_length = spectrogram_matrix.shape[1]
window_count = spectrogram_matrix.shape[0]
length_in_s = window_count*window_step_ms/1000
plt.subplot(212)
plt.imshow(20*np.log10(np.abs(spectrogram_matrix[:,range(fft_length)].T)),
origin='lower',aspect='auto',
extent=[0, length_in_s, 0, fs/2000])
plt.axis([0, length_in_s, 0, 8])
plt.xlabel('Time (s)')
plt.ylabel('Frequency (kHz)');
plt.title('Spectrogram of noisy audio')
plt.tight_layout()
plt.show()
11.1.1. Noise gate#
Suppose you are talking in a reasonably quiet environment. For example, typically when you speak on a phone, you would go to a quiet room. Similarly, when attending an online lecture, you would most often want to be in a room without background noise.
What we perceive as quiet is however never entirely silent. When we play a sound recorded in a “quiet” room, then in the reproduction you then hear the local and the recorded background noises. Assuming the two noises have similar energies, then their sum has twice the energy, viz. 6dB higher than the original noises. In a teleconference with multiple participants, the background noises add up such that each contributes with a 6dB increase in the background noise level. You do not need many participants before the total noise level becomes so high that communication is impossible.
The mute-button in teleconferences is therefore essential. Participants can silence their microphones whenever they are not speaking, such that only the background noise of the active speaker(s) is transmitted to all listeners.
While the mute button is a good user interface in the sense that it gives control to the user, it is however an annoying user interface in that users tend to forget to mute and unmute themselves. Would be better with an automatic mute.
Noise gating is a simple auto-mute in the sense that it thresholds signal energy and turns reproduction/transmission off if energy is too low. Typically it also features a hysteresis functionality such that reproduction/transmission is kept off for a while after the last speech segment. Moreover, to avoid discontinuities, there is a fade-in/fade-out functionality at the start and end.
Note that noise gating with an energy threshold is a simple implementation of a voice activity detector (VAD). With more advanced features than mere energy, we can refine voice activity detection quite a bit, to make it more robust especially in noisy and reverberant environments. In addition to the enhancement of signal quality, such methods are often used also to preserve resources, such as transmission bandwidth (telecommunication) and computation costs (recognition applications such as speech recognition).
For the noise gate, we first need to choose a threshold. Typically the threshold is chosen relative to the mean (log) energy \(\sigma^2\) such that the threshold is \(x^2 < \sigma^2\gamma\), where \(\gamma\) is a tunable parameter. Moreover, we can implement the gate such that if we are below the threshold, we set a gain value to 0 and otherwise to 1. If we want fade-in/fade-out, we can ramp that gain value smoothly from 0 to 1 at the attack and from 1 to 0 at the release.
frame_energy = np.sum(np.abs(spectrogram_matrix)**2,axis=1)
frame_energy_dB = 10*np.log10(frame_energy)
mean_energy_dB = np.mean(frame_energy_dB) # mean of energy in dB
threshold_dB = mean_energy_dB + 3. # threshold relative to mean
speech_active = frame_energy_dB > threshold_dB
Show code cell source
# Reconstruct and play thresholded signal
spectrogram_thresholded = spectrogram_matrix * np.expand_dims(speech_active,axis=1)
data_thresholded = istft(spectrogram_thresholded,fs,window_length_ms=window_length_ms,window_step_ms=window_step_ms,windowing_function=windowing_function)
# Illustrate thresholding (without hysteresis)
plt.figure(figsize=[12,6])
plt.subplot(211)
t = np.arange(0,window_count,1)*window_step_samples/fs
normalized_frame_energy = frame_energy_dB - np.mean(frame_energy_dB)
plt.plot(t,normalized_frame_energy,label='Signal energy')
plt.plot(t,speech_active*10,label='Noise gate')
plt.legend()
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.title('Noise gate')
plt.axis([0, len(data)/fs, 1.05*np.min(normalized_frame_energy), 1.05*np.max(normalized_frame_energy)])
plt.subplot(212)
plt.imshow(20*np.log10(1e-6+np.abs(spectrogram_thresholded[:,range(fft_length)].T)),
origin='lower',aspect='auto',
extent=[0, length_in_s, 0, fs/2000])
plt.axis([0, length_in_s, 0, 8])
plt.xlabel('Time (s)')
plt.ylabel('Frequency (kHz)');
plt.title('Spectrogram of gated audio')
plt.tight_layout()
plt.show()
#sd.play(data_thresholded,fs)
ipd.display(ipd.Audio(data_thresholded,rate=fs))
This is quite awful, isn’t it? Though we lost many stationary noise segments, we also distorted the speech signal significantly. In particular, typically we lose plosives at the beginning of words. Overall the sound also sounds odd when it turns on and off again.
To improve, we can add hysteresis, where the activity indicator is kept on for a while after the last true speech frame.
hysteresis_time_ms = 300
hysteresis_time = int(hysteresis_time_ms/window_step_ms)
speech_active_hysteresis = np.zeros([window_count])
for window_ix in range(window_count):
range_start = max(0,window_ix-hysteresis_time)
speech_active_hysteresis[window_ix] = np.max(speech_active[range(range_start,window_ix+1)])
Show code cell source
# Reconstruct and play thresholded signal
spectrogram_hysteresis = spectrogram_matrix * np.expand_dims(speech_active_hysteresis,axis=1)
data_hysteresis = istft(spectrogram_hysteresis,fs,window_length_ms=window_length_ms,window_step_ms=window_step_ms,windowing_function=windowing_function)
# Illustrate thresholding (without hysteresis)
plt.figure(figsize=[12,6])
plt.subplot(211)
t = np.arange(0,window_count,1)*window_step_samples/fs
normalized_frame_energy = frame_energy_dB - np.mean(frame_energy_dB)
plt.plot(t,normalized_frame_energy,label='Signal energy')
plt.plot(t,speech_active_hysteresis*10,label='Noise gate')
plt.legend()
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.title('Noise gate with hysteresis')
plt.axis([0, len(data)/fs, 1.05*np.min(normalized_frame_energy), 1.05*np.max(normalized_frame_energy)])
plt.subplot(212)
plt.imshow(20*np.log10(1e-6+np.abs(spectrogram_hysteresis[:,range(fft_length)].T)),
origin='lower',aspect='auto',
extent=[0, length_in_s, 0, fs/2000])
plt.axis([0, length_in_s, 0, 8])
plt.xlabel('Time (s)')
plt.ylabel('Frequency (kHz)');
plt.title('Spectrogram of gating with hysteresis')
plt.tight_layout()
plt.show()
#sd.play(data_thresholded,fs)
ipd.display(ipd.Audio(data_hysteresis,rate=fs))
This sounds quite a bit better already. There are only some sudden muted areas (depending on your sound sample), but overall the sound is clearly better.
# Fade-in and fade-out
fade_in_time_ms = 50
fade_out_time_ms = 300
fade_in_time = int(fade_in_time_ms/window_step_ms)
fade_out_time = int(fade_out_time_ms/window_step_ms)
speech_active_sloped = np.zeros([window_count])
for frame_ix in range(window_count):
if speech_active_hysteresis[frame_ix]:
range_start = max(0,frame_ix-fade_in_time)
speech_active_sloped[frame_ix] = np.mean(speech_active_hysteresis[range(range_start,frame_ix+1)])
else:
range_start = max(0,frame_ix-fade_out_time)
speech_active_sloped[frame_ix] = np.mean(speech_active_hysteresis[range(range_start,frame_ix+1)])
Show code cell source
# Reconstruct and play sloped-thresholded signal
spectrogram_sloped = spectrogram_matrix * np.expand_dims(speech_active_sloped,axis=1)
data_sloped = istft(spectrogram_sloped,fs,window_length_ms=window_length_ms,window_step_ms=window_step_ms,windowing_function=windowing_function)
# Illustrate thresholding
plt.figure(figsize=[12,6])
plt.subplot(211)
t = np.arange(0,window_count,1)*window_step_samples/fs
normalized_frame_energy = frame_energy_dB - np.mean(frame_energy_dB)
plt.plot(t,normalized_frame_energy,label='Signal energy')
plt.plot(t,speech_active_sloped*10,label='Noise gate')
plt.legend()
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.title('Noise gate with sloped hysteresis')
plt.axis([0, len(data)/fs, 1.05*np.min(normalized_frame_energy), 1.05*np.max(normalized_frame_energy)])
plt.subplot(212)
plt.imshow(20*np.log10(1e-6+np.abs(spectrogram_sloped[:,range(fft_length)].T)),
origin='lower',aspect='auto',
extent=[0, length_in_s, 0, fs/2000])
plt.axis([0, length_in_s, 0, 8])
plt.xlabel('Time (s)')
plt.ylabel('Frequency (kHz)');
plt.title('Spectrogram of gating with sloped hysteresis')
plt.tight_layout()
plt.show()
#sd.play(data_thresholded,fs)
ipd.display(ipd.Audio(data_sloped,rate=fs))
This doesn’t sound all too bad! The sudden on- and off-sets are gone and the transitions to muted areas sound reasonably natural.
Now we have implemented gating for the full-band signal. Gating can be easily improved by band-wise -processing. Depending on the amount of processing you can afford, you could go all the way and apply gating on individual frequency bins in the STFT.
hysteresis_time_ms = 100
hysteresis_time = int(hysteresis_time_ms/window_step_ms)
fade_in_time_ms = 30
fade_out_time_ms = 60
fade_in_time = int(fade_in_time_ms/window_step_ms)
fade_out_time = int(fade_out_time_ms/window_step_ms)
# NB: This is a pedagogic, but very slow implementation since it involves multiple for-loops.
spectrogram_binwise = np.zeros(spectrogram_matrix.shape,dtype=complex)
for bin_ix in range(fft_length):
bin_energy_dB = 10.*np.log10(np.abs(spectrogram_matrix[:,bin_ix])**2)
mean_energy_dB = np.mean(bin_energy_dB) # mean of energy in dB
threshold_dB = mean_energy_dB + 16. # threshold relative to mean
speech_active = bin_energy_dB > threshold_dB
speech_active_hysteresis = np.zeros_like(speech_active)
for window_ix in range(window_count):
range_start = max(0,window_ix-hysteresis_time)
speech_active_hysteresis[window_ix] = np.max(speech_active[range(range_start,window_ix+1)])
#speech_active_sloped = np.zeros_like(spe
for frame_ix in range(window_count):
if speech_active_hysteresis[frame_ix]:
range_start = max(0,frame_ix-fade_in_time)
speech_active_sloped[frame_ix] = np.mean(speech_active_hysteresis[range(range_start,frame_ix+1)])
else:
range_start = max(0,frame_ix-fade_out_time)
speech_active_sloped[frame_ix] = np.mean(speech_active_hysteresis[range(range_start,frame_ix+1)])
spectrogram_binwise[:,bin_ix] = spectrogram_matrix[:,bin_ix]*speech_active_sloped
Show code cell source
# Reconstruct and play sloped-thresholded signal
data_binwise = istft(spectrogram_binwise,fs,window_length_ms=window_length_ms,window_step_ms=window_step_ms,windowing_function=windowing_function)
# Illustrate thresholding
plt.figure(figsize=[12,6])
plt.subplot(211)
plt.imshow(20*np.log10(np.abs(spectrogram_matrix[:,range(fft_length)].T)),
origin='lower',aspect='auto',
extent=[0, length_in_s, 0, fs/2000])
plt.axis([0, length_in_s, 0, 8])
plt.xlabel('Time (s)')
plt.ylabel('Frequency (kHz)');
plt.title('Original spectrogram of noisy audio')
ipd.display(ipd.Audio(data,rate=fs))
plt.subplot(212)
plt.imshow(20*np.log10(1e-6+np.abs(spectrogram_binwise[:,range(fft_length)].T)),
origin='lower',aspect='auto',
extent=[0, length_in_s, 0, fs/2000])
plt.axis([0, length_in_s, 0, 8])
plt.xlabel('Time (s)')
plt.ylabel('Frequency (kHz)');
plt.title('Spectrogram of bin-wise gating with sloped hysteresis')
plt.tight_layout()
plt.show()
#sd.play(data_thresholded,fs)
ipd.display(ipd.Audio(data_binwise,rate=fs))