14.2. Speech data and experiment design#
In development of speech processing systems, we need speech data for several purposes, including:
Speech analysis to better understand speech signals. A better understanding of how humans communicate is naturally valuable by itself, but here we are interested in the engineering challenges. By analysis, we can identify properties of speech signals which can aid in improving performance of the system as well as gain better understanding of what features of speech are important for utility and quality. We can for example learn what features are different between different populations of speakers, say between children, adults and seniors. By understanding such differences, we can better focus development efforts to such features of speech which are likely to improve performance.
Training machine learning methods. Machine learning has permeated all areas of speech technology and since such methods require a large amount of training data, it is clear that practically all speech processing development requires a large amount of data.
Evaluating performance of systems. Evaluation is a large topic on its own. It is however clear that speech data is essential in evaluation of the performance of speech processing systems.
It is then clear that choice and design of data sources is closely related to the design of the overall experiment.
14.2.1. Experiment design#
14.2.1.1. Describe the use-case in detail#
How is it planned that the system should be used in real life, specifically? Think through the scenario in detail. How does the user experience performance and quality in that scenario? What all different aspects of quality can you think of in this scenario? Importantly, to which of these aspects of quality does your novel system contribute to?
14.2.1.1.1. Example#
Consider an open-office scenario, where two (or more) people, like Alice and Bob, would like to have independent teleconferences at the same time. When Alice speaks to her Voice-service 1, it will then be picked up also by Bob’s Voice service 2. Alice’s voice thus leaks into Bob’s conversation and vice versa. [Rech, 2022]
Such leaks and cross-talk are problematic in at least two aspects of quality, namely,
The sound quality at the remote end (for the voice services) is reduced, degrading both perceived quality, increasing listening effort and potentially reducing intelligibility.
In addition, this is potentially also threat to privacy, if conversations contain private information.
We therefore need to design our experiments such that they allow testing for quality and privacy.
14.2.1.2. Choice of experiments#
Next we need to choose such experiments which measure the desired aspects of quality. It is important to choose the experiments such that they reflect the performance/utility in the final use-case. That is, if the system outputs speech for humans to listen, then the best measure for quality are human listeners with subjective listening tests (see Subjective quality evaluation). If the output is fed to a subsequent module like a speech recognition module, then the word-error-rate (or similar) of the recognition module is a good candidate for the utility measure.
Often, however, the true measure of quality is impractical, costly or indeed impossible. For example, subjective listening tests are time-consuming, non-repeatable and expensive. It is therefore useful to consider proxy-measures, which measure similar things in easier ways. For example, subjective listening tests can often be replaced with Objective quality evaluation. The objective measure is an approximation of the subjective measure, such that it good practice to always do also a subjective listening test as well, though in small scale.
14.2.1.2.1. Example (continued)#
In the above scenario, we thus need speech data which features a large variety of examples of cross-talk. The examples should cover the whole range of possibilities, with respect to, for example, distances between speakers and their microphones, different room sizes and reverberation characteristics, speakers with different genders, ages and speaking styles, etc. Enhancement experiments typically will also benefit if there are a range of different background noises.
In this particular case, it is probably difficult to find any existing dataset which has the desired characteristics. We then have the options to either record our own or create a synthetic dataset.
Recording one’s own dataset is in principle straightforward but the required amount of work and effort is typically very large. In this case, you could for example choose 3 different rooms where to record. In each room you would further choose for example 3 to 5 different combinations of locations for the microphones. Then we would already have 9 to 15 different configurations. To make the recording realistic, speakers would need to hold conversations over a teleconferencing platform, and we also need two speakers in the same room. For maximum realism, each speakers would need someone to speak to in their conversations, so four speakers would be involved in every experiment. Finally, we would need to bring in say 60 speakers as subjects. That would give 15 groups of 4 speakers. Each group of 4 could do 2 different room configurations, 10 minutes each and where we record both rooms simultaneously. The management of all this complexity would require at least a week of full-time work with 2 lab technicians. Yet the outcome would be no more than \(15\times 2 \times 2 \times 10\) minutes of audio, or only \(600\) minutes \(=10\) hours in total.
Synthetic datasets, also known as data-augmentation, is a way to generate large datasets using small datasets as components. In the above scenario, we can for example take the Fisher corpus [Cieri et al., 2004] to get 2000 hours of recordings of spontaneous dialogues. By simulating different room and microphone setups with ‘pyroomacoustics’ [Scheibler et al., 2018], we can then combine pairs of conversations in random room-configurations to get an infinite amount of data. That is,
A significant advantage of such synthetic datasets is that we can use the original audio (without room acoustics) as a target for the speech enhancement process. We can therefore do straightforward objective evaluation of the enhanced audio signals, because we can compare the enhanced and original signals.
However, the real recordings have the advantage that the speakers can hear the other conversation. Consequently, their behaviour might change due to cross-talk and that was exactly the effect we wanted to remedy. However, in real recordings, it is difficult to obtain a clean reference, where cross-talk is not present.
In summary, in the best case, we should perform our experiments with both real and synthetic datasets. This would give the ability to train models and objectively evaluate on an “infinite” dataset, but use subjective evaluation on the real recordings.
14.2.2. Quality of data#
On the surface level, collecting data is simple: Just collect or create data which is as similar as possible to the target scenario. If you are not sure of the specifics of the scenario, just collect more data in the hope that all special cases are covered.
There are some issues with this approach though. For example, one perspective is that most data is “easy” data, in the sense that it is similar to many other examples. It is thus an inefficient dataset because we cannot learn much new from most samples. We could, therefore, attempt to design datasets such that we maximize the informativeness of every sample.
Another perspective is that many datasets have been collected where it is easy to collect; among university students in the US. This biases data to the US, to the educated, to young adults, to caucasian people, etc. Such biases will lead to poorer performance of processing methods for the under-represented populations (ethnic groups, children and the elderly, the less educated, etc.). This is a an ethical problem because it penalizes minorities.
To mend this problem, we would need to collect data such that it represents all sub-populations in a fair manner. A difficult (and perhaps non-solvable) question is however the definition of “fair” in this context. For example, if we would collect data from different gender groups, obviously heterosexual men and women would be collected in more or less similar amounts. The difficult is however in choosing the amount of non-heterosexual subjects and its subcategories. For a multitude of reasons, it will be difficult, impractical and potentially ethics-violating to collect data from all subcategories of non-heterosexuals in equal amounts as those of heterosexual males and females. To make things worse, we might not know which subcategories exist in the population. By intersecting different categorisations we can also readily increase the number of subcategories; how would you make sure that your dataset contains a sufficient number of people who speak Dutch but whose ethinc origin is Bulgarian, in the age group 60-80, identifies as non-heterosexual and has a lisp?
Collecting such labels is also a problem of privacy (see Security and privacy in speech technology). To better serve the smaller minorities, we need to identify and label those minorities. More extensive labelling however also increases the subjects’ exposure to privacy and ethics violations. Fortunately, however, note that labels are needed only during evaluation, to verify that minorities are not discriminated. We thus need to publish labels only for evaluation sets, but not for training sets.
14.2.3. Some noteworthy speech corpora#
LibriSpeech is a fairly large open collection of audiobooks with the text labelled. [Panayotov et al., 2015]
Speech Commands is a limited-vocabulary collection for keyword detection. [Warden, 2018]
VoxCeleb2 is a corpus for speaker recognition. [Chung et al., 2018]
VoxPopuli is one of the largest public corpora, which has 400k hours of speech collected from 2009-2020 European Parliament event recordings. It has 23 languages and a large proportion of the data is labelled. [Wang et al., 2021]
CSTR VCTK is a corpus intended for voice conversion, but since it is one of the largest open databases with a high sampling rate, it is used also for many other purposes. [Yamagishi et al., 2019]
The Fischer corpus is a collection of 5850 spontaneous conversations between two speakers on a phone, each of 10 min length and each recording with different speakers. This is useful for analysis and modelling of informal language (not pre-scripted) and for the dynamics of conversations. [Cieri et al., 2004]
14.2.4. References#
Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: deep speaker recognition. In Proc. Interspeech 2018, 1086 – 1090. 2018. URL: http://dx.doi.org/10.21437/Interspeech.2018-1929, doi:10.21437/Interspeech.2018-1929.
Christopher Cieri, David Miller, and Kevin Walker. The fisher corpus: a resource for the next generations of speech-to-text. In LREC, volume 4, 69–71. 2004. URL: http://www.lrec-conf.org/proceedings/lrec2004/pdf/767.pdf.
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 5206 – 5210. IEEE, 2015. URL: https://doi.org/10.1109/ICASSP.2015.7178964.
Silas Rech. Multi-device speech enhancement for privacy and quality. Master's thesis, Aalto University, 2022.
Robin Scheibler, Eric Bezzam, and Ivan Dokmanić. Pyroomacoustics: a python package for audio room simulation and array processing algorithms. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 351–355. IEEE, 2018. URL: LCAV/pyroomacoustics.
Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. VoxPopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proc 59th Annual Meeting Assoc Comp Ling & 11th Int Joint Conf on Natural Language Proc, 993 – 1003. August 2021. URL: https://aclanthology.org/2021.acl-long.80.
Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. CoRR, 2018. URL: http://arxiv.org/abs/1804.03209, arXiv:1804.03209.
Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92). 2019. URL: https://doi.org/10.7488/ds/2645.