References

17. References#

[1]

Dan Jurafsky and James H. Martin. Speech and Language Processing. Stanford University, 3 edition, 2021. URL: https://web.stanford.edu/%7Ejurafsky/slp3/.

[2]

Jessica Gasiorek. Message processing: The science of creating understanding. UH Mānoa Outreach College, 2018. URL: http://pressbooks-dev.oer.hawaii.edu/messageprocessing/.

[3]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. URL: http://www.deeplearningbook.org.

[4]

Julius Orion Smith. Spectral audio signal processing. W3K, 2011. URL: https://ccrma.stanford.edu/%7Ejos/sasp/.

[5]

S. Greenberg, H. Carvey, L. Hitchcock, and S. Chang. Temporal properties of spontaneous speech – a syllable centric perspective. Speech Communication, 31:465–485, 2003. URL: https://doi.org/10.1016/j.wocn.2003.09.005.

[6]

P. Hallé and A. Christia. Global and detailed speech representations in early language acquisition. In M. Weirich Fuchs, D. Pape, and P. Perrier, editors, Speech planning and dynamics. Peter Lang, Frankfurt am Main, 2012.

[7]

I. Y. Liberman, D. Shankweiler, W. F. Fischer, and B. Carter. Explicit syllable and phoneme segmentation in the young child. Journal of Experimental Child Psychology, 18:201–212, 1974.

[8]

J. Morais, A. Content, L. Cary, J. Mehler, and J. Segui. Syllabic segmentation and literacy. Language and Cognitive Processes, 4(1):56–67, 1989.

[9]

M. Nespor, M. Shukla, and J. Mehler. Stress‐timed vs. syllable‐timed languages. In van Oostendorp et al, editor, The Blackwell Companion to Phonology, pages 1147–1159. Blackwell, Malden, MA, 2011.

[10]

H. C. Nusbaum and J. DeGroot. The role of syllables in speech perception. In M. S. Ziolkowski, M. Noske, and K. Deaton, editors, Papers from the parasession on the syllable in phonetics and phonology. Chicago Linguistic Society, Chicago, 1991.

[11]

Janet B. Pierrehumbert. The Phonology and Phonetics of English Intonation. PhD thesis, Massachusetts Institute of Technology, 1980.

[12]

Kenneth L Pike. The Intonation of American English. University of Michigan Press, Ann Arbor, Mich., 1945.

[13]

J. F. Werker and R. C. Tees. Cross-language speech perception: evidence for perceptual reorganization during the first year of life. Infant Behavior & Development, 7(1):49–63, 1984. URL: https://doi.org/10.1016/S0163-6383(84)80022-3.

[14]

Peter Noll. A comparative study of various quantization schemes for speech encoding. Bell System Technical Journal, 54(9):1597 – 1614, 1975. URL: https://doi.org/10.1002/j.1538-7305.1975.tb02053.x.

[15]

John Makhoul. Linear prediction: a tutorial review. Proceedings of the IEEE, 63(4):561 – 580, 1975. URL: https://doi.org/10.1109/PROC.1975.9792.

[16]

Alain De Cheveigné and Hideki Kawahara. Yin, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917–1930, 2002. URL: https://doi.org/10.1121/1.1458024.

[17]

Matthias Mauch and Simon Dixon. Pyin: a fundamental frequency estimator using probabilistic threshold distributions. In 2014 ieee international conference on acoustics, speech and signal processing (icassp), 659–663. IEEE, 2014. URL: https://doi.org/10.1109/ICASSP.2014.6853678.

[18]

Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. Crepe: a convolutional representation for pitch estimation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 161–165. IEEE, 2018. URL: https://doi.org/10.1109/ICASSP.2018.8461329.

[19]

Satwinder Singh, Ruili Wang, and Yuanhang Qiu. Deepf0: end-to-end fundamental frequency estimation for music and speech signals. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 61–65. IEEE, 2021. URL: https://doi.org/10.1109/ICASSP39728.2021.9414050.

[20]

Tom Bäckström, Jérémie Lecomte, Guillaume Fuchs, Sascha Disch, and Christian Uhle. Speech coding: with code-excited linear prediction. Springer, 2017. URL: https://doi.org/10.1007/978-3-319-50204-5.

[21]

Homer Dudley. Remaking speech. The Journal of the Acoustical Society of America, 11(2):169 – 177, 1939. URL: https://doi.org/10.1121/1.1916020.

[22]

Daniel Griffin and Jae Lim. Signal estimation from modified short-time Fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2):236 – 243, 1984. URL: https://doi.org/10.1109/TASSP.1984.1164317.

[23]

Nathanaël Perraudin, Peter Balazs, and Peter L Søndergaard. A fast Griffin-Lim algorithm. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1 – 4. IEEE, 2013. URL: https://doi.org/10.1109/WASPAA.2013.6701851.

[24]

Yi Hu and Philipos C Loizou. Evaluation of objective quality measures for speech enhancement. IEEE Transactions on audio, speech, and language processing, 16(1):229–238, 2007. URL: https://ecs.utdallas.edu/loizou/speech/noizeus/, doi:10.1109/TASL.2007.911054.

[25]

Pranay Manocha, Zeyu Jin, and Adam Finkelstein. Audio similarity is unreliable as a proxy for audio quality. arXiv preprint arXiv:2206.13411, 2022. URL: https://doi.org/10.48550/arXiv.2206.13411.

[26]

Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), volume 2, 749 – 752. IEEE, 2001. URL: https://doi.org/10.1109/ICASSP.2001.941023.

[27]

John G Beerends, Christian Schmidmer, Jens Berger, Matthias Obermann, Raphael Ullmann, Joachim Pomy, and Michael Keyhl. Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part i—temporal alignment. Journal of the Audio Engineering Society, 61(6):366 – 384, 2013. URL: http://www.aes.org/e-lib/browse.cfm?elib=16829.

[28]

Thilo Thiede, William C Treurniet, Roland Bitto, Christian Schmidmer, Thomas Sporer, John G Beerends, and Catherine Colomes. PEAQ - The ITU standard for objective measurement of perceived audio quality. Journal of the Audio Engineering Society, 48(1/2):3 – 29, 2000. URL: http://www.aes.org/e-lib/browse.cfm?elib=12078.

[29]

Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. An algorithm for intelligibility prediction of time – frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2125 – 2136, 2011. URL: https://doi.org/10.1109/TASL.2011.2114881.

[30]

Augustine Gray and John Markel. Distance measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(5):380 – 391, 1976. URL: https://doi.org/10.1109/TASSP.1976.1162849.

[31]

James L. Flanagan. Speech Analysis Synthesis and Perception. Springer-Verlag, 2nd edition, 1972. URL: https://doi.org/10.1007/978-3-662-00849-2.

[32]

John Cunnison Catford. Fundamental Problems in Phonetics, chapter one, pages 1–278. Indiana University Press, Bloomington, USA, 1977.

[33]

Sudarsana Reddy Kadiri, Paavo Alku, and B. Yegnanarayana. Extraction and utilization of excitation information of speech: a review. Proceedings of the IEEE, 109(12):1920–1941, 2021. URL: https://doi.org/10.1109/JPROC.2021.3126493.

[34]

Christian T Herbst. Electroglottography – an update. J. Voice, 34(4):503 – 526, 2020. URL: https://doi.org/10.1016/j.jvoice.2018.12.014.

[35]

Paavo Alku. Glottal inverse filtering analysis of human voice production-a review of estimation and parameterization methods of the glottal excitation and their applications. Sadhana, 36(5):623–650, 2011. URL: https://doi.org/10.1007/s12046-011-0041-5.

[36]

David Y. Wong, John D. Markel, and Jr. Augustine H. Gray. Least squares glottal inverse filtering from the acoustic speech waveform. IEEE Trans. Acoustics Speech Signal Process., 27(4):350–355, August 1979. URL: https://doi.org/10.1109/TASSP.1979.1163260.

[37]

Paavo Alku. Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Communications,, 11(2):109 – 118, 1992. URL: https://doi.org/10.1016/0167-6393(92)90005-R.

[38]

Manu Airaksinen, Tuomo Raitio, Brad Story, and Paavo Alku. Quasi closed phase glottal inverse filtering analysis with weighted linear prediction. IEEE/ACM Trans. Audio Speech Lang. Process., 22(3):596 – 607, 3 2014. URL: https://doi.org/10.1109/TASLP.2013.2294585.

[39]

Manu Airaksinen, Tom Bäckström, and Paavo Alku. Quadratic programming approach to glottal inverse filtering by joint norm-1 and norm-2 optimization. IEEE/ACM Trans. Audio Speech Lang. Process., 25(5):929 – 939, 2016. URL: https://doi.org/10.1109/TASLP.2016.2620718.

[40]

Qiang Fu and Peter Murphy. Robust glottal source estimation based on joint source-filter model optimization. IEEE Trans. on Audio Speech and Language Processing, 14:492 – 501, 2006. URL: https://doi.org/10.1109/TSA.2005.857807.

[41]

Olaf Schleusing, Tomi Kinnunen, Brad H. Story, and Jean-Marc Vesin. Joint source-filter optimization for accurate vocal tract estimation using differential evolution. IEEE Trans. Audio Speech Lang. Process., 21(8):1560 – 1572, 2013. URL: https://doi.org/10.1109/TASL.2013.2255275.

[42]

Harri Auvinen, Tuomo Raitio, Manu Airaksinen, Samuli Siltanen, Brad H. Story, and Paavo Alku. Automatic glottal inverse filtering with the Markov chain Monte Carlo method. Comput. Speech Lang., 28(5):1139 – 1155, 2014. URL: https://doi.org/10.1016/j.csl.2013.09.004.

[43]

Gabriel A Alzamendi and Gastón Schlotthauer. Modeling and joint estimation of glottal source and vocal tract filter by state-space methods. Biomed. Signal Process. Control, 37:5 – 15, 2017. URL: https://doi.org/10.1016/j.bspc.2016.12.022.

[44]

Subhasmita Sahoo and Aurobinda Routray. A novel method of glottal inverse filtering. IEEE/ACM Trans. Audio Speech Lang. Process., 24(7):1230 – 1241, 2016. URL: https://doi.org/10.1109/TASLP.2016.2551864.

[45]

Baris Bozkurt, Boris Doval, Christophe d'Alessandro, and Thierry Dutoit. Zeros of z-transform representation with application to source-filter separation in speech. IEEE Sig.Pro. Letters, 12:344 – 347, 2005. URL: https://doi.org/10.1109/LSP.2005.843770.

[46]

Thomas Drugman, Baris Bozkurt, and Thierry Dutoit. Causal-anticausal decomposition of speech using complex cepstrum for glottal source estimation. Speech Communication, 53:855 – 866, 2011. URL: https://doi.org/10.1016/j.specom.2011.02.004.

[47]

Thomas Drugman, Baris Bozkurt, and Thierry Dutoit. A comparative study of glottal source estimation techniques. Computer Speech and Language, 26:20 – 34, 2012. URL: https://doi.org/10.1016/j.csl.2011.03.003.

[48]

Kwok Tai Chui, Miltiadis D Lytras, and Pandian Vasant. Combined generative adversarial network and fuzzy c-means clustering for multi-class voice disorder detection with an imbalanced dataset. Applied Sciences, 10(13):4571, 2020. URL: https://doi.org/10.3390/app10134571.

[49]

Alireza Bayestehtashk, Meysam Asgari, Izhak Shafran, and James McNames. Fully automated assessment of the severity of Parkinson's disease from speech. Computer speech & language, 29(1):172 – 185, 2015. URL: https://doi.org/10.1016/j.csl.2013.12.001.

[50]

Prabhakera Narendra and Paavo Alku. Automatic assessment of intelligibility in speakers with dysarthria from coded telephone speech using glottal features. Comput. Speech Lang., 65:101117, 2021. URL: https://doi.org/10.1016/j.csl.2020.101117.

[51]

Tomas Arias-Vergara, Juan Camilo Vásquez-Correa, Juan Rafael Orozco-Arroyave, and Elmar Nöth. Speaker models for monitoring Parkinson’s disease progression considering different communication channels and acoustic conditions. Speech Communication, 101:11 – 25, 2018. URL: https://doi.org/10.1016/j.specom.2018.05.007.

[52]

Steven Davis and Paul Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):357–366, 8 1980. URL: https://doi.org/10.1109/TASSP.1980.1163420.

[53]

Florian Eyben, Martin Wöllmer, and Björn Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, 1459 – 1462. 2010. URL: https://doi.org/10.1145/1873951.1874246.

[54]

Sudarsana R. Kadiri and Paavo Alku. Analysis and detection of pathological voice using glottal source features. IEEE J. Sel. Top. Signal Process., 14(2):367–379, 2020. URL: https://doi.org/10.1109/JSTSP.2019.2957988.

[55]

Juliette Millet and Neil Zeghidour. Learning to detect dysarthria from raw speech. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5831 – 5835. 2019. URL: https://doi.org/10.1109/ICASSP.2019.8682324.

[56]

J. C. Vásquez-Correa, J. R. Orozco-Arroyave, and E. Nöth. Convolutional neural network to model articulation impairments in patients with Parkinson's disease. In Proc. Interspeech, 314 – 318. 2017. URL: https://doi.org/10.21437/Interspeech.2017-1078.

[57]

Lang He and Cui Cao. Automated depression analysis using convolutional neural networks from speech. Journal of biomedical informatics, 83:103 – 111, 2018. URL: https://doi.org/10.1016/j.jbi.2018.05.007.

[58]

Tifani Warnita, Tifani Warnita, Nakamasa Inoue, and Koichi Shinoda. Detecting Alzheimer's disease using gated convolutional neural network from audio data. Proc. Interspeech, pages 1706 – 1710, 2018. URL: https://doi.org/10.21437/Interspeech.2018-1713.

[59]

Raquel Norel, Mary Pietrowicz, Carla Agurto, Shay Rishoni, and Guillermo Cecchi. Detection of Amyotrophic Lateral Sclerosis (ALS) via Acoustic Analysis. Proc. Interspeech, pages 377 – 381, 2018. URL: https://doi.org/10.21437/Interspeech.2018-2389.

[60]

Haihua Jiang, Bin Hu, Zhenyu Liu, Lihua Yan, Tianyang Wang, Fei Liu, Huanyu Kang, and Xiaoyu Li. Investigation of different speech types and emotions for detecting depression using different classifiers. Speech Communication, 90:39 – 46, 2017. URL: https://doi.org/10.1016/j.specom.2017.04.001.

[61]

Jorge Andrés Gómez Garc\'ıa, Laureano Moro-Velázquez, and Juan Ignacio Godino-Llorente. On the design of automatic voice condition analysis systems. \uppercase Part II: review of speaker recognition techniques and study on the effects of different variability factors. Biomed. Signal Process. Control, 48:128 – 143, 2019. URL: https://doi.org/10.1016/j.bspc.2018.09.003.

[62]

M Catarina Botelho, Isabel Trancoso, Alberto Abad, and Teresa Paiva. Speech as a biomarker for obstructive sleep apnea detection. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5851 – 5855. IEEE, 2019. URL: https://doi.org/10.1109/ICASSP.2019.8682431.

[63]

Björn W Schuller, Anton Batliner, Christian Bergler, Cecilia Mascolo, Jing Han, Iulia Lefter, Heysem Kaya, Shahin Amiriparian, Alice Baird, Lukas Stappen, and others. The INTERSPEECH 2021 computational paralinguistics challenge: COVID-19 cough, COVID-19 speech, escalation & primates. Proc. INTERSPEECH, pages 431 – 435, 2021. URL: https://doi.org/10.21437/Interspeech.2021-19.

[64]

Neeraj Kumar Sharma, Ananya Muguli, Prashant Krishnan, Rohit Kumar, Srikanth Raj Chetupalli, and Sriram Ganapathy. Towards sound based testing of COVID-19 – summary of the first diagnostics of COVID-19 using acoustics (DiCOVA) challenge. Computer Speech & Language, 73:101320, 2022. URL: https://doi.org/10.1016/j.csl.2021.101320.

[65]

Frank Rudzicz, Aravind Kumar Namasivayam, and Talya Wolff. The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Language Resources and Evaluation, 46(4):523 – 541, 2012. URL: https://doi.org/10.1007/s10579-011-9145-0.

[66]

Heejin Kim, Mark Hasegawa-Johnson, Adrienne Perlman, Jon Gunderson, Thomas S Huang, Kenneth Watkin, and Simone Frame. Dysarthric speech database for universal access research. In Proc. INTERSPEECH, 1741 – 1744. 2008.

[67]

Manfred Pützer and William J. Barry. Saarbrücken voice database, institute of phonetics, univ. of saarland. 2010. http://www.stimmdatenbank.coli.uni-saarland.de/.

[68]

Pavel Grill and Jana Tučková. Speech databases of typical children and children with SLI. PLoS ONE, 11(3):e0150365, 2016. URL: https://doi.org/10.1371/journal.pone.0150365.

[69]

Jan Rusz, Roman Cmejla, Hana Ruzickova, and Evzen Ruzicka. Quantitative acoustic measurements for characterization of speech and voice disorders in early untreated Parkinson’s disease. The journal of the Acoustical Society of America, 129(1):350 – 367, 2011. URL: https://doi.org/10.1121/1.3514381.

[70]

Stewart Morrison Geoffrey, Enzinger Ewald, Daniel Ramos, Joaquín González-Rodríguez, and Alicia Lozano-Díez. Statistical models in forensic voice comparison. In Handbook of forensic statistics, pages 451 – 497. Chapman and Hall/CRC, 2020. URL: https://doi.org/10.1201/9780367527709.

[71]

Didier Meuwly and Raymond Veldhuis. Forensic biometrics: from two communities to one discipline. In 2012 BIOSIG-Proceedings of the International Conference of Biometrics Special Interest Group (BIOSIG), 1 – 12. IEEE, 2012. URL: https://ieeexplore.ieee.org/abstract/document/6313550.

[72]

Christophe Champod and Didier Meuwly. The inference of identity in forensic speaker recognition. Speech communication, 31(2-3):193 – 203, 2000. URL: https://doi.org/10.1016/S0167-6393(99)00078-3.

[73]

Joaquin Gonzalez-Rodriguez, Phil Rose, Daniel Ramos, Doroteo T Toledano, and Javier Ortega-Garcia. Emulating DNA: rigorous quantification of evidential weight in transparent and testable forensic speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(7):2104 – 2115, 2007. URL: https://doi.org/10.1109/TASL.2007.902747.

[74]

Didier Meuwly, Daniel Ramos, and Rudolf Haraksim. A guideline for the validation of likelihood ratio methods used for forensic evidence evaluation. Forensic science international, 276:142 – 153, 2017. URL: https://doi.org/10.1016/j.forsciint.2016.03.048.

[75]

Daniel Ramos, Didier Meuwly, Rudolf Haraksim, and Charles EH Berger. Validation of forensic automatic likelihood ratio methods. Handbook of Forensic Statistics, pages 143 – 162, 2020. URL: https://doi.org/10.1201/9780367527709.

[76]

Yundong Zhang, Naveen Suda, Liangzhen Lai, and Vikas Chandra. Hello edge: keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128, 2017. URL: https://doi.org/10.48550/arXiv.1711.07128.

[77]

Xiaohui Zhang. Strategies for Handling Out-of-Vocabulary Words in Automatic Speech Recognition. PhD thesis, Johns Hopkins University, 2019. URL: http://jhir.library.jhu.edu/handle/1774.2/62275.

[78]

Björn Schuller and Anton Batliner. Computational paralinguistics: emotion, affect and personality in speech and language processing. John Wiley & Sons, 2013. URL: https://www.wiley.com/en-us/9781118706626.

[79]

Jouni Pohjalainen, Okko Räsänen, and Serdar Kadioglu. Feature selection methods and their combinations in high-dimensional classification of speaker likability, intelligibility and personality traits. Computer Speech & Language, 29(1):145 – 171, 2015. URL: https://doi.org/10.1016/j.csl.2013.11.004.

[80]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. Wav2vec 2.0: a framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449 – 12460, 2020. URL: https://doi.org/10.48550/arXiv.2006.11477.

[81]

Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James Glass. An unsupervised autoregressive model for speech representation learning. arXiv preprint arXiv:1904.03240, 2019. URL: https://doi.org/10.48550/arXiv.1904.03240.

[82]

Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, 144 – 152. 1992. URL: https://doi.org/10.1145/130385.130401.

[83]

Zhizheng Wu, Cassia Valentini-Botinhao, Oliver Watts, and Simon King. Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 4460 – 4464. IEEE, 2015. URL: https://doi.org/10.1109/ICASSP.2015.7178814.

[84]

Lawrence R Rabiner and Ronald W Schafer. Introduction to digital speech processing. Foundations and Trends in Signal Processing, 1(1):1 – 194, 2007. URL: https://doi.org/10.1561/2000000001.

[85]

Andrew J Hunt and Alan W Black. Unit selection in a concatenative speech synthesis system using a large speech database. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, volume 1, 373 – 376. IEEE, 1996. URL: https://doi.org/10.1109/ICASSP.1996.541110.

[86]

Douglas O’Shaughnessy. Review of methods for coding of speech signals. EURASIP Journal on Audio, Speech, and Music Processing, 2023. URL: https://doi.org/10.1186/s13636-023-00274-x.

[87]

Jorma Rissanen and Glen G Langdon. Arithmetic coding. IBM Journal of research and development, 23(2):149 – 162, 1979. URL: https://doi.org/10.1147/rd.232.0149.

[88]

Jacob Benesty, M Mohan Sondhi, Yiteng Huang, and others. Springer handbook of speech processing. Volume 1. Springer, 2008. URL: https://doi.org/10.1007/978-3-540-49127-9.

[89]

Rainer Martin. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Transactions on speech and audio processing, 9(5):504 – 512, 2001. URL: https://doi.org/10.1109/89.928915.

[90]

Steven Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on acoustics, speech, and signal processing, 27(2):113 – 120, 1979. URL: https://doi.org/10.1109/TASSP.1979.1163209.

[91]

Jean-Marc Valin. A hybrid dsp/deep learning approach to real-time full-band speech enhancement. In 2018 IEEE 20th international workshop on multimedia signal processing (MMSP), 1–5. IEEE, 2018. URL: https://doi.org/10.1109/MMSP.2018.8547084.

[92]

Chengshi Zheng, Huiyong Zhang, Wenzhe Liu, Xiaoxue Luo, Andong Li, Xiaodong Li, and Brian CJ Moore. Sixty years of frequency-domain monaural speech enhancement: from traditional to deep learning methods. Trends in Hearing, 2023. URL: https://doi.org/10.1177/23312165231209913.

[93]

Volodymyr Kuleshov, S. Zayd Enam, and Stefano Ermon. Audio super resolution using neural networks. CoRR, 2017. URL: http://arxiv.org/abs/1708.00853.

[94]

Konstantin Schmidt and Bernd Edler. Blind bandwidth extension based on convolutional and recurrent deep neural networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5444–5448. 2018. doi:10.1109/ICASSP.2018.8462691.

[95]

Sebastian Braun and Ivan Tashev. A consolidated view of loss functions for supervised deep learning-based speech enhancement. 2020. doi:10.48550/ARXIV.2009.12286.

[96]

Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. A survey on neural speech synthesis. 2021. doi:10.48550/ARXIV.2106.15561.

[97]

Tzu-hsien Huang, Chien-yu Lin, Jheng-hao aand Huang, and Hung-yi Lee. How far are we from robust voice conversion: a survey. 2020. doi:10.48550/ARXIV.2011.12063.

[98]

Xiang Hao, Chenglin Xu, Nana Hou, Lei Xie, Eng Siong Chng, and Haizhou Li. Time-domain neural network approach for speech bandwidth extension. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 866–870. 2020. doi:10.1109/ICASSP40776.2020.9054551.

[99]

Zhen-Hua Ling, Yang Ai, Yu Gu, and Li-Rong Dai. Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(5):883 – 894, may 2018. doi:10.1109/taslp.2018.2798811.

[100]

Mu Wang, Zhiyong Wu, Shiyin Kang, Xixin Wu, Jia Jia, Dan Su, Dong Yu, and Helen Meng. Speech super-resolution using parallel wavenet. In 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), 260–264. 2018. doi:10.1109/ISCSLP.2018.8706637.

[101]

Bin Liu, Jianhua Tao, Zhengqi Wen, Ya Li, and Danish Bukhari. A novel method of artificial bandwidth extension using deep architecture. In INTERSPEECH. 2015. URL: https://www.isca-speech.org/archive/pdfs/interspeech_2015/liu15g_interspeech.pdf.

[102]

Archit Gupta, Brendan Shillingford, Yannis Assael, and Thomas C. Walters. Speech bandwidth extension with WaveNet. 2019. doi:10.48550/ARXIV.1907.04927.

[103]

Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, and Christian Fuegen. A two-stage approach to speech bandwidth extension. In INTERSPEECH, 5. 2021. URL: https://maigoakisame.github.io/papers/interspeech21b.pdf.

[104]

Teck Yian Lim, Raymond A. Yeh, Yijia Xu, Minh N. Do, and Mark Hasegawa-Johnson. Time-frequency networks for audio super-resolution. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 646–650. 2018. doi:10.1109/ICASSP.2018.8462049.

[105]

Gang Liu, Ke Gong, Xiaodan Liang, and Zhiguang Chen. Cp-gan: context pyramid generative adversarial network for speech enhancement. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6624–6628. 2020. doi:10.1109/ICASSP40776.2020.9054060.

[106]

Rafael Ferro, Nicolas Obin, and Axel Roebel. Cyclegan voice conversion of spectral envelopes using adversarial weights. 2019. doi:10.48550/ARXIV.1910.12614.

[107]

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis. 2020. doi:10.48550/ARXIV.2010.05646.

[108]

Jiaqi Su, Yunyun Wang, Adam Finkelstein, and Zeyu Jin. Bandwidth extension is all you need. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 696–700. 2021. doi:10.1109/ICASSP39728.2021.9413575.

[109]

Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, and DeLiang Wang. Neural vocoder is all you need for speech super-resolution. 2022. doi:10.48550/ARXIV.2203.14941.

[110]

Yunpeng Li, Marco Tagliasacchi, Oleg Rybakov, Victor Ungureanu, and Dominik Roblek. Real-time speech frequency bandwidth extension. 2020. doi:10.48550/ARXIV.2010.10677.

[111]

Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R. Hershey. Sdr - half-baked or well done? 2018. doi:10.48550/ARXIV.1811.02508.

[112]

C. Knapp and G. Carter. The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(4):320–327, 1976. doi:10.1109/TASSP.1976.1162830.

[113]

Mordechai Azaria and David Hertz. Time delay estimation by generalized cross correlation methods. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):280 – 285, 1984. URL: https://doi.org/10.1109/TASSP.1984.1164314.

[114]

Byoungho Kwon, Youngjin Park, and Youn-sik Park. Analysis of the GCC-PHAT technique for multiple sources. In ICCAS 2010, volume, 2070–2073. 2010. doi:10.1109/ICCAS.2010.5670137.

[115]

William Havard, Laurent Besacier, and Olivier Rosec. Speech-coco: 600k visually grounded spoken captions aligned to mscoco data set. arXiv preprint arXiv:1707.08435, 2017. URL: https://doi.org/10.21437/GLU.2017-9.

[116]

James L McClelland and Jeffrey L Elman. The TRACE model of speech perception. Cognitive psychology, 18(1):1 – 86, 1986. URL: https://doi.org/10.1016/0010-0285(86)90015-0.

[117]

Okko Räsänen. Computational modeling of phonetic and lexical learning in early language acquisition: existing models and future directions. Speech Communication, 54(9):975 – 997, 2012. URL: https://doi.org/10.1016/j.specom.2012.05.001.

[118]

Emmanuel Dupoux. Cognitive science in the era of artificial intelligence: a roadmap for reverse-engineering the infant language-learner. Cognition, 173:43 – 59, 2018. URL: https://doi.org/10.1016/j.cognition.2017.11.008.

[119]

Luc Steels. The synthetic modeling of language origins. Evolution of communication, 1(1):1 – 34, 1997. URL: https://doi.org/10.1075/eoc.1.1.02ste.

[120]

Simon Kirby. Natural language from artificial life. Artificial life, 8(2):185 – 215, 2002. URL: https://doi.org/10.1162/106454602320184248.

[121]

David Marr. Vision: A computational investigation into the human representation and processing of visual information. W.H. Freeman and Company, 1982.

[122]

Andrea Weber and Odette Scharenborg. Models of spoken-word recognition. Wiley Interdisciplinary Reviews: Cognitive Science, 3(3):387 – 401, 2012. doi:10.1002/wcs.1178.

[123]

James S Magnuson, Heejo You, Sahil Luthra, Monica Li, Hosung Nam, Monty Escabi, Kevin Brown, Paul D Allopenna, Rachel M Theodore, Nicholas Monto, and others. EARSHOT: a minimal neural network model of incremental human speech recognition. Cognitive science, 44(4):e12823, 2020. URL: https://doi.org/10.1111/cogs.12823.

[124]

Janet F Werker and Richard C Tees. Cross-language speech perception: evidence for perceptual reorganization during the first year of life. Infant behavior and development, 7(1):49 – 63, 1984. URL: https://doi.org/10.1016/S0163-6383(84)80022-3.

[125]

Jenny R Saffran, Richard N Aslin, and Elissa L Newport. Statistical learning by 8-month-old infants. Science, 274(5294):1926 – 1928, 1996. URL: https://doi.org/10.1126/science.274.5294.1926.

[126]

Jessica Maye, Janet F Werker, and LouAnn Gerken. Infant sensitivity to distributional information can affect phonetic discrimination. Cognition, 82(3):B101 – B111, 2002. URL: https://doi.org/10.1016/S0010-0277(01)00157-3.

[127]

Jenny R Saffran and Natasha Z Kirkham. Infant statistical learning. Annual review of psychology, 69:181 – 203, 2018. URL: https://doi.org/10.1146/annurev-psych-122216-011805.

[128]

Tasha Nagamine, Michael L Seltzer, and Nima Mesgarani. Exploring how deep neural networks form phonemic categories. In Sixteenth Annual Conference of the International Speech Communication Association. 2015. URL: https://www.isca-speech.org/archive_v0/interspeech_2015/papers/i15_1912.pdf.

[129]

Okko Räsänen and Heikki Rasilo. A joint model of word segmentation and meaning acquisition through cross-situational learning. Psychological review, 122(4):792, 2015. URL: https://psycnet.apa.org/doi/10.1037/a0039702.

[130]

Sofoklis Kakouros and Okko Räsänen. 3PRO – an unsupervised method for the automatic detection of sentence prominence in speech. Speech Communication, 82:67 – 84, 2016. URL: https://doi.org/10.1016/j.specom.2016.06.004.

[131]

Herman Kamper, Aren Jansen, and Sharon Goldwater. A segmental framework for fully-unsupervised large-vocabulary speech recognition. Computer Speech & Language, 46:154 – 174, 2017. URL: https://doi.org/10.1016/j.csl.2017.04.008.

[132]

Okko Räsänen, Gabriel Doyle, and Michael C Frank. Pre-linguistic segmentation of speech into syllable-like units. Cognition, 171:130 – 150, 2018. URL: https://doi.org/10.1016/j.cognition.2017.11.003.

[133]

Dennis Norris. Shortlist: a connectionist model of continuous speech recognition. Cognition, 52(3):189 – 234, 1994. URL: https://doi.org/10.1016/0010-0277(94)90043-4.

[134]

Shinji Maeda. Improved articulatory models. The Journal of the Acoustical Society of America, 84(S1):S146 – S146, 1988. URL: https://doi.org/10.1121/1.2025845.

[135]

Peter Birkholz. 3D-Artikulatorische Sprachsynthese. PhD thesis, der Universität Rostock, 2005. URL: https://www.vocaltractlab.de/publications/birkholz-2005-dissertation.pdf.

[136]

Peter Birkholz, Lucia Martin, Klaus Willmes, Bernd J Kröger, and Christiane Neuschaefer-Rube. The contribution of phonation type to the perception of vocal emotions in german: an articulatory synthesis study. The Journal of the Acoustical Society of America, 137(3):1503 – 1512, 2015. URL: https://doi.org/10.1121/1.4906836.

[137]

Jason A Tourville and Frank H Guenther. The DIVA model: a neural theory of speech acquisition and production. Language and cognitive processes, 26(7):952 – 981, 2011. URL: https://doi.org/10.1080/01690960903498424.

[138]

Ian S Howard and Piers Messum. Learning to pronounce first words in three languages: an investigation of caregiver and infant behavior using a computational model of an infant. PLoS One, 9(10):e110334, 2014. URL: https://doi.org/10.1371/journal.pone.0110334.

[139]

Heikki Rasilo and Okko Räsänen. An online model for vowel imitation learning. Speech Communication, 86:1 – 23, 2017. URL: https://doi.org/10.1016/j.specom.2016.10.010.

[140]

Pierre-Yves Oudeyer, George Kachergis, and William Schueller. Computational and robotic models of early language development: a review. In J.S. Horst and J. von Koss Torkildsen, editors, International handbook of language acquisition. Routledge/Taylor & Francis Group, 2019. URL: https://psycnet.apa.org/doi/10.4324/9781315110622-5.

[141]

Silas Rech. Multi-device speech enhancement for privacy and quality. Master's thesis, Aalto University, 2022.

[142]

Christopher Cieri, David Miller, and Kevin Walker. The fisher corpus: a resource for the next generations of speech-to-text. In LREC, volume 4, 69–71. 2004. URL: http://www.lrec-conf.org/proceedings/lrec2004/pdf/767.pdf.

[143]

Robin Scheibler, Eric Bezzam, and Ivan Dokmanić. Pyroomacoustics: a python package for audio room simulation and array processing algorithms. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 351–355. IEEE, 2018. URL: LCAV/pyroomacoustics.

[144]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 5206 – 5210. IEEE, 2015. URL: https://doi.org/10.1109/ICASSP.2015.7178964.

[145]

Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. CoRR, 2018. URL: http://arxiv.org/abs/1804.03209, arXiv:1804.03209.

[146]

Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: deep speaker recognition. In Proc. Interspeech 2018, 1086 – 1090. 2018. URL: http://dx.doi.org/10.21437/Interspeech.2018-1929, doi:10.21437/Interspeech.2018-1929.

[147]

Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. VoxPopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proc 59th Annual Meeting Assoc Comp Ling & 11th Int Joint Conf on Natural Language Proc, 993 – 1003. August 2021. URL: https://aclanthology.org/2021.acl-long.80.

[148]

Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92). 2019. URL: https://doi.org/10.7488/ds/2645.

[149]

Xabier Lareo. Smart speakers and virtual assistants. TechDispatch #1:, 2019. URL: https://data.europa.eu/doi/10.2804/004275.

[150]

Andreas Nautsch, Catherine Jasserand, Els Kindt, Massimiliano Todisco, Isabel Trancoso, and Nicholas Evans. The GDPR & speech data: reflections of legal and technology communities, first steps towards a common understanding. arXiv preprint arXiv:1907.03458, 2019. URL: https://doi.org/10.21437/Interspeech.2019-2647.

[151]

Sandra Petronio. Boundaries of privacy: Dialectics of disclosure. Suny Press, 2002.

[152]

Alexandra König, Aharon Satt, Alexander Sorin, Ron Hoory, Orith Toledo-Ronen, Alexandre Derreumaux, Valeria Manera, Frans Verhey, Pauline Aalten, Phillipe H Robert, and others. Automatic speech analysis for the assessment of patients with predementia and Alzheimer's disease. Alzheimer's & Dementia: Diagnosis, Assessment & Disease Monitoring, 1(1):112 – 124, 2015. URL: https://doi.org/10.1016/j.dadm.2014.11.012.

[153]

Rachel L Finn, David Wright, and Michael Friedewald. Seven types of privacy. In European data protection: coming of age, pages 3 – 32. Springer, 2013. URL: https://doi.org/10.1007/978-94-007-5170-5_1.

[154]

Weisong Shi, Jie Cao, Quan Zhang, Youhuizi Li, and Lanyu Xu. Edge computing: vision and challenges. IEEE internet of things journal, 3(5):637 – 646, 2016. URL: https://doi.org/10.1109/JIOT.2016.2579198.

[155]

Frederik Armknecht, Colin Boyd, Christopher Carr, Kristian Gjøsteen, Angela Jäschke, Christian A Reuter, and Martin Strand. A guide to fully homomorphic encryption. IACR Cryptology ePrint Archive, 2015:1192, 2015. URL: https://ia.cr/2015/1192.

[156]

Antti Poikola and Kai Kuikkaniemi end Harri Honko. MyData – a nordic model for human-centered personal data management and processing. 2015. URL: http://urn.fi/URN:ISBN:978-952-243-455-5.

[157]

S.C. Chen and G.S. Dhillon. Interpreting dimensions of consumer trust in e-commerce. Information Technology and Management, 4:303 – 318, 2003. doi:https://doi.org/10.1023/A:102296263124.

[158]

Yi Xie and Siqing Peng. How to repair customer trust after negative publicity: the roles of competence, integrity, benevolence, and forgiveness. Psychology & Marketing, 26(7):572–589, 2009. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/mar.20289, doi:10.1002/mar.20289.