15. References#


Dan Jurafsky and James H. Martin. Speech and Language Processing. Stanford University, 3 edition, 2021. URL: https://web.stanford.edu/~jurafsky/slp3/.


Jessica Gasiorek. Message processing: The science of creating understanding. UH Mānoa Outreach College, 2018. URL: http://pressbooks-dev.oer.hawaii.edu/messageprocessing/.


Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. URL: http://www.deeplearningbook.org.


Julius Orion Smith. Spectral audio signal processing. W3K, 2011. URL: https://ccrma.stanford.edu/~jos/sasp/.


S. Greenberg, H. Carvey, L. Hitchcock, and S. Chang. Temporal properties of spontaneous speech – a syllable centric perspective. Speech Communication, 31:465–485, 2003. URL: https://doi.org/10.1016/j.wocn.2003.09.005.


P. Hallé and A. Christia. Global and detailed speech representations in early language acquisition. In M. Weirich Fuchs, D. Pape, and P. Perrier, editors, Speech planning and dynamics. Peter Lang, Frankfurt am Main, 2012.


I. Y. Liberman, D. Shankweiler, W. F. Fischer, and B. Carter. Explicit syllable and phoneme segmentation in the young child. Journal of Experimental Child Psychology, 18:201–212, 1974.


J. Morais, A. Content, L. Cary, J. Mehler, and J. Segui. Syllabic segmentation and literacy. Language and Cognitive Processes, 4(1):56–67, 1989.


M. Nespor, M. Shukla, and J. Mehler. Stress‐timed vs. syllable‐timed languages. In van Oostendorp et al, editor, The Blackwell Companion to Phonology, pages 1147–1159. Blackwell, Malden, MA, 2011.


H. C. Nusbaum and J. DeGroot. The role of syllables in speech perception. In M. S. Ziolkowski, M. Noske, and K. Deaton, editors, Papers from the parasession on the syllable in phonetics and phonology. Chicago Linguistic Society, Chicago, 1991.


Janet B. Pierrehumbert. The Phonology and Phonetics of English Intonation. PhD thesis, Massachusetts Institute of Technology, 1980.


Kenneth L Pike. The Intonation of American English. University of Michigan Press, Ann Arbor, Mich., 1945.


J. F. Werker and R. C. Tees. Cross-language speech perception: evidence for perceptual reorganization during the first year of life. Infant Behavior & Development, 7(1):49–63, 1984. URL: https://doi.org/10.1016/S0163-6383(84)80022-3.


Peter Noll. A comparative study of various quantization schemes for speech encoding. Bell System Technical Journal, 54(9):1597–1614, 1975. URL: https://doi.org/10.1002/j.1538-7305.1975.tb02053.x.


John Makhoul. Linear prediction: a tutorial review. Proceedings of the IEEE, 63(4):561–580, 1975. URL: https://doi.org/10.1109/PROC.1975.9792.


Tom Bäckström, Jérémie Lecomte, Guillaume Fuchs, Sascha Disch, and Christian Uhle. Speech coding: with code-excited linear prediction. Springer, 2017. URL: https://doi.org/10.1007/978-3-319-50204-5.


Homer Dudley. Remaking speech. The Journal of the Acoustical Society of America, 11(2):169–177, 1939. URL: https://doi.org/10.1121/1.1916020.


Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), volume 2, 749–752. IEEE, 2001. URL: https://doi.org/10.1109/ICASSP.2001.941023.


John G Beerends, Christian Schmidmer, Jens Berger, Matthias Obermann, Raphael Ullmann, Joachim Pomy, and Michael Keyhl. Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part i—temporal alignment. Journal of the Audio Engineering Society, 61(6):366–384, 2013. URL: http://www.aes.org/e-lib/browse.cfm?elib=16829.


Thilo Thiede, William C Treurniet, Roland Bitto, Christian Schmidmer, Thomas Sporer, John G Beerends, and Catherine Colomes. PEAQ - The ITU standard for objective measurement of perceived audio quality. Journal of the Audio Engineering Society, 48(1/2):3–29, 2000. URL: http://www.aes.org/e-lib/browse.cfm?elib=12078.


Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2125–2136, 2011. URL: https://doi.org/10.1109/TASL.2011.2114881.


Augustine Gray and John Markel. Distance measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(5):380–391, 1976. URL: https://doi.org/10.1109/TASSP.1976.1162849.


Stewart Morrison Geoffrey, Enzinger Ewald, Daniel Ramos, Joaquín González-Rodríguez, and Alicia Lozano-Díez. Statistical models in forensic voice comparison. In Handbook of forensic statistics, pages 451–497. Chapman and Hall/CRC, 2020. URL: https://doi.org/10.1201/9780367527709.


Didier Meuwly and Raymond Veldhuis. Forensic biometrics: from two communities to one discipline. In 2012 BIOSIG-Proceedings of the International Conference of Biometrics Special Interest Group (BIOSIG), 1–12. IEEE, 2012. URL: https://ieeexplore.ieee.org/abstract/document/6313550.


Christophe Champod and Didier Meuwly. The inference of identity in forensic speaker recognition. Speech communication, 31(2-3):193–203, 2000. URL: https://doi.org/10.1016/S0167-6393(99)00078-3.


Joaquin Gonzalez-Rodriguez, Phil Rose, Daniel Ramos, Doroteo T Toledano, and Javier Ortega-Garcia. Emulating dna: rigorous quantification of evidential weight in transparent and testable forensic speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(7):2104–2115, 2007. URL: https://doi.org/10.1109/TASL.2007.902747.


Didier Meuwly, Daniel Ramos, and Rudolf Haraksim. A guideline for the validation of likelihood ratio methods used for forensic evidence evaluation. Forensic science international, 276:142–153, 2017. URL: https://doi.org/10.1016/j.forsciint.2016.03.048.


Daniel Ramos, Didier Meuwly, Rudolf Haraksim, and Charles EH Berger. Validation of forensic automatic likelihood ratio methods. Handbook of Forensic Statistics, pages 143–162, 2020. URL: https://doi.org/10.1201/9780367527709.


Yundong Zhang, Naveen Suda, Liangzhen Lai, and Vikas Chandra. Hello edge: keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128, 2017. URL: https://doi.org/10.48550/arXiv.1711.07128.


Xiaohui Zhang. Strategies for Handling Out-of-Vocabulary Words in Automatic Speech Recognition. PhD thesis, Johns Hopkins University, 2019. URL: http://jhir.library.jhu.edu/handle/1774.2/62275.


Björn Schuller and Anton Batliner. Computational paralinguistics: emotion, affect and personality in speech and language processing. John Wiley & Sons, 2013. URL: https://www.wiley.com/en-us/9781118706626.


Jouni Pohjalainen, Okko Räsänen, and Serdar Kadioglu. Feature selection methods and their combinations in high-dimensional classification of speaker likability, intelligibility and personality traits. Computer Speech & Language, 29(1):145–171, 2015. URL: https://doi.org/10.1016/j.csl.2013.11.004.


Florian Eyben, Martin Wöllmer, and Björn Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, 1459–1462. 2010. URL: https://doi.org/10.1145/1873951.1874246.


Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. Wav2vec 2.0: a framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460, 2020. URL: https://doi.org/10.48550/arXiv.2006.11477.


Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James Glass. An unsupervised autoregressive model for speech representation learning. arXiv preprint arXiv:1904.03240, 2019. URL: https://doi.org/10.48550/arXiv.1904.03240.


Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, 144–152. 1992. URL: https://doi.org/10.1145/130385.130401.


Zhizheng Wu, Cassia Valentini-Botinhao, Oliver Watts, and Simon King. Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 4460–4464. IEEE, 2015. URL: https://doi.org/10.1109/ICASSP.2015.7178814.


Lawrence R Rabiner and Ronald W Schafer. Introduction to digital speech processing. Foundations and Trends in Signal Processing, 1(1):1–194, 2007. URL: https://doi.org/10.1561/2000000001.


Andrew J Hunt and Alan W Black. Unit selection in a concatenative speech synthesis system using a large speech database. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, volume 1, 373–376. IEEE, 1996. URL: https://doi.org/10.1109/ICASSP.1996.541110.


Jorma Rissanen and Glen G Langdon. Arithmetic coding. IBM Journal of research and development, 23(2):149–162, 1979. URL: https://doi.org/10.1147/rd.232.0149.


Jacob Benesty, M Mohan Sondhi, Yiteng Huang, and others. Springer handbook of speech processing. Volume 1. Springer, 2008. URL: https://doi.org/10.1007/978-3-540-49127-9.


Rainer Martin. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Transactions on speech and audio processing, 9(5):504–512, 2001. URL: https://doi.org/10.1109/89.928915.


Steven Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on acoustics, speech, and signal processing, 27(2):113–120, 1979. URL: https://doi.org/10.1109/TASSP.1979.1163209.


Volodymyr Kuleshov, S. Zayd Enam, and Stefano Ermon. Audio super resolution using neural networks. CoRR, 2017. URL: http://arxiv.org/abs/1708.00853.


Konstantin Schmidt and Bernd Edler. Blind bandwidth extension based on convolutional and recurrent deep neural networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5444–5448. 2018. doi:10.1109/ICASSP.2018.8462691.


Sebastian Braun and Ivan Tashev. A consolidated view of loss functions for supervised deep learning-based speech enhancement. 2020. doi:10.48550/ARXIV.2009.12286.


Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. A survey on neural speech synthesis. 2021. doi:10.48550/ARXIV.2106.15561.


Tzu-hsien Huang, Chien-yu Lin, Jheng-hao aand Huang, and Hung-yi Lee. How far are we from robust voice conversion: a survey. 2020. doi:10.48550/ARXIV.2011.12063.


Xiang Hao, Chenglin Xu, Nana Hou, Lei Xie, Eng Siong Chng, and Haizhou Li. Time-domain neural network approach for speech bandwidth extension. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 866–870. 2020. doi:10.1109/ICASSP40776.2020.9054551.


Zhen-Hua Ling, Yang Ai, Yu Gu, and Li-Rong Dai. Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(5):883–894, may 2018. doi:10.1109/taslp.2018.2798811.


Mu Wang, Zhiyong Wu, Shiyin Kang, Xixin Wu, Jia Jia, Dan Su, Dong Yu, and Helen Meng. Speech super-resolution using parallel wavenet. In 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), 260–264. 2018. doi:10.1109/ISCSLP.2018.8706637.


Bin Liu, Jianhua Tao, Zhengqi Wen, Ya Li, and Danish Bukhari. A novel method of artificial bandwidth extension using deep architecture. In INTERSPEECH. 2015. URL: https://www.isca-speech.org/archive/pdfs/interspeech_2015/liu15g_interspeech.pdf.


Archit Gupta, Brendan Shillingford, Yannis Assael, and Thomas C. Walters. Speech bandwidth extension with wavenet. 2019. doi:10.48550/ARXIV.1907.04927.


Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, and Christian Fuegen. A Two-Stage Approach to Speech Bandwidth Extension. In INTERSPEECH, 5. 2021. URL: https://maigoakisame.github.io/papers/interspeech21b.pdf.


Teck Yian Lim, Raymond A. Yeh, Yijia Xu, Minh N. Do, and Mark Hasegawa-Johnson. Time-frequency networks for audio super-resolution. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 646–650. 2018. doi:10.1109/ICASSP.2018.8462049.


Gang Liu, Ke Gong, Xiaodan Liang, and Zhiguang Chen. Cp-gan: context pyramid generative adversarial network for speech enhancement. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6624–6628. 2020. doi:10.1109/ICASSP40776.2020.9054060.


Rafael Ferro, Nicolas Obin, and Axel Roebel. Cyclegan voice conversion of spectral envelopes using adversarial weights. 2019. doi:10.48550/ARXIV.1910.12614.


Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: generative adversarial networks for efficient and high fidelity speech synthesis. 2020. doi:10.48550/ARXIV.2010.05646.


Jiaqi Su, Yunyun Wang, Adam Finkelstein, and Zeyu Jin. Bandwidth extension is all you need. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 696–700. 2021. doi:10.1109/ICASSP39728.2021.9413575.


Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, and DeLiang Wang. Neural vocoder is all you need for speech super-resolution. 2022. doi:10.48550/ARXIV.2203.14941.


Yunpeng Li, Marco Tagliasacchi, Oleg Rybakov, Victor Ungureanu, and Dominik Roblek. Real-time speech frequency bandwidth extension. 2020. doi:10.48550/ARXIV.2010.10677.


Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R. Hershey. Sdr - half-baked or well done? 2018. doi:10.48550/ARXIV.1811.02508.


C. Knapp and G. Carter. The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(4):320–327, 1976. doi:10.1109/TASSP.1976.1162830.


Mordechai Azaria and David Hertz. Time delay estimation by generalized cross correlation methods. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):280–285, 1984. URL: https://doi.org/10.1109/TASSP.1984.1164314.


Byoungho Kwon, Youngjin Park, and Youn-sik Park. Analysis of the gcc-phat technique for multiple sources. In ICCAS 2010, volume, 2070–2073. 2010. doi:10.1109/ICCAS.2010.5670137.


William Havard, Laurent Besacier, and Olivier Rosec. Speech-coco: 600k visually grounded spoken captions aligned to mscoco data set. arXiv preprint arXiv:1707.08435, 2017. URL: https://doi.org/10.21437/GLU.2017-9.


James L McClelland and Jeffrey L Elman. The trace model of speech perception. Cognitive psychology, 18(1):1–86, 1986. URL: https://doi.org/10.1016/0010-0285(86)90015-0.


Okko Räsänen. Computational modeling of phonetic and lexical learning in early language acquisition: existing models and future directions. Speech Communication, 54(9):975–997, 2012. URL: https://doi.org/10.1016/j.specom.2012.05.001.


Emmanuel Dupoux. Cognitive science in the era of artificial intelligence: a roadmap for reverse-engineering the infant language-learner. Cognition, 173:43–59, 2018. URL: https://doi.org/10.1016/j.cognition.2017.11.008.


Luc Steels. The synthetic modeling of language origins. Evolution of communication, 1(1):1–34, 1997. URL: https://doi.org/10.1075/eoc.1.1.02ste.


Simon Kirby. Natural language from artificial life. Artificial life, 8(2):185–215, 2002. URL: https://doi.org/10.1162/106454602320184248.


David Marr. Vision: A computational investigation into the human representation and processing of visual information. W.H. Freeman and Company, 1982.


Andrea Weber and Odette Scharenborg. Models of spoken-word recognition. Wiley Interdisciplinary Reviews: Cognitive Science, 3(3):387–401, 2012. doi:10.1002/wcs.1178.


James S Magnuson, Heejo You, Sahil Luthra, Monica Li, Hosung Nam, Monty Escabi, Kevin Brown, Paul D Allopenna, Rachel M Theodore, Nicholas Monto, and others. Earshot: a minimal neural network model of incremental human speech recognition. Cognitive science, 44(4):e12823, 2020. URL: https://doi.org/10.1111/cogs.12823.


Janet F Werker and Richard C Tees. Cross-language speech perception: evidence for perceptual reorganization during the first year of life. Infant behavior and development, 7(1):49–63, 1984. URL: https://doi.org/10.1016/S0163-6383(84)80022-3.


Jenny R Saffran, Richard N Aslin, and Elissa L Newport. Statistical learning by 8-month-old infants. Science, 274(5294):1926–1928, 1996. URL: https://doi.org/10.1126/science.274.5294.1926.


Jessica Maye, Janet F Werker, and LouAnn Gerken. Infant sensitivity to distributional information can affect phonetic discrimination. Cognition, 82(3):B101–B111, 2002. URL: https://doi.org/10.1016/S0010-0277(01)00157-3.


Jenny R Saffran and Natasha Z Kirkham. Infant statistical learning. Annual review of psychology, 69:181–203, 2018. URL: https://doi.org/10.1146/annurev-psych-122216-011805.


Tasha Nagamine, Michael L Seltzer, and Nima Mesgarani. Exploring how deep neural networks form phonemic categories. In Sixteenth Annual Conference of the International Speech Communication Association. 2015. URL: https://www.isca-speech.org/archive_v0/interspeech_2015/papers/i15_1912.pdf.


Okko Räsänen and Heikki Rasilo. A joint model of word segmentation and meaning acquisition through cross-situational learning. Psychological review, 122(4):792, 2015. URL: https://psycnet.apa.org/doi/10.1037/a0039702.


Sofoklis Kakouros and Okko Räsänen. 3pro–an unsupervised method for the automatic detection of sentence prominence in speech. Speech Communication, 82:67–84, 2016. URL: https://doi.org/10.1016/j.specom.2016.06.004.


Herman Kamper, Aren Jansen, and Sharon Goldwater. A segmental framework for fully-unsupervised large-vocabulary speech recognition. Computer Speech & Language, 46:154–174, 2017. URL: https://doi.org/10.1016/j.csl.2017.04.008.


Okko Räsänen, Gabriel Doyle, and Michael C Frank. Pre-linguistic segmentation of speech into syllable-like units. Cognition, 171:130–150, 2018. URL: https://doi.org/10.1016/j.cognition.2017.11.003.


Dennis Norris. Shortlist: a connectionist model of continuous speech recognition. Cognition, 52(3):189–234, 1994. URL: https://doi.org/10.1016/0010-0277(94)90043-4.


Shinji Maeda. Improved articulatory models. The Journal of the Acoustical Society of America, 84(S1):S146–S146, 1988. URL: https://doi.org/10.1121/1.2025845.


Peter Birkholz. 3D-Artikulatorische Sprachsynthese. PhD thesis, der Universität Rostock, 2005. URL: https://www.vocaltractlab.de/publications/birkholz-2005-dissertation.pdf.


Peter Birkholz, Lucia Martin, Klaus Willmes, Bernd J Kröger, and Christiane Neuschaefer-Rube. The contribution of phonation type to the perception of vocal emotions in german: an articulatory synthesis study. The Journal of the Acoustical Society of America, 137(3):1503–1512, 2015. URL: https://doi.org/10.1121/1.4906836.


Jason A Tourville and Frank H Guenther. The diva model: a neural theory of speech acquisition and production. Language and cognitive processes, 26(7):952–981, 2011. URL: https://doi.org/10.1080/01690960903498424.


Ian S Howard and Piers Messum. Learning to pronounce first words in three languages: an investigation of caregiver and infant behavior using a computational model of an infant. PLoS One, 9(10):e110334, 2014. URL: https://doi.org/10.1371/journal.pone.0110334.


Heikki Rasilo and Okko Räsänen. An online model for vowel imitation learning. Speech Communication, 86:1–23, 2017. URL: https://doi.org/10.1016/j.specom.2016.10.010.


Pierre-Yves Oudeyer, George Kachergis, and William Schueller. Computational and robotic models of early language development: a review. In J.S. Horst and J. von Koss Torkildsen, editors, International handbook of language acquisition. Routledge/Taylor & Francis Group, 2019. URL: https://psycnet.apa.org/doi/10.4324/9781315110622-5.


Xabier Lareo. Smart speakers and virtual assistants. TechDispatch #1:, 2019. URL: https://data.europa.eu/doi/10.2804/004275.


Andreas Nautsch, Catherine Jasserand, Els Kindt, Massimiliano Todisco, Isabel Trancoso, and Nicholas Evans. The gdpr & speech data: reflections of legal and technology communities, first steps towards a common understanding. arXiv preprint arXiv:1907.03458, 2019. URL: https://doi.org/10.21437/Interspeech.2019-2647.


Sandra Petronio. Boundaries of privacy: Dialectics of disclosure. Suny Press, 2002.


Alexandra König, Aharon Satt, Alexander Sorin, Ron Hoory, Orith Toledo-Ronen, Alexandre Derreumaux, Valeria Manera, Frans Verhey, Pauline Aalten, Phillipe H Robert, and others. Automatic speech analysis for the assessment of patients with predementia and Alzheimer's disease. Alzheimer's & Dementia: Diagnosis, Assessment & Disease Monitoring, 1(1):112–124, 2015. URL: https://doi.org/10.1016/j.dadm.2014.11.012.


Rachel L Finn, David Wright, and Michael Friedewald. Seven types of privacy. In European data protection: coming of age, pages 3–32. Springer, 2013. URL: https://doi.org/10.1007/978-94-007-5170-5_1.


Weisong Shi, Jie Cao, Quan Zhang, Youhuizi Li, and Lanyu Xu. Edge computing: vision and challenges. IEEE internet of things journal, 3(5):637–646, 2016. URL: https://doi.org/10.1109/JIOT.2016.2579198.


Frederik Armknecht, Colin Boyd, Christopher Carr, Kristian Gjøsteen, Angela Jäschke, Christian A Reuter, and Martin Strand. A guide to fully homomorphic encryption. IACR Cryptology ePrint Archive, 2015:1192, 2015. URL: https://ia.cr/2015/1192.


Antti Poikola and Kai Kuikkaniemi end Harri Honko. Mydata – a nordic model for human-centered personal data management and processing. 2015. URL: http://urn.fi/URN:ISBN:978-952-243-455-5.


S.C. Chen and G.S. Dhillon. Interpreting dimensions of consumer trust in e-commerce. Information Technology and Management, 4:303–318, 2003. doi:https://doi.org/10.1023/A:102296263124.


Yi Xie and Siqing Peng. How to repair customer trust after negative publicity: the roles of competence, integrity, benevolence, and forgiveness. Psychology & Marketing, 26(7):572–589, 2009. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/mar.20289, doi:10.1002/mar.20289.