A scalable architecture for multilingual speech recognition on embedded devices

Raab M, Gruhn R, Nöth E (2011)


Publication Status: Published

Publication Type: Journal article

Publication year: 2011

Journal

Publisher: Elsevier

Book Volume: 53

Pages Range: 62-74

Journal Issue: 1

DOI: 10.1016/j.specom.2010.07.007

Abstract

In-car infotainment and navigation devices are typical examples where speech based interfaces are successfully applied. While classical applications are monolingual, such as voice commands or monolingual destination input, the trend goes towards multilingual applications. Examples are music player control or multilingual destination input. As soon as more languages are considered the training and decoding complexity of the speech recognizer increases. For large multilingual systems, some kind of parameter tying is needed to keep the decoding task feasible on embedded systems with limited resources. A traditional technique for this is to use a semi-continuous Hidden Markov Model as the acoustic model. The monolingual codebook on which such a system relies is not appropriate for multilingual recognition. We introduce Multilingual Weighted Codebooks that give good results with low decoding complexity. These codebooks depend on the actual language combination and increase the training complexity. Therefore an algorithm is needed that can reduce the training complexity. Our first proposal are mathematically motivated projections between Hidden Markov Models defined in Gaussian spaces. Although theoretically optimal, these projections were difficult to employ directly in speech decoders. We found approximated projections to be most effective for practical application, giving good performance without requiring major modifications to the common speech recognizer architecture. With a combination of the Multilingual Weighted Codebooks and Gaussian Mixture Model projections we create an efficient and scalable architecture for non-native speech recognition. Our new architecture offers a solution to the combinatoric problems of training and decoding for multiple languages. It builds new multilingual systems in only 0.002% of the time of a traditional HMM training, and achieves comparable performance on foreign languages. (C) 2010 Elsevier B.V. All rights reserved.

Authors with CRIS profile

Involved external institutions

How to cite

APA:

Raab, M., Gruhn, R., & Nöth, E. (2011). A scalable architecture for multilingual speech recognition on embedded devices. Speech Communication, 53(1), 62-74. https://doi.org/10.1016/j.specom.2010.07.007

MLA:

Raab, Martin, Rainer Gruhn, and Elmar Nöth. "A scalable architecture for multilingual speech recognition on embedded devices." Speech Communication 53.1 (2011): 62-74.

BibTeX: Download