tom Dieck T, Perez Toro PA, Arias Vergara T, Nöth E, Klumpp P (2022)
Publication Type: Conference contribution
Publication year: 2022
Publisher: International Speech Communication Association
Book Volume: 2022-September
Pages Range: 5130-5134
Conference Proceedings Title: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOI: 10.21437/Interspeech.2022-10865
End2end models became extremely popular in recent years. Whilst they excel at tasks like acoustic modelling or full-fledged speech recognition, the decision making process can be quite complex to retrace due to their black-box character. As end2end models learn high-level feature extraction on-the-fly, outputs from hidden layers from within the network had been used as feature vectors in various studies to perform transfer learning. It is therefore crucial to understand how extracted hidden activations transport information collected from the signal. Furthermore, is the traditional categorization into feature extractor and temporal analysis still applicable on the sub-parts of end2end models? By the example of Wav2vec 2.0, we show how an acoustic model learns to perform a frequency analysis on a speech waveform. Our experiments also show that phonetic information about speech production is preserved in extracted feature vectors. Ultimately, our findings highlight how different parts of an end2end model encode information on an entirely different level. Whilst the influence of gender is quite large on early feature vectors, it vanished after temporal contextualization. At the same time, hidden activations which included context information were superimposed by language-related patterns.
APA:
tom Dieck, T., Perez Toro, P.A., Arias Vergara, T., Nöth, E., & Klumpp, P. (2022). Wav2vec behind the Scenes: How end2end Models learn Phonetics. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (pp. 5130-5134). Incheon, KR: International Speech Communication Association.
MLA:
tom Dieck, Teena, et al. "Wav2vec behind the Scenes: How end2end Models learn Phonetics." Proceedings of the 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022, Incheon International Speech Communication Association, 2022. 5130-5134.
BibTeX: Download