Wav2vec behind the Scenes: How end2end Models learn Phonetics

tom Dieck T, Perez Toro PA, Arias Vergara T, Nöth E, Klumpp P (2022)

Publication Type: Conference contribution

Publication year: 2022

Publisher: International Speech Communication Association

Book Volume: 2022-September

Pages Range: 5130-5134

Conference Proceedings Title: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Event location: Incheon

DOI: 10.21437/Interspeech.2022-10865

Abstract

End2end models became extremely popular in recent years. Whilst they excel at tasks like acoustic modelling or full-fledged speech recognition, the decision making process can be quite complex to retrace due to their black-box character. As end2end models learn high-level feature extraction on-the-fly, outputs from hidden layers from within the network had been used as feature vectors in various studies to perform transfer learning. It is therefore crucial to understand how extracted hidden activations transport information collected from the signal. Furthermore, is the traditional categorization into feature extractor and temporal analysis still applicable on the sub-parts of end2end models? By the example of Wav2vec 2.0, we show how an acoustic model learns to perform a frequency analysis on a speech waveform. Our experiments also show that phonetic information about speech production is preserved in extracted feature vectors. Ultimately, our findings highlight how different parts of an end2end model encode information on an entirely different level. Whilst the influence of gender is quite large on early feature vectors, it vanished after temporal contextualization. At the same time, hidden activations which included context information were superimposed by language-related patterns.

Authors with CRIS profile

Teena tom Dieck Lehrstuhl für Informatik 5 (Mustererkennung) Paula Andrea Perez Toro Lehrstuhl für Informatik 5 (Mustererkennung) Tomás Arias Vergara Professur für Informatik (Mustererkennung) Elmar Nöth Professur für Informatik (Mustererkennung) Philipp Klumpp Lehrstuhl für Informatik 5 (Mustererkennung)

How to cite

APA:

tom Dieck, T., Perez Toro, P.A., Arias Vergara, T., Nöth, E., & Klumpp, P. (2022). Wav2vec behind the Scenes: How end2end Models learn Phonetics. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (pp. 5130-5134). Incheon, KR: International Speech Communication Association.

MLA:

tom Dieck, Teena, et al. "Wav2vec behind the Scenes: How end2end Models learn Phonetics." Proceedings of the 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022, Incheon International Speech Communication Association, 2022. 5130-5134.

BibTeX: Download