Towards End-to-End Speech Articulation and Spoken Language Analysis Using Deep Learning

Weise T, Demir K, Perez Toro PA, Arias Vergara T, Maier A, Nöth E, Schuster M, Heismann B, Yang SH (2025)


Publication Type: Journal article

Publication year: 2025

Journal

Book Volume: 5

Pages Range: 103-122

Journal Issue: 1

DOI: 10.1007/s44230-025-00094-6

Abstract

This study presents a speech and spoken language analysis framework, leveraging a robust, end-to-end deep learning model developed in our prior work. The framework represents a foundational step towards a comprehensive solution for analyzing speech articulation and spoken language. Unlike traditional approaches that rely on separate specialized models, our architecture integrates multiple prediction tasks into a single multi-task learning setup: nine articulatory trajectories, a phoneme sequence, and phoneme alignment. While conceptually distinct, these outputs share a strong underlying relation: phonemes, as the fundamental building blocks of language, emerge from specific articulatory configurations, and phoneme alignment provides crucial temporal structure. We bridge the gap between abstract linguistic representations and their physical realizations by integrating phoneme recognition, articulatory trajectory prediction, and phoneme alignment within a single deep learning framework. Phonemes, as abstract speech units, manifest as concrete articulatory gestures, which can be precisely captured through EMA and analyzed using deep learning methods. This integration lays the foundation for diverse applications, including intelligibility assessment and therapeutic feedback. Extensive experiments validate the model’s capabilities and demonstrate its potential in real-world contexts. These include evaluations of articulatory and phoneme-related metrics, intelligibility estimation using phoneme error rates, and open vocabulary keyword spotting. A case study on stroke-related datasets highlights how the framework provides detailed articulatory feedback and supports therapy progress tracking. While not a complete solution, this work shows that an integrated, end-to-end deep learning approach can effectively address multiple facets of speech analysis. Ultimately, it serves as a foundation for developing scalable and robust frameworks to tackle challenges in speech and language processing.

Authors with CRIS profile

Involved external institutions

How to cite

APA:

Weise, T., Demir, K., Perez Toro, P.A., Arias Vergara, T., Maier, A., Nöth, E.,... Yang, S.H. (2025). Towards End-to-End Speech Articulation and Spoken Language Analysis Using Deep Learning. Human-Centric Intelligent Systems, 5(1), 103-122. https://doi.org/10.1007/s44230-025-00094-6

MLA:

Weise, Tobias, et al. "Towards End-to-End Speech Articulation and Spoken Language Analysis Using Deep Learning." Human-Centric Intelligent Systems 5.1 (2025): 103-122.

BibTeX: Download