Wagner D, Baumann I, Engert N, Lee S, Nöth E, Riedhammer K, Bocklet T (2025)
Publication Type: Conference contribution
Publication year: 2025
Publisher: International Speech Communication Association
Pages Range: 3294-3298
Conference Proceedings Title: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Event location: Rotterdam, NLD
DOI: 10.21437/Interspeech.2025-2155
In this work, we present our submission to the Speech Accessibility Project challenge for dysarthric speech recognition. We integrate parameter-efficient fine-tuning with latent audio representations to improve an encoder-decoder ASR system. Synthetic training data is generated by fine-tuning Parler-TTS to mimic dysarthric speech, using LLM-generated prompts for corpus-consistent target transcripts. Personalization with x-vectors consistently reduces word error rates (WERs) over non-personalized fine-tuning. AdaLoRA adapters outperform full fine-tuning and standard low-rank adaptation, achieving relative WER reductions of ∼23% and ∼22%, respectively. Further improvements (∼5% WER reduction) come from incorporating wav2vec 2.0-based audio representations. Training with synthetic dysarthric speech yields up to ∼7% relative WER improvement over personalized fine-tuning alone.
APA:
Wagner, D., Baumann, I., Engert, N., Lee, S., Nöth, E., Riedhammer, K., & Bocklet, T. (2025). Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (pp. 3294-3298). Rotterdam, NLD, NL: International Speech Communication Association.
MLA:
Wagner, Dominik, et al. "Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition." Proceedings of the 26th Interspeech Conference 2025, Rotterdam, NLD International Speech Communication Association, 2025. 3294-3298.
BibTeX: Download