Machine learning based assessment of hoarseness severity: a multi-sensor approach centered on high-speed videoendoscopy

Schraut T, Schützenberger A, Arias Vergara T, Kunduk M, Echternach M, Dürr S, Werz J, Döllinger M (2025)


Publication Language: English

Publication Type: Journal article

Publication year: 2025

Journal

Book Volume: 8

Article Number: 1601716

DOI: 10.3389/frai.2025.1601716

Abstract

Introduction: Functional voice disorders are characterized by impaired voice production without primary organic changes, posing challenges for standardized assessment. Current diagnostic methods rely heavily on subjective evaluation, suffering from inter-rater variability. High-speed videoendoscopy (HSV) offers an objective alternative by capturing true intra-cycle vocal fold behavior. Integrating time-synchronized acoustic and HSV recordings could allow for an objective visual and acoustic assessment of vocal function based on a single HSV examination. This study investigates a machine learning-based approach for hoarseness severity assessment using synchronous HSV and acoustic recordings, alongside conventional voice examinations. Methods: Three databases comprising 457 HSV recordings of the sustained vowel /i/, 634 HSV-synchronized acoustic recordings, and clinical parameters from 923 visits were analyzed. Subjects were classified into two hoarseness groups based on auditory-perceptual ratings, with predicted scores serving as continuous hoarseness severity ratings. A videoendoscopic model was developed by selecting a suitable classification algorithm and a minimal-optimal subset of glottal parameters. This model was compared against an acoustic model based on HSV-synchronized recordings and a clinical model based on parameters from other examinations. Two ensemble models were constructed by combining the HSV-based models and all models, respectively. Model performance was evaluated on a shared test set based on classification accuracy, correlation with subjective ratings, and correlation between predicted and observed changes in hoarseness severity. Results: The videoendoscopic, acoustic, and clinical model achieved correlations of 0.464, 0.512, and 0.638 with subjective hoarseness ratings. Integrating glottal and acoustic parameters into the HSV-based ensemble model improved correlation to 0.603, confirming the complementary nature of time-synchronized HSV and acoustic recordings. The ensemble model incorporating all modalities achieved the highest correlation of 0.752, underscoring the diagnostic value of multimodal objective assessments. Discussion: This study highlights the potential of synchronous HSV and acoustic recordings for objective hoarseness severity assessment, offering a more comprehensive evaluation of vocal function. While practical challenges remain, the integration of these modalities led to notable improvements, supporting their complementary value in enhancing diagnostic accuracy. Future advancements could include flexible nasal endoscopy to enable more natural phonation and refinement of glottal parameter extraction to improve model robustness under variable recording conditions.

Authors with CRIS profile

Involved external institutions

How to cite

APA:

Schraut, T., Schützenberger, A., Arias Vergara, T., Kunduk, M., Echternach, M., Dürr, S.,... Döllinger, M. (2025). Machine learning based assessment of hoarseness severity: a multi-sensor approach centered on high-speed videoendoscopy. Frontiers in Artificial Intelligence, 8. https://doi.org/10.3389/frai.2025.1601716

MLA:

Schraut, Tobias, et al. "Machine learning based assessment of hoarseness severity: a multi-sensor approach centered on high-speed videoendoscopy." Frontiers in Artificial Intelligence 8 (2025).

BibTeX: Download