Diagnostic accuracy of deep learning vs. human raters for detecting osteoporotic vertebral compression fractures in routine CT scans

Riedel EO, Schinz D, Keicher M, Rühling S, El Husseini M, Pellegrini C, Baum T, Dieckmeyer M, Malagutti L, Seeger I, Walburga AS, Wiestler B, Sollmann N, Löffler MT, Wagner A, Kirschke JS (2026)

Publication Type: Journal article

Publication year: 2026

Journal

European Radiology Springer Verlag (Germany)

DOI: 10.1007/s00330-026-12393-y

Abstract

Objectives: Vertebral fractures are severe complications of osteoporosis but are frequently missed on computed tomography (CT). Differentiating true fractures from non-osteoporotic vertebral height loss remains challenging; deep learning (DL) models may improve detection and grading. Materials and methods: In this retrospective study, we evaluated eight human raters with different expertise (three students, three residents, and two attendings), four DL models, and one DL-based commercial software using the public Vertebral Segmentation (VerSe) 19 & 20 datasets. Vertebral fractures were graded using the semiquantitative Genant scale (0–3). Diagnostic performance was evaluated using interrater agreement and classification metrics, with significance tested via generalized linear mixed models across patient and vertebral levels for the thoracolumbar spine and its regional subsets. Consensus readings by a senior neuroradiologist and an experienced resident, informed by clinical data, served as the reference standard. Results: 3548 thoracic and lumbar vertebrae from 331 patients were analyzed. 190 (5.4%) of vertebrae were fractured, and 139 (3.9%) had “clinically most relevant” moderate or severe fractures (Genant 2 or 3). DL models showed comparable accuracy as residents in detecting moderate/severe fractures (0.988 vs. 0.991, p > 0.05) on vetebral level. SpineQ v1.1 consistently showed comparable Area Under the Curve (AUROC) and higher diagnostic accuracy compared to experts in detecting any fracture and moderate/severe fractures across vertebral, regional, and patient-level analyses. Students consistently exhibited the lowest AUROC and diagnostic accuracy across all levels and comparisons. Conclusion: When specifically trained to detect a distinct condition like vertebral fractures, advanced algorithms can show comparable performance as experts. Key Points: Question Can dedicated deep learning-based algorithms improve diagnostic performance for accurate detection and grading of osteoporotic vertebral fractures on CT scans? Findings Deep learning models specifically trained for vertebral fracture detection and grading can reach comparable performance as experts for the identification and grading of osteoporotic fractures. Clinical relevance Specifically trained deep learning models represent a valuable advancement for improving the identification and grading of osteoporotic fractures in clinical practice, bringing these tools a significant step closer to routine clinical implementation.

Involved external institutions

Technische Universität München (TUM)

Germany (DE) Klinikum der Universität München (LMU Klinikum)

Germany (DE)

How to cite

APA:

Riedel, E.O., Schinz, D., Keicher, M., Rühling, S., El Husseini, M., Pellegrini, C.,... Kirschke, J.S. (2026). Diagnostic accuracy of deep learning vs. human raters for detecting osteoporotic vertebral compression fractures in routine CT scans. European Radiology. https://doi.org/10.1007/s00330-026-12393-y

MLA:

Riedel, Evamaria O., et al. "Diagnostic accuracy of deep learning vs. human raters for detecting osteoporotic vertebral compression fractures in routine CT scans." European Radiology (2026).

BibTeX: Download