Diagnostic Accuracy and Stability of Multimodal Large Language Models for Hand Fracture Detection: A Multi-Run Evaluation on Plain Radiographs

Güler I, Grieb G, Kraus A, Lautenbach M, Stelling H (2026)

Publication Type: Journal article

Publication year: 2026

Journal

Diagnostics MDPI

Book Volume: 16

Article Number: 424

Journal Issue: 3

DOI: 10.3390/diagnostics16030424

Abstract

Background/Objectives: Multimodal large language models (MLLMs) offer potential for automated fracture detection, yet their diagnostic stability under repeated inference remains underexplored. This study evaluates the diagnostic accuracy, stability, and intra-model consistency of four MLLMs in detecting hand fractures on plain radiographs. Methods: In total, images of hand radiographs of 65 adult patients with confirmed hand fractures (30 phalangeal, 30 metacarpal, 5 scaphoid) were evaluated by four models: GPT-5 Pro, Gemini 2.5 Pro, Claude Sonnet 4.5, and Mistral Medium 3.1. Each image was independently analyzed five times per model using identical zero-shot prompts (1300 total inferences). Diagnostic accuracy, inter-run reliability (Fleiss’ κ), case-level agreement profiles, subgroup performance, and exploratory demographic inference (age, sex) were assessed. Results: GPT-5 Pro achieved the highest accuracy (64.3%) and consistency (κ = 0.71), followed by Gemini 2.5 Pro (56.9%, κ = 0.57). Mistral Medium 3.1 exhibited high agreement (κ = 0.88) despite low accuracy (38.5%), indicating systematic error (“confident hallucination”). Claude Sonnet 4.5 showed low accuracy (33.8%) and consistency (κ = 0.33), reflecting instability. While phalangeal fractures were reliably detected by top models, scaphoid fractures remained challenging. Demographic analysis revealed poor capabilities, with age estimation errors exceeding 12 years and sex prediction accuracy near random chance. Conclusions: Diagnostic accuracy and consistency are distinct performance dimensions; high intra-model agreement does not imply correctness. While GPT-5 Pro demonstrated the most favorable balance of accuracy and stability, other models exhibited critical failure modes ranging from systematic bias to random instability. At present, MLLMs should be regarded as experimental diagnostic reasoning systems rather than reliable standalone tools for clinical fracture detection.

Involved external institutions

Otto-von-Guericke-Universität Magdeburg

Germany (DE) Krankenhaus Waldfriede

Germany (DE)

How to cite

APA:

Güler, I., Grieb, G., Kraus, A., Lautenbach, M., & Stelling, H. (2026). Diagnostic Accuracy and Stability of Multimodal Large Language Models for Hand Fracture Detection: A Multi-Run Evaluation on Plain Radiographs. Diagnostics, 16(3). https://doi.org/10.3390/diagnostics16030424

MLA:

Güler, Ibrahim, et al. "Diagnostic Accuracy and Stability of Multimodal Large Language Models for Hand Fracture Detection: A Multi-Run Evaluation on Plain Radiographs." Diagnostics 16.3 (2026).

BibTeX: Download