Reliability and Performance Stability of Large Language Models in Medical Knowledge Assessment: Evidence from the European Board of Nuclear Medicine Examination

Stelling H, Brink I, Grieb G, Kraus A, Güler I (2026)


Publication Type: Journal article

Publication year: 2026

Journal

Book Volume: 7

Article Number: 77

Journal Issue: 2

DOI: 10.3390/ai7020077

Abstract

Background: Large language models (LLMs) have demonstrated strong performance on general medical examinations. Whether this performance translates to highly specialized, subspecialty-level board examinations remains unclear. This study evaluates the accuracy and inter-run stability of contemporary LLMs using authentic European Board of Nuclear Medicine (EBNM) Fellowship Examination material. Methods: Ten LLMs (five proprietary, five open-source) completed 50 EBNM multiple-choice questions across five independent zero-shot runs, resulting in 2500 total inferences. Accuracy was calculated per model across runs. Inter-run reliability was assessed using pairwise Cohen’s kappa coefficients. Pairwise model differences were analyzed using McNemar’s test with Bonferroni correction (α = 0.0011). Results: Mean accuracy ranged from 53.6% to 100.0%, with all models exceeding an illustrative 50% pass threshold. Inter-run reliability varied substantially (κ = 0.370–1.000; mean κ = 0.716). High accuracy did not consistently correspond to high reproducibility. Gemini 2.5 Pro achieved high accuracy (93.6%) but showed the lowest reliability (κ = 0.370), whereas DeepSeek V3.2 demonstrated perfect accuracy and agreement across all runs. No significant correlation between accuracy and reliability was observed (Spearman ρ = 0.394, p = 0.26). Conclusions: LLMs demonstrate strong but heterogeneous performance on high-stakes medical knowledge assessments. Differences in reproducibility highlight the need for multi-run evaluation when considering LLMs for educational or clinical knowledge-support applications and for continued validation using non-disclosed examination material.

Involved external institutions

How to cite

APA:

Stelling, H., Brink, I., Grieb, G., Kraus, A., & Güler, I. (2026). Reliability and Performance Stability of Large Language Models in Medical Knowledge Assessment: Evidence from the European Board of Nuclear Medicine Examination. AI, 7(2). https://doi.org/10.3390/ai7020077

MLA:

Stelling, Henrik, et al. "Reliability and Performance Stability of Large Language Models in Medical Knowledge Assessment: Evidence from the European Board of Nuclear Medicine Examination." AI 7.2 (2026).

BibTeX: Download