Multi-step retrieval and reasoning improves radiology question answering with large language models

Wind S, Sopa J, Truhn D, Lotfinia M, Nguyen TT, Bressem K, Adams L, Rusu M, Köstler H, Wellein G, Maier A, Tayebi Arasteh S (2025)


Publication Type: Journal article

Publication year: 2025

Journal

Book Volume: 8

Article Number: 790

Journal Issue: 1

URI: https://www.nature.com/articles/s41746-025-02250-5

DOI: 10.1038/s41746-025-02250-5

Open Access Link: https://www.nature.com/articles/s41746-025-02250-5

Abstract

Large language models (LLMs) show promise for radiology decision support, yet conventional retrieval-augmented generation (RAG) relies on single-step retrieval and struggles with complex reasoning. We introduce radiology Retrieval and Reasoning (RaR), a multi-step retrieval framework that iteratively summarizes clinical questions, retrieves evidence, and synthesizes answers. We evaluated 25 LLMs spanning general-purpose, reasoning-optimized, and clinically fine-tuned models (0.5B → 670B parameters) on 104 expert-curated radiology questions and an independent set of 65 real radiology board-exam questions. RaR significantly improved mean diagnostic accuracy versus zero-shot prompting (75% vs. 67%; P = 1.1 × 10−7) and conventional online RAG (75% vs. 69%; P = 1.9 × 10−6). Gains were largest in mid-sized and small models (e.g., Mistral Large: 72% → 81%), while very large models showed minimal change. RaR reduced hallucinations and provided clinically relevant evidence in 46% of cases, improving factual grounding. These results show that multi-step retrieval enhances diagnostic reliability, especially in deployable mid-sized LLMs. Code, datasets, and RaR are publicly available.

Authors with CRIS profile

Involved external institutions

How to cite

APA:

Wind, S., Sopa, J., Truhn, D., Lotfinia, M., Nguyen, T.-T., Bressem, K.,... Tayebi Arasteh, S. (2025). Multi-step retrieval and reasoning improves radiology question answering with large language models. npj Digital Medicine, 8(1). https://doi.org/10.1038/s41746-025-02250-5

MLA:

Wind, Sebastian, et al. "Multi-step retrieval and reasoning improves radiology question answering with large language models." npj Digital Medicine 8.1 (2025).

BibTeX: Download