A comparative evaluation of publicly available large language models in the assessment of CTG traces according to the FIGO criteria

Psilopatis I, Monod C, Filippi V, Tschudin R, Lapaire O, Emons J, Mosimann B, Zwimpfer TA (2025)

Publication Type: Journal article

Publication year: 2025

Journal

Archives of Gynecology and Obstetrics Springer

Abstract

Background: Cardiotocography (CTG) remains a cornerstone in fetal monitoring, but its interpretation is subject to considerable inter- and intra-observer variability. Artificial intelligence (AI) tools, particularly large language models (LLMs), offer potential to improve diagnostic consistency and reduce clinician workload. Objectives: This study aims to evaluate and compare the accuracy of various LLMs in CTG interpretation based on Federation of Gynecology and Obstetrics (FIGO 2015) criteria. Study design: An analysis of sixty CTG traces previously classified by clinicians at the University Hospital Basel according to FIGO guidelines was conducted. In a two-run protocol, 30 normal CTG traces were initially presented as screenshots to Chat-GPT-4.0, Google Gemini, Bing Copilot, and DeepSeek. Subsequently, the LLMs that demonstrated adequate interpretation of normal CTGs were tasked to classify another 30 suspicious or pathological CTG traces. Each LLM was asked to classify each CTG trace as normal or abnormal. Results: DeepSeek was unable to interpret CTGs and was excluded. Google Gemini showed poor performance (6.7%) on normal CTGs. Chat-GPT-4.0 partially succeeded in correctly classifying the provided CTG traces as normal (46.7%) or abnormal (50%). Bing Copilot accurately interpreted normal CTGs (96.6%) but failed on abnormal ones (0%). Conclusions: LLMs show major limitations in the interpretation of CTG traces according to the FIGO criteria.

Authors with CRIS profile

Julius Emons Department of Obstetrics and Gynaecology

Involved external institutions

Universitätsspital Basel

Switzerland (CH)

How to cite

APA:

Psilopatis, I., Monod, C., Filippi, V., Tschudin, R., Lapaire, O., Emons, J.,... Zwimpfer, T.A. (2025). A comparative evaluation of publicly available large language models in the assessment of CTG traces according to the FIGO criteria. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-025-08145-w

MLA:

Psilopatis, Iason, et al. "A comparative evaluation of publicly available large language models in the assessment of CTG traces according to the FIGO criteria." Archives of Gynecology and Obstetrics (2025).

BibTeX: Download