Bhat S, Bayer S, Grbic S, Maier A (2026)
Publication Language: English
Publication Type: Conference contribution, Abstract of a poster
Publication year: 2026
Conference Proceedings Title: European Society of Radiologists
Purpose or Learning Objective: Recent approaches demonstrate that adapting Contrastive Language–Image Pretraining (CLIP) to medical imaging can enhance localization and classification performance [1]. However, CLIP relies primarily on maximizing similarity between image and text embeddings, without explicitly modeling deeper cross-modal interactions. This limitation may restrict its ability to capture clinically meaningful semantic relationships between imaging findings and report language.
In this work, we investigate the ALBEF (Align Before Fuse) framework [2], which extends beyond similarity-based learning by explicitly fusing aligned image and text representations. As shown in Figure 1, ALBEF incorporates a masked language modeling objective, where masked report tokens are predicted using both visual and textual context. This encourages richer cross-modal reasoning and improved semantic understanding. Applied to chest X-rays (CXRs) and associated clinical reports, we demonstrate that ALBEF-CXR improves both zero-shot and fine-tuned classification performance compared to prior CLIP-based approaches.
Methods or Background: We use a ViT-G/14 model pretrained with the ALBEF framework [2] on natural image–text pairs, followed by domainspecific pretraining on the large-scale MIMIC-CXR dataset [4]. The model is then fine-tuned on the public VinDR-CXR dataset [5]. CXR reports are processed following [1], and classification is performed using multiple positive and negative prompts per finding. Zero-shot and fine-tuned AUC scores are compared against state-of-the-art (SOTA) vision–language (VL) methods and classification-only baselines.
Results or Findings: Table 1 presents zero-shot AUC scores across 13 thoracic findings on VinDR-CXR. ALBEF-CXR demonstrates consistently strong zero-shot performance compared to other VL methods, including CheXZero [6] and MedKLIP [7]. Performance gains are particularly evident for findings requiring higher-level semantic context, such as pleural thickening, pulmonary fibrosis, mass/nodule, and atelectasis. Overall, ALBEF-CXR achieves a +2 percentage point improvement in macro AUC over CheXZero, indicating improved generalization without task-specific fine-tuning.
Table 2 reports fine-tuned results compared against multiple VL models and classification-only baselines. For common abnormalities encountered in routine clinical practice—such as lung opacity, infiltration, and atelectasis—ALBEF-CXR matches or exceeds strong classifiers (ViT multiclass, Dual Encoder [8]), achieving AUCs of 0.90–0.95. Improvements are particularly notable for diffuse or complex conditions, including pulmonary fibrosis (0.93) and pleural thickening (0.90). For clinically urgent findings such as pleural effusion and pneumothorax, ALBEF-CXR achieves very high AUCs (0.99), indicating reliable detection of acute abnormalities.
Compared to other VL approaches (MedKLIP [7], SLIP [10], CheXZero [6] variants), ALBEF-CXR demonstrates a more balanced performance across both acute and chronic findings, including subtle conditions such as fibrosis and calcification. The highest average AUC of 0.92 highlights the benefit of incorporating textual supervision during finetuning. Macro AUC results were statistically significant in both the zero-shot and fine-tuned settings (p<0.05).
Conclusion: Building on prior work (Patch-CLIP [1], CXR-CML [3]), this study demonstrates that the ALBEF-CXR framework delivers robust and consistent performance across a wide range of thoracic findings in both zero-shot and fine-tuned settings. ALBEF-CXR shows improved generalization to unseen tasks and maintains strong performance after fine-tuning, particularly for clinically important acute and chronic abnormalities.
These findings underscore the value of multi-modal learning that jointly leverages chest radiographs and clinical text. By integrating visual and linguistic context, ALBEF-CXR offers improved robustness across diverse pathologies, supporting its potential role in clinical decision support and radiologist assistance, especially in heterogeneous or data-limited clinical settings.
APA:
Bhat, S., Bayer, S., Grbic, S., & Maier, A. (2026, March). Robust AI-Based Chest Radiograph Classification Without Prior Task Training via Improved Vision-Language Alignment. Poster presentation at European Congress of Radiology, Vienna, AT.
MLA:
Bhat, Sheethal, et al. "Robust AI-Based Chest Radiograph Classification Without Prior Task Training via Improved Vision-Language Alignment." Presented at European Congress of Radiology, Vienna 2026.
BibTeX: Download