LoGSAM: Parameter-Efficient Cross-Modal Grounding for MRI Segmentation

Islam Bhuiyan MR, Bhat S, Qahqaie M, Nguyen TT, Perez-Toro PA, Arias-Vergara T, Maier A (2026)

Publication Language: English

Publication Status: Submitted

Publication Type: Unpublished / Preprint

Future Publication Type: Conference contribution

Publication year: 2026

City/Town: https://arxiv.org/abs/2603.17576

URI: https://arxiv.org/abs/2603.17576

DOI: https://arxiv.org/abs/2603.17576

Open Access Link: https://arxiv.org/abs/2603.17576

Abstract

Precise localization and delineation of brain tumors using Magnetic Resonance Imaging (MRI) are essential for planning therapy and guiding surgical decisions. However, most existing approaches rely on task-specific supervised models and are constrained by the limited availability of annotated data. To address this, we propose LoGSAM, a parameter-efficient, detection-driven framework that transforms radiologist dictation into text prompts for foundation-model-based localization and segmentation. Radiologist speech is first transcribed and translated using a pretrained Whisper ASR model, followed by negation-aware clinical NLP to extract tumor-specific textual prompts. These prompts guide text-conditioned tumor localization via a LoRA-adapted vision-language detection model, Grounding DINO (GDINO). The LoRA adaptation updates using 5% of the model parameters, thereby enabling computationally efficient domain adaptation while preserving pretrained cross-modal knowledge. The predicted bounding boxes are used as prompts for MedSAM to generate pixel-level tumor masks without any additional fine-tuning. Conditioning the frozen MedSAM on LoGSAM-derived priors yields a state-of-the-art dice score of 80.32% on BRISC 2025. In addition, we evaluate the full pipeline using German dictations from a board-certified radiologist on 12 unseen MRI scans, achieving 91.7% case-level accuracy. These results highlight the feasibility of constructing a modular, speech-to-segmentation pipeline by intelligently leveraging pretrained foundation models with minimal parameter updates.

Authors with CRIS profile

Sheethal Bhat Lehrstuhl für Informatik 14 (Bild- und Sprachverarbeitung) (LME) Melika Qahqaie Lehrstuhl für Informatik 5 (Mustererkennung) Tri-Thien Nguyen Lehrstuhl für Informatik 5 (Mustererkennung) Tomás Arias Vergara
Andreas Maier Lehrstuhl für Informatik 5 (Mustererkennung)

Involved external institutions

Universidad de Antioquía (UDEA)

Colombia (CO) King’s College London

United Kingdom (GB)

How to cite

APA:

Islam Bhuiyan, M.R., Bhat, S., Qahqaie, M., Nguyen, T.-T., Perez-Toro, P.A., Arias-Vergara, T., & Maier, A. (2026). LoGSAM: Parameter-Efficient Cross-Modal Grounding for MRI Segmentation. (Unpublished, Submitted).

MLA:

Islam Bhuiyan, Mohammad Robaitul, et al. LoGSAM: Parameter-Efficient Cross-Modal Grounding for MRI Segmentation. Unpublished, Submitted. 2026.

BibTeX: Download