Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval

Sun Z, Zhao M, Liu G, Kaup A (2024)

Publication Language: English

Publication Type: Journal article

Publication year: 2024

Journal

IEEE Transactions on Geoscience and Remote Sensing Institute of Electrical and Electronics Engineers (IEEE)

Book Volume: 62

Article Number: 4709118

DOI: 10.1109/TGRS.2024.3489224

Abstract

In recent years, remote sensing cross-modal text-image retrieval (RSCTIR) has attracted considerable attention owing to its convenience and information mining capabilities. However, two significant challenges persist: effectively integrating global and local information during feature extraction due to substantial variations in remote sensing imagery, and the failure of existing methods to adequately consider feature pre-alignment prior to modal fusion, resulting in complex modal interactions that adversely impact retrieval accuracy and efficiency. To address these challenges, we propose a cross-modal pre-aligned method with global and local information (CMPAGL) for remote sensing imagery. Specifically, we design the Gswin transformer block, which introduces a global information window on top of the local window attention mechanism, synergistically combining local window self-attention and global-local window cross-attention to effectively capture multi-scale features of remote sensing images. Additionally, our approach incorporates a pre-alignment mechanism to mitigate the training difficulty of modal fusion, thereby enhancing retrieval accuracy. Moreover, we propose a similarity matrix reweighting (SMR) reranking algorithm to deeply exploit information from the similarity matrix during retrieval process. This algorithm combines forward and backward ranking, extreme difference ratio, and other factors to reweight the similarity matrix, thereby further enhancing retrieval accuracy. Finally, we optimize the triplet loss function by introducing an intra-class distance term for matched image-text pairs, not only focusing on the relative distance between matched and unmatched pairs but also minimizing the distance within matched pairs. Experiments on four public remote sensing text-image datasets, including RSICD, RSITMD, UCM-Captions, and Sydney-Captions, demonstrate the effectiveness of our proposed method, achieving improvements over state-of-the-art methods, such as a 2.28% increase in mean Recall (mR) on the RSITMD dataset and a significant 4.65% improvement in R@1. The code is available from https://github.com/ZbaoSun/CMPAGL.

Authors with CRIS profile

Andre Kaup Lehrstuhl für Multimediakommunikation und Signalverarbeitung (LMS)

Involved external institutions

Shanghai Institute of Technical Physics (SITP) / 中国科学院上海技术物理研究所

China (CN) Shanghai Maritime University (SMU) / 上海海事大学, 原名上海海运学院, 吴淞商船学校

China (CN)

How to cite

APA:

Sun, Z., Zhao, M., Liu, G., & Kaup, A. (2024). Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval. IEEE Transactions on Geoscience and Remote Sensing, 62. https://doi.org/10.1109/TGRS.2024.3489224

MLA:

Sun, Zengbao, et al. "Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval." IEEE Transactions on Geoscience and Remote Sensing 62 (2024).

BibTeX: Download