Sun Z, Zhao M, Liu G, Kaup A (2024)
Publication Language: English
Publication Type: Journal article
Publication year: 2024
Book Volume: 62
Article Number: 4709118
DOI: 10.1109/TGRS.2024.3489224
In recent years, remote sensing cross-modal text-image retrieval (RSCTIR) has attracted considerable attention owing to its convenience and information mining capabilities. However, two significant challenges persist: effectively integrating global and local information during feature extraction due to substantial variations in remote sensing imagery, and the failure of existing methods to adequately consider feature pre-alignment prior to modal fusion, resulting in complex modal interactions that adversely impact retrieval accuracy and efficiency. To address these challenges, we propose a cross-modal pre-aligned method with global and local information (CMPAGL) for remote sensing imagery. Specifically, we design the Gswin transformer block, which introduces a global information window on top of the local window attention mechanism, synergistically combining local window self-attention and global-local window cross-attention to effectively capture multi-scale features of remote sensing images. Additionally, our approach incorporates a pre-alignment mechanism to mitigate the training difficulty of modal fusion, thereby enhancing retrieval accuracy. Moreover, we propose a similarity matrix reweighting (SMR) reranking algorithm to deeply exploit information from the similarity matrix during retrieval process. This algorithm combines forward and backward ranking, extreme difference ratio, and other factors to reweight the similarity matrix, thereby further enhancing retrieval accuracy. Finally, we optimize the triplet loss function by introducing an intra-class distance term for matched image-text pairs, not only focusing on the relative distance between matched and unmatched pairs but also minimizing the distance within matched pairs. Experiments on four public remote sensing text-image datasets, including RSICD, RSITMD, UCM-Captions, and Sydney-Captions, demonstrate the effectiveness of our proposed method, achieving improvements over state-of-the-art methods, such as a 2.28% increase in mean Recall (mR) on the RSITMD dataset and a significant 4.65% improvement in R@1. The code is available from https://github.com/ZbaoSun/CMPAGL.
APA:
Sun, Z., Zhao, M., Liu, G., & Kaup, A. (2024). Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval. IEEE Transactions on Geoscience and Remote Sensing, 62. https://doi.org/10.1109/TGRS.2024.3489224
MLA:
Sun, Zengbao, et al. "Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval." IEEE Transactions on Geoscience and Remote Sensing 62 (2024).
BibTeX: Download