首页|VLCA: vision-language aligning model with cross-modal atten-tion for bilingual remote sensing image captioning

VLCA: vision-language aligning model with cross-modal atten-tion for bilingual remote sensing image captioning

扫码查看
In the field of satellite imagery, remote sensing image captioning (RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote (DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention (VLCA) is presented to generate accurate and abundant bilin-gual descriptions for remote sensing images. Third, a cross-modal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-to-end Chinese captions generation by using the pre-training lan-guage model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in produc-ing captions.

remote sensing image captioning (RSIC)vision-lan-guage representationremote sensing image caption datasetattention mechanism

WEI Tingting、YUAN Weilin、LUO Junren、ZHANG Wanpeng、LU Lina

展开 >

College of Intelligence Science and Technology,National University of Defense Technology,Changsha 410073,China

National Natural Science Foundation of ChinaNational Natural Science Foundation of China

6170252861806212

2023

系统工程与电子技术(英文版)
中国航天科工防御技术研究院 中国宇航学会 中国系统工程学会 中国系统仿真学会

系统工程与电子技术(英文版)

CSTPCDCSCD北大核心
影响因子:0.64
ISSN:1004-4132
年,卷(期):2023.34(1)
  • 1
  • 1