VLCA: vision-language aligning model with cross-modal atten-tion for bilingual remote sensing image captioning

扫码查看

原文链接

NETL
NSTL
万方数据

外文摘要：In the field of satellite imagery, remote sensing image captioning (RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote (DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention (VLCA) is presented to generate accurate and abundant bilin-gual descriptions for remote sensing images. Third, a cross-modal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-to-end Chinese captions generation by using the pre-training lan-guage model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in produc-ing captions.

外文关键词：

remote sensing image captioning (RSIC)vision-lan-guage representationremote sensing image caption datasetattention mechanism

作者：

WEI Tingting、YUAN Weilin、LUO Junren、ZHANG Wanpeng、LU Lina

展开 >

作者单位：

College of Intelligence Science and Technology,National University of Defense Technology,Changsha 410073,China

基金：

National Natural Science Foundation of ChinaNational Natural Science Foundation of China

项目编号：

6170252861806212

出版年：

2023

DOI：

10.23919/JSEE.2023.000035

系统工程与电子技术(英文版)

中国航天科工防御技术研究院中国宇航学会中国系统工程学会中国系统仿真学会

系统工程与电子技术(英文版)

CSTPCDCSCD北大核心

影响因子：0.64

ISSN：1004-4132

年,卷(期)：2023.34(1)

被引量1
参考文献量1