首页|VLCA: vision-language aligning model with cross-modal atten-tion for bilingual remote sensing image captioning
VLCA: vision-language aligning model with cross-modal atten-tion for bilingual remote sensing image captioning
扫码查看
点击上方二维码区域,可以放大扫码查看
原文链接
NETL
NSTL
万方数据
In the field of satellite imagery, remote sensing image captioning (RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote (DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention (VLCA) is presented to generate accurate and abundant bilin-gual descriptions for remote sensing images. Third, a cross-modal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-to-end Chinese captions generation by using the pre-training lan-guage model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in produc-ing captions.