Neural Networks2022,Vol.14813.DOI:10.1016/j.neunet.2022.01.011

Dual Global Enhanced Transformer for image captioning

Xian, Tiantao Li, Zhixin Zhang, Canlong Ma, Huifang
Neural Networks2022,Vol.14813.DOI:10.1016/j.neunet.2022.01.011

Dual Global Enhanced Transformer for image captioning

Xian, Tiantao 1Li, Zhixin 1Zhang, Canlong 1Ma, Huifang2
扫码查看

作者信息

  • 1. Guangxi Key Lab Multisource Informat Min & Secur,Guangxi Normal Univ
  • 2. Coll Comp Sci & Engn,Northwest Normal Univ
  • 折叠

Abstract

Transformer-based architectures have shown great success in image captioning, where self-attention module can model source and target interaction (e.g., object-to-object, object-to-word, word-to-word). However, the global information is not explicitly considered in the attention weight calculation, which is essential to understand the scene content. In this paper, we propose Dual Global Enhanced Transformer (DGET) to incorporate global information in the encoding and decoding stages. Concretely, in DGET, we regard the grid feature as the visual global information and adaptively fuse it into region features in each layer by a novel Global Enhanced Encoder (GEE). During decoding, we proposed Global Enhanced Decoder (GED) to explicitly utilize the textual global information. First, we devise the context encoder to encode the existing caption generated by classic captioner as a context vector. Then, we use the context vector to guide the decoder to generate accurate words at each time step. To validate our model, we conduct extensive experiments on the MS COCO image captioning dataset and achieve superior performance over many state-of-the-art methods.(c) 2022 Elsevier Ltd. All rights reserved.

Key words

Image captioning/Transformer/Global information/Visual attention/Reinforcement learning

引用本文复制引用

出版年

2022
Neural Networks

Neural Networks

EISCI
ISSN:0893-6080
被引量28
参考文献量57
段落导航相关论文