首页|Dual Global Enhanced Transformer for image captioning
Dual Global Enhanced Transformer for image captioning
扫码查看
点击上方二维码区域,可以放大扫码查看
原文链接
NSTL
Elsevier
Transformer-based architectures have shown great success in image captioning, where self-attention module can model source and target interaction (e.g., object-to-object, object-to-word, word-to-word). However, the global information is not explicitly considered in the attention weight calculation, which is essential to understand the scene content. In this paper, we propose Dual Global Enhanced Transformer (DGET) to incorporate global information in the encoding and decoding stages. Concretely, in DGET, we regard the grid feature as the visual global information and adaptively fuse it into region features in each layer by a novel Global Enhanced Encoder (GEE). During decoding, we proposed Global Enhanced Decoder (GED) to explicitly utilize the textual global information. First, we devise the context encoder to encode the existing caption generated by classic captioner as a context vector. Then, we use the context vector to guide the decoder to generate accurate words at each time step. To validate our model, we conduct extensive experiments on the MS COCO image captioning dataset and achieve superior performance over many state-of-the-art methods.(c) 2022 Elsevier Ltd. All rights reserved.