首页|LLAFN-Generator: Learnable linear-attention with fast-normalization for large-scale image captioning

LLAFN-Generator: Learnable linear-attention with fast-normalization for large-scale image captioning

扫码查看
Recently, although Transformer has widespread application in the field of computer vision, the quadratic complexity of its Self-Attention hindered the processing in large-scale image captioning task。 Therefore, in this paper, we propose a Learnable Linear-Attention with Fast-Normalization for Large-Scale Image Captioning (dubbed as LLAFN-Generator)。 Firstly, it introduces a Learnable Linear-Attention (LLA) module to solve the weight score learning of large-scale images, which is simply implemented through two linear layers and greatly reduces the computation complexity。 Meanwhile, the Fast-Normalization (FN) method is employed in the Learnable linear-Attention instead of the original Softmax function to improve the computational speed。 Additionally, the feature enhancement module be used to compensate for the shallow, fine-grained information in order to enhance the feature representation of the model。 Finally, extensive experiments on the MS COCO dataset show that the computational complexity is reduced by 30% and the parameter is reduced by 20% on models of the same size, with the performance metrics BLEU_1 and CIDEr increasing by 1。2% and 3。6%, respectively。

Image captioningTransformerFast-normalizationLearnable linear-attention

Xiaobao Yang、Xi Tian、Junsheng Wu、Xiaochun Yang、Sugang Ma、Xinman Qi、Zhiqiang Hou

展开 >

School of Computer Science, Xi'an University of Posts and Telecommunications, Xi'an, 710061, China||School of Computer Science, Northwestern Polytechnical University, Xi'an, 710021, China

School of Computer Science, Xi'an University of Posts and Telecommunications, Xi'an, 710061, China

School of Software, Northwestern Polytechnical University, Xi'an, 710021, China

School of Aerospace Academy, Northwestern Polytechnkal University, Xi'an, 710021, China

School of Electronic Engineering, Xi'an University of Posts and Telecommunications, Xi'an, 710061, China

展开 >

2024

Computer vision and image understanding

Computer vision and image understanding

EISCI
ISSN:1077-3142
年,卷(期):2024.248(Nov.)
  • 59