Image Captioning Method Based on Transformer Visual Features Fusion
Existing image captioning methods only use regional visual features to generate description statements and ignore the importance of grid visual features.Moreover,as these methods are two-stage approaches,image captioning quality is affected.To address this issue,this study proposes an end-to-end image captioning method based on the visual feature fusion of Transformer.First,in the feature extraction stage,the visual feature extractor is used to extract regional and grid visual features.Second,in the feature fusion stage,the regional and grid visual features are concatenated using a visual feature fusion module.Finally,the visual features are sent to the language generator to realize image captioning.All components of the method are implemented based on the Transformer model,which is a one-stage method.The experimental results on the MS-COCO dataset show that the proposed method can fully utilize the respective advantages of regional and grid visual features,with the BLEU-1,BLEU-4,METEOR,ROUGE-L,CIDEr,and SPICE metrics reaching 83.1%,41.5%,30.2%,60.1%,140.3%,and 23.9%,respectively,indicating that the proposed method is superior to mainstream image captioning methods and can generate more accurate and rich description statements.
image captioningregional visual featuresgrid visual featuresTransformer modelend-to-end training