Multi-scale Image Captioning Generation Based on Improved Transformer
The Transformer model is widely used in image description generation tasks,but it has the following problems:① relying on com-plex neural networks for image preprocessing;② Self attention has a quadratic computational complexity;③ Masked Self Attention lacks im-age guidance information.To this end,an improved Transformer based multi-scale image description generation model is proposed.Firstly,the image is divided into multi-scale image blocks to obtain multi-level image features,which are then linearly mapped as input to the Trans-former,avoiding the steps of complex neural network preprocessing and improving model training and inference speed;Then,linear complexi-ty memory attention is used in the encoder to learn the prior knowledge of the entire dataset through learnable shared memory units and explore potential correlations between samples;Finally,visual guided attention is introduced into the decoder,using visual features as auxiliary infor-mation to guide the decoder in generating semantic descriptions that better match the image content.The test results on the COCO 2014 dataset show that compared to the base model,the improved model has improved scores on CIDEr,METEOR,ROUGE,and SPICE indicators by 2.6,0.7,0.4,and 0.7,respectively.The multi-scale image description generation model based on improved Transformer can generate more accurate language descriptions.