基于改进Transformer的多尺度图像描述生成

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：Transformer模型被广泛应用于图像描述生成任务中,但存在以下问题:①依赖复杂神经网络对图像进行预处理;②自注意力具有二次计算复杂度;③Masked Self-Attention缺少图像引导信息.为此,提出改进Transformer的多尺度图像描述生成模型.首先,将图像划分为多尺度图像块以获取多层次图像特征,并将其通过线性映射作为Trans-former的输入,避免了复杂神经网络预处理的步骤,从而提升了模型训练与推理速度;其次,在编码器中使用线性复杂度的记忆注意力,通过可学习的共享记忆单元学习整个数据集的先验知识,挖掘样本间潜在的相关性;最后,在解码器中引入视觉引导注意力,将视觉特征作为辅助信息指导解码器生成与图像内容更为匹配的语义描述.在COCO 2014数据集上的测试结果表明,与基础模型相比,改进模型在CIDEr、METEOR、ROUGE和SPICE指标分数方面分别提高了2.6、0.7、0.4、0.7.基于改进Transformer的多尺度图像描述生成模型能生成更加准确的语言描述.

外文标题：Multi-scale Image Captioning Generation Based on Improved Transformer

外文摘要：The Transformer model is widely used in image description generation tasks,but it has the following problems:① relying on com-plex neural networks for image preprocessing;② Self attention has a quadratic computational complexity;③ Masked Self Attention lacks im-age guidance information.To this end,an improved Transformer based multi-scale image description generation model is proposed.Firstly,the image is divided into multi-scale image blocks to obtain multi-level image features,which are then linearly mapped as input to the Trans-former,avoiding the steps of complex neural network preprocessing and improving model training and inference speed;Then,linear complexi-ty memory attention is used in the encoder to learn the prior knowledge of the entire dataset through learnable shared memory units and explore potential correlations between samples;Finally,visual guided attention is introduced into the decoder,using visual features as auxiliary infor-mation to guide the decoder in generating semantic descriptions that better match the image content.The test results on the COCO 2014 dataset show that compared to the base model,the improved model has improved scores on CIDEr,METEOR,ROUGE,and SPICE indicators by 2.6,0.7,0.4,and 0.7,respectively.The multi-scale image description generation model based on improved Transformer can generate more accurate language descriptions.

外文关键词：

image captioningTransformer modelmemory attentionmulti-scale imageself-attention

作者：

崔衡、张海涛、杨剑、杜宝昌

展开 >

作者单位：

辽宁工程技术大学软件学院,辽宁葫芦岛 125105

汕头职业技术学院计算机系,广东汕头 515071

信息工程大学地理空间信息学院,河南郑州 450052

关键词：

图像描述 Transformer模型记忆注意力多尺度图像自注意力

基金：

国家自然科学基金项目国家重点研发计划项目KartoBit Research Network开放课题基金项目

项目编号：

421301122017YFB0503500KRN2201CA

出版年：

2024

DOI：

10.11907/rjdk.231488

软件导刊

湖北省信息学会

软件导刊

影响因子：0.524

ISSN：1672-7800

年,卷(期)：2024.23(7)

参考文献量1