首页|基于位置和多层编码的图像描述生成

基于位置和多层编码的图像描述生成

扫码查看
针对图像描述中位置信息相关性和编码器各层信息利用不充分的问题,提出一种基于Transformer的位置和多层聚合编码的图像描述生成模型。该模型引入一种视觉对象的位置编码机制,通过提取独立区域位置信息中隐藏的相对空间信息,有助于模型关注视觉对象之间的差异与联系。同时在该模型中,提出了一种多层聚合注意编码,通过门控循环单元与自注意力的结合,将多层图像编码信息传递到输出层,使获取的图像特征语义更加丰富。实验结果表明:所提出模型性能明显优于传统编解码器结构的图像描述模型,描述语句更加准确丰富。
Enhanced transformer based on encoding of location and multi-layer aggregation
An image captioning model named enhanced Transformer based on encoding of location and multi-layer aggregation was proposed,for two insufficient utilization problems in image captioning:location information correlation and different encoder level's information.Proposed model introduces a location en-coding mechanism of visual objects,which helps the model to pay attention to the differences and relation-ships between visual objects by extracting the relative spatial information hidden in the location information of independent regions.Meanwhile,a multi-layer aggregation attention encoding is designed,which trans-mits the multi-layer image coding information to the output layer through the combination of gated loop unit and self-attention,so that the semantics of the acquired image features are more abundant.The experimen-tal results show that the performance of the proposed model is obviously better than the traditional encoder-decoder structure's models.It output sentences are more accurate and more detail rich.

image captioningTransformermuti-layer aggregation encodinglocation encodinggated re-current unit

姜维维、杨有、汪兴建

展开 >

重庆师范大学计算机与信息科学学院,重庆 401331

重庆国家应用数学中心,重庆 401331

重庆教育管理学校,重庆 400066

图像描述生成 Transformer 多层聚合编码 位置编码 门控循环单元

重庆市教委科学技术研究项目重庆市教委科学技术研究项目重庆市教育科学"十四五"规划项目

KJZD-K202200504KJQN-2022005642022-576

2024

信息技术
黑龙江省信息技术学会 中国电子信息产业发展研究院 中国信息产业部电子信息中心

信息技术

CSTPCD
影响因子:0.413
ISSN:1009-2552
年,卷(期):2024.(9)