基于全局与序列混合变分Transformer的多样化图像描述生成方法

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：多样化图像描述生成已成为图像描述领域研究热点.然而,现有方法忽视了全局和序列隐向量之间的依赖关系,严重限制了图像描述性能的提升.针对该问题,本文提出了基于混合变分Transformer的多样化图像描述生成框架.具体地,首先构建全局与序列混合条件变分自编码模型,解决全局与序列隐向量之间依赖关系表示的问题.其次,通过最大化条件似然推导混合模型的变分证据下界,解决多样化图像描述目标函数设计问题.最后,无缝融合Transformer和混合变分自编码模型,通过联合优化提升多样化图像描述的泛化性能.在MSCOCO数据集上实验结果表明,与当前最优基准方法相比,在随机生成20和100个描述语句时,多样性指标m-BLEU(mutual overlap-BiLingual Evaluation Understudy)分别提升了 4.2％和 4.7％,同时准确性指标 CIDEr(Consensus-based Image Description Evaluation)分别提升了4.4％和15.2％.

外文标题：Diverse Image Captioning Based on Hybrid Global and Sequential Variational Transformer

外文摘要：Diverse image captioning has become a research hotspot in the field of image description.Existing meth-ods generally ignore the dependency relationship between global and sequential latent vectors,which seriously limits the performance improvement.To address this problem,this paper proposes a hybrid variational Transformer based diverse im-age captioning framework.Firstly,we construct a hybrid conditional variational autoencoder to effectively model the depen-dency between global and sequential latent vectors.Secondly,the evidence lower bound is derived by maximizing the condi-tional likelihood of the hybrid autoencoder,which serves as the objective function for diverse image captioning.Finally,we seamlessly combine the Transformer model with the hybrid conditional variational autoencoder,which can be jointly opti-mized to improve the generalization performance of diverse image captioning.The experimental results on MSCOCO datas-et show that compared with the state-of-the-art methods,when randomly generating 20 and 100 captions,the diversity met-ric m-BLEU(Mutual overlap Bilingual Evaluation Under study)has improved by 4.2％and 4.7％,respectively,while the ac-curacy metric CIDEr(Consensus based Image Description Evaluation)has improved by 4.4％and 15.2％,respectively.

外文关键词：

image understandingimage captioningvariational autoencodinglatent embeddingmulti-modal learn-inggenerative model

作者：

刘兵、李穗、刘明明、刘浩

展开 >

作者单位：

中国矿业大学计算机科学与技术学院,江苏徐州 221116

矿山数字化教育部工程研究中心,江苏徐州 221116

关键词：

图像理解图像描述变分自编码隐嵌入多模态学习生成模型

基金：

国家自然科学基金国家自然科学基金

项目编号：

6227626661801198

出版年：

2024

DOI：

10.12263/DZXB.20231155

电子学报

中国电子学会

电子学报

CSTPCD北大核心

影响因子：1.237

ISSN：0372-2112

年,卷(期)：2024.52(4)