电子学报2024,Vol.52Issue(7) :2219-2227.DOI:10.12263/DZXB.20231156

基于条件变分推断与内省对抗学习的多样化图像描述生成

Diverse Image Captioning via Conditional Variational Inference and Introspective Adversarial Learning

刘兵 李穗 刘明明 刘浩
电子学报2024,Vol.52Issue(7) :2219-2227.DOI:10.12263/DZXB.20231156

基于条件变分推断与内省对抗学习的多样化图像描述生成

Diverse Image Captioning via Conditional Variational Inference and Introspective Adversarial Learning

刘兵 1李穗 1刘明明 2刘浩1
扫码查看

作者信息

  • 1. 中国矿业大学计算机科学与技术学院,江苏徐州 221116;矿山数字化教育部工程研究中心,江苏徐州 221116
  • 2. 中国矿业大学计算机科学与技术学院,江苏徐州 221116
  • 折叠

摘要

现有多样化图像描述生成方法受到隐空间表示能力和评价指标制约,很难同时兼顾描述生成的多样性和准确性.为此,本文提出了一种新的多样化图像描述生成模型,该模型由一个条件变分推断编码器和一个生成器组成.编码器利用全局注意力学习每个单词的隐向量空间,以提升模型对描述多样化的建模能力.生成器根据给定图像和序列隐向量生成多样化的描述语句.同时,引入内省对抗学习的思想,条件变分推断编码器同时作为鉴别器来区分真实描述和生成的描述,赋予模型自我评价生成的描述语句的能力,克服预定义评价指标的局限性.在MSCOCO数据集上的实验表明,与传统方法相比,在随机生成100个描述语句时,多样性指标mBLEU(mutual overlap-BiLingual Evalu-ation Understudy)提升了1.9%,同时准确性指标CIDEr(Consensus-based Image Description Evaluation)显著提升了7.5%.与典型多模态大模型相比,所提出方法在较小参数量的条件下更适用于生成多样化的陈述性描述语句.

Abstract

Limited by the latent space modeling ability and pre-defined diversity metrics,most diverse image caption-ing models fail to achieve a balance between diversity and accuracy. To this end,we propose a novel diverse image caption-ing framework,which consists of a transformer based variational inference encoder and a generator. Specifically,the varia-tional inference network aims to learn a latent space for each word to enhance the ability of caption diversity modeling,while the generator network produces diverse captions conditioned on each image and a sequence of latent variables. To overcome the limitation of pre-defined metrics,we introduce introspective adversarial learning into the proposed model,where the variational inference network also serves as a discriminator to distinguish between the ground truth captions and those produced by the generator without extra discriminators. The proposed method is endowed the ability to self-evaluate the quality of generated captions. The experimental results on dataset MSCOCO show that compared with the conventional methods,the proposed method with 100 samples improves the mBLEU (mutual overlap-BiLingual Evaluation Understudy) scores by 1.9% and the CIDEr (Consensus-based Image Description Evaluation) scores by 7.5%,respectively. Compared with typical multimodal large models,the proposed method is more suitable for generating diverse declarative descriptive captions with smaller parameters.

关键词

图像描述/变分推断/对抗学习/隐嵌入/多模态学习/生成模型

Key words

image captioning/variational inference/adversarial learning/latent embedding/multi-modal learning/generative model

引用本文复制引用

基金项目

国家自然科学基金(62276266)

国家自然科学基金(61801198)

出版年

2024
电子学报
中国电子学会

电子学报

CSTPCDCSCD北大核心
影响因子:1.237
ISSN:0372-2112
参考文献量24
段落导航相关论文