首页|基于条件变分推断与内省对抗学习的多样化图像描述生成

基于条件变分推断与内省对抗学习的多样化图像描述生成

扫码查看
现有多样化图像描述生成方法受到隐空间表示能力和评价指标制约,很难同时兼顾描述生成的多样性和准确性.为此,本文提出了一种新的多样化图像描述生成模型,该模型由一个条件变分推断编码器和一个生成器组成.编码器利用全局注意力学习每个单词的隐向量空间,以提升模型对描述多样化的建模能力.生成器根据给定图像和序列隐向量生成多样化的描述语句.同时,引入内省对抗学习的思想,条件变分推断编码器同时作为鉴别器来区分真实描述和生成的描述,赋予模型自我评价生成的描述语句的能力,克服预定义评价指标的局限性.在MSCOCO数据集上的实验表明,与传统方法相比,在随机生成100个描述语句时,多样性指标mBLEU(mutual overlap-BiLingual Evalu-ation Understudy)提升了1.9%,同时准确性指标CIDEr(Consensus-based Image Description Evaluation)显著提升了7.5%.与典型多模态大模型相比,所提出方法在较小参数量的条件下更适用于生成多样化的陈述性描述语句.
Diverse Image Captioning via Conditional Variational Inference and Introspective Adversarial Learning
Limited by the latent space modeling ability and pre-defined diversity metrics,most diverse image caption-ing models fail to achieve a balance between diversity and accuracy. To this end,we propose a novel diverse image caption-ing framework,which consists of a transformer based variational inference encoder and a generator. Specifically,the varia-tional inference network aims to learn a latent space for each word to enhance the ability of caption diversity modeling,while the generator network produces diverse captions conditioned on each image and a sequence of latent variables. To overcome the limitation of pre-defined metrics,we introduce introspective adversarial learning into the proposed model,where the variational inference network also serves as a discriminator to distinguish between the ground truth captions and those produced by the generator without extra discriminators. The proposed method is endowed the ability to self-evaluate the quality of generated captions. The experimental results on dataset MSCOCO show that compared with the conventional methods,the proposed method with 100 samples improves the mBLEU (mutual overlap-BiLingual Evaluation Understudy) scores by 1.9% and the CIDEr (Consensus-based Image Description Evaluation) scores by 7.5%,respectively. Compared with typical multimodal large models,the proposed method is more suitable for generating diverse declarative descriptive captions with smaller parameters.

image captioningvariational inferenceadversarial learninglatent embeddingmulti-modal learninggenerative model

刘兵、李穗、刘明明、刘浩

展开 >

中国矿业大学计算机科学与技术学院,江苏徐州 221116

矿山数字化教育部工程研究中心,江苏徐州 221116

图像描述 变分推断 对抗学习 隐嵌入 多模态学习 生成模型

国家自然科学基金国家自然科学基金

6227626661801198

2024

电子学报
中国电子学会

电子学报

CSTPCD北大核心
影响因子:1.237
ISSN:0372-2112
年,卷(期):2024.52(7)