Diverse Image Captioning via Conditional Variational Inference and Introspective Adversarial Learning
Limited by the latent space modeling ability and pre-defined diversity metrics,most diverse image caption-ing models fail to achieve a balance between diversity and accuracy. To this end,we propose a novel diverse image caption-ing framework,which consists of a transformer based variational inference encoder and a generator. Specifically,the varia-tional inference network aims to learn a latent space for each word to enhance the ability of caption diversity modeling,while the generator network produces diverse captions conditioned on each image and a sequence of latent variables. To overcome the limitation of pre-defined metrics,we introduce introspective adversarial learning into the proposed model,where the variational inference network also serves as a discriminator to distinguish between the ground truth captions and those produced by the generator without extra discriminators. The proposed method is endowed the ability to self-evaluate the quality of generated captions. The experimental results on dataset MSCOCO show that compared with the conventional methods,the proposed method with 100 samples improves the mBLEU (mutual overlap-BiLingual Evaluation Understudy) scores by 1.9% and the CIDEr (Consensus-based Image Description Evaluation) scores by 7.5%,respectively. Compared with typical multimodal large models,the proposed method is more suitable for generating diverse declarative descriptive captions with smaller parameters.
image captioningvariational inferenceadversarial learninglatent embeddingmulti-modal learninggenerative model