Diverse Image Captioning Based on Hybrid Global and Sequential Variational Transformer
Diverse image captioning has become a research hotspot in the field of image description.Existing meth-ods generally ignore the dependency relationship between global and sequential latent vectors,which seriously limits the performance improvement.To address this problem,this paper proposes a hybrid variational Transformer based diverse im-age captioning framework.Firstly,we construct a hybrid conditional variational autoencoder to effectively model the depen-dency between global and sequential latent vectors.Secondly,the evidence lower bound is derived by maximizing the condi-tional likelihood of the hybrid autoencoder,which serves as the objective function for diverse image captioning.Finally,we seamlessly combine the Transformer model with the hybrid conditional variational autoencoder,which can be jointly opti-mized to improve the generalization performance of diverse image captioning.The experimental results on MSCOCO datas-et show that compared with the state-of-the-art methods,when randomly generating 20 and 100 captions,the diversity met-ric m-BLEU(Mutual overlap Bilingual Evaluation Under study)has improved by 4.2%and 4.7%,respectively,while the ac-curacy metric CIDEr(Consensus based Image Description Evaluation)has improved by 4.4%and 15.2%,respectively.
image understandingimage captioningvariational autoencodinglatent embeddingmulti-modal learn-inggenerative model