Image caption in Chinese with vision-union grouping
To address the problem that the encoders used in the image captioning task can not extract sufficient fine-grained semantic features of the giving images,which leads to coarse descriptions and insufficient textual fineness,a model of image caption in Chinese with vision-union grouping is proposed.The model belongs to encoder-decoder framework.In the encoding stage,two types'features of global semantic and local details,are extracted using two different network channels.Firstly,the potential semantic information of the image is extracted using Contrastive Language-Image Pre-Training image encoder.Secondly,by utilizing the idea of visual grouping,each image object category is divided into visual segments.Segments are the image detail which are corresponding to different regular sizes.Global and local features are fused together and then converted into prefix embeddings through a mapping network.In the decoding stage,the language model GPT-2 is employed to generate image descriptions.Compared with these Chinese image caption models available,proposed model achieved best performance,that is 0.815,0.711,0.616 and 0.532 from BLEU-1 to BLEU-4.Simulation experiments are conducted on the AIC-ICC dataset.The results show that the proposed model generates more accurate and fluent description texts.
image captioning in Chinesevisual groupingfeature integrationimage semanticsencoding and decoding