CIC-CGT:COMIC IMAGE CAPTIONING AND DESCRIPTION WITH MULTIMODAL LARGE SCALE MODEL
Different from traditional image description tasks,comic image description not only involves image recognition and natural language processing,but also requires model to deeply understand the humor,culture,and emotional attributes unique to comics.In response to the above challenges,a task of comic image captioning and description is proposed in this work,and a novel framework based on the multimodal large model is developed for generating comic caption and description(CIC-CGT).The comic image features are firstly extracted by CLIP large model,which are fed into the prefix embedding mapping module.Then it is fed into GPT2 model to generate the rough language description combined with CLIP visual characteristics.Finally,the rough description is sent to the T5 model for language feature encoding,and decoding into the final comic title description.The results on the comic image description dataset NYCCB show that the model proposed in this work can generate different styles of comic title and description,and can accurately capture and express the unique humor and emotional depth of comics.
large scale modelcomic imagesimage captioningmultimodal learningnatural language processing