CIC-CGT:基于多模态大模型的漫画图像标题与描述生成

CIC-CGT:COMIC IMAGE CAPTIONING AND DESCRIPTION WITH MULTIMODAL LARGE SCALE MODEL

李嘉鑫 ¹汤鹏杰 ²谭云兰 ²张丽²

扫码查看

作者信息

1. 井冈山大学电子与信息工程学院,江西,吉安 343009
2. 井冈山大学电子与信息工程学院,江西,吉安 343009;电子数据管控与取证江西省重点实验室,江西,吉安 343009
折叠

摘要

不同于传统的图像描述任务,漫画图像描述不仅涉及图像识别与自然语言处理,同时还要求模型能够深入理解漫画所特有的幽默、文化和情感属性.针对上述挑战,本研究提出了漫画图像标题与描述生成任务,基于多模态大模型,设计了一种新的漫画标题与描述生成框架(CIC-CGT).首先,通过CLIP大模型提取漫画图像特征,将获取的特征信息送入前缀嵌入映射模块,获得视觉语言对齐语义表达.然后将其送入GPT2模型,再合CLIP视觉特征,生成粗糙语言描述.最后,将粗糙描述送入T5模型进行语言特征编码,并解码为最终的漫画标题描述.在漫画图像描述数据集NYCCB上结果显示,本研究所提模型能够生成不同风格的漫画标题与描述,能够准确捕捉并表达漫画独有的幽默感和情感深度.

Abstract

Different from traditional image description tasks,comic image description not only involves image recognition and natural language processing,but also requires model to deeply understand the humor,culture,and emotional attributes unique to comics.In response to the above challenges,a task of comic image captioning and description is proposed in this work,and a novel framework based on the multimodal large model is developed for generating comic caption and description(CIC-CGT).The comic image features are firstly extracted by CLIP large model,which are fed into the prefix embedding mapping module.Then it is fed into GPT2 model to generate the rough language description combined with CLIP visual characteristics.Finally,the rough description is sent to the T5 model for language feature encoding,and decoding into the final comic title description.The results on the comic image description dataset NYCCB show that the model proposed in this work can generate different styles of comic title and description,and can accurately capture and express the unique humor and emotional depth of comics.

关键词

大模型/漫画图像/标题生成与描述/跨模态学习/自然语言处理

Key words

large scale model/comic images/image captioning/multimodal learning/natural language processing

引用本文复制引用

出版年

2024

井冈山大学学报(自然科学版)

井岗山大学

井冈山大学学报(自然科学版)

影响因子：0.298

ISSN：1674-8085

段落导航