面向多视角对比学习和语义增强的多模态预训练方法

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：视觉语言预训练(VLP)模型通过对比学习等方法,在多模态任务上表现出了优异的性能.然而现有研究忽视了多视角描述带来的好处,以及语义和语法的重要性.为了解决这一问题,文中提出了多视角对比学习和语义增强多模态预训练(Multi-view learning and Semantic Enhancement for Multimodal pre-training,MulSE)模型.MulSE 主要分为 3 个部分:1)在融合编码器模型中,引入带有生成器的多视角对比学习;2)提出了一种新的自监督视觉语言预训练任务——多模态文本重排序;3)增加并探寻最优MLM掩码比例,最大化利用视觉信息的能力.通过改进预训练任务,采取多种最优策略,并通过实验验证MulSE增强了模态内部和模态间的理解能力以及对文本语法和语义的理解能力.预训练仅用4 × 106的数据量,在图文检索任务中就达到了先前大型数据集的效果,且其在视觉问答和视觉蕴含任务上的评估效果优于先前的理解式VLP模型.

外文标题：Multimodal Pre-training Method for Multi-view Contrastive Learning and Semantic Enhancement

外文摘要：The visual language pretraining(VLP)model has shown impressive performance on multimodal tasks through con-trastive learning and other methods.However,existing research has overlooked the benefits of multi-view descriptions,andthe im-portance of semantics and grammar.To address this issue,this paper proposes multi-view learning and semantic enhancement for multimodal pre-training(MulSE),which consists of the following three main components:1)introducing multi-view contrastive learning with a generator in the fused encoder model;2)proposing multimodal text reordering as a novel self-supervised visual language pretraining task;3)increasing and exploring the optimal MLM masking ratio,maximizing the ability to use visual infor-mation.By improving the pretraining task and employing multiple optimal strategies,our experiments demonstrate that MulSE enhances intra-modal and inter-modal understanding,improves the comprehension of syntax and semantics within text.With only 4M pre-training data volume,it achieves the results of previous large datasets in the graphic retrieval task,and the valuation result on visual question-answering and visual implicative tasks outperforms the previous comprehension VLP models.

外文关键词：

Computer VersionMultimodalPre-trainingMulti-viewComprehension augment

作者：

汤嘉、郭燕、叶名玮、吴桂兴

展开 >

作者单位：

中国科学技术大学软件学院合肥 230026

中国科学技术大学苏州高等研究院江苏苏州 215123

关键词：

计算机视觉多模态预训练多视角理解增强

出版年：

2024

DOI：

10.11896/jsjkx.230700084

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(1)

参考文献量1