首页|视觉-语言多模态预训练模型前沿进展

视觉-语言多模态预训练模型前沿进展

A survey on Vision-Language multimodality pre-training

扫码查看
近年来,多模态预训练学习在视觉-语言任务上蓬勃发展.大量研究表明,多个模态特征的表征学习预训练有利于视觉-语言下游任务的效果提升.多模态表征预训练旨在采用自监督的学习范式,包括对比学习,掩码自监督等,在大规模的图文相关性数据上进行训练,通过学习模态自身与模态间的知识先验,使模型获得通用的、泛化性较强的视觉表征能力.后BERT时代,本文介绍了视觉多模态领域基于Transformer的相关工作;对主流多模态学习方法的发展脉络进行梳理,分析了不同方法的优势和局限性;总结了多模态预训练的各种监督信号及其作用;概括了现阶段主流的大规模图像-文本数据集;最后简要介绍了几种相关的跨模态预训练下游任务.
Multimodal pre-training has shown increased interest on vision-language tasks.Recent comprehensive studies have demonstrated that,multimodal representations training can benefit the Vision-Language downstream tasks.Multimodal pre-training requires a large-scale training data and self-supervised learning.This paper reviews some significant transformer-base researches about Vision-Language(VL)pre-training,which came out after BERT.Firstly,the application background and development significance of multimode pretraining are expounded.Secondly,this paper introduces the development of mainstream multimodal networks and analyzes the advantages and disadvantages of methods.Then,we explain cost functions used in multi-task pre-training.Next,We then illustrate the large-scale image-text database mentioned in recent studies.In the end,combining different VL downstream tasks,this paper describes the task objectives,datasets and training methods.

multimodal pre-trainingVision-Language(VL)trainingrepresentation learning

朱若琳、蓝善祯、朱紫星

展开 >

中国传媒大学信息与通信工程学院,北京 100024

多模态预训练 视觉-语言预训练 表征学习

国家重点研发计划

2018YFB1404103

2023

中国传媒大学学报(自然科学版)
中国传媒大学

中国传媒大学学报(自然科学版)

CHSSCD
影响因子:0.514
ISSN:1673-4793
年,卷(期):2023.30(1)
  • 54