视觉提示学习综述

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：近年来,随着提示学习方法在自然语言处理领域被提出,其日益受到研究人员广泛关注.它通过将各类下游任务重构成预训练任务的形式,以参数高效和数据高效的方式将大规模预训练模型应用在各类自然语言相关下游任务中.其中以GPT系列为代表的模型通过提示学习在对话生成和多模态图文理解等任务上取得了巨大的成功.然而,这类模型及方法还不能解决视觉中的稠密任务.受此启发,一些研究人员逐渐将提示学习广泛应用到视觉相关的各类任务当中,如图像识别、目标检测、图像分割、领域适应、持续学习等.由于目前还没有提示学习应用在视觉相关领域中的综述,本文将对视觉单模态领域以及视觉语言多模态领域的提示学习方法展开全面论述和分析.作为回顾,我们首先简要介绍自然语言处理领域的预训练模型,并对提示学习的基本概念、下游应用形式以及提示模板类型进行阐述和分类.其次,我们分别介绍视觉单模态领域以及视觉语言多模态领域里提示学习方法适配的预训练模型和任务.再次,我们分别介绍视觉单模态领域以及视觉语言多模态领域的提示学习方法.在自然语言处理领域,提示学习方法以继承预训练形式实现多任务统一为主要目的;与此不同,在视觉相关领域,提示学习方法侧重于面向特定下游任务进行设计.为此,我们将从方法设计上进行简单分类,然后从应用任务角度详细介绍视觉单模态提示学习和视觉语言多模态提示学习方法.最后,我们对比分析了自然语言处理领域和视觉相关领域提示学习研究的进展,并对未来研究路线给出了展望.

外文标题：Visual Prompt Learning:A Survey

外文摘要：With the rapid development of deep learning models and the increasing parameter size,fine-tuning the entire model in various downstream applications with different objectives is prohibitive.To solve this significant issue,prompt learning has been primarily proposed in the field of natural language processing(NLP),and has been widely studied in recent years.By reformulating various downstream tasks as the same form of the pre-training one,prompt learning successfully leverages large-scale pre-trained language models in various downstream applications with great efficiency from both the parameter and data perspectives.Among them,models pre-trained by masked language modeling(MLM)represented by BERT have achieved great success in tasks requiring word-level output such as text classification,named entity recognition by"cloze prompt";models pre-trained via autoregressive/casual language modeling(A/CLM)such as GPT have been widely applied in tasks requiring text-level output using"prefix prompt",the tasks include dialogue generation,question answering,summarization,etc.Witnessing the success of prompt learning in NLP area,language models have also been applied in multimodal vision-language understanding problems through prompt learning.However,they still could not solve dense tasks in vision-related area.In addition,the expensive and complex process of fine-tuning the entire vision model in practical applications also occurs in vision-related area.Inspired by the great success of prompt learning in NLP,it has been gradually applied to various vision-related tasks,including image classification,object detection,image segmentation,domain adaptation,continual learning,etc.Seeing the lack of a comprehensive survey of prompt learning in vision area,therefore,this paper aims at conducting a comprehensive introduction and analysis on the prompt learning methods in unimodal vision area and multimodal vision-language area.First,we briefly introduce the pre-training models,the basic concepts of prompt learning,the forms of downstream applications,and the types of prompts in NLP as the preliminary.Second,we deliver the pre-training models that adopted in unimodal vision and multimodal vision-language prompt learning methods,respectively.Then,we give a comprehensive introduction to the prompt learning methods in vision-related areas.It is worth mentioning that prompt learning methods in NLP are designed for inheriting the pre-training tasks in all downstream applications.Differently,current prompt learning methods in unimodal vision and multimodal vision-language fields are designed for specific downstream applications.Therefore,we will conduct a brief introduction from the method design,and then give the details of unimodal visual prompt learning and multimodal vision-language prompt learning methods from the perspective of appli-cation tasks.On the one side,unimodal visual prompt learning methods are mainly designed by concatenating learnable prompt tokens,adding optimizable pixel-wise perturbations,learning prompt networks,combining multiple prompt modules,constructing the label mapping,neural architecture search,etc.On the other side,the popular design of multimodal vision-language prompt learning methods includes textual prompt learning,vision-guided textual prompt learning,text or knowledge-guided textual prompt learning,vision-language joint prompt learning,distribution-based prompt learning,multitask-shared prompt learning,gradient-guided prompt learning,etc.Finally,we make an in-depth analysis and comparison between the prompt learning methods in NLP and vision-related fields,and propose a prospect and summary for future research.

外文关键词：

large-scale pre-trained modelnatural language processingunimodal visual prompt learningmultimodal vision-language prompt learning

作者：

廖宁、曹敏、严骏驰

展开 >

作者单位：

上海交通大学人工智能教育部重点实验室上海 200240

苏州大学计算机科学与技术学院江苏苏州 215021

关键词：

大规模预训练模型自然语言处理视觉单模态提示学习视觉语言多模态提示学习

基金：

国家自然科学基金优秀青年科学基金项目上海市级科技重大专项国家自然科学基金

项目编号：

622226072021SHZDZX010262002252

出版年：

2024

DOI：

10.11897/SP.J.1016.2024.00790

计算机学报

中国计算机学会中国科学院计算技术研究所

计算机学报

CSTPCD北大核心

影响因子：3.18

ISSN：0254-4164

年,卷(期)：2024.47(4)

参考文献量186