视觉Transformer(ViT)发展综述

Survey of Vision Transformers(ViT)

李玉洁 ¹马子航 ²王艺甫 ²王星河 ²谭本英¹

扫码查看

作者信息

1. 桂林电子科技大学人工智能学院广西桂林 541004;广西高校人工智能算法工程重点实验室(桂林电子科技大学)广西桂林 541004
2. 桂林电子科技大学人工智能学院广西桂林 541004
折叠

摘要

视觉Transformer(Vision Transformer,ViT)是基于编码器-解码器结构的Transformer改进模型,已经被成功应用于计算机视觉领域.近几年基于ViT的研究层出不穷且效果显著,基于该模型的工作已经成为计算机视觉任务的重要研究方向,因此针对近年来ViT的发展进行概述.首先,简要回顾了 ViT的基本原理及迁移过程,并分析了 ViT模型的结构特点和优势;然后,根据各ViT变体模型的改进特点,归纳和梳理了基于ViT的主要骨干网络变体改进方向及其代表性改进模型,包括局部性改进、结构改进、自监督、轻量化及效率改进等改进方向,并对其进行分析比较;最后,讨论了当前ViT及其改进模型仍存在的不足,对ViT未来的研究方向进行了展望.可以作为研究人员进行基于ViT骨干网络的研究时选择深度学习相关方法的一个权衡和参考.

Abstract

The Vision Transformer(ViT),an application of the Transformer architecture with an encoder-decoder structure,has garnered remarkable success in the field of computer vision.Over the past few years,research centered around ViT has witnessed a prolific surge and has consistently exhibited exceptional performance.Therefore,endeavors rooted in this model have evolved into a pivotal and prominent research trajectory within the domain of computer vision tasks.Consequently,this paper seeks to provide a comprehensive survey of the recent advancements and developments in ViT during the recent years.To begin with,it briefly revisits the fundamental principles of the Transformer and its adaptation into ViT,analyzing the structural characteristics and advantages of the ViT model.Then it categorizes and synthesizes the various directions of improvement for ViT backbone networks and their representative improvement models based on the distinguishing features of each ViT variant.These directions include enhancements in locality,structural modifications,self-supervised improvements,and lightweight and efficient improve-ments,which are thoroughly examined and compared.Lastly,this paper discusses the remaining shortcomings of the current ViT and its enhancement models,while also offering a prospective view on the future research directions for ViT.This comprehensive analysis serves as a valuable reference for researchers when deliberating on the choice of deep learning methodologies for their investigations into ViT backbone networks.

关键词

计算机视觉/模式识别/Vision/Transformer(ViT)/深度学习/自注意力

Key words

Computer vision/Pattern recognition/Vision Transformer(ViT)/Deep learning/Self-attention

引用本文复制引用

出版年

2025

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

北大核心

影响因子：0.944

ISSN：1002-137X

段落导航