基于Transformer的视觉分割技术进展

Overview of Transformer-Based Visual Segmentation Techniques

李文生 ¹张菁 ²卓力 ²吴鑫嘉 ¹闫伊¹

扫码查看

作者信息

1. 北京工业大学信息科学技术学院北京 100124
2. 北京工业大学信息科学技术学院北京 100124;北京工业大学计算智能与智能系统北京市重点实验室北京 100124
折叠

摘要

视觉分割是计算机视觉领域的核心任务,旨在将图像或视频帧中的像素分类以划分成不同区域.得益于视觉分割技术的快速发展,该技术在自动驾驶、航空遥感和视频场景理解等多种应用领域中发挥着关键作用.近年来,基于Transformer的视觉分割技术因具备长程依赖建模能力而备受关注.随着Transformer的模型架构的持续优化与迭代,亟须更全面地理解和认识Transformer在视觉分割领域的已有进展和发展趋势,通过发现现有研究中的不足和挑战,以更深入地探索Transformer的核心理论.为此,本文从图像/视频两个视觉脉络出发,整理、回顾、分析和探讨了近年来基于Transformer的视觉分割相关技术进展,不仅归纳了Transformer的理论框架,还给出了一些应用实例和研究热点,从而做出总结和展望.具体来说,首先梳理了Transformer的背景,包括问题定义、数据集和评估指标、基本结构,其中,问题定义描述了视觉分割在图像/视频任务中的预期目标和结果;数据集和评估指标反映了模型的具体应用场景,以及性能的衡量标准;基本结构则描述了算法的核心模块、实现流程以及各个模块之间的关系.然后,着重阐述了Transformer在图像语义分割、图像实例分割,以及视频语义分割和视频实例分割四个方法体系,并探讨了当前的研究热点.对于图像语义分割任务,分析了Transformer的代表性结构,包括纯Trans-former和双分支结构,并以无人机影像非铺装道路分割和遥感图像语义分割为实际应用案例,探讨了Transformer的改进动机与应用效果,并展示了主观结果;图像实例分割总结了常见的非端对端Transformer和端对端Trans-former典型结构.视频语义分割主要分为面向精度的和面向效率的Transformer结构,视频实例分割则包括逐帧和逐片段Transformer分割,并以网络直播视频实例分割为应用实例,一方面讨论了可用的数据集、实验参数和评估指标,另一方面,对网络直播视频实例分割主流方法性能进行了评价和分析,展示了一些主观可视化结果.之后,鉴于视觉分割领域的SAM大模型、开放词汇分割、指代分割受到了广泛关注,本文将这些热点问题方法进行了追溯和评述,以期碰撞出视觉分割的新思路和新灵感.最后,尽管基于Transformer在视觉分割技术受到了广泛的关注,但存在的科学问题也逐渐凸显,限制了模型性能与效率的进一步提升,对此本文总结了利用Trans-former开展图像/视频语义/实例分割仍需关注的难点问题,并对未来可能的发展方向进行了展望,提供了一些启示供参考.

Abstract

In the field of computer vision,visual segmentation is a fundamental task that categorizes pixels in an image or video frame into distinct regions.Thanks to the significant development of visual segmentation techniques,it plays a key role in various applications such as autonomous driving,aerial remote sensing,and video scene understanding.In recent years,Transformer-based visual segmentation has attracted much attention because of its long-range dependency modeling capability.With the continuous optimization and updating of Transformer's model architecture,there is an urgent need to more comprehensively understand and recognize the existing progress and development trend of Transformer in field of visual segmentation,and to find out the deficiencies and challenges,so as to explore the core theory of Transformer in a deeper way.To this end,this paper organizes,reviews,analyzes and explores the recent advances in Transformer-based visual segmentation techniques from two visual pipelines of image/video,not only summarizing the theoretical framework of Transformer,but also giving some application examples and research hotspots,so as to make a summary and overlook.Specifically,the background of the Transformer is initially reviewed,including problem definition,datasets,indicators,and the basic structure,in which the problem definition describes the expected goals and results of visual segmentation in image/video tasks;the dataset and indicators respond to the specific application scenarios of the model as well as the performance measures;the basic structure describes the core modules of the algorithm,the implementation process,and the relationship between the individual module.Then,the four methodologies of Transformer are highlighted in detail in terms of image semantic and instance segmentation,as well as the video semantic and instance segmentation,and current research hotspots are discussed.For the task of image semantic segmentation,the representative structures of Transformer are analyzed,including pure Transformer and dual-branch structures,and the motivation and application effect of Transformer's improvement are exhibited and the visual results are shown with the practical application cases of unpaved road segmentation of UAV images and semantic segmentation of remote sensing images,while image instance segmentation summarizes the typical structure of Transformer without/with end-to-end framework.Video semantic segmentation is mainly categorized into accuracy-oriented and efficiency-oriented Transformer structures,while video instance segmentation includes frame-by-frame and segment-by-segment Transformer structure.Notably,video instance segmentation takes livestreaming video instance segmentation as an application example,and not only discusses the available datasets,experimental parameters and indicators,but also evaluates and analyzes the performance of the mainstream methods for livestreaming video instance segmentation,and shows some visual results.Subsequently,for segment anything(SAM),open vocabulary segmentation,and referring segmentation,which are widely concerned in the field of visual segmentation,this paper traces and reviews these hotspots,with a view to colliding new ideas and inspirations in visual segmentation.Finally,although Transformer-based visual segmentation has received widespread attention,the scientific problems have gradually emerged,limiting the further improvement of model performance and efficiency.Finally,this paper summarizes the changeable issues that still need to be addressed in terms of image/video semantic/instance segmentation tasks using Transformer,and looks forward to the potential future development directions to provide some insights for reference.

关键词

视觉分割/Transformer/语义分割/实例分割/自注意力机制

Key words

visual segmentation/Transformer/semantic segmentation/instance segmentation/self-attention mechanism

引用本文复制引用

出版年

2024

计算机学报

中国计算机学会中国科学院计算技术研究所

计算机学报

CSTPCDCSCD北大核心

影响因子：3.18

ISSN：0254-4164

段落导航