多粒度空间注意力与空间先验监督的DETR

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：近年来,Transformer在视觉领域的表现卓越,由于其优秀的全局建模能力以及可媲美CNN的性能表现受到了广泛关注.DETR(Detection Transformer)是在其基础上研究的首个在目标检测任务上采用Transformer架构的端到端网络,但是其全局范围内的等价建模以及目标查询键的无差别性导致其训练收敛缓慢,且性能表现欠佳.针对上述问题,利用多粒度的注意力机制替换DETR的encoder中的自注意力以及decoder中的交叉注意力,在距离近的token之间使用细粒度,在距离远的token之间使用粗粒度,增强其建模能力;并在decoder中的交叉注意力中引入空间先验限制对网络训练进行监督,使其训练收敛速度得以加快.实验结果表明,在引入多粒度的注意力机制和空间先验监督后,相较于未改进的DETR,所提改进模型在PASCAL VOC2012数据集上的识别准确度提升了 16％,收敛速度快了 2倍.

外文标题：DETR with Multi-granularity Spatial Attention and Spatial Prior Supervision

外文摘要：The Transformer has shown remarkable performance in the field of computer vision in recent years,and has gained widespread attention due to its excellent global modeling capability and competitive performance compared to convolutional neural networks(CNNs).Detection Transformer(DETR)is the first end-to-end network that adopts the Transformer architecture for object detection tasks,but it suffers from slow convergence during training and suboptimal performance due to its equivalent mo-deling across the global scope and indistinguishability of object query keys.To address these issues,we propose replacing the self-attention in the encoder and the cross-attention in the decoder of DETR with a multi-granularity attention mechanism,using fine-grained attention for tokens that are close in distance and coarse-grained attention for tokens that are far apart,to enhance its modeling capability.We also introduce spatial prior constraints in the cross-attention of the decoder to supervise the network training,which accelerates the convergence speed.Experimental results show that the proposed improved model,after incorpora-ting the multi-granularity attention mechanism and spatial prior supervision,achieves a 16％improvement in recognition accuracy on the PASCAL VOC2012 dataset compared to the unmodified DETR,with a doubled convergence speed.

外文关键词：

Multi-granularity spatial attentionSpatial prior supervisionObject detectionVision TransformerEncoder-Decoder architecture

作者：

廖峻霜、谭钦红

展开 >

作者单位：

重庆邮电大学通信与信息工程学院重庆 400065

关键词：

多粒度空间注意力空间先验监督目标检测视觉Transformer 编解码架构

出版年：

2024

DOI：

10.11896/jsjkx.230300218

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(6)

参考文献量22