DETR with Multi-granularity Spatial Attention and Spatial Prior Supervision
The Transformer has shown remarkable performance in the field of computer vision in recent years,and has gained widespread attention due to its excellent global modeling capability and competitive performance compared to convolutional neural networks(CNNs).Detection Transformer(DETR)is the first end-to-end network that adopts the Transformer architecture for object detection tasks,but it suffers from slow convergence during training and suboptimal performance due to its equivalent mo-deling across the global scope and indistinguishability of object query keys.To address these issues,we propose replacing the self-attention in the encoder and the cross-attention in the decoder of DETR with a multi-granularity attention mechanism,using fine-grained attention for tokens that are close in distance and coarse-grained attention for tokens that are far apart,to enhance its modeling capability.We also introduce spatial prior constraints in the cross-attention of the decoder to supervise the network training,which accelerates the convergence speed.Experimental results show that the proposed improved model,after incorpora-ting the multi-granularity attention mechanism and spatial prior supervision,achieves a 16%improvement in recognition accuracy on the PASCAL VOC2012 dataset compared to the unmodified DETR,with a doubled convergence speed.