Transformer visual object tracking algorithm based on mixed attention
The Transformer-based visual object tracking algorithm can capture the global information of the target well,but there is a possibility of further improvement in the presentation of the object features.To better improve the expression ability of object features,a Transformer visual object tracking algorithm based on mixed attention is proposed.First,the mixed attention module is introduced to capture the features of the object in the spatial and channel dimensions,so as to model the contextual dependencies of the target features.Second,the feature maps are sampled by multiple parallel dilated convolutions with different dilation rates to obtain the multi-scale features of the images,and enhance the local feature representation.Finally,the convolutional position encoding constructed is added to the Transformer encoder to provide accurate and length-adaptive position coding for the tracker,thereby improving the accuracy of tracking and positioning.The experimental results of the proposed algorithm on OTB 100,VOT 2018 and LaSOT show that by learning the relationship between features through the Transformer network based on mixed attention,the object features can be better represented.Compared with other mainstream object tracking algorithms,the proposed algorithm has better tracking performance and achieves a real-time tracking speed of 26 frames per second.