A Deep Learning Framework for Multi-Object Tracking of Team Sports Videos Based on the Improved VisionTransformer Model
The application of multi-object tracking(MOT)technology opens up new possibilities for team sports video monitoring and analysis,enabling real-time tracking of multiple athletes and supporting multi-dimensional analysis and understanding of game dynamics.However,in complex team sports scenarios,issues such as mutual occlusion between athletes,rapid movements,and frequent changes in target identi-ties may potentially degrade tracking performance.To address these challenges,an end-to-end deep learn-ing MOT framework based on VisionTransformer was proposed,which mainly consisted of two parts:detection network and memory network.The detection network comprised a convolutional neural network(CNN)backbone,Vision Transformer encoder and decoder.The ResNet50 was adopted as a feature extractor,and the traditional feed-forward neural network(FFN)layer was replaced by a local attention(LA)module to obtain more comprehensive feature representations through the combination of global attention and local convolution.The memory network consisted of a memory encoding module and spatio-temporal memory decoder.The memory encoding module was responsible for aggregating the target embedding information,in which the short-term cross attention(CA)module focused on the immediate states,while the long-term CA module explored the significant features covered by memory over time spans,captured dependencies and associations over long time intervals to effectively preserve temporal con-text information of tracked objects.The spatio-temporal memory decoder integrated encoded frame embed-dings,candidate embeddings and trajectory embeddings to address multi-object detection and identity asso-ciation in MOT.The spatio-temporal memory mechanism efficiently retained observed historical states of targets and combined with an attention mechanism,accurately predicted target states.Experimental results demonstrate that the proposed framework achieves 75.7%HOTA and 98.5%MOTA on the team sports video public dataset SportsMOT,outperforming other state-of-the-art MOT methods.Addi-tionally,the proposed framework achieves optimal or near-optimal performance on multiple metrics on the generalized public datasets MOT17 and MOT20,further validating the effectiveness and robustness of the proposed framework.