首页|基于改进VisionTransformer模型的团队体育视频多目标跟踪深度学习框架

基于改进VisionTransformer模型的团队体育视频多目标跟踪深度学习框架

扫码查看
多目标跟踪(MOT)技术为团队体育视频监测和分析提供了全新的可能性,能够实时跟踪多个运动员并支持对比赛动态的多维度分析与理解.然而,在复杂的团队运动场景下,诸如运动员之间的相互遮挡、快速移动以及目标身份的频繁变换等问题,都可能降低跟踪性能.为此,本文提出了基于VisionTransformer的端到端深度学习MOT框架,主要包括检测网络和记忆网络两个部分.检测网络由卷积神经网络(CNN)骨干网、VisionTransformer编码器和解码器组成,采用ResNet50作为特征提取器,并引入局部注意力(LA)模块替代传统前馈神经网络(FFN)层.通过全局注意力和局部卷积的结合,得到更全面的特征表示.记忆网络由记忆编码模块和时空记忆解码器组成.记忆编码模块负责聚合目标嵌入信息,其中,短时互注意力(CA)模块关注即时状态,而长时记忆CA模块则挖掘了记忆涵盖的时间跨度内的显著特征,捕捉长时间间隔内的依赖关系和关联,从而有效保留了跟踪对象的时间上下文信息.时空记忆解码器在嵌入融合过程中综合考虑了编码帧、候选嵌入和轨迹嵌入信息,解决了MOT中的多目标检测和身份关联.时空记忆机制能够有效地保留目标历史状态的观察结果,并结合注意力机制对目标状态进行准确预测.实验结果表明,所提框架在团队体育视频公开数据集SportsMOT上实现了75.7%的HOTA和98.5%的MOTA结果,优于其他先进的MOT方法.此外,所提框架在通用公开数据集MOT17和MOT20上的多个指标取得了最优或次优性能,进一步验证了所提方法的有效性和鲁棒性.
A Deep Learning Framework for Multi-Object Tracking of Team Sports Videos Based on the Improved VisionTransformer Model
The application of multi-object tracking(MOT)technology opens up new possibilities for team sports video monitoring and analysis,enabling real-time tracking of multiple athletes and supporting multi-dimensional analysis and understanding of game dynamics.However,in complex team sports scenarios,issues such as mutual occlusion between athletes,rapid movements,and frequent changes in target identi-ties may potentially degrade tracking performance.To address these challenges,an end-to-end deep learn-ing MOT framework based on VisionTransformer was proposed,which mainly consisted of two parts:detection network and memory network.The detection network comprised a convolutional neural network(CNN)backbone,Vision Transformer encoder and decoder.The ResNet50 was adopted as a feature extractor,and the traditional feed-forward neural network(FFN)layer was replaced by a local attention(LA)module to obtain more comprehensive feature representations through the combination of global attention and local convolution.The memory network consisted of a memory encoding module and spatio-temporal memory decoder.The memory encoding module was responsible for aggregating the target embedding information,in which the short-term cross attention(CA)module focused on the immediate states,while the long-term CA module explored the significant features covered by memory over time spans,captured dependencies and associations over long time intervals to effectively preserve temporal con-text information of tracked objects.The spatio-temporal memory decoder integrated encoded frame embed-dings,candidate embeddings and trajectory embeddings to address multi-object detection and identity asso-ciation in MOT.The spatio-temporal memory mechanism efficiently retained observed historical states of targets and combined with an attention mechanism,accurately predicted target states.Experimental results demonstrate that the proposed framework achieves 75.7%HOTA and 98.5%MOTA on the team sports video public dataset SportsMOT,outperforming other state-of-the-art MOT methods.Addi-tionally,the proposed framework achieves optimal or near-optimal performance on multiple metrics on the generalized public datasets MOT17 and MOT20,further validating the effectiveness and robustness of the proposed framework.

multi-object trackingdeep learningteam sports videosVisionTransformerspatio-temporal memoryattention mechanism

曹伟、王晓勇、刘咸祥

展开 >

淮南联合大学 公共教育学院,安徽 淮南 232038

淮南联合大学 信息工程学院,安徽 淮南 232038

多目标跟踪 深度学习 团队体育视频 VisionTransformer 时空记忆 注意力机制

2024

中北大学学报(自然科学版)
中北大学

中北大学学报(自然科学版)

影响因子:0.258
ISSN:1673-3193
年,卷(期):2024.45(6)