Mainstream video action recognition algorithms often lack sufficient exploitation of temporal information,while Transformer excels at handling long sequences and global dependency issues.In this paper 3D Convolutional Neural Networks(3DCNN)and Transformer were combined to propose a sparse Transformer-based long-short temporal association action recognition algorithm,so as to realize the modeling of global temporal information of video.The algorithm used a pre-trained model to extract clip features,embedded a video feature clustering module to reduce the potential noise of the input features,and used a Transformer long-short temporal association module based on sparse self-attentiveness which introduced a sparse mask matrix masking operations on the similarity matrix to suppress smaller attention weights,selectively retained important long-short temporal information,and improved the model's attention concentration on global contextual information.The experimental results verify the effectiveness of the proposed algorithm,showing the model can achieve higher accuracy compared to state-of-the-art approaches on the UCF101 and HMDB51 datasets with a small number of parameters and computational complexity.
deep learningaction recognitionsparse transformerR3D-18