Action Recognition Model Based on Improved Two Stream Vision Transformer
To address the issues of poor resistance to background interference and low accuracy in existing action recognition methods,an improved dual stream visual Transformer action recognition model is proposed.The model adopts a segmented sam-pling method to increase its processing ability for long-term sequence data;embedding a parameter free attention module in the network header enhances the model's feature representation ability while reducing action background interference;embedding a temporal attention module at the tail of the network to fully extract temporal features by integrating high semantic information in the time domain.A new joint loss function is proposed in the paper,aiming to increase inter class differences and reduce intra class differences.Adopting a decision fusion layer to fully utilize the features of optical flow and RGB flow.In response to the above improved model,comparative and ablation experiments are conducted on the benchmark datasets UCF101 and HMDB51.The ablation experiment results verify the effectiveness of the proposed method.The comparison results show that the accuracy of the proposed method is 3.48%and 7.76%higher than that of the time segmented network on the two datasets,respectively,which is better than the current mainstream algorithms and has good recognition performance.
Action recognitionVision TransformerSim AM parameter-free attentionTemporal attentionJoint loss