Fine-grained 2D convolutional network model for action recognition based on spatio-temporal multi-scale correlation feature fusion
In order to solve the problems of traditional 2-dimensional(2D)convolutional network extracting spatiotempo-ral features at a single scale and insufficient utilization of long-range temporal correlation information between frames in fine-grained action data sets,this paper proposes a fine-grained 2D convolutional network model for action recog-nition based on spatio-temporal multi-scale correlation feature fusion model.First,in order to model the multi-scale spatial correlation of videos to enhance the spatial representation ability of fine-grained video data,the model uses a multi-scale'feature squeeze and feature excitation'method to make the spatial features extracted by the network more abundant and effective.Then,in order to fully utilize the motion information in the time dimension of fine-grained video data,a temporal window self attention mechanism is introduced,and the powerful long-range depend-ency modeling ability of Transformer is utilized to only perform self attention operations in the time dimension,mod-eling long-range time dependencies at a lower computational cost.Finally,considering that the extracted spatio-temporal features contribute unevenly to different types of action classification,an adaptive feature fusion module is introduced to dynamically assign different weights to features to achieve adaptive feature fusion.The model's Top-1 accuracy on the two fine-grained action recognition data sets Diving48 and Something-somethingV1 reached 86.0%and 46.9%respectively,which improved the Top-1 accuracy of the original backbone network by 3.8%and 1.3%respectively.Experimental results show that when only using video frame information as input,this mod-el achieves recognition accuracy comparable to existing algorithms based on Transformer and 3-dimensional convolu-tional neural network(3D CNN).