Real-Time Action Detection Based on Spatio-Temporal Interaction Perception
Spatiotemporal action detection requires incorporation of video spatial and temporal information.Current state-of-the-art approaches usually use a 2D CNN(Convolutionsl Neural Networks)or a 3D CNN architecture.However,due to the complexity of network structure and spatiotemporal information extraction,these methods are usually non-real-time and offline.To solve this problem,this paper proposes a real-time action detection method based on spatiotemporal in-teraction perception.First of all,the input video is rearranged out of order to enhance the temporal information.As 2D or 3D backbone networks cannot be used to model spatiotemporal features effectively,a multi-branch feature extraction net-work is proposed to extract features from different sources.And a multi-scale attention network is proposed to extract long-term time-dependent and spatial context information.Then,for the fusion of temporal and spatial features from two differ-ent sources,a new motion saliency enhancement fusion strategy is proposed,which guides the fusion between features by encoding temporal and spatial features to highlight more discriminative spatiotemporal features.Finally,action tube links are generated online based on the frame-level detector results.The proposed method achieves an accuracy of 84.71%and 78.4%on two spatiotemporal motion datasets UCF101-24 and JHMDB-21.And it provides a speed of 73 frames per sec-ond,which is superior to the state-of-the-art methods.In addition,for the problems of high inter-class similarity and easy confusion of difficult sample data in the JHMDB-21 dataset,this paper proposes an action detection method of key frame optical flow based on action representation,which avoids the generation of redundant optical flow and further improves the accuracy of action detection.