Egocentric action recognition method based on multi-scale temporal interaction
For egocentric action recognition task,most existing methods use non-behavior category labels such as target bounding boxes and eye gaze data to assist in supervising deep neural networks,so that they can focus on the areas where hands and their interactive objects are located in the video.This requires more manually labeled data and makes the process of extracting video features more complex.To address this issue,a multi-scale temporal interaction module is proposed,which enables 2D neural networks to extract video frame features through different scales of 3D temporal convolution for temporal interaction,so that the features of a single video frame can be fused with those of its neighboring frames.With only behavior category labels for supervision,multi-scale temporal interaction can promote the network to pay more attention to the areas where hands and their interactive objects are located in egocentric videos.Experimental results show that the pro-posed method has better recognition accuracy than existing egocentric action recognition methods.