Temporal dynamic frame selection and spatio-temporal graph convolution for interpretable skeleton-based action recognition
Skeleton-based action recognition is a prominent research topic in computer vision and machine learning.Existing data-driven neural networks often overlook the temporal dynamic frame selection of skeleton sequences and lack the understandable decision logic inherent in the model,resulting in insufficient interpretability.To this end,we proposed an interpretable skeleton-based action recognition method based on temporal dynamic frame selection and spatio-temporal graph convolution,thereby enhancing the interpretability and recognition performance.Firstly,the quality of skeleton frames was estimated using the joint confidence to remove low-quality skeleton frames,addressing the skeleton noise problem.Secondly,based on the domain knowledge of human activity,an adaptive temporal dynamic frame selection module was proposed for calculating the motion salient regions to capture the dynamic patterns of key skeleton frames in human motion.To represent the intrinsic topology of human joints,an improved spatiotemporal graph convolutional network was used for interpretable skeleton-based action recognition.Experiments were conducted on three large public datasets,including NTU RGB+D,NTU RGB+D 120,and FineGym,and the results demonstrated that the recognition accuracy of this method outperformed comparative methods and possessed interpretability.