Space-Time-Correlated Transformer for Skeleton-Based Action Recognition
At present,the most common skeleton action recognition methods adopt a joint stream,bone stream,and corre-sponding motion stream as multi-stream networks for separate training operations,which results in high training costs.In ad-dition,the modeling of complex spatio-temporal dependencies is neglected in the feature extraction process,and large-scale convolution is adopted for the exchange of information in the temporal domain,leading to the aggregation of a large amount of redundant information.A space-time-correlated transformer skeleton action recognition method was investigated to address these problems.First,a motion fusion module was constructed to reduce the cost of training motion streams separately by us-ing joint and skeletal streams as inputs and fusing the respective motion information at the feature level.Second,a shift trans-former module was proposed,which used the characteristics of the temporal shift operation to mix spatio-temporal informa-tion with the transformer to capture the short-term spatio-temporal dependencies at a low cost.Then,a multiscale temporal convolution was designed for time-domain long-term information.Finally,the final classification prediction was obtained by fusing the two-stream scores.Experiments on the large-scale datasets NTU RGB+D and NTU RGB+D 120 showed that the model achieved recognition accuracies of 91.5%and 96.3%on the two evaluation standards X-Sub and X-View for the NTU RGB+D dataset,respectively;and 87.2%and 89.3%on the two evaluation standards X-Sub and X-Set for the NTU RGB+D 120 dataset,respectively.The recognition accuracy of the proposed method was significantly better than those of the most commonly used skeleton action recognition methods,which verified the effectiveness and generality of the model.