Few-shot Classification Using Video Action Skeleton Description Space Spatial-Temporal Joint Alignment
Compared with big data human action recognition,few-shot action recognition can quickly adapt to novel ac-tion categories and classify unlabeled actions by learning discriminative features with few labeled action samples.There-fore,the action recognition algorithm based on few-shot learning has gained increasing interest and is widely studied.However,the action information in the video RGB feature description method is easily confused with irrelevant back-ground,brightness,and color changes.Aiming at the problems of difficulty in labeling action samples,poor environ-mental adaptability of RGB video data,and high data dimension,this study considers the combination of efficient infor-mation representation and interpretable skeleton description data with few-shot learning to propose an algorithm for video action skeleton description space spatial-temporal joint alignment few-shot classification.The model is based on the concept of the Prototype Network(ProtoNet),which maps an original input to the embedding space to calculate the prototype representation and uses the measurement method for query sample prediction.In the feature extraction of the skeleton sequence,the Spatial-Temporal Joint Attention Graph Convolutional Network(STJA-GCN)is designed as the feature coding backbone.First,the spatial-temporal graph is constructed for the input skeleton sequence,and then the multilevel spatial-temporal graph convolution and spatial-temporal joint attention activation are conducted to obtain the corresponding high-level embedded feature representation.The Spatial-Temporal Joint Attention(ST-JointAtt)module weighs the importance of bones and joints of different action stages in spatial-temporal dimensions and adaptively fo-cuses on activated key action information to enhance the ability of the model to extract discriminative features.In the dis-tance measurement,the Euclidean distance between the query and support skeleton graphs is obtained through graph matching.Subsequently,the Dynamic Time Warping(DTW)algorithm is used to simulate the time series stretching and shrinking operations,the optimal matching between the two action sequences is dynamically scheduled,and the dis-tance accumulation of the skeleton graph pair is calculated to enhance the alignment of spatial-temporal features.Experi-ments were performed on three skeleton benchmarks:NTU-T,NTU-S,and Kinetics.The experimental results show that the proposed algorithm can fully utilize human skeleton information and further improve the matching accuracy of few-shot action recognition.