Gesture recognition by combining spatio-temporal mask and spatial 2D position encoding
Objective Gesture recognition often neglects the correlation between fingers and pays excessive attention to the node features,which is crucial for the low gesture recognition rate.For example,the index finger and thumb are physically disconnected,but their interaction is important for recognizing the"pinch"action.Thus,the low recognition rate is due to the inability to encode the spatial position of the hand node properly.Dividing the joint of the hand part into blocks is pro-posed to address the correlation between fingers.The aforementioned problem can be addressed byencoding the two-dimensional position of the joint through its projection coordinates.The authors believe that this study is the first to encode the two-dimensional position of the node in space.Method The spatiotemporal graph is generated from the gesture sequence.This graph contains the physical connection of the node and its temporal information.Thus,the spatial and tem-poral characteristics are learned using mask operations.According to the three-dimensional space coordinates of joint nodes,the two-dimensional projection coordinates are obtained,and the two-dimensional projection coordinates are input-ted into the two-dimensional space position encoder,which comprises sine and cosine functions with different frequencies.The plane where the projection coordinates are located is divided into several grid cells,and the encoder comprising sine and cosine functions is calculated in each grid cell.The encoders in all grids are combined to form sine and cosine func-tions with different frequencies to generate the final spatial two-dimensional position code.Embedding the encoded informa-tion into the spatial features of the nodes not only strengthens the spatial structure between them but also avoids the disorder of the nodes in the movement process.Using the graph convolutional network to aggregate and embed the spatial encoded node and neighbor features,the spatiotemporal graph features after the graph convolution are inputted into the spatial self-attention module to extract the inter-finger correlation.Taking each finger as the research object,the distribution of nodes in the spatiotemporal graph is divided into blocks according to the biological structure of the human hand.Each finger through a linear learnable change to generate the eigenvector of the finger query(Q),key(K),value(V).The self-attention mechanism is then used to calculate the correlation between fingers in each frame of the space-time graph,the cor-relation weight between fingers is obtained by combining the spatial mask matrix,and each finger feature is updated.While updating the finger features,the spatial mask matrix is used to disconnect the time relationship between fingers in the spa-tiotemporal graph,avoiding the influence of time dimension on the spatial correlation weight matrix.The time self-attention module is similarly used to learn the timing features of fingers in the spatiotemporal graph.First,temporal sequence embedding is conducted for each frame through temporal one-dimensional position coding to obtain the temporal sequence information of each frame during model learning.The time dimension expansion strategy is used to fuse the features of the two adjacent frames to capture the interframe correlation at a long distance.A learnable linear change then generates a fea-ture vector query(Q),key(K),and value(V)for each frame.Finally,the self-attention mechanism is utilized to calcu-late the correlation between each frame in the space-time graph.Simultaneously,the correlation weight matrix between frames in the space-time graph is obtained by combining the time mask matrix,and the features of each frame are updated.Updating the features of each frame also uses the temporal mask matrix to avoid the influence of spatial dimension on the temporal correlation weight matrix.The fully connected network,ReLU activation function,and layer normalization are added to the end of each attention module to improve the training efficiency of the model,and the model finally outputs the learned feature vector for gesture recognition.Result The model is tested on two challenging datasets:DHG-14/28 and SHREC'17 track.The experimental results show that the model achieves the best recognition rate on DHG-14/28,which is 4.47%and 2.71%higher than the HPEV and the MS-ISTGCN algorithms,respectively,on average.On the SHREC'17 track dataset,the algorithm is 0.47%higher than the HPEV algorithm on average.The ablation experiment proves the need of two-dimensional location coding in space.The experimental test shows that the model has the best recognition rate when node features are 64 dimensions and the number of self-attention head is 8.Conclusion Numerous experimental evaluations verified that the network model constructed by the block strategy and spatial two-dimensional position coding not only improves the spatial structure of the nodes but also enhances the recognition rate of gestures using the self-attention mechanism to learn the correlation between non-physically connected fingers.
gesture recognitionself-attentionspatial two-dimensional position codingspatio-temporal maskhand seg-mentation