Point cloud human behavior recognition based on coordinate transformation and spatiotemporal information injection
Objective Human motion recognition and deep learning have become a research hotspot in the field of computer vision because of their extensive applications in video surveillance,virtual reality,and human computer intelligent interac-tion.Deep learning theory has made excellent achievements in the feature extraction of static images and has been gradu-ally extended to the research of behavior recognition in other directions.Traditional research on human behavior recogni-tion focuses on depth image sequence under 2D information.Depth image cannot only capture 3D information successfully,but can also provide depth information.Depth information represents the distance between the target and the depth camera within the visual range,disregarding the influence of external factors,such as lighting and background.Although depth image can capture 3D information,most depth image algorithms use the multi-view method to extract behavior features.The extraction effect of spatiotemporal features is affected by the angle and number of multiple views,considerably affect-ing the utilization rate of 3D structural information,and the spatiotemporal structure information of 3D data is largely lost.With the rapid development of 3D acquisition technology,3D sensors are becoming increasingly accessible and affordable,including various types of 3D scanners and LiDAR.The 3D data collected by these sensors can provide rich geometry,shape,and scale information.3D data have many applications in different fields,including autonomous driving,robotics,remote sensing,and healthcare.Point cloud representation is a commonly used 3D representation;it retains the original geometric information in 3D space without any discretization.Therefore,it is the preferred representation for understanding related applications in many scenarios,such as autonomous driving and robotics.However,the deep learning of a 3D point cloud still faces major challenges,such as small dataset size.Method In this study,the depth map sequence is first con-verted into a 3D point cloud sequence to represent human behavior information,and the large and authoritative datasets in the depth dataset are converted into point cloud datasets to compensate for the shortcoming of the small size of point cloud datasets.Given the huge amount of point cloud data,the traditional point cloud deep learning network will use a sampling algorithm to sample the point cloud before feature extraction.The most commonly used algorithm is random subsampling,which will inevitably lead to the destruction of point cloud structural information.To improve the utilization rate of temporal and spatial structure information and compensate for the loss of such information during the random subsampling of a point cloud,a point cloud human behavior recognition network that combines coordinate transformation and spatiotemporal infor-mation injection is proposed for motion recognition in this study.The network consists of two modules:the feature extrac-tion module and the spatiotemporal information injection module.The feature extraction module extracts the deep appear-ance contour features of the point cloud through operations,such as the abstraction manipulation layer,multilayer percep-tron,and maximum pooling.Among which,the abstraction manipulation layer includes the sampling,grouping,convolu-tional block attention module(CBAM),and PointNet layers.In the spatiotemporal information injection module,time sequence and spatial structure information are injected for abstract features.When timing information is injected,the sine and cosine functions of different frequencies are used as time position coding,because sine and cosine functions are unique and robust in the position of each vector in the disordered direction.During spatial structure information injection,the abstract features after location coding are multiplied with a group of learnable normal distribution random tensors and pro-jected onto the corresponding dimension space.Then,the coefficients of the random tensors are learned through the net-work to find the optimal projection space that can better focus on the structural relations between point clouds.Subse-quently,the feature enters the interpoint attention mechanism module to further learn the structural relationship between point cloud data points and points through the interpoint attention mechanism.Finally,the multilevel features in feature extraction and information injection are aggregated and inputted into the classifier for classification.Result A large number of experiments are performed on three common datasets,and the proposed network structure exhibits good performance.Accuracy on the NTU RGB+d60 datasets is 1.3%and 0.2%higher than those of PSTNet and SequentialPointNet,respec-tively,considerably exceeding the recognition accuracy of other networks.Although the accuracy of the NTU RGB+d120 dataset is 0.1%lower than that of SequentialPointNet,it remains in a leading position compared with other networks.The network recognition accuracy proposed in this study is 1.9%higher than that of PSTNet.The NTU dataset is one of the larg-est human action datasets.To ensure the robustness of the network model,the effect of the point cloud human behavior rec-ognition network that combines coordinate transformation and spatiotemporal information injection on small datasets is veri-fied,and experimental comparison was performed on small datasets of MSR Action3D.The recognition accuracy of the net-work proposed in this study was 1.07%higher than that of SequentialPointNet,and considerably higher than those of other networks.Conclusion In this study,we propose a point cloud human behavior recognition network that combines coordi-nate transformation and spatiotemporal information injection for behavior recognition.Through coordinate transformation,the depth map sequence is converted into 3D point cloud sequence for the characterization of human behavior information,compensating for the shortcomings of insufficient depth information,spatial information,and geometric features,and improving the utilization rate of spatiotemporal structure information.The network proposed in this study not only obtains static point cloud contour features,but also integrates dynamic temporal and spatial information to compensate for the tem-poral and spatial losses caused by sampling during feature extraction.
human behavior recognitioncoordinate transformationpoint cloud sequencefeature extractionspatiotem-poral information