Industrial box-packing action recognition based on multi-view adaptive 3D skeleton network
Objective Action recognition has become increasingly important in industrial manufacturing.Production effi-ciency and quality can be improved by recognizing worker actions and postures in complex production environments.In recent years,action recognition based on skeletal data has received widespread attention and research,with methods mainly based on graph convolutional networks(GCN)or long short-term memory(LSTM)networks exhibiting excellent recognition performance in experiments.However,these methods have not considered the recognition problems of occlu-sion,viewpoint changes,and similar subtle actions in the factory environment,which may have a significant impact on sub-sequent action recognition.Therefore,this study proposes a packing behavior recognition method that combines a dual-view skeleton multi-stream network.Method The network model consists of a main network and a sub-network.The main network uses two RGB videos from different perspectives as input and records the input of workers at the same time and action.Subsequently,the image difference method is used to convert the input video data into a difference image.More-over,the 3D skeleton information of the character is extracted from the depth map by using the 3D pose estimation algo-rithm and then transmitted to the subsequent viewing angle conversion module.In the perspective conversion module,the rotation of the bone data is used to find the best viewing angle,and the converted skeleton data are passed into a three-layer stacked LSTM network.The different classification scores of the weighted fusion are obtained for the recognition results of the main network.In addition,for some similar behaviors and non-compliant"fake actions",we use a local positioning image convolution network combined with an attention mechanism and pass it into the ResNeXt network for recognition.Moreover,we introduce a spatio-temporal attention mechanism for analyzing video action recognition sequences to focus on the key frames of the skeleton sequence.The recognition scores of the main network and the sub-network are fused in pro-portion to obtain the final recognition result and predict the behavior of the person.Result First,convolutional neural net-work(CNN)-based methods usually have better performance than recurrent neural network(RNN)-based ones,whereas GCN-based methods have middling performance.Moreover,CNN and RNN network structures are combined to improve the accuracy and recall rate to greatly explore the spatiotemporal information of skeletons.However,the method proposed in this study has an identification accuracy of packing behavior of 92.31%and a recall rate of 89.72%,which is still 3.96%and 3.81%higher than the accuracy,respectively.The proposed method is significantly ahead of other existing mainstream behavior recognition methods.Second,the method based on a difference image combined with a skeleton extraction algorithm can achieve an 87.6%accuracy,which is better than RGB as the input method of the original image,although the frame rate is reduced to 55.3 frames per second,which is still within the acceptable range.Third,consider-ing the influence of the adaptive transformation module and the multi-view module on the experiment,we find that the rec-ognition rate of the single-stream network with the adaptive transformation module is greatly improved,but the fps is slightly decreased.The experiment finds that the learning of the module is more inclined to observe the action from the front because the front observation can scatter the skeleton as much as possible compared with the side observation.The highest degree of mutual occlusion among bones was the worst observation effect.For dual view,simply fusing two different single-stream output results can improve the performance,and the weighted average method has the best effect,which is 3.83%and 3.03%higher than the accuracy of single-stream S1 and S2,respectively.Some actions have the problem of object occlusion and human self-occlusion under a certain shooting angle.The occlusion problem can be solved by two comple-mentary views,that is,the occluded action can be well recognized in one of the views.In addition,evaluations were car-ried out on the public NTU RGB+D dataset,where the performance results outperformed other networks.This result further validates the effectiveness and accuracy of the proposed method in the study.Conclusion This method uses a two-stream network model.The main network is an adaptive multi-view RNN network.Two depth cameras under complementary per-spectives are used to collect the data from the same station,and the incoming RGB image is converted into a differential image for extracting skeleton information.Then,the skeleton data are passed into the adaptive view transformation module to obtain the best skeleton observation points,and the three-layer stacked LSTM network is used to obtain the recognition results.Finally,the weighted fusion of the two view features is used,and the main network solves the influence of occlu-sion and background clutter.The sub-network adds the hand image recognition of skeleton positioning,and the intercepted local positioning image is sent to the ResNeXt network for recognition to make up for the problem of insufficient accuracy of"fake action"and similar action recognition.Finally,the recognition results of the main network and the sub-network are fused.The human behavior recognition method proposed in this study effectively utilizes human behavior information from multiple views and combines skeleton network and CNN models to significantly improve the accuracy of behavior recogni-tion.