Research on HumanAction Recognition Algorithm by Fusing CNN and Spa-tio-Temporal Separation ViT
Currently in the field of computer vision,video action recognition technology has made some development,but there is still some room for improvement.In order to solve the problem of recognition accuracy in the field of action recognition nowadays,a network model fusing CNN and spatio-temporal separation ViT is proposed to improve the accuracy of action classification and recognition.The encoder structure of the traditional ViT model is mainly e-volved into a temporal encoder and spatial encoders.The temporal and spatial encoders extract video features in series and fuses with the features extracted by CNN to improve the recognition effect.The results of the experiments show that the network model fusing CNN and spatio-temporal separated ViT has certain superiority in recognition effect,which provides a new idea for the design of human action recognition algorithm.