3D human body pose estimation based on improved Transformer
An improved Transformer multi-level feature encoding network for 3D human pose estimation is designed.The spatial pooling operator structure is used to replace the Attention module,which reduces the amount of model parameters and operation complexity.The structure is connected in series to obtain the initial feature representation.Then,the cross-attention(CA)mechanism is used for interactive learning of feature information, and strided convolution is used to reduce the time dimension,and similar Poses are combined into a single representation of Pose sequences.Results of verification experiment on Human3.6 M datasets show that this method can achieve effective estimation effect for 3D human Pose estimation by using Pooling structure and attention mechanism.Compared with the original Transformer method,the amount of model parameters is reduced by 30% and the positional precision is improved by 8.6%.