Fusing Global-Local Contextual Information for Small Object Multi-Person Pose Estimation
Despite advancements,existing multi-person 2D pose estimation methods cannot effectively identify the poses of small objects.A multi-person pose estimation method that integrates global and local contextual information is proposed to address this problem.The method uses the different scale features output by a High-Resolution Network(HRNet)to roughly locate multiple anatomical centers of the human body,thereby providing more supervisory information to small objects through multiple center points to improve their localization ability.The coordinates of the human center point are used as a clue to extract local contextual information of different scales near the center point through deformable sampling,whereby the comparative loss between the local contextual information of different objects is calculated to improve the discriminative ability between objects.Using the low-resolution features of HRNet as global contextual information and local contextual information as cross-attention queries,a multilayer Transformer model is constructed by combining global and local contextual information to enhance the contextual information of small objects.This enhanced information is then used as clustering centers,and multi-scale fusion features are decoupled to obtain keypoint heatmaps corresponding to different objects to achieve multi-person pose estimation of small objects.The experimental results show that the propoesd method can effectively improve the recognition performance of small object poses,realizing an Average Precision(AP)of 69.0%on the COCO test-dev2017 dataset and an APM improvement of 1.4 percentage points compared to Dual Anatomical Centers(DAC).
pose estimationsmall objectmultiple center pointsattentioncontextual information