人体姿态估计(human pose estimation,HPE)是计算机视觉中的一项基本任务,旨在从给定的图像中获取人体关节的空间坐标,在动作识别、语义分割、人机交互和人员重新识别等方面得到了广泛应用。随着深度卷积神经网络(deep convolutional neural network,DCNN)的兴起,人体姿态估计取得了显著进展。然而,尽管取得了不错的成果,人体姿态估计仍然是一项具有挑战性的任务,特别是在面对复杂姿态、关键点尺度的变化和遮挡等因素时。为了总结关于遮挡的人体姿态估计技术的发展,本文系统地概述了自2018年以来的代表性方法,根据神经网络包含的训练数据、模型结构以及输出结果,将方法细分为基于数据增广(data augmentation)的预处理、基于特征区分的结构设计和基于人体先验的结果优化3类。基于数据增广方法通过生成遮挡的数据来增加训练样本;基于特征区分的方法通过利用注意力机制等方式来减少干扰特征;基于人体结构先验的方法通过利用人体结构先验来优化遮挡姿态。同时,为了更好地评测遮挡方法的性能,重新标注了 MSCOCO(Microsoft common objects in context)va12017数据集。最后,对各种方法进行了对比和总结,阐明了它们在面对遮挡时性能的优劣。此外,在此基础上总结和讨论了遮挡情况下人体姿态估计困难的原因以及该领域未来的发展趋势。
A comprehensive review of progress in deep-learning-based occluded human pose estimation
Human pose estimation(HPE)is a prominent area of research in computer vision whose primary goal is to accu-rately localize annotated keypoints of the human body,such as wrists and eyes.This fundamental task serves as the basis for numerous downstream applications,including human action recognition,human-computer interaction,pedestrian re-identification,video surveillance,and animation generation,among others.Thanks to the powerful nonlinear mapping capabilities offered by convolutional neural networks,HPE has experienced notable advancements in recent years.Despite this progress,HPE remains a challenging task,particularly when facing complex postures,variations in keypoint scales,occlusion,and other factors.Notably,the current heatmap-based methods suffer from severe performance degradation when encountering occlusion,which remains a critical challenge in HPE given that diverse human postures,complex back-grounds,and various occluding objects can all cause performance degradation.To comprehensively delve into the recent advancements in occlusion-aware HPE,this paper not only explores the intricacies of occlusion prediction difficulties but also delves into the reasons behind these challenges.The identified challenges encompass the absence of annotated occluded data.Annotating occluded data is inherently complex and demanding.Most of the prevalent datasets for HPE pre-dominantly focus on visible keypoints,with only a few datasets addressing and annotating occlusion scenarios.This defi-ciency in annotated occluded data during model training significantly compromises the robustness of models in effectively handling situations that involve a partial or complete obstruction of body keypoints.Feature confusion presents a key chal-lenge for top-down HPE methods,where the reliance on detected bounding boxes extracted from the image leads to the crop-ping of the target person's region for keypoint prediction.However,in the presence of occlusion,these detection boxes may include individuals other than the target person,thereby interfering with the accurate prediction of keypoints.This issue is particularly problematic because the high feature similarity between the target person and the interfering individuals prevents the model from distinguishing features effectively,thereby compromising the accuracy of keypoint predictions and emphasizing the need to develop strategies for addressing feature confusion in occluded scenes.Navigating the intricacies of inference becomes particularly challenging in the presence of substantial occlusion.The expansive coverage of occlusion leads to the loss of valuable contextual and structural information that is essential for accurately predicting the occluded key-points.Contextual cues and structural insights play pivotal roles in the inference process,and their absence impedes the model's ability to draw precise conclusions.The significant loss of contextual information also hampers the model's capac-ity to glean necessary details from adjacent keypoints,which is crucial for making informed predictions about occluded key-points.This,in turn,results in the potential omission of keypoints or the emergence of anomalous pose estimations.Besides,this paper systematically reviews representative methods since 2018.Based on the training data,model struc-ture,and output results contained in neural networks,this paper categorizes methods into three types,namely,preprocess-ing based on data augmentation,structural design based on feature discrimination,and result optimization based on human body priors.Preprocessing based on data augmentation techniques,which generate data with occlusion,are employed to augment training samples,compensate for the lack of annotated occluded data,and alleviate the performance degradation of the model in the presence of occlusion.These techniques utilize synthetic methods to introduce occlusive elements and simulate occlusion scenarios observed in real-world settings.Through these techniques,the model is exposed to a diverse set of samples featuring occlusion during the training process,thereby enhancing its robustness in complex environments.This data augmentation strategy aids the model in understanding and adapting to occluded conditions for keypoint predic-tion.By incorporating diverse occlusion patterns,the model can learn a broad range of scenarios,thus improving its gener-alization ability in practical applications.This method not only helps enhance the model's performance in occluded scenes but also provides comprehensive training to boost its adaptability to complex situations.Feature-discrimination-based meth-ods utilize attention mechanisms and similar techniques to reduce interference features.By strengthening features associ-ated with the target person and suppressing those related to non-target individuals,these methods effectively mitigate the interference caused by feature confusion.These methods rely on mechanisms,such as attention,to selectively emphasize relevant features,thereby allowing the model to focus on distinguishing the keypoint features of the target person from those of interfering individuals.By enhancing the discriminative power of features belonging to the target individual,the model becomes adept at navigating scenarios where feature confusion is prevalent.Methods based on human body structure priors optimize occluded poses by leveraging prior knowledge of the human body structure.The use of human body structure pri-ors is particularly effective in providing valuable information about the structural aspects of the human body.These priors serve as constraints that improve the robustness of the model during the inference process.By incorporating these priors,the model is further informed about the expected configuration of body parts,even in the presence of occlusion.This prior knowledge helps guide the model's predictions and ensures that the estimated poses adhere closely to anatomically plau-sible configurations.A comparative analysis is also conducted to highlight the strengths and limitations of each method in handling occlusion.This paper also discusses the challenges inherent to occluded pose estimation and offers some direc-tions for future research in this area.
human pose estimation(HPE)occlusiondata augmentationhuman structure a prioriinsufficient occlu-sion labeling data