Review of 2D human pose encoding and decoding methods:from the perspective of ambiguity mitigation
Within the various subfields of computer vision,human pose estimation stands out as an interesting area of research.This estimation aims to precisely localize body parts or keypoints of the human instance from a given image or video and reconstruct the skeleton structure of the human body.Human pose estimation offers technical support for various applications,such as human pose tracking,human action recognition,person re-identification,human-object interac-tions,and person image generation.The uses of human pose estimation span across entertainment(such as virtual reality,augmented reality,and animation),health(such as healthcare and sports),and security(such as surveillance).Conse-quently,high-performance and real-time human pose estimation have emerged as prominent focus areas in current com-puter vision research.Extensive research on human pose estimation methods has been conducted in recent years.A part of the research focuses on developing and refining high-performance or lightweight network architectures.Notable examples include Hourglass,SimpleBaseline,high resolution net(HRNet),and Lite-HRNet.These architectures have found broad utility in various visual tasks,including object detection and instance segmentation.Another facet of research is dedicated to introducing innovative pose encoding and decoding schemes.These novel schemes are intended to construct accurate and robust human pose estimation models.The encoding and decoding processes for human pose estimation represent a piv-otal stage in extracting features from the input data and translating this information into comprehensible human poses.The encoding process primarily involves extracting features from the initial input data and molding them into an intermediate representation.This intermediate form,which could be feature maps or latent vectors,simplifies processing and compre-hension;the subsequent decoding process retrieves the ultimate human pose from this encoded structure.Despite the con-siderable progress made in current research on human pose estimation,the issue of ambiguity remains a major obstacle in real-world scenarios.Diverse poses might be mapped to similar or overlapping low-dimensional representations,primarily due to variables such as illumination,motion blur,occlusions,complex poses,perspective,and resolution.This approach leads to ambiguous and uncertain resultant poses,constituting the ambiguity challenge in human pose estimation.This challenge encompasses distributive,scale,and associative ambiguity.For example,in scenarios where a hand is obscured,the precise location of the wrist becomes uncertain,thus yielding distributive ambiguity.Second,the scale of the body in the image diminishes when the camera is positioned farther from the human instance,often making it difficult to ascertain the accurate scale without ample contextual details,leading to scale ambiguity.Third,precisely assigning the identified keypoints to corresponding human instances becomes intricate when two human instances obscure each other,thereby introducing associative ambiguity.The well-designed methods for encoding and decoding human poses enable the suitable modeling and solving of human pose estimation.These methods provide effective optimization objectives and fea-ture representations for the model,allowing for the construction of highly reasonable and robust human pose estimation mod-els.Therefore,investigating encoding and decoding for human pose estimation carries substantial importance for research.The majority of past review papers on human pose estimation have primarily focused on the design of network structures,while the ambiguity problem can markedly influence the performance of human pose estimation.The objective is to provide a summarized analysis of the current research on pose encoding and decoding methods.This analysis will encompass a thor-ough investigation of the inherent ambiguity challenge associated with human pose estimation.In this paper,human pose modeling techniques are first introduced,which directly impact the potential for expressive human pose representation.Second,the pose encoding and decoding methods are categorized into distributive,scale,and associative ambiguity.Three strategies are explored to address distributive ambiguity:distributive,structural,and iterative constraints.The scale ambiguity is further refined into the keypoint-and pixel-wise scale ambiguity problem.The former is mainly addressed through representative-based methods,and the latter can be solved using unbiased and integral-based methods.Possible approaches for associative ambiguity can be categorized into the following four groups:graph-,limb-,center-,and embedding-based methods.These diverse methods provide multiple potential solutions for dealing with associative ambigu-ity.A summary and performance comparison of the methods used for encoding and decoding human poses are provided to help understand the strengths and limitations of each approach.Finally,potential directions for future development are elu-cidated.This paper aims to establish a novel research trajectory for researchers:addressing the ambiguity problem in human pose estimation through encoding and decoding.The resolution of ambiguity challenges in human pose estimation is expected to broaden its potential applications.
deep learninghuman pose estimationambiguity problemhuman pose encoding and decodinghuman pose modeling