A survey on multimodal information-guided 3D human motion generation
Three-dimensional(3D)digital human motion generation guided by multimodal information generates human motion under specific input conditions through data,such as text,audio,image,and video.This technology has a wide spectrum of applications and extensive economic and social benefits in the fields of film,animation,game production,metaverse,etc.,and is one of the research hotspots in the fields of computer graphics and computer vision.However,such a task faces grand challenges,including the difficult representation and fusion of multimodal information,lack of high-quality datasets,poor quality of generated motion(such as jitter,penetration,and foot sliding),and low generation effi-ciency.Although various solutions have been proposed to address the aforementioned challenges,a mechanism for achiev-ing efficient and high-quality 3D digital human motion generation based on the characteristics of distinct modal data remains an open problem to be solved.This paper comprehensively reviews 3D digital human motion generation and elabo-rates on related recent advances from the perspectives of parametrized 3D human models,human motion representation,motion generation techniques,motion analysis and editing,existing human motion datasets and evaluation metrics.Param-etrized human models facilitate digital human modeling and motion generation through the provision of parameters associ-ated with body shapes and postures and serve as key pillars of current digital human research and applications.This survey begins with an introduction to widely used parametrized 3D human body models,including shape completion and animation of people(SCAPE),skinned multi-person linear model(SMPL),SMPL-X,and SMPL-H,and their detailed comparison in terms of model representations and the parameters used to control body shapes,poses,and facial expressions.Human motion representation is a core issue in digital human motion generation.This work highlights the musculoskeletal model and classic skinning algorithms,including linear blending skinning and dual quaternion skinning,and their application in physics-based and data-driven methods to control human movements.We have also extensively studied approaches to exist-ing multimodal information-guided human motion generation and categorized them into four major branches,i.e.,genera-tive adversarial network-,autoencoder-,variational autoencoder-,and diffusion model-based methods.Other works,such as generative motion matching,have also been mentioned and compared with data-driven methods.The survey summarizes existing schemes of human motion generation from the perspectives of methods and model architectures and presents a uni-fied framework for the generation of digital human motion.A motion encoder extracts motion features from an original motion sequence and fuses them with the conditional characteristics extracted by the conditional encoder into latent vari-ables or maps them to the latent space.This condition enables generative adversarial networks,autoencoders,variational autoencoders,or diffusion models to generate qualified human movements through a motion decoder.In addition,this paper surveys the current work on digital human motion analysis and editing,including motion clustering,motion predic-tion,motion in-betweening,and motion in-filling.Data-driven human motion generation and evaluation requires the use of a high-quality dataset.We collected publicly available human motion databases and classified them into various types based on two criteria.From the perspective of data type,existing databases can be classified into motion capture and video reconstruction datasets.Motion capture data sets rely on devices,such as motion capture systems,cameras,and inertial measurement units,to obtain real human movement data(i.e.,ground truth).Meanwhile,the video reconstruction data-set was used to reconstruct a 3D human body model through estimation of body joints from motion videos and fitting them to a parametric human body model.From the perspective of task type,commonly used databases can be classified into text-,action-,and audio-motion datasets.The new datasets are usually obtained by processing motion capture and video recon-struction datasets based on specific tasks.A comprehensive briefing on the evaluation metrics of 3D human motion genera-tion,including motion quality,motion diversity,and multimodality,consistency between inputs and outputs,and infer-ence efficiency,is also provided.Apart from objective evaluation metrics,user study was employed to generate human motion quality and was discussed in this paper.To compare the performances of various generation methods used in digital human motion on public datasets,we selected a collection of the most representative work and carried out extensive experi-ments for comprehensive evaluation.Finally,the well-addressed and underexplored issues in this field were summarized,and several potential further research directions regarding datasets,the quality and diversity of generated motions,cross-modal information fusion,and generation efficiency were discussed.Specifically,existing datasets generally fail to meet the expectations concerning motion diversity and descriptions associated with motions,data distribution,and length of motion sequence.Future work should consider the development of a large-scale 3D human motion database to boost the effi-cacy and robustness of motion generation models.In addition,the quality of generated human motions,especially those with complex movement patterns,remains dissatisfactory.Physical constraints and postprocessing show promise in the inte-gration into human motion generation frameworks to tackle issues.In addition,although human-motion generation methods can generate various motion sequences from multimodal information,such as text,audio,music,actions and keyframes,work on cross-modal human motion generation(e.g.,generating a motion from a text description and a piece of back-ground music)is scarcely reported.Investigation of such a task is worthy,especially in unlocking new opportunities in this area.In terms of the diversity of generated content,some researchers have explored harvesting rich,diverse,and stylized motions using variational autoencoders,diffusion models,and contrastive language-image pretraining neural networks.However,current studies mainly focus on the motion generation of a single human represented by an SMPL-like naked parameterized 3D model.Meanwhile,the generation and interaction of multiple dressed humans have huge untapped appli-cation potential but have not received sufficient attention.Finally,another nonnegligible issue is a mechanism for boosting motion generation efficiency and achieving a good balance between quality and inference overhead.Possible solutions to such a problem include lightweight parameterized human models,information-intensive training datasets,and improved or more advanced generative frameworks.
3D avatarmotion generationmultimodal informationparametric human modelgenerative adversarial net-work(GAN)autoencoder(AE)variational autoencoder(VAE)diffusion model