Incorporating variational auto-encoder networks for text-driven generation of 3D motion human body
Objective Artificial intelligence generated content(AIGC)technology can reduce the workload of three-dimensional(3D)modeling when applied to generate virtual 3D scene models using natural language.For static 3D objects,methods have arisen in generating high-precision 3D models that match a given textual description.By contrast,for dynamic digital human body models,which is also highly popular in numerous circumstances,only two-dimensional(2D)human images or sequences of human poses can be generated corresponding to a given textual description.Dynamic 3D human models cannot be generated with the same way above using natural language.Moreover,current existing meth-ods can lead to problems such as immutable shape and motion when generating dynamic digital human models.A method fusing variational auto-encoder(VAE),contrastive language-image pretraining(CLIP),and gate recurrent unit(GRU),which can be used to generate satisfactory dynamic 3D human models corresponding to the shapes and motions described by the text,is proposed to address the above problems.Method A method based on the VAE network is proposed in this paper to generate dynamic 3D human models,which correspond to the body shape and action information described in the text.Notably,a variety of pose sequences with variable time duration can be generated with the proposed method.First,the shape information of the body is obtained through the body shape generation module based on the VAE network and CLIP model,and zero-shot samples are used to generate the skinned multi-person linear(SMPL)parametric human model that matches the textual description.Specifically,the VAE network encodes the body shape of the SMPL model,the CLIP model matches the textual descriptions and body shapes,and the 3D human model with the highest matching score is thus filtered.Second,variable-length 3D human pose sequences are generated through the body action generation module based on the VAE and GRU networks that match the textual description.Particularly,the VAE self-encoder encodes the dynamic human poses.The action length sampling network then obtains the length of time that matches the textual description of the action.The GRU and VAE networks encode the input text and generate the diverse dynamic 3D human pose sequences through the decoder.Finally,a dynamic 3D human model corresponding to the body shape and action description can be generated by fusing the body shape and action information generated above.The performance of the method is evaluated in this paper using the HumanML3D dataset,which comprises 14 616 motions and 44 970 linguistic annotations.Some of the motions in the dataset are mirrored before training,and some words are replaced in the motion descriptions(e.g.,"left"is changed to"right")to expand the dataset.In the experiments in this paper,the HumanML3D dataset is divided into train-ing,testing,and validation sets in the ratios of 80%,15%,and 5%,respectively.The experiments in this paper are con-ducted in an Ubuntu 18.04 environment with a Tesla V100 GPU and 16GB of video memory.The adaptive moment estima-tion(Adam)optimizer is trained in 300 training rounds with a learning rate of 0.000 1 and a batch size of 128 to train the motion self-encoder.The Adam optimizer performs 320 training rounds with a learning rate of 0.000 2 and a batch size of 32 to train the motion generator.This optimizer also performs 200 training rounds with a learning rate of 0.000 1 and a batch size of 64 for training the motion length network.Result Dynamic 3D human model generation experiments were con-ducted on the HumanML3D dataset.Compared with three other state-of-the-art methods,the proposed method shows an improvement of 0.031,0.034,and 0.028 in the Top1,Top2,and Top3 dimensions of R-precision,0.094 in Fréchet inception distance(FID),and 0.065 in diversity,respectively,considering the best available results.The experimental analysis for qualitative evaluation was divided into three parts:body shape feature generation,action feature generation,and dynamic 3D human model generation including body features.The body feature generation part was tested using differ-ent text descriptions(e.g.,tall,short,fat,thin).For the action feature generation part,the same text descriptions are tested using this paper and other methods for generation comparison.Combining the body shape features and the action fea-ture of the human body,the generation of dynamic 3D human models with body shape features is demonstrated.In addi-tion,ablation experiments,including ablation comparison with different methods using different loss functions,are per-formed to further demonstrate the effectiveness of the method.The final experimental results show that the proposed method in this paper improves the effectiveness of the model.Conclusion This paper presents methods for generating dynamic 3D human models that conform to textual descriptions,fusing body shape and action information.The body shape generation module can generate SMPL parameterized human models whose body shape conforms to the textual description,while the action generation module can generate variable-length 3D human pose sequences that match the textual description.Experi-mental results show that the proposed method can effectively generate motion dynamic 3D human models that conform to textual descriptions,and the generated human models have diverse body shape and motions.On the HumanML3D dataset,the performance of the method outperforms other similar state-of-the-art algorithms.
human motion synthesisnatural language processing(NLP)deep learningskinned multi-person linear modelvariational auto-encoder network