融入变分自编码网络的文本生成三维运动人体

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：目的针对现有动态三维数字人体模型生成时不能改变体型、运动固定单一等问题,提出一种融合变分自编码器(variational auto-encoder,VAE)网络、对比语言—图像预训练(contrastive language-image pretraining,CLIP)网络与门控循环单元(gate recurrent unit,GRU)网络生成运动三维人体模型的方法.该方法可根据文本描述生成相应体型和动作的三维人体模型.方法首先,使用VAE编码网络生成潜在编码,结合CLIP网络零样本生成体型与文本表述相符的人体模型,以解决蒙皮多人线性(skinned multi-person linear,SMPL)模型参数不合理而生成不符合正常体型特征的人体模型问题;其次,采用VAE网络与GRU网络生成与文本表述相符的变长时间三维人体姿势序列,以解决现有运动生成方法仅生成事先指定的姿势序列、无法生成运动时间不同的姿势序列问题;最后,将体型特征与运动特征结合,得到三维运动人体模型.结果在HumanML3D数据集上进行人体生成实验,并与其他3种方法进行比较,相比于现有最好方法,R精度的Top1、Top2和Top3分别提高了 0.031、0.034和0.028,弗雷歇初始距离(Fréchet inception distance,FID)提高了0.094,多样性提高了0.065.消融实验验证了模型的有效性,结果表明本文方法对人体模型生成效果有提升.结论本文方法可通过文本描述生成运动三维人体模型,模型的体型和动作更符合输入文本的描述.

外文标题：Incorporating variational auto-encoder networks for text-driven generation of 3D motion human body

外文摘要：Objective Artificial intelligence generated content(AIGC)technology can reduce the workload of three-dimensional(3D)modeling when applied to generate virtual 3D scene models using natural language.For static 3D objects,methods have arisen in generating high-precision 3D models that match a given textual description.By contrast,for dynamic digital human body models,which is also highly popular in numerous circumstances,only two-dimensional(2D)human images or sequences of human poses can be generated corresponding to a given textual description.Dynamic 3D human models cannot be generated with the same way above using natural language.Moreover,current existing meth-ods can lead to problems such as immutable shape and motion when generating dynamic digital human models.A method fusing variational auto-encoder(VAE),contrastive language-image pretraining(CLIP),and gate recurrent unit(GRU),which can be used to generate satisfactory dynamic 3D human models corresponding to the shapes and motions described by the text,is proposed to address the above problems.Method A method based on the VAE network is proposed in this paper to generate dynamic 3D human models,which correspond to the body shape and action information described in the text.Notably,a variety of pose sequences with variable time duration can be generated with the proposed method.First,the shape information of the body is obtained through the body shape generation module based on the VAE network and CLIP model,and zero-shot samples are used to generate the skinned multi-person linear(SMPL)parametric human model that matches the textual description.Specifically,the VAE network encodes the body shape of the SMPL model,the CLIP model matches the textual descriptions and body shapes,and the 3D human model with the highest matching score is thus filtered.Second,variable-length 3D human pose sequences are generated through the body action generation module based on the VAE and GRU networks that match the textual description.Particularly,the VAE self-encoder encodes the dynamic human poses.The action length sampling network then obtains the length of time that matches the textual description of the action.The GRU and VAE networks encode the input text and generate the diverse dynamic 3D human pose sequences through the decoder.Finally,a dynamic 3D human model corresponding to the body shape and action description can be generated by fusing the body shape and action information generated above.The performance of the method is evaluated in this paper using the HumanML3D dataset,which comprises 14 616 motions and 44 970 linguistic annotations.Some of the motions in the dataset are mirrored before training,and some words are replaced in the motion descriptions(e.g.,"left"is changed to"right")to expand the dataset.In the experiments in this paper,the HumanML3D dataset is divided into train-ing,testing,and validation sets in the ratios of 80％,15％,and 5％,respectively.The experiments in this paper are con-ducted in an Ubuntu 18.04 environment with a Tesla V100 GPU and 16GB of video memory.The adaptive moment estima-tion(Adam)optimizer is trained in 300 training rounds with a learning rate of 0.000 1 and a batch size of 128 to train the motion self-encoder.The Adam optimizer performs 320 training rounds with a learning rate of 0.000 2 and a batch size of 32 to train the motion generator.This optimizer also performs 200 training rounds with a learning rate of 0.000 1 and a batch size of 64 for training the motion length network.Result Dynamic 3D human model generation experiments were con-ducted on the HumanML3D dataset.Compared with three other state-of-the-art methods,the proposed method shows an improvement of 0.031,0.034,and 0.028 in the Top1,Top2,and Top3 dimensions of R-precision,0.094 in Fréchet inception distance(FID),and 0.065 in diversity,respectively,considering the best available results.The experimental analysis for qualitative evaluation was divided into three parts:body shape feature generation,action feature generation,and dynamic 3D human model generation including body features.The body feature generation part was tested using differ-ent text descriptions(e.g.,tall,short,fat,thin).For the action feature generation part,the same text descriptions are tested using this paper and other methods for generation comparison.Combining the body shape features and the action fea-ture of the human body,the generation of dynamic 3D human models with body shape features is demonstrated.In addi-tion,ablation experiments,including ablation comparison with different methods using different loss functions,are per-formed to further demonstrate the effectiveness of the method.The final experimental results show that the proposed method in this paper improves the effectiveness of the model.Conclusion This paper presents methods for generating dynamic 3D human models that conform to textual descriptions,fusing body shape and action information.The body shape generation module can generate SMPL parameterized human models whose body shape conforms to the textual description,while the action generation module can generate variable-length 3D human pose sequences that match the textual description.Experi-mental results show that the proposed method can effectively generate motion dynamic 3D human models that conform to textual descriptions,and the generated human models have diverse body shape and motions.On the HumanML3D dataset,the performance of the method outperforms other similar state-of-the-art algorithms.

外文关键词：

human motion synthesisnatural language processing(NLP)deep learningskinned multi-person linear modelvariational auto-encoder network

作者：

李健、杨钧、王丽燕、王永归

展开 >

作者单位：

陕西科技大学电子信息与人工智能学院,西安 710021

陕西科技大学文理学院,西安 710021

关键词：

人体动作合成自然语言处理(NLP) 深度学习蒙皮多人线性模型变分自编码器网络

基金：

陕西科技大学教育信息化教学改革项目(2021)

项目编号：

JXJG2021-09

出版年：

2024

DOI：

10.11834/jig.230291

中国图象图形学报

中国科学院遥感应用研究所,中国图象图形学学会 ,北京应用物理与计算数学研究所

中国图象图形学报

CSTPCD北大核心

影响因子：1.111

ISSN：1006-8961

年,卷(期)：2024.29(5)

参考文献量26