首页|基于姿态嵌入机制和多尺度注意力的单张着装图像视频合成

基于姿态嵌入机制和多尺度注意力的单张着装图像视频合成

扫码查看
基于单张着装图像生成视频在虚拟试衣和三维重建等领域有重要应用,但现有方法存在生成帧之间动作不连贯、生成视频质量差、人物服装细节缺失等问题,为此提出一种基于姿态嵌入机制以及多尺度注意链接的生成对抗网络模型.首先采用位置嵌入方法,对相邻帧间动作建模,然后针对每个分辨率尺度的特征添加注意力链接,同时在训练过程中输入人物解析图像,最后在服装视频合成数据集的测试集合上进行结果验证.结果表明:本文模型比当前单张着装图像生成视频主流模型在定性结果与定量结果指标上均有所提高,其中峰值信噪比和运动矢量分别为 20.89 和0.108 4,说明本文模型能够有效提高视频生成的质量与帧间动作的稳定性,为着装人物视频合成提供了新模型.
Single dress image video synthesis based on pose embedding and multi-scale attention
Objective Video generation based on a single dress image has important applications in the fields of virtual try-on and 3-D reconstruction.However,existing methods have problems such as incoherent movements between generated frames,poor quality of generated videos,and missing details of clothing.In order to address the above issues,a generative adversarial network model based on pose embedding mechanism and multi-scale attention links is proposed.Method A generative adversarial network(EBDGAN)model based on pose embedding mechanism and multi-scale attention was proposed.Pose embedding method was adopted to model adjacent frame actions and improve the coherence of video generated actions,and attention links for each resolution scale feature were added to improve feature decoding efficiency and generate image frame fidelity.Human parsing images were utilized during the training process to improve the clothing accuracy of the synthesized images.Results The learned perceptual image patch similarity(LPIPS)and peak signal-to-noise-ratio(PSNR)values indicated that the generated results of EBDGAN were closer to the original video in terms of color and structure.From the motion vector(MV),it was seen that the video generated by EBDGAN from a single image moved less between adjacent frames and had higher similarity between the two frames,leading to more stable the overall videos.Although the structure similarity index metric(SSIM)score was slightly lower than(CASD),this method was more efficient as it only requires image and pose information as input.In some frames where the characters were far from the camera,EBDGAN retained the details of hair and shoes.In some frames where the characters are closer to the camera,the front clothing image of EBDGAN retained the collar and hem,such as the collar of the left image in the second row and the hem of the right clothing.When the characters in the video turned around,EBDGAN did not cause the characters in the video to exhibit strange pose or lose some body parts,but instead generated a more reasonable body shape.The results of the ablation experiment showed that the complete model can efficiently utilize the pose information and features of the input image to guide video generation.Blocking any network component will result in a decrease in model performance.The results of EBDGAN-1 indicated that multi-scale attention linking could help networks generate images with more reasonable distribution.The MV of EBDGAN-2 suggested that when the attitude embedding module was added,the relative movement between adjacent frames was smaller,resulting in high video stability.Conclusion This article proposes a method for generating videos from single images based on pose embedding mechanism and multi-scale attention linking.This method uses the pose embedding module EBD to model the pose between adjacent frames in the time series,reducing the number of parameters while ensuring the coherence of actions between adjacent frames.By using multi-scale attention linking,the efficiency of feature extraction is improved,further improving the quality of video generation.Using character analysis images as auxiliary input enhances the expressive ability of character clothing.The prosposed method was experimentally validated on a public dataset,with SSIM 0.855,LPIPS 0.162,PSNR 20.89,and MV 0.108 4.The ablation experiment proves that the model proposed model can help the network achieve better performance in video generation tasks.Comparative experiments have shown that the proposed method offers better stability in generating videos and more realistic character details.

generative adversarial networkvideo synthesisdeep learningpose embeddingattention mechanismdressing imagevirtual try-on

陆寅雯、侯珏、杨阳、顾冰菲、张宏伟、刘正

展开 >

浙江理工大学 服装学院,浙江 杭州 310018

浙江省服装工程技术研究中心,浙江 杭州 310018

浙江理工大学 丝绸文化传承与产品设计数字化技术文化和旅游部重点实验室,浙江 杭州 310018

西安工程大学 电子信息学院,陕西 西安 710043

浙江理工大学 国际教育学院,浙江 杭州 310018

展开 >

生成对抗网络 视频合成 深度学习 姿态嵌入 注意力机制 着装图像 虚拟试衣

国家自然科学基金青年科学基金项目浙江省科技计划项目浙江理工大学科研启动基金项目

618032922023C0318121072325-Y

2024

纺织学报
中国纺织工程学会

纺织学报

CSTPCD北大核心
影响因子:0.699
ISSN:0253-9721
年,卷(期):2024.45(7)