Single dress image video synthesis based on pose embedding and multi-scale attention
Objective Video generation based on a single dress image has important applications in the fields of virtual try-on and 3-D reconstruction.However,existing methods have problems such as incoherent movements between generated frames,poor quality of generated videos,and missing details of clothing.In order to address the above issues,a generative adversarial network model based on pose embedding mechanism and multi-scale attention links is proposed.Method A generative adversarial network(EBDGAN)model based on pose embedding mechanism and multi-scale attention was proposed.Pose embedding method was adopted to model adjacent frame actions and improve the coherence of video generated actions,and attention links for each resolution scale feature were added to improve feature decoding efficiency and generate image frame fidelity.Human parsing images were utilized during the training process to improve the clothing accuracy of the synthesized images.Results The learned perceptual image patch similarity(LPIPS)and peak signal-to-noise-ratio(PSNR)values indicated that the generated results of EBDGAN were closer to the original video in terms of color and structure.From the motion vector(MV),it was seen that the video generated by EBDGAN from a single image moved less between adjacent frames and had higher similarity between the two frames,leading to more stable the overall videos.Although the structure similarity index metric(SSIM)score was slightly lower than(CASD),this method was more efficient as it only requires image and pose information as input.In some frames where the characters were far from the camera,EBDGAN retained the details of hair and shoes.In some frames where the characters are closer to the camera,the front clothing image of EBDGAN retained the collar and hem,such as the collar of the left image in the second row and the hem of the right clothing.When the characters in the video turned around,EBDGAN did not cause the characters in the video to exhibit strange pose or lose some body parts,but instead generated a more reasonable body shape.The results of the ablation experiment showed that the complete model can efficiently utilize the pose information and features of the input image to guide video generation.Blocking any network component will result in a decrease in model performance.The results of EBDGAN-1 indicated that multi-scale attention linking could help networks generate images with more reasonable distribution.The MV of EBDGAN-2 suggested that when the attitude embedding module was added,the relative movement between adjacent frames was smaller,resulting in high video stability.Conclusion This article proposes a method for generating videos from single images based on pose embedding mechanism and multi-scale attention linking.This method uses the pose embedding module EBD to model the pose between adjacent frames in the time series,reducing the number of parameters while ensuring the coherence of actions between adjacent frames.By using multi-scale attention linking,the efficiency of feature extraction is improved,further improving the quality of video generation.Using character analysis images as auxiliary input enhances the expressive ability of character clothing.The prosposed method was experimentally validated on a public dataset,with SSIM 0.855,LPIPS 0.162,PSNR 20.89,and MV 0.108 4.The ablation experiment proves that the model proposed model can help the network achieve better performance in video generation tasks.Comparative experiments have shown that the proposed method offers better stability in generating videos and more realistic character details.