Speaker video generation requires precise joint modeling of facial texture and driven audio;to achieve this goal,re-search on semantic-guided texture feature deformation has been conducted,a sketch-guided few-shot speaker video generation frame-work is proposed,dual-stage generation technique is employed for modality alignment.In the first stage,the information on the real prior facial landmarks is used to generate from the audio to the target facial landmarks,and in the second stage,facial landmarks are transformed into sketches as intermediate representations for semantic alignment with reference images.The introduction of sketches effectively addresses the modality mismatch between audio and images;Through experimental testing,the algorithm achieves the FID scores of 15.676 and 8.618 on the public HDTF and MEAD datasets,respectively.The proposed algorithm effectively models fa-cial texture under the drive of target audio through intermediate representations,achieving a generation performance comparable to state-of-the-art algorithms.
关键词
高保真生成/说话人视频生成/关键点生成/多模态学习/音唇同步
Key words
high-fidelity generation/talking face generation/landmark generation/multi modal learning/lip synchronization