Research on Few-Shot Speaker Video Generation Algorithm Guided by Sketches
Speaker video generation requires precise joint modeling of facial texture and driven audio;to achieve this goal,re-search on semantic-guided texture feature deformation has been conducted,a sketch-guided few-shot speaker video generation frame-work is proposed,dual-stage generation technique is employed for modality alignment.In the first stage,the information on the real prior facial landmarks is used to generate from the audio to the target facial landmarks,and in the second stage,facial landmarks are transformed into sketches as intermediate representations for semantic alignment with reference images.The introduction of sketches effectively addresses the modality mismatch between audio and images;Through experimental testing,the algorithm achieves the FID scores of 15.676 and 8.618 on the public HDTF and MEAD datasets,respectively.The proposed algorithm effectively models fa-cial texture under the drive of target audio through intermediate representations,achieving a generation performance comparable to state-of-the-art algorithms.
high-fidelity generationtalking face generationlandmark generationmulti modal learninglip synchronization