基于草图引导的少样本说话人视频生成算法研究

Research on Few-Shot Speaker Video Generation Algorithm Guided by Sketches

魏清杨 ¹徐树公¹

扫码查看

作者信息

1. 上海大学通信与信息工程学院,上海 200444
折叠

摘要

说话人视频生成需要对面部纹理和驱动语音进行精准联合建模;为实现该目标,对语义引导的纹理特征形变进行了研究,提出一种基于草图引导的少样本说话人视频生成框架,采用双阶段生成技术进行模态对齐;在第一阶段使用真实先验关键点信息进行语音到目标关键点的生成,第二阶段将关键点转化为草图作为中间表征与参考图片进行语义对齐;草图的引入有效地解决了语音与图像的模态不匹配问题;通过实验测试,算法在公开数据集HDTF和MEAD上的FID指标达到了 15.676和8.618;经上述结果验证,提出的算法可通过中间表征有效建模目标音频驱动下的面部纹理,达到与最先进算法相当的生成效果.

Abstract

Speaker video generation requires precise joint modeling of facial texture and driven audio;to achieve this goal,re-search on semantic-guided texture feature deformation has been conducted,a sketch-guided few-shot speaker video generation frame-work is proposed,dual-stage generation technique is employed for modality alignment.In the first stage,the information on the real prior facial landmarks is used to generate from the audio to the target facial landmarks,and in the second stage,facial landmarks are transformed into sketches as intermediate representations for semantic alignment with reference images.The introduction of sketches effectively addresses the modality mismatch between audio and images;Through experimental testing,the algorithm achieves the FID scores of 15.676 and 8.618 on the public HDTF and MEAD datasets,respectively.The proposed algorithm effectively models fa-cial texture under the drive of target audio through intermediate representations,achieving a generation performance comparable to state-of-the-art algorithms.

关键词

高保真生成/说话人视频生成/关键点生成/多模态学习/音唇同步

Key words

high-fidelity generation/talking face generation/landmark generation/multi modal learning/lip synchronization

引用本文复制引用

基金项目

国家自然科学基金(61871262)

出版年

2024

计算机测量与控制

中国计算机自动测量与控制技术协会

计算机测量与控制

CSTPCD

影响因子：0.546

ISSN：1671-4598

段落导航