计算机辅助设计与图形学学报2024,Vol.36Issue(10) :1616-1624.DOI:10.3724/SP.J.1089.2024.20026

基于结构化潜码引导NeRF的语音驱动人脸重演

Speech-Driven Facial Reenactment Based on Implicit Neural Representations with Structured Latent Codes

谢志峰 郑迦恒 王吉 梁佳佳 马利庄
计算机辅助设计与图形学学报2024,Vol.36Issue(10) :1616-1624.DOI:10.3724/SP.J.1089.2024.20026

基于结构化潜码引导NeRF的语音驱动人脸重演

Speech-Driven Facial Reenactment Based on Implicit Neural Representations with Structured Latent Codes

谢志峰 1郑迦恒 2王吉 2梁佳佳 2马利庄3
扫码查看

作者信息

  • 1. 上海大学上海电影学院 上海 200072;上海电影特效工程技术研究中心 上海 200072
  • 2. 上海大学上海电影学院 上海 200072
  • 3. 上海电影特效工程技术研究中心 上海 200072;上海交通大学计算机科学与工程系 上海 200240
  • 折叠

摘要

语音驱动的人脸重演的目标是生成与输入语音内容相匹配的高保真人脸面部动画.然而,由于音频与视频模态之间存在鸿沟,当前方法难以实现高质量的面部重演.针对现有方法保真度低、唇音同步效果差等问题,提出一种基于结构化潜码引导隐式神经表示的语音驱动人脸重演方法,以人脸点云序列作为中间表示,将语音驱动人脸重演分解为跨模态映射和神经辐射场渲染两大任务分别解决.首先,通过跨模态映射从音频预测人脸表情系数,利用人脸三维重建技术获得人脸身份系数;然后,基于3DMM模型合成人脸点云动画序列;接着,使用顶点位置信息构建结构化隐式神经表示,回归场景中每个采样点的密度和颜色值;最后,通过体绘制技术渲染人脸RGB帧,并装配到原图像中.在多个时长为3~5 min的单人演讲视频上的可视化比较、量化评估、主观评估等实验结果表明,文中所提方法在唇音同步效果与图像生成精度上优于AD-NeRF等方法,能够实现高保真语音驱动人脸重演.

Abstract

The goal of speech-driven facial reenactment aims to generate high-fidelity facial animation matching with the input speech content.However,existing methods can hardly achieve high-quality facial reenactment because of the gap between audio and video modals.In order to address the problems of exist-ing methods such as low fidelity and poor lip sync effect,we propose a speech-driven facial reenactment method based on implicit neural representations with structured latent codes,which takes the point cloud sequence of human face as the intermediate representation,decomposing the speech-driven facial reenact-ment into two tasks:cross-modal mapping and neural radiance fields rendering.Firstly,we predict the facial expression coefficients through cross-modal mapping and get the facial identity coefficients by 3D face re-construction;then,we synthesize face point cloud sequence based on 3DMM;next,we use the position of vertices constructing the structured implicit neural representations and regress density and color for each sampling points;finally,we render RGB frames of human face through volume rendering techniques and assemble them into original image.Experiments results on multiple 3-5 min individual speech videos,in-cluding visual comparison,quantitative evaluation,and subjective assessment demonstrate that our method achieves better results than state-of-the-art methods such as AD-NeRF in terms of lip-sync accuracy and image generation precision,which can achieve high-fidelity speech-driven facial reenactment.

关键词

音频驱动人脸重演/隐式神经表示/神经辐射场/跨模态

Key words

audio-driven facial reenactment/implicit neural representations/neural radiance field(NeRF)/cross-modal

引用本文复制引用

出版年

2024
计算机辅助设计与图形学学报
中国计算机学会

计算机辅助设计与图形学学报

CSTPCDCSCD北大核心
影响因子:0.892
ISSN:1003-9775
段落导航相关论文