Speech-Driven Facial Reenactment Based on Implicit Neural Representations with Structured Latent Codes
The goal of speech-driven facial reenactment aims to generate high-fidelity facial animation matching with the input speech content.However,existing methods can hardly achieve high-quality facial reenactment because of the gap between audio and video modals.In order to address the problems of exist-ing methods such as low fidelity and poor lip sync effect,we propose a speech-driven facial reenactment method based on implicit neural representations with structured latent codes,which takes the point cloud sequence of human face as the intermediate representation,decomposing the speech-driven facial reenact-ment into two tasks:cross-modal mapping and neural radiance fields rendering.Firstly,we predict the facial expression coefficients through cross-modal mapping and get the facial identity coefficients by 3D face re-construction;then,we synthesize face point cloud sequence based on 3DMM;next,we use the position of vertices constructing the structured implicit neural representations and regress density and color for each sampling points;finally,we render RGB frames of human face through volume rendering techniques and assemble them into original image.Experiments results on multiple 3-5 min individual speech videos,in-cluding visual comparison,quantitative evaluation,and subjective assessment demonstrate that our method achieves better results than state-of-the-art methods such as AD-NeRF in terms of lip-sync accuracy and image generation precision,which can achieve high-fidelity speech-driven facial reenactment.