小型微型计算机系统2025,Vol.46Issue(6) :1400-1408.DOI:10.20009/j.cnki.21-1106/TP.2024-0202

采用线性注意力机制的语音驱动三维人脸动画技术

Speech-driven 3D Face Animation Using Linear Attention Mechanism

童程凯 叶阳
小型微型计算机系统2025,Vol.46Issue(6) :1400-1408.DOI:10.20009/j.cnki.21-1106/TP.2024-0202

采用线性注意力机制的语音驱动三维人脸动画技术

Speech-driven 3D Face Animation Using Linear Attention Mechanism

童程凯 1叶阳1
扫码查看

作者信息

  • 1. 浙江工业大学计算机科学与技术学院,杭州 310023
  • 折叠

摘要

语音驱动三维人脸动画技术,旨在通过输入语音,驱动三维人脸模型生成视觉对应的人脸表情动画.当前的常用方法是基于Transformer结构以自回归形式完成人脸动画生成,但是这些方法在面对长语音生成动画时的二次运算复杂度限制了其性能瓶颈,在数据集稀疏情况下的过拟合问题也使得其在生成动画的准确性以及泛化性上存在不足.为了解决以上问题,本文提出一种基于线性注意力的语音驱动三维人脸动画方法.该方法采用一种新的端到端网络模型,通过语音自监督表示学习构建编码器提取语音特征,并利用线性注意力变体的结构RWKV构建人脸表情映射解码模块生成人脸动画.实验结果表明,本文的方法在人脸表情生成的准确度和时效性上都优于目前的相关方法,三维人脸网格顶点平均误差在标准化条件下上较sota方法降低了0.15,单帧人脸预测时延上也比基于传统Transformer的方法快了4倍左右.

Abstract

Speech-driven 3D face animation technology aims to drive the 3D face model to generate visually corresponding face expres-sion animation by inputting speech.The current common method is based on the Transformer structure to complete the face animation generation in the form of autoregression,but these methods in the face of long speech to generate animation in the secondary operation complexity limits its performance bottleneck,in the case of sparse datasets in the overfitting problem also makes it in the generation of animation accuracy as well as the generalisation of the shortcomings.In order to solve the above problems,this paper proposes a voice-driven 3D face animation method based on linear attention.The method adopts a new end-to-end network model,constructs an encoder to extract speech features through speech self-supervised representation learning,and constructs a face expression mapping decoding module to generate face animation using the structure of linear attention variant RWKV.The experimental results show that the method in this paper is better than the current related methods in the accuracy and timeliness of face expression generation,and the average er-ror of 3D face mesh vertices is reduced by 0.15mm under the standardised condition compared with the sota method,and the delay of single-frame face prediction is also about 4 times faster than the traditional Transformer-based method.

关键词

语音驱动/自监督/线性注意力/人脸动画

Key words

speech-driven/self-supervised/linear attention/face animation

引用本文复制引用

出版年

2025
小型微型计算机系统
中国科学院沈阳计算技术研究所

小型微型计算机系统

影响因子:0.564
ISSN:1000-1220
段落导航相关论文