一种融合激励和颤音建模的端到端歌唱合成方法

An End-to-End Singing Voice Synthesis Method with Excitation and Vibrato Modeling

周骁 ¹胡亚军 ²潘嘉 ²胡国平 ²凌震华³

扫码查看

作者信息

1. 科大讯飞股份有限公司,合肥 230088;中国科学技术大学信息科学技术学院,合肥 230026
2. 科大讯飞股份有限公司,合肥 230088
3. 中国科学技术大学信息科学技术学院,合肥 230026
折叠

摘要

近年来,歌唱合成技术快速发展,基于变分推理和流模型的端到端歌唱合成(VISinger)成为主流,但其在效果上和真人仍有一定差距,主要体现在合成歌声中的音高听感不连续、颤音合成不佳及发音不稳定等.为此,本文针对性地提出了一系列改进方法:针对基频稳定性问题,提出在解码器中增加激励模块,将基频信息以激励信号的形式显式提供给解码器;针对颤音合成不自然问题,增加颤音预测模块,通过流式模型和变分数据增强,显式对歌声中的颤音进行建模;进一步在先验网络中增加ReZero策略.实验结果显示,增加激励信号能提升合成基频的稳定性,颤音建模对颤音的恢复有显著提升作用,ReZero策略对训练速度和发音稳定性有一定提升.主观测听中,本文提出的模型在歌唱合成自然度上相比VISinger有显著优势,平均意见分(Mean opinion score,MOS)达到3.95,对比两阶段建模方法DiffSinger+HiFiGAN也有明显优势,证明了本文所提方法的有效性.

Abstract

In recent years,singing voice synthesis technology has developed rapidly,and end-to-end singing voice synthesis(VISinger)based on variational inference and normalizing flow has become mainstream.But there is still a certain gap between its effect and the sound quality of real persons,which is mainly reflected in the discontinuous hearing of pitch,poor synthesis of vibrato,and unstable articulation in the synthesized singing voice.We propose three main improvements.Firstly,to address the problem of fundamental frequency stability,we propose to add an excitation module in the decoder to explicitly provide the fundamental frequency information to the decoder in the form of an excitation signal;secondly,to address the problem of unnatural vibrato synthesis,we add a vibrato prediction module to explicitly model the vibrato in the song using flow with variational data augmentation;thirdly,we further add a ReZero strategy to the frame prior network.Experimental results show that increasing the excitation signal can improve the stability of the synthesized fundamental frequency,the vibrato modeling has a significant enhancement effect on the recovery of vibrato,and the ReZero strategy has a certain improvement on the training speed and articulation stability.Subjective evaluation demonstrates that the proposed model has a significant advantage over VISinger in the naturalness of singing voice synthesis,with mean opinion score(MOS)reaching 3.95,and also has a significant advantage over the two-stage modeling method DiffSinger+HiFiGAN,proving the effectiveness of the proposed method.

关键词

端到端歌唱合成/神经网络/颤音建模/归一化流/变分数据增强

Key words

end-to-end singing voice synthesis/neural networks/vibrato modeling/normalizing flow/variational data augmentation

引用本文复制引用

基金项目

科技创新2030新一代人工智能重大项目(2020AAA0103600)

出版年

2024

数据采集与处理

中国电子学会中国仪器仪表学会信号处理学会　中国仪器仪表学会中国物理学会微弱信号检测学会　南京航空航天大学

数据采集与处理

CSTPCD北大核心

影响因子：0.679

ISSN：1004-9037

参考文献量21

段落导航