An End-to-End Singing Voice Synthesis Method with Excitation and Vibrato Modeling
In recent years,singing voice synthesis technology has developed rapidly,and end-to-end singing voice synthesis(VISinger)based on variational inference and normalizing flow has become mainstream.But there is still a certain gap between its effect and the sound quality of real persons,which is mainly reflected in the discontinuous hearing of pitch,poor synthesis of vibrato,and unstable articulation in the synthesized singing voice.We propose three main improvements.Firstly,to address the problem of fundamental frequency stability,we propose to add an excitation module in the decoder to explicitly provide the fundamental frequency information to the decoder in the form of an excitation signal;secondly,to address the problem of unnatural vibrato synthesis,we add a vibrato prediction module to explicitly model the vibrato in the song using flow with variational data augmentation;thirdly,we further add a ReZero strategy to the frame prior network.Experimental results show that increasing the excitation signal can improve the stability of the synthesized fundamental frequency,the vibrato modeling has a significant enhancement effect on the recovery of vibrato,and the ReZero strategy has a certain improvement on the training speed and articulation stability.Subjective evaluation demonstrates that the proposed model has a significant advantage over VISinger in the naturalness of singing voice synthesis,with mean opinion score(MOS)reaching 3.95,and also has a significant advantage over the two-stage modeling method DiffSinger+HiFiGAN,proving the effectiveness of the proposed method.
end-to-end singing voice synthesisneural networksvibrato modelingnormalizing flowvariational data augmentation