This paper introduces a non-autoregressive Zhuang text-to-speech synthesis model,Zhuang-TTS,based on the FastSpeech2 model.To enhance the rhythmic quality of synthesized Zhuang speech,a new set of Zhuang phonetic features is proposed based on the characteristics of Zhuang language and on-field investigations.These features include tone,initial consonants or con-sonants,and final vowels or vowels.Improvements are made to address Zhuang language's acoustic characteristics:(ⅰ)Utilizing Zhuang phoneme sequences to represent pronunciation information;(ⅱ)Employing a phoneme-level acoustic regulator(similar to FastPitch)for enhanced stability in synthesis results;(ⅲ)Substituting the Conformer for the Transformer in the FastSpeech2 model,considering the acoustic characteristics of Zhuang language.Additionally,a Zhuang speech synthesis corpus is constructed.Experimental results show that Zhuang-TTS achieves a Mean Opinion Score(MOS)of 3.90 in terms of rhythm,a synthesis real-time rate of 8.65×10-2.The model's substan-tial improvements in the quality and speed of synthesizing Zhuang speech,outperforming the base-line models Tacotron2 and FastSpeech2,have also contributed to the advancement of the field of Zhuang speech synthesis.
Zhuang language speech synthesisnon-autoregressive acoustic modelnon-autoregres-sive vocoderConformer