Mixure Density Network-Based Hmong Language Text-to-Speech Method
The research on Hmong language text-to-speech is of great significance for the inheritance,protection,and development of ethnic culture.In response to the problems of missing text,lack of electronic resources,and difficulty in obtaining data for Hmong language,a mix-ure density network-based Hmong language speech synthesis method is proposed.This method learns the alignment between text and speech based on duration,addressing issues such as missing words and repetitions that may occur during alignment learning with attention mecha-nism.The mix density network is used to extract the real duration of the text and jointly trained with the duration predictor,eliminating the need for additional external aligners or autoregressive models to guide alignment learning,simplifying the complexity of model training.Using the self-built Hmong language text-to-speech corpus,Hmong_data,as the benchmark data,comparative experiments are conducted with ad-vanced methods.The experimental results shows that the proposed method achieves an average opinion score of 3.89,which is a 0.41 improve-ment over the Tacotron2 method.The generated alignment graphs are clearer and smoother,and the synthesized speech is considered under-standable and correct.
Hmong languagetext-to-speechmixure density networkcorpus