基于跨语言迁移学习及联合训练的泰语语音合成

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：随着深度学习和神经网络的快速发展,基于深度神经网络的端到端语音合成系统因性能优异成为主流.然而近年来,泰语语音合成相关研究还不充分,主要原因是大规模泰语数据集稀缺且该语言拼写方式有其特殊性.为此,在低资源前提下基于FastSpeech2声学模型和StyleMelGAN声码器研究泰语语音合成.针对基线系统中存在的问题,提出了3个改进方法以进一步提高泰语合成语音的质量.(1)在泰语语言专家指导下,结合泰语语言学相关知识设计泰语G2P模型,旨在处理泰语文本中存在的特殊拼写方式;(2)根据所设计的泰语G2P模型转换的国际音标表示的音素,选择拥有相似音素输入单元且数据集丰富的语言进行跨语言迁移学习来解决泰语训练数据不足的问题;(3)采用FastSpeech2和StyleMelGAN声码器联合训练的方法解决声学特征失配的问题.为了验证所提方法的有效性,从注意力对齐图、客观评测MCD和主观评测MOS评分3方面进行测评.实验结果表明,使用所提泰语G2P模型可以获得更好的对齐效果进而得到更准确的音素持续时间,采用"所提泰语G2P模型+联合训练+迁移学习"方法的系统可以获得最好的语音合成质量,合成语音的MCD和MOS评分分别为7.43±0.82分和4.53分,明显优于基线系统的9.47±0.54分和1.14分.

外文标题：Thai Speech Synthesis Based on Cross-language Transfer Learning and Joint Training

外文摘要：With the rapid development of deep learning and neural network,end-to-end speech synthesis system based on deep neural network has become the mainstream because of its excellent performance.However,in recent years,there are not enough researches on Thai speech synthesis,which is mainly due to the scarcity of large-scale Thai datasets and the special spelling of the language.This paper studies Thai speech synthesis based on the FastSpeech2 acoustic model and StyleMelGAN vocoder under the premise of low resources.Aiming at the problems existing in the baseline system,three improvement methods are proposed to further improve the quality of Thai synthesized speech.(1)Under the guidance of Thai language experts and combined with rele-vant knowledge of Thai linguistics,the Thai G2P model is designed to deal with the special spelling in Thai text.(2)According to the phonemes represented by the international phonetic alphabet converted by the designed Thai G2P model,languages with simi-lar phonemes input units and rich data sets are selected for cross-language transfer learning to solve the problem of insufficient Thai training data.(3)The joint training method of FastSpeech2 and StyleMelGAN vocoder is used to solve the problem of acous-tic feature mismatch.In order to verify the effectiveness of the proposed methods,this paper measures the attention alignment map,objective evaluation MCD and subjective evaluation MOS score.Experimental results show that using the Thai G2P model designed in this paper can obtain better alignment effect and thus more accurate phoneme duration,and the system using the"Thai G2P model designed in this paper+joint training+transfer learning"method has the best speech synthesis quality,and the MCD and MOS scores of the synthesized speech are 7.43±0.82 and 4.53 points,which are significantly better than the 9.47±0.54 and 1.14 points of the baseline system.

外文关键词：

Speech synthesisLow resourceThai G2P modelTransfer learningJoint training

作者：

张欣瑞、杨鉴、王展

展开 >

作者单位：

云南大学信息学院昆明 650504

关键词：

语音合成低资源泰语G2P模型迁移学习联合训练

基金：

国家重点研发计划国家自然科学基金

项目编号：

2020AAA010790161961043

出版年：

2024

DOI：

10.11896/jsjkx.230500174

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(z1)

参考文献量23