字音转换是中文语音合成系统(Text-To-Speech,TTS)的重要组成部分,其核心问题是多音字消歧,即在若干候选读音中为多音字选择一个正确的发音.现有的方法通常无法充分理解多音字所在词语的语义,且多音字数据集存在分布不均衡的问题.针对以上问题,提出了一种基于预训练模型RoBERTa的多音字消歧方法CLTRoBERTa(Cross-lingual Translation RoBERTa).首先联合跨语言互译模块获得多音字所在词语的另一种语言翻译,并将其作为额外特征输入模型以提升对词语的语义理解,然后使用判别微调中的层级学习率优化策略来适应神经网络不同层之间的学习特性,最后结合样本权重模块以解决多音字数据集的分布不均衡问题.CTLRoBERTa平衡了数据集的不均衡分布带来的性能差异,并且在CPP(Chinese Poly-phone with Pinyin)基准数据集上取得了 99.08%的正确率,性能优于其他基线模型.
Polyphone Disambiguation Based on Pre-trained Model
Grapheme-to-phoneme conversion(G2P)is an important part of the Chinese text-to-speech system(TTS).The key is-sue of G2P is to select the correct pronunciation for polyphonic characters among several alternatives.Existing methods usually struggle to fully grasp the semantics of words that contain polyphonic characters,and fail to effectively handle the imbalanced dis-tribution in datasets.To solve these problems,this paper proposes a polyphone disambiguation method based on the pre-trained model RoBERTa,called cross-lingual translation RoBERTa(CLTRoBERTa).Firstly,the cross-lingual translation module gene-rates another translation of the word containing the polyphonic character as an additional input feature to improve the model's se-mantic comprehension.Secondly,the hierarchical learning rate optimization strategy is employed to adapt the different layers of the neural network.Finally,the model is enhanced with the sample weight module to address the imbalanced distribution in the dataset.Experimental results show that CLTRoBERTa mitigates performance differences caused by uneven dataset distribution and achieves a 99.08%accuracy on the public Chinese polyphone with pinyin(CPP)dataset,outperforming other baseline models.