基于单语优先级采样自训练神经机器翻译的研究
Research on self-training neural machine translation based on monolingual priority sampling
张笑燕 1逄磊 1杜晓峰 1陆天波 1夏亚梅1
作者信息
- 1. 北京邮电大学计算机学院(国家示范性软件学院),北京 100876
- 折叠
摘要
为了提高神经机器翻译(NMT)性能,改善不确定性过高的单语在自训练过程中对NMT模型的损害,提出了一种基于优先级采样的自训练神经机器翻译模型.首先,通过引入语法依存分析构建语法依存树并计算单语单词重要程度.然后,构建单语词典并基于单语单词的重要程度和不确定性定义优先级.最后,计算单语优先级并基于优先级进行采样,进而合成平行数据集,作为学生NMT的训练输入.在大规模WMT英德部分数据集上的实验结果表明,所提模型能有效提升NMT的翻译效果,并改善不确定性过高对模型的损害.
Abstract
To enhance the performance of neural machine translation(NMT)and ameliorate the detrimental impact of high uncertainty in monolingual data during the self-training process,a self-training NMT model based on priority sam-pling was proposed.Initially,syntactic dependency trees were constructed and the importance of monolingual tokeniza-tion was assessed using grammar dependency analysis.Subsequently,a monolingual lexicon was built,and priority was defined based on the importance of monolingual tokenization and uncertainty.Finally,monolingual priorities were com-puted,and sampling was carried out based on these priorities,consequently generating a synthetic parallel dataset for training the student NMT model.Experimental results on a large-scale subset of the WMT English to German dataset demonstrate that the proposed model effectively enhances NMT translation performance and mitigates the impact of high uncertainty on the model.
关键词
机器翻译/数据增强/自训练/不确定性/语法依存Key words
machine translation/data augmentation/self-training/uncertainty/syntactic dependency引用本文复制引用
出版年
2024