基于单语优先级采样自训练神经机器翻译的研究

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据
维普

中文摘要：为了提高神经机器翻译(NMT)性能,改善不确定性过高的单语在自训练过程中对NMT模型的损害,提出了一种基于优先级采样的自训练神经机器翻译模型.首先,通过引入语法依存分析构建语法依存树并计算单语单词重要程度.然后,构建单语词典并基于单语单词的重要程度和不确定性定义优先级.最后,计算单语优先级并基于优先级进行采样,进而合成平行数据集,作为学生NMT的训练输入.在大规模WMT英德部分数据集上的实验结果表明,所提模型能有效提升NMT的翻译效果,并改善不确定性过高对模型的损害.

外文标题：Research on self-training neural machine translation based on monolingual priority sampling

外文摘要：To enhance the performance of neural machine translation(NMT)and ameliorate the detrimental impact of high uncertainty in monolingual data during the self-training process,a self-training NMT model based on priority sam-pling was proposed.Initially,syntactic dependency trees were constructed and the importance of monolingual tokeniza-tion was assessed using grammar dependency analysis.Subsequently,a monolingual lexicon was built,and priority was defined based on the importance of monolingual tokenization and uncertainty.Finally,monolingual priorities were com-puted,and sampling was carried out based on these priorities,consequently generating a synthetic parallel dataset for training the student NMT model.Experimental results on a large-scale subset of the WMT English to German dataset demonstrate that the proposed model effectively enhances NMT translation performance and mitigates the impact of high uncertainty on the model.

外文关键词：

machine translationdata augmentationself-traininguncertaintysyntactic dependency

作者：

张笑燕、逄磊、杜晓峰、陆天波、夏亚梅

展开 >

作者单位：

北京邮电大学计算机学院(国家示范性软件学院),北京 100876

关键词：

机器翻译数据增强自训练不确定性语法依存

基金：

国家自然科学基金

项目编号：

62162060

出版年：

2024

DOI：

10.11959/j.issn.1000-436x.2024066

通信学报

中国通信学会

通信学报

CSTPCD北大核心

影响因子：1.265

ISSN：1000-436X

年,卷(期)：2024.45(4)

参考文献量40