同源语料增强的低资源神经机器翻译

Cognate-Corpus-Enhanced Low-Resource Neural Machine Translation

王琳 ¹刘伍颖²

扫码查看

作者信息

1. 上海外国语大学贤达经济人文学院,上海 200083
2. 鲁东大学山东省语言资源开发与应用重点实验室,山东烟台 264025
折叠

摘要

缺少平行句对的低资源机器翻译面临跨语言语义转述科学问题.该文围绕具体的低资源印尼语-汉语机器翻译问题,探索了基于同源语料的数据增广方法,并混合同源语料训练出更优的神经机器翻译模型.这种混合语料模型在印尼语-汉语机器翻译实验中提升了 3个多点的BLEU4评分.实验结果证明,同源语料能够有效增强低资源神经机器翻译性能,而这种有效性主要是源于同源语言之间的形态相似性和语义等价性.

Abstract

Low-resource machine translation is challenged by lacking parallel sentence pairs.We address the specific low-resource machine translation issue from Indonesian to Chinese,and proposes a data augmentation method based on a cognate corpus.Specifically,we optimize the neural machine translation(NMT)model by mixing a cognate corpus,which is mainly derived from the morphological similarity and semantic equivalence between the cognate languages.Experiments demonstrate that the proposed method achieves more than 3 points of the BLEU4 score in the Indonesian-Chinese machine translation.

关键词

同源语料/数据增广/低资源机器翻译/印尼语/马来语

Key words

cognate corpus/data augmentation/low-resource machine translation/Indonesian/Malay

引用本文复制引用

基金项目

教育部人文社会科学研究青年基金(20YJC740062)

上海市哲学社会科学"十三五"规划课题(2019BYY028)

教育部人文社会科学研究规划基金(20YJAZH069)

教育部新文科研究与改革实践项目(2021060049)

山东省研究生教育改革研究项目(SDYJG21185)

山东省本科教学改革研究重点项目(Z2021323)

出版年

2024

中文信息学报

中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCSCDCHSSCD北大核心

影响因子：0.8

ISSN：1003-0077

参考文献量29

段落导航