首页|面向"一带一路"的低资源语言机器翻译研究

面向"一带一路"的低资源语言机器翻译研究

扫码查看
随着"一带一路"倡议的深入推进,沿线国家和地区之间的跨语言沟通需求日渐增长,机器翻译技术逐渐成为各国之间深入交流的重要手段。然而,这些国家存在大量低资源语言,语料的稀缺性导致其机器翻译研究进展较为缓慢。针对该问题,提出一种基于NLLB模型改进的低资源语言机器翻译训练方法。首先基于多语言预训练模型提出一种改进的训练策略,该策略在数据增强的前提下,对损失函数进行优化,从而在机器翻译任务中有效提高低资源语言的翻译性能;然后使用ChatGPT以及ChatGLM模型分别评估老挝语-汉语以及越南语-汉语的翻译能力,大语言模型(LLM)已具备一定的翻译低资源语言的能力,而且ChatGPT模型在越南语-汉语翻译任务上已经大幅超越传统的神经机器翻译(NMT)模型,但是在老挝语上的翻译性能还有待进一步提高。实验结果表明,在4种低资源语言到汉语的翻译任务上,相比NLLB-600M基线模型,平均提升了 1。33个双语替换测评(BLEU)值以及0。82个chrF++值,从而充分证明了该方法在低资源语言机器翻译任务上的有效性。此外,该方法使用ChatGPT和ChatGLM模型分别对老挝语-汉语以及越南语-汉语进行了初步研究,在越南语-汉语翻译任务中,ChatGPT模型表现出色,远超传统的NMT模型,分别提高了 9。28个BLEU值和3。12个chrF++值。
Research on Low-Resource Language Machine Translation for the"Belt and Road"
With the development of the"Belt and Road"initiative,the demand for cross-language communication between countries and regions along the"Belt and Road"has grown,and Machine Translation(MT)technology has gradually become an important means of in-depth exchange between countries.However,owing to the abundance of low-resource languages and scarcity of language materials in these countries,progress in machine translation research has been relatively slow.This paper proposes a low-resource language machine translation training method based on the NLLB model.An improved training strategy based on a multilingual pre-training model is deployed to optimize the loss function under the premise of data augmentation,thereby effectively improving the translation performance of low-resource languages in machine translation tasks.The ChatGPT and ChatGLM models are used to evaluate translation performance for Laotian-Chinese and Vietnamese-Chinese,respectively.Large Language Models(LLM)are already capable of translating low-resource languages,and the ChatGPT model significantly outperforms the traditional Neural Machine Translation(NMT)model in Vietnamese-Chinese translation tasks.H owever,the translation of Laotian requires further improvement.The experimental results show that compared to the NLLB-600M baseline model,the proposed model achieves average improvements of 1.33 in terms of BiLingual Evaluation Understudy(BLEU)score and 0.82 in terms of chrF++score in Chinese translation tasks for four low-resource languages.These results fully demonstrate the effectiveness of the proposed method in low-resource language machine translation.In another experiment,this method uses the ChatGPT and ChatGLM models to conduct preliminary studies on Laotian-Chinese and Vietnamese-Chinese,respectively.In Vietnamese-Chinese translation tasks,the ChatGPT model significantly outperformed the traditional NMT models with a 9.28 improvement in BLEU score and 3.12 improvement in chrF++score.

low-resource languagesMachine Translation(MT)data enhancementmultilingual pre-training modelsLarge Language Model(LLM)

侯钰涛、阿布都克力木·阿布力孜、史亚庆、马依拉木·木斯得克、哈里旦木·阿布都克里木

展开 >

新疆财经大学信息管理学院,新疆乌鲁木齐 830012

低资源语言 机器翻译 数据增强 多语言预训练模型 大语言模型

国家自然科学基金国家自然科学基金高层次人才专项

61966033623660502022XGC060o

2024

计算机工程
华东计算技术研究所 上海市计算机学会

计算机工程

CSTPCD北大核心
影响因子:0.581
ISSN:1000-3428
年,卷(期):2024.50(4)
  • 43