计算机技术与发展2023,Vol.33Issue(12) :200-206.DOI:10.3969/j.issn.1673-629X.2023.12.028

基于mRASP的藏汉双向神经机器翻译研究

Research on Tibetan-Chinese Bidirectional Neural Machine Translation Based on mRASP

杨丹 拥措 仁青卓玛 唐超超
计算机技术与发展2023,Vol.33Issue(12) :200-206.DOI:10.3969/j.issn.1673-629X.2023.12.028

基于mRASP的藏汉双向神经机器翻译研究

Research on Tibetan-Chinese Bidirectional Neural Machine Translation Based on mRASP

杨丹 1拥措 1仁青卓玛 1唐超超1
扫码查看

作者信息

  • 1. 西藏大学 信息科学技术学院,西藏 拉萨 850000;西藏自治区藏文信息技术人工智能重点实验室,西藏 拉萨 850000;藏文信息技术教育部工程研究中心,西藏 拉萨 850000
  • 折叠

摘要

藏汉机器翻译技术的研究对于弘扬和传承优秀民族文化,推进藏族地区经济、教育和文化的发展有着十分重要的现实意义.该文立足于藏汉平行语料匮乏而导致的藏汉神经机器翻译效果欠佳的问题,对跨语言预训练模型进行了研究.使用第十八届全国机器翻译大会(CCMT 2022)的藏汉数据集构建藏汉双语的跨语言预训练模型(mRASP),采用谷歌的Transformer神经网络机器翻译架构作为基线模型,主要利用数据增强的方式对藏汉平行语料进行扩充、优化藏汉机器翻译所用到的词表,并探索跨语言预训练模型中的联合词表对翻译性能的影响,最终提出了一种融合跨语言预训练模型(mRASP)与改进后的绿色联合词表的藏汉双向神经机器翻译.经过上述策略,藏汉翻译任务上的BLEU值达到了55.69,汉藏翻译任务上的BLEU值达到了29.57.与传统的基于预训练模型的藏汉双向神经机器翻译相比,在稀缺资源条件下有效地提升了藏汉双向机器翻译的性能.

Abstract

The study of Tibetan-Chinese machine translation technology is of great practical significance to promote and inherit excellent national culture and advance the development of economy,education and culture in Tibetan areas.Based on the problem of poor Tibetan-Chinese neural machine translation caused by the lack of Tibetan-Chinese parallel corpus,we investigate the cross-linguistic pre-training model.We use the Tibetan-Chinese dataset from the 18th National Conference on Machine Translation(CCMT 2022)to construct the cross-lingual pre-training model(mRASP)for Tibetan-Chinese bilingualism,and adopt Google's Transformer neural network machine translation architecture as the baseline model,and mainly use data augmentation to expand the Tibetan-Chinese parallel corpus and optimize the vocabulary used in Tibetan-Chinese machine translation,and explore the influence of the joint vocabulary in the cross-language pre-training model on the translation performance.Finally,a Tibetan-Chinese bidirectional neural machine translation that integrates the cross-language pre-training model(mRASP)and the improved green joint vocabulary is proposed.Through the above strategies,the BLEU value on the Tibetan-Chinese translation task reached 55.69,and the BLEU value on the Chinese-Tibetan translation task reached29.57.Compared with the traditional Tibetan-Chinese bidirectional neural machine translation based on pre-trained model,it effectively improves the performance of Tibetan-Chinese bidirectional machine translation under the condition of scarce resources.

关键词

跨语言预训练模型/藏汉双向神经机器翻译/mRASP/数据增强/词表

Key words

cross-language pre-training model/Tibetan-Chinese bidirectional neural machine translation/mRASP/data

引用本文复制引用

基金项目

国家重点研发计划项目(2017YFB1402202)

西藏自治区科技创新基地自主研究项目(XZ2021HR002G)

西藏大学珠峰学科建设计划项目(zf22002001)

出版年

2023
计算机技术与发展
陕西省计算机学会

计算机技术与发展

CSTPCD
影响因子:0.621
ISSN:1673-629X
参考文献量10
段落导航相关论文