基于降噪原型序列的汉越神经机器翻译

Chinese-Vietnamese neural machine translation based on denoising prototype sequence

扫码查看

原文链接

维普
万方数据

中文摘要：[目的]在汉越低资源场景下,平行语料匮乏,原型序列蕴含庞杂的信息,直接使用会增加翻译模型训练的难度,甚至引入噪声,故对原型序列的降噪策略进行研究.[方法]首先,利用跨语言检索得到原型序列;其次,基于实体词典对原型序列中的噪声信息进行掩盖,再综合稀有词词频及语义相似度,得到原型序列的参考价值;最后使用额外的编码器接收原型序列,并允许解码器到两个编码器间建立注意力机制.[结果]相比基线模型,相似度评分、稀有词词频、依据实体词典降噪,以及3种降噪融合的策略使汉越神经机器翻译的性能分别提升0.24,0.12,0.29,以及0.69个百分点的BLEU值.[结论]经降噪策略处理的原型序列能提升汉越神经机器翻译的性能.

外文摘要：[Objective]Insufficient Chinese-Vietnamese parallel corpus limits the performance of Chinese-Vietnamese neural machine translation models.In this scenario,Vietnamese sentences can serve as a prototype sequence,thus supplementing the original encoder-decoder architecture.This extra sequence imparts the knowledge of the target language to the translation model and directs the translation process accordingly.As a continuation of previous research,we propose a denoising strategy that utilizes a Vietnamese entity dictionary to mask the entities present in the prototype sequence.Furthermore,because the prototype sequence contains noises due to the large language differences between Chinese and Vietnamese,and these noises exert a negative impact on the translation,we assess the reference value of the said sequence to rectify this shortcoming.[Methods]We build our model using Transformer blocks with the configuration,including 8 attention blocks,512 dimensional hidden state,256 dimensional feed-forward state,and 5 gradient accumulation.The number of hidden layers is 3 for the retrieval model,4 for the prototype encoder in the translation model,and 6 for the encoder-decoder architecture in the translation model.We retrieve 3 Vietnames sentences as prototype sequences.Experiments are carried on single NVIDIA RTX A5000 GPU.First,we retrieve target prototype sequences by using the cross language similarity between Chinese and Vietnamese.Then we mask out entities in the prototype sequences with a Vietnamese entity dictionary,and evaluate the reference value of the prototype sequences based on the similarity to the source and the frequency of rare words.Finally,the Chinese sentence and its corresponding prototype sequence are fed into a dual encoder-single decoder model to generate the Vietnamese translation.[Results]Using a corpus containing 120 000 Chinese-Vietnamese parallel sentence pairs,we conduct a comparative analysis of our translation model with other models in which similar methods are used.The performance of the model in this article on the validation set of Chinese Vietnamese has improved by 1.69,3.30,3.36,and 0.46 percentage points compared to these four baseline models,respectively.In the ablation experiment,after evaluating the reference value through similarity score evaluation,rare word frequency evaluation,entity dictionary denoising,and the entire prototype sequence processing module denoising,compared to NMT(none),the proposed model has improved by 0.24,0.12,0.29,and 0.69 percentage points,respectively.To investigate the impact of candidate monolingual corpus size on translation performance and time overhead in the context of low resources for Chinese and Vietnamese languages,we have acquired an additional 150 000 Vietnamese monolingual sentences from the QED(the QCRI educational domain corpus)corpus.The study utilized these sentences in conjunction with a 120 000 Chinese-Vietnamese parallel corpus,and gradually increased the size of the Vietnamese monolingual corpus to evaluate its effects.As the size of the monolingual corpus increases,the model's performance improves,and it performs best when all available monolingual corpora are used as the candidate sentence base.However,we observe that the gain in model performance does not vary linearly and may even diminish.Compared to the traditional single encoder-single decoder structure of the Transformer,the multi-module structure presented in this paper introduces a training time overhead of 3.5 to 3.9 hours.Interestingly,an increase in the size of the monolingual corpus does not exert a significant impact on the time overhead of the model.[Conclusions]Results of characterizations mentioned above indicate that this method not only leverages Vietnamese monolingual data as prototype sequences to address the shortage of bilingual resources,but also enhances beneficial knowledge features for machine translation through denoising of the prototype sequences.Furthermore,increasing the size of the prototype sequences by using monolingual data from relevant fields can improve model performance.However,recognition errors in the entity dictionaries constructed using the existing tools are observed in this study,and additional time overheads for the model training are introduced by the multi-module structure.Further research is required to resolve these issues.

外文关键词：

Chinese-Vietnamese neural machine translationlow resourceprototype sequencedenoising

作者：

杨汉清、赖华、于志强、余正涛

展开 >

作者单位：

昆明理工大学信息工程与自动化学院,云南昆明 650500

云南省人工智能重点实验室,云南昆明 650500

关键词：

汉越神经机器翻译低资源原型序列降噪

基金：

国家自然科学基金国家自然科学基金国家自然科学基金云南省重大科技专项云南省重大科技专项云南省重大科技专项云南省高新技术产业专项云南省教育厅科学研究基金项目

项目编号：

6173200561972186U21B2027202103AA080015202002AD080001202202AD0800032016062022J0449

出版年：

2024

DOI：

10.6043/j.issn.0438-0479.202209028

厦门大学学报(自然科学版)

厦门大学

厦门大学学报(自然科学版)

CSTPCD北大核心

影响因子：0.449

ISSN：0438-0479

年,卷(期)：2024.63(4)