Chinese-Vietnamese neural machine translation based on denoising prototype sequence
[Objective]Insufficient Chinese-Vietnamese parallel corpus limits the performance of Chinese-Vietnamese neural machine translation models.In this scenario,Vietnamese sentences can serve as a prototype sequence,thus supplementing the original encoder-decoder architecture.This extra sequence imparts the knowledge of the target language to the translation model and directs the translation process accordingly.As a continuation of previous research,we propose a denoising strategy that utilizes a Vietnamese entity dictionary to mask the entities present in the prototype sequence.Furthermore,because the prototype sequence contains noises due to the large language differences between Chinese and Vietnamese,and these noises exert a negative impact on the translation,we assess the reference value of the said sequence to rectify this shortcoming.[Methods]We build our model using Transformer blocks with the configuration,including 8 attention blocks,512 dimensional hidden state,256 dimensional feed-forward state,and 5 gradient accumulation.The number of hidden layers is 3 for the retrieval model,4 for the prototype encoder in the translation model,and 6 for the encoder-decoder architecture in the translation model.We retrieve 3 Vietnames sentences as prototype sequences.Experiments are carried on single NVIDIA RTX A5000 GPU.First,we retrieve target prototype sequences by using the cross language similarity between Chinese and Vietnamese.Then we mask out entities in the prototype sequences with a Vietnamese entity dictionary,and evaluate the reference value of the prototype sequences based on the similarity to the source and the frequency of rare words.Finally,the Chinese sentence and its corresponding prototype sequence are fed into a dual encoder-single decoder model to generate the Vietnamese translation.[Results]Using a corpus containing 120 000 Chinese-Vietnamese parallel sentence pairs,we conduct a comparative analysis of our translation model with other models in which similar methods are used.The performance of the model in this article on the validation set of Chinese Vietnamese has improved by 1.69,3.30,3.36,and 0.46 percentage points compared to these four baseline models,respectively.In the ablation experiment,after evaluating the reference value through similarity score evaluation,rare word frequency evaluation,entity dictionary denoising,and the entire prototype sequence processing module denoising,compared to NMT(none),the proposed model has improved by 0.24,0.12,0.29,and 0.69 percentage points,respectively.To investigate the impact of candidate monolingual corpus size on translation performance and time overhead in the context of low resources for Chinese and Vietnamese languages,we have acquired an additional 150 000 Vietnamese monolingual sentences from the QED(the QCRI educational domain corpus)corpus.The study utilized these sentences in conjunction with a 120 000 Chinese-Vietnamese parallel corpus,and gradually increased the size of the Vietnamese monolingual corpus to evaluate its effects.As the size of the monolingual corpus increases,the model's performance improves,and it performs best when all available monolingual corpora are used as the candidate sentence base.However,we observe that the gain in model performance does not vary linearly and may even diminish.Compared to the traditional single encoder-single decoder structure of the Transformer,the multi-module structure presented in this paper introduces a training time overhead of 3.5 to 3.9 hours.Interestingly,an increase in the size of the monolingual corpus does not exert a significant impact on the time overhead of the model.[Conclusions]Results of characterizations mentioned above indicate that this method not only leverages Vietnamese monolingual data as prototype sequences to address the shortage of bilingual resources,but also enhances beneficial knowledge features for machine translation through denoising of the prototype sequences.Furthermore,increasing the size of the prototype sequences by using monolingual data from relevant fields can improve model performance.However,recognition errors in the entity dictionaries constructed using the existing tools are observed in this study,and additional time overheads for the model training are introduced by the multi-module structure.Further research is required to resolve these issues.