Research on Vocabulary Sharing for Machine Translation from Ancient Chinese to Modern Chinese Based on UniLM Model
[Purpose/significance]In the process of machine translation from ancient Chinese to modern Chinese,due to the significant differences in vocabulary composition,syntax and flexible use of parts of speech between ancient Chi-nese and modern Chinese,and the lack of open word segmentation data of ancient Chinese,the understanding and pro-cessing ability of machine translation system is biased,which affects the translation quality to some extent.[Method/process]Firstly,this paper puts forward an unsupervised thesaurus construction method.Based on UniLM model,it is combined with BERT,RoBERTa,RoFormer and RoFormerV2 pre-training models respectively and fine-tuned the model.With the help of UniLM model,the language relationship between the source language and the target language is generated into an intermediate language representation,and the pre-training model is used to learn the context-relat-ed language representation,so as to increase the relevance between semantics,thus improving the machine translation of ancient and modern times.[Result/conclusion]The experimental results show that the BLEU value of machine translation of ancient Chinese prose,which integrates the knowledge characteristics of ancient Chinese prose,is in-creased by 0.27 to 1.12 on BERT,RoBERTa,RoFormer and RoFormerV2 pre-training models respectively,which proves the effectiveness of the proposed method.
UniLM modelancient Chinese word segmentationvocabulary sharingancient Chinese translationmachine translation