基于UniLM模型的古文到现代文机器翻译词汇共享研究

扫码查看

原文链接

万方数据
维普

中文摘要：[目的臆义]从古文到现代文的机器翻译过程中,由于古文与现代文之间在词汇构成、句法以及词类活用等方面的显著差异,并且缺少公开的古文分词数据,使得机器翻译系统对古文的理解和处理能力存在偏差,一定程度上影响了翻译的质量.[方法/过程]文章提出无监督词库构建的方法,在UniLM模型的基础上,分别与BERT、RoBERTa、RoFormer和RoFormerV2预训练模型相结合并对模型进行微调,借助UniLM模型融合古文领域知识特征将源语言和目标语言之间的语言关系生成中间的语言表示,利用预训练模型学习上下文相关的语言表示,增加语义之间的关联性,从而提升古现机器翻译的性能.[结果/结论]实验结果表明,融合古文领域知识特征的古文机器翻译在BERT、RoBERTa、RoFormer和RoFormerV2预训练模型上的BLEU值分别提高了 0.27到1.12,证明了提出方法的有效性.

外文标题：Research on Vocabulary Sharing for Machine Translation from Ancient Chinese to Modern Chinese Based on UniLM Model

外文摘要：[Purpose/significance]In the process of machine translation from ancient Chinese to modern Chinese,due to the significant differences in vocabulary composition,syntax and flexible use of parts of speech between ancient Chi-nese and modern Chinese,and the lack of open word segmentation data of ancient Chinese,the understanding and pro-cessing ability of machine translation system is biased,which affects the translation quality to some extent.[Method/process]Firstly,this paper puts forward an unsupervised thesaurus construction method.Based on UniLM model,it is combined with BERT,RoBERTa,RoFormer and RoFormerV2 pre-training models respectively and fine-tuned the model.With the help of UniLM model,the language relationship between the source language and the target language is generated into an intermediate language representation,and the pre-training model is used to learn the context-relat-ed language representation,so as to increase the relevance between semantics,thus improving the machine translation of ancient and modern times.[Result/conclusion]The experimental results show that the BLEU value of machine translation of ancient Chinese prose,which integrates the knowledge characteristics of ancient Chinese prose,is in-creased by 0.27 to 1.12 on BERT,RoBERTa,RoFormer and RoFormerV2 pre-training models respectively,which proves the effectiveness of the proposed method.

外文关键词：

UniLM modelancient Chinese word segmentationvocabulary sharingancient Chinese translationmachine translation

作者：

许乾坤、王东波、刘禹彤、吴梦成、黄水清

展开 >

作者单位：

南京农业大学信息管理学院江苏 210095

南京农业大学人文与社会计算研究中心江苏 210095

关键词：

UniLM模型古文分词词汇共享古文翻译机器翻译

基金：

国家社会科学基金重大项目

项目编号：

21&ZD331

出版年：

2024

DOI：

10.12154/j.qbzlgz.2024.01.008

情报资料工作

中国人民大学

情报资料工作

CSTPCDCSSCICHSSCD北大核心

影响因子：2.201

ISSN：1002-0314

年,卷(期)：2024.45(1)

参考文献量37