首页|基于领域术语词典和句式框架的藏汉机器翻译领域数据增强方法研究

基于领域术语词典和句式框架的藏汉机器翻译领域数据增强方法研究

扫码查看
藏汉机器翻译系统在新闻、时政等领域已经取得了显著的翻译效果,这主要归功于建立了相对充足的双语句对.然而,现有藏汉双语语料中存在较大的领域偏差问题,藏医、佛学等领域的数据极度稀缺,导致藏汉翻译模型在处理这些低资源领域句对时面临着领域词汇稀缺和翻译困难的挑战.为了解决这一问题,充分利用现有领域术语双语词典,提出了一种基于词典结合特定领域上下文语义关系的翻译质量提升方法,并应用于传统藏医药领域.首先,收集并建立了包含9 166对词条的藏医领域术语双语词典,并利用该词典扩充低资源领域的数据,以提高翻译系统对于特定领域术语的覆盖率;其次,将词典中的词对直接添加到已有句对中、领域词典中的词来替换原有句对中的词两种方式进行数据扩充,以验证词典扩充的领域翻译性能;最后,考虑到领域特定句式信息对于翻译的重要性,通过分析特定领域的语境和语义关系,提出引入特定领域上下文句式框架来优化特殊领域的翻译性能,在传统藏医药领域进行测试.实验结果表明,在利用词典进行数据扩充后,传统藏医药领域的BLEU值从0提升到4.59,且文章提出的领域句式框架方法,仅构造5条句式框架,就能使BLEU值最高提升至6.32,这为解决低资源领域翻译问题提供了新的思路和方法.
Study on Domain Data Augmentation Methods for Tibetan-Chinese Machine Translation based on Domain Terminology Dictionary and Sentence Structure Framework
The Tibetan-Chinese machine translation system has achieved significant translation effectiveness in domains such as news and politics,primarily due to the establishment of ample bilingual sentence pairs.Howev-er,existing Tibetan-Chinese bilingual corpora exhibit significant biases with severe scarcity of data in domains such as Tibetan medicine and Buddhist studies.This scarcity poses challenges for Tibetan-Chinese translation models when handling these low-resource domain-specific sentence pairs,often encountering issues with sparse domain vocabulary and translation difficulties.To address this issue,we leveraged an existing bilingual diction-ary of domain-specific terms and proposed a method to enhance translation quality by integrating these terms with specific domain context and semantic relationships,particularly in the traditional Tibetan medicine domain.Initially,we collected and established a Tibetan medicine domain-specific bilingual dictionary containing 9 166 term pairs and used it to expand data in low-resource languages,thereby improving the translation system's cov-erage of domain-specific terms.Additionally,we augmented existing sentence pairs by adding dictionary terms and replacing words,validating the model's domain translation performance.Finally,considering the importance of domain-specific syntactic information for translation,we proposed introducing domain-specific contextual syn-tactic frameworks to optimize translation performance in special domains,tested specifically in the traditional Ti-betan medicine domain.Experimental results showed that after using the dictionary for data augmentation,BLEU scores in the traditional Tibetan medicine domain improved from 0 to a maximum of 4.59.Building upon this,our proposed domain sentence pattern framework method achieved a maximum BLEU score improvement to 6.32 with just 10 generated patterns,offering new insights and methods for addressing translation challenges in low-resource domains.

Tibetan-Chinese machine translationdomain data imbalancedomain sentence structure frameworkterminology bilingual dictionary

格桑加措、尼玛扎西、嘎玛扎西、次仁白玛、步寅硕

展开 >

西藏大学信息科学技术学院 西藏拉萨 850000

西藏大学西藏自治区藏文信息技术人工智能重点实验室 西藏拉萨 850000

西藏大学藏文信息技术教育部工程研究中心 西藏拉萨 850000

西藏大学西藏信息化省部共建协同创新中心 西藏拉萨 850000

展开 >

藏汉机器翻译 领域数据不平衡 领域句式框架 术语双语词典

2024

高原科学研究

高原科学研究

ISSN:
年,卷(期):2024.8(3)