Study on Domain Data Augmentation Methods for Tibetan-Chinese Machine Translation based on Domain Terminology Dictionary and Sentence Structure Framework
The Tibetan-Chinese machine translation system has achieved significant translation effectiveness in domains such as news and politics,primarily due to the establishment of ample bilingual sentence pairs.Howev-er,existing Tibetan-Chinese bilingual corpora exhibit significant biases with severe scarcity of data in domains such as Tibetan medicine and Buddhist studies.This scarcity poses challenges for Tibetan-Chinese translation models when handling these low-resource domain-specific sentence pairs,often encountering issues with sparse domain vocabulary and translation difficulties.To address this issue,we leveraged an existing bilingual diction-ary of domain-specific terms and proposed a method to enhance translation quality by integrating these terms with specific domain context and semantic relationships,particularly in the traditional Tibetan medicine domain.Initially,we collected and established a Tibetan medicine domain-specific bilingual dictionary containing 9 166 term pairs and used it to expand data in low-resource languages,thereby improving the translation system's cov-erage of domain-specific terms.Additionally,we augmented existing sentence pairs by adding dictionary terms and replacing words,validating the model's domain translation performance.Finally,considering the importance of domain-specific syntactic information for translation,we proposed introducing domain-specific contextual syn-tactic frameworks to optimize translation performance in special domains,tested specifically in the traditional Ti-betan medicine domain.Experimental results showed that after using the dictionary for data augmentation,BLEU scores in the traditional Tibetan medicine domain improved from 0 to a maximum of 4.59.Building upon this,our proposed domain sentence pattern framework method achieved a maximum BLEU score improvement to 6.32 with just 10 generated patterns,offering new insights and methods for addressing translation challenges in low-resource domains.
Tibetan-Chinese machine translationdomain data imbalancedomain sentence structure frameworkterminology bilingual dictionary