面向藏汉神经机器翻译的数据筛选方法

Data Filtering Strategies for Tibetan-Chinese Neural Machine Translation

仁青卓玛 ¹拥措 ¹唐超超¹

扫码查看

作者信息

1. 西藏大学信息科学技术学院,西藏拉萨 850000;西藏自治区藏文信息技术人工智能重点实验室,西藏拉萨 850000;藏文信息技术教育部工程研究中心,西藏拉萨 850000
折叠

摘要

针对目前在藏汉机器翻译中使用传统数据增强方法会导致数据的语法和语义损失等问题,本文在传统数据增强方法的基础上,提出将句子困惑度与语义相似度相结合的伪数据筛选方法,通过困惑度降低伪数据的语法错误率,同时通过语义相似度减少伪数据的语义偏差,以更好地缓解低资源下平行数据质量欠佳和稀缺等问题.本文使用伪数据筛选方法在藏汉、英汉2对双向语种上进行实验,结果比传统的数据增强方法的BLEU值分别提升了0.11、0.53、1.18、1.08.由此表明,本文提出的伪数据筛选方法有效地改善了翻译模型在语法和语义上的缺陷,从而增强了翻译系统的性能以及提升了翻译模型的泛化能力,验证了本文方法的有效性.

Abstract

Data syntax and semantic losses arise in Tibetan-Chinese machine translation when traditional data augmentation methods are employed.To address this issue,this paper proposes a pseudo-data filtering method combining sentence confusion degree with semantic similarity degree on the basis of traditional data enhancement methods.This strategy effectively tackles chal-lenges such as the inadequate quality and scarcity of parallel data,particularly in low-resource settings.The results of this study demonstrate that the pseudo data filtering approach significantly improves both Tibetan-Chinese and English-Chinese bidirec-tional language translation tasks.The proposed pseudo-data filtering method effectively improves the grammatical and semantic defects of the translation model,thus enhancing the performance of the translation system and the generalization ability of the translation model,and verifies the effectiveness of the proposed method.

关键词

回译/数据筛选/藏汉神经机器翻译/困惑度/语义相似度

Key words

back translation/data selection/Tibetan Chinese neural machine translation/perplexity/semantic similarity

引用本文复制引用

基金项目

科技创新2030—"新一代人工智能"重大项目(2022ZD0116100)

西藏自治区科技创新基地自主研究项目(XZ2021JR0002G)

西藏大学学科建设能力提升计划项目(藏财预指[2023]1号)

出版年

2024

计算机与现代化

江西省计算机学会江西省计算技术研究所

计算机与现代化

CSTPCD

影响因子：0.472

ISSN：1006-2475

段落导航