Data Filtering Strategies for Tibetan-Chinese Neural Machine Translation
Data syntax and semantic losses arise in Tibetan-Chinese machine translation when traditional data augmentation methods are employed.To address this issue,this paper proposes a pseudo-data filtering method combining sentence confusion degree with semantic similarity degree on the basis of traditional data enhancement methods.This strategy effectively tackles chal-lenges such as the inadequate quality and scarcity of parallel data,particularly in low-resource settings.The results of this study demonstrate that the pseudo data filtering approach significantly improves both Tibetan-Chinese and English-Chinese bidirec-tional language translation tasks.The proposed pseudo-data filtering method effectively improves the grammatical and semantic defects of the translation model,thus enhancing the performance of the translation system and the generalization ability of the translation model,and verifies the effectiveness of the proposed method.
back translationdata selectionTibetan Chinese neural machine translationperplexitysemantic similarity