Performance of morphological segmentation in Uyghur machine translation
[Objective]The advancement of neural machine translation(NMT)has dramatically changed the landscape of computational linguistics,resulting in unprecedented improvements in the translation quality of numerous languages.These technological strides have enabled more precise and fluent translations,thus significantly enhancing cross-linguistic communication.Despite these advances,the translation of low-resource languages,especially those with complex morphological structures such as Uyghur,remains considerably challenging.This article rigorously assesses the impact of morphological segmentation on the quality of Uyghur NMT,and focuses on translations between Uyghur and two high-resource languages:Chinese and English.Finally,the study aims to identify effective ways to improve the accuracy and the fluency of Uyghur translations.[Methods]In Uyghur-Chinese and Uyghur-English NMT tasks,seven different morphological segmentation methods,including a herein-proposed method that incorporates self-supervised learning and the widely used byte pair encoding(BPE)technique,are comprehensively evaluated.State-of-the-art NMT models such as Transformer and DeltaLM are employed to ensure the relevance of these findings to current translation technologies.The evaluation relies on BLEU,chrF2++and TER metrics to provide a multifaceted understanding of translation quality.For each NMT task,five randomized experiments are conducted,and statistical tests are utilized to examine if significant differences between different morphological segmentation methods and BPE in terms of translation effectiveness exist.Furthermore,effects of segmentation granularity and method model compatibility on the overall translation effectiveness are explored.[Results]This study assessed various morphological segmentation methods for Uyghur and compared their performance across different translation models.In terms of supervised versus unsupervised methods,supervised approaches demonstrated superior accuracy.The CNN-BiLSTM-CRF method notably stood out,recording F-values of 96.90%and 97.65%on the validation and test datasets,respectively,along with P-values of 96.87%and 97.40%,and R-values of 97.02%and 98.00%.In the task of Uyghur-Chinese translation,the BPE method,when used with the Transformer model,achieved average BLEU and chrF2++scores of 33.25%and 31.81%,respectively,with a TER of 65.12%.Conversely,when used with the DeltaLM model,it recorded higher average BLEU and chrF2++scores of 38.83%and 36.87%,and a lower TER of 57.86%.For Uyghur-English translation,the Morfessor method excelled when used with the Transformer model,attaining the highest BLEU and chrF2++scores with averages of 28.35%and 47.49%,respectively,and a TER of 62.07%.Furthermore,the LMVR method behaved notably satisfactorily in the DeltaLM model for achieving the highest BLEU and chrF2++scores,alongside the lowest TER,with average values of 29.29%,48.25%,and 59.91%.Additionally,the study found that the effectiveness of morphological segmentation varies significantly across different model architectures and language pairs,indicating that no single method consistently outperforms others in all scenarios.This variability underscores the intricate dynamics between morphological segmentation accuracy and its impact on the subsequent quality of neural machine translation.[Conclusions]The investigation underscores the pivotal role of morphological segmentation in enhancing NMT for low-resource languages such as Uyghur.The complex relationship between segmentation accuracy and translation quality suggests that optimal segmentation strategies may differ by NMT model and language pair.Notably,the application of self-supervised learning in morphological segmentation yielded promising results comparable to the BPE method,indicating the potential for future advancements in NMT.Herein,we identify the necessity of developing tailored segmentation approaches aligned with specific NMT models to fully exploit their capabilities in handling complex morphological structures.Future research should explore more adaptable and sophisticated segmentation techniques to further improve NMT performance for Uyghur and other morphologically-enriching low-resource languages.