Research on Tibetan Syllable Error Detection and Correction Model
To address the less-touched issue of automatic proofreading of Tibetan texts,this paper studies Tibetan error detection and correction model with a focus on equal-length texts based on Tibetan syllables.For corpus construction,this paper proposes a linguistic driven Tibetan confusion set construction algorithm which establishes confusion sets of similar sounds and shapes.It also designs a noise adding algorithm based on the different confusion sets of phonetic similarity,morphological similarity,verb tense,and error prone function words.For error detection,a BiGRU-Attention Tibetan syllable error detection model based on pre-trained models Word2Vec and ELMo is applied.For error correction,a soft-masked+BERT Tibetan syllable correction network is deployed.Experiments demonstrate that the best F1-values for error checking and error correction reach 95.51%and 90.69%at the syllable level and 86.34%and 79.77%at sentence level,respectively.