首页|藏文音节查错纠错模型研究

藏文音节查错纠错模型研究

扫码查看
针对藏文文本自动校对研究中缺乏高质量标注语料、鲜有纠错任务研究等问题,该文以藏文音节为单元的等长文本为研究内容,通过分析藏文文本错误类型,开展了藏文查错、纠错模型研究,该文主要贡献如下:①针对缺乏标注语料问题,一是提出了结合语言知识的藏文混淆集构建算法,自动建立了音似、形似和拼写错误音节的混淆集,二是根据音似、形似、动词时态、易错虚词的不同混淆集,提出了加噪算法,在等长文本中将正确音节替换为错误音节.②针对查错问题,提出了基于预训练模型 Word2Vec和ELMo的BiGRU-Attention藏文音节查错模型.最终实验表明,使用预训练模型能有效提升藏文音节查错效果,其中ELMo-BiGRU-Attention模型的查错效果达到最佳,音节级查错F1为90.91%,句子级查错F1 为83.24%.③针对纠错问题,提出了 soft-masked+BERT的藏文音节纠错网络,效果最好的模型音节级查错F1和纠错F1分别为95.51%和90.69%,句子级查错F1和纠错F1分别为 86.34%和 79.77%.
Research on Tibetan Syllable Error Detection and Correction Model
To address the less-touched issue of automatic proofreading of Tibetan texts,this paper studies Tibetan error detection and correction model with a focus on equal-length texts based on Tibetan syllables.For corpus construction,this paper proposes a linguistic driven Tibetan confusion set construction algorithm which establishes confusion sets of similar sounds and shapes.It also designs a noise adding algorithm based on the different confusion sets of phonetic similarity,morphological similarity,verb tense,and error prone function words.For error detection,a BiGRU-Attention Tibetan syllable error detection model based on pre-trained models Word2Vec and ELMo is applied.For error correction,a soft-masked+BERT Tibetan syllable correction network is deployed.Experiments demonstrate that the best F1-values for error checking and error correction reach 95.51%and 90.69%at the syllable level and 86.34%and 79.77%at sentence level,respectively.

Tibetan syllableerror detection modelerror correction modelpre-trainingsoft-masked

珠杰、郑任公、拉巴顿珠、德庆卓玛、顿珠次仁

展开 >

西藏大学信息科学技术学院,西藏拉萨 540000

西藏信息化省部共建协同创新中心,西藏拉萨 540000

藏文音节 查错模型 纠错模型 预训练 软掩码

2024

中文信息学报
中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCHSSCD北大核心
影响因子:0.8
ISSN:1003-0077
年,卷(期):2024.38(12)