藏文音节查错纠错模型研究

Research on Tibetan Syllable Error Detection and Correction Model

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：针对藏文文本自动校对研究中缺乏高质量标注语料、鲜有纠错任务研究等问题,该文以藏文音节为单元的等长文本为研究内容,通过分析藏文文本错误类型,开展了藏文查错、纠错模型研究,该文主要贡献如下:①针对缺乏标注语料问题,一是提出了结合语言知识的藏文混淆集构建算法,自动建立了音似、形似和拼写错误音节的混淆集,二是根据音似、形似、动词时态、易错虚词的不同混淆集,提出了加噪算法,在等长文本中将正确音节替换为错误音节.②针对查错问题,提出了基于预训练模型 Word2Vec和ELMo的BiGRU-Attention藏文音节查错模型.最终实验表明,使用预训练模型能有效提升藏文音节查错效果,其中ELMo-BiGRU-Attention模型的查错效果达到最佳,音节级查错F1为90.91％,句子级查错F1 为83.24％.③针对纠错问题,提出了 soft-masked+BERT的藏文音节纠错网络,效果最好的模型音节级查错F1和纠错F1分别为95.51％和90.69％,句子级查错F1和纠错F1分别为 86.34％和 79.77％.

外文摘要：To address the less-touched issue of automatic proofreading of Tibetan texts,this paper studies Tibetan error detection and correction model with a focus on equal-length texts based on Tibetan syllables.For corpus construction,this paper proposes a linguistic driven Tibetan confusion set construction algorithm which establishes confusion sets of similar sounds and shapes.It also designs a noise adding algorithm based on the different confusion sets of phonetic similarity,morphological similarity,verb tense,and error prone function words.For error detection,a BiGRU-Attention Tibetan syllable error detection model based on pre-trained models Word2Vec and ELMo is applied.For error correction,a soft-masked+BERT Tibetan syllable correction network is deployed.Experiments demonstrate that the best F1-values for error checking and error correction reach 95.51％and 90.69％at the syllable level and 86.34％and 79.77％at sentence level,respectively.

外文关键词：

Tibetan syllableerror detection modelerror correction modelpre-trainingsoft-masked

作者：

珠杰、郑任公、拉巴顿珠、德庆卓玛、顿珠次仁

展开 >

作者单位：

西藏大学信息科学技术学院,西藏拉萨 540000

西藏信息化省部共建协同创新中心,西藏拉萨 540000

关键词：

藏文音节查错模型纠错模型预训练软掩码

出版年：

2024

中文信息学报

中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCHSSCD北大核心

影响因子：0.8

ISSN：1003-0077

年,卷(期)：2024.38(12)