基于深度学习的梵藏文本识别

Sanskrit-Tibetan text recognition based on deep learning

扫码查看

原文链接

维普
万方数据

中文摘要：[目的]梵藏文本识别是自动排序、词法分析和自动校对等研究的重要前期工作环节.当前基于规则的梵藏文本识别方法中存在无法有效识别短梵文词语等诸多问题.[方法]在自建的梵藏文本识别数据集上,采用基于双向长短时记忆网络和自注意力的梵藏文本识别方法、基于预训练语言模型CINO的梵藏文本识别方法和基于规则的梵藏文本识别方法之间进行实验对比,并分析它们的识别结果,进而选出最优的梵藏文本识别方法.[结果]基于双向长短时记忆网络和自注意力机制的梵藏文本识别模型的宏准确率、召回率和F1值分别达到了 98.09％、99.22％和98.65％,其效果优于多语言预训练模型CINO和其他3种基于规则的方法.[结论]基于skip-gram、CBOW和GloVe的藏文字符表示模型使用相同的小规模、无重样的训练数据集时,CBOW的字符表示效果优于其他两者;训练数据相同的情况下,基于双向长短时记忆网络和自注意力机制的梵藏文本识别模型优于多语言预训练模型CINO,同时,也优于基于规则的梵藏文本识别模型.

外文摘要：[Objective]As an important preliminary work link in the research of automatic sorting,lexical analysis and automatic correction,Sanskrit-Tibetan text recognition is critically needed.However,numerous problems in the rule-based Sanskrit-Tibetan text recognition methods,such as the inability to effectively identify short Sanskrit words exist.[Methods]On the self-built Sanskrit-Tibetan text recognition dataset,the Sanskrit-Tibetan text recognition method based on Bi-LSTM and Self-Attention,the Sanskrit-Tibetan text recognition method based on pre-trained language model CINO,and the rule-based Sanskrit-Tibetan text recognition method are compared experimentally.Next,their recognition results are analyzed,and the optimal Sanskrit-Tibetan text recognition method is selected.[Results]The macro accuracy,recall and F1 value of the Sanskrit-Tibetan text recognition model based on Bi-LSTM and Self-Attention mechanism reach 98.09％,99.22％and 98.65％,respectively,and perform more effectively than the multilingual pre-trained model CINO and the other three rule-based methods do.[Conclusions]When the same small-scale and no duplicate training dataset are used along with the Tibetan character representation models based on skip-gram,CBOW and GloVe,the character representation effect of CBOW is better than those of the other two.Under the same training data,the Sanskrit-Tibetan text recognition model based on Bi-LSTM and Self-Attention mechanism performs better than the multilingual pre-trained model CINO does,and also better than the rule-based Sanskrit-Tibetan text recognition model does.

外文关键词：

Tibetan information processingSanskrit-Tibetan text recognitioncharacter representationSTTRM_BS model

作者：

才让叁智、仁增多杰、多拉、索南尖措

展开 >

作者单位：

西藏大学信息科学技术学院,西藏拉萨 850000

西北民族大学中国语言文学学部,甘肃兰州 750000

西藏大学藏文信息技术国家地方联合工程研究中心,西藏拉萨 850000

省部共建藏语智能信息处理及应用国家重点实验室,西藏拉萨 850000

青海师范大学藏语智能信息处理及应用国家重点实验室,青海西宁 810008

展开 >

关键词：

藏文信息处理梵藏文本识别字符表示 STTRM_BS模型

出版年：

2024

DOI：

10.6043/j.issn.0438-0479.202311026

厦门大学学报(自然科学版)

厦门大学

厦门大学学报(自然科学版)

CSTPCD北大核心

影响因子：0.449

ISSN：0438-0479

年,卷(期)：2024.63(6)