Sanskrit-Tibetan text recognition based on deep learning
[Objective]As an important preliminary work link in the research of automatic sorting,lexical analysis and automatic correction,Sanskrit-Tibetan text recognition is critically needed.However,numerous problems in the rule-based Sanskrit-Tibetan text recognition methods,such as the inability to effectively identify short Sanskrit words exist.[Methods]On the self-built Sanskrit-Tibetan text recognition dataset,the Sanskrit-Tibetan text recognition method based on Bi-LSTM and Self-Attention,the Sanskrit-Tibetan text recognition method based on pre-trained language model CINO,and the rule-based Sanskrit-Tibetan text recognition method are compared experimentally.Next,their recognition results are analyzed,and the optimal Sanskrit-Tibetan text recognition method is selected.[Results]The macro accuracy,recall and F1 value of the Sanskrit-Tibetan text recognition model based on Bi-LSTM and Self-Attention mechanism reach 98.09%,99.22%and 98.65%,respectively,and perform more effectively than the multilingual pre-trained model CINO and the other three rule-based methods do.[Conclusions]When the same small-scale and no duplicate training dataset are used along with the Tibetan character representation models based on skip-gram,CBOW and GloVe,the character representation effect of CBOW is better than those of the other two.Under the same training data,the Sanskrit-Tibetan text recognition model based on Bi-LSTM and Self-Attention mechanism performs better than the multilingual pre-trained model CINO does,and also better than the rule-based Sanskrit-Tibetan text recognition model does.
Tibetan information processingSanskrit-Tibetan text recognitioncharacter representationSTTRM_BS model