基于TI-FastText的拼音维语识别方法

Pinyin Uighur Language Recognition Method Based on TI-FastText

刘宣 ¹季铎 ¹滕超越²

扫码查看

作者信息

1. 中国刑事警察学院公安信息技术与情报学院,辽宁沈阳 110000
2. 广东省深圳市公安局网络警察支队,广东深圳 518000
折叠

摘要

维吾尔语是中国新疆维吾尔自治区最重要的语言之一,由于计算机处理和信息检索中存在一些困难,拼音维语应运而生.拼音维语为维吾尔语的数字化处理提供了便利,但是由于拼音维语没有完全统一的标准、偏向于口语化、网络社交媒体居多、数据收集困难等特点,导致计算机对拼音维语的识别存在困难.基于此,首先引入TF-IDF和FastText模型融合的方法对拼音维语进行识别,与传统方法相比,该方法的创新之处在于TF-IDF可以对拼音维语独特的语言特点进行更深度的提取,并且融合FastText模型可以降低单一模型的局限性,利用其对词序和低频词汇的高敏感性,可以进行更精准的维语识别;同时为了降低模型的鲁棒性,引入数据伪造技术获取了大量的多语种数据集.实验结果显示,该项技术识别拼音维语的准确率可以达到95％以上.通过开发拼音维语识别技术,可以帮助在数字化时代更好地处理和管理维吾尔语的信息,可以推动自然语言处理和人工智能领域技术在维吾尔语识别方面的研究和应用.

Abstract

Uyghur is one of the most important languages in Xinjiang Uyghur Autonomous Region in Chi-na,and Pinyin Uyghur is born due to the difficulties in computer processing and information retrieval.Pi-nyin Uyghur provides convenience for the digitization of Uyghur.However,due to the characteristics of Pinyin Uyghur,such as lack of completely unified standards,preference for colloquialism,online social media and difficulty in data collection,it is difficult for computers to recognize Pinyin Uyghur.Based on this,the fusion method of TF-IDF and FastText is firstly introduced to identify Pinyin Uyghur.Compared with the traditional method,the innovation of this method is that TF-IDF can extract the unique linguistic characteristics of Pinyin Uighur language in depth;the fusion FastText model can reduce the limitations of a single model;this method can realize more accurate Uighur recognition by using its high sensitivity to-wards word order and low-frequency vocabulary.Meanwhile,to reduce the robustness of the model,data forgery technology is introduced to obtain many multi-lingual datasets.The experimental results show that the accuracy of the technology to identify pinyin Uyghur can reach more than 95％.The development of pinyin Uyghur language recognition technology can help better process and manage Uyghur information in the digital era,and promote the research and application of natural language processing and artificial in-telligence in Uyghur recognition.

关键词

拼音维语/FastText模型/识别/维语

Key words

Pinyin Uyghur/FastText model/recognition/Uyghur

引用本文复制引用

出版年

2024

中国人民公安大学学报(自然科学版)

中国人民公安大学

中国人民公安大学学报(自然科学版)

影响因子：0.33

ISSN：1007-1784

参考文献量23

段落导航