首页|HanLP下不同分词器对文本分类性能的研究

HanLP下不同分词器对文本分类性能的研究

扫码查看
文本分类在搜索引擎技术中占据着重要的地位,文本分类第一步就是分词,分词分得准确,则在后续文字特征提取的时候也更为精确.针对以上情况,主要探究HanLP中不同分词器对文本分类后所呈现结果的影响程度,所用分词器主要为实词分词器和二元语法分词器,利用两种分词器对语料库分词,将特征向量导入朴素贝叶斯和支持向量机中进行训练和测试,测评后精确率P、召回率R、F1分数最高的一组搭配为二元语法分词和支持向量机.实验数据表明二元语法分词器能够较大地提高文本分类的准确率,但分词特征较多会影响分类模型分类的速率.
Research on text classification performance of different word segmentation under HanLP
Text classification occupies an important position in search engine technology,the first step of text classification is word segmentation,the accuracy of word segmentation,the subsequent text feature extraction is more accurate.In view of the above situation,this paper mainly explores the influence degree of different HanLP word dividers on the results presented after text classi-fication.The main word dividers used are content word dividers and binary grammar dividers.The two kinds of word dividers are used to divide words into corpus,and the feature vector is introduced into naive Bayes and support vector machine for training and testing.After evaluation,the group with the highest accuracy P,recall R and F1 scores is matched with binary syntax segmentation and support vector machine.The experimental data show that binary syntax word segmentation can greatly improve the accuracy of text classification,but more features of word segmentation will affect the classification rate of classification model.

content word dividerbinary grammar word dividernaive Bayessupport vector machine

汪兰兰

展开 >

武汉工程科技学院计算机与人工智能学院,武汉 430200

实词分词器 二元语法分词器 朴素贝叶斯 支持向量机

2024

现代计算机
中大控股

现代计算机

影响因子:0.292
ISSN:1007-1423
年,卷(期):2024.30(14)