Research on text classification performance of different word segmentation under HanLP
Text classification occupies an important position in search engine technology,the first step of text classification is word segmentation,the accuracy of word segmentation,the subsequent text feature extraction is more accurate.In view of the above situation,this paper mainly explores the influence degree of different HanLP word dividers on the results presented after text classi-fication.The main word dividers used are content word dividers and binary grammar dividers.The two kinds of word dividers are used to divide words into corpus,and the feature vector is introduced into naive Bayes and support vector machine for training and testing.After evaluation,the group with the highest accuracy P,recall R and F1 scores is matched with binary syntax segmentation and support vector machine.The experimental data show that binary syntax word segmentation can greatly improve the accuracy of text classification,but more features of word segmentation will affect the classification rate of classification model.
content word dividerbinary grammar word dividernaive Bayessupport vector machine