结合信息检索技术的半监督文本分类方法
Semi-supervised text classification with information retrieval techniques
贾志洋 1高炜 2王勇刚1
作者信息
- 1. 云南大学旅游文化学院,云南丽江674100
- 2. 苏州大学数学科学学院,江苏苏州215006;云南师范大学信息学院,云南昆明650092
- 折叠
摘要
搜索引擎的查询结果和查询关键词与某一个文本类别应该具有一定关联.基于这样的假设,针对文本分类问题,根据小样本集提取特征词构建查询并从查询结果中下载网页样本,将下载的网页样本进行去重、去噪、提取正文等处理后,判断其类别并扩充到初始样本集,最终使用扩充后的实验样本集学习训练朴素贝叶斯文本分类器,并对分类器的分类效果进行了测试.实验结果表明,结合信息检索技术的半监督分类器的分类准确率相对于使用小样本构建的分类器具有较大的提高.
Abstract
It supposes that the search results bear some relation both to the key words of the query and a certain text category. As such, queries are constructed according to the feature words extracted from the initial sample set, then queries are send to the search engine and web pages are downloaded from the search results which response from the search engine. Downloaded web pages are processed by eliminating of duplicated content, noise reduction and extraction of text content. These samples are expanded into the sample set after the category of the samples is predicted. Finally a Naive Bayes text classifier is retrained by the enlarged sample set. The classification effect of the classifier is also experimented. Experimental results show that the precision of semi-supervised text classification method with information retrieval techniques is significantly better than the classifier constructed by small sample set.
关键词
文本分类/半监督学习/信息检索/搜索引擎Key words
text classification/semi-supervised learning/information retrieval/search engine引用本文复制引用
基金项目
国家自然科学基金(60903131)
云南省教育厅科学研究基金(2010Y108)
出版年
2012