首页|一种基于质心与本体的文本分类方法

一种基于质心与本体的文本分类方法

扫码查看
针对传统的TFIDF模型计算根集(root set)文档特征权重的不适应性,提出了计算文档特征权重的新方法--TFIDF-2模型.另外,给出3种启发式规则用于获取根集文档的质心向量.通过计算文档与质心之间的相似度进行文本分类只是质心的一个初步应用.在这个过程中,提出了一种计算文档与质心之间相似度的新方法.通过一系列的对比实验,分析验证了此种分类方法比传统的分类算法更准确、更高效.最后,验证了将本体与质心相结合提取未标识数据集中相关文档的有效性.
A Classification Method Based on Centroid and Ontology
Aimed at the unsuitability of traditional TFIDF model for calculating the feature weight of a document in a root set,a new model named TFIDF-2 is proposed in this paper.In addition,three heuristic rules are given for obtaining a centroid vector corresponding to a root set.A document can be classified if its similarity with centroids is calculated.This is just its preliminary application concerning centroid.During this process,a new method for calculating the similarity between document and centroid is proposed.Through a series of experiments,it is verified that this classification method is more accurate and more efficient than traditional classification methods.Finally this paper validates the effectivity of combining ontology with centroid for extracting relevant documents from an unlabeled dataset.

centroidtext classificationTFIDFfocused crawlingontology

王辉、左万利、袁华

展开 >

吉林大学计算机科学与技术学院,长春,130012

质心 文本分类 TFIDF 主题爬行 本体

国家自然科学基金教育部重点实验室基金

6037309993K-17

2007

计算机研究与发展
中国科学院计算技术研究所 中国计算机学会

计算机研究与发展

CSTPCDCSCD北大核心
影响因子:2.649
ISSN:1000-1239
年,卷(期):2007.44(z2)
  • 3
  • 1