一种基于质心与本体的文本分类方法

扫码查看

原文链接

NETL
NSTL
万方数据
维普

中文摘要：针对传统的TFIDF模型计算根集(root set)文档特征权重的不适应性,提出了计算文档特征权重的新方法--TFIDF-2模型.另外,给出3种启发式规则用于获取根集文档的质心向量.通过计算文档与质心之间的相似度进行文本分类只是质心的一个初步应用.在这个过程中,提出了一种计算文档与质心之间相似度的新方法.通过一系列的对比实验,分析验证了此种分类方法比传统的分类算法更准确、更高效.最后,验证了将本体与质心相结合提取未标识数据集中相关文档的有效性.

外文标题：A Classification Method Based on Centroid and Ontology

外文摘要：Aimed at the unsuitability of traditional TFIDF model for calculating the feature weight of a document in a root set,a new model named TFIDF-2 is proposed in this paper.In addition,three heuristic rules are given for obtaining a centroid vector corresponding to a root set.A document can be classified if its similarity with centroids is calculated.This is just its preliminary application concerning centroid.During this process,a new method for calculating the similarity between document and centroid is proposed.Through a series of experiments,it is verified that this classification method is more accurate and more efficient than traditional classification methods.Finally this paper validates the effectivity of combining ontology with centroid for extracting relevant documents from an unlabeled dataset.

外文关键词：

centroidtext classificationTFIDFfocused crawlingontology

作者：

王辉、左万利、袁华

展开 >

作者单位：

吉林大学计算机科学与技术学院,长春,130012

关键词：

质心文本分类 TFIDF 主题爬行本体

基金：

国家自然科学基金教育部重点实验室基金

项目编号：

6037309993K-17

出版年：

2007

计算机研究与发展

中国科学院计算技术研究所中国计算机学会

计算机研究与发展

CSTPCDCSCD北大核心

影响因子：2.649

ISSN：1000-1239

年,卷(期)：2007.44(z2)

被引量3
参考文献量1