A Method for Clustering E-business Contents

扫码查看

原文链接

NETL

外文摘要：With the rapid development of deep web， high quality data pre-processing and extraction are extremely essential from these web data sources。 The clustering is a crucial step for the data processing。 This paper presents a unified solution to tackle the issue of clustering e-business web contents。 Firstly， the vocabulary are segmented based on the obtained web contents， and then perform statistically analysis on the segmentation results to tune the document frequency (DF) so that the dimensionality of feature vector representing the web contents is under control。 Next， term frequency (TF) and inverse document frequency (IDF) are used to form a weighted vector matrix， which is utilized to cluster the obtained web contents。 Experiments show that this approach is capable to cluster e-business web contents with reasonable recall rate and precision。

外文关键词：

ClusteringData extractionDeep WebTF. IDFWords segmentation

作者：

Liu Ronghui、Zheng Jianguo、Wang Xiang

展开 >

作者单位：

Sch. of Manage., Donghua Univ., Shanghai, China

会议地点：

Beidaihe(CN)

会议母体文献：

2010 WASE International Conference on Information Engineering

页码：

43-46

出版时间：

2010

DOI：

10.1109/ICIE.2010.106