首页|A Method for Clustering E-business Contents

A Method for Clustering E-business Contents

扫码查看
With the rapid development of deep web, high quality data pre-processing and extraction are extremely essential from these web data sources。 The clustering is a crucial step for the data processing。 This paper presents a unified solution to tackle the issue of clustering e-business web contents。 Firstly, the vocabulary are segmented based on the obtained web contents, and then perform statistically analysis on the segmentation results to tune the document frequency (DF) so that the dimensionality of feature vector representing the web contents is under control。 Next, term frequency (TF) and inverse document frequency (IDF) are used to form a weighted vector matrix, which is utilized to cluster the obtained web contents。 Experiments show that this approach is capable to cluster e-business web contents with reasonable recall rate and precision。

ClusteringData extractionDeep WebTF. IDFWords segmentation

Liu Ronghui、Zheng Jianguo、Wang Xiang

展开 >

Sch. of Manage., Donghua Univ., Shanghai, China

Beidaihe(CN)

2010 WASE International Conference on Information Engineering

43-46

2010