首页|数据降维与K-均值聚类的质量评估

数据降维与K-均值聚类的质量评估

扫码查看
聚类分析在大数据时代应用广泛,但缺乏直观评价聚类质量的有效方法.为此,提出一种具有数据降维和搜寻数据固有聚类数量的处理模式.在数据散射矩阵基础上构造一个增广矩阵,利用线性辨别分析将高维数据变换到最具辨别性的低维特征子空间以实现数据降维.为解决分区聚类算法的随机初始化问题,提出最小-最大规则,避免出现空聚类并确保数据的可分性.对于聚类的结果,计算每个聚类的轮廓系数,通过比较轮廓的尺寸以评价不同聚类数量情况下的聚类质量.对K-均值算法的仿真结果说明,这种处理模式不仅能够可视化确定未知数据所固有的聚类数量,而且能为高维数据提供有效的分析方法.
Data dimensionality reduction and clustering quality evaluation of K-means clustering
In the age of big data,data analysis is becoming more and more important,and one of the most important tasks in data analysis is data classification.In pattern recognition and machine learning,classification can also be divided into supervised and unsupervised classification.In supervised classification,the data includes both features and class labels.However,in practical applications,data sources are usually obtained through sensor device,and there are no available class labels for the data.As a result,unsupervised classification,especially clustering techniques,plays a crucial role in data analysis.Clustering,as an exploratory data analysis method,can discover the inherent structure of raw data by grouping data samples into different clusters.In the era of mobile internet,the dimensionality and structure of data are becoming more complex,cluster analysis of high-dimensional data is inevitable.For the huge amount of data that needs to be processed,to more easily organize,summarize and extract the useful information contained in the data,compression has also become a very important topic.Data compression(dimensionality reduction)is to transform the data into a new feature subspace with a lower dimensionality.Dimensionality reduction mainly includes feature selection and feature extraction.Feature selection is to select a subset of the features.In feature extraction,the relevant information is derived from the feature set in order to construct a new feature subspace.Obviously,dimensionality reduction is not only a basic step of reprocessing,but also conducive to data visualization.Based on the properties of generated clusters,clustering can be divided into partitional clustering and hierarchical clustering.In academic and industrial fields,however,partitional clustering is the most widely used.In various partitional clustering methods,the K-means clustering algorithm has become the most classic and popular algorithm.This is because its low computational complexity makes it popular.The K-means algorithm has achieved very good clustering effects in many practical applications,especially when the clustering results are compact and hyper-spherical in shape.Although K-means algorithm is by far the mostly used clustering method,it also has several major disadvantages.The first problem is that the iterative optimization process cannot guarantee the algorithm to converge to the globe optimum,i.e.,K-means clustering can converge to a local optimum.Due to the use of randomly assigned centroid positions in K-means,the centroids may be too close and may be merged in subsequence iterations,resulting in the algorithm generating one or more empty clusters.That is,K-means algorithm has a reasonably initialization problem.In theory,there is no effective and general scheme to determine such reasonable initialization.The second flaw to be highlighted is that K-means algorithm assumes that the user knows in advance the number of clusters in the data.However,choosing the right number of clusters can be very difficult because the inherent data structure of the real data is unknown.As in the case of random initialization,there is also no efficient scheme to correctly select the number of clusters.The third drawback is that the K-means algorithm is also sensitive to outliers and noise in the data.Even if the data point is far from the center of the cluster,the K-means algorithm still forces the point to be included in the cluster and used to calculate the centroids,which distorts the shape of the cluster.Aiming at the above three problems of K-means algorithm,this paper proposes the corresponding improvement scheme.Firstly,the location of the initial centroid is determined by the farthest initial center selection principle and the min-max distance rule,which avoids the occurrence of empty clusters in the clustering results and solves the problem of uncertainty of clustering results caused by the random initialization of the classical K-means algorithm.Secondly,to determine the optimal number of clusters,a method of estimating the number range of clusters based on the statistical empirical rule is proposed,and in this range,by observing the curve of the within cluster sum-of-squared-errors(SSE)with the number of clusters,the elbow method is used to intuitively determine the inherent cluster structure of the data.Thirdly,to optimize the performance of K-means algorithm,an import feature scaling method,standardization is used as data preprocessing.The standardized data obeys a normal distribution with zero mean and unit variance,which solves the problem that the K-means algorithm is sensitive to outliers and noise.Fourth,comparing with other feature extraction techniques,such as principal component analysis(PCA)and kernel PCA,the supervised linear discriminant analysis(LDA)is proposed to compress high-dimensional data into low feature subspaces for data visualization,and more importantly,LDA is a dimensionality reduction method that maximizes cluster separability.Finally,to evaluate the clustering quality,silhouette analysis is used to verify the validity of clustering.The silhouette coefficient of each cluster is calculated,and by comparing the silhouette size,the final number of clusters is determined.

clustering qualityscattering matrixlinear discriminant analysismin-max rulesilhouette analysis

何帆、何选森、刘润宗、樊跃平、熊茂华

展开 >

北京理工大学管理与经济学院,北京 100081

广州商学院信息技术与工程学院,广州 511363

湖南大学信息科学与工程学院,长沙 410082

聚类质量 散射矩阵 线性辨别分析 最小-最大规则 轮廓分析

广东省普通高校重点领域专项广东省教育厅特色创新项目

2021ZDZX10352022KTSCX64

2024

重庆理工大学学报
重庆理工大学

重庆理工大学学报

CSTPCD北大核心
影响因子:0.567
ISSN:1674-8425
年,卷(期):2024.38(1)
  • 2