Data dimensionality reduction and clustering quality evaluation of K-means clustering
In the age of big data,data analysis is becoming more and more important,and one of the most important tasks in data analysis is data classification.In pattern recognition and machine learning,classification can also be divided into supervised and unsupervised classification.In supervised classification,the data includes both features and class labels.However,in practical applications,data sources are usually obtained through sensor device,and there are no available class labels for the data.As a result,unsupervised classification,especially clustering techniques,plays a crucial role in data analysis.Clustering,as an exploratory data analysis method,can discover the inherent structure of raw data by grouping data samples into different clusters.In the era of mobile internet,the dimensionality and structure of data are becoming more complex,cluster analysis of high-dimensional data is inevitable.For the huge amount of data that needs to be processed,to more easily organize,summarize and extract the useful information contained in the data,compression has also become a very important topic.Data compression(dimensionality reduction)is to transform the data into a new feature subspace with a lower dimensionality.Dimensionality reduction mainly includes feature selection and feature extraction.Feature selection is to select a subset of the features.In feature extraction,the relevant information is derived from the feature set in order to construct a new feature subspace.Obviously,dimensionality reduction is not only a basic step of reprocessing,but also conducive to data visualization.Based on the properties of generated clusters,clustering can be divided into partitional clustering and hierarchical clustering.In academic and industrial fields,however,partitional clustering is the most widely used.In various partitional clustering methods,the K-means clustering algorithm has become the most classic and popular algorithm.This is because its low computational complexity makes it popular.The K-means algorithm has achieved very good clustering effects in many practical applications,especially when the clustering results are compact and hyper-spherical in shape.Although K-means algorithm is by far the mostly used clustering method,it also has several major disadvantages.The first problem is that the iterative optimization process cannot guarantee the algorithm to converge to the globe optimum,i.e.,K-means clustering can converge to a local optimum.Due to the use of randomly assigned centroid positions in K-means,the centroids may be too close and may be merged in subsequence iterations,resulting in the algorithm generating one or more empty clusters.That is,K-means algorithm has a reasonably initialization problem.In theory,there is no effective and general scheme to determine such reasonable initialization.The second flaw to be highlighted is that K-means algorithm assumes that the user knows in advance the number of clusters in the data.However,choosing the right number of clusters can be very difficult because the inherent data structure of the real data is unknown.As in the case of random initialization,there is also no efficient scheme to correctly select the number of clusters.The third drawback is that the K-means algorithm is also sensitive to outliers and noise in the data.Even if the data point is far from the center of the cluster,the K-means algorithm still forces the point to be included in the cluster and used to calculate the centroids,which distorts the shape of the cluster.Aiming at the above three problems of K-means algorithm,this paper proposes the corresponding improvement scheme.Firstly,the location of the initial centroid is determined by the farthest initial center selection principle and the min-max distance rule,which avoids the occurrence of empty clusters in the clustering results and solves the problem of uncertainty of clustering results caused by the random initialization of the classical K-means algorithm.Secondly,to determine the optimal number of clusters,a method of estimating the number range of clusters based on the statistical empirical rule is proposed,and in this range,by observing the curve of the within cluster sum-of-squared-errors(SSE)with the number of clusters,the elbow method is used to intuitively determine the inherent cluster structure of the data.Thirdly,to optimize the performance of K-means algorithm,an import feature scaling method,standardization is used as data preprocessing.The standardized data obeys a normal distribution with zero mean and unit variance,which solves the problem that the K-means algorithm is sensitive to outliers and noise.Fourth,comparing with other feature extraction techniques,such as principal component analysis(PCA)and kernel PCA,the supervised linear discriminant analysis(LDA)is proposed to compress high-dimensional data into low feature subspaces for data visualization,and more importantly,LDA is a dimensionality reduction method that maximizes cluster separability.Finally,to evaluate the clustering quality,silhouette analysis is used to verify the validity of clustering.The silhouette coefficient of each cluster is calculated,and by comparing the silhouette size,the final number of clusters is determined.