基于密度分布的鲁棒谱聚类算法

扫码查看

原文链接

万方数据
维普

中文摘要：谱聚类作为一种基于图论的聚类方法,通过相似性矩阵对数据进行特征分解或将数据投影到低维空间以实现更好的数据划分.谱聚类因其适用于复杂数据和非凸子簇而受到广泛的关注,并已成功应用在很多领域.然而,计算复杂度高、噪声敏感等问题会限制其聚类效果的进一步提升.针对这些问题,本文提出了一种基于密度分布的鲁棒谱聚类算法.首先,设置噪声系数以过滤少量的低密度噪声点.其次,根据密度峰值聚类具有的特性,即尽可能多地划分数据能够保证子簇内数据标签的一致性,新提出的算法能够在较少的子簇数和更高的簇内标签一致性上达到平衡,实现了对数据更加优质的划分.最后,基于簇间密度分布的相似性度量改善了谱聚类在密度不均匀数据集上的聚类效果.合成数据以及真实数据上的实验充分证明了新算法在9个最新改进算法中的有效性.在保证聚类效率的前提下,新算法在真实数据上的准确率、调整兰德系数和调整互信息的平均值上至少分别提升了10.02％、22.11％和 15.76％.

外文标题：Robust Spectral Clustering Based on Density Distribution

外文摘要：Spectral clustering,as a classic clustering method based on graph theory,uses the similarity matrix to decompose the data or project the data into a low-dimensional space to achieve better data partition.In spectral clustering,the similarity matrix of data needs to be constructed first,and the similarity between data points is usually calculated by the Gaussian kernel function or k-nearest neighbors method.Then,the similarity matrix is transformed into a Laplacian matrix,and the eigendecomposition of the Laplacian matrix is carried out,and the eigenvectors are obtained and clustered by the k-means algorithm method.Finally,according to the clustering results,the data points belong to the cluster.Spectral clustering is of great significance in the field of data mining and pattern recognition.It is not only suitable for clustering problems,but also can be applied to graph segmentation,dimensionality reduction,feature selection and other fields,so it has a wide range of application values.However,the computational complexity of spectral clustering is high and may be limited when dealing with large-scale data sets.In addition,spectral clustering is sensitive to noise,because noisy data points may affect the construction of the similarity matrix and the calculation of the eigenvectors,resulting in instability and a decrease in the accuracy of the clustering results.Especially in the case of no noise preprocessing or denoising,spectral clustering may incorrectly divide noisy data points into a certain cluster,affecting the final clustering results.Therefore,when dealing with data containing noise,it is necessary to properly clean or denoise the data before using spectral clustering to improve the effect.To address these problems,this paper proposes a robust spectral clustering algorithm based on density distribution.Firstly,the noise points between subclusters have lower local density;therefore,this paper sets the noise coefficient to filter a small number of low-density noise points from the perspective of different density levels.Secondly,according to the characteristics of density peaks clustering,that is,dividing the data as much as possible can ensure the consistency of the data label within the subcluster,the newly proposed algorithm can achieve a balance between a smaller number of subclusters and a higher consistency of the label within the cluster,and achieve a better division of the data.Finally,based on the density distribution information between all clusters,including the density mean and density standard deviation,a new similarity measure is proposed to improve the clustering effect of spectral clustering on data sets with uneven density.In the proposed algorithm,the parameter in spectral clustering,the non-negative Gaussian kernel bandwidth,is replaced by the more easily adjusted k-nearest neighbors,which can select the optimal parameters of the algorithm in a more limited range.Experiments on synthetic data and real data fully demonstrate the effectiveness of the new algorithm in nine state-of-the-art improved algorithms.Under the premise of ensuring the clustering efficiency,the average values of accuracy,the adjusted rand index and the adjusted mutual information are increased by 10.02％,22.11％and 15.76％at least,respectively.

外文关键词：

spectral clusteringdensity distributionsub-cluster similaritylocal peaknoise detection

作者：

李超、廖红梅、徐晓、郭丽丽、丁世飞

展开 >

作者单位：

中国矿业大学计算机科学与技术学院江苏徐州 221116

矿山数字化教育部工程研究中心(中国矿业大学) 江苏徐州 221116

关键词：

谱聚类密度分布子簇相似性局部峰值噪声检测

基金：

国家自然科学基金项目国家自然科学基金项目

项目编号：

6227626561976216

出版年：

2024

DOI：

10.11897/SP.J.1016.2024.02645

计算机学报

中国计算机学会中国科学院计算技术研究所

计算机学报

CSTPCD北大核心

影响因子：3.18

ISSN：0254-4164

年,卷(期)：2024.47(11)