首页|基于Zipf分布的网格密度峰值聚类算法

基于Zipf分布的网格密度峰值聚类算法

扫码查看
网格密度峰值聚类在兼顾密度峰值聚类算法可识别任意形状类簇的基础上,通过数据集的网格化简化整体计算量,成为当前备受关注的聚类方法。针对大规模数据,如何进一步区分稠密与稀疏网格,减少网格密度峰值聚类中参与计算的非空网格代表点的数量是解决"网格灾难"的关键。结合以网格密度为变量的概率密度分布呈现出类Zipf分布的特点,提出一种基于Zipf分布的网格密度峰值聚类算法。首先计算所有非空网格的密度并映射为Zipf分布,根据对应的Zipf分布筛选出稠密中心网格和稀疏边缘网格;然后仅对稠密中心网格进行密度峰值聚类,在自适应确定潜在聚类中心的同时减少欧氏距离的计算量,降低算法复杂度;最后通过对稀疏边缘网格的处理,进一步优化类簇边界并提高聚类精度。人工数据集和UCI数据集下的实验结果表明,所提出算法对大规模、类簇交叉数据的聚类具有明显优势,能够在保证聚类精度的同时降低时间复杂度。
Grid density peak clustering algorithm based on Zipf distribution
Grid density peak clustering,taking advantage of the strong ability that the density peak clustering algorithm can identify arbitrary shape clusters,greatly reduces the computational cost by gridding the data set.It has become a popular clustering method nowadays.However,for large-scale data,how to further distinguish the dense and sparse grids and reduce the number of non-empty grids involved in the calculation of the grid density peak clustering is the key to solve the"grid disaster".In view of the probability density distribution with grid density as variable shows a Zipf-like distribution,a grid density peak clustering algorithm based on Zipf distribution is proposed.Firstly,the density of all non-empty grids is calculated and mapped to Zipf distribution,so the dense center grids and sparse edge grids are filtered according to the Zipf distribution.And then,only the dense center grids are clustered with peak density,and the calculation of Euclidean distance is reduced while the cluster centers are determined heuristically,which reduces the complexity of the algorithm.Finally,the sparse edge grids are processed to further optimize the cluster boundary and improve the clustering accuracy.Experimental results on the synthetic datasets and UCI datasets show that the proposed algorithm has obvious advantages for the clustering of large-scale and cross-cluster data,and can reduce the time complexity while ensuring the clustering accuracy.

clusteringdensity peaksgridZipf distributiondensity threshold

马福民、宫婷、杨帆、张腾飞

展开 >

南京财经大学信息工程学院,南京 210023

南京邮电大学自动化学院、人工智能学院,南京 210023

聚类 密度峰值 网格 Zipf分布 密度阈值

国家自然科学基金项目国家自然科学基金项目江苏省自然科学基金项目江苏省重点研发计划项目江苏省研究生科研与实践创新计划项目江苏省高校青蓝工程项目

6197315162073173BK20191406BE2021001-4G-TXW21001

2024

控制与决策
东北大学

控制与决策

CSTPCD北大核心
影响因子:1.227
ISSN:1001-0920
年,卷(期):2024.39(2)
  • 21