启发式k-means聚类算法的改进研究

Study on Improvement of Heuristic k-means Clustering Algorithm

殷丽凤 ¹栗庆杰¹

扫码查看

作者信息

1. 大连交通大学软件学院,辽宁大连 116028
折叠

摘要

启发式k-means聚类算法通过在k-means第一次迭代后查看附近的集群来预测每个数据点可能会被划分到的集群子集,有效地加快了算法的运行速度.但由于启发式算法存在随机选择初始聚类中心以及无法有效识别数据集中离群点的缺陷,导致聚类结果的误差平方和较大并且轮廓系数偏小.针对这一问题,提出了CHk-means算法,该算法引入仔细播种方法,克服了启发式k-means算法随机选择初始聚类中心带来的局部最优解问题;该算法引入局部异常因子LOF算法对离群点进行检测,降低了离群点数据对聚类结果的影响.在多个数据集上对3种算法进行对比试验,结果表明CHk-means算法可有效降低聚类结果的误差平方和,增强聚类的轮廓系数,使聚类质量得到明显改善.

Abstract

The heuristic k-means algorithm predicts the subset of clusters to each data point which is likely to be classified by looking at nearby clusters after the first iteration of k-means, effectively speeding up the oper-ation of the algorithm. However, due to the shortcomings of the heuristic algorithm in randomly selecting the initial clustering center and being unable to effectively identify outliers in the data set, the sum of squared errors in the clustering results is large, and the silhouette coefficient is small. To address this problem, the CHk-means algorithm is proposed. This algorithm introduces a careful seeding method to overcome the local optimal solution problem caused by the heuristic k-means algorithm random selection of the initial cluster center. This algo-rithm introduces the local outlier factor LOF algorithm to detect outliers, reducing the impact of outlier data on clustering results. Comparative experiments were conducted on three algorithms on multiple data sets. The re-sults show that the CHk-means algorithm can effectively reduce the sum of square errors of clustering results, enhance the silhouette coefficient of clustering, and significantly improve the clustering quality.

关键词

聚类算法/k-means/启发式算法/仔细播种/局部异常因子/离群点

Key words

clustering algorithm/k-means/heuristic algorithm/careful seeding/local outlier factor/outliers

引用本文复制引用

基金项目

国家自然科学基金(61771087)

出版年

2024

大连交通大学学报

大连交通大学

大连交通大学学报

CSTPCD

影响因子：0.258

ISSN：1673-9590

段落导航