基于标签迭代的聚类集成算法

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：现有的"数据相同,算法不同"式的聚类集成算法训练策略普遍存在处理大规模数据性能受限以及共识函数适应性不强的缺点.为此,对"数据不同,算法相同"式的聚类集成算法训练策略进行了研究,构建了一种基于标签迭代的聚类集成(LICE)算法.首先,该算法在原始数据集的随机样本划分(RSP)数据块上训练若干基聚类器.接着,利用最大平均差异准则对聚类簇数相同的基聚类结果进行融合,并基于标签确定的RSP数据块训练一个启发式分类器.之后,迭代式地利用启发式分类器对标签不确定的RSP数据块中的样本点进行标签预测,利用分类标签与聚类标签一致的样本点强化启发式分类器的性能.最后,通过一系列可信的实验对LICE算法的可行性和有效性进行验证,结果显示在代表性数据集上,LICE算法对应的标准互信息、调整兰德系数、Fowlkes-Mallows指数以及纯度在第5次迭代时相比于迭代起始分别平均提升了17.23%、16.75%、31.29%和12.37%.与7种经典的聚类集成算法相比,在选用的数据集上,这4个指标的值分别平均提升了11.76%、16.50%、9.36%和14.20%.实验证实了LICE算法是一种高效合理的、能够处理大数据聚类问题的聚类集成算法.

外文标题：Label iteration-based clustering ensemble algorithm

外文摘要：The existing training strategies for clustering ensemble algorithm are generally conducted based on the same data and different base clustering algorithms and commonly have the limitations of low performance for large-scale data and weak adaptability of consensus function.To address these problems,this paper proposed a label iteration-based clus-tering ensemble(LICE)algorithm which was developed based on the training strategy for clustering ensemble algorithm of different data and same base clustering algorithm.Firstly,multiple base clusterings were trained based on the random sample partition(RSP)data blocks.Secondly,the base clustering results with same cluster numbers were fused with maxi-mum mean discrepancy criterion and then a heuristic classifier was trained based on the RSP data blocks with labels.Thirdly,the sample points without labels were labeled with heuristic classifier which was iteratively enhanced with the la-beled sample points having the consistent labeling for clustering and classification.Finally,a series of persuasive experi-ments were conducted to validate the feasibility and effectiveness of LICE algorithm.The experimental results showed that the normalized mutual information,adjusted Rand index,Fowlkes-Mallows index and purity of LICE algorithm in-creased by 17.23%,16.75%,31.29%,and 12.37%on average at the 5th iteration compared to the initial iteration and these four indexes increased by 11.76%,16.50%,9.36%,and 14.20%on average for the representative datasets in com-parison with seven state-of-the-art clustering ensemble algorithms and thus demonstrate that LICE algorithm is an effi-cient and reasonable clustering ensemble algorithm with the potential to handle large-scale data clustering problems.

外文关键词：

clustering ensemble algorithmensemble learningrandom sample partitionmaximum mean discrepancylabel iteration

作者：

何玉林、杨锦、黄哲学、尹剑飞

展开 >

作者单位：

人工智能与数字经济广东省实验室(深圳),广东深圳 518107

深圳大学计算机与软件学院,广东深圳 518060

关键词：

聚类集成算法集成学习随机样本划分最大平均差异标签迭代

出版年：

2024

DOI：

10.11959/j.issn.2096-6652.202443

智能科学与技术学报

CSTPCD

ISSN：

年,卷(期)：2024.6(4)