基于k-center聚类和最近邻中心的公平数据汇总

Fair Data Summarization Based on k-Center Clustering and Nearest Neighbor Center

扫码查看

原文链接

万方数据

中文摘要：公平数据汇总是指从每种数据类别中选择有代表性的子集,且满足公平性要求.在大数据时代,每种类别的数据都是海量的,因此公平数据汇总研究具有非常重要的现实意义.为了使公平数据汇总的数据点更具有代表性,提出了基于k-center聚类和最近邻中心的公平数据汇总算法.算法主要包括2个基本步骤:(1)通过k-center聚类,将k个簇中心作为当前汇总结果;(2)选择满足公平约束的原簇中心的最近邻点作为新簇中心.由于更新簇中心时选择的是原簇中心的最近邻点,因此相对随机选择的数据点,最近邻点更具有代表性,是除原始簇中心外的次优代表点.同时,寻找最近邻点作为新的簇中心能最大限度减少公平代价.在2个模拟数据集和6个UCI真实数据集上的对比实验结果表明,所提出的算法在近似因子和公平代价方面都优于对比算法,说明所提出的算法获得的数据汇总更具有代表性.

外文摘要：The fair data summarization refers to selecting representative subset from each data category and satisfying the fairness requirement.In the era of big data,each category may contain a large volume of data,so the research into fair data summarization is of great practical importance.To enhance the repre-sentativeness of data points in data summarization,we proposed a fair data summarization algorithm based on k-center clustering and nearest neighbor center.The algorithm mainly consists of two basic steps:(1)K centers are taken as the current summarization result via k-center clustering;(2)The nearest neighbors of the original cluster centers that satisfy the fairness constraints are selected as the new cluster centers.Because nearest neighbors are selected as new cluster centers,they are more representative com-pared to data points selected randomly,and they are also suboptimal representative points besides the orig-inal cluster centers.Moreover,selecting nearest neighbor points as new cluster centers can minimize the fairness cost.The comparison results on 2 simulated datasets and 6 real UCI datasets show that the pro-posed algorithm outperforms the compared algorithm in terms of approximation factors and fair cost,indi-cating that the data summarization obtained by the proposed algorithm is more representative.

外文关键词：

the nearest neighbor pointk-center clusteringdata summarizationfairness constraint

作者：

何艳、黄巧玲、郑伯川

展开 >

作者单位：

西华师范大学数学与信息学院,四川南充 637009

西华师范大学计算机学院,四川南充 637009

关键词：

最近邻点 k-center聚类数据汇总公平约束

出版年：

2025

DOI：

10.16246/j.issn.1673-5072.2025.01.013

西华师范大学学报(自然科学版)

西华师范大学

西华师范大学学报(自然科学版)

影响因子：0.212

ISSN：1673-5072

年,卷(期)：2025.46(1)