Fair Data Summarization Based on k-Center Clustering and Nearest Neighbor Center
The fair data summarization refers to selecting representative subset from each data category and satisfying the fairness requirement.In the era of big data,each category may contain a large volume of data,so the research into fair data summarization is of great practical importance.To enhance the repre-sentativeness of data points in data summarization,we proposed a fair data summarization algorithm based on k-center clustering and nearest neighbor center.The algorithm mainly consists of two basic steps:(1)K centers are taken as the current summarization result via k-center clustering;(2)The nearest neighbors of the original cluster centers that satisfy the fairness constraints are selected as the new cluster centers.Because nearest neighbors are selected as new cluster centers,they are more representative com-pared to data points selected randomly,and they are also suboptimal representative points besides the orig-inal cluster centers.Moreover,selecting nearest neighbor points as new cluster centers can minimize the fairness cost.The comparison results on 2 simulated datasets and 6 real UCI datasets show that the pro-posed algorithm outperforms the compared algorithm in terms of approximation factors and fair cost,indi-cating that the data summarization obtained by the proposed algorithm is more representative.
the nearest neighbor pointk-center clusteringdata summarizationfairness constraint