Hierarchical clustering is an important research area in unsupervised learning.Due to its good interpretability,it is widely used in data mining.Most hierarchical clustering algorithms merge the clusters by calculating pairwise distances.Unfortunately,this step has high complexity(in both time and space),making it inapplicable in large-scale datasets.This paper proposes a density-based scalable hierarchical clustering algorithm(DBSC).Firstly,the algorithm constructs the nearest neighbor graph based on the nearest neighbor relationship of the data.Then,it selects the representative roots on each nearest neighbor component.The representative roots are selected based on the local density of the reciprocal nearest neighbor nodes.Besides,to reduce the interference of the isolated nearest neighbor component with selecting representative roots,the algorithm reconnects to the appropriate nearest neighbor component by second nearest neighbors.Through the above measures,the algorithm iteratively selects the representative roots and constructs the clustering tree in a bottom-up manner.Experiments on massive real datasets show that the algorithm increases the ability to process data to hundreds of thousands of items while ensuring the accuracy of clustering and fast response.
hierarchical clusteringlocal density peaknearest neighbor graphreciprocal nearest neighbor