基于均衡聚类索引的近似最近邻检索方法

扫码查看

原文链接

万方数据
维普

中文摘要：大数据时代，深度学习通过将复杂对象表示为高维特征向量，并使用向量之间的距离度量来衡量样本的相似性，在推荐系统、用户画像、数据中台管理等场景中得到了广泛的应用。但是，随着数据规模的不断增加，海量特征数据的相似向量检索面临着检索模型占用内容大、特征检索算法召回率较低的严重挑战。如何在保证检索精度的前提下，设计紧凑型索引图结构，降低特征检索的内存消耗，对于提升大数据系统的近邻检索效率具有重要的作用。因此，本文提出了一种均衡感知的快速K均值近邻聚类的特征数据分桶及其图结构紧凑型索引用于海量数据近邻检索。首先，设计了均衡感知的快速K-均值聚类算法，通过在图索引构建过程中海量特征数据的均衡分桶，将高维向量压缩成轻量级紧凑型图索引结构，随后通过量化操作进一步压缩高维向量样本，提升其在候选集上的最近邻检索速度。在基准数据集上实验验证结果表明，本文提出的方法能够在保证较高检测召回率的同时，有效加快索引构建速度，可以用于支持高维特征数据的高效最近邻检索。

外文标题：Balanced Clustering-based Index for Approximate Nearest Neighbor Retrieval

外文摘要：In the era of big data,deep learning has been widely applied in recommendation systems,user profiling,and data management by representing complex objects as high-dimensional feature vectors and evaluating their similarities based on vector distance measurements.However,with the continuous growth of data scale,the retrieval of similar feature vectors from massive data faces significant challenges such as large memory consumption of retrieval models and low recall rates of feature retrieval algorithms.It is crucial to design compact index graph structures and reduce memory consumption in feature retrieval to improve the efficiency of nearest neighbor search in large-scale data systems while ensuring retrieval accuracy.Therefore,a balanced-aware distributed K-means clustering-based user feature binning approach and a compact index design algorithm for graph structures are proposed.Firstly,fast balanced-aware K-means clustering algorithm is designed to achieve balanced binning of massive feature data during graph index construction,compressing high-dimensional vectors into lightweight and compact graph index structures.Subsequently,quantization operation is conducted to further compress high-dimensional vectors sample and improve its nearest neighbor search speed in dataset.Experimental results on benchmark datasets demonstrate that the proposed method can effectively accelerate index construction speed while ensuring high accuracy,thus enabling efficient indexing and retrieval of massive data.

外文关键词：

big data retrieval and analysisnearest neighbor searchbalanced perception

作者：

吕宏伟、李博、刘普凡、刘识、李继伟、刘俊健

展开 >

作者单位：

国家电网大数据中心,江苏南京 210023

关键词：

大数据检索与分析最近邻搜索均衡感知

基金：

国家电网大数据中心自建科技项目

项目编号：

SGSJ0000SJJS2310021

出版年：

2024

DOI：

10.3969/j.issn.1001-4616.2024.02.012

南京师大学报(自然科学版)

南京师范大学

南京师大学报(自然科学版)

CSTPCD北大核心

影响因子：0.427

ISSN：1001-4616

年,卷(期)：2024.47(2)

参考文献量30