基于前缀剪枝的大规模向量空间相似检索框架

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：针对大规模文本集合下基于权重的相似性查询问题,提出一种支持前缀剪枝的高效检索框架.首先给出向量空间模型下相似性及其带权前缀定义,理论证明了带权前缀剪枝的正确性;其次,面向大规模文本查询,提出一种新的倒排索引结构,利用索引叶节点维护记录的前缀权重,并基于该索引构建高效的相似检索算法;最后,在TF/IDF权重策略下证明该方法能够有效支持大规模带权相似检索.结果表明,其查询效率较Lucene的归并验证策略提升了5倍以上.

外文标题：A Large-Scale Vector Space Similarity Retrieval Framework Based on Prefix Pruning

外文摘要：Aiming at the problem of weight-based similarity query under large-scale text collection,an efficient retrieval framework supporting prefix pruning is proposed.Firstly,we give the definition of similarity and its weighted prefix under the vector space model,and theoretically prove the correctness of weighted prefix pruning;then,for large-scale text query,we propose a new inverted index structure,use the index leaf nodes to maintain the prefix weights of the records,and construct efficient similarity retrieval algorithms based on the index;finally,we prove that the meth-od can effectively support large-scale similar retrieval with weights,and the results show that its query efficiency is more than 5 times higher than that of Lucene's subsumption verification strategy.

外文关键词：

prefix-based pruningTF/IDFvector space modelinverted indexinformation retrievaldatabase

作者：

刘健博、邓凌风、李文海、田野

展开 >

作者单位：

武汉数博科技有限责任公司,湖北武汉 430205

武汉大学计算机学院,湖北武汉 430072

湖北开放大学软件工程学院,湖北武汉 430074

关键词：

前缀剪枝 TF/IDF 向量空间倒排索引信息检索数据库

出版年：

2024

DOI：

10.11907/rjdk.241015

软件导刊

湖北省信息学会

软件导刊

影响因子：0.524

ISSN：1672-7800

年,卷(期)：2024.23(6)