A Large-Scale Vector Space Similarity Retrieval Framework Based on Prefix Pruning
Aiming at the problem of weight-based similarity query under large-scale text collection,an efficient retrieval framework supporting prefix pruning is proposed.Firstly,we give the definition of similarity and its weighted prefix under the vector space model,and theoretically prove the correctness of weighted prefix pruning;then,for large-scale text query,we propose a new inverted index structure,use the index leaf nodes to maintain the prefix weights of the records,and construct efficient similarity retrieval algorithms based on the index;finally,we prove that the meth-od can effectively support large-scale similar retrieval with weights,and the results show that its query efficiency is more than 5 times higher than that of Lucene's subsumption verification strategy.
prefix-based pruningTF/IDFvector space modelinverted indexinformation retrievaldatabase