首页|基于CRF的分区倒排索引压缩算法

基于CRF的分区倒排索引压缩算法

扫码查看
倒排索引是大型搜索引擎的核心数据结构,本质是倒排列表中整数序列的集合.倒排索引压缩可以有效减少倒排索引所占空间,提高对关键词的检索效率.本文提出的基于条件随机场(CRF)的分区倒排索引压缩算法主要关注域值分区的分区方式.该算法对序列进行预分区,并且使用条件随机场对预分区进行标注并重组,有效减少了压缩时间.根据分区类型,该算法使用相应的编码方式,进一步减少了压缩后的空间占用.与其他倒排索引压缩算法进行对比实验分析,结果表明本文算法在压缩率上超过目前一些域值分区的算法,并且在解压时间上与其他域值分区算法相当.该算法在时间和空间上取得了较好的平衡.
A Partition Inverted Index Compression Algorithm Based on CRF
The inverted index is the core data structure of a large search engine,which is essentially a collection of integer se-quences in an inverted table.Inverted index compression can effectively reduce the space occupied by inverted indexes and im-prove retrieval efficiency of keywords.The partition inverted index compression algorithm based on conditional random field(CRF)mainly focuses on the partition mode of universe partition.The algorithm pre-partitions the sequence and uses conditional random fields to label and reorganize the pre-partitions,which effectively reduces the compression time.According to the parti-tion type,the algorithm uses the corresponding encoding method to further reduce the space occupation after compression.Com-pared with other inverted index compression algorithms,this algorithm outperforms some current universe partition algorithms in compression ratio,and is equivalent to other universes partition algorithms in decompression time.The algorithm achieves a good balance between time and space.

inverted indexdata compressionuniverse partitionconditional random fieldsearch engines

王子琛、瞿有利

展开 >

北京交通大学计算机与信息技术学院,北京 100044

倒排索引 数据压缩 域值分区 条件随机场 搜索引擎

国家自然科学基金资助项目

61976015

2024

计算机与现代化
江西省计算机学会 江西省计算技术研究所

计算机与现代化

CSTPCD
影响因子:0.472
ISSN:1006-2475
年,卷(期):2024.(2)
  • 27