上下文语义嵌入的变粒度云存储相似数据去重技术

Variable Granularity-based Chunk-context Aware Similar Data Deduplication Technique for Cloud Storage

阳智欢 ¹田纹龙 ²何婷婷 ³叶旭明 ¹唐佳¹

扫码查看

作者信息

1. 南华大学计算机学院,湖南衡阳 421001
2. 南华大学计算机学院,湖南衡阳 421001;新加坡南洋理工大学数理科学学院,新加坡 637371
3. 衡阳师范学院教育科学学院,湖南衡阳 421010
折叠

摘要

针对云存储环境下现有相似数据去重技术效果不佳以及元数据开销大等问题,提出了上下文语义嵌入的变粒度云存储相似数据去重技术.该技术采用基于子块重组的特征提取算法,对数据块内容内部结构进行初步特征提取,并利用BP(Back Propagation)神经网络上下文感知模型将数据块上下文特征信息嵌入到初始特征中,实现了具有上下文语义嵌入的变粒度数据块.通过控制数据块大小,动态地合并相邻相似数据块或非冗余数据块,减少元数据开销,并对位于相似数据块和非冗余数据块之间过渡区域进行分割,从而获得更好的相似数据块表示形式.最后,为了评估其性能,实现了一个变粒度相似数据检测算法原型rCARD并在真实世界的数据集进行了实验,实验结果表明,与最新相似性检测去重技术Finesse相比,rCARD在实现更高重复数据删除率的同时,显著降低了元数据的大小,并且加速相似性检测速度高达11.07 倍.

Abstract

Aiming at the problems of poor effect of existing similar data deduplication techniques and high metadata overhead in cloud storage environment,variable granularity-based chunk-context aware similar data deduplication technique for cloud storage is proposed.The technique adopts a feature extraction algorithm based on sub-block reorganization to perform initial feature extraction of the internal structure of the data block content,and utilizes a BP(Back Propagation)neural network context-aware model to embed the data block contextual feature information into the initial features,realizing a variable granularity data block with contextual semantic embedding.A better representation of similar data blocks is obtained by controlling the data block size,dynamically merging neighboring similar data blocks or non-redundant data blocks to reduce metadata overhead,and segmenting the transition region located between similar and non-redundant data blocks.Finally,to evaluate its performance,a prototype variable granularity similar data detection algorithm,rCARD,is implemented and extensively experimented on real world datasets.The experimental results show that compared to the latest similarity de-tection deduplication technique Finesse,rCARD achieves a higher deduplication rate while significantly reducing the metadata size and ac-celerates the similarity detection speedup by up to 11.07 times.

关键词

相似数据去重/数据块语义/变粒度/云存储/元数据

Key words

similar data deduplication/data block semantics/variable granularity/cloud storage/metadata

引用本文复制引用

基金项目

湖南省自然科学基金(2021JJ40468)

湖南省教育厅优青项目(22B0437)

湖南省教师教育研究基地项目(XJK23AJD014)

出版年

2024

计算机技术与发展

陕西省计算机学会

计算机技术与发展

CSTPCD

影响因子：0.621

ISSN：1673-629X

参考文献量22

段落导航