Efficient Approach of Compression Ratio Estimation for Data Delta Compression
Delta compression not only eliminates identical data chunks but also removes duplicate fragmentations among similar chunks,achieving higher data compression ratios than deduplication.This technique has been integrated into many commercial products.However,further exploitation of data compressibility introduces significant overhead,including reading similar chunks from storage devices to identify their duplicates.Consequently,delta compression typically operates at only one-seventh the speed of deduplication.However,such substantial overhead does not guarantee better compression ratios because not all data possess sufficient compressibility for exploitation.Therefore,when evaluating the implementation of delta compression in storage systems,it is essential to quickly ascertain its applicability for current data.This study proposes a delta compression estimation framework,EDCR,which promptly assesses the compressibility of data chunks based on their similarity features to evaluate the applicability of delta compression accurately.Additionally,the framework incorporates sampling and correction schemes to enhance the efficiency and accuracy of compression ratio estimation.Evaluations conducted on multiple real-world datasets demonstrate that EDCR achieves an estimation error rate of less than 1.5%.Moreover,compared to existing delta compression frameworks,the EDCR estimation framework operates 18-24 times faster on Solid State Disk(SSD)and 16-146 times faster on Hard Disk Drive(HDD).
delta compressioncompression ratio estimationsimilarity featuresamplingestimation correction