首页|Locality sensitive blocking (LSB): A robust blocking technique for data deduplication
Locality sensitive blocking (LSB): A robust blocking technique for data deduplication
扫码查看
点击上方二维码区域,可以放大扫码查看
原文链接
NETL
NSTL
Sage
Data deduplication is process of discovering multiple representations of same entity in an information system. Blocking has been a benchmark technique for avoiding the pair-wise record comparisons in data deduplication. Standard blocking (SB) aims at putting the potential duplicate records in the same block on the basis of a blocking key. Afterwards, the detailed comparisons are made only among the records residing in the same block. The selection of blocking key is a tedious process that involves exponential alternatives. The outcome of SB varies considerably with a change in blocking key. To this end, we have proposed a robust blocking technique called Locality Sensitive Blocking (LSB) that does not require the selection of blocking key. The experimental results show an increase of up to 0.448 in F-score as compared with SB. Furthermore, it is found that LSB is more robust towards blocking parameters and data noise.
Blockingcandidate record pairsdata integrationdata matchinglocality sensitive hashing
Asif Sohail、Waqar ul Qounain
展开 >
Department of Information Technology, Faculty of Computing and Information Technology, University of the Punjab, Lahore, Pakistan
National Center of Artificial Intelligence, University of the Punjab, Lahore, Pakistan||Department of Information Technology, Faculty of Computing and Information Technology, University of the Punjab, Lahore, Pakistan