首页|Locality sensitive blocking (LSB): A robust blocking technique for data deduplication

Locality sensitive blocking (LSB): A robust blocking technique for data deduplication

扫码查看
Data deduplication is process of discovering multiple representations of same entity in an information system. Blocking has been a benchmark technique for avoiding the pair-wise record comparisons in data deduplication. Standard blocking (SB) aims at putting the potential duplicate records in the same block on the basis of a blocking key. Afterwards, the detailed comparisons are made only among the records residing in the same block. The selection of blocking key is a tedious process that involves exponential alternatives. The outcome of SB varies considerably with a change in blocking key. To this end, we have proposed a robust blocking technique called Locality Sensitive Blocking (LSB) that does not require the selection of blocking key. The experimental results show an increase of up to 0.448 in F-score as compared with SB. Furthermore, it is found that LSB is more robust towards blocking parameters and data noise.

Blockingcandidate record pairsdata integrationdata matchinglocality sensitive hashing

Asif Sohail、Waqar ul Qounain

展开 >

Department of Information Technology, Faculty of Computing and Information Technology, University of the Punjab, Lahore, Pakistan

National Center of Artificial Intelligence, University of the Punjab, Lahore, Pakistan||Department of Information Technology, Faculty of Computing and Information Technology, University of the Punjab, Lahore, Pakistan

2024

Journal of information science: Principles & practice

Journal of information science: Principles & practice

EI
ISSN:0165-5515
年,卷(期):2024.50(6)
  • 47