Detection and Elimination of Duplicate Data Using Token-Based Method for a Data Warehouse: A Clustering Based Approach

扫码查看

原文链接

NETL
NSTL

外文摘要：The process of detecting and removing database defects and duplicates is referred to as data cleaning. The fundamental issue of duplicate detection is that inexact duplicates in a database may refer to the same real world object due to errors and missing data. Duplicate elimination is hard because it is caused by different types of errors like typographical errors, missing values, abbreviations and different representations of the same logical value. In the existing approaches, duplicate detection and elimination is domain dependent. These domain dependent methods for duplicate elimination rely on similarity functions and threshold for duplicate elimination and produce high false positives.This research work presents a general sequential framework for duplicate detection and elimination. The proposed framework uses six steps to improve the process of duplicate detection and elimination. First, an attribute selection algorithm is used to identify or select best and suitable attributes for duplicate identification and elimination. The token is formed for the selected attribute field values in the next step. After the token formation, clustering algorithm or blocking method is used to group the records based on the similarities value. The best blocking key will be selected for the blocking records by comparing performance of the duplicate detection. In the next step the threshold value is calculated based on the similarities between records and fields. Then, a rule based approach is used to identify or detect duplicates and to eliminate poor quality duplicates by retaining only one copy of the best duplicate record. Finally, all the cleaned records are grouped or merged and made available for the next process.This research work will be efficient for reducing the number of false positives without missing out on detecting duplicates. To compare this new framework with previous approaches the token concept is included to speed up the data cleaning process and reduce the complexity. Analysis of several blocking key is made to select best blocking key to bring similar records together through extensive experiments to avoid comparing all pairs of records. A rule based approach is used to identify exact and inexact duplicates and to eliminate duplicates.

作者：

J. Jebamalar Tamilselvi、V. Saravanan

展开 >

作者单位：

Department of Computer Application, Karunya University, Coimbatore - 641 114, Tamilnadu, INDIA

Department of Computer Application, Karunya University, Coimbatore-641 114, Tamilnadu, INDIA

出版年：

2009

International journal of computational intelligence research

ISSN：0973-1873

年,卷(期)：2009.5(2)