Intelligent Evidence Set Selection Method for Diverse Data Cleaning Tasks
Due to the limitations of data cleaning algorithms designed specifically for individual data quality issues and their ina-bility to effectively address multiple coexisting data quality enhancement requirements,a collaborative approach employing multi-ple data cleaning methods can be adopted to fulfill various data cleaning needs.This paper formulates the data cleaning problem as a task of evidence set generation and selection.By utilizing an incremental quality assessment scheme based on aggregate queries and an operator result selection scheme based on intermediate operator evidence sets,efficient data cleaning involving a combina-tion of diverse cleaning methods is achieved across various cleaning tasks.In the proposed cleaning model,the operator repository yields data cleaning results and transforms them into intermediate operators.The sampler in the midstream module distributes and prunes the set of intermediate operators to provide the searcher with a high-quality candidate evidence set.The downstream searcher,guided by the quality evaluator,selects evidence sets.Upon completion of the search process,the upstream operator re-pository updates data and necessary parameters,facilitating the reiteration of intermediate operator generation.Finally,extensive experiments are conducted on three real-world datasets of varying scales.Performance verification across different data cleaning tasks demonstrates the feasibility of operator orchestration for any type of data cleaning requirement,underpinning the proposed method's stable precision and recall in scenarios involving diverse data quality constraints,dynamics,and large-scale data clea-ning.Furthermore,a performance comparison with existing intelligent data cleaning systems reveals that the proposed method outperforms these systems by over 15%in tasks related to outlier detection,rule violations,and mixed errors,all within the same cleaning time.
Data cleaningData quality assessmentPipeline system designOperator selectionEvidence set