海量数据的相似重复记录检测算法

扫码查看

原文链接

NETL
NSTL
万方数据
维普

中文摘要：针对海量数据下相似重复记录检测算法的低查准率和低效率问题,采用综合加权法和基于字符串长度过滤法对数据集进行相似重复检测.综合加权法通过结合用户经验和数理统计法计算各属性的权重.基于字符串长度过滤法在相似检测过程中利用字符串间的长度差异提前结束编辑距离算法的计算,减少待匹配的记录数.实验结果表明,通过综合加权法计算的权重向量更加全面、准确反映出各属性的重要性,基于字符串的长度过滤法减少了记录间的比对时间,能够有效地解决海量数据的相似重复记录检测问题.

外文标题：Algorithm for detecting approximate duplicate records in massive data

外文摘要：For the problem of low precision and low time efficiency of approximate duplicate records detection algorithm in massive data,integrated weighted method and filtration method based on the length of strings were adopted to do the approximate duplicate records detection in dataset.Integrated weighted method integrated user experience and mathematical statistics to calculate the weight of each attribute to make weight calculation more scientific.The filtration method based on the length of strings made use of the length difference between strings to terminate the edit distance algorithm earlier which reduced the number of the records to be matched during the detection process.The experimental results show that the weight vector calculated by the integrated weighted method makes the importance of each field more comprehensive and accurate.The filtration method based on the length of strings reduces the comparison time among records and effectively solves the problem of the detection of approximate duplicate records under massive data.

外文关键词：

massive dataapproximate duplicate recordintegrated weighted methodedit distance

作者：

周典瑞、周莲英

展开 >

作者单位：

江苏大学计算机科学与通信工程学院,江苏镇江212013

关键词：

海量数据相似重复记录综合加权法编辑距离

基金：

江苏省科技支撑项目

项目编号：

BE2011156

出版年：

2013

DOI：

10.11772/j.issn.1001-9081.2013.08.2208

计算机应用

中国科学院成都计算机应用研究所

计算机应用

CSTPCDCSCD北大核心

影响因子：0.892

ISSN：1001-9081

年,卷(期)：2013.33(8)

被引量11
参考文献量4