Algorithm for detecting approximate duplicate records in massive data
For the problem of low precision and low time efficiency of approximate duplicate records detection algorithm in massive data,integrated weighted method and filtration method based on the length of strings were adopted to do the approximate duplicate records detection in dataset.Integrated weighted method integrated user experience and mathematical statistics to calculate the weight of each attribute to make weight calculation more scientific.The filtration method based on the length of strings made use of the length difference between strings to terminate the edit distance algorithm earlier which reduced the number of the records to be matched during the detection process.The experimental results show that the weight vector calculated by the integrated weighted method makes the importance of each field more comprehensive and accurate.The filtration method based on the length of strings reduces the comparison time among records and effectively solves the problem of the detection of approximate duplicate records under massive data.