Identification and processing of outliers for national real estate registration data based on statistical approaches
We aim to analyze the existing issues in the national real estate registration database,and then construct a method to improve the quality of big data.and assessed the effectiveness of this method.we employed statistical methods such as kernel density estimation and residential registration data of S city in the national real estate registration data to identify extreme values and duplicate values in residential prices and classify the cleaned data in S city.① We categorize the data according to the distribution condition of the data,firstly eliminating extreme values,then eliminating the duplicate values and special values in the valid data to obtain the subject data,where the subject data includes low probability data,market behavior data,and non-market behavior data.②For the market behavior data,the average value of transaction price is extracted and calculated,and compared with the information of public house price data of intermediary institutions,the difference of the data in most regions is less than 15%,the quantitative analysis confirms that the data in the real estate registration database is more authoritative and effective.This study built up a data quality improvement method based on kernel density estimation to identify extreme values and duplicate values in the real estate registration data.Our results verified that the method of data quality improvement is robust and effective.The improvement of registration data quality provided a methodological basis,which can provide more accurate data resources for the application of national real estate registration data.
data quality improvementkernel density estimationregistration of immovable propertyurban residential housing price