大数据背景下两阶段Leverage重要性抽样方法研究

Two-stage Leverage Importance Sampling Method in the Context of Big Data

贺建风 ¹何韩吉¹

扫码查看

作者信息

1. 华南理工大学经济与金融学院数量经济学系
折叠

摘要

大数据背景下,需要对传统的抽样调查技术进行改进,以应对数据结构变化.以杠杆得分为入样概率的Leverage重要性抽样能够增加高杠杆值样本点被抽中的概率,但也增加了异常值选入抽样子集的风险,使得抽样估计偏离真实值.为降低大数据异常值影响,提高大数据抽样子集估计的稳健性,本文提出基于阈值自选择的两阶段Leverage重要性抽样方法.该方法第一阶段以样本距离的有序聚类识别稳健子集,使得用于二阶段抽样的样本更具代表性,第二阶段则是在稳健子集的基础上获得稳健抽样估计.模拟分析结果表明,本文所提方法能够提升线性回归系数估计的精度,在漂移型、波动型和混合型离群值中均适用.实证分析中本文所提方法在三个案例数据中拥有较小的预测值均方误差,有效降低了异常值的影响.

Abstract

In the context of big data,it is necessary to improve the traditional sampling survey technology to cope with the reality of data structure changes.Leverage importance sampling with leverage score as the sampling probability can increase the probability of sample points with high leverage value being selected,but it also increases the risk of outliers being selected into the sampling subset,which makes the sampling estimation deviate from the true value.In order to reduce the influence of outliers and improve the robustness of sampling subset estimation of big data,this paper proposes a two-stage Leverage importance sampling method based on threshold self-selection.In the first stage,the method identifies robust subsets by ordered clustering of sample distances,which makes the samples used for two-stage sampling more representative.In the second stage,robust sampling estimation is obtained on the basis of robust subsets.The simulation results show that the method proposed in this paper can improve the accuracy of linear regression coefficient estimation,and is applicable to drift,fluctuation and mixed outliers.In the empirical analysis,the method has a small mean square error of the predicted value in the data of three cases,effectively reducing the influence of outliers.

关键词

大规模数据/线性模型/有序聚类/Leverage重要性抽样

Key words

Large-scale Data/Linear Model/Ordered Clustering/Leverage Importance Sampling

引用本文复制引用

基金项目

国家社会科学基金一般项目(19BTJ022)

全国统计科学研究重大项目(2020LD02)

全国统计科学研究优选项目(2023LY010)

华南理工大学中央高校哲学社会科学创新团队项目(CXTD202405)

出版年

2024

统计研究

中国统计学会,国家统计局统计科学研究所

统计研究

CSTPCDCSSCICHSSCD北大核心

影响因子：2.019

ISSN：1002-4565

段落导航