基于数据剪辑的自训练信用评估集成分类模型
Self-training credit evaluation integrated classification model based on data editing
刘文杰 1王国强1
作者信息
- 1. 上海工程技术大学 数理与统计学院, 上海 201620
- 折叠
摘要
针对信用数据不平衡及类标签数据难以获取的问题,提出一种基于数据剪辑的自训练信用评估集成分类模型.首先,采用合成少数类过采样法(SMOTE)在有标记样本上采样,以缓解数据不平衡性.其次,在少量带标签样本数据集上构建Stacking集成模型,并对无标记样本做"伪标记",以获取类标签数据.最后,提出一种改进的双重加权半监督K近邻算法,并利用其剪辑伪标签数据和扩充训练集,直到模型收敛.使用UCI和Kaggle信用评估数据集进行仿真试验,结果表明,该模型具有更好的预测性能,更能有效识别少数类样本.
Abstract
Aiming at the problems of unbalance of credit data and difficult acquisition of label data,a self-training credit evaluation integrated classification model based on data editing was proposed.Firstly,synthetic minority over-sampling technique(SMOTE)was used to sample labeled samples to alleviate data imbalance.Secondly,a Stacking integration model was constructed on a few labeled sample datasets and unlabeled samples were"falsified"to obtain label-like data.Finally,an improved semi-supervised double-weighted K-nearest neighbor algorithm was proposed,which was used to clip the pseudo-label data and expand the training set until the model converged.Simulation experiments of UCI and Kaggle credit evaluation dataset show that the model has better predictive performance and can identify a few types of samples more effectively.
关键词
信用评估/半监督学习/Stacking集成策略/数据剪辑/自训练Key words
credit evaluation/semi-superised learning/Stacking integration strategy/data editing/self-training引用本文复制引用
基金项目
国家自然科学基金面上项目资助(11971302)
浦东新区科技发展基金产学研专项资金(人工智能)项目资助(PKX2020-R02)
全国统计科学研究项目一般项目资助(2020LY067)
出版年
2024