上海工程技术大学学报2024,Vol.38Issue(1) :83-89.

基于数据剪辑的自训练信用评估集成分类模型

Self-training credit evaluation integrated classification model based on data editing

刘文杰 王国强
上海工程技术大学学报2024,Vol.38Issue(1) :83-89.

基于数据剪辑的自训练信用评估集成分类模型

Self-training credit evaluation integrated classification model based on data editing

刘文杰 1王国强1
扫码查看

作者信息

  • 1. 上海工程技术大学 数理与统计学院, 上海 201620
  • 折叠

摘要

针对信用数据不平衡及类标签数据难以获取的问题,提出一种基于数据剪辑的自训练信用评估集成分类模型.首先,采用合成少数类过采样法(SMOTE)在有标记样本上采样,以缓解数据不平衡性.其次,在少量带标签样本数据集上构建Stacking集成模型,并对无标记样本做"伪标记",以获取类标签数据.最后,提出一种改进的双重加权半监督K近邻算法,并利用其剪辑伪标签数据和扩充训练集,直到模型收敛.使用UCI和Kaggle信用评估数据集进行仿真试验,结果表明,该模型具有更好的预测性能,更能有效识别少数类样本.

Abstract

Aiming at the problems of unbalance of credit data and difficult acquisition of label data,a self-training credit evaluation integrated classification model based on data editing was proposed.Firstly,synthetic minority over-sampling technique(SMOTE)was used to sample labeled samples to alleviate data imbalance.Secondly,a Stacking integration model was constructed on a few labeled sample datasets and unlabeled samples were"falsified"to obtain label-like data.Finally,an improved semi-supervised double-weighted K-nearest neighbor algorithm was proposed,which was used to clip the pseudo-label data and expand the training set until the model converged.Simulation experiments of UCI and Kaggle credit evaluation dataset show that the model has better predictive performance and can identify a few types of samples more effectively.

关键词

信用评估/半监督学习/Stacking集成策略/数据剪辑/自训练

Key words

credit evaluation/semi-superised learning/Stacking integration strategy/data editing/self-training

引用本文复制引用

基金项目

国家自然科学基金面上项目资助(11971302)

浦东新区科技发展基金产学研专项资金(人工智能)项目资助(PKX2020-R02)

全国统计科学研究项目一般项目资助(2020LY067)

出版年

2024
上海工程技术大学学报
上海工程技术大学

上海工程技术大学学报

影响因子:0.264
ISSN:1009-444X
参考文献量23
段落导航相关论文