首页|基于方差迁移的非平衡数据过采样方法

基于方差迁移的非平衡数据过采样方法

扫码查看
重采样是解决非平衡数据分类问题的重要方法.但在数据集很小的情况下,欠采样会丢失数据集的重要信息,因此过采样是非平衡数据分类问题的研究重点.现有的过采样方法虽然部分解决了类间不平衡问题,但是本质上并未给少数类引入额外的信息,且仍然存在着过拟合的风险.针对这些问题,提出了一种基于多数类方差迁移的少数类合成方法(Variance Transfer Oversampling,VTO),从足够多样化的多数类中提取样本偏移向量,综合少数类和多数类的特征权重矩阵以调整,最终将经过置信条件筛选的偏移向量叠加至少数类样本中心,从而在少数类样本生成中引入多数类方差,进而丰富少数类特征空间.为了验证所提算法的有效性,使用决策树为分类模型在6个KEEL数据集上训练,对比SMOTEENN等其他过采样方法,以F-score和PR-AUC值为评价指标进行了实验.结果显示,该算法在处理非平衡数据分类问题时具有更大优势.
Imbalanced Data Oversampling Method Based on Variance Transfer
Resampling is an important method to solve imbalanced data classification problem.However,when the size of data set is very small,undersampling will lose important information of the data set,so oversampling is the research focus of imbalanced data classification.Although the existing oversampling methods partially solve the problem of imbalance between classes,they es-sentially do not introduce additional information to minority class,and there is still a risk of overfitting.To solve these problems,VTO,an oversampling method based on variance migration of the majority class,is proposed in this paper.In this method,a shift vector is extracted from majority class,and the feature weight matrix of the minority class and the majority class is used for adjustment.Furthermore,the shift vectors filtered by the confidence conditions are superimposed to the center of the minority class,so as to introduce the majority class variance in the generation process of new minority class samples,then enrich the mi-nority class feature space.In order to verify the effectiveness of the proposed algorithm,decision tree is used as classification mod-el to train on 6 KEEL data sets.Compared with SMOTEENN and other over-sampling methods,with F-score and PR-AUC val-ues as evaluation indexes,the results show that VTO is more advantageous in dealing with imbalanced data classification.

Imbalanced dataClassificationOversamplingVariance transferCovariance

郑一凡、王卯宁

展开 >

中央财经大学信息学院 北京 102206

非平衡数据 分类 过采样 方差迁移 协方差

国家自然科学基金国家自然科学基金北京市自然科学基金四川省教育厅人文社会科学重点研究基地科技金融与创业金融研究中心项目

61907042617025704194090JR2018-2

2024

计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

CSTPCD北大核心
影响因子:0.944
ISSN:1002-137X
年,卷(期):2024.51(z1)
  • 15