河北工业科技2024,Vol.41Issue(4) :291-298.DOI:10.7535/hbgykj.2024yx04007

基于ADASYN和WGAN的混合不平衡数据处理方法

Hybrid imbalanced data processing based on ADASYN and WGAN

周万珍 盛媛媛 张永强 马金龙
河北工业科技2024,Vol.41Issue(4) :291-298.DOI:10.7535/hbgykj.2024yx04007

基于ADASYN和WGAN的混合不平衡数据处理方法

Hybrid imbalanced data processing based on ADASYN and WGAN

周万珍 1盛媛媛 2张永强 1马金龙1
扫码查看

作者信息

  • 1. 河北科技大学信息科学与工程学院,河北石家庄 050018;河北省智能物联网技术创新中心,河北石家庄 050018
  • 2. 河北科技大学信息科学与工程学院,河北石家庄 050018
  • 折叠

摘要

为了解决不平衡数据集中少数类样本分类精度较低的问题,提出了一种处理不平衡数据集的ADASYN-WGAN方法.首先,采用ADASYN(adaptive synthetic sampling)算法生成少数类样本,用这些生成样本代替 WGAN(wasserstein generative adversarial networks)中的随机噪声;其次,利用 WGAN算法生成符合原始数据集分布规律的少数类样本,构建平衡数据集;然后,在6个公开数据集上,采用随机森林分类器对所提方法和4种过采样算法得出的处理结果分别与原始数据集进行对比;最后,通过F1-Score,G-mean和AUC等分类评估指标的表现验证所提方法的有效性.结果表明:在对比实验中,经过ADASYN-WGAN方法得到的平衡数据集在随机森林分类器的十折交叉验证中,4个公开数据集中的各项分类评估指标值均达到最优,虽然另2个公开数据集中的AUC值略低,但其F1-Score和G-mean取得了最高值.所提出的ADASYN-WGAN方法可生成高质量的数据样本,并可为解决不平衡数据集中少数类样本的预测偏差问题提供参考.

Abstract

In order to solve the problem of low classification accuracy of minority class samples in imbalanced datasets,an ADASYN-WGAN method was proposed to deal with imbalanced datasets.Firstly,the minority class samples were generated using the ADASYN algorithm,and these generated samples were used to replace the random noise in the WGAN;Secondly,the minority class samples conforming to the distribution law of the original dataset were generated using the WGAN algorithm to construct the balanced dataset;Then,the processing results derived from the proposed method and the four over-sampling algorithms were compared with the original dataset using the random forest classifier on six public datasets,respectively.Finally,the effectiveness of the proposed method was verified by the performance of classification assessment indexes such as F1-Score,G-mean and AUC.The results show that in the comparison experiments,the balanced dataset obtained by the ADASYN-WGAN method achieves the optimal values of all classification assessment indexes in four public datasets in the ten-fold cross-validation of the random forest classifier,and the F1-Score and G-mean achieve the highest values in the other two public datasets,although the AUC values are slightly lower.The proposed ADASYN-WGAN method can generate high-quality data samples and provide reference for solving the problem of prediction bias for a few class samples in unbalanced datasets.

关键词

数据处理/不平衡数据/WGAN/ADASYN/过采样方法/随机森林

Key words

data processing/imbalanced data/WGAN/ADASYN/oversampling method/random forest

引用本文复制引用

基金项目

河北省自然科学基金(F2022208002)

河北省高等学校科学技术研究重点项目(ZD2021048)

出版年

2024
河北工业科技
河北科技大学

河北工业科技

CSTPCD
影响因子:0.694
ISSN:1008-1534
参考文献量5
段落导航相关论文