基于ADASYN和WGAN的混合不平衡数据处理方法

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：为了解决不平衡数据集中少数类样本分类精度较低的问题,提出了一种处理不平衡数据集的ADASYN-WGAN方法.首先,采用ADASYN(adaptive synthetic sampling)算法生成少数类样本,用这些生成样本代替 WGAN(wasserstein generative adversarial networks)中的随机噪声;其次,利用 WGAN算法生成符合原始数据集分布规律的少数类样本,构建平衡数据集;然后,在6个公开数据集上,采用随机森林分类器对所提方法和4种过采样算法得出的处理结果分别与原始数据集进行对比;最后,通过F1-Score,G-mean和AUC等分类评估指标的表现验证所提方法的有效性.结果表明:在对比实验中,经过ADASYN-WGAN方法得到的平衡数据集在随机森林分类器的十折交叉验证中,4个公开数据集中的各项分类评估指标值均达到最优,虽然另2个公开数据集中的AUC值略低,但其F1-Score和G-mean取得了最高值.所提出的ADASYN-WGAN方法可生成高质量的数据样本,并可为解决不平衡数据集中少数类样本的预测偏差问题提供参考.

外文标题：Hybrid imbalanced data processing based on ADASYN and WGAN

外文摘要：In order to solve the problem of low classification accuracy of minority class samples in imbalanced datasets,an ADASYN-WGAN method was proposed to deal with imbalanced datasets.Firstly,the minority class samples were generated using the ADASYN algorithm,and these generated samples were used to replace the random noise in the WGAN;Secondly,the minority class samples conforming to the distribution law of the original dataset were generated using the WGAN algorithm to construct the balanced dataset;Then,the processing results derived from the proposed method and the four over-sampling algorithms were compared with the original dataset using the random forest classifier on six public datasets,respectively.Finally,the effectiveness of the proposed method was verified by the performance of classification assessment indexes such as F1-Score,G-mean and AUC.The results show that in the comparison experiments,the balanced dataset obtained by the ADASYN-WGAN method achieves the optimal values of all classification assessment indexes in four public datasets in the ten-fold cross-validation of the random forest classifier,and the F1-Score and G-mean achieve the highest values in the other two public datasets,although the AUC values are slightly lower.The proposed ADASYN-WGAN method can generate high-quality data samples and provide reference for solving the problem of prediction bias for a few class samples in unbalanced datasets.

外文关键词：

data processingimbalanced dataWGANADASYNoversampling methodrandom forest

作者：

周万珍、盛媛媛、张永强、马金龙

展开 >

作者单位：

河北科技大学信息科学与工程学院,河北石家庄 050018

河北省智能物联网技术创新中心,河北石家庄 050018

关键词：

数据处理不平衡数据 WGAN ADASYN 过采样方法随机森林

基金：

河北省自然科学基金河北省高等学校科学技术研究重点项目

项目编号：

F2022208002ZD2021048

出版年：

2024

DOI：

10.7535/hbgykj.2024yx04007

河北工业科技

河北科技大学

河北工业科技

CSTPCD

影响因子：0.694

ISSN：1008-1534

年,卷(期)：2024.41(4)

参考文献量5