现代计算机2024,Vol.30Issue(14) :36-40.DOI:10.3969/j.issn.1007-1423.2024.14.006

面向类不平衡数据集的重采样方法影响研究

An investigation into the impact of resampling methods for class-imbalanced datasets

丁浩杰
现代计算机2024,Vol.30Issue(14) :36-40.DOI:10.3969/j.issn.1007-1423.2024.14.006

面向类不平衡数据集的重采样方法影响研究

An investigation into the impact of resampling methods for class-imbalanced datasets

丁浩杰1
扫码查看

作者信息

  • 1. 山西科技学院大数据与计算机科学学院,晋城 048000
  • 折叠

摘要

为了评估重采样方法对类不平衡数据集的影响,对被广泛使用的美国威斯康星州的乳腺癌诊断数据集进行研究,基于逻辑斯特回归、支持向量机、随机森林等三种机器学习算法进行实验,对随机上采样抽样、随机下采样抽样、SMOTE以及ADASYN四种重采样方法使用F1值和AUC值进行了分析.实验结果表明,四种重采样方法均可以提升模型性能,其中随机下采样抽样在处理类不平衡数据集时被证明更加有效.

Abstract

In order to evaluate the impact of resampling methods on class-imbalanced datasets,an investigation was conducted using the widely recognized Wisconsin breast cancer diagnosis dataset from the United States.Experiments were carried out based on three machine learning algorithms:Logistic Regression,Support Vector Machine,and Random Forest.Four resampling meth-ods—Random Over-sampling,Random Under-sampling,SMOTE,and ADASYN—were analyzed using F1 scores and AUC values.The experimental results indicate that all four resampling methods can improve model performance,with Random Under-sampling proving to be more effective in handling class-imbalanced datasets.

关键词

重采样方法/随机下采样抽样/支持向量机/逻辑斯特回归/随机森林

Key words

resampling methods/random under-sampling/support vector machine/logistic regression/random forest

引用本文复制引用

出版年

2024
现代计算机
中大控股

现代计算机

影响因子:0.292
ISSN:1007-1423
段落导航相关论文