首页|针对不平衡数据分类的改进GBDT算法

针对不平衡数据分类的改进GBDT算法

扫码查看
许多传统的分类算法在处理不平衡数据时,训练出的分类器对多数类别样本预测准确率较高,而对少数类别样本的预测准确率较低。针对这一问题,提出一种改进的梯度提升决策树(GBDT)算法用于处理不平衡数据的二分类问题。数据层面,用自适应综合过采样(Adaptive Synthetic Sampling)技术增加少数类的样本数量。算法层面,将焦点损失函数(Focal Loss)引入到GBDT二分类算法中,增加模型对少数类样本的关注度。并且通过平衡化GBDT内部迭代时的每一次随机子采样,使基分类器的性能更稳定。在10组KEEL不平衡数据集上进行对比实验,实验结果验证了改进的可行性。并且用提出的改进算法与SMOTEBoost、RUSBoost、CUSBoost这三种流行的不平衡数据分类算法进行比较,实验结果表明所提改进算法在其中7组数据集上F1-measure值取得最高,其中6组数据集上G-mean值取得最高,验证了所提改进算法在处理不平衡数据的二分类问题时具有较好的效果。
Improved GBDT Algorithm for Imbalanced Data Classification
When many traditional classification algorithms deal with imbalanced data,the trained classifiers have higher pre-diction accuracy for most class samples and lower prediction accuracy for a few class samples.To solve this problem,an improved GBDT(Gradient Boosting Decision Tree)algorithm is proposed to deal with the binary classification problem of unbalanced data.Consider from the data level,Adaptive Synthetic Sampling(ADASYN)technology is used to increase the number of samples of a few classes.Secondly,at the algorithmic level,the Focal Loss function is introduced into the GBDT binary classification algorithm to in-crease the model's attention to a small number of samples.The performance of the base classifier is more stable by balancing each random subsampling in GBDT internal iteration.Comparative experiments are carried out on 10 sets of KEEL imbalanced data sets,and the experimental results verified the feasibility of the improvement.And the proposed improved algorithm is compared with the three popular imbalanced data classification algorithms,SMOTEBoost,RUSBoost,and CUSBoost.The experimental results show that the enhanced algorithm has the highest F1-measure value on seven sets of data and the highest G-mean value on six sets of da-ta.It is verified that the proposed improved algorithm has a good effect in dealing with the two classification problems of unbalanced data.

unbalanced datagradient goosting gecision treeadaptive synthetic samplingfocal fossrandom subsampling

李长洪、郑凯、林博宇

展开 >

华南师范大学计算机学院 广州 510631

华南师范大学网络中心 广州 510631

不平衡数据 梯度提升决策树 自适应综合过采样 焦点损失函数 随机子采样

中国高校产学研创新基金

2020ITA05033

2024

计算机与数字工程
中国船舶重工集团公司第七0九研究所

计算机与数字工程

CSTPCD
影响因子:0.355
ISSN:1672-9722
年,卷(期):2024.52(7)