缓解随机一致性的基尼指数与决策树方法

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：决策树模型具有较强的可解释性,是随机森林、深度森林等机器学习方法的基础.如何选择节点的分割属性与分割值是决策树算法的关键问题,对树的泛化能力、深度、平衡程度等重要性能产生影响.传统属性选择准则的定义大多基于凹函数,使得决策树算法存在多值偏向问题,即倾向于选择取值种类多的属性作为节点分割属性.已有研究表明缓解随机一致性的评价准则能够降低分类偏差与类簇个数偏向.本文将基于标准化框架缓解基尼指数的随机一致性,以此缓解其多值偏向问题.通过人造数据集验证,标准基尼指数能够缓解基尼指数的多值偏向问题,并且选择出具有决策信息的属性.通过12个基准数据集与两个图像数据集的实验验证,基于标准基尼指数的决策树算法比现有缓解多值偏向的决策树算法具有较高的泛化性能.

外文标题：Gini index and decision tree method with mitigating random consistency

外文摘要：The decision tree model has strong interpretability and is the basis of machine learning methods such as random forest and deep forest.Selecting the segmentation attribute and segmentation value of nodes is the core problem of the decision tree method,which has an impact on the generalization ability,depth,balance degree,and other important performance aspects of the tree.Most of the traditional node selection attribute criteria are defined based on the sum of concave functions,which makes the decision tree algorithm have the problem of multivalue bias;that is,it tends to select the attribute with many values as the node segmentation attribute.In the classification task,the performance evaluation method from the perspective of random consistency was verified to have a low classification bias.The evaluation criterion that alleviates random consistency can reduce classification bias and cluster number bias.In this paper,the random consistency of the Gini index is alleviated based on the standard framework to offset its multivalue bias.It is verified by artificial data sets that the standard Gini index can alleviate the multivalue bias problem of the Gini index and select the attributes with decision information.Experimental results on twelve benchmark datasets and two image data sets show that the decision tree based on the pure Gini index has higher generalization performance than the existing decision tree algorithms to mitigate multivalue bias.

外文关键词：

Gini indexbias to multi-valuedecision treerandom consistency

作者：

王婕婷、李飞江、李珏、钱宇华、梁吉业

展开 >

作者单位：

山西大学大数据科学与产业研究院,太原 030006

山西大学计算智能与中文信息处理教育部重点实验室,太原 030006

关键词：

基尼指数多值偏向决策树随机一致性

基金：

科技创新2030—重大项目国家自然科学基金重点项目国家自然科学基金青年基金国家自然科学基金青年基金山西省科技重大专项山西省基础研究计划山西省基础研究计划

项目编号：

2021ZD011240062136005621061326230617020220102010100620210302124271202103021223026

出版年：

2024

DOI：

10.1360/SSI-2022-0337

中国科学F辑

中国科学院,国家自然科学基金委员会

中国科学F辑

CSTPCD北大核心

影响因子：1.438

ISSN：1674-5973

年,卷(期)：2024.54(1)

参考文献量3