首页|Feature Ranking from Random Forest Through Complex Network's Centrality Measures: A Robust Ranking Method Without Using Out-of-Bag Examples

Feature Ranking from Random Forest Through Complex Network's Centrality Measures: A Robust Ranking Method Without Using Out-of-Bag Examples

扫码查看
The volume of available data in recent years has rapidly increased。 In consequence, datasets commonly end up with many irrelevant features。 That increase may disturb human understanding and even lead to poor machine learning models。 This research proposes a novel feature ranking method that employs trees from a Random Forest to transform a dataset into a complex network to which centrality measures are applied to rank the features。 That process takes place by representing each tree as a graph where all the tree features are vertices on this graph, and the links within the nodes (father → child) of the tree are represented by a weighted edge between the two respective vertices。 The union of all graphs from individual trees leads to the complex network。 Then, three centrality measures are applied to rank the features in the complex network。 Experiments were performed in eighty-five supervised classification datasets, with a variation in the feature noise level, to evaluate our novel method。 Results show that centrality measures in non-oriented complex networks are comparable and may be correlated to the Random Forest's variable importance ranking algorithm。 Vertex strength and eigenvector outperformed the Random Forest in 40% noise datasets, with a not statistically different result at a 95% confidence level。

Feature rankingRandom ForestComplex networksCentrality measures

Adriano Henrique Cantao、Alessandra Alaniz Macedo、Liang Zhao、Jose Augusto Baranauskas

展开 >

Department of Computer Science and Mathematics, Faculty of Philosophy, Sciences and Letters at Ribeirao Preto, University of Sao Paulo, Bandeirantes Avenue, 3900, Ribeirao Preto, SP 14040-901, Brazil

European Conference on Advances in Databases and Information Systems

Turin(IT)

Advances in Databases and Information Systems

330-343

2022