查看更多>>摘要:Researchers rely on species distribution models (SDMs) to establish a correlation between species occurrence records and environmental data. These models offer insights into the ecological and evolutionary aspects of the subject. Feature selection (FS) aims to choose useful interlinked features or remove unnecessary and redundant ones and make the induced model easier to understand. Although feature selection plays a crucial role in SDMs, only a limited number of studies in the literature have addressed it with several key shortcomings such as lack of the use of multivariate techniques, lack of comparison between the univariate and the multivariate filters, and absence of a comparison between the ensemble univariate and multivariate filters. Therefore, this study presents a rigorous empirical evaluation consisting of assessing and comparing six filter-based univariate feature selection methods using two thresholds with two multivariate techniques, as well as four classifiers: Extreme Gradient boosting (XGB), Random Forest (RF), Decision Tree (DT), and Light gradient-boosting machine (LGBM). Furthermore, the current study proposes a novel approach for ensemble construction consisting of evaluating the applications of ensemble learning using 40% of features ranked by means of Borda Count and Reciprocal Rank (univariate filter ensembles) as well as the fusion-based and the intersection-based ensembles (multivariate filter ensembles). Moreover, we evaluated and compared the performances of univariate and multivariate techniques with their ensembles. Similarly, we evaluated and compared the performances of the best ensemble techniques across datasets. The empirical evaluations involve several techniques, such as the 5-fold cross-validation method, the Scott Knott (SK) test, and Borda Count. In addition, we used three performance metrics (accuracy, Kappa, and F1-score). Experiments showed that Consistency-based subset selection in conjunction with RF outperformed all other univariate and multivariate FS techniques with an accuracy value of 91.63% across all datasets. However, Fisher score trained with RF was the best choice when considering the number of features. Moreover, the univariate or multivariate based ensembles, in general, outperformed their singles. In addition, when comparing the univariate and multivariate ensembles, the fusion-based ensemble outperformed all other ensembles achieving an accuracy of 91.77% when using RF across datasets. Nevertheless, in terms of performance and number of features, the ensemble constructed using Reciprocal Rank performed better than all other FS techniques regardless of the classifier used. It achieved an accuracy of 91.61% across datasets when using RF.