Default prediction model based on optimal combination of big data variables:A case study of Chinese small enterprises
To enhance the identification and management of credit risk for commercial bank corporate customers,we present a systematic approach for predicting default.Firstly,in con-structing high-dimensional big data variable sets,we determine the optimal cutoff point for dividing the indicator interval by minimizing the Gini index.This ensures that each path of the decision tree maximizes the distinction between customer default and non-default.We treat each path as a dummy variable,where the value of the variable is 1 if the customer belongs to this path,otherwise,it is 0.Secondly,to reduce the dimensionality of dummy variables,we utilize Lasso regression to minimize prediction error and infer the optimal set of variables.Thirdly,we calculate the optimal default prediction threshold of the logistic regression model with the highest sum of customer judgment ratio,which improves the accuracy of default firm prediction.Our results show that decision tree path variables have stronger default discriminatory power than raw credit indicators and contain richer information.Additionally,the indicators of net profit cash content,per capita disposable income of urban residents,and legal dispute situation of enterprises have a significant impact on the default prediction of Chinese small enterprises.Although these three indicators represent only 3.704%of the total number of indicators,their contribution to accuracy is 41.639%.Our proposed methodology outperforms the comparison model in terms of accuracy and robustness.It can unveil the key factors and thresholds that affect the credit risk of enterprises,thus providing a basis for commercial banks'credit approval and pre-loan review work.The methodology's effectiveness has been proven across multiple credit datasets,and it can be extended to constructing default prediction models for individuals as well as large and medium-sized enterprises.
default predictionbig datadecision tree path variablesLassodefault prediction threshold