Machine learning prediction model of diabetic kidney disease in different regions of Gansu province
Objective To construct a machine learning prediction model for diabetic kidney disease(DKD)in type 2 diabetes mellitus(T2DM)patients in the plain-sand and loess hilly areas of Gansu Province,and analyze the interpretability of the model.Methods A multi-stage stratified random sampling method was used to collect the data of T2DM patients in the two areas.After key feature screening,eight ML prediction models were constructed for the risk of DKD in the two areas.The receiver operating characteristic(ROC)curve,accuracy and F1 index were used to evaluate the model,and Shapley additive explanation(SHAP)algorithm was used for model interpretation.Results A total of 1599 patients with T2DM were enrolled in this study.After feature screening,ten variables were selected for model construction in the plain-sand areas.Among the eight models,the gradient boosting decision tree(GBDT)model had the highest prediction efficiency.The area under the curve(AUC)of the test dataset was 0.972,the accuracy was 0.949,and the F1 index was 0.884.In the loess hilly region,12 variables were included in the model,and the best model was the random forest(RF).The AUC of the test set was 0.966,the accuracy was 0.951,and the F1 index was 0.861.SHAP analysis showed that in addition to serum creatinine,age,LDL-C,HbA1c,DM duration,serum uric acid and urinary microalbumin were also closely related to the high risk of DKD.Conclusions The GBDT and RF models have good predictive efficiency for the occurrence of DKD in the two areas,which can be used for the screening of DKD high-risk populations and the in-depth exploration of potential risk factors in the two areas.
Diabetic kidney diseaseDiabetes mellitus,type 2Machine learningPrediction model