甘肃省不同地区糖尿病肾脏疾病的机器学习预测模型的研究

Machine learning prediction model of diabetic kidney disease in different regions of Gansu province

杨建宁 ¹洪豆豆 ¹李杨 ²余静 ³杨帆 ¹温子英 ¹乔文俊 ⁴刘静 ²张琦³

扫码查看

作者信息

1. 730000 兰州,甘肃中医药大学第一临床医学院
2. 甘肃省人民医院内分泌代谢诊疗中心
3. 甘肃省人民医院老年医学科
4. 宁夏医科大学第一临床医学院
折叠

摘要

目的构建甘肃省平原风沙与黄土丘陵地区T2DM患者发生DKD的机器学习(ML)预测模型,并对模型进行可解释性分析.方法采用多阶段分层随机抽样法收集两地区T2DM患者资料,经关键特征筛选后构建 8 种DKD发生风险的ML预测模型.采用受试者工作特征(ROC)曲线下面积(AUC)、准确率及F1 指数评价模型,模型解释采用Shapley加性解释(SHAP)算法.结果最终纳入1599 例T2DM患者,经特征筛选后平原风沙地区纳入10 个变量建模.在8 种模型中,梯度提升决策树(GBDT)模型预测效能最高,其测试集AUC为 0.972,准确率为 0.949,F1 指数为0.884.黄土丘陵地区纳入 12 个变量建模,最优模型为随机森林(RF),其测试集的AUC为 0.966,准确率为 0.951,F1 指数为 0.861.SHAP分析发现,DKD高风险除与血肌酐、年龄、LDL-C、HbA1c及DM病程等因素相关外,还与血尿酸、尿微量白蛋白相关.结论 GBDT和RF模型对两地区DKD的发生有良好预测效能,可用于两地区DKD高危人群筛查及潜在危险因素深入挖掘.

Abstract

Objective To construct a machine learning prediction model for diabetic kidney disease(DKD)in type 2 diabetes mellitus(T2DM)patients in the plain-sand and loess hilly areas of Gansu Province,and analyze the interpretability of the model.Methods A multi-stage stratified random sampling method was used to collect the data of T2DM patients in the two areas.After key feature screening,eight ML prediction models were constructed for the risk of DKD in the two areas.The receiver operating characteristic(ROC)curve,accuracy and F1 index were used to evaluate the model,and Shapley additive explanation(SHAP)algorithm was used for model interpretation.Results A total of 1599 patients with T2DM were enrolled in this study.After feature screening,ten variables were selected for model construction in the plain-sand areas.Among the eight models,the gradient boosting decision tree(GBDT)model had the highest prediction efficiency.The area under the curve(AUC)of the test dataset was 0.972,the accuracy was 0.949,and the F1 index was 0.884.In the loess hilly region,12 variables were included in the model,and the best model was the random forest(RF).The AUC of the test set was 0.966,the accuracy was 0.951,and the F1 index was 0.861.SHAP analysis showed that in addition to serum creatinine,age,LDL-C,HbA1c,DM duration,serum uric acid and urinary microalbumin were also closely related to the high risk of DKD.Conclusions The GBDT and RF models have good predictive efficiency for the occurrence of DKD in the two areas,which can be used for the screening of DKD high-risk populations and the in-depth exploration of potential risk factors in the two areas.

关键词

糖尿病肾脏疾病/糖尿病,2型/机器学习/预测模型

Key words

Diabetic kidney disease/Diabetes mellitus,type 2/Machine learning/Prediction model

引用本文复制引用

出版年

2025

中国糖尿病杂志

北京大学

中国糖尿病杂志

北大核心

影响因子：1.946

ISSN：1006-6187

段落导航