目的:探讨结直肠息肉的影响因素并构建早期风险预测模型.方法:收集2016 年11 月至2021 年10 月于郑州大学第一附属医院健康管理中心同时进行结肠镜检查和血常规、生化指标检查的4 997 名受检者资料,包括22 项指标,使用最小绝对收缩选择算子(LASSO)进行特征变量筛选.按7∶3 随机分组,在训练集中采用得到的最优变量构建梯度提升(Catboost)、支持向量机(SVM)、Logistic回归(LR)预测模型,在测试集中进行验证.采用χ2检验比较3 种模型的准确率,进一步通过净重新分类指数(NRI)、综合判别改善指数(IDI)、ROC 曲线下面积(AUC)评估模型的预测性能,并对纳入因素进行重要性评估.结果:LASSO回归得到性别、年龄、腰围(WC)、尿素(BU)、总蛋白(TP)、肾小球滤过率(GFR)、甘油三酯葡萄糖指数(TyG)等7 项特征变量,基于该7 项特征变量构建的SVM、Catboost模型的准确率优于LR模型(P<0.05).SVM、Catboost、LR模型测试集的AUC(95%CI)分别为0.760(0.736~0.784)、0.766(0.742~0.790)和 0.676(0.649~0.703).进一步评估显示SVM模型预测效果最优,Catboost次之,LR 最差(SVM vs Catboost/LR:NRI>0,IDI>0,P<0.05;Catboost vs LR:NRI>0,IDI>0,P<0.001).特征重要性评估显示年龄的重要性最大,其次是WC.结论:基于性别、年龄、WC、BU、TP、GFR、TyG构建的SVM模型具备较好的预测价值.该预测模型的建立可对健康体检人群进行危险分层,有助于尽早发现结直肠癌早期病变.
Construction and evaluation of a risk prediction model for colorectal pol-yps based on physical examination data
Aim:To explore the risk factors influencing the development of colorectal polyps and construct an early risk prediction model.Methods:Data from 4 997 participants who underwent colonoscopy and blood routine,as well as biochem-ical index examinations at the Health Management Center of the First Affiliated Hospital of Zhengzhou University between November 2016 and October 2021 were collected,including 22 indicators.The Least Absolute Shrinkage and Selection Oper-ator(LASSO)was used for feature variable selection to determine the optimal variables.Randomly grouped in 7∶3,predic-tive models,namely categorical boosting(Catboost),support vector machine(SVM),and Logistic regression(LR),were con-structed in the train set using the optimal variables and validated in the test set.The accuracy of the 3 models was compared using χ2 test.The prediction performance of the 3 models was further assessed using metrics such as net reclassification in-dex(NRI),integrated discriminant improvement(IDI),and area under curve(AUC).Finally,the importance of the included factors was assessed.Results:LASSO regression identified 7 feature variables,including gender,age,waist circumference(WC),blood urea(BU),total protein(TP),glomerular filtration rate(GFR)and triglyceride glucose index(TyG).The ac-curacy of the SVM model and Catboost model was better than that of LR model(P<0.05).The AUC(95%CI)for the test sets of the SVM,Catboost,and LR models were 0.760(0.736-0.784),0.766(0.742-0.790),and 0.676(0.649-0.703),respectively.Further evaluation revealed that SVM had the best predictive efficacy,followed by Catboost,and LR was the last(SVM vs Catboost/LR:NRI>0,IDI>0,P<0.05;Catboost vs LR:NRI>0,IDI>0,P<0.001).Feature im-portance assessment indicated that age was most important,followed by WC.Conclusion:The SVM models constructed based on gender,age,WC,BU,TP,GFR,TyG demonstrate good predictive value,which can provide risk stratification in healthy examination population,and aid in the early detection of early-stage colorectal cancer lesions.
colorectal polypphysical examination datarisk prediction model