摘要
目的 通过集成学习算法对肝癌术后患者进行复发预测,为肝癌患者的术后治疗提供指导.方法 回顾性分析山西医科大学第一医院2017-01-01-2022-12-31入院接受外科手术治疗的471例肝癌患者临床资料.采用极端梯度提升(XG-Boost)、随机森林(RF)模型、最小绝对值收敛和选择算子(LASSO)3种算法筛选影响因素;采用合成少数类过采样法(SMOTE)平衡数据,同时构建XGBoost分类模型,并与RF、支持向量机(SVM)和logistic回归模型进行比较;基于准确度、灵敏度、F1值和受试者工作特征曲线下面积(AUC)4个指标评价模型性能;应用沙普利加法解释(SHAP)及列线图对模型进行解释及可视化,得出相对较优的复发预测模型.结果 采用RF筛选出的年龄、凝血酶原时间、肝叶位置、天冬氨酸转氨酶、脉管侵犯、血小板计数、CD10、腹水、分化程度和淋巴细胞绝对值是对肝癌术后复发影响较大的10个因素,并综合危险因素构建预测肝癌术后复发风险列线图;同时,所构建的XGBoost模型(准确度为0.905,灵敏度为0.852,F1为0.900,AUC为0.905)取得了最优的分类性能.结论 本研究构建的XGBoost模型具有较好的分类性能,结合SHAP及列线图可以使模型更具解释性.此模型可以识别复发高危人群,指导临床制定个性化诊疗方案.
Abstract
Objective To predict recurrence in patients with hepatocellular carcinoma(HCC)by ensemble learning algo-rithm,and to provide guidance for the postoperative treatment of patients with HCC.Methods Clinical data of 471 pa-tients with liver cancer admitted for surgical treatment from 2017-01-01 to 2022-12-31 in the First Hospital of Shanxi Medical University were retrospectively analyzed.Three algorithms,including eXtreme Gradient Boosting(XGBoost),Ran-dom Forest(RF),least absolute shrinkage and selection operator(LASSO),were used to screen the influencing factors.Synthetic Minority Over-sampling Technique(SMOTE)was used to balance the data.XGBoost classification model was constructed and compared with RF,support vector machines(SVM)and logistic regression model.The model performance was evaluated based on accuracy,sensitivity,F1 value and area under receiver operating characteristic(ROC)curve(AUC).Shapley Additive Explanation(SHAP)and nomogram were applied to explain and visualize the model,and a rela-tively good recurrence prediction model was obtained.Results Age,prothrombin time,liver lobe location,aspartate trans-ferase,vascular invasion,platelet count,CD10,ascites,degree of differentiation,and absolute value of lymphocytes were the ten factors that had the greatest influence on postoperative recurrence of HCC.Combined with the risk factors,a column graph was constructed to predict the risk of postoperative recurrence of liver cancer.In addition,the constructed XGBoost model(accuracy 0.905,sensitivity 0.852,F1=0.900,AUC=0.905)achieved the best classification performance.Conclu-sion The XGBoost model constructed in this study has good classification performance.Combining SHAP and nomogram can make the model more explanatory,and this model can identify high-risk groups of recurrence and guide the clinical de-velopment of personalized diagnosis and treatment plan.
基金项目
山西省留学人员科技活动择优资助项目(20210004)
中央引导地方科技发展资金项目(YDZJSX2021A041)
中国博士后科学基金(2021M702051)