临床专病机器学习算法的特征优选及参数优化

Feature selection and parameter optimization of machine learning algorithms for clinical specialty diseases

金伯儒 ¹马玉虎 ¹王一 ¹何旺平 ¹刘振 ¹靳克成 ¹钟汝阳 ¹林延延 ²岳平 ²李书艳 ³孟文勃²

扫码查看

作者信息

1. 兰州大学第一临床医学院,甘肃兰州 730000
2. 兰州大学第一医院普外科,甘肃兰州 730000
3. 徐州医科大学医学信息工程学院,江苏徐州 221004
折叠

摘要

目的探索机器学习算法处理多模态数据及超参数选择中的优势,加速数据驱动在临床研究中的应用.方法纳入2022-2023年在兰州大学第一医院因胆总管结石初次接受内镜逆行胰胆管造影的患者,利用多种变量筛选方法排序特征重要性,输入k最近邻域法、极限梯度提升、支持向量机、朴素贝叶斯、随机森林5种机器学习算法及Logistic回归,预测术后并发症.在最大曲线下面积处选择最佳特征集并在十折交叉验证下调优参数,以测试集的曲线下面积作为最终评价指标,建立胆总管结石内镜逆行胰胆管造影术后并发症的最佳二分类模型.结果共纳入465例患者.平行对比各算法,随机森林为最佳模型,筛选的特征集贡献度依次为机械碎石、导丝进入胰管次数、术中出血、困难插管、手术时间.优选超参数树的数量为500,最小节点大小为2,每棵树的特征选择数为1,分裂规则采用基尼不纯度.随机森林模型在十折交叉验证集的平均特异度、敏感度、曲线下面积分别为0.972、0.710、0.942,测试集分别为0.950、0.625、0.886,显著优于其他机器学习算法和Logistic回归,且其在临床决策有效性、预测准确性、风险收益评估方面,均优于Logistic回归.结论基于临床专病构建的多种机器学习模型,在变量精细化预处理及参数优选的情况下随机森林模型最佳.

Abstract

Objective To explore the advantages of machine learning algorithms in handling multimodal data and hyperparameter selection,thereby accelerating the application of data-driven approaches in clinical research.Methods This study included patients who underwent their first endoscopic retrograde cholangiopan-creatography for choledocholithiasis in The First Hospital of Lanzhou University from 2022 to 2023.Different variable selection methods were used to rank feature importance,and five machine learning algorithms—k-nearest neighbor method,extreme gradient boosting,support vector machine,naive bayes,random forest and Logistic regression were applied to predict postoperative complications.The optimal feature set was selected at the maximum area under the curve(AUC),and hyperparameter tuning performed under ten-fold cross-val-idation,using the AUC of the test set as the final evaluation metric to establish the best binary classifica-tion model for postoperative complications of ERCP in choledocholithiasis.Results A total of 465 patients were included.Through a parallel comparison between algorithms,random forest was identified as the best model,with the contributions of the selected feature set ranked as follows:mechanical fragmentation,number of guidewire entries into the pancreatic duct,intraoperative bleeding,difficult intubation,and surgical dura-tion.The optimal hyperparameters were:number of trees=500,minimum node size=2,features per tree=1,and the splitting criterion was Gini impurity.The average specificity,sensitivity,and AUC for the random forest model in the ten-fold cross-validation set were 0.972,0.710,and 0.942,respectively,while the test set values were 0.950,0.625,and 0.886.The random forest model significantly outperformed other machine learning algorithms and Logistic regression,and it was superior to Logistic regression in terms of clinical deci-sion-making effectiveness,predictive accuracy,and risk-benefit assessment.Conclusion Among the various machine learning models constructed for clinical specialties,the random forest model showed the best perfor-mance under the conditions of refined variable preprocessing and optimized parameters.

关键词

内镜逆行胰胆管造影/机器学习/随机森林/Logistic回归/并发症/危险因素

Key words

endoscopic retrograde cholangiopancreatography/machine learning/random forest/Logistic regression/complications/risk factor

引用本文复制引用

基金项目

国家自然科学基金资助项目(32160255)

甘肃省科技重大专项资助项目(1602FKDA001)

出版年

2024

兰州大学学报(医学版)

兰州大学

兰州大学学报(医学版)

CSTPCD

影响因子：0.641

ISSN：1000-2812

段落导航