基于Stacking集成算法的中医证候诊断模型建立

Establishment of a Traditional Chinese Medicine Syndrome Diagnostic Model Based on Stacking Ensemble Learning:Take Lung Cancer as an Example

扫码查看

原文链接

万方数据

中文摘要：目的探索Stacking集成算法优化中医证候诊断模型效能的方法.方法以肺癌中医证候诊断模型的构建为例,将来自9家医院肺癌患者的2598例次临床症状及体征信息作为自变量(即特征变量),中医证候信息作为因变量,采用Python 3.7软件将临床数据以8∶2比例按照随机数字表法分为训练集和测试集.运用卡方检验、Spearman相关性检验、最小绝对值收缩和选择算子(LASSO)逻辑回归分析筛选肺癌中医证候的稳定特征;利用支持向量机(SVM)、K近邻算法(KNN)、随机森林(RF)、极端随机树(ExtraTrees)、极端梯度提升机(XGBoost)、轻量级梯度提升机(LightGBM)、自适应增强(AdaBoost)、梯度提升(GB)及多层神经网络(MLP)9种机器学习算法进行训练,得到9种基础模型.在上述基础模型中筛选出性能表现较优的4种模型,运用Stacking集成算法进行融合形成融合模型,并通过上述9种机器学习算法对融合模型进行二次训练,运用准确率、微平均受试者工作特征(micro-average ROC)曲线、曲线下面积(AUC)和混淆矩阵指标进行评估,筛选最优诊断模型.结果经数据处理得到稳定特征79个、中医证候13个.在基础模型训练中,RF、ExtraTrees、MLP及SVM基础模型综合性能表现较优,故将该4种模型的证候预测分布作为二次训练数据,并基于Stacking集成算法得到9种融合模型(SVM,KNN,RF,ExtraTree,XGBoost,LightGBM,GB,AdaBoost,MLP).其中XGBoost融合模型性能表现最优,在训练集和测试集中准确率分别为 0.850 和 0.838,过拟合差异为 0.012,micro-average ROC 曲线下面积(micro-average AUC)为 0.996.所有融合模型在测试集中的准确率和micro-average AUC较基础模型均有改善.结论以肺癌的中医证候数据为例,通过Stacking集成算法得出XGBoost融合模型在提升肺癌中医证候诊断效能方面具有显著优势.可见Stacking集成算法能整合多种模型算法的优点,有效提升中医证候诊断模型识别效能,为同类研究提供方法学借鉴.

外文摘要：Objective To explore the method of optimizing the performance of traditional Chinese medicine(TCM)syndrome diagnostic models using Stacking ensemble learning.Methods Taking the construction of TCM syndrome diagnostic model for lung cancer as an example,2598 cases of clinical symptoms and signs from lung cancer patients in 9 hospitals were used as independent variables(i.e.,feature variables),TCM syndrome information as dependent variables,and the clinical data were divided into training set and testing set in 8:2 ratio according to ran-dom number table method using Python 3.7 software.The stable features of TCM syndrome of lung cancer were screened using chi-square test,Spearman's correlation test,and Least Absolute Shrinkage and Selection Operator(LASSO)logistic regression analysis;nine machine learning algorithms are trained,including support vector machines(SVMs),k-nearest neighbors(KNN)algorithm,Random Forest(RF),Extremely Randomized Trees,Extreme Gradient Boosting(XGBoost),Lightweight Gradient Boosting(LightGBM),Adaptive Boosting(AdaBoost),Gradient Boosting(GB)and the multi-layer perceptron(MLP),to obtain 9 basic models.Four models with better performance were screened out from the above basic models and fused to form a fusion model by using the Stacking ensemble learn-ing,and the fusion model was trained twice by the above nine machine learning algorithms and evaluated by accuracy rate,micro-average ROC curves,area under the curve(AUC),and confusion matrix metrics,to screen the optimal diagnostic model.Results After data processing,79 stable features and 13 TCM syndromes were obtained.In the basic model training,the comprehensive performance of RF,ExtraTrees,MLP and SVM basic models were better,so the predicted distributions of the syndromes of these four models were used as the secondary training data,and nine fusion models were obtained based on the Stacking ensemble learning(SVM,KNN,RF,ExtraTree,XGBoost,Light-GBM,GB,AdaBoost,MLP).Among them,the XGBoost fusion model performed the best,with an accuracy of 0.850 and 0.838 in the training set and test set,respectively,an overfitting difference of 0.012,and an area under the micro-average ROC curve of 0.996.All fusion models showed an improvement in accuracy and area under the micro-average ROC curve compared with the base model in the test set.Conclusion Taking the TCM syndrome in-formation of lung cancer as an example,the XGBoost fusion model has significant advantages in improving the diag-nostic performance of TCM syndrome information of lung cancer through Stacking ensemble learning.It can be seen that the advantages of Stacking ensemble learning to integrate multiple models and effectively improve the diagnostic efficiency of TCM diagnostic models,which provided a methodological reference for similar studies.

外文关键词：

diagnostic model of traditional Chinese medicine syndromelung cancersyndromemachine learningStacking ensemble learning

作者：

郭小川、冯贞贞、刘文瑞、李建生

展开 >

作者单位：

河南中医药大学/呼吸疾病中医药防治省部共建协同创新中心,河南省郑州市郑东新区金水东路156号,450046

河南中医药大学第一附属医院

河南中医药大学第一临床医学院

关键词：

中医证候诊断模型肺癌证候机器学习 Stacking集成算法

基金：

国家中医药管理局中医药传承与创新"百千万"人才工程-岐黄工程首席科学家国家自然科学基金河南省中医药科学研究专项课题

项目编号：

国中医药人教函[2020]219号822053132022JDZX102

出版年：

2024

DOI：

10.13288/j.11-2166/r.2024.17.007

中医杂志

中华中医药学会中国中医科学院

中医杂志

CSTPCD北大核心

影响因子：1.464

ISSN：1001-1668

年,卷(期)：2024.65(17)

参考文献量17