Establishment of a Traditional Chinese Medicine Syndrome Diagnostic Model Based on Stacking Ensemble Learning:Take Lung Cancer as an Example
Objective To explore the method of optimizing the performance of traditional Chinese medicine(TCM)syndrome diagnostic models using Stacking ensemble learning.Methods Taking the construction of TCM syndrome diagnostic model for lung cancer as an example,2598 cases of clinical symptoms and signs from lung cancer patients in 9 hospitals were used as independent variables(i.e.,feature variables),TCM syndrome information as dependent variables,and the clinical data were divided into training set and testing set in 8:2 ratio according to ran-dom number table method using Python 3.7 software.The stable features of TCM syndrome of lung cancer were screened using chi-square test,Spearman's correlation test,and Least Absolute Shrinkage and Selection Operator(LASSO)logistic regression analysis;nine machine learning algorithms are trained,including support vector machines(SVMs),k-nearest neighbors(KNN)algorithm,Random Forest(RF),Extremely Randomized Trees,Extreme Gradient Boosting(XGBoost),Lightweight Gradient Boosting(LightGBM),Adaptive Boosting(AdaBoost),Gradient Boosting(GB)and the multi-layer perceptron(MLP),to obtain 9 basic models.Four models with better performance were screened out from the above basic models and fused to form a fusion model by using the Stacking ensemble learn-ing,and the fusion model was trained twice by the above nine machine learning algorithms and evaluated by accuracy rate,micro-average ROC curves,area under the curve(AUC),and confusion matrix metrics,to screen the optimal diagnostic model.Results After data processing,79 stable features and 13 TCM syndromes were obtained.In the basic model training,the comprehensive performance of RF,ExtraTrees,MLP and SVM basic models were better,so the predicted distributions of the syndromes of these four models were used as the secondary training data,and nine fusion models were obtained based on the Stacking ensemble learning(SVM,KNN,RF,ExtraTree,XGBoost,Light-GBM,GB,AdaBoost,MLP).Among them,the XGBoost fusion model performed the best,with an accuracy of 0.850 and 0.838 in the training set and test set,respectively,an overfitting difference of 0.012,and an area under the micro-average ROC curve of 0.996.All fusion models showed an improvement in accuracy and area under the micro-average ROC curve compared with the base model in the test set.Conclusion Taking the TCM syndrome in-formation of lung cancer as an example,the XGBoost fusion model has significant advantages in improving the diag-nostic performance of TCM syndrome information of lung cancer through Stacking ensemble learning.It can be seen that the advantages of Stacking ensemble learning to integrate multiple models and effectively improve the diagnostic efficiency of TCM diagnostic models,which provided a methodological reference for similar studies.
diagnostic model of traditional Chinese medicine syndromelung cancersyndromemachine learningStacking ensemble learning