Auxiliary diagnosis of active tuberculosis based on machine learning under feature selection
[Objective]Tuberculosis(TB)constitutes a pervasively infectious disease.In clinical data,a marked disparity in cytokine secretion levels within peripheral blood T lymphocytes,subsequent to tuberculosis-specific antigen stimulation,distinguishes active tuberculosis patients from those with latent infections.These datasets contain varying levels of cytokines before and after antigen stimulation,making them suitable for processing with machine learning techniques.Therefore,this study conducts a cytokine assay in conjunction with machine learning algorithms which incorporate various feature selection strategies to analyze cytokine levels in suspected tuberculosis patients,thereby facilitating the auxiliary diagnosis of active tuberculosis.[Methods]A total of 42 patients with active tuberculosis and 38 patients with inactive tuberculosis were tested for serum cytokine levels.In response to the limitations posed by the reliance of traditional multi-population genetic algorithm(MPGA)on single-criterion fitness functions,an improved MPGA(IMPGA)has been proposed.Using IMPG A,MPG A,particle swarm optimization algorithm(PSO),and Pearson correlation coefficient(PCC)selection,the four feature selection methods are combined with three classifiers,including logistic regression(LR),support vector machine(SVM),extreme gradient boosting(XGBoost),to explore the classification effect of active tuberculosis and select the key features.[Results]Regarding feature selection results,the number of features filtered by the MPGA-SVM,MPGA-XGBoost,IMPGA-SVM,and IMPGA-XGBoost methods is significantly lower than that by the other methods.When classified by the classifier method,the number of selected features follows an increasing order:SVM,XGBoost,LR.However,no obvious pattern is observed when categorized according to the feature selection method.IFN-g-T and MIG-T appear with the highest frequency in the selection results of various methods.When categorized by classifier methods,the most frequent feature selection results for the XGBoost group include IFN-g-T,GBP5-N,and IL-15-N;for the SVM group,it is MIG-T;and for the LR group,it includes IFN-g-T and Eotaxin-T.Nevertheless,there is no clear pattern observed when the feature selection results are classified based on the feature selection method.In terms of feature selection performance combined with classification models,the area under curve(AUC)in the LR group ranged from 0.630 to 0.784,with PCC-LR performing the best,showing a 0.037 improvement over using LR without feature selection.In the SVM group,several algorithms generally outperformed the LR group,with all algorithms in this group achieving AUC values between 0.776 and 0.880.The best-performing algorithm in this group was IMPGA-SVM with an AUC of 0.880,representing a 0.052 increase over using SVM without feature selection.In the XGBoost group,the AUC for all algorithms ranged from 0.722 to 0.832,with the best performance exhibited by IMPGA-XGBoost that achieves an AUC of 0.832,representing a 0.078 increase over using XGBoost without feature selection.Among all the 15 methods evaluated,the best AUC performance is found in IMPGA-SVM,which is 0.880.[Conclusion]Analyzing the selection results of the IMPGA-SVM method,which exhibited the most ideal classification performance,it becomes apparent that the frequency of monokine induced by γ-interferon T(MIG-T)markedly surpasses that of other features.This underscores the pivotal role played by MIG in the prediction of active tuberculosis in patients,aligning with findings from related literature studies.Concurrently,this study has addressed certain deficiencies inherent in the conventional MPGA approach,implementing substantial improvements to the traditional MPGA method,ultimately deriving the optimal model for this research.In this study,when different features were selected,IMPGA showed an average fitness improvement over the traditional MPGA of 0.018,0.008,and 0.010 for the LR,SVM,and XGBoost groups,respectively,thereby enhancing the predictive capability for active tuberculosis in a relatively stable manner.In summary,by employing machine learning methods to assist in the diagnosis of active tuberculosis,coupled with the use of feature selection techniques to reduce feature dimensionality,this study achieves dual objectives:enhancing classification accuracy and identifying key features,thereby increasing the interpretability of the machine learning outcomes.