Feature selection and parameter optimization of machine learning algorithms for clinical specialty diseases
Objective To explore the advantages of machine learning algorithms in handling multimodal data and hyperparameter selection,thereby accelerating the application of data-driven approaches in clinical research.Methods This study included patients who underwent their first endoscopic retrograde cholangiopan-creatography for choledocholithiasis in The First Hospital of Lanzhou University from 2022 to 2023.Different variable selection methods were used to rank feature importance,and five machine learning algorithms—k-nearest neighbor method,extreme gradient boosting,support vector machine,naive bayes,random forest and Logistic regression were applied to predict postoperative complications.The optimal feature set was selected at the maximum area under the curve(AUC),and hyperparameter tuning performed under ten-fold cross-val-idation,using the AUC of the test set as the final evaluation metric to establish the best binary classifica-tion model for postoperative complications of ERCP in choledocholithiasis.Results A total of 465 patients were included.Through a parallel comparison between algorithms,random forest was identified as the best model,with the contributions of the selected feature set ranked as follows:mechanical fragmentation,number of guidewire entries into the pancreatic duct,intraoperative bleeding,difficult intubation,and surgical dura-tion.The optimal hyperparameters were:number of trees=500,minimum node size=2,features per tree=1,and the splitting criterion was Gini impurity.The average specificity,sensitivity,and AUC for the random forest model in the ten-fold cross-validation set were 0.972,0.710,and 0.942,respectively,while the test set values were 0.950,0.625,and 0.886.The random forest model significantly outperformed other machine learning algorithms and Logistic regression,and it was superior to Logistic regression in terms of clinical deci-sion-making effectiveness,predictive accuracy,and risk-benefit assessment.Conclusion Among the various machine learning models constructed for clinical specialties,the random forest model showed the best perfor-mance under the conditions of refined variable preprocessing and optimized parameters.