Prediction of nonsynonymous variant pathogenicity and feature importance analysis based on machine learning
Objective:This study aims to assess the integrated performance of various machine learning models in predicting the pathogenicity of nonsynonymous variant,and to validate the contributions and effects of each prediction tool through feature importance analysis and multiple dataset validation.Methods:Twenty-seven pathogenicity prediction tools were used to evaluate the pathogenicity of nonsynonymous variants in the ClinVar dataset and three external validation sets,handling missing values with mean,median,and random forest imputation methods.Four classical machine learning models(random forest,neural network,naive bayes,extreme gradient boosting tree)were used to integrate prediction tools,constructing twelve models combined with the three imputation methods.The best imputation method was evaluated based on the accuracy and kappa values of the internal validation set,and the performance of the four models using this imputation method was further assessed on multiple metrics.The importance of each prediction tool in the ensemble model was evaluated using feature importance scoring,and validated in internal and external validation sets.Results:The random forest imputation method performed best in handling missing values,with an average accuracy of 0.908 0 and an average kappa value of 0.808 7.Among the four machine learning algorithms,the extreme gradient boosting tree model showed the best overall performance across various metrics.The neural network and random forest models had similar performance to the extreme gradient boosting tree model,while the naive bayes model had the highest specificity and shortest runtime but a lower kappa value.Feature importance scores indicated that AlphaMissense,VEST4,and MVP were the core features of the extreme gradient boosting tree model.In both the internal validation set and the three external validation sets,AlphaMissense,VEST4,and DEOGEN2 had AUC values ranking in the top five.The ensemble prediction extreme gradient boosting tree model constructed in this study had an AUC value of 0.976 3 in the internal validation set,higher than any single prediction score,with AUC values above 0.96 in the external validation sets.Conclusions:This study found that the extreme gradient boosting tree model,using random forest imputation for missing values,performed best in predicting the pathogenicity of nonsynonymous variant.This model can be considered when integrating multiple prediction tools.Prediction tools such as AlphaMissense and VEST4 made significant contributions to the ensemble model with high predictive reliability and accuracy,which can provide reliable predictions for the pathogenicity of nonsynonymous mutations.