A convolutional neural network-based method for differentiating between Escherichia coli and Shigella genomes
Objective To differentiate between highly genetically similar bacteria,such as Escherichia coli and Shigella spp.using deep learning techniques in order to contribute to clinical diagnosis and epidemic prevention.Methods A convolutional neural network(CNN)was proposed based on transfer learning with a large-scale pre-trained protein language model,which could enable rapid and accurate identification of bacterial strains at the genus level.To validate the reliability of this model,whole-genome data on related bacteria was retrieved from the National Center for Biotechnology Information(NCBI)in the United States before the full-genome protein sequences of highly genetically similar strains of Escherichia coli and Shigella spp.were selected as experimental samples.Results With this method,genus-level classification accuracies of 97.13%and 95.56%were made available respectively during classification experiments on 2960 strains with high assembly quality and 4945 strains with low assembly quality,which outperformed the other methods currently available.Conclusion This study demonstrates the reliability and potential of deep learning-based methods for differentiation of bacterial types.By integrating self-supervised pre-training techniques with transfer learning,this approach can capture high-dimensional feature differences that are not easily discernible or statistically analyzable by humans.Furthermore,this method exhibits broad applicability,as it requires lower assembly completeness of the bacterial genome sequences used.