首页|一种基于卷积神经网络的大肠杆菌和志贺菌基因组鉴别方法

一种基于卷积神经网络的大肠杆菌和志贺菌基因组鉴别方法

扫码查看
目的 利用深度学习方法,鉴别基因组相似度很高的大肠杆菌和志贺菌,为临床诊断和疫情防控提供参考依据.方法 提出一种迁移学习大规模预训练蛋白质语言模型的卷积神经网络(CNN),用于细菌类型鉴别,该方法可在属水平上实现对细菌类型的快速准确鉴别.为了验证模型的可靠性,该研究从美国国家生物技术信息中心(NCBI)下载相关细菌的全因组数据,并选择相似度很高的大肠杆菌和志贺菌的全基因组蛋白质序列作为实验样本.结果 在2960个高组装质量和4945个包含低组装质量的菌株上进行分类实验时,该方法在属水平上的分类准确率分别达到97.13%和95.56%,优于现有的其他方法.结论 这种基于深度学习的细菌类型鉴别方法通过自监督预训练技术与迁移学习相结合,可以学习到人类无法直观统计和观察的高维特征差异,表现出巨大潜力.此外,该方法对所用菌株的基因组序列的拼装完成度要求较低,适用范围广,更具实际应用价值.
A convolutional neural network-based method for differentiating between Escherichia coli and Shigella genomes
Objective To differentiate between highly genetically similar bacteria,such as Escherichia coli and Shigella spp.using deep learning techniques in order to contribute to clinical diagnosis and epidemic prevention.Methods A convolutional neural network(CNN)was proposed based on transfer learning with a large-scale pre-trained protein language model,which could enable rapid and accurate identification of bacterial strains at the genus level.To validate the reliability of this model,whole-genome data on related bacteria was retrieved from the National Center for Biotechnology Information(NCBI)in the United States before the full-genome protein sequences of highly genetically similar strains of Escherichia coli and Shigella spp.were selected as experimental samples.Results With this method,genus-level classification accuracies of 97.13%and 95.56%were made available respectively during classification experiments on 2960 strains with high assembly quality and 4945 strains with low assembly quality,which outperformed the other methods currently available.Conclusion This study demonstrates the reliability and potential of deep learning-based methods for differentiation of bacterial types.By integrating self-supervised pre-training techniques with transfer learning,this approach can capture high-dimensional feature differences that are not easily discernible or statistically analyzable by humans.Furthermore,this method exhibits broad applicability,as it requires lower assembly completeness of the bacterial genome sequences used.

Escherichia coliShigellabacterial identificationwhole genome proteinconvolutional neural network

孟人杰、罗楠、靳远、岳俊杰、王博千、高沅铭

展开 >

国防科技大学计算机学院,长沙 410073

军事科学院军事医学研究院生物工程研究所,病原微生物生物安全国家重点实验室,北京 100071

大肠杆菌 志贺菌 细菌鉴别 全基因组蛋白 卷积神经网络

国家自然科学基金国家自然科学基金国家自然科学基金病原微生物生物安全国家重点实验室研究项目病原微生物生物安全国家重点实验室研究项目

820035193207002562102439SKLPBS1807SKLPBS2214

2024

军事医学
军事医学科学院

军事医学

CSTPCD
影响因子:0.586
ISSN:1674-9960
年,卷(期):2024.48(3)
  • 22