首页|基于机器学习的DNA序列分类研究

基于机器学习的DNA序列分类研究

扫码查看
DNA承载了生物体内的所有遗传信息,决定基因的结构和功能.对DNA所属类别进行预测,可以判断一个未知类是否为新物种、外来物种或者熟知物种.随着生物技术的发展,如何从获取到的DNA序列中提取完整信息并预测其序列组成,找到组成规律,准确反映物种特性成为生物信息学中的一个重要问题.本研究从NCBI网站上下载序列登录号为CP021707和CP085300的两类DNA序列文件,基于碱基频率和数量特征提取方法进行单碱基、双碱基和三碱基的特征提取,构建出84维、168 维和 35 维特征向量,分别基于K近邻(K-Nearest Neighbor,KNN)、支持向量机(Support Vector Machine,SVM)以及K近邻和支持向量机融合(KNN-SVM)算法模型进行分类预测.实验结果表明,在 168 维特征向量下,基于KNN-SVM算法模型的分类准确率比基于KNN或SVM算法模型的分类准确率高,对判断一个未知类的相关特性具有积极意义.
Research on DNA Sequence Classification Based on Machine Learning
DNA carries all the genetic information in the organism,which determines the structure and function of the gene.Predicting the category of DNA can determine whether an unknown class is a new species,an alien species or a well-known species.With the development of biotechnology,how to extract complete information from the obtained DNA sequence and predict its sequence composition,find the composition rule,and accurately reflect the characteristics of the species has become an important issue in bioinformatics.In this study,two types of DNA sequence files with sequence registration numbers CP021707 and CP085300 are downloaded from the NCBI website.Based on the base frequency and quantitative feature extraction method,the feature extraction of single base,double base and triple base is carried out to construct 84-dimensional,168-dimensional and 35-dimensional feature vectors.Classification prediction is based on K-nearest neighbor(KNN),support vector machine(SVM)and K-nearest neighbor and support vector machine fusion(KNN-SVM)algorithm models respectively.The experimental results show that under the 168-dimensional feature vector,the classification accuracy based on KNN-SVM algorithm model is effectively improved compared with the classification accuracy based on KNN or SVM algorithm model,which is of positive significance for judging the relevant characteristics of an unknown class.

support vector machineDNA sequencefeature extractionK-nearest neighborclassification accuracy

保志康、陈继璇、刘印晓、张茂源、章洪博、刘振安、魏晓娟

展开 >

西北民族大学 电气工程学院,甘肃兰州 730000

支持向量机 DNA序列 特征提取 K近邻 分类准确率

国家自然科学基金项目甘肃省自然科学基金项目甘肃省高等学校创新基金项目中央高校基本科研业务费专项中央高校基本科研业务费专项

1220524120JR10RA1152022B-0743192022004931920230138

2024

生物化工

生物化工

ISSN:
年,卷(期):2024.10(3)
  • 10