基于机器学习的DNA序列分类研究

Research on DNA Sequence Classification Based on Machine Learning

保志康 ¹陈继璇 ¹刘印晓 ¹张茂源 ¹章洪博 ¹刘振安 ¹魏晓娟¹

扫码查看

作者信息

1. 西北民族大学电气工程学院,甘肃兰州 730000
折叠

摘要

DNA承载了生物体内的所有遗传信息,决定基因的结构和功能.对DNA所属类别进行预测,可以判断一个未知类是否为新物种、外来物种或者熟知物种.随着生物技术的发展,如何从获取到的DNA序列中提取完整信息并预测其序列组成,找到组成规律,准确反映物种特性成为生物信息学中的一个重要问题.本研究从NCBI网站上下载序列登录号为CP021707和CP085300的两类DNA序列文件,基于碱基频率和数量特征提取方法进行单碱基、双碱基和三碱基的特征提取,构建出84维、168 维和 35 维特征向量,分别基于K近邻(K-Nearest Neighbor,KNN)、支持向量机(Support Vector Machine,SVM)以及K近邻和支持向量机融合(KNN-SVM)算法模型进行分类预测.实验结果表明,在 168 维特征向量下,基于KNN-SVM算法模型的分类准确率比基于KNN或SVM算法模型的分类准确率高,对判断一个未知类的相关特性具有积极意义.

Abstract

DNA carries all the genetic information in the organism,which determines the structure and function of the gene.Predicting the category of DNA can determine whether an unknown class is a new species,an alien species or a well-known species.With the development of biotechnology,how to extract complete information from the obtained DNA sequence and predict its sequence composition,find the composition rule,and accurately reflect the characteristics of the species has become an important issue in bioinformatics.In this study,two types of DNA sequence files with sequence registration numbers CP021707 and CP085300 are downloaded from the NCBI website.Based on the base frequency and quantitative feature extraction method,the feature extraction of single base,double base and triple base is carried out to construct 84-dimensional,168-dimensional and 35-dimensional feature vectors.Classification prediction is based on K-nearest neighbor(KNN),support vector machine(SVM)and K-nearest neighbor and support vector machine fusion(KNN-SVM)algorithm models respectively.The experimental results show that under the 168-dimensional feature vector,the classification accuracy based on KNN-SVM algorithm model is effectively improved compared with the classification accuracy based on KNN or SVM algorithm model,which is of positive significance for judging the relevant characteristics of an unknown class.

关键词

支持向量机/DNA序列/特征提取/K近邻/分类准确率

Key words

support vector machine/DNA sequence/feature extraction/K-nearest neighbor/classification accuracy

引用本文复制引用

基金项目

国家自然科学基金项目(12205241)

甘肃省自然科学基金项目(20JR10RA115)

甘肃省高等学校创新基金项目(2022B-074)

中央高校基本科研业务费专项(31920220049)

中央高校基本科研业务费专项(31920230138)

出版年

2024

生物化工

ISSN：

参考文献量10

段落导航