Research on DNA Sequence Classification Based on Machine Learning
DNA carries all the genetic information in the organism,which determines the structure and function of the gene.Predicting the category of DNA can determine whether an unknown class is a new species,an alien species or a well-known species.With the development of biotechnology,how to extract complete information from the obtained DNA sequence and predict its sequence composition,find the composition rule,and accurately reflect the characteristics of the species has become an important issue in bioinformatics.In this study,two types of DNA sequence files with sequence registration numbers CP021707 and CP085300 are downloaded from the NCBI website.Based on the base frequency and quantitative feature extraction method,the feature extraction of single base,double base and triple base is carried out to construct 84-dimensional,168-dimensional and 35-dimensional feature vectors.Classification prediction is based on K-nearest neighbor(KNN),support vector machine(SVM)and K-nearest neighbor and support vector machine fusion(KNN-SVM)algorithm models respectively.The experimental results show that under the 168-dimensional feature vector,the classification accuracy based on KNN-SVM algorithm model is effectively improved compared with the classification accuracy based on KNN or SVM algorithm model,which is of positive significance for judging the relevant characteristics of an unknown class.
support vector machineDNA sequencefeature extractionK-nearest neighborclassification accuracy