Machine learning based clustering algorithm for tea tree DNA
In order to study the clustering problem of tea tree gene sequences,this paper designsan improved kernel principal component analysis(KPCA)with k-means++for dimensionality reduction clustering algorithm based on the cumulative variance contribution rate.Firstly,the gene pool dataset was filtered and grouped,then the data features of the gene data were extracted using the k-mers algorithm,and then the KPCA was improved by selecting the feature principal components with a contribution rate greater than 85%according to the percentage of the cumulative variance contribution rate,and then the clustering operation was implemented by the k-means++method,and finally the clustering results were analysed by the Calinski-Harabasz index and response time.The experimental results showed that the combined method had the highest Calinski-Harabasz Index for different sample sizes compared to the four treatments of clustering alone,KPCA-clustering,improved PCA-clustering and improved KPCA-clustering.In terms of response time compared to the same improved PCA-k-means++,the clustering speed was effectively reduced.The improved KPCA-k-means++was able to guarantee the clustering accuracy and clustering speed for the gene sequences of tea trees,and showed excellent clustering stability.
kernel principal component analysiscumulative variance contribution ratek-means++algorithmgene clustering