基于机器学习的茶树DNA聚类算法

Machine learning based clustering algorithm for tea tree DNA

杨小平 ¹倪萍 ²诸葛天秋 ³罗跃新 ³郭春雨 ³庞月兰 ³吴雨婷³

扫码查看

作者信息

1. 桂林理工大学信息科学与工程学院,广西桂林 541004
2. 桂林理工大学广西嵌入式技术与智能系统重点实验室,广西桂林 541004
3. 广西桂林茶叶科学研究所,广西桂林 541004
折叠

摘要

为了研究茶树基因序列的聚类问题,设计一种基于累计方差贡献率进行改进的核主成分分析(KPCA)与k均值(k-means)++聚类算法相结合的降维聚类算法(KPCA-k-means++).将基因库数据集筛选分组后,利用k-mers算法提取基因数据的数据特征,根据累计方差贡献率的占比大于85％的标准确定降维主元个数对KPCA进行降维改进并采用k-means++算法对降维后数据聚类,通过CH(Calinski-Harabaze Index)指标和响应时间分析聚类结果.结果表明:在单独聚类、KPCA-聚类、改进PCA-聚类、改进KPCA-聚类4种处理方式中,改进KPCA-k-means++算法在不同处理方式和不同样本数的对比下,CH指标均为最高,与未改进时相比平均高出33％.在响应时间方面,改进KPCA-k-means++算法与同样改进PCA-k-means++算法在不同聚类数和样本数的对比下响应时间均较短.改进KPCA-k-means++算法能够保证对于茶树的基因序列的聚类准确率和聚类速度,表现出极好的聚类稳定性.

Abstract

In order to study the clustering problem of tea tree gene sequences,this paper designsan improved kernel principal component analysis(KPCA)with k-means++for dimensionality reduction clustering algorithm based on the cumulative variance contribution rate.Firstly,the gene pool dataset was filtered and grouped,then the data features of the gene data were extracted using the k-mers algorithm,and then the KPCA was improved by selecting the feature principal components with a contribution rate greater than 85％according to the percentage of the cumulative variance contribution rate,and then the clustering operation was implemented by the k-means++method,and finally the clustering results were analysed by the Calinski-Harabasz index and response time.The experimental results showed that the combined method had the highest Calinski-Harabasz Index for different sample sizes compared to the four treatments of clustering alone,KPCA-clustering,improved PCA-clustering and improved KPCA-clustering.In terms of response time compared to the same improved PCA-k-means++,the clustering speed was effectively reduced.The improved KPCA-k-means++was able to guarantee the clustering accuracy and clustering speed for the gene sequences of tea trees,and showed excellent clustering stability.

关键词

核主成分分析/累计方差贡献率/k均值聚类算法/基因聚类

Key words

kernel principal component analysis/cumulative variance contribution rate/k-means++algorithm/gene clustering

引用本文复制引用

基金项目

广西壮族自治区科技计划(桂科AD18281068)

广西壮族自治区科技重大专项(桂科AA20302018-4)

出版年

2024

广西大学学报(自然科学版)

广西大学

广西大学学报(自然科学版)

CSTPCD北大核心

影响因子：0.767

ISSN：1001-7445

参考文献量22

段落导航