首页|基于机器学习的茶树DNA聚类算法

基于机器学习的茶树DNA聚类算法

扫码查看
为了研究茶树基因序列的聚类问题,设计一种基于累计方差贡献率进行改进的核主成分分析(KPCA)与k均值(k-means)++聚类算法相结合的降维聚类算法(KPCA-k-means++).将基因库数据集筛选分组后,利用k-mers算法提取基因数据的数据特征,根据累计方差贡献率的占比大于85%的标准确定降维主元个数对KPCA进行降维改进并采用k-means++算法对降维后数据聚类,通过CH(Calinski-Harabaze Index)指标和响应时间分析聚类结果.结果表明:在单独聚类、KPCA-聚类、改进PCA-聚类、改进KPCA-聚类4种处理方式中,改进KPCA-k-means++算法在不同处理方式和不同样本数的对比下,CH指标均为最高,与未改进时相比平均高出33%.在响应时间方面,改进KPCA-k-means++算法与同样改进PCA-k-means++算法在不同聚类数和样本数的对比下响应时间均较短.改进KPCA-k-means++算法能够保证对于茶树的基因序列的聚类准确率和聚类速度,表现出极好的聚类稳定性.
Machine learning based clustering algorithm for tea tree DNA
In order to study the clustering problem of tea tree gene sequences,this paper designsan improved kernel principal component analysis(KPCA)with k-means++for dimensionality reduction clustering algorithm based on the cumulative variance contribution rate.Firstly,the gene pool dataset was filtered and grouped,then the data features of the gene data were extracted using the k-mers algorithm,and then the KPCA was improved by selecting the feature principal components with a contribution rate greater than 85%according to the percentage of the cumulative variance contribution rate,and then the clustering operation was implemented by the k-means++method,and finally the clustering results were analysed by the Calinski-Harabasz index and response time.The experimental results showed that the combined method had the highest Calinski-Harabasz Index for different sample sizes compared to the four treatments of clustering alone,KPCA-clustering,improved PCA-clustering and improved KPCA-clustering.In terms of response time compared to the same improved PCA-k-means++,the clustering speed was effectively reduced.The improved KPCA-k-means++was able to guarantee the clustering accuracy and clustering speed for the gene sequences of tea trees,and showed excellent clustering stability.

kernel principal component analysiscumulative variance contribution ratek-means++algorithmgene clustering

杨小平、倪萍、诸葛天秋、罗跃新、郭春雨、庞月兰、吴雨婷

展开 >

桂林理工大学信息科学与工程学院,广西桂林 541004

桂林理工大学广西嵌入式技术与智能系统重点实验室,广西桂林 541004

广西桂林茶叶科学研究所,广西桂林 541004

核主成分分析 累计方差贡献率 k均值聚类算法 基因聚类

广西壮族自治区科技计划广西壮族自治区科技重大专项

桂科AD18281068桂科AA20302018-4

2024

广西大学学报(自然科学版)
广西大学

广西大学学报(自然科学版)

CSTPCD北大核心
影响因子:0.767
ISSN:1001-7445
年,卷(期):2024.49(2)
  • 22