单细胞转录组测序数据的细胞类型识别方法比较

Comparison of Cell Type Identification Methods for Single-cell RNA-sequencing Data

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：单细胞转录组测序技术提供单个细胞分辨率的基因表达谱,有助于更准确地揭示细胞异质性.聚类是识别生物组织中细胞类型的主要方法,选择合适的聚类算法可以提升单细胞转录组测序数据分析的性能.本文阐述了 k-means、层次聚类(hierarchical clustering,HC)、Leiden、SC3、SCENA、LAK、SIMLR 和 dropClust 等 8 种典型的单细胞聚类算法,在 12 个带有真实标签的单细胞转录组测序数据集上进行聚类比较分析.采用轮廓系数、Calinski-Harabasz指数、调整兰德指数、调整互信息、FMI指数、V-measure、Jaccard系数和变异系数等8个评价指标,对8种聚类算法的性能进行分析评价.根据实验结果,发现HC、SC3、k-means、SCENA的聚类泛用性与鲁棒性最佳,在大规模数据集上SIMLR算法表现最好;在小规模数据集上Leiden算法表现最好,但是存在依赖邻居节点参数和稳定性低的问题;dropClust算法在泛用性和鲁棒性上最差.此外,8种聚类方法的性能都与数据质量有关,当数据的变异系数较低时,聚类算法的评分指标普遍增高,反之亦然.

外文摘要：Single-cell RNA-sequencing technology provides gene expression profiles with single cell resolution,which helps to reveal cellular heterogeneity more accurately.Clustering is the main method to identify cell types in biological tissues.Selecting a suitable clustering algorithm can improve the performance of single-cell transcriptome sequencing data analysis.In this paper,eight typical sin-gle-cell clustering methods are elaborated,including k-means,hierarchical clustering(HC),Leiden,SC3,SCENA,LAK,SIMLR,and dropClust,and compared on 12 single-cell transcriptome sequencing datasets with real labels.Eight evaluation indexes including contour coefficient,Calinski-Harabasz index,adjusted Rand index,adjusted mutual information,FMI index,V-measure,Jaccard coefficient and coefficient of variation are used to analyze and evaluate the performance of eight clustering algorithms.According to the experimental results,it is found that HC,SC3,k-means and SCENA have the best generalization and robustness of clustering perfor-mance,and SIMLR has the best clustering performance on large-scale data sets.Leiden algorithm has the best performance on small data sets,but it has the problem of dependence on neighbor node parameters and low stability.dropClust algorithm is the worst in terms of generalization and robustness.In addition,the performance of the eight clustering methods is related to the quality of the data.When the coefficient of variation of the data is low,the score of the clustering algorithm generally increases,and vice versa.

外文关键词：

Single-cell RNA-sequencingClusteringCell type identificationData qualityPerformance evaluation

作者：

朱晓姝、滕飞、廖燕莹、谢妙、杨朝义

展开 >

作者单位：

桂林电子科技大学计算机与信息安全学院,桂林,541004

玉林师范学院计算机科学与工程学院,玉林,537000

广西城市职业大学信息工程学院,南宁,532100

关键词：

单细胞转录组测序聚类细胞类型识别数据质量性能评价

基金：

国家自然科学基金广西壮族自治区重点研发计划广西产业技术研究院产研计划(2023)

项目编号：

62141207桂科AB23026031CYY-HT2023-JSJJ-0021

出版年：

2024

DOI：

10.13417/j.gab.043.000195

基因组学与应用生物学

广西大学

基因组学与应用生物学

CSTPCD北大核心

影响因子：1.108

ISSN：1674-568X

年,卷(期)：2024.43(2)

参考文献量44