目的 基因表达谱(gene expression profiling,GEP)是弥漫性大B细胞淋巴瘤(diffuse large B-cell lymphoma,DLBCL)细胞起源(cell-of-origin,COO)分类的金标准.本研究旨在建立一个基于GEP的简约模型来准确预测DLBCL的COO亚型并为其在临床上的应用提供参考.方法 收集GEO数据库中6个DLBCL数据集中的基因和临床数据,将其中1个数据集作为训练集,其余5个作为验证集.构建基于惩罚回归分析的变量重要性分析策略,识别最优变量子集,并通过logistic回归分析确定最终用于COO分类的六基因模型,采用生存分析评估训练集和验证集预测的两种COO亚型与临床预后的关系.结果 六基因模型在训练集预测效果较好[AUC(95%CI):0.999(0.997~1.000),判别斜率及其95%CI为0.944(0.920~0.966)],在验证集也表现出较好的效果[AUC及其95%CI波动范围从0.910(0.820~0.999)到1.000,判别斜率及其95%CI波动范围从0.506(0.350~0.966)到0.927(0.841~0.987)].预后模型显示,在训练集和验证集中6个基因预测的亚型均为风险预测因子(均P<0.05).结论 六基因模型中的6个基因对DLBCL的分型和预后有重要的临床应用价值.基于变量重要性的基因排序为基因功能和靶向药物的进一步研究提供了参考依据.
Cell-of-origin subtype classification and prognosis of diffuse large B-cell lymphoma based on variable importance analysis
Objective Gene expression profiling(GEP)is the gold standard for cell-of-origin COO classification of diffuse large B-cell lymphoma(DLBCL).The aim of this study was to establish a GEP-based parsimony model to accurately predict the COO subtypes of DLBCL and provide a reference for its clinical application.Methods Genetic and clinical data from 6 DLBCL datasets in the GEO database were collected,and one dataset was used as the training set and the remaining five as the validation set.A variable importance analysis strategy based on penalized regression analysis was constructed to identify the optimal subset of variables,and a logistic regression analysis was performed to determine the six-gene model that was ultimately used for COO classification.Survival analysis was used to assess the relationship between the two COO subtypes predicted by the training and validation sets and clinical prognosis.Results The six-gene model predicted better in the training set[AUC(95%CI):0.999(0.997~1.000),dis-criminant slope and its 95%CI were 0.944(0.920~0.966)],and also showed better results in the validation set[AUC and its 95%CI fluctuated from 0.910(0.820~0.999)to 1.000,and the discriminant slope and its 95%CI fluctuated from 0.506(0.350~0.966)to 0.927(0.841~0.987)].The prognostic modeling showed that the six genetically predicted subtypes were risk predictors in both the training and validation sets(all P<0.05).Conclusions The six genes in the six-gene model have important clinical applications for the classification and prognosis of DLBCL.The gene ordering based on varia-ble importance provides a reference basis for further-research on gene function and targeted drug research.
Diffuse large B-cell lymphomaPenalized regressionPrognosisSubtype classifica-tionVariable importance analysis