摘要
为探讨肠道菌群在疾病类型预测中的价值,利用机器学习基于瘤胃球菌丰度构建了疾病的非侵入性评估模型.选取ExperimentHub R库存储库数据,下载来自不同研究的人类粪便瘤胃球菌丰度信息及实验方案、疾病状态、年龄、性别、抗生素使用情况、地区、吸烟情况等多种信息,利用随机森林、决策树、Adaboost等机器学习模型建立疾病筛查的评估模型,使用GridSearchCV(网格搜索)调整参数,并用混淆矩阵评估外部验证结果.经数据处理提取标准化命名了12种瘤胃球菌、7种疾病并将25个变量进行了哑变量变换.利用多种瘤胃球菌属微生物的丰度及性别、年龄等样本一般资料信息建立了3种评估模型.其中随机森林模型准确率最高(0.884),且当n_estimators为220时,模型得分为0.892,为最佳模型.外部验证结果也显示可见模型中分类算法预测错误的情况相对较少,模型性能良好.根据粪便样本的宏基因组学数据,基于瘤胃球菌丰度利用随机森林算法可以有效地对疾病类型进行预测.
Abstract
The study used machine learning model to construct a non-invasive evaluation model of diseases based on the abun-dance of Ruminococus to explore the value of intestinal flora in the prediction of disease types.Data in R library was used to down-load data from different studies.Abundance of Ruminococcus,study condition,disease state,age,sex,antibiotic use,region,smoking situation,and other information of human samples were selected,and the evaluation model of disease screening was es-tablished by using machine learning classification models such as random forest,decision tree and Adaboost.The parameters were adjusted by GridSearchCV,and the external verification results were evaluated by using a confusion matrix.Three evalua-tion models were established based on the abundance of Ruminococcus and the general information of samples such as sex and age.The random forest model had the highest accuracy(0.884).In addition,when n_estimators was 220,the score was 0.892,which was the best model.The external validation results also showed that the classification algorithm in the visible model predict-ed relatively few errors,and the model performed well.According to the metagenomic data of fecal samples,the random forest al-gorithm can effectively predict the disease types based on the abundance of Ruminococcus.
基金项目
国家自然科学基金(82073407)
江苏高校优势学科建设工程项目(苏政办发[2018]87号)
江苏省重点学科建设项目(十三五)(苏教研[2016]9号)