eDNA监测测序数据分析注释中参考数据库选择、指标阈值选择、目标数据准备的影响—

eDNA监测测序数据分析注释中参考数据库选择、指标阈值选择、目标数据准备的影响——以长江中游鱼类为监测目标

扫码查看

原文链接

万方数据
维普

中文摘要：在基于宏条形码(meta-barcoding)的eDNA监测技术中,eDNA测序数据的分析和注释是决定监测结果判断和评估精准与否的基础,而参考数据库选择、指标阈值选择、目标数据准备是eDNA测序数据分析和注释中最为关键的3个技术环节.为厘清上述3个技术环节处理方案的影响,本研究以长江中游2组eDNA监测COI基因测序数据为分析对象,针对鱼类的检出进行3组实验来分别检验:1)不同参考数据库及物种注释算法对注释结果的影响;2)不同OTU聚类序列相似度和物种注释分类置信度(序列一致性和序列覆盖度)对注释结果的影响;3)目标数据中各物种不同序列丰富度对注释结果的影响.结果显示:1)Blast算法下,3个版本nt库注释出的物种基本一致(72％～78％),2个本地序列参考库注释出的物种也基本一致(91％～96％),这5个序列参考库注释出的物种52％～68％一致;nt库RDP Classifier算法注释出的物种覆盖95％以上Blast算法注释出的物种,并比Blast算法注释出的物种多151％～443％,多出的物种大都是错误注释,本地参考数据库RDP Classifier算法注释出的物种覆盖66％～85％的Blast算法注释出的物种,并存在数条只注释到科属的结果.2)OTU聚类序列相似度阈值,取值0.999比取值0.99获得的OTU多154％～209％,注释到鱼类的OTU多240％～490％;注释分类置信度阈值(Blast算法,序列一致性和序列覆盖度)从0.8到0.99注释获得的物种组成(94％以上)基本一致,OTU组成(83％以上)也基本一致,注释分类置信度阈值取0.7时注释获得的物种组成、OTU组成与取0.8及以上时注释获得的有较大差异.3)在OTU聚类序列相似度阈值为0.999、注释分类置信度阈值为0.9时,多序列数据注释所得鱼类物种数、OTU数最多,物种注释正确率最高(达81.49％),分别比单序列数据的多7％、215％和高5％.在具体eDNA测序数据的分析和注释中,可通过建立完善本地参考数据库、优化OTU聚类序列相似度和物种注释分类置信度(序列一致性和序列覆盖度)取值、增加目标数据的丰富度来提高注释结果的准确性,但受制于物种注释算法的局限性,物种注释错误和注释遗漏的问题可能将长期存在,物种注释正确率通常低于85％(基于COI基因的eDNA监测).

外文标题：The impacts of reference database selection,indicator threshold determination and target data preparation on the sequence data analysis of eDNA monitoring-Taking fish as the target in the middle Yangtze River

外文摘要：In the meta-barcoding based eDNA monitoring technology,the analysis and annotation of eDNA sequence data serve as the foundation for obtaining accurate and reliable monitoring results.The selection of reference databases,the determination of analysis and annotation indicator thresholds,and the preparation of target data are the most critical technical steps in eDNA sequence data analysis and annotation.To clarify the impacts of these three technical aspects and provide scientific support for the standardization of eDNA monitoring technology,the current study used two sets of COI gene sequence data from eDNA monitoring in the middle reach of the Yangtze River as the analysis objects and designed three sets of experiments to test 1)the impacts of dif-ferent reference databases and species annotation algorithms on the annotation results,2)the impacts of different OTU clustering sequence similarity and species annotation classification confidence(sequence consistency and sequence coverage)on the annota-tion results,and 3)the impacts of different target sequence data richness of each species on the annotation results.The results showed that:1)under the Blast algorithm,the annotated species matched with three versions of nt library from NCBI were general-ly consistent(72％-78％);those matched with two local sequence reference libraries were also generally consistent(91％-96％);and the annotated species from the five results matched with these five sequence reference libraries were consistent in 52％-68％.The RDP Classifier algorithm annotated species matched with nt libraries covered over 95％of Blast algorithm annota-ted species,and increased by 151％-443％species,but most additional species were misannotated.The RDP Classifier algorithm annotated species matched with local sequence reference libraries covered 66％-85％of Blast algorithm annotated species,and there were several results only annotated to family or genus level.2)When the OTU clustering sequence similarity threshold was set to 0.999,it obtained 154％-209％more OTUs than when set to 0.99,and 240％-490％more annotated OTUs of fish were ob-tained.The classification confidence threshold(Blast algorithm)had little effect on species composition when changed from 0.8 to 0.99,with over 94％consistency,but there was a significant difference when it was set to 0.7.3)When the OTU clustering se-quence similarity threshold was 0.999 and the classification confidence threshold was 0.9,the number of fish species and OTUs ob-tained from multiple-sequences data annotation was the largest.It also had the highest species annotation accuracy(81.49％),which increased by 7％fish species,215％OTUs and 5％accuracy respectively compared to single-sequence data annotation.In eDNA sequenc data analysis and annotation,accuracy can be improved by establishing and improving local reference databases,optimizing OTU clustering sequence similarity and species annotation classification confidence thresholds(sequence consistency and sequence coverage),increasing target sequence data richness.However,due to the limitation of species annotation algorithms,problems such as species annotation errors and omissions may persist in eDNA sequence data analysis and annotation in the future.Then,the species annotation accuracy of eDNA monitoring(based on the COI gene)would always be lower than 85％.

外文关键词：

Environmental DNAfishmeta-barcodingreference databaseOTU clustering sequence similarityspecies annota-tion classification confidencemiddle Yangtze River

作者：

许兰馨、杨海乐、刘志刚、杜浩

展开 >

作者单位：

中国水产科学研究院长江水产研究所,农业农村部淡水生物多样性保护重点实验室,武汉 430223

南京农业大学无锡渔业学院,无锡 214000

关键词：

环境DNA 鱼类宏条形码参考数据库 OTU聚类序列相似度物种注释分类置信度长江中游

基金：

中央级公益性科研院所基本科研业务费专项农业财政专项

项目编号：

YFI202201CJJC-2023-01

出版年：

2024

DOI：

10.18307/2024.0631

湖泊科学

中国科学院南京地理与湖泊研究所中国海洋湖沼学会

湖泊科学

CSTPCD北大核心

影响因子：1.439

ISSN：1003-5427

年,卷(期)：2024.36(6)