高通量蛋白质结构生物信息学进展

Advances in High-throughput Protein Structural Bioinformatics

祝云篪 ¹陆祖宏¹

扫码查看

作者信息

1. 东南大学生物科学与医学工程学院,数字医学工程全国重点实验室,南京 211189
折叠

摘要

本文总结了高通量蛋白质结构生物信息学的最新进展,包括结构数据管理、工具软件开发和结构数据挖掘三个主要方面.结构数据管理方面,得益于类AlphaFold系统的发展,蛋白质结构数据量实现爆发式增长,直接促进了压缩技术的升级,也吸引了研究者对结构数据管理的关注.工具软件开发方面,以Foldseek为代表的新算法实现了高速的结构比对,突破了结构分析的通量瓶颈,此外深度学习模型的大量应用从多个方面改进了基于结构的蛋白质功能注释.结构数据挖掘方面,研究者以组学思维处理结构大数据,在持续的探索中提炼分析要素、优化方法,并在新工具的帮助下推动着结构数据挖掘的进阶.随着高通量方法的发展,结构生物信息学有望在生命科学中发挥更重要的作用.

Abstract

This review provides a comprehensive summary of the latest advancements in high-throughput protein structural bioinformatics,a field that has undergone a revolutionary transformation with the advent of deep learning-based protein structure prediction systems like AlphaFold2.These systems have significantly increased the accuracy,speed,and scale of protein structure prediction,resulting in an exponential growth in the number of protein structures available for analysis.Notably,the AlphaFold Protein Structure Database(AFDB)has amassed over 214 million protein structures,surpassing the PDB's 50-year cumulative data by over 1 000-fold within several months.Big data is driving the comprehensive upgrade of protein structural bioinformatics.This review focuses on three main areas:structure data management,tool development,and structure data mining.In the realm of structure data management,the review spotlights the optimization strategy of AlphaFold-like systems,which significantly reduces the resource requirements for protein folding,enabling more researchers to make custom structure predictions and further enlarging the data scale.The resulting"data explosion"has exerted increased pressure on storage and bandwidth,prompting the development of cutting-edge tools such as Foldcomp,PDC,and ProteStAr for compressing PDB files.Moreover,the review underscores the critical role of public repositories like ModelArchive and PDB-Dev in archiving and sharing third-party AlphaFold models.It also highlights the utilization of independent services like MineProt and 3D-Beacons to create more interactive and accessible data portals.In terms of tool development,the review spotlights recent breakthroughs in structure alignment algorithms,represented by Foldseek,which enable ultra-fast searching of large protein structure databases.It also covers tools for functional annotation of proteins based on their structures,including AlphaFill for ligand annotation,DeepFRI for Gene Ontology(GO)annotation,TT3D for protein-protein interaction(PPI)prediction,among others.It is proposed that 3Di sequences born concurrently with Foldseek can enhance many sequence-based deep learning models developed in the pre-AlphaFold era,enabling them to be applied to structure-based function prediction.The challenges on traditional molecular docking methods in the high-throughput era are mentioned at last,in a gesture to arouse the attention of researchers.Finally,the review explores the burgeoning field of structure data mining.Whole proteome structuring has become feasible in recent years,and scientists are processing large structure datasets from an omics viewpoint,continuously identifying analyzable elements and optimizing methodologies,as well as utilizing newly developed tools to push the boundaries.Notable examples include the identification of new protein families,the development of protein structure clustering,and the integration of AlphaFold with conventional experimental techniques to solve large structures.These advancements are paving the way for a deeper understanding of protein structure and function and have the potential to unlock new discoveries in the life sciences.However,the review also acknowledges the challenges and limitations that persist in the field,including the lack of diversity in high-throughput software for protein structural bioinformatics and the existing bottleneck in rapidly predicting protein complex structures.Overall,structural bioinformatics is expected to play an even more crucial role in the life sciences with the development of high-throughput methodology.

关键词

蛋白质结构生物信息学/高通量/类AlphaFold系统/结构蛋白质组学

Key words

protein structural bioinformatics/high-throughput/AlphaFold-like system/structural proteomics

引用本文复制引用

基金项目

国家重点研发计划(2016YFA0501600)

出版年

2024

生物化学与生物物理进展

中国科学院生物物理研究所,中国生物物理学会

生物化学与生物物理进展

CSTPCD北大核心

影响因子：0.476

ISSN：1000-3282

段落导航