Advances in High-throughput Protein Structural Bioinformatics
This review provides a comprehensive summary of the latest advancements in high-throughput protein structural bioinformatics,a field that has undergone a revolutionary transformation with the advent of deep learning-based protein structure prediction systems like AlphaFold2.These systems have significantly increased the accuracy,speed,and scale of protein structure prediction,resulting in an exponential growth in the number of protein structures available for analysis.Notably,the AlphaFold Protein Structure Database(AFDB)has amassed over 214 million protein structures,surpassing the PDB's 50-year cumulative data by over 1 000-fold within several months.Big data is driving the comprehensive upgrade of protein structural bioinformatics.This review focuses on three main areas:structure data management,tool development,and structure data mining.In the realm of structure data management,the review spotlights the optimization strategy of AlphaFold-like systems,which significantly reduces the resource requirements for protein folding,enabling more researchers to make custom structure predictions and further enlarging the data scale.The resulting"data explosion"has exerted increased pressure on storage and bandwidth,prompting the development of cutting-edge tools such as Foldcomp,PDC,and ProteStAr for compressing PDB files.Moreover,the review underscores the critical role of public repositories like ModelArchive and PDB-Dev in archiving and sharing third-party AlphaFold models.It also highlights the utilization of independent services like MineProt and 3D-Beacons to create more interactive and accessible data portals.In terms of tool development,the review spotlights recent breakthroughs in structure alignment algorithms,represented by Foldseek,which enable ultra-fast searching of large protein structure databases.It also covers tools for functional annotation of proteins based on their structures,including AlphaFill for ligand annotation,DeepFRI for Gene Ontology(GO)annotation,TT3D for protein-protein interaction(PPI)prediction,among others.It is proposed that 3Di sequences born concurrently with Foldseek can enhance many sequence-based deep learning models developed in the pre-AlphaFold era,enabling them to be applied to structure-based function prediction.The challenges on traditional molecular docking methods in the high-throughput era are mentioned at last,in a gesture to arouse the attention of researchers.Finally,the review explores the burgeoning field of structure data mining.Whole proteome structuring has become feasible in recent years,and scientists are processing large structure datasets from an omics viewpoint,continuously identifying analyzable elements and optimizing methodologies,as well as utilizing newly developed tools to push the boundaries.Notable examples include the identification of new protein families,the development of protein structure clustering,and the integration of AlphaFold with conventional experimental techniques to solve large structures.These advancements are paving the way for a deeper understanding of protein structure and function and have the potential to unlock new discoveries in the life sciences.However,the review also acknowledges the challenges and limitations that persist in the field,including the lack of diversity in high-throughput software for protein structural bioinformatics and the existing bottleneck in rapidly predicting protein complex structures.Overall,structural bioinformatics is expected to play an even more crucial role in the life sciences with the development of high-throughput methodology.
protein structural bioinformaticshigh-throughputAlphaFold-like systemstructural proteomics