大语言模型在蛋白质设计中的应用综述

A Review for the Application of Large Language Model in Protein Design

张锦雄 ¹孟雪莉 ¹陈燕 ²韦松键 ¹吕丽兰 ³胡小春⁴

扫码查看

作者信息

1. 广西大学计算机与电子信息学院,南宁,530004
2. 广西大学计算机与电子信息学院,南宁,530004;广西大学工商管理学院,南宁,530004
3. 广西壮族自治区亚热带作物研究所,南宁,530001
4. 广西财经大数据重点实验室,南宁,530003
折叠

摘要

在蛋白质设计领域,人工智能技术的应用已经催生了一些大模型.蛋白质的计算设计是指利用计算机技术辅助确定蛋白质的氨基酸序列,实现预设的结构和功能的过程.基于计算的蛋白质设计可进行改造设计或从头设计.特定功能的蛋白质快速生成,对生物医学研究、药物开发和生物工程等领域的发展具有重要意义.本文首先从传统计算方法、机器学习方法和深度学习方法对蛋白质的计算设计进行了梳理概述,然后介绍大语言模型的核心架构Transformer,重点分类介绍了蛋白质大语言模型的研究应用,最后对未来的研究重点进行了展望.

Abstract

In the field of protein design,the application of artificial intelligence technology has spawned some large models.The com-putational design of proteins refers to the process of using computer technology to assist in determining the amino acid sequence of pro-teins and achieving preset structures and functions.Computational protein design can be conducted through redesign or de novo design.The rapid generation of proteins with specific functions is of great significance to the development of biomedical research,drug develop-ment,and bioengineering.This article first provides an overview of computational protein design from traditional computational methods,machine learning methods,and deep learning methods.Then,it introduces the core architecture of large language models,Transformer,and focuses on introducing the research and application of protein large language models.Finally,it looks forward to the future research priorities.

关键词

蛋白质序列/蛋白质结构/大语言模型/Transformer架构

Key words

Protein sequence/Protein structure/Large language model/Transformer architecture

引用本文复制引用

基金项目

国家自然科学基金项目(62362004)

广西重点研发计划项目(桂科AB24010031)

出版年

2024

基因组学与应用生物学

广西大学

基因组学与应用生物学

CSTPCDCSCD北大核心

影响因子：1.108

ISSN：1674-568X

参考文献量85

段落导航