基于Transformer的预训练语言模型在生物医学领域的应用

Application of Transformer-based pretrained language models in the biomedical domain

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：[背景]随着人工智能技术的快速发展,自然语言处理(NLP)已成为生物医学研究领域中的一项关键技术.而基于Transformer框架的预训练语言模型(T-PLMs)已被广泛应用于生物医学文本的分析,有效地推动了临床决策支持系统的发展和基因组学数据的解读.[进展]本文聚焦于T-PLMs在生物医学领域的应用,探讨其在处理和理解生物医学文本数据方面的潜力和挑战.首先回顾NLP技术的演进,从传统的特征工程到预训练语言模型的兴起,特别是BERT等模型如何改变生物医学文本分析的范式;随后详细介绍T-PLMs的训练范式,包括预训练和微调过程,以及如何通过特定领域的预训练和Prompt工程来提升模型在生物医学任务中的性能;进而深入探讨T-PLMs在生物医学领域的多样化应用,包括文本表示和知识挖掘、临床决策支持、医学影像理解、蛋白质预测、分子表示和药物设计等,并特别归纳收集了上述多个生物医学细分领域相关的数据库资料.[展望]当前研究和应用中仍面临许多挑战,如模型可解释性、数据隐私保护、多模态数据等.基于此对未来的研究方向提出展望,以充分发挥NLP在推动生物医学研究和改善患者护理方面的潜力.

外文摘要：[Background]The rapid development of artificial intelligence(AI)has had a profound impact across various scientific disciplines,with natural language processing(NLP)emerging as a cornerstone technology in biomedical research.NLP's ability to analyze vast amounts of sequenced biomedical data,including not only medical texts but also complex sequences such as proteins and DNA,has become indispensable for tasks like clinical decision support and genomics interpretation.The introduction of Transformer-based pretrained language models(T-PLMs)represents a major breakthrough in this field,fundamentally transforming the way biomedical sequences are processed and understood.These models,particularly BERT and its variants,have significantly surpassed traditional rule-based and feature engineering approaches.By leveraging deep learning,T-PLMs can capture the intricate patterns and relationships within biomedical data that were previously difficult to discern.This advancement has enabled the extraction of complex,meaningful medical knowledge from large-scale sequence data,fueling innovations in personalized medicine and improving overall healthcare outcomes.The shift from conventional methods to T-PLMs marks a pivotal milestone in the field,equipping researchers with powerful tools to uncover new insights,drive forward biomedical research,and ultimately transform patient care on a global scale.[Progress]The development of T-PLMs has significantly evolved,transforming how biomedical data is processed and analyzed.The training paradigms for T-PLMs in biomedicine typically involve two major phases:pre-training and fine-tuning.In the pre-training phase,models are trained on vast amounts of general language data,and then further specialized through domain-specific pre-training on biomedical corpora.This process enables T-PLMs to develop a strong foundational understanding of language,which can be fine-tuned for specific biomedical tasks.Fine-tuning is often tailored to particular applications,such as clinical decision support,where models are adjusted to improve their accuracy and relevance to medical contexts.Advanced techniques like continuous pre-training and prompt engineering have further enhanced the adaptability and performance of T-PLMs in specialized tasks.T-PLMs have found diverse applications in biomedicine,ranging from text representation and knowledge extraction to complex tasks like protein structure prediction and molecular representation.These models have significantly improved the accuracy of tasks such as named entity recognition,relation extraction,and even drug discovery.Moreover,their ability to integrate and process multimodal data,such as combining text with medical images,has opened new frontiers in medical diagnostics and research.The advancements in T-PLMs have not only improved data analysis but have also paved the way for personalized medicine and more informed clinical practices,demonstrating their critical role in modern biomedical research.[Perspective]While T-PLMs have achieved considerable success in biomedicine,several ongoing challenges must be addressed to fully capitalize on their potential.One of the most pressing issues is the interpretability of these models,particularly in clinical settings where understanding the rationale behind AI-generated decisions is crucial for ensuring patient safety and fostering trust in AI-driven healthcare.Enhancing the transparency and explainability of T-PLMs is essential,with approaches like explainable AI and advanced model visualization playing a critical role.Another key challenge is the integration and fusion of multimodal data,which involves combining diverse data types such as text,medical images,and genetic sequences.This is particularly complex due to the heterogeneity of these data sources,which can lead to difficulties in data alignment and fusion.Addressing these challenges is vital for advancing the field.For example,research is increasingly focusing on multimodal learning frameworks that can seamlessly integrate different types of biomedical data,potentially enabling breakthroughs in diagnostics and personalized medicine.Additionally,data privacy concerns remain paramount,especially when using sensitive patient information to train and deploy T-PLMs.Implementing robust privacy-preserving techniques,such as federated learning and blockchain,is essential to maintaining data security while still leveraging AI's potential in healthcare.Furthermore,the scalability of T-PLMs poses another significant challenge.Training these models on large-scale,complex biomedical datasets demands substantial computational resources.Future research must focus on developing more efficient training paradigms and optimizing resource utilization to make T-PLMs more accessible for a broader range of applications in both research and clinical settings.Addressing these challenges will be key to unlocking the full potential of T-PLMs and driving continued innovation in the biomedical field.

外文关键词：

natural language processingbiomedical applicationpretrained language modelmultimodal learningmedical text mining

作者：

游至宇、阳倩、傅姿晴、陈庆超、李奇渊

展开 >

作者单位：

厦门大学医学院,健康医疗大数据国家研究院,福建厦门 361102

北京大学健康医疗大数据国家研究院,跨媒体通用人工智能全国重点实验室,北京 100191

关键词：

自然语言处理生物医学应用预训练语言模型多模态学习医疗文本挖掘

基金：

国家自然科学基金面上项目

项目编号：

82272944

出版年：

2024

DOI：

10.6043/j.issn.0438-0479.202404005

厦门大学学报(自然科学版)

厦门大学

厦门大学学报(自然科学版)

CSTPCD北大核心

影响因子：0.449

ISSN：0438-0479

年,卷(期)：2024.63(5)