面向真实世界的知识挖掘与知识图谱补全研究(四):真实世界数据标注平台搭建及基于预训练语言模型的自动化抽取方法探索

Research on real-world knowledge mining and knowledge graph completion(Ⅳ):construction of a real-world data annotation platform and exploration of automatic extraction method based on pre-trained language models

阎思宇 ¹谭杰骏 ²朱海锋 ²黄桥 ¹王诗淳 ³马文昊 ³石涵予 ⁴王永博 ¹任相颖 ¹胡文斌 ²靳英辉¹

扫码查看

作者信息

1. 武汉大学中南医院循证与转化医学中心(武汉 430071)
2. 武汉大学计算机学院(武汉 430072)
3. 武汉大学中南医院循证与转化医学中心(武汉 430071);武汉大学第二临床学院(武汉 430071)
4. 武汉大学弘毅学堂(武汉 430072)
折叠

摘要

目的探索搭建真实世界数据标注平台,并比较检索增强生成式技术(retrieval augmented generation,RAG)结合大语言模型,及预训练语言模型的预训练-微调方法的真实世界数据提取效果.方法以真实世界电子病历数据中的膀胱癌病理记录为例,搭建真实世界数据标注平台,并基于平台标注数据比较RAG结合GPT-3.5,及基于BERT、RoBERTa模型的预训练-微调方法自动化抽取膀胱癌癌症分型、分期的效果.结果全训练集微调的预训练-微调模型抽取效果优于RAG结合大模型的方法与小样本微调的预训练-微调模型,RoBERTa模型效果总体优于BRET模型,但这些方法的抽取效果均有待提升.在测试集中,使用全训练集微调的RoBERTa模型抽取膀胱癌分型、T分期、N分期的F1值分别为71.06％、50.18％,73.65％.结论预训练语言模型在处理临床非结构化数据方面具有应用潜力,但现有方法在信息抽取效果上仍有提升空间.未来工作需进一步优化模型或训练策略,以加速数据赋能.

Abstract

Objective To explore the construction of a real-world data annotation platform,and compare the real-world data extraction performance of retrieval augmented generation(RAG)combined with large language models and pre-training fine-tuning methods for pre-trained language models.Methods Taking the pathological records of bladder cancer in the real world electronic medical record data as an example,a real-world data annotation platform was built.Based on the platform annotation data,the effects of automatic extraction of cancer typing and staging of bladder cancer using RAG combined with GPT-3.5,and the pre-training fine tuning method based on BERT and RoBERTa models were compared.Results The extraction effects of the pre-training and fine-tuning model based on the fine-tuning of the full-training set were better than that of RAG combined with large model method and pre-training and fine-tuning model with the few-shot fine-tuning,and the effects of RoBERTa model were generally better than that of BERT model,but the extraction effects of these methods needs to be improved totally.The F1 scores for extracting bladder cancer typing,T staging,and N staging in the test set,using the RoBERTa model fine-tuned with the entire training set,were 71.06％,50.18％,and 73.65％respectively.Conclusion Pre-trained language models have the application potential in processing clinical unstructured data,but there is still room for improvement in the information extraction effect of existing methods.Future work requires further optimization of models or training strategies to accelerate data empowerment.

关键词

真实世界数据/电子病历/标注平台/预训练语言模型/检索增强生成/大语言模型/病理记录/膀胱癌

Key words

Real-world data/Electronic medical records/Annotation platform/Pre-trained language model/Retrieval augmented generation/Large language model/Pathology records/Bladder cancer

引用本文复制引用

出版年

2024

医学新知

武汉大学中南医院,中国农工民主党湖北省委医药卫生工作委员会

医学新知

CSTPCD

影响因子：0.243

ISSN：1004-5511

段落导航