基于深度学习的非结构化医学文本知识抽取
Unstructured medical text knowledge extraction based on deep learning
耿飙 1梁成全 2魏炜 3朱长元4
作者信息
- 1. 中国矿业大学计算机科学与技术学院,江苏徐州 221116;苏州卫生职业技术学院健康管理学院,江苏 苏州 215009
- 2. 华东疗养院 信息科,江苏 无锡 214065
- 3. 苏州卫生职业技术学院健康管理学院,江苏 苏州 215009;杭州电子科技大学计算机学院,浙江杭州 310018
- 4. 中国矿业大学计算机科学与技术学院,江苏徐州 221116
- 折叠
摘要
为解决一词多义和关系重叠问题,以糖尿病领域文本数据为对象,基于序列标注的新型标注策略,提出一种轻量级端到端神经模型.采用头部实体优先策略,使用BERT获取输入字向量,通过BiLSTM深度学习捕获时间特征和上下文相关性.引入multi_head attention机制,采用CRF模型根据相邻标签的相互依赖关系得到最优预测序列.旨在将非结构化的医学文本转换成结构化的数据,在阿里云天池中文糖尿病标注数据集上进行综合实验,实验结果表明,该模型在医学文本知识抽取中具有优越的性能.
Abstract
To solve the problem of one word polysemy and relationship overlap,a lightweight end-to-end neural model was pro-posed based on an annotation strategy based on sequence annotation for text data in the field of diabetes.The head entity priority strategy was adopted,BERT was used to obtain the input word vector,and the temporal characteristics and context correlation were captured through BiLSTM deep learning.The multi_head attention mechanism was introduced,and the CRF model was used to obtain the optimal prediction sequence according to the interdependence of adjacent tags.The purpose was to convert unstructured medical text into structured data.A comprehensive experiment was carried out on Alibaba cloud Tianchi Chinese Diabetes annotation data set.Experimental results show that the proposed model has superior performance in medical text knowledge extraction.
关键词
深度学习/非结构化文本/医学文本/知识抽取/实体识别/关系抽取/序列标注Key words
deep learning/unstructured text/medical texts/knowledge extraction/entity identification/relation extraction/se-quence labeling引用本文复制引用
基金项目
中国博士后科学基金项目(2021T140707)
国民核生化灾害防护国家重点实验室基金项目(SKLNBC2020-23)
苏州卫生职业技术学院校级领雁培育重点基金项目(szwzy202004)
出版年
2024