Research on Chinese Medical Named Entity Recognition Integrating Word-level Segment Information
王海鹏 1杜方 1宋丽娟 1李婷2
扫码查看
点击上方二维码区域,可以放大扫码查看
作者信息
1. 宁夏大学 信息工程学院,宁夏 银川 750021
2. 宁夏大学 数学统计学院,宁夏 银川 750021
折叠
摘要
中文医疗命名实体识别(Named Entity Recognition,NER)是医学领域的一项基础任务,在知识图谱等许多下游任务中起着重要的作用.常用的NER方法可分为基于词级信息和基于段级信息,已有研究表明两种信息融合能取得更好的性能.目前,词级信息和段级信息融合的方法在中文医疗NER任务中还未被充分研究,且现有的融合方法为段中的每个单词赋予相同的权重,不考虑单词的不同贡献.而医疗实体中每个单词和实体(段)有着不同的相关性,忽略这种相关性的差异将影响医疗NER的性能.基于此,通过分析中文医疗实体特性,提出了一种单词级段信息抽取方法(Word-Level Segment Information Extraction,WL-SIE).该方法为实体中的每个单词分配一个权重矩阵集,学习单词与实体之间的关联信息,在与实体词组交互之后输出不同的单词级段信息.在CCKS2017 和CMeEE中文临床NER数据集上的实验结果表明,WL-SIE方法较对比方法在F1 值上提升了3%~5%,特别是在实体样本不均衡场景下和长实体识别任务上表现出了优异的性能.
Abstract
Chinese medical named entity recognition(NER)is a fundamental task in the field of medicine and plays an important role in various downstream tasks such as knowledge graphs.NER methods can generally be categorized into two types:word-level information and segment-level information.Some studies have shown that the fusion of the two types of information achieves better performance.However,the integration of word-level and segment-level information has not been thoroughly studied in Chinese medical NER task.Meanwhile,in the existing integration methods,each word in the segment is assigned with equivalent weight,which do not consider the different contribution of the word.Moreover,each word and entity(segment)in medical entities have different correlations,ignoring these differences in correlations will decrease the performance of medical NER.Based on this,we propose a word-level segment information extraction method called WL-SIE by analyzing the characteristics of Chinese medical entities.This method assigns a weight matrix set to each word in the entity to learn different associative information between words and entities,and outputs word-level segment information after interacting with entity phrases.Experimental results on the CCKS2017 and CMeEE Chinese clinical NER datasets dem-onstrate that the WL-SIE method improves the F1 score by 3% to 5% compared to comparative methods,particularly in scenarios with imbalanced entity samples and long entity recognition tasks,showing outstanding performance.
关键词
命名实体识别/深度神经网络/词级信息/段级信息/中文医疗信息处理
Key words
named entity recognition/deep neural network/word-level information/segment-level information/Chinese medical information processing