基于注意力增强与特征融合的中文医学实体识别

Chinese Medical Entity Recognition Based on Attention Enhancement and Feature Fusion

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：针对基于字符表示的中文医学领域命名实体识别模型嵌入形式单一、边界识别困难、语义信息利用不充分等问题,一种非常有效的方法是在Bret底层注入词汇特征,在利用词粒度语义信息的同时降低分词错误带来的影响,然而在注入词汇信息的同时也会引入一些低相关性的词汇和噪声,导致基于注意力机制的Bret模型出现注意力分散的情况.此外仅依靠字、词粒度难以充分挖掘中文字符深层次的语义信息.对此,提出基于注意力增强与特征融合的中文医学实体识别模型,对字词注意力分数矩阵进行稀疏处理,使模型的注意力集中在相关度高的词汇,能够有效减少上下文中的噪声词汇干扰.同时,对汉字发音和笔画通过卷积神经网络(CNN)提取特征,经过迭代注意力特征融合模块进行融合,然后与Bret模型的输出特征进行拼接输入给BiLSTM模型,进一步挖掘字符所包含的深层次语义信息.通过爬虫等方式搜集大量相关医学语料,训练医学领域词向量库,并在CCKS2017和CCKS2019数据集上进行验证,实验结果表明,该模型F1值分别达到94.90%、89.37%,效果优于当前主流的实体识别模型,具有更好的识别效果.

外文摘要：To address problems such as single embedding forms,difficult boundary recognition,and insufficient use of semantic information in Chinese medical named entity recognition models based on character representation,an effective method is to inject lexical features at the bottom of Bret.This approach reduces the impact of word segmentation errors while utilizing word granularity semantic information.However,some low correlation words and noise are introduced when vocabulary information is injected,leading to attention distraction in the Bret model based on the attention mechanism.In addition,it is difficult to fully mine deep semantic information of Chinese characters by relying solely on word granularity.Therefore,this study proposes a Chinese medical entity recognition model based on attention enhancement and feature fusion.The sparse processing of the attention score matrix of words causes the model to focus on words with a high correlation,which can effectively reduce the interference of noisy words in the context.Simultaneously,Convolutional Neural Networks(CNNs)are used to extract the features of Chinese pronunciation and strokes,which are fused with the output features of the Bret model through an iterative attention feature fusion module and subsequently concatenated to the BiLSTM model to further mine the deep semantic information contained in characters.During the experiment,a large number of relevant medical corpora is collected using a crawler and other methods.Further,a medical field word vector library is trained and verified on the CCKS2017 and CCKS2019 datasets.The experimental results show that the F1 values of the model reach 94.90%and 89.37%,respectively,which are higher than those with current mainstream entity recognition models.Therefore,the proposed model exhibits higher recognition performance.

外文关键词：

entity recognitionChinese word segmentationsparse attentionfeature fusionmedical word vector library

作者：

王晋涛、秦昂、张元、陈一飞、王廷凤、谢承霖、邹刚

展开 >

作者单位：

中北大学计算机科学与技术学院,山西太原 030051

湖南省肿瘤医院,湖南长沙 410031

湖南省中医药研究院附属医院,湖南长沙 410006

湖南中科助英智能科技研究院,湖南长沙 410076

展开 >

关键词：

实体识别中文分词注意力稀疏特征融合医学词向量库

基金：

湖南省自然科学基金

项目编号：

2022JJ70022

出版年：

2024

DOI：

10.19678/j.issn.1000-3428.0067645

计算机工程

华东计算技术研究所　上海市计算机学会

计算机工程

CSTPCD北大核心

影响因子：0.581

ISSN：1000-3428

年,卷(期)：2024.50(7)