计算机工程2024,Vol.50Issue(7) :324-332.DOI:10.19678/j.issn.1000-3428.0067645

基于注意力增强与特征融合的中文医学实体识别

Chinese Medical Entity Recognition Based on Attention Enhancement and Feature Fusion

王晋涛 秦昂 张元 陈一飞 王廷凤 谢承霖 邹刚
计算机工程2024,Vol.50Issue(7) :324-332.DOI:10.19678/j.issn.1000-3428.0067645

基于注意力增强与特征融合的中文医学实体识别

Chinese Medical Entity Recognition Based on Attention Enhancement and Feature Fusion

王晋涛 1秦昂 2张元 1陈一飞 2王廷凤 1谢承霖 3邹刚4
扫码查看

作者信息

  • 1. 中北大学计算机科学与技术学院,山西 太原 030051
  • 2. 湖南省肿瘤医院,湖南 长沙 410031
  • 3. 湖南省中医药研究院附属医院,湖南 长沙 410006
  • 4. 中北大学计算机科学与技术学院,山西 太原 030051;湖南中科助英智能科技研究院,湖南 长沙 410076
  • 折叠

摘要

针对基于字符表示的中文医学领域命名实体识别模型嵌入形式单一、边界识别困难、语义信息利用不充分等问题,一种非常有效的方法是在Bret底层注入词汇特征,在利用词粒度语义信息的同时降低分词错误带来的影响,然而在注入词汇信息的同时也会引入一些低相关性的词汇和噪声,导致基于注意力机制的Bret模型出现注意力分散的情况.此外仅依靠字、词粒度难以充分挖掘中文字符深层次的语义信息.对此,提出基于注意力增强与特征融合的中文医学实体识别模型,对字词注意力分数矩阵进行稀疏处理,使模型的注意力集中在相关度高的词汇,能够有效减少上下文中的噪声词汇干扰.同时,对汉字发音和笔画通过卷积神经网络(CNN)提取特征,经过迭代注意力特征融合模块进行融合,然后与Bret模型的输出特征进行拼接输入给BiLSTM模型,进一步挖掘字符所包含的深层次语义信息.通过爬虫等方式搜集大量相关医学语料,训练医学领域词向量库,并在CCKS2017和CCKS2019数据集上进行验证,实验结果表明,该模型F1值分别达到94.90%、89.37%,效果优于当前主流的实体识别模型,具有更好的识别效果.

Abstract

To address problems such as single embedding forms,difficult boundary recognition,and insufficient use of semantic information in Chinese medical named entity recognition models based on character representation,an effective method is to inject lexical features at the bottom of Bret.This approach reduces the impact of word segmentation errors while utilizing word granularity semantic information.However,some low correlation words and noise are introduced when vocabulary information is injected,leading to attention distraction in the Bret model based on the attention mechanism.In addition,it is difficult to fully mine deep semantic information of Chinese characters by relying solely on word granularity.Therefore,this study proposes a Chinese medical entity recognition model based on attention enhancement and feature fusion.The sparse processing of the attention score matrix of words causes the model to focus on words with a high correlation,which can effectively reduce the interference of noisy words in the context.Simultaneously,Convolutional Neural Networks(CNNs)are used to extract the features of Chinese pronunciation and strokes,which are fused with the output features of the Bret model through an iterative attention feature fusion module and subsequently concatenated to the BiLSTM model to further mine the deep semantic information contained in characters.During the experiment,a large number of relevant medical corpora is collected using a crawler and other methods.Further,a medical field word vector library is trained and verified on the CCKS2017 and CCKS2019 datasets.The experimental results show that the F1 values of the model reach 94.90%and 89.37%,respectively,which are higher than those with current mainstream entity recognition models.Therefore,the proposed model exhibits higher recognition performance.

关键词

实体识别/中文分词/注意力稀疏/特征融合/医学词向量库

Key words

entity recognition/Chinese word segmentation/sparse attention/feature fusion/medical word vector library

引用本文复制引用

基金项目

湖南省自然科学基金(2022JJ70022)

出版年

2024
计算机工程
华东计算技术研究所 上海市计算机学会

计算机工程

CSTPCD北大核心
影响因子:0.581
ISSN:1000-3428
段落导航相关论文