基于字形特征的中文医学命名实体识别方法

A Chinese Medical Named Entity Recognition Method Based on Glyph Features

孟伟伦 ¹郭景峰 ¹邢珂萱 ²魏宁 ¹王巧梭 ³刘滨⁴

扫码查看

作者信息

1. 燕山大学信息科学与工程学院,河北秦皇岛 066004;河北省虚拟技术与系统集成重点实验室,河北秦皇岛 066004
2. 燕山大学信息科学与工程学院,河北秦皇岛 066004
3. 河北建材职业技术学院,河北秦皇岛 066000
4. 河北科技大学大数据与社会计算研究中心,河北石家庄 050018
折叠

摘要

作为医学信息抽取的第一个关键环节,医学命名实体识别任务旨在从如电子医疗病例、中文医药说明书等非结构化文本中抽取出医学相关的实体.目前大多数中文医学命名实体识别工作通过在预训练模型上进行微调来获得文本表示向量,然后利用特征工程来提升模型在医疗领域上的性能.这些模型大部分源自在通用数据集上表现较好的模型,没有考虑中文医学数据集的语言特性.通过在多个医学数据集上进行统计分析,发现部分类型的医学实体在字形上具有共性,如在汉字中大部分表示疾病含义的字符都包含"疒",大部分表示身体器官的字符都包含"月".针对这些问题,本文提出了一种基于字形特征的中文医学命名实体识别方法,该方法通过在文本表示向量上融合字形向量以及进一步利用数据集中负样本来提升模型的准确度和泛化能力.在多个公共的中文医学数据集上的实验结果表明,该方法获得了比其他模型更好的效果,并且通过消融实验证明了融合字形特征和从负样本中学习对于该任务是有效的.

Abstract

As the first key link in medical information extraction,the medical named entity recognition task aims to extract medical-related entities from unstructured texts such as electronic medical records and Chinese medical instructions. Most current Chinese medical named entity recognition works obtain text representation vectors by fine-tuning pre-trained models,and then use feature engineering to improve the performance of the models in the medical field. Most of these mod-els are derived from models that perform well on general-purpose datasets,without considering the language characteristics of Chinese medical datasets. Through statistical analysis on multiple medical data sets,it is found that some types of medi-cal entities have similarities in glyphs. For example,in Chinese characters,most of the characters representing diseases con-tain "疒",and most of the characters representing body organs contain "月". In response to these problems,this paper pro-poses a Chinese medical named entity recognition method based on glyph features. This method improves the accuracy and generalization ability of the model by fusing the glyph vector on the text representation vector and further utilizing the nega-tive samples in the dataset. Experimental results on multiple public Chinese medical datasets show that this method achieves better results than other models,and ablation experiments prove that fusing glyph features and learning from negative sam-ples is effective for this task.

关键词

字形/负样本/两阶段/医学信息/命名实体识别/深度学习

Key words

glyph feature/negative sample/two stages/medical information/named entity recognition/deep learning

引用本文复制引用

基金项目

河北省省级科技计划(21310101D)

中央引导地方科技发展资金(226Z0102G)

国家文化和旅游科技创新工程(2020年度)()

出版年

2024

电子学报

中国电子学会

电子学报

CSTPCD北大核心

影响因子：1.237

ISSN：0372-2112

参考文献量6

段落导航