基于词汇融合和依存关系的中文命名实体识别

Chinese Named Entity Recognition Based on Lexicon Fusion and Dependency Relation

唐卓然 ¹柳毅¹

扫码查看

作者信息

1. 广东工业大学计算机学院,广东广州 510006
折叠

摘要

命名实体识别是自然语言处理领域的重要基础任务,为关系抽取、构建知识图谱等众多下游任务提供有价值的数据支撑.针对中文命名实体识别存在分词错误、实体边界模糊和上下文依赖的难点,以及现有方法不能充分利用词汇信息和有效提取文本内部特征等问题,提出一种基于词汇融合和依存关系的中文命名实体识别模型.首先,获取输入文本中每个字符的自匹配词生成词汇特征向量,并根据字符在它的自匹配词上的位置得到词边界信息,利用双仿射注意力机制将字符向量与词汇特征向量进行融合,将词汇信息和词边界信息融入模型的编码过程,从而使模型获得良好的实体识别能力;然后,根据依存句法建立输入文本的依存图结构,利用图注意力网络(GAT)捕获输入文本内部依存关系特征,增强文本内部的语义依赖信息,同时有利于区分实体边界;最后,使用条件随机场(CRF)计算文本的标签.实验结果表明,该模型在CCKS2017、OntoNote4.0和MSRA数据集上分别获得了 92.10％、80.76％和95.66％的F1值,优于对比模型.

Abstract

Named entity recognition is an important foundational task in the field of natural language processing,providing valuable data support for many downstream tasks,such as relation extraction and knowledge graph construction.To address the difficulties of word segmentation errors,ambiguous entity boundaries,and contextual dependencies in Chinese named entity recognition,as well as the inability of existing methods to fully utilize lexical information and effectively extract internal text features,this paper proposes a Chinese named entity recognition method based on lexicon fusion and dependency relation.First,the self-matching words of each character in the input text are obtained to generate lexical feature vectors,and word boundary information is obtained according to the position of the character in its self-matching words.The character and lexical feature vectors are fused using biaffine attention mechanism,and the lexical and word boundary information are integrated into the encoding process of the model so that the model can achieve good entity recognition ability.Subsequently,based on dependency syntax,a dependency graph structure of the input text is established,and a Graph Attention Network(GAT)is used to capture the internal dependency features of the input text,enhance the semantic dependency information within the text,and facilitate the differentiation of entity boundaries.Finally,text labels are calculated using a Conditional Random Field(CRF).The proposed method obtains Fl values of 92.10％,80.76％,and 95.66％on the CCKS2017,OntoNote4.0 and MSRA datasets,respectively,which are better than those of the comparison models.

关键词

注意力机制/依存关系/词汇融合/图注意力网络/中文命名实体识别

Key words

attention mechanism/dependency relation/lexicon fusion/Graph Attention Network(GAT)/Chinese named entity recognition

引用本文复制引用

基金项目

广东省重点领域研发计划(2021B0101200002)

出版年

2024

计算机工程

华东计算技术研究所　上海市计算机学会

计算机工程

CSTPCD北大核心

影响因子：0.581

ISSN：1000-3428

参考文献量5

段落导航