为上下文显式独立建模的中文实体识别方法

Explicitly modeling the context for Chinese named-entity recognition

陈点 ¹曹逸轩 ¹罗平²

扫码查看

作者信息

1. 智能信息处理重点实验室(中国科学院计算技术研究所)北京 100190;中国科学院大学北京 100049
2. 智能信息处理重点实验室(中国科学院计算技术研究所)北京 100190;鹏城实验室深圳 518066
折叠

摘要

现有中文命名实体识别(NER)模型在公开数据集上的表现相对成熟,但有研究指出,模型过度依赖实体文本的字面特征,而上下文对实体识别的影响却未得到重视.现有的模型在简单的泛化测试中表现较差,因此本文提出显式地为上下文独立建模,令模型对上下文和实体的字面信息进行区分.为此,也提出了相应的数据增强方法用于训练模型中的上下文模块、实体字面模块和综合模块.实验结果表明,本文提出的方法在不损失测试集识别效果的情况下,明显改善了模型在不变性测试中的表现,较基准模型其失败率降低了 2.3％.

Abstract

Current Chinese named-entity recognition(NER)models have achieved remarkable results on public datasets.However,some studies suggest that they rely too heavily on literal features of entity text.Moreover,the influence of context on entity recognition has yet to be fully explored.Existing models perform poorly in simple invariance tests.To address this problem,this paper proposes explicitly modeling the context independently,enabling the model to differentiate between contextual information and the literal information of entities.Additionally,an adapted data en-hancement method is introduced to train the context,surface name,and combination modules.Experimental results show that this approach significantly improves the model's performance in the invariance test without sacrificing recognition performance,reducing the failure rate by 2.3％compared with the benchmark model.

关键词

自然语言处理/中文命名实体识别(NER)/上下文独立建模/数据增强

Key words

natural language processing/Chinese named-entity recognition(NER)/independent context modeling/data augmentation

引用本文复制引用

出版年

2024

高技术通讯

中国科学技术信息研究所

高技术通讯

CSTPCD北大核心

影响因子：0.19

ISSN：1002-0470

段落导航