基于多种分词情况的中文命名实体识别

Chinese named entity recognition based on multiple word segmentation cases

田地 ¹邵玉斌 ¹杜庆治 ¹龙华 ¹马迪南²

扫码查看

作者信息

1. 昆明理工大学信息工程与自动化学院,昆明 650500
2. 云南省媒体融合重点实验室,昆明 650032
折叠

摘要

针对中文词语边界不明确,词语和句子上下文关系被忽略的问题,设计一种基于多种分词情况的歧义分词信息抑制算法.在预处理中根据预训练的词汇频率表计算语句中不同分词的权重,将最有可能的分词情况与其他分词情况进行区分,合并至语句中,在自注意力机制提取语句上下文信息时加入分词权重信息,添加正确分词有效的边界信息,抑制歧义分词错误的前后文关系.对比MarkBert与W2NER算法,在公开数据集Resume、MSRA、Weibo、OntoNotes中的试验结果表明,歧义分词信息抑制算法的预测准确率、句子长度增加时的鲁棒性、数据集增大时的预测准确率均有更好的表现.

Abstract

Aiming at the problem of unclear sentence vocabulary boundaries and neglected vocabulary and context relationship training,an ambiguous word segmentation information suppression algorithm based on multiple word segmentation situations was designed.The weights of different subwords of the utterance were calculated in the computation based on the pre-trained timing frequency table,the most likely subword cases were distinguished from other subword cases and merged into the utterance,and finally the information of subword weights was added in the independent variable mechanism to extract the contextual information of the utterance;the goal of adding the valid boundary information of the correct subword and the purpose of regulating the symmetric contextual relationship for ambiguous subword errorsr were achieved.A comparison between the MarkBert and W2NER algorithms was made and experiments on the public data sets such as Resume,MSRA,Weibo and OntoNotes showed that the algorithm improved the prediction accuracy and robustness when the sentence length increased,and increased the prediction accuracy when the data set increased.

关键词

命名实体识别/预训练模型/自注意力/词边界信息

Key words

named entity recognition/pre-trained model/self-attention/word boundary information

引用本文复制引用

出版年

2024

兰州大学学报(自然科学版)

兰州大学

兰州大学学报(自然科学版)

CSTPCDCSCD北大核心

影响因子：0.855

ISSN：0455-2059

段落导航