首页|基于多种分词情况的中文命名实体识别

基于多种分词情况的中文命名实体识别

Chinese named entity recognition based on multiple word segmentation cases

扫码查看
针对中文词语边界不明确,词语和句子上下文关系被忽略的问题,设计一种基于多种分词情况的歧义分词信息抑制算法.在预处理中根据预训练的词汇频率表计算语句中不同分词的权重,将最有可能的分词情况与其他分词情况进行区分,合并至语句中,在自注意力机制提取语句上下文信息时加入分词权重信息,添加正确分词有效的边界信息,抑制歧义分词错误的前后文关系.对比MarkBert与W2NER算法,在公开数据集Resume、MSRA、Weibo、OntoNotes中的试验结果表明,歧义分词信息抑制算法的预测准确率、句子长度增加时的鲁棒性、数据集增大时的预测准确率均有更好的表现.
Aiming at the problem of unclear sentence vocabulary boundaries and neglected vocabulary and context relationship training,an ambiguous word segmentation information suppression algorithm based on multiple word segmentation situations was designed.The weights of different subwords of the utterance were calculated in the computation based on the pre-trained timing frequency table,the most likely subword cases were distinguished from other subword cases and merged into the utterance,and finally the information of subword weights was added in the independent variable mechanism to extract the contextual information of the utterance;the goal of adding the valid boundary information of the correct subword and the purpose of regulating the symmetric contextual relationship for ambiguous subword errorsr were achieved.A comparison between the MarkBert and W2NER algorithms was made and experiments on the public data sets such as Resume,MSRA,Weibo and OntoNotes showed that the algorithm improved the prediction accuracy and robustness when the sentence length increased,and increased the prediction accuracy when the data set increased.

named entity recognitionpre-trained modelself-attentionword boundary information

田地、邵玉斌、杜庆治、龙华、马迪南

展开 >

昆明理工大学 信息工程与自动化学院,昆明 650500

云南省媒体融合重点实验室,昆明 650032

命名实体识别 预训练模型 自注意力 词边界信息

2024

兰州大学学报(自然科学版)
兰州大学

兰州大学学报(自然科学版)

CSTPCD北大核心
影响因子:0.855
ISSN:0455-2059
年,卷(期):2024.60(3)