基于字词融合的低词汇信息损失中文命名实体识别方法

扫码查看

原文链接

万方数据
维普

中文摘要：中文命名实体识别(CNER)任务是一种自然语言处理技术,旨在识别文本中具有特定类别的实体,如人名、地名、组织机构名等,它是问答系统、机器翻译、信息抽取等自然语言应用的基础底层任务.由于中文不具备类似英文这样的天然分词结构,基于词的NER模型在中文命名实体识别上的效果会因分词错误而显著降低,基于字符的NER模型又忽略了词汇信息的作用,因此,近年来许多研究开始尝试将词汇信息融入字符模型中.WC-LSTM通过在词汇的开始字符和结束字符中注入词汇信息,使模型性能获得了显著的提升.然而,该模型依然没有充分利用词汇信息,因此在其基础上提出了基于字词融合的低词汇信息损失NER模型LLL-WCM,对词汇的所有中间字符融入词汇信息,避免了词汇信息损失.同时,引入了两种编码策略平均(avg)和自注意力机制(self-attention)以提取所有词汇信息.在4个中文数据集上进行实验,结果表明,与WC-LSTM相比,该方法的F1值分别提升了 1.89％,0.29％,1.10％和1.54％.

外文标题：Word-Character Model with Low Lexical Information Loss for Chinese NER

外文摘要：Chinese named entity recognition(CNER)task is a natural language processing technique that aims to recognize enti-ties with specific categories in text,such as names of people,places,organizations.It is a fundamental underlying task of natural language applications such as question and answer systems,machine translation,and information extraction.Since Chinese does not have a natural word separation structure like English,the effectiveness of word-based NER models for Chinese named entity recognition is significantly reduced by word separation errors,and character-based NER models ignore the role of lexical informa-tion.In recent years,many studies have attempted to incorporate lexical information into character-based models,and WC-LSTM has achieved significant improvements in model performance by injecting lexical information into the start and end characters of a word.However,this model still does not fully utilize lexical information,so based on it,LLL-WCM(word-character model with low lexical information loss)is proposed to incorporate lexical information for all intermediate characters of the lexicon to avoid lexical information loss.Meanwhile,two encoding strategies average and self-attention mechanism are introduced to extract all lexical information.Experiments are conducted on four Chinese datasets,and the results show that the F1 values of this method are improved by 1.89％,0.29％,1.10％and 1.54％,respectively,compared with WC-LSTM.

外文关键词：

Named entity recognitionNatural language processingLexical information lossIntermediate charactersEncoding strategy

作者：

郭志强、关东海、袁伟伟

展开 >

作者单位：

南京航空航天大学计算机科学与技术学院南京 211106

关键词：

命名实体识别自然语言处理词汇信息损失中间字符编码策略

基金：

航空基金

项目编号：

ASFC-20200055052005

出版年：

2024

DOI：

10.11896/jsjkx.230500047

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(8)