首页|基于字词融合的低词汇信息损失中文命名实体识别方法

基于字词融合的低词汇信息损失中文命名实体识别方法

扫码查看
中文命名实体识别(CNER)任务是一种自然语言处理技术,旨在识别文本中具有特定类别的实体,如人名、地名、组织机构名等,它是问答系统、机器翻译、信息抽取等自然语言应用的基础底层任务.由于中文不具备类似英文这样的天然分词结构,基于词的NER模型在中文命名实体识别上的效果会因分词错误而显著降低,基于字符的NER模型又忽略了词汇信息的作用,因此,近年来许多研究开始尝试将词汇信息融入字符模型中.WC-LSTM通过在词汇的开始字符和结束字符中注入词汇信息,使模型性能获得了显著的提升.然而,该模型依然没有充分利用词汇信息,因此在其基础上提出了基于字词融合的低词汇信息损失NER模型LLL-WCM,对词汇的所有中间字符融入词汇信息,避免了词汇信息损失.同时,引入了两种编码策略平均(avg)和自注意力机制(self-attention)以提取所有词汇信息.在4个中文数据集上进行实验,结果表明,与WC-LSTM相比,该方法的F1值分别提升了 1.89%,0.29%,1.10%和1.54%.
Word-Character Model with Low Lexical Information Loss for Chinese NER
Chinese named entity recognition(CNER)task is a natural language processing technique that aims to recognize enti-ties with specific categories in text,such as names of people,places,organizations.It is a fundamental underlying task of natural language applications such as question and answer systems,machine translation,and information extraction.Since Chinese does not have a natural word separation structure like English,the effectiveness of word-based NER models for Chinese named entity recognition is significantly reduced by word separation errors,and character-based NER models ignore the role of lexical informa-tion.In recent years,many studies have attempted to incorporate lexical information into character-based models,and WC-LSTM has achieved significant improvements in model performance by injecting lexical information into the start and end characters of a word.However,this model still does not fully utilize lexical information,so based on it,LLL-WCM(word-character model with low lexical information loss)is proposed to incorporate lexical information for all intermediate characters of the lexicon to avoid lexical information loss.Meanwhile,two encoding strategies average and self-attention mechanism are introduced to extract all lexical information.Experiments are conducted on four Chinese datasets,and the results show that the F1 values of this method are improved by 1.89%,0.29%,1.10%and 1.54%,respectively,compared with WC-LSTM.

Named entity recognitionNatural language processingLexical information lossIntermediate charactersEncoding strategy

郭志强、关东海、袁伟伟

展开 >

南京航空航天大学计算机科学与技术学院 南京 211106

命名实体识别 自然语言处理 词汇信息损失 中间字符 编码策略

航空基金

ASFC-20200055052005

2024

计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

CSTPCD北大核心
影响因子:0.944
ISSN:1002-137X
年,卷(期):2024.51(8)