中文命名实体识别(NER)任务旨在抽取非结构化文本中包含的实体并给它们分配预定义的实体类别.针对大多数中文NER方法在上下文信息缺乏时的语义学习不足问题,提出一种层次融合多元知识的NER框架——HTLR(Chinese NER method based on Hierarchical Transformer fusing Lexicon and Radical),以通过分层次融合的多元知识来帮助模型学习更丰富、全面的上下文信息和语义信息.首先,通过发布的中文词汇表和词汇向量表识别语料中包含的潜在词汇并把它们向量化,同时通过优化后的位置编码建模词汇和相关字符的语义关系,以学习中文的词汇知识;其次,通过汉典网发布的基于汉字字形的编码将语料转换为相应的编码序列以代表字形信息,并提出RFE-CNN(Radical Feature Extraction-Convolutional Neural Network)模型来提取字形知识;最后,提出 Hierarchical Transformer模型,其中由低层模块分别学习字符和词汇以及字符和字形的语义关系,并由高层模块进一步融合字符、词汇、字形等多元知识,从而帮助模型学习语义更丰富的字符表征.在Weibo、Resume、MSRA和OntoNotes4.0公开数据集进行了实验,与主流方法NFLAT(Non-Flat-LAttice Transformer for Chinese named entity recognition)的对比结果表明,所提方法的F1值在4个数据集上分别提升了9.43、0.75、1.76和6.45个百分点,达到最优水平.可见,多元语义知识、层次化融合、RFE-CNN结构和Hierarchical Transformer结构对学习丰富的语义知识及提高模型性能是有效的.
HTLR:named entity recognition framework with hierarchical fusion of multi-knowledge
Chinese Named Entity Recognition(NER)tasks aim to extract entities from unstructured text and assign them to predefined entity categories.Aiming at the issue of insufficient semantic learning caused by the lack of contextual information in most Chinese NER methods,an NER framework with hierarchical fusion of multi-knowledge,named HTLR(Chinese NER method based on Hierarchical Transformer fusing Lexicon and Radical),was proposed to utilize hierarchically fused multi-knowledge to help the model learn richer and more comprehensive contextual and semantic information.Firstly,the lexicon contained in the corpus was identified and vectorized by using a publicly available Chinese lexicon table and word vector table.At the same time,the knowledge about Chinese lexicon was learned by modeling semantic relationships between lexicon and related characters through optimized position encoding.Secondly,the corpus was converted into the corresponding coding sequences to represent the character form information by the coding based on Chinese character radicals provided by Han Dian website,and an RFE-CNN(Radical Feature Extraction-Convolutional Neural Network)model was proposed for extracting radical information.Finally,the Hierarchical Transformer model was proposed,where semantic relationships between characters and lexicon,characters and radical forms in lower-level modules,and multi-knowledge about characters,lexicon,and radical forms were learned at higher-level modules,which helped the model acquire character representations with richer semantics.Experimental results on public datasets Weibo,Resume,MSRA,and OntoNotes4.0 show that the F1 values of the proposed method are improved by 9.43,0.75,1.76,and 6.45 percentage points,respectively,compared with those of the mainstream method NFLAT(Non-Flat-LAttice Transformer for Chinese named entity recognition),reaching the optimal level.It can be seen that multi-semantic knowledge,hierarchical fusion,the RFE-CNN structure,and Hierarchical Transformer structure are effective for learning rich semantic knowledge and improving model performance.
Named Entity Recognition(NER)Natural Language Processing(NLP)knowledge graph constructionlexicon enhancementradical enhancement