结合自训练模型的命名实体识别方法

Named entity recognition method combined with self-training model

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：针对命名实体识别数据集中存在某些实体类别样本过少,使模型学习该类别特征能力较差,导致整体性能较低的问题,提出结合自训练模型的命名实体识别方法.利用已有的命名实体识别数据集训练一个教师模型,通过改进的文本相似度函数搜寻与原数据集最相似的无标签文本,利用教师模型对无标签文本生成伪标签,并将伪标签与有标签数据集混合重新训练一个学生模型用于下游的命名实体识别任务.试验结果表明,相较基线模型,该方法在公共数据集MSRA、CONLL03和法律实体识别数据集上取得更优的性能.

外文摘要：Aiming to address the issue of insufficient samples for certain entity categories in the named entity recognition dataset,which hampered the model's ability to learn the category's features and resulted in lower overall performance,this study proposed a named entity recognition method that incorporated a self-training model.A teacher model was trained using the available named enti-ty recognition dataset.The improved text similarity function was used to search for unlabeled text that was most similar to the origi-nal dataset.The teacher model was utilized to generate pseudo-labels for the unlabeled text.These pseudo-labels were then combined with the labeled dataset to retrain a student model for the downstream named entity recognition task.The experimental results showed that,compared with the baseline model,the method achieved even better performance on the public datasets MSRA,CONLL03,and the legal entity recognition dataset.

外文关键词：

named entity recognitionself-trainingtext similaritynatural language processingfew-shot

作者：

肖伟、郑更生、陈钰佳

展开 >

作者单位：

武汉工程大学计算机科学与工程学院、人工智能学院,湖北武汉 430205

智能机器人湖北省重点实验室,湖北武汉 430205

关键词：

命名实体识别自训练文本相似度自然语言处理少样本

基金：

国家自然科学基金青年基金

项目编号：

62106179

出版年：

2024

DOI：

10.6040/j.issn.1672-3961.0.2022.353

山东大学学报(工学版)

山东大学

山东大学学报(工学版)

CSTPCD北大核心

影响因子：0.634

ISSN：1672-3961

年,卷(期)：2024.54(2)

参考文献量24