面向远程监督命名实体识别的噪声检测

Noise Detection for Distant Supervised Named Entity Recognition

王嘉诚 ¹王凯 ¹王昊奋 ²杜渂 ³何之栋 ³阮彤 ¹刘井平¹

扫码查看

作者信息

1. 华东理工大学信息科学与工程学院上海 200237
2. 同济大学设计与创意学院上海 200092
3. 迪爱斯信息技术股份有限公司上海 200032
折叠

摘要

针对远程监督命名实体识别(named entity recognition,NER)任务,目前有许多基于强化学习的方法,利用强化学习的强大决策能力,对远程监督生成的自动标注数据进行噪声过滤.然而,这些方法所使用的策略网络模型架构都较简单,识别噪声能力较弱,且都以完整的句子样本为单位进行识别,导致句子中的部分正确信息被丢弃.为解决上述问题,提出了一种新的基于强化学习的方法,称为RLTL-DSNER,该方法可以从远程监督生成的带噪数据中,以单词级别识别正确实例,减少噪声实例对远程监督NER的负面影响.具体来说,在策略网络模型中引入了标签置信函数来准确识别实例.此外,提出了一种新颖的NER模型预训练策略,使其能为强化学习的初始训练提供精准的状态表示和有效的奖励值,引导其向正确的方向更新.在 4个数据集上的实验结果验证了RLTL-DSNER方法的优越性,在NEWS数据集上,相较于现有最先进的方法,获得了4.28%的F1提升.

Abstract

On distantly supervised named entity recognition(NER),there are many reinforcement learning based approaches,which exploit the powerful decision-making ability of reinforcement learning to detect noise from the automatically labeled data generated by distant supervision.However,the structures of the policy network models used are typically simple,which results in a weak ability to recognize noisy instances.Furthermore,correct instances are identified at sentence level,resulting in part of the useful information in the sentence being discarded.In this paper,we propose a new reinforcement learning based method for distantly supervised NER,named RLTL-DSNER,which can detect correct instances at token level from noisy data generated by distant supervision,proposing to reduce the negative impact of noisy instances on distantly supervised NER model.Specifically,we introduce a tag confidence function to identify correct instances accurately.In addition,we propose a novel pretraining strategy for the NER model.This strategy can provide accurate state representations and effective reward values for the initial training of the reinforcement learning model.The pre-training strategy can help guide it to update in the right direction.We conduct experiments on four datasets to verify the superiority of the RLTL-DSNER method,gaining 4.28%F1 improvement on NEWS dataset over state-of-the-art methods.

关键词

命名实体识别/远程监督/深度强化学习/噪声检测/预训练策略

Key words

named entity recognition/distant supervision/deep reinforcement learning/noise detection/pre-training strategy

引用本文复制引用

基金项目

上海市促进产业高质量发展专项(2021-GZL-RGZN-01018)

国家重点研发计划(2021YFC2701800)

国家重点研发计划(2021YFC2701801)

之江实验室开放基金(2019ND0AB01)

上海市青年科技英才"扬帆计划"项目(23YF1409400)

出版年

2024

计算机研究与发展

中国科学院计算技术研究所中国计算机学会

计算机研究与发展

CSTPCD北大核心

影响因子：2.649

ISSN：1000-1239

参考文献量38

段落导航