基于半监督学习的域适应实体解析算法

扫码查看

原文链接

万方数据
维普

中文摘要：实体解析旨在查找两个数据实体是否引用同一实体,是许多自然语言处理任务中的一项基本任务.现有的基于深度学习的实体解析解决方案通常需要大量的标注数据,即使利用预训练的语言模型进行训练,仍然需要数千个标签才能达到令人满意的准确性.现实场景中,这些标注数据并不容易获得.针对上述问题,提出了一个基于半监督学习的域适应实体解析模型.首先,在源域上训练一个分类器,然后利用域适应减小源域和目标域的分布差异,同时用数据增强后的目标域软伪标签加入源域迭代训练,从而实现从源域到目标域的知识迁移.在13个来自相同或不同领域的数据集上对所提模型进行了对比实验和消融实验,实验结果表明,与无监督基线模型相比,所提模型在多个数据集上的F1值平均提升了 2.84％,9.16％和7.1％;与有监督基线模型相比,所提模型只需要20％～40％的标签就可以达到与有监督学习相当的性能.消融实验进一步证明了所提模型的有效性,其总体上可以获得更好的实体解析结果(相关代码已开源1)).

外文标题：Domain-adaptive Entity Resolution Algorithm Based on Semi-supervised Learning

外文摘要：Entity resolution is a fundamental task in many natural language processing tasks,which aims to find out whether two data entities refer to the same entity.Existing deep learning-based solutions for entity resolution typically require a large amount of annotated data,even when pre-trained language models are used for training.Obtaining such annotated data is challenging in real-world scenarios.To address this issue,a domain-adaptive entity resolution model based on semi-supervised learning is pro-posed.First,a classifier is trained on the source domain,and then domain adaptation is used to reduce the distributional difference between the source and target domains.Soft pseudo-labels from the augmented target domain are then added to the source domain for iterative training,enabling knowledge transfer from the source to the target domain.Comparison and ablation experiments are performed on 13 datasets from various domains.The results show that,compared to unsupervised baseline models,the proposed model achieves an average F1 score improvement of 2.84％,9.16％,and 7.1％across multiple datasets.Compared to supervised baseline models,it achieves comparable performance with only 20％to 40％of the labels required.Ablation experiments further demonstrate the effectiveness of the proposed model,and better entity resolution results can be obtained in general(The relevant code is available1)).

外文关键词：

Entity resolutionDomain adaptationPseudo-labelsPre-trained language modelData augmentation

作者：

戴超凡、丁华华

展开 >

作者单位：

国防科技大学信息系统工程全国重点实验室长沙 410073

关键词：

实体解析域适应伪标签预训练语言模型数据增强

出版年：

2024

DOI：

10.11896/jsjkx.230800102

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(9)