基于实例的词性标注数据错误检测

Instance-Based Error Detection for Part-of-Speech Tagging Dataset

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：由于深度学习框架在可解释性上的缺乏,本文将基于实例的方法首次应用到词性标注数据错误检测任务,旨在充分利用模型学到的实例之间的相似度信息.首先,本文基于预训练语言模型,实现了基于实例的词性标注模型,在CTB7数据集上的预测准确率和基于标准分类器的模型相当,达96.76％.进而,本文提出了一种基于实例的标注错误检测方法.为了获得真实检错数据集,本文采用不同方法对CTB7测试集进行自动错误检测,并人工标注候选错误,最终获得2 016个真实标注错误,约占所有8万多词中的2.5％.检错数据集上的实验表明,基于实例的方法的检错准确率达41.48％.

外文摘要：Due to the lack of interpretability in deep learning frameworks,in this paper,we apply instance-based methods to error de-tection for part-of-speech tagging dataset for the first time aiming to leverage the similarity information learned between instances.Firstly,we implements an instance-based part-of-speech tagging model based on a pre-trained language model,achieving compara-ble prediction accuracy reaching 96.76％to models based on standard classifiers on the CTB7 dataset.Furthermore,we propose an instance-based annotation error detection method.To obtain an actual error detection dataset,several methods are employed to auto-matically detect errors in the CTB7 test set,and candidate errors are manually corrected,resulting in 2 016 annotation errors,ac-counting for approximately 2.5％of the total 80 000+words.Experimental results on the error detection dataset show that the error detection accuracy of the instance based method reaches 41.48％.

外文关键词：

part-of-speech taggingerror detection datasetsemantic similarityCTB7 dataset

作者：

崔秀莲、严福康、李正华

展开 >

作者单位：

苏州大学计算机科学与技术学院,江苏苏州 215000

关键词：

词性分类标注错误数据集语义相似度 CTB7数据集

基金：

国家自然科学基金江苏高校优势学科建设工程资助项目

项目编号：

62176173

出版年：

2024

DOI：

10.13451/j.sxu.ns.2023166

山西大学学报(自然科学版)

山西大学

山西大学学报(自然科学版)

CSTPCD北大核心

影响因子：0.287

ISSN：0253-2395

年,卷(期)：2024.47(2)

参考文献量29