首页|基于部分标注的自训练多标签文本分类框架

基于部分标注的自训练多标签文本分类框架

扫码查看
多标签文本分类(multi-label text classification,MLTC)旨在从预定义的候选标签中选择一个或多个文本相关的类别,是自然语言处理(natural language processing,NLP)的一项基本任务.前人工作大多基于规范且全面的标注数据集,而这些规范数据集需要严格的质量控制,一般很难获取.在真实的标注过程中,难免会缺失标注一些相关标签,进而导致不完全标注问题.该文提出 了 一种基于部分标注的 自训练多标签文本分类(partial labeling self-training for multi-label text classification,PST)框架,该框架利用教师模型自动地给大规模无标注数据分配标签,同时给不完全标注数据补充缺失标签,最后再利用这些数据反向更新教师模型.在合成数据集和真实数据集上的实验表明,PST框架兼容现有的各类多标签文本分类模型,并且可以缓解不完全标注数据对模型的影响.
Self-training with partial labeling for multi-label text classification
[Objective]Multi-label text classification(MLTC)is a fundamental task in natural language processing.It selects the most relevant labels from the predefined label set to annotate texts.Most previous studies have been conducted on standardized and comprehensive datasets with manual annotations,which require strict quality control and are difficult to procure.In the real annotation process,some related labels are always lost,resulting in incomplete annotation.The impact of this missing label on the model is primarily divided into two forms:1)degradation effect:numerous missing labels lead to a decrease in the number of positive example labels related to the text,and the model cannot obtain more comprehensive and complete information under the training of a few related labels;2)misleading effect:numerous missing labels are treated as negative example labels that are unrelated to the text during model training,thereby misleading the model to learn the opposite information.MLTC for incomplete annotation aims to learn text from incomplete annotation datasets to classifiers for related labels while minimizing their impact on the model and improving the efficiency of multi label classification.All existing methods for MLTC involve supervised training on manually annotated data,which cannot solve missing incomplete labels.[Methods]This article proposes partial labeling self-training for the MLTC(PST)framework based on local annotation,which alleviates the negative impact of missing labels on the model by supplementing the use of missing labels.Particularly,the PST framework first utilized the basic multi label text classification model to train on incompletely labeled datasets to obtain a teacher model.Furthermore,the teacher model automatically scored large-scale unlabeled and incompletely labeled data.A dual threshold mechanism was then used to divide the labels into states based on their scores to obtain positive,negative,and other labels.Finally,the teacher model was updated using label information from three different states through joint training.To comprehensively evaluate the performance of the PST framework,we randomly deleted some labels from the training set of the English dataset AAPD,according to different missing ratios,to construct incomplete annotated synthetic datasets with different degrees of missing data.Meanwhile,we manually corrected the incomplete CCKS2022 Task 8 dataset with incomplete annotations and used it as the real dataset for the experiment.[Results]Experiments on synthetic datasets showed that as the problem of annotation intensifies,the performance of multi label text classification models decreases sharply,and the PST framework could alleviate the speed of decline to some extent,in which the more the missing labels,the more obvious the relief.The experimental results of different multi-label classification teacher models on real datasets showed that the PST framework has varying degrees of improvement on different teacher models on incompletely annotated datasets,which fully proves the universality of the PST framework.[Conclusions]The PST framework is a model-independent plug-in framework that is compatible with various teacher models.We could fully utilize the external unlabeled data to optimize the teacher model,while supplementing the use of missing labels from incomplete labeled data,thereby weakening the impact of missing labels on the model.The experimental results indicate that our proposed framework is universal and can alleviate the impact of incomplete data annotation to some extent.

multi-label text classificationincomplete labelingself-training

任俊飞、朱桐、陈文亮

展开 >

苏州大学计算机科学与技术学院,苏州 215006

多标签文本分类 不完全标注 自训练

国家自然科学基金重点联合项目

61936010

2024

清华大学学报(自然科学版)
清华大学

清华大学学报(自然科学版)

CSTPCD北大核心
影响因子:0.586
ISSN:1000-0054
年,卷(期):2024.64(4)
  • 19