稳健选择伪标注的混合式半监督学习

Robust pseudo-label selection for holistic semi-supervised learning

郭兰哲 ¹李宇峰¹

扫码查看

作者信息

1. 计算机软件新技术国家重点实验室(南京大学),南京 210023
折叠

摘要

半监督学习旨在数据标注缺乏的情形下利用无标注数据提升学习性能,是重要的机器学习范式.尽管不少研究报道表明半监督学习取得了优异的性能表现,然而其在面临诸多实践任务时仍存在伪标注质量判断困难、超参数选择敏感、理论指导缺乏等瓶颈.针对上述挑战,本文提出一种稳健选择伪标注的混合式半监督学习方法,通过综合利用模型预测结果之间的分歧自适应地判断伪标注质量,无需预设超参数,显著提升了半监督学习的稳健性.本文在理论上证明了新方法的错误率随训练轮数的增加而显著下降.实验验证了本文方法较主流技术取得了明显的性能提升,例如,相较于在CIFAR-10数据集中表现最优的半监督学习技术FixMatch,新方法的分类错误率下降了 11％以上,在更具挑战的STL-10数据集中分类错误率下降了 18.8％.

Abstract

Semi-supervised learning(SSL)is a powerful paradigm for leveraging unlabeled data to mitigate the reliance on large labeled datasets.Although it has been reported that SSL methods achieve significant performance on multiple benchmark datasets,they still have critical limitations when applied to real-world tasks,such as being difficult to determine the quality of pseudo-labels,being sensitive to hyper-parameter choices,lacking theoretical guarantee.To address these issues,we propose a new holistic SSL approach with robust pseudo-label selection.Specifically,our proposal selects pseudo-labels adaptively based on the disagreement of model predictions without pre-defined hyper-parameters.Theoretically,we prove that the classification error decreases with the training iterations.Experimentally,we achieve state-of-the-art performance by a large margin across various datasets.For example,compared with the SOTA SSL algorithm FixMatch,we reduce the error by 11.8％on the CIFAR-10 dataset and 18.8％on the more difficult STL-10 dataset.

关键词

机器学习/深度学习/半监督学习/伪标注/稳健性

Key words

machine learning/deep learning/semi-supervised learning/pseudo-label/robust

引用本文复制引用

基金项目

国家自然科学基金(62176118)

国家自然科学基金(61921006)

中国人工智能学会-华为MindSpore学术奖励基金()

出版年

2024

中国科学F辑

中国科学院,国家自然科学基金委员会

中国科学F辑

CSTPCD北大核心

影响因子：1.438

ISSN：1674-5973

参考文献量39

段落导航