基于混合比例估计的标签噪声学习方法

Label-noise learning via mixture proportion estimation

郑庆华 ¹曹书植 ¹阮建飞 ¹赵锐 ¹董博²

扫码查看

作者信息

1. 西安交通大学计算机科学与技术学院,西安 710049;智能网络与网络安全教育部重点实验室,西安 710049
2. 西安交通大学继续教育学院,西安 710049;陕西省天地网技术重点实验室,西安 710049
折叠

摘要

近年来,人工智能蓬勃发展,伴随着计算硬件算力的提升,深度学习已成为了人工智能算法的新范式.然而深度学习依赖大量精确标注的数据,在现实的多类别分类场景中,受限于标注成本和隐私数据保护等因素,大量精准标注的数据往往难以获得.近些年,移动众包和网络爬虫这类经济廉价的数据收集方法被广泛采用,但他们不可避免地引入了错误标注,即标签噪声.鉴于深度神经网络强大的数据拟合能力,标签噪声的存在将造成算法的过拟合,严重制约了深度学习方法的泛化能力.针对标签噪声问题,现有研究大多显式或隐式地依赖锚点(明确属于某一类别的样本),然而在现实场景中锚点难以获取,这使得现有解决方案不再适用.为解决这一问题,本文创造性地将多类别标签噪声学习问题转化为混合比例估计(mixture proportion estimation,MPE)问题,构建了一种不依赖锚点的满足统计一致性的学习算法.本文的主要贡献包括:(1)对现有的仅适用于二组成物MPE场景的R-MPE(regrouping-MPE)方法进行推广,提出了多组成物场景下不依赖不可约假设的MPE方法MR-MPE(multi-component oriented R-MPE);(2)理论上证明了多类别分类场景下标签噪声学习算法锚点假设和MPE问题不可约假设的等价性,并基于所提出的MR-MPE方法构建了不依赖锚点的满足统计一致性的算法.本文在合成噪声数据集和真实噪声数据集上分别与现有算法进行了对比实验,结果显示本文所提算法在多个数据集上均展现出了最优的性能;同时,在移除锚点的情况下,本文对算法的鲁棒性进行了测试,验证了所提算法不依赖锚点的特性.

Abstract

With the rise of artificial intelligence in recent years,along with the improvement of hardware computing power,deep learning has emerged as the new paradigm for artificial intelligence algorithms.In realistic multi-class classification scenarios,deep learning relies heavily on the availability of massive manually labeled data;the limitations of labeling costs and privacy protections,however,often make it difficult to obtain adequate amounts of appropriately labeled data for deep learning.Recently,crowdsourcing and web crawling have provided an easy way to collect large amounts of labeled data,but they are limited by the inevitable introduction of label noise.As deep neural networks have a high capacity to fit noisy labels,it is challenging to train deep networks robustly with noisy labels.For robust learning,existing works commonly rely explicitly or implicitly on a given set of anchor points,i.e.,instances that almost certainly belong to the true classes.Unfortunately,anchor points are difficult to obtain in practice,which makes these works fragile.To address this problem,in this paper,we build an anchor-free statistically consistent algorithm in the presence of label noise by creatively transforming the multi-class label-noise learning problem into a mixture proportion estimation(MPE)problem.This paper makes the following contributions:(i)we for the first time generalize the existing Regrouping-MPE(R-MPE)method that is only suitable for two-component scenarios,and propose a multi-component oriented R-MPE(MR-MPE)method without relying on the common irreducible assumption;and(ii)from a theoretical perspective,we demonstrate that the anchor point hypothesis for label-noise learning is equivalent to the irreducible hypothesis for MPE problems in the context of multi-class classification.Therefore,an anchor-free statistically consistent label-noise learning algorithm is subsequently constructed based on the proposed MR-MPE method.In this paper,comparative experiments with existing algorithms are conducted on both synthetic noisy datasets and real-world noisy datasets.The results demonstrate that the proposed algorithm performs most effectively on multiple datasets.Additionally,the robustness of the proposed algorithm is verified when anchor points are removed.

关键词

混合比例估计/多类别分类/标签噪声学习/锚点/不可约假设/统计一致性

Key words

mixture proportion estimation/multi-class classification/label-noise learning/anchor point/irreducible assumption/statistical consistency

引用本文复制引用

基金项目

科技创新2030新一代人工智能重大项目(2020AAA0108800)

国家自然科学基金(62037001)

国家自然科学基金(61721002)

国家自然科学基金(62002282)

教育部创新团队项目(IRT-17R86)

西安交通大学本科教学改革研究项目(20JX04Y)

西安交大-税友集团税务大数据协同创新项目()

出版年

2024

中国科学F辑

中国科学院,国家自然科学基金委员会

中国科学F辑

CSTPCD北大核心

影响因子：1.438

ISSN：1674-5973

参考文献量52

段落导航