摘要
由一名新闻记者-机器人与机器学习的工作人员新闻编辑每日新闻-机器学习的新数据在一份新的报告中提供。根据NewsRx记者在中国人民共和国北京的新闻报道,研究表明:“监督机器学习(ML)模型在含有错误标签实例的数据上训练,往往会由于标签错误而产生不准确的结果。传统的检测错误标签实例的方法依赖于数据接近性,如果实例的标签与邻居不一致,就会被认为是错误标签。”本研究的资金来源包括国家自然科学基金(NSFC)、国家重点研发项目(NSFC)、国家自然科学基金(NSFC)、DITDP、国家科学基金(NSF)、北京自然科学基金、中国人民大学研究基金。新闻记者从北京科技学院获得了这项研究的一句话,“然而,它的性能往往很差,因为一个实例并不总是与它的邻居共享相同的标签。基于ML的方法反而利用训练的模型来区分错误标记的实例和干净的实例。然而,这些方法很难达到高精度。”本文提出了一种在模型训练过程中检测错误标记实例的新方法MisD Etect,该方法利用早期损失观测来迭代识别和删除错误标记实例,并采用基于影响的验证来提高检测精度。MisDetect自动判断早期丢失在检测错误标签方面不再有效,从而终止初始检测过程。最后,对于训练实例,MisDetect仍然不确定它们是否被错误标签,Mis检测自动生成伪标签来学习二进制分类模型,并利用机器学习模型的泛化能力来确定它们的状态。
Abstract
By a News Reporter-Staff News Editor at Robotics & Machine Learning Daily News Daily News-Fresh data on Machine Learning are pre sented in a new report. According to news reporting from Beijing, People's Repub lic of China, by NewsRx journalists, research stated, "Supervised machine learni ng (ML) models trained on data with mislabeled instances often produce inaccurat e results due to label errors. Traditional methods of detecting mislabeled insta nces rely on data proximity, where an instance is considered mislabeled if its l abel is inconsistent with its neighbors." Financial supporters for this research include National Natural Science Foundati on of China (NSFC), National Key R&D Program of China, National Nat ural Science Foundation of China (NSFC), DITDP, National Science Foundation (NSF ), Beijing Natural Science Foundation, Research Funds of Renmin University of Ch ina. The news correspondents obtained a quote from the research from the Beijing Inst itute of Technology, "However, it often performs poorly, because an instance doe s not always share the same label with its neighbors. ML-based methods instead u tilize trainedmodels to differentiate between mislabeled and clean instances. Ho wever, these methods struggle to achieve high accuracy, since the models may hav e already overfitted mislabeled instances. In this paper, we propose a novel fra mework, MisDetect, that detects mislabeled instances during model training. MisD etect leverages the early loss observation to iteratively identify and remove mi slabeled instances. In this process, influence-based verification is applied to enhance the detection accuracy. Moreover, MisDetect automatically determines whe n the early loss is no longer effective in detecting mislabels such that the ite rative detection process should terminate. Finally, for the training instances t hat MisDetect is still not certain about whether they are mislabeled or not, Mis Detect automatically produces some pseudo labels to learn a binary classificatio n model and leverages the generalization ability of the machine learning model t o determine their status."