分类任务中含有类别型标签噪声是传统数据挖掘中的常见问题,目前还缺少针对性方法来专门检测类别型标签噪声。离群点检测技术能用于噪声的识别与过滤,但由于离群点与类别型标签噪声并不具有一致性,使得离群点检测算法无法精确检测分类数据集中的标签噪声。针对这些问题,提出一种基于离群点检测技术、适用于过滤类别型标签噪声的方法——基于相对离群因子(Relative outlier factor,ROF)的集成过滤方法(Label noise ensemble filtering method based on rel-ative outlier factor,EROF)。首先,通过相对离群因子对样本进行噪声概率估计;然后,再迭代联合多种离群点检测算法,实现集成过滤。实验结果表明,该方法在大多数含有标签噪声的数据集上,都能保持优秀的噪声识别能力,并显著提升各种分类模型的泛化能力。
A Label Noise Filtering Method Based on Relative Outlier Factor
The presence of categorical label noise in classification tasks is a common issue in traditional data min-ing.Currently,there is a lack of targeted methods specifically designed to detect categorical label noise.While out-lier detection techniques can be used for noise identification and filtering,the lack of consistency between outliers and categorical label noise renders outlier detection algorithms unable to accurately detect label noise in classifica-tion data sets.To address these issues,a method based on outlier detection techniques,called the label noise en-semble filtering method based on relative outlier factor(ROF)(EROF),is proposed for filtering categorical label noise.The EROF method estimates noise probability of samples by using relative outlier factor and then iteratively combinings multiple outlier detection algorithms for ensemble filtering.Experimental results show that this method maintains excellent noise identification capability in most data sets which contain label noise,and significantly im-proves the generalization ability of various classification models.