首页|基于动态阈值和差异性检验的自训练算法

基于动态阈值和差异性检验的自训练算法

扫码查看
针对自训练算法在迭代训练分类器的过程中存在难以有效选取高置信度样本以及误标记样本错误累积的问题,本文提出了基于动态阈值和差异性检验的自训练算法。引入样本的局部离群因子,据此剔除有标签样本中的离群点以及分类标注无标签样本,依据标注分批次处理无标签样本,以使模型更易选取到高置信度的无标签样本;根据新增伪标签样本的数量和对比隶属度的变化,设计一种动态隶属度阈值函数,提升高置信度样本的质量;定义密集距离度量样本间的差异性,分别计算伪标签样本与同类和不同类样本之间的密集距离之和,从而找出不确定度高的伪标签样本,并将此类样本并入下轮训练的无标签样本集中,缓解误标记样本错误累积的问题。实验结果表明,该算法在12个UCI基准数据集上均取得理想效果。
Self-training algorithm based on dynamic threshold and difference test
In the process of iterative training of the classifier by a self-training algorithm,it is difficult to effectively se-lect high-confidence samples and there exists mislabeled samples error accumulation.To address the above issues,this paper proposes a self-training algorithm based on dynamic threshold and difference test.The local outlier factor of the sample is introduced to remove the outliers from the labeled samples,classify and label the unlabeled samples.The un-labeled samples are subsequently fed into the model in batches based on the assigned mark,allowing the model to more easily select high-confidence unlabeled samples.Further,a dynamic membership threshold function is designed based on the changes in the number of newly added pseudo-labeled samples and the contrast membership.This function aims to improve the quality of high-confidence samples.Finally,the dense distance is defined to measure the difference between samples.The sum of dense distances between pseudo-labeled samples and samples of the same class and differ-ent classes is calculated separately to find the pseudo-labeled samples with high uncertainty,and incorporate these samples into the unlabeled samples set of the next round of training,which alleviates error accumulation of mislabeled samples.The experimental results demonstrate effectiveness of this algorithm on 12 benchmark UCI datasets.

self-training algorithmmislabeled sampleshigh-confidence samplesdynamic thresholddifference testlocal outlier factorcontrast membershipdense distance

吕佳、邱鸿波、肖锋

展开 >

重庆师范大学 计算机与信息科学学院,重庆 401331

重庆市数字农业服务工程技术研究中心,重庆 401331

自训练算法 误标记样本 高置信度样本 动态阈值 差异性检验 局部离群因子 对比隶属度 密集距离

国家自然科学基金重大项目重庆市教委"成渝地区双城经济圈建设"科技创新项目重庆市高校创新研究群体资助项目

11991024KJCX2020024CXQT20015

2024

智能系统学报
中国人工智能学会 哈尔滨工程大学

智能系统学报

CSTPCD北大核心
影响因子:0.672
ISSN:1673-4785
年,卷(期):2024.19(4)