结合半监督聚类和数据剪辑的自训练方法

扫码查看

原文链接

NETL
NSTL
万方数据
维普

中文摘要：针对自训练方法在迭代中选出的置信度高的无标记样本所含信息量不大和自训练方法容易误标记无标记样本的问题,提出了一种结合半监督聚类和数据剪辑的Naive Bayes自训练方法.该自训练方法在每次迭代的时候,首先利用少量的有标记样本和大量的无标记样本进行半监督聚类,从而选出聚类隶属度高的无标记样本作NaiveBayes分类;然后利用数据剪辑技术来过滤掉聚类隶属度高而被Naive Bayes误分类的无标记样本.该数据剪辑技术能够同时利用有标记样本和无标记样本信息进行噪声过滤,解决了传统数据剪辑技术的性能可能因有标记样本数量匮乏而下降的问题.通过在UCI数据集上的对比实验,证明了所提算法的有效性.

外文标题：Self-training method based on semi-supervised clustering and data editing

外文摘要：According to the problem that unlabeled samples of high confidence selected by self-training method contain less information in each iteration and self-training method is easy to mislabel unlabeled samples,a Naive Bayes self-training method based on semi-supervised clustering and data editing was proposed.Firstly,semi-supervised clustering was used to classify a small number of labeled samples and a large number of unlabeled samples,and the unlabeled samples with high membership were chosen,then they were classified by Naive Bayes.Secondly,the data editing technique was used to filter out unlabeled samples with high clustering membership which were misclassified by Naive Bayes.The data editing technique could filter noise by utilizing information of the labeled samples and unlabeled samples,solving the problem that performance of traditional data editing technique may be decreased due to lack of labeled samples.The effectiveness of the proposed algorithm was verified by comparative experiments on UCI datasets.

外文关键词：

self-trainingsemi-supervised learningsemi-supervised clusteringdata editingnearest neighbor

作者：

吕佳、黎隽男

展开 >

作者单位：

重庆师范大学计算机与信息科学学院,重庆401331

关键词：

自训练半监督学习半监督聚类数据剪辑最近邻

基金：

重庆市自然科学基金资助项目重庆市教委科技项目重庆市科研项目重庆师范大学科研项目

项目编号：

cstc2014jcyjA40011KJ1400513CYS17176YKC17001

出版年：

2018

DOI：

10.11772/j.issn.1001-9081.2017071721

计算机应用

中国科学院成都计算机应用研究所

计算机应用

CSTPCDCSCD北大核心

影响因子：0.892

ISSN：1001-9081

年,卷(期)：2018.38(1)

被引量5
参考文献量2