基于类的余弦距离聚类缺失值填补方法研究
A Study of Missing Value Imputation Methods for Class-based Cosine Distance Clustering
夏婷婷 1林康 2张潇予 3刘海忠1
作者信息
- 1. 兰州交通大学,甘肃 兰州 730070
- 2. 北京师范大学,广东 珠海 519087
- 3. 香港城市大学社会与行为科学学院,甘肃 兰州 730070
- 折叠
摘要
[目的]为了解决欧氏距离计算相似性带来的高维度问题,提出了基于类的余弦距离聚类缺失值填补方法.[方法]首先将不完整数据集分为两个不同的组(G1和GIM);其次通过聚类中心对GIM组中的缺失数据进行预填补;再次利用余弦距离计算相关性;最后选择与G1组中距离最小的数据来填补缺失值.[结果]实验结果表明,该方法在类别和混合数据集上均优于其他插补方法.[结论]该方法显著提高了准确率、召回率、F1-score及插补效果.
Abstract
[Purposes]In order to solve the high dimension problem caused by the similarity of Euclidean distance calculation,a class-based cosine distance clustering missing value imputation approach is pro-posed.[Methods]Firstly,the incomplete data set is divided into two different groups(G1 and GIM);sec-ondly,the missing data in the GIM group is pre-filled by the clustering center;the cosine distance is used again to calculate the correlation;finally,the data with the smallest distance from the G1 group is selected to fill the missing values.[Findings]The experimental results show that the proposed method outperforms other imputation methods for both categorical and mixed datasets.[Conclusions]The CBC-IM-COS method significantly improves accuracy,recall and F1-score and imputationperformance.
关键词
不完整数据/缺失值插补/聚类/余弦距离Key words
incomplete data/missing value imputation/clustering/cosine distance引用本文复制引用
出版年
2024