基于稀疏自编码的多维数据去重聚类算法分析

Analysis of Multi-dimensional Data De-Duplication Clustering Algorithm Based Sparse Self-Coding

薛丽香 ¹高丽杰 ¹李占波²

扫码查看

作者信息

1. 郑州科技学院信息工程学院,河南郑州 450064
2. 郑州大学网络管理中心,河南郑州 450001
折叠

摘要

随着科技信息的不断发展,数据量与数据类型与日俱增,针对数据集维度高、重复数据多导致有效信息提取复杂的问题,提出基于改进稀疏自编码器的多维数据聚类算法.算法分为数据处理与聚类分析两大部分,数据处理时首先利用S-SAE中逐层贪婪的原理将高维数据集降维至每组 6 维的数据集;接着采用映射值匹配机制对降维后的数据集进行重复数据清洗处理,被清洗的值用0 替代;然后将处理好的数据投入到K-Means++聚类算法中进行聚类分析;最终构建出TS-SAE-K-Means++多维数据聚类模型,并通过最优化分析得出其最优化参数设置情况.通过对不同基线组合算法的仿真对比分析表明,TS-SAE-K-Means++在聚类轮廓系数S与模型特征值F1 评价体系中均优于其它算法组合.这表明提出的算法在解决高维数据内有效信息提取的问题上具有一定的优越性.

Abstract

With the continuous development of science and technology information,the volume and type of data are increasing day by day.To address the problem of high dimensionality of data sets and complicated extraction of ef-fective information due to many duplicate data,this paper proposes a multi-dimensional data clustering algorithm based on improved sparse self-encoder.The algorithm is divided into two major parts:data processing and clustering analysis.The data processing first uses the layer-by-layer greedy principle in S-SAE to downscale the high-dimen-sional data set to a 6-dimensional data set in each group;Then the mapped value matching mechanism is used to clean the downscaled data set with duplicate data,and the cleaned values are replaced by 0;Then the processed data are put into the K-Means++clustering algorithm for clustering analysis;Finally,a TS-SAE-K-Means++multi-di-mensional data clustering model is constructed and its optimal parameter settings are derived by optimization analysis.The simulation comparison analysis of different baseline combination algorithms shows that TS-SAE-K-Means++out-performs other algorithm combinations in the evaluation system of clustering profile coefficient S and model eigenvalue F1.This indicates that the algorithm proposed in this paper has certain superiority in solving the problem of effective information extraction within high-dimensional data.

关键词

改进稀疏自编码器/聚类算法/评级指标

Key words

Improved sparse self-encoder/Clustering algorithm/Rating metrics

引用本文复制引用

出版年

2024

计算机仿真

中国航天科工集团公司第十七研究所

计算机仿真

CSTPCD

影响因子：0.518

ISSN：1006-9348

参考文献量8

段落导航