考虑属性相关度的大数据随机游走抽样仿真

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：大数据通常是非随机的,且大数据量下可能存在抽样偏倚的问题,部分群体可能被过采样或欠采样,从而导致结果准确性较低.为此,提出基于属性相关度的大数据随机游走抽样算法.获取大数据邻域关系矩阵,根据排序思想得到大数据单属性邻域关系矩阵.计算不同大数据属性的邻域关系矩阵,计算数据属性的相关度,得到大数据属性约简结果.采用区间密度相似性调整区间,建立可变网格空间,将网格空间和密度偏差抽样算法有效结合,完成大数据随机游走抽样.仿真实验分析表明,所提算法可以大幅度提升样本质量,且能耗明显更低一些,最高仅为 280Wh,获取更加精准的大数据随机游走抽样结果.

外文标题：Simulation of Big Data Random Walkthrough Sampling Considering Attribute Correlation

外文摘要：Typically,big data is non-random.However,the problem of sampling bias may exist in massive big da-ta,and some groups may be oversampled or undersampled,leading to low accuracy in results.To address this,a ran-dom-walk sampling algorithm for big data based on attribute correlation was proposed.Firstly,after obtaining the neighborhood relationship matrix of big data,the big data single-attribute neighborhood relationship matrix was derived based on sorting ideas.Then,the neighborhood relationship matrices of different big data attributes and the correlation between data attributes were calculated to obtain the attribute reduction results.Secondly,interval density similarity was used to adjust the interval and construct a variable grid space.Finally,the grid space and density devia-tion sampling algorithm were effectively combined to complete the big data random walk sampling.The simulation a-nalysis results show that the algorithm can significantly improve sample quality.Energy consumption is noticeably low-er,with a maximum of only 280Wh.This method can obtain more accurate random-walk sampling results for big data.

外文关键词：

Attribute correlationBig dataRandom walkSampling

作者：

谢超群、游文辉

展开 >

作者单位：

福建中医药大学,福建福州 350122

关键词：

属性相关度大数据随机游走抽样

出版年：

2024

计算机仿真

中国航天科工集团公司第十七研究所

计算机仿真

CSTPCD

影响因子：0.518

ISSN：1006-9348

年,卷(期)：2024.41(9)