考虑属性相关度的大数据随机游走抽样仿真
Simulation of Big Data Random Walkthrough Sampling Considering Attribute Correlation
谢超群 1游文辉1
作者信息
- 1. 福建中医药大学,福建 福州 350122
- 折叠
摘要
大数据通常是非随机的,且大数据量下可能存在抽样偏倚的问题,部分群体可能被过采样或欠采样,从而导致结果准确性较低.为此,提出基于属性相关度的大数据随机游走抽样算法.获取大数据邻域关系矩阵,根据排序思想得到大数据单属性邻域关系矩阵.计算不同大数据属性的邻域关系矩阵,计算数据属性的相关度,得到大数据属性约简结果.采用区间密度相似性调整区间,建立可变网格空间,将网格空间和密度偏差抽样算法有效结合,完成大数据随机游走抽样.仿真实验分析表明,所提算法可以大幅度提升样本质量,且能耗明显更低一些,最高仅为 280Wh,获取更加精准的大数据随机游走抽样结果.
Abstract
Typically,big data is non-random.However,the problem of sampling bias may exist in massive big da-ta,and some groups may be oversampled or undersampled,leading to low accuracy in results.To address this,a ran-dom-walk sampling algorithm for big data based on attribute correlation was proposed.Firstly,after obtaining the neighborhood relationship matrix of big data,the big data single-attribute neighborhood relationship matrix was derived based on sorting ideas.Then,the neighborhood relationship matrices of different big data attributes and the correlation between data attributes were calculated to obtain the attribute reduction results.Secondly,interval density similarity was used to adjust the interval and construct a variable grid space.Finally,the grid space and density devia-tion sampling algorithm were effectively combined to complete the big data random walk sampling.The simulation a-nalysis results show that the algorithm can significantly improve sample quality.Energy consumption is noticeably low-er,with a maximum of only 280Wh.This method can obtain more accurate random-walk sampling results for big data.
关键词
属性相关度/大数据/随机游走/抽样Key words
Attribute correlation/Big data/Random walk/Sampling引用本文复制引用
出版年
2024