基于优先填补策略的Spark数据均衡分区方法

扫码查看

原文链接

万方数据
维普

中文摘要：Spark作为基于内存计算的分布式大数据处理框架,运行速度快且通用性强.在任务计算过程中,Spark的默认分区器HashPartitioner在处理倾斜数据时,容易产生各个分区数据量不平衡的情况,导致资源利用率低且运行效率差.现存的Spark均衡分区改进方法,例如多阶段分区、迁移分区和采样分区等,大多存在尺度把控难、通信开销成本高、对采样过度依赖等缺陷.为改善上述问题,本文提出了一种基于优先填补策略的分区方法,同时考虑了样本数据和非样本数据的分配,以便实现对全部数据的均衡分区.该方法在对数据采样并根据样本信息估算出每个键的权值后,将键按照权值大小降序排列,依次将键在满足分区容忍度的条件下分配到前面的分区中,为未被采样的键预留后面的分区空间,以获得针对样本数据的分区方案.Spark根据分区方案对样本中出现的键对应的数据进行分区,没有出现的键对应的数据则直接映射到可分配的最后一个分区中.实验结果表明,新分区方法能够有效实现Spark数据的均衡分区,在美国运输统计局发布的真实航空数据集上,基于该方法设计的优先填补分区器的总运行时间比Hash-Partitioner平均缩短了15.3%,比现有的均衡数据分区器和哈希键值重分配分区器分别平均缩短了38.7%和30.2%.

外文标题：First Filling Strategy-Based Partitioning Method to Balance Data in Spark

外文摘要：Spark is a distributed big data processing framework based on in-memory computing,which has the advan-tages of fast running speed and strong versatility.When conducting the computation task,Spark's default partitioner Hash-Partitioner is easy to generate data skewing among partitions.It results in low resource utilization and poor operating effi-ciency.Most of the existing Spark balanced partitioning methods,such as multi-stage partitioning,migration partitioning,and sampling partitioning,have defects of scale control difficulty,high communication overhead,and excessive sampling dependence.In order to solve the above-mentioned problems,we propose a partitioning method based on first filling strate-gy,which considers the allocations of sample data and non-sample data at the same time,so as to achieve a balanced data partitioning.After sampling the data and estimating the weight of each key according to the sample information,the keys are sorted in descending order according to the weights.The keys are in turn assigned to the previous partitions if their addi-tions can satisfy the partition tolerance,and the space of the last partition is reserved for the keys that are not sampled,so as to obtain the partitioning plan for the sample data.Spark partitions the data corresponding to the keys that appear in the sam-ple according to the partitioning plan,and the data of other keys that do not appear is directly allocated to the last data parti-tion available.The experimental results show that the new method can effectively achieve balanced partitioning for Spark data.On the real datasets from Bureau of Transportation Statistics,compared with HashPartitioner,the total running time of first filling partitioner(FFP),designed based on the proposed method,is shortened by 15.3%on average.In addition,FFP's total running time is on average 38.7%shorter than balanced Spark data partitioner and 30.2%shorter than hash based key reassigning partitioner.

外文关键词：

balanced partitioningfirst fill strategydata skewSpark operatorbig data

作者：

何玉林、吴东彤、Philippe Fournier-Viger、黄哲学

展开 >

作者单位：

人工智能与数字经济广东省实验室(深圳),广东深圳 518107

深圳大学计算机与软件学院,广东深圳 518060

关键词：

均衡分区优先填补策略数据倾斜 Spark算子大数据

基金：

深圳市科技重大专项项目广东省自然科学基金面上项目深圳市基础研究面上项目广东省基础与应用基础研究基金粤深联合基金重点项目

项目编号：

202302D0742023A1515011667JCYJ202103240936090262023B1515120020

出版年：

2024

DOI：

10.12263/DZXB.20240094

电子学报

中国电子学会

电子学报

CSTPCD北大核心

影响因子：1.237

ISSN：0372-2112

年,卷(期)：2024.52(10)