First Filling Strategy-Based Partitioning Method to Balance Data in Spark
Spark is a distributed big data processing framework based on in-memory computing,which has the advan-tages of fast running speed and strong versatility.When conducting the computation task,Spark's default partitioner Hash-Partitioner is easy to generate data skewing among partitions.It results in low resource utilization and poor operating effi-ciency.Most of the existing Spark balanced partitioning methods,such as multi-stage partitioning,migration partitioning,and sampling partitioning,have defects of scale control difficulty,high communication overhead,and excessive sampling dependence.In order to solve the above-mentioned problems,we propose a partitioning method based on first filling strate-gy,which considers the allocations of sample data and non-sample data at the same time,so as to achieve a balanced data partitioning.After sampling the data and estimating the weight of each key according to the sample information,the keys are sorted in descending order according to the weights.The keys are in turn assigned to the previous partitions if their addi-tions can satisfy the partition tolerance,and the space of the last partition is reserved for the keys that are not sampled,so as to obtain the partitioning plan for the sample data.Spark partitions the data corresponding to the keys that appear in the sam-ple according to the partitioning plan,and the data of other keys that do not appear is directly allocated to the last data parti-tion available.The experimental results show that the new method can effectively achieve balanced partitioning for Spark data.On the real datasets from Bureau of Transportation Statistics,compared with HashPartitioner,the total running time of first filling partitioner(FFP),designed based on the proposed method,is shortened by 15.3%on average.In addition,FFP's total running time is on average 38.7%shorter than balanced Spark data partitioner and 30.2%shorter than hash based key reassigning partitioner.
balanced partitioningfirst fill strategydata skewSpark operatorbig data