基于数据倾斜的关联查询优化方法
Relational Query Optimization Method Based on Data Skew
郭开卫 1王颖卓 1王亚雄1
作者信息
摘要
在大数据分布式批量数据处理过程中,数据倾斜是从事大数据开发人员经常遇到的问题.数据倾斜是指数据分布不均,导致每个数据节点处理的数据差别较大,造成"长尾"现象,这在分布式数据处理系统中较常见,造成这种现象的主要原因是数据中键值的分布不均匀.在并行计算过程中,大量相同的密钥被分配给单个主机进行处理,导致出现"单机繁忙、集群空闲"的情况,违背了并行计算的初衷和设计原则,导致并行计算的整体效率低下,甚至内存溢出.
Abstract
In the process of big data distributed batch data processing,data skew is often encountered by big data developers.Data skew is a"long tail"phenomenon caused by uneven distribution of data,resulting in large differences in data processed by each data node.This is common in distributed data processing systems,and the main reason for this phenomenon is the uneven distribution of key values in the data.In parallel computing,a large number of the same keys are allocated to a single host for processing,resulting in the situation of"busy single machine and idle cluster",which violates the original intention and design principles of parallel computing,resulting in the overall efficiency of parallel computing,and even memory overflow.
关键词
大数据/分布式/批量数据/数据倾斜Key words
big data/distributed/batch data/data skew引用本文复制引用
出版年
2024