首页|基于数据倾斜的关联查询优化方法

基于数据倾斜的关联查询优化方法

扫码查看
在大数据分布式批量数据处理过程中,数据倾斜是从事大数据开发人员经常遇到的问题.数据倾斜是指数据分布不均,导致每个数据节点处理的数据差别较大,造成"长尾"现象,这在分布式数据处理系统中较常见,造成这种现象的主要原因是数据中键值的分布不均匀.在并行计算过程中,大量相同的密钥被分配给单个主机进行处理,导致出现"单机繁忙、集群空闲"的情况,违背了并行计算的初衷和设计原则,导致并行计算的整体效率低下,甚至内存溢出.
Relational Query Optimization Method Based on Data Skew
In the process of big data distributed batch data processing,data skew is often encountered by big data developers.Data skew is a"long tail"phenomenon caused by uneven distribution of data,resulting in large differences in data processed by each data node.This is common in distributed data processing systems,and the main reason for this phenomenon is the uneven distribution of key values in the data.In parallel computing,a large number of the same keys are allocated to a single host for processing,resulting in the situation of"busy single machine and idle cluster",which violates the original intention and design principles of parallel computing,resulting in the overall efficiency of parallel computing,and even memory overflow.

big datadistributedbatch datadata skew

郭开卫、王颖卓、王亚雄

展开 >

中国银联,上海 201201

大数据 分布式 批量数据 数据倾斜

2024

数码设计

数码设计

ISSN:1672-9129
年,卷(期):2024.(3)