基于数据倾斜的关联查询优化方法

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：在大数据分布式批量数据处理过程中,数据倾斜是从事大数据开发人员经常遇到的问题.数据倾斜是指数据分布不均,导致每个数据节点处理的数据差别较大,造成"长尾"现象,这在分布式数据处理系统中较常见,造成这种现象的主要原因是数据中键值的分布不均匀.在并行计算过程中,大量相同的密钥被分配给单个主机进行处理,导致出现"单机繁忙、集群空闲"的情况,违背了并行计算的初衷和设计原则,导致并行计算的整体效率低下,甚至内存溢出.

外文标题：Relational Query Optimization Method Based on Data Skew

外文摘要：In the process of big data distributed batch data processing,data skew is often encountered by big data developers.Data skew is a"long tail"phenomenon caused by uneven distribution of data,resulting in large differences in data processed by each data node.This is common in distributed data processing systems,and the main reason for this phenomenon is the uneven distribution of key values in the data.In parallel computing,a large number of the same keys are allocated to a single host for processing,resulting in the situation of"busy single machine and idle cluster",which violates the original intention and design principles of parallel computing,resulting in the overall efficiency of parallel computing,and even memory overflow.

外文关键词：

big datadistributedbatch datadata skew

作者：

郭开卫、王颖卓、王亚雄

展开 >

作者单位：

中国银联,上海 201201

关键词：

大数据分布式批量数据数据倾斜

出版年：

2024

数码设计

ISSN：1672-9129

年,卷(期)：2024.(3)