Relational Query Optimization Method Based on Data Skew
In the process of big data distributed batch data processing,data skew is often encountered by big data developers.Data skew is a"long tail"phenomenon caused by uneven distribution of data,resulting in large differences in data processed by each data node.This is common in distributed data processing systems,and the main reason for this phenomenon is the uneven distribution of key values in the data.In parallel computing,a large number of the same keys are allocated to a single host for processing,resulting in the situation of"busy single machine and idle cluster",which violates the original intention and design principles of parallel computing,resulting in the overall efficiency of parallel computing,and even memory overflow.