排序算法是计算机科学领域的一个基础算法,是大量应用的算法核心.在大数据时代,随着数据量的极速增长,并行排序算法受到广泛关注.现有的并行排序算法普遍存在通信开销过大、负载不均衡等问题,导致算法难以大规模扩展.针对以上问题,提出一种大规模可扩展的正则采样并行排序(scalable parallel sorting by regular sampling,ScaPSRS)算法,摒弃传统正则采样并行排序(parallel sorting by regular sampling,PSRS)算法中由一个进程负责采样的做法,转而让所有进程参与正则采样,选出p-1个分隔元素,将整个数据集划分成p个不相交的子集,然后实施并行排序,避免了单一进程的采样瓶颈.此外,ScaPSRS采用一种新的迭代更新策略选择p-1个分隔元素,保证划分的p个子集尽可能大小相同,从而确保p个进程对各自的子集进行本地排序时的负载均衡.在天河二号超级计算机上进行的大量实验表明,ScaPSRS算法能够成功地扩展到32000个内核,性能比PSRS算法和Hofmann等人提出的分区算法分别提升了3.7倍和11.7倍.
A scalable parallel sorting algorithm by regular sampling for big data
Sorting is one of the basic algorithms in computer science, and has been extensively used in a variety of applications. In the big data era, as the volumes of data increase rapidly, parallel sorting has attracted much attention. Existing parallel sorting algorithms suffered from excessive communication overhead and load imbalance, making it difficult to scale massively. To solve above problems, a scalable parallel algorithm sorting by regular sampling (ScaPSRS) was proposed, which sampled the p-1 pivot elements to divide the entire data set into p disjoint subsets by all parallel processes, rather than by only one given process as PSRS did. Furthermore, ScaPSRS adopted a novel iterative update strategy of selecting pivots to guarantee that the workloads and data were evenly scheduled among the parallel processes, thus ensuring superior overall performance. A variety of experiments conducted on the Tianhe-Ⅱ supercomputer demonstrated that ScaPSRS succeeded in scaling to 32000 cores and outperformed state-of-the-art works significantly.
parallel sortingregular samplingload balancebig data