首页期刊导航|IEEE transactions on parallel and distributed systems
期刊信息/Journal information
IEEE transactions on parallel and distributed systems
Institute of Electrical and Electronics Engineers
IEEE transactions on parallel and distributed systems

Institute of Electrical and Electronics Engineers

1045-9219

IEEE transactions on parallel and distributed systems/Journal IEEE transactions on parallel and distributed systemsSCIISTPEI
正式出版
收录年代

    Guest Editorial:Special Section on SC22 Student Cluster Competition

    Omer RanaJosef SpillnerStephen LeakGerald F Lofstead II...
    803-803页

    Productivity, Portability, Performance, and Reproducibility: Data-Centric Python

    Alexandros Nikolaos ZiogasTimo SchneiderTal Ben-NunAlexandru Calotoiu...
    804-820页
    查看更多>>摘要:Python has become the de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python support in High-Performance Computing (HPC) has skyrocketed. However, the Python language itself does not necessarily offer high performance. This work presents a workflow that retains Python’s high productivity while achieving portable performance across different architectures. The workflow’s key features are HPC-oriented language extensions and a set of automatic optimizations powered by a data-centric intermediate representation. We show performance results and scaling across CPU, GPU, FPGA, and the Piz Daint supercomputer (up to 23,328 cores), with 2.47x and 3.75x speedups over previous-best solutions, first-ever Xilinx and Intel FPGA results of annotated Python, and up to 93.16% scaling efficiency on 512 nodes. Our benchmarks were reproduced in the Student Cluster Competition (SCC) during the Supercomputing Conference (SC) 2022. We present and discuss the student teams’ results.

    Reproducing Performance of Data-Centric Python by SCC Team From National Tsing Hua University

    Fu-Chiang ChangEn-Ming HuangPin-Yi KuoChan-Yu Mou...
    821-825页
    查看更多>>摘要:As part of the Student Cluster Competition at the SC22 conference, this work aims to reproduce the performance evaluations of the Data Centric (DaCe) Python framework by leveraging Intel MKL and NVIDIA CUDA interface. The evaluations are conducted on a single CPU-based node, NVIDIA A100 GPUs, and an eight-node cloud supercomputer. Our experimental results successfully reproduce the performance evaluations on our cluster. Additionally, we provide insightful analysis and propose effective methods for achieving higher performance when utilizing DaCe as an acceleration library.

    Critique of “Productivity, Portability, Performance: Data-Centric Python” by SCC Team From Zhejiang University

    Zihan YangYi ChenKaiqi ChenXingjian Qian...
    826-829页
    查看更多>>摘要:In SC’21, Alexandros Nikolaos Ziogas et al. proposed a Data-Centric Python workflow in their DaCe paper. DaCe provides high productivity, performance, and portability with language extensions and automatic optimizations. We reproduce the performance evaluation results from the paper on both CPU and GPU on the Azure CycleCloud cluster. We also reproduce the scaling results with up to 32 nodes and 64 processes. Our results show that the proposed workflow in that paper has outstanding performance and scalability in the provided cluster, in accordance with the SC paper.

    Critique of “Productivity, Portability, Performance Data-Centric Python” by SCC Team From Sun Yat-sen University

    Han HuangTengyang ZhengTianxing YangYang Ye...
    830-834页
    查看更多>>摘要:In SC21, Ziogas et al. proposedData-Centric (DaCe) Python. It attains high performance and portability, and further extends the original productivity of Python. This paper analyzes the reproducibility of the DaCe paper as part of the SC22 Student Cluster Competition (SCC). The reproduction experiments are conducted on the Azure CycleCloud. Different from the DaCe paper, we use AMD EPYC 7V73X processors for CPU-based experiments. We successfully reproduce most of the results of the DaCe paper. The remaining results are also explainable.

    Analysis and Reproducibility of “Productivity, Portability, Performance: Data-Centric Python”

    Christopher LompaPiotr Luczynski
    835-840页
    查看更多>>摘要:This report analyses the reproducibility of the results obtained in the NPBench (Ziogas et al. 2021) paper. We begin by providing the reader with some background information and a demonstration on the simplicity of DaCe. We then reproduce a subset of the results presented in the original paper, specifically: the comparison of DaCe on CPU and GPU over NumPy and its parallel efficiency in a distributed environment. For most benchmarks we show that we can obtain similar results on our machine. Despite that, for some benchmarks we cannot conclude the same without reasonable doubt. The experimental runs were performed during the SC22 Student Cluster Competition in Dallas, TX.

    Reproducibility of the DaCe Framework on NPBench Benchmarks

    Anish GovindYuchen JingStefanie DaoMichael Granado...
    841-846页
    查看更多>>摘要:DaCe is a framework for Python that claims to provide massive speedups with C-like speeds compared to already existing high-performance Python frameworks (e.g. Numba or Pythran). In this work, we take a closer look at reproducing the NPBench work. We use performance results to confirm that NPBench achieves higher performance than NumPy in a variety of benchmarks and provide reasons as to why DaCe is not truly as portable as it claims to be, but with a small adjustment it can run anywhere.

    IoT-Dedup: Device Relationship-Based IoT Data Deduplication Scheme

    Yuan GaoLiquan ChenJianchang LaiTianyi Wang...
    847-860页
    查看更多>>摘要:The cyclical and continuous working characteristics of Internet of Things (IoT) devices make a large amount of the same or similar data, which can significantly consume storage space. To solve this problem, various secure data deduplication schemes have been proposed. However, existing deduplication schemes only perform deduplication based on data similarity, ignoring the internal connection among devices, making the existing schemes not directly applicable to parallel and distributed scenarios like IoT. Furthermore, since secure data deduplication leads to multiple users sharing same encryption key, which may lead to security issues. To this end, we propose a device relationship-based IoT data deduplication scheme that fully considers the IoT data characteristics and devices internal connections. Specifically, we propose a device relationship prediction approach, which can obtain device collaborative relationships by clustering the topology of their communication graph, and classifies the data types based on device relationships to achieve data deduplication with different security levels. Then, we design a similarity-preserving encryption algorithm, so that the security level of encryption key is determined by the data type, ensuring the security of the deduplicated data. In addition, two different data deduplication methods, identical deduplication and similar deduplication, have been designed to meet the privacy requirement of different data types, improving the efficiency of deduplication while ensuring data privacy as much as possible. We evaluate the performance of our scheme using five real datasets, and the results show that our scheme has favorable results in terms of both deduplication performance and computational cost.

    Courier: A Unified Communication Agent to Support Concurrent Flow Scheduling in Cluster Computing

    Zhaochen ZhangXu ZhangZhaoxiang BaoLiang Wei...
    861-876页
    查看更多>>摘要:As one of the pillars in cluster computing frameworks, coflow scheduling algorithms can effectively shorten the network transmission time of cluster computing jobs, thus reducing the job completion times and improving the execution performance. However, most of existing coflow scheduling algorithms failed to consider the influences of concurrent flows, which can degrade their performance under a massive number of concurrent flows. To fill the gap, we propose a unified communication agent named Courier to minimize the number of concurrent flows in cluster computing applications, which is compatible with the mainstream coflow scheduling approaches. To maintain the scheduling order given by the scheduling algorithms, Courier merges multiple flows between each pair of hosts into a unified flow, and determines its order based on that of origin flows. In addition, in order to adapt to various types of topologies, Courier introduces a control mechanism to adjust the number of flows while maintaining the scheduling order. Extensive large-scale trace-driven simulations have shown that Courier is compatible with existing scheduling algorithms, and outperforms the state-of-the-art approaches by about 30% under a variety of workloads and topologies.

    Libfork: Portable Continuation-Stealing With Stackless Coroutines

    Conor J. WilliamsJames Elliott
    877-888页
    查看更多>>摘要:Fully-strict fork-join parallelism is a powerful model for shared-memory programming due to its optimal time-scaling and strong bounds on memory scaling. The latter is rarely achieved due to the difficulty of implementing continuation-stealing in traditional High Performance Computing (HPC) languages – where it is often impossible without modifying the compiler or resorting to non-portable techniques. We demonstrate how stackless-coroutines (a new feature in C++$\bm {20}$) can enable fully-portable continuation stealing and present libfork a wait-free fine-grained parallelism library, combining coroutines with user-space, geometric segmented-stacks. We show our approach is able to achieve optimal time/memory scaling, both theoretically and empirically, across a variety of benchmarks. Compared to openMP (libomp), libfork is on average $7.2\times$ faster and consumes $10\times$ less memory. Similarly, compared to Intel's TBB, libfork is on average $2.7\times$ faster and consumes $6.2\times$ less memory. Additionally, we introduce non-uniform memory access (NUMA) optimizations for schedulers that demonstrate performance matching busy-waiting schedulers.