Preparing MPICH for exascale

扫码查看

原文链接

NETL
NSTL
Sage

外文摘要：The advent of exascale supercomputers heralds a new era of scientific discovery, yet it introduces significant architectural challenges that must be overcome for MPI applications to fully exploit its potential. Among these challenges is the adoption of heterogeneous architectures, particularly the integration of GPUs to accelerate computation. Additionally, the complexity of multithreaded programming models has also become a critical factor in achieving performance at scale. The efficient utilization of hardware acceleration for communication, provided by modern NICs, is also essential for achieving low latency and high throughput communication in such complex systems. In response to these challenges, the MPICH library, a high-performance and widely used Message Passing Interface (MPI) implementation, has undergone significant enhancements. This paper presents four major contributions that prepare MPICH for the exascale transition. First, we describe a lightweight communication stack that leverages the advanced features of modern NICs to maximize hardware acceleration. Second, our work showcases a highly scalable multithreaded communication model that addresses the complexities of concurrent environments. Third, we introduce GPU-aware communication capabilities that optimize data movement in GPU-integrated systems. Finally, we present a new datatype engine aimed at accelerating the use of MPI derived datatypes on GPUs. These improvements in the MPICH library not only address the immediate needs of exascale computing architectures but also set a foundation for exploiting future innovations in high-performance computing. By embracing these new designs and approaches, MPICH-derived libraries from HPE Cray and Intel were able to achieve real exascale performance on OLCF Frontier and ALCF Aurora respectively.

外文关键词：

Message passing interfaceMPIHPC communicationHPC networkexascale MPI

作者：

Yanfei Guo、Ken Raffenetti、Hui Zhou、Pavan Balaji、Min Si、Abdelhalim Amer、Shintaro Iwasaki、Sangmin Seo、Giuseppe Congiu、Robert Latham、Lena Oden、Thomas Gillis、Rohit Zambre、Kaiming Ouyang、Charles Archer、Wesley Bland、Jithin Jose、Sayantan Sur、Hajime Fujita、Dmitry Durnov、Michael Chuvelev、Gengbin Zheng、Alex Brooks、Sagar Thapaliya、Taru Doodi、Maria Garazan、Steve Oyanagi、Marc Snir、Rajeev Thakur

展开 >

作者单位：

Argonne National Laboratory, Lemont, IL, USA

Argonne National Laboratory, Lemont, IL, USA||Meta, Palo Alto, CA, USA

Argonne National Laboratory, Lemont, IL, USA||Cerebras Systems, Sunnyvale, CA, USA

Argonne National Laboratory, Lemont, IL, USA||Klaytn Foundation, Singapore

Argonne National Laboratory, Lemont, IL, USA||NVIDIA, Santa Clara, CA, USA

Argonne National Laboratory, Lemont, IL, USA||FernUniversitaet in Hagen, Hagen, Germany

NVIDIA, Santa Clara, CA, USA||University of California, Irvine, CA, USA

NVIDIA, Santa Clara, CA, USA||University of California, Riverside, CA, USA

Cornelis Networks, Chesterbrook, PA, USA||Intel Corporation, Santa Clara, CA, USA

Meta, Palo Alto, CA, USA||Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA||Microsoft, Redmond, WA, USA

NVIDIA, Santa Clara, CA, USA||Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA||Fastly, San Francisco, CA, USA

Intel Corporation, Santa Clara, CA, USA

Hewlett Packard Enterprise, Palo Alto, CA, USA

University of Illinois Urbana-Champaign, Urbana, IL, USA

展开 >

出版年：

2025

DOI：

10.1177/10943420241311608

International journal of high performance computing applications

ISSN：1094-3420

年,卷(期)：2025.39(2)

参考文献量33