首页期刊导航|International journal of high performance computing applications
期刊信息/Journal information
International journal of high performance computing applications
Sage Publications Inc.
International journal of high performance computing applications

Sage Publications Inc.

季刊

1094-3420

International journal of high performance computing applications/Journal International journal of high performance computing applicationsSCIAHCIISTP
正式出版
收录年代

    Exploiting mesh structure to improve multigrid performance for saddle-point problems

    Lukas SpiesLuke OlsonScott MacLachlan
    211-229页
    查看更多>>摘要:In recent years, solvers for finite-element discretizations of linear or linearized saddle-point problems, like the Stokes and Oseen equations, have become well established. There are two main classes of preconditioners for such systems: those based on a block-factorization approach and those based on monolithic multigrid. Both classes of preconditioners have several critical choices to be made in their composition, such as the selection of a suitable relaxation scheme for monolithic multigrid. From existing studies, some insight can be gained as to what options are preferable in low-performance computing settings, but there are very few fair comparisons of these approaches in the literature, particularly for modern architectures, such as GPUs. In this paper, we perform a comparison between a Block-Triangular preconditioner and monolithic multigrid methods with the three most common choices of relaxation scheme - Braess-Sarazin, Vanka, and Schur-Uzawa. We develop a performant Vanka relaxation algorithm for structured-grid discretizations, which takes advantage of memory efficiencies in this setting. We detail the behavior of the various CUDA kernels for the multigrid relaxation schemes and evaluate their individual arithmetic intensity, performance, and runtime. Running a preconditioned FGMRES solver for the Stokes equations with these preconditioners allows us to compare their efficiency in a practical setting. We show that monolithic multigrid can outperform Block-Triangular preconditioning, and that using Vanka or Braess-Sarazin relaxation is most efficient Even though multigrid with Vanka relaxation exhibits reduced performance on the CPU (up to 100% slower than Braess-Sarazin), it is able to outperform Braess-Sarazin by more than 20% on the GPU, making it a competitive algorithm, especially given the high amount of algorithmic tuning needed for effective Braess-Sarazin relaxation.

    Algebraic temporal blocking for sparse iterative solvers on multi-core CPUs

    Christie AlappatJonas ThiesGeorg HagerHolger Fehske...
    230-250页
    查看更多>>摘要:Sparse linear iterative solvers are essential for many large-scale simulations. Much of the runtime of these solvers is often spent in the implicit evaluation of matrix polynomials via a sequence of sparse matrix-vector products. A variety of approaches has been proposed to make these polynomial evaluations explicit (i.e., fix the coefficients), e.g., polynomial preconditioners or s-step Krylov methods. Furthermore, it is nowadays a popular practice to approximate triangular solves by a matrix polynomial to increase parallelism. Such algorithms allow to evaluate the polynomial using a so-called matrix power kernel (MPK), which computes the product between a power of a sparse matrix A and a dense vector x, i.e., A~px, or a related operation. Recently we have shown that using the level-based formulation of sparse matrix-vector multiplications in the Recursive Algebraic Coloring Engine (RACE) framework we can perform temporal cache blocking of MPK to increase its performance. In this work, we demonstrate the application of this cache-blocking optimization in sparse iterative solvers. By integrating the RACE library into the Trilinos framework, we demonstrate the speedups achieved in (preconditioned) s-step GMRES, polynomial preconditioners, and algebraic multigrid (AMG). For MPK-dominated algorithms we achieve speedups of up to 3× on modem multi-core compute nodes. For algorithms with moderate contributions from subspace orthogonalization, the gain reduces significantly, which is often caused by the insufficient quality of the or-thogonalization routines. Finally, we showcase the application of RACE-accelerated solvers in a real-world wind turbine simulation (Nalu-Wind) and highlight the new opportunities and perspectives opened up by RACE as a cache-blocking technique for MPK-enabled sparse solvers.

    Advancements of PAPI for the exascale generation

    Heike JagodeAnthony DanalisGiuseppe CongiuDaniel Barry...
    251-268页
    查看更多>>摘要:The Performance Application Programming Interface (PAPI) serves as a coherent, operating-system-independent interface for accessing performance counter data across a wide range of hardware and software components. PAPI can operate autonomously as a performance monitoring library and tool for application analysis. However, its true value emerges when it functions as a middleware for numerous third-party profiling, tracing, and sampling toolkits, establishing itself as a universal interface for hardware counter analysis. In this role, PAPI manages the intricacies of each hardware component, presenting a streamlined API to higher-level toolkits. Within the Exascale Computing Project (ECP), PAPI has expanded its capabilities in performance counter monitoring and incorporated support for power management across cutting-edge hardware and software technologies. This includes performance and power monitoring for AMD GPUs through integration with AMD ROCm and ROCm-SMI, Intel Ponte Vecchio GPUs via Intel's oneAPI Level Zero, and NVIDIA GPUs through the CUPTI Profiling API. Additionally, PAPI is compatible with interconnects, the latest CPUs, and ARM chips. These enhancements have been implemented while preserving the standard PAPI interface and methodology for utilizing low-level performance counters in CPUs, GPUs, on/off-chip memory, interconnects, and the I/O system, encompassing energy and power management. To strengthen PAPI's sustainability, ECP has facilitated its integration into Spack and E4S, ensuring software robustness through continuous integration and continuous deployment. In addition to hardware counter-based data, PAPI now supports the registration and monitoring of Software-Defined Events. This feature exposes the internal behavior of runtime systems and libraries like PaRSEC, SLATE, Magma, to applications utilizing those libraries, broadening the scope of performance events to include software-based information. Additionally, PAPI has been expanded with the Counter Analysis Toolkit, aiding in native performance counter disambiguation through micro-benchmarks. These micro-benchmarks probe various essential aspects of modern chips, contributing to the classification of raw performance events. In summary, ECP has enabled PAPI to include comprehensive counter analysis capabilities, advanced performance and power monitoring support for exascale hardware components, and broadened the scope of performance events to encompass not only hardware-related metrics but also software-based information.

    UMap: An application-oriented user level memory mapping library

    Ivy PengJacob WahlgrenKarim YoussefKeita Iwabuchi...
    269-282页
    查看更多>>摘要:Exploiting the prominent role of complex memories in exascale node architecture, the UMap page fault handler offers new capabilities to access large memory-mapped data sets directly. UMap provides flexible configuration options to customize page handling to each application, including analysis of massive observational and simulation data sets. The high-performance design features I/O decoupling, dynamic load balancing, and application-level controls. Page faults triggered by application threads and processes accessing data mapped to a UMapp'ed region are handled via the Linux userfaultfd protocol, an asynchronous message-oriented kernel-user communication mechanism that avoids the context switch penalty of traditional signal fault handlers. UMap is fully open source. In this paper, we give an overview of the UMap library architecture, its extensible plugin architecture, and the use/performance of UMap in emerging heterogeneous memory hierarchies such as near-node Non-volatile Memory (NVM) and network attached memories. We highlight new capabilities in two pagefault management plugins, the NetworkStore and SparseStore. We demonstrate the integration between UMap and multiple ECP products including Caliper, Metall, ZFP, Mochi, and Ripples.

    Preparing MPICH for exascale

    Yanfei GuoKen RaffenettiHui ZhouPavan Balaji...
    283-305页
    查看更多>>摘要:The advent of exascale supercomputers heralds a new era of scientific discovery, yet it introduces significant architectural challenges that must be overcome for MPI applications to fully exploit its potential. Among these challenges is the adoption of heterogeneous architectures, particularly the integration of GPUs to accelerate computation. Additionally, the complexity of multithreaded programming models has also become a critical factor in achieving performance at scale. The efficient utilization of hardware acceleration for communication, provided by modern NICs, is also essential for achieving low latency and high throughput communication in such complex systems. In response to these challenges, the MPICH library, a high-performance and widely used Message Passing Interface (MPI) implementation, has undergone significant enhancements. This paper presents four major contributions that prepare MPICH for the exascale transition. First, we describe a lightweight communication stack that leverages the advanced features of modern NICs to maximize hardware acceleration. Second, our work showcases a highly scalable multithreaded communication model that addresses the complexities of concurrent environments. Third, we introduce GPU-aware communication capabilities that optimize data movement in GPU-integrated systems. Finally, we present a new datatype engine aimed at accelerating the use of MPI derived datatypes on GPUs. These improvements in the MPICH library not only address the immediate needs of exascale computing architectures but also set a foundation for exploiting future innovations in high-performance computing. By embracing these new designs and approaches, MPICH-derived libraries from HPE Cray and Intel were able to achieve real exascale performance on OLCF Frontier and ALCF Aurora respectively.

    PETSc/TAO developments for GPU-based early exascale systems

    Richard Tran MillsMark F. AdamsSatish BalayJed Brown...
    306-325页
    查看更多>>摘要:The Portable Extensible Toolkit for Scientific Computation (PETSc) library provides scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization via the Toolkit for Advanced Optimization (TAO). PETSc is used in dozens of scientific fields and is an important building block for many simulation codes. During the U.S. Department of Energy's Exascale Computing Project, the PETSc team has made substantial efforts to enable efficient utilization of the massive fine-grain parallelism present within exascale compute nodes and to enable performance portability across exascale architectures. We recap some of the challenges that designers of numerical libraries face in such an endeavor, and then discuss the many developments we have made, which include the addition of new GPU backends, features supporting efficient on-device matrix assembly, better support for asynchronicity and GPU kernel concurrency, and new communication infrastructure. We evaluate the performance of these developments on some pre-exascale systems as well as the early exascale systems Frontier and Aurora, using compute kernel, communication layer, solver, and mini-application benchmark studies, and then close with a few observations drawn from our experiences on the tension between portable performance and other goals of numerical libraries.