首页|Preventing Workload Interference with Intelligent Routing and Flexible Job Placement Strategy on Dragonfly System

Preventing Workload Interference with Intelligent Routing and Flexible Job Placement Strategy on Dragonfly System

扫码查看
Dragonfly is an indispensable interconnect topology for exascale high-performance computing (HPC) systems. To link tens of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources with the entire system such that network bandwidth is not exclusive to any single application. Since HPC systems are usually shared among multiple co-running applications at the same time, network competition between co-existing workloads is inevitable. This network contention manifests as workload interference, in which a job's network communication can be severely delayed by other jobs. This study presents a comprehensive examination of leveraging intelligent routing and flexible job placement to mitigate workload interference on Dragonfly systems. Specifically, we leverage the parallel discrete event simulation toolkit, the Structural Simulation Toolkit (SST), to investigate workload interference on Dragonfly with three contributions. We first present Q-adaptive routing, a multi-agent reinforcement learning routing scheme, and a flexible job placement strategy that, together, can mitigate workload interference based on workload communication characteristics. Next, we enhance SST with Q-adaptive routing and develop an automatic module that serves as the bridge between the SST and HPC job scheduler for automatic simulation configuration and automated simulation launching. Finally, we extensively examine workload interference under various job placement and routing configurations.

High-performance computinginterconnect networkingparallel discreteevent simulation

XIN WANG、YAO KANG、ZHILING LAN

展开 >

Computer Science,University of Illinois Chicago,Chicago,United States

NVIDIA Corp,Santa Clara,United States

2025

ACM Transactions on Modeling and Computer Simulation

ACM Transactions on Modeling and Computer Simulation

ISSN:1049-3301
年,卷(期):2025.35(2)
  • 47