首页|Preventing Workload Interference with Intelligent Routing and Flexible Job Placement Strategy on Dragonfly System
Preventing Workload Interference with Intelligent Routing and Flexible Job Placement Strategy on Dragonfly System
扫码查看
点击上方二维码区域,可以放大扫码查看
原文链接
NETL
NSTL
Assoc Computing Machinery
Dragonfly is an indispensable interconnect topology for exascale high-performance computing (HPC) systems. To link tens of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources with the entire system such that network bandwidth is not exclusive to any single application. Since HPC systems are usually shared among multiple co-running applications at the same time, network competition between co-existing workloads is inevitable. This network contention manifests as workload interference, in which a job's network communication can be severely delayed by other jobs. This study presents a comprehensive examination of leveraging intelligent routing and flexible job placement to mitigate workload interference on Dragonfly systems. Specifically, we leverage the parallel discrete event simulation toolkit, the Structural Simulation Toolkit (SST), to investigate workload interference on Dragonfly with three contributions. We first present Q-adaptive routing, a multi-agent reinforcement learning routing scheme, and a flexible job placement strategy that, together, can mitigate workload interference based on workload communication characteristics. Next, we enhance SST with Q-adaptive routing and develop an automatic module that serves as the bridge between the SST and HPC job scheduler for automatic simulation configuration and automated simulation launching. Finally, we extensively examine workload interference under various job placement and routing configurations.