中国科学:信息科学(英文版)2024,Vol.67Issue(12) :119-135.DOI:10.1007/s11432-024-4227-2

MuxFlow:efficient GPU sharing in production-level clusters with more than 10000 GPUs

Xuanzhe LIU Yihao ZHAO Shufan LIU Xiang LI Yibo ZHU Xin LIU Xin JIN
中国科学:信息科学(英文版)2024,Vol.67Issue(12) :119-135.DOI:10.1007/s11432-024-4227-2

MuxFlow:efficient GPU sharing in production-level clusters with more than 10000 GPUs

Xuanzhe LIU 1Yihao ZHAO 1Shufan LIU 2Xiang LI 2Yibo ZHU 3Xin LIU 2Xin JIN1
扫码查看

作者信息

  • 1. School of Computer Science,Peking University,Beijing 100871,China;Key Laboratory of High Confidence Software Technologies(Peking University),Ministry of Education,Beijing 100871,China
  • 2. ByteDance,Beijing 100006,China
  • 3. Step Fun,Shanghai 200232,China
  • 折叠

Abstract

Large-scale GPU clusters are widely used to speed up both latency-critical(online)and best-effort(offline)deep learning(DL)workloads.However,similar to the common practice,the DL clusters at ByteDance dedicate each GPU to one workload or share workloads in time dimension,leading to very low GPU resource utilization.Existing techniques like NVIDIA MPS provide an opportunity to share multiple workloads in space on widely-deployed NVIDIA GPUs,but it cannot guarantee the performance of online workloads.We present MuxFlow,the first production system that can scale over massive GPUs to support highly efficient space-sharing for DL workloads.MuxFlow introduces a two-level protection mechanism for both memory and computation to guarantee the performance of online workloads.MuxFlow leverages dynamic streaming multiprocessor(SM)allocation to improve the efficiency of offline workloads.Based on our practical error analysis,we design a mixed error-handling mechanism to improve system reliability.MuxFlow has been deployed at ByteDance on more than 18000 GPUs.The deployment results indicate that MuxFlow substantially improves the GPU utilization from 26%to 76%,SM activity from 16%to 33%,and GPU memory usage from 42%to 48%.

Key words

GPU cluster/deep learning workload/cluster management/GPU sharing/deployed system

引用本文复制引用

出版年

2024
中国科学:信息科学(英文版)
中国科学院

中国科学:信息科学(英文版)

CSTPCDEI
影响因子:0.715
ISSN:1674-733X
段落导航相关论文