中国科学:信息科学(英文版)2024,Vol.67Issue(9) :101-121.DOI:10.1007/s11432-023-3894-4

SDCC:software-defined collective communication for distributed training

Xin JIN Zhen ZHANG Yunshan JIA Yun MA Xuanzhe LIU
中国科学:信息科学(英文版)2024,Vol.67Issue(9) :101-121.DOI:10.1007/s11432-023-3894-4

SDCC:software-defined collective communication for distributed training

Xin JIN 1Zhen ZHANG 2Yunshan JIA 1Yun MA 1Xuanzhe LIU1
扫码查看

作者信息

  • 1. School of Computer Science,Peking University,Beijing 100871,China
  • 2. Department of Computer Science and Technology,Johns Hopkins University,Baltimore 21218,USA
  • 折叠

Abstract

Communication is crucial to the performance of distributed training.Today's solutions tightly couple the control and data planes and lack flexibility,generality,and performance.In this study,we present SDCC,a software-defined collective communication framework for distributed training.SDCC is based on the principle of modern systems design to effectively decouple the control plane from the data plane.SDCC abstracts the operations for collective communication in distributed training with dataflow operations and unifies computing and communication with a single dataflow graph.The abstraction,together with the unification,is powerful:it enables users to easily express new and existing collective communication algorithms and optimizations,simplifies the integration with different computing engines(e.g.,PyTorch and TensorFlow)and network transports(e.g.,Linux TCP and kernel bypass),and allows the system to improve performance by exploiting parallelism exposed by the dataflow graph.We further demonstrate the benefits of SDCC in four use cases.

Key words

machine learning systems/distributed training/deep learning/collective communication/software-defined networking

引用本文复制引用

基金项目

National Natural Science Foundation of China(62172008)

National Natural Science Fund for the Excellent Young Scientists Fund Program(Overseas)()

出版年

2024
中国科学:信息科学(英文版)
中国科学院

中国科学:信息科学(英文版)

CSTPCDEI
影响因子:0.715
ISSN:1674-733X
段落导航相关论文