Gloo+:利用在网计算技术加速分布式深度学习训练

扫码查看

原文链接

万方数据
维普

中文摘要：在分布式深度学习训练中,聚合通信是主要的通信方式.在聚合通信优化的研究中,有软件层面的优化和硬件层面的优化.SH ARP是Mellanox提出来的一种聚合通信网络卸载协议,是针对聚合通信在硬件上的优化,其将聚合操作卸载到网络中的交换机,进而缩短了聚合通信时间.在Gloo的基础上集成了SHARP技术,设计并实现了一个能够利用在网计算技术来加速分布式深度学习训练的聚合通信库——Gloo+.评估并比较了 Gloo+、Gloo以及MPI中聚合操作的性能,并将Gloo+应用于分布式深度学习训练中,以此来检验其实战能力.对Gloo+的实验评估结果显示,在基准测试时,在消息大小较小的情况下,Gloo+相对于Gloo的加速比最高能达到100以上;相比于以太网模式下的MPI,其加速比最高也能达到50以上;相比于IB网模式下的MPI,其加速比在10以内.在分布式深度学习训练的实际应用中,Gloo+相比于Gloo加速比最高能达到1.1,相比于以太网模式下的MPI加速比最高有1.3,相比于IB网模式下的MPI加速比最高有0.5.

外文标题：Gloo+:Accelerating distributed training of deep learning using in-network computing

外文摘要：In distributed deep learning training,collective communication is the main communication method.In the research of collective communication optimization,there are software-level optimization and hardware-level optimization.SHARP is a collective communication network offload protocol pro-posed by Mellanox.It is optimized for collective communication in hardware.It offloads collective ope-rations to switches in the network,thereby shortening the collective communication time.We integrated SHARP technology on the basis of Gloo,and designed and implemented a collective communication library-Gloo+that can accelerate distributed deep learning training by using in-network computing.Our experimental evaluation of Gloo+ shows that in the benchmark test,when the message size is small,the acceleration ratio of Gloo+ relative to Gloo can reach up to 100 or more.While compared to MPI in Ethernet mode,the acceleration ratio can also reach up to 50 or more.While compared to MPI in IB mode,the acceleration ratio is within 10.In the practical application of distributed deep learning train-ing,the acceleration ratio of Gloo+can reach a maximum of 1.1 compared to Gloo,1.3 compared to MPI in Ethernet mode,and 0.5 compared to MPI in IB mode.

外文关键词：

distributed deep learningcollective communicationin-network computingGlooSHARP

作者：

黄泽彪、董德尊、齐星云

展开 >

作者单位：

国防科技大学计算机学院,湖南长沙 410073

关键词：

分布式深度学习聚合通信在网计算 Gloo SHARP

基金：

国家重点研发计划

项目编号：

2022YFB4501702

出版年：

2024

DOI：

10.3969/j.issn.1007-130X.2024.01.004

计算机工程与科学

国防科学技术大学计算机学院

计算机工程与科学

CSTPCD北大核心

影响因子：0.787

ISSN：1007-130X

年,卷(期)：2024.46(1)

参考文献量22