首页|分布式机器学习网络通信优化技术

分布式机器学习网络通信优化技术

扫码查看
Ring all-reduce算法被广泛应用在分布式机器学习之中,其同步过程会受到慢节点的影响进而降低整个系统的效率.对Ring all-reduce中的Reduce_Scat-ter和Allgather 2个阶段进行分析,针对Reduce_Scatter数据汇总过程提出优化策略,其主要思想是将慢节点多出的计算时间与通信时间进行重叠.使用OMNet++对Ring all-reduce和优化策略进行对比仿真,仿真结果与理论分析相一致,该策略相比Ring all-reduce算法最高能缩短25.3%的训练时间.
Communication Optimization Technology of Distributed Machine Learning Network
Ring all-reduce algorithm is widely used in distributed machine learning,its synchronization process will be affected by slow nodes and reduce the efficiency of the whole system.The two stages of Reduce_Scatter and Allgather in Ring all-reduce are analyzed,and an optimization strategy is proposed for the data summary process of Reduce_Scatter.The main idea is to over-lap the extra calculation time and communication time of slow nodes.OMNet++ is used to compare and simulate Ring all-re-duce and optimization strategy.The simulation results are consistent with the theoretical analysis,and the strategy can short-en the training time by up to 25.3%compared with Ring all-reduce algorithm.

Ring all-reduce algorithmDistributed machine learningRing all-reduce optimization strategy

张汉钢、邓鑫源、宋晔、薛旭伟、郭秉礼、黄善国

展开 >

北京邮电大学,北京 100876

Ring all-reduce算法 分布式机器学习 Ring all-reduce优化策略

2024

邮电设计技术
中讯邮电咨询设计院有限公司

邮电设计技术

影响因子:0.647
ISSN:1007-3043
年,卷(期):2024.(2)
  • 8