首页|基于混合并行的分布式训练优化研究

基于混合并行的分布式训练优化研究

扫码查看
大型神经网络训练是深度学习领域的一个热点话题,而分布式训练是基于多节点实现大型神经网络训练的最佳方法之一.分布式训练通常包含数据并行、层间并行和层内并行3种并行方法.然而现有的框架在层间并行时只能对模型进行手动切分,增加了模型设计的抽象复杂度,对此提出了节点约束关系搜索算法,实现了模型的自动切分.另外,在传统的数据并行和层间并行中,由于模型的复杂约束关系和通信操作的需要,计算和通信往往受到严格的序列化限制,为此引入了同步优化算法,实现了计算和通信的重叠,有效提高了整体训练的效率.实验对不同规模的GPT-2,AlexNet,VGG16和ResNet50模型进行训练,使用同步优化算法在6节点条件下可以将GPT2-XL,GPT2-LARGE和GPT2-MEDIUM模型的训练性能分别提升1.14倍、1.18倍和1.23倍,在1节点条件下将AlexNet,VGG16和ResNet50模型的训练性能分别提升1.31倍、1.14倍和1.03倍.实验结果表明,同步优化算法能够提升混合并行中的训练效率.
Study on Distributed Training Optimization Based on Hybrid Parallel
Large-scale neural network training is a hot topic in the field of deep learning,and distributed training stands out as one of the most effective methods for training large neural networks across multiple nodes.Distributed training typically involves three parallel methods:data parallelism,inter-layer parallelism,and intra-layer parallelism.However,in existing frameworks,manual model partitioning is required for inter-layer parallelism,which increases the abstract complexity of model design.To ad-dress this issue,we propose a node-constrained relationship search algorithm that automates the model partitioning process.Moreover,in traditional data parallelism and inter-layer parallelism,strict serialization limits the overlap of computation and com-munication due to complex model constraints and the need for communication operations.To overcome this challenge,we intro-duce a synchronous optimization algorithm,enabling the overlap of computation and communication and effectively enhancing the overall training efficiency.The experiments involve training GPT-2 of different sizes,AlexNet,VGG16,and ResNet50 models.Using the synchronous optimization algorithm under a 6-node configuration,the training performance of GPT2-XL,GPT2-LARGE,and GPT2-MEDIUM models is improved,achieving speed-ups of 1.14,1.18,and 1.23,respectively.Under 1-node con-figuration,performance enhancements are also observed for AlexNet,VGG16,and ResNet50 models,with speed-ups of 1.31,1.14,and 1.03,respectively.The experimental results indicate that the synchronous optimization algorithm effectively enhances the training efficiency in mixed parallelism.

Distributed learningHybrid parallelAutomatic segmentationCommunication optimizationGradient synchronization

徐金龙、李鹏飞、李嘉楠、陈飙元、高伟、韩林

展开 >

国家超级计算郑州中心(郑州大学)郑州 450000

战略支援部队信息工程大学 郑州 450000

郑州大学计算机与人工智能学院 郑州 450000

分布式训练 混合并行 自动切分 通信优化 梯度同步

2024

计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

CSTPCD北大核心
影响因子:0.944
ISSN:1002-137X
年,卷(期):2024.51(12)