Advances of Pipeline Model Parallelism for Deep Learning Training:An Overview

扫码查看

原文链接

万方数据
维普

外文摘要：Deep learning has become the cornerstone of artificial intelligence,playing an increasingly important role in human production and lifestyle.However,as the complexity of problem-solving increases,deep learning models become in-creasingly intricate,resulting in a proliferation of large language models with an astonishing number of parameters.Pipeline model parallelism(PMP)has emerged as one of the mainstream approaches to addressing the significant chal-lenge of training"big models".This paper presents a comprehensive review of PMP.It covers the basic concepts and main challenges of PMP.It also comprehensively compares synchronous and asynchronous pipeline schedules for PMP ap-proaches,and discusses the main techniques to achieve load balance for both intra-node and inter-node training.Further-more,the main techniques to optimize computation,storage,and communication are presented,with potential research di-rections being discussed.

外文关键词：

deep learningpipeline scheduleload balancemulti-GPU systempipeline model parallelism(PMP)

作者：

关磊、李东升、梁吉业、王文剑、葛可适、卢锡城

展开 >

作者单位：

College of Science,National University of Defense Technology,Changsha 410073,China

College of Computer,National University of Defense Technology,Changsha 410073,China

School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China

基金：

National Natural Science Foundation of ChinaNational Natural Science Foundation of ChinaNational Natural Science Foundation of ChinaNational Natural Science Foundation of ChinaNational Natural Science Foundation of ChinaState Administration of Science,Technology,and Industry for National Defense of China

项目编号：

62025208U21A20473U21A205136207615462302512WDZC20235250118

出版年：

2024

DOI：

10.1007/s11390-024-3872-3

计算机科学技术学报(英文版)

中国计算机学会

计算机科学技术学报(英文版)

CSTPCD

影响因子：0.432

ISSN：1000-9000

年,卷(期)：2024.39(3)