A Survey of Distributed Training System and Its Optimization Algorithms
Artificial intelligence employs a variety of optimization techniques to learn key features or knowledge from massive samples to improve the quality of solutions,which puts forward higher requirements for training methods.However,traditional single-machine training cannot meet the requirements of storage and computing performance,especially since the size of datasets and models continue to increase in recent years.Therefore,a distributed training system with the cooperation of multiple computing nodes has become one of the hot topics in computation-intensive and storage-intensive applications such as deep learning.Firstly,this survey introduces the main challenges(e.g.,dataset/model size,computing performance,storage capacity,system stability,and privacy protection)of single-machine training.Secondly,three key problems including partition,communication,and aggregation are proposed.To address these problems,a general framework of a distributed training system including four components(e.g.,partition component,commu-nication component,optimization component,aggregation component)is summarized.This paper pays attention to the core technologies in each component and reviews the existing representative research progress.Furthermore,this survey focuses on the parallel stochastic gradient descent algorithm and its variants,and categorizes them into the branches of centralized and decentralized architecture respectively.In each branch,a line of synchronous and asynchronous optimization algorithms has been revisited.Furthermore,it introduces three representative applications which consist of heterogeneous environment training,federated learning,and large model training in distributed systems.Finally,the following two future research directions are proposed.For one thing,an efficient distributed second-order optimization algorithm will be designed,and for another,a theoretical analysis method in federated learning will be explored.
distributed training systemdecentralized algorithmscentralized algorithms(a)synchro-nous algorithmsparallel stochastic gradient descentconvergence rate