计算机学报2024,Vol.47Issue(1) :1-28.DOI:10.11897/SP.J.1016.2024.00001

分布式训练系统及其优化算法综述

A Survey of Distributed Training System and Its Optimization Algorithms

王恩东 闫瑞栋 郭振华 赵雅倩
计算机学报2024,Vol.47Issue(1) :1-28.DOI:10.11897/SP.J.1016.2024.00001

分布式训练系统及其优化算法综述

A Survey of Distributed Training System and Its Optimization Algorithms

王恩东 1闫瑞栋 2郭振华 1赵雅倩3
扫码查看

作者信息

  • 1. 浪潮电子信息产业股份有限公司 济南 250101
  • 2. 山东海量信息技术研究院 济南 250101;浪潮(北京)电子信息产业有限公司 北京 100875;浪潮电子信息产业股份有限公司 济南 250101
  • 3. 浪潮(北京)电子信息产业有限公司 北京 100875;浪潮电子信息产业股份有限公司 济南 250101
  • 折叠

摘要

人工智能利用各种优化技术从海量训练样本中学习关键特征或知识以提高解的质量,这对训练方法提出了更高要求.然而,传统单机训练无法满足存储与计算性能等方面的需求.因此,利用多个计算节点协同的分布式训练系统成为热点研究方向之一.本文首先阐述了单机训练面临的主要挑战.其次,分析了分布式训练系统亟需解决的三个关键问题.基于上述问题归纳了分布式训练系统的通用框架与四个核心组件.围绕各个组件涉及的技术,梳理了代表性研究成果.在此基础之上,总结了基于并行随机梯度下降算法的中心化与去中心化架构研究分支,并对各研究分支优化算法与应用进行综述.最后,提出了未来可能的研究方向.

Abstract

Artificial intelligence employs a variety of optimization techniques to learn key features or knowledge from massive samples to improve the quality of solutions,which puts forward higher requirements for training methods.However,traditional single-machine training cannot meet the requirements of storage and computing performance,especially since the size of datasets and models continue to increase in recent years.Therefore,a distributed training system with the cooperation of multiple computing nodes has become one of the hot topics in computation-intensive and storage-intensive applications such as deep learning.Firstly,this survey introduces the main challenges(e.g.,dataset/model size,computing performance,storage capacity,system stability,and privacy protection)of single-machine training.Secondly,three key problems including partition,communication,and aggregation are proposed.To address these problems,a general framework of a distributed training system including four components(e.g.,partition component,commu-nication component,optimization component,aggregation component)is summarized.This paper pays attention to the core technologies in each component and reviews the existing representative research progress.Furthermore,this survey focuses on the parallel stochastic gradient descent algorithm and its variants,and categorizes them into the branches of centralized and decentralized architecture respectively.In each branch,a line of synchronous and asynchronous optimization algorithms has been revisited.Furthermore,it introduces three representative applications which consist of heterogeneous environment training,federated learning,and large model training in distributed systems.Finally,the following two future research directions are proposed.For one thing,an efficient distributed second-order optimization algorithm will be designed,and for another,a theoretical analysis method in federated learning will be explored.

关键词

分布式训练系统/(去)中心化架构/中心化架构算法/(异)同步算法/并行随机梯度下降/收敛速率

Key words

distributed training system/decentralized algorithms/centralized algorithms/(a)synchro-nous algorithms/parallel stochastic gradient descent/convergence rate

引用本文复制引用

基金项目

山东省自然科学基金项目(ZR2021QF073)

出版年

2024
计算机学报
中国计算机学会 中国科学院计算技术研究所

计算机学报

CSTPCDCSCD北大核心
影响因子:3.18
ISSN:0254-4164
被引量1
参考文献量4
段落导航相关论文