Cloudless-Training:基于serverless的高效跨地域分布式ML训练框架

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：跨地域分布式机器学习(ML)训练能够联合多区域的云资源协作训练，可满足许多新兴ML场景(比如大型模型训练、联邦学习)的训练需求。但其训练效率仍受2 方面挑战的制约。首先，多区域云资源缺乏有效的弹性调度，这会影响训练的资源利用率和性能;其次，模型跨地域同步需要在广域网(WAN)上高频通信，受WAN的低带宽和高波动的影响，会产生巨大通信开销。本文提出Cloudless-Training，从3 个方面实现高效的跨地域分布式ML训练。首先，它基于serverless计算模式实现，使用控制层和训练执行层的2层架构，支持多云区域的弹性调度和通信。其次，它提供一种弹性调度策略，根据可用云资源的异构性和训练数据集的分布自适应地部署训练工作流。最后，它提供了2 种高效的跨云同步策略，包括基于梯度累积的异步随机梯度下降(ASGD-GA)和跨云参数服务器(PS)间的模型平均(MA)。Cloudless-Training是基于OpenFaaS实现的，并被部署在腾讯云上评估，实验结果表明Cloudless-Training可显著地提高跨地域分布式ML训练的资源利用率(训练成本降低了9。2%～24。0%)和同步效率(训练速度最多比基线快1。7 倍)，并能保证模型的收敛精度。

外文标题：Cloudless-Training:a framework to improve efficiency of geo-distributed ML training based on serverless

外文摘要：Geo-distributed machine learning(ML)training can benefit many emerging ML scenarios(e.g.,large model training,federated learning)with multi-regional cloud resources and wide area network.However,its efficiency is limited due to two challenges.First,efficient elastic scheduling of multi-regional cloud resources is usually miss-ing,affecting resource utilization and performance of training.Second,training communication on wide area net-work(WAN)is still the main overhead,easily subjected to low bandwidth and high fluctuations of WAN.In this paper,a framework Cloudless-Training is proposed to realize efficient geo-distributed ML training in 3 aspects.First,it uses a two-layer architecture with control and physical training planes to support elastic scheduling and communication for multi-regional clouds in a serverless manner.Second,it provides an elastic scheduling strategy that can deploy training workflows adaptively according to the heterogeneity of available cloud resources and distri-bution of pre-existing training datasets.Third,it provides two new synchronization strategies for training partitions among clouds,including asynchronous stochastic gradient descent with gradient accumulation(ASGD-GA)and in-ter-parameter server(PS)model averaging(MA).It is implemented with OpenFaaS and evaluated on Tencent Cloud.Experimental results show that Cloudless-Training can support general ML training in a geo-distributed way,and greatly improve resource utilization(e.g.,9.2%-24.0%training cost reduction)and synchronization effi-ciency(e.g.,1.7 times speedup of training over baseline at most)with model correctness guarantees.

外文关键词：

geo-distributed machine learning(ML)trainingcross cloud ML trainingdistributed training frameworkserverlesscross cloud model synchronization

作者：

谭文婷、吕存驰、史骁、赵晓芳

展开 >

作者单位：

中国科学院计算技术研究所北京 100190

中国科学院大学北京 100049

中科南京信息高铁研究院南京 211135

中科苏州智能计算技术研究院苏州 215028

展开 >

关键词：

跨地域分布式机器学习(ML)训练跨云ML训练分布式训练框架 serverless 跨云模型同步

基金：

国家重点研发计划光合基金(B类)

项目编号：

2021YFF0703800202302028357

出版年：

2024

DOI：

10.3772/j.issn.1002-0470.2024.03.001

高技术通讯

中国科学技术信息研究所

高技术通讯

CSTPCD北大核心

影响因子：0.19

ISSN：1002-0470

年,卷(期)：2024.34(3)

参考文献量37