Cloudless-Training:a framework to improve efficiency of geo-distributed ML training based on serverless
Geo-distributed machine learning(ML)training can benefit many emerging ML scenarios(e.g.,large model training,federated learning)with multi-regional cloud resources and wide area network.However,its efficiency is limited due to two challenges.First,efficient elastic scheduling of multi-regional cloud resources is usually miss-ing,affecting resource utilization and performance of training.Second,training communication on wide area net-work(WAN)is still the main overhead,easily subjected to low bandwidth and high fluctuations of WAN.In this paper,a framework Cloudless-Training is proposed to realize efficient geo-distributed ML training in 3 aspects.First,it uses a two-layer architecture with control and physical training planes to support elastic scheduling and communication for multi-regional clouds in a serverless manner.Second,it provides an elastic scheduling strategy that can deploy training workflows adaptively according to the heterogeneity of available cloud resources and distri-bution of pre-existing training datasets.Third,it provides two new synchronization strategies for training partitions among clouds,including asynchronous stochastic gradient descent with gradient accumulation(ASGD-GA)and in-ter-parameter server(PS)model averaging(MA).It is implemented with OpenFaaS and evaluated on Tencent Cloud.Experimental results show that Cloudless-Training can support general ML training in a geo-distributed way,and greatly improve resource utilization(e.g.,9.2%-24.0%training cost reduction)and synchronization effi-ciency(e.g.,1.7 times speedup of training over baseline at most)with model correctness guarantees.
geo-distributed machine learning(ML)trainingcross cloud ML trainingdistributed training frameworkserverlesscross cloud model synchronization