Padding Load:Load Reducing Cluster Resource Waste and Deep Learning Training Costs
In recent years,large-scale models have achieved remarkable success in multiple domains such as bioinformatics,natural language processing,and computer vision.However,these models often require substantial computational resources during the training and inference stages,resulting in considerable computational costs.Additionally,computing clusters experience imba-lances between supply and demand,manifesting as low resource utilization and difficulties in task scheduling.To address this problem,the concept of Padding Load is introduced,which leverages idle computing resources for computational tasks.Resources allocated to Padding Load can be preempted by other tasks at any time.However,they operate with a lower resource priority,leading to relatively lower costs.PaddingTorch is a distributed deep learning training framework tailored for Padding Load.Utili-zing data from the Alibaba PAI cluster,job scheduling is simulated on four GPUs,specifically during peak task-switching inter-vals.PaddingTorch is employed to train a protein complex prediction model using the Padding Load approach.While the training duration is 2.8 times that of exclusive resource usage,there is an 84%reduction in training costs and a 25.8%increase in GPU resource utilization during the periods when Padding Load is employed.
Deep learningDistributed trainingResource utilizationComputing clusterProgramming framework