填充性载荷:减少集群资源浪费与深度学习训练成本的负载

扫码查看

原文链接

万方数据
维普

中文摘要：近年来,大模型在生物信息学、自然语言处理和计算机视觉等多个领域取得了显著成功.然而,这些模型在训练和推理阶段需要大量的计算资源,导致计算成本高昂.同时,计算集群中存在资源利用率低、任务调度难的供需失衡问题.为了解决这一问题,提出了填充性载荷的概念,即一种在计算集群中利用空闲资源进行计算的负载.填充性载荷的计算资源随时可能被其他负载抢占,但其使用的资源优先级较低,资源成本也相对较低.为此,设计了适用于填充性载荷的分布式深度学习训练框架PaddingTorch.基于阿里巴巴PAI集群的数据,使用4块GPU模拟了任务切换最频繁的4个GPU时间段上的作业调度情况,使用PaddingTorch将蛋白质复合物预测程序作为填充性载荷进行训练.训练时长为独占资源时训练时长的2.8倍,但训练成本降低了 84％,在填充性载荷填充时间段内GPU资源利用率提升了 25.8％.

外文标题：Padding Load:Load Reducing Cluster Resource Waste and Deep Learning Training Costs

外文摘要：In recent years,large-scale models have achieved remarkable success in multiple domains such as bioinformatics,natural language processing,and computer vision.However,these models often require substantial computational resources during the training and inference stages,resulting in considerable computational costs.Additionally,computing clusters experience imba-lances between supply and demand,manifesting as low resource utilization and difficulties in task scheduling.To address this problem,the concept of Padding Load is introduced,which leverages idle computing resources for computational tasks.Resources allocated to Padding Load can be preempted by other tasks at any time.However,they operate with a lower resource priority,leading to relatively lower costs.PaddingTorch is a distributed deep learning training framework tailored for Padding Load.Utili-zing data from the Alibaba PAI cluster,job scheduling is simulated on four GPUs,specifically during peak task-switching inter-vals.PaddingTorch is employed to train a protein complex prediction model using the Padding Load approach.While the training duration is 2.8 times that of exclusive resource usage,there is an 84％reduction in training costs and a 25.8％increase in GPU resource utilization during the periods when Padding Load is employed.

外文关键词：

Deep learningDistributed trainingResource utilizationComputing clusterProgramming framework

作者：

杜昱、俞子舒、彭晓晖、徐志伟

展开 >

作者单位：

中国科学院计算技术研究所北京 100190

中国科学院大学北京 100049

关键词：

深度学习分布式训练资源利用率计算集群编程框架

基金：

北京市自然科学基金国家自然科学基金

项目编号：

421202762072434

出版年：

2024

DOI：

10.11896/jsjkx.231000222

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(9)