首页|研发类GPU集群任务数据集的构建及分析

研发类GPU集群任务数据集的构建及分析

扫码查看
近年来,随着深度学习模型训练需求增长,研究机构和企业通过搭建共享GPU集群来降低成本和提高效率。现有研究主要关注企业生产类GPU集群的任务调度和资源分配。针对研发类GPU集群鹏城云脑I,进行任务运行时关键指标的监控和数据采集,构建含任务细粒度时序资源使用信息的深度学习训练任务数据集——鹏城云脑I任务数据集。该数据集是首个面向研发类GPU集群公开数据集,揭示了研发类GPU集群中资源利用率低的现象,为研发类GPU集群高资源利用率的调度器设计提供依据和参考,推动任务调度和资源分配机制的研究。
Constructing and analyzing deep learning task dataset for R&D GPU clusters
In recent years,with the growing demand for training deep learning models,research in-stitutions and enterprises have established shared GPU clusters to reduce costs and improve efficiency.Existing research mainly focuses on task scheduling and resource allocation in enterprise-level GPU clus-ters.However,this paper focuses on the Pengcheng Cloud Brain I,a research and development GPU cluster,by monitoring and collecting key indicators during task runtime.It constructs a dataset for deep learning training tasks,named the Pengcheng Cloud Brain I Task Dataset,which includes fine-grained time-series resource usage information for tasks.This dataset represents the first publicly available data-set tailored for R&D GPU clusters.It reveals the phenomenon of low resource utilization in R&D GPU clusters and provides a basis and reference for designing schedulers with high resource utilization for R&D GPU clusters,thereby promoting research on task scheduling and resource allocation mecha-nisms.

GPU clusterdeep learningcluster workloadworkloads datasetresource utilization

罗婧、叶志晟、杨泽华、傅天豪、魏雄、汪小林、罗英伟

展开 >

武汉纺织大学计算机与人工智能学院,湖北武汉 430200

鹏城实验室,广东 深圳 518000

北京大学计算机学院,北京 100871

GPU集群 深度学习 集群负载 任务数据集 资源利用率

2024

计算机工程与科学
国防科学技术大学计算机学院

计算机工程与科学

CSTPCD北大核心
影响因子:0.787
ISSN:1007-130X
年,卷(期):2024.46(12)