Constructing and analyzing deep learning task dataset for R&D GPU clusters
In recent years,with the growing demand for training deep learning models,research in-stitutions and enterprises have established shared GPU clusters to reduce costs and improve efficiency.Existing research mainly focuses on task scheduling and resource allocation in enterprise-level GPU clus-ters.However,this paper focuses on the Pengcheng Cloud Brain I,a research and development GPU cluster,by monitoring and collecting key indicators during task runtime.It constructs a dataset for deep learning training tasks,named the Pengcheng Cloud Brain I Task Dataset,which includes fine-grained time-series resource usage information for tasks.This dataset represents the first publicly available data-set tailored for R&D GPU clusters.It reveals the phenomenon of low resource utilization in R&D GPU clusters and provides a basis and reference for designing schedulers with high resource utilization for R&D GPU clusters,thereby promoting research on task scheduling and resource allocation mecha-nisms.