首页|保证延迟敏感型任务服务质量的情况下利用流处理器内所有并行性以最大化系统吞吐

保证延迟敏感型任务服务质量的情况下利用流处理器内所有并行性以最大化系统吞吐

扫码查看
为了应对越来越高的算力需求,GPU在流处理器内集成了多种通用计算单元及专用计算单元(FP32 Core,INT32 Core,FP64 Core,Tensor Core,RT Core)。任意一种 GPU 内可能包含以上计算单元中的部分单元。尽管GPU的流处理器内存在着多种计算单元,它们之间的计算并行性无法从硬件设计白皮书中获知。与此同时,现有调度接口无法支持使用不同计算单元的核函数并行利用这些计算资源,更无法支持运行时的精细调度以最大化系统吞吐。面对以上问题,我们提出了硬件感知吞吐导向的核函数调度方法Hato。Hato首先设计了一个硬件并行性感知工具,支持为任意GPU定位出所有的流处理器内并行性。其次,Hato提出了一个核函数混跑建模方法,通过核函数混跑利用到流处理器内并行性,并支持核函数在混跑情况下的执行时间精准预测。最后,Hato提出了一个吞吐导向的调度策略,支持在保证延迟敏感型应用服务质量的同时,利用到所有可能的流处理器内并行性,以最大化整体系统吞吐。实验结果表明,Hato相比最新调度系统Tacker提升了平均19。2%,最高54。1%的系统吞吐。
Exploiting all intra-SM parallelism to maximize the throughput while ensuring QoS
To address the burgeoning demand for computational capacity,GPUs incorporate an array of both general-purpose and specialized computing units,including FP32 Core,INT32 Core,FP64 Core,Tensor Core,and RT Core,within their streaming multiprocessors(SM).Various types of GPUs may encompass a subset of these computing units.Despite the presence of multiple computing units within the SM,the parallelism among them is not elucidated in the hardware design documentation.Concurrently,official scheduling interfaces lack the capability to facilitate the parallel utilization of these computing resources by co-running the kernels using different computing units.Also,they could not support the runtime scheduling to optimize overall system throughput.Faced with the above problems,we propose a hardware-aware throughput-oriented kernel scheduling method Hato.Hato first designs a parallelism-aware tool that supports finding all intra-SM parallelism for any GPU.Secondly,Hato proposes a kernel co-running modeling method,which supports the existing scheduling interfaces to utilize the intra-SM parallelism,and the accurate duration prediction for the co-running kernels.Finally,Hato proposes a throughput-oriented scheduling strategy that supports utilizing all possible intra-SM parallelism to maximize overall system throughput while ensuring service quality for latency-sensitive applications.Experimental results show that compared with the state-of-the-art scheduling system Tacker,Hato improves the system throughput by an average of 19.2%and by as much as 54.1%.

GPUintra-SM parallelismthroughput improvementruntime system

赵涵、邓俊骁、崔炜皞、陈全、曾德泽、杨静、过敏意

展开 >

上海交通大学电子信息与电气工程学院,上海 200240

上海市金融信息技术研究重点实验室(上海财经大学),上海 200433

中国地质大学计算机学院,武汉 430074

贵州大学机械工程学院,贵阳 550025

展开 >

GPU 流处理器内并行性 吞吐提升 运行时系统

2024

中国科学F辑
中国科学院,国家自然科学基金委员会

中国科学F辑

CSTPCD北大核心
影响因子:1.438
ISSN:1674-5973
年,卷(期):2024.54(12)