模板运算代码的自动生成与调优框架

Auto-Generation and Auto-Tuning Framework of Stencil Operation Code

扫码查看

原文链接

维普
万方数据

中文摘要：针对现有模板代码生成方法不支持多图形处理器(GPU)、调优不充分等问题,提出一种由领域专用语言(DSL)描述的模板代码的自动生成与调优框架.在代码自动生成阶段,该框架能够自动解析上层提供的描述语言,构建计算图进而生成模板运算的统一计算设备架构(CUDA)核函数,同时根据单GPU或多GPU环境生成不同的主机端代码.在代码调优阶段,根据不同的GPU型号确定候选参数范围,动态调用生成的CUDA核函数以确定最优参数.在多GPU的情况下,自动生成的主机端代码能够使用计算与通信重叠的方法进行边界数据交换.在4种不同的GPU与7、13、19、27点模板运算中,该框架能找到最优的参数配置.实验结果表明,对于Tesla V100-SXM2,以调优过的参数进行模板运算,该框架在单精度4种模板运算下的每秒万亿次浮点运算数(TFLOPs)分别为1.230、1.680、1.120、1.480,在双精度下分别为0.690、1.010、0.480、1.470,平均性能达到手工优化代码的98％,并且描述更简单,支持多GPU扩展.

外文摘要：To address existing issues such as the limitations of current stencil code generation methods in supporting multi-Graphic Processing Unit(GPU)and insufficient optimization,this study proposes a framework for the automatic generation and optimization of stencil code using Domain Specific Language(DSL).In the code generation stage,the framework automatically parses the provided higher-level descriptive language,constructs computational graphs,and generates stencil operation Unified Compute Device Architecture(CUDA)kernel functions.It also produces different host-side code based on whether a single-GPU or multi-GPU environment is used.During the code optimization stage,candidate parameter ranges are determined according to different GPU models,and they are dynamically invoked by generated CUDA kernel functions to ascertain the optimal parameters.For multi-GPU,the automatically generated host-side code can utilize overlapping computation and communication methods for boundary data exchange.Across four different GPUs and stencil operations with 7-,13-,19-,and 27-point configurations,the framework successfully identifies the optimal parameter configuration.Experimental results on the Tesla V100-SXM2 show that with optimized parameters for stencil operations,the framework achieves Trillion Floating-point Operations Per Second(TFLOPs)of 1.230,1.680,1.120,and 1.480,respectively,in single precision for the four stencil operations,and 0.690,1.010,0.480,and 1.470,respectively,in double precision,with an average performance reaching 98％of hand-optimized code.Additionally,it offers a simpler description and supports multi-GPU extension.

外文关键词：

stencil operationCompute Unified Device Architecture(CUDA)computational graphDomain Specific Language(DSL)code generationautomatic tuning

作者：

刘金硕、文尧

展开 >

作者单位：

武汉大学国家网络安全学院空天信息安全与可信计算教育部重点实验室,湖北武汉 430072

关键词：

模板运算统一计算设备架构计算图领域专用语言代码生成自动调优

基金：

国家重点研发计划

项目编号：

2020YFA0607900

出版年：

2024

DOI：

10.19678/j.issn.1000-3428.0068234

计算机工程

华东计算技术研究所　上海市计算机学会

计算机工程

CSTPCD北大核心

影响因子：0.581

ISSN：1000-3428

年,卷(期)：2024.50(6)