To address existing issues such as the limitations of current stencil code generation methods in supporting multi-Graphic Processing Unit(GPU)and insufficient optimization,this study proposes a framework for the automatic generation and optimization of stencil code using Domain Specific Language(DSL).In the code generation stage,the framework automatically parses the provided higher-level descriptive language,constructs computational graphs,and generates stencil operation Unified Compute Device Architecture(CUDA)kernel functions.It also produces different host-side code based on whether a single-GPU or multi-GPU environment is used.During the code optimization stage,candidate parameter ranges are determined according to different GPU models,and they are dynamically invoked by generated CUDA kernel functions to ascertain the optimal parameters.For multi-GPU,the automatically generated host-side code can utilize overlapping computation and communication methods for boundary data exchange.Across four different GPUs and stencil operations with 7-,13-,19-,and 27-point configurations,the framework successfully identifies the optimal parameter configuration.Experimental results on the Tesla V100-SXM2 show that with optimized parameters for stencil operations,the framework achieves Trillion Floating-point Operations Per Second(TFLOPs)of 1.230,1.680,1.120,and 1.480,respectively,in single precision for the four stencil operations,and 0.690,1.010,0.480,and 1.470,respectively,in double precision,with an average performance reaching 98%of hand-optimized code.Additionally,it offers a simpler description and supports multi-GPU extension.
关键词
模板运算/统一计算设备架构/计算图/领域专用语言/代码生成/自动调优
Key words
stencil operation/Compute Unified Device Architecture(CUDA)/computational graph/Domain Specific Language(DSL)/code generation/automatic tuning