Auto-Generation and Auto-Tuning Framework of Stencil Operation Code
To address existing issues such as the limitations of current stencil code generation methods in supporting multi-Graphic Processing Unit(GPU)and insufficient optimization,this study proposes a framework for the automatic generation and optimization of stencil code using Domain Specific Language(DSL).In the code generation stage,the framework automatically parses the provided higher-level descriptive language,constructs computational graphs,and generates stencil operation Unified Compute Device Architecture(CUDA)kernel functions.It also produces different host-side code based on whether a single-GPU or multi-GPU environment is used.During the code optimization stage,candidate parameter ranges are determined according to different GPU models,and they are dynamically invoked by generated CUDA kernel functions to ascertain the optimal parameters.For multi-GPU,the automatically generated host-side code can utilize overlapping computation and communication methods for boundary data exchange.Across four different GPUs and stencil operations with 7-,13-,19-,and 27-point configurations,the framework successfully identifies the optimal parameter configuration.Experimental results on the Tesla V100-SXM2 show that with optimized parameters for stencil operations,the framework achieves Trillion Floating-point Operations Per Second(TFLOPs)of 1.230,1.680,1.120,and 1.480,respectively,in single precision for the four stencil operations,and 0.690,1.010,0.480,and 1.470,respectively,in double precision,with an average performance reaching 98%of hand-optimized code.Additionally,it offers a simpler description and supports multi-GPU extension.
stencil operationCompute Unified Device Architecture(CUDA)computational graphDomain Specific Language(DSL)code generationautomatic tuning