Experimental method for optimizing GPGPU performance in a multiple-kernel environment based on GPGPU-sim
[Objective]With the rapid development and continuous improvement of the parallel computing architecture of general-purpose graphics processing units(GPGPUs),their computing power has been significantly improved,making them essential in high-performance and high-throughput applications.However,as tasks increase in number and complexity,multi-kernel execution environments face serious challenges.Therefore,optimizing GPGPU performance in multi-kernel environments is crucial.Scholars often use GPGPU-sim as the main tool for studying GPGPU performance optimization methods.Despite this,there is currently no comprehensive guide for conducting GPGPU performance optimization experiments using GPGPU-sim in multi-kernel environments,posing difficulties for beginners in experimental verification and analysis in this area.Furthermore,while the round-robin(RR)scheduling strategy ensures fair resource utilization,it may lead to scheduling delays between multiple kernels in concurrent execution environments.This study aims to provide key experimental methods for beginners to optimize GPGPU performance in multi-kernel concurrent execution environments and offer valuable case references for teaching computer architecture.[Methods]First,the article provides a detailed introduction to the GPGPU architecture and explores the source code structure of the GPGPU-sim simulator,providing readers with relevant background knowledge.It then comprehensively analyzes and discusses the improvement ideas and adaptive thread block(ATB)algorithm of the proposed ATB scheduling strategy.The article elaborates on the process of modifying the GPGPU-sim source code to implement the ATB strategy scheduling of multi-kernel thread block execution.In addition,to ensure that beginners can easily replicate the relevant experiments,the article provides a detailed explanation of the configuration parameters of GPGPU-sim and modifications to the testing program.[Results]This article compares the ATB strategy with the benchmark RR thread block scheduling method,analyzing the experimental results on system performance,shared memory utilization,register utilization,and memory access efficiency.From the perspective of system performance,the ATB strategy enables concurrent execution of multiple kernels,effectively improving resource utilization on the GPGPU,thereby significantly improving the overall execution performance.Compared to RR,ATB's execution efficiency can be improved by up to 76%,with an average system performance improvement of 45%.In terms of shared memory and register utilization,the ATB strategy allows threads from multiple kernels to concurrently access GPGPU resources,improving the utilization of these resources.Shared memory usage under ATB increased by a maximum of 84%,compared to RR,with an average increase of 54%.Register usage saw an average increase of 29%,with a maximum increase of 49%.Regarding memory access efficiency,ATB allows threads from different kernels to access different storage resources,effectively reducing the probability of threads competing for the same resource.Compared to the RR strategy,the pipeline stagnation cycle of ATB decreased by an average of 5%,while the warp waiting data cycle was reduced by a maximum of 44%and an average of 29%.Overall,compared to the benchmark method,the ATB proposed in this paper effectively improves the efficiency of concurrent execution of multiple kernels and GPGPU performance.[Conclusions]This article provides an in-depth analysis and discussion of GPGPU performance optimization methods using GPGPU-sim in an environment including multiple kernels.It successfully designs and implements an ATB scheduling strategy.By adopting an improved ATB scheduling strategy in the GPGPU-sim simulator,the study successfully achieved concurrent execution of multiple kernels and verified the effectiveness of this strategy in improving GPGPU performance through experimental data.This work not only provides detailed and feasible experimental methods for beginners but also offers important reference cases for teaching computer architecture.