矩阵乘法的GPU并行计算时耗模型与最优配置方法

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：水平矩阵乘竖直矩阵是科学计算及工程领域中的基本计算之一,很大程度上影响了整个算法的计算效率.GPU并行计算是迄今主流的并行计算方式之一,其底层设计使得GPU非常契合于大规模矩阵计算.迄今已经有许多研究基于GPU并行计算框架,针对矩阵的结构设计、优化矩阵乘法,但尚未有针对水平矩阵乘竖直矩阵的GPU并行算法及优化.此外,GPU核函数配置直接影响计算效率,但迄今针对最优核函数配置的研究极为有限,通常需要研究人员针对具体算法的计算特点启发式地设置.基于GPU的线程、内存模型,设计了一种并行水平矩阵乘竖直矩阵乘法PHVM.数值实验结果表明,在左乘矩阵的水平维度远远大于竖直维度时,PHVM要显著优于NVIDIAcuBLAS库中的通用矩阵乘法.进一步,基于GPU的硬件参数,建立了PHVM运行时间的核函数配置最优化理论模型.数值实验结果表明,该理论模型较为准确地描述了PHVM算法运行时间随核函数配置(网格大小、线程块大小)变换的变化趋势,且模型得出的理论最优核函数配置与实际最优运行核函数配置相符.

外文标题：Time Cost Model and Optimal Configuration Method for GPU Parallel Computation of Matrix Multiplication

外文摘要：Horizontal matrix &vertical matrix multiplication(HVM)is one of the fundamental calculations in scientific compu-ting and engineering,as it largely affects the computational efficiency of higher-levet algorithms.GPU parallel computing has be-come one of the mainstream parallel computing method,and its underlying design makes it highly suitable for large-scale multipli-cation calculations.Numerous studies have focused on designing matrix structures and optimizing matrix multiplication using GPU parallel computing frameworks.However,there has been a lack of GPU parallet algorithms and optimization methods spe-cifically targeting HVM.Furthermore,the configuration of GPU kernel functions directly affects computational efficiency,but studies on the optimal configuration of kernel functions have been extremely limited,typically requiring researchers to heuristi-cally set them based on the specific computational characteristics of the algorithm.This paper designs a parallel HVM algorithm,PHVM,based on the GPU's thread and memory model.The numerical experimental results show that when the horizontal di-mension of the left matrix is much larger than the vertical dimension,PHVM significantly outperforms the general matrix multi-plication in the NVIDIA cuBLAS library.Furthermore,this paper establishes an optimal theoretical model for kernel function configuration of PHVM runtime based on GPU hardware parameters.The numerical experimental results indicates that this theo-retical model accurately reflects the trend of changes in PHVM algorithm runtime with kernel function configuration(grid size and thread block size)variations.

外文关键词：

Matrix multiplicationGPUCUDAKernel function configuration

作者：

雷超、刘江、宋佳文

展开 >

作者单位：

中国科学院重庆绿色智能技术研究院重庆 400714

中国科学院大学重庆学院重庆 400714

中南大学航空航天技术研究院长沙 410017

关键词：

矩阵乘法 GPU CUDA 核函数配置

基金：

国家重点研发计划中国科学院科技服务网络计划区域重点项目中南大学课题&&

项目编号：

2018YFC0116704KFJ-STS-QYZD-2021-01-001E190600801

出版年：

2024

DOI：

10.11896/jsjkx.230300200

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(z1)

参考文献量27