基于异构系统的多级并行稀疏张量向量乘算法

扫码查看

原文链接

万方数据
维普

中文摘要：张量在许多实际应用中被用来表示大规模、多源、高维、多模态的数据.稀疏张量分解作为挖掘数据中隐藏信息的有效方法之一,已被广泛应用于机器学习、文本分析、生物医疗等研究领域中.稀疏张量向量乘(Sparse Tensor-VectorMultiplication,SpTV)是张量分解中最基础、耗时最多的运算之一.为加速大数据和人工智能相关应用的运行效率,本文提出了基于CPU-GPU异构结构的多级并行SpTV加速算法.首先,为了将SpTV运算映射到混合、多级并行的分布式CPU-GPU异构多/众核构架,本文设计了一种多维并行SpTV划分方法,采用面向节点级并行的N-1维张量划分和面向GPU线程级并行的矩阵划分,充分利用计算节点间和节点内的多级并行计算能力.其次,设计了一种基于稀疏张量纤维的压缩存储格式,压缩稀疏张量的内存占用,优化SpTV运算的计算和访存模式.最后,提出了基于多流并行的异构高效SpTV算法,进一步设计了稀疏张量的细粒度划分方法、多流并行运行机制和基于张量块排序的多流并行优化技术,实现了SpTV运算中通信开销和计算开销的相互重叠与隐藏.实验结果表明,与相关工作aeSpTV相比,所提出的SpTV算法在所有测试数据集上最高能够获得3.28倍的加速比.

外文标题：Exploiting Hierarchical Parallelism for Sparse Tensor-Vector Multiplication on Heterogeneous Parallel Systems

外文摘要：Many application domains give rise to multidimensional data that can be naturally represented via tensors.The tensors used in most real-world applications are extremely large and very sparse.The sparse tensor decomposition is an effective approach to predict the unobserved data and is commonly used in machine learning,text analysis,healthcare analytics,and numerous other applications.Sparse tensor-vector multiplication(SpTV)is one of the most fundamental and time-intensive operations in computing tensor decomposition.In order to improve the efficiency of related applications,this paper exploits the hierarchical parallelism for SpTV on CPU-GPU heterogeneous parallel computing systems.First of all,we propose a multidimensional partitioning method to map parallel SpTV to the underlying CPU-GPU heterogeneous and parallel computing architectures.It utilizes the N-1-dimensional tensor partitioning to exploit the inter-node parallelism and the matricized tensor partitioning to exploit the intra-node parallelism.Second,based on the multidimensional data partitioning,we design a fiber-wise compressed storage format for sparse tensors to reduce the memory footprint and optimize the computing and memory accessing patterns in parallel SpTV.Third,we design the parallel streaming SpTV algorithm,by adopting the fine-grained data partitioning method,the parallel streaming execution scheme,and the tensor block sorting technique,to overlap the data swapping cost and the computation overhead and further leverage the computing power of GPUs.The experimental results show that the parallel and efficient SpTV algorithm achieves the speedup of up to 3.28 compared to state-of-the-art(aeSpTV)on a CPU-GPU system.

外文关键词：

CPU-GPUheterogeneous and parallel computinghierarchical parallelismsparse tensorstensor operations

作者：

陈玥丹、肖国庆、阳王东、金纪勇、龙军、李肯立

展开 >

作者单位：

中南大学大数据研究院长沙 410083

湖南大学深圳研究院广东深圳 518000

湖南大学信息科学与工程学院长沙 410082

国家超级计算长沙中心长沙 410082

之江实验室基础理论研究院—应用数学与机器智能研究中心杭州 311100

展开 >

关键词：

CPU-GPU 异构并行计算多级并行稀疏张量张量运算

基金：

广东省重点领域研究计划国家自然科学基金国家自然科学基金湖南省科技项目湖南省科技项目湖南省科技项目广东省自然科学基金深圳市基础研究面上项目之江实验室开放课题

项目编号：

2021B010119000462172157622021492023GK20022021RC30622023JJ600022023A1515012915JCYJ202103241354090262022RC0AB03

出版年：

2024

DOI：

10.11897/SP.J.1016.2024.00441

计算机学报

中国计算机学会中国科学院计算技术研究所

计算机学报

CSTPCD北大核心

影响因子：3.18

ISSN：0254-4164

年,卷(期)：2024.47(2)

参考文献量34