首页|面向TPU粗粒度指令的自动张量化方法

面向TPU粗粒度指令的自动张量化方法

扫码查看
张量化是通过调用硬件特定指令对张量运算进行加速的过程.TPU支持多种粗粒度指令,可表示神经网络级别的算子,且没有明确的运算规模限制.现有张量化方法对于粗粒度指令需要手写大量的IR匹配片段,且难以实现灵活的双缓存(ping-pong buffer)形式的指令并行优化,不利于扩展至TPU场景.为此,提出了一种面向TPU粗粒度指令的 自动张量化方法——Tir2TPU.首先,基于TensorIR抽象语法树的分析对运算程序进行指令替换.其次,设计了一种模拟硬件行为的并行模型以实现指令并行优化.最后,构建了基于TPU硬件特征的程序调度空间以实现快速自动调优.实验对矩阵乘法等5种机器学习模型中常用的算子进行了性能评估.实验结果表明,Tir2TPU自动优化生成的算子与TPU自有编译器相比可取得最高3.1倍、平均1.78倍的运算加速,并且可取得平均90%的手工优化性能.
Automatic Tensorization for TPU Coarse-grained Instructions
Tensorization refers to the process of calling specific hardware instructions to accelerate tensor programs.TPU sup-ports various coarse-grained instructions for computation and memory transaction without clear constraints on the input scale.How to use these instructions to automatically generate tensorized programs has become an important topic.However,existing tensorization method requires a large number of handwritten matching fragments for coarse-grained instructions and does not support flexible instruction parallelism optimization like ping-pong buffer,which is inefficient to scale to TPU scenarios.To this end,this paper proposes Tir2TPU,an automatic tensorization method for TPU coarse-grained instructions.Firstly,Tir2TPU ex-tracts the iterator binding information of Block structure and automatically performs instruction replacement while traversing TensorIR's abstract syntax tree.Secondly,it also utilizes a parallel model that simulates hardware behavior to generate parallel instruction flow.Finally,Tir2TPU combines a hardware-centric schedule space based on TPU features,which greatly accelerates auto-tuning process.The performance of Tir2TPU is evaluatedon 5 commonly used operators in machine learning models.Experi-mental results show that Tir2TPU can achieve up to 3 × and an average of 1.78 × speedup compared to TPU's compiler,and consistently delivers 90%performance compared to manually optimized operators.

Machine-learning compilerTensor acceleratorTensorizationInstruction parallelismOperator optimization

刘磊、周志德、刘兴祥、车皓阳、姚雷、江贺

展开 >

大连理工大学软件学院 辽宁大连 116620

深信服科技股份有限公司 广东深圳 518000

浙江极氪智能科技有限公司 浙江宁波 315800

机器学习编译器 张量加速器 张量化 指令并行 算子优化

国家自然科学基金重点项目CCF-深信服伏羲基金中国博士后科学基金国家自然科学基金

6203200420220032023M73047262302077

2024

计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

CSTPCD北大核心
影响因子:0.944
ISSN:1002-137X
年,卷(期):2024.51(6)
  • 26