Automatic Tensorization for TPU Coarse-grained Instructions
Tensorization refers to the process of calling specific hardware instructions to accelerate tensor programs.TPU sup-ports various coarse-grained instructions for computation and memory transaction without clear constraints on the input scale.How to use these instructions to automatically generate tensorized programs has become an important topic.However,existing tensorization method requires a large number of handwritten matching fragments for coarse-grained instructions and does not support flexible instruction parallelism optimization like ping-pong buffer,which is inefficient to scale to TPU scenarios.To this end,this paper proposes Tir2TPU,an automatic tensorization method for TPU coarse-grained instructions.Firstly,Tir2TPU ex-tracts the iterator binding information of Block structure and automatically performs instruction replacement while traversing TensorIR's abstract syntax tree.Secondly,it also utilizes a parallel model that simulates hardware behavior to generate parallel instruction flow.Finally,Tir2TPU combines a hardware-centric schedule space based on TPU features,which greatly accelerates auto-tuning process.The performance of Tir2TPU is evaluatedon 5 commonly used operators in machine learning models.Experi-mental results show that Tir2TPU can achieve up to 3 × and an average of 1.78 × speedup compared to TPU's compiler,and consistently delivers 90%performance compared to manually optimized operators.