首页|基于飞腾D2000的GEMM算法设计与优化实现技术

基于飞腾D2000的GEMM算法设计与优化实现技术

扫码查看
在深度学习推理框架中,GEMM是典型的计算密集型算子,在Bert、Transformer、Yolo等模型的模块中存在大量GEMM运算,会直接影响模型的推理延时.针对该算子的优化问题,分别采用循环展开、OpenMP、NEON指令集等方法进行优化,在国产嵌入式板卡飞腾D2000、国产操作系统进行实验测试.实验结果表明优化后比优化前加速43.89 倍,优化方法加速效果行之有效,可以大大降低人工智能模型在边缘端的推理延时.
GEMM Algorithm Design and Optimization Implementation Technology Based on Feiteng D2000
In the deep learning inference framework,GEMM is a typical calculation-intensive operator.For example,there are a large number of GEMM operations in the modules of Bert,Transformer,Yolo and other models.Therefore,the quality of the underlying implementation of the GEMM operator in the deep learning framework will directly It affects the inference delay of the model.Due to the limited computing power of the edge embedded platform,optimizing this operator is crucial.The main work of this article is to perform embedded optimization on it,using loop expansion,OpenMP,NEON instruction set and other methods for optimization.Experimental tests were conducted on the domestic embedded board Feiteng D2000 and the domestic operating system.The experimental results show that the operator is optimized af-ter The acceleration is 43.89 times faster than before optimization.The acceleration effect of this optimi-zation method is effective and can greatly reduce the inference delay of the artificial intelligence model at the edge.

inference frameGEMMOpenMPNEONFeiteng D2000

郑恩、白林亭、文鹏程

展开 >

航空工业西安航空计算技术研究所,陕西 西安 710000

机载弹载计算机航空科技重点实验室,陕西 西安 710000

推理框架 GEMM OpenMP NEON 飞腾D2000

航空科学基金

2022Z071031001

2024

航空计算技术
中国航空工业西安航空计算技术研究所

航空计算技术

CSTPCD
影响因子:0.316
ISSN:1671-654X
年,卷(期):2024.54(3)
  • 11