GEMM Algorithm Design and Optimization Implementation Technology Based on Feiteng D2000
In the deep learning inference framework,GEMM is a typical calculation-intensive operator.For example,there are a large number of GEMM operations in the modules of Bert,Transformer,Yolo and other models.Therefore,the quality of the underlying implementation of the GEMM operator in the deep learning framework will directly It affects the inference delay of the model.Due to the limited computing power of the edge embedded platform,optimizing this operator is crucial.The main work of this article is to perform embedded optimization on it,using loop expansion,OpenMP,NEON instruction set and other methods for optimization.Experimental tests were conducted on the domestic embedded board Feiteng D2000 and the domestic operating system.The experimental results show that the operator is optimized af-ter The acceleration is 43.89 times faster than before optimization.The acceleration effect of this optimi-zation method is effective and can greatly reduce the inference delay of the artificial intelligence model at the edge.