Fast convolution algorithm optimization for Phytium processor
To address the issue of deploying convolutional neural networks on resource-constrained devices,a high-performance fast convolution algorithm(FastInfer)was proposed for the FT-2000/4 multi-core processor.The algorithm optimized general matrix multiplication using a block-based strategy,storing frequently accessed data closer to the processor's cache to improve memory access efficiency during computation.In addition,a high-performance matrix multiplication microkernel was designed and implemented,utilizing vector outer product operations to update data and enhance the memory-to-computation ratio.This design maximized the masking of memory instruction latency.Experimental results demonstrated that FastInfer achieved a peak computational performance of 99.56 GFLOPS on the FT-2000/4 processor.In tests with general matrix multiplication at various input scales,FastInfer outperformed OpenBLAS by 1.07 and 1.52 times.In convolution tests,FastInfer performed 1.32 times better than the ARM Compute Library,achieving high-performance convolution computation on the FT-2000/4 multi-core processor.
deep learningfast convolution algorithmparallel computinggeneral matrix multiplication