In a domestic general-purpose accelerator Deep Computing Unit(DCU),Local Data Shared(LDS)is a key storage component with a lower latency and higher bandwidth than global memory.As heterogeneous programs use LDS more frequently,the low memory access efficiency of LDS has become an important limiting factor in the performance of heterogeneous programs.In addition,owing to bank conflicts in the LDS access process,LDS access must follow certain principles to be used efficiently.When the data access between threads presents overlapping memory access characteristics,access vectorization instructions create delays.To address this problem,an optimization method for the LDS memory access vectorization for the DCU is proposed.This method reduces the number of LDS accesse and time-consuming memory accesse by realizing the vectorization of continuous data access,thereby improving the efficiency of program memory access.On this basis,through the determination of memory access characteristics,an LDS access vectorization method that can effectively address data overlap is proposed,and an efficient LDS memory access technology for domestic general-purpose accelerators is realized to ensure the vectorization method effectively improve the memory access efficiency.The experimental results demonstrate that in the heterogeneous programs using LDS,the program performance is improved by an average of 22.6%after the LDS access vectorization is implemented,which verifies the effectiveness of this study.Simultaneously,the vectorization method can realize the overlapping of memory access data between LDS threads,and improves the performance of heterogeneous programs by an average of 30%.
Deep Computing Unit(DCU)Local Data Shared(LDS)memory access vectorizationmemory access characteristicbank conflict