While vector processing unit is widely employed in processors for neural networks,signal processing,and high performance computing,it suffers from expensive shuffle operations dedicated to data alignment.Traditionally,processors handle shuffle operations with its data shuffle unit.However,data shuffle unit will introduce expensive overhead of data movement and only can shuffle data in serial.In fact,shuffle operations only change the layout of data and ideally should be done entirely within memory.Nowadays,SRAM is no longer just a storage component,but also as a computing unit.To this end,we propose Shuffle-SRAM in this paper,and Shuffle-SRAM can shuffle multiple data elements simultaneously bit by bit within an SRAM bank.The key idea is to exploit the bit-line wise data movement ability in SRAM so as to shuffle multiple data in parallel,where all the bits of different data elements on the same bit-line of SRAM can be shuffled simultaneously,achieving a high level of parallelism.Through suitable data layout preparation and the vector shuffle extension instructions,Shuffle-SRAM efficiently supports a wide range of commonly used shuffle operations efficiently.Our evaluation results show that Shuffle-SRAM can achieve a performance gain of 28 times for commonly used shuffle operations and 3.18 times for real world applications including FFT,AlexNet,and VggNet.The SRAM area overhead only increases by 4.4%.
关键词
向量单指令多数据体系结构/静态随机访问存储器/混洗操作/向量内存/存内计算
Key words
vector SIMD architecture/SRAM/shuffle operations/vector memory/processing in memory