首页|规则压缩模型和灵活架构的Transformer加速器设计

规则压缩模型和灵活架构的Transformer加速器设计

扫码查看
基于注意力机制的Transformer模型具有优越的性能,设计专用的Transformer加速器能大幅提高推理性能以及降低推理功耗.Transformer模型复杂性包括数量上和结构上的复杂性,其中结构上的复杂性导致不规则模型和规则硬件之间的失配,降低了模型映射到硬件的效率.目前的加速器研究主要聚焦在解决模型数量上的复杂性,但对如何解决模型结构上的复杂性研究得不多.该文首先提出规则压缩模型,降低模型的结构复杂度,提高模型和硬件的匹配度,提高模型映射到硬件的效率.接着提出一种硬件友好的模型压缩方法,采用规则的偏移对角权重剪枝方案和简化硬件量化推理逻辑.此外,提出一个高效灵活的硬件架构,包括一种以块为单元的权重固定脉动运算阵列,同时包括一种准分布的存储架构.该架构可以高效实现算法到运算阵列的映射,同时实现高效的数据存储效率和降低数据移动.实验结果表明,该文工作在性能损失极小的情况下实现93.75%的压缩率,在FPGA上实现的加速器可以高效处理压缩后的Transformer模型,相比于中央处理器(CPU)和图形处理器(GPU)能效分别提高了12.45倍和4.17倍.
Design of Transformer Accelerator with Regular Compression Model and Flexible Architecture
The Transformer model based on attention mechanism demonstrates superior performance. The complexity of the Transformer model includes both quantity and structural complexity, where the structural complexity leads to a mismatch between irregular models and regular hardware, reducing the efficiency of mapping the model to the hardware. Current accelerator research mainly focuses on addressing the complexity in terms of model quantity, but there is limited research on how to tackle the complexity in model structure. A regularized compressed model is proposd to reduce the structural complexity of the model, improving the matching between the model and the hardware, and increasing the efficiency of mapping the model to the hardware. A hardware-friendly model compression method is introduced, which utilizes a rule-based pruning scheme for weight with offset diagonals and simplifies the hardware quantization inference logic.An efficient and flexible hardware architecture is also present, including a pulsatile operation array with weight fixed at the block level, as well as a quasi-distributed storage architecture. This architecture enables efficient mapping of algorithms to the operation array, while achieving high data storage efficiency and reducing data movement. Experimental results show that the proposed approach achieves a compression rate of 93.75% with minimal performance loss. The accelerator implemented on an FPGA can efficiently handle the compressed Transformer model, resulting in energy efficiency improvements of 12.45 times compared to Central Processing Unit (CPU) and 4.17 times compared to Graphics Processing Unit (GPU).n energy efficiency improvements of 12.45 times compared to Central Processing Unit (CPU) and 4.17 times compared to Graphics Processing Unit (GPU).

Natural Language Processing(NLP)TransformerModel compressionHardware acceleratorMachine translation

姜小波、邓晗珂、莫志杰、黎红源

展开 >

华南理工大学电子与信息学院 广州 510000

广东科学技术职业学院机器人学院 珠海 519090

自然语音处理 Transformer 模型压缩 硬件加速器 机器翻译

国家自然科学基金广东省科技计划广州市科技计划

U18012622019B010154003202102080579

2024

电子与信息学报
中国科学院电子学研究所 国家自然科学基金委员会信息科学部

电子与信息学报

CSTPCD北大核心
影响因子:1.302
ISSN:1009-5896
年,卷(期):2024.46(3)
  • 26