首页|注意力机制量化剪枝优化方法

注意力机制量化剪枝优化方法

Quantization and pruning optimization method for attention mechanism

扫码查看
面向基于注意力机制模型的巨大计算和访存开销问题,研究量化和剪枝协同优化的模型压缩技术,提出针对注意力机制中查询、键、值、概率共四个激活值矩阵的对称线性定点量化方法.同时,提出概率矩阵剪枝方法和渐进式剪枝策略,有效降低剪枝精度损失.在不同数据集上的实验结果表明,针对典型基于注意力机制模型BERT,在较低或者没有精度损失的情况下该优化方法可达到4 位或8 位定点量化、0.93~0.98的稀疏度,大幅度降低模型计算量,为加速量化稀疏模型的推理奠定良好的基础.
To address the significant computation and memory overhead of models based on attention mechanism,model compression techniques,such as collaborative optimization of quantization and pruning,were studied.A symmetric linear fixed point quantization method was proposed for four activation matrices of query,key,value and probability in the attention mechanism.Meanwhile,a probability matrix pruning method and a progressive pruning strategy were proposed to effectively reduce the pruning accuracy loss.Experimental results on different datasets show that for the typical attention-based model BERT,this optimization method can achieve 4 bit or 8 bit fixed point quantization and 0.93~0.98 sparsity with little or no accuracy loss,which greatly reduces the model computation and lays a strong foundation for accelerating the inference of quantized sparse models.

natural language processingattention mechanismquantizationpruning

何源宏、姜晶菲、许金伟

展开 >

国防科技大学 计算机学院,湖南 长沙 410073

国防科技大学 并行与分布计算全国重点实验室,湖南 长沙 410073

自然语言处理 注意力机制 量化 剪枝

重点实验室稳定支持重点项目

WDZC20215250103

2024

国防科技大学学报
国防科学技术大学

国防科技大学学报

CSTPCD北大核心
影响因子:0.517
ISSN:1001-2486
年,卷(期):2024.46(1)
  • 29