国防科技大学学报2024,Vol.46Issue(1) :113-120.DOI:10.11887/j.cn.202401012

注意力机制量化剪枝优化方法

Quantization and pruning optimization method for attention mechanism

何源宏 姜晶菲 许金伟
国防科技大学学报2024,Vol.46Issue(1) :113-120.DOI:10.11887/j.cn.202401012

注意力机制量化剪枝优化方法

Quantization and pruning optimization method for attention mechanism

何源宏 1姜晶菲 1许金伟1
扫码查看

作者信息

  • 1. 国防科技大学 计算机学院,湖南 长沙 410073;国防科技大学 并行与分布计算全国重点实验室,湖南 长沙 410073
  • 折叠

摘要

面向基于注意力机制模型的巨大计算和访存开销问题,研究量化和剪枝协同优化的模型压缩技术,提出针对注意力机制中查询、键、值、概率共四个激活值矩阵的对称线性定点量化方法.同时,提出概率矩阵剪枝方法和渐进式剪枝策略,有效降低剪枝精度损失.在不同数据集上的实验结果表明,针对典型基于注意力机制模型BERT,在较低或者没有精度损失的情况下该优化方法可达到4 位或8 位定点量化、0.93~0.98的稀疏度,大幅度降低模型计算量,为加速量化稀疏模型的推理奠定良好的基础.

Abstract

To address the significant computation and memory overhead of models based on attention mechanism,model compression techniques,such as collaborative optimization of quantization and pruning,were studied.A symmetric linear fixed point quantization method was proposed for four activation matrices of query,key,value and probability in the attention mechanism.Meanwhile,a probability matrix pruning method and a progressive pruning strategy were proposed to effectively reduce the pruning accuracy loss.Experimental results on different datasets show that for the typical attention-based model BERT,this optimization method can achieve 4 bit or 8 bit fixed point quantization and 0.93~0.98 sparsity with little or no accuracy loss,which greatly reduces the model computation and lays a strong foundation for accelerating the inference of quantized sparse models.

关键词

自然语言处理/注意力机制/量化/剪枝

Key words

natural language processing/attention mechanism/quantization/pruning

引用本文复制引用

基金项目

重点实验室稳定支持重点项目(WDZC20215250103)

出版年

2024
国防科技大学学报
国防科学技术大学

国防科技大学学报

CSTPCDCSCD北大核心
影响因子:0.517
ISSN:1001-2486
参考文献量29
段落导航相关论文