注意力机制量化剪枝优化方法

Quantization and pruning optimization method for attention mechanism

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
维普
万方数据

中文摘要：面向基于注意力机制模型的巨大计算和访存开销问题,研究量化和剪枝协同优化的模型压缩技术,提出针对注意力机制中查询、键、值、概率共四个激活值矩阵的对称线性定点量化方法.同时,提出概率矩阵剪枝方法和渐进式剪枝策略,有效降低剪枝精度损失.在不同数据集上的实验结果表明,针对典型基于注意力机制模型BERT,在较低或者没有精度损失的情况下该优化方法可达到4 位或8 位定点量化、0.93～0.98的稀疏度,大幅度降低模型计算量,为加速量化稀疏模型的推理奠定良好的基础.

外文摘要：To address the significant computation and memory overhead of models based on attention mechanism,model compression techniques,such as collaborative optimization of quantization and pruning,were studied.A symmetric linear fixed point quantization method was proposed for four activation matrices of query,key,value and probability in the attention mechanism.Meanwhile,a probability matrix pruning method and a progressive pruning strategy were proposed to effectively reduce the pruning accuracy loss.Experimental results on different datasets show that for the typical attention-based model BERT,this optimization method can achieve 4 bit or 8 bit fixed point quantization and 0.93～0.98 sparsity with little or no accuracy loss,which greatly reduces the model computation and lays a strong foundation for accelerating the inference of quantized sparse models.

外文关键词：

natural language processingattention mechanismquantizationpruning

作者：

何源宏、姜晶菲、许金伟

展开 >

作者单位：

国防科技大学计算机学院,湖南长沙 410073

国防科技大学并行与分布计算全国重点实验室,湖南长沙 410073

关键词：

自然语言处理注意力机制量化剪枝

基金：

重点实验室稳定支持重点项目

项目编号：

WDZC20215250103

出版年：

2024

DOI：

10.11887/j.cn.202401012

国防科技大学学报

国防科学技术大学

国防科技大学学报

CSTPCD北大核心

影响因子：0.517

ISSN：1001-2486

年,卷(期)：2024.46(1)

参考文献量29