To address the significant computation and memory overhead of models based on attention mechanism,model compression techniques,such as collaborative optimization of quantization and pruning,were studied.A symmetric linear fixed point quantization method was proposed for four activation matrices of query,key,value and probability in the attention mechanism.Meanwhile,a probability matrix pruning method and a progressive pruning strategy were proposed to effectively reduce the pruning accuracy loss.Experimental results on different datasets show that for the typical attention-based model BERT,this optimization method can achieve 4 bit or 8 bit fixed point quantization and 0.93~0.98 sparsity with little or no accuracy loss,which greatly reduces the model computation and lays a strong foundation for accelerating the inference of quantized sparse models.
natural language processingattention mechanismquantizationpruning