High-Precision Lightweight Quantization Inference Method for Prevalent Activation Functions in Transformer Models in Edge Device Deployment
杨赟辉 1程虎 1魏敬和 1刘国柱 1桑贤侦1
扫码查看
点击上方二维码区域,可以放大扫码查看
作者信息
1. 中国电子科技集团公司第五十八研究所,江苏 无锡 214072
折叠
摘要
基于Transformer的大语言模型(Large Language Models,LLM)和视觉Transformer(Vision Transformers,ViTs)分别在自然语言处理、机器视觉任务上实现了最为先进的性能.但是ViTs和LLM的常用激活函数GELU(Gauss-ian Error Linear Unit)、Swish在Transformer全量化推理中存在精度不足、计算效率低的问题,限制了它们在资源受限的边缘端设备上的部署和应用.本文提出了一种基于分段二次多项式拟合的激活函数高精度近似计算方法(Segmented Quadratic Polynomial Fitting,SQPF)及其量化推理过程,以实现端侧非线性激活函数的高性能部署.SQPF采用最小二乘法和粒子群优化方法求解非线性激活函数拟合优化问题,给出最优的二次多项式拟合系数和区间划分.得到的二次多项式拟合采用动态精度定点对称量化方法进行纯整数推理,推理过程仅包含移位操作和乘加运算.本文使用SQPF计算了GELU和Swish的二次多项式拟合Si-GELU和Si-Swish,并评估了量化推理精度.实验结果表明,在标准数据集ImageNet上,Si-GELU引起的ViTs(ViT、DeiT和Swin)模型分类任务准确率衰减仅为0.09%,是其他同类方法的27.3%;在主流的大语言模型评测数据集MMLU上,Si-Swish引起的子类别精度衰减不超过0.77%,大类别精度衰减不超过0.23%.极小的精度损失表明SQPF计算得到的最优分段二次多项式拟合可以直接替换Transformer模型中全精度浮点激活函数,不必进行参数微调或者重训练.
Abstract
Transformer-based models,such as large language models(LLM)and vision Transformers(ViTs),had achieved state-of-the-art performance in tasks across natural language processing and machine vision.However,the prevalent ac-tivation functions such as GELU(Gaussian Error Linear Unit)and Swish in ViTs and LLMs encountered challenges with insuffi-cient precision and low computational efficiency during fully quantized inference,which constrained their deployment and appli-cation in resource-limited edge devices.This paper introduced a high-precision segmented quadratic polynomial fitting method(SQPF)and its corresponding quantized inference process,to achieve high-performance deployment of nonlinear activation functions on the edge side.The SQPF adopted the least squares method and particle swarm optimization to fetch the optimal coef-ficient and interval divisions for the quadratic polynomial fitting of activation functions.The obtained quadratic polynomials-were subjected to dynamic fixed-point symmetric quantization,enabling pure integer inference that solely required shift opera-tions and multiply-accumulate computations.This paper calculated the quadratic polynomials of GELU and Swish to Si-GELU and Si-Swish,and evaluated their inference accuracy.The experimental results demonstrated that on ImageNet,the Si-GELU in-duced a minimal accuracy reduction of only 0.09%in the classification tasks for ViTs(ViT,DeiT,and Swin),which is 27.3%of other methods.On large language model benchmark dataset MMLU,Si-Swish caused a negligible precision degradation,with subcategory precision degradation not exceeding 0.77%and major category precision degradation not exceeding 0.23%.The minimal loss in precision indicated that the optimal quadratic polynomials derived from SQPF were a direct substitute for the full-precision floating-point activation functions in Transformer models,negating parameter fine-tuning or retraining.