基于MLIR的FP8量化模拟与推理内存优化

扫码查看

原文链接

万方数据
维普

中文摘要：随着目标检测模型和语言大模型的迅速发展,网络模型正变得越来越庞大.为了更好地在端侧硬件上进行模型部署,通常采用模型量化技术对模型进行压缩.现有的模型量化策略主要基于FP16,BF16和INT8等类型实现.其中,8bit数据类型在降低推理内存占用与部署开销方面最为显著,但INT8类型依赖特定的校准算法,未能很好地处理动态范围大、离群点多的模型.FP8类型能够更好地拟合神经网络中的数据分布,同时具有多种数制,可在表达范围和表达精度上灵活调整.然而,当前MLIR系统缺乏对FP8类型量化的支持.为此,提出了一种基于MLIR系统的FP8量化模拟策略,包含FP8E4M3和FP8E5M2两种数制,通过对网络中的算子进行量化模拟,评估FP8两种数制对模型推理精度的影响.同时,针对推理引擎中存在的内存分配冗余问题,提出了一种基于定义使用链的内存复用策略,使得模型推理过程中的内存占用峰值进一步减小.实验选取了典型的Yolov5s和Resnet50模型进行测试,结果表明相较于现有的INT8量化策略,FP8量化策略能够保持更好的模型精度,同时不依赖特定校准算法,部署更为简便.在模型精度上,测试用例分别达到了 55.5％和77.8％的准确度,经过内存复用优化,内存占用峰值降低了约15％～20％.

外文标题：FP8 Quantization and Inference Memory Optimization Based on MLIR

外文摘要：With the development of object detection models and language models,network models are becoming increasingly large.In order to better deploy the model on the end-to-end hard ware,model quantization technology is usually used to compress the model.The existing model quantization strategies are mainly implemented based on FP16,BF16,INT8,and other types.Among them,the 8-bit data type is the most significant in reducing inference memory usage and deployment costs,but the INT8 type relies on specific calibration algorithms and fails to handle models with large dynamic ranges and multiple outliers well.The FP8 type can better fit the data distribution in neural networks,and has multiple formats that can be flexibly adjusted in terms of expression range and accuracy.However,the current MLIR lacks support for quantifying the FP8 type.To this end,a FP8 quanti-zation simulation strategy based on MLIR is proposed,which includes two formats:FP8E4M3 and FP8E5M2.By quantifying and simulating the operators in the network,the impact of the two formats on the inference accuracy of the model is evaluated.A memory reuse strategy based on define use chain is proposed to address the issue of memory allocation redundancy in inference engines,further reducing the peak memory usage during the model inference process.Typical Yolov5s and Resnet50 models are selected for testing and verification,and the results show that,compared to the existing INT8 quantization strategy,the FP8 quantization strategy can maintain better model accuracy,and does not rely on specific calibration algorithms,making deployment more convenient.In terms of model accuracy,the test cases achieve an accuracy of 55.5％and 77.8％,respectively.After memory reuse optimization,the peak memory usage is reduced by about 15％～20％.

外文关键词：

Model compressionDeep learning compilerFP8 quantificationMLIRYolov5s model

作者：

徐金龙、桂中华、李嘉楠、李颖颖、韩林

展开 >

作者单位：

郑州大学国家超级计算郑州中心郑州 450001

信息工程大学四院郑州 450001

郑州大学计算机与人工智能学院郑州 450001

关键词：

模型压缩深度学习编译器 FP8量化 MLIR Yolov5s模型

基金：

2022年河南省重大科技专项

项目编号：

221100210600

出版年：

2024

DOI：

10.11896/jsjkx.230900143

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(9)