轻量级卷积神经网络的硬件加速方法

Hardware acceleration method for lightweight convolutional neural networks

吕文浩 ¹支小莉 ²童维勤²

扫码查看

作者信息

1. 上海大学计算机工程与科学学院,上海 200444
2. 上海大学计算机工程与科学学院,上海 200444;上海智能计算系统工程技术研究中心研发部,上海 200444
折叠

摘要

为提升轻量级卷积神经网络在硬件平台的资源利用效率和推理速度,基于软硬件协同优化的思想,提出一种面向FPGA平台的轻量级卷积神经网络加速器,并针对网络结构的特性设计专门的硬件架构.与多级并行策略结合,设计一种统一的卷积层计算单元.为降低模型存储成本、提高加速器的吞吐量,提出一种基于可微阈值的选择性移位量化方案,使计算单元能够以硬件友好的形式执行计算.实验结果表明,在Arria 10 FPGA平台上部署的MobileNetV2加速器能够达到311 fps的推理速度,相比CPU版本实现了约9.3倍的加速比、GPU版本约3倍的加速比.在吞吐量方面,加速器能够实现 98.62 GOPS.

Abstract

To improve the resource utilization efficiency and the speed of the lightweight convolutional neural network in the hard-ware platform,based on the idea of software and hardware co-optimization,a lightweight convolutional neural networks accelera-tor based on FPGA was proposed,and a special hardware architecture was designed according to the characteristics of the net-work structure.Combined with multi-level parallel strategy,a unified convolutional computing unit was designed.Moreover,a differentiable threshold-based selective shift quantization method was proposed to reduce the storage cost and improve the throughput of the accelerator,which enabled the computational unit to perform computations in a hardware-friendly form.As revealed from the experimental results,the MobileNetV2 accelerator deployed on the Arria 10 FPGA platform can achieve 311 fps,which is about 9.3 times faster than the CPU version and about 3 times faster than the GPU version.In terms of throughput,it can achieve 98.62 GOPS.

关键词

软硬件协同优化/现场可编程门阵列/轻量级卷积神经网络/移位量化/并行计算/硬件加速/开放式计算语言

Key words

software-hardware co-optimization/field programmable gate array/lightweight convolutional neural networks/shift quantization/parallel computation/hardware acceleration/open computing language

引用本文复制引用

基金项目

山东省自然科学基金(ZR2019LZH002)

中国高校产学研创新基金(2020HYA02011)

上海市科委人工智能支撑专项(22511106005)

出版年

2024

计算机工程与设计

中国航天科工集团二院706所

计算机工程与设计

CSTPCD北大核心

影响因子：0.617

ISSN：1000-7024

参考文献量22

段落导航