电子与信息学报2024,Vol.46Issue(4) :1499-1512.DOI:10.11999/JEIT230509

面向边缘计算的可重构CNN协处理器研究与设计

A Research and Design of Reconfigurable CNN Co-Processor for Edge Computing

李伟 陈億 陈韬 南龙梅 杜怡然
电子与信息学报2024,Vol.46Issue(4) :1499-1512.DOI:10.11999/JEIT230509

面向边缘计算的可重构CNN协处理器研究与设计

A Research and Design of Reconfigurable CNN Co-Processor for Edge Computing

李伟 1陈億 1陈韬 1南龙梅 1杜怡然1
扫码查看

作者信息

  • 1. 战略支援部队信息工程大学密码工程学院 郑州 450001
  • 折叠

摘要

随着深度学习技术的发展,卷积神经网络模型的参数量和计算量急剧增加,极大提高了卷积神经网络算法在边缘侧设备的部署成本.因此,为了降低卷积神经网络算法在边缘侧设备上的部署难度,减小推理时延和能耗开销,该文提出一种面向边缘计算的可重构CNN协处理器结构.基于按通道处理的数据流模式,提出的两级分布式存储方案解决了片上大规模的数据搬移和重构运算时PE单元间的大量数据移动导致的功耗开销和性能下降的问题;为了避免加速阵列中复杂的数据互联网络传播机制,降低控制的复杂度,该文提出一种灵活的本地访存机制和基于地址转换的填充机制,使得协处理器能够灵活实现任意规格的常规卷积、深度可分离卷积、池化和全连接运算,提升了硬件架构的灵活性.本文提出的协处理器包含256个PE运算单元和176 kB的片上私有存储器,在55 nm TT Corner(25°C,1.2 V)的CMOS工艺下进行逻辑综合和布局布线,最高时钟频率能够达到328 MHz,实现面积为4.41 mm2.在320 MHz的工作频率下,该协处理器峰值运算性能为163.8 GOPs,面积效率为37.14 GOPs/mm2,完成LeNet-5和MobileNet网络的能效分别为210.7 GOPs/W和340.08 GOPs/W,能够满足边缘智能计算场景下的能效和性能需求.

Abstract

With the development of Deep Learning, the number of parameters and computation of Convolutional Neural Network (CNN) increases dramatically, which greatly raises the cost of deploying CNN algorithms on edge devices. To reduce the difficulty of the deployment and decrease the inference latency and energy consumption of CNN on the edge side, a Reconfigurable CNN Co-Processor for edge computing is proposed. Based on the data flow pattern of channel-wise processing, the proposed two-level distributed storage scheme solves the problem of power consumption overhead and performance degradation caused by large data movement between PE units and large-scale migration of intermediate data on chip. To avoid the complex data interconnection network propagation mechanism in PE arrays and reduce the complexity of control, a flexible local access mechanism and a padding mechanism based on address translation are proposed, which can perform conventional convolution, deep separable convolution, pooling and fully connected operations with great flexibility. The proposed co-processor contains 256 Processing Elements (PEs) and 176 kB on-chip SRAM. Synthesized and post-layout with 55-nm TT Corner CMOS process (25 °C,1.2 V), the CNN co-processor achieves a maximum clock frequency of 328 MHz and an area of 4.41 mm2. The peak performance of the co-processor is 163.8 GOPs at 320 MHz and the area efficiency is 37.14 GOPs/mm2, the energy efficiency of LeNet-5 and MobileNet are 210.7 GOPs/W and 340.08 GOPs/W, respectively, which is able to meet the energy-efficiency and performance requirements of edge intelligent computing scenarios.

关键词

硬件加速/卷积神经网络/可重构/ASIC

Key words

Hardware acceleration/Convolutional Neural Network (CNN)/Reconfigurable/ASIC

引用本文复制引用

基金项目

基础加强计划重点基础研究项目(2019-JCJQ-ZD-187-00-02)

出版年

2024
电子与信息学报
中国科学院电子学研究所 国家自然科学基金委员会信息科学部

电子与信息学报

CSTPCDCSCD北大核心
影响因子:1.302
ISSN:1009-5896
参考文献量23
段落导航相关论文