面向边缘计算的可重构CNN协处理器研究与设计

扫码查看

原文链接

万方数据
维普

中文摘要：随着深度学习技术的发展,卷积神经网络模型的参数量和计算量急剧增加,极大提高了卷积神经网络算法在边缘侧设备的部署成本.因此,为了降低卷积神经网络算法在边缘侧设备上的部署难度,减小推理时延和能耗开销,该文提出一种面向边缘计算的可重构CNN协处理器结构.基于按通道处理的数据流模式,提出的两级分布式存储方案解决了片上大规模的数据搬移和重构运算时PE单元间的大量数据移动导致的功耗开销和性能下降的问题;为了避免加速阵列中复杂的数据互联网络传播机制,降低控制的复杂度,该文提出一种灵活的本地访存机制和基于地址转换的填充机制,使得协处理器能够灵活实现任意规格的常规卷积、深度可分离卷积、池化和全连接运算,提升了硬件架构的灵活性.本文提出的协处理器包含256个PE运算单元和176 kB的片上私有存储器,在55 nm TT Corner(25°C,1.2 V)的CMOS工艺下进行逻辑综合和布局布线,最高时钟频率能够达到328 MHz,实现面积为4.41 mm2.在320 MHz的工作频率下,该协处理器峰值运算性能为163.8 GOPs,面积效率为37.14 GOPs/mm2,完成LeNet-5和MobileNet网络的能效分别为210.7 GOPs/W和340.08 GOPs/W,能够满足边缘智能计算场景下的能效和性能需求.

外文标题：A Research and Design of Reconfigurable CNN Co-Processor for Edge Computing

外文摘要：With the development of Deep Learning, the number of parameters and computation of Convolutional Neural Network (CNN) increases dramatically, which greatly raises the cost of deploying CNN algorithms on edge devices. To reduce the difficulty of the deployment and decrease the inference latency and energy consumption of CNN on the edge side, a Reconfigurable CNN Co-Processor for edge computing is proposed. Based on the data flow pattern of channel-wise processing, the proposed two-level distributed storage scheme solves the problem of power consumption overhead and performance degradation caused by large data movement between PE units and large-scale migration of intermediate data on chip. To avoid the complex data interconnection network propagation mechanism in PE arrays and reduce the complexity of control, a flexible local access mechanism and a padding mechanism based on address translation are proposed, which can perform conventional convolution, deep separable convolution, pooling and fully connected operations with great flexibility. The proposed co-processor contains 256 Processing Elements (PEs) and 176 kB on-chip SRAM. Synthesized and post-layout with 55-nm TT Corner CMOS process (25 °C,1.2 V), the CNN co-processor achieves a maximum clock frequency of 328 MHz and an area of 4.41 mm2. The peak performance of the co-processor is 163.8 GOPs at 320 MHz and the area efficiency is 37.14 GOPs/mm2, the energy efficiency of LeNet-5 and MobileNet are 210.7 GOPs/W and 340.08 GOPs/W, respectively, which is able to meet the energy-efficiency and performance requirements of edge intelligent computing scenarios.

外文关键词：

Hardware accelerationConvolutional Neural Network (CNN)ReconfigurableASIC

作者：

李伟、陈億、陈韬、南龙梅、杜怡然

展开 >

作者单位：

战略支援部队信息工程大学密码工程学院郑州 450001

关键词：

硬件加速卷积神经网络可重构 ASIC

基金：

基础加强计划重点基础研究项目

项目编号：

2019-JCJQ-ZD-187-00-02

出版年：

2024

DOI：

10.11999/JEIT230509

电子与信息学报

中国科学院电子学研究所国家自然科学基金委员会信息科学部

电子与信息学报

CSTPCD北大核心

影响因子：1.302

ISSN：1009-5896

年,卷(期)：2024.46(4)