国防科技大学学报2024,Vol.46Issue(1) :103-112.DOI:10.11887/j.cn.202401011

多核数字信号处理卷积算法并行优化

Parallel optimization of convolution algorithm on multi-core DSP

许金伟 王庆林 李娅琳 姜晶菲 高蕾 李荣春 李东升
国防科技大学学报2024,Vol.46Issue(1) :103-112.DOI:10.11887/j.cn.202401011

多核数字信号处理卷积算法并行优化

Parallel optimization of convolution algorithm on multi-core DSP

许金伟 1王庆林 1李娅琳 1姜晶菲 1高蕾 1李荣春 1李东升1
扫码查看

作者信息

  • 1. 国防科技大学 计算机学院,湖南 长沙 410073;国防科技大学 并行与分布计算全国重点实验室,湖南 长沙 410073
  • 折叠

摘要

针对国防科技大学自主研发的异构多核数字信号处理(digital signal processing,DSP)芯片的特征以及卷积算法自身特点,提出了一种面向多核DSP架构的高性能多核并行卷积实现方案.针对1×1 卷积提出了特征图级多核并行方案;针对卷积核大于1 的卷积提出了窗口级多核并行优化设计,同时提出了逐元素向量化计算的核内并行优化实现.实验结果表明,所提并行优化方法实现单核计算效率最高能达到64.95%,在带宽受限情况下,多核并行扩展效率可达到48.36%~88.52%,在典型网络ResNet50 上的执行性能与E5-2640 CPU相比,获得了5.39 倍性能加速.

Abstract

According to the characteristics of the heterogeneous multi-core DSP(digital signal processing)chip independently developed by National University of Defense Technology and the characteristics of the convolution algorithm,a high-performance multi-core parallel convolution implementation scheme for multi-core DSP architecture was proposed.A feature graph level multi-core parallel scheme is proposed for 1×1 convolution.For convolutions with kernels larger than 1,a window level multi-core parallel optimization design was proposed,and an element-wise vectorization based intra-core parallel optimization implementation was proposed.The experimental results show that the proposed parallel optimization method can reach a maximum single core computing efficiency of 64.95%.When the bandwidth is limited,the parallel expansion efficiency of multi-core can still reach 48.36%~88.52%.Compared with E5-2640 CPU,the execution performance on the typical network ResNet50 achieves 5.39x performance acceleration.

关键词

多核DSP/卷积神经网络/卷积算法/并行优化

Key words

multi-core DSP/CNNs/convolutional algorithms/parallel optimization

引用本文复制引用

基金项目

国家自然科学基金(61732018)

出版年

2024
国防科技大学学报
国防科学技术大学

国防科技大学学报

CSTPCD北大核心
影响因子:0.517
ISSN:1001-2486
参考文献量1
段落导航相关论文