A neural network pruning and quantization algorithm for hardware deployment
Due to their superior performance,deep neural networks have been widely applied in fields such as image recognition and object detection.However,they contain a large number of parameters and require immense computational power,posing challenges for deployment on mobile edge devices that re-quire low latency and low power consumption.To address this issue,a compression algorithm that re-places multiplication operations with bit-shifting and addition is proposed.This algorithm compresses neural network parameters to low bit-widths through pruning and quantization.This algorithm reduces the hardware deployment difficulty under limited multiplication resources,meets the requirements of low latency and low power consumption on mobile edge devices,and improves operational efficiency.Experiments conducted on classical neural networks with the ImageNet dataset revealed that when the neural network parameters were compressed to 4 bits,the accuracy remained essentially unchanged com-pared to the full-precision neural network.Furthermore,for ResNetl8,ResNet50,and GoogleNet,the Top-1/Top-5 accuracies even improved by0.38%/0.22%,0.35%/0.21%,and 1.14%/0.57%,respec-tively.When testing the eighth convolutional layer of VGG16 deployed on Zynq7035,the results showed that the compressed network reduced the inference time by 51.1%and power consumption by 46.7%,while using 43%fewer DSP resources.
deep neural networkshardwarepruningquantizationFPGA