引用本文: | 徐欣,刘强,王少军.一种高度并行的卷积神经网络加速器设计方法[J].哈尔滨工业大学学报,2020,52(4):31.DOI:10.11918/201812159 |
| XU Xin,LIU Qiang,WANG Shaojun.A highly parallel design method for convolutional neural networks accelerator[J].Journal of Harbin Institute of Technology,2020,52(4):31.DOI:10.11918/201812159 |
|
摘要: |
为实现卷积神经网络数据的高度并行传输与计算,生成高效的硬件加速器设计方案,提出了一种基于数据对齐并行处理、多卷积核并行计算的硬件架构设计和探索方法. 该方法首先根据输入图像尺寸对数据进行对齐预处理,实现数据层面的高度并行传输与计算,以提高加速器的数据传输和计算速度,并适应多种尺寸的输入图像;采用多卷积核并行计算方法,使不同的卷积核可同时对输入图片进行卷积,以实现卷积核层面的并行计算;基于该方法建立硬件资源与性能的数学模型,通过数值求解,获得性能与资源协同优化的高效卷积神经网络硬件架构方案. 实验结果表明: 所提出的方法,在Xilinx Zynq XC7Z045上实现的基于16位定点数的SSD网络(single shot multibox detector network)模型在175 MHz的时钟频率下,吞吐量可以达到44.59帧/s,整板功耗为9.72 W,能效为31.54 GOP/(s·W);与实现同一网络的中央处理器(CPU)和图形处理器(GPU)相比,功耗分别降低85.1%与93.9%;与现有的其他卷积神经网络硬件加速器设计相比,能效提升20%~60%,更适用于低功耗嵌入式应用场合. |
关键词: 现场可编程门阵列 卷积神经网络 并行处理 硬件结构优化 SSD网络 |
DOI:10.11918/201812159 |
分类号:TP391.4 |
文献标识码:A |
基金项目:国家自然科学基金(61574099); 天津市交通运输委科技发展基金(2017b-40) |
|
A highly parallel design method for convolutional neural networks accelerator |
XU Xin1,LIU Qiang1,WANG Shaojun2
|
(1.Key Laboratory of Imaging and Sensing Microelectronic Technology (Tianjin University), Tianjin 300072, China; 2.School of Electronic and Information Engineering, Harbin Institute of Technology, Harbin 150001, China)
|
Abstract: |
To achieve highly parallel data transmission and computation of convolutional neural network acceleration and generate efficient hardware accelerator design, a hardware design and exploration method based on data-alignment and multi-filter parallel computing was proposed. In order to improve the data transmission and computation speed and adapt to various input image sizes, the method first aligned the data according to the input image size to achieve highly parallel transmission and computation at the data level. The method also used the multi-filter parallel computing method so that different filters can simultaneously convolve the input image to achieve parallel computing at the filters level. Based on this method, mathematical models of hardware resources and performance were formulated and numerically solved to obtain the performance and resource co-optimized neural network hardware architecture. The proposed design method was applied to the single shot multibox detector (SSD) network, and results show that the accelerator on Xilinx Zynq XC7Z045 at 175 MHz clock frequency could achieve the throughput of 44.59 FPS, power consumption of 9.72 W, and power efficiency of 31.54 GOP/(s·W). The accelerator consumed 85.1% and 93.9% less power than the central processing unit (CPU) and graphics processing unit (GPU) implementations respectively. Compared with the exiting designs, the power efficiency of the proposed design increased 20%~60%. Therefore, the design method is more suitable for embedded applications with low power requirements. |
Key words: field programmable gate array (FPGA) convolutional neural network parallelism structure optimization single shot multibox detector (SSD) network |