欢迎访问《哈尔滨工业大学学报》编辑部网站！

期刊检索

关键词检索

新闻公告MORE

【03-25】投稿请提供保密审查证明
【05-04】论文版权转让协议
【07-05】出版伦理声明
【04-04】告作者书
【07-11】审稿人的职责
【10-17】《哈工大学报》入选“第5届中国精品科技期刊”
【12-30】《哈工大学报》入选“世界学术影响力Q2期刊”
【01-03】《哈工大学报》入选“2018中国国际影响力优秀学术期刊”
【11-01】哈工大学报荣获2016、2018、2020年度“中国高校百佳科技期刊奖”
【03-24】哈工大学报10篇论文入选中国精品科技期刊顶尖学术论文
【12-18】哈工大学报2023优秀审稿专家
【12-24】哈工大学报2022优秀审稿专家
【12-21】哈工大学报2021优秀审稿专家
【12-10】哈工大学报2020优秀审稿专家
【12-13】哈工大学报2019优秀审稿专家
【11-23】哈工大学报2018优秀审稿专家

主管单位 中华人民共和国
工业和信息化部 主办单位 哈尔滨工业大学主编李隆球 国际刊号ISSN 0367-6234 国内刊号CN 23-1235/T

期刊网站二维码

微信公众号二维码

引用本文:	徐欣,刘强,王少军.一种高度并行的卷积神经网络加速器设计方法[J].哈尔滨工业大学学报,2020,52(4):31.DOI:10.11918/201812159
	XU Xin,LIU Qiang,WANG Shaojun.A highly parallel design method for convolutional neural networks accelerator[J].Journal of Harbin Institute of Technology,2020,52(4):31.DOI:10.11918/201812159

【打印本页】【HTML】【下载PDF全文】【查看/发表评论】【下载PDF阅读器】【关闭】

过刊浏览高级检索

本文已被：浏览 1299次下载 860次	码上扫一扫！
分享到：微信更多字体:加大+\|默认\|缩小-
一种高度并行的卷积神经网络加速器设计方法
徐欣¹,刘强¹,王少军²
(1.天津市成像与感知微电子技术重点实验室(天津大学),天津 300072; 2.哈尔滨工业大学电子与信息工程学院,哈尔滨 150001)

摘要:

为实现卷积神经网络数据的高度并行传输与计算,生成高效的硬件加速器设计方案,提出了一种基于数据对齐并行处理、多卷积核并行计算的硬件架构设计和探索方法. 该方法首先根据输入图像尺寸对数据进行对齐预处理,实现数据层面的高度并行传输与计算,以提高加速器的数据传输和计算速度,并适应多种尺寸的输入图像；采用多卷积核并行计算方法,使不同的卷积核可同时对输入图片进行卷积,以实现卷积核层面的并行计算；基于该方法建立硬件资源与性能的数学模型,通过数值求解,获得性能与资源协同优化的高效卷积神经网络硬件架构方案. 实验结果表明: 所提出的方法,在Xilinx Zynq XC7Z045上实现的基于16位定点数的SSD网络(single shot multibox detector network)模型在175 MHz的时钟频率下,吞吐量可以达到44.59帧/s,整板功耗为9.72 W,能效为31.54 GOP/(s·W);与实现同一网络的中央处理器(CPU)和图形处理器(GPU)相比,功耗分别降低85.1%与93.9%；与现有的其他卷积神经网络硬件加速器设计相比,能效提升20%~60%,更适用于低功耗嵌入式应用场合.

关键词: 现场可编程门阵列卷积神经网络并行处理硬件结构优化 SSD网络

DOI：10.11918/201812159

分类号:TP391.4

文献标识码:A

基金项目:国家自然科学基金(61574099)；天津市交通运输委科技发展基金(2017b-40)

A highly parallel design method for convolutional neural networks accelerator

XU Xin¹,LIU Qiang¹,WANG Shaojun²

(1.Key Laboratory of Imaging and Sensing Microelectronic Technology (Tianjin University), Tianjin 300072, China; 2.School of Electronic and Information Engineering, Harbin Institute of Technology, Harbin 150001, China)

Abstract:

To achieve highly parallel data transmission and computation of convolutional neural network acceleration and generate efficient hardware accelerator design, a hardware design and exploration method based on data-alignment and multi-filter parallel computing was proposed. In order to improve the data transmission and computation speed and adapt to various input image sizes, the method first aligned the data according to the input image size to achieve highly parallel transmission and computation at the data level. The method also used the multi-filter parallel computing method so that different filters can simultaneously convolve the input image to achieve parallel computing at the filters level. Based on this method, mathematical models of hardware resources and performance were formulated and numerically solved to obtain the performance and resource co-optimized neural network hardware architecture. The proposed design method was applied to the single shot multibox detector (SSD) network, and results show that the accelerator on Xilinx Zynq XC7Z045 at 175 MHz clock frequency could achieve the throughput of 44.59 FPS, power consumption of 9.72 W, and power efficiency of 31.54 GOP/(s·W). The accelerator consumed 85.1% and 93.9% less power than the central processing unit (CPU) and graphics processing unit (GPU) implementations respectively. Compared with the exiting designs, the power efficiency of the proposed design increased 20%~60%. Therefore, the design method is more suitable for embedded applications with low power requirements.

Key words: field programmable gate array (FPGA) convolutional neural network parallelism structure optimization single shot multibox detector (SSD) network

期刊检索

关键词检索

新闻公告MORE

友情链接LINKS