Uploaded by Stanley Seungchul Song

Software-Hardware Co-Optimization for CNNs Based on Reconfigurable Devices

advertisement
2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom) | 978-1-6654-3574-1/21/$31.00 ©2021 IEEE | DOI: 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00176
2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable
Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)
Software-Hardware Co-Optimization for CNNs
Based on Reconfigurable Devices
1st Fang Liu
2nd Zimeng Fan
School of Computer Science, Wuhan University
Wuhan City College
Wuhan, China
liufangfang@whu.edu.cn
College of Computer Science and Technology
Wuhan University of Science and Technology
Wuhan, China
fanzimeng@wust.edu.cn
3rd Yanxiang He∗
4th Min Peng∗
School of Computer Science, Wuhan University
Wuhan, China
yxhe@whu.edu.cn
School of Computer Science, Wuhan University
Wuhan, China
pengm@whu.edu.cn
Abstract—Convolutional Neural Network (CNN) is an efficient
recognition algorithm used in pattern recognition, image processing, etc. The problem of how to build an efficient computational system for CNNs has become a pressing one, which
is traditionally achieved by software optimization or hardware
accelerator design. In contrast, this paper argues that software
algorithms and hardware architectures affect each other in
neural network applications with a high degree of coupling.
In this paper, we propose a software and hardware-based cooptimization design to improve CNNs’ processing efficiency by
co-optimization of software and hardware. First, a modular
analysis and collaborative design are carried out for the ShuffleNetV2 model by implementing quantization and improving the
computational unit. Second, the model is optimized based on
reconfigurable computing devices for its characteristics. Third,
8 bit quantization is implemented, while the depth-separable
convolution operation and channel selection module is redesigned
to make the module perform operations in a hardware-friendly
form. The experimental work was carried out on the FPGA
platform Xilin xzynqxc-7Z045 using High Level Synthesis (HLS).
The experimental results show that the optimized model has
a significant improvement in terms of resource utilization and
latency.
Index Terms—Software-Hardware Co-Optimization, Convolutional Neural Networks, ShuffleNetV2, Reconfigurable Computing
I. I NTRODUCTION
With the development of computing power [1] [2] [3],
big data [5] [6] [7], and machine learning technology [12]
[13] [14], the research field of convolutional neural networks
(CNNs) has been extended from image processing to text,
video, and speech due to its high inference accuracy and strong
self-adaptability. This greatly impacts on the our society, such
as transportation [9] and healthcare [10].
This paper was supported by the National key R&D Program of china under
Grant No.2018YFC1604003,General Program of Natural Science Foundation
of China (NSFC) under Grant NO.61772382 and Grant NO.62072346 and
key R&D Project of Hubei Province under Grant No.2020BAA021,Natural
Science Foundation of Hubei Province NO.2020CFB795.
CNNs are constantly evolving and optimized, with the
emergence of SqueezeNet, MobileNet, ShuffleNet, Xception,
and other lightweight models that use special structures or
units to reduce the amount of computation. For the convolution
kernel and pooling operations in CNNs, the core is matrix
operations, which require high arithmetic power, and both
convolution and pooling operations are streaming modes.
In the development of convolutional neural networks, a large
number of convolutional variants emerge, requiring lots of
convolutional computations, and the increasing of the depth of
the network also brings down the computational efficiency [17]
[18] [19], therefore, how to build an efficient computational
system oriented to CNNs becomes an urgent problem [4].
Existing approaches to solving this problem mainly rely
on network optimization of CNNs or designing hardware
accelerators based on the characteristics of CNNs [15]. In the
current research, new deep neural network models need to
match the hardware architecture, resulting in model features
not being fully exploited in generic hardware architectures.
and the design of neural network model accelerators does
not sufficiently consider the characteristics of the models
themselves to achieve more efficient computation [16].
This paper argues that an important guideline for building
efficient AI systems is software and hardware co-design, where
neural network algorithms and hardware architectures interact
with each other in the computation process and have a high
degree of coupling. Therefore, software optimization needs
to be combined with hardware optimization in the form of
hardware-software synergy. While the neural network model
architecture and the number of parameters are reasonably designed, the hardware architecture supporting the above model
is designed from the perspective of hardware to ensure the
accuracy and efficiency of the whole computing system.
In the AI system design process, deep learning algorithms
are first designed based on the target application scenario,
followed by optimization of the neural network model to
enable hardware acceleration. This optimization step typically
978-1-6654-3574-1/21/$31.00 ©2021 IEEE
1279
DOI 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00176
Authorized licensed use limited to: GOOGLE. Downloaded on May 08,2023 at 01:29:37 UTC from IEEE Xplore. Restrictions apply.
includes model compression and fixed-point quantization to
reduce the workload and improve the peak performance of the
AI accelerator, and then adjusting the hardware architecture
design based on the algorithm optimization strategy used. The
three steps are iterative to ensure that the metrics of the target
application scenario are met.
After the hardware architecture is designed, a customized
software compiler is used to convert the neural network
model into a sequence of instructions for run-time execution.
The software compiler automatically performs scheduling and
operator optimization to further improve the computational
efficiency of the hardware [50]. The hardware-software codesign is based on reconfigurable computing technology, and
the performance of the programmable devices is fully utilized
to optimize the performance of the neural network model [34].
Based on the idea of collaborative hardware and software
optimization, this paper presents a modular analysis and collaborative design of ShuffleNetv2 [30], a lightweight model
of CNN. By implementing quantization, improving computational units and optimizing the shuffle module, the network
optimization is carried out in collaboration with hardware and
software. The experimental data in this paper illustrates that
the optimized ShuffleNetV2 network [30] has significantly
improved performance indicators in terms of latency and
resource utilization.
The main points of this article are as follows:
• A collaborative design, starting with a variant convolution
of the ShuffleNetV2 model, to quantify and improve the
convolutional computational unit and optimize the data
layout.
• The layout of ShuffleNetV2 implementation on FPGA
development board with module analysis.
• The software optimization idea is designed for hardware
unification, and the combination of software and hardware is realized [32] [33]. The method of this paper
is verified on FPGA, and the experimental data shows
that the method of this paper improves the resource
utilization and reduces latency, and for ShuffleNetv2 a
certain optimization is realized through the method of
software and hardware cooperation.
variants have emerged from research based on convolutional
operations, such as depthwise convolution, pointwise convolution.
The pooling layer is used to dramatically reduce the dimensionality of the data, in which down sampling is used to reduce
the dimensionality of the data, maximum pooling or average
pooling is used to retain important feature information. The
fully-connected layer acts as a ”classifier”, with convolution,
pooling, and activation functions iterating over multiple iterations before the results are classified by the fully-connected
layer.
B. ShuffleNetV2
ShuffleNetV2 is an upgraded version of ShuffleNet. ShuffleNetV2 is optimized for minimizing memory accesses for
the same channel size, giving full consideration to the group
convolution in terms of grouping method and number, and
reducing element-level operations [30].
ShuffleNetV2 introduces a new operation: Channel Split,
which divides the input channels of the model into two parts,
one part is passed directly downward and the other part is
calculated backward; at the end of the model, the number
of output channels of the two parts is cascaded, making the
information between the channels interoperable. The number
of output channels in both parts is cascaded at the end of
the model, allowing information to interoperate between the
various channels.
ShuffleNetv2 retains the modules in ShuffleNet that requires
down sampling, removes the random split operation from the
modules, and processes the information downwards separately
before stitching, doubling the number of output channels.
ShuffleNetv2 uses a scaling factor to control the balance
between accuracy and efficiency.
II. BACKGROUND
A. Convolutional Neural Network
Convolutional Neural Network (CNN) is a class of feedforward neural networks that incorporates convolutional computation and has a deep structure with representational learning
capabilities and is commonly used for visual image processing.
CNN can effectively downscale large data volume into small
data volume, and effectively preserve the image features [16].
A typical CNN consists of 3 parts including a convolutional
layer, a pooling layer, and a fully connected layer. The
convolutional layer is responsible for feature extraction and is
the core component of a CNN. By using convolutional kernels
to perform convolutional operations, features are effectively
extracted as key factors for recognition. Many convolutional
Fig. 1. ShuffleNetV2 Model
C. Reconfigurable Computing
The theory of reconfigurable computing was proposed by
Gerald Estrin in the 1960s. The core idea is that hardware
resources can be configured to build dedicated circuits for
computing based on the characteristics of the data flow and the
1280
Authorized licensed use limited to: GOOGLE. Downloaded on May 08,2023 at 01:29:37 UTC from IEEE Xplore. Restrictions apply.
computation [37] [38]. if the algorithm changes, the hardware
resources can be reconfigured based on the new data flow and
the characteristics of the computation so that the hardware can
be adapted to the new computational requirements. The basis
of reconfigurable computing is a computing device that can
reconfigure hardware resources. Currently, a commonly used
hardware platform for reconfigurable computing is the FPGA
[32].
The research shows that the combined computing power
of reconfigurable computing can exceed that of a CPU by a
factor of several and consume far less power, even though the
computing unit used in reconfigurable computing, the FPGA
[31], is clocked at a much lower frequency than the CPU of
the day [15].
The use of parallel operations using convolution in the CNN
model is the most suitable multi-core architecture for handling
parallel operations [22] [23] [24], and the use of reconfigurable
computing allows for effective optimization of the CNN model
to improve overall operational performance.
collaborative optimization of software and hardware for CNN,
and carries out related work based on it.
C. EXPLORING DESIGN SPACE
In this section, a collaborative hardware and software optimization exploration [41] [42] [43] based on the ShuffleNetV2
model is presented. The idea of the optimization method is to
analyze the convolutional module based on the computational
characteristics of ShuffleNetV2 after confirming the quantization method, using a chunked module hardware and software
model optimization method.
D. The problem of quantifying model parameters
Deep neural network models can perform high-precision
data computation with limited data [40]. In a hardwaresoftware collaborative approach [28] [29], it is necessary to
balance classification accuracy and latency when designing
hardware-software collaborative design based on FPGAs as
a goal to design a reasonable quantization scheme.
III. RELATED WORK
A. CNN Optimization Model
The CNN-based optimization models, SqueezeNet [44],MobileNets [11], ShuffleNet [35], Xception [36], have been well
used in mobile applications [8]. SqueezeNet was published in
ICLR-2017 by researchers from Berkeley & Stanford. Squeeze
represents a Squeeze layer in the model, which uses a 1*1
convolution kernel to convolve the feature map of the previous
layer, aiming to reduce the dimensionality of the feature map.
ShuffleNet is proposed by the Face++ team in 2017, the
shuffle in the model is to perform channel shuffling, which will
reduce the number of channels of each part of the Xception
is proposed by Google, based on Inception-V3, and combined
with Depth-Wise Convolution.
Fig. 2. Accuracy loss in Top-1 classification versus bit-width Q of input
feature maps
Research [35] has shown that weights have a small impact on classification accuracy when they are a reasonable
precision. As shown in Fig. 2, the experiments based on the
cifar-10 dataset shown that the weights can be quantized to 8bit (or lower) precision with an inference accuracy reduction
of less than 1%. It follows that when weights are quantized
to 8bit precision, the loss in classification accuracy can be
modeled using software by reducing the quantized bit width
of the feature images, where the quantized bit width will be
expressed as Q.
Since the input images are all between -1 and 1 after preprocessing, the default bit width for the integer part of the
experiment is 1. If the quantized bit width is Q, the integer
part is 1 and the fractional part is Q-1.
This experimental data shows that Q=8 is the optimal
feature image quantization scheme considering both accuracy
and resource utilization for applications where accuracy is not
extremely demanding, while any value of Q¿=8 results in a
small loss in classification accuracy while improving resource
utilization and performance. Therefore, in this paper, an 8-bit
quantization scheme is used for the weights and feature images
in a hardware-software collaborative optimization scheme.
B. Hardware Accelerators for CNNs
Current hardware accelerators for CNNs mainly refer to
the use of FPGAs as a hardware platform to accelerate a
part of the CNN computation in hardware form. FPGA-based
model optimization can be designed by compressing the model
to greatly reduce the model size with a reasonable loss of
accuracy, for example by performing data quantization [48]
and making the weights smaller by pruning or sparse matrices
[31].
The second approach is to optimize the structure of the
model, mainly by optimizing the computational units, e.g. by
the fast Fourier transform [20], by optimizing the loop structure [45] [46], e.g. by expanding the input and output channels
to increase the parallelism of the convolution operations [44],
and by optimizing the access structure, e.g. by optimizing the
arrangement of the CNN data [21].
From the existing research, it can be found that the current
research mainly focuses on the optimization of CNN models or
the design of CNN-oriented hardware accelerators. However,
software and hardware should work together for CNN-oriented
optimization [49]. Therefore, this paper proposes the idea of
1281
Authorized licensed use limited to: GOOGLE. Downloaded on May 08,2023 at 01:29:37 UTC from IEEE Xplore. Restrictions apply.
E. Analysis of ShuffleNetV2 convolutional modules
his section analyzes various parts of ShuffleNetv2 to find
the computational bottlenecks and optimization directions of
ShuffleNetv2 with the idea of accelerating large probability
events. ShuffleNetv2 is a lightweight model based on CNN and
mainly performs the convolutional operation of multiplicative
accumulation operation MAC (Multiply Accumulate), therefore this paper performs acceleration based on MAC for large
probability operations.
As shown in Fig. 3, there are four types of convolution in
the ShuffleNetV2 model, namely deep convolution, pointwise
convolution, normal convolution and fully connected layers,
where the width and height of the convolution kernel for both
deep and normal convolution are 3.
In the ShuffleNetV2 model, pointwise convolution accounts
for 70.240%, ordinary convolution for 23.159%, depthwise
convolution for 6.442%, and full linkage layer for 0.159%,
but the proportion is negligible because the full linkage layer
only exists in the output layer to obtain classification results.
Based on the idea of accelerating probabilistic events, the
hardware and software co-design in this paper focuses on the
pointwise convolution and the normal convolution kernel of
the ShuffleNetV2 model with depth-separable convolution.
convolutional kernel size of 1, depthwise convolution with a
convolutional kernel size of 3, and normal convolution with a
convolutional kernel size of 3.
In this paper, we design a highly parallelized convolutional computation scheme for each convolutional feature, and
achieve fine-grained parallelism by copying the data before
transferring them to the PE unit, so that the convolutional
kernel data stream and the feature image data stream are
aligned. In this section we set the feature image to C ×H ×W
and the convolution kernel to c × d × d.
a) Pointwise Convolution: In contrast to normal convolution, pointwise convolution has a convolution kernel of size
d of only 1, which results in no convolution kernel moving
over the feature image, but more like the feature map itself
traversing each pixel point.
This allows each of the convolution kernels to operate in
parallel. As shown in the Fig. 4 below, we take the values at
the same position in each dimension of the input feature image
one at a time in turn, and then copy them c times, completing
one round of operations with the next pixel point’s value taken
out in each dimension. The convolution kernel part, on the
other hand, takes out each convolution kernel data in turn.
Fig. 3. Analysis of ShuffleNetV2 convolution module
Fig. 4. Pointwise convolution data layout
IV. SOFTWARE-HARDWARE CO-DESIGN
Based on the analysis in the previous section, this paper
designs the ShuffleNetv2 model based on three of the computationally intensive convolution operations, while reducing
the impact of the channel split operation and optimizing
the structure of the model based on software and hardware
collaboration [25] [26].
In this paper, the system is optimized in terms of hardware
and software [42] [43] [46]. In the software design part,
different computational processes are designed for the three
different convolutional features, and the corresponding data
flows are designed. in the hardware design part, the overall
architecture of the system and the top-level architecture for
convolutional operations are highlighted.
A. Software Optimization Design
In the ShuffleNetv2 model [30], there are three different
types of convolutional units: pointwise convolution with a
b) Depthwise Convolution: Compared to the other 2
convolution operations, the most important feature of depthwise convolution is that the dimensionality of the convolution
kernel is the same as the dimensionality of the feature image
i.e. C=c, while there is no data dependency between each
dimension.
The feature image is read using the pointwise method, but
instead of taking out the value of a pixel, a window of data
is extracted, as shown in Fig. 5. The convolution kernel then
takes out the data sequentially along the dimensional direction.
c) Normal convolution: Since pointwise convolution and
depthwise convolution can be considered as two operational
steps of one common convolution, the operation for common
convolution is a fusion of the two previous methods. For
a single convolution kernel, the data is read in the same
order as for depthwise convolution, while between different
convolution kernels, the feature images need to be copied c
times, the same as for pointwise.
1282
Authorized licensed use limited to: GOOGLE. Downloaded on May 08,2023 at 01:29:37 UTC from IEEE Xplore. Restrictions apply.
Fig. 5. Depthwise Convolution data layout
Fig. 6. Normal Convolution data layout
B. Hardware design
a) Overall Architecture: The scheme is based on a
reconfigurable computing platform for hardware design [37]
[38] [39]. The overall structure of this optimized solution is
shown in Fig. 7. The overall architecture of the system is
achieved by quantizing the data, uploading and storing the
WEIGHT on the BRAM, deploying the entire system on the
FPGA and controlling the system flow through the ARM core.
the previous section to enable it to be stored on the BRAM,
thus speeding up the data access time during inference. The
reading of the data is carried out by the AXI4 bus and the
commands from the ARM core are also transmitted via the
AXI4 bus. When the ShuffleNetv2 module has been calculated,
an interrupted signal is generated and returned to the ARM
core via terminal control, marking the completion of the
inference calculation part.
Finally, the ARM core completes the inference process by
retrieving the inference results in BRAM. We use the ARM
core on the FPGA for the overall control of the system, calling
the ShuffleNetv2 module designed in this paper (as an IP core)
to perform the inference, recording information and parameter
configuration.
b) Convolution Module Design: The design of the convolution model is shown in Fig. 8. In addition to transferring the
feature images to the module, an identifier Flag is set to distinguish which type of convolution operation is being performed.
When Flag=PW, this operation is pointwise convolution; when
Flag = DW, this operation is depthwise convolution; when
Flag = CON, this operation is normal convolution. The PE
unit consists of a number of PEs (Processing Units) and an
addition tree.
The feature images are processed through the previous
section and multiplied with the parallel sum weight in the
PEs, then the convolution operation is completed by a set of
additional trees, and finally each output result is added with
the bias. If Flag is PW mode, the output feature images will
be divided into 2 groups.
With this design, the feature image is divided at the pointwise convolution of the previous convolution unit and the next
channel split unit can be skipped. It is worth noting that the
feature image is still stored continuously in hardware after
segmentation, so that subsequent operations are not affected,
while reducing the stress on access caused by channel split
operations.
V. EXPERIMENT
A. Experiment Setup
The dataset tested in this paper was the image classification
classical dataset cifar-10. We implemented ShuffleNetv2 on
different hardware platforms, where the CPU model was
an Intel(R) Core(TM) i5-7500 CPU @3.4GHz (single core),
which used the C++ version of the ShuffleNetv2 model; the
GPU is used on Google’s Colab using the Tesla P100, which
uses the Pytorch implementation of ShuffleNetv2 [50].
The optimization was also implemented on the ZYNQ7045
board, which uses the Xilinx FPGA chip xc7z045 and operates
at 100 MHz. 5 different configurations of the system were set
up to explore the optimal number of PEs, with 4, 8, 16, 32
and 64 PEs in each of the 5 configurations.
B. Experiment Results
Fig. 7. Overall System Architecture
The data is compressed by the quantization operation in
In this subsection we will first show the comparisons made
in exploring the optimal number of PEs, here we set up 5 sets
of comparison experiments. The experiments are tested and
1283
Authorized licensed use limited to: GOOGLE. Downloaded on May 08,2023 at 01:29:37 UTC from IEEE Xplore. Restrictions apply.
Fig. 8. Convolution module top-level design
compared by setting different numbers of PEs on the FPGA
through the 3 convolution modules that have been implemented. The choice of PE number is based on the alignment of
the data stream and the nature of the 3 convolution operations.
increases. The maximum value of LUT of the FPGA used in
this paper is 210k,. As can be seen from Fig. 7, when the
number of PEs is 32, the resources of LUT basically reach
the maximum value of 210k, and the resources of LUT will
not be enough if the number of PEs is added, so the number
of PEs should be less than 32.
In the actual situation, the ratio of LUT usage should
preferably not be too saturated, and it can be seen that when
the number of PEs is greater than When the number of PEs
is 16, the latency and the LUTs intersect at exactly the same
time, and the efficiency of the LUTs is at its highest. Therefore,
according to the experimental results, the optimal setting for
the number of PEs in this paper is 16.
TABLE I
C OMPARISON WITH CPU AND GPU
Fig. 9. Comparison of resource usage with different number of PEs
Fig. 9 shows the consumption of various hardware resources
of DSP, BRAM, FF, and LUT as PE increases. It can be seen
that except for BRAM, which does not change significantly,
the consumption of DSP, FF and LUT all increase as PE increases, with LUT increasing the most, with DSP consumption
increasing from 39k to 434k as PE increases from 4 to 64.
Fig. 10. Comparison of latency and LUT usage for different numbers of PEs
Fig. 10 shows the transformation relationship between LUT,
latency and the maximum number of LUTs as the number
of PEs increases. Latency decreases as the number of PEs
increases, but at the same time the consumption of LUTs also
Device
Execution time(ms)
TOP-1 acc
FPGA XC7Z045
28.78
75.027%
GPU
43.4
75.84%
CPU
283.8
75.1%
In Table I, the data regarding the design of this pape is
shown in comparison with ShuffleNetv2 running on CPU,
GPU with the same number of parameters. In terms of accuracy, it is only 0.813% lower than the original ShuffleNetV2,
which can prove that the design is not overly detrimental to
the accuracy of the model.
At the same time, the number of parameters remains the
same as between, except for the data cache used to unlock
the data dependencies. And on top of the above, our work is
still 9.8x faster than a single-core CPU and 1.5x faster than a
GPU.
Table II shows the comparison of the baseline experiments
and the accelerated data from other methods. Where the baseline model is the data before optimization and the experimental
data from other papers [20] [21] [47]. the DSP, BRAM, FF
and LUT are the hardware resource usage. the BRAM for the
baseline model is as high as 1276, while the DSP usage is only
95, which shows a low resource utilization and high memory
space consumption [48]. The optimized model substantially
increases the DSP usage and reduces the BRAM usage. At
the same time, the optimized result in terms of latency is
1284
Authorized licensed use limited to: GOOGLE. Downloaded on May 08,2023 at 01:29:37 UTC from IEEE Xplore. Restrictions apply.
TABLE II
C OMPARISON WITH OTHER WORK
baseline
Device
DSP
FF
LUT
FPS
GOPS
LATENCY
Clock
XC 7z045
95
18k
28k
15.79
109.42ms
100MHZ
Partial
Reconfiguration
on FPGA
XC 7z020
220
106k
154k
40-44ms
100MHZ
only 26.3% of the original [4]. Using the same dataset as this
paper, the Latency of this paper is only 65% to 72% of that
of this paper, with approximately the same hardware resource
consumption.
The model used in is the same as that used in this paper,
and it is worth noting that is of dubious reference for the
FPS parameters as it was tested on the ImageNet dataset. In
addition, the device used in [8] is a ZU3EG with a clock
frequency of 250 MHZ, while the device used in this paper
is clocked at only 100 MHZ. The final result proves that the
GOPS in this paper is improved by 1.59x and the Latency is
reduced from 104.3ms to 28.78ms.
Compared to where the data is re-arranged by changing
the method of Our latency is still only 32% of that of
, which is similar to the method proposed in this paper.
The comparative results demonstrate the improved resource
utilization and reduced latency of this paper’s approach.
VI. CONCLUSION
Improving the computational efficiency of CNNs and related models has been a hot research topic. The traditional
optimization methods are mainly the optimization of models or
designing hardware accelerators. In this paper, we propose an
optimization scheme for ShuffleNetv2 on a reconfigurable platform based on the idea of hardware-software co-optimization,
and validate it on FPGA hardware.
The paper starts with data quantization and optimization of
each of the three Pointwise convolutions, Depthwise convolutions and normal convolutions of the model. The redesign
of the convolution module and the high degree of pipelining
result in significant optimization of the latency.
Experimental results from the optimization experiments
conducted on a Xilinx Zynq 7045 development board show
that the optimization solution significantly speeds up the
inference of ShuffleNetV2, improves the resource utilization
and reduces the latency of the system compared to the existing
model.
In future work, we will continue to focus on software and
hardware cooperative deep learning model optimization, and
the next work will concentrate on optimizing the computational processing of the first and last ordinary convolution
in ShuffleNetV2. We will further reduce the latency through
operations such as loop unfolding, and conduct research on
Synetgy
Data Optimization
CNN Accelerator
Our
ZU3EG
37
30k
24k
58.7
47.09
104.3ms
250MHZ
XC 7z045
340
22k
42k
57.5
100ms
100MHZ
XC 7z045
220
87k
115k
291.45
76.27
28.78ms
100MHZ
hardware-software collaborative-based model optimization by
adding more layers to the data flow architecture and improving
the ratio of computation to communication.
VII. ACKNOWLEDGMENT
This paper was supported by the National key RD Program
of china under Grant No.2018YFC1604003,General Program
of Natural Science Foundation of China (NSFC) under Grant
NO.61772382 and NO.62072346 and key RD Project of
Hubei Province under Grant No.2020BAA021,Natural Science
Foundation of Hubei Province NO.2020CFB795.We thank
MindSpore for the partial support of this work, which is a
new deep learning computing framework.
REFERENCES
[1] M. Qiu and J. Li, “Real-time embedded systems: optimization, synthesis,
and networking,” CRC Press
[2] J.Wang, M. Qiu, B. Guo, “High reliable real-time bandwidth scheduling for virtual machines with hidden Markov predicting in telehealth
platform,” Future Generation Computer Systems, 49, 68-762015
[3] F. Hu, S. Lakdawala, Q. Hao, M. Qiu, “Low-power, intelligent sensor
hardware interface for medical data preprocessing,” IEEE Transactions
on Information Technology in Biomedicine, 13 (4), 656-6632009
[4] S. H. Shi, X. M. Chu, “Speeding up Convolutional Neural Networks”,
arXiv preprint arXiv:1704.07724 (2017).
[5] P. Wu, Z. Lu, Q. Zhou, Z. Lei, X. Li, M. Qiu, P.C.K. Hung, “Bigdata logs
analysis based on seq2seq networks for cognitive Internet of Things,”
Future Generation Computer Systems, 90, 477-488, 2019
[6] R. Lu, X. Jin, S. Zhang, M. Qiu, X. Wu, “A study on big knowledge
and its engineering issues,” IEEE Transactions on Knowledge and Data
Engineering, 31 (9), 1630-1644, 2018
[7] W. Dai, L. Qiu, A. Wu, M. Qiu“Cloud infrastructure resource allocation
for big data applications,” IEEE Transactions on Big Data, 4(3), pp.3133242016
[8] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.C. Chen, “Inverted
residuals and linear bottlenecks: Mobile networks for classification,
detection and segmentation”. arXiv preprint arXiv:1801.04381 (2018).
[9] M. Zhu, X.Y Liu, F. Tang, M. Qiu, R. Shen, W. Shu, M.Y. Wu“Public
vehicles for future urban transportation,” IEEE transactions on intelligent
transportation systems, 17(12), pp.3344-33532016
[10] Q. Zhang, T. Huang, Y. Zhu, M. Qiu“A case study of sensor data
collection and analysis in smart city: provenance in smart food supply
chain,” International Journal of Distributed Sensor Networks, 9(11),
3821322013
[11] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, et al., “Mobilenets:
Efficient convolutional neural networks for mobile vision applications”.
arXiv preprint arXiv:1704.04861 (2017).
[12] H. Qiu, M. Qiu, Z. Lu, “Selective encryption on ECG data in body sensor
network based on supervised machine learning,” Information Fusion, 55,
59-67,2020
[13] K. Gai and M. Qiu, “Reinforcement learning-based content-centric
services in mobile sensing,” IEEE Network, 32(4), pp.34-39, 2018
1285
Authorized licensed use limited to: GOOGLE. Downloaded on May 08,2023 at 01:29:37 UTC from IEEE Xplore. Restrictions apply.
[14] K. Gai and M. Qiu, “Optimal resource allocation using reinforcement
learning for IoT content-centric services,” Applied Soft Computing,
Vol.70, pp.12-21, 2018
[15] Z. W. Cai, X. D. He, J. Sun, and V. Nuno, “Deep learning with low
precision by half-wave Gaussian quantization”. IEEE Conference on
Computer Vision and Pattern Recognition (CVPR2017). 5406–5414.
2017.
[16] K. Simonyan, A. Zisserman. ”Very deep convolutional networks for
large-scale image recognition”. arXiv preprint arXiv:1409.1556, 2014.
[17] J. Niu, Y. Gao, M. Qiu, Z. Ming, “Selecting proper wireless network
interfaces for user experience enhancement with guaranteed probability,”
Journal of Parallel and Distributed Computing, 72(12), pp. 1565-1575,
2012
[18] M. Qiu, K. Zhang, M. Huang, “Usability in mobile interface browsing,”
Web Intelligence and Agent Systems: An International Journal, 4(1), pp.
43-59, 2006
[19] M. Qiu, K. Zhang, M. Huang“An empirical study of web interface design
on small display devices,” IEEE/WIC/ACM International Conference on
Web Intelligence (WI’04), pp.29-35, 2004
[20] Y. F. Yang, Q. J. Huang, “Synetgy: Algorithm-hardware Co-design for
ConvNet Accelerators on Embedded FPGAs”. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2019.
[21] W. Hu, S. Chen, et al., “Data Optimization CNN Accelerator Design on
FPGA”, IEEE ISPA, 294-299, 2019, Xiamen, China, doi: 10.1109/ISPABDCloud-SustainCom-SocialCom48970.2019.00051.
[22] M. Qiu, H. Li, E.H.-M. Sha, “Heterogeneous real-time embedded
software optimization considering hardware platform,” ACM symposium
on Applied Computing, pp. 1637-1641, 2009
[23] M. Qiu, C. Xue, Z. Shao, E.H.-M. Sha, “Energy minimization with
soft real-time and DVS for uniprocessor and multiprocessor embedded
systems,” ACM/IEEE Design, Automation Test in Europe Conference
(DATE), pp. 1-6, 2007
[24] M. Qiu, Z. Chen, M. Liu, “Low-power low-latency data allocation for
hybrid scratch-pad memory,” IEEE Embedded Systems Letters, 6(4),
69-72, 2014
[25] Y. Guo, Q. Zhuge, J. Hu, M. Qiu, E.H.-M. Sha, “Optimal data allocation
for scratch-pad memory on embedded multi-core systems,” IEEE International Conference on Parallel Processing (ICPP), pp.464-471, 2011
[26] L. Zhang, M. Qiu, W.C. Tseng, E.H.-M. Sha, “Variable partitioning
and scheduling for MPSoC with virtually shared scratch pad memory,”
Journal of Signal Processing Systems, Vol.58(2), pp. 247-265, 2010
[27] K. Kwon, A. Amid, et al., “Co-design of deep neural nets and neural net accelerators for embedded vision applications”. arXiv preprint
arXiv:1804.10642, (2018).
[28] M. Liu, S. Zhang, Z. Fan, M. Qiu, “H Infinite State Estimation for
Discrete-Time Chaotic Systems Based on a Unified Model,” IEEE
Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics),
Vol.42, 2012
[29] M. Qiu, D. Cao, H. Su, K. Gai, “Data transfer minimization for financial
derivative pricing using Monte Carlo simulation with GPU in 5G,” IEEE
International Journal of Communication Systems, 29(16), pp. 23642374, 2016
[30] N. N. Ma, X, Y. Zhang, et al., “Shufflenetv2: Practical guidelines for
efficient cnn architecture design”. arXiv:1807.11164, 2018.
[31] J. S. Wang, Q. W. Lou, et al., “Design flow of accelerating hybrid
extremely low bit-width neural network in embedded FPGA”. arXiv
preprint arXiv:1808.04311 (2018).
[32] J. T. Qiu, J. Wang, et al., “Going deeper with embedded FPGA platform
for convolutional neural network”. In Proceedings of the ACM SIGDA,
2016.
[33] R. Z. Zhao, X. Y. Niu, et al., “Optimizing CNN-based object detection
algorithms on embedded FPGA platforms”. In Proceedings of the
Annual ARC Processor Summit (ARC). Springer, 255–267, 2017
[34] D. Gschwend, “ZynqNet: An FPGA-accelerated embedded convolutional neural network,” ETH Zurich, 2016.
[35] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices”.
arXiv:1707.01083. (2017).
[36] F. Chollet “Xception: Deep Learning with Depthwise Separable Convolutions”. Computer Vision and Pattern Recognition, 1800-1807, 2007.
[37] H. Zhao, M. Chen, M. Qiu, K. Gai, M. Liu, “A novel pre-cache schema
for high performance Android system,” Future Generation Computer
Systems, Vol. 56, pp. 766-772, 2016
[38] Y. Gao, S. Iqbal, P. Zhang, M. Qiu, “Performance and power analysis
of high-density multi-GPGPU architectures: A preliminary case study,”
IEEE 17th International Conference on High Performance Computing
(HPCC), 2015
[39] M. Qiu, L. Chen, Y. Zhu, J. Hu, X. Qin, “Online data allocation for
hybrid memories on embedded tele-health systems,” IEEE Intl Conf on
High Performance Computing and Communications (HPCC), 2014
[40] P. Gysel, et al., “Ristretto: A Framework for Empirical Study of
Resource-Efficient Inference in Convolutional Neural Networks”. IEEE
Trans. Neural Networks and Learning Systems, 29(11),5784–5789,
2018.
[41] K. Gai, M. Qiu, M. Liu, Z. Xiong, “In-memory big data analytics
under space constraints using dynamic programming,” Future Generation
Computer Systems, 83, 219-227,2018
[42] J. Niu, C. Liu, Y. Gao, M. Qiu, “Energy efficient task assignment
with guaranteed probability satisfying timing constraints for embedded
systems,” IEEE Transactions on Parallel and Distributed Systems, 25(8),
2043-2052, 2013
[43] Y. Guo, Q. Zhuge, J. Hu, J. Yi, M. Qiu, E. H.-M. Sha, “Data placement
and duplication for embedded multicore systems with scratch pad
memory,” IEEE Transactions on Computer-Aided Design of Integrated
Circuits, 2013
[44] F. N. Iandola, H. Song, et al.,“SqueezeNet: AlexNet-level accuracy with
50x fewer parameters and ¡0.5MB model size”. ICLR, 2017.
[45] M. Qiu, M. Guo, M. Liu, C.J. Xue, L.T. Yang, E. H.-M. Sha, “Loop
scheduling and bank type assignment for heterogeneous multi-bank
memory,” Journal of Parallel and Distributed Computing, 69 (6), 546558, 2009
[46] Z. Shao, M. Wang, Y. Chen, C. Xue, M. Qiu, L.T. Yang, E.H.-M. Sha,
“Real-time dynamic voltage loop scheduling for multi-core embedded
systems,” IEEE Transactions on Circuits and Systems II, 54 (5), 445449, 2007
[47] M. Farhadi, M. Ghasemi, Y. Z. Yang, “A Novel Design of Adaptive and
Hierarchical Convolutional Neural Networks using Partial Reconfiguration on FPGA”. arXiv:1909.05653. (2019).
[48] J. S. Wang, Q. W. Lou, et al., “Design flow of accelerating hybrid
extremely low bit-width neural network in embedded FPGA”. arXiv
preprint arXiv:1808.04311 (2018).
[49] J. T. Qiu, J. Wang, et al., “Going deeper with embedded FPGA
platform for convolutional neural network”. ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays. 26–35, 2016
[50] Mindspore. https://www.mindspore.cn/, 2020
1286
Authorized licensed use limited to: GOOGLE. Downloaded on May 08,2023 at 01:29:37 UTC from IEEE Xplore. Restrictions apply.
Download