Uploaded by Minh Khánh

A review of CNN accelerators for embedded systems

advertisement
A review of CNN accelerators for embedded systems
based on RISC-V
Alejandra Sanchez-Flores
Industrial and Construction Eng. dept.
Universitat de les Illes Balears
Palma, Spain
alejandra.sanchez1@estudiant.uib.cat
Lluc Alvarez
Barcelona Supercomputing Center,
Universitat Politècnica de Catalunya
Barcelona, Spain
lluc.alvarez@bsc.es
Abstract— One of the great challenges of computing
today is sustainable energy consumption. In the deployment
of edge computing this challenge is particularly important
considering the use of embedded equipment with limited
energy and computation resources. In those systems, the
energy consumption must be carefully managed to operate
for long periods. Specifically, for embedded systems with
machine learning capabilities in the Internet of Things
(EMLIoT) era, the convolutional neural networks (CNN)
model execution is energy challenging and requires massive
data. Nowadays, high workload processing is designed
separately into a host processor in charge of generic
functions and an accelerator dedicated to executing the
specific task. Open-hardware-based designs are pushing for
new levels of energy efficiency. For achieving energy
efficiency, open-source tools, such as the RISC-V ISA, have
been introduced to optimize every internal stage of the
system. This document aims to compare the EMLIoT
accelerator designs based on RISC-V and highlights open
topics for research.
Keywords—RISC-V, accelerator, energy efficiency
I.
INTRODUCTION
Since the introduction of accelerators to execute CNN, their
design priorities were first to get enough resources to run the
CNN model and then execute the most instructions per second
(throughput). However, considering the high demand for
embedded systems with artificial intelligence capacity, the
designers face now the challenge of reducing the power
consumption for complex and long-term applications. It seems
contradictory that small devices such as embedded systems, with
low power consumption which typically is translated into a lean
HW system, are designed to run a CNN model, characterized by
high demand for resources. Then, the priority has evolved to
maximize the energy efficiency (EE) value, i. e. the ratio of the
device’s throughput per unit of power, reaching the gigaoperations per second/watt (GOPS/W) or tera-operations per
second/watt (TOPS/W). In consequence, accelerators are
focused on including only circuitry and functions that operate
optimally to achieve this goal.
In recent times, accelerator implementations exploring the
RISC-V features have emerged with favorable results, leading
to a breakthrough in accelerator design. Truly, open source is
not the only advantage of RISC-V, but it offers other benefits
for EE maximization. To expose these benefits and make a
Bartomeu Alorda-Ladaria
Inst. de Inv. Sanitaria IB - IdisBa.
Universitat de les Illes Balears
Palma, Spain
tomeu.alorda@uib.es
recompilation of design key elements, a selection of projects
with the best EE results are examined under the following
sequence: determine the challenges of EMLIoT devices for
optimizing EE, relate these challenges to the RISC-V design
objectives, and examine the EMLIoT RISC-V implementations
to expose the elements that address the objectives.
II. SYSTEM SPECIFICATIONS
To start this analysis, the operating factors of EMLIoT are
discussed. First the CNN model execution requirements, then
the electronic restrictions related to embedded devices, and
finally, a brief overview of RISC-V setup for most projects.
A. CNN execution constraints
CNN inference execution is the purpose of the accelerator
EMLIoT devices. This operational condition establishes the
functional features of the system organized in memory, power,
throughput, frequency, and accuracy.
Any CNN model involves millions of parameters named
weights. For example, ResNet [1] uses a total number of weights
from 0.27M to 19.4M depending on the model version. Weights
are stored in memory while the execution is performed, thus the
system requires an equivalent memory volume, usually DRAM.
From a workload point of view, the most important operation
executed is the convolutional matrix multiplication, which in
turn divides into dot-products implemented with MAC
(multiplication-accumulation) operation. Depending on the
number of convolution layers in the model and the number of
weights in every layer, the total operations executed can reach
GOP (Giga operations – 1𝑥10 ). If the efficiency expectation is
to reach the TOPS/W, then the CNN model should be executed
at high throughput (GOPS) but at ultra-low power consumption
(mW or less).
Some compression algorithms have been proposed for the
CNN model to reduce the number of weights, several
instructions, or execution precision, all with beneficial energy
results. However, these reduction criteria focus on HPC systems
or imply a degradation of the final classification, where a high
percentage of accuracy (>90%) should be the indicator of an
effective acceleration. So, an optimal solution is to propose
compression techniques that work in conjunction with hardware
modifications, linked by the ISA architecture.
B. RV32 ISA features
The RISC-V 32-bit integrates several advisable features for
“© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/
republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted
component of this work in other works.” Published version: DOI: 10.1109/COINS54846.2022.9855006
embedded devices, and the most relevant is the low power cost.
In addition, computations with 32-bit data single-precision
obtain enough accuracy for CNN inference [2]. Thus, RV32I
and RV32E are the basic instruction sets dedicated for 32-bit
embedded processors, both include integer data, basic arithmetic
operations, and ld/st instructions for 16 and 8-bits data handling
[3]. Once the base I or E is selected, some standard extension
can be attached. For EMLIoT, the most recurrent extensions are
M – multiplication/division extension is omnipresent in the
CNN accelerators, required for the MAC operation. C –
compression extension offers a 16-bit version of the original 32bit instructions. A – atomic extension that supports compressed
memory exchange instructions. B – bit manipulation extension
dedicated to a more complex bit manipulation (draft version). V
– vector extension for handling data in vector format, enabling
parallel processing (draft version). X-custom extension is
reserved for new extensions defined by customers [3]. In
addition, each standard extension provides some custom
instruction code available for custom definition. Both features
(custom extensions and instructions) give a significant
advantage in aiding optimization, as non-standard operations
can be implemented in HW and bound to ad-hoc instruction.
C. EE challenges
Once CNN model execution specifications were mentioned,
as well as some RISC-V capacities, next, all the challenges
identified in the EMLIoT projects are condensed and related to
the design objectives to reach EE. Fig. 1 synthesizes the result.
 The DRAM consumes a lot of power, either for storing
or transferring data. The reason is the resistive-capacitive
(RC) array on which this memory is based [4], keeping
it current due to its low cost. This challenge is the reason
to optimize the memory storage (obj. A) and search for
new technologies to store data (obj. B).
 The execution of convolution is the process that
consumes the most energy within the processor [5]. The
MAC (multiplication-accumulation) operation, whose
multiplication is the highest energy consumption part of
the MAC [6], [7]. This justifies exploring more efficient
processing options (obj. B) and reducing the execution
of the functions (obj. D).
 The energy consumption increases when the data are
transferred. The cost is cheaper when data are moved
between registers and L1 levels than when are transferred
from/to DRAM, according to the memory hierarchy [8].
Mobility synchronization is required to avoid
unnecessary energy expenses (obj. C) and new
interactions between the CPU and the accelerator need to
be explored (obj. E).
 Traditionally, loosely coupling accelerator (LCA) has
been the widely used host-accelerator configuration.
Using an open-source platform, the possibilities of
interaction grow and should be explored (obj. E).
 The working frequency of an embedded device is
typically limited between 100 and 500 MHz [9],
conditioned by the RC array. When the bandwidth of the
DRAM is increased (the number of bits transmitted per
Fig. 1. Design objectives defined during this research
unit of time), the power consumption is increased [10].
Because the processor works at a higher speed than the
memory frequency, the processor demands data faster
than memory can deliver it. Conversely, data from the
processor are available for transfer faster than the DRAM
can store them, creating bottlenecks [11]. Mobility
optimization would improve this challenge (obj. C),
explore new options to process (obj. B), or new
configurations CPU-accelerator (obj. E).
III.
DESIGN OBJECTIVES REVIEW
In this section, the design objectives defined in the last
section are used to typify the selected projects and identify the
key implementations that aid the EE improvement.
A. Optimize DRAM usage
An obvious solution to the large volume of DRAM is data
use reduction, which leads to two options, to reduce the number
of parameters (committed to algorithm compression, analysis of
which is omitted in this paper) or to reduce the data size. The
CNN inference computing using low-precision data involves
mathematics and computing. Binary neural networks (BNN)
were the first data reduction approach, using weights normalized
to a signed unit value (-1, +1). From a computing point of view,
one bit can represent one weight, in consequence, the 32-bits
standard can store 32 weights. BNN can be executed with both
standard programming and computing resources since binary
MAC operation is performed using common Boolean operations
such as XNOR-pop count. The energy consumption and the
throughput were improved at that time, but, the inaccuracy of
the results [12] encouraged the exploration of other options.
The alternative to using binary values is the quantized
weights, where 2 to 8 bits represent a weight. The quantized
neural networks (QNN) approach [13] was implemented with
more accurate results than BNN. Concerning energy reduction
goals, QNN implementation addresses the issues related to lowprecision data storage, retrieval, and computation in a 32-bits
architecture. In DRAM memory, data are stored by volume
optimization: two 16-bits, four 8-bits, eight 4-bits, or sixteen 2bits data, depending on the chosen precision.
For 8-bits data retrieval, a custom extension based on the
ISA RV32IMC and named LVE is defined in [14] and applied
in [15]. It contains instructions for 16 and 8-bits data allocation
as vector array in the scratchpad (SP) memory. Another option
to deal with data reduction is proposed in [16], based on a
configuration RV32IMA. They use data reduction to 8-bit using
shift operations and store it directly in general-purpose registers
(GPR) to speed up the MAC operations.
The RISC-V ISA offers the capacity to integrate new
functionalities to the standard. New extensions like XpulpNN
[17] use an ISA RV32IMCX, including instructions to manage
data stored in memory in eight 4-bit or sixteen 2-bit formats. The
low-precision data are extracted and accommodated in 8-bits
locations using hardware loops related to one single instruction.
Xpulpv2 matches with parallel processing configurations used
by the Pulp ecosystem thus every data extracted is
accommodated in a vector/array to perform dot-product.
The implementations in [18] and [19] use an ultra-lowpower processor RV32IM [20]. This low-cost benefit is
maintained in the projects by the authors, who propose a shift
operation to extract each low-precision data from its storage
format. So, quantized operations are executed at a low cost.
Quantization, and especially the BNN, is one strategy with
great potential to reduce energy consumption and remains one
of the most implemented strategies to improve performance
results. Low-precision data manipulation is not a standard
operation, and custom instructions for a shift or mask extraction
must be implemented, just like the RISC-V processors in this
section have proposed.
B. Storage alternatives
Given the advantages of DRAM technology, it will hardly
be replaced in a short time [21]. However, the possibility of
using memories with lower power consumption is pushing
emerging options. The memory alternatives most referenced in
EMLIoT are the non-volatile memories (NVM) [22], the 8T
SRAM [23], and the eDRAM [24].
In addition to their storage functionality, these devices have
been chosen for their ability to compute one or more operations,
in what is called in-memory computing (IMC, shown in Fig. 2).
This approach states that one or more operations can be
computed inside a storage cell, without transferring data to the
processor, saving the cost of mobility. Furthermore, since these
devices are based on analog mechanisms, the storage/processing
output signal is a continuous function, providing a multiprecision capacity. For a larger processing result in the volume
of data, the IMC is massively performed in a 2D storage cells
arrangement that is controlled synchronously [25].
Even though these memories represent a breakthrough for
the EMLIoT devices, some issues remain to be solved. i) the
massive data mobility to/from the IMC device must be handled,
ii) the stability of the output signal is a challenge for some of
these memories, iii) the analog output values must be inputted
to analog-digital converters, generating energy cost. The data
mobility is discussed in the next sub-section D, and analog
converters implications are not a subject for this analysis.
C. Functions optimization
A complementary measure to reducing memory is the
optimization of CNN model execution. A clear example of
operation simplification is the MAC substitution by the XNORpop - count function in the BNNs, which led to a 25% reduction
in energy consumption in the first implementations [26].
However, this analysis is bounded characterized by innovative
Fig. 2. IMC module representation
proposals for function/operation execution.
Some projects make hardware innovations to deal with lowprecision data operations related to the MAC. In [27] a shift
operation is performed to extract low-precision data, and
immediately after each extraction, the data are accumulated
postponing 32-bits data manipulation until the accumulation is
finished. The purpose of this is to avoid data transfers between
registers with different data sizes.
Since the standard convolutional multiplication efficiency is
improvable, the kernel re-arrangement is proposed as a
guaranteed optimization of the PULP-NN [28] library. This
includes kernel re-arrangement to prioritize kernel weights reuse
and diminish MAC processing. Another kernel re-arrangement
optimization is presented in [28], where a 3x3 kernel in
hardware is proposed, using individual FIFO buffers for the
different operands and individual accumulators at each buffer
output.
Among the compression techniques sparsity appears
recurrently in hardware optimization. The purpose is the
detection of values to avoid unnecessary computing. For
example, zero detection is used in [19] and [29] applying AND
bitwise operation to identify null results.
Wu et. al [30] focus on the activation function execution
performed by a RISC-V co-processor. Depending on the
operation of the CNN to be executed (convolution, max pool,
rectified linear unit ReLU, and Add functions) one circuitry in
the accelerator is activated. The energy reduction reported is
47%, in the entire SoC.
An emerging concept is mixed-precision data, i.e. the
optimal use of precision data in different layers forming the
CNN model. The MPIC library [31], which includes the
functions defined for RISC-V [28], helps the CNN designer to
normalize data, and adapt and combine different data sizes along
the CNN design to balance energy consumption and accuracy
required by the layer.
All these implementations have in common the definition of
a RISC-V instruction to execute the new/modified circuit.
D. Optimization of data mobility
For both, DRAM and NVM memories, data mobility
optimization remains a critical objective to reduce energy. The
recurrent topics in these projects are to improve data transfers,
by using specialized circuits between CPU and memory, and to
locate memory close to processing elements.
Most of the circuits implemented to improve data transfers
emerged for commercial processors under ARM architecture.
So, the innovation is not the appearance of the circuits but the
availability to connect them with the open ISA processor in
different configurations. Next, is a brief explanation of some of
these circuits and the RISC-V project implementation.
In IMC projects, direct memory access (DMA) is usually
employed for data transfers between the RISC-V host and the
NVM [32], [19]. GAP-8 uses a DMA to speed up the instruction
transfers between the host L2 memory to the L1 memory in the
accelerator, both host and accelerators are RISC-V. In [33], the
accelerator is a multicore RISC-V and an array of TCDM-DMA
connects the L1 memory (shared by the cores) and the host L2
memory.
A recurrent circuit to speed up memory allocation is the
tightly direct memory (TCDM for data or ITCM for instructions)
proposed to substitute the cache random access with an
addressed and controlled access between instructions and data.
The benefit of TCM in accelerators is to create a direct
connection between processor and DRAM, avoiding the long
hierarchy memory. Like DMA, the data transfers are faster and
addressed exactly to the target module. Examples of
implementations are [34], [35],
The co-processors, known as tightly coupled accelerators
(TCA), use a buffer in FIFO format that stores data waiting to
be processed. Examples of this connection are [35], [36]. IMC
implementations use buffers too for reshaping data and
maintaining them waiting while entering the processing module.
In the mentioned projects, a combination of open ISA and
the ld/st custom instructions to transfer data between the CPU
registers and circuits are implemented to ease the data mobility
and avoid bottlenecks.
E. CPU-accelerator configuration
This section provides, a brief explanation of the processing
factors related to the CPU-accelerator configuration and the
RISC-V influence up to now. The two most representative
configurations of CPU-accelerator are the LCA and TCA, the
delineation between both definitions is given by the accelerator
access to the CPU memory hierarchy level (). However, an
open-source platform such as RISC-V enables an extensive
combination of these two options.
LCA is usually limited to sending control signals to the
accelerator. Under this principle, a RISC-V interface [37] is
available to facilitate this interaction by connecting any
accelerator to a RISC-V processor, e. g. [16]. This tool can be
configured to exchange control signals, to exchange data
between the CPU cache memory and external elements, or to
communicate between the accelerator and the interface. Other
projects use the TCM tools discussed in the last section to
transfer data between CPU cache memory and the accelerator
[18], [19], [38], [33], [34]. Despite these projects are LCA, data
are provided faster using tightly coupled circuits.
Regarding some less conventional interactions between
CPU and accelerator, the TCA implementations are few and do
not report energy consumption or accuracy. This is a drawback
as they are not considered in the EE key elements.
The processing mode is another component of energy
consumption. Parallel processing is a robust implementation to
distribute tasks between different processing elements, either
cores or PEs. Its execution response is faster than singleinstruction processing, which improves throughput. Almost all
the implementations apply parallel processing one way or
another. The RISC-V factor is evident in the custom extension
for SIMD to manage parallel instructions, e. g. [15], [34], [29].
[35] introduces parallel processing for two RISC-V processors,
one in a simplified setting for general purpose activity, and the
second a more capable processor for the CNN execution.
Finally, the IMC projects involve implicitly parallel processing
since all the NVM cells act at the same time and provide results
in a few cycles.
IV.
DISCUSSION
In this section, the discussion focuses on the overall results
reported by the sample of projects and links the RISC-V
solutions. This exercise helps identify the effectiveness of the
current solutions and a possible direction for upcoming
proposals.
Before starting, it is worth mentioning that from a sample of
several projects, the implementations with the most information
about EE, throughput, and power, are selected. Furthermore,
over the group of projects chosen exists a diversity of conditions
that make them difficult to analyze with the same criteria. For
example, some of these conditions are CNN model and
compression techniques, manufacturing technology, power and
throughput measurements tools and conditions, and accuracy
evaluation metrics. Although all these variables indicate a
particular design, this analysis considers global characteristics
and highlights the RISC-V solutions. An interesting remark
from this study is the identification of a gap in the EE standard
metrics for reporting results and facilitating the analysis.
The relationship between throughput and EE is the first item
of discussion, as shown in Fig. 3. The best results in throughput
get the best results in EE, and a high level at EE is related to low
power consumption, but not conversely (low power does not
imply high EE). In the same way, power metrics lack a uniform
definition, and the current ones depend on technology, power
supply, model execution, and frequency.
Additionally from Fig. 3, the best EE levels are achieved by
solutions implemented at 65-55 nm and the rest use any node
between 180 -22nm. So the EE nor the throughput are affected
by the electronic technology node used in the implementations.
Regarding the throughput in Fig. 4, the IMC-SRAM
implementations obtain the best results (4.7 and 2.18 TOPS)
using precision of 1 bit (probably to speed some layers with no
high accuracy required) and 4 bits (to improve the classification
accuracy). In addition, both in-memory accelerators include
Near Memory Computing capabilities. Other initiatives with
values around 100 GOPS are not identified with any specific
characteristic, except maybe for the use of mixed-precision data.
Power information for all projects on the sample is reported
in Fig. 5. Chewbacca, the best very low energy project, uses a
Gapuino device, i.e. a microcontroller with a GAP8 as
microprocessor; it works under multicore processing and uses
the Xpulp extensions [29]. CIMU [19] and CIMA [18] are IMC
Sparsity
LCAparallel
RV32
IM
92.4%
Sparsity
TCAparallel
RV32
IM
92.4%
MMIO
kernel,
partial
sum, max
pool
TCA/
LCA
serial
8, 4, 2,
1b
DMA,
TCDM
Tiling,
kernel
LCAparallel
RV32
IMCX
8, 4, 2,
1b
DMA
Kernel
optimizati
on
LCAparallel
RV32
IMCX
Comp
ACC
[27]
16, 8,
1b
Global
buffer,
Network
on Chip
Kernel
optimizati
on
LCAparallel
IMAE2E
[34]
8, 4b
Chewb
acca
[29]
16, 1b
4, 1b
SRAM
CIMA
[18]
4, 1b
SRAM
PTCLLC
[39]
32-1b
CONCLUSIONS
In the EMLIoT ecosystem, energy efficiency remains an
open topic for long-life devices, as small devices must perform
big tasks with limited resources. The future expectation is that
more and more of these devices will be around to help human
activity. This implies that each device must be specifically
designed to make optimal use of the resources. Thus, specific
domain accelerators are emerging to speed up processing and
lower power consumption for machine learning devices
PULPNN
[28]
XPULP
-NN
[17]
MultiP
ulply
[32]
DORY
[40]
Fig. 3. EE. Different brands in the figure, indicate different types of
manufacturing technology
16- 1b
8b
NVM
CIMU
[19]
DMA
PCM
DMA,
reshaping
buffer
DMA,
reshaping
buffer,
NMC
TCDM
DMA,
buffer,
SCM
ReRA
M
TCDM,
DMA, SP
TCDM
DMA
Kernel
optimizati
on
XNORpop-count
max-pool,
others
Weights
Reuse,
vector
Tiling
Accuracy
RISC-V ISA
extensions
RV32
IMC
16,8,4
Function
optimization
LCAparallel
GAP-8
[38]
Mobility data
tool
CPUAccelerator
Configuration
I.
SUMMARY OF EMLIOT PROJECTS
Kernel,
Sparsity
Precision
(bits)
Table 1 includes shows some additional details about the
projects. The accuracy reported by CIMU, CIMA, and
Chewbacca is more than 90%. The authors of [29] propose BNN
optimizations to still benefit from low-energy processing but
improve accuracy. Among these optimizations are binary
addition stored in full-precision registers, binary data reuse,
XNOR pop-count, and max-pooling operations.
TABLE I.
Project
reference
accelerators and integrate a zero-riscy processor [20] using a
minimum of extensions (RV32IM) and a power dynamic density
of 0.68 µW/MHz; These three projects use data mobility circuits
(DMA or TCDM) to speed up the data transfers.
56.5%
LCAparallel
RV32
IMCX
LCAparallel
RV32
IMCX
LCAparallel
RV32
IMC
LCAparallel
RV32
IMCX
91.5%
Facing such a challenge, using the right tools could make a
difference. In particular, the RISC-V key contributions found in
this study are developed to optimize known computing
approaches and execute them efficiently. Such contributions are
listed: open architecture enables tight coupling between memory
and CPU for speeding up data mobility; implementation of
custom instructions to integrate optimized functions; set
modularity that allows a different level of instructions
complexity and therefore, the possibility of varying the energy.
In general, most of the implementations reviewed have the
traditional and safe tends until now explored: LCA-parallel and
data mobility circuits, including IMC implementations. It
remains pending to see developments related to CPUaccelerator tightly coupling, serial processing, and more QNNs
exploration.
ACKNOWLEDGES
Fig. 4. Maximum peak throughput
This work has been partially supported by the Mexican
Government F-PROMEP-01/Rev-04 SEP-23-002-A; the
Spanish Ministry of Science and Innovation (contract PID2019107255GB-C21/AEI/10.13039/501100011033) and by the
Generalitat de Catalunya (contract 2017-SGR-1328); and
DTS21/00089 del Instituto Carlos III.
REFERENCES
[1]
Fig. 5. Power and implementation similarities between the projects
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Recognit., vol. 2016-Decem, pp. 770–778, 2016.
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
“Quantized neural networks: Training neural networks with low precision
weights and activations,” J. Mach. Learn. Res., vol. 18, pp. 1–30, 2018.
A. Waterman, K. Asanovic, "The RISC-V instruction Set
Manual." Volume I: Unprivileged ISA’, version 20191213, 2019.
M. Radulović, “Memory bandwidth and latency in HPC: system
requirements and performance impact,” TDX (Tesis Dr. en Xarxa), no.
March, 2019.
M. Sami, D. Sciuto, C. Silvano, and V. Zaccaria, “Instruction-level power
estimation for embedded VLIW cores,” Hardware/Software Codesign Proc. Int. Work., pp. 34–38, 2000.
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, & W. J. Dally,
“EIE: Efficient Inference Engine on Compressed Deep Neural Network,”
Proc. - 2016 43rd Int. Symp. Comput. Archit. ISCA 2016, vol. 16, pp.
243–254, 2016.
A. Vasilyev, “CNN optimizations for embedded systems and FFT,” Proj.
Rep., 2015.
R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S.
Dwarkadas, “Memory hierarchy reconfiguration for energy and
performance in general-purpose processor architectures,” Proc. Annu. Int.
Symp. Microarchitecture, pp. 245–257, 2000.
R. Xu, D. Zhu, C. Rusu, R. Melhem, and D. Mossé, “Energy-efficient
policies for embedded clusters,” ACM SIGPLAN Not., vol. 40, no. 7, pp.
1–10, 2005.
C. James, “Beyond Moore ’ s Law : Exploring the Future of Computation
Issued :,” The Royal Society, Los Alamos, 2019.
M. Jung, C. Weis, and N. Wehn, “The Dynamic Random Access Memory
Challenge in Embedded Computing Systems,” A Journey Embed. CyberPhysical Syst., pp. 19–36, 2021.
T. Simons and D. J. Lee, “A review of binarized neural networks,”
Electron., vol. 8, no. 6, 2019.
Wu, J., Leng, C., Wang, Y., Hu, Q., & Cheng, J. (2016). Quantized
convolutional neural networks for mobile devices. In Proceedings of the
IEEE conference on computer vision and pattern recognition (pp. 48204828).
G. Lemieux, J. Vandergriendt. "FPGA-optimized lightweight vector
extensions for VectorBlox ORCA RISC-V,” 4th RISC-V Workshop."
(2016): 73.
G. Lemieux, J. Edwards, J. Vandergriendt, A. Severance, R. De Iaco, A.
Raouf, & S. Singh, "TinBiNN: Tiny binarized neural network overlay in
about 5,000 4-LUTs and 5mw." arXiv preprint arXiv:1903.06630 (2019).
G. Zhang, K. Zhao, B. Wu, Y. Sun, L. Sun, and F. Liang, “A RISC-V
based hardware accelerator designed for Yolo object detection system,”
Proc. 2019 IEEE Int. Conf. Intell. Appl. Syst. Eng. ICIASE 2019, pp. 9–
11, 2019.
A. Garofalo, G. Tagliavini, F. Conti, D. Rossi, and L. Benini, “XpulpNN:
Accelerating Quantized Neural Networks on RISC-V Processors Through
ISA Extensions,” Proc. 2020 Des. Autom. Test Eur. Conf. Exhib. DATE
2020, pp. 186–191, 2020.
H. Jia, S. Member, H. Valavi, and S. Member, “A Programmable
Heterogeneous Microprocessor Based on Bit-Scalable In-Memory
Computing,” IEEE J. Solid-State Circuits, vol. 55, no. 9, pp. 2609–2621,
2020.
H. Jia, Y. Tang, H. Valavi, J. Zhang, and N. Verma, “A Microprocessor
implemented in 65nm CMOS with Configurable and Bit-scalable
Accelerator for Programmable In-memory Computing Hongyang Jia,
Yinqi Tang 1 , Hossein Valavi 1 , Jintao Zhang, Naveen Verma Princeton
University, Princeton NJ,” IEEE J. Solid-State Circuits, vol. 55, no. 9, pp.
2609–2621, 2020.
P. D. Schiavone, F. Conti, D. Rossi, M. Gautschi, A. Pullini, E. Flamand,
L. Benini, “Slow and steady wins the race? A comparison of ultra-lowpower RISC-V cores for internet-of-things applications,” 2017 27th Int.
Symp. Power Timing Model. Optim. Simulation, PATMOS 2017, vol.
2017-Janua, pp. 1–8, 2017.
S. Shiratake, “Scaling and Performance Challenges of Future DRAM,”
May 2020.
[22] A. Chen, “A review of emerging non-volatile memory (NVM)
technologies and applications,” Solid. State. Electron., vol. 125, pp. 25–
38, 2016.
[23] A. S. S. Trinadh Kumar, “Low voltage high speed 8T SRAM cell for
ultra-low power applications,” Int. J. Eng. Technol., vol. 7, pp. 70–74,
2018.
[24] S. S. Iyer, and Howard L. Kalter. "Embedded DRAM technology:
opportunities and challenges." IEEE spectrum 36.4 (1999): 56-64.
[25] R. et al. Sebastian, A., Le Gallo, M., Khaddam-Aljameh, “Memory
devices and applications for in-memory computing,” Nat. Nanotechnol.,
vol. 15, pp. 529–544, 2020.
[26] A. Al Bahou, G. Karunaratne, R. Andri, L. Cavigelli, & L. Benini,
“XNORBIN: A 95 TOp/s/W hardware accelerator for binary
convolutional neural networks,” 21st IEEE Symp. Low-Power HighSpeed Chips Syst. Cool Chips 2018 - Proc., pp. 1–3, 2018.
[27] Z. Ji, W. Jung, J. Woo, K. Sethi, S. L. Lu, & A. P. Chandrakasan,
"CompAcc: Efficient Hardware Realization for Processing Compressed
Neural Networks Using Accumulator Arrays." 2020 IEEE Asian SolidState Circuits Conference (A-SSCC). IEEE, 2020.
[28] A. Garofalo, M. Rusci, F. Conti, D. Rossi, & L. Benini, “Pulp-NN:
Accelerating quantized neural networks on parallel ultra-low-power
RISC-V processors,” Philos. Trans. R. Soc. A Math. Phys. Eng. Sci., vol.
378, no. 2164, 2020.
[29] R. Andri, G. Karunaratne, L. Cavigelli, and L. Benini, “ChewBaccaNN:
A Flexible 223 TOPS/W BNN Accelerator,” pp. 1–5, 2021.
[30] F. Wu, N., Jiang, T., Zhang, L., Zhou, F., & Ge, “A Reconfigurable
Convolutional Neural Network-Accelerated Coprocessor Based on RISCV Instruction Set,” Electron., vol. 9, no. 6, p. 1005, 2020.
[31] M. Rusci, A. Capotondi, and L. Benini, “Memory-Driven Mixed Low
Precision Quantization For Enabling Deep Network Inference On
Microcontrollers,” 2019.
[32] A. Eliahu, R. Ronen, P. E. Gaillardon, and S. Kvatinsky, “MultiPULPly:
A Multiplication Engine for Accelerating Neural Networks on Ultra-lowpower Architectures,” ACM J. Emerg. Technol. Comput. Syst., vol. 17,
no. 2, 2021.
[33] F. Glaser, G. Tagliavini, D. Rossi, G. Haugou, Q. Huang, and L. Benini,
“Energy-Efficient Hardware-Accelerated Synchronization for SharedL1-Memory Multiprocessor Clusters,” IEEE Trans. Parallel Distrib.
Syst., vol. 32, no. 3, pp. 633–648, 2021.
[34] G. Ottavi, G. Karunaratne, F. Conti, I. Boybat, L. Benini, and D. Rossi,
“End-to-end 100-TOPS / W Inference With Analog In-Memory
Computing : Are We There Yet ?,” pp. 10–13, 2021.
[35] L. Zhang, Z. Xian, & G. Chuliang, "A CNN Accelerator with Embedded
Risc-V Controllers." 2021 China Semiconductor Technology
International Conference (CSTIC). IEEE, 2021.
[36] N. Wu, T. Jiang, L. Zhang, F. Zhou, & F. Ge, "A reconfigurable
convolutional neural network-accelerated coprocessor based on RISC-V
instruction set." Electronics 9.6 (2020): 1005.
[37] K. Asanovic, R. Avizienis, J. Bachrach, S. Beamer, D. Biancolin, C.
Celio,... & A. Waterman, "The rocket chip generator." EECS Department,
University of California, Berkeley, Tech. Rep. UCB/EECS-2016-17 4
(2016).
[38] E. Flamand, D. Rossi, F. Conti, I. Loi, A. Pullini, F. Rotenberg, & L.
Benini, “GAP-8: A RISC-V SoC for AI at the Edge of the IoT,” Proc. Int.
Conf. Appl. Syst. Archit. Process., vol. 2018-July, pp. 1–4, 2018.
[39] Y. Lo, Y. Kuo, Y. Chang, and J. Huang, “PTLL-BNN : Physically Tightly
Coupled , Logically Loosely Coupled , Near-Memory BNN Accelerator,”
2019 Eur. Solid-State Circuits Conf., pp. 2–3, 2019.
[40] A. Burrello, A. Garofalo, N. Bruschi, G. Tagliavini, D. Rossi, and F.
Conti, “DORY : Automatic End-to-End Deployment of Real-World
DNNs on Low-Cost IoT MCUs,” vol. 70, no. 8, pp. 1253–1268, 2021.
Download