CNN Accelerators for Embedded Systems based on RISC-V

A review of CNN accelerators for embedded systems
based on RISC-V
Alejandra Sanchez-Flores
Industrial and Construction Eng. dept.
Universitat de les Illes Balears
Palma, Spain
Lluc Alvarez
Barcelona Supercomputing Center,
Universitat Politècnica de Catalunya
Barcelona, Spain
Abstract— One of the great challenges of computing
today is sustainable energy consumption. In the deployment
of edge computing this challenge is particularly important
considering the use of embedded equipment with limited
energy and computation resources. In those systems, the
energy consumption must be carefully managed to operate
for long periods. Specifically, for embedded systems with
machine learning capabilities in the Internet of Things
(EMLIoT) era, the convolutional neural networks (CNN)
model execution is energy challenging and requires massive
data. Nowadays, high workload processing is designed
separately into a host processor in charge of generic
functions and an accelerator dedicated to executing the
specific task. Open-hardware-based designs are pushing for
new levels of energy efficiency. For achieving energy
efficiency, open-source tools, such as the RISC-V ISA, have
been introduced to optimize every internal stage of the
system. This document aims to compare the EMLIoT
accelerator designs based on RISC-V and highlights open
topics for research.
Keywords—RISC-V, accelerator, energy efficiency
Since the introduction of accelerators to execute CNN, their
design priorities were first to get enough resources to run the
CNN model and then execute the most instructions per second
(throughput). However, considering the high demand for
embedded systems with artificial intelligence capacity, the
designers face now the challenge of reducing the power
consumption for complex and long-term applications. It seems
contradictory that small devices such as embedded systems, with
low power consumption which typically is translated into a lean
HW system, are designed to run a CNN model, characterized by
high demand for resources. Then, the priority has evolved to
maximize the energy efficiency (EE) value, i. e. the ratio of the
device’s throughput per unit of power, reaching the gigaoperations per second/watt (GOPS/W) or tera-operations per
second/watt (TOPS/W). In consequence, accelerators are
focused on including only circuitry and functions that operate
optimally to achieve this goal.
In recent times, accelerator implementations exploring the
RISC-V features have emerged with favorable results, leading
to a breakthrough in accelerator design. Truly, open source is
not the only advantage of RISC-V, but it offers other benefits
for EE maximization. To expose these benefits and make a
Bartomeu Alorda-Ladaria
Inst. de Inv. Sanitaria IB - IdisBa.
Universitat de les Illes Balears
Palma, Spain
recompilation of design key elements, a selection of projects
with the best EE results are examined under the following
sequence: determine the challenges of EMLIoT devices for
optimizing EE, relate these challenges to the RISC-V design
objectives, and examine the EMLIoT RISC-V implementations
to expose the elements that address the objectives.
To start this analysis, the operating factors of EMLIoT are
discussed. First the CNN model execution requirements, then
the electronic restrictions related to embedded devices, and
finally, a brief overview of RISC-V setup for most projects.
A. CNN execution constraints
CNN inference execution is the purpose of the accelerator
EMLIoT devices. This operational condition establishes the
functional features of the system organized in memory, power,
throughput, frequency, and accuracy.
Any CNN model involves millions of parameters named
weights. For example, ResNet [1] uses a total number of weights
from 0.27M to 19.4M depending on the model version. Weights
are stored in memory while the execution is performed, thus the
system requires an equivalent memory volume, usually DRAM.
From a workload point of view, the most important operation
executed is the convolutional matrix multiplication, which in
turn divides into dot-products implemented with MAC
(multiplication-accumulation) operation. Depending on the
number of convolution layers in the model and the number of
weights in every layer, the total operations executed can reach
GOP (Giga operations – 1𝑥10 ). If the efficiency expectation is
to reach the TOPS/W, then the CNN model should be executed
at high throughput (GOPS) but at ultra-low power consumption
(mW or less).
Some compression algorithms have been proposed for the
CNN model to reduce the number of weights, several
instructions, or execution precision, all with beneficial energy
results. However, these reduction criteria focus on HPC systems
or imply a degradation of the final classification, where a high
percentage of accuracy (>90%) should be the indicator of an
effective acceleration. So, an optimal solution is to propose
compression techniques that work in conjunction with hardware
modifications, linked by the ISA architecture.
B. RV32 ISA features
The RISC-V 32-bit integrates several advisable features for
embedded devices, and the most relevant is the low power cost.
In addition, computations with 32-bit data single-precision
obtain enough accuracy for CNN inference [2]. Thus, RV32I
and RV32E are the basic instruction sets dedicated for 32-bit
embedded processors, both include integer data, basic arithmetic
operations, and ld/st instructions for 16 and 8-bits data handling
[3]. Once the base I or E is selected, some standard extension
can be attached. For EMLIoT, the most recurrent extensions are
M – multiplication/division extension is omnipresent in the
CNN accelerators, required for the MAC operation. C –
compression extension offers a 16-bit version of the original 32bit instructions. A – atomic extension that supports compressed
memory exchange instructions. B – bit manipulation extension
dedicated to a more complex bit manipulation (draft version). V
– vector extension for handling data in vector format, enabling
parallel processing (draft version). X-custom extension is
reserved for new extensions defined by customers [3]. In
addition, each standard extension provides some custom
instruction code available for custom definition. Both features
(custom extensions and instructions) give a significant
advantage in aiding optimization, as non-standard operations
can be implemented in HW and bound to ad-hoc instruction.
C. EE challenges
Once CNN model execution specifications were mentioned,
as well as some RISC-V capacities, next, all the challenges
identified in the EMLIoT projects are condensed and related to
the design objectives to reach EE. Fig. 1 synthesizes the result.
 The DRAM consumes a lot of power, either for storing
or transferring data. The reason is the resistive-capacitive
(RC) array on which this memory is based [4], keeping
it current due to its low cost. This challenge is the reason
to optimize the memory storage (obj. A) and search for
new technologies to store data (obj. B).
 The execution of convolution is the process that
consumes the most energy within the processor [5]. The
MAC (multiplication-accumulation) operation, whose
multiplication is the highest energy consumption part of
the MAC [6], [7]. This justifies exploring more efficient
processing options (obj. B) and reducing the execution
of the functions (obj. D).
 The energy consumption increases when the data are
transferred. The cost is cheaper when data are moved
between registers and L1 levels than when are transferred
from/to DRAM, according to the memory hierarchy [8].
Mobility synchronization is required to avoid
unnecessary energy expenses (obj. C) and new
interactions between the CPU and the accelerator need to
be explored (obj. E).
 Traditionally, loosely coupling accelerator (LCA) has
been the widely used host-accelerator configuration.
Using an open-source platform, the possibilities of
interaction grow and should be explored (obj. E).
 The working frequency of an embedded device is
typically limited between 100 and 500 MHz [9],
conditioned by the RC array. When the bandwidth of the
DRAM is increased (the number of bits transmitted per
Fig. 1. Design objectives defined during this research
unit of time), the power consumption is increased [10].
Because the processor works at a higher speed than the
memory frequency, the processor demands data faster
than memory can deliver it. Conversely, data from the
processor are available for transfer faster than the DRAM
can store them, creating bottlenecks [11]. Mobility
optimization would improve this challenge (obj. C),
explore new options to process (obj. B), or new
configurations CPU-accelerator (obj. E).
In this section, the design objectives defined in the last
section are used to typify the selected projects and identify the
key implementations that aid the EE improvement.
A. Optimize DRAM usage
An obvious solution to the large volume of DRAM is data
use reduction, which leads to two options, to reduce the number
of parameters (committed to algorithm compression, analysis of
which is omitted in this paper) or to reduce the data size. The
CNN inference computing using low-precision data involves
mathematics and computing. Binary neural networks (BNN)
were the first data reduction approach, using weights normalized
to a signed unit value (-1, +1). From a computing point of view,
one bit can represent one weight, in consequence, the 32-bits
standard can store 32 weights. BNN can be executed with both
standard programming and computing resources since binary
MAC operation is performed using common Boolean operations
such as XNOR-pop count. The energy consumption and the
throughput were improved at that time, but, the inaccuracy of
the results [12] encouraged the exploration of other options.
The alternative to using binary values is the quantized
weights, where 2 to 8 bits represent a weight. The quantized
neural networks (QNN) approach [13] was implemented with
more accurate results than BNN. Concerning energy reduction
goals, QNN implementation addresses the issues related to lowprecision data storage, retrieval, and computation in a 32-bits
architecture. In DRAM memory, data are stored by volume
optimization: two 16-bits, four 8-bits, eight 4-bits, or sixteen 2bits data, depending on the chosen precision.
For 8-bits data retrieval, a custom extension based on the
ISA RV32IMC and named LVE is defined in [14] and applied
in [15]. It contains instructions for 16 and 8-bits data allocation
as vector array in the scratchpad (SP) memory. Another option
to deal with data reduction is proposed in [16], based on a
configuration RV32IMA. They use data reduction to 8-bit using
shift operations and store it directly in general-purpose registers
(GPR) to speed up the MAC operations.
The RISC-V ISA offers the capacity to integrate new
functionalities to the standard. New extensions like XpulpNN
[17] use an ISA RV32IMCX, including instructions to manage
data stored in memory in eight 4-bit or sixteen 2-bit formats. The
low-precision data are extracted and accommodated in 8-bits
locations using hardware loops related to one single instruction.
Xpulpv2 matches with parallel processing configurations used
by the Pulp ecosystem thus every data extracted is
accommodated in a vector/array to perform dot-product.
The implementations in [18] and [19] use an ultra-lowpower processor RV32IM [20]. This low-cost benefit is
maintained in the projects by the authors, who propose a shift
operation to extract each low-precision data from its storage
format. So, quantized operations are executed at a low cost.
Quantization, and especially the BNN, is one strategy with
great potential to reduce energy consumption and remains one
of the most implemented strategies to improve performance
results. Low-precision data manipulation is not a standard
operation, and custom instructions for a shift or mask extraction
must be implemented, just like the RISC-V processors in this
section have proposed.
B. Storage alternatives
Given the advantages of DRAM technology, it will hardly
be replaced in a short time [21]. However, the possibility of
using memories with lower power consumption is pushing
emerging options. The memory alternatives most referenced in
EMLIoT are the non-volatile memories (NVM) [22], the 8T
SRAM [23], and the eDRAM [24].
In addition to their storage functionality, these devices have
been chosen for their ability to compute one or more operations,
in what is called in-memory computing (IMC, shown in Fig. 2).
This approach states that one or more operations can be
computed inside a storage cell, without transferring data to the
processor, saving the cost of mobility. Furthermore, since these
devices are based on analog mechanisms, the storage/processing
output signal is a continuous function, providing a multiprecision capacity. For a larger processing result in the volume
of data, the IMC is massively performed in a 2D storage cells
arrangement that is controlled synchronously [25].
Even though these memories represent a breakthrough for
the EMLIoT devices, some issues remain to be solved. i) the
massive data mobility to/from the IMC device must be handled,
ii) the stability of the output signal is a challenge for some of
these memories, iii) the analog output values must be inputted
to analog-digital converters, generating energy cost. The data
mobility is discussed in the next sub-section D, and analog
converters implications are not a subject for this analysis.
C. Functions optimization
A complementary measure to reducing memory is the
optimization of CNN model execution. A clear example of
operation simplification is the MAC substitution by the XNORpop - count function in the BNNs, which led to a 25% reduction
in energy consumption in the first implementations [26].
However, this analysis is bounded characterized by innovative
Fig. 2. IMC module representation
proposals for function/operation execution.
Some projects make hardware innovations to deal with lowprecision data operations related to the MAC. In [27] a shift
operation is performed to extract low-precision data, and
immediately after each extraction, the data are accumulated
postponing 32-bits data manipulation until the accumulation is
finished. The purpose of this is to avoid data transfers between
registers with different data sizes.
Since the standard convolutional multiplication efficiency is
improvable, the kernel re-arrangement is proposed as a
guaranteed optimization of the PULP-NN [28] library. This
includes kernel re-arrangement to prioritize kernel weights reuse
and diminish MAC processing. Another kernel re-arrangement
optimization is presented in [28], where a 3x3 kernel in
hardware is proposed, using individual FIFO buffers for the
different operands and individual accumulators at each buffer
Among the compression techniques sparsity appears
recurrently in hardware optimization. The purpose is the
detection of values to avoid unnecessary computing. For
example, zero detection is used in [19] and [29] applying AND
bitwise operation to identify null results.
Wu et. al [30] focus on the activation function execution
performed by a RISC-V co-processor. Depending on the
operation of the CNN to be executed (convolution, max pool,
rectified linear unit ReLU, and Add functions) one circuitry in
the accelerator is activated. The energy reduction reported is
47%, in the entire SoC.
An emerging concept is mixed-precision data, i.e. the
optimal use of precision data in different layers forming the
CNN model. The MPIC library [31], which includes the
functions defined for RISC-V [28], helps the CNN designer to
normalize data, and adapt and combine different data sizes along
the CNN design to balance energy consumption and accuracy
required by the layer.
All these implementations have in common the definition of
a RISC-V instruction to execute the new/modified circuit.
D. Optimization of data mobility
For both, DRAM and NVM memories, data mobility
optimization remains a critical objective to reduce energy. The
recurrent topics in these projects are to improve data transfers,
by using specialized circuits between CPU and memory, and to
locate memory close to processing elements.
Most of the circuits implemented to improve data transfers
emerged for commercial processors under ARM architecture.
So, the innovation is not the appearance of the circuits but the
availability to connect them with the open ISA processor in
different configurations. Next, is a brief explanation of some of
these circuits and the RISC-V project implementation.
In IMC projects, direct memory access (DMA) is usually
employed for data transfers between the RISC-V host and the
NVM [32], [19]. GAP-8 uses a DMA to speed up the instruction
transfers between the host L2 memory to the L1 memory in the
accelerator, both host and accelerators are RISC-V. In [33], the
accelerator is a multicore RISC-V and an array of TCDM-DMA
connects the L1 memory (shared by the cores) and the host L2
A recurrent circuit to speed up memory allocation is the
tightly direct memory (TCDM for data or ITCM for instructions)
proposed to substitute the cache random access with an
addressed and controlled access between instructions and data.
The benefit of TCM in accelerators is to create a direct
connection between processor and DRAM, avoiding the long
hierarchy memory. Like DMA, the data transfers are faster and
addressed exactly to the target module. Examples of
implementations are [34], [35],
The co-processors, known as tightly coupled accelerators
(TCA), use a buffer in FIFO format that stores data waiting to
be processed. Examples of this connection are [35], [36]. IMC
implementations use buffers too for reshaping data and
maintaining them waiting while entering the processing module.
In the mentioned projects, a combination of open ISA and
the ld/st custom instructions to transfer data between the CPU
registers and circuits are implemented to ease the data mobility
and avoid bottlenecks.
E. CPU-accelerator configuration
This section provides, a brief explanation of the processing
factors related to the CPU-accelerator configuration and the
RISC-V influence up to now. The two most representative
configurations of CPU-accelerator are the LCA and TCA, the
delineation between both definitions is given by the accelerator
access to the CPU memory hierarchy level (). However, an
open-source platform such as RISC-V enables an extensive
combination of these two options.
LCA is usually limited to sending control signals to the
accelerator. Under this principle, a RISC-V interface [37] is
available to facilitate this interaction by connecting any
accelerator to a RISC-V processor, e. g. [16]. This tool can be
configured to exchange control signals, to exchange data
between the CPU cache memory and external elements, or to
communicate between the accelerator and the interface. Other
projects use the TCM tools discussed in the last section to
transfer data between CPU cache memory and the accelerator
[18], [19], [38], [33], [34]. Despite these projects are LCA, data
are provided faster using tightly coupled circuits.
Regarding some less conventional interactions between
CPU and accelerator, the TCA implementations are few and do
not report energy consumption or accuracy. This is a drawback
as they are not considered in the EE key elements.
The processing mode is another component of energy
consumption. Parallel processing is a robust implementation to
distribute tasks between different processing elements, either
cores or PEs. Its execution response is faster than singleinstruction processing, which improves throughput. Almost all
the implementations apply parallel processing one way or
another. The RISC-V factor is evident in the custom extension
for SIMD to manage parallel instructions, e. g. [15], [34], [29].
[35] introduces parallel processing for two RISC-V processors,
one in a simplified setting for general purpose activity, and the
second a more capable processor for the CNN execution.
Finally, the IMC projects involve implicitly parallel processing
since all the NVM cells act at the same time and provide results
in a few cycles.
In this section, the discussion focuses on the overall results
reported by the sample of projects and links the RISC-V
solutions. This exercise helps identify the effectiveness of the
current solutions and a possible direction for upcoming
Before starting, it is worth mentioning that from a sample of
several projects, the implementations with the most information
about EE, throughput, and power, are selected. Furthermore,
over the group of projects chosen exists a diversity of conditions
that make them difficult to analyze with the same criteria. For
example, some of these conditions are CNN model and
compression techniques, manufacturing technology, power and
throughput measurements tools and conditions, and accuracy
evaluation metrics. Although all these variables indicate a
particular design, this analysis considers global characteristics
and highlights the RISC-V solutions. An interesting remark
from this study is the identification of a gap in the EE standard
metrics for reporting results and facilitating the analysis.
The relationship between throughput and EE is the first item
of discussion, as shown in Fig. 3. The best results in throughput
get the best results in EE, and a high level at EE is related to low
power consumption, but not conversely (low power does not
imply high EE). In the same way, power metrics lack a uniform
definition, and the current ones depend on technology, power
supply, model execution, and frequency.
Additionally from Fig. 3, the best EE levels are achieved by
solutions implemented at 65-55 nm and the rest use any node
between 180 -22nm. So the EE nor the throughput are affected
by the electronic technology node used in the implementations.
Regarding the throughput in Fig. 4, the IMC-SRAM
implementations obtain the best results (4.7 and 2.18 TOPS)
using precision of 1 bit (probably to speed some layers with no
high accuracy required) and 4 bits (to improve the classification
accuracy). In addition, both in-memory accelerators include
Near Memory Computing capabilities. Other initiatives with
values around 100 GOPS are not identified with any specific
characteristic, except maybe for the use of mixed-precision data.
Power information for all projects on the sample is reported
in Fig. 5. Chewbacca, the best very low energy project, uses a
Gapuino device, i.e. a microcontroller with a GAP8 as
microprocessor; it works under multicore processing and uses
the Xpulp extensions [29]. CIMU [19] and CIMA [18] are IMC
sum, max
8, 4, 2,
8, 4, 2,
16, 8,
on Chip
8, 4b
16, 1b
4, 1b
4, 1b
In the EMLIoT ecosystem, energy efficiency remains an
open topic for long-life devices, as small devices must perform
big tasks with limited resources. The future expectation is that
more and more of these devices will be around to help human
activity. This implies that each device must be specifically
designed to make optimal use of the resources. Thus, specific
domain accelerators are emerging to speed up processing and
lower power consumption for machine learning devices
Fig. 3. EE. Different brands in the figure, indicate different types of
manufacturing technology
16- 1b
Mobility data
Table 1 includes shows some additional details about the
projects. The accuracy reported by CIMU, CIMA, and
Chewbacca is more than 90%. The authors of [29] propose BNN
optimizations to still benefit from low-energy processing but
improve accuracy. Among these optimizations are binary
addition stored in full-precision registers, binary data reuse,
XNOR pop-count, and max-pooling operations.
accelerators and integrate a zero-riscy processor [20] using a
minimum of extensions (RV32IM) and a power dynamic density
of 0.68 µW/MHz; These three projects use data mobility circuits
(DMA or TCDM) to speed up the data transfers.
Facing such a challenge, using the right tools could make a
difference. In particular, the RISC-V key contributions found in
this study are developed to optimize known computing
approaches and execute them efficiently. Such contributions are
listed: open architecture enables tight coupling between memory
and CPU for speeding up data mobility; implementation of
custom instructions to integrate optimized functions; set
modularity that allows a different level of instructions
complexity and therefore, the possibility of varying the energy.
In general, most of the implementations reviewed have the
traditional and safe tends until now explored: LCA-parallel and
data mobility circuits, including IMC implementations. It
remains pending to see developments related to CPUaccelerator tightly coupling, serial processing, and more QNNs
Fig. 4. Maximum peak throughput
This work has been partially supported by the Mexican
Government F-PROMEP-01/Rev-04 SEP-23-002-A; the
Spanish Ministry of Science and Innovation (contract PID2019107255GB-C21/AEI/10.13039/501100011033) and by the
Generalitat de Catalunya (contract 2017-SGR-1328); and
DTS21/00089 del Instituto Carlos III.
Fig. 5. Power and implementation similarities between the projects
