Uploaded by Leo Tsai

DIANA: Hybrid Digital-Analog Neural Network SoC

advertisement
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023
203
DIANA: An End-to-End Hybrid DIgital and
ANAlog Neural Network SoC for the Edge
Pouya Houshmand , Graduate Student Member, IEEE, Giuseppe M. Sarda , Vikram Jain , Member, IEEE,
Kodai Ueyoshi , Ioannis A. Papistas , Man Shi , Qilin Zheng, Debjyoti Bhattacharjee , Arindam Mallik,
Peter Debacker , Diederik Verkest, and Marian Verhelst , Senior Member, IEEE
Abstract— DIgital-ANAlog
(DIANA),
a
heterogeneous
multi-core accelerator, combines a reduced instruction set
computer - five (RISC-V) host processor with an analog
in-memory computing (AIMC) artificial intelligence (AI)
accelerator and a digital reconfigurable deep neural network
(DNN) accelerator in a single system-on-chip (SoC) to support
a wide variety of neural network (NN) workloads. AIMC cores
can bring extreme computational parallelism and efficiency
at the expense of accuracy and dataflow flexibility. Digital
AI co-processors, on the other hand, guarantee accuracy
through deterministic compute, but cannot achieve the same
computational density and efficiency. DIANA exploits this
fundamental tradeoff by integrating both types of cores in
a shared and optimized memory system, to enable seamless
execution of the workloads on the parallel cores. The system’s
performance benefits further from pipelined parallel execution
across both accelerator cores and enhanced AIMC spatial
unrolling techniques, leading to drastically reduced execution
latency and reduced memory footprints. The design has been
implemented in a 22-nm technology and achieves peak efficiencies
of 600 TOP/s/W for the AIMC core (I/W/O: 7/1.5/6 bit)
and 14 TOP/s/W (I/W/O: 8/8/8 bit) for the digital accelerator,
respectively. End-to-end performance evaluation of CIFAR-10
and ImageNet classification workloads is carried out on the
chip, reporting 7.02 and 5.56 TOP/s/W, respectively, at the
system level.
Index Terms— Algorithm-to-HW mapping, analog in-memory
computing (AIMC), deep neural network (DNN) acceleration,
machine learning processing, mixed-signal computing, reduced
instruction set computer - five (RISC-V), system-on-chip (SoC).
I. I NTRODUCTION
EEP learning algorithms have brought state-of-the-art
(SotA) accuracy in pattern recognition and object classification tasks and have become an integral tool in solving
D
Manuscript received 24 May 2022; revised 26 August 2022;
accepted 30 September 2022. Date of publication 31 October 2022;
date of current version 28 December 2022. This article was approved by
Associate Editor Sophia Shao. This work was supported in part by KU
Leuven, in part by the European Union (EU) European Research Council
(ERC) for Resource-efficient sensing through dynamic attention-scalability
(Re-SENSE) Project under Grant ERC-2016-STG-715037, and in part by
the Flemish Government (Artificial intelligence (AI) Research Program).
(Pouya Houshmand and Giuseppe M. Sarda contributed equally to this work.)
(Corresponding authors: Pouya Houshmand; Giuseppe M. Sarda.)
Pouya Houshmand, Vikram Jain, Kodai Ueyoshi, Man Shi, and
Qil in Zheng are with ESAT-MICAS, KU Leuven, 3000 Leuven, Belgium
(e-mail: pouya.houshmand@kuleuven.be).
Giuseppe M. Sarda and Marian Verhelst are with KU Leuven,
3000 Leuven, Belgium, and also with imec, 3001 Leuven, Belgium (e-mail:
giuseppe.sarda@imec.be).
Ioannis A. Papistas, Debjyoti Bhattacharjee, Arindam Mallik, Peter
Debacker, and Diederik Verkest are with imec, 3001 Leuven, Belgium.
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/JSSC.2022.3214064.
Digital Object Identifier 10.1109/JSSC.2022.3214064
many real-world problems [1], [2], [3], [4]. However, deep
neural networks (DNNs) are computationally intensive algorithms, which significantly hinder efficient execution on edge
devices, characterized by strict resource, energy, and latency
constraints.
However, the pre-defined sequential nature of these computations, which can be expressed as a sequence of matrix-vector
multiplications (MVMs), enables research on specialized efficient DNN hardware accelerators, which exploit the high
degree of parallelization possibilities [5], [6].
The co-design space of the hardware architecture combined with the scheduling possibilities consists in countless
optimization opportunities to maximize the spatial parallelism
and data locality of the operations. In recent years, many
digital designs have first been proposed to accelerate the operations [7], [8], [9], [10], with a hardware template consisting
of an array of processing elements (PEs), which allows for
2-D spatial parallelization of the multiply-accumulate (MAC)
operations. The computing units are then surrounded by a
memory hierarchy, to exploit temporal and spatial localities
of the operands via multiple levels of caching [11]. These
hardware templates are also replicated in the homogeneous
chiplet-based multi-core architectures [12], [13] to enable
higher parallelization degrees and pipelining of a sequence
of workloads.
The efficiency of these designs is often dominated by the
communication cost of sending data between the memories
and the compute array. This overhead of data movement can be
minimized through optimized memory hierarchies and appropriate tiling of the nested loops [14], thus optimizing re-use
of the operands. Even so, it still remains an important limiting
factor in accelerator performance. This stimulated research in
alternatives to digital accelerators, such as analog in-memory
computing (AIMC), which removes part of data movement by
means of computing the MVM operations within the memory
in the analog domain. This allows massive parallelization
and density of the DNN operations, promising up to 100×
improved throughput and energy efficiency compared with the
SotA digital designs [15], [16], [17].
However, these benefits are only achieved under high spatial and temporal utilization of the array. The spatial utilization is expressed as the amount of the array’s compute
cells performing useful computations over all the computing
cells. Achieving good spatial utilization across each targeted
MVM is challenged by the limited scheduling flexibility of
AIMC arrays, as well as the wide diversity of DNN layer
0018-9200 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply.
204
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023
Fig. 1. Computer vision networks exhibit different workloads: the number
of parameters per layer grows quadratically, while the feature maps’ size
decreases with the same speed. The result is shallow layers with huge
feature maps and small-scale weights and deep layers with large number of
weights but small features. A heterogeneous, reconfigurable architecture is
thus required to support different workload characteristics.
topologies. For example, the limited channel count of the
first layers of networks designed for ImageNet, CIFAR-10,
and COCO [18], [19] usually leads to poor spatial utilization
of such arrays. The temporal utilization, on the other hand,
is determined by the amount of clock cycles the AIMC
array can be effectively used. Maximizing temporal utilization
implies minimizing the amount of stalling cycles that might
occur due to limited data reuse and bandwidth limitations
to source/sink the input–output data. For example, fully connected (FC) layers exhibit poor weight reuse and depthwise
layers exhibit poor input reuse, making it challenging to
achieve good temporal utilization for their execution on AIMC
arrays [20]. Finally, the noise sensitivity of a DNN layer
should also be taken into account when assessing whether a
layer benefits more from mapping on a traditional digital DNN
accelerator, or on an AIMC array.
It is clear that specific neural network (NN) layers can
benefit enormously from the AIMC technology toward higher
throughput and energy efficiency, yet not all the layers. For
several workload and layer types, mapping on the traditional
digital MVM accelerators remains beneficial, e.g., the case for
depthwise layers [21].
To this purpose, we present DIANA [22], a hybrid DIgitalANAlog DNN system-on-chip (SoC), which combines the
energy efficiency and high throughput of the AIMC technology
with the dataflow reconfigurability and higher precision of
digital architectures, as summarized in Fig. 1. The design
enables end-to-end acceleration of a diverse set of NN models,
under the control of a reduced instruction set computer - five
(RISC-V) host processor to exploit parallel execution across
the three cores and efficiently share data between them. The
contributions of this work are hence as follows.
1) A heterogeneous DNN accelerator, combining an AIMC
core with a flexible digital accelerator in a single SoC
with an optimized shared memory hierarchy.
2) Enhanced AIMC mapping methods to achieve higher
spatial and temporal utilization of the AIMC array.
3) Optimized scheduling strategies across the multi-core
heterogeneous SoC to maximize temporal utilization of
the computing fabrics.
Fig. 2. (a) Seven-nested loop representation of a 2-D convolutional layer and
of an MMM and how different dataflows map the workload on the hardware.
(b) Weight stationary FX, FY, and C|K dataflow for AIMC. (c) Output
stationary OX|K dataflow and C|K dataflow for digital.
Section II will start with the basic dataflow concepts
and motivate the design choices through hardware-scheduling
co-optimization, justifying the choice of a heterogeneous
architecture. Section III describes the architectural components
in more detail, followed by Section VI where measurements
are carried to evaluate the performance and efficiency of the
taped-out design.
II. D ESIGN C HOICES
This section details the rationale behind the algorithm and
hardware co-design choices, first introducing the workload
characteristics followed by the motivation why a hybrid design
would suit them.
A. Dataflow Concepts
The workloads of modern networks can be expressed as
a sequence of nested loops that operate on two tensors to
generate an output tensor with a series of MAC operations. The
nested loops’ representation can be used for convolutional layers in convolutional NNs (CNNs) and for matrix–matrix multiplications (MMMs) in transformers [23]; Fig. 2(a) describes
the loop description for different workloads.
This offers hardware designers the possibility to parallelize
a subset of the loops in the spatial dimension across an
array of MAC units. The loops that are not flattened in the
MAC array can in turn be temporally unrolled and tiled [14]
to exploit re-use at various levels in the memory hierarchy.
The combination of the temporal tiling of the nested loops
and their spatial unrolling provides the dataflow of a given
workload [24], [25].
AIMC typically relies on a weight stationary dataflow
[see Fig. 2(b)], since its physical implementation constrains
the way the operands can be spatially re-used in the MAC
array: inputs are multicasted along the cells of each row,
while accumulations happen along the columns [16], [17].
This commonly results in spatially unrolling weights relative
to different input channels (C) and different kernel dimensions
(FX, FY) on the rows, and the different kernels (K dimension)
on the columns (denoted as C, FX, FY|K). Layers whose C,
FX, FY, and K dimensions fit well with the dimensions of the
array can achieve a high spatial utilization and thus exploit
better the computational density and energy efficiency of the
Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply.
HOUSHMAND et al.: DIANA: AN END-TO-END HYBRID DIgital AND ANAlog NEURAL NETWORK SoC FOR THE EDGE
macro, maximizing the potential performance improvements
of AIMC.
For those layers that underutilize the array with the C,
FX, FY|K dataflow, mapping efficiently can be improved with
a concept called output pixel unrolling [see Fig. 4(a)]; this
consists of duplicating the weights in the same AIMC array
and computing multiple (OXu) output pixels in parallel on
different columns (C, FX, FY|K, OX unrolling). Excluding
the negligible overheads related to additional weight writing,
OXu× utilization, performance, and efficiency benefits can be
obtained, as can be seen in Fig. 4(c).
While other works have proposed weight duplication, they
have not done so in the same AIMC array, but across multiple
arrays [26], [27], [28]. Doing it in place guarantees a higher
spatial and temporal re-use of the activations, with consequent
energy benefits, beside preventing communication overheads
of writing the same weights to separate cores.
Yet, even with these dataflow enhancements, AIMC mapping efficiency strongly depends on layer dimensions and the
AIMC hardware parameters. This leads to the study carried
out in Section II-B.
B. Design Space Exploration
As introduced in Sections I and II, some workloads are
better suited for AIMC while others cause underutilization
and do not fully exploit the massive parallelism that the
technology brings. When considered at the system level, the
factors that contribute to this match consist essentially of:
1) the layer topology; 2) the dimensions of the AIMC array
itself (for spatial utilization); and 3) the activation buffer which
sources the AIMC array with data (for temporal utilization).
We have thus carried out an exploration, with the intent of
finding: 1) which array sizes and what activation buffers bring
best AIMC performance and 2) under such optimal hardware
constellation, which layer and network topologies maximize
energy efficiency and computational density. For modeling
the different design points, we have used an extension of
ZigZag [24], extended with the output pixel unrolling in the
dataflow space. The considered model is depicted in Fig. 3(b),
taking into account the energy and latency contributions of
the AIMC core, of the L1 scratchpad activation buffer data
movements, of loading/storing the weights in the AIMC core
and of L2 scratchpad accesses for weights and activations.
The single values of each contribution are extracted from the
simulations and are summarized in table Fig. 3(c), together
with the swept hardware parameters (# of L1 banks and # of
AIMC rows/columns).
Under these assumptions, a diverse set of models from
literature have been assessed on achievable TOP/s/W and
TOP/s/mm2 wih AIMC. Fig. 3(a) outlines the exploration
results, highlighting the pareto optimal hardware constellations for each studied network (row × column, buffer size).
It is again clear from the results that AIMC works best
with networks that are characterized by a large number of
non-pointwise kernels throughout the network: ResNet [2]
or YOLO-like (DarkNet19 [4]) structures optimize energy
efficiency and computational density. Networks dominated by
205
(a)
(b)
Fig. 3. Design space exploration of the AIMC core for different workloads is
carried out (a) based on an AIMC core hardware template of (b) exploring for
different design points by sweeping over the listed AIMC array and activation
buffer sizes. Pareto frontier points in (c) report the optimal configurations for
each network. The stars indicate the configuration selected for the DIANA
design.
Fig. 4. By adopting the output pixel unrolling mapping of the array, weights
are duplicated OXu times in the array, while a corresponding number of
pixels are computed in (b) parallel instead of (a) single one. As an example,
(c) describes the significant savings in latency that can be achieved for the
layers in the first ResBlock of ResNet18 with an 1152 × 512 array, linearly
proportional with OXu, with a negligible increase in extra weight loading (c).
layers with 1 × 1 kernels, such as SqueezeNet [3], perform
∼2× worse: they do not offer the same energy efficiency
since they cannot exploit massive MVM operations in a large
AIMC core and find their optimum hardware constellation at
smaller analog arrays. As they still need rather large activation
buffers, similar to ResNet18 and DarkNet19, this leads to
degraded TOP/s/mm2 . Finally, the ResNet20 network was
assessed working on CIFAR-10 [2] input images. This network
also performs poorly on AIMC, due to the small activation
sizes, leading to too little weight reuse.
The final design choice converged on an AIMC array of
size 1152 × 512 with a 256 kB, which sits on the front of
the optimal design points for the ImageNet networks; larger
array and L1 memory sizes were not considered due to area
limitations.
Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply.
206
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023
C. Reconfigurable Heterogeneous Architecture
Section II-B, showed that while some layers can achieve
impressive system efficiency, some others cannot benefit from
parallelization. For those layers, a digital array with more
dataflow flexibility can bring better spatial and temporal
utilization.
In addition, from the accuracy point of view, not all the
layers tolerate execution on a low-precision AIMC array,
which suffers from inherent noise and non-linearities of the
analog domain [29], [30], [31]. Running noise-sensitive layers
in the digital core can overcome accuracy degradation.
The desire to maximize both the performance and accuracy
prompted the development of DIANA’s heterogeneous system,
relying on two separate cores: 1) a high-performance AIMC
core for massive MVM operations without strict accuracy
requirements and 2) a reconfigurable digital core for accelerating those workloads that require higher precision or better
mapping efficiency.
The digital core must work efficiently on the various workloads, also those that do not suit AIMC: pointwise layers,
depthwise layers, and FC layers. This can be achieved by
enabling the digital core to support multiple dataflows. The
DIANA digital core supports both C|K (weight stationary)
and OX|K (output stationary) spatial unrolling [see Fig. 2(c)].
In DIANA, this is realized with a scaled-up version of the
SOMA digital accelerator from [32]. Finally, to allow for
flexible and concurrent assignment of the workloads to the two
cores, an RISC-V processor is integrated in the architecture
to trigger (TRG) and synchronize the operations on the two
cores and perform the necessary pre- and post-processing on
the data. The possibility to parallelize different tasks across
the different cores present in DIANA prompted the evaluation
of multi-core scheduling strategies, discussed next.
D. Optimization Strategies for Multi-Core
A widely heterogeneous architecture such as DIANA offers
many optimization strategies on the scheduling of the workloads to maximize the temporal and spatial localities of the
operands, as well as the activation memory footprint. Both the
accelerator cores can operate in parallel, e.g., each serving a
different NN layer at the same time, to achieve latency savings.
Parallelization can be coupled with layer fusion [33]: instead
of waiting until one layer has completely finished execution
to start the next layer (on the same core, or the other core),
it is possible to start execution of a layer as soon as a part of a
layer has been processed. Such layer fusion exploits immediate
reuse of the intermediate data, leading to smaller activation
buffer requirements.
The dataflow concepts discussed in this section led to
the DIANA SoC architecture, discussed in more detail in
Sections III–V.
III. S YSTEM A RCHITECTURE
This section gives a high-level overview of DIANA’s
hardware architecture. The heterogeneous multi-core system,
shown in Fig. 5, is composed of an RISC-V CPU and two
DNN accelerators: a fully digital 16 × 16 PE array, which
we refer to as the digital core, and a second co-processor
Fig. 5.
Architectural block diagram of the heterogeneous system.
based on an AIMC macro, denoted by the analog core.
A hierarchical distributed memory system and network-onchip (NoC) complete the system. In the remainder of this
section, we will first describe the RISC-V CPU system, and
then address the memory system and NoC, explaining the
high-level interaction between the various components. The
analog and digital cores will be detailed in Sections IV and V.
A. RISC-V CPU and Network Control
The central RISC-V unit is based on the PULPissimo
template [34]. The system uses two communication networks
to connect the various SoC components: a 16 B tightly coupled
data memory (TCDM) bus [35] for data communication and
a 4 B advanced peripheral bus (APB) dedicated to control and
instructions.
To transfer data, the RISC-V can itself directly read and
write from L2, while it has to explicitly program the various
direct memory accesses (DMAs) to let other components
access the global memory through the TCDM bus.
Specifically, the RISC-V initiates the migrations between
L2 and L1 for activation data, and between L2 and dedicated
memories in the cores for the weights. Note that the local
storage for weights in the analog core is the AIMC macro
itself.
To transfer control information, the RISC-V writes to
dedicated register files inside the accelerator over the APB
bus. In this way, it programs instructions and configuration
registers, starts jobs on the cores, and checks when the
processing has ended. The timing diagram in Fig. 6 showcases
the resulting task parallelism possibilities of the architecture:
1) the RISC-V core first initializes the read pointers in L2;
2) and then loads the activations and weights through the DMA
from L2 to L1; 3) the system can then run single workloads
on single cores; 4) or independently run multiple workloads
on separate cores in parallel; 5) DMA operations can occur
in parallel with core computations; and 6) eventually, data are
sent off-chip via I/Os.
B. Memory System
As data management is as critical as the computation in
NN processing, the memory system design is fundamental
Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply.
HOUSHMAND et al.: DIANA: AN END-TO-END HYBRID DIgital AND ANAlog NEURAL NETWORK SoC FOR THE EDGE
207
Fig. 6.
Timing diagram of the system under different workloads; each
workload reports an indicative number of the time duration of each task,
in clock cycles at 270 MHz.
Fig. 7.
to avoiding performance degradation due to inefficient data
traffic. DIANA includes three hierarchical levels of memory: a global L2 static random-access memory (SRAM), the
accelerator-dedicated L1 SRAM scratchpad, and local distributed register-based memories (L0). The SRAM L2 scratchpad
memory has 512-kB capacity with 16-byte read/write (R/W)
bandwidth; it can be accessed through the TCDM bus, which
is connected, in priority order, to the CPU, the IO DMA, and
to one additional DMA for each accelerator (see Fig. 5). This
global L2 memory stores the RISC-V binary code and system
configuration data, like accelerator instructions and operating
settings, as well as the weights and the intermediate activation
values of the NN under execution. In addition, L1 is an SRAM
scratchpad memory, but it is dedicated to storing (tiles of)
feature maps, as a communication data buffer between the
accelerators. In line with the ResNet18/Darknet19 study of
Fig. 3, it has a total capacity of 256 kB, divided into 16byte-wide banks. As shown in Fig. 7, the module has separate
R/W interfaces optimized for each of the two NN cores: on the
analog accelerator side, two read and one write port can access
up to four contiguous banks (maximum 64-byte bandwidth);
on the digital side, there is a single read and write port with
a fixed 16-byte bandwidth. The DMAs can override the L1
ports on both sides for L2–L1 inter-memory transfer.
This memory design offers extensive mapping and scheduling flexibility and enough bandwidth to avoid latency penalties
without compromising the efficiency during execution. On the
other hand, L1 does not include any handler for address
conflicts; the sequence of instructions that each core executes,
and the addressing of L1 and synchronization tasks that the
RISC-V runs are pre-compiled with an in-house compiler and
are generated before runtime. This can be done given the
sequential and deterministic nature of the operations, which
rules out control or data hazards.
Local data reuse was improved further through the inclusion
of various L0 registers within the analog and digital cores.
L1 memory design with detail of the analog core interfaces.
Fig. 8. (a) AIMC core block diagram. (b) Description of the a single bitcel
in the analog domain (from [36]). (c) Timing diagram of the processing stages
and their synchronization signals.
For clarity, we will discuss the register-level details of the
subsystem in the dedicated accelerator sections.
IV. AIMC C OMPUTING C ORE
A. AIMC Core Microarchitecture
The AIMC core is optimized for energy-efficient execution
of massive MVM operations. To maximally exploit the array’s
efficiency, the contribution of the glue logic around it must be
minimized. The main challenge in designing these modules
lies in: 1) providing the required massively parallel input
vectors required by the array in each processing cycle and
2) maximizing the reuse of the fetched operands for energy
efficiency reasons. In the design of the analog core, depicted in
Fig. 8, we have aimed at achieving this across all the memory
levels involved: in a first stage, the vector of input feature maps
is fetched from L1 and fed to the activation buffer via the local
memory control unit (MCU), after which the AIMC core is
triggered for computation. When the computation is finished,
Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply.
208
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023
Fig. 10.
Fig. 9. Microarchitecture of the MCU (left) and example of operation for
data reuse of the activation buffer.
the output buffer collects the MVM result and sequentially
sends batches of 64 results to the single-instruction multipledata (SIMD) unit for post-processing, to then finally be written
back to L1. The AIMC macro executes an MVM in 40 ns,
while the rest of the system can run at up to 270 MHz.
This provides a window of ∼10 clock cycles for executing
the fetching, post-processing, and storing tasks in a pipelined
fashion; the different stages are overlapped in time so as to
minimize latency overheads, as depicted in Fig. 8.
B. Memory Control Unit
The MCU, depicted in Fig. 9, handles the communication
between the L1 activation memory and the input buffer.
The unit generates the memory address according to the
patterns specific to NN kernels: it supports different window
sizes, stride values, padding, and a dedicated sliding pattern
for convolution before pooling layers. For padding, specific
pointers generate the relative position of the kernel window in
the feature map, such that when computing the border pixels,
zeros are directly inserted into the activation buffer. To reuse
overlapping input data between processing of consecutive
pixels under a convolution workload, the activation buffer has
the flexibility to shift its internal data.
To support 3 × 3 kernels, six pixels with up to 128 input
channels can be reused, saving both memory access
and latency for every computing cycle, as described in
Fig. 9 (right).
C. AIMC Macro
Based on the exploration results of Fig. 3(a), 1152 ×
512 analog array of the computing cells is instantiated. This
AIMC macro is a scaled-up version of the design from [36]
to guarantee better support for 3 × 3 kernels. Each cell in
the computing array comprises two SRAM cells that store
ternary values for the weights (−1, 0, +1), connected to
two summation lines representing the positive and negative
products’ accumulation values [see Fig. 8(b)]. The activations
are encoded with an active low pulsewidth, which generates a
current that discharges one of the summation lines, depending
on the weight activation product sign. The resulting voltages
from the two lines are then subtracted in the analog domain.
The 512 successive-approximation register analog to digital
Microarchitecture of the output buffer and the SIMD unit.
converters (SAR-ADCs) the accumulation results back to a
6-bit digital representation.
The vector of input activations is converted from the
digital 7-bit representation into a pulse width modulation
(PWM)-based analog representation via a vector of digital
to analog converters (DACs). Through the DAC unit time
duration, the minimum amount of charge subtracted from the
sum lines can be tuned. This, together with the cell source
bias voltage, which determines the magnitude of the cell
discharging current, gives flexibility to adjust the sensitivity
of the output result to the activation values, at the expense
of additional energy consumption. Both these parameters can
be programmed by the user externally, as will be explored
in Section VI. The computation is triggered by the edge on
the TRG signal (see Fig. 8); for each MVM operation, all
1152 wordlines are activated; there is, however, the possibility
of disabling DAC units in groups of 64 blocks to save energy
consumption.
D. Output Buffer and SIMD Unit
The AIMC’s 512-wide 6-bit analog to digital converter
(ADC) output vector is stored in an output buffer upon
completion of the macro computation, determined by a rising
edge of the DONE signal of the macro, as in Fig. 8. The
output vector is then transferred toward a 64-way SIMD unit
in batches of 64 adjacent elements per cycle. The starting batch
can be selected by setting an offset in the instructions.
The SIMD unit, as described in Fig. 10, handles the elementwise operations on the AIMC outputs with a six-stage
programmable pipeline consisting of: 1) partial sum accumulation in the digital domain in case the input channels and the
kernel dimensions cannot be fully unrolled along the rows,
followed by; 2) batch-norm operation, equal to (αX + β)
operation; 3) residual branch addition; 4) activation function (rectified linear unit (ReLU), LeakyReLU supported);
5) re-quantization via pre-trained scaling and clipping parameters and ultimately; and 6) pooling, for immediate computation
of max/average pooling of the pooling window with output
channel parallelism. When needed, partial accumulation data
and/or residuals are fetched from L1 via a dedicated read
port. The pre-trained parameters required for these operations
are initially loaded from L2 and stored in local registers to
avoid recurrent fetching from memory. The stages of the SIMD
pipeline can be programmed to either be skipped or executed
through the AIMC instruction set.
The digital DNN accelerator core, depicted in Fig. 11, is a
scaled-up version of the SOMA core of [32] and consists of
Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply.
HOUSHMAND et al.: DIANA: AN END-TO-END HYBRID DIgital AND ANAlog NEURAL NETWORK SoC FOR THE EDGE
Fig. 11.
Digital core architecture diagram.
a 16 × 16 reconfigurable PE array, allowing for acceleration of
a diverse set of workloads. Each PE can run precision scalable
MAC operations, with configurable resolutions of 2-, 4-, and
8-bit precision. Based on the selected precision, the parallelism
increases to a 32 × 16 (at 4 bit) and 64 × 16 (at 2 bit)
sized array. Beside supporting precision reconfiguration, the
accelerator can seamlessly switch between dataflows across
successive workloads. The ones supported are the: 1) output
stationary OX|K configuration for CNN workloads, which
turns into OX,C|K when the MAC units are used at sub-8-bit
precision and 2) C|K spatial mapping to support the canonical
MMM, adopted for FC layers or dense tensor operations (see
Fig. 2). Furthermore, elementwise operations (residual branch
addition, ReLU activation function, shifting) are supported
by the PE units. A flexible max pooling unit, integrated in
the accelerator fabric, handles the pooling operations when
required.
To support streamlined transfer of operands from one accelerator core to the other across the shared L1 memory, the
data must be re-organized for correct activation fetching: the
digital core requires OX parallelism across the input vectors
fetched in a read cycle from the memory, while the AIMC
core requires C parallelism when fetching from L1. To achieve
this, the digital core is able to either offload the final outputs
in two different fashions, as visualized in Fig. 11: 1) along
the columns to achieve OX parallelism if the next layer is
done in the digital core or 2) along the rows for C parallelism
if the next layer is done on AIMC. A reshuffling buffer
finally reorders data whose previous layer has been executed
on the analog core (stored K-parallel), when the next layers
are executed in the OX|K mode of the digital core (requiring
OX parallel input data).
V. D IGITAL DNN ACCELERATOR
The digital DNN accelerator core, depicted in Fig. 11, is a
scaled-up version of the SOMA core of [32] and consists of a
16 × 16 reconfigurable PE array, allowing for acceleration of
a diverse set of workloads. Each PE can run precision scalable
MAC operations, with configurable resolutions of 2-, 4-, and
8-bit precision. Based on the selected precision, the parallelism
Fig. 12.
209
Die photograph.
increases to a 32 × 16 (at 4 bit) and 64 × 16 (at 2 bit)
sized array. Beside supporting precision reconfiguration, the
accelerator can seamlessly switch between dataflows across
successive workloads. The ones supported are the: 1) output
stationary OX|K configuration for CNN workloads, which
turns into OX,C|K when the MAC units are used at sub-8-bit
precision and 2) C|K spatial mapping to support the canonical
MMM, adopted for FC layers or dense tensor operations (see
Fig. 2). Furthermore, elementwise operations (residual branch
addition, ReLU activation function, shifting) are supported
by the PE units. A flexible max pooling unit, integ rated in
the accelerator fabric, handles the pooling operations when
required.
To support streamlined transfer of operands from one accelerator core to the other across the shared L1 memory, the
data must be re-organized for correct activation fetching: the
digital core requires OX parallelism across the input vectors
fetched in a read cycle from the memory, while the AIMC
core requires C parallelism when fetching from L1. To achieve
this, the digital core is able to either offload the final outputs
in two different fashions, as visualized in Fig. 11: 1) along the
columns to achieve OX parallelism if the next layer is done
in the digital core or 2) along the rows for C parallelism if
the next layer is done on AIMC. A reshuffling buffer finally
reorders data whose previous layer has been executed on
the analog core (stored K-parallel), when the next layers are
executed in the OX|K mode of the digital core (requiring OX
parallel input data).
VI. M EASUREMENTS
DIANA is implemented in 22FDX1 , covering 10.24 mm2 ,
or 8.91 mm2 excluding the outer pad ring. Fig. 12 shows the
die picture. The full-custom analog macro is integrated with
a digital back-end flow and accounts for 2.29 mm2 including
the SRAM cells, DACs, ADCs, control, and timing circuits.
The other conventional memories were implemented using
standard memory macros. The chip has four different supply
voltages for IO pads, logic cells, the AIMC macro and
memories. The system includes an internal frequency-locked
1 Registered trademark.
Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply.
210
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023
loop intellectual property (FLL IP) macro for on-chip clock
generation. For the measurement tasks, a custom printed circuit board (PCB) has been made for the chip, with headers and
dedicated pins for supply, connected to a field programmable
gate array (FPGA) board. The FPGA drives a set of control
pins (e.g., reset) and emulates the off-chip memory interface.
The remainder of this section will go into more depth on
the characterization results of the chip. Going from fine grain
to system-level measurements, and from peak performance to
application-specific workloads, we will subsequently cover the
following.
1) The characterization of the AIMC macro and its inherent
tradeoff between accuracy and efficiency.
2) The analog and digital core peak performance.
3) The performance under actual NN workloads, exploiting
heterogeneous multi-core scheduling.
All the reported measurements were taken at room temperature, with our fabricated samples working correctly up
to 270 MHz at a nominal 0.8-V supply level.
Fig. 13. (a) Error and power measurements as a function of the AIMC macro
operating point for a uniform distribution of accumulation values between
[0;4096]; each grid element corresponding to one operating point and the
label its mean percent error on the ADC output. (b) Most accurate operating
point of the macro for different accumulation value ranges.
A. Efficiency Versus Accuracy Trade-off in the Analog Macro
As discussed in Section IV-C, the AIMC core efficiency
and accuracy are highly dependent on the DAC unit time and
the source bias voltage of the charge-subtracting transistors in
the cells. A larger unit time and/or cell source bias voltage
increases the gain of the analog MAC operation, bringing
improved accuracy at the expense of a smaller MVM dynamic
range and increased power consumption. Both these parameters can be set at compile time for every instruction of
the analog core, in function of the expected dynamic range
from the NN model training. More inputs and/or a wider
range requires a smaller gain, while fewer activations are more
accurately accumulated with a higher gain.
To evaluate this impact, the array was sourced with
input-weight combinations spanning different MVM accumulation ranges. This was repeated for each tuning knob combination, and the mean relative error between the measured and
exact values was evaluated.
Fig. 13(a) shows how the AIMC macro operating point
affects the output accuracy of a single MVM operation. A uniform distribution of MVM output values between [0; 4096]
at the 512 ADC bank output is computed for different biasing voltages and unit time values and the percent error on
the output vector reported in each grid element. Fig. 13(b)
summarizes the most accurate operating point for a set of
output accumulation values, between [0;1024] and [0;10 240].
As depicted in the figure, the best operating point varies based
on the distribution on the final accumulation.
All the measurements in the remainder of this article are
performed under the most efficient settings that provide the
desired gain (dynamic range), with an error within 2% of the
most accurate one.
B. Peak Performance and Efficiency Characterization
Fig. 14 shows plots of the peak efficiency and performance
of the two cores at their maximum functional operating
frequency across different supply voltages. As can be seen
Fig. 14. Peak performance and efficiencies at the system level of (a) complete
analog core and (b) complete digital core, both including peripheries.
from these graphs, at nominal conditions (0.8 Vat 250 MHz),
the analog accelerator, with I7/W1.5/O6, outperforms the
I8/W8/O8 digital core by ∼95×, both in terms of TOP/s/W
performance and TOP/s efficiency. This shows the strength
of the analog accelerator when used at maximum spatial and
temporal utilization. Unfortunately, NN layers have different
shapes preventing full utilization of the array and require
bandwidth limited weight loading, thus resulting in temporal
stalls. Both the effects are not taken into account in the peak
performance figure-of-merit.
C. Workload Performance Characterization
This section details energy and latency breakdowns when
executing different sets of DNN layers, highlighting potential advantages and drawbacks of different workloads when
ran on hardware. While the digital core’s small size and
flexibility allows it to operate at maximum utilization across
workloads, this is not true for the analog core. To investigate
this, we analyzed the layers with different shapes from the
ResNet20 network for CIFAR10 and the ResNet18 [2] network
for ImageNet, excluding the FC layers for their negligible
contribution. The first layers were mapped onto the digital
core, while the deeper layers were executed by the analog
core. The first layer has a relevant impact on accuracy when
executed on low precision and, for the low input channel
count, it would heavily underutilize the analog macro. From an
Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply.
HOUSHMAND et al.: DIANA: AN END-TO-END HYBRID DIgital AND ANAlog NEURAL NETWORK SoC FOR THE EDGE
211
Fig. 15. ResNet20 (a) performance measurement and (b) energy efficiency.
The network achieves low spatial utilization of the AIMC macro, thus not
achieving peak efficiencies.
Fig. 18. ResNet18 mapping of the ResBlock layers on the AIMC core,
with their relative spatial utilization and required number of weight-write
operations. OX unrolling of 4 is used for ResBlock64.
Fig. 16. ResNet18 (a) performance measurement and (b) efficiency. TOP/s
and TOP/s/W are highly dependent on the temporal and spatial utilization of
the AIMC macro.
Fig. 17. By immediate reuse of intermediate data, memory requirements for
the activations can be reduced by 7.2×, allowing to keep only a subsection
of the feature map in memory as depicted in (a); furthermore by scheduling
different layers on different cores, computations can be overlapped in time
(b) with subsequent latency savings (c).
energy perspective, it would still make sense to use the analog
core, but it would require a dedicated input memory manager
to handle its dataflow, with minor utilization improvements.
The following layers benefit marginally from a more accurate
execution, while being almost two orders of magnitude less
energy-efficient on our hardware if run on the digital core.
Figs. 15 and 16 show the energy and latency breakdown of
the system; we also report the mapping adopted for different
layers throughout ResNet18 in Fig. 18.
TABLE I
N ETWORK E ND - TO -E ND A CCURACY AND P ERFORMANCE S UMMARY
As expected, the utilization of the macro and the system
efficiency are positively correlated across the workloads. The
peak performance at core level is reached at full utilization:
this occurs in the ResBlock512 layers (see Fig. 18) where
16.5 TOP/s, close to the 18.1 max TOP/s achievable, can
be reached at the core level. These layers are also the most
demanding ones in terms of bandwidth requirements: given the
high throughput that can be achieved, new weights have to be
continuously loaded on-chip to keep up with the computation.
When taking into account the energy and latency for writing the weights in the array, the characteristics significantly
degrade, as shown in Fig. 16. For deeper layers, where weight
traffic dominates due to the small activation sizes and the
many channels, >10× degradation is seen on TOP/s/W and
TOP/s figures. In summary, the array’s temporal utilization
performance increases with activation size, due to increased
weight reuse. In addition, layers with many input–output channels can get the most spatial utilization out of the computing
hardware. Yet, they also shift the bottleneck to the weight
movement.
Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply.
212
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023
TABLE II
S OTA DNN A CCELERATORS ’ C OMPARISON
To realize additional latency savings, pipelining and layer
fusion can be applied on the heteoregenous architecture,
as discussed in Section II-D. We apply this to ResNet18, fusing
a stack of six layers consisting of the first convolutional and
pooling layer (run on the digital core) and ResBlock64 (run
on the AIMC core). By computing an activation tile of size
56 × 4 pixels at a time, we: 1) avoid multiple activation data
transfers to L2 by immediately reusing the intermediate data
from the L1 memory and 2) we can overlap the operations of
the digital core and the AIMC core, as illustrated in Fig. 17,
leading to 25% latency savings on the layer stack. Table I
shows comparison of the accuracies and performances of the
two workloads when run under different operand precision.
The drop in accuracy when switching from a floating-point
model to an 8-bit quantized one which runs on the digital
core is negligible but still is not optimized for efficiency and
performance. When switching to the mixed mode computation,
the drop in Top1 accuracy is around 2% for ResNet20 and 5%
for ResNet18; considering the 30× and 17× improvement in
energy efficiency and latency, the loss in accuracy may be
worth the tradeoff.
D. SotA Comparison
Table II shows comparison of DIANA to the recently
taped-out DNN accelerators; the selected designs include digital DNN accelerators, AIMC, and a heterogeneous design. The
DIANA peak performances reported correspond to the ones
described in Section VI-B. The integrated AIMC core achieves
the highest TOP/s/W compared with AIMC implementations.
Previous work [37] includes programmable heterogeneous
architectures, but without a reconfigurable digital accelerator
core integrated. A homogeneous AIMC multi-core approach
is implemented in [26], with pipelined computations across
cores. Yue et al. [39] propose an in-memory computing (IMC)
design for handling activation and block-level sparsity, but
with lower overall efficiency. Overall, DIANA is unique in
its ability to flexible combine analog and digital acceleration,
offering a favorable tradeoff between efficiency, accuracy, and
broad workload flexibility.
VII. C ONCLUSION
We implemented DIANA, a heterogeneous AIMC–Digital
DNN accelerator to research the tradeoffs between the extreme
energy efficiency and performance of AIMC cores with the
reconfigurability and higher precision of digital accelerators,
to optimally run end-to-end DNN models. To enlarge the
mapping space of the non-reconfigurable AIMC array, we proposed output pixel unrolling, a novel spatial mapping method
that enables OXu × latency savings and proportional energy
savings, with negligible overhead; furthermore, we explore
layer fusion and pipelining strategies, to further optimize
the mappings on the heterogeneous multi-core design. The
main bottlenecks that affect the current design are the loading
of the weights to the AIMC core and the limited buffer size of
the L1; these two factors degrade the temporal utilization of
the accelerator and can be further optimized in future
designs.
Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply.
HOUSHMAND et al.: DIANA: AN END-TO-END HYBRID DIgital AND ANAlog NEURAL NETWORK SoC FOR THE EDGE
Finally, this work opens an exciting area of optimization
possibilities: the DNN models can be trained by a hardwareaware neural architecture search (NAS), knowledgeable of
each core strengths and features. We would thus optimize
for the desired metric and execute the layers on the best
suited core, achieving higher accuracies, energy efficiencies,
and throughput based on the task requirements.
ACKNOWLEDGMENT
The authors would like to thank Eidgenössische Technische
Hochschule (ETH) for their support on the PULPissimo platform and GlobalFoundries, Heverlee, Belgium, for the tapeout
support on their 22FDX FD-SOI Platform.
R EFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, vol. 25, F. Pereira, C. Burges, L. Bottou,
and K. Weinberger, Eds. Red Hook, NY, USA: Curran Associates,
2012.
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” CoRR, vol. abs/1512.03385, pp. 1–12, Dec. 2015.
[3] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and
K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <1 mb model size,” CoRR, vol. abs/1602.07360, pp. 1–13,
Feb. 2016.
[4] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal
speed and accuracy of object detection,” 2020, arXiv:2004.10934.
[5] M. Verhelst and B. Moons, “Embedded deep neural network processing:
Algorithmic and processor techniques bring deep learning to IoT and
edge devices,” IEEE Solid-State Circuits Mag., vol. 9, no. 4, pp. 55–65,
Fall 2017.
[6] V. Jain, L. Mei, and M. Verhelst, “Analyzing the energy-latencyarea-accuracy trade-off across contemporary neural networks,” in Proc.
IEEE 3rd Int. Conf. Artif. Intell. Circuits Syst. (AICAS), Jun. 2021,
pp. 1–4.
[7] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “14.5 Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks,” in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), Jan. 2016,
pp. 262–263.
[8] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “14.5 envision: A 0.26-to-10 TOPS/W subword-parallel dynamic-voltageaccuracy-frequency-scalable convolutional neural network processor in
28 nm FDSOI,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
Tech. Papers, Feb. 2017, pp. 246–247.
[9] D. Bankman, L. Yang, B. Moons, M. Verhelst, and B. Murmann,
“An always-on 3.8 μJ/86% CIFAR-10 mixed-signal binary CNN processor with all memory on chip in 28-nm CMOS,” IEEE J. Solid-State
Circuits, vol. 54, no. 1, pp. 158–172, Jan. 2019.
[10] J. Yue et al., “A 65 nm computing-in-memory-based CNN processor with
2.9-to-35.8 TOPS/W system energy efficiency using dynamic-sparsity
performance-scaling architecture and energy-efficient inter/intra-macro
data reuse,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
Papers, Feb. 2020, pp. 234–236.
[11] M. V. Bert Moons and D. Bankman, Embedded Deep Learning:
Algorithms, Architectures and Circuits for Always-On Neural Network
Processing. Cham, Switzerland: Springer, 2019.
[12] K. Prabhu et al., “CHIMERA: A 0.92-TOPS, 2.2-TOPS/W edge AI
accelerator with 2-MByte on-chip foundry resistive RAM for efficient
training and inference,” IEEE J. Solid-State Circuits, vol. 57, no. 4,
pp. 1013–1026, Apr. 2022.
[13] Y. S. Shao et al., “Simba: Scaling deep-learning inference with multichip-module-based architecture,” in Proc. 52nd Annu. IEEE/ACM Int.
Symp. Microarchitecture (MICRO). New York, NY, USA: Association
for Computing Machinery, Oct. 2019, pp. 14–27.
[14] X. Yang et al., “A systematic approach to blocking convolutional neural networks,” CoRR, vol. abs/1606.04209, pp. 1–12,
Jun. 2016.
[15] B. Murmann, “Mixed-signal computing for deep neural network inference,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 29, no. 1,
pp. 3–13, Jan. 2021.
213
[16] N. Verma et al., “In-memory computing: Advances and prospects,” IEEE
Solid-State Circuits Mag., vol. 11, no. 3, pp. 43–55, Summer2019.
[17] S. Cosemans et al., “Towards 10000 TOPS/W DNN inference with
analog in-memory computing—A circuit blueprint, device options and
requirements,” in IEDM Tech. Dig., Dec. 2019, pp. 22.2.1–22.2.4.
[18] A. Krizhevsky, V. Nair, and G. Hinton, “Learning multiple layers of
features from tiny images,” Can. Inst. Adv. Res., Toronto, ON, Canada,
2009. [Online]. Available: https://www.cs.toronto.edu/~kriz/cifar.html
[19] T. Lin et al., “Microsoft COCO: Common objects in context,” CoRR,
vol. abs/1405.0312, pp. 1–15, May 2014.
[20] P. Houshmand et al., “Opportunities and limitations of emerging analog in-memory compute DNN architectures,” in IEDM Tech. Dig.,
Dec. 2020, pp. 29.1.1–29.1.4.
[21] A. Garofalo et al., “A heterogeneous in-memory computing cluster for
flexible end-to-end inference of real-world deep neural networks,” CoRR,
vol. abs/2201.01089, pp. 1–15, Jan. 2022.
[22] K. Ueyoshi et al., “DIANA: An end-to-end energy-efficient digital and analog hybrid neural network SOC,” in IEEE Int. SolidState Circuits Conf. (ISSCC) Dig. Tech. Papers, vol. 65, Feb. 2022,
pp. 1–3.
[23] A. Vaswani et al., “Attention is all you need,” CoRR,
vol. abs/1706.03762, pp. 1–15, Jun. 2017.
[24] L. Mei, P. Houshmand, V. Jain, S. Giraldo, and M. Verhelst, “ZigZag:
Enlarging joint architecture-mapping design space exploration for DNN
accelerators,” IEEE Trans. Comput., vol. 70, no. 8, pp. 1160–1174,
Aug. 2021.
[25] X. Yang et al., “Interstellar: Using halide’s scheduling language to analyze DNN accelerators,” in Proc. 25th Int. Conf. Architectural Support
Program. Lang. Operating Syst. New York, NY, USA: Association for
Computing Machinery, 2020, pp. 369–383.
[26] H. Jia et al., “Scalable and programmable neural network inference
accelerator based on in-memory computing,” IEEE J. Solid-State Circuits, vol. 57, no. 1, pp. 198–211, Jan. 2022.
[27] A. Shafiee et al., “ISAAC: A convolutional neural network
accelerator with in-situ analog arithmetic in crossbars,” in Proc.
ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2016,
pp. 14–26.
[28] P.-Y. Chen, X. Peng, and S. Yu, “NeuroSim+: An integrated deviceto-algorithm framework for benchmarking synaptic devices and array
architectures,” in IEDM Tech. Dig., Dec. 2017, pp. 6.1.1–6.1.4.
[29] A. S. Rekhi et al., “Analog/mixed-signal hardware error modeling for
deep learning inference,” in Proc. 56th Annu. Design Autom. Conf.
(DAC). New York, NY, USA: Association for Computing Machinery,
2019, pp. 1–6.
[30] M. L. Gallo et al., “Mixed-precision in-memory computing,” Nature
Electron., vol. 1, pp. 246–253, Apr. 2018.
[31] S. K. Gonugondla, C. Sakr, H. Dbouk, and N. R. Shanbhag,
“Fundamental limits on the precision of in-memory architectures,”
in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design, Nov. 2020,
pp. 1–9.
[32] J. S. P. Giraldo, V. Jain, and M. Verhelst, “Efficient execution of temporal
convolutional networks for embedded keyword spotting,” IEEE Trans.
Very Large Scale Integr. (VLSI) Syst., vol. 29, no. 12, pp. 2220–2228,
Dec. 2021.
[33] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN
accelerators,” in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO), Oct. 2016, pp. 1–12.
[34] P. D. Schiavone, D. Rossi, A. Pullini, A. Di Mauro, F. Conti, and
L. Benini, “Quentin: An ultra-low-power PULPissimo SoC in 22 nm
FDX,” in Proc. IEEE SOI-3D-Subthreshold Microelectron. Technol.
Unified Conf. (S3S), Oct. 2018, pp. 1–3.
[35] A. Rahimi, I. Loi, M. R. Kakoee, and L. Benini, “A fully-synthesizable
single-cycle interconnection network for shared-L1 processor clusters,”
in Proc. Design, Autom. Test Eur., Mar. 2011, pp. 1–6.
[36] I. A. Papistas et al., “A 22 nm, 1540 TOP/S/W, 12.1 TOP/s/mm2
in-memory analog matrix-vector-multiplier for DNN acceleration,”
in Proc. IEEE Custom Integr. Circuits Conf. (CICC), Apr. 2021,
pp. 1–2.
[37] H. Jia, H. Valavi, Y. Tang, J. Zhang, and N. Verma, “A programmable
heterogeneous microprocessor based on bit-scalable in-memory computing,” IEEE J. Solid-State Circuits, vol. 55, no. 9, pp. 2609–2621,
Sep. 2020.
[38] H. Mo et al., “A 28 nm 12.1 TOPS/W dual-mode CNN processor
using effective-weight-based convolution and error-compensation-based
prediction,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
Papers, vol. 64, Feb. 2021, pp. 146–148.
Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply.
214
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023
[39] J. Yue et al., “A 2.75-to-75.9 TOPS/W computing-in-memory NN
processor supporting set-associate block-wise zero skipping and pingpong CIM with simultaneous computation and weight updating,” in
IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, vol. 64,
Feb. 2021, pp. 238–240.
Pouya Houshmand (Graduate Student Member,
IEEE) received the B.Sc. and M.Sc. degrees in
electrical engineering from the Polytechnic of Turin,
Turin, Italy, in 2017 and 2019, respectively. He is
currently pursuing the Ph.D. degree in architectures
for deep neural network (DNN) accelerators with
ESAT-MICAS Laboratories, KU Leuven, Leuven,
Belgium.
His current research interests include algorithmhardware co-design, in-memory computing (IMC),
and emerging technologies.
Giuseppe M. Sarda received the B.Sc. and M.Sc.
degrees in electrical engineering from the Politecnico di Torino, Turin, Italy, in 2018 and 2020,
respectively. He carried out his master’s thesis from
Technische Universität Wien (TU Wien), Vienna,
Austria.
In September 2020, he joined imec, Leuven,
Belgium, and the MICAS Group, Katholieke Universiteit Leuven, Leuven, as a Ph.D. Researcher. His
current research focuses on in-memory design for
efficient embedded machine learning.
Vikram Jain (Member, IEEE) received the M.Sc.
degree in embedded electronics systems design
(EESD) from the Chalmers University of Technology, Gothenburg, Sweden, in 2018. He is currently pursuing the Ph.D. degree in energy-efficient
digital acceleration and reduced instruction set
computer - five (RISC-V) processors for machine
learning applications at the edge, under the
supervision of Prof. Marian Verhelst.
In 2018, he joined ESAT-MICAS Laboratories, KU Leuven, Leuven, Belgium, as a
Research Assistant. He was a Visiting Researcher with the IIS
Laboratory, ETH Zürich, Zürich, Switzerland, under the supervision
of Prof. Luca Benini. His current research interests include ML accelerators,
RISC-V architecture, heterogeneous systems, design space exploration, lowpower digital design, and hardware design automation.
Dr. Jain was a recipient of prestigious research fellowship from the Swedish
Institute (SI) for his master’s degree from 2016 to 2018.
Kodai Ueyoshi received the B.E., M.E., and Ph.D.
degrees from Hokkaido University, Sapporo, Japan,
in 2015, 2017, and 2020, respectively.
From 2017 to 2020, he was a JSPS Research
Fellow. From 2018 to 2020, he was also a Researcher
of JST ACT-I. He was an Assistant Researcher with
KU Leuven, Leuven, Belgium, from 2020 to 2022.
He is currently a Research Engineer with ZEKU
Technologies, Shanghai, China. His research interests include energy-efficient hardware architecture
for machine learning systems and software and
hardware co-optimization.
Dr. Ueyoshi received the ISSCC Silkroad Award in 2018, the IEEE SSCS
Pre-Doctoral Award, and the Ninth JSPS Ikushi Prize in 2019.
Ioannis A. Papistas received the B.Sc. and
M.Eng. degrees in electrical and computer engineering from the Aristotle University of Thessaloniki,
Thessaloniki, Greece, in 2014, and the Ph.D. degree
from the University of Manchester, Manchester,
U.K., in 2018. His Ph.D. dissertation was on heterogeneous 3-D IC integration.
He spent a year as a Post-Doctoral Research
Fellow with the University of Manchester. In 2019,
he joined the Machine Learning Team, imec,
Leuven, Belgium, as a Research and Development
Engineer on analog in-memory computing (IMC) accelerators. In 2021, he
co-founded axelera.ai, where he is currently a Senior Research and Development Engineer working on energy- and area-efficient machine learning accelerator designs. His current research interests include IMC, efficient machine
learning accelerator designs, heterogeneous integration, and mixed-signal IC
design.
Dr. Papistas was a recipient of the prestigious EPSRC Doctoral Prize Award.
Man Shi received the B.Sc. degree from the
School of Information Science and Engineering from Shandong University (SDU), Jinan,
China, in 2017, and the M.Sc. degree from the
Institute of Microelectronics, Tsinghua University, Beijing, China, in 2020. She is currently
pursuing the Ph.D. degree in the accelerators
architecture for deep neural network (DNN)
with MICAS Laboratories, KU Leuven, Leuven,
Belgium.
Her current research interests include low-power
DNN hardware accelerator design, algorithm–hardware co-design, and
reconfigured computation.
Qilin Zheng received the B.S. degree from Peking
University, Beijing, China, in 2019, and the M.S.
degree from KU Leuven, Leuven, Belgium, in 2022.
He is currently pursuing the Ph.D. degree in ECE
with Duke University, Durham, NC, USA.
His current research interests include computer
architecture, non-volatile memory, and compute-inmemory design.
Debjyoti Bhattacharjee received the B.Tech. degree
in computer science and engineering from the
West Bengal University of Technology (WBUT),
Kolkata, West Bengal, India, in 2013, the M.Tech.
degree in computer science from the Indian Statistical Institute, Kolkata, in 2015, and the Ph.D. degree
in computer science and engineering from Nanyang
Technological University, Singapore, in 2019.
He worked as a Research Fellow with Nanyang
Technological University, for a year. During his doctoral studies, he worked on design of architectures
using emerging technologies for in-memory computing (IMC). He developed
novel technology mapping algorithms, technology-aware synthesis techniques,
and proposed novel methods for multi-valued logic realization. He is currently
a Research and Development Engineer with the Compute System Architecture
Unit, imec, Leuven, Belgium. His current research interests include machine
learning accelerator using analog hardware, hardware design automation
tools, and application-specific accelerator design, with emphasis on emerging
technologies.
Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply.
HOUSHMAND et al.: DIANA: AN END-TO-END HYBRID DIgital AND ANAlog NEURAL NETWORK SoC FOR THE EDGE
Arindam Mallik received the M.S. and Ph.D.
degrees in electrical engineering and computer science from Northwestern University, Evanston, IL,
USA, in 2004 and 2008, respectively.
He is currently a technologist with 20 years of
experience in semiconductor research. He leads the
Future System Exploration (FuSE) Group with the
Compute System Architecture (CSA) Research and
Development Unit. He has authored or coauthored
more than 100 articles in international journals and
conference proceedings. He holds number of international patents. His research interests include novel computing system,
design–technology co-optimization, and economics of semiconductor scaling.
Peter Debacker received the M.Sc. degree (Hons.)
in electrical engineering from Katholieke Universiteit Leuven, Leuven, Belgium, in 2004.
Before joining imec, Leuven, in 2011, he worked
with Philips as a System Engineer and Essensium as
a System Architect. At imec, he is currently a Principal Member of Technical Staff, architecting solutions
for high-performance and energy-efficient artificial
intelligence (AI) systems with hardware ranging
from large, scaled-out high-performance systems to
tiny deep neural network (DNN) accelerators and
in-memory compute hardware. Before that, he was a Program Manager for
imec’s Machine Learning Program, and he worked on imec’s low-power
digital chip and processor architectures and implementation in advanced
technology nodes to optimize power–performance–area (PPA) optimization
of scaled CMOS technologies (for 3 nm and beyond). His current research
interests include processor and computer architectures, AI and machine
learning, design methodologies, and digital chip design and verification.
215
Diederik Verkest received the Ph.D. degree in
applied sciences from the University of Leuven,
Leuven, Belgium, in 1994.
He started working in the VLSI Design Methodology Group, imec, Leuven, on hardware/software
co-design, re-configurable systems, and multiprocessor system-on-chips in the domain of wireless
and multimedia. From 2010 to 2020, he was responsible for the imec’s Logic Insite Research Program
in which the leading design and process companies
jointly work on co-optimization of CMOS design
and process technology for N+2 nodes. In recent years, he focused on
design and process technology optimization for ML accelerators. He has
published and presented over 150 articles in international journals and at
international conferences.
Marian Verhelst (Senior Member, IEEE) received
the Ph.D. degree from KU Leuven, Leuven,
Belgium, in 2008.
She worked as a Research Scientist with Intel
Labs, Hillsboro, OR, USA, from 2008 to 2010.
She is currently a Full Professor with MICAS
Laboratories, KU Leuven, and the Research Director with imec, Leuven. Her research focuses on
embedded machine learning, hardware accelerators, HW–algorithm co-design, and low-power edge
processing.
Dr. Verhelst was a member of the Young Academy of Belgium and the
STEM Advisory Committee to the Flemish Government. She is a member
of the Board of Directors of tinyML and active in the TPC’s of DATE,
ISSCC, VLSI, and ESSCIRC. She received the Laureate Prize of the Royal
Academy of Belgium in 2016, the 2021 Intel Outstanding Researcher Award,
and the André Mischke YAE Prize for Science and Policy in 2021. She
was the Chair of tinyML2021 and the TPC Co-Chair of AICAS2020. She
is an IEEE SSCS Distinguished Lecturer. She was an Associate Editor of
IEEE T RANSACTIONS ON V ERY L ARGE S CALE I NTEGRATION (VLSI) S YS TEMS (TVLSI), IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS —II:
E XPRESS B RIEFS (TCAS-II), and IEEE J OURNAL OF S OLID -S TATE
C IRCUITS (JSSC).
Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply.
Download