IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023 203 DIANA: An End-to-End Hybrid DIgital and ANAlog Neural Network SoC for the Edge Pouya Houshmand , Graduate Student Member, IEEE, Giuseppe M. Sarda , Vikram Jain , Member, IEEE, Kodai Ueyoshi , Ioannis A. Papistas , Man Shi , Qilin Zheng, Debjyoti Bhattacharjee , Arindam Mallik, Peter Debacker , Diederik Verkest, and Marian Verhelst , Senior Member, IEEE Abstract— DIgital-ANAlog (DIANA), a heterogeneous multi-core accelerator, combines a reduced instruction set computer - five (RISC-V) host processor with an analog in-memory computing (AIMC) artificial intelligence (AI) accelerator and a digital reconfigurable deep neural network (DNN) accelerator in a single system-on-chip (SoC) to support a wide variety of neural network (NN) workloads. AIMC cores can bring extreme computational parallelism and efficiency at the expense of accuracy and dataflow flexibility. Digital AI co-processors, on the other hand, guarantee accuracy through deterministic compute, but cannot achieve the same computational density and efficiency. DIANA exploits this fundamental tradeoff by integrating both types of cores in a shared and optimized memory system, to enable seamless execution of the workloads on the parallel cores. The system’s performance benefits further from pipelined parallel execution across both accelerator cores and enhanced AIMC spatial unrolling techniques, leading to drastically reduced execution latency and reduced memory footprints. The design has been implemented in a 22-nm technology and achieves peak efficiencies of 600 TOP/s/W for the AIMC core (I/W/O: 7/1.5/6 bit) and 14 TOP/s/W (I/W/O: 8/8/8 bit) for the digital accelerator, respectively. End-to-end performance evaluation of CIFAR-10 and ImageNet classification workloads is carried out on the chip, reporting 7.02 and 5.56 TOP/s/W, respectively, at the system level. Index Terms— Algorithm-to-HW mapping, analog in-memory computing (AIMC), deep neural network (DNN) acceleration, machine learning processing, mixed-signal computing, reduced instruction set computer - five (RISC-V), system-on-chip (SoC). I. I NTRODUCTION EEP learning algorithms have brought state-of-the-art (SotA) accuracy in pattern recognition and object classification tasks and have become an integral tool in solving D Manuscript received 24 May 2022; revised 26 August 2022; accepted 30 September 2022. Date of publication 31 October 2022; date of current version 28 December 2022. This article was approved by Associate Editor Sophia Shao. This work was supported in part by KU Leuven, in part by the European Union (EU) European Research Council (ERC) for Resource-efficient sensing through dynamic attention-scalability (Re-SENSE) Project under Grant ERC-2016-STG-715037, and in part by the Flemish Government (Artificial intelligence (AI) Research Program). (Pouya Houshmand and Giuseppe M. Sarda contributed equally to this work.) (Corresponding authors: Pouya Houshmand; Giuseppe M. Sarda.) Pouya Houshmand, Vikram Jain, Kodai Ueyoshi, Man Shi, and Qil in Zheng are with ESAT-MICAS, KU Leuven, 3000 Leuven, Belgium (e-mail: pouya.houshmand@kuleuven.be). Giuseppe M. Sarda and Marian Verhelst are with KU Leuven, 3000 Leuven, Belgium, and also with imec, 3001 Leuven, Belgium (e-mail: giuseppe.sarda@imec.be). Ioannis A. Papistas, Debjyoti Bhattacharjee, Arindam Mallik, Peter Debacker, and Diederik Verkest are with imec, 3001 Leuven, Belgium. Color versions of one or more figures in this article are available at https://doi.org/10.1109/JSSC.2022.3214064. Digital Object Identifier 10.1109/JSSC.2022.3214064 many real-world problems [1], [2], [3], [4]. However, deep neural networks (DNNs) are computationally intensive algorithms, which significantly hinder efficient execution on edge devices, characterized by strict resource, energy, and latency constraints. However, the pre-defined sequential nature of these computations, which can be expressed as a sequence of matrix-vector multiplications (MVMs), enables research on specialized efficient DNN hardware accelerators, which exploit the high degree of parallelization possibilities [5], [6]. The co-design space of the hardware architecture combined with the scheduling possibilities consists in countless optimization opportunities to maximize the spatial parallelism and data locality of the operations. In recent years, many digital designs have first been proposed to accelerate the operations [7], [8], [9], [10], with a hardware template consisting of an array of processing elements (PEs), which allows for 2-D spatial parallelization of the multiply-accumulate (MAC) operations. The computing units are then surrounded by a memory hierarchy, to exploit temporal and spatial localities of the operands via multiple levels of caching [11]. These hardware templates are also replicated in the homogeneous chiplet-based multi-core architectures [12], [13] to enable higher parallelization degrees and pipelining of a sequence of workloads. The efficiency of these designs is often dominated by the communication cost of sending data between the memories and the compute array. This overhead of data movement can be minimized through optimized memory hierarchies and appropriate tiling of the nested loops [14], thus optimizing re-use of the operands. Even so, it still remains an important limiting factor in accelerator performance. This stimulated research in alternatives to digital accelerators, such as analog in-memory computing (AIMC), which removes part of data movement by means of computing the MVM operations within the memory in the analog domain. This allows massive parallelization and density of the DNN operations, promising up to 100× improved throughput and energy efficiency compared with the SotA digital designs [15], [16], [17]. However, these benefits are only achieved under high spatial and temporal utilization of the array. The spatial utilization is expressed as the amount of the array’s compute cells performing useful computations over all the computing cells. Achieving good spatial utilization across each targeted MVM is challenged by the limited scheduling flexibility of AIMC arrays, as well as the wide diversity of DNN layer 0018-9200 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply. 204 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023 Fig. 1. Computer vision networks exhibit different workloads: the number of parameters per layer grows quadratically, while the feature maps’ size decreases with the same speed. The result is shallow layers with huge feature maps and small-scale weights and deep layers with large number of weights but small features. A heterogeneous, reconfigurable architecture is thus required to support different workload characteristics. topologies. For example, the limited channel count of the first layers of networks designed for ImageNet, CIFAR-10, and COCO [18], [19] usually leads to poor spatial utilization of such arrays. The temporal utilization, on the other hand, is determined by the amount of clock cycles the AIMC array can be effectively used. Maximizing temporal utilization implies minimizing the amount of stalling cycles that might occur due to limited data reuse and bandwidth limitations to source/sink the input–output data. For example, fully connected (FC) layers exhibit poor weight reuse and depthwise layers exhibit poor input reuse, making it challenging to achieve good temporal utilization for their execution on AIMC arrays [20]. Finally, the noise sensitivity of a DNN layer should also be taken into account when assessing whether a layer benefits more from mapping on a traditional digital DNN accelerator, or on an AIMC array. It is clear that specific neural network (NN) layers can benefit enormously from the AIMC technology toward higher throughput and energy efficiency, yet not all the layers. For several workload and layer types, mapping on the traditional digital MVM accelerators remains beneficial, e.g., the case for depthwise layers [21]. To this purpose, we present DIANA [22], a hybrid DIgitalANAlog DNN system-on-chip (SoC), which combines the energy efficiency and high throughput of the AIMC technology with the dataflow reconfigurability and higher precision of digital architectures, as summarized in Fig. 1. The design enables end-to-end acceleration of a diverse set of NN models, under the control of a reduced instruction set computer - five (RISC-V) host processor to exploit parallel execution across the three cores and efficiently share data between them. The contributions of this work are hence as follows. 1) A heterogeneous DNN accelerator, combining an AIMC core with a flexible digital accelerator in a single SoC with an optimized shared memory hierarchy. 2) Enhanced AIMC mapping methods to achieve higher spatial and temporal utilization of the AIMC array. 3) Optimized scheduling strategies across the multi-core heterogeneous SoC to maximize temporal utilization of the computing fabrics. Fig. 2. (a) Seven-nested loop representation of a 2-D convolutional layer and of an MMM and how different dataflows map the workload on the hardware. (b) Weight stationary FX, FY, and C|K dataflow for AIMC. (c) Output stationary OX|K dataflow and C|K dataflow for digital. Section II will start with the basic dataflow concepts and motivate the design choices through hardware-scheduling co-optimization, justifying the choice of a heterogeneous architecture. Section III describes the architectural components in more detail, followed by Section VI where measurements are carried to evaluate the performance and efficiency of the taped-out design. II. D ESIGN C HOICES This section details the rationale behind the algorithm and hardware co-design choices, first introducing the workload characteristics followed by the motivation why a hybrid design would suit them. A. Dataflow Concepts The workloads of modern networks can be expressed as a sequence of nested loops that operate on two tensors to generate an output tensor with a series of MAC operations. The nested loops’ representation can be used for convolutional layers in convolutional NNs (CNNs) and for matrix–matrix multiplications (MMMs) in transformers [23]; Fig. 2(a) describes the loop description for different workloads. This offers hardware designers the possibility to parallelize a subset of the loops in the spatial dimension across an array of MAC units. The loops that are not flattened in the MAC array can in turn be temporally unrolled and tiled [14] to exploit re-use at various levels in the memory hierarchy. The combination of the temporal tiling of the nested loops and their spatial unrolling provides the dataflow of a given workload [24], [25]. AIMC typically relies on a weight stationary dataflow [see Fig. 2(b)], since its physical implementation constrains the way the operands can be spatially re-used in the MAC array: inputs are multicasted along the cells of each row, while accumulations happen along the columns [16], [17]. This commonly results in spatially unrolling weights relative to different input channels (C) and different kernel dimensions (FX, FY) on the rows, and the different kernels (K dimension) on the columns (denoted as C, FX, FY|K). Layers whose C, FX, FY, and K dimensions fit well with the dimensions of the array can achieve a high spatial utilization and thus exploit better the computational density and energy efficiency of the Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply. HOUSHMAND et al.: DIANA: AN END-TO-END HYBRID DIgital AND ANAlog NEURAL NETWORK SoC FOR THE EDGE macro, maximizing the potential performance improvements of AIMC. For those layers that underutilize the array with the C, FX, FY|K dataflow, mapping efficiently can be improved with a concept called output pixel unrolling [see Fig. 4(a)]; this consists of duplicating the weights in the same AIMC array and computing multiple (OXu) output pixels in parallel on different columns (C, FX, FY|K, OX unrolling). Excluding the negligible overheads related to additional weight writing, OXu× utilization, performance, and efficiency benefits can be obtained, as can be seen in Fig. 4(c). While other works have proposed weight duplication, they have not done so in the same AIMC array, but across multiple arrays [26], [27], [28]. Doing it in place guarantees a higher spatial and temporal re-use of the activations, with consequent energy benefits, beside preventing communication overheads of writing the same weights to separate cores. Yet, even with these dataflow enhancements, AIMC mapping efficiency strongly depends on layer dimensions and the AIMC hardware parameters. This leads to the study carried out in Section II-B. B. Design Space Exploration As introduced in Sections I and II, some workloads are better suited for AIMC while others cause underutilization and do not fully exploit the massive parallelism that the technology brings. When considered at the system level, the factors that contribute to this match consist essentially of: 1) the layer topology; 2) the dimensions of the AIMC array itself (for spatial utilization); and 3) the activation buffer which sources the AIMC array with data (for temporal utilization). We have thus carried out an exploration, with the intent of finding: 1) which array sizes and what activation buffers bring best AIMC performance and 2) under such optimal hardware constellation, which layer and network topologies maximize energy efficiency and computational density. For modeling the different design points, we have used an extension of ZigZag [24], extended with the output pixel unrolling in the dataflow space. The considered model is depicted in Fig. 3(b), taking into account the energy and latency contributions of the AIMC core, of the L1 scratchpad activation buffer data movements, of loading/storing the weights in the AIMC core and of L2 scratchpad accesses for weights and activations. The single values of each contribution are extracted from the simulations and are summarized in table Fig. 3(c), together with the swept hardware parameters (# of L1 banks and # of AIMC rows/columns). Under these assumptions, a diverse set of models from literature have been assessed on achievable TOP/s/W and TOP/s/mm2 wih AIMC. Fig. 3(a) outlines the exploration results, highlighting the pareto optimal hardware constellations for each studied network (row × column, buffer size). It is again clear from the results that AIMC works best with networks that are characterized by a large number of non-pointwise kernels throughout the network: ResNet [2] or YOLO-like (DarkNet19 [4]) structures optimize energy efficiency and computational density. Networks dominated by 205 (a) (b) Fig. 3. Design space exploration of the AIMC core for different workloads is carried out (a) based on an AIMC core hardware template of (b) exploring for different design points by sweeping over the listed AIMC array and activation buffer sizes. Pareto frontier points in (c) report the optimal configurations for each network. The stars indicate the configuration selected for the DIANA design. Fig. 4. By adopting the output pixel unrolling mapping of the array, weights are duplicated OXu times in the array, while a corresponding number of pixels are computed in (b) parallel instead of (a) single one. As an example, (c) describes the significant savings in latency that can be achieved for the layers in the first ResBlock of ResNet18 with an 1152 × 512 array, linearly proportional with OXu, with a negligible increase in extra weight loading (c). layers with 1 × 1 kernels, such as SqueezeNet [3], perform ∼2× worse: they do not offer the same energy efficiency since they cannot exploit massive MVM operations in a large AIMC core and find their optimum hardware constellation at smaller analog arrays. As they still need rather large activation buffers, similar to ResNet18 and DarkNet19, this leads to degraded TOP/s/mm2 . Finally, the ResNet20 network was assessed working on CIFAR-10 [2] input images. This network also performs poorly on AIMC, due to the small activation sizes, leading to too little weight reuse. The final design choice converged on an AIMC array of size 1152 × 512 with a 256 kB, which sits on the front of the optimal design points for the ImageNet networks; larger array and L1 memory sizes were not considered due to area limitations. Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply. 206 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023 C. Reconfigurable Heterogeneous Architecture Section II-B, showed that while some layers can achieve impressive system efficiency, some others cannot benefit from parallelization. For those layers, a digital array with more dataflow flexibility can bring better spatial and temporal utilization. In addition, from the accuracy point of view, not all the layers tolerate execution on a low-precision AIMC array, which suffers from inherent noise and non-linearities of the analog domain [29], [30], [31]. Running noise-sensitive layers in the digital core can overcome accuracy degradation. The desire to maximize both the performance and accuracy prompted the development of DIANA’s heterogeneous system, relying on two separate cores: 1) a high-performance AIMC core for massive MVM operations without strict accuracy requirements and 2) a reconfigurable digital core for accelerating those workloads that require higher precision or better mapping efficiency. The digital core must work efficiently on the various workloads, also those that do not suit AIMC: pointwise layers, depthwise layers, and FC layers. This can be achieved by enabling the digital core to support multiple dataflows. The DIANA digital core supports both C|K (weight stationary) and OX|K (output stationary) spatial unrolling [see Fig. 2(c)]. In DIANA, this is realized with a scaled-up version of the SOMA digital accelerator from [32]. Finally, to allow for flexible and concurrent assignment of the workloads to the two cores, an RISC-V processor is integrated in the architecture to trigger (TRG) and synchronize the operations on the two cores and perform the necessary pre- and post-processing on the data. The possibility to parallelize different tasks across the different cores present in DIANA prompted the evaluation of multi-core scheduling strategies, discussed next. D. Optimization Strategies for Multi-Core A widely heterogeneous architecture such as DIANA offers many optimization strategies on the scheduling of the workloads to maximize the temporal and spatial localities of the operands, as well as the activation memory footprint. Both the accelerator cores can operate in parallel, e.g., each serving a different NN layer at the same time, to achieve latency savings. Parallelization can be coupled with layer fusion [33]: instead of waiting until one layer has completely finished execution to start the next layer (on the same core, or the other core), it is possible to start execution of a layer as soon as a part of a layer has been processed. Such layer fusion exploits immediate reuse of the intermediate data, leading to smaller activation buffer requirements. The dataflow concepts discussed in this section led to the DIANA SoC architecture, discussed in more detail in Sections III–V. III. S YSTEM A RCHITECTURE This section gives a high-level overview of DIANA’s hardware architecture. The heterogeneous multi-core system, shown in Fig. 5, is composed of an RISC-V CPU and two DNN accelerators: a fully digital 16 × 16 PE array, which we refer to as the digital core, and a second co-processor Fig. 5. Architectural block diagram of the heterogeneous system. based on an AIMC macro, denoted by the analog core. A hierarchical distributed memory system and network-onchip (NoC) complete the system. In the remainder of this section, we will first describe the RISC-V CPU system, and then address the memory system and NoC, explaining the high-level interaction between the various components. The analog and digital cores will be detailed in Sections IV and V. A. RISC-V CPU and Network Control The central RISC-V unit is based on the PULPissimo template [34]. The system uses two communication networks to connect the various SoC components: a 16 B tightly coupled data memory (TCDM) bus [35] for data communication and a 4 B advanced peripheral bus (APB) dedicated to control and instructions. To transfer data, the RISC-V can itself directly read and write from L2, while it has to explicitly program the various direct memory accesses (DMAs) to let other components access the global memory through the TCDM bus. Specifically, the RISC-V initiates the migrations between L2 and L1 for activation data, and between L2 and dedicated memories in the cores for the weights. Note that the local storage for weights in the analog core is the AIMC macro itself. To transfer control information, the RISC-V writes to dedicated register files inside the accelerator over the APB bus. In this way, it programs instructions and configuration registers, starts jobs on the cores, and checks when the processing has ended. The timing diagram in Fig. 6 showcases the resulting task parallelism possibilities of the architecture: 1) the RISC-V core first initializes the read pointers in L2; 2) and then loads the activations and weights through the DMA from L2 to L1; 3) the system can then run single workloads on single cores; 4) or independently run multiple workloads on separate cores in parallel; 5) DMA operations can occur in parallel with core computations; and 6) eventually, data are sent off-chip via I/Os. B. Memory System As data management is as critical as the computation in NN processing, the memory system design is fundamental Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply. HOUSHMAND et al.: DIANA: AN END-TO-END HYBRID DIgital AND ANAlog NEURAL NETWORK SoC FOR THE EDGE 207 Fig. 6. Timing diagram of the system under different workloads; each workload reports an indicative number of the time duration of each task, in clock cycles at 270 MHz. Fig. 7. to avoiding performance degradation due to inefficient data traffic. DIANA includes three hierarchical levels of memory: a global L2 static random-access memory (SRAM), the accelerator-dedicated L1 SRAM scratchpad, and local distributed register-based memories (L0). The SRAM L2 scratchpad memory has 512-kB capacity with 16-byte read/write (R/W) bandwidth; it can be accessed through the TCDM bus, which is connected, in priority order, to the CPU, the IO DMA, and to one additional DMA for each accelerator (see Fig. 5). This global L2 memory stores the RISC-V binary code and system configuration data, like accelerator instructions and operating settings, as well as the weights and the intermediate activation values of the NN under execution. In addition, L1 is an SRAM scratchpad memory, but it is dedicated to storing (tiles of) feature maps, as a communication data buffer between the accelerators. In line with the ResNet18/Darknet19 study of Fig. 3, it has a total capacity of 256 kB, divided into 16byte-wide banks. As shown in Fig. 7, the module has separate R/W interfaces optimized for each of the two NN cores: on the analog accelerator side, two read and one write port can access up to four contiguous banks (maximum 64-byte bandwidth); on the digital side, there is a single read and write port with a fixed 16-byte bandwidth. The DMAs can override the L1 ports on both sides for L2–L1 inter-memory transfer. This memory design offers extensive mapping and scheduling flexibility and enough bandwidth to avoid latency penalties without compromising the efficiency during execution. On the other hand, L1 does not include any handler for address conflicts; the sequence of instructions that each core executes, and the addressing of L1 and synchronization tasks that the RISC-V runs are pre-compiled with an in-house compiler and are generated before runtime. This can be done given the sequential and deterministic nature of the operations, which rules out control or data hazards. Local data reuse was improved further through the inclusion of various L0 registers within the analog and digital cores. L1 memory design with detail of the analog core interfaces. Fig. 8. (a) AIMC core block diagram. (b) Description of the a single bitcel in the analog domain (from [36]). (c) Timing diagram of the processing stages and their synchronization signals. For clarity, we will discuss the register-level details of the subsystem in the dedicated accelerator sections. IV. AIMC C OMPUTING C ORE A. AIMC Core Microarchitecture The AIMC core is optimized for energy-efficient execution of massive MVM operations. To maximally exploit the array’s efficiency, the contribution of the glue logic around it must be minimized. The main challenge in designing these modules lies in: 1) providing the required massively parallel input vectors required by the array in each processing cycle and 2) maximizing the reuse of the fetched operands for energy efficiency reasons. In the design of the analog core, depicted in Fig. 8, we have aimed at achieving this across all the memory levels involved: in a first stage, the vector of input feature maps is fetched from L1 and fed to the activation buffer via the local memory control unit (MCU), after which the AIMC core is triggered for computation. When the computation is finished, Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply. 208 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023 Fig. 10. Fig. 9. Microarchitecture of the MCU (left) and example of operation for data reuse of the activation buffer. the output buffer collects the MVM result and sequentially sends batches of 64 results to the single-instruction multipledata (SIMD) unit for post-processing, to then finally be written back to L1. The AIMC macro executes an MVM in 40 ns, while the rest of the system can run at up to 270 MHz. This provides a window of ∼10 clock cycles for executing the fetching, post-processing, and storing tasks in a pipelined fashion; the different stages are overlapped in time so as to minimize latency overheads, as depicted in Fig. 8. B. Memory Control Unit The MCU, depicted in Fig. 9, handles the communication between the L1 activation memory and the input buffer. The unit generates the memory address according to the patterns specific to NN kernels: it supports different window sizes, stride values, padding, and a dedicated sliding pattern for convolution before pooling layers. For padding, specific pointers generate the relative position of the kernel window in the feature map, such that when computing the border pixels, zeros are directly inserted into the activation buffer. To reuse overlapping input data between processing of consecutive pixels under a convolution workload, the activation buffer has the flexibility to shift its internal data. To support 3 × 3 kernels, six pixels with up to 128 input channels can be reused, saving both memory access and latency for every computing cycle, as described in Fig. 9 (right). C. AIMC Macro Based on the exploration results of Fig. 3(a), 1152 × 512 analog array of the computing cells is instantiated. This AIMC macro is a scaled-up version of the design from [36] to guarantee better support for 3 × 3 kernels. Each cell in the computing array comprises two SRAM cells that store ternary values for the weights (−1, 0, +1), connected to two summation lines representing the positive and negative products’ accumulation values [see Fig. 8(b)]. The activations are encoded with an active low pulsewidth, which generates a current that discharges one of the summation lines, depending on the weight activation product sign. The resulting voltages from the two lines are then subtracted in the analog domain. The 512 successive-approximation register analog to digital Microarchitecture of the output buffer and the SIMD unit. converters (SAR-ADCs) the accumulation results back to a 6-bit digital representation. The vector of input activations is converted from the digital 7-bit representation into a pulse width modulation (PWM)-based analog representation via a vector of digital to analog converters (DACs). Through the DAC unit time duration, the minimum amount of charge subtracted from the sum lines can be tuned. This, together with the cell source bias voltage, which determines the magnitude of the cell discharging current, gives flexibility to adjust the sensitivity of the output result to the activation values, at the expense of additional energy consumption. Both these parameters can be programmed by the user externally, as will be explored in Section VI. The computation is triggered by the edge on the TRG signal (see Fig. 8); for each MVM operation, all 1152 wordlines are activated; there is, however, the possibility of disabling DAC units in groups of 64 blocks to save energy consumption. D. Output Buffer and SIMD Unit The AIMC’s 512-wide 6-bit analog to digital converter (ADC) output vector is stored in an output buffer upon completion of the macro computation, determined by a rising edge of the DONE signal of the macro, as in Fig. 8. The output vector is then transferred toward a 64-way SIMD unit in batches of 64 adjacent elements per cycle. The starting batch can be selected by setting an offset in the instructions. The SIMD unit, as described in Fig. 10, handles the elementwise operations on the AIMC outputs with a six-stage programmable pipeline consisting of: 1) partial sum accumulation in the digital domain in case the input channels and the kernel dimensions cannot be fully unrolled along the rows, followed by; 2) batch-norm operation, equal to (αX + β) operation; 3) residual branch addition; 4) activation function (rectified linear unit (ReLU), LeakyReLU supported); 5) re-quantization via pre-trained scaling and clipping parameters and ultimately; and 6) pooling, for immediate computation of max/average pooling of the pooling window with output channel parallelism. When needed, partial accumulation data and/or residuals are fetched from L1 via a dedicated read port. The pre-trained parameters required for these operations are initially loaded from L2 and stored in local registers to avoid recurrent fetching from memory. The stages of the SIMD pipeline can be programmed to either be skipped or executed through the AIMC instruction set. The digital DNN accelerator core, depicted in Fig. 11, is a scaled-up version of the SOMA core of [32] and consists of Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply. HOUSHMAND et al.: DIANA: AN END-TO-END HYBRID DIgital AND ANAlog NEURAL NETWORK SoC FOR THE EDGE Fig. 11. Digital core architecture diagram. a 16 × 16 reconfigurable PE array, allowing for acceleration of a diverse set of workloads. Each PE can run precision scalable MAC operations, with configurable resolutions of 2-, 4-, and 8-bit precision. Based on the selected precision, the parallelism increases to a 32 × 16 (at 4 bit) and 64 × 16 (at 2 bit) sized array. Beside supporting precision reconfiguration, the accelerator can seamlessly switch between dataflows across successive workloads. The ones supported are the: 1) output stationary OX|K configuration for CNN workloads, which turns into OX,C|K when the MAC units are used at sub-8-bit precision and 2) C|K spatial mapping to support the canonical MMM, adopted for FC layers or dense tensor operations (see Fig. 2). Furthermore, elementwise operations (residual branch addition, ReLU activation function, shifting) are supported by the PE units. A flexible max pooling unit, integrated in the accelerator fabric, handles the pooling operations when required. To support streamlined transfer of operands from one accelerator core to the other across the shared L1 memory, the data must be re-organized for correct activation fetching: the digital core requires OX parallelism across the input vectors fetched in a read cycle from the memory, while the AIMC core requires C parallelism when fetching from L1. To achieve this, the digital core is able to either offload the final outputs in two different fashions, as visualized in Fig. 11: 1) along the columns to achieve OX parallelism if the next layer is done in the digital core or 2) along the rows for C parallelism if the next layer is done on AIMC. A reshuffling buffer finally reorders data whose previous layer has been executed on the analog core (stored K-parallel), when the next layers are executed in the OX|K mode of the digital core (requiring OX parallel input data). V. D IGITAL DNN ACCELERATOR The digital DNN accelerator core, depicted in Fig. 11, is a scaled-up version of the SOMA core of [32] and consists of a 16 × 16 reconfigurable PE array, allowing for acceleration of a diverse set of workloads. Each PE can run precision scalable MAC operations, with configurable resolutions of 2-, 4-, and 8-bit precision. Based on the selected precision, the parallelism Fig. 12. 209 Die photograph. increases to a 32 × 16 (at 4 bit) and 64 × 16 (at 2 bit) sized array. Beside supporting precision reconfiguration, the accelerator can seamlessly switch between dataflows across successive workloads. The ones supported are the: 1) output stationary OX|K configuration for CNN workloads, which turns into OX,C|K when the MAC units are used at sub-8-bit precision and 2) C|K spatial mapping to support the canonical MMM, adopted for FC layers or dense tensor operations (see Fig. 2). Furthermore, elementwise operations (residual branch addition, ReLU activation function, shifting) are supported by the PE units. A flexible max pooling unit, integ rated in the accelerator fabric, handles the pooling operations when required. To support streamlined transfer of operands from one accelerator core to the other across the shared L1 memory, the data must be re-organized for correct activation fetching: the digital core requires OX parallelism across the input vectors fetched in a read cycle from the memory, while the AIMC core requires C parallelism when fetching from L1. To achieve this, the digital core is able to either offload the final outputs in two different fashions, as visualized in Fig. 11: 1) along the columns to achieve OX parallelism if the next layer is done in the digital core or 2) along the rows for C parallelism if the next layer is done on AIMC. A reshuffling buffer finally reorders data whose previous layer has been executed on the analog core (stored K-parallel), when the next layers are executed in the OX|K mode of the digital core (requiring OX parallel input data). VI. M EASUREMENTS DIANA is implemented in 22FDX1 , covering 10.24 mm2 , or 8.91 mm2 excluding the outer pad ring. Fig. 12 shows the die picture. The full-custom analog macro is integrated with a digital back-end flow and accounts for 2.29 mm2 including the SRAM cells, DACs, ADCs, control, and timing circuits. The other conventional memories were implemented using standard memory macros. The chip has four different supply voltages for IO pads, logic cells, the AIMC macro and memories. The system includes an internal frequency-locked 1 Registered trademark. Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply. 210 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023 loop intellectual property (FLL IP) macro for on-chip clock generation. For the measurement tasks, a custom printed circuit board (PCB) has been made for the chip, with headers and dedicated pins for supply, connected to a field programmable gate array (FPGA) board. The FPGA drives a set of control pins (e.g., reset) and emulates the off-chip memory interface. The remainder of this section will go into more depth on the characterization results of the chip. Going from fine grain to system-level measurements, and from peak performance to application-specific workloads, we will subsequently cover the following. 1) The characterization of the AIMC macro and its inherent tradeoff between accuracy and efficiency. 2) The analog and digital core peak performance. 3) The performance under actual NN workloads, exploiting heterogeneous multi-core scheduling. All the reported measurements were taken at room temperature, with our fabricated samples working correctly up to 270 MHz at a nominal 0.8-V supply level. Fig. 13. (a) Error and power measurements as a function of the AIMC macro operating point for a uniform distribution of accumulation values between [0;4096]; each grid element corresponding to one operating point and the label its mean percent error on the ADC output. (b) Most accurate operating point of the macro for different accumulation value ranges. A. Efficiency Versus Accuracy Trade-off in the Analog Macro As discussed in Section IV-C, the AIMC core efficiency and accuracy are highly dependent on the DAC unit time and the source bias voltage of the charge-subtracting transistors in the cells. A larger unit time and/or cell source bias voltage increases the gain of the analog MAC operation, bringing improved accuracy at the expense of a smaller MVM dynamic range and increased power consumption. Both these parameters can be set at compile time for every instruction of the analog core, in function of the expected dynamic range from the NN model training. More inputs and/or a wider range requires a smaller gain, while fewer activations are more accurately accumulated with a higher gain. To evaluate this impact, the array was sourced with input-weight combinations spanning different MVM accumulation ranges. This was repeated for each tuning knob combination, and the mean relative error between the measured and exact values was evaluated. Fig. 13(a) shows how the AIMC macro operating point affects the output accuracy of a single MVM operation. A uniform distribution of MVM output values between [0; 4096] at the 512 ADC bank output is computed for different biasing voltages and unit time values and the percent error on the output vector reported in each grid element. Fig. 13(b) summarizes the most accurate operating point for a set of output accumulation values, between [0;1024] and [0;10 240]. As depicted in the figure, the best operating point varies based on the distribution on the final accumulation. All the measurements in the remainder of this article are performed under the most efficient settings that provide the desired gain (dynamic range), with an error within 2% of the most accurate one. B. Peak Performance and Efficiency Characterization Fig. 14 shows plots of the peak efficiency and performance of the two cores at their maximum functional operating frequency across different supply voltages. As can be seen Fig. 14. Peak performance and efficiencies at the system level of (a) complete analog core and (b) complete digital core, both including peripheries. from these graphs, at nominal conditions (0.8 Vat 250 MHz), the analog accelerator, with I7/W1.5/O6, outperforms the I8/W8/O8 digital core by ∼95×, both in terms of TOP/s/W performance and TOP/s efficiency. This shows the strength of the analog accelerator when used at maximum spatial and temporal utilization. Unfortunately, NN layers have different shapes preventing full utilization of the array and require bandwidth limited weight loading, thus resulting in temporal stalls. Both the effects are not taken into account in the peak performance figure-of-merit. C. Workload Performance Characterization This section details energy and latency breakdowns when executing different sets of DNN layers, highlighting potential advantages and drawbacks of different workloads when ran on hardware. While the digital core’s small size and flexibility allows it to operate at maximum utilization across workloads, this is not true for the analog core. To investigate this, we analyzed the layers with different shapes from the ResNet20 network for CIFAR10 and the ResNet18 [2] network for ImageNet, excluding the FC layers for their negligible contribution. The first layers were mapped onto the digital core, while the deeper layers were executed by the analog core. The first layer has a relevant impact on accuracy when executed on low precision and, for the low input channel count, it would heavily underutilize the analog macro. From an Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply. HOUSHMAND et al.: DIANA: AN END-TO-END HYBRID DIgital AND ANAlog NEURAL NETWORK SoC FOR THE EDGE 211 Fig. 15. ResNet20 (a) performance measurement and (b) energy efficiency. The network achieves low spatial utilization of the AIMC macro, thus not achieving peak efficiencies. Fig. 18. ResNet18 mapping of the ResBlock layers on the AIMC core, with their relative spatial utilization and required number of weight-write operations. OX unrolling of 4 is used for ResBlock64. Fig. 16. ResNet18 (a) performance measurement and (b) efficiency. TOP/s and TOP/s/W are highly dependent on the temporal and spatial utilization of the AIMC macro. Fig. 17. By immediate reuse of intermediate data, memory requirements for the activations can be reduced by 7.2×, allowing to keep only a subsection of the feature map in memory as depicted in (a); furthermore by scheduling different layers on different cores, computations can be overlapped in time (b) with subsequent latency savings (c). energy perspective, it would still make sense to use the analog core, but it would require a dedicated input memory manager to handle its dataflow, with minor utilization improvements. The following layers benefit marginally from a more accurate execution, while being almost two orders of magnitude less energy-efficient on our hardware if run on the digital core. Figs. 15 and 16 show the energy and latency breakdown of the system; we also report the mapping adopted for different layers throughout ResNet18 in Fig. 18. TABLE I N ETWORK E ND - TO -E ND A CCURACY AND P ERFORMANCE S UMMARY As expected, the utilization of the macro and the system efficiency are positively correlated across the workloads. The peak performance at core level is reached at full utilization: this occurs in the ResBlock512 layers (see Fig. 18) where 16.5 TOP/s, close to the 18.1 max TOP/s achievable, can be reached at the core level. These layers are also the most demanding ones in terms of bandwidth requirements: given the high throughput that can be achieved, new weights have to be continuously loaded on-chip to keep up with the computation. When taking into account the energy and latency for writing the weights in the array, the characteristics significantly degrade, as shown in Fig. 16. For deeper layers, where weight traffic dominates due to the small activation sizes and the many channels, >10× degradation is seen on TOP/s/W and TOP/s figures. In summary, the array’s temporal utilization performance increases with activation size, due to increased weight reuse. In addition, layers with many input–output channels can get the most spatial utilization out of the computing hardware. Yet, they also shift the bottleneck to the weight movement. Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply. 212 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023 TABLE II S OTA DNN A CCELERATORS ’ C OMPARISON To realize additional latency savings, pipelining and layer fusion can be applied on the heteoregenous architecture, as discussed in Section II-D. We apply this to ResNet18, fusing a stack of six layers consisting of the first convolutional and pooling layer (run on the digital core) and ResBlock64 (run on the AIMC core). By computing an activation tile of size 56 × 4 pixels at a time, we: 1) avoid multiple activation data transfers to L2 by immediately reusing the intermediate data from the L1 memory and 2) we can overlap the operations of the digital core and the AIMC core, as illustrated in Fig. 17, leading to 25% latency savings on the layer stack. Table I shows comparison of the accuracies and performances of the two workloads when run under different operand precision. The drop in accuracy when switching from a floating-point model to an 8-bit quantized one which runs on the digital core is negligible but still is not optimized for efficiency and performance. When switching to the mixed mode computation, the drop in Top1 accuracy is around 2% for ResNet20 and 5% for ResNet18; considering the 30× and 17× improvement in energy efficiency and latency, the loss in accuracy may be worth the tradeoff. D. SotA Comparison Table II shows comparison of DIANA to the recently taped-out DNN accelerators; the selected designs include digital DNN accelerators, AIMC, and a heterogeneous design. The DIANA peak performances reported correspond to the ones described in Section VI-B. The integrated AIMC core achieves the highest TOP/s/W compared with AIMC implementations. Previous work [37] includes programmable heterogeneous architectures, but without a reconfigurable digital accelerator core integrated. A homogeneous AIMC multi-core approach is implemented in [26], with pipelined computations across cores. Yue et al. [39] propose an in-memory computing (IMC) design for handling activation and block-level sparsity, but with lower overall efficiency. Overall, DIANA is unique in its ability to flexible combine analog and digital acceleration, offering a favorable tradeoff between efficiency, accuracy, and broad workload flexibility. VII. C ONCLUSION We implemented DIANA, a heterogeneous AIMC–Digital DNN accelerator to research the tradeoffs between the extreme energy efficiency and performance of AIMC cores with the reconfigurability and higher precision of digital accelerators, to optimally run end-to-end DNN models. To enlarge the mapping space of the non-reconfigurable AIMC array, we proposed output pixel unrolling, a novel spatial mapping method that enables OXu × latency savings and proportional energy savings, with negligible overhead; furthermore, we explore layer fusion and pipelining strategies, to further optimize the mappings on the heterogeneous multi-core design. The main bottlenecks that affect the current design are the loading of the weights to the AIMC core and the limited buffer size of the L1; these two factors degrade the temporal utilization of the accelerator and can be further optimized in future designs. Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply. HOUSHMAND et al.: DIANA: AN END-TO-END HYBRID DIgital AND ANAlog NEURAL NETWORK SoC FOR THE EDGE Finally, this work opens an exciting area of optimization possibilities: the DNN models can be trained by a hardwareaware neural architecture search (NAS), knowledgeable of each core strengths and features. We would thus optimize for the desired metric and execute the layers on the best suited core, achieving higher accuracies, energy efficiencies, and throughput based on the task requirements. ACKNOWLEDGMENT The authors would like to thank Eidgenössische Technische Hochschule (ETH) for their support on the PULPissimo platform and GlobalFoundries, Heverlee, Belgium, for the tapeout support on their 22FDX FD-SOI Platform. R EFERENCES [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, vol. 25, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds. Red Hook, NY, USA: Curran Associates, 2012. [2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, pp. 1–12, Dec. 2015. [3] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <1 mb model size,” CoRR, vol. abs/1602.07360, pp. 1–13, Feb. 2016. [4] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal speed and accuracy of object detection,” 2020, arXiv:2004.10934. [5] M. Verhelst and B. Moons, “Embedded deep neural network processing: Algorithmic and processor techniques bring deep learning to IoT and edge devices,” IEEE Solid-State Circuits Mag., vol. 9, no. 4, pp. 55–65, Fall 2017. [6] V. Jain, L. Mei, and M. Verhelst, “Analyzing the energy-latencyarea-accuracy trade-off across contemporary neural networks,” in Proc. IEEE 3rd Int. Conf. Artif. Intell. Circuits Syst. (AICAS), Jun. 2021, pp. 1–4. [7] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “14.5 Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks,” in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), Jan. 2016, pp. 262–263. [8] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “14.5 envision: A 0.26-to-10 TOPS/W subword-parallel dynamic-voltageaccuracy-frequency-scalable convolutional neural network processor in 28 nm FDSOI,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2017, pp. 246–247. [9] D. Bankman, L. Yang, B. Moons, M. Verhelst, and B. Murmann, “An always-on 3.8 μJ/86% CIFAR-10 mixed-signal binary CNN processor with all memory on chip in 28-nm CMOS,” IEEE J. Solid-State Circuits, vol. 54, no. 1, pp. 158–172, Jan. 2019. [10] J. Yue et al., “A 65 nm computing-in-memory-based CNN processor with 2.9-to-35.8 TOPS/W system energy efficiency using dynamic-sparsity performance-scaling architecture and energy-efficient inter/intra-macro data reuse,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2020, pp. 234–236. [11] M. V. Bert Moons and D. Bankman, Embedded Deep Learning: Algorithms, Architectures and Circuits for Always-On Neural Network Processing. Cham, Switzerland: Springer, 2019. [12] K. Prabhu et al., “CHIMERA: A 0.92-TOPS, 2.2-TOPS/W edge AI accelerator with 2-MByte on-chip foundry resistive RAM for efficient training and inference,” IEEE J. Solid-State Circuits, vol. 57, no. 4, pp. 1013–1026, Apr. 2022. [13] Y. S. Shao et al., “Simba: Scaling deep-learning inference with multichip-module-based architecture,” in Proc. 52nd Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO). New York, NY, USA: Association for Computing Machinery, Oct. 2019, pp. 14–27. [14] X. Yang et al., “A systematic approach to blocking convolutional neural networks,” CoRR, vol. abs/1606.04209, pp. 1–12, Jun. 2016. [15] B. Murmann, “Mixed-signal computing for deep neural network inference,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 29, no. 1, pp. 3–13, Jan. 2021. 213 [16] N. Verma et al., “In-memory computing: Advances and prospects,” IEEE Solid-State Circuits Mag., vol. 11, no. 3, pp. 43–55, Summer2019. [17] S. Cosemans et al., “Towards 10000 TOPS/W DNN inference with analog in-memory computing—A circuit blueprint, device options and requirements,” in IEDM Tech. Dig., Dec. 2019, pp. 22.2.1–22.2.4. [18] A. Krizhevsky, V. Nair, and G. Hinton, “Learning multiple layers of features from tiny images,” Can. Inst. Adv. Res., Toronto, ON, Canada, 2009. [Online]. Available: https://www.cs.toronto.edu/~kriz/cifar.html [19] T. Lin et al., “Microsoft COCO: Common objects in context,” CoRR, vol. abs/1405.0312, pp. 1–15, May 2014. [20] P. Houshmand et al., “Opportunities and limitations of emerging analog in-memory compute DNN architectures,” in IEDM Tech. Dig., Dec. 2020, pp. 29.1.1–29.1.4. [21] A. Garofalo et al., “A heterogeneous in-memory computing cluster for flexible end-to-end inference of real-world deep neural networks,” CoRR, vol. abs/2201.01089, pp. 1–15, Jan. 2022. [22] K. Ueyoshi et al., “DIANA: An end-to-end energy-efficient digital and analog hybrid neural network SOC,” in IEEE Int. SolidState Circuits Conf. (ISSCC) Dig. Tech. Papers, vol. 65, Feb. 2022, pp. 1–3. [23] A. Vaswani et al., “Attention is all you need,” CoRR, vol. abs/1706.03762, pp. 1–15, Jun. 2017. [24] L. Mei, P. Houshmand, V. Jain, S. Giraldo, and M. Verhelst, “ZigZag: Enlarging joint architecture-mapping design space exploration for DNN accelerators,” IEEE Trans. Comput., vol. 70, no. 8, pp. 1160–1174, Aug. 2021. [25] X. Yang et al., “Interstellar: Using halide’s scheduling language to analyze DNN accelerators,” in Proc. 25th Int. Conf. Architectural Support Program. Lang. Operating Syst. New York, NY, USA: Association for Computing Machinery, 2020, pp. 369–383. [26] H. Jia et al., “Scalable and programmable neural network inference accelerator based on in-memory computing,” IEEE J. Solid-State Circuits, vol. 57, no. 1, pp. 198–211, Jan. 2022. [27] A. Shafiee et al., “ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2016, pp. 14–26. [28] P.-Y. Chen, X. Peng, and S. Yu, “NeuroSim+: An integrated deviceto-algorithm framework for benchmarking synaptic devices and array architectures,” in IEDM Tech. Dig., Dec. 2017, pp. 6.1.1–6.1.4. [29] A. S. Rekhi et al., “Analog/mixed-signal hardware error modeling for deep learning inference,” in Proc. 56th Annu. Design Autom. Conf. (DAC). New York, NY, USA: Association for Computing Machinery, 2019, pp. 1–6. [30] M. L. Gallo et al., “Mixed-precision in-memory computing,” Nature Electron., vol. 1, pp. 246–253, Apr. 2018. [31] S. K. Gonugondla, C. Sakr, H. Dbouk, and N. R. Shanbhag, “Fundamental limits on the precision of in-memory architectures,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design, Nov. 2020, pp. 1–9. [32] J. S. P. Giraldo, V. Jain, and M. Verhelst, “Efficient execution of temporal convolutional networks for embedded keyword spotting,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 29, no. 12, pp. 2220–2228, Dec. 2021. [33] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN accelerators,” in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO), Oct. 2016, pp. 1–12. [34] P. D. Schiavone, D. Rossi, A. Pullini, A. Di Mauro, F. Conti, and L. Benini, “Quentin: An ultra-low-power PULPissimo SoC in 22 nm FDX,” in Proc. IEEE SOI-3D-Subthreshold Microelectron. Technol. Unified Conf. (S3S), Oct. 2018, pp. 1–3. [35] A. Rahimi, I. Loi, M. R. Kakoee, and L. Benini, “A fully-synthesizable single-cycle interconnection network for shared-L1 processor clusters,” in Proc. Design, Autom. Test Eur., Mar. 2011, pp. 1–6. [36] I. A. Papistas et al., “A 22 nm, 1540 TOP/S/W, 12.1 TOP/s/mm2 in-memory analog matrix-vector-multiplier for DNN acceleration,” in Proc. IEEE Custom Integr. Circuits Conf. (CICC), Apr. 2021, pp. 1–2. [37] H. Jia, H. Valavi, Y. Tang, J. Zhang, and N. Verma, “A programmable heterogeneous microprocessor based on bit-scalable in-memory computing,” IEEE J. Solid-State Circuits, vol. 55, no. 9, pp. 2609–2621, Sep. 2020. [38] H. Mo et al., “A 28 nm 12.1 TOPS/W dual-mode CNN processor using effective-weight-based convolution and error-compensation-based prediction,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, vol. 64, Feb. 2021, pp. 146–148. Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply. 214 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023 [39] J. Yue et al., “A 2.75-to-75.9 TOPS/W computing-in-memory NN processor supporting set-associate block-wise zero skipping and pingpong CIM with simultaneous computation and weight updating,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, vol. 64, Feb. 2021, pp. 238–240. Pouya Houshmand (Graduate Student Member, IEEE) received the B.Sc. and M.Sc. degrees in electrical engineering from the Polytechnic of Turin, Turin, Italy, in 2017 and 2019, respectively. He is currently pursuing the Ph.D. degree in architectures for deep neural network (DNN) accelerators with ESAT-MICAS Laboratories, KU Leuven, Leuven, Belgium. His current research interests include algorithmhardware co-design, in-memory computing (IMC), and emerging technologies. Giuseppe M. Sarda received the B.Sc. and M.Sc. degrees in electrical engineering from the Politecnico di Torino, Turin, Italy, in 2018 and 2020, respectively. He carried out his master’s thesis from Technische Universität Wien (TU Wien), Vienna, Austria. In September 2020, he joined imec, Leuven, Belgium, and the MICAS Group, Katholieke Universiteit Leuven, Leuven, as a Ph.D. Researcher. His current research focuses on in-memory design for efficient embedded machine learning. Vikram Jain (Member, IEEE) received the M.Sc. degree in embedded electronics systems design (EESD) from the Chalmers University of Technology, Gothenburg, Sweden, in 2018. He is currently pursuing the Ph.D. degree in energy-efficient digital acceleration and reduced instruction set computer - five (RISC-V) processors for machine learning applications at the edge, under the supervision of Prof. Marian Verhelst. In 2018, he joined ESAT-MICAS Laboratories, KU Leuven, Leuven, Belgium, as a Research Assistant. He was a Visiting Researcher with the IIS Laboratory, ETH Zürich, Zürich, Switzerland, under the supervision of Prof. Luca Benini. His current research interests include ML accelerators, RISC-V architecture, heterogeneous systems, design space exploration, lowpower digital design, and hardware design automation. Dr. Jain was a recipient of prestigious research fellowship from the Swedish Institute (SI) for his master’s degree from 2016 to 2018. Kodai Ueyoshi received the B.E., M.E., and Ph.D. degrees from Hokkaido University, Sapporo, Japan, in 2015, 2017, and 2020, respectively. From 2017 to 2020, he was a JSPS Research Fellow. From 2018 to 2020, he was also a Researcher of JST ACT-I. He was an Assistant Researcher with KU Leuven, Leuven, Belgium, from 2020 to 2022. He is currently a Research Engineer with ZEKU Technologies, Shanghai, China. His research interests include energy-efficient hardware architecture for machine learning systems and software and hardware co-optimization. Dr. Ueyoshi received the ISSCC Silkroad Award in 2018, the IEEE SSCS Pre-Doctoral Award, and the Ninth JSPS Ikushi Prize in 2019. Ioannis A. Papistas received the B.Sc. and M.Eng. degrees in electrical and computer engineering from the Aristotle University of Thessaloniki, Thessaloniki, Greece, in 2014, and the Ph.D. degree from the University of Manchester, Manchester, U.K., in 2018. His Ph.D. dissertation was on heterogeneous 3-D IC integration. He spent a year as a Post-Doctoral Research Fellow with the University of Manchester. In 2019, he joined the Machine Learning Team, imec, Leuven, Belgium, as a Research and Development Engineer on analog in-memory computing (IMC) accelerators. In 2021, he co-founded axelera.ai, where he is currently a Senior Research and Development Engineer working on energy- and area-efficient machine learning accelerator designs. His current research interests include IMC, efficient machine learning accelerator designs, heterogeneous integration, and mixed-signal IC design. Dr. Papistas was a recipient of the prestigious EPSRC Doctoral Prize Award. Man Shi received the B.Sc. degree from the School of Information Science and Engineering from Shandong University (SDU), Jinan, China, in 2017, and the M.Sc. degree from the Institute of Microelectronics, Tsinghua University, Beijing, China, in 2020. She is currently pursuing the Ph.D. degree in the accelerators architecture for deep neural network (DNN) with MICAS Laboratories, KU Leuven, Leuven, Belgium. Her current research interests include low-power DNN hardware accelerator design, algorithm–hardware co-design, and reconfigured computation. Qilin Zheng received the B.S. degree from Peking University, Beijing, China, in 2019, and the M.S. degree from KU Leuven, Leuven, Belgium, in 2022. He is currently pursuing the Ph.D. degree in ECE with Duke University, Durham, NC, USA. His current research interests include computer architecture, non-volatile memory, and compute-inmemory design. Debjyoti Bhattacharjee received the B.Tech. degree in computer science and engineering from the West Bengal University of Technology (WBUT), Kolkata, West Bengal, India, in 2013, the M.Tech. degree in computer science from the Indian Statistical Institute, Kolkata, in 2015, and the Ph.D. degree in computer science and engineering from Nanyang Technological University, Singapore, in 2019. He worked as a Research Fellow with Nanyang Technological University, for a year. During his doctoral studies, he worked on design of architectures using emerging technologies for in-memory computing (IMC). He developed novel technology mapping algorithms, technology-aware synthesis techniques, and proposed novel methods for multi-valued logic realization. He is currently a Research and Development Engineer with the Compute System Architecture Unit, imec, Leuven, Belgium. His current research interests include machine learning accelerator using analog hardware, hardware design automation tools, and application-specific accelerator design, with emphasis on emerging technologies. Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply. HOUSHMAND et al.: DIANA: AN END-TO-END HYBRID DIgital AND ANAlog NEURAL NETWORK SoC FOR THE EDGE Arindam Mallik received the M.S. and Ph.D. degrees in electrical engineering and computer science from Northwestern University, Evanston, IL, USA, in 2004 and 2008, respectively. He is currently a technologist with 20 years of experience in semiconductor research. He leads the Future System Exploration (FuSE) Group with the Compute System Architecture (CSA) Research and Development Unit. He has authored or coauthored more than 100 articles in international journals and conference proceedings. He holds number of international patents. His research interests include novel computing system, design–technology co-optimization, and economics of semiconductor scaling. Peter Debacker received the M.Sc. degree (Hons.) in electrical engineering from Katholieke Universiteit Leuven, Leuven, Belgium, in 2004. Before joining imec, Leuven, in 2011, he worked with Philips as a System Engineer and Essensium as a System Architect. At imec, he is currently a Principal Member of Technical Staff, architecting solutions for high-performance and energy-efficient artificial intelligence (AI) systems with hardware ranging from large, scaled-out high-performance systems to tiny deep neural network (DNN) accelerators and in-memory compute hardware. Before that, he was a Program Manager for imec’s Machine Learning Program, and he worked on imec’s low-power digital chip and processor architectures and implementation in advanced technology nodes to optimize power–performance–area (PPA) optimization of scaled CMOS technologies (for 3 nm and beyond). His current research interests include processor and computer architectures, AI and machine learning, design methodologies, and digital chip design and verification. 215 Diederik Verkest received the Ph.D. degree in applied sciences from the University of Leuven, Leuven, Belgium, in 1994. He started working in the VLSI Design Methodology Group, imec, Leuven, on hardware/software co-design, re-configurable systems, and multiprocessor system-on-chips in the domain of wireless and multimedia. From 2010 to 2020, he was responsible for the imec’s Logic Insite Research Program in which the leading design and process companies jointly work on co-optimization of CMOS design and process technology for N+2 nodes. In recent years, he focused on design and process technology optimization for ML accelerators. He has published and presented over 150 articles in international journals and at international conferences. Marian Verhelst (Senior Member, IEEE) received the Ph.D. degree from KU Leuven, Leuven, Belgium, in 2008. She worked as a Research Scientist with Intel Labs, Hillsboro, OR, USA, from 2008 to 2010. She is currently a Full Professor with MICAS Laboratories, KU Leuven, and the Research Director with imec, Leuven. Her research focuses on embedded machine learning, hardware accelerators, HW–algorithm co-design, and low-power edge processing. Dr. Verhelst was a member of the Young Academy of Belgium and the STEM Advisory Committee to the Flemish Government. She is a member of the Board of Directors of tinyML and active in the TPC’s of DATE, ISSCC, VLSI, and ESSCIRC. She received the Laureate Prize of the Royal Academy of Belgium in 2016, the 2021 Intel Outstanding Researcher Award, and the André Mischke YAE Prize for Science and Policy in 2021. She was the Chair of tinyML2021 and the TPC Co-Chair of AICAS2020. She is an IEEE SSCS Distinguished Lecturer. She was an Associate Editor of IEEE T RANSACTIONS ON V ERY L ARGE S CALE I NTEGRATION (VLSI) S YS TEMS (TVLSI), IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS —II: E XPRESS B RIEFS (TCAS-II), and IEEE J OURNAL OF S OLID -S TATE C IRCUITS (JSSC). Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on January 09,2025 at 02:48:31 UTC from IEEE Xplore. Restrictions apply.