A review of CNN accelerators for embedded systems based on RISC-V Alejandra Sanchez-Flores Industrial and Construction Eng. dept. Universitat de les Illes Balears Palma, Spain alejandra.sanchez1@estudiant.uib.cat Lluc Alvarez Barcelona Supercomputing Center, Universitat Politècnica de Catalunya Barcelona, Spain lluc.alvarez@bsc.es Abstract— One of the great challenges of computing today is sustainable energy consumption. In the deployment of edge computing this challenge is particularly important considering the use of embedded equipment with limited energy and computation resources. In those systems, the energy consumption must be carefully managed to operate for long periods. Specifically, for embedded systems with machine learning capabilities in the Internet of Things (EMLIoT) era, the convolutional neural networks (CNN) model execution is energy challenging and requires massive data. Nowadays, high workload processing is designed separately into a host processor in charge of generic functions and an accelerator dedicated to executing the specific task. Open-hardware-based designs are pushing for new levels of energy efficiency. For achieving energy efficiency, open-source tools, such as the RISC-V ISA, have been introduced to optimize every internal stage of the system. This document aims to compare the EMLIoT accelerator designs based on RISC-V and highlights open topics for research. Keywords—RISC-V, accelerator, energy efficiency I. INTRODUCTION Since the introduction of accelerators to execute CNN, their design priorities were first to get enough resources to run the CNN model and then execute the most instructions per second (throughput). However, considering the high demand for embedded systems with artificial intelligence capacity, the designers face now the challenge of reducing the power consumption for complex and long-term applications. It seems contradictory that small devices such as embedded systems, with low power consumption which typically is translated into a lean HW system, are designed to run a CNN model, characterized by high demand for resources. Then, the priority has evolved to maximize the energy efficiency (EE) value, i. e. the ratio of the device’s throughput per unit of power, reaching the gigaoperations per second/watt (GOPS/W) or tera-operations per second/watt (TOPS/W). In consequence, accelerators are focused on including only circuitry and functions that operate optimally to achieve this goal. In recent times, accelerator implementations exploring the RISC-V features have emerged with favorable results, leading to a breakthrough in accelerator design. Truly, open source is not the only advantage of RISC-V, but it offers other benefits for EE maximization. To expose these benefits and make a Bartomeu Alorda-Ladaria Inst. de Inv. Sanitaria IB - IdisBa. Universitat de les Illes Balears Palma, Spain tomeu.alorda@uib.es recompilation of design key elements, a selection of projects with the best EE results are examined under the following sequence: determine the challenges of EMLIoT devices for optimizing EE, relate these challenges to the RISC-V design objectives, and examine the EMLIoT RISC-V implementations to expose the elements that address the objectives. II. SYSTEM SPECIFICATIONS To start this analysis, the operating factors of EMLIoT are discussed. First the CNN model execution requirements, then the electronic restrictions related to embedded devices, and finally, a brief overview of RISC-V setup for most projects. A. CNN execution constraints CNN inference execution is the purpose of the accelerator EMLIoT devices. This operational condition establishes the functional features of the system organized in memory, power, throughput, frequency, and accuracy. Any CNN model involves millions of parameters named weights. For example, ResNet [1] uses a total number of weights from 0.27M to 19.4M depending on the model version. Weights are stored in memory while the execution is performed, thus the system requires an equivalent memory volume, usually DRAM. From a workload point of view, the most important operation executed is the convolutional matrix multiplication, which in turn divides into dot-products implemented with MAC (multiplication-accumulation) operation. Depending on the number of convolution layers in the model and the number of weights in every layer, the total operations executed can reach GOP (Giga operations – 1𝑥10 ). If the efficiency expectation is to reach the TOPS/W, then the CNN model should be executed at high throughput (GOPS) but at ultra-low power consumption (mW or less). Some compression algorithms have been proposed for the CNN model to reduce the number of weights, several instructions, or execution precision, all with beneficial energy results. However, these reduction criteria focus on HPC systems or imply a degradation of the final classification, where a high percentage of accuracy (>90%) should be the indicator of an effective acceleration. So, an optimal solution is to propose compression techniques that work in conjunction with hardware modifications, linked by the ISA architecture. B. RV32 ISA features The RISC-V 32-bit integrates several advisable features for “© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.” Published version: DOI: 10.1109/COINS54846.2022.9855006 embedded devices, and the most relevant is the low power cost. In addition, computations with 32-bit data single-precision obtain enough accuracy for CNN inference [2]. Thus, RV32I and RV32E are the basic instruction sets dedicated for 32-bit embedded processors, both include integer data, basic arithmetic operations, and ld/st instructions for 16 and 8-bits data handling [3]. Once the base I or E is selected, some standard extension can be attached. For EMLIoT, the most recurrent extensions are M – multiplication/division extension is omnipresent in the CNN accelerators, required for the MAC operation. C – compression extension offers a 16-bit version of the original 32bit instructions. A – atomic extension that supports compressed memory exchange instructions. B – bit manipulation extension dedicated to a more complex bit manipulation (draft version). V – vector extension for handling data in vector format, enabling parallel processing (draft version). X-custom extension is reserved for new extensions defined by customers [3]. In addition, each standard extension provides some custom instruction code available for custom definition. Both features (custom extensions and instructions) give a significant advantage in aiding optimization, as non-standard operations can be implemented in HW and bound to ad-hoc instruction. C. EE challenges Once CNN model execution specifications were mentioned, as well as some RISC-V capacities, next, all the challenges identified in the EMLIoT projects are condensed and related to the design objectives to reach EE. Fig. 1 synthesizes the result. The DRAM consumes a lot of power, either for storing or transferring data. The reason is the resistive-capacitive (RC) array on which this memory is based [4], keeping it current due to its low cost. This challenge is the reason to optimize the memory storage (obj. A) and search for new technologies to store data (obj. B). The execution of convolution is the process that consumes the most energy within the processor [5]. The MAC (multiplication-accumulation) operation, whose multiplication is the highest energy consumption part of the MAC [6], [7]. This justifies exploring more efficient processing options (obj. B) and reducing the execution of the functions (obj. D). The energy consumption increases when the data are transferred. The cost is cheaper when data are moved between registers and L1 levels than when are transferred from/to DRAM, according to the memory hierarchy [8]. Mobility synchronization is required to avoid unnecessary energy expenses (obj. C) and new interactions between the CPU and the accelerator need to be explored (obj. E). Traditionally, loosely coupling accelerator (LCA) has been the widely used host-accelerator configuration. Using an open-source platform, the possibilities of interaction grow and should be explored (obj. E). The working frequency of an embedded device is typically limited between 100 and 500 MHz [9], conditioned by the RC array. When the bandwidth of the DRAM is increased (the number of bits transmitted per Fig. 1. Design objectives defined during this research unit of time), the power consumption is increased [10]. Because the processor works at a higher speed than the memory frequency, the processor demands data faster than memory can deliver it. Conversely, data from the processor are available for transfer faster than the DRAM can store them, creating bottlenecks [11]. Mobility optimization would improve this challenge (obj. C), explore new options to process (obj. B), or new configurations CPU-accelerator (obj. E). III. DESIGN OBJECTIVES REVIEW In this section, the design objectives defined in the last section are used to typify the selected projects and identify the key implementations that aid the EE improvement. A. Optimize DRAM usage An obvious solution to the large volume of DRAM is data use reduction, which leads to two options, to reduce the number of parameters (committed to algorithm compression, analysis of which is omitted in this paper) or to reduce the data size. The CNN inference computing using low-precision data involves mathematics and computing. Binary neural networks (BNN) were the first data reduction approach, using weights normalized to a signed unit value (-1, +1). From a computing point of view, one bit can represent one weight, in consequence, the 32-bits standard can store 32 weights. BNN can be executed with both standard programming and computing resources since binary MAC operation is performed using common Boolean operations such as XNOR-pop count. The energy consumption and the throughput were improved at that time, but, the inaccuracy of the results [12] encouraged the exploration of other options. The alternative to using binary values is the quantized weights, where 2 to 8 bits represent a weight. The quantized neural networks (QNN) approach [13] was implemented with more accurate results than BNN. Concerning energy reduction goals, QNN implementation addresses the issues related to lowprecision data storage, retrieval, and computation in a 32-bits architecture. In DRAM memory, data are stored by volume optimization: two 16-bits, four 8-bits, eight 4-bits, or sixteen 2bits data, depending on the chosen precision. For 8-bits data retrieval, a custom extension based on the ISA RV32IMC and named LVE is defined in [14] and applied in [15]. It contains instructions for 16 and 8-bits data allocation as vector array in the scratchpad (SP) memory. Another option to deal with data reduction is proposed in [16], based on a configuration RV32IMA. They use data reduction to 8-bit using shift operations and store it directly in general-purpose registers (GPR) to speed up the MAC operations. The RISC-V ISA offers the capacity to integrate new functionalities to the standard. New extensions like XpulpNN [17] use an ISA RV32IMCX, including instructions to manage data stored in memory in eight 4-bit or sixteen 2-bit formats. The low-precision data are extracted and accommodated in 8-bits locations using hardware loops related to one single instruction. Xpulpv2 matches with parallel processing configurations used by the Pulp ecosystem thus every data extracted is accommodated in a vector/array to perform dot-product. The implementations in [18] and [19] use an ultra-lowpower processor RV32IM [20]. This low-cost benefit is maintained in the projects by the authors, who propose a shift operation to extract each low-precision data from its storage format. So, quantized operations are executed at a low cost. Quantization, and especially the BNN, is one strategy with great potential to reduce energy consumption and remains one of the most implemented strategies to improve performance results. Low-precision data manipulation is not a standard operation, and custom instructions for a shift or mask extraction must be implemented, just like the RISC-V processors in this section have proposed. B. Storage alternatives Given the advantages of DRAM technology, it will hardly be replaced in a short time [21]. However, the possibility of using memories with lower power consumption is pushing emerging options. The memory alternatives most referenced in EMLIoT are the non-volatile memories (NVM) [22], the 8T SRAM [23], and the eDRAM [24]. In addition to their storage functionality, these devices have been chosen for their ability to compute one or more operations, in what is called in-memory computing (IMC, shown in Fig. 2). This approach states that one or more operations can be computed inside a storage cell, without transferring data to the processor, saving the cost of mobility. Furthermore, since these devices are based on analog mechanisms, the storage/processing output signal is a continuous function, providing a multiprecision capacity. For a larger processing result in the volume of data, the IMC is massively performed in a 2D storage cells arrangement that is controlled synchronously [25]. Even though these memories represent a breakthrough for the EMLIoT devices, some issues remain to be solved. i) the massive data mobility to/from the IMC device must be handled, ii) the stability of the output signal is a challenge for some of these memories, iii) the analog output values must be inputted to analog-digital converters, generating energy cost. The data mobility is discussed in the next sub-section D, and analog converters implications are not a subject for this analysis. C. Functions optimization A complementary measure to reducing memory is the optimization of CNN model execution. A clear example of operation simplification is the MAC substitution by the XNORpop - count function in the BNNs, which led to a 25% reduction in energy consumption in the first implementations [26]. However, this analysis is bounded characterized by innovative Fig. 2. IMC module representation proposals for function/operation execution. Some projects make hardware innovations to deal with lowprecision data operations related to the MAC. In [27] a shift operation is performed to extract low-precision data, and immediately after each extraction, the data are accumulated postponing 32-bits data manipulation until the accumulation is finished. The purpose of this is to avoid data transfers between registers with different data sizes. Since the standard convolutional multiplication efficiency is improvable, the kernel re-arrangement is proposed as a guaranteed optimization of the PULP-NN [28] library. This includes kernel re-arrangement to prioritize kernel weights reuse and diminish MAC processing. Another kernel re-arrangement optimization is presented in [28], where a 3x3 kernel in hardware is proposed, using individual FIFO buffers for the different operands and individual accumulators at each buffer output. Among the compression techniques sparsity appears recurrently in hardware optimization. The purpose is the detection of values to avoid unnecessary computing. For example, zero detection is used in [19] and [29] applying AND bitwise operation to identify null results. Wu et. al [30] focus on the activation function execution performed by a RISC-V co-processor. Depending on the operation of the CNN to be executed (convolution, max pool, rectified linear unit ReLU, and Add functions) one circuitry in the accelerator is activated. The energy reduction reported is 47%, in the entire SoC. An emerging concept is mixed-precision data, i.e. the optimal use of precision data in different layers forming the CNN model. The MPIC library [31], which includes the functions defined for RISC-V [28], helps the CNN designer to normalize data, and adapt and combine different data sizes along the CNN design to balance energy consumption and accuracy required by the layer. All these implementations have in common the definition of a RISC-V instruction to execute the new/modified circuit. D. Optimization of data mobility For both, DRAM and NVM memories, data mobility optimization remains a critical objective to reduce energy. The recurrent topics in these projects are to improve data transfers, by using specialized circuits between CPU and memory, and to locate memory close to processing elements. Most of the circuits implemented to improve data transfers emerged for commercial processors under ARM architecture. So, the innovation is not the appearance of the circuits but the availability to connect them with the open ISA processor in different configurations. Next, is a brief explanation of some of these circuits and the RISC-V project implementation. In IMC projects, direct memory access (DMA) is usually employed for data transfers between the RISC-V host and the NVM [32], [19]. GAP-8 uses a DMA to speed up the instruction transfers between the host L2 memory to the L1 memory in the accelerator, both host and accelerators are RISC-V. In [33], the accelerator is a multicore RISC-V and an array of TCDM-DMA connects the L1 memory (shared by the cores) and the host L2 memory. A recurrent circuit to speed up memory allocation is the tightly direct memory (TCDM for data or ITCM for instructions) proposed to substitute the cache random access with an addressed and controlled access between instructions and data. The benefit of TCM in accelerators is to create a direct connection between processor and DRAM, avoiding the long hierarchy memory. Like DMA, the data transfers are faster and addressed exactly to the target module. Examples of implementations are [34], [35], The co-processors, known as tightly coupled accelerators (TCA), use a buffer in FIFO format that stores data waiting to be processed. Examples of this connection are [35], [36]. IMC implementations use buffers too for reshaping data and maintaining them waiting while entering the processing module. In the mentioned projects, a combination of open ISA and the ld/st custom instructions to transfer data between the CPU registers and circuits are implemented to ease the data mobility and avoid bottlenecks. E. CPU-accelerator configuration This section provides, a brief explanation of the processing factors related to the CPU-accelerator configuration and the RISC-V influence up to now. The two most representative configurations of CPU-accelerator are the LCA and TCA, the delineation between both definitions is given by the accelerator access to the CPU memory hierarchy level (). However, an open-source platform such as RISC-V enables an extensive combination of these two options. LCA is usually limited to sending control signals to the accelerator. Under this principle, a RISC-V interface [37] is available to facilitate this interaction by connecting any accelerator to a RISC-V processor, e. g. [16]. This tool can be configured to exchange control signals, to exchange data between the CPU cache memory and external elements, or to communicate between the accelerator and the interface. Other projects use the TCM tools discussed in the last section to transfer data between CPU cache memory and the accelerator [18], [19], [38], [33], [34]. Despite these projects are LCA, data are provided faster using tightly coupled circuits. Regarding some less conventional interactions between CPU and accelerator, the TCA implementations are few and do not report energy consumption or accuracy. This is a drawback as they are not considered in the EE key elements. The processing mode is another component of energy consumption. Parallel processing is a robust implementation to distribute tasks between different processing elements, either cores or PEs. Its execution response is faster than singleinstruction processing, which improves throughput. Almost all the implementations apply parallel processing one way or another. The RISC-V factor is evident in the custom extension for SIMD to manage parallel instructions, e. g. [15], [34], [29]. [35] introduces parallel processing for two RISC-V processors, one in a simplified setting for general purpose activity, and the second a more capable processor for the CNN execution. Finally, the IMC projects involve implicitly parallel processing since all the NVM cells act at the same time and provide results in a few cycles. IV. DISCUSSION In this section, the discussion focuses on the overall results reported by the sample of projects and links the RISC-V solutions. This exercise helps identify the effectiveness of the current solutions and a possible direction for upcoming proposals. Before starting, it is worth mentioning that from a sample of several projects, the implementations with the most information about EE, throughput, and power, are selected. Furthermore, over the group of projects chosen exists a diversity of conditions that make them difficult to analyze with the same criteria. For example, some of these conditions are CNN model and compression techniques, manufacturing technology, power and throughput measurements tools and conditions, and accuracy evaluation metrics. Although all these variables indicate a particular design, this analysis considers global characteristics and highlights the RISC-V solutions. An interesting remark from this study is the identification of a gap in the EE standard metrics for reporting results and facilitating the analysis. The relationship between throughput and EE is the first item of discussion, as shown in Fig. 3. The best results in throughput get the best results in EE, and a high level at EE is related to low power consumption, but not conversely (low power does not imply high EE). In the same way, power metrics lack a uniform definition, and the current ones depend on technology, power supply, model execution, and frequency. Additionally from Fig. 3, the best EE levels are achieved by solutions implemented at 65-55 nm and the rest use any node between 180 -22nm. So the EE nor the throughput are affected by the electronic technology node used in the implementations. Regarding the throughput in Fig. 4, the IMC-SRAM implementations obtain the best results (4.7 and 2.18 TOPS) using precision of 1 bit (probably to speed some layers with no high accuracy required) and 4 bits (to improve the classification accuracy). In addition, both in-memory accelerators include Near Memory Computing capabilities. Other initiatives with values around 100 GOPS are not identified with any specific characteristic, except maybe for the use of mixed-precision data. Power information for all projects on the sample is reported in Fig. 5. Chewbacca, the best very low energy project, uses a Gapuino device, i.e. a microcontroller with a GAP8 as microprocessor; it works under multicore processing and uses the Xpulp extensions [29]. CIMU [19] and CIMA [18] are IMC Sparsity LCAparallel RV32 IM 92.4% Sparsity TCAparallel RV32 IM 92.4% MMIO kernel, partial sum, max pool TCA/ LCA serial 8, 4, 2, 1b DMA, TCDM Tiling, kernel LCAparallel RV32 IMCX 8, 4, 2, 1b DMA Kernel optimizati on LCAparallel RV32 IMCX Comp ACC [27] 16, 8, 1b Global buffer, Network on Chip Kernel optimizati on LCAparallel IMAE2E [34] 8, 4b Chewb acca [29] 16, 1b 4, 1b SRAM CIMA [18] 4, 1b SRAM PTCLLC [39] 32-1b CONCLUSIONS In the EMLIoT ecosystem, energy efficiency remains an open topic for long-life devices, as small devices must perform big tasks with limited resources. The future expectation is that more and more of these devices will be around to help human activity. This implies that each device must be specifically designed to make optimal use of the resources. Thus, specific domain accelerators are emerging to speed up processing and lower power consumption for machine learning devices PULPNN [28] XPULP -NN [17] MultiP ulply [32] DORY [40] Fig. 3. EE. Different brands in the figure, indicate different types of manufacturing technology 16- 1b 8b NVM CIMU [19] DMA PCM DMA, reshaping buffer DMA, reshaping buffer, NMC TCDM DMA, buffer, SCM ReRA M TCDM, DMA, SP TCDM DMA Kernel optimizati on XNORpop-count max-pool, others Weights Reuse, vector Tiling Accuracy RISC-V ISA extensions RV32 IMC 16,8,4 Function optimization LCAparallel GAP-8 [38] Mobility data tool CPUAccelerator Configuration I. SUMMARY OF EMLIOT PROJECTS Kernel, Sparsity Precision (bits) Table 1 includes shows some additional details about the projects. The accuracy reported by CIMU, CIMA, and Chewbacca is more than 90%. The authors of [29] propose BNN optimizations to still benefit from low-energy processing but improve accuracy. Among these optimizations are binary addition stored in full-precision registers, binary data reuse, XNOR pop-count, and max-pooling operations. TABLE I. Project reference accelerators and integrate a zero-riscy processor [20] using a minimum of extensions (RV32IM) and a power dynamic density of 0.68 µW/MHz; These three projects use data mobility circuits (DMA or TCDM) to speed up the data transfers. 56.5% LCAparallel RV32 IMCX LCAparallel RV32 IMCX LCAparallel RV32 IMC LCAparallel RV32 IMCX 91.5% Facing such a challenge, using the right tools could make a difference. In particular, the RISC-V key contributions found in this study are developed to optimize known computing approaches and execute them efficiently. Such contributions are listed: open architecture enables tight coupling between memory and CPU for speeding up data mobility; implementation of custom instructions to integrate optimized functions; set modularity that allows a different level of instructions complexity and therefore, the possibility of varying the energy. In general, most of the implementations reviewed have the traditional and safe tends until now explored: LCA-parallel and data mobility circuits, including IMC implementations. It remains pending to see developments related to CPUaccelerator tightly coupling, serial processing, and more QNNs exploration. ACKNOWLEDGES Fig. 4. Maximum peak throughput This work has been partially supported by the Mexican Government F-PROMEP-01/Rev-04 SEP-23-002-A; the Spanish Ministry of Science and Innovation (contract PID2019107255GB-C21/AEI/10.13039/501100011033) and by the Generalitat de Catalunya (contract 2017-SGR-1328); and DTS21/00089 del Instituto Carlos III. REFERENCES [1] Fig. 5. Power and implementation similarities between the projects K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2016-Decem, pp. 770–778, 2016. [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” J. Mach. Learn. Res., vol. 18, pp. 1–30, 2018. A. Waterman, K. Asanovic, "The RISC-V instruction Set Manual." Volume I: Unprivileged ISA’, version 20191213, 2019. M. Radulović, “Memory bandwidth and latency in HPC: system requirements and performance impact,” TDX (Tesis Dr. en Xarxa), no. March, 2019. M. Sami, D. Sciuto, C. Silvano, and V. Zaccaria, “Instruction-level power estimation for embedded VLIW cores,” Hardware/Software Codesign Proc. Int. Work., pp. 34–38, 2000. S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, & W. J. Dally, “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” Proc. - 2016 43rd Int. Symp. Comput. Archit. ISCA 2016, vol. 16, pp. 243–254, 2016. A. Vasilyev, “CNN optimizations for embedded systems and FFT,” Proj. Rep., 2015. R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas, “Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures,” Proc. Annu. Int. Symp. Microarchitecture, pp. 245–257, 2000. R. Xu, D. Zhu, C. Rusu, R. Melhem, and D. Mossé, “Energy-efficient policies for embedded clusters,” ACM SIGPLAN Not., vol. 40, no. 7, pp. 1–10, 2005. C. James, “Beyond Moore ’ s Law : Exploring the Future of Computation Issued :,” The Royal Society, Los Alamos, 2019. M. Jung, C. Weis, and N. Wehn, “The Dynamic Random Access Memory Challenge in Embedded Computing Systems,” A Journey Embed. CyberPhysical Syst., pp. 19–36, 2021. T. Simons and D. J. Lee, “A review of binarized neural networks,” Electron., vol. 8, no. 6, 2019. Wu, J., Leng, C., Wang, Y., Hu, Q., & Cheng, J. (2016). Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 48204828). G. Lemieux, J. Vandergriendt. "FPGA-optimized lightweight vector extensions for VectorBlox ORCA RISC-V,” 4th RISC-V Workshop." (2016): 73. G. Lemieux, J. Edwards, J. Vandergriendt, A. Severance, R. De Iaco, A. Raouf, & S. Singh, "TinBiNN: Tiny binarized neural network overlay in about 5,000 4-LUTs and 5mw." arXiv preprint arXiv:1903.06630 (2019). G. Zhang, K. Zhao, B. Wu, Y. Sun, L. Sun, and F. Liang, “A RISC-V based hardware accelerator designed for Yolo object detection system,” Proc. 2019 IEEE Int. Conf. Intell. Appl. Syst. Eng. ICIASE 2019, pp. 9– 11, 2019. A. Garofalo, G. Tagliavini, F. Conti, D. Rossi, and L. Benini, “XpulpNN: Accelerating Quantized Neural Networks on RISC-V Processors Through ISA Extensions,” Proc. 2020 Des. Autom. Test Eur. Conf. Exhib. DATE 2020, pp. 186–191, 2020. H. Jia, S. Member, H. Valavi, and S. Member, “A Programmable Heterogeneous Microprocessor Based on Bit-Scalable In-Memory Computing,” IEEE J. Solid-State Circuits, vol. 55, no. 9, pp. 2609–2621, 2020. H. Jia, Y. Tang, H. Valavi, J. Zhang, and N. Verma, “A Microprocessor implemented in 65nm CMOS with Configurable and Bit-scalable Accelerator for Programmable In-memory Computing Hongyang Jia, Yinqi Tang 1 , Hossein Valavi 1 , Jintao Zhang, Naveen Verma Princeton University, Princeton NJ,” IEEE J. Solid-State Circuits, vol. 55, no. 9, pp. 2609–2621, 2020. P. D. Schiavone, F. Conti, D. Rossi, M. Gautschi, A. Pullini, E. Flamand, L. Benini, “Slow and steady wins the race? A comparison of ultra-lowpower RISC-V cores for internet-of-things applications,” 2017 27th Int. Symp. Power Timing Model. Optim. Simulation, PATMOS 2017, vol. 2017-Janua, pp. 1–8, 2017. S. Shiratake, “Scaling and Performance Challenges of Future DRAM,” May 2020. [22] A. Chen, “A review of emerging non-volatile memory (NVM) technologies and applications,” Solid. State. Electron., vol. 125, pp. 25– 38, 2016. [23] A. S. S. Trinadh Kumar, “Low voltage high speed 8T SRAM cell for ultra-low power applications,” Int. J. Eng. Technol., vol. 7, pp. 70–74, 2018. [24] S. S. Iyer, and Howard L. Kalter. "Embedded DRAM technology: opportunities and challenges." IEEE spectrum 36.4 (1999): 56-64. [25] R. et al. Sebastian, A., Le Gallo, M., Khaddam-Aljameh, “Memory devices and applications for in-memory computing,” Nat. Nanotechnol., vol. 15, pp. 529–544, 2020. [26] A. Al Bahou, G. Karunaratne, R. Andri, L. Cavigelli, & L. Benini, “XNORBIN: A 95 TOp/s/W hardware accelerator for binary convolutional neural networks,” 21st IEEE Symp. Low-Power HighSpeed Chips Syst. Cool Chips 2018 - Proc., pp. 1–3, 2018. [27] Z. Ji, W. Jung, J. Woo, K. Sethi, S. L. Lu, & A. P. Chandrakasan, "CompAcc: Efficient Hardware Realization for Processing Compressed Neural Networks Using Accumulator Arrays." 2020 IEEE Asian SolidState Circuits Conference (A-SSCC). IEEE, 2020. [28] A. Garofalo, M. Rusci, F. Conti, D. Rossi, & L. Benini, “Pulp-NN: Accelerating quantized neural networks on parallel ultra-low-power RISC-V processors,” Philos. Trans. R. Soc. A Math. Phys. Eng. Sci., vol. 378, no. 2164, 2020. [29] R. Andri, G. Karunaratne, L. Cavigelli, and L. Benini, “ChewBaccaNN: A Flexible 223 TOPS/W BNN Accelerator,” pp. 1–5, 2021. [30] F. Wu, N., Jiang, T., Zhang, L., Zhou, F., & Ge, “A Reconfigurable Convolutional Neural Network-Accelerated Coprocessor Based on RISCV Instruction Set,” Electron., vol. 9, no. 6, p. 1005, 2020. [31] M. Rusci, A. Capotondi, and L. Benini, “Memory-Driven Mixed Low Precision Quantization For Enabling Deep Network Inference On Microcontrollers,” 2019. [32] A. Eliahu, R. Ronen, P. E. Gaillardon, and S. Kvatinsky, “MultiPULPly: A Multiplication Engine for Accelerating Neural Networks on Ultra-lowpower Architectures,” ACM J. Emerg. Technol. Comput. Syst., vol. 17, no. 2, 2021. [33] F. Glaser, G. Tagliavini, D. Rossi, G. Haugou, Q. Huang, and L. Benini, “Energy-Efficient Hardware-Accelerated Synchronization for SharedL1-Memory Multiprocessor Clusters,” IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 3, pp. 633–648, 2021. [34] G. Ottavi, G. Karunaratne, F. Conti, I. Boybat, L. Benini, and D. Rossi, “End-to-end 100-TOPS / W Inference With Analog In-Memory Computing : Are We There Yet ?,” pp. 10–13, 2021. [35] L. Zhang, Z. Xian, & G. Chuliang, "A CNN Accelerator with Embedded Risc-V Controllers." 2021 China Semiconductor Technology International Conference (CSTIC). IEEE, 2021. [36] N. Wu, T. Jiang, L. Zhang, F. Zhou, & F. Ge, "A reconfigurable convolutional neural network-accelerated coprocessor based on RISC-V instruction set." Electronics 9.6 (2020): 1005. [37] K. Asanovic, R. Avizienis, J. Bachrach, S. Beamer, D. Biancolin, C. Celio,... & A. Waterman, "The rocket chip generator." EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2016-17 4 (2016). [38] E. Flamand, D. Rossi, F. Conti, I. Loi, A. Pullini, F. Rotenberg, & L. Benini, “GAP-8: A RISC-V SoC for AI at the Edge of the IoT,” Proc. Int. Conf. Appl. Syst. Archit. Process., vol. 2018-July, pp. 1–4, 2018. [39] Y. Lo, Y. Kuo, Y. Chang, and J. Huang, “PTLL-BNN : Physically Tightly Coupled , Logically Loosely Coupled , Near-Memory BNN Accelerator,” 2019 Eur. Solid-State Circuits Conf., pp. 2–3, 2019. [40] A. Burrello, A. Garofalo, N. Bruschi, G. Tagliavini, D. Rossi, and F. Conti, “DORY : Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs,” vol. 70, no. 8, pp. 1253–1268, 2021.