WACAS Submission #XXX– Confidential Draft – Do Not Distribute!! Toward General-Purpose Code Acceleration with Analog Computation Amir Yazdanbakhsh∗ Hadi Esmaeilzadeh∗ ∗ Georgia Institute of Technology Renee St. Amant§ Arjang Hassibi§ § The University of Texas at Austin Abstract Bradley Thwaites∗ Jongse Park∗ Luis Ceze† Doug Burger‡ † University of Washington ‡ Microsoft Research with conventional languages. NPUs can be a potential solution for general-purpose analog computing. Prior research has shown that analog neural networks can effectively solve classes of domain-specific problems, such as pattern recognition [4, 27, 28, 18]. The process of invoking a neural network and returning a result defines a clean, coarse-grained interface for D/A and A/D conversion. Furthermore, the compile-time training of the network permits any analog-specific restrictions to be hidden from the programmer. The programmer simply specifies which region of the code can be approximated, without adding any neural-network-specific information. Thus, no additional changes to the programming model are necessary. Figure 1 illustrates an overview of our framework. This paper reports on our study to design an NPU with mixed-signal components and develop a compilation workflow for utilizing the mixed-signal NPU for code acceleration. The goal of this study is to investigate challenges and potential solutions of implementing NPUs with analog components, while both bounding application error to sufficiently low levels and achieving worthwhile performance or efficiency gains for general-purpose approximable code. We found that exposing the analog limitations to the compiler allowed for the compensation of these shortcomings and produced sufficiently accurate results. We trained networks at compile time using 8bit values, topologies restricted to eight inputs per neuron, plus RPROP and CDLM [8] for training. Using these techniques together, we were able to bound error on all applications to a 10% limit, which is comparable to prior studies using entirely digital accelerators. The average time required to compute a neural result was 3.3× better than a previous digital implementation with an additional energy savings of 12.1×. The performance gains result in an average full-application-level improvement of 3.7× and 23.3× in performance and energy-delay product, respectively. This study shows that using limited-precision analog circuits for code acceleration, by converting regions of imperative code to neural networks and exposing the circuit limitations to the compiler, is both feasible and advantageous. We propose a solution—from circuit to compiler—that enables general-purpose use of limited-precision, analog hardware to accelerate “approximable” code—code that can tolerate imprecise execution. We utilize an algorithmic transformation that automatically converts approximable regions of code from a von Neumann model to an “analog” neural model. We outline the challenges of taking an analog approach, including restricted-range value encoding, limited precision in computation, circuit inaccuracies, noise, and constraints on supported topologies. We address these limitations with a combination of circuit techniques, a novel hardware/software interface, neural-network training techniques, and compiler support. Analog neural acceleration provides whole application speedup of 3.3× and and energy savings of 12.1× with quality loss less than 10% for all except one benchmark. These results show that using limited-precision analog circuits for code acceleration, through a neural approach, is both feasible and beneficial over a range emerging applications. 1. Introduction A growing body of work [11, 25, 6, 22, 12] has focused on approximation as a strategy for improving performance and efficiency through approximation. Large classes of applications can tolerate small errors in their outputs with no discernible loss in Quality of Result (QoR). Many conventional techniques in energy-efficient computing navigate a design space defined by the two dimensions of performance and energy, and traditionally trade one for the other. General-purpose approximate computing explores a third dimension—that of error. Many interesting design alternatives become possible once precision is relaxed. An obvious candidate is the use of analog circuits for computation. However, computation in the analog domain has several major challenges, even when small errors are permissible. First, analog circuits tend to be special purpose, good for only specific operations. Second, the bit widths they can accommodate are much smaller than current floatingpoint standards (i.e. 32 or 64 bits), since the ranges must be represented by physical voltage or current levels. Another consideration is determining where the boundaries between digital and analog computation lie. Using individual analog operations will not be effective due to the overhead of A/D and D/A conversions. Finally, effective storage of temporary analog results is challenging in current CMOS technologies. Due to these limitations, it has not been effective to design analog von Neumann processors that can be programmed 2. Analog Circuits for Neural Computation As Figure 2a illustrates, each neuron in a multi-layer perceptron takes in a set of inputs (xi ) and performs a weighted sum of those input values (∑i xi wi ). The weights (wi ) are the result of training the neural network on . After the summation stage, which produces a linear combination of the weighted inputs, the neuron applies a nonlinearity function, sigmoid, to the result of summation. Figure 2b depicts a conceptual 1 Profiling&Path&for& Training&Data& Collec;on Annotated& Source&Code A"NPU& Circuit&Design& Training&Data Programmer A"NPU& High"Level&Model& Programming Design Trained&Neural& Network Custom&Training& Algorithm&for& Limited"Precision& Analog&Accelerator Compila;on Instrumented& Binary CORE Accelerator& Config A"NPU Code& Genera;on Accelera;on Figure 1: Framework for using limited-precision analog computation to accelerate code written in conventional languages. x0 xi … w0 … wn wi X xn (xi wi ) x0 xi xn DAC DAC DAC … I(x0 ) R(w0 ) V to I y = sigmoid( (a) (xi wi )) R(wi ) V to I X X … I(xi ) more of the implementation into the analog domain gains efficiency at the expense of flexibility, limiting the scope of supported network topologies and, consequently, limiting potential network accuracy. The NPU approach targets code approximation, rather than typical, simpler neural tasks, such as recognition and prediction, and imposes higher accuracy requirements. The main challenge is to mange this tradeoff to achieve acceptable accuracy for code acceleration, while delivering higher performance and efficiency when analog neural circuits are used for general-purpose code acceleration. While a holistically analog neural hardware design with fixed-wire connections between neurons may be efficient, it effectively provides a fixed topology network, limiting the scope of applications that can benefit from the neural accelerator, as the optimal network topology varies with application. Additionally, routing analog signals among neurons and the limited capability of analog circuits for buffering signals negatively impacts accuracy and makes the circuit susceptible to noise. In order to provide additional flexibility, we set the digital-analog boundary in conjunction with an algorithmic, sigmoid-activated neuron. where a set of digital inputs and weights are converted to the analog domain for efficient computation, producing a digital output that can be accurately routed to multiple consumers. We refer to this basic computation unit as an analog neural unit, or ANU. ANUs can be composed, in various physical configurations, along with digital control and storage, to form a reconfigurable mixed-signal NPU, or A-NPU. I(xn ) R(wn ) V to I (I(xi )R(wi )) ADC y ⇡ sigmoid( X (I(xi )R(wi ))) (b) Figure 2: One neuron and its conceptual analog circuit. analog circuit that performs the three necessary operations of a neuron: (1) scaling inputs by weight (xi wi ), (2) summing the scaled inputs (∑i xi wi ), and (3) applying the nonlinearity function (sigmoid). This conceptual design first encodes the digital inputs (xi ) as analog current levels (I(xi )). Then, these current levels pass through a set of variable resistances whose values (R(wi )) are set proportional to the corresponding weights (wi ). The voltage level at the output of each resistance (I(xi )R(wi )), is proportional to xi wi . These voltages are then converted to currents that can be summed quickly according to Kirchhoff’s current law (KCL). Analog circuits only operate linearly within a small range of voltage and current levels, outside of which the transistors enter saturation mode with IV characteristics similar in shape to a non-linear sigmoid function. Thus, at the high level, the non-linearity is naturally applied to the result of summation when the final voltage reaches the analog-to-digital converter (ADC). Compared to a digital implementation of a neuron, which requires multipliers, adder trees and sigmoid lookup tables, the analog implementation leverages the physical properties of the circuit elements and is orders of magnitude more efficient. However, it operates in limited ranges and therefore offers limited precision. Value representation and bit-width limitations. One of the fundamental design choices for an ANU is the bit-width of inputs and weights. Increasing the number of bits results in an exponential increase in the ADC and DAC energy dissipation and can significantly limit the benefits from analog acceleration. Furthermore, due to the fixed range of voltage and current levels, increasing the number of bits translates to quantizing this fixed value range to fine granularities that practical ADCs can not handle. In addition, the fine granularity encoding makes the analog circuit significantly more susceptible to noise, thermal, voltage, current, and process variations. In practice, these non-ideal effects can adversely affect the final accuracy when more bit-width is used for weights and inputs. We design our ANUs such that the granularity of the voltage and current levels used for information encoding is to a large degree robust to variations and noise. Analog-digital boundaries. The conceptual design in Figure 2b draws the analog-digital boundary at the level of an algorithmic neuron. As we will discuss, the analog neural accelerator will be a composition of these analog neural units (ANUs). However, an alternative design, primarily optimizing for efficiency, may lay out the entirety of a neural network with only analog components, limiting the D-to-A and A-to-D conversions to the inputs and outputs of the neural network and not the individual neurons. The overhead of conversions in the ANUs significantly limits the potential efficiency gains of an analog approach toward neural computation. However, there is a tradeoff between efficiency, reconfigurability (generality), and accuracy in analog neural hardware design. Pushing 2 s w0 sx0 w0 I(|x0 |) Resistor' Ladder s wn x0 Current' Steering' DAC wn … I(|xn |) Resistor' Ladder R(|w0 |) V (|w0 x0 |) I + (wn xn ) I (w0 x0 ) w i xi + + Diff' ⌘ V R(|wn |) Diff' Pair I + (w0 x0 ) ⇣X a variable resistance depending on the magnitude of each weight (R(|wi |)) . We use a standard resistor ladder thats consists of a set of resistors connected to a tree-structured set of switches. The digital weight bits (wi s) control the switches, adjusting the effective resistance, R(|wi |), seen by the input current (I(|xi |)). These variable resistances scale the input currents by the digital weight values, effectively multiplying each input magnitude by its corresponding weight magnitude. The output of the resistor ladder is a voltage: V (|wi xi |) = I(|xi |) × R(|wi |). The resistor network requires 2m resistors and approximately 2m+1 switches, where m is the number of digital weight bits. This resistor ladder design has been shown to work well for m ≤ 10. Our circuit simulations show that only minimally sized switches are necessary. V (|wi xi |) as well as the XOR of the weight and input sign bits feed to a differential pair that converts voltage values to two differential current (I + (wi xi ), I − (wi xi )) that capture the sign of the weighted input. These differential currents are proportional to the voltage applied to the differential pair, V (|wi xi |). If the voltage difference between the two gates is kept small, the current-voltage relationship is linear, producing Ibias − I + (wi xi ) = Ibias 2 + ∆I and I (wi xi ) = 2 − ∆I. Resistor ladder values are chosen such that the gate voltage remains in the range that produces linear outputs, and consequently a more accurate final result. Based on the sign of the computation, a switch steers either the current associated with a positive value or the current associated with a negative value to a single wire to be efficiently summed according to Kirchhoff’s current law. The alternate current is steered to a second wire, retaining differential operation at later design stages. Differential operation combats environmental noise and increases gain, the later being particularly important for mitigating the impact of analog range challenges at later stages. Resistors convert the resulting pair of differential currents to voltages, V + (∑i wi xi ) and V − (∑i wi xi ), that represent the weighted sum of the inputs to the ANU. These voltages are used as input to an additional amplification stage (implemented as a current-mode differential amplifier with diode-connected load). The goal of this amplification stage is to significantly magnify the input voltage range of interest that maps to the linear output region of the desired sigmoid function. Our experiments show that neural networks are sensitive to the steepness of this non-linear function, losing accuracy with shallower, non-linear activation functions. This fact is relevant for an analog implementation because steeper functions increase range pressure in the analog domain, as a small range of interest must be mapped to a much larger output range in accordance with ADC input range requirements for accurate conversion. We magnify this range of interest, choosing circuit parameters that give the required gain, but also allowing for saturation with inputs outside of this range. The amplified voltage is used as input to an analog-to-digital converter that converts the voltage to a digital value. We chose a flash ADC design (named for its speed), which consists of a xn Current' Steering' DAC V (|wn xn |) Diff' Pair V+ sxn ⇣X w i xi ⌘ -; Amp I (wn xn ) Flash ADC sy y ⇣ ⇣X ⌘⌘ y ⇡ sigmoid V w i xi Figure 3: A single analog neuron (ANU). Topology restrictions. Another important design choice is the number of inputs in the ANU. Similar to bit-width, increasing the number of ANU inputs translates to encoding a larger value range in a bounded voltage and current range, which, as discussed, becomes impractical. The larger the number of inputs, the larger the number of multiply and add operations that can be done in parallel in the analog domain, increasing efficiency. However, due to the bounded range of voltage and currents, increasing the number of inputs requires decreasing the number of bits for inputs and weights. Through circuit-level simulations, we empirically found that limiting the number of inputs to eight with 8-bit inputs and weights strikes a balance between accuracy and efficiency. However, this unique ANU limitation restricts the topology of the neural network that can run on the analog accelerator. Our customized training algorithm and compilation workflow takes into account this topology limitation and produces neural networks that can be computed on our mixed-signal accelerator. Non-ideal sigmoid. The saturation behavior of the analog circuit that leads to sigmoid-like behavior after the summation stage represents an approximation of the ideal sigmoid. We measure this behavior at the circuit level and expose it to the compiler and the training algorithm. 3. Mixed-Signal Neural Accelerator (A-NPU) Circuit design for ANUs. Figure 3 illustrates the design of a single analog neuron (ANU). As mentioned, the ANU performs the computation of a single neuron, which is y ≈ sigmoid(∑i wi xi ). We place the analog-digital boundary at the ANU level, with computation in the analog domain and storage in the digital domain. Digital input and weight values are represented in sign-magnitude form. In the figure, swi and sxi represent the sign bits and wi and xi represent the magnitude. Digital input values are converted to the analog domain through current-steering DACs that translate digital values to analog currents. Current-steering DACs are used for their speed and simplicity. In Figure 3, I(|xi |) is the analog current that represents the magnitude of the input value, xi . Digital weight values control resistor string ladders that create 3 … 8"Wide AnalogNeuron Weight'Buffer 8"Wide AnalogNeuron Weight'Buffer Weight'Buffer Weight'Buffer 8"Wide AnalogNeuron A-NPU configuration. During code generation, the compiler produces an A-NPU configuration that constitutes the weights and the schedule. The static A-NPU scheduling algorithm first assigns an order to the neurons. This determines the order in which the neurons will be computed on the ANUs. Then, the scheduler takes the following steps for each layer of the neural network: (1) Assign each neuron to one of the ANUs. (2) Assign an order to neurons. (3) Assign an order to the weights. (4) Generate the order for inputs to be fed to the ANUs. (5) Generate the order in which the outputs will be written to the Output Buffer. The scheduler also assigns a unique order for the inputs and outputs of the neural network in which the processor will communicate data with the A-NPU Config'FIFO Input'FIFO Row'Selector 8"Wide AnalogNeuron Column'Selector Output'FIFO Figure 4: Mixed-signal neural accelerator, A-NPU. Only four of the ANUs are shown. Each ANU processes eight 8-bit inputs. set of reference voltages and comparators [1, 17]. The ADC requires 2n comparators, where n is the number of digital output bits. Flash ADC designs are capable of converting 8 bits at frequency on the order of one GHz. We require 2–3 mV between ADC quantization levels for accurate operation and noise tolerance. Typically, ADC reference voltages increase linearly; however, we use a non-linearly increasing set of reference voltages to capture the behavior of a sigmoid function which also improves the accuracy of the analog sigmoid. 4. Compilation for Analog Acceleration As Figure 1 illustrates, the compilation for A-NPU execution consists of three stages: (1) profile-driven data collection, (2) training for a limited-precision A-NPU, and (3) code generation for hybrid analog-digital execution. In the profile-driven data collection stage, the compiler instruments the application to collect the inputs and outputs of approximable functions. The compiler then runs the application with representative inputs and collects the inputs and their corresponding outputs. These input-output pairs constitute the training data. Section 3 briefly discussed ISA extensions and code generation. While compilation stages (1) and (3) are similar to the techniques presented for a digital implementation [12], the training phase is unique to an analog approach, accounting for analog-imposed, topology restrictions and adjusting weight selection to account for limited-precision computation. Reconfigurable mixed-signal A-NPU. We design a reconfigurable mixed-signal A-NPU that can perform the computation of a wide variety of neural topologies since each requires a different topology. Figure 4 illustrates the A-NPU design with some details omitted for clarity. The figure shows four ANUs while the actual design has eight. The A-NPU is a time-multiplexed architecture where the algorithmic neurons are mapped to the ANUs based on a static scheduling algorithm, which is loaded to the A-NPU before invocation. The multi-layer perceptron consists of layers of neurons, where the inputs of each layer are the outputs of the previous layer. The ANU starts from the input layer and performs the computations of the neurons layer by layer. The Input Buffer always contains the inputs to the neurons, either coming from the processor or from the previous layer computation. The Output Buffer, which is a single entry buffer, collects the outputs of the ANUs. When all of its columns are computed, the results are pushed back to the Input Buffer to enable calculation of the next layer. The Row Selector determines which entry of the input buffer will be fed to the ANUs. The output of the ANUs will be written to a single-entry output buffer. The Column Selector determines which column of the output buffer will be written by the ANUs. These selectors are FIFO buffers whose values are part of the preloaded A-NPU configuration. All the buffers are digital SRAM structures. Each ANU has eight inputs. As depicted in Figure 4, each A-NPU is augmented with a dedicated weight buffer that stores the 8-bit weights. The weight buffers simultaneously feed the weights to the ANUs. The weights and the order in which they are fed to the ANUs are part of the A-NPU configuration. The Input Buffer and Weight Buffers synchronously provide the inputs and weights for the ANUs based on a pre-loaded order. Hardware/software interface for exposing analog circuits to the compiler. As we discussed in Section 2, we expose the following analog circuit restrictions to the compiler through a hardware/software interface that captures the following circuit characteristics: (1) input bit-width limitations, (2) weight bit-width limitations, (3) limited number of inputs to each analog neuron (topology restriction), and (4) the nonideal shape of the analog sigmoid. The compiler internally constructs a high-level model of the circuit based on these limitations and uses this model during the neural topology search and training with the goal of limiting the impact of inaccuracies due to an analog implementation. Training for limited bit widths and analog computation. Traditional training algorithms for multi-layered perceptron neural networks use a gradient descent approach to minimize the average network error, over a set of training input-output pairs, by backpropagating the output error through the network and iteratively adjusting the weight values to minimize that error. Traditional training techniques, however, that do not consider limited-precision inputs, weights, and outputs perform poorly when these values are saturated to adhere to the bit-width requirements that are feasible for an implementation in the analog domain. Simply limiting weight values 4 Table 1: The evaluated benchmarks, characterization of each offloaded function, training data, and the trained neural network. Benchmark*Name blackscholes Descrip0on Mathema'cal* model*of*a* financial*market* M RadixE2*CooleyE Tukey*fast*fourier inversek2j Type Financial* Analysis Signal* Processing #*of* Func0on* Calls #*of* Loops #*of*Ifs/ elses 5 0 5 #*of* x86@64* Evalua0on*Input* Neural*Network* Training*Input*Set Instruc0on Set Topology s 4096*Data*Point* 16384*Data*Point* 309 from*PARSEC from*PARSEC 6*E>*8*E>*8E>*1 2 0 0 34 Inverse*kinema'cs* Robo'cs for*2Ejoint*arm 4 0 0 100 jmeint Triangle* intersec'on* detec'on 3D*Gaming 32 0 23 1,079 jpeg JPEG*encoding Compression 3 4 0 1,257 kmeans Machine* KEmeans*clustering Learning 1 0 0 26 sobel Sobel*edge* detector Image* Processing 3 2 1 88 2048*Random* Floa'ng*Point* Numbers 10000*(x,*y)* Random* Coordinates 10000*Random* Pairs*of*3D* Triangle* Coordinates 220x200EPixel* Color*Image 220x200EPixel* Color*Image 220x200EPixel* Color*Image during training is also detrimental to achieving quality outputs because the algorithm does not have sufficient precision to converge to a quality solution. To incorporate bit-width limitations into the training algorithm, we use a customized continuous-discrete learning method (CDLM) [8]. This approach takes advantage of the availability of full-precision computation at training time and then adjusts slightly to optimize the network for errors due to limited-precision values. In an initial phase, CDLM first trains a fully-precise network according to a standard training algorithm, such as backpropagation [24]. In a second phase, it discretizes the input, weight, and output values according the the exposed analog specification. The algorithm calculates the new error and backpropagates that error through the fully-precise network using full-precision computation and updates the weight values according to the algorithm also used in stage 1. This process repeats, backpropagating the ’discrete’ errors through a precise network. The original CDLM training algorithm was developed to mitigate the impact of limited-precision weights. We customize this algorithm by incorporating the input bit-width limitation and the output bit-width limitation in addition to limited weight values. Additionally, this training scheme is advantageous for an analog implementation because it is general enough to also make up for errors that arise due to an analog implementation, such as a non-ideal sigmoid function and any other analog non-ideality that behaves consistently. Fully* Analog*NN* Digital*NN* MSE*(8@bit) MSE Applica0on*Error* Metric Fully* Digital* Error Analog* Error 0.000011 Avg.*Rela've*Error 6.02% 10.2% Avg.*Rela've*Error 2.75% 4.1% Avg.*Rela've*Error 6.2% 9.4% Miss*Rate 17.68% 19.7% Image*Diff 5.48% 8.4% Image*Diff 3.21% 7.3% Image*Diff 3.89% 5.2% 0.00228 32768*Random* Floa'ng*Point* 1*E>*4*E>*4*E>*2 0.00002 0.00194 Numbers 10000*(x,*y)* Random* 2*E>*8*E>*2 0.000341 0.00467 Coordinates 10000*Random* Pairs*of*3D* 18*E>*32*E>*8*E>*2 0.05235 0.06729 Triangle* Coordinate Three*512x512E 64*E>*16*E>*8*E>*64 0.0000156 0.0000325 Pixel*Color*Images 50000*Pairs*of* Random*(r,*g,*b)* 6*E>*8*E>*4*E>*1 0.00752 0.009589 Values One*512x512E 9*E>*8*E>*1 0.000782 0.00405 Pixel*Color*Image simple algorithm guided by the mean-squared error of the network to determine the best topology given the exposed restriction. The error evaluation uses a typical cross-validation approach: the compiler partitions the data collected during profiling into a training set, 70% of the data, and a test set, the remaining 30%. The topology search algorithm trains many different neural-network topologies using the training set and chooses the one with the highest accuracy on the test set and the lowest latency on the A-NPU hardware (prioritizing accuracy). The space of possible topologies is large, so we restrict the search to neural networks with at most two hidden layers. We also limit the number of neurons per hidden layer to powers of two up to 32. To further improve accuracy, and compensate for topologyrestricted networks, we utilize a Resilient Back Propagation (RPROP) [16] training algorithm as the base training algorithm in our CDLM framework. During training, instead of updating the weight values based on the backpropagated error (as in conventional backpropagation [24]), the RPROP algorithm increases or decreases the weight values by a predefined value based on the sign of the error. Our investigation showed that RPROP significantly outperforms conventional backpropagation for the selected network topologies, requiring only half of the number of training epochs as backpropagation to converge on a quality solution. The main advantage of the application of RPROP training to an analog approach to neural computing is its robustness to the sigmoid function and topology restrictions imposed by the analog design. Our RPROP-based, customized CDLM training phase requires 5000 training epochs, with the analog-based CDLM phase adding roughly 10% to the training time of the baseline training algorithm. Training with topology restrictions. Conventional multilayered perceptron networks are fully connected, i.e. the output of each neuron in one layer is routed to the input of each neuron in the following layer. However, analog range limitations restrict the number of inputs that can be computed in a neuron (eight in our design). Consequently, network connections must be limited, and in many cases, the network can not be fully connected. We impose the circuit restriction on the connectivity between the neurons during the topology search and we use a 5. Evaluations Cycle-accurate simulation and energy modeling. We use the MARSSx86 x86-64 cycle-accurate simulator [23] to model the performance of the processor. The processor is modeled after a single-core Intel Nehalem to evaluate the performance 5 kmeans 8-bit D-NPU vs. A-NPU 44 5.77 6 0.67 2.44 3.67 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 1 -> 16 -> 8 -> 1 50 10.30 8 1.22 2.08 3.13 59.90 9.71 6.35 0.84 0 3.33 4.71 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48 swaptions 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33 nversek2j 2 -> 8 -> 2 2.33 4 0.51 2.42 3.63 4.57 4.05 geomean 5.88 28.28 25.50 22.81 82.21 73.76 1.99 3.92 1.36 15.21 6 8.56 7.57 5.49 4 2.75 4.13 11.81 10.45 50 10.30 8 1.22 2.08 3.13 8.45 7.58 9.71 6.35 0.84 3.33 4.71 12.14 10.32 benefits of A-NPU acceleration over an aggressive out-oforder architecture. We extended the simulator to include ISAlevel support for A-NPU queue and dequeue instructions. We also augmented MARSSx86 with a cycle-accurate simulator for our A-NPU design and an 8-bit, fixed-point D-NPU with eight processing engines (PEs) as described in [12]. We use GCC v4.7.3 with -o3 to enable compiler optimization. The baseline in our experiments is the benchmark run solely on the processor8-bit without neural transformation. We use McPAT [19] D-NPU vs. A-NPU for processor energy estimations. We model the energy of an 8-bit, fixed-point D-NPU using results from McPAT, CACTI 6.5 [21], and [13] to estimate its energy. Both the D-NPU and the processor operate at 3.4GHz, while the A-NPU is clocked at one third of the digital clock frequency, 1.1GHz at 1.2 V, to achieve acceptable accuracy1 . 1.0 Speedup A-NPU over D-ANPU swaptions 18 -> 32 -> 8 -> 2 1 -> 16 ->0.1841 8 -> 1 peg 64 -> 16 -> 8 -> 64 geomean kmeans 6 -> 8 -> 4 -> 1 1.00.0662 9 -> 8 -> 1 1 -> 16 -> 8 -> 1 0.0261 Application Topology 0.06 blackscholes 6 -> 8 -> 8 -> 1 fft 1 -> 4 -> 4 -> 2 geomean 18.41 2.08 19.67 3.13 0.0126 6 -> 8 -> 4 -> 18 sobel 14.32% 9 -> 8 -> 6.62% 1 swaptions 1 -> 16 -> 8 -> 1 0.0173 7.28 (nJ) 0.0093 3.34 0.951 0.005 0.563 12% 0.048 0.297 0.023 0.571 0.01 D-ANPU (Hadi) 0.86 0.96 0.54 0.61 0.51 0.58 Benchmarks 1.99 2.21 Improvement 8.15 9.53 18.21 Level Error Application 2.04 Fully Precise Digital Sigmoid Analog Sigmoid 8.39% 10.21% 3.03% 4.13% 2.33 blackscholes 1.36 1.51 56.26 16 0.67 0.76 111.54 5.76% 0.46 0.53 1.22 1.36 0.79 0.89 fft inversek2j 8.13% 9.42% 5.77 jmeint 18.41% 6.62% 4.00 1.17 6.1% 7.28% sobel 4.28% 5.21% swaptions 2.61% 3.34% geomean 6.02% 7.32% 8% 6% 4% 41.47 Topology blackscholes 6 -> 8 -> 8 -> 1 Fully Precise Digital Sigmoid 0.0839 8.39 10.21 0.0182 fft 1 -> 4 -> 4 -> 2 0.0303 3.03 4.13 0.011 inversek2j 2 -> 8 -> 2 0.0813 8.13 9.42 0.0129 jmeint 18 -> 32 -> 8 -> 2 0.1841 18.41 19.67 0.0126 jpeg 64 -> 16 -> 8 -> 64 0.0662 6.62 8.35 0.0173 kmeans 6 -> 8 -> 4 -> 1 0.06 6.10 7.28 0.0118 sobel 9 -> 8 -> 1 0.0428 4.28 5.21 0.0093 swaptions 1 -> 16 -> 8 -> 1 0.0261 2.61 3.34 0.0073 0.06 6.02 7.32 0.01 Analog Sigmoid Fully Precise Digital Sigmoid Fully Precise Digital Sigmoid Analog Sigmoid Analog Sigmoid geomean 12.41 0% 4.51 blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean 8-bit A-NPU Energy (nJ) 8 2 141 64 365 6 2.33 Benchmarks blackscholes fft 2.4% 31.4% 0.3% 1.2% Other Instructions 56.26 44 0.54 4 geomean 4.3% 2.6% 6.0% 4.577.3% 5.2% 3.3% 7.3% 3.33 4.05 8.56 7.57 4 12% blackscholes 0.46 inversek2j fft jmeint Energy 0.43 0.66 0.860.023 2.500.571 0.54 1.89 0.51 2.42 1.99 3.92 2.75 kmeans jpeg sobel swaptions4.13 11.81 6% Energy Improvement Energy Improvement (1.7 GHz) 9.53 8.48 3.79 3% 3.33 3.63 4.57 4.05 5.88 28.28 25.50 22.81 82.21 Analog Sigmoid 0.67 2.44 3.67 8.56 7.57 0.46 2.75 4.13 11.81 10.45 0.79 3.33 8-bit D-NPU vs. A-NPU 5.00 15.2 28.2 8-bit A-NPU Cycle 0% 12.14 0.0182 Cycle Improvement 10.21 Energy Cycle Improvement (1.1336 GHz) 7.58 12.14 9.53 10 28.28 25.5 82.21 73.8 10.32 8.56 8.5 0.75 NPU Queue Instructions 11.81 10.5 Other Instructions 0.25 0.00 4.5 3.03 8.13 8.15 6 2.04 6 4.13 4.57 4.1 3.79 4 3.3 2 blackscholes fft inversek2j 4 8 Energy Improvement blackscholes Energy 9.42 0.54 29 2.33 4 0.51 141 18.41 56.26 12 2.5 33 15.21 0.67 8.35 4 6.10 10.30 9.71 8 2.4 3.92 1.36 6 5.49 111.54 6.62 44 2.42 19.67 1.99 8 5.77 365 0.0129 2.83 3.63 0.0126 5.88 22.81 2.75 1.22 7.28 0.84 2.08 4.13 8.48 3.79 3.33 4.57 4.05 28.28 25.50 82.21 73.76 3.67 0.0173 2.44 0.46 9.53 8.56 7.57 11.81 10.45 jmeint jpeg kmeans sobel swaptions geomean 1/3 Digital Frequency 1/2 Digital Frequency fft inversek2j jmeint jpeg kmeans 59.90 1 0.0261 6.35 3.13 0.0118 4.71 8.45 7.58 3.33 4.28 5.21 0.0093 2.61 3.34 0.0073 12.14 10.32 7.32 0.01 Analog Sigmoid Fully Precise Digital Sigmoid inversek2j Analog Sigmoid jmeint Analog Sigmoid jpeg 0.011 8.13 9.42 0.0129 0.1841 18.41 19.67 0.0126 jpeg 64 -> 16 -> 8 -> 64 0.0662 6.62 8.35 0.0173 1 -> 16 -> 8 -> 1 3.34 6.02 7.32 12% 0.0118 Energy Saving Speedup 0.01 Benchmarks 0.0073 Fully Precise Digital Sigmoid 2 4 14.32% 8 16 6.62% 5.76% 15% fft inversek2j jmeint jpeg 8.39% fft kmeans 3.03% 8.39% Analog Sigmoid 3.03% 10.21% 4.13% inversek2j 9.42% inversek2j 8.13% jmeint 18.41% jpeg 6.62% kmeans 6.1% sobel 4.28% swaptions 2.61% sobel geomean 4% 4.13% 2% 8.13% 9.42% 0% blackscholes 19.67% 6.62% 8.35% 6.1% 7.28% fft inversek2j jmeint jpeg kmeans sobel 4.28% 5.21% swaptions 2.61% fft 3.34% geomean inversek2j 7.32% 6.02% Topology blackscholes 3% 6 -> 8 -> 8 -> 1 2.7% 8.4% 3.0% 10.2% 4.1% A-NPU + Ideal Sigmoid 2 -> 8 -> 2 jmeint 18 -> 32 -> 8 -> 2 jpeg 64 -> 16 -> 8 -> 64 8 blackscholes fft inversek2j kmeans sobel 2.4% 31.4% geomean 3.1% jpeg Other Instructions 3.3% jmeint 43.3% 65.5% 40.5% NPU Queue Instructions 0.3% 1.2% 0.8% 0.5% 4.8% 2.3% 17.6% 5.4% 8.1% 18.4% 6.6% 9.4% 19.7% 8.4% 29 kmeans 8.15 sobel 3.2% 2.04 3.8% 6.1% 4.3% 7.3% 5.2% 2.33 56.26 blackscholes fft inversek2j jmeint 8.4% 3.0% 8.1% 18.4% 4.1% 9.4% 19.7% 44 33 67.4% 95.9% jpeg 0.86 6 0.54 jpeg 5.49 8.4% sobel swaptions geomean 6.1% 4.3% 2.6% 6.0% 7.3% 5.2% 3.3% 6 4 3.7 28.28 25.5 8.56 8.5 6 -> 8 -> 4 -> 1 sobel 9 -> 8 -> 1 6 0.54 2.33 4 0.51 3.63 4.57 141 56.26 12 1.99 3.92 5.88 28.28 365 111.54 8 1.36 15.21 22.81 82.21 44 5.77 6 0.67 2.44 3.67 8.56 33 5.49 4 0.46 2.75 4.13 11.81 61.47 9.63 6.15 0.79 3.33 5.00 12.14 29 9.5 geomean 2.83 4.5 5 1.89 2.42 3.33 4.05 25.50 Benchmarks blackscholes 73.76 Other 7.57 10.45 Instructions 2 fft 2 inversek2j 3.3 jmeint blackscholes fft jpeg 0 blackscholes inversek2j jmeint fft 2.4 2.4 2.5 jpeg inversek2j fft inversek2j jmeint jpeg kmeans sobel swaptions geomean Application Speedup Topology 6 -> 8 -> 8 -> 1 14.1441013944802 24.5221729784241 48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738 fft 1 -> 4 -> 4 -> 2 Core + D-NPU 1.12709929506364 1.32327013458615 1.64546022960568 0.68497510592142 0.804194541306698 0.195805458693302 14.9861613789597 0.53259881505741 0.733600916412848 0.266399083587152 inversek2j 2 -> 8 -> 2 jmeint 18 -> 32 -> 8 -> 2 jpeg 6 -> 8 -> 4 -> 1 9 -> 8 -> 1 Core + Ideal NPU Core + D-NPU Core + A-NPU speedup 7.98161179269307 10.9938617211157 2.39085372136084 6.26190545947988 14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335 1.5617504494923 1.87946485929561 1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875 64 -> 16 -> 8 -> 64 kmeans sobel Core + A-NPU Core + D-NPU Core + A-NPU Core + Ideal NPU 0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028 2.4864550898745 3.10723166292606 3.62429006473114 0.686053004992842 0.857335259438336 0.142664740561664 2.5478647166383 3.7797513074705 5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968 blackscholes fft inversek2j 6.0% 2.7% 6.2% 17.6% 5.4% 8.4% 3.0% 8.1% 18.4% 6.6% 10.2% 4.1% 9.4% 19.7% 8.4% Floating Point D-NPU Application Topology Core + D-NPU blackscholes 6 -> 8 -> 8 -> 1 fft 1 -> 4 -> 4 -> 2 inversek2j 2 -> 8 -> 2 jmeint 18 -> 32 -> 8 -> 2 jpeg 64 -> 16 -> 8 -> 64 kmeans 6 -> 8 -> 4 -> 1 sobel 9 -> 8 -> 1 swaptions 1 -> 16 -> 8 -> 1 Core + A-NPU A-NPU + Ideal Sigmoid 2 Core + Ideal NPU speedup A-NPU Core + D-NPU Core + A-NPU blackscholes fft inversek2j jmeint jpeg kmeans sobel Other Instructions 2.4% 31.4% 3.3% 3.1% 43.3% 65.5% 40.5% 1.2% 0.8% 1.8% 0.5% 4.8% 2.3% fft Analog inversek2j Sigmoidjmeint jpeg swaptions kmeans sobel jpeg k fft inversek2j jm 8.4% 3.0% 8.1% 18. 10.2% 4.1% 9.4% 19. Analog Sigmoid geomean jmeint blackscholes Percentage instructions subsumed geomean (b) Whole application energy saving. blackscholes fft inversek2j jmeint 97.2% 67.4% 95.9% Core + D-NPU Core + A-NPU kmeans Core + Idealsobel NPU jpeg 95.1% 56.3% 29.7% 57.1% 6 (c) % dynamic instructions subsumed. 10 Cycle Improvement (1.7 GHz) Energy Improvement Energy Improvement (1.1336 GHz) Energy Improvement (1.7 GHz) 2.50 3.75 9.53 8.48 1.89 2.83 3.79 3.33 3.63 4.57 4.05 3.92 5.88 28.28 25.50 1.36 15.21 2.42 22.81 82.21 73.76 0.67 2.44 3.67 0.46 2.75 4.13 blackscholes 11.81 Application fft 0.79 3.33 5.00 12.14 fft 8.56 8 Floating Point D-NPU A-NPU + Ideal Sigmoid 6.0% 8.4% 2.7% 6 -> 8 -> 8 -> 1 14.1441013944802 10.791.12709929506364 24.5221729784241 0 speedup 2 -> 8 -> 2 6.2% 7.98161179269307 10.9938617211157 14.9861613789597 jmeint 18 -> 32 -> 8 -> 2 2.39085372136084 6.26190545947988 14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335 jpeg 64 -> 16 -> 8 -> 64 1.5617504494923 1.87946485929561 1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875 3.0% 8.1% kmeans 6 -> 8 -> 4 -> 1 sobel 9 -> 8 -> 1 geomean Speedup 10.2% 4.1% Energy Saving 9.4% 3.79 4 blackscholes 48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738 inversek2j 1 -> 4 -> 4 -> 2 6 2 7.57 10.45Core + jmeint Topology D-NPU Core + A-NPU NPU Core + D-NPU Core + A-NPU inversek2j jpegCore + Ideal kmeans sobel blackscholes 9.53 8.5 17.6% 5.4% 3.2% 3.8% 1.32327013458615 1.64546022960568 0.68497510592142 0.804194541306698 0.195805458693302 0.53259881505741 0.733600916412848 0.266399083587152 18.4% 6.6% 6.1% 4.3% 0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028 2.4864550898745 3.10723166292606 3.62429006473114 0.686053004992842 0.857335259438336 0.142664740561664 2.5478647166383 3.7797513074705 5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968 19.7% 8.4% 7.3% 5.2% 12.14 10.3 with other work on quality trade-offs use digital techniques. Truffle [11] and EnerJ [26] 8.4% shows3.0%similar error (3–10%) for 8.1% 18.4% 6.6% 6.1% 4.3% 2.6% some applications and much greater error (above 80%) for 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2% 3.3% others in a moderate configuration. Green [2] has error rates 40.5% below 1% for some applications but greater than 20% for oth2.3% ers. A case study [20] explores manual optimizations of the x264 video encoder that trade off 0.5–10% quality loss. jmeint jpeg kmeans sobel geomean 8.45 7.6 7.6 blackscholes Application blackscholes Percentage fft instructions inversek2j subsumed jmeint jpeg sobel swaptions 1/3 Digital Frequency 1/2 Digital Frequency kmeans sobel Analog kmeans Sigmoidsobel geomean swaptions Topology Core + D-NPU fft inversek2j Application Speedup-1 Core + A-NPU Core + Ideal NPU jmeint Core + D-NPU jpeg Core + A-NPU kmeans sobel swaptions geomean speedup 6 -> 8 -> 8 -> 1 1 -> 4 -> 4 -> 2 6.0% 2 -> 8 -> 2 18 -> 32 -> 8 -> 2 64 -> 16 -> 8 -> 64 6 -> 8 -> 4 -> 1 7.3% 9 -> 8 -> 1 1 -> 16 -> 8 -> 1 geomean swaptions geomean 0 Energy (nJ) blackscholes 8-bit A-NPU fft Cyclejmeint inversek2j jpeg 28.2 3.9 3.7 82.21 73.8 3.3 2 Cycle Improvement kmeansEnergy sobel swaptions geomean 8 6 Energy Improvement Cycle 1 Improvement (1.7 GHz) Cycle Improvement (1.1336 GHz) 2.4 Improvement 2.7 2.0 2.4 2.4 8-bit D-NPU vs. A-NPU-1 1 3 1.8 8.4 Energy Saving Speedup 1.8 2 4.5 9.5 geomean 3.3 3.7 sobel 2.5 kmeans 12.1 jpeg 3.9 4.5 82.2 15.2 jmeint 8.5 inversek2j 11.8 fft 28.2 blackscholes 8-bit D-NPU Cycle blackscholes Application blackscholes 1.8 Improvement 2 9.5 Improvement 3 geomean Fetch/Issue Width: 4/5, INT ALUs/FPUs: 6/6, Load/Store 6. Related Work FUs: 1/1, ROB Entries: 128, Issue Queue Entries: 36, INT/FP Physical blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions Registers: 256/256, Branch Predictor: Tournament 48 KB, BTB Sets/Ways: General-purpose approximate computing. A growing 1024/4, RAS Entries: 64, Load/Store Queue Entries: 48/48, Dependence Percentage 5 Instructions 97.2% 56.3%of29.7% 57.1% Predictor: 4096-entry Bloom Filter, ITLB/DTLB Entries:67.4% 128/25695.9% L1: 32 95.1%body Energy Saving work has explored relaxing the abstraction of full Speedup KB Instruction, 32 KB Data, LineSubsumed Width: 64 bytes, 8-Way, Latency: 3 cycles accuracy at the circuit and architecture level for gains in per-28.28 25.5 10 4 9.53 L2: 256 KB, Line Width: 64 bytes, 8-Way, Latency: 6 cycles L3: 2 MB, formance, energy, and resource utilization [9, 11, 8.525, 6, 22, 12]. Line Width 64 bytes, 16-Way, Latency: 27 cycles Memory Latency: 50 ns 3 0 4 kmeans 1 10.79 1 5 2.4 4.1 3.79 4 NPU Queue Instructions 3.3 1 Processor: 4.57 0 Speedup Energy Saving 3.9 3.7 4 3.79 8.48 6 1.8 2.04 9.53 sobel 6.51636491720012 0.787719817984527 0.977559461493782 0.011817544342672 Benchmarks 0 2.4 64 -> 16 -> 8 -> 64 kmeans 3.75 Energy Improvement (1.7 GHz) 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474 1.99 11.81 10.5 82.2 15.2 18 -> 32 -> 8 -> 2 jpeg 2.50 12.1 2 -> 8 -> 2 jmeint 0.86 Energy Improvement (1.1336 GHz) 11.8 34 inversek2j apses to each Neuron 6 Cycle Improvement (1.7 GHz) 8.5 1 -> 4 -> 4 -> 2 8.15 Cycle Improvement (1.1336 GHz) Energy Improvement 45 Energy 3 2.5 Cycle Energy Improvement geomean 2.7 6 -> 8 -> 8 -> 1 Energy (nJ) 82.2 fft Cycle 15.2 16 blackscholes Topology Cycle Improvement 28.2 Application 8-bit A-NPU Improvement 8-bit D-NPU 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845 A-NPU 82.21 73.8 8 + D-NPU + A-NPU swaptions geomean Application Speedup 7.3% 6.15 95.1% 56.3% 29.7% 57.1% 8-bit D-NPU vs. A-NPU-1 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225 2.83229403346624 geomean 0.51 kmeans Application error. Table 2 shows the application-level errors with a floating point D-NPU, A-NPU with ideal sigmoid and our A-NPU which incorporates non-idealities of the analog sigmoid. Except for jmeint, which shows error above 10%, all of the applications show error less than or3.3%around 10%. 2.4% 31.4% 3.1% 43.3% 65.5% Application average error rates with0.3% the A-NPU from4.8% 1.2% 0.8% range 1.8% 0.5% 4.1% to 10.2%. This quality-of-result loss is commensurate 9.53 sobel Table 2: Error with a inversek2j floating point D-NPU, A-NPU with ideal blackscholes fft jmeint jpeg kmeans sobel geomean sigmoid, and A-NPU with non-ideal sigmoid. 4 5 10 1.33727439731608 6.37013417935512 kmeans sobel 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033 (a) Whole application speedup. 0% Cycle Improvement (1.1336 GHz) 6 12 9.63 4 inversek2j jmeint jpeg kmeans 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806 2% 2.74676510816413 Cycle Improvement Energy 8 5.776.6% 61.47 kmeans sobel 2.21156760942508 1.15321498752577 9 -> 8 -> 1 2.4 97.2% Percentage Instructions Subsumed inversek2j jmeint 7.32010374044854 6 -> 8 -> 4 -> 1 4 4.5 fft 18 -> 32 -> 8 -> 2 8 Percentage Instructions Subsumed 3.9 blackscholes 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239 17.8930069836166 jpeg kmeans Figure 5:4 Whole application speedup and energy saving with 8-bit D-NPU vs. A-NPU-1 D-NPU, A-NPU, and an Ideal NPU that consumes zero energy fft inversek2j jmeint jpeg kmeans sobel swaptions geomean 2 and takes zero cycles for neural computation. 111.54 10.2% Analog Sigmoid geomean 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124 30.0198258158588 Application Speedup-1 10 9.5 1.8% swaptions jpeg 346.2% 141 Percentage instructions subsumed 9 -> 8 -> 1 Benchmarks jmeint Cycle 365 6 -> 8 -> 4 -> 1 sobel 16 45 inversek2j fft 6.0% 1.70352109148241 5.13306978649764 Core Analog Sigmoid Fully Precise Digital Sigmoid Core jpeg Fully Precise Digital Sigmoid 8-bit A-NPU Energy (nJ) blackscholes Floating Point 1 -> 4 -> 4 -> 2D-NPU kmeans 4 Maximum number of Incoming Synapses to each Neuron swaptions geomean Cycle A-NPU 0% sobel 8-bit D-NPU Application 3.34% Core + A-NPU 4% 51.250425520663 1.66144638762811 25.8726893148605 0 Queue NPU 0.3% Instructions blackscholes 10.21% 18.41% 6.02% 7.32% swaptions geomean kmeans 6% 6% 7.28% jpeg5.21% 9% Analog Sigmoid 8% 19.67% 8.35% jmeint 12% 2.5 Application Level Error (Image Diff) blackscholes Fully Precise Digital Sigmoid blackscholes fft 1 0 Benchmarks 2.0 1.8 Maximum number of Incoming Synapses to each Neuron blackscholes Application Level Error 2.7 2.4 2.4 Table 1 2.5 Improvement Application Level Error (Image Diff) 3 Analog Sigmoid Fully Precise Digital Sigmoid 10% Application Level Error 0.0093 3.3 4 5.21 2.61 0.06 3.7 geomean 7.28 4.28 0.0261 geomean 0.0182 8.4 6.10 0.0428 3.9 5 9 -> 8 -> 1 8.5 28.2 82.2 15.2 0.06 4.5 6 -> 8 -> 4 -> 1 3.03 11.8 4.13 0.0813 18 -> 32 -> 8 -> 2 9.5 0.0303 2 -> 8 -> 2 jmeint swaptions 10.21 sobel 0.0839 1 -> 4 -> 4 -> 2 inversek2j kmeans 8.39 kmeans 6 -> 8 -> 8 -> 1 fft Application Level Error 6.02 Fully Precise Digital Sigmoid blackscholes fft 12.1 0.06 Topology 0 blackscholes sobel swaptions geomean Application Level Error 1.8 2 0.0428 Application sobel 16 Whole application speedup and energy savings. Figure 5 10% shows the whole application speedup and energy savings when 8% the processor is augmented with an 8-bit, 8-PE D-NPU, our 8-ANU A-NPU, and an ideal NPU, which takes zero 6% cycles and consumes zero energy. Figure 5c shows the percentage of Table 1 4% instructions subsumed by the neural transformation mber of Incoming Synapses todynamic each of the candidate code. The results show, following the 2% Am4 8 16 2% 6.62% 5.76% Law, that the larger the number of dynamic instructions dahl’s 0% blackscholes subsumed, the larger the benefits from neural acceleration. Geometric mean speedup and energy savings with an A-NPU is 3.7× and 6.3× respectively, which is 48% and 24% better than an 8-bit, 8-PE NPU. Among the benchmarks, kmeans sees slow down with D-NPU and A-NPU-based acceleration. All benchmarks benefit in terms of energy. The speedup with A-NPU acceleration ranges from 0.8× to 24.5×. The energy savings range from 1.1× to 51.2×. geomean 50 Core + D-NPU 6 Energy 3.75 6% Core + Ideal NPU 8 Improvement Improvement (1.7 of Incoming Synapses to each Neuron Maximum number 0.011 (1.1336 GHz) GHz) Cycle Improvement (1.7 GHz) 2.50 Speedup Energy 1.89 Saving 0.86 2.7 34 1 -> 16 -> 8 -> 1 0.06 swaptions 7.6 7.6 6 0 10.79 12.14 10.3 8.45 8 0.50 73.76 6 8.39 4.71 8% 42.5954312379609 fft jmeint jmeint 10.45 8.45 1.00 4 Energy (nJ) 3.13 3.33 Core + A-NPU geomean Benchmarks. We use the benchmarks in [12] and add one more, blackscholes. 12% 45 4 9 -> 8 -> 1 3.75 95.1% 56.3%2.08 29.7% 57.1% 9% 0.84 Energy Improvement (1.1336 GHz) 2.83 15.21 Analog1.36Sigmoid 8 8-bit D-NPU 64 -> 16 -> 8 -> 64 Cycle 0.563 Improvement (1.7 GHz) 0.297 6 0.41 6.15 Cycle 2 -> 8 -> 2 Improvement Cycle 0.005 Improvement (1.1336 GHz) 0.048 6 4 9.63 Cycle Improvement 0.951 0.018 1.0 12 6 -> 8 -> 4 -> 1 0.0662 3 sobel 0.959 Cycle 1.0 2.33 0.1841 18 -> 32 -> 8 -> 2 jpeg 0.972 0.674 0.008 8-bit A-NPU 0.03 8.15 1.0 0.0813 1 -> 4 -> 4 -> 2 inversek2j Less Insts 0.003 0.012 0.03 3.7 fft NPU Queue Instructions 0.02 0.31 67.4%1.22 95.9% 97.2% 6.35 Normalized # of Dynamic Instructions Percentage 8 Instructions Subsumed 8-bit D-NPU vs. A-NPU-1 2.04 6 -> 8 -> 8 -> 1 5.21% 3.67 56.26 Topology 4.28% 2.44 29 0.0303 sobel 0.67 111.54 Fully Precise 44 5.77 Digital Sigmoid 33 5.49 5 blackscholes 7.28% 6 365 0.0839 Application 8.35% 6.1% 5.77 141 61.47 6.62% kmeans 3.34% Energy Improvement 45 9 -> 8 -> 1 1 -> 16 -> 8 ->341 18 -> 32 -> 8 -> 2 jpeg 107.32% 3.3 6 -> 8 -> 4 -> 1 -> 16 -> 8 -> 64 Fully64 Precise 6 -> 8 -> 4 -> 1 Digital Sigmoid 9 -> 8 -> 1 kmeans swaptions 3.796.1% 8.4% 6.02% 0.8% 12.1 Cycle 64 -> 16 -> 8 -> 64Energy (nJ) jmeint sobel 6.6% 19.7% 2.61% 8.5 1.0 64 2.3% jpeg 18.4% 9.4% inversek2j 2 -> 8 -> 2 geomean geomean 11.8 18 -> 32 ->8-bit 8 -> D-NPU 2 2 4.8% jmeint 8.1% 4.1% 19.67% swaptions 82.2 1.0 geomean 0.5% inversek2j 3.0% 10.2% 3.63 9.42% sobel 18.41% 73.76 3.9 kmeans sobel 1.8% fft 2.83 8.4% kmeans 8.13% jmeint fft 1 -> 4 -> 4 -> 2 blackscholes 4.13% jpeg 64 -> 16 -> 8 -> 64 3.03% inversek2j 25.50 Other Instructions 1.0 2 ->geomean 8 -> 2 40.5% kmeans fft 8.48 82.21 2.4 peg 65.5% Analog Sigmoid 5.76% 28.28 CPU 1.0 6 ->sobel 8 -> 8 -> 1 meint Topology 2 -> 8 -> 2 1 ->swaptions 4 -> 4 -> 2 nversek2j Percentage instructions subsumed geomean 2.6% Improvement (1.7 16 3.3% GHz) 5.2% 5.88 9.71 6 -> 8 -> 8 -> 1 blackscholes blackscholes 1.89 swaptions 4.3% 9.53 inversek2j inversek2j Application Energy Reduction Core + D-NPU 6 -> 8 -> 8 -> 1 010.21% jmeint 22.81 5.49 1 -> 4 -> 4 -> 2 fft 43.3% 6.62% Analog blackscholes Sigmoid 3.92 10.30 blackscholes kmeans 3.1% 18.4% 6.6% 6.1% Improvement 8 19.7% 8.4% (1.1336 GHz)7.3% 9.4% Topology fft Circuit design for ANU. We implemented the 8-bit, 8-input ANU in the Cadence Analog Design Environment using predictive technology models at 45 nm [5]. We ran detailed Spectre spice simulations to understand circuit behavior and Analog Sigmoid measure ANU energy consumption. We used CACTI to estimate energy of the A-NPU buffers. fft Topology jpeg sobel 8.39% 15.21 50 9.5 Application kmeans 8.1% 3.75 Fully Precise Digital Sigmoid blackscholes 1.36 59.90 Application jmeint 14.32% 2.42 jpeg 15% 1.99 3.3% 8.4% 3.0% Improvement (1.7 4 GHz) 10.2% 4.1% A-NPU 2.50 Benchmarks jpeg Application 8 NPU Queue Instructions Dynamic Insts inversek2j jmeint Energy Improvement kmeans sobel swaptions jmeint fft blackscholes 111.54 33 12 0.51 inversek2j inversek2j A-NPU + Ideal Sigmoid Application Level Error (Image Diff) 0.86 fft Maximum number of Incoming Synapses each2.3% 6.0% 2.7% 6.2% 17.6% 5.4% 3.2% to 3.8% Cycle Energy Energy Neuron Floating Point D-NPU Cycle Improvement (1.1336 GHz) 6 2.04 16 29 12% 4 2 0.47 0.17 0.25 10% blackscholes 0.00 Application Speedup 34 4 Cycle Improvement blackscholes Energy 8.15 mum number of Incoming Synapses to each Neuron 0.29 0.2 0 Application Level Error 28.2 45 Cycle Application Level Error (Image Diff) 8-bit D-NPU 6 0.49 0.44 Core + D-NPU Core + A-NPU Core + Ideal NPU 0.4 0.50 14.40 Table 1 Cycle 8 2% 165.46 8.35% 11.43 kmeans 33.85 19.67% 5.49 jpeg Application 10% 0.0073 7.32 A-NPU - 1/2 Digital Frequency (Manual) geomean 0.018 0.0118 5.21 0.41 4.28 Energy 2.61 A-NPU - 1/36.02 Digital Frequency (Manual) inversek2j Table 1 2 -> 8 -> 2 Application Level Error Image Diff) 8.35 4.71 0.66 6.10 4 0.011 0.0129 jmeint number of Incoming 18 -> 32 ->Synapses 8 -> 2 Maximum to each Neuron jpeg 64 -> 16 -> 8 -> 64 kmeans 0.0182 4.13 3.67 0.43 9.42 4.13 6.62 3.14 0.06 1.00.0428 sobel 3.63 Analog Sigmoid 5.88 0.03 10.2122.81 3.03 2.44 8.13 2.75 10 Analog Sigmoid 6.3 6.5 meint 2.42 Analog Sigmoid 3.92 15.21 8.39 1.0 0.959 geomean 3.1 3.6 9 -> 8 -> 0.0813 1 0.674 0.008 sobel 0.53 Energy Improvement (1.1336 GHz) 45 8.15 6 0.86 2.50 3.75 34 2.04 6 0.54 1.89 blackscholes 2.83 0 Energy Improvement (1.7 GHz) 9.53 fft 3.79 inversek2j 8.48 jmeint 3.33 jpeg kmeans sobel swaptions geomean Energy Improvement 6 -> 8 -> 0.0303 4 -> 1 2.83 kmeans 7.3 1.0 64 -> 16 -> 8 -> 64 0.0839 1 ->kmeans 4 -> 4 -> 2 0.972 0.012 3.75 0.03 0.51 2.3 2 -> 8 -> 2 Fully Precise Fully Precise 18 -> 32 -> 8 -> 2 Digital Sigmoid Digital Sigmoid 6 ->jpeg 8 -> 8 -> 1 swaptions 0.31 1.89 jpeg 1.6 1.7 1.7 inversek2j Topology jmeint 0.003 Speedup (1.7 GHz) 2.50 1.0 2 ->sobel 8 -> 2 fft 64 nversek2j 0.02 Speedup (1.1336 GHz) 1.0 jmeint 0.75 0.6 12.1 1 -> 4 ->Analog 4 -> 2 Sigmoid 2.0 6 -> 8 -> 8 -> 1 fft inversek2j 14.1 24.5 48.0 Topology blackscholes fft 8.5 Application 2blackscholes Application blackscholes 8.4 59.90 Less3.67Insts 2.44 Application Level Error geomean NPU Queue 0.67 0.46 Instructions 14.0 8 33 10.32 6.2 12 111.54 Other Instructions 44 5.77 1 -> 16 -> 8 -> 1 7.58 12.14 25.8 30.0 31.4 56.26 365 9 -> 8 -> 1 NPU Queue Instructions Other Instructions 10.45 8.45 10.9 14.9 141 64 -> 16 -> 8 -> 64 CPU6 -> 8 -> 4 -> 1 7.57 11.81 7.9 18 -> 32 -> 8 -> 2 swaptions kmeans Application Energy Reduction peg sobel 82.2 meint 29 8.56 1.00 2.5 blackscholes fft Dynamic Insts 2 6 -> 8 -> 4 -> 1 sobel Energy Improvement (1.7 GHz) 2.7 2.8 2.8 Energy Improvement (1.1336 GHz) 2.4 Energy Improvement Cycle Improvement (1.7 GHz) 73.76 1.1 1.3 1.3 Cycle Improvement (1.1336 GHz) 4.05 25.50 0.5 0.8 1.2 Cycle Improvement Energy 3.3 Cycle 82.21 12.1 8-bit A-NPU Energy (nJ) 28.28 22.81 11.8 Cycle 5.88 15.21 2.7 Topology 3.92 1.36 1.1 1.3 1.6 8-bit D-NPU Application 1.99 Energy Improvement 8 5.4 12 111.54 3.7 56.26 365 5.4 141 64 -> 16 -> 8 -> 64 2.5 18 -> 32 -> 8 -> 2 jpeg 3.7 jmeint 3.33 5.1 swaptions geomean 4.57 3.1 3.6 sobel 3.63 2.4 kmeans 2.42 2.7 2.8 2.8 jpeg 0.51 2.2 2.3 2.3 jmeint 8.48 3.79 1.5 1.8 1.9 inversek2j GHz) 9.53 2.83 0.5 0.8 1.2 4 (1.1336 GHz) 3.75 1.89 1.1 1.3 1.3 2.33 GHz) 2.50 0.54 4 1.5 1.8 1.9 29 (1.1336 GHz) 0.86 2.2 2.3 2.3 6 14.0 fft 6 2.04 6.2 blackscholes 8.15 34 2.3Error Application Level 0.00 45 17.8 18.8 2 -> 8 -> 2 0.25 10.9 14.9 inversek2j 0.50 Normalized Application Speedup Normalized # 7.9 of Dynamic Instructions 1 -> 4 -> 4 -> 2 15.2 1 -> 16 -> 8 -> 1 geomean 6 -> 8 -> 8 -> 1 fft 8.5 swaptions blackscholes 1.1 1.3 1.6 0.571 11.8 0.297 0.023 2.7 0.563 0.048 0.41 1.6 1.7 1.7 0.959 0.951 0.005 0.66 1.0 14.1 24.5 48.0 0.008 0.018 0.43 1.0 9 -> 8 -> 1 42.5 51.2 52.5 0.03 0.03 1.0 6 -> 8 -> 4 -> 1 Application Speedup 1.0 1.0 64 -> 16 -> 8 -> 64 Application Ener 2 -> 8 -> 2 18 -> 32 -> 8 -> 2 peg sobel Normalized # of D nversek2j meint kmeans 6 4.57 4 4.1 3.79 3.3 8.56 References These circuit and architecture studies, although proven successful, are limited to purely digital techniques. We explore how a mixed-signal, analog-digital approach can go beyond what digital approximate techniques offer. [1] P. E. Allen and D. R. Holberg, CMOS Analog Circuit Design, 2nd ed. Oxford University Press, 2002. [2] W. Baek and T. M. Chilimbi, “Green: A framework for supporting energy-conscious programming using controlled approximation,” in PLDI, 2010. [3] B. Belhadj, A. Joubert, Z. Li, R. Héliot, and O. Temam, “Continuous real-world inputs can open up alternative accelerator designs,” in ISCA, 2013. [4] B. E. Boser, E. Säckinger, J. Bromley, Y. L. Cun, L. D. Jackel, and S. Member, “An analog neural network processor with programmable topology,” J. Solid-State Circuits, vol. 26, no. 12, December 1991. [5] Y. Cao, “Predictive technology models,” 2013. Available: http: //ptm.asu.edu [6] L. N. Chakrapani, B. E. S. Akgul, S. Cheemalavagu, P. Korkmaz, K. V. Palem, and B. Seshasayee, “Ultra-efficient (embedded) SOC architectures based on probabilistic CMOS (PCMOS) technology,” in DATE, 2006. [7] T. Chen, Y. Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti, A. Nere, S. Qiu, M. Sebag, and O. Temam, “Benchnn: On the broad potential application scope of hardware neural network accelerators,” in IISWC, 2012. [8] F. Choudry, E. Fiesler, A. Choudry, and H. J. Caulfield, “A weight discretization paradigm for optical neural networks,” in ICOE, 1990. [9] M. de Kruijf, S. Nomura, and K. Sankaralingam, “Relax: An architectural framework for software recovery of hardware faults,” in ISCA, 2010. [10] H. Esmaeilzadeh, P. Saeedi, B. N. Araabi, C. Lucas, and S. M. Fakhraie, “Neural network stream processing core (NnSP) for embedded systems,” in ISCAS, 2006. [11] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Architecture support for disciplined approximate programming,” in ASPLOS, 2012. [12] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural acceleration for general-purpose approximate programs,” in MICRO, 2012. [13] S. Galal and M. Horowitz, “Energy-efficient floating-point unit design,” IEEE Trans. Comput., vol. 60, no. 7, 2011. [14] A. Hashmi, H. Berry, O. Temam, and M. Lipasti, “Automatic abstraction and fault tolerance in cortical microarchitectures,” in ISCA, 2011. [15] A. Hashmi, A. Nere, J. J. Thomas, and M. Lipasti, “A case for neuromorphic ISAs,” in ASPLOS, 2011. [16] C. Igel and M. Hüsken, “Improving the RPROP learning algorithm,” in NC, 2000. [17] D. A. Johns and K. Martin, Analog Integrated Circuit Design. John Wiley and Sons, Inc., 1997. [18] A. Joubert, B. Belhadj, O. Temam, and R. Héliot, “Hardware spiking neurons design: Analog or digital?” in IJCNN, 2012. [19] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, “McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures,” in MICRO, 2009. [20] S. Misailovic, S. Sidiroglou, H. Hoffman, and M. Rinard, “Quality of service profiling,” in ICSE, 2010. [21] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0,” in MICRO, 2007. [22] S. Narayanan, J. Sartori, R. Kumar, and D. L. Jones, “Scalable stochastic processors,” in DATE, 2010. [23] A. Patel, F. Afram, S. Chen, and K. Ghose, “MARSSx86: A full system simulator for x86 CPUs,” in DAC, 2011. [24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” in Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, 1986, vol. 1. [25] M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke, “Sage: Self-tuning approximation for graphics engines,” in MICRO, 2013. [26] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman, “EnerJ: Approximate data types for safe and general low-power computation,” in PLDI, 2011. [27] J. Schemmel, J. Fieres, and K. Meier, “Wafer-scale integration of analog neural networks,” in IJCNN, 2008. [28] S. M. Tam, B. Gupta, H. A. Castro, and M. Holler, “Learning on an analog VLSI neural network chip,” in Systems, Man, and Cybernetics (SMC), 1990. [29] O. Temam, “A defect-tolerant accelerator for emerging highperformance applications,” in ISCA, 2012. [30] J. Zhu and P. Sutton, “FPGA implementations of neural networks: A survey of a decade of progress,” in FPL, 2003. Analog and digital neural hardware. There is an extensive body of work on hardware implementations of neural networks both in the digital [10, 30] and the analog [4, 18] domain. Recent work has proposed higher-level abstractions for implementation of neural networks [15]. Other work has examined fault-tolerant hardware neural networks [14, 29]. In particular, Temam [29] uses datasets from the UCI machine learning repository to explore fault tolerance of a hardware neural network design. In contrast, our compilation, neuralnetwork selection/training framework, and architecture design aim at applying neural networks to general-purpose code written in familiar programming models and languages, not explicitly written to utilize neural networks directly. Neural-based code acceleration. A recent study [7] shows that a number of applications can be manually reimplemented with explicit use of various kinds of neural networks. That study did not prescribe a programming workflow, nor a preferred hardware architecture. More recent work exposes analog spiking neurons as primitive operators [3]. This work devises a new programming model that allows programmers to express digital signal-processing applications as a graph of analog neurons and automatically maps the expressed graph to a tiled analog, spiking-neural hardware. The work in [3] is restricted to the domain of applications whose input are realworld signals that should be encoded as pulses. Our approach intends to address the long-standing challenges of utilizing analog computation (programmability and generality) by not imposing domain-specific limitations, and by providing analog circuitry that is integrated with a conventional, digital processor without require a new programming paradigm. 7. Conclusions Analog circuits inherently trade accuracy for very attractive energy-efficiency gains. However, it is challenging to utilize them in a way that is both programmable and generally useful. The transformation of general-purpose, approximable code to a neural model, as well as the use of neural accelerators, provide an avenue for realizing the benefits of analog computation by taking advantage of the fixed-function qualities of a neural network while targeting traditionally-written, generallyapproximable code. We presented a complete solution on using analog neural networks for accelerating approximate applications, from circuits to compilers design. A very important insight from this work is that it is crucial to expose analog circuit characteristics to the compilation and neural network training phase. Our analog neural acceleration provides whole application speedup of 3.3× and and energy savings of 12.1× with quality loss less than 10% for all except one benchmark. 7