Toward General-Purpose Code Acceleration with Analog Computation Amir Yazdanbakhsh∗ Hadi Esmaeilzadeh∗ ∗ Georgia Institute of Technology Renee St. Amant§ Arjang Hassibi§ § The University of Texas at Austin Abstract Bradley Thwaites∗ Jongse Park∗ Luis Ceze† Doug Burger‡ † University of Washington ‡ Microsoft Research has shown that analog neural networks can effectively solve classes of domain-specific problems, such as pattern recognition [4, 27, 28, 18]. The process of invoking a neural network and returning a result defines a clean, coarse-grained interface for D/A and A/D conversion. Furthermore, the compile-time training of the network permits any analog-specific restrictions to be hidden from the programmer. The programmer simply specifies which region of the code can be approximated, without adding any neural-network-specific information. Thus, no additional changes to the programming model are necessary. Figure 1 illustrates an overview of our framework. This paper reports on our study to design an NPU with mixed-signal components and develop a compilation workflow for utilizing the mixed-signal NPU for code acceleration. The goal of this study is to investigate challenges and potential solutions of implementing NPUs with analog components, while both bounding application error to sufficiently low levels and achieving worthwhile performance or efficiency gains for general-purpose approximable code. We found that exposing the analog limitations to the compiler allowed for the compensation of these shortcomings and produced sufficiently accurate results. We trained networks at compile time using 8bit values, topologies restricted to eight inputs per neuron, plus RPROP and CDLM [8] for training. Using these techniques together, we were able to bound error on all applications to a 10% limit, which is comparable to prior studies using entirely digital accelerators. The average time required to compute a neural result was 3.3× better than a previous digital implementation with an additional energy savings of 12.1×. The performance gains result in an average full-application-level improvement of 3.7× and 23.3× in performance and energy-delay product, respectively. This study shows that using limited-precision analog circuits for code acceleration, by converting regions of imperative code to neural networks and exposing the circuit limitations to the compiler, is both feasible and advantageous. We propose a solution—from circuit to compiler—that enables general-purpose use of limited-precision, analog hardware to accelerate “approximable” code—code that can tolerate imprecise execution. We utilize an algorithmic transformation that automatically converts approximable regions of code from a von Neumann model to an “analog” neural model. We outline the challenges of taking an analog approach, including restricted-range value encoding, limited precision in computation, circuit inaccuracies, noise, and constraints on supported topologies. We address these limitations with a combination of circuit techniques, a novel hardware/software interface, neural-network training techniques, and compiler support. Analog neural acceleration provides whole application speedup of 3.3× and and energy savings of 12.1× with quality loss less than 10% for all except one benchmark. These results show that using limited-precision analog circuits for code acceleration, through a neural approach, is both feasible and beneficial over a range emerging applications. 1. Introduction A growing body of work [11, 25, 6, 22, 12] has focused on approximation as a strategy for improving performance and efficiency through approximation. Large classes of applications can tolerate small errors in their outputs with no discernible loss in Quality of Result (QoR). Many conventional techniques in energy-efficient computing navigate a design space defined by the two dimensions of performance and energy, and traditionally trade one for the other. General-purpose approximate computing explores a third dimension—that of error. Many interesting design alternatives become possible once precision is relaxed. An obvious candidate is the use of analog circuits for computation. However, computation in the analog domain has several major challenges, even when small errors are permissible. First, analog circuits tend to be special purpose, good for only specific operations. Second, the bit widths they can accommodate are much smaller than current floatingpoint standards (i.e. 32 or 64 bits), since the ranges must be represented by physical voltage or current levels. Another consideration is determining where the boundaries between digital and analog computation lie. Using individual analog operations will not be effective due to the overhead of A/D and D/A conversions. Finally, effective storage of temporary analog results is challenging in current CMOS technologies. Due to these limitations, it has not been effective to design analog von Neumann processors that can be programmed with conventional languages. NPUs can be a potential solution for general-purpose analog computing. Prior research 2. Analog Circuits for Neural Computation As Figure 2a illustrates, each neuron in a multi-layer perceptron takes in a set of inputs (xi ) and performs a weighted sum of those input values (∑i xi wi ). The weights (wi ) are the result of training the neural network on . After the summation stage, which produces a linear combination of the weighted inputs, the neuron applies a nonlinearity function, sigmoid, to the result of summation. Figure 2b depicts a conceptual analog circuit that performs the three necessary operations of a neuron: (1) scaling inputs by weight (xi wi ), (2) summing the scaled inputs (∑i xi wi ), and (3) applying the nonlinear- 1 Profiling&Path&for& Training&Data& Collec;on Annotated& Source&Code A"NPU& Circuit&Design& Training&Data Programmer A"NPU& High"Level&Model& Programming Design Trained&Neural& Network Custom&Training& Algorithm&for& Limited"Precision& Analog&Accelerator Compila;on Instrumented& Binary CORE Accelerator& Config A"NPU Code& Genera;on Accelera;on Figure 1: Framework for using limited-precision analog computation to accelerate code written in conventional languages. x0 xi … w0 … wn wi X xn (xi wi ) x0 xi xn DAC DAC DAC … I(x0 ) R(w0 ) V to I y = sigmoid( (a) (xi wi )) R(wi ) V to I X X … I(xi ) tential network accuracy. The NPU approach targets code approximation, rather than typical, simpler neural tasks, such as recognition and prediction, and imposes higher accuracy requirements. The main challenge is to mange this tradeoff to achieve acceptable accuracy for code acceleration, while delivering higher performance and efficiency when analog neural circuits are used for general-purpose code acceleration. While a holistically analog neural hardware design with fixed-wire connections between neurons may be efficient, it effectively provides a fixed topology network, limiting the scope of applications that can benefit from the neural accelerator, as the optimal network topology varies with application. Additionally, routing analog signals among neurons and the limited capability of analog circuits for buffering signals negatively impacts accuracy and makes the circuit susceptible to noise. In order to provide additional flexibility, we set the digital-analog boundary in conjunction with an algorithmic, sigmoid-activated neuron. where a set of digital inputs and weights are converted to the analog domain for efficient computation, producing a digital output that can be accurately routed to multiple consumers. We refer to this basic computation unit as an analog neural unit, or ANU. ANUs can be composed, in various physical configurations, along with digital control and storage, to form a reconfigurable mixed-signal NPU, or A-NPU. I(xn ) R(wn ) V to I (I(xi )R(wi )) ADC y ⇡ sigmoid( X (I(xi )R(wi ))) (b) Figure 2: One neuron and its conceptual analog circuit. ity function (sigmoid). This conceptual design first encodes the digital inputs (xi ) as analog current levels (I(xi )). Then, these current levels pass through a set of variable resistances whose values (R(wi )) are set proportional to the corresponding weights (wi ). The voltage level at the output of each resistance (I(xi )R(wi )), is proportional to xi wi . These voltages are then converted to currents that can be summed quickly according to Kirchhoff’s current law (KCL). Analog circuits only operate linearly within a small range of voltage and current levels, outside of which the transistors enter saturation mode with IV characteristics similar in shape to a non-linear sigmoid function. Thus, at the high level, the non-linearity is naturally applied to the result of summation when the final voltage reaches the analog-to-digital converter (ADC). Compared to a digital implementation of a neuron, which requires multipliers, adder trees and sigmoid lookup tables, the analog implementation leverages the physical properties of the circuit elements and is orders of magnitude more efficient. However, it operates in limited ranges and therefore offers limited precision. Value representation and bit-width limitations. One of the fundamental design choices for an ANU is the bit-width of inputs and weights. Increasing the number of bits results in an exponential increase in the ADC and DAC energy dissipation and can significantly limit the benefits from analog acceleration. Furthermore, due to the fixed range of voltage and current levels, increasing the number of bits translates to quantizing this fixed value range to fine granularities that practical ADCs can not handle. In addition, the fine granularity encoding makes the analog circuit significantly more susceptible to noise, thermal, voltage, current, and process variations. In practice, these non-ideal effects can adversely affect the final accuracy when more bit-width is used for weights and inputs. We design our ANUs such that the granularity of the voltage and current levels used for information encoding is to a large degree robust to variations and noise. Analog-digital boundaries. The conceptual design in Figure 2b draws the analog-digital boundary at the level of an algorithmic neuron. As we will discuss, the analog neural accelerator will be a composition of these analog neural units (ANUs). However, an alternative design, primarily optimizing for efficiency, may lay out the entirety of a neural network with only analog components, limiting the D-to-A and A-to-D conversions to the inputs and outputs of the neural network and not the individual neurons. The overhead of conversions in the ANUs significantly limits the potential efficiency gains of an analog approach toward neural computation. However, there is a tradeoff between efficiency, reconfigurability (generality), and accuracy in analog neural hardware design. Pushing more of the implementation into the analog domain gains efficiency at the expense of flexibility, limiting the scope of supported network topologies and, consequently, limiting po- Topology restrictions. Another important design choice is the number of inputs in the ANU. Similar to bit-width, increasing the number of ANU inputs translates to encoding a larger 2 s w0 sx0 w0 I(|x0 |) Resistor' Ladder s wn x0 Current' Steering' DAC wn … I(|xn |) Resistor' Ladder R(|w0 |) V (|w0 x0 |) I + (wn xn ) I (w0 x0 ) w i xi + + Diff' ⌘ V R(|wn |) Diff' Pair I + (w0 x0 ) ⇣X of switches. The digital weight bits (wi s) control the switches, adjusting the effective resistance, R(|wi |), seen by the input current (I(|xi |)). These variable resistances scale the input currents by the digital weight values, effectively multiplying each input magnitude by its corresponding weight magnitude. The output of the resistor ladder is a voltage: V (|wi xi |) = I(|xi |) × R(|wi |). The resistor network requires 2m resistors and approximately 2m+1 switches, where m is the number of digital weight bits. This resistor ladder design has been shown to work well for m ≤ 10. Our circuit simulations show that only minimally sized switches are necessary. V (|wi xi |) as well as the XOR of the weight and input sign bits feed to a differential pair that converts voltage values to two differential current (I + (wi xi ), I − (wi xi )) that capture the sign of the weighted input. These differential currents are proportional to the voltage applied to the differential pair, V (|wi xi |). If the voltage difference between the two gates is kept small, the current-voltage relationship is linear, producing Ibias − I + (wi xi ) = Ibias 2 + ∆I and I (wi xi ) = 2 − ∆I. Resistor ladder values are chosen such that the gate voltage remains in the range that produces linear outputs, and consequently a more accurate final result. Based on the sign of the computation, a switch steers either the current associated with a positive value or the current associated with a negative value to a single wire to be efficiently summed according to Kirchhoff’s current law. The alternate current is steered to a second wire, retaining differential operation at later design stages. Differential operation combats environmental noise and increases gain, the later being particularly important for mitigating the impact of analog range challenges at later stages. Resistors convert the resulting pair of differential currents to voltages, V + (∑i wi xi ) and V − (∑i wi xi ), that represent the weighted sum of the inputs to the ANU. These voltages are used as input to an additional amplification stage (implemented as a current-mode differential amplifier with diode-connected load). The goal of this amplification stage is to significantly magnify the input voltage range of interest that maps to the linear output region of the desired sigmoid function. Our experiments show that neural networks are sensitive to the steepness of this non-linear function, losing accuracy with shallower, non-linear activation functions. This fact is relevant for an analog implementation because steeper functions increase range pressure in the analog domain, as a small range of interest must be mapped to a much larger output range in accordance with ADC input range requirements for accurate conversion. We magnify this range of interest, choosing circuit parameters that give the required gain, but also allowing for saturation with inputs outside of this range. The amplified voltage is used as input to an analog-to-digital converter that converts the voltage to a digital value. We chose a flash ADC design (named for its speed), which consists of a set of reference voltages and comparators [1, 17]. The ADC requires 2n comparators, where n is the number of digital output bits. Flash ADC designs are capable of converting 8 bits xn Current' Steering' DAC V (|wn xn |) Diff' Pair V+ sxn ⇣X w i xi ⌘ -; Amp I (wn xn ) Flash ADC sy y ⇣ ⇣X ⌘⌘ y ⇡ sigmoid V w i xi Figure 3: A single analog neuron (ANU). value range in a bounded voltage and current range, which, as discussed, becomes impractical. The larger the number of inputs, the larger the number of multiply and add operations that can be done in parallel in the analog domain, increasing efficiency. However, due to the bounded range of voltage and currents, increasing the number of inputs requires decreasing the number of bits for inputs and weights. Through circuit-level simulations, we empirically found that limiting the number of inputs to eight with 8-bit inputs and weights strikes a balance between accuracy and efficiency. However, this unique ANU limitation restricts the topology of the neural network that can run on the analog accelerator. Our customized training algorithm and compilation workflow takes into account this topology limitation and produces neural networks that can be computed on our mixed-signal accelerator. Non-ideal sigmoid. The saturation behavior of the analog circuit that leads to sigmoid-like behavior after the summation stage represents an approximation of the ideal sigmoid. We measure this behavior at the circuit level and expose it to the compiler and the training algorithm. 3. Mixed-Signal Neural Accelerator (A-NPU) Circuit design for ANUs. Figure 3 illustrates the design of a single analog neuron (ANU). As mentioned, the ANU performs the computation of a single neuron, which is y ≈ sigmoid(∑i wi xi ). We place the analog-digital boundary at the ANU level, with computation in the analog domain and storage in the digital domain. Digital input and weight values are represented in sign-magnitude form. In the figure, swi and sxi represent the sign bits and wi and xi represent the magnitude. Digital input values are converted to the analog domain through current-steering DACs that translate digital values to analog currents. Current-steering DACs are used for their speed and simplicity. In Figure 3, I(|xi |) is the analog current that represents the magnitude of the input value, xi . Digital weight values control resistor string ladders that create a variable resistance depending on the magnitude of each weight (R(|wi |)) . We use a standard resistor ladder thats consists of a set of resistors connected to a tree-structured set 3 … 8"Wide AnalogNeuron Weight'Buffer 8"Wide AnalogNeuron Weight'Buffer Weight'Buffer Weight'Buffer 8"Wide AnalogNeuron rithm first assigns an order to the neurons. This determines the order in which the neurons will be computed on the ANUs. Then, the scheduler takes the following steps for each layer of the neural network: (1) Assign each neuron to one of the ANUs. (2) Assign an order to neurons. (3) Assign an order to the weights. (4) Generate the order for inputs to be fed to the ANUs. (5) Generate the order in which the outputs will be written to the Output Buffer. The scheduler also assigns a unique order for the inputs and outputs of the neural network in which the processor will communicate data with the A-NPU Config'FIFO Input'FIFO Row'Selector 8"Wide AnalogNeuron Column'Selector Output'FIFO Figure 4: Mixed-signal neural accelerator, A-NPU. Only four of the ANUs are shown. Each ANU processes eight 8-bit inputs. 4. Compilation for Analog Acceleration As Figure 1 illustrates, the compilation for A-NPU execution consists of three stages: (1) profile-driven data collection, (2) training for a limited-precision A-NPU, and (3) code generation for hybrid analog-digital execution. In the profile-driven data collection stage, the compiler instruments the application to collect the inputs and outputs of approximable functions. The compiler then runs the application with representative inputs and collects the inputs and their corresponding outputs. These input-output pairs constitute the training data. Section 3 briefly discussed ISA extensions and code generation. While compilation stages (1) and (3) are similar to the techniques presented for a digital implementation [12], the training phase is unique to an analog approach, accounting for analog-imposed, topology restrictions and adjusting weight selection to account for limited-precision computation. at frequency on the order of one GHz. We require 2–3 mV between ADC quantization levels for accurate operation and noise tolerance. Typically, ADC reference voltages increase linearly; however, we use a non-linearly increasing set of reference voltages to capture the behavior of a sigmoid function which also improves the accuracy of the analog sigmoid. Reconfigurable mixed-signal A-NPU. We design a reconfigurable mixed-signal A-NPU that can perform the computation of a wide variety of neural topologies since each requires a different topology. Figure 4 illustrates the A-NPU design with some details omitted for clarity. The figure shows four ANUs while the actual design has eight. The A-NPU is a time-multiplexed architecture where the algorithmic neurons are mapped to the ANUs based on a static scheduling algorithm, which is loaded to the A-NPU before invocation. The multi-layer perceptron consists of layers of neurons, where the inputs of each layer are the outputs of the previous layer. The ANU starts from the input layer and performs the computations of the neurons layer by layer. The Input Buffer always contains the inputs to the neurons, either coming from the processor or from the previous layer computation. The Output Buffer, which is a single entry buffer, collects the outputs of the ANUs. When all of its columns are computed, the results are pushed back to the Input Buffer to enable calculation of the next layer. The Row Selector determines which entry of the input buffer will be fed to the ANUs. The output of the ANUs will be written to a single-entry output buffer. The Column Selector determines which column of the output buffer will be written by the ANUs. These selectors are FIFO buffers whose values are part of the preloaded A-NPU configuration. All the buffers are digital SRAM structures. Each ANU has eight inputs. As depicted in Figure 4, each A-NPU is augmented with a dedicated weight buffer that stores the 8-bit weights. The weight buffers simultaneously feed the weights to the ANUs. The weights and the order in which they are fed to the ANUs are part of the A-NPU configuration. The Input Buffer and Weight Buffers synchronously provide the inputs and weights for the ANUs based on a pre-loaded order. Hardware/software interface for exposing analog circuits to the compiler. As we discussed in Section 2, we expose the following analog circuit restrictions to the compiler through a hardware/software interface that captures the following circuit characteristics: (1) input bit-width limitations, (2) weight bit-width limitations, (3) limited number of inputs to each analog neuron (topology restriction), and (4) the nonideal shape of the analog sigmoid. The compiler internally constructs a high-level model of the circuit based on these limitations and uses this model during the neural topology search and training with the goal of limiting the impact of inaccuracies due to an analog implementation. Training for limited bit widths and analog computation. Traditional training algorithms for multi-layered perceptron neural networks use a gradient descent approach to minimize the average network error, over a set of training input-output pairs, by backpropagating the output error through the network and iteratively adjusting the weight values to minimize that error. Traditional training techniques, however, that do not consider limited-precision inputs, weights, and outputs perform poorly when these values are saturated to adhere to the bit-width requirements that are feasible for an implementation in the analog domain. Simply limiting weight values during training is also detrimental to achieving quality outputs because the algorithm does not have sufficient precision to converge to a quality solution. A-NPU configuration. During code generation, the compiler produces an A-NPU configuration that constitutes the weights and the schedule. The static A-NPU scheduling algo- 4 Table 1: The evaluated benchmarks, characterization of each offloaded function, training data, and the trained neural network. Benchmark*Name blackscholes Descrip0on Mathema'cal* model*of*a* financial*market* M RadixE2*CooleyE Tukey*fast*fourier inversek2j Type Financial* Analysis Signal* Processing #*of* Func0on* Calls #*of* Loops #*of*Ifs/ elses 5 0 5 #*of* x86@64* Evalua0on*Input* Neural*Network* Training*Input*Set Instruc0on Set Topology s 4096*Data*Point* 16384*Data*Point* 309 from*PARSEC from*PARSEC 6*E>*8*E>*8E>*1 2 0 0 34 Inverse*kinema'cs* Robo'cs for*2Ejoint*arm 4 0 0 100 jmeint Triangle* intersec'on* detec'on 3D*Gaming 32 0 23 1,079 jpeg JPEG*encoding Compression 3 4 0 1,257 kmeans Machine* KEmeans*clustering Learning 1 0 0 26 sobel Sobel*edge* detector Image* Processing 3 2 1 88 2048*Random* Floa'ng*Point* Numbers 10000*(x,*y)* Random* Coordinates 10000*Random* Pairs*of*3D* Triangle* Coordinates 220x200EPixel* Color*Image 220x200EPixel* Color*Image 220x200EPixel* Color*Image To incorporate bit-width limitations into the training algorithm, we use a customized continuous-discrete learning method (CDLM) [8]. This approach takes advantage of the availability of full-precision computation at training time and then adjusts slightly to optimize the network for errors due to limited-precision values. In an initial phase, CDLM first trains a fully-precise network according to a standard training algorithm, such as backpropagation [24]. In a second phase, it discretizes the input, weight, and output values according the the exposed analog specification. The algorithm calculates the new error and backpropagates that error through the fully-precise network using full-precision computation and updates the weight values according to the algorithm also used in stage 1. This process repeats, backpropagating the ’discrete’ errors through a precise network. The original CDLM training algorithm was developed to mitigate the impact of limited-precision weights. We customize this algorithm by incorporating the input bit-width limitation and the output bit-width limitation in addition to limited weight values. Additionally, this training scheme is advantageous for an analog implementation because it is general enough to also make up for errors that arise due to an analog implementation, such as a non-ideal sigmoid function and any other analog non-ideality that behaves consistently. Fully* Analog*NN* Digital*NN* MSE*(8@bit) MSE Applica0on*Error* Metric Fully* Digital* Error Analog* Error 0.000011 Avg.*Rela've*Error 6.02% 10.2% Avg.*Rela've*Error 2.75% 4.1% Avg.*Rela've*Error 6.2% 9.4% Miss*Rate 17.68% 19.7% Image*Diff 5.48% 8.4% Image*Diff 3.21% 7.3% Image*Diff 3.89% 5.2% 0.00228 32768*Random* Floa'ng*Point* 1*E>*4*E>*4*E>*2 0.00002 0.00194 Numbers 10000*(x,*y)* Random* 2*E>*8*E>*2 0.000341 0.00467 Coordinates 10000*Random* Pairs*of*3D* 18*E>*32*E>*8*E>*2 0.05235 0.06729 Triangle* Coordinate Three*512x512E 64*E>*16*E>*8*E>*64 0.0000156 0.0000325 Pixel*Color*Images 50000*Pairs*of* Random*(r,*g,*b)* 6*E>*8*E>*4*E>*1 0.00752 0.009589 Values One*512x512E 9*E>*8*E>*1 0.000782 0.00405 Pixel*Color*Image approach: the compiler partitions the data collected during profiling into a training set, 70% of the data, and a test set, the remaining 30%. The topology search algorithm trains many different neural-network topologies using the training set and chooses the one with the highest accuracy on the test set and the lowest latency on the A-NPU hardware (prioritizing accuracy). The space of possible topologies is large, so we restrict the search to neural networks with at most two hidden layers. We also limit the number of neurons per hidden layer to powers of two up to 32. To further improve accuracy, and compensate for topologyrestricted networks, we utilize a Resilient Back Propagation (RPROP) [16] training algorithm as the base training algorithm in our CDLM framework. During training, instead of updating the weight values based on the backpropagated error (as in conventional backpropagation [24]), the RPROP algorithm increases or decreases the weight values by a predefined value based on the sign of the error. Our investigation showed that RPROP significantly outperforms conventional backpropagation for the selected network topologies, requiring only half of the number of training epochs as backpropagation to converge on a quality solution. The main advantage of the application of RPROP training to an analog approach to neural computing is its robustness to the sigmoid function and topology restrictions imposed by the analog design. Our RPROP-based, customized CDLM training phase requires 5000 training epochs, with the analog-based CDLM phase adding roughly 10% to the training time of the baseline training algorithm. Training with topology restrictions. Conventional multilayered perceptron networks are fully connected, i.e. the output of each neuron in one layer is routed to the input of each neuron in the following layer. However, analog range limitations restrict the number of inputs that can be computed in a neuron (eight in our design). Consequently, network connections must be limited, and in many cases, the network can not be fully connected. We impose the circuit restriction on the connectivity between the neurons during the topology search and we use a simple algorithm guided by the mean-squared error of the network to determine the best topology given the exposed restriction. The error evaluation uses a typical cross-validation 5. Evaluations Cycle-accurate simulation and energy modeling. We use the MARSSx86 x86-64 cycle-accurate simulator [23] to model the performance of the processor. The processor is modeled after a single-core Intel Nehalem to evaluate the performance benefits of A-NPU acceleration over an aggressive out-oforder architecture. We extended the simulator to include ISAlevel support for A-NPU queue and dequeue instructions. We 5 jmeint jpeg kmeans sobel swaptions geomean jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 kmeans 8-bit D-NPU vs. A-NPU 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 1 -> 16 -> 8 -> 1 50 10.30 8 1.22 2.08 3.13 59.90 9.71 6.35 0.84 0 3.33 blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48 swaptions 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33 3.92 5.88 28.28 25.50 1.36 15.21 22.81 82.21 73.76 44 5.77 sobel 4.05 geomean swaptions 6 0.67 2.44 3.67 33 5.49 4 0.46 2.75 4.13 50 10.30 8 1.22 2.08 3.13 8.45 7.58 59.90 9.71 6.35 0.84 3.33 4.71 12.14 8.56 7.57 11.81 10.45 10.32 geomean blackscholes Table 2: Error with a floating point D-NPU, A-NPU with ideal Analog Sigmoid sigmoid, and A-NPU with non-ideal sigmoid. 2.42 Analog Sigmoid 3.92 64 -> 16 -> 8 -> 64 0.0839 1 ->kmeans 4 -> 4 -> 2 6 -> 8 -> 0.0303 4 -> 1 3.03 2.44 4.13 3.67 0.011 nversek2j 2 ->sobel 8 -> 2 9 -> 8 -> 0.0813 1 8.13 2.75 9.42 4.13 0.0129 meint swaptions 18 -> 32 -> 8 -> 2 1 -> 16 ->0.1841 8 -> 1 18.41 2.08 19.67 3.13 0.0126 peg 64 -> 16 -> 8 -> 64 geomean kmeans 6 -> 8 -> 4 -> 1 0.0662 0.0073 7.32 A-NPU - 1/2 Digital Frequency (Manual) 6 -> 8 -> 8 -> 1 0.86 0.96 1 -> 4 -> 4 -> 2 0.54 0.61 0.51 0.58 Benchmarks 1.99 2.21 inversek2j Table 1 2 -> 8 -> 2 0.01 D-ANPU (Hadi) Improvement A-NPU + Ideal 8.15 9.53 Sigmoid 18.21 Level Error Application Fully Precise Digital Sigmoid Analog Sigmoid 8.39% 10.21% 3.03% 4.13% 1.51 56.26 6 -> 8 -> 4 -> 18 16 0.67 0.76 111.54 9 -> 8 -> 6.62% 1 5.76% 0.46 0.53 5.77 fft inversek2j 8.13% jmeint 9.42% 18.41% 8.4% 33.85 2.04 A-NPU 2.33 blackscholes 1.36 4 sobel 14.32% Analog Sigmoid Analog Sigmoid 1.17 kmeans 0.0303 jpeg 0.1841 64 -> 16 -> 8 -> 64 8.1% kmeans 0.0662 18.4% 6.6% 6.1% 6 -> 8 -> 4 -> 1 sobel 9.4% swaptions 4.3% 0.06 9 -> 8 -> 1 4.1% 4% sobel 0.0813 3.2% 3.8% 18 -> 32 -> 8 -> 2 3.0% 6% 41.47 jpeg Analog Sigmoid Fully Precise Digital Sigmoid jmeint 8% 0.0428 19.7% 8.4% 7.3% 1 -> 16 -> 8 -> 1 5.2% 0.0261 10.21 0.0182 3.03 4.13 0.011 8.13 9.42 0.0129 18.41 19.67 0.0126 6.62 8.35 0.0173 6.10 7.28 0.0118 4.28 5.21 0.0093 2.61 3.34 0.0073 6.02 7.32 geomean 12.41 0% 0.06 0.01 also augmented MARSSx86 with a cycle-accurate simulator for our A-NPU design and an 8-bit, fixed-point D-NPU with 8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3% 2.6% eight processing engines (PEs) as described in [12]. We use 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2% 3.3% GCC v4.7.3 with -o3 to enable compiler optimization. The 3.1% 43.3% 65.5% 40.5% baseline in our experiments is the benchmark run solely on the 1.8% 0.5% 4.8% 2.3% processor without neural transformation. We use McPAT [19] for processor energy estimations. We model the energy of an 8-bit, fixed-point D-NPU using results from McPAT, CACTI 6.5 [21], and [13] to estimate its energy. Both the D-NPU and inversek2j jmeint jpeg kmeans sobel swaptions the processor operate at 3.4GHz, while the A-NPU is clocked 95.9% 95.1%at56.3% 29.7%of57.1% one third the digital clock frequency, 1.1GHz at 1.2 V, to achieve acceptable accuracy1 . 10 geomean 1.22 1.36 0.79 0.89 5.49 jpeg 6.62% 11.43 kmeans 4.51 8.35% 6.1% 7.28% sobel 4.28% 5.21% swaptions 2.61% 3.34% geomean 6.02% 7.32% blackscholes fft inversek2j jmeint sobel swaptions blackscholes fft blackscholes blackscholes fft Other Instructions 2.4% 31.4% NPU Queue Instructions 0.3% 1.2% inversek2j A-NPU + Ideal Sigmoid A-NPU jmeint jpeg kmeans sobel 3.3% 3.1% 43.3% 65.5% 40.5% 0.8% 1.8% 0.5% 4.8% 2.3% 15% 97.2% Dynamic Insts CPU Other Instructions 6 -> 8 -> 8 -> 1 1.0 1 -> 4 -> 4 -> 2 1.0 inversek2j 2 -> 8 -> 2 1.0 jmeint 18 -> 32 ->8-bit 8 -> D-NPU 2 1.0 Cycle 64 -> 16 -> 8 -> 64Energy (nJ) 6 -> 8 -> 4 -> 1 45 8.15 1.0 2.04 9.53 4.05 28.28 25.50 22.81 82.21 73.76 2.42 5.77 6 0.67 2.44 3.67 8.56 7.57 5.49 4 0.46 2.75 4.13 11.81 10.45 9.63 6.15 0.79 3.33 8-bit D-NPU vs. A-NPU 5.00 8-bit A-NPU Energy (nJ) 1.8 jpeg Cycle Cycle Improvement (1.1336 GHz) 6 0.86 2.04 6 0.54 3.9 0% 12.14 4 8 Energy Improvement Cycle Improvement (1.7 GHz) 2.50 Speedup Energy 1.89 Saving Energy (1.1336 GHz) GHz) 9.53 8.48 2.83 3.79 3.33 2.33 4 0.51 3.63 4.57 4.05 56.26 12 1.99 3.92 5.88 28.28 25.50 111.54 8 1.36 15.21 22.81 82.21 73.76 44 5.77 6 0.67 5.49 4 2.4 2.42 Energy 365 2.44 3.67 8.56 7.57 0.46 2.75 4.13 11.81 10.45 50 10.30 8 1.22 2.08 3.13 8.45 7.58 59.90 9.71 6.35 0.84 3.33 4.71 12.14 10.32 0.0813 8.13 9.42 0.0129 0.1841 18.41 19.67 0.0126 0.0662 6.62 8.35 0.0173 6.10 18.4% 6.6% 6.1% 4.3% 2.6% 6.0% 19.7% 8.4% 7.3% 5.2% 3.3% 6% 0.0118 0.0261 2.61 3.34 6.02 7.32 Energy Saving Speedup 0.01 8 16 6.62% 5.76% 2.7 Application Level Error Fully Precise Digital Sigmoid blackscholes 8.39% 2.0 Benchmarks fft inversek2j jmeint jpeg kmeans Analog Sigmoid 3.03% 28.28 25.5 10 82.21 73.8 9.53 6.62% 8.35% kmeans 6.1% 7.28% sobel 4.28% 5.21% swaptions 8 0.50 0.25 8.45 7.6 7.6 28.28 25.5 4.57 9.53 4.1 3.79 4 2 8.5fft blackscholes inversek2j blackscholes jmeint fft jpeg kmeans inversek2j jmeint sobel 8.13% 9.42% 18.41% 19.67% jpeg 6.62% 8.35% kmeans 6.1% 7.28% sobel 4.28% 5.21% 2.61% 3.34% sobel geomean jpeg 1/3 Digital Frequency 1/2 Digital Frequency kmeans sobel blackscholes fft inversek2j jmeint jpeg kmeans sobel Application Topology blackscholes 6 -> 8 -> 8 -> 1 fft 1 -> 4 -> 4 -> 2D-NPU Cycle inversek2j 2 -> 8 -> 2 jmeint 18 -> 32 -> 8 -> 2 jpeg 64 -> 16 -> 8 -> 64 8-bit A-NPU Energy (nJ) fft 6.0% 2.7% 8.4% 3.0% 10.2% 4.1% A-NPU + Ideal Sigmoid Cycle inversek2j geomean 3.1% jpeg kmeans sobel 31.4% 3.3% jmeint 43.3% 65.5% 40.5% NPU Queue Instructions 0.3% 1.2% 0.8% 0.5% 4.8% 2.3% 8.1% 18.4% 6.6% 9.4% 19.7% 8.4% 29 kmeans 8.15 sobel 3.2% 2.04 3.8% 6.1% 4.3% 7.3% 5.2% 2.33 56.26 blackscholes fft inversek2j jmeint 8.4% 3.0% 8.1% 18.4% 4.1% 9.4% 19.7% 44 33 jpeg 5.776.6% 5.49 61.47 8.4% 0.86 6 0.54 24.5221729784241 48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738 1.12709929506364 1.32327013458615 1.64546022960568 0.68497510592142 0.804194541306698 0.195805458693302 Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU 0.53259881505741 0.733600916412848 0.266399083587152 2 -> 8 -> 2 7.98161179269307 10.9938617211157 14.9861613789597 jmeint 18 -> 32 -> 8 -> 2 2.39085372136084 6.26190545947988 14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335 1.5617504494923 1.87946485929561 1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875 64 -> 16 -> 8 -> 64 kmeans 6 -> 8 -> 4 -> 1 sobel 9 -> 8 -> 1 Application Topology blackscholes 6 -> 8 -> 8 -> 1 Core + D-NPU Core + A-NPU Core + Ideal NPU 2.4864550898745 3.10723166292606 3.62429006473114 0.686053004992842 0.857335259438336 0.142664740561664 2.5478647166383 3.7797513074705 5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968 Core + D-NPU Core + A-NPU 18 -> 32 -> 8 -> 2 jpeg 64 -> 16 -> 8 -> 64 kmeans 6 -> 8 -> 4 -> 1 sobel 9 -> 8 -> 1 1 -> 16 -> 8 -> 1 7.6 Core + D-NPU blackscholes fft inversek2j jmeint jpeg kmeans sobel Other Instructions 2.4% 31.4% 3.3% 3.1% 43.3% 65.5% 40.5% 1.2% 0.8% 1.8% 0.5% 4.8% 2.3% inversek2j jmeint jpeg fft swaptions kmeans inversek2j jmeint 95.9% 6.0% 2.7% 6.2% 17.6% 5.4% 8.4% 3.0% 8.1% 18.4% 6.6% 10.2% 4.1% 9.4% 19.7% 8.4% speedup A-NPU Core + A-NPU Analog Sigmoid geomean sobel 67.4% 95.9% jpeg kmeans sobel 10 9.53 2.50 3.75 9.53 0.54 4 0.51 3.63 4.57 4.05 12 1.99 3.92 5.88 28.28 25.50 8 1.36 15.21 22.81 82.21 73.76 44 5.77 6 0.67 2.44 3.67 8.56 7.57 33 5.49 4 0.46 2.75 4.13 11.81 10.45 61.47 9.63 6.15 0.79 3.33 5.00 12.14 10.79 2.42 4.5 3.79 3.33 4.57 4.05 25.50 73.76 0.67 2.44 3.67 0.46 2.75 4.13 11.81 Application 0.79 3.33 5.00 12.14 fft 2.6% 6.0% 5.2% 3.3% 6 4 28.28 25.5 82.21 73.8 2 fft inversek2j jm 8.1% 18. 10.2% 4.1% 9.4% 19. geomean jpeg 4.1 3.79 4 1/3 Digital Frequency 1/2 Digital Frequency blackscholes fft inversek2j jmeint blackscholes fft inversek2j jmeint jpeg kmeans sobel jpeg kmeans sobel swaptions geomean Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU speedup 24.5221729784241 48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738 1.32327013458615 1.64546022960568 0.68497510592142 0.804194541306698 0.195805458693302 0.53259881505741 0.733600916412848 0.266399083587152 inversek2j 2 -> 8 -> 2 7.98161179269307 10.9938617211157 14.9861613789597 jmeint 18 -> 32 -> 8 -> 2 2.39085372136084 6.26190545947988 14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335 jpeg 64 -> 16 -> 8 -> 64 1.5617504494923 1.87946485929561 1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875 kmeans 6 -> 8 -> 4 -> 1 sobel 9 -> 8 -> 1 geomean 0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028 2.4864550898745 3.10723166292606 3.62429006473114 0.686053004992842 0.857335259438336 0.142664740561664 2.5478647166383 3.7797513074705 5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968 Application Topology blackscholes 6 -> 8 -> 8 -> 1 fft 1 -> 4 -> 4 -> 2 inversek2j 2 -> 8 -> 2 jmeint 18 -> 32 -> 8 -> 2 jpeg 64 -> 16 -> 8 -> 64 kmeans 6 -> 8 -> 4 -> 1 sobel 9 -> 8 -> 1 swaptions 1 -> 16 -> 8 -> 1 Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU speedup geomean swaptions geomean 6. Related Work jmeint 82.2 15.2 28.2 kmeans sobel geomean 3.3 2.4 2.4 1.8 2 jpeg General-purpose approximate computing. A growing body of work has explored relaxing the abstraction of full accuracy at the circuit and architecture level for gains in performance, energy, and resource utilization [9, 11, 25, 6, 22, 12]. Energy Saving Speedup These circuit and architecture studies, although proven successful, are limited to purely digital techniques. We explore 3.9 4.5 9.5 3.7 1 3 2.5 3.3 2.7 2.0 2.4 1.8 2.4 2.5 2 14.1441013944802 10.791.12709929506364 1 -> 4 -> 4 -> 2 Application Speedup-1 3.3 0 geomean Energy Saving Speedup 10.45Core + D-NPU 6 -> 8 -> 8 -> 1 1 Improvement 3.7 sobel 7.57 Topology 12.14 10.3 2 12.1 kmeans 8.4 8.5 82.2 15.2 jpeg 3.9 4.5 9.5 4 jmeint 11.8 inversek2j 28.2 fft blackscholes 7.6 2.4 4.57 3.3 2.4 2.4 3.7 2.5 blackscholes 8.56 8.45 2.4 6 0 Speedup Energy Saving 3.9 3 11.81 10.5 7.6 6 1 0 blackscholes fft inversek2j jmeint jpeg kmeans sobel k 3.0% Application Speedup 7.3% 6.15 1 0 3.33 82.21 1.8 Improvement 2 8.48 3.79 28.28 Fetch/Issue Width: 4/5, INT ALUs/FPUs: 6/6, Load/Store blackscholes fft inversek2j FUs: 1/1, ROB Entries: 128, Issue Queue Entries: 36, INT/FP Physical Registers: 256/256, Branch Predictor: Tournament 48 KB, BTB Sets/Ways: 1024/4, RAS Entries: 64, Load/Store Queue Entries: 48/48, Dependence Predictor: 4096-entry Bloom Filter, ITLB/DTLB Entries: 128/256 L1: 32 KB Instruction, 32 KB Data, Line Width: 64 bytes,5 8-Way, Latency: 3 cycles L2: 256 KB, Line Width: 64 bytes, 8-Way, Latency: 6 cycles L3: 2 MB, Line Width 64 bytes, 16-Way, Latency: 27 cycles 4Memory Latency: 50 ns 3 9.53 2.83 5.88 12.1 2.83 8.48 6 2.33 9.5 1.89 Energy Improvement (1.7 GHz) 3.75 1.89 3.63 8.56 1.8 0.86 Energy Improvement (1.1336 GHz) 2.5 Energy Improvement Cycle Improvement (1.7 GHz) 2.50 22.81 95.1% 56.3% 29.7% 57.1% Energy Improvement 6 2.04 Cycle Improvement (1.1336 GHz) Energy Improvement (1.7 GHz) 3.92 4.3% 5 Energy Improvement Energy Improvement (1.1336 GHz) 15.21 7.3% jpeg 8.4% Core + D-NPU Core + A-NPU kmeans Core + Idealsobel NPU 4 2.42 jmeint blackscholes Percentage instructions subsumed 6 0 9 -> 8 - inversek2j Application Topolo blackscholes 6 -> 8 - fft 1 -> 4 - inversek2j 2 -> 8 - jmeint 18 -> 3 jpeg 64 -> 1 kmeans 6 -> 8 - sobel 9 -> 8 - swaptions 1 -> 16 geomean 1.36 6.1% 9.63 6 -> 8 - geomean 95.1% 56.3% 29.7% 57.1% 1.99 geomean 64 -> 1 kmeans sobel (b) Whole application energy saving. 0.51 swaptions 18 -> 3 jpeg fft A-NPU + Ideal Sigmoid Core + Ideal NPU Benchmarks Cycle Improvement (1.7 GHz) 2 -> 8 - jmeint 1 -> 4 -> 4 -> 2 8 sobel 1 -> 4 - blackscholes Floating Point D-NPU geomean fft 6 -> 8 - fft 0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028 jmeint 8.45swaptions 7.6 Topolo blackscholes 12.14 2 -> 8 -> 2 10.3 inversek2j 4 Application swaptions geomean inversek2j speedup inversek2j fft 11.81 10.5 2 sobel 4.5 97.2% inversek2j jmeint 56.26 4 Improvement Core + D-NPU 4 3.9 fft 111.54 5 0 14.1441013944802 1 -> 4 -> 4 -> 2 Application Speedup-1 Cycle Improvement 6 kmeans 3.7 blackscholes 365 1 Processor: 3 kmeans 6 -> 8 -> 8 -> 1 12 111.54 10.2% Analog Sigmoid geomean Improvement 8.15 34 5 jpeg Topology jpeg 2 Cycle Improvement (1.1336 GHz) Energy 9.5 1.8% swaptions jpeg 17.6% 5.4% 141 Percentage instructions subsumed 9 -> 8 -> 1 fft 2.4% jmeint 346.2% 365 6 -> 8 -> 4 -> 1 blackscholes Other Instructions 45 inversek2j blackscholes Floating Point 141 29 geomean jmeint fft 8 Percentage 1/3 Digital Frequency Instructions 97.2% 67.4% 1/2 Digital Frequency Subsumed swaptions geomean 8-bit D-NPU 11.8 9 -> 8 -> 1 inversek2j Application blackscholes 8-bit D-NPU vs. A-NPU-1 0% 8.5 sobel fft blackscholes 6 10 2.7 6 -> 8 -> 4 -> 1 geomean 6.51636491720012 0.787719817984527 0.977559461493782 0.011817544342672 8 4.1 4% Benchmarks 82.2 kmeans 107.32% 0 Queue NPU 0.3% Instructions blackscholes 2% Cycle Improvement Energy 15.2 64 -> 16 -> 8 -> 64 Cycle 28.2 18 -> 32 -> 8 -> 2 swaptions geomean 6.02% 7.32% swaptions geomean 8-bit A-NPU Energy (nJ) 45 1 -> 4 -> 4 -> 2 2 -> 8 -> 2 sobel Application Speedup 8.56 swaptions geomean 6% 8-bit D-NPU vs. A-NPU-1 8-bit D-NPU 6 -> 8 -> 8 -> 1 6.37013417935512 blackscholes 82.21 73.8 3.3 0 sobel Cycle jpeg 12.14 10.3 6 8.5 jmeint 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474 geomean NPU Queue Instructions 11.81 10.5 Other Instructions 8 fft 5.13306978649764 3.34% 6.02% 17.8930069836166 (a) Whole application speedup. 0% geomean 2.61% geomean 8.56 8.5 0.75 4 inversek2j 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845 jpeg 8% 4.13% inversek2j jmeint Percentage Instructions Subsumed blackscholes 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225 2.83229403346624 19.67% 10.21% 16 Topology 1.33727439731608 18.41% 3.3 10% 0.0073 Maximum number of Incoming Synapses to each Neuron Application inversek2j jmeint jpeg kmeans 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806 2% 2.74676510816413 jmeint Analog Sigmoid Fully Precise Digital Sigmoid kmeans 8 2.21156760942508 1.15321498752577 9 -> 8 -> 1 7.3% 3.79 A-NPU 3% 7.32010374044854 6 -> 8 -> 4 -> 1 fft + D-NPU + A-NPU 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033 4.57 12% 3.3 9% 4 9.4% 18 -> 32 -> 8 -> 2 9.42% sobel 16 0.0093 swaptions fft 12% 0% 4.1% 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239 geomean Application Level Error 5.21 1.8 4 14.32% 1 blackscholes 10.2% 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124 30.0198258158588 Whole application speedup and energy savings. 5 jmeint jpeg kmeans sobel swaptions geomean blackscholes fftFigure inversek2j (c) % dynamic instructions subsumed. shows the whole application speedup and energy savings when the processor is augmented with an 8-bit, 8-PE D-NPU, our Figure 5: Whole application speedup and energy saving with 8-ANU A-NPU, and an ideal NPU, which takes zero cycles D-NPU, A-NPU, and an Ideal NPU that consumes zero energy and consumes zero energy. Figure 5c shows the percentage of and takes zero cycles for neural computation. dynamic instructions subsumed by the neural transformation and our A-NPU which incorporates non-idealities of the anaof the candidate code. The results show, following the Amblackscholes fft inversek2j jmeint jpeg kmeans sobel geomean log sigmoid. Except for jmeint, which shows error above 10%, dahl’s Law, that the larger the number of dynamic instructions all of the applications show error less than or around 10%. subsumed, the larger the benefits from neural acceleration. Application average error rates with the A-NPU range from Geometric mean speedup and energy savings with an A-NPU 4.1% to 10.2%. This quality-of-result loss is commensurate is 3.7× and 6.3× respectively, which is 48% and 24% better with other work on quality trade-offs use digital techniques. than an 8-bit, 8-PE NPU. Among the benchmarks, kmeans Speedup Truffle [11] and EnerJ Energy Saving[26] shows similar error (3–10%) for sees slow down with D-NPU and A-NPU-based acceleration. some applications and much greater error (above 80%) for All benchmarks benefit in terms of energy. The speedup with others in a moderate configuration. Green [2] has error rates A-NPU acceleration ranges from 0.8× to 24.5×. The energy below 1% for some applications but greater than 20% for othsavings range from 1.1× to 51.2×. ers. A case study [20] explores manual optimizations of the Application error. Table 2 shows the application-level erx264 video encoder that trade off 0.5–10% quality loss. rors with a floating point D-NPU, A-NPU with ideal sigmoid 4.28 2.4 Table 1 2 0 7.28 0.0428 Maximum number of Incoming Synapses to each Neuron 15% Application Level Error (Image Diff) 0.011 0.06 3.7 2.5 Improvement Application Level Error (Image Diff) 4.13 3.9 1 -> 16 -> 8 -> 1 10.794 3 3.03 0.06 geomean sobel 0.0303 4.5 6 -> 8 -> 4 -> 1 9 -> 8 -> 1 kmeans 0.0182 2.4 5 10.45 9.5 64 -> 16 -> 8 -> 64 swaptions 8.1% 1.70352109148241 Analog Sigmoid jpeg 10.21 12.1 18 -> 32 -> 8 -> 2 Analog Sigmoid 8.39 8.4 2 -> 8 -> 2 jmeint 7.57 Fully Precise Digital Sigmoid jmeint inversek2j 0.0839 28.2 inversek2j kmeans 3.0% 51.250425520663 1.66144638762811 25.8726893148605 blackscholes 4.13% jpeg 64 -> 16 -> 8 -> 64 Benchmarks. We use the benchmarks in2 [12] and add one more, blackscholes. Fully Precise Digital Sigmoid blackscholes fft 11.8 1 -> 4 -> 4 -> 2 sobel 8.4% Core + A-NPU 4% 42.5954312379609 2 -> 8 -> 2 8.13% kmeans Core + D-NPU 1 -> 4 -> 4 -> 2 3.03% Improvement Improvement (1.7 of Incoming Synapses to each Neuron Maximum number 3.75 141 8.5 fft jpeg geomean 0 10.79 Cycle Improvement Energy 8.15 82.2 15.2 6 -> 8 -> 8 -> 1 73.76 swaptions 0.00 1 Topology 0 25.50 sobel 7.3% inversek2j Core + Ideal NPU kmeans sobel swaptions Analog Sigmoid 4.05 kmeans 6 -> 8 -> 8 -> 1 010.21% jmeint inversek2j fft 16 5.76% jpeg 8.39% Core + A-NPU 10 5.4 2 blackscholes 3.3% jmeint 3.33 4.57 5.88 15.21 33 1 -> 16 -> 8 -> 1 Application 2.6% 5.2% inversek2j 8.48 3.79 3% 3.63 3.92 1.36 3.7 2.5 3 4.3% 7.3% fft Analog Sigmoid geomean 6.1% blackscholes Energy Improvement (1.7 GHz) 1.99 29 4 9 -> 8 -> 1 8 6.62% Energy Improvement 0.51 34 64 -> 16 -> 8 -> 64 3.33 19.7% 8.4% 6% 8 45 6 -> 8 -> 4 -> 1 8.48 18.4% 6.6% 9.4% Analog blackscholes Sigmoid fft blackscholes 6% Application Energy Reduction Core + D-NPU 3.7 18 -> 32 -> 8 -> 2 geomean 8.1% Topology 2.5 2 -> 8 -> 2 jmeint jpeg 3.0% 44.1% Fully Precise Digital Sigmoid swaptions 1.00 4 28.2 inversek2j Improvement rovement sobel 95.1% 56.3% 29.7% 57.1% 12 15.2 1 -> 4 -> 4 -> 2 95.9% 2.33 4.5 fft 5 Energy kmeans sobel (1.7 Improvement swaptions GHz) 8.4% 9% 56.26 Cycle 9.5 6 -> 8 -> 8 -> 1 67.4% 111.54 8-bit D-NPU Topology blackscholes kmeans Application 3.1 3.6 61.47 Application 3.75 jpeg 10.2% 14.32% swaptions inversek2j jmeint Energy Improvement (1.1336 GHz) 2.83 jmeint Application Level Error 2.4 geomean 2.500.571 1.89 6.0% Benchmarks 8% Core Analog Sigmoid Fully Precise Digital Sigmoid Core inversek2j jmeint kmeans sobel geomean jpeg 0.5 0.8 1.2 44 33 0.860.023 0.54 blackscholes fft fft inversek2j jmeint 10% blackscholes 1.5 1.8 1.9 6 -> 8 -> 4 -> 1 9 -> 8 -> 1 6 0.41 6 Cycle 0.563 Improvement (1.7 GHz) 0.297 inversek2j 2 geomean 0.47 0.17 2.3 kmeans Cycle 0.005 Improvement (1.1336 GHz) 0.048 swaptions Circuit design for ANU. We implemented the 8-bit, 8-input 8 ANU in the Cadence Analog Design Environment using predictive technology models at 45 nm [5]. We ran detailed 6 Spectre spice simulations to understand circuit behavior and measure ANU energy consumption. We used CACTI to esti4 mate energy of the A-NPU buffers. 29 sobel 0.66 sobel 0.49 0.44 Core + D-NPU Core + A-NPU Core + Ideal NPU 12% 4 7.9 365 Energy 0.43 1.0 0 3.3 141 64 -> 16 -> 8 -> 64 Cycle 1.0 3.3 18 -> 32 -> 8 -> 2 peg Cycle Improvement 0.951 0.018 fft 12.1 9 -> 8 -> 1 1 -> 16 -> 8 ->341 2 ->geomean 8 -> 2 meint 8-bit A-NPU 0.03 8.5 6 ->sobel 8 -> 8 -> 1 1 ->swaptions 4 -> 4 -> 2 nversek2j 0.959 11.8 blackscholes fft 0.972 0.674 0.008 2.7 kmeans Less Insts 0.003 0.012 0.03 82.2 Topology jpeg NPU Queue Instructions 0.02 0.31 8-bit D-NPU vs. A-NPU-1 2.4 Application Topology blackscholes fft kmeans geomean 0.29 0 6 12% blackscholes Percentage Instructions Subsumed Application fft Normalized # of Dynamic Instructions Benchmarks jpeg Maximum number of Incoming Synapses each2.3% 6.0% 2.7% 6.2% 17.6% 5.4% 3.2% to 3.8% Neuron 16 mum number of Incoming Synapses to each Neuron jmeint Floating Point D-NPU Percentage instructions subsumed Application Level Error (Image Diff) 8 swaptions geomean inversek2j Application Level Error (Image Diff) 4 sobel Table 1 Analog Sigmoid geomean sobel 0.53 0.2 12.1 kmeans kmeans 28.2 jpeg kmeans 0.51 14.40 Percentage instructions subsumed meint jpeg Energy Improvement 1 -> 16 -> 8 -> 1 Energy Improvement swaptions 8 2% 165.46 19.67% inversek2j jmeint 1 -> 4 -> 4 -> 2 inversek2j -> 2 2.7% 6.2%2 -> 817.6% 5.4% 10% 10.2% 4.00 fftfft 8.39 1.1 1.3 1.6 kmeans 6.0% 0.0093 3.34 A-NPU - 1/36.02 Digital Frequency (Manual) jmeint number of Incoming 18 -> 32 ->Synapses 8 -> 2 Maximum to each Neuron jpeg 64 -> 16 -> 8 -> 64 Application Level Error Image Diff) Floating Point D-NPU 0.0118 5.21 (nJ) Fully Precise Digital Sigmoid 0.0839 11.8 fft 0.0173 7.28 4.28 Energy 2.61 Fully Precise Digital Sigmoid 2 8.5 blackscholes 8.35 4.71 6.10 0.0428 6 -> 8 -> 8 -> 1 Application Speedup Topology 0.06 Topology blackscholes 82.2 0.0261 Application Application 0.0182 blackscholes 12% jpeg 0.4 11.8 9 -> 8 -> 1 1 -> 16 -> 8 -> 1 10.2122.81 6.62 3.14 0.06 sobel swaptions geomean 15.21 8.39 3.63 Analog Sigmoid 5.88 6 ->jpeg 8 -> 8 -> 1 fft 10.32 jmeint 6.3 6.5 2.83 42.5 51.2 52.5 2 -> 8 -> 2 Fully Precise Fully Precise 18 -> 32 -> 8 -> 2 Digital Sigmoid Digital Sigmoid 3.75 1.89 15.2 inversek2j Topology jmeint blackscholes 10 Speedup (1.7 GHz) 2.50 8.5 Application Speedup (1.1336 GHz) Application Speedup 1 -> 4 ->Analog 4 -> 2 Sigmoid Application Energy Reduction 6 -> 8 -> 8 -> 1 fft Application Level Error Topology blackscholes 7.58 12.14 inversek2j 0.6 Speedup A-NPU over D-ANPU Application 4.71 fft 2.7 1 -> 16 -> 8 -> 1 geomean kmeans 2.42 12.1 6 -> 8 -> 4 -> 1 9 -> 8 -> 1 swaptions jpeg 0.51 3.1 3.6 4.57 1.99 8 kmeans inversek2j 7.57 10.45 8.45 2.4 3.63 12 111.54 sobel fft 29 8.56 11.81 2.7 2.8 2.8 jmeint 56.26 365 73.76 0.5 0.8 1.2 4 141 64 -> 16 -> 8 -> 64 4.05 25.50 1.1 1.3 1.3 2.33 18 -> 32 -> 8 -> 2 peg 82.21 1.5 1.8 1.9 2 -> 8 -> 2 meint 28.28 2.2 2.3 2.3 nversek2j 2.0 blackscholes 2 4 5.88 sobel Energy Improvement (1.7 GHz) fft 0% 22.81 14.0 Energy Improvement (1.1336 GHz) 15.21 6.2 Energy Improvement Cycle Improvement (1.7 GHz) 3.92 1.36 2.3Error Application Level Cycle Improvement (1.1336 GHz) 1.99 17.8 18.8 Cycle Improvement Energy 4.57 7.3 Cycle 3.63 14.0 8-bit A-NPU Energy (nJ) 2.42 8.4 Cycle 3.33 0.51 6.2 8-bit D-NPU Topology 8.48 3.79 2.3 inversek2j GHz) 9.53 2.83 swaptions geomean 9.53 8.5 8 Energy Improvement 4 (1.1336 GHz) 3.75 1.89 5.4 2.33 GHz) 2.50 0.54 4 3.7 29 (1.1336 GHz) 0.86 2.5 6 5.1 fft 6 2.04 1.1 1.3 1.6 blackscholes 8.15 34 2.7 2.8 2.8 0.00 45 1.1 1.3 1.3 2 -> 8 -> 2 0.25 2.2 2.3 2.3 inversek2j 0.50 Applicatio 2% Application 1 -> 4 -> 4 -> 2 Normalized 7.9 Application Speedup 10.9 14.9 1 -> 16 -> 8 -> 1 4% geomean 6 -> 8 -> 8 -> 1 fft 2.7 swaptions blackscholes 25.8 30.0 31.4 0.571 10.9 14.9 0.297 0.023 1.1 1.3 1.6 0.563 0.048 0.41 1.6 1.7 1.7 0.959 0.951 0.005 0.66 1.0 1.6 1.7 1.7 0.008 0.018 0.43 1.0 9 -> 8 -> 1 14.1 24.5 48.0 0.03 0.03 1.0 6 -> 8 -> 4 -> 1 14.1 24.5 48.0 1.0 1.0 64 -> 16 -> 8 -> 64 meint Application Ener nversek2j Application 2 -> 8 -> 2 18 -> 32 -> 8 -> 2 sobel Normalized # of D 6% peg kmeans 6 4 3.79 2 0 blackscholes References how a mixed-signal, analog-digital approach can go beyond what digital approximate techniques offer. [1] P. E. Allen and D. R. Holberg, CMOS Analog Circuit Design, 2nd ed. Oxford University Press, 2002. [2] W. Baek and T. M. Chilimbi, “Green: A framework for supporting energy-conscious programming using controlled approximation,” in PLDI, 2010. [3] B. Belhadj, A. Joubert, Z. Li, R. Héliot, and O. Temam, “Continuous real-world inputs can open up alternative accelerator designs,” in ISCA, 2013. [4] B. E. Boser, E. Säckinger, J. Bromley, Y. L. Cun, L. D. Jackel, and S. Member, “An analog neural network processor with programmable topology,” J. Solid-State Circuits, vol. 26, no. 12, December 1991. [5] Y. Cao, “Predictive technology models,” 2013. Available: http: //ptm.asu.edu [6] L. N. Chakrapani, B. E. S. Akgul, S. Cheemalavagu, P. Korkmaz, K. V. Palem, and B. Seshasayee, “Ultra-efficient (embedded) SOC architectures based on probabilistic CMOS (PCMOS) technology,” in DATE, 2006. [7] T. Chen, Y. Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti, A. Nere, S. Qiu, M. Sebag, and O. Temam, “Benchnn: On the broad potential application scope of hardware neural network accelerators,” in IISWC, 2012. [8] F. Choudry, E. Fiesler, A. Choudry, and H. J. Caulfield, “A weight discretization paradigm for optical neural networks,” in ICOE, 1990. [9] M. de Kruijf, S. Nomura, and K. Sankaralingam, “Relax: An architectural framework for software recovery of hardware faults,” in ISCA, 2010. [10] H. Esmaeilzadeh, P. Saeedi, B. N. Araabi, C. Lucas, and S. M. Fakhraie, “Neural network stream processing core (NnSP) for embedded systems,” in ISCAS, 2006. [11] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Architecture support for disciplined approximate programming,” in ASPLOS, 2012. [12] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural acceleration for general-purpose approximate programs,” in MICRO, 2012. [13] S. Galal and M. Horowitz, “Energy-efficient floating-point unit design,” IEEE Trans. Comput., vol. 60, no. 7, 2011. [14] A. Hashmi, H. Berry, O. Temam, and M. Lipasti, “Automatic abstraction and fault tolerance in cortical microarchitectures,” in ISCA, 2011. [15] A. Hashmi, A. Nere, J. J. Thomas, and M. Lipasti, “A case for neuromorphic ISAs,” in ASPLOS, 2011. [16] C. Igel and M. Hüsken, “Improving the RPROP learning algorithm,” in NC, 2000. [17] D. A. Johns and K. Martin, Analog Integrated Circuit Design. John Wiley and Sons, Inc., 1997. [18] A. Joubert, B. Belhadj, O. Temam, and R. Héliot, “Hardware spiking neurons design: Analog or digital?” in IJCNN, 2012. [19] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, “McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures,” in MICRO, 2009. [20] S. Misailovic, S. Sidiroglou, H. Hoffman, and M. Rinard, “Quality of service profiling,” in ICSE, 2010. [21] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0,” in MICRO, 2007. [22] S. Narayanan, J. Sartori, R. Kumar, and D. L. Jones, “Scalable stochastic processors,” in DATE, 2010. [23] A. Patel, F. Afram, S. Chen, and K. Ghose, “MARSSx86: A full system simulator for x86 CPUs,” in DAC, 2011. [24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” in Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, 1986, vol. 1. [25] M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke, “Sage: Self-tuning approximation for graphics engines,” in MICRO, 2013. [26] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman, “EnerJ: Approximate data types for safe and general low-power computation,” in PLDI, 2011. [27] J. Schemmel, J. Fieres, and K. Meier, “Wafer-scale integration of analog neural networks,” in IJCNN, 2008. [28] S. M. Tam, B. Gupta, H. A. Castro, and M. Holler, “Learning on an analog VLSI neural network chip,” in Systems, Man, and Cybernetics (SMC), 1990. [29] O. Temam, “A defect-tolerant accelerator for emerging highperformance applications,” in ISCA, 2012. [30] J. Zhu and P. Sutton, “FPGA implementations of neural networks: A survey of a decade of progress,” in FPL, 2003. Analog and digital neural hardware. There is an extensive body of work on hardware implementations of neural networks both in the digital [10, 30] and the analog [4, 18] domain. Recent work has proposed higher-level abstractions for implementation of neural networks [15]. Other work has examined fault-tolerant hardware neural networks [14, 29]. In particular, Temam [29] uses datasets from the UCI machine learning repository to explore fault tolerance of a hardware neural network design. In contrast, our compilation, neuralnetwork selection/training framework, and architecture design aim at applying neural networks to general-purpose code written in familiar programming models and languages, not explicitly written to utilize neural networks directly. Neural-based code acceleration. A recent study [7] shows that a number of applications can be manually reimplemented with explicit use of various kinds of neural networks. That study did not prescribe a programming workflow, nor a preferred hardware architecture. More recent work exposes analog spiking neurons as primitive operators [3]. This work devises a new programming model that allows programmers to express digital signal-processing applications as a graph of analog neurons and automatically maps the expressed graph to a tiled analog, spiking-neural hardware. The work in [3] is restricted to the domain of applications whose input are realworld signals that should be encoded as pulses. Our approach intends to address the long-standing challenges of utilizing analog computation (programmability and generality) by not imposing domain-specific limitations, and by providing analog circuitry that is integrated with a conventional, digital processor without require a new programming paradigm. 7. Conclusions Analog circuits inherently trade accuracy for very attractive energy-efficiency gains. However, it is challenging to utilize them in a way that is both programmable and generally useful. The transformation of general-purpose, approximable code to a neural model, as well as the use of neural accelerators, provide an avenue for realizing the benefits of analog computation by taking advantage of the fixed-function qualities of a neural network while targeting traditionally-written, generallyapproximable code. We presented a complete solution on using analog neural networks for accelerating approximate applications, from circuits to compilers design. A very important insight from this work is that it is crucial to expose analog circuit characteristics to the compilation and neural network training phase. Our analog neural acceleration provides whole application speedup of 3.3× and and energy savings of 12.1× with quality loss less than 10% for all except one benchmark. 7