Appears in the Proceedings of the 41st International Symposium on Computer Architecture, 2014 General-Purpose Code Acceleration with Limited-Precision Analog Computation Renée St. Amant? Hadi Esmaeilzadeh§ Amir Yazdanbakhsh§ Arjang Hassibi? ? University of Texas at Austin † University of Washington Jongse Park§ Bradley Thwaites§ Luis Ceze† Doug Burger‡ § Georgia Institute of Technology ‡ Microsoft Research stamant@cs.utexas.edu a.yazdanbakhsh@gatech.edu jspark@gatech.edu bthwaites@gatech.edu hadi@cc.gatech.edu arjang@mail.utexas.edu luisceze@cs.washington.edu dburger@microsoft.com Abstract As improvements in per-transistor speed and energy efficiency diminish, radical departures from conventional approaches are becoming critical to improving the performance and energy efficiency of general-purpose processors. We propose a solution—from circuit to compiler—that enables general-purpose use of limited-precision, analog hardware to accelerate “approximable” code—code that can tolerate imprecise execution. We utilize an algorithmic transformation that automatically converts approximable regions of code from a von Neumann model to an “analog” neural model. We outline the challenges of taking an analog approach, including restricted-range value encoding, limited precision in computation, circuit inaccuracies, noise, and constraints on supported topologies. We address these limitations with a combination of circuit techniques, a hardware/software interface, neuralnetwork training techniques, and compiler support. Analog neural acceleration provides whole application speedup of 3.7× and energy savings of 6.3× with quality loss less than 10% for all except one benchmark. These results show that using limited-precision analog circuits for code acceleration, through a neural approach, is both feasible and beneficial over a range of approximation-tolerant, emerging applications including financial analysis, signal processing, robotics, 3D gaming, compression, and image processing. 1. Introduction Energy efficiency now fundamentally limits microprocessor performance gains. CMOS scaling no longer provides gains in efficiency commensurate with transistor density increases [16, 25]. As a result, both the semiconductor industry and the research community are increasingly focused on specialized accelerators, which provide large gains in efficiency and performance by restricting the workloads that benefit. The community is facing an “iron triangle”; we can choose any two of performance, efficiency, and generality at the expense of the third. Before the effective end of Dennard scaling, we improved all three consistently for decades. Solutions that improve performance and efficiency, while retaining as much generality as possible, are highly desirable, hence the growing interest in GPGPUs and FPGAs. A growing body of recent work [13, 17, 44, 2, 35, 8, 28, 38, 18, 51, 46] has focused on approximation as a strategy for the iron triangle. Many classes of applications can tolerate small errors in their outputs with no discernible loss in QoR (Quality of Result). Many conventional techniques in energy-efficient computing navigate a design space defined by the two dimensions of performance and energy, and traditionally trade one for the other. General-purpose approximate computing explores a third dimension—that of error. Many design alternatives become possible once precision is relaxed. An obvious candidate is the use of analog circuits for computation. However, computation in the analog domain has several major challenges, even when small errors are permissible. First, analog circuits tend to be special purpose, good for only specific operations. Second, the bit widths they can accommodate are smaller than current floating-point standards (i.e. 32/64 bits), since the ranges must be represented by physical voltage or current levels. Another consideration is determining where the boundaries between digital and analog computation lie. Using individual analog operations will not be effective due to the overhead of A/D and D/A conversions. Finally, effective storage of temporary analog results is challenging in current CMOS technologies. These limitations has made it ineffective to design analog von Neumann processors that can be programmed with conventional languages. Despite these challenges, the potential performance and energy gains from analog execution are highly attractive. An important challenge is thus to architect designs where a significant portion of the computation can be run in the analog domain, while also addressing the issues of value range, domain conversions, and relative error. Recent work on Neural Processing Units (NPUs) may provide a possible approach [18]. NPU-enabled systems rely on an algorithmic transformation that converts regions of approximable general-purpose code into a neural representation (specifically, multi-layer perceptrons) at compile time. At run-time, the processor invokes the NPU instead of running the original code. NPUs have shown large performance and efficiency gains, since they subsume an entire code region (including all of the instruction fetch, decode, etc., overheads). They have an added advantage in that they convert many distinct code patterns into a common representation that can be run on a single physical accelerator, improving generality. NPUs may be a good match for mixed-signal implementations for a number of reasons. First, prior research has shown that neural networks can be implemented in analog domain to solve classes of domain-specific problems, such as pattern recognition [5, 47, 49, 32]. Second, the process of invoking a neural network and returning a result defines a clean, coarse-grained interface for D/A and A/D conversion. Third, the compile-time training of the network permits any analogspecific restrictions to be hidden from the programmer. The programmer simply specifies which region of the code can be approximated, without adding any neural-network-specific information. Thus, no additional changes to the programming model are necessary. In this paper we evaluate an NPU design with mixed-signal components and develop a compilation workflow for utilizing the mixed-signal NPU for code acceleration. The goal of this study is to investigate challenges and define potential solutions to enable effective mixed-signal NPU execution. The objective is to both bound application error to sufficiently low levels and achieve worthwhile performance or efficiency gains for general-purpose approximable code. This study makes the following four findings: 1. Due to range limitations, it is necessary to limit the scope of the analog execution to a single neuron; inter-neuron communication should be in the digital domain. 2. Again due to range issues, there is an interplay between the bit widths (inputs and weights) that neurons can use and the number of inputs that they can process. We found that the best design limited weights and inputs to eight bits, while also restricting the number of inputs to each neuron to eight. The input count limitation restricts the topological space of feasible neural networks. 3. We found that using a customized continuous-discrete learning method (CDLM) [10], which accounts for limitedprecision computation at training time, is necessary to reduce error due to analog range limitations. 4. Given the analog-imposed topology restrictions, we found that using a Resilient Back Propagation (RPROP) [30] training algorithm can further reduce error over a conventional backpropagation algorithm. We found that exposing the analog limitations to the compiler allowed for the compensation of these shortcomings and produced sufficiently accurate results. The latter three findings were all used at training time; we trained networks at compile time using 8-bit values, topologies restricted to eight inputs per neuron, plus RPROP and CDLM for training. Using these techniques together, we were able to bound error on all applications but one to a 10% limit, which is commensurate with entirely digital approximation techniques. The average time required to compute a neural result was 3.3× better than a previous digital implementation with an additional energy savings of 12.1×. The performance gains result in an average full-application-level improvement of 3.7× and 23.3× in performance and energy-delay product, respectively. This study shows that using limited-precision analog circuits for code acceleration, by converting regions of imperative code to neural networks and exposing the circuit limitations to the compiler, is both feasible and advantageous. While it may be possible to move more of the accelerator architecture design into the analog domain, the current mixed-signal design performs well enough that only 3% and 46% additional improvements in application-level energy consumption and performance are possible with improved accelerator designs. However, improving the performance of the analog NPU may lead to higher overall performance gains. 2. Overview and Background Programming. We use a similar programming model as described in [18] to enable programmers to mark error-tolerant regions of code as candidates for transformation using a simple keyword, approximable. Explicit annotation of code for approximation is a common practice in approximate programming languages [45, 7]. A candidate region is an error-tolerant function of any size, containing function calls, loops, and complex control flow. Frequently executed functions provide a greater opportunity for gains. In addition to error tolerance, the candidate function must have well-defined inputs and outputs. That is, the number of inputs and outputs must be known at compile time. Additionally, the code region must not read any data other than its inputs, nor affect any data other than its outputs. No major changes are necessary to the programming language beyond adding the approximable keyword. Exposing analog circuits to the compiler. Although an analog accelerator presents the opportunity for gains in efficiency over a digital NPU, it suffers from reduced accuracy and flexibility, which results in limitations on possible network topologies and limited-precision computation, potentially resulting in a decreased range of applications that can utilize the acceleration. These shortcomings at the hardware level, however, can be exposed as a high-level model and considered in the training phase. Four characteristics need to be exposed: (1) limited precision for input and output encoding, (2) limited precision for encoding weights, (3) the behavior of the activation function (sigmoid), (4) limited feasible neural topologies. Other lowlevel circuit behavior such as response to noise can also be exposed to the compiler. Section 5 describes this necessary hardware/software interface in more detail. Analog neural accelerator circuit design. To extract the high-level model for the compiler and to be able to accelerate execution, we design a mixed-signal neural hardware for multilayer perceptrons. The accelerator must support a large enough variety of neural network topologies to be useful over a wide range of applications. As we will show, each applications requires a different topology for the neural network that is replacing its approximable regions of code. Section 4 describes a candidate A-NPU circuit design, and outlines the challenges and tradeoffs present with an analog implementation. Compiling for analog neural hardware. The compiler aims to mimic approximable regions of code with neural networks that can be executable on the mixed-signal accelerator. While considering the limitation of the analog hardware, the compiler searches the topology space of the neural networks and selects and trains a neural network to produce outputs Profiling,Path,for, Training,Data, Collec<on Annotated, Source,Code A"NPU Circuit,Design Training,Data Programmer A-NPU High-Level Model Programming Design Instrumented, Binary Trained,Neural, Network CORE Code, Genera<on Custom,Training, Algorithm,for, Limited"Precision, Analog,Accelerator Accelerator, Config A"NPU Compila/on Accelera/on Figure 1: Framework for using limited-precision analog computation to accelerate code written in conventional languages. comparable to those produced by the original code segment. 1) Profile-driven training data collection. During a profiling stage, the compiler runs the application with representative profiling inputs and collects the inputs and outputs to the approximable code region. This step provides the training data for the rest of the compilation workflow. 2) Training for a limited-precision A-NPU. This stage is where our compilation workflow significantly deviates from the framework presented in [18] that targets digital NPUs. The compiler uses the collected training data to train a multilayer perceptron neural network, choosing a network topology, i.e. the number of neurons and their connectivity, and taking a gradient descent approach to find the synaptic weights of the network while minimizing the error with respect to the training data. This compilation stage does a neural topology search to find the smallest neural network that (a) adheres to the organization of the analog circuit and (b) delivers acceptable accuracy at the application level. The network training algorithm, which finds optimal synaptic weights, uses a combination of a resilient back propagation algorithm, RPROP [30], that we found to outperform traditional back propagation for restricted network topologies, and a continuous-discrete learning method, CDLM [10], that attempts to correct for error due to limited-precision computation. Section 5 describes these techniques that address analog limitations. 3) Code generation for hybrid analog-digital execution. Similar to prior work [18], in the code generation phase, the compiler replaces each instance of the original program code with code that initiates a computation on the analog neural accelerator. Similar ISA extensions are used to specify the neural network topology, send input and weight values to the A-NPU, and retrieve computed outputs from the A-NPU. 3. Analog Circuits for Neural Computation This section describes how analog circuits can perform the computation of neurons in multi-layer perceptrons, which are widely used neural networks. We also discuss, at a highlevel, how limitations of the analog circuits manifest in the computation. We explain how these restrictions are exposed to the compilation framework. The next section presents a concrete design for the analog neural accelerator. As Figure 2a illustrates, each neuron in a multi-layer perceptron takes in a set of inputs (xi ) and performs a weighted sum of those input values (∑i xi wi ). The weights (wi ) are the result of training the neural network on . After the summation x0 xi … w0 … wn wi X xn (xi wi ) x0 xi xn DAC DAC DAC … I(x0 ) R(w0 ) V to I y = sigmoid( (a) (xi wi )) R(wi ) V to I X X … I(xi ) I(xn ) R(wn ) V to I (I(xi )R(wi )) ADC y ⇡ sigmoid( X (I(xi )R(wi ))) (b) Figure 2: One neuron and its conceptual analog circuit. stage, which produces a linear combination of the weighted inputs, the neuron applies a nonlinearity function, sigmoid, to the result of summation. Figure 2b depicts a conceptual analog circuit that performs the three necessary operations of a neuron: (1) scaling inputs by weight (xi wi ), (2) summing the scaled inputs (∑i xi wi ), and (3) applying the nonlinearity function (sigmoid). This conceptual design first encodes the digital inputs (xi ) as analog current levels (I(xi )). Then, these current levels pass through a set of variable resistances whose values (R(wi )) are set proportional to the corresponding weights (wi ). The voltage level at the output of each resistance (I(xi )R(wi )), is proportional to xi wi . These voltages are then converted to currents that can be summed quickly according to Kirchhoff’s current law (KCL). Analog circuits only operate linearly within a small range of voltage and current levels, outside of which the transistors enter saturation mode with IV characteristics similar in shape to a non-linear sigmoid function. Thus, at the high level, the non-linearity is naturally applied to the result of summation when the final voltage reaches the analog-to-digital converter (ADC). Compared to a digital implementation of a neuron, which requires multipliers, adder trees and sigmoid lookup tables, the analog implementation leverages the physical properties of the circuit elements and is orders of magnitude more efficient. However, it operates in limited ranges and therefore offers limited precision. Analog-digital boundaries. The conceptual design in Figure 2b draws the analog-digital boundary at the level of an algorithmic neuron. As we will discuss, the analog neural accelerator will be a composition of these analog neural units (ANUs). However, an alternative design, primarily optimizing for efficiency, may lay out the entirety of a neural network with only analog components, limiting the D-to-A and A-to-D conversions to the inputs and outputs of the neural network and not the individual neurons. The overhead of conversions in the ANUs significantly limits the potential efficiency gains of an analog approach toward neural computation. However, there is a tradeoff between efficiency, reconfigurability (generality), and accuracy in analog neural hardware design. Pushing more of the implementation into the analog domain gains efficiency at the expense of flexibility, limiting the scope of supported network topologies and, consequently, limiting potential network accuracy. The NPU approach targets code approximation, rather than typical, simpler neural tasks, such as recognition and prediction, and imposes higher accuracy requirements. The main challenge is to manage this tradeoff to achieve acceptable accuracy for code acceleration, while delivering higher performance and efficiency when analog neural circuits are used for general-purpose code acceleration. As prior work [18] has shown and we corroborate, regions of code from different applications require different topologies of neural networks. While a holistically analog neural hardware design with fixed-wire connections between neurons may be efficient, it effectively provides a fixed topology network, limiting the scope of applications that can benefit from the neural accelerator, as the optimal network topology varies with application. Additionally, routing analog signals among neurons and the limited capability of analog circuits for buffering signals negatively impacts accuracy and makes the circuit susceptible to noise. In order to provide additional flexibility, we set the digital-analog boundary in conjunction with an algorithmic, sigmoid-activated neuron. where a set of digital inputs and weights are converted to the analog domain for efficient computation, producing a digital output that can be accurately routed to multiple consumers. We refer to this basic computation unit as an analog neural unit, or ANU. ANUs can be composed, in various physical configurations, along with digital control and storage, to form a reconfigurable mixed-signal NPU, or A-NPU. One of the most prevalent limitations in analog design is the bounded range of currents and voltages within which the circuits can operate effectively. These range limitations restrict the bit-width of input and weight values and the network topologies that can be computed accurately and efficiently. We expose these limitations to the compiler and our custom training algorithm and compilation workflow considers these restrictions when searching for optimal network topologies and training neural networks. As we will show, one of the insights from this work is that even with limited bit-width (≤ 8), and a restricted neural topology, many general-purpose approximate applications achieve acceptable accuracy and significantly benefit from mixed-signal neural acceleration. Value representation and bit-width limitations. One of the fundamental design choices for an ANU is the bit-width of inputs and weights. Increasing the number of bits results in an exponential increase in the ADC and DAC energy dissipation and can significantly limit the benefits from analog acceleration. Furthermore, due to the fixed range of voltage and current levels, increasing the number of bits translates to quantizing this fixed value range to fine granularities that practical ADCs can not handle. In addition, the fine granularity encoding makes the analog circuit significantly more susceptible to noise, thermal, voltage, current, and process variations. In practice, these non-ideal effects can adversely affect the final accuracy when more bit-width is used for weights and inputs. We design our ANUs such that the granularity of the voltage and current levels used for information encoding is to a large degree robust to variations and noise. Topology restrictions. Another important design choice is the number of inputs in the ANU. Similar to bit-width, increasing the number of ANU inputs translates to encoding a larger value range in a bounded voltage and current range, which, as discussed, becomes impractical. There is a tradeoff between accuracy and efficiency in choosing the number ANU inputs. The larger the number of inputs, the larger the number of multiply and add operations that can be done in parallel in the analog domain, increasing efficiency. However, due to the bounded range of voltage and currents, increasing the number of inputs requires decreasing the number of bits for inputs and weights. Through circuit-level simulations, we empirically found that limiting the number of inputs to eight with 8-bit inputs and weights strikes a balance between accuracy and efficiency. A digital implementation does not impose such restrictions on the number of inputs to the hardware neuron and it can potentially compute arbitrary topologies of neural networks. However, this unique ANU limitation restricts the topology of the neural network that can run on the analog accelerator. Our customized training algorithm and compilation workflow takes into account this topology limitation and produces neural networks that can be computed on our mixed-signal accelerator. Non-ideal sigmoid. The saturation behavior of the analog circuit that leads to sigmoid-like behavior after the summation stage represents an approximation of the ideal sigmoid. We measure this behavior at the circuit level and expose it to the compiler and the training algorithm. 4. Mixed-Signal Neural Accelerator (A-NPU) This section describes a concrete ANU design and the mixedsignal, neural accelerator, A-NPU. 4.1. ANU Circuit Design Figure 3 illustrates the design of a single analog neuron (ANU). The ANU performs the computation of one neuron, or y ≈ sigmoid(∑i wi xi ). We place the analog-digital boundary at the ANU level, with computation in the analog domain and storage in the digital domain. Digital input and weight values are represented in sign-magnitude form. In the figure, swi and sxi represent the sign bits and wi and xi represent the magnitude. Digital input values are converted to the analog domain through current-steering DACs that translate digital values to analog currents. Current-steering DACs are used for … I(|xn |) Resistor' Ladder R(|w0 |) V (|w0 x0 |) I + (wn xn ) I (w0 x0 ) V w i xi V R(|wn |) ⇣X w i xi ⌘ -; 8"Wide AnalogNeuron 8"Wide AnalogNeuron … 8"Wide AnalogNeuron 8"Wide AnalogNeuron Column'Selector I (wn xn ) + + Diff' ⌘ Config'FIFO Input'FIFO Row'Selector Current' Steering' DAC Diff' Pair I + (w0 x0 ) ⇣X xn V (|wn xn |) Diff' Pair + sxn Weight'Buffer Resistor' Ladder Current' Steering' DAC wn Weight'Buffer I(|x0 |) s wn x0 Weight'Buffer sx0 w0 Weight'Buffer s w0 Amp Flash ADC sy y ⇣ ⇣X ⌘⌘ y ⇡ sigmoid V w i xi Figure 3: A single analog neuron (ANU). their speed and simplicity. In Figure 3, I(|xi |) is the analog current that represents the magnitude of the input value, xi . Digital weight values control resistor-string ladders that create a variable resistance depending on the magnitude of each weight (R(|wi |)) . We use a standard resistor ladder thats consists of a set of resistors connected to a tree-structured set of switches. The digital weight bits control the switches, adjusting the effective resistance, R(|wi |), seen by the input current (I(|xi |)). These variable resistances scale the input currents by the digital weight values, effectively multiplying each input magnitude by its corresponding weight magnitude. The output of the resistor ladder is a voltage: V (|wi xi |) = I(|xi |) × R(|wi |). The resistor network requires 2m resistors and approximately 2m+1 switches, where m is the number of digital weight bits. This resistor ladder design has been shown to work well for m ≤ 10. Our circuit simulations show that only minimally sized switches are necessary. V (|wi xi |), as well as the XOR of the weight and input sign bits, feed to a differential pair that converts voltage values to two differential currents (I + (wi xi ), I − (wi xi )) that capture the sign of the weighted input. These differential currents are proportional to the voltage applied to the differential pair, V (|wi xi |). If the voltage difference between the two gates is kept small, the current-voltage relationship is linear, producing Ibias − I + (wi xi ) = Ibias 2 + ∆I and I (wi xi ) = 2 − ∆I. Resistor ladder values are chosen such that the gate voltage remains in the range that produces linear outputs, and consequently a more accurate final result. Based on the sign of the computation, a switch steers either the current associated with a positive value or the current associated with a negative value to a single wire to be efficiently summed according to Kirchhoff’s current law. The alternate current is steered to a second wire, retaining differential operation at later design stages. Differential operation combats environmental noise and increases gain, the later being particularly important for mitigating the impact of analog range challenges at later stages. Resistors convert the resulting pair of differential currents to voltages, V + (∑i wi xi ) and V − (∑i wi xi ), that represent the weighted sum of the inputs to the ANU. These voltages are Output'FIFO Figure 4: Mixed-signal neural accelerator, A-NPU. Only four of the ANUs are shown. Each ANU processes eight 8-bit inputs. used as input to an additional amplification stage (implemented as a current-mode differential amplifier with diode-connected load). The goal of this amplification stage is to significantly magnify the input voltage range of interest that maps to the linear output region of the desired sigmoid function. Our experiments show that neural networks are sensitive to the steepness of this non-linear function, losing accuracy with shallower, non-linear activation functions. This fact is relevant for an analog implementation because steeper functions increase range pressure in the analog domain, as a small range of interest must be mapped to a much larger output range in accordance with ADC input range requirements for accurate conversion. We magnify this range of interest, choosing circuit parameters that give the required gain, but also allowing for saturation with inputs outside of this range. The amplified voltage is used as input to an ADC that converts the analog voltage to a digital value. We chose a flash ADC design (named for its speed), which consists of a set of reference voltages and comparators [1, 31]. The ADC requires 2n comparators, where n is the number of digital output bits. Flash ADC designs are capable of converting 8 bits at a frequency on the order of one GHz. We require 2–3 mV between ADC quantization levels for accurate operation and noise tolerance. Typically, ADC reference voltages increase linearly; however, we use a non-linearly increasing set of reference voltages to capture the behavior of a sigmoid function, which also improves the accuracy of the analog sigmoid. 4.2. Reconfigurable Mixed-Signal A-NPU We design a reconfigurable mixed-signal A-NPU that can perform the computation of a wide variety of neural topologies since each requires a different topology. Figure 4 illustrates the A-NPU design with some details omitted for clarity. The figure shows four ANUs while the actual design has eight. The A-NPU is a time-multiplexed architecture where the algorithmic neurons are mapped to the ANUs based on a static scheduling algorithm, which is loaded to the A-NPU before invocation. The multi-layer perceptron consists of layers of neurons, where the inputs of each layer are the outputs of the previous layer. The ANU starts from the input layer and performs the computations of the neurons layer by layer. The Input Buffer always contains the inputs to the neurons, either coming from the processor or from the previous layer computation. The Output Buffer, which is a single entry buffer, collects the outputs of the ANUs. When all of its columns are computed, the results are pushed back to the Input Buffer to enable calculation of the next layer. The Row Selector determines which entry of the input buffer will be fed to the ANUs. The output of the ANUs will be written to a single-entry output buffer. The Column Selector determines which column of the output buffer will be written by the ANUs. These selectors are FIFO buffers whose values are part of the preloaded A-NPU configuration. All the buffers are digital SRAM structures. Each ANU has eight inputs. As shown in Figure 4, each A-NPU is augmented with a dedicated weight buffer, storing the 8-bit weights. The weight buffers simultaneously feed the weights to the ANUs. The weights and the order in which they are fed to the ANUs are part of the A-NPU configuration. The Input Buffer and Weight Buffers synchronously provide the inputs and weights for the ANUs based on the pre-loaded order. A-NPU configuration. During code generation, the compiler produces an A-NPU configuration that constitutes the weights and the schedule. The static A-NPU scheduling algorithm first assigns an order to the neurons of the neural network, in which the neurons will be computed in the ANUs. The scheduler then takes the following steps for each layer of the neural network: (1) Assign each neuron to one of the ANUs. (2) Assign an order to neurons. (3) Assign an order to the weights. (4) Generate the order for inputs to feed the ANUs. (5) Generate the order in which the outputs will be written to the Output Buffer. The scheduler also assigns a unique order for the inputs and outputs of the neural network in which the core communicates data with the A-NPU. 4.3. Architectural interface for A-NPU We adopt the same FIFO-based architectural interface through which a digital NPU communicates with the processor [18]. The A-NPU is tightly integrated to the pipeline. The processor only communicates with the ANUs through the Input, Output, Config FIFOs. The processor ISA is extended with special instructions that can enqueue and dequeue data from these FIFOs as shown in Figure 4. When a data value is queued/dequeued to/from the Input/Output FIFO, the A-NPU converts the values to the appropriate representation for the A-NPU/processor. 5. Compilation for Analog Acceleration As Figure 1 illustrates, the compilation for A-NPU execution consists of three stages: (1) profile-driven data collection, (2) training for a limited-precision A-NPU, and (3) code generation for hybrid analog-digital execution. In the profile-driven data collection stage, the compiler instruments the application to collect the inputs and outputs of approximable functions. The compiler then runs the application with representative inputs and collects the inputs and their corresponding outputs. These input-output pairs constitute the training data. Section 4 briefly discussed ISA extensions and code generation. While compilation stages (1) and (3) are similar to the techniques presented for a digital implementation [18], the training phase is unique to an analog approach, accounting for analog-imposed, topology restrictions and adjusting weight selection to account for limited-precision computation. Hardware/software interface for exposing analog circuits to the compiler. As we discussed in Section 3, we expose the following analog circuit restrictions to the compiler through a hardware/software interface that captures the following circuit characteristics: (1) input bit-width limitations, (2) weight bit-width limitations, (3) limited number of inputs to each analog neuron (topology restriction), and (4) the nonideal shape of the analog sigmoid. The compiler internally constructs a high-level model of the circuit based on these limitations and uses this model during the neural topology search and training with the goal of limiting the impact of inaccuracies due to an analog implementation. Training for limited bit widths and analog computation. Traditional training algorithms for multi-layered perceptron neural networks use a gradient descent approach to minimize the average network error, over a set of training input-output pairs, by backpropagating the output error through the network and iteratively adjusting the weight values to minimize that error. Traditional training techniques, however, that do not consider limited-precision inputs, weights, and outputs perform poorly when these values are saturated to adhere to the bit-width requirements that are feasible for an implementation in the analog domain. Simply limiting weight values during training is also detrimental to achieving quality outputs because the algorithm does not have sufficient precision to converge to a quality solution. To incorporate bit-width limitations into the training algorithm, we use a customized continuous-discrete learning method (CDLM) [10]. This approach takes advantage of the availability of full-precision computation at training time and then adjusts slightly to optimize the network for errors due to limited-precision values. In an initial phase, CDLM first trains a fully-precise network according to a standard training algorithm, such as backpropagation [43]. In a second phase, it discretizes the input, weight, and output values according the the exposed analog specification. The algorithm calculates the new error and backpropagates that error through the fully-precise network using full-precision computation and updates the weight values according to the algorithm also used in stage 1. This process repeats, backpropagating the ’discrete’ errors through a precise network. The original CDLM training algorithm was developed to mitigate the impact of limited-precision weights. We customize this algorithm by incorporating the input bit-width limitation and the output bit-width limitation in addition to limited weight values. Additionally, this training scheme is advantageous for an analog implementation because it is general enough to also make up for errors that arise due to an analog implementation, such as a non-ideal sigmoid function and any other analog non-ideality that behaves consistently. In essence, after one round of full-precision training, the compiler models an analog-like version of the network. A second, CDLM-based training pass adjusts for these analogimposed errors, enabling the inaccurate and limited A-NPU as an option for a beneficial NPU implementation by maintaining acceptable accuracy and generality. Training with topology restrictions. In addition to determining weight values for a given network topology, the compiler searches the space of possible topologies to find an optimal network for a given approximable code region. Conventional multi-layered perceptron networks are fully connected, i.e. the output of each neuron in one layer is routed to the input of each neuron in the following layer. However, analog range limitations restrict the number of inputs that can be computed in a neuron (eight in our design). Consequently, network connections must be limited, and in many cases, the network can not be fully connected. We impose the circuit restriction on the connectivity between the neurons during the topology search and we use a simple algorithm guided by the mean-squared error of the network to determine the best topology given the exposed restriction. The error evaluation uses a typical cross-validation approach: the compiler partitions the data collected during profiling into a training set, 70% of the data, and a test set, the remaining 30%. The topology search algorithm trains many different neural-network topologies using the training set and chooses the one with the highest accuracy on the test set and the lowest latency on the A-NPU hardware (prioritizing accuracy). The space of possible topologies is large, so we restrict the search to neural networks with at most two hidden layers. We also limit the number of neurons per hidden layer to powers of two up to 32. The numbers of neurons in the input and output layers are predetermined based on the number of inputs and outputs in the candidate function. To further improve accuracy, and compensate for topologyrestricted networks, we utilize a Resilient Back Propagation (RPROP) [30] training algorithm as the base training algorithm in our CDLM framework. During training, instead of updating the weight values based on the backpropagated error (as in conventional backpropagation [43]), the RPROP algorithm increases or decreases the weight values by a predefined value based on the sign of the error. Our investigation showed that RPROP significantly outperforms conventional backpropagation for the selected network topologies, requiring only half of the number of training epochs as backpropagation to converge on a quality solution. The main advantage of the application of RPROP training to an analog approach to neural computing is its robustness to the sigmoid function and topology restrictions imposed by the analog design. Backpropagation, for example, is extremely sensitive to the steepness of the sigmoid function, and allowing for a variety of steepness levels in a fixed, analog implementation is challenging. Additionally, backpropagation performs poorly with a shallow sigmoid function. The require- ment of a steep sigmoid function exacerbates analog range challenges, possibly making the implementation infeasible. RPROP tolerates a more shallow sigmoid activation steepness and performs consistently utilizing a constant activation steepness over all applications. Our RPROP-based, customized CDLM training phase requires 5000 training epochs, with the analog-based CDLM phase adding roughly 10% to the training time of the baseline training algorithm. 6. Evaluations Cycle-accurate simulation and energy modeling. We use the MARSSx86 x86-64 cycle-accurate simulator [39] to model the performance of the processor. The processor is modeled after a single-core Intel Nehalem to evaluate the performance benefits of A-NPU acceleration over an aggressive out-oforder architecture1 . We extended the simulator to include ISAlevel support for A-NPU queue and dequeue instructions. We also augmented MARSSx86 with a cycle-accurate simulator for our A-NPU design and an 8-bit, fixed-point D-NPU with eight processing engines (PEs) as described in [18]. We use GCC v4.7.3 with -o3 to enable compiler optimization. The baseline in our experiments is the benchmark run solely on the processor without neural transformation. We use McPAT [33] for processor energy estimations. We model the energy of an 8-bit, fixed-point D-NPU using results from McPAT, CACTI 6.5 [37], and [22] to estimate its energy. Both the D-NPU and the processor operate at 3.4GHz at 0.9 V, while the A-NPU is clocked at one third of the digital clock frequency, 1.1GHz at 1.2 V, to achieve acceptable accuracy. Circuit design for ANU. We built a detailed transistor-level SPICE model of the analog neuron, ANU. We designed and simulated the 8-bit, 8-input ANU in the Cadence Analog Design Environment using predictive technology models at 45 nm [6]. We ran detailed Spectre SPICE simulations to understand circuit behavior and measure ANU energy consumption. We used CACTI to estimate the energy of the A-NPU buffers. Evaluations consider all A-NPU components, both digital and analog. For the analog parts, we used direct measurements from the transistor-level SPICE simulations. For SRAM accesses, we used CACTI. We built an A-NPU cycle-accurate simulator to evaluate the performance improvements. Similar to McPAT, we combined simulation statistics with measurements from SPICE and CACTI to calculate A-NPU energy. To avoid biasing our study toward analog designs, all energy and performance comparisons are to an 8-bit, fixed-point D-NPU (8-bit inputs/weights/multiply-adders). For consistency with the available McPAT model for the baseline processor, we 1 Processor: Fetch/Issue Width: 4/5, INT ALUs/FPUs: 6/6, Load/Store FUs: 1/1, ROB Entries: 128, Issue Queue Entries: 36, INT/FP Physical Registers: 256/256, Branch Predictor: Tournament 48 KB, BTB Sets/Ways: 1024/4, RAS Entries: 64, Load/Store Queue Entries: 48/48, Dependence Predictor: 4096-entry Bloom Filter, ITLB/DTLB Entries: 128/256 L1: 32 KB Instruction, 32 KB Data, Line Width: 64 bytes, 8-Way, Latency: 3 cycles L2: 256 KB, Line Width: 64 bytes, 8-Way, Latency: 6 cycles L3: 2 MB, Line Width 64 bytes, 16-Way, Latency: 27 cycles Memory Latency: 50 ns Application Speedup Application Topology Core + D-NPU Core + A-NPU blackscholes 6 -> 8 -> 8 -> 1 14.1441013944802 24.5221729784241 48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738 fft 1 -> 4 -> 4 -> 2 1.12709929506364 1.32327013458615 Core + Ideal NPU 1.64546022960568 Core + D-NPU 0.68497510592142 0.804194541306698 0.195805458693302 Core + A-NPU speedup inversek2j 2 -> 8 -> 2 7.98161179269307 10.9938617211157 14.9861613789597 0.53259881505741 0.733600916412848 0.266399083587152 jmeint 18 -> 32 -> 8 -> 2 2.39085372136084 6.26190545947988 14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335 jpeg 64 -> 16 -> 8 -> 64 1.5617504494923 1.87946485929561 1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875 kmeans 6 -> 8 -> 4 -> 1 0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028 sobel 9 -> 8 -> 1 2.4864550898745 3.10723166292606 3.62429006473114 0.686053004992842 0.857335259438336 Table 1: The evaluated benchmarks, characterization of each offloaded function,0.142664740561664 training data, and the trained neural network. Miss*Rate 17.68% 19.7% Image*Diff 5.48% 8.4% Image*Diff 3.21% 7.3% Image*Diff 3.89% 5.2% 12.1 3.3 1 0 blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean Figure 5: A-NPU with 8 ANUs vs. D-NPU with 8 PEs. ble 1, each application benefits from a different neural network topology, so the ability to reconfigure the A-NPU is critical. 10 A-NPU vs 8-bit D-NPU. Figure 5 shows the average energy improvement and speedup for one invocation of an A-NPU over one invocation of an 8-bit D-NPU, where the A-NPU 8 is clocked at 13 the D-NPU frequency. On average, the ANPU is 12.1× more energy efficient and 3.3× faster than the D-NPU. While consuming significantly less energy, the A-NPU 6 can perform 64 multiply-adds in parallel, while the D-NPU can only perform eight. This energy-efficient, parallel computation explains why jpeg–with the largest neural network (64→16→8→64)–achieves the highest energy and 4 performance improvements, 82.2× and 15.2×, respectively. The larger the network, the higher the benefits from A-NPU. Compared to a D-NPU, an A-NPU offers a higher level of parallelism with low energy cost that can potentially enable using 2 larger neural networks to replace more complicated code. Whole application speedup and energy savings. Figure 6 shows the whole application speedup and energy savings when 0 the processor is augmented with an 8-bit, 8-PE D-NPU, our blackscholes 0 idealinversek2j 8-ANU A-NPU, and an NPU, whichjmeint takes zero jpeg cycles kmeans and consumes zero energy. Figure 6c shows the percentage of dynamic instructions subsumed by the neural transformation of the 10 candidate code. The results show, following the Amdahl’s Law, that the larger the number of dynamic instructions subsumed, the larger the benefits from neural acceleration. 2.3 8 6 0.5 0.8 1.2 6.2 7.9 used McPAT and CACTI to estimate D-NPU energy. Even though we do not have a fabrication-ready layout for the design, in Table 2, we provide an estimate of the ANU area in terms of number of transistors. T denotes a transistor with width length = 1. As shown, each ANU (which performs eight, 8-bit analog multiply-adds in parallel followed by a sigmoid) requires about 10,964 transistors. An equivalent digital neuron that performs eight, 8-bit multiply-adds and a sigmoid would require about 72,456 T from which 56,000 T are for the eight, 8-bit multiply-adds and 16,456 T for the sigmoid lookup. With the same compute capability, the analog neuron requires 6.6× fewer transistors than its equivalent digital implementation. Benchmarks. We use the benchmarks in [18] and add one more, blackscholes. These benchmarks represent a diverse set of application domains, including financial analysis, signal processing, robotics, 3D gaming, compression, image processing. Table 1 summarizes information about each benchmark: application domain, target code, neural-network topology, training/test data and final application error levels for fully-digital neural networks and analog neural networks using our customized RPROP-based CDLM training algorithm. The neural networks were trained using either typical program inputs, such as sample images, or a limited number of random inputs. Accuracy results are reported using an independent data set, e.g, an input image that is different than the image used during training. Each benchmark requires an application-specific error metric, which is used in our evaluations. As shown in Ta- Speedup EnergyBSaving 1.1 1.3 1.6 ≈ 10,964 T 2 14.1 24.5 48.0 Total 3 42.5 51.2 52.5 3,096 T∗ 4,096 T + 1 KΩ (≈ 450T) 48 T 20 KΩ (≈ 30 T) 244 T 2550 T + 1 KΩ (≈ 450 T) Improvement 8×8-bit DAC 8×Resistor Ladder (8-bit weights) 8×Differential Pair I-to-V Resistors Differential Amplifier 8-bit ADC 4 Applica'on*Speedup Area rgy*Reduc'on Sub-circuit 5 3.3 Table 2: Area estimates for the analog neuron (ANU). ∗ Transistor with width/length = 1 9.4% 1.5 1.8 1.9 3 6.2% 11.8 sobel Image* Processing Avg.*Rela've*Error 2.7 1 Sobel*edge* detector 4.1% 14.0 kmeans 2.75% 17.8 18.8 3 Avg.*Rela've*Error 8.5 1 -> 16 -> 8 -> 1 Machine* KEmeans*clustering Learning 10.2% 2.4 9 -> 8 -> 1 Compression geomean 6.02% 7.3 32 6 -> 8 -> 4 -> 1 Avg.*Rela've*Error 0.00228 15.2 82.2 swaptions 64 -> 16 -> 8 -> 64 0.000011 2048*Random* 32768*Random* 0 0 34 Floa'ng*Point* Floa'ng*Point* 1*E>*4*E>*4*E>*2 0.00002 0.00194 Application Speedup-1 Numbers Numbers Core + A-NPU Core + Ideal NPU Core10000*(x,*y)* + D-NPU Core + A-NPU speedup 10000*(x,*y)* 0 0 100 Random* Random* 2*E>*8*E>*2 0.000341 0.00467 Coordinates Coordinates 10000*Random* 10000*Random* Pairs*of*3D* Pairs*of*3D* 0 23 1,079 18*E>*32*E>*8*E>*2 0.05235 0.06729 Triangle* Triangle* Coordinates Coordinate 220x200EPixel* Three*512x512E 4 0 1,257 64*E>*16*E>*8*E>*64 0.0000156 0.0000325 Color*Image Pixel*Color*Images 220x200EPixel* 50000*Pairs*of* 0 0 26 Color*Image Random*(r,*g,*b)* 6*E>*8*E>*4*E>*1 0.00752 0.009589 Values 220x200EPixel* One*512x512E 2 1 88 9*E>*8*E>*1 0.000782 0.00405 Color*Image Pixel*Color*Image 10.9 14.9 JPEG*encoding 18 -> 32 -> 8 -> 2 3D*Gaming 4 Analog* Error 28.2 2 -> 8 -> 2 Fully* Digital* Error 25.8 30.0 31.4 1 -> 4 -> 4 -> 2 inversek2j 2 Core + D-NPU Applica0on*Error* Metric 3.9 fft sobel jpeg 5 Fully* Analog*NN* Digital*NN* MSE*(8@bit) MSE 4.5 Topology kmeans 0 Signal* Processing Application Triangle* jmeint intersec'on* jpeg detec'on jmeint 5 Financial* Analysis Inverse*kinema'cs* Robo'cs blackscholes 6 -> 8 -> 8 -> 1 for*2Ejoint*arm inversek2j #*of*Ifs/ elses 2.4 RadixE2*CooleyE Tukey*fast*fourier M #*of* Loops 3.7 Mathema'cal* model*of*a* financial*market* Type 5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968 #*of* x86@64* Evalua0on*Input* Neural*Network* Training*Input*Set Instruc0on Set Topology s 4096*Data*Point* 16384*Data*Point* 309 from*PARSEC from*PARSEC 6*E>*8*E>*8E>*1 9.5 blackscholes Descrip0on 3.7797513074705 1.8 Benchmark*Name 2.5478647166383 #*of* Func0on* Calls 2.5 geomean geomean 7.28 0.0118 4.28 5.21 0.0093 2.61 3.34 0.0073 6.02 7.32 0.01 8 6 Fully Precise Digital Sigmoid Analog Sigmoid blackscholes 8.39% 10.21% fft 3.03% 4.13% inversek2j 8.13% jmeint 18.41% jpeg 6.62% 4.28% swaptions 2.61% geomean 6.02% 10 8 inversek2j jmeintblackscholes jpeg Other Instructions 8 6 2.4% 31.4% 3.3% 0.3% 1.2% 0.8% 97.2% blackscholes fft inversek2j jmeint jpeg kmeans sobel 2.4% 31.4% 3.3% 3.1% 43.3% 65.5% 40.5% blackscholes 01.2% inversek2j jmeint 0.3% 0.8% 1.8% 0.5% 4.8% blackscholes 0 jpeg jpeg 4 9 inversek2j jmeint fft swaptions swaptions geomean kmeans sobel 65.5% 40.5% fft jpeg 2.50 95.9% Analog Sigmoid 10.2% Energy 29.7% 57.1% (1.7 Improvement GHz) 8.48 2.04 6 3.33 2.33 4 0.51 4.05 28.28 25.50 1.36 15.21 7.57 0.46 2.75 4.13 11.81 10.45 0.79 3.33 5.00 12.14 10.79 2.7 2.4 sobel 12.1 3.3 swaptions geomean 2.0 2.7 2.4 kmeans 8.4 11.8 3.3 8.4 15.2 2.0 3.9 2.4 jpeg 3.3 8.56 1.8 jmeint 11.8 8.5 73.76 3.67 12.1 82.21 2.44 3.3 22.81 0.67 2.4 3.7 2.4 inversek2j 4.05 12.1 4.57 5.88 82.2 3.63 3.92 2.7 82.2 15.2 3.9 4.5 4.5 2.42 1.99 11.8 82.2 fft inversek2j 28.2 fft 0.51 2.4 3.9 4.5 2.4 1.8 3.7 2.5 1.8 kscholes 3.33 11.8 6.15 8.48 3.79 12.1 4 9.63 3.7 5.49 9.53 2.83 15.2 6 11.81 2.7 8 5.77 3.75 1.89 8.5 111.54 2.50 8.5 12 4.13 0.54 4.5 4 9.5 2.33 56.26 Improvement 2.75 Improvement (1.7 (1.1336 GHz) GHz) 0.86 3.9 6 82.2 2.04 Improvement 0.46 (1.7 GHz) 10.45 Geometric mean and energy with an A-NPU 6.15 0.79 speedup 3.33 5.00 savings 12.14 10.79 is 3.7× and 6.3× respectively, which is 48% and 24% better than an 8-bit, 8-PE NPU. Among the benchmarks, kmeans sees slow down with D-NPU and A-NPU-based acceleration. Speedup All benchmarks benefit in termsEnergy of energy. Saving The speedup with A-NPU acceleration ranges from 0.8× to 24.5×. The energy savings range from 1.3× to 51.2×.As the results show, the Speedup Saving closely follows the ideal case, and, savings with anEnergy A-NPU in terms of “energy”, there is little value in designing a more sophisticated A-NPU. This result is due to the fact that the energy cost of executing instructions in the von Neumann, out-of-order pipeline is much higher than performing simple multiply-adds in the analog domain. Using physics laws (Ohm’s law for multiplication and Kirchhoff’s law for summainversek2j kmeansof devices sobel geomean tion)jpegandjmeint analogsobeljpeg properties to perform computation jmeint kmeans geomean can lead to significant energy and performance benefits. Application error. Table 3 shows the application-level errors with a floating point D-NPU, A-NPU with ideal sigmoid Energy Saving and our A-NPU which incorporates non-idealities of the anaSpeedup Energy Savingshows error above 10%, log sigmoid. Except for jmeint , which Speedup all of the applications show error less than or around 10%. Application average error rates with the A-NPU range from 28.2 8.15 9.636 15.2 7 Improvement 4 (1.1336 GHz) 28.2 5.49 2.4 3 jpeg A-NPU + Ideal Sigmoid 8.4% 3.0% 8.1% 18.4% 6.6% 6 6.1% 4.3% 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2% 100% 80% blackscholes fft inversek2j 8.4% 3.0% 8.1% 10.2% 4.1% 9.4% Percentage instructions subsumed 60% Analog Sigmoid geomean 0% 7.3% 20% jpeg kmeans sobel swaptions geomean 18.4% 6.6% 6.1% 4.3% 2.6% 6.0% 19.7% 8.4% 7.3% 5.2% 3.3% 7.3% blackscholes 2 fft jmeint inversek2j jpeg kmeans 0 sobel 4.3% 10% 5.2% 4 jmeint 30% 40% 50% 60% 70% 80% blackscholes 90% 100% fft inve Error kmeans sobel swaptions Figure 7: CDF plot of application output error. A point (x,y) indicates 8.4% 56.3% 3.0% 8.1% 6.1% 4.3% 2.6% 6.0% 95.1% 29.7% 57.1% that y% of18.4% the6.6% output elements see error ≤ x%. kmeans sobel 9.53 8.5 ) jmeint fft inversek2j jmeint jpeg kmeans sobel swaptions geomean 4.1% 9.4% 19.7% 8.4% 7.3% 5.2% 3.3% 7.3% Percentage instructions subsumed geomean 4.57 jpeg blackscholes (c) % dynamic 0.54 1.89 instructions 2.83 subsumed. 3.79 28.2 5 swaptions inversek2j jmeint 56.26 4 3.8% 19.7% 8.4% 0.86 3.63 3.2% 18.4% 6.6% 0% 6.1% 6 2.42 sobel 17.6% 5.4% 9.4% Energy Improvement 3.75 kmeans 6.2% 8.1% sobel geomean inversek2j jmeint inversek2j 2.7% 4.1% sobel geomean Cycle Improvement fft 6.0% 3.0% kmeans Cycle Cycle Energy 97.2% 67.4% Improvement 95.9% (1.7 95.1% 56.3% Improvement Improvement (1.1336 GHz) GHz) (1.1336 GHz) blackscholes 8.4% Figure126: Whole 1.99 application3.92 speedup and saving with25.50 D-NPU, 5.88 energy 28.28 8-bit111.54 D-NPU vs. A-NPU-1 1.36 15.21 consumes 22.81 zero energy 82.21 and takes 73.76 zero A-NPU,8 and an Ideal NPU that 8-bit A-NPU Cycle Improvement Energy Improvement 5.77 6 0.67 2.44 3.67 8.56 7.57 for neural Cycle Energy cycles Cycle Cycle computation. Energy Energy 1 10 swaptions geomean Floating Point D-NPU 10.2% kmeans 2.3% sobel (b) Whole application energy saving. blackscholes sobel 40% CoreB+BDHNPU CoreB+BAHNPU 1.8% 0.5% 4.8% 2.3% CoreB+BIdealB CoreB+BDHNPU CoreB+BAHNPUblackscholes fft inversek2j jmeint jpeg 20% kmeans sobel CoreB+BIdealB Floating Point 6.0% 2.7% 6.2% 17.6% 5.4% 3.2% 3.8% 67.4% Other Instructions Percentage Energy Instructions Subsumed 8.15 43.3% A-NPU + Ideal Sigmoid blackscholes 8-bit A-NPU 5 3.1% D-NPU Percentage Instructions Subsumed 8-bit D-NPU vs. A-NPU-1 Cycle inversek2j jmeint jpeg kmeans kmeans sobel geomean Benchmarks 0 Energy (nJ) fft A-NPU 6 4 2 0 kmeans A-NPU Benchmarks blackscholes fft inversek2j jpeg inversek2j jmeint jpegspeedup. kmeans sobeljmeint geomean Whole application NPU Queue Instructions 4 2 14.1 24.5 48.0 11.8 0% NPU Queue Instructions D-NPU Applic 3.3 12.1 3.3 2% 7.3 7.3 6.1% sobel 2 19.67% 0 8.35% blackscholes 0 7.28% 0 5.21% blackscholes 0 (a) 3.34% 10 7.32% 9.42% 25.8 25.8 30.0 30.0 31.4 31.4 kmeans 4% 1.1 1.1 1.3 1.3 1.6 1.6 4 2 Benchmarks Applica'on*Energy*Reduc'on Applica'on*Energy*Reduc'on ron 3.3 6% 1.6 1.6 1.7 1.7 1.7 1.7 5.76% 12.1 8.5 8% 6 4 42.5 42.5 51.2 51.2 52.5 52.5 16 CoreB+BAHNPU Analog Sigmoid CoreB+BIdealB Fully Precise Digital Sigmoid 10% Application Level Error h 11.8 12% Percentage of Output Elements 6.10 10 8 28.28 25.5 4.1% to 10.2%. This quality-of-result loss 10is commensurate 9.53 with other work on quality trade-offs. Among digital hard8.5 ware approximation techniques, Truffle [17] and EnerJ [45] 8 shows similar error (3–10%) for some applications and much greater error (above 80%) for others in a moderate configura6 tion. Green [3] has error rates below 1% for some applications 4.57 but greater than 20% for others. A case study [36] explores 4.1 3.79 4 3.3 manual optimizations of the x264 video encoder that trade off 0.5–10% quality loss. As expected, the quality-of-results 2 degradation with an A-NPU is more than a floating point D-NPU. However, the quality losses are commensurate with 0 digital approximate computing techniques. blackscholes fft inversek2j jmeint To study the application-level quality loss in more detail, Figure 7 illustrates the CDF (cumulative distribution function) 1/3 Digital of Frequency plot of final error for each element application’s output. 1/2 Digital Frequency Each benchmark’s output a collection of elements– blackscholes fft inversek2j jmeint jpeg consists kmeans sobel of swaptions geomean an image consists of pixels; a vector consists of scalars; etc. This CDF reveals the distribution of error among an application’s output elements and shows that only a small fraction of the output elements see large quality loss with analog acceleration. The majority (80% to 100%) of each application’s output elements have error less than 10% except for jmeint. Exposing circuit limitations to the compiler. Figure 8 shows the effect of bit-width restrictions on application-level error, assuming 8 inputs per neuron. As the results suggest, exposing the bit-width limitations and the topology restrictions to the compiler enables our RPROP-based, customized CDLM training algorithm to find and train neural networks that can achieve accuracy levels commensurate with the digital approximation techniques, using only eight bits of precision for inputs, outputs, and weights, and eight inputs to the analog neurons. Several applications show less than 10% error even with fewer than eight bits. The results shows that there are many applications that can significantly benefit from analog 10 28.28 25.5 4.57 0 4.1 3.79 3.3 82.21 73 12.14 10.3 8.45 7.6 6 2 11.81 10.5 8.56 8 4 82.21 73.8 9.53 8.5 Energy Improvement 0.0173 jpeg Table 3: Error with a floating point D-NPU, A-NPU with ideal sigmoid, 8 and A-NPU with non-ideal sigmoid. 5.1 5.1 6.3 6.3 6.5 6.5 0.0126 jmeint 2.4 2.4 3.1 3.1 3.6 3.6 2.5 2.5 3.7 3.7 5.4 5.4 0.0129 8.35 0.5 0.5 0.8 0.8 1.2 1.2 9.42 19.67 6.62 1.1 1.1 1.3 1.3 1.3 1.3 2.7 2.7 2.8 2.8 2.8 2.8 8.13 18.41 inversek2j CoreB+BDHNPU CoreB+BAHNPU CoreB+BIdealB CoreB+BDHNPU 1.5 1.5 1.8 1.8 1.9 1.9 0.011 2.2 2.2 2.3 2.3 2.3 2.3 4.13 fft kmeans sobel geomean Level Error 2.3Application 2.3 6.2 6.2 3.03 jpeg 7.32% 17.8 17.8 18.8 18.8 0.0182 3.34% inversek2j jmeint 6.02% 7.9 7.9 10.21 Applica'on*Speedup Applica'on*Speedup 8.39 10 blackscholes kmeans sobel geomean 7.9 2.61% blackscholes 0 Analog Sigmoid 0% inver 1.1 1.3 1.6 7.28% 14.0 14.0 14.1 14.1 24.5 24.5 48.0 48.0 Analog Sigmoid 10.32 Application Speedup 8.35% 6.1% fft Energy Improvement 6.62% blackscholes 2% Speedup 7.58 EnergyBSaving 12.14 04.28% inversek2j jmeint jpeg 5.21% swaptions 0 se moid 8.45 0 4% 10.45 3.3 4.5 0 1 blackscholes sobel 9.42% 19.67% 2.4 2.5 1 2 kmeans 8.5 9.5 3.7 9.5 3.7 4.71 18.41% 1.8 jpeg 3.13 1.8 3.33 2 3 4.13%11.81 8.13% 2.5 3 2.08 jmeint Speedup EnergyBSaving 7.57 2.7 inversek2j 4 0.84 10.21%8.56 4.13 3.03% 25.50 73.76 2.4 1.22 16 9.71 8.39% 3.67 82.21 2.7 fft 8 5.76% 6.35 10.30 4 5 2.75 Improvement Improvement 8 % blackscholes 2.44 4.05 Analog28.28 Sigmoid 2.4 0.46 22.81 28.2 0.67 4 4.57 Fully Precise 5.88 Digital Sigmoid 15.2 82.2 6 5.49 3.63 5 15.21 15.2 82.2 5.77 apses to each 2.42 Benchmarks 3.92 10.9 10.9 14.9 14.9 1.36 4.5 1.99 3.9 28.2 0.51 8 3.9 4 12 2.4 2.33 56.26 111.54 7.6 Applic blacks fft jpeg inverse jmeint jpeg kmean sobel swapti geome Energy Speedup Application Topology blackscholes 80% Offloading 6 -> 8 -> 8 -> 1 85% Offloading 4.29879115398644 90% Offloading 5.4152844266765 95% Offloading 7.31520984460266 100% Offloading 11.2688183058235 11.2688183058235 fft 1 -> 4 -> 4 -> 2 1.24291093738476 1.26207162497662 1.28183232380289 inversek2j 2 -> 8 -> 2 3.66612081850573 4.39916488650216 5.49861846323813 7.33074108518411 10.9938617211157 jmeint 18 -> 32 -> 8 -> 2 3.05104421601116 3.49966751720672 4.10296438810026 4.9575875972808 6.26190545947988 4.9575875972808 jpeg 64 -> 16 -> 8 -> 64 1.59832986868194 1.66042238801816 1.7275342892346 1.80029983714727 1.87946485929561 1.80029983714727 kmeans 6 -> 8 -> 4 -> 1 sobel 1.30222166592591 24.5221729784241 1.32327013458615 1.30222166592591 7.33074108518411 0.871890119261146 0.864964463370785 0.858147965078337 0.851438063857892 0.844832278645737 0.851438063857892 9 -> 8 -> 1 geomean 2.18596481041593 2.36096624039658 2.56642620661238 2.8110545094951 3.10723166292606 2.8110545094951 2.10323387019544 2.31542035187076 2.60078141883403 3.02129183761473 3.7797513074705 3.02129183761473 Application Topology blackscholes 6 -> 8 -> 8 -> 1 80% Offloading 4.63801183770937 6.00293321989393 8.50623461623297 14.5907800958696 51.250425520663 fft 1 -> 4 -> 4 -> 2 1.49339422451373 1.54091152752417 1.59155204808073 1.64563417396981 1.70352109148241 inversek2j 2 -> 8 -> 2 4.41210751318201 5.60806507402552 7.69348022144627 12.2480344702273 30.0198258158588 jmeint 18 -> 32 -> 8 -> 2 4.08646628510342 5.06317341211034 6.65340510063347 9.69994502836948 17.8930069836166 jpeg 64 -> 16 -> 8 -> 64 1.87320180113408 1.9813332883883 2.10271344731737 2.23993609469231 2.39631940662302 kmeans 6 -> 8 -> 4 -> 1 1.25276901445104 1.27287802214948 1.29364312672525 1.31509697041071 1.33727439731608 sobel 9 -> 8 -> 1 2.07272551473412 2.22167873100213 2.39369814970405 2.59459132340805 2.83229403346624 2.49828814534424 2.83493584289978 3.32702119489218 4.166650961657 6.37013417935512 geomean 85% Offloading 90% Offloading 95% Offloading 100% Offloading 17.8 30.0 51.2 24.5 10.9 Figure 8: Application error with limited bit-width analog neural computation. significant improvements in accuracy and generality. 80%BOffloading Other design points. This study evaluates the performance 85%BOffloading 90%BOffloading 8 8 and energy improvements of an A-NPU 80%BOffloading assuming integration 100%BOffloading with a modern, high-performance processor. If low-power 6 6 cores are used instead, we expect to see, and preliminary 4 4 results confirm, that the performance benefits of an A-NPU increase, and that the energy benefits decrease. 2 2 Variability and noise. We designed the circuit with vari0 0 ability noise as first-order concerns, and we made several blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean blackscholes 0 and inversek2j jmeint jpeg kmeans sobel geomean design decisions to mitigate them. We limit both the input and (a) Total application speedup with limited A-NPU invocations. weight bit widths, as well as the analog neuron input count 10 80%BOffloading to eight to provide quantization margins for variation/noise. 80%BOffloading 85%BOffloading 85%BOffloading 90%BOffloading We designed the sigmoid circuit one order of magnitude more 90%BOffloading 8 80%BOffloading 80%BOffloading 100%BOffloading 100%BOffloading shallow than the digital implementation to provide additional 6 margins for variation and noise. We used a differential design, which provides resilience to noise by representing a value by 4 the difference between two signals; as noise affects the pair 2 of nearby signals similarly, the difference between the signals remains intact and the computation correct. Conversion 0 kmeans sobel geomean blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean to the digital domain after each analog neuron computation (b) Total application energy saving for limited A-NPU invocations. enforces computation integrity and reduces variation/noise susceptibility, while incurring energy and speed overheads. Figure 9: Speedup/energy saving with limited A-NPU invocations. As mentioned in Section 6, to further improve the quality of 80% Offloading acceleration without significant output quality loss. 80% Offloading 85% Offloading 85% Offloading 90% Offloading the final result, we can refrain from A-NPU invocations and 90% Offloading 95% Offloading Limited analog acceleration. We examine the effects on 95% Offloading 100% Offloading 100% Offloading fall back to the original code as needed. An online noisethe benefits when, due to noise or pathological inputs, only monitoring system could potentially limit the invocation of a fraction of the invocations are offloaded to the A-NPU. In the A-NPU to low-noise situations. Incorporating a quantithis case, the application falls back to the original code for tative noise model into the training algorithm may improve the remaining invocations. Figure 9 depicts the application robustness to analog noise. speedup and energy improvement when only 80%, 85%, 90%, Training for variability. A neural approach to approximate 95%, and 100% of the invocations are offloaded to the A-NPU. blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean blackscholes fft inversek2j jmeint kmeans sobel swaptions computing presents the jpeg opportunity to correct for certain types The results suggest that even limited analog accelerators can of analog-imposed inaccuracy, such as process variation, nonprovide significant energy and performance improvements. linearity, and other forms of non-ideality that are consistent for 80% Offloading 80% Offloading 7. Limitations and Considerations 85% Offloading 85% Offloading executions on a particular A-NPU hardware instance for some 90% Offloading 90% Offloading 95% Offloading Offloading 100% Offloading period of time. After an initial training phase that accounts for 100% Offloading Applicability. Not all applications can benefit 95% from analog the predictable, compiler-exposed analog limitations, a second acceleration; however, our work shows that there are many that (shorter) training phase can adjust for hardware-specific noncan. More rigorous optimization at the circuit level, as well as idealities, sending training inputs and outputs to the A-NPU broadening the scope off application coverage by continued and adjusting network weights to minimize error. This correcadvancements at the neural transformation step, may provide 7 0.851438063857892 0.844832278645737 0.851438063857892 2.8110545094951 3 3.02129183761473 3.7797513074705 3.02129183761473 6.3 17.8930069836166 2.23993609469231 2.39631940662302 6 -> 8 -> 4 -> 1 1.25276901445104 1.27287802214948 1.29364312672525 1.31509697041071 1.33727439731608 9 -> 8 -> 1 2.07272551473412 2.22167873100213 2.39369814970405 2.59459132340805 2.83229403346624 2.49828814534424 2.83493584289978 3.32702119489218 4.166650961657 6.37013417935512 2.8 2.5 4.0 2.0 30.0198258158588 9.69994502836948 2.10271344731737 1.3 12.2480344702273 6.65340510063347 1.9813332883883 2.3 7.69348022144627 5.06317341211034 1.87320180113408 1.5 5.60806507402552 4.08646628510342 1.2 1.70352109148241 1.8 3.0 51.250425520663 1.64563417396981 4.41210751318201 geomean 4.4 sobel 100% Offloading 14.5907800958696 1.59155204808073 1.7 kmeans 4.6 64 -> 16 -> 8 -> 64 3.7 18 -> 32 -> 8 -> 2 jpeg 3.1 2 -> 8 -> 2 jmeint 95% Offloading 8.50623461623297 6.3 Energy Improvement Energy-1 4.4 2.8 2.5 2.0 1.3 2.3 1.8 1.7 1.5 1.2 4.6 3.7 3.1 2.1 2.1 0.8 0.8 90% Offloading 2.1 inversek2j 30.0 3.10723166292606 1.54091152752417 51.2 2.8110545094951 1.49339422451373 2.1 7.33074108518411 8 1 -> 4 -> 4 -> 2 0.8 1.80029983714727 fft 1.8 4.9575875972808 1.87946485929561 85% Offloading 0.8 6.26190545947988 Energy Improvement 6.2 10.9938617211157 4.9575875972808 1.80029983714727 80% Offloading 17.8 7.33074108518411 6 6.00293321989393 1.5 1.30222166592591 6 4.63801183770937 3.6 4.2 11.2688183058235 1.32327013458615 6 -> 8 -> 8 -> 1 4.0 1.30222166592591 24.5221729784241 Energy Topology blackscholes 1.3 11.2688183058235 10 80%BOffloading 85%BOffloading 90%BOffloading 80%BOffloading 100%BOffloading Application Application Topology 60% Offloading blackscholes 6 -> 8 -> 8 -> 1 2.42891052606711 3.1881815570466 4.63801183770937 8.50623461623297 51.250425520663 fft 1 -> 4 -> 4 -> 2 1.32941304492404 1.40664067630675 1.49339422451373 1.59155204808073 1.70352109148241 inversek2j 2 -> 8 -> 2 2.38102726132093 3.09293091260182 4.41210751318201 7.69348022144627 30.0198258158588 jmeint 18 -> 32 -> 8 -> 2 2.30663132833562 2.94879625663539 4.08646628510342 6.65340510063347 17.8930069836166 jpeg 64 -> 16 -> 8 -> 64 1.53755323713305 1.68886212045322 1.87320180113408 2.10271344731737 2.39631940662302 kmeans 6 -> 8 -> 4 -> 1 sobel 9 -> 8 -> 1 swaptions 1 -> 16 -> 8 -> 1 geomean 70% Offloading 80% Offloading 90% Offloading 1.178309012499 1.21439871854634 1.25276901445104 1.29364312672525 1.33727439731608 1.82765411462566 2.07272551473412 2.39369814970405 2.83229403346624 1.76097005127676 2.05223572654731 2.49828814534424 3.32702119489218 6.37013417935512 24.5221729784241 1.28183232380289 1.32327013458615 inversek2j 2 -> 108 -> 2 2.19985260892858 2.7497313266094 3.66612081850573 5.49861846323813 10.9938617211157 jmeint 18 -> 32 -> 8 -> 2 2.01687120906375 2.42843959329192 3.05104421601116 4.10296438810026 6.26190545947988 jpeg 64 -> 16 -> 8 -> 64 1.39035685940477 1.48710728068585 1.59832986868194 1.7275342892346 1.87946485929561 kmeans 6 -> 88 -> 4 -> 1 0.900738494539225 0.886079562648145 0.871890119261146 0.858147965078337 0.844832278645737 9 -> 8 -> 1 1.68606220082748 1.90374324790557 1 -> 16 -> 8 -> 1 6 2.56642620661238 Energy-1 10 3.10723166292606 9 -> 8 -> 1 swaptions 1 -> 16 -> 8 -> 1 2.49828814534424 3.32702119489218 6.37013417935512 6.3 0 blackscholes fft inversek2j jmeint 2.8 jpeg 1.7 2 1.6 1.6 geomean 4 1.3 3.1 1.6 0.8 sobel 6 1.1 3.7797513074705 2.3 0078141883403 8 1.5 3.10723166292606 3.7 6642620661238 2.3 58147965078337 0.844832278645737 17.8 10 30.0 1.87946485929561 51.2 6.26190545947988 7275342892346 1.3 10.9938617211157 0296438810026 2.4 1.32327013458615 9861846323813 Energy Improvement 24.5221729784241 8183232380289 kmeans sobel swaptions 0 2.8 1.7 2.05223572654731 1.6 2.83229403346624 1.76097005127676 1.3 2.39369814970405 1.1 2.07272551473412 2.3 1.82765411462566 1.5 2 1.33727439731608 1.63440778233613 2.3 1.29364312672525 0.8 1.25276901445104 0.9 2.39631940662302 1.21439871854634 1.4 2.10271344731737 6 -> 8 -> 4 -> 1 sobel 17.8930069836166 1.87320180113408 1.7 2.3 1.178309012499 kmeans 30.0198258158588 4 6.65340510063347 1.3 1.68886212045322 1.70352109148241 7.69348022144627 4.08646628510342 2.4 1.53755323713305 3.7 64 -> 16 -> 8 -> 64 jpeg 1.59155204808073 4.41210751318201 100% Offloading 1520984460266 8 1.49339422451373 1.6 2.94879625663539 3.1 3.09293091260182 2.30663132833562 1.6 1.40664067630675 2.38102726132093 18 -> 32 -> 8 -> 2 1.8 1.32941304492404 2 -> 8 -> 2 jmeint 2.0 1 -> 4 -> 4 -> 2 inversek2j 1.1 2 fft 1.3 2.2 2.3 4 geomean kmeans 2.18596481041593 100% Offloading Application Topology 60% Offloading 70% Offloading 80% Offloading 90% Offloading 100% 6 Offloading 1.59150802823661 1.80117697344101 2.10323387019544 2.60078141883403 3.7797513074705 blackscholes 6 -> 8 -> 8 -> 1 2.42891052606711 3.1881815570466 4.63801183770937 8.50623461623297 51.250425520663 0 % Offloading 90% Offloading 6.3 7.31520984460266 1.24291093738476 geomean 80% Offloading 17.8 4.29879115398644 1.20628350525801 swaptions 70% Offloading 30.0 3.04371757679209 1.17175303062181 sobel 60% Offloading 51.2 2.3558921269523 1 -> 4 -> 4 -> 2 Energy Improvement 6 -> 8 -> 8 -> 1 fft 6.2 Topology blackscholes Application Speedup Application 100% Offloading 1.63440778233613 Speedup-1 1.7 2.3 3 100% Offloading 1.2 9 95% Offloading Applica'on*Speedup 6 10 tion technique is able to address inter and intra-chip process variation and hardware-dependent, non-ideal analog behavior. Smaller technology nodes. This work is the start of using analog circuits for code acceleration. Providing benefits at smaller nodes may require using larger transistors for analog parts, trading off area for resilience. Energy-efficient performance is growing in importance relative to area efficiency, especially as CMOS scaling benefits continue to diminish. 8. Related Work This research lies at the intersection of (a) general-purpose approximate computing, (b) accelerators, (c) analog and digital neural hardware, (d) neural-based code acceleration, (e) and limited-precision learning. This work combines techniques in all these areas to provide a compilation workflow and the architecture/circuit design that enables code acceleration with limited-precision mixed-signal neural hardware. In each area, we discuss the key related work that inspired our work. General-purpose approximate computing. Several studies have shown that diverse classes of applications are tolerant to imprecise execution [20, 54, 34, 12, 45]. A growing body of work has explored relaxing the abstraction of full accuracy at the circuit and architecture level for gains in performance, energy, and resource utilization [13, 17, 44, 2, 35, 8, 28, 38, 18, 51, 46]. These circuit and architecture studies, although proven successful, are limited to purely digital techniques. We explore how a mixed-signal, analog-digital approach can go beyond what digital approximate techniques offer. Accelerators. Research on accelerators seeks to synthesize efficient circuits or FPGA configurations to accelerate generalpurpose code [41, 42, 11, 19, 29]. Similarly, static specialization has shown significant efficiency gains for irregular and legacy code [52, 53]. More recently, configurable accelerators have been proposed that allow the main CPU to offload certain code to a small, efficient structure [23, 24]. This paper extends the prior work on digital accelerators with a new class of mixed-signal, analog-digital accelerators. Analog and digital neural hardware. There is an extensive body of work on hardware implementations of neural networks both in digital [40, 15, 55] and analog [5, 47, 49, 32, 48]. Recent work has proposed higher-level abstractions for implementation of neural networks [27]. Other work has examined fault-tolerant hardware neural networks [26, 50]. In particular, Temam [50] uses datasets from the UCI machine learning repository [21] to explore fault tolerance of a hardware neural network design. In contrast, our compilation, neural-network selection/training framework, and architecture design aim at applying neural networks to general-purpose code written in familiar programming models and languages, not explicitly written to utilize neural networks directly. Neural-based code acceleration. A recent study [9] shows that a number of applications can be manually reimplemented with explicit use of various kinds of neural networks. That study did not prescribe a programming workflow, nor a preferred hardware architecture. More recent work exposes analog spiking neurons as primitive operators [4]. This work devises a new programming model that allows programmers to express digital signal-processing applications as a graph of analog neurons and automatically maps the expressed graph to a tiled analog, spiking-neural hardware. The work in [4] is restricted to the domain of applications whose inputs are realworld signals that should be encoded as pulses. Our approach addresses the long-standing challenges of using analog computation (programmability and generality) by not imposing domain-specific limitations, and by providing analog circuitry that is integrated with a conventional digital processor in a way that does not require a new programming paradigm. Limited-precision learning. The work in [14] provides a complete survey of learning algorithms that consider limited precision neural hardware implementation. We tried various algorithms, but we found that CDLM [10] was the most effective. More sophisticated limited-precision learning techniques can improve the reported quality results in this paper and further confirm the feasibility and effectiveness of the mixed-signal, approach for neural-based code acceleration. 9. Conclusions For decades, before the effective end of Dennard scaling, we consistently improved performance and efficiency while maintaining generality in general-purpose computing. As the benefits from scaling diminish, the community is facing an iron triangle; we can choose any two of performance, efficiency, and generality at the expense of the third. Solutions that improve performance and efficiency, while retaining as much generality as possible, are growing in importance. Analog circuits inherently trade accuracy for significant gains in energy-efficiency. However, it is challenging to utilize them in a way that is both programmable and generally useful. As this paper showed, the neural transformation of general-purpose approximable code provides an avenue for realizing the benefits of analog computation while targeting code written in conventional languages. This work provided an end-to-end solution for utilizing analog circuits for accelerating approximate applications, from circuits to compiler design. The insights from this work show that it is crucial to expose analog circuit characteristics to the compilation and neural network training phases. The NPU model offers a way to exploit analog efficiencies, despite their challenges, for a wider range of applications than is typically possible. Further, mixed-signal execution delivers much larger savings for NPUs than digital. However, this study is not conclusive. The full range of applications that can exploit mixed-signal NPUs is still unknown, as is whether it will be sufficiently large to drive adoption in high-volume microprocessors. It is still an open question how developers might reason about the acceptable level of error when an application undergoes an approximate execution including analog acceleration. Finally, in a noisy, high-performance microprocessor environment, it is unclear that an analog NPU would not be adversely affected. However, the significant gains from ANPU acceleration and the diversity of the studied applications suggest a potentially promising path forward. Acknowledgments This work was supported in part by gifts from Microsoft Research and Google, NSF award CCF-1216611, and by C-FAR, one of six centers of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA. References [1] P. E. Allen and D. R. Holberg, CMOS Analog Circuit Design. Oxford University Press, 2002. [2] C. Alvarez, J. Corbal, and M. Valero, “Fuzzy memoization for floatingpoint multimedia applications,” IEEE TC, 2005. [3] W. Baek and T. M. Chilimbi, “Green: A framework for supporting energy-conscious programming using controlled approximation,” in PLDI, 2010. [4] B. Belhadj, A. Joubert, Z. Li, R. Héliot, and O. Temam, “Continuous real-world inputs can open up alternative accelerator designs,” in ISCA, 2013. [5] B. E. Boser, E. Säckinger, J. Bromley, Y. L. Cun, L. D. Jackel, and S. Member, “An analog neural network processor with programmable topology,” JSSC, 1991. [6] Y. Cao, “Predictive technology models,” 2013. Available: http: //ptm.asu.edu [7] M. Carbin, S. Misailovic, and M. C. Rinard, “Verifying quantitative reliability for programs that execute on unreliable hardware,” in OOPSLA, 2013. [8] L. N. Chakrapani, B. E. S. Akgul, S. Cheemalavagu, P. Korkmaz, K. V. Palem, and B. Seshasayee, “Ultra-efficient (embedded) SOC architectures based on probabilistic CMOS (PCMOS) technology,” in DATE, 2006. [9] T. Chen, Y. Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti, A. Nere, S. Qiu, M. Sebag, and O. Temam, “BenchNN: On the broad potential application scope of hardware neural network accelerators,” in IISWC, 2012. [10] F. Choudry, E. Fiesler, A. Choudry, and H. J. Caulfield, “A weight discretization paradigm for optical neural networks,” in ICOE, 1990. [11] N. Clark, M. Kudlur, H. Park, S. Mahlke, and K. Flautner, “Applicationspecific processing on a general-purpose core via transparent instruction set customization,” in MICRO, 2004. [12] M. de Kruijf and K. Sankaralingam, “Exploring the synergy of emerging workloads and silicon reliability trends,” in SELSE, 2009. [13] M. de Kruijf, S. Nomura, and K. Sankaralingam, “Relax: An architectural framework for software recovery of hardware faults,” in ISCA, 2010. [14] S. Draghici, “On the capabilities of neural networks using limited precision weights,” Elsevier NN, 2002. [15] H. Esmaeilzadeh, P. Saeedi, B. N. Araabi, C. Lucas, and S. M. Fakhraie, “Neural network stream processing core (NnSP) for embedded systems,” in ISCAS, 2006. [16] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,” in ISCA, 2011. [17] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Architecture support for disciplined approximate programming,” in ASPLOS, 2012. [18] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural acceleration for general-purpose approximate programs,” in MICRO, 2012. [19] K. Fan, M. Kudlur, G. Dasika, and S. Mahlke, “Bridging the computation gap between programmable processors and hardwired accelerators,” in HPCA, 2009. [20] Y. Fang, H. Li, and X. Li, “A fault criticality evaluation framework of digital systems for error tolerant video applications,” in ATS, 2011. [21] A. Frank and A. Asuncion, “UCI machine learning repository,” 2010. Available: http://archive.ics.uci.edu/ml [22] S. Galal and M. Horowitz, “Energy-efficient floating-point unit design,” IEEE TC, 2011. [23] V. Govindaraju, C. Ho, and K. Sankaralingam, “Dynamically specialized datapaths for energy efficient computing,” in HPCA, 2011. [24] S. Gupta, S. Feng, A. Ansari, S. Mahlke, and D. August, “Bundled execution of recurring traces for energy-efficient general purpose processing,” in MICRO, 2011. [25] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, “Toward dark silicon in servers,” IEEE Micro, 2011. [26] A. Hashmi, H. Berry, O. Temam, and M. Lipasti, “Automatic abstraction and fault tolerance in cortical microarchitectures,” in ISCA, 2011. [27] A. Hashmi, A. Nere, J. J. Thomas, and M. Lipasti, “A case for neuromorphic ISAs,” in ASPLOS, 2011. [28] R. Hegde and N. R. Shanbhag, “Energy-efficient signal processing via algorithmic noise-tolerance,” in ISLPED, 1999. [29] Y. Huang, P. Ienne, O. Temam, Y. Chen, and C. Wu, “Elastic cgras,” in FPGA, 2013. [30] C. Igel and M. Hüsken, “Improving the RPROP learning algorithm,” in NC, 2000. [31] D. A. Johns and K. Martin, Analog Integrated Circuit Design. John Wiley and Sons, Inc., 1997. [32] A. Joubert, B. Belhadj, O. Temam, and R. Héliot, “Hardware spiking neurons design: Analog or digital?” in IJCNN, 2012. [33] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, “McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures,” in MICRO, 2009. [34] X. Li and D. Yeung, “Exploiting soft computing for increased fault tolerance,” in ASGI, 2006. [35] S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn, “Flikker: Saving dram refresh-power through critical data partitioning,” in ASPLOS, 2011. [36] S. Misailovic, S. Sidiroglou, H. Hoffman, and M. Rinard, “Quality of service profiling,” in ICSE, 2010. [37] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0,” in MICRO, 2007. [38] S. Narayanan, J. Sartori, R. Kumar, and D. L. Jones, “Scalable stochastic processors,” in DATE, 2010. [39] A. Patel, F. Afram, S. Chen, and K. Ghose, “MARSSx86: A full system simulator for x86 CPUs,” in DAC, 2011. [40] K. W. Przytula and V. K. P. Kumar, Eds., Parallel Digital Implementations of Neural Networks. Prentice Hall, 1993. [41] A. Putnam, D. Bennett, E. Dellinger, J. Mason, P. Sundararajan, and S. Eggers, “CHiMPS: A high-level compilation flow for hybrid CPUFPGA architectures,” in FPGA, 2008. [42] R. Razdan and M. D. Smith, “A high-performance microarchitecture with hardware-programmable functional units,” in MICRO, 1994. [43] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” in Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, 1986. [44] M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke, “Sage: Self-tuning approximation for graphics engines,” in MICRO, 2013. [45] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman, “EnerJ: Approximate data types for safe and general low-power computation,” in PLDI, 2011. [46] A. Sampson, J. Nelson, K. Strauss, and L. Ceze, “Approximate storage in solid-state memories,” in MICRO, 2013. [47] J. Schemmel, J. Fieres, and K. Meier, “Wafer-scale integration of analog neural networks,” in IJCNN, 2008. [48] R. St. Amant, D. A. Jiménez, and D. Burger, “Mixed-signal approximate computation: A neural predictor case study,” IEEE MICRO Top Picks, vol. 29, no. 1, January/February 2009. [49] S. M. Tam, B. Gupta, H. A. Castro, and M. Holler, “Learning on an analog VLSI neural network chip,” in SMC, 1990. [50] O. Temam, “A defect-tolerant accelerator for emerging highperformance applications,” in ISCA, 2012. [51] S. Venkataramani, V. K. Chippa, S. T. Chakradhar, K. Roy, and A. Raghunathan, “Quality-programmable vector processors for approximate computing,” in MICRO, 2013. [52] G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. LugoMartinez, S. Swanson, and M. B. Taylor, “Conservation cores: Reducing the energy of mature computations,” in ASPLOS, 2010. [53] G. Venkatesh, J. Sampson, N. Goulding, S. K. Venkata, M. Taylor, and S. Swanson, “QsCores: Trading dark silicon for scalable energy efficiency with quasi-specific cores,” in MICRO, 2011. [54] V. Wong and M. Horowitz, “Soft error resilience of probabilistic inference applications,” in SELSE, 2006. [55] J. Zhu and P. Sutton, “FPGA implementations of neural networks: A survey of a decade of progress,” in FPL, 2003.