Toward General-Purpose Code Acceleration with Analog Computation

advertisement
Toward General-Purpose Code Acceleration with Analog Computation
Amir Yazdanbakhsh∗
Hadi Esmaeilzadeh∗
∗ Georgia Institute of Technology
Renee St. Amant§
Arjang Hassibi§
§ The University of Texas at Austin
Abstract
Bradley Thwaites∗
Jongse Park∗
Luis Ceze†
Doug Burger‡
† University of Washington
‡ Microsoft Research
has shown that analog neural networks can effectively solve
classes of domain-specific problems, such as pattern recognition [4, 27, 28, 18]. The process of invoking a neural network
and returning a result defines a clean, coarse-grained interface
for D/A and A/D conversion. Furthermore, the compile-time
training of the network permits any analog-specific restrictions
to be hidden from the programmer. The programmer simply
specifies which region of the code can be approximated, without adding any neural-network-specific information. Thus, no
additional changes to the programming model are necessary.
Figure 1 illustrates an overview of our framework.
This paper reports on our study to design an NPU with
mixed-signal components and develop a compilation workflow for utilizing the mixed-signal NPU for code acceleration.
The goal of this study is to investigate challenges and potential solutions of implementing NPUs with analog components,
while both bounding application error to sufficiently low levels and achieving worthwhile performance or efficiency gains
for general-purpose approximable code. We found that exposing the analog limitations to the compiler allowed for the
compensation of these shortcomings and produced sufficiently
accurate results. We trained networks at compile time using 8bit values, topologies restricted to eight inputs per neuron, plus
RPROP and CDLM [8] for training. Using these techniques together, we were able to bound error on all applications to a 10%
limit, which is comparable to prior studies using entirely digital accelerators. The average time required to compute a neural
result was 3.3× better than a previous digital implementation
with an additional energy savings of 12.1×. The performance
gains result in an average full-application-level improvement
of 3.7× and 23.3× in performance and energy-delay product,
respectively. This study shows that using limited-precision
analog circuits for code acceleration, by converting regions of
imperative code to neural networks and exposing the circuit
limitations to the compiler, is both feasible and advantageous.
We propose a solution—from circuit to compiler—that enables general-purpose use of limited-precision, analog hardware to accelerate “approximable” code—code that can tolerate imprecise execution. We utilize an algorithmic transformation that automatically converts approximable regions of
code from a von Neumann model to an “analog” neural model.
We outline the challenges of taking an analog approach, including restricted-range value encoding, limited precision in
computation, circuit inaccuracies, noise, and constraints on
supported topologies. We address these limitations with a
combination of circuit techniques, a novel hardware/software
interface, neural-network training techniques, and compiler
support. Analog neural acceleration provides whole application speedup of 3.3× and and energy savings of 12.1× with
quality loss less than 10% for all except one benchmark. These
results show that using limited-precision analog circuits for
code acceleration, through a neural approach, is both feasible
and beneficial over a range emerging applications.
1. Introduction
A growing body of work [11, 25, 6, 22, 12] has focused on
approximation as a strategy for improving performance and efficiency through approximation. Large classes of applications
can tolerate small errors in their outputs with no discernible
loss in Quality of Result (QoR). Many conventional techniques
in energy-efficient computing navigate a design space defined
by the two dimensions of performance and energy, and traditionally trade one for the other. General-purpose approximate
computing explores a third dimension—that of error.
Many interesting design alternatives become possible once
precision is relaxed. An obvious candidate is the use of analog
circuits for computation. However, computation in the analog
domain has several major challenges, even when small errors
are permissible. First, analog circuits tend to be special purpose, good for only specific operations. Second, the bit widths
they can accommodate are much smaller than current floatingpoint standards (i.e. 32 or 64 bits), since the ranges must be
represented by physical voltage or current levels. Another
consideration is determining where the boundaries between
digital and analog computation lie. Using individual analog
operations will not be effective due to the overhead of A/D
and D/A conversions. Finally, effective storage of temporary
analog results is challenging in current CMOS technologies.
Due to these limitations, it has not been effective to design
analog von Neumann processors that can be programmed
with conventional languages. NPUs can be a potential solution for general-purpose analog computing. Prior research
2. Analog Circuits for Neural Computation
As Figure 2a illustrates, each neuron in a multi-layer perceptron takes in a set of inputs (xi ) and performs a weighted sum
of those input values (∑i xi wi ). The weights (wi ) are the result of training the neural network on . After the summation
stage, which produces a linear combination of the weighted
inputs, the neuron applies a nonlinearity function, sigmoid,
to the result of summation. Figure 2b depicts a conceptual
analog circuit that performs the three necessary operations of
a neuron: (1) scaling inputs by weight (xi wi ), (2) summing
the scaled inputs (∑i xi wi ), and (3) applying the nonlinear-
1
Profiling&Path&for&
Training&Data&
Collec;on
Annotated&
Source&Code
A"NPU&
Circuit&Design&
Training&Data
Programmer
A"NPU&
High"Level&Model&
Programming
Design
Trained&Neural&
Network
Custom&Training&
Algorithm&for&
Limited"Precision&
Analog&Accelerator
Compila;on
Instrumented&
Binary
CORE
Accelerator&
Config
A"NPU
Code&
Genera;on
Accelera;on
Figure 1: Framework for using limited-precision analog computation to accelerate code written in conventional languages.
x0
xi
…
w0
…
wn
wi
X
xn
(xi wi )
x0
xi
xn
DAC
DAC
DAC
…
I(x0 )
R(w0 )
V to I
y = sigmoid(
(a)
(xi wi ))
R(wi )
V to I
X
X
…
I(xi )
tential network accuracy. The NPU approach targets code
approximation, rather than typical, simpler neural tasks, such
as recognition and prediction, and imposes higher accuracy
requirements. The main challenge is to mange this tradeoff to
achieve acceptable accuracy for code acceleration, while delivering higher performance and efficiency when analog neural
circuits are used for general-purpose code acceleration.
While a holistically analog neural hardware design with
fixed-wire connections between neurons may be efficient, it
effectively provides a fixed topology network, limiting the
scope of applications that can benefit from the neural accelerator, as the optimal network topology varies with application.
Additionally, routing analog signals among neurons and the
limited capability of analog circuits for buffering signals negatively impacts accuracy and makes the circuit susceptible to
noise. In order to provide additional flexibility, we set the
digital-analog boundary in conjunction with an algorithmic,
sigmoid-activated neuron. where a set of digital inputs and
weights are converted to the analog domain for efficient computation, producing a digital output that can be accurately
routed to multiple consumers. We refer to this basic computation unit as an analog neural unit, or ANU. ANUs can be
composed, in various physical configurations, along with digital control and storage, to form a reconfigurable mixed-signal
NPU, or A-NPU.
I(xn )
R(wn )
V to I
(I(xi )R(wi ))
ADC
y ⇡ sigmoid(
X
(I(xi )R(wi )))
(b)
Figure 2: One neuron and its conceptual analog circuit.
ity function (sigmoid). This conceptual design first encodes
the digital inputs (xi ) as analog current levels (I(xi )). Then,
these current levels pass through a set of variable resistances
whose values (R(wi )) are set proportional to the corresponding
weights (wi ). The voltage level at the output of each resistance
(I(xi )R(wi )), is proportional to xi wi . These voltages are then
converted to currents that can be summed quickly according to
Kirchhoff’s current law (KCL). Analog circuits only operate
linearly within a small range of voltage and current levels,
outside of which the transistors enter saturation mode with
IV characteristics similar in shape to a non-linear sigmoid
function. Thus, at the high level, the non-linearity is naturally applied to the result of summation when the final voltage
reaches the analog-to-digital converter (ADC). Compared to a
digital implementation of a neuron, which requires multipliers,
adder trees and sigmoid lookup tables, the analog implementation leverages the physical properties of the circuit elements
and is orders of magnitude more efficient. However, it operates
in limited ranges and therefore offers limited precision.
Value representation and bit-width limitations. One of
the fundamental design choices for an ANU is the bit-width
of inputs and weights. Increasing the number of bits results
in an exponential increase in the ADC and DAC energy dissipation and can significantly limit the benefits from analog
acceleration. Furthermore, due to the fixed range of voltage
and current levels, increasing the number of bits translates to
quantizing this fixed value range to fine granularities that practical ADCs can not handle. In addition, the fine granularity
encoding makes the analog circuit significantly more susceptible to noise, thermal, voltage, current, and process variations.
In practice, these non-ideal effects can adversely affect the
final accuracy when more bit-width is used for weights and
inputs. We design our ANUs such that the granularity of the
voltage and current levels used for information encoding is to
a large degree robust to variations and noise.
Analog-digital boundaries. The conceptual design in Figure 2b draws the analog-digital boundary at the level of an
algorithmic neuron. As we will discuss, the analog neural
accelerator will be a composition of these analog neural units
(ANUs). However, an alternative design, primarily optimizing
for efficiency, may lay out the entirety of a neural network
with only analog components, limiting the D-to-A and A-to-D
conversions to the inputs and outputs of the neural network
and not the individual neurons. The overhead of conversions
in the ANUs significantly limits the potential efficiency gains
of an analog approach toward neural computation. However,
there is a tradeoff between efficiency, reconfigurability (generality), and accuracy in analog neural hardware design. Pushing
more of the implementation into the analog domain gains efficiency at the expense of flexibility, limiting the scope of
supported network topologies and, consequently, limiting po-
Topology restrictions. Another important design choice is
the number of inputs in the ANU. Similar to bit-width, increasing the number of ANU inputs translates to encoding a larger
2
s w0
sx0
w0
I(|x0 |)
Resistor'
Ladder
s wn
x0
Current'
Steering'
DAC
wn
…
I(|xn |)
Resistor'
Ladder
R(|w0 |)
V (|w0 x0 |)
I + (wn xn )
I (w0 x0 )
w i xi
+
+
Diff'
⌘
V
R(|wn |)
Diff'
Pair
I + (w0 x0 )
⇣X
of switches. The digital weight bits (wi s) control the switches,
adjusting the effective resistance, R(|wi |), seen by the input
current (I(|xi |)). These variable resistances scale the input
currents by the digital weight values, effectively multiplying
each input magnitude by its corresponding weight magnitude.
The output of the resistor ladder is a voltage: V (|wi xi |) =
I(|xi |) × R(|wi |). The resistor network requires 2m resistors
and approximately 2m+1 switches, where m is the number of
digital weight bits. This resistor ladder design has been shown
to work well for m ≤ 10. Our circuit simulations show that
only minimally sized switches are necessary.
V (|wi xi |) as well as the XOR of the weight and input sign
bits feed to a differential pair that converts voltage values
to two differential current (I + (wi xi ), I − (wi xi )) that capture
the sign of the weighted input. These differential currents
are proportional to the voltage applied to the differential pair,
V (|wi xi |). If the voltage difference between the two gates is
kept small, the current-voltage relationship is linear, producing
Ibias
−
I + (wi xi ) = Ibias
2 + ∆I and I (wi xi ) = 2 − ∆I. Resistor ladder
values are chosen such that the gate voltage remains in the
range that produces linear outputs, and consequently a more
accurate final result. Based on the sign of the computation, a
switch steers either the current associated with a positive value
or the current associated with a negative value to a single wire
to be efficiently summed according to Kirchhoff’s current law.
The alternate current is steered to a second wire, retaining
differential operation at later design stages. Differential operation combats environmental noise and increases gain, the
later being particularly important for mitigating the impact of
analog range challenges at later stages.
Resistors convert the resulting pair of differential currents
to voltages, V + (∑i wi xi ) and V − (∑i wi xi ), that represent the
weighted sum of the inputs to the ANU. These voltages are
used as input to an additional amplification stage (implemented
as a current-mode differential amplifier with diode-connected
load). The goal of this amplification stage is to significantly
magnify the input voltage range of interest that maps to the
linear output region of the desired sigmoid function. Our
experiments show that neural networks are sensitive to the
steepness of this non-linear function, losing accuracy with
shallower, non-linear activation functions. This fact is relevant for an analog implementation because steeper functions
increase range pressure in the analog domain, as a small range
of interest must be mapped to a much larger output range in
accordance with ADC input range requirements for accurate
conversion. We magnify this range of interest, choosing circuit
parameters that give the required gain, but also allowing for
saturation with inputs outside of this range.
The amplified voltage is used as input to an analog-to-digital
converter that converts the voltage to a digital value. We chose
a flash ADC design (named for its speed), which consists of a
set of reference voltages and comparators [1, 17]. The ADC
requires 2n comparators, where n is the number of digital output bits. Flash ADC designs are capable of converting 8 bits
xn
Current'
Steering'
DAC
V (|wn xn |)
Diff'
Pair
V+
sxn
⇣X
w i xi
⌘
-;
Amp
I (wn xn )
Flash
ADC
sy
y
⇣ ⇣X
⌘⌘
y ⇡ sigmoid V
w i xi
Figure 3: A single analog neuron (ANU).
value range in a bounded voltage and current range, which, as
discussed, becomes impractical. The larger the number of inputs, the larger the number of multiply and add operations that
can be done in parallel in the analog domain, increasing efficiency. However, due to the bounded range of voltage and currents, increasing the number of inputs requires decreasing the
number of bits for inputs and weights. Through circuit-level
simulations, we empirically found that limiting the number of
inputs to eight with 8-bit inputs and weights strikes a balance
between accuracy and efficiency. However, this unique ANU
limitation restricts the topology of the neural network that
can run on the analog accelerator. Our customized training
algorithm and compilation workflow takes into account this
topology limitation and produces neural networks that can be
computed on our mixed-signal accelerator.
Non-ideal sigmoid. The saturation behavior of the analog
circuit that leads to sigmoid-like behavior after the summation
stage represents an approximation of the ideal sigmoid. We
measure this behavior at the circuit level and expose it to the
compiler and the training algorithm.
3. Mixed-Signal Neural Accelerator (A-NPU)
Circuit design for ANUs. Figure 3 illustrates the design
of a single analog neuron (ANU). As mentioned, the ANU
performs the computation of a single neuron, which is y ≈
sigmoid(∑i wi xi ). We place the analog-digital boundary at
the ANU level, with computation in the analog domain and
storage in the digital domain. Digital input and weight values
are represented in sign-magnitude form. In the figure, swi
and sxi represent the sign bits and wi and xi represent the
magnitude. Digital input values are converted to the analog
domain through current-steering DACs that translate digital
values to analog currents. Current-steering DACs are used for
their speed and simplicity. In Figure 3, I(|xi |) is the analog
current that represents the magnitude of the input value, xi .
Digital weight values control resistor string ladders that create
a variable resistance depending on the magnitude of each
weight (R(|wi |)) . We use a standard resistor ladder thats
consists of a set of resistors connected to a tree-structured set
3
…
8"Wide
AnalogNeuron
Weight'Buffer
8"Wide
AnalogNeuron
Weight'Buffer
Weight'Buffer
Weight'Buffer
8"Wide
AnalogNeuron
rithm first assigns an order to the neurons. This determines
the order in which the neurons will be computed on the ANUs.
Then, the scheduler takes the following steps for each layer
of the neural network: (1) Assign each neuron to one of the
ANUs. (2) Assign an order to neurons. (3) Assign an order
to the weights. (4) Generate the order for inputs to be fed to
the ANUs. (5) Generate the order in which the outputs will
be written to the Output Buffer. The scheduler also assigns a
unique order for the inputs and outputs of the neural network
in which the processor will communicate data with the A-NPU
Config'FIFO
Input'FIFO
Row'Selector
8"Wide
AnalogNeuron
Column'Selector
Output'FIFO
Figure 4: Mixed-signal neural accelerator, A-NPU. Only four of
the ANUs are shown. Each ANU processes eight 8-bit inputs.
4. Compilation for Analog Acceleration
As Figure 1 illustrates, the compilation for A-NPU execution
consists of three stages: (1) profile-driven data collection, (2)
training for a limited-precision A-NPU, and (3) code generation for hybrid analog-digital execution. In the profile-driven
data collection stage, the compiler instruments the application
to collect the inputs and outputs of approximable functions.
The compiler then runs the application with representative
inputs and collects the inputs and their corresponding outputs.
These input-output pairs constitute the training data. Section 3
briefly discussed ISA extensions and code generation. While
compilation stages (1) and (3) are similar to the techniques presented for a digital implementation [12], the training phase is
unique to an analog approach, accounting for analog-imposed,
topology restrictions and adjusting weight selection to account
for limited-precision computation.
at frequency on the order of one GHz. We require 2–3 mV
between ADC quantization levels for accurate operation and
noise tolerance. Typically, ADC reference voltages increase
linearly; however, we use a non-linearly increasing set of reference voltages to capture the behavior of a sigmoid function
which also improves the accuracy of the analog sigmoid.
Reconfigurable mixed-signal A-NPU. We design a reconfigurable mixed-signal A-NPU that can perform the computation of a wide variety of neural topologies since each requires
a different topology. Figure 4 illustrates the A-NPU design
with some details omitted for clarity. The figure shows four
ANUs while the actual design has eight. The A-NPU is a
time-multiplexed architecture where the algorithmic neurons
are mapped to the ANUs based on a static scheduling algorithm, which is loaded to the A-NPU before invocation. The
multi-layer perceptron consists of layers of neurons, where
the inputs of each layer are the outputs of the previous layer.
The ANU starts from the input layer and performs the computations of the neurons layer by layer. The Input Buffer always
contains the inputs to the neurons, either coming from the
processor or from the previous layer computation. The Output
Buffer, which is a single entry buffer, collects the outputs of the
ANUs. When all of its columns are computed, the results are
pushed back to the Input Buffer to enable calculation of the next
layer. The Row Selector determines which entry of the input
buffer will be fed to the ANUs. The output of the ANUs will
be written to a single-entry output buffer. The Column Selector
determines which column of the output buffer will be written
by the ANUs. These selectors are FIFO buffers whose values
are part of the preloaded A-NPU configuration. All the buffers
are digital SRAM structures.
Each ANU has eight inputs. As depicted in Figure 4, each
A-NPU is augmented with a dedicated weight buffer that stores
the 8-bit weights. The weight buffers simultaneously feed the
weights to the ANUs. The weights and the order in which
they are fed to the ANUs are part of the A-NPU configuration.
The Input Buffer and Weight Buffers synchronously provide the
inputs and weights for the ANUs based on a pre-loaded order.
Hardware/software interface for exposing analog circuits
to the compiler. As we discussed in Section 2, we expose the following analog circuit restrictions to the compiler
through a hardware/software interface that captures the following circuit characteristics: (1) input bit-width limitations,
(2) weight bit-width limitations, (3) limited number of inputs
to each analog neuron (topology restriction), and (4) the nonideal shape of the analog sigmoid. The compiler internally
constructs a high-level model of the circuit based on these
limitations and uses this model during the neural topology
search and training with the goal of limiting the impact of
inaccuracies due to an analog implementation.
Training for limited bit widths and analog computation.
Traditional training algorithms for multi-layered perceptron
neural networks use a gradient descent approach to minimize
the average network error, over a set of training input-output
pairs, by backpropagating the output error through the network and iteratively adjusting the weight values to minimize
that error. Traditional training techniques, however, that do
not consider limited-precision inputs, weights, and outputs
perform poorly when these values are saturated to adhere to
the bit-width requirements that are feasible for an implementation in the analog domain. Simply limiting weight values
during training is also detrimental to achieving quality outputs
because the algorithm does not have sufficient precision to
converge to a quality solution.
A-NPU configuration. During code generation, the compiler produces an A-NPU configuration that constitutes the
weights and the schedule. The static A-NPU scheduling algo-
4
Table 1: The evaluated benchmarks, characterization of each offloaded function, training data, and the trained neural network.
Benchmark*Name
blackscholes
Descrip0on
Mathema'cal*
model*of*a*
financial*market*
M
RadixE2*CooleyE
Tukey*fast*fourier
inversek2j
Type
Financial*
Analysis
Signal*
Processing
#*of*
Func0on*
Calls
#*of*
Loops
#*of*Ifs/
elses
5
0
5
#*of*
x86@64* Evalua0on*Input*
Neural*Network*
Training*Input*Set
Instruc0on
Set
Topology
s
4096*Data*Point* 16384*Data*Point*
309
from*PARSEC
from*PARSEC
6*E>*8*E>*8E>*1
2
0
0
34
Inverse*kinema'cs*
Robo'cs
for*2Ejoint*arm
4
0
0
100
jmeint
Triangle*
intersec'on*
detec'on
3D*Gaming
32
0
23
1,079
jpeg
JPEG*encoding
Compression
3
4
0
1,257
kmeans
Machine*
KEmeans*clustering
Learning
1
0
0
26
sobel
Sobel*edge*
detector
Image*
Processing
3
2
1
88
2048*Random*
Floa'ng*Point*
Numbers
10000*(x,*y)*
Random*
Coordinates
10000*Random*
Pairs*of*3D*
Triangle*
Coordinates
220x200EPixel*
Color*Image
220x200EPixel*
Color*Image
220x200EPixel*
Color*Image
To incorporate bit-width limitations into the training algorithm, we use a customized continuous-discrete learning
method (CDLM) [8]. This approach takes advantage of the
availability of full-precision computation at training time and
then adjusts slightly to optimize the network for errors due
to limited-precision values. In an initial phase, CDLM first
trains a fully-precise network according to a standard training
algorithm, such as backpropagation [24]. In a second phase,
it discretizes the input, weight, and output values according
the the exposed analog specification. The algorithm calculates the new error and backpropagates that error through the
fully-precise network using full-precision computation and
updates the weight values according to the algorithm also used
in stage 1. This process repeats, backpropagating the ’discrete’ errors through a precise network. The original CDLM
training algorithm was developed to mitigate the impact of
limited-precision weights. We customize this algorithm by
incorporating the input bit-width limitation and the output
bit-width limitation in addition to limited weight values. Additionally, this training scheme is advantageous for an analog
implementation because it is general enough to also make up
for errors that arise due to an analog implementation, such as a
non-ideal sigmoid function and any other analog non-ideality
that behaves consistently.
Fully*
Analog*NN*
Digital*NN*
MSE*(8@bit)
MSE
Applica0on*Error*
Metric
Fully*
Digital*
Error
Analog*
Error
0.000011
Avg.*Rela've*Error
6.02%
10.2%
Avg.*Rela've*Error
2.75%
4.1%
Avg.*Rela've*Error
6.2%
9.4%
Miss*Rate
17.68%
19.7%
Image*Diff
5.48%
8.4%
Image*Diff
3.21%
7.3%
Image*Diff
3.89%
5.2%
0.00228
32768*Random*
Floa'ng*Point*
1*E>*4*E>*4*E>*2
0.00002
0.00194
Numbers
10000*(x,*y)*
Random*
2*E>*8*E>*2
0.000341
0.00467
Coordinates
10000*Random*
Pairs*of*3D*
18*E>*32*E>*8*E>*2 0.05235
0.06729
Triangle*
Coordinate
Three*512x512E
64*E>*16*E>*8*E>*64 0.0000156 0.0000325
Pixel*Color*Images
50000*Pairs*of*
Random*(r,*g,*b)*
6*E>*8*E>*4*E>*1
0.00752
0.009589
Values
One*512x512E
9*E>*8*E>*1
0.000782
0.00405
Pixel*Color*Image
approach: the compiler partitions the data collected during
profiling into a training set, 70% of the data, and a test set,
the remaining 30%. The topology search algorithm trains
many different neural-network topologies using the training
set and chooses the one with the highest accuracy on the test
set and the lowest latency on the A-NPU hardware (prioritizing accuracy). The space of possible topologies is large, so we
restrict the search to neural networks with at most two hidden
layers. We also limit the number of neurons per hidden layer
to powers of two up to 32.
To further improve accuracy, and compensate for topologyrestricted networks, we utilize a Resilient Back Propagation
(RPROP) [16] training algorithm as the base training algorithm
in our CDLM framework. During training, instead of updating
the weight values based on the backpropagated error (as in
conventional backpropagation [24]), the RPROP algorithm
increases or decreases the weight values by a predefined value
based on the sign of the error. Our investigation showed that
RPROP significantly outperforms conventional backpropagation for the selected network topologies, requiring only half of
the number of training epochs as backpropagation to converge
on a quality solution. The main advantage of the application of
RPROP training to an analog approach to neural computing is
its robustness to the sigmoid function and topology restrictions
imposed by the analog design. Our RPROP-based, customized
CDLM training phase requires 5000 training epochs, with the
analog-based CDLM phase adding roughly 10% to the training
time of the baseline training algorithm.
Training with topology restrictions. Conventional multilayered perceptron networks are fully connected, i.e. the output of each neuron in one layer is routed to the input of each
neuron in the following layer. However, analog range limitations restrict the number of inputs that can be computed in a
neuron (eight in our design). Consequently, network connections must be limited, and in many cases, the network can not
be fully connected.
We impose the circuit restriction on the connectivity between the neurons during the topology search and we use a
simple algorithm guided by the mean-squared error of the
network to determine the best topology given the exposed
restriction. The error evaluation uses a typical cross-validation
5. Evaluations
Cycle-accurate simulation and energy modeling. We use
the MARSSx86 x86-64 cycle-accurate simulator [23] to model
the performance of the processor. The processor is modeled
after a single-core Intel Nehalem to evaluate the performance
benefits of A-NPU acceleration over an aggressive out-oforder architecture. We extended the simulator to include ISAlevel support for A-NPU queue and dequeue instructions. We
5
jmeint
jpeg
kmeans
sobel
swaptions geomean
jmeint
18 -> 32 -> 8 -> 2
141
56.26
12
jpeg
64 -> 16 -> 8 -> 64
365
111.54
8
kmeans
8-bit D-NPU vs. A-NPU
6 -> 8 -> 4 -> 1
44
5.77
6
0.67
2.44
3.67
9 -> 8 -> 1
33
5.49
4
0.46
2.75
4.13
1 -> 16 -> 8 -> 1
50
10.30
8
1.22
2.08
3.13
59.90
9.71
6.35
0.84
0
3.33
blackscholes
6 -> 8 -> 8 -> 1
45
8.15
6
0.86
2.50
3.75
9.53
8.48
swaptions
1 -> 4 -> 4 -> 2
34
2.04
6
0.54
1.89
2.83
3.79
3.33
3.92
5.88
28.28
25.50
1.36
15.21
22.81
82.21
73.76
44
5.77
sobel
4.05
geomean
swaptions
6
0.67
2.44
3.67
33
5.49
4
0.46
2.75
4.13
50
10.30
8
1.22
2.08
3.13
8.45
7.58
59.90
9.71
6.35
0.84
3.33
4.71
12.14
8.56
7.57
11.81
10.45
10.32
geomean
blackscholes
Table 2: Error with a floating point D-NPU, A-NPU with ideal
Analog Sigmoid
sigmoid, and A-NPU with non-ideal sigmoid.
2.42
Analog Sigmoid
3.92
64 -> 16 ->
8 -> 64
0.0839
1 ->kmeans
4 -> 4 -> 2
6 -> 8 -> 0.0303
4 -> 1
3.03 2.44
4.13 3.67
0.011
nversek2j
2 ->sobel
8 -> 2
9 -> 8 -> 0.0813
1
8.13 2.75
9.42 4.13
0.0129
meint
swaptions
18 ->
32 -> 8 -> 2
1 -> 16 ->0.1841
8 -> 1
18.41 2.08
19.67 3.13
0.0126
peg
64 ->
16 -> 8 -> 64
geomean
kmeans
6 -> 8 -> 4 -> 1
0.0662
0.0073
7.32
A-NPU - 1/2
Digital
Frequency
(Manual)
6 -> 8 -> 8 -> 1
0.86
0.96
1 -> 4 -> 4 -> 2
0.54
0.61
0.51
0.58
Benchmarks
1.99
2.21
inversek2j Table 1 2 -> 8 -> 2
0.01
D-ANPU (Hadi)
Improvement
A-NPU + Ideal
8.15
9.53
Sigmoid
18.21 Level Error
Application
Fully Precise
Digital Sigmoid
Analog
Sigmoid
8.39%
10.21%
3.03%
4.13%
1.51
56.26
6 -> 8 -> 4 -> 18
16 0.67
0.76
111.54
9 -> 8 -> 6.62%
1
5.76% 0.46
0.53
5.77
fft
inversek2j
8.13%
jmeint
9.42%
18.41%
8.4%
33.85
2.04
A-NPU
2.33
blackscholes
1.36
4
sobel 14.32%
Analog Sigmoid
Analog Sigmoid
1.17
kmeans
0.0303
jpeg
0.1841
64 -> 16 -> 8 -> 64
8.1%
kmeans
0.0662
18.4% 6.6%
6.1%
6 -> 8 -> 4 -> 1
sobel
9.4%
swaptions
4.3%
0.06
9 -> 8 -> 1
4.1%
4%
sobel
0.0813
3.2%
3.8%
18 -> 32 -> 8 -> 2
3.0%
6%
41.47
jpeg
Analog Sigmoid
Fully Precise Digital Sigmoid
jmeint
8%
0.0428
19.7% 8.4%
7.3%
1 -> 16 -> 8 -> 1
5.2%
0.0261
10.21
0.0182
3.03
4.13
0.011
8.13
9.42
0.0129
18.41
19.67
0.0126
6.62
8.35
0.0173
6.10
7.28
0.0118
4.28
5.21
0.0093
2.61
3.34
0.0073
6.02
7.32
geomean
12.41
0%
0.06
0.01
also augmented MARSSx86 with a cycle-accurate simulator
for our A-NPU design and an
8-bit,
fixed-point
D-NPU
with
8.4%
3.0%
8.1%
18.4% 6.6%
6.1%
4.3%
2.6%
eight processing engines (PEs) as described in [12]. We use
10.2%
4.1%
9.4%
19.7% 8.4%
7.3%
5.2%
3.3%
GCC v4.7.3 with -o3 to enable
compiler
optimization.
The
3.1% 43.3% 65.5% 40.5%
baseline in our experiments is the benchmark run solely on the
1.8%
0.5%
4.8%
2.3%
processor without neural transformation. We use McPAT [19]
for processor energy estimations. We model the energy of an
8-bit, fixed-point D-NPU using results from McPAT, CACTI
6.5 [21], and [13] to estimate its energy. Both the D-NPU and
inversek2j jmeint jpeg kmeans sobel swaptions
the processor operate at 3.4GHz, while the A-NPU is clocked
95.9% 95.1%at56.3%
29.7%of57.1%
one third
the digital clock frequency, 1.1GHz at 1.2 V, to
achieve acceptable accuracy1 .
10
geomean
1.22
1.36
0.79
0.89
5.49
jpeg
6.62%
11.43
kmeans
4.51
8.35%
6.1%
7.28%
sobel
4.28%
5.21%
swaptions
2.61%
3.34%
geomean
6.02%
7.32%
blackscholes
fft
inversek2j
jmeint
sobel
swaptions
blackscholes
fft
blackscholes
blackscholes
fft
Other
Instructions
2.4%
31.4%
NPU Queue
Instructions
0.3%
1.2%
inversek2j
A-NPU + Ideal
Sigmoid
A-NPU
jmeint
jpeg
kmeans
sobel
3.3%
3.1%
43.3%
65.5%
40.5%
0.8%
1.8%
0.5%
4.8%
2.3%
15%
97.2%
Dynamic Insts
CPU
Other Instructions
6 -> 8 -> 8 -> 1
1.0
1 -> 4 -> 4 -> 2
1.0
inversek2j
2 -> 8 -> 2
1.0
jmeint
18 -> 32 ->8-bit
8 -> D-NPU
2
1.0
Cycle
64 -> 16 -> 8 -> 64Energy (nJ)
6 -> 8 -> 4 -> 1
45
8.15 1.0
2.04
9.53
4.05
28.28
25.50
22.81
82.21
73.76
2.42
5.77
6
0.67
2.44
3.67
8.56
7.57
5.49
4
0.46
2.75
4.13
11.81
10.45
9.63
6.15
0.79
3.33
8-bit
D-NPU vs. A-NPU
5.00
8-bit A-NPU
Energy (nJ)
1.8
jpeg
Cycle
Cycle
Improvement
(1.1336 GHz)
6
0.86
2.04
6
0.54
3.9
0%
12.14
4
8
Energy Improvement
Cycle
Improvement (1.7
GHz)
2.50
Speedup
Energy 1.89
Saving
Energy
(1.1336 GHz)
GHz)
9.53
8.48
2.83
3.79
3.33
2.33
4
0.51
3.63
4.57
4.05
56.26
12
1.99
3.92
5.88
28.28
25.50
111.54
8
1.36
15.21
22.81
82.21
73.76
44
5.77
6
0.67
5.49
4
2.4
2.42
Energy
365
2.44
3.67
8.56
7.57
0.46
2.75
4.13
11.81
10.45
50
10.30
8
1.22
2.08
3.13
8.45
7.58
59.90
9.71
6.35
0.84
3.33
4.71
12.14
10.32
0.0813
8.13
9.42
0.0129
0.1841
18.41
19.67
0.0126
0.0662
6.62
8.35
0.0173
6.10
18.4%
6.6%
6.1%
4.3%
2.6%
6.0%
19.7%
8.4%
7.3%
5.2%
3.3%
6%
0.0118
0.0261
2.61
3.34
6.02
7.32
Energy Saving
Speedup
0.01
8
16
6.62%
5.76%
2.7
Application Level Error
Fully Precise
Digital Sigmoid
blackscholes
8.39%
2.0
Benchmarks
fft
inversek2j
jmeint
jpeg
kmeans
Analog
Sigmoid
3.03%
28.28 25.5
10
82.21 73.8
9.53
6.62%
8.35%
kmeans
6.1%
7.28%
sobel
4.28%
5.21%
swaptions
8
0.50
0.25
8.45
7.6
7.6
28.28 25.5
4.57
9.53
4.1
3.79
4
2
8.5fft
blackscholes
inversek2j
blackscholes
jmeint
fft
jpeg
kmeans
inversek2j jmeint
sobel
8.13%
9.42%
18.41%
19.67%
jpeg
6.62%
8.35%
kmeans
6.1%
7.28%
sobel
4.28%
5.21%
2.61%
3.34%
sobel
geomean
jpeg
1/3 Digital Frequency
1/2 Digital Frequency
kmeans
sobel
blackscholes
fft
inversek2j
jmeint
jpeg
kmeans
sobel
Application
Topology
blackscholes
6 -> 8 -> 8 -> 1
fft
1 -> 4 -> 4 -> 2D-NPU
Cycle
inversek2j
2 -> 8 -> 2
jmeint
18 -> 32 -> 8 -> 2
jpeg
64 -> 16 -> 8 -> 64
8-bit A-NPU
Energy (nJ)
fft
6.0%
2.7%
8.4%
3.0%
10.2%
4.1%
A-NPU + Ideal
Sigmoid
Cycle
inversek2j
geomean
3.1%
jpeg
kmeans
sobel
31.4%
3.3%
jmeint
43.3%
65.5%
40.5%
NPU Queue
Instructions
0.3%
1.2%
0.8%
0.5%
4.8%
2.3%
8.1%
18.4% 6.6%
9.4%
19.7% 8.4%
29
kmeans
8.15
sobel
3.2%
2.04
3.8%
6.1%
4.3%
7.3%
5.2%
2.33
56.26
blackscholes
fft
inversek2j
jmeint
8.4%
3.0%
8.1%
18.4%
4.1%
9.4%
19.7%
44
33
jpeg
5.776.6%
5.49
61.47
8.4%
0.86
6
0.54
24.5221729784241
48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738
1.12709929506364
1.32327013458615
1.64546022960568
0.68497510592142 0.804194541306698 0.195805458693302
Core + A-NPU
Core + Ideal NPU
Core + D-NPU
Core + A-NPU
0.53259881505741 0.733600916412848 0.266399083587152
2 -> 8 -> 2
7.98161179269307
10.9938617211157
14.9861613789597
jmeint
18 -> 32 -> 8 -> 2
2.39085372136084
6.26190545947988
14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335
1.5617504494923
1.87946485929561
1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875
64 -> 16 -> 8 -> 64
kmeans
6 -> 8 -> 4 -> 1
sobel
9 -> 8 -> 1
Application
Topology
blackscholes
6 -> 8 -> 8 -> 1
Core + D-NPU
Core + A-NPU
Core + Ideal NPU
2.4864550898745
3.10723166292606
3.62429006473114 0.686053004992842 0.857335259438336 0.142664740561664
2.5478647166383
3.7797513074705
5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968
Core + D-NPU
Core + A-NPU
18 -> 32 -> 8 -> 2
jpeg
64 -> 16 -> 8 -> 64
kmeans
6 -> 8 -> 4 -> 1
sobel
9 -> 8 -> 1
1 -> 16 -> 8 -> 1
7.6
Core + D-NPU
blackscholes
fft
inversek2j
jmeint
jpeg
kmeans
sobel
Other
Instructions
2.4%
31.4%
3.3%
3.1%
43.3%
65.5%
40.5%
1.2%
0.8%
1.8%
0.5%
4.8%
2.3%
inversek2j
jmeint
jpeg
fft
swaptions
kmeans
inversek2j jmeint
95.9%
6.0%
2.7%
6.2%
17.6% 5.4%
8.4%
3.0%
8.1%
18.4% 6.6%
10.2%
4.1%
9.4%
19.7% 8.4%
speedup
A-NPU
Core + A-NPU
Analog
Sigmoid
geomean
sobel
67.4%
95.9%
jpeg
kmeans sobel
10
9.53
2.50
3.75
9.53
0.54
4
0.51
3.63
4.57
4.05
12
1.99
3.92
5.88
28.28
25.50
8
1.36
15.21
22.81
82.21
73.76
44
5.77
6
0.67
2.44
3.67
8.56
7.57
33
5.49
4
0.46
2.75
4.13
11.81
10.45
61.47
9.63
6.15
0.79
3.33
5.00
12.14
10.79
2.42
4.5
3.79
3.33
4.57
4.05
25.50
73.76
0.67
2.44
3.67
0.46
2.75
4.13
11.81
Application
0.79
3.33
5.00
12.14
fft
2.6%
6.0%
5.2%
3.3%
6
4
28.28 25.5
82.21 73.8
2
fft
inversek2j
jm
8.1%
18.
10.2%
4.1%
9.4%
19.
geomean
jpeg
4.1
3.79
4
1/3 Digital Frequency
1/2 Digital Frequency
blackscholes
fft
inversek2j jmeint
blackscholes
fft
inversek2j
jmeint
jpeg
kmeans
sobel
jpeg
kmeans
sobel
swaptions geomean
Core + A-NPU
Core + Ideal NPU
Core + D-NPU
Core + A-NPU
speedup
24.5221729784241
48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738
1.32327013458615
1.64546022960568
0.68497510592142 0.804194541306698 0.195805458693302
0.53259881505741 0.733600916412848 0.266399083587152
inversek2j
2 -> 8 -> 2
7.98161179269307
10.9938617211157
14.9861613789597
jmeint
18 -> 32 -> 8 -> 2
2.39085372136084
6.26190545947988
14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335
jpeg
64 -> 16 -> 8 -> 64
1.5617504494923
1.87946485929561
1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875
kmeans
6 -> 8 -> 4 -> 1
sobel
9 -> 8 -> 1
geomean
0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028
2.4864550898745
3.10723166292606
3.62429006473114 0.686053004992842 0.857335259438336 0.142664740561664
2.5478647166383
3.7797513074705
5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968
Application
Topology
blackscholes
6 -> 8 -> 8 -> 1
fft
1 -> 4 -> 4 -> 2
inversek2j
2 -> 8 -> 2
jmeint
18 -> 32 -> 8 -> 2
jpeg
64 -> 16 -> 8 -> 64
kmeans
6 -> 8 -> 4 -> 1
sobel
9 -> 8 -> 1
swaptions
1 -> 16 -> 8 -> 1
Core + D-NPU
Core + A-NPU
Core + Ideal NPU
Core + D-NPU
Core + A-NPU
speedup
geomean
swaptions geomean
6. Related Work
jmeint
82.2
15.2
28.2
kmeans
sobel
geomean
3.3
2.4
2.4
1.8
2
jpeg
General-purpose approximate computing. A growing
body of work has explored relaxing the abstraction of full
accuracy at the circuit and architecture level for gains in performance, energy,
and resource utilization [9, 11, 25, 6, 22, 12].
Energy Saving
Speedup
These circuit
and architecture studies, although proven successful, are limited to purely digital techniques. We explore
3.9
4.5
9.5
3.7
1
3
2.5
3.3
2.7
2.0
2.4
1.8
2.4
2.5
2
14.1441013944802
10.791.12709929506364
1 -> 4 -> 4 -> 2
Application Speedup-1
3.3
0
geomean
Energy Saving
Speedup
10.45Core + D-NPU
6 -> 8 -> 8 -> 1
1
Improvement
3.7
sobel
7.57
Topology
12.14 10.3
2
12.1
kmeans
8.4
8.5
82.2
15.2
jpeg
3.9
4.5
9.5
4
jmeint
11.8
inversek2j
28.2
fft
blackscholes
7.6
2.4
4.57
3.3
2.4
2.4
3.7
2.5
blackscholes
8.56
8.45
2.4
6
0
Speedup
Energy Saving
3.9
3
11.81 10.5
7.6
6
1
0
blackscholes
fft
inversek2j
jmeint
jpeg
kmeans
sobel
k
3.0%
Application Speedup
7.3%
6.15
1
0
3.33
82.21
1.8
Improvement
2
8.48
3.79
28.28
Fetch/Issue Width: 4/5, INT ALUs/FPUs:
6/6, Load/Store
blackscholes
fft
inversek2j
FUs: 1/1, ROB Entries: 128, Issue Queue Entries: 36, INT/FP Physical
Registers: 256/256, Branch Predictor: Tournament 48 KB, BTB Sets/Ways:
1024/4, RAS Entries: 64, Load/Store Queue Entries: 48/48, Dependence
Predictor: 4096-entry Bloom Filter, ITLB/DTLB Entries: 128/256 L1: 32
KB Instruction, 32 KB Data, Line Width: 64 bytes,5 8-Way, Latency: 3 cycles
L2: 256 KB, Line Width: 64 bytes, 8-Way, Latency: 6 cycles L3: 2 MB,
Line Width 64 bytes, 16-Way, Latency: 27 cycles 4Memory Latency: 50 ns
3
9.53
2.83
5.88
12.1
2.83
8.48
6
2.33
9.5
1.89
Energy
Improvement (1.7
GHz)
3.75
1.89
3.63
8.56
1.8
0.86
Energy
Improvement
(1.1336 GHz)
2.5
Energy Improvement
Cycle
Improvement (1.7
GHz)
2.50
22.81
95.1% 56.3% 29.7% 57.1%
Energy Improvement
6
2.04
Cycle
Improvement
(1.1336 GHz)
Energy
Improvement (1.7
GHz)
3.92
4.3%
5
Energy Improvement
Energy
Improvement
(1.1336 GHz)
15.21
7.3%
jpeg
8.4%
Core + D-NPU
Core + A-NPU
kmeans
Core
+ Idealsobel
NPU
4
2.42
jmeint
blackscholes
Percentage
instructions
subsumed
6
0
9 -> 8 -
inversek2j
Application
Topolo
blackscholes
6 -> 8 -
fft
1 -> 4 -
inversek2j
2 -> 8 -
jmeint
18 -> 3
jpeg
64 -> 1
kmeans
6 -> 8 -
sobel
9 -> 8 -
swaptions
1 -> 16
geomean
1.36
6.1%
9.63
6 -> 8 -
geomean
95.1% 56.3% 29.7% 57.1%
1.99
geomean
64 -> 1
kmeans
sobel
(b) Whole application energy saving.
0.51
swaptions
18 -> 3
jpeg
fft
A-NPU + Ideal
Sigmoid
Core + Ideal NPU
Benchmarks
Cycle
Improvement (1.7
GHz)
2 -> 8 -
jmeint
1 -> 4 -> 4 -> 2
8
sobel
1 -> 4 -
blackscholes
Floating Point
D-NPU
geomean
fft
6 -> 8 -
fft
0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028
jmeint
8.45swaptions
7.6
Topolo
blackscholes
12.14
2 -> 8 -> 2 10.3
inversek2j
4
Application
swaptions
geomean
inversek2j
speedup
inversek2j
fft
11.81 10.5
2
sobel
4.5
97.2%
inversek2j jmeint
56.26
4
Improvement
Core + D-NPU
4
3.9
fft
111.54
5
0
14.1441013944802
1 -> 4 -> 4 -> 2
Application Speedup-1
Cycle Improvement
6
kmeans
3.7
blackscholes
365
1 Processor:
3
kmeans
6 -> 8 -> 8 -> 1
12
111.54
10.2%
Analog
Sigmoid
geomean
Improvement
8.15
34
5
jpeg
Topology
jpeg
2
Cycle
Improvement
(1.1336 GHz)
Energy
9.5
1.8%
swaptions
jpeg
17.6% 5.4%
141
Percentage
instructions
subsumed
9 -> 8 -> 1
fft
2.4%
jmeint
346.2%
365
6 -> 8 -> 4 -> 1
blackscholes
Other
Instructions
45
inversek2j
blackscholes
Floating Point
141
29
geomean
jmeint
fft
8
Percentage
1/3 Digital Frequency
Instructions
97.2%
67.4%
1/2 Digital Frequency
Subsumed
swaptions geomean
8-bit D-NPU
11.8
9 -> 8 -> 1
inversek2j
Application
blackscholes
8-bit D-NPU vs. A-NPU-1
0%
8.5
sobel
fft
blackscholes
6
10
2.7
6 -> 8 -> 4 -> 1
geomean
6.51636491720012 0.787719817984527 0.977559461493782 0.011817544342672
8
4.1
4%
Benchmarks
82.2
kmeans
107.32%
0 Queue
NPU
0.3%
Instructions
blackscholes
2%
Cycle Improvement
Energy
15.2
64 -> 16 -> 8 -> 64
Cycle
28.2
18 -> 32 -> 8 -> 2
swaptions geomean
6.02%
7.32%
swaptions
geomean
8-bit A-NPU
Energy (nJ)
45
1 -> 4 -> 4 -> 2
2 -> 8 -> 2
sobel
Application Speedup
8.56
swaptions geomean
6%
8-bit D-NPU vs. A-NPU-1
8-bit D-NPU
6 -> 8 -> 8 -> 1
6.37013417935512
blackscholes
82.21 73.8
3.3
0
sobel
Cycle
jpeg
12.14 10.3
6
8.5
jmeint
18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474
geomean
NPU Queue Instructions
11.81 10.5
Other Instructions
8
fft
5.13306978649764
3.34%
6.02%
17.8930069836166
(a) Whole application
speedup.
0%
geomean
2.61%
geomean
8.56
8.5
0.75
4
inversek2j
2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845
jpeg
8%
4.13%
inversek2j
jmeint
Percentage
Instructions
Subsumed
blackscholes
1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225
2.83229403346624
19.67%
10.21%
16
Topology
1.33727439731608
18.41%
3.3
10%
0.0073
Maximum number of Incoming Synapses to each Neuron
Application
inversek2j
jmeint
jpeg
kmeans
2.39631940662302 2.39878691088084
0.921952508325545
0.998971353292519
0.0010286467074806
2%
2.74676510816413
jmeint
Analog Sigmoid
Fully Precise Digital Sigmoid
kmeans
8
2.21156760942508
1.15321498752577
9 -> 8 -> 1
7.3%
3.79
A-NPU
3%
7.32010374044854
6 -> 8 -> 4 -> 1
fft
+ D-NPU
+ A-NPU
52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033
4.57
12%
3.3
9%
4
9.4%
18 -> 32 -> 8 -> 2
9.42%
sobel
16
0.0093
swaptions
fft
12%
0%
4.1%
31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239
geomean
Application Level Error
5.21
1.8
4
14.32%
1
blackscholes
10.2%
1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124
30.0198258158588
Whole application speedup and energy savings.
5 jmeint jpeg kmeans sobel swaptions geomean
blackscholes fftFigure
inversek2j
(c) % dynamic instructions subsumed.
shows the whole application speedup and energy savings when
the processor is augmented with an 8-bit, 8-PE D-NPU, our
Figure 5: Whole application speedup and energy saving with
8-ANU A-NPU, and an ideal NPU, which takes zero cycles
D-NPU, A-NPU, and an Ideal NPU that consumes zero energy
and consumes zero energy. Figure 5c shows the percentage of
and takes zero cycles for neural computation.
dynamic instructions subsumed by the neural transformation
and our A-NPU which incorporates non-idealities of the anaof the candidate code. The results show, following the Amblackscholes
fft
inversek2j
jmeint
jpeg
kmeans
sobel
geomean
log sigmoid. Except for jmeint, which shows error above 10%,
dahl’s Law, that the larger the number of dynamic instructions
all of the applications show error less than or around 10%.
subsumed, the larger the benefits from neural acceleration.
Application
average error rates with the A-NPU range from
Geometric mean speedup and energy savings with an A-NPU
4.1%
to
10.2%.
This quality-of-result loss is commensurate
is 3.7× and 6.3× respectively, which is 48% and 24% better
with
other
work
on quality trade-offs use digital techniques.
than an 8-bit, 8-PE NPU. Among the benchmarks, kmeans
Speedup
Truffle
[11]
and
EnerJ
Energy
Saving[26] shows similar error (3–10%) for
sees slow down with D-NPU and A-NPU-based acceleration.
some
applications
and
much greater error (above 80%) for
All benchmarks benefit in terms of energy. The speedup with
others
in
a
moderate
configuration.
Green [2] has error rates
A-NPU acceleration ranges from 0.8× to 24.5×. The energy
below
1%
for
some
applications
but
greater than 20% for othsavings range from 1.1× to 51.2×.
ers. A case study [20] explores manual optimizations of the
Application error. Table 2 shows the application-level erx264 video encoder that trade off 0.5–10% quality loss.
rors with a floating point D-NPU, A-NPU with ideal sigmoid
4.28
2.4
Table 1
2
0
7.28
0.0428
Maximum number of Incoming Synapses to each
Neuron
15%
Application Level Error (Image Diff)
0.011
0.06
3.7
2.5
Improvement
Application
Level Error
(Image Diff)
4.13
3.9
1 -> 16 -> 8 -> 1
10.794
3
3.03
0.06
geomean
sobel
0.0303
4.5
6 -> 8 -> 4 -> 1
9 -> 8 -> 1
kmeans
0.0182
2.4
5
10.45
9.5
64 -> 16 -> 8 -> 64
swaptions
8.1%
1.70352109148241
Analog Sigmoid
jpeg
10.21
12.1
18 -> 32 -> 8 -> 2
Analog Sigmoid
8.39
8.4
2 -> 8 -> 2
jmeint
7.57
Fully Precise
Digital Sigmoid jmeint
inversek2j
0.0839
28.2
inversek2j
kmeans
3.0%
51.250425520663
1.66144638762811
25.8726893148605
blackscholes
4.13%
jpeg
64 -> 16 -> 8 -> 64
Benchmarks. We use the benchmarks in2 [12] and add one
more, blackscholes.
Fully Precise
Digital Sigmoid
blackscholes
fft
11.8
1 -> 4 -> 4 -> 2
sobel
8.4%
Core + A-NPU
4%
42.5954312379609
2 -> 8 -> 2
8.13%
kmeans
Core + D-NPU
1 -> 4 -> 4 -> 2
3.03%
Improvement
Improvement
(1.7 of Incoming Synapses to each Neuron
Maximum
number
3.75
141
8.5
fft
jpeg
geomean
0
10.79
Cycle Improvement
Energy
8.15
82.2
15.2
6 -> 8 -> 8 -> 1
73.76
swaptions
0.00
1
Topology
0
25.50
sobel
7.3%
inversek2j
Core + Ideal NPU
kmeans sobel swaptions
Analog Sigmoid
4.05
kmeans
6 -> 8 -> 8 -> 1
010.21%
jmeint
inversek2j
fft
16
5.76%
jpeg
8.39%
Core + A-NPU
10
5.4
2
blackscholes
3.3%
jmeint
3.33
4.57
5.88
15.21
33
1 -> 16 -> 8 -> 1
Application
2.6%
5.2%
inversek2j
8.48
3.79
3%
3.63
3.92
1.36
3.7
2.5
3
4.3%
7.3%
fft
Analog
Sigmoid
geomean
6.1%
blackscholes
Energy
Improvement (1.7
GHz)
1.99
29
4
9 -> 8 -> 1
8
6.62%
Energy Improvement
0.51
34
64 -> 16 -> 8 -> 64
3.33
19.7% 8.4%
6%
8
45
6 -> 8 -> 4 -> 1
8.48
18.4% 6.6%
9.4%
Analog
blackscholes
Sigmoid
fft
blackscholes
6%
Application Energy Reduction
Core + D-NPU
3.7
18 -> 32 -> 8 -> 2
geomean
8.1%
Topology
2.5
2 -> 8 -> 2
jmeint
jpeg
3.0%
44.1%
Fully Precise
Digital Sigmoid
swaptions
1.00
4
28.2
inversek2j
Improvement
rovement
sobel
95.1% 56.3% 29.7% 57.1%
12
15.2
1 -> 4 -> 4 -> 2
95.9%
2.33
4.5
fft
5
Energy
kmeans
sobel (1.7
Improvement
swaptions
GHz)
8.4%
9%
56.26
Cycle
9.5
6 -> 8 -> 8 -> 1
67.4%
111.54
8-bit D-NPU
Topology
blackscholes
kmeans
Application
3.1
3.6
61.47
Application
3.75
jpeg
10.2%
14.32%
swaptions
inversek2j jmeint
Energy
Improvement
(1.1336 GHz)
2.83
jmeint
Application Level Error
2.4
geomean
2.500.571
1.89
6.0%
Benchmarks
8%
Core
Analog Sigmoid
Fully Precise Digital Sigmoid
Core
inversek2j
jmeint
kmeans
sobel
geomean
jpeg
0.5
0.8
1.2
44
33
0.860.023
0.54
blackscholes
fft
fft
inversek2j
jmeint
10%
blackscholes
1.5
1.8
1.9
6 -> 8 -> 4 -> 1
9 -> 8 -> 1
6 0.41
6
Cycle
0.563
Improvement (1.7
GHz)
0.297
inversek2j
2
geomean
0.47
0.17
2.3
kmeans
Cycle
0.005
Improvement
(1.1336 GHz)
0.048
swaptions
Circuit design for ANU. We implemented the 8-bit, 8-input
8
ANU in the Cadence Analog Design Environment
using predictive technology models at 45 nm [5]. We ran detailed
6
Spectre spice simulations to understand circuit
behavior and
measure ANU energy consumption. We used CACTI to esti4
mate energy of the A-NPU buffers.
29
sobel
0.66
sobel
0.49
0.44
Core + D-NPU
Core + A-NPU
Core + Ideal NPU
12%
4
7.9
365
Energy
0.43
1.0
0
3.3
141
64 -> 16 -> 8 -> 64
Cycle
1.0
3.3
18 -> 32 -> 8 -> 2
peg
Cycle Improvement
0.951
0.018
fft
12.1
9 -> 8 -> 1
1 -> 16 -> 8 ->341
2 ->geomean
8 -> 2
meint
8-bit A-NPU
0.03
8.5
6 ->sobel
8 -> 8 -> 1
1 ->swaptions
4 -> 4 -> 2
nversek2j
0.959
11.8
blackscholes
fft
0.972
0.674
0.008
2.7
kmeans
Less Insts
0.003
0.012
0.03
82.2
Topology
jpeg
NPU Queue
Instructions
0.02
0.31
8-bit D-NPU vs. A-NPU-1
2.4
Application
Topology
blackscholes
fft
kmeans
geomean
0.29
0
6
12%
blackscholes
Percentage
Instructions
Subsumed
Application
fft
Normalized # of Dynamic Instructions
Benchmarks
jpeg
Maximum number
of Incoming
Synapses
each2.3%
6.0%
2.7%
6.2% 17.6%
5.4% 3.2% to
3.8%
Neuron
16
mum number of Incoming Synapses to each Neuron
jmeint
Floating Point
D-NPU
Percentage
instructions
subsumed
Application Level Error (Image Diff)
8
swaptions geomean
inversek2j
Application
Level Error
(Image Diff)
4
sobel
Table 1
Analog
Sigmoid
geomean
sobel
0.53
0.2
12.1
kmeans
kmeans
28.2
jpeg
kmeans
0.51
14.40
Percentage
instructions
subsumed
meint
jpeg
Energy Improvement
1 -> 16 -> 8 -> 1
Energy Improvement
swaptions
8
2%
165.46
19.67%
inversek2j
jmeint
1 -> 4 ->
4 -> 2
inversek2j
-> 2
2.7%
6.2%2 -> 817.6%
5.4%
10%
10.2%
4.00
fftfft
8.39
1.1
1.3
1.6
kmeans
6.0%
0.0093
3.34
A-NPU - 1/36.02
Digital
Frequency
(Manual)
jmeint number of Incoming
18 -> 32 ->Synapses
8 -> 2
Maximum
to each
Neuron
jpeg
64 -> 16 -> 8 -> 64
Application
Level Error
Image Diff)
Floating Point
D-NPU
0.0118
5.21
(nJ)
Fully Precise
Digital Sigmoid
0.0839
11.8
fft
0.0173
7.28
4.28
Energy
2.61
Fully Precise
Digital Sigmoid
2
8.5
blackscholes
8.35 4.71
6.10
0.0428
6 -> 8 -> 8 -> 1
Application Speedup
Topology 0.06
Topology
blackscholes
82.2
0.0261
Application
Application
0.0182
blackscholes
12%
jpeg
0.4
11.8
9 -> 8 -> 1
1 -> 16 -> 8 -> 1
10.2122.81
6.62 3.14
0.06
sobel
swaptions
geomean
15.21
8.39
3.63
Analog Sigmoid
5.88
6 ->jpeg
8 -> 8 -> 1
fft
10.32
jmeint
6.3
6.5
2.83
42.5
51.2
52.5
2 -> 8 -> 2
Fully Precise
Fully Precise
18 ->
32 -> 8 -> 2 Digital Sigmoid
Digital
Sigmoid
3.75
1.89
15.2
inversek2j
Topology
jmeint
blackscholes
10
Speedup (1.7 GHz)
2.50
8.5
Application
Speedup (1.1336
GHz)
Application Speedup
1 -> 4 ->Analog
4 -> 2 Sigmoid
Application Energy Reduction
6 -> 8 -> 8 -> 1
fft
Application Level Error
Topology
blackscholes
7.58
12.14
inversek2j
0.6
Speedup A-NPU over D-ANPU
Application
4.71
fft
2.7
1 -> 16 -> 8 -> 1
geomean
kmeans
2.42
12.1
6 -> 8 -> 4 -> 1
9 -> 8 -> 1
swaptions
jpeg
0.51
3.1
3.6
4.57
1.99
8
kmeans
inversek2j
7.57
10.45
8.45
2.4
3.63
12
111.54
sobel
fft
29
8.56
11.81
2.7
2.8
2.8
jmeint
56.26
365
73.76
0.5
0.8
1.2
4
141
64 -> 16 -> 8 -> 64
4.05
25.50
1.1
1.3
1.3
2.33
18 -> 32 -> 8 -> 2
peg
82.21
1.5
1.8
1.9
2 -> 8 -> 2
meint
28.28
2.2
2.3
2.3
nversek2j
2.0
blackscholes
2
4
5.88
sobel
Energy
Improvement (1.7
GHz)
fft
0%
22.81
14.0
Energy
Improvement
(1.1336 GHz)
15.21
6.2
Energy Improvement
Cycle
Improvement (1.7
GHz)
3.92
1.36
2.3Error
Application Level
Cycle
Improvement
(1.1336 GHz)
1.99
17.8
18.8
Cycle Improvement
Energy
4.57
7.3
Cycle
3.63
14.0
8-bit A-NPU
Energy (nJ)
2.42
8.4
Cycle
3.33
0.51
6.2
8-bit D-NPU
Topology
8.48
3.79
2.3
inversek2j
GHz)
9.53
2.83
swaptions geomean
9.53
8.5
8
Energy Improvement
4
(1.1336 GHz)
3.75
1.89
5.4
2.33
GHz)
2.50
0.54
4
3.7
29
(1.1336 GHz)
0.86
2.5
6
5.1
fft
6
2.04
1.1
1.3
1.6
blackscholes
8.15
34
2.7
2.8
2.8
0.00
45
1.1
1.3
1.3
2 -> 8 -> 2
0.25
2.2
2.3
2.3
inversek2j
0.50
Applicatio
2%
Application
1 -> 4 -> 4 -> 2
Normalized
7.9 Application Speedup
10.9
14.9
1 -> 16 -> 8 -> 1
4%
geomean
6 -> 8 -> 8 -> 1
fft
2.7
swaptions
blackscholes
25.8
30.0
31.4
0.571
10.9
14.9
0.297
0.023
1.1
1.3
1.6
0.563
0.048
0.41
1.6
1.7
1.7
0.959
0.951
0.005
0.66
1.0
1.6
1.7
1.7
0.008
0.018
0.43
1.0
9 -> 8 -> 1
14.1
24.5
48.0
0.03
0.03
1.0
6 -> 8 -> 4 -> 1
14.1
24.5
48.0
1.0
1.0
64 -> 16 -> 8 -> 64
meint
Application Ener
nversek2j
Application
2 -> 8 -> 2
18 -> 32 -> 8 -> 2
sobel
Normalized # of D
6%
peg
kmeans
6
4
3.79
2
0
blackscholes
References
how a mixed-signal, analog-digital approach can go beyond
what digital approximate techniques offer.
[1] P. E. Allen and D. R. Holberg, CMOS Analog Circuit Design, 2nd ed.
Oxford University Press, 2002.
[2] W. Baek and T. M. Chilimbi, “Green: A framework for supporting
energy-conscious programming using controlled approximation,” in
PLDI, 2010.
[3] B. Belhadj, A. Joubert, Z. Li, R. Héliot, and O. Temam, “Continuous
real-world inputs can open up alternative accelerator designs,” in ISCA,
2013.
[4] B. E. Boser, E. Säckinger, J. Bromley, Y. L. Cun, L. D. Jackel, and
S. Member, “An analog neural network processor with programmable
topology,” J. Solid-State Circuits, vol. 26, no. 12, December 1991.
[5] Y. Cao, “Predictive technology models,” 2013. Available: http:
//ptm.asu.edu
[6] L. N. Chakrapani, B. E. S. Akgul, S. Cheemalavagu, P. Korkmaz,
K. V. Palem, and B. Seshasayee, “Ultra-efficient (embedded) SOC
architectures based on probabilistic CMOS (PCMOS) technology,” in
DATE, 2006.
[7] T. Chen, Y. Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti,
A. Nere, S. Qiu, M. Sebag, and O. Temam, “Benchnn: On the broad
potential application scope of hardware neural network accelerators,”
in IISWC, 2012.
[8] F. Choudry, E. Fiesler, A. Choudry, and H. J. Caulfield, “A weight
discretization paradigm for optical neural networks,” in ICOE, 1990.
[9] M. de Kruijf, S. Nomura, and K. Sankaralingam, “Relax: An architectural framework for software recovery of hardware faults,” in ISCA,
2010.
[10] H. Esmaeilzadeh, P. Saeedi, B. N. Araabi, C. Lucas, and S. M. Fakhraie,
“Neural network stream processing core (NnSP) for embedded systems,”
in ISCAS, 2006.
[11] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Architecture
support for disciplined approximate programming,” in ASPLOS, 2012.
[12] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural acceleration for general-purpose approximate programs,” in MICRO, 2012.
[13] S. Galal and M. Horowitz, “Energy-efficient floating-point unit design,”
IEEE Trans. Comput., vol. 60, no. 7, 2011.
[14] A. Hashmi, H. Berry, O. Temam, and M. Lipasti, “Automatic abstraction and fault tolerance in cortical microarchitectures,” in ISCA, 2011.
[15] A. Hashmi, A. Nere, J. J. Thomas, and M. Lipasti, “A case for neuromorphic ISAs,” in ASPLOS, 2011.
[16] C. Igel and M. Hüsken, “Improving the RPROP learning algorithm,”
in NC, 2000.
[17] D. A. Johns and K. Martin, Analog Integrated Circuit Design. John
Wiley and Sons, Inc., 1997.
[18] A. Joubert, B. Belhadj, O. Temam, and R. Héliot, “Hardware spiking
neurons design: Analog or digital?” in IJCNN, 2012.
[19] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and
N. P. Jouppi, “McPAT: An integrated power, area, and timing modeling
framework for multicore and manycore architectures,” in MICRO,
2009.
[20] S. Misailovic, S. Sidiroglou, H. Hoffman, and M. Rinard, “Quality of
service profiling,” in ICSE, 2010.
[21] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing
NUCA organizations and wiring alternatives for large caches with
CACTI 6.0,” in MICRO, 2007.
[22] S. Narayanan, J. Sartori, R. Kumar, and D. L. Jones, “Scalable stochastic processors,” in DATE, 2010.
[23] A. Patel, F. Afram, S. Chen, and K. Ghose, “MARSSx86: A full system
simulator for x86 CPUs,” in DAC, 2011.
[24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal
representations by error propagation,” in Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press,
1986, vol. 1.
[25] M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke, “Sage:
Self-tuning approximation for graphics engines,” in MICRO, 2013.
[26] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and
D. Grossman, “EnerJ: Approximate data types for safe and general
low-power computation,” in PLDI, 2011.
[27] J. Schemmel, J. Fieres, and K. Meier, “Wafer-scale integration of
analog neural networks,” in IJCNN, 2008.
[28] S. M. Tam, B. Gupta, H. A. Castro, and M. Holler, “Learning on an
analog VLSI neural network chip,” in Systems, Man, and Cybernetics
(SMC), 1990.
[29] O. Temam, “A defect-tolerant accelerator for emerging highperformance applications,” in ISCA, 2012.
[30] J. Zhu and P. Sutton, “FPGA implementations of neural networks: A
survey of a decade of progress,” in FPL, 2003.
Analog and digital neural hardware. There is an extensive body of work on hardware implementations of neural
networks both in the digital [10, 30] and the analog [4, 18]
domain. Recent work has proposed higher-level abstractions
for implementation of neural networks [15]. Other work has
examined fault-tolerant hardware neural networks [14, 29]. In
particular, Temam [29] uses datasets from the UCI machine
learning repository to explore fault tolerance of a hardware
neural network design. In contrast, our compilation, neuralnetwork selection/training framework, and architecture design aim at applying neural networks to general-purpose code
written in familiar programming models and languages, not
explicitly written to utilize neural networks directly.
Neural-based code acceleration. A recent study [7] shows
that a number of applications can be manually reimplemented
with explicit use of various kinds of neural networks. That
study did not prescribe a programming workflow, nor a preferred hardware architecture. More recent work exposes analog spiking neurons as primitive operators [3]. This work
devises a new programming model that allows programmers
to express digital signal-processing applications as a graph of
analog neurons and automatically maps the expressed graph
to a tiled analog, spiking-neural hardware. The work in [3] is
restricted to the domain of applications whose input are realworld signals that should be encoded as pulses. Our approach
intends to address the long-standing challenges of utilizing
analog computation (programmability and generality) by not
imposing domain-specific limitations, and by providing analog circuitry that is integrated with a conventional, digital
processor without require a new programming paradigm.
7. Conclusions
Analog circuits inherently trade accuracy for very attractive
energy-efficiency gains. However, it is challenging to utilize
them in a way that is both programmable and generally useful.
The transformation of general-purpose, approximable code
to a neural model, as well as the use of neural accelerators,
provide an avenue for realizing the benefits of analog computation by taking advantage of the fixed-function qualities of a
neural network while targeting traditionally-written, generallyapproximable code. We presented a complete solution on
using analog neural networks for accelerating approximate
applications, from circuits to compilers design. A very important insight from this work is that it is crucial to expose analog
circuit characteristics to the compilation and neural network
training phase. Our analog neural acceleration provides whole
application speedup of 3.3× and and energy savings of 12.1×
with quality loss less than 10% for all except one benchmark.
7
Download