Toward General-Purpose Code Acceleration with Analog Computation

advertisement
WACAS Submission #XXX– Confidential Draft – Do Not Distribute!!
Toward General-Purpose Code Acceleration with Analog Computation
Amir Yazdanbakhsh∗
Hadi Esmaeilzadeh∗
∗ Georgia Institute of Technology
Renee St. Amant§
Arjang Hassibi§
§ The University of Texas at Austin
Abstract
Bradley Thwaites∗
Jongse Park∗
Luis Ceze†
Doug Burger‡
† University of Washington
‡ Microsoft Research
with conventional languages. NPUs can be a potential solution for general-purpose analog computing. Prior research
has shown that analog neural networks can effectively solve
classes of domain-specific problems, such as pattern recognition [4, 27, 28, 18]. The process of invoking a neural network
and returning a result defines a clean, coarse-grained interface
for D/A and A/D conversion. Furthermore, the compile-time
training of the network permits any analog-specific restrictions
to be hidden from the programmer. The programmer simply
specifies which region of the code can be approximated, without adding any neural-network-specific information. Thus, no
additional changes to the programming model are necessary.
Figure 1 illustrates an overview of our framework.
This paper reports on our study to design an NPU with
mixed-signal components and develop a compilation workflow for utilizing the mixed-signal NPU for code acceleration.
The goal of this study is to investigate challenges and potential solutions of implementing NPUs with analog components,
while both bounding application error to sufficiently low levels and achieving worthwhile performance or efficiency gains
for general-purpose approximable code. We found that exposing the analog limitations to the compiler allowed for the
compensation of these shortcomings and produced sufficiently
accurate results. We trained networks at compile time using 8bit values, topologies restricted to eight inputs per neuron, plus
RPROP and CDLM [8] for training. Using these techniques together, we were able to bound error on all applications to a 10%
limit, which is comparable to prior studies using entirely digital accelerators. The average time required to compute a neural
result was 3.3× better than a previous digital implementation
with an additional energy savings of 12.1×. The performance
gains result in an average full-application-level improvement
of 3.7× and 23.3× in performance and energy-delay product,
respectively. This study shows that using limited-precision
analog circuits for code acceleration, by converting regions of
imperative code to neural networks and exposing the circuit
limitations to the compiler, is both feasible and advantageous.
We propose a solution—from circuit to compiler—that enables general-purpose use of limited-precision, analog hardware to accelerate “approximable” code—code that can tolerate imprecise execution. We utilize an algorithmic transformation that automatically converts approximable regions of
code from a von Neumann model to an “analog” neural model.
We outline the challenges of taking an analog approach, including restricted-range value encoding, limited precision in
computation, circuit inaccuracies, noise, and constraints on
supported topologies. We address these limitations with a
combination of circuit techniques, a novel hardware/software
interface, neural-network training techniques, and compiler
support. Analog neural acceleration provides whole application speedup of 3.3× and and energy savings of 12.1× with
quality loss less than 10% for all except one benchmark. These
results show that using limited-precision analog circuits for
code acceleration, through a neural approach, is both feasible
and beneficial over a range emerging applications.
1. Introduction
A growing body of work [11, 25, 6, 22, 12] has focused on
approximation as a strategy for improving performance and efficiency through approximation. Large classes of applications
can tolerate small errors in their outputs with no discernible
loss in Quality of Result (QoR). Many conventional techniques
in energy-efficient computing navigate a design space defined
by the two dimensions of performance and energy, and traditionally trade one for the other. General-purpose approximate
computing explores a third dimension—that of error.
Many interesting design alternatives become possible once
precision is relaxed. An obvious candidate is the use of analog
circuits for computation. However, computation in the analog
domain has several major challenges, even when small errors
are permissible. First, analog circuits tend to be special purpose, good for only specific operations. Second, the bit widths
they can accommodate are much smaller than current floatingpoint standards (i.e. 32 or 64 bits), since the ranges must be
represented by physical voltage or current levels. Another
consideration is determining where the boundaries between
digital and analog computation lie. Using individual analog
operations will not be effective due to the overhead of A/D
and D/A conversions. Finally, effective storage of temporary
analog results is challenging in current CMOS technologies.
Due to these limitations, it has not been effective to design
analog von Neumann processors that can be programmed
2. Analog Circuits for Neural Computation
As Figure 2a illustrates, each neuron in a multi-layer perceptron takes in a set of inputs (xi ) and performs a weighted sum
of those input values (∑i xi wi ). The weights (wi ) are the result of training the neural network on . After the summation
stage, which produces a linear combination of the weighted
inputs, the neuron applies a nonlinearity function, sigmoid,
to the result of summation. Figure 2b depicts a conceptual
1
Profiling&Path&for&
Training&Data&
Collec;on
Annotated&
Source&Code
A"NPU&
Circuit&Design&
Training&Data
Programmer
A"NPU&
High"Level&Model&
Programming
Design
Trained&Neural&
Network
Custom&Training&
Algorithm&for&
Limited"Precision&
Analog&Accelerator
Compila;on
Instrumented&
Binary
CORE
Accelerator&
Config
A"NPU
Code&
Genera;on
Accelera;on
Figure 1: Framework for using limited-precision analog computation to accelerate code written in conventional languages.
x0
xi
…
w0
…
wn
wi
X
xn
(xi wi )
x0
xi
xn
DAC
DAC
DAC
…
I(x0 )
R(w0 )
V to I
y = sigmoid(
(a)
(xi wi ))
R(wi )
V to I
X
X
…
I(xi )
more of the implementation into the analog domain gains efficiency at the expense of flexibility, limiting the scope of
supported network topologies and, consequently, limiting potential network accuracy. The NPU approach targets code
approximation, rather than typical, simpler neural tasks, such
as recognition and prediction, and imposes higher accuracy
requirements. The main challenge is to mange this tradeoff to
achieve acceptable accuracy for code acceleration, while delivering higher performance and efficiency when analog neural
circuits are used for general-purpose code acceleration.
While a holistically analog neural hardware design with
fixed-wire connections between neurons may be efficient, it
effectively provides a fixed topology network, limiting the
scope of applications that can benefit from the neural accelerator, as the optimal network topology varies with application.
Additionally, routing analog signals among neurons and the
limited capability of analog circuits for buffering signals negatively impacts accuracy and makes the circuit susceptible to
noise. In order to provide additional flexibility, we set the
digital-analog boundary in conjunction with an algorithmic,
sigmoid-activated neuron. where a set of digital inputs and
weights are converted to the analog domain for efficient computation, producing a digital output that can be accurately
routed to multiple consumers. We refer to this basic computation unit as an analog neural unit, or ANU. ANUs can be
composed, in various physical configurations, along with digital control and storage, to form a reconfigurable mixed-signal
NPU, or A-NPU.
I(xn )
R(wn )
V to I
(I(xi )R(wi ))
ADC
y ⇡ sigmoid(
X
(I(xi )R(wi )))
(b)
Figure 2: One neuron and its conceptual analog circuit.
analog circuit that performs the three necessary operations of
a neuron: (1) scaling inputs by weight (xi wi ), (2) summing
the scaled inputs (∑i xi wi ), and (3) applying the nonlinearity function (sigmoid). This conceptual design first encodes
the digital inputs (xi ) as analog current levels (I(xi )). Then,
these current levels pass through a set of variable resistances
whose values (R(wi )) are set proportional to the corresponding
weights (wi ). The voltage level at the output of each resistance
(I(xi )R(wi )), is proportional to xi wi . These voltages are then
converted to currents that can be summed quickly according to
Kirchhoff’s current law (KCL). Analog circuits only operate
linearly within a small range of voltage and current levels,
outside of which the transistors enter saturation mode with
IV characteristics similar in shape to a non-linear sigmoid
function. Thus, at the high level, the non-linearity is naturally applied to the result of summation when the final voltage
reaches the analog-to-digital converter (ADC). Compared to a
digital implementation of a neuron, which requires multipliers,
adder trees and sigmoid lookup tables, the analog implementation leverages the physical properties of the circuit elements
and is orders of magnitude more efficient. However, it operates
in limited ranges and therefore offers limited precision.
Value representation and bit-width limitations. One of
the fundamental design choices for an ANU is the bit-width
of inputs and weights. Increasing the number of bits results
in an exponential increase in the ADC and DAC energy dissipation and can significantly limit the benefits from analog
acceleration. Furthermore, due to the fixed range of voltage
and current levels, increasing the number of bits translates to
quantizing this fixed value range to fine granularities that practical ADCs can not handle. In addition, the fine granularity
encoding makes the analog circuit significantly more susceptible to noise, thermal, voltage, current, and process variations.
In practice, these non-ideal effects can adversely affect the
final accuracy when more bit-width is used for weights and
inputs. We design our ANUs such that the granularity of the
voltage and current levels used for information encoding is to
a large degree robust to variations and noise.
Analog-digital boundaries. The conceptual design in Figure 2b draws the analog-digital boundary at the level of an
algorithmic neuron. As we will discuss, the analog neural
accelerator will be a composition of these analog neural units
(ANUs). However, an alternative design, primarily optimizing
for efficiency, may lay out the entirety of a neural network
with only analog components, limiting the D-to-A and A-to-D
conversions to the inputs and outputs of the neural network
and not the individual neurons. The overhead of conversions
in the ANUs significantly limits the potential efficiency gains
of an analog approach toward neural computation. However,
there is a tradeoff between efficiency, reconfigurability (generality), and accuracy in analog neural hardware design. Pushing
2
s w0
sx0
w0
I(|x0 |)
Resistor'
Ladder
s wn
x0
Current'
Steering'
DAC
wn
…
I(|xn |)
Resistor'
Ladder
R(|w0 |)
V (|w0 x0 |)
I + (wn xn )
I (w0 x0 )
w i xi
+
+
Diff'
⌘
V
R(|wn |)
Diff'
Pair
I + (w0 x0 )
⇣X
a variable resistance depending on the magnitude of each
weight (R(|wi |)) . We use a standard resistor ladder thats
consists of a set of resistors connected to a tree-structured set
of switches. The digital weight bits (wi s) control the switches,
adjusting the effective resistance, R(|wi |), seen by the input
current (I(|xi |)). These variable resistances scale the input
currents by the digital weight values, effectively multiplying
each input magnitude by its corresponding weight magnitude.
The output of the resistor ladder is a voltage: V (|wi xi |) =
I(|xi |) × R(|wi |). The resistor network requires 2m resistors
and approximately 2m+1 switches, where m is the number of
digital weight bits. This resistor ladder design has been shown
to work well for m ≤ 10. Our circuit simulations show that
only minimally sized switches are necessary.
V (|wi xi |) as well as the XOR of the weight and input sign
bits feed to a differential pair that converts voltage values
to two differential current (I + (wi xi ), I − (wi xi )) that capture
the sign of the weighted input. These differential currents
are proportional to the voltage applied to the differential pair,
V (|wi xi |). If the voltage difference between the two gates is
kept small, the current-voltage relationship is linear, producing
Ibias
−
I + (wi xi ) = Ibias
2 + ∆I and I (wi xi ) = 2 − ∆I. Resistor ladder
values are chosen such that the gate voltage remains in the
range that produces linear outputs, and consequently a more
accurate final result. Based on the sign of the computation, a
switch steers either the current associated with a positive value
or the current associated with a negative value to a single wire
to be efficiently summed according to Kirchhoff’s current law.
The alternate current is steered to a second wire, retaining
differential operation at later design stages. Differential operation combats environmental noise and increases gain, the
later being particularly important for mitigating the impact of
analog range challenges at later stages.
Resistors convert the resulting pair of differential currents
to voltages, V + (∑i wi xi ) and V − (∑i wi xi ), that represent the
weighted sum of the inputs to the ANU. These voltages are
used as input to an additional amplification stage (implemented
as a current-mode differential amplifier with diode-connected
load). The goal of this amplification stage is to significantly
magnify the input voltage range of interest that maps to the
linear output region of the desired sigmoid function. Our
experiments show that neural networks are sensitive to the
steepness of this non-linear function, losing accuracy with
shallower, non-linear activation functions. This fact is relevant for an analog implementation because steeper functions
increase range pressure in the analog domain, as a small range
of interest must be mapped to a much larger output range in
accordance with ADC input range requirements for accurate
conversion. We magnify this range of interest, choosing circuit
parameters that give the required gain, but also allowing for
saturation with inputs outside of this range.
The amplified voltage is used as input to an analog-to-digital
converter that converts the voltage to a digital value. We chose
a flash ADC design (named for its speed), which consists of a
xn
Current'
Steering'
DAC
V (|wn xn |)
Diff'
Pair
V+
sxn
⇣X
w i xi
⌘
-;
Amp
I (wn xn )
Flash
ADC
sy
y
⇣ ⇣X
⌘⌘
y ⇡ sigmoid V
w i xi
Figure 3: A single analog neuron (ANU).
Topology restrictions. Another important design choice is
the number of inputs in the ANU. Similar to bit-width, increasing the number of ANU inputs translates to encoding a larger
value range in a bounded voltage and current range, which, as
discussed, becomes impractical. The larger the number of inputs, the larger the number of multiply and add operations that
can be done in parallel in the analog domain, increasing efficiency. However, due to the bounded range of voltage and currents, increasing the number of inputs requires decreasing the
number of bits for inputs and weights. Through circuit-level
simulations, we empirically found that limiting the number of
inputs to eight with 8-bit inputs and weights strikes a balance
between accuracy and efficiency. However, this unique ANU
limitation restricts the topology of the neural network that
can run on the analog accelerator. Our customized training
algorithm and compilation workflow takes into account this
topology limitation and produces neural networks that can be
computed on our mixed-signal accelerator.
Non-ideal sigmoid. The saturation behavior of the analog
circuit that leads to sigmoid-like behavior after the summation
stage represents an approximation of the ideal sigmoid. We
measure this behavior at the circuit level and expose it to the
compiler and the training algorithm.
3. Mixed-Signal Neural Accelerator (A-NPU)
Circuit design for ANUs. Figure 3 illustrates the design
of a single analog neuron (ANU). As mentioned, the ANU
performs the computation of a single neuron, which is y ≈
sigmoid(∑i wi xi ). We place the analog-digital boundary at
the ANU level, with computation in the analog domain and
storage in the digital domain. Digital input and weight values
are represented in sign-magnitude form. In the figure, swi
and sxi represent the sign bits and wi and xi represent the
magnitude. Digital input values are converted to the analog
domain through current-steering DACs that translate digital
values to analog currents. Current-steering DACs are used for
their speed and simplicity. In Figure 3, I(|xi |) is the analog
current that represents the magnitude of the input value, xi .
Digital weight values control resistor string ladders that create
3
…
8"Wide
AnalogNeuron
Weight'Buffer
8"Wide
AnalogNeuron
Weight'Buffer
Weight'Buffer
Weight'Buffer
8"Wide
AnalogNeuron
A-NPU configuration. During code generation, the compiler produces an A-NPU configuration that constitutes the
weights and the schedule. The static A-NPU scheduling algorithm first assigns an order to the neurons. This determines
the order in which the neurons will be computed on the ANUs.
Then, the scheduler takes the following steps for each layer
of the neural network: (1) Assign each neuron to one of the
ANUs. (2) Assign an order to neurons. (3) Assign an order
to the weights. (4) Generate the order for inputs to be fed to
the ANUs. (5) Generate the order in which the outputs will
be written to the Output Buffer. The scheduler also assigns a
unique order for the inputs and outputs of the neural network
in which the processor will communicate data with the A-NPU
Config'FIFO
Input'FIFO
Row'Selector
8"Wide
AnalogNeuron
Column'Selector
Output'FIFO
Figure 4: Mixed-signal neural accelerator, A-NPU. Only four of
the ANUs are shown. Each ANU processes eight 8-bit inputs.
set of reference voltages and comparators [1, 17]. The ADC
requires 2n comparators, where n is the number of digital output bits. Flash ADC designs are capable of converting 8 bits
at frequency on the order of one GHz. We require 2–3 mV
between ADC quantization levels for accurate operation and
noise tolerance. Typically, ADC reference voltages increase
linearly; however, we use a non-linearly increasing set of reference voltages to capture the behavior of a sigmoid function
which also improves the accuracy of the analog sigmoid.
4. Compilation for Analog Acceleration
As Figure 1 illustrates, the compilation for A-NPU execution
consists of three stages: (1) profile-driven data collection, (2)
training for a limited-precision A-NPU, and (3) code generation for hybrid analog-digital execution. In the profile-driven
data collection stage, the compiler instruments the application
to collect the inputs and outputs of approximable functions.
The compiler then runs the application with representative
inputs and collects the inputs and their corresponding outputs.
These input-output pairs constitute the training data. Section 3
briefly discussed ISA extensions and code generation. While
compilation stages (1) and (3) are similar to the techniques presented for a digital implementation [12], the training phase is
unique to an analog approach, accounting for analog-imposed,
topology restrictions and adjusting weight selection to account
for limited-precision computation.
Reconfigurable mixed-signal A-NPU. We design a reconfigurable mixed-signal A-NPU that can perform the computation of a wide variety of neural topologies since each requires
a different topology. Figure 4 illustrates the A-NPU design
with some details omitted for clarity. The figure shows four
ANUs while the actual design has eight. The A-NPU is a
time-multiplexed architecture where the algorithmic neurons
are mapped to the ANUs based on a static scheduling algorithm, which is loaded to the A-NPU before invocation. The
multi-layer perceptron consists of layers of neurons, where
the inputs of each layer are the outputs of the previous layer.
The ANU starts from the input layer and performs the computations of the neurons layer by layer. The Input Buffer always
contains the inputs to the neurons, either coming from the
processor or from the previous layer computation. The Output
Buffer, which is a single entry buffer, collects the outputs of the
ANUs. When all of its columns are computed, the results are
pushed back to the Input Buffer to enable calculation of the next
layer. The Row Selector determines which entry of the input
buffer will be fed to the ANUs. The output of the ANUs will
be written to a single-entry output buffer. The Column Selector
determines which column of the output buffer will be written
by the ANUs. These selectors are FIFO buffers whose values
are part of the preloaded A-NPU configuration. All the buffers
are digital SRAM structures.
Each ANU has eight inputs. As depicted in Figure 4, each
A-NPU is augmented with a dedicated weight buffer that stores
the 8-bit weights. The weight buffers simultaneously feed the
weights to the ANUs. The weights and the order in which
they are fed to the ANUs are part of the A-NPU configuration.
The Input Buffer and Weight Buffers synchronously provide the
inputs and weights for the ANUs based on a pre-loaded order.
Hardware/software interface for exposing analog circuits
to the compiler. As we discussed in Section 2, we expose the following analog circuit restrictions to the compiler
through a hardware/software interface that captures the following circuit characteristics: (1) input bit-width limitations,
(2) weight bit-width limitations, (3) limited number of inputs
to each analog neuron (topology restriction), and (4) the nonideal shape of the analog sigmoid. The compiler internally
constructs a high-level model of the circuit based on these
limitations and uses this model during the neural topology
search and training with the goal of limiting the impact of
inaccuracies due to an analog implementation.
Training for limited bit widths and analog computation.
Traditional training algorithms for multi-layered perceptron
neural networks use a gradient descent approach to minimize
the average network error, over a set of training input-output
pairs, by backpropagating the output error through the network and iteratively adjusting the weight values to minimize
that error. Traditional training techniques, however, that do
not consider limited-precision inputs, weights, and outputs
perform poorly when these values are saturated to adhere to
the bit-width requirements that are feasible for an implementation in the analog domain. Simply limiting weight values
4
Table 1: The evaluated benchmarks, characterization of each offloaded function, training data, and the trained neural network.
Benchmark*Name
blackscholes
Descrip0on
Mathema'cal*
model*of*a*
financial*market*
M
RadixE2*CooleyE
Tukey*fast*fourier
inversek2j
Type
Financial*
Analysis
Signal*
Processing
#*of*
Func0on*
Calls
#*of*
Loops
#*of*Ifs/
elses
5
0
5
#*of*
x86@64* Evalua0on*Input*
Neural*Network*
Training*Input*Set
Instruc0on
Set
Topology
s
4096*Data*Point* 16384*Data*Point*
309
from*PARSEC
from*PARSEC
6*E>*8*E>*8E>*1
2
0
0
34
Inverse*kinema'cs*
Robo'cs
for*2Ejoint*arm
4
0
0
100
jmeint
Triangle*
intersec'on*
detec'on
3D*Gaming
32
0
23
1,079
jpeg
JPEG*encoding
Compression
3
4
0
1,257
kmeans
Machine*
KEmeans*clustering
Learning
1
0
0
26
sobel
Sobel*edge*
detector
Image*
Processing
3
2
1
88
2048*Random*
Floa'ng*Point*
Numbers
10000*(x,*y)*
Random*
Coordinates
10000*Random*
Pairs*of*3D*
Triangle*
Coordinates
220x200EPixel*
Color*Image
220x200EPixel*
Color*Image
220x200EPixel*
Color*Image
during training is also detrimental to achieving quality outputs
because the algorithm does not have sufficient precision to
converge to a quality solution.
To incorporate bit-width limitations into the training algorithm, we use a customized continuous-discrete learning
method (CDLM) [8]. This approach takes advantage of the
availability of full-precision computation at training time and
then adjusts slightly to optimize the network for errors due
to limited-precision values. In an initial phase, CDLM first
trains a fully-precise network according to a standard training
algorithm, such as backpropagation [24]. In a second phase,
it discretizes the input, weight, and output values according
the the exposed analog specification. The algorithm calculates the new error and backpropagates that error through the
fully-precise network using full-precision computation and
updates the weight values according to the algorithm also used
in stage 1. This process repeats, backpropagating the ’discrete’ errors through a precise network. The original CDLM
training algorithm was developed to mitigate the impact of
limited-precision weights. We customize this algorithm by
incorporating the input bit-width limitation and the output
bit-width limitation in addition to limited weight values. Additionally, this training scheme is advantageous for an analog
implementation because it is general enough to also make up
for errors that arise due to an analog implementation, such as a
non-ideal sigmoid function and any other analog non-ideality
that behaves consistently.
Fully*
Analog*NN*
Digital*NN*
MSE*(8@bit)
MSE
Applica0on*Error*
Metric
Fully*
Digital*
Error
Analog*
Error
0.000011
Avg.*Rela've*Error
6.02%
10.2%
Avg.*Rela've*Error
2.75%
4.1%
Avg.*Rela've*Error
6.2%
9.4%
Miss*Rate
17.68%
19.7%
Image*Diff
5.48%
8.4%
Image*Diff
3.21%
7.3%
Image*Diff
3.89%
5.2%
0.00228
32768*Random*
Floa'ng*Point*
1*E>*4*E>*4*E>*2
0.00002
0.00194
Numbers
10000*(x,*y)*
Random*
2*E>*8*E>*2
0.000341
0.00467
Coordinates
10000*Random*
Pairs*of*3D*
18*E>*32*E>*8*E>*2 0.05235
0.06729
Triangle*
Coordinate
Three*512x512E
64*E>*16*E>*8*E>*64 0.0000156 0.0000325
Pixel*Color*Images
50000*Pairs*of*
Random*(r,*g,*b)*
6*E>*8*E>*4*E>*1
0.00752
0.009589
Values
One*512x512E
9*E>*8*E>*1
0.000782
0.00405
Pixel*Color*Image
simple algorithm guided by the mean-squared error of the
network to determine the best topology given the exposed
restriction. The error evaluation uses a typical cross-validation
approach: the compiler partitions the data collected during
profiling into a training set, 70% of the data, and a test set,
the remaining 30%. The topology search algorithm trains
many different neural-network topologies using the training
set and chooses the one with the highest accuracy on the test
set and the lowest latency on the A-NPU hardware (prioritizing accuracy). The space of possible topologies is large, so we
restrict the search to neural networks with at most two hidden
layers. We also limit the number of neurons per hidden layer
to powers of two up to 32.
To further improve accuracy, and compensate for topologyrestricted networks, we utilize a Resilient Back Propagation
(RPROP) [16] training algorithm as the base training algorithm
in our CDLM framework. During training, instead of updating
the weight values based on the backpropagated error (as in
conventional backpropagation [24]), the RPROP algorithm
increases or decreases the weight values by a predefined value
based on the sign of the error. Our investigation showed that
RPROP significantly outperforms conventional backpropagation for the selected network topologies, requiring only half of
the number of training epochs as backpropagation to converge
on a quality solution. The main advantage of the application of
RPROP training to an analog approach to neural computing is
its robustness to the sigmoid function and topology restrictions
imposed by the analog design. Our RPROP-based, customized
CDLM training phase requires 5000 training epochs, with the
analog-based CDLM phase adding roughly 10% to the training
time of the baseline training algorithm.
Training with topology restrictions. Conventional multilayered perceptron networks are fully connected, i.e. the output of each neuron in one layer is routed to the input of each
neuron in the following layer. However, analog range limitations restrict the number of inputs that can be computed in a
neuron (eight in our design). Consequently, network connections must be limited, and in many cases, the network can not
be fully connected.
We impose the circuit restriction on the connectivity between the neurons during the topology search and we use a
5. Evaluations
Cycle-accurate simulation and energy modeling. We use
the MARSSx86 x86-64 cycle-accurate simulator [23] to model
the performance of the processor. The processor is modeled
after a single-core Intel Nehalem to evaluate the performance
5
kmeans
8-bit D-NPU vs. A-NPU
44
5.77
6
0.67
2.44
3.67
9 -> 8 -> 1
33
5.49
4
0.46
2.75
4.13
1 -> 16 -> 8 -> 1
50
10.30
8
1.22
2.08
3.13
59.90
9.71
6.35
0.84
0
3.33
4.71
6 -> 8 -> 8 -> 1
45
8.15
6
0.86
2.50
3.75
9.53
8.48
swaptions
1 -> 4 -> 4 -> 2
34
2.04
6
0.54
1.89
2.83
3.79
3.33
nversek2j
2 -> 8 -> 2
2.33
4
0.51
2.42
3.63
4.57
4.05
geomean
5.88
28.28
25.50
22.81
82.21
73.76
1.99
3.92
1.36
15.21
6
8.56
7.57
5.49
4
2.75
4.13
11.81
10.45
50
10.30
8
1.22
2.08
3.13
8.45
7.58
9.71
6.35
0.84
3.33
4.71
12.14
10.32
benefits of A-NPU acceleration over an aggressive out-oforder architecture. We extended the simulator to include ISAlevel support for A-NPU queue and dequeue instructions. We
also augmented MARSSx86 with a cycle-accurate simulator
for our A-NPU design and an 8-bit, fixed-point D-NPU with
eight processing engines (PEs) as described in [12]. We use
GCC v4.7.3 with -o3 to enable compiler optimization. The
baseline in our experiments is the benchmark run solely on the
processor8-bit
without
neural transformation. We use McPAT [19]
D-NPU vs. A-NPU
for processor energy estimations. We model the energy of an
8-bit, fixed-point D-NPU using results from McPAT, CACTI
6.5 [21], and [13] to estimate its energy. Both the D-NPU and
the processor operate at 3.4GHz, while the A-NPU is clocked
at one third of the digital clock frequency, 1.1GHz at 1.2 V, to
achieve acceptable accuracy1 .
1.0
Speedup A-NPU over D-ANPU
swaptions
18 ->
32 -> 8 -> 2
1 -> 16 ->0.1841
8 -> 1
peg
64 ->
16 -> 8 -> 64
geomean
kmeans
6 -> 8 -> 4 -> 1
1.00.0662
9 -> 8 -> 1
1 -> 16 -> 8 -> 1
0.0261
Application
Topology 0.06
blackscholes
6 -> 8 -> 8 -> 1
fft
1 -> 4 -> 4 -> 2
geomean
18.41 2.08
19.67 3.13
0.0126
6 -> 8 -> 4 -> 18
sobel 14.32%
9 -> 8 -> 6.62%
1
swaptions
1 -> 16 -> 8 -> 1
0.0173
7.28
(nJ)
0.0093
3.34
0.951
0.005
0.563
12%
0.048
0.297
0.023
0.571
0.01
D-ANPU (Hadi)
0.86
0.96
0.54
0.61
0.51
0.58
Benchmarks
1.99
2.21
Improvement
8.15
9.53
18.21 Level Error
Application
2.04
Fully Precise
Digital Sigmoid
Analog
Sigmoid
8.39%
10.21%
3.03%
4.13%
2.33
blackscholes
1.36
1.51
56.26
16 0.67
0.76
111.54
5.76% 0.46
0.53
1.22
1.36
0.79
0.89
fft
inversek2j
8.13%
9.42%
5.77
jmeint
18.41%
6.62%
4.00
1.17
6.1%
7.28%
sobel
4.28%
5.21%
swaptions
2.61%
3.34%
geomean
6.02%
7.32%
8%
6%
4%
41.47
Topology
blackscholes
6 -> 8 -> 8 -> 1
Fully Precise
Digital Sigmoid
0.0839
8.39
10.21
0.0182
fft
1 -> 4 -> 4 -> 2
0.0303
3.03
4.13
0.011
inversek2j
2 -> 8 -> 2
0.0813
8.13
9.42
0.0129
jmeint
18 -> 32 -> 8 -> 2
0.1841
18.41
19.67
0.0126
jpeg
64 -> 16 -> 8 -> 64
0.0662
6.62
8.35
0.0173
kmeans
6 -> 8 -> 4 -> 1
0.06
6.10
7.28
0.0118
sobel
9 -> 8 -> 1
0.0428
4.28
5.21
0.0093
swaptions
1 -> 16 -> 8 -> 1
0.0261
2.61
3.34
0.0073
0.06
6.02
7.32
0.01
Analog Sigmoid
Fully Precise Digital Sigmoid
Fully Precise
Digital Sigmoid
Analog Sigmoid
Analog Sigmoid
geomean
12.41
0%
4.51
blackscholes
fft
inversek2j
jmeint
jpeg
kmeans
sobel
swaptions geomean
8-bit A-NPU
Energy (nJ)
8
2
141
64
365
6
2.33
Benchmarks
blackscholes
fft
2.4%
31.4%
0.3%
1.2%
Other
Instructions
56.26
44
0.54
4
geomean
4.3%
2.6%
6.0%
4.577.3%
5.2%
3.3%
7.3%
3.33
4.05
8.56
7.57
4
12%
blackscholes
0.46
inversek2j
fft
jmeint
Energy
0.43
0.66
0.860.023
2.500.571
0.54
1.89
0.51
2.42
1.99
3.92
2.75
kmeans
jpeg
sobel swaptions4.13
11.81
6%
Energy Improvement
Energy
Improvement (1.7
GHz)
9.53
8.48
3.79
3%
3.33
3.63
4.57
4.05
5.88
28.28
25.50
22.81
82.21
Analog
Sigmoid
0.67
2.44
3.67
8.56
7.57
0.46
2.75
4.13
11.81
10.45
0.79
3.33
8-bit
D-NPU vs. A-NPU
5.00
15.2
28.2
8-bit A-NPU
Cycle
0%
12.14
0.0182
Cycle Improvement
10.21
Energy
Cycle
Improvement
(1.1336 GHz)
7.58
12.14
9.53
10
28.28 25.5
82.21 73.8
10.32
8.56
8.5
0.75
NPU Queue Instructions
11.81 10.5
Other Instructions
0.25
0.00
4.5
3.03
8.13
8.15
6
2.04
6
4.13
4.57
4.1
3.79
4
3.3
2
blackscholes
fft
inversek2j
4
8
Energy Improvement
blackscholes
Energy
9.42
0.54
29
2.33
4
0.51
141
18.41 56.26
12
2.5
33
15.21
0.67
8.35
4
6.10 10.30
9.71
8
2.4
3.92
1.36
6
5.49
111.54
6.62
44
2.42
19.67
1.99
8
5.77
365
0.0129
2.83
3.63
0.0126
5.88
22.81
2.75
1.22
7.28
0.84
2.08
4.13
8.48
3.79
3.33
4.57
4.05
28.28
25.50
82.21
73.76
3.67
0.0173
2.44
0.46
9.53
8.56
7.57
11.81
10.45
jmeint
jpeg
kmeans
sobel
swaptions geomean
1/3 Digital Frequency
1/2 Digital Frequency
fft
inversek2j jmeint
jpeg
kmeans
59.90
1
0.0261
6.35
3.13
0.0118
4.71
8.45
7.58
3.33
4.28
5.21
0.0093
2.61
3.34
0.0073
12.14
10.32
7.32
0.01
Analog Sigmoid
Fully Precise
Digital Sigmoid
inversek2j
Analog Sigmoid
jmeint
Analog Sigmoid
jpeg
0.011
8.13
9.42
0.0129
0.1841
18.41
19.67
0.0126
jpeg
64 -> 16 -> 8 -> 64
0.0662
6.62
8.35
0.0173
1 -> 16 -> 8 -> 1
3.34
6.02
7.32
12%
0.0118
Energy Saving
Speedup
0.01
Benchmarks
0.0073
Fully Precise
Digital Sigmoid
2
4
14.32%
8
16
6.62%
5.76%
15%
fft
inversek2j
jmeint
jpeg
8.39%
fft
kmeans
3.03%
8.39%
Analog
Sigmoid
3.03%
10.21%
4.13%
inversek2j
9.42%
inversek2j
8.13%
jmeint
18.41%
jpeg
6.62%
kmeans
6.1%
sobel
4.28%
swaptions
2.61%
sobel
geomean
4%
4.13%
2%
8.13%
9.42%
0%
blackscholes
19.67%
6.62%
8.35%
6.1%
7.28%
fft
inversek2j
jmeint
jpeg
kmeans
sobel
4.28%
5.21%
swaptions
2.61%
fft
3.34%
geomean
inversek2j 7.32%
6.02%
Topology
blackscholes
3%
6 -> 8 -> 8 -> 1
2.7%
8.4%
3.0%
10.2%
4.1%
A-NPU + Ideal
Sigmoid
2 -> 8 -> 2
jmeint
18 -> 32 -> 8 -> 2
jpeg
64 -> 16 -> 8 -> 64
8
blackscholes
fft
inversek2j
kmeans
sobel
2.4%
31.4%
geomean
3.1%
jpeg
Other
Instructions
3.3%
jmeint
43.3%
65.5%
40.5%
NPU Queue
Instructions
0.3%
1.2%
0.8%
0.5%
4.8%
2.3%
17.6% 5.4%
8.1%
18.4% 6.6%
9.4%
19.7% 8.4%
29
kmeans
8.15
sobel
3.2%
2.04
3.8%
6.1%
4.3%
7.3%
5.2%
2.33
56.26
blackscholes
fft
inversek2j
jmeint
8.4%
3.0%
8.1%
18.4%
4.1%
9.4%
19.7%
44
33
67.4%
95.9%
jpeg
0.86
6
0.54
jpeg
5.49
8.4%
sobel
swaptions
geomean
6.1%
4.3%
2.6%
6.0%
7.3%
5.2%
3.3%
6
4
3.7
28.28 25.5
8.56
8.5
6 -> 8 -> 4 -> 1
sobel
9 -> 8 -> 1
6
0.54
2.33
4
0.51
3.63
4.57
141
56.26
12
1.99
3.92
5.88
28.28
365
111.54
8
1.36
15.21
22.81
82.21
44
5.77
6
0.67
2.44
3.67
8.56
33
5.49
4
0.46
2.75
4.13
11.81
61.47
9.63
6.15
0.79
3.33
5.00
12.14
29
9.5
geomean
2.83
4.5
5
1.89
2.42
3.33
4.05
25.50
Benchmarks
blackscholes
73.76
Other 7.57
10.45
Instructions
2
fft
2
inversek2j
3.3
jmeint
blackscholes
fft
jpeg
0
blackscholes
inversek2j jmeint
fft
2.4
2.4
2.5
jpeg
inversek2j
fft
inversek2j
jmeint
jpeg
kmeans
sobel
swaptions geomean
Application Speedup
Topology
6 -> 8 -> 8 -> 1
14.1441013944802
24.5221729784241
48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738
fft
1 -> 4 -> 4 -> 2
Core + D-NPU
1.12709929506364
1.32327013458615
1.64546022960568
0.68497510592142 0.804194541306698 0.195805458693302
14.9861613789597
0.53259881505741 0.733600916412848 0.266399083587152
inversek2j
2 -> 8 -> 2
jmeint
18 -> 32 -> 8 -> 2
jpeg
6 -> 8 -> 4 -> 1
9 -> 8 -> 1
Core + Ideal NPU
Core + D-NPU
Core + A-NPU
speedup
7.98161179269307
10.9938617211157
2.39085372136084
6.26190545947988
14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335
1.5617504494923
1.87946485929561
1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875
64 -> 16 -> 8 -> 64
kmeans
sobel
Core + A-NPU
Core + D-NPU
Core + A-NPU
Core + Ideal NPU
0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028
2.4864550898745
3.10723166292606
3.62429006473114 0.686053004992842 0.857335259438336 0.142664740561664
2.5478647166383
3.7797513074705
5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968
blackscholes
fft
inversek2j
6.0%
2.7%
6.2%
17.6% 5.4%
8.4%
3.0%
8.1%
18.4% 6.6%
10.2%
4.1%
9.4%
19.7% 8.4%
Floating Point
D-NPU
Application
Topology
Core + D-NPU
blackscholes
6 -> 8 -> 8 -> 1
fft
1 -> 4 -> 4 -> 2
inversek2j
2 -> 8 -> 2
jmeint
18 -> 32 -> 8 -> 2
jpeg
64 -> 16 -> 8 -> 64
kmeans
6 -> 8 -> 4 -> 1
sobel
9 -> 8 -> 1
swaptions
1 -> 16 -> 8 -> 1
Core + A-NPU
A-NPU + Ideal
Sigmoid
2
Core + Ideal NPU
speedup
A-NPU
Core + D-NPU
Core + A-NPU
blackscholes
fft
inversek2j
jmeint
jpeg
kmeans
sobel
Other
Instructions
2.4%
31.4%
3.3%
3.1%
43.3%
65.5%
40.5%
1.2%
0.8%
1.8%
0.5%
4.8%
2.3%
fft Analog
inversek2j
Sigmoidjmeint
jpeg
swaptions
kmeans
sobel
jpeg
k
fft
inversek2j
jm
8.4%
3.0%
8.1%
18.
10.2%
4.1%
9.4%
19.
Analog
Sigmoid
geomean
jmeint
blackscholes
Percentage
instructions
subsumed
geomean
(b) Whole application energy saving.
blackscholes
fft
inversek2j jmeint
97.2%
67.4%
95.9%
Core + D-NPU
Core + A-NPU
kmeans
Core
+ Idealsobel
NPU
jpeg
95.1% 56.3% 29.7% 57.1%
6
(c) % dynamic instructions subsumed.
10
Cycle
Improvement (1.7
GHz)
Energy Improvement
Energy
Improvement
(1.1336 GHz)
Energy
Improvement (1.7
GHz)
2.50
3.75
9.53
8.48
1.89
2.83
3.79
3.33
3.63
4.57
4.05
3.92
5.88
28.28
25.50
1.36
15.21
2.42
22.81
82.21
73.76
0.67
2.44
3.67
0.46
2.75
4.13
blackscholes
11.81
Application
fft
0.79
3.33
5.00
12.14
fft
8.56
8
Floating Point
D-NPU
A-NPU + Ideal
Sigmoid
6.0%
8.4%
2.7%
6 -> 8 -> 8 -> 1
14.1441013944802
10.791.12709929506364
24.5221729784241
0
speedup
2 -> 8 -> 2
6.2%
7.98161179269307
10.9938617211157
14.9861613789597
jmeint
18 -> 32 -> 8 -> 2
2.39085372136084
6.26190545947988
14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335
jpeg
64 -> 16 -> 8 -> 64
1.5617504494923
1.87946485929561
1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875
3.0%
8.1%
kmeans
6 -> 8 -> 4 -> 1
sobel
9 -> 8 -> 1
geomean
Speedup
10.2%
4.1%
Energy Saving
9.4%
3.79
4
blackscholes
48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738
inversek2j
1 -> 4 -> 4 -> 2
6
2
7.57
10.45Core + jmeint
Topology
D-NPU
Core + A-NPU
NPU
Core + D-NPU
Core + A-NPU
inversek2j
jpegCore + Ideal
kmeans
sobel
blackscholes
9.53
8.5
17.6% 5.4%
3.2%
3.8%
1.32327013458615
1.64546022960568
0.68497510592142 0.804194541306698 0.195805458693302
0.53259881505741 0.733600916412848 0.266399083587152
18.4% 6.6%
6.1%
4.3%
0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028
2.4864550898745
3.10723166292606
3.62429006473114 0.686053004992842 0.857335259438336 0.142664740561664
2.5478647166383
3.7797513074705
5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968
19.7% 8.4%
7.3%
5.2%
12.14 10.3
with other work on quality trade-offs use digital techniques.
Truffle [11] and EnerJ [26] 8.4%
shows3.0%similar
error (3–10%)
for
8.1%
18.4% 6.6%
6.1%
4.3%
2.6%
some applications and much greater error (above 80%) for
10.2%
4.1%
9.4%
19.7% 8.4%
7.3%
5.2%
3.3%
others in a moderate configuration.
Green
[2] has error
rates
40.5%
below 1% for some applications but greater than 20% for oth2.3%
ers. A case study [20] explores manual optimizations of the
x264
video
encoder
that trade off 0.5–10% quality loss.
jmeint
jpeg
kmeans
sobel
geomean
8.45
7.6
7.6
blackscholes
Application
blackscholes
Percentage
fft
instructions
inversek2j
subsumed
jmeint
jpeg
sobel
swaptions
1/3 Digital Frequency
1/2 Digital Frequency
kmeans
sobel
Analog kmeans
Sigmoidsobel
geomean
swaptions
Topology
Core + D-NPU
fft
inversek2j
Application Speedup-1
Core + A-NPU
Core + Ideal NPU
jmeint
Core + D-NPU
jpeg
Core + A-NPU
kmeans
sobel
swaptions
geomean
speedup
6 -> 8 -> 8 -> 1
1 -> 4 -> 4 -> 2
6.0%
2 -> 8 -> 2
18 -> 32 -> 8 -> 2
64 -> 16 -> 8 -> 64
6 -> 8 -> 4 -> 1
7.3%
9 -> 8 -> 1
1 -> 16 -> 8 -> 1
geomean
swaptions geomean
0
Energy
(nJ)
blackscholes
8-bit A-NPU
fft
Cyclejmeint
inversek2j
jpeg
28.2
3.9
3.7
82.21 73.8
3.3
2
Cycle Improvement
kmeansEnergy
sobel swaptions geomean
8
6
Energy Improvement
Cycle
1
Improvement (1.7
GHz)
Cycle
Improvement
(1.1336 GHz)
2.4
Improvement
2.7
2.0
2.4
2.4
8-bit D-NPU vs. A-NPU-1
1
3
1.8
8.4
Energy Saving
Speedup
1.8
2
4.5
9.5
geomean
3.3
3.7
sobel
2.5
kmeans
12.1
jpeg
3.9
4.5
82.2
15.2
jmeint
8.5
inversek2j
11.8
fft
28.2
blackscholes
8-bit D-NPU
Cycle
blackscholes
Application
blackscholes
1.8
Improvement
2
9.5
Improvement
3
geomean
Fetch/Issue Width: 4/5, INT ALUs/FPUs: 6/6, Load/Store
6. Related Work
FUs: 1/1, ROB Entries: 128, Issue Queue Entries: 36, INT/FP Physical
blackscholes
fft
inversek2j jmeint jpeg kmeans sobel swaptions
Registers: 256/256, Branch Predictor: Tournament
48 KB, BTB
Sets/Ways:
General-purpose approximate computing. A growing
1024/4, RAS Entries: 64, Load/Store
Queue Entries: 48/48, Dependence
Percentage
5
Instructions
97.2%
56.3%of29.7%
57.1%
Predictor: 4096-entry Bloom Filter,
ITLB/DTLB
Entries:67.4%
128/25695.9%
L1: 32 95.1%body
Energy
Saving
work
has
explored relaxing the abstraction of full
Speedup
KB Instruction, 32 KB Data, LineSubsumed
Width: 64 bytes, 8-Way, Latency: 3 cycles
accuracy at the circuit and architecture level
for gains in per-28.28 25.5
10
4
9.53
L2: 256 KB, Line Width: 64 bytes, 8-Way, Latency: 6 cycles L3: 2 MB,
formance, energy, and resource utilization [9, 11, 8.525, 6, 22, 12].
Line Width 64 bytes, 16-Way, Latency: 27 cycles Memory Latency: 50 ns
3
0
4
kmeans
1
10.79
1
5
2.4
4.1
3.79
4
NPU Queue
Instructions
3.3
1 Processor:
4.57
0
Speedup
Energy Saving
3.9
3.7
4
3.79
8.48
6
1.8
2.04
9.53
sobel
6.51636491720012 0.787719817984527 0.977559461493782 0.011817544342672
Benchmarks
0
2.4
64 -> 16 -> 8 -> 64
kmeans
3.75
Energy
Improvement (1.7
GHz)
18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474
1.99
11.81 10.5
82.2
15.2
18 -> 32 -> 8 -> 2
jpeg
2.50
12.1
2 -> 8 -> 2
jmeint
0.86
Energy
Improvement
(1.1336 GHz)
11.8
34
inversek2j
apses to each Neuron
6
Cycle
Improvement (1.7
GHz)
8.5
1 -> 4 -> 4 -> 2
8.15
Cycle
Improvement
(1.1336 GHz)
Energy Improvement
45
Energy
3
2.5
Cycle
Energy Improvement
geomean
2.7
6 -> 8 -> 8 -> 1
Energy (nJ)
82.2
fft
Cycle
15.2
16
blackscholes
Topology
Cycle Improvement
28.2
Application
8-bit A-NPU
Improvement
8-bit D-NPU
2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845
A-NPU
82.21 73.8
8
+ D-NPU
+ A-NPU
swaptions geomean
Application Speedup
7.3%
6.15
95.1% 56.3% 29.7% 57.1%
8-bit D-NPU vs. A-NPU-1
1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225
2.83229403346624
geomean
0.51
kmeans
Application error. Table 2 shows the application-level errors with a floating point D-NPU, A-NPU with ideal sigmoid
and our A-NPU which incorporates non-idealities of the analog sigmoid. Except for jmeint, which shows error above 10%,
all of the applications show error less
than
or3.3%around
10%.
2.4%
31.4%
3.1% 43.3% 65.5%
Application average error rates with0.3%
the A-NPU
from4.8%
1.2%
0.8% range
1.8%
0.5%
4.1% to 10.2%. This quality-of-result loss is commensurate
9.53
sobel
Table 2: Error
with
a inversek2j
floating
point
D-NPU,
A-NPU
with ideal
blackscholes
fft
jmeint
jpeg
kmeans
sobel
geomean
sigmoid, and A-NPU with non-ideal sigmoid.
4
5
10
1.33727439731608
6.37013417935512
kmeans
sobel
52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033
(a) Whole application
speedup.
0%
Cycle
Improvement
(1.1336 GHz)
6
12
9.63
4
inversek2j
jmeint
jpeg
kmeans
2.39631940662302 2.39878691088084
0.921952508325545
0.998971353292519
0.0010286467074806
2%
2.74676510816413
Cycle Improvement
Energy
8
5.776.6%
61.47
kmeans sobel
2.21156760942508
1.15321498752577
9 -> 8 -> 1
2.4
97.2%
Percentage
Instructions
Subsumed
inversek2j jmeint
7.32010374044854
6 -> 8 -> 4 -> 1
4
4.5
fft
18 -> 32 -> 8 -> 2
8
Percentage
Instructions
Subsumed
3.9
blackscholes
31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239
17.8930069836166
jpeg
kmeans
Figure 5:4 Whole application speedup and energy saving with
8-bit D-NPU vs. A-NPU-1
D-NPU, A-NPU, and an Ideal NPU that consumes zero energy
fft
inversek2j jmeint
jpeg
kmeans
sobel swaptions geomean
2
and
takes
zero
cycles
for
neural computation.
111.54
10.2%
Analog
Sigmoid
geomean
1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124
30.0198258158588
Application Speedup-1
10
9.5
1.8%
swaptions
jpeg
346.2%
141
Percentage
instructions
subsumed
9 -> 8 -> 1
Benchmarks
jmeint
Cycle
365
6 -> 8 -> 4 -> 1
sobel
16
45
inversek2j
fft
6.0%
1.70352109148241
5.13306978649764
Core
Analog Sigmoid
Fully Precise Digital Sigmoid
Core
jpeg
Fully Precise Digital Sigmoid
8-bit A-NPU
Energy (nJ)
blackscholes
Floating Point
1 -> 4 -> 4 -> 2D-NPU
kmeans
4
Maximum number of Incoming Synapses to each Neuron
swaptions geomean
Cycle
A-NPU
0%
sobel
8-bit D-NPU
Application
3.34%
Core + A-NPU
4%
51.250425520663
1.66144638762811
25.8726893148605
0 Queue
NPU
0.3%
Instructions
blackscholes
10.21%
18.41%
6.02%
7.32%
swaptions
geomean
kmeans
6%
6%
7.28%
jpeg5.21%
9%
Analog
Sigmoid
8%
19.67%
8.35%
jmeint
12%
2.5
Application Level Error (Image Diff)
blackscholes
Fully Precise
Digital Sigmoid
blackscholes
fft
1
0
Benchmarks
2.0
1.8
Maximum number of Incoming Synapses to each
Neuron
blackscholes
Application Level Error
2.7
2.4
2.4
Table 1
2.5
Improvement
Application
Level Error
(Image Diff)
3
Analog Sigmoid
Fully Precise Digital Sigmoid
10%
Application Level
Error
0.0093
3.3
4
5.21
2.61
0.06
3.7
geomean
7.28
4.28
0.0261
geomean
0.0182
8.4
6.10
0.0428
3.9
5 9 -> 8 -> 1
8.5
28.2
82.2
15.2
0.06
4.5
6 -> 8 -> 4 -> 1
3.03
11.8
4.13
0.0813
18 -> 32 -> 8 -> 2
9.5
0.0303
2 -> 8 -> 2
jmeint
swaptions
10.21
sobel
0.0839
1 -> 4 -> 4 -> 2
inversek2j
kmeans
8.39
kmeans
6 -> 8 -> 8 -> 1
fft
Application Level Error
6.02
Fully Precise
Digital Sigmoid
blackscholes
fft
12.1
0.06
Topology
0
blackscholes
sobel
swaptions geomean
Application Level Error
1.8
2
0.0428
Application
sobel
16
Whole application speedup and energy savings. Figure
5
10%
shows the whole application speedup and energy savings when
8%
the processor is augmented with an 8-bit, 8-PE D-NPU,
our
8-ANU A-NPU, and an ideal NPU, which takes zero 6%
cycles
and consumes zero energy. Figure 5c shows the percentage of
Table 1
4%
instructions subsumed by the neural transformation
mber of Incoming Synapses todynamic
each
of the
candidate code. The results show, following the
2% Am4
8
16
2%
6.62%
5.76% Law, that the larger the number of dynamic instructions
dahl’s
0%
blackscholes
subsumed, the larger the benefits from neural acceleration.
Geometric mean speedup and energy savings with an A-NPU
is 3.7× and 6.3× respectively, which is 48% and 24% better
than an 8-bit, 8-PE NPU. Among the benchmarks, kmeans
sees slow down with D-NPU and A-NPU-based acceleration.
All benchmarks benefit in terms of energy. The speedup with
A-NPU acceleration ranges from 0.8× to 24.5×. The energy
savings range from 1.1× to 51.2×.
geomean
50
Core + D-NPU
6
Energy
3.75
6%
Core + Ideal NPU
8
Improvement
Improvement
(1.7 of Incoming Synapses to each Neuron
Maximum
number
0.011
(1.1336 GHz)
GHz)
Cycle
Improvement (1.7
GHz)
2.50
Speedup
Energy 1.89
Saving
0.86
2.7
34
1 -> 16 -> 8 -> 1
0.06
swaptions
7.6
7.6
6
0
10.79
12.14 10.3
8.45
8
0.50
73.76
6
8.39
4.71
8%
42.5954312379609
fft
jmeint
jmeint
10.45
8.45
1.00
4
Energy (nJ)
3.13
3.33
Core + A-NPU
geomean
Benchmarks. We use the benchmarks in [12] and add one
more, blackscholes.
12%
45
4
9 -> 8 -> 1
3.75
95.1% 56.3%2.08
29.7% 57.1%
9% 0.84
Energy
Improvement
(1.1336 GHz)
2.83
15.21
Analog1.36Sigmoid
8
8-bit D-NPU
64 -> 16 -> 8 -> 64
Cycle
0.563
Improvement (1.7
GHz)
0.297
6 0.41
6.15
Cycle
2 -> 8 -> 2
Improvement
Cycle
0.005
Improvement
(1.1336 GHz)
0.048
6
4
9.63
Cycle Improvement
0.951
0.018
1.0
12
6 -> 8 -> 4 -> 1
0.0662
3
sobel
0.959
Cycle
1.0
2.33
0.1841
18 -> 32 -> 8 -> 2
jpeg
0.972
0.674
0.008
8-bit A-NPU
0.03
8.15 1.0
0.0813
1 -> 4 -> 4 -> 2
inversek2j
Less Insts
0.003
0.012
0.03
3.7
fft
NPU Queue
Instructions
0.02
0.31
67.4%1.22
95.9%
97.2%
6.35
Normalized # of Dynamic Instructions
Percentage
8
Instructions
Subsumed
8-bit D-NPU vs. A-NPU-1
2.04
6 -> 8 -> 8 -> 1
5.21%
3.67
56.26
Topology
4.28%
2.44
29
0.0303
sobel
0.67
111.54
Fully Precise
44
5.77
Digital
Sigmoid
33
5.49
5
blackscholes
7.28%
6
365
0.0839
Application
8.35%
6.1%
5.77
141
61.47
6.62%
kmeans
3.34%
Energy Improvement
45
9 -> 8 -> 1
1 -> 16 -> 8 ->341
18 -> 32 -> 8 -> 2
jpeg
107.32%
3.3
6 -> 8 -> 4 -> 1
-> 16 -> 8 -> 64
Fully64 Precise
6 -> 8 -> 4 -> 1
Digital
Sigmoid
9 -> 8 -> 1
kmeans
swaptions
3.796.1%
8.4%
6.02%
0.8%
12.1
Cycle
64 -> 16 -> 8 -> 64Energy (nJ)
jmeint
sobel
6.6%
19.7%
2.61%
8.5
1.0
64
2.3%
jpeg
18.4%
9.4%
inversek2j
2 -> 8 -> 2
geomean
geomean
11.8
18 -> 32 ->8-bit
8 -> D-NPU
2
2
4.8%
jmeint
8.1%
4.1%
19.67%
swaptions
82.2
1.0
geomean
0.5%
inversek2j
3.0%
10.2%
3.63
9.42%
sobel
18.41%
73.76
3.9
kmeans
sobel
1.8%
fft
2.83
8.4%
kmeans
8.13%
jmeint
fft
1 -> 4 -> 4 -> 2
blackscholes
4.13%
jpeg
64 -> 16 -> 8 -> 64
3.03%
inversek2j
25.50
Other Instructions
1.0
2 ->geomean
8 -> 2
40.5%
kmeans
fft
8.48
82.21
2.4
peg
65.5%
Analog
Sigmoid
5.76%
28.28
CPU
1.0
6 ->sobel
8 -> 8 -> 1
meint
Topology
2 -> 8 -> 2
1 ->swaptions
4 -> 4 -> 2
nversek2j
Percentage
instructions
subsumed
geomean
2.6%
Improvement
(1.7
16
3.3%
GHz)
5.2%
5.88
9.71
6 -> 8 -> 8 -> 1
blackscholes
blackscholes
1.89
swaptions
4.3%
9.53
inversek2j
inversek2j
Application Energy Reduction
Core + D-NPU
6 -> 8 -> 8 -> 1
010.21%
jmeint
22.81
5.49
1 -> 4 -> 4 -> 2
fft
43.3%
6.62%
Analog
blackscholes
Sigmoid
3.92
10.30
blackscholes
kmeans
3.1%
18.4% 6.6% 6.1%
Improvement
8
19.7% 8.4%
(1.1336
GHz)7.3%
9.4%
Topology
fft
Circuit design for ANU. We implemented the 8-bit, 8-input
ANU in the Cadence Analog Design Environment using predictive technology models at 45 nm [5]. We ran detailed
Spectre spice simulations to understand circuit behavior and
Analog Sigmoid
measure ANU energy consumption. We used CACTI to estimate energy of the A-NPU buffers.
fft
Topology
jpeg
sobel
8.39%
15.21
50
9.5
Application
kmeans
8.1%
3.75
Fully Precise
Digital Sigmoid
blackscholes
1.36
59.90
Application
jmeint
14.32%
2.42
jpeg
15% 1.99
3.3%
8.4%
3.0%
Improvement
(1.7
4
GHz) 10.2% 4.1%
A-NPU
2.50
Benchmarks
jpeg
Application
8
NPU Queue
Instructions
Dynamic Insts
inversek2j
jmeint
Energy
Improvement
kmeans sobel
swaptions
jmeint
fft
blackscholes
111.54
33
12
0.51
inversek2j
inversek2j
A-NPU + Ideal
Sigmoid
Application
Level Error
(Image Diff)
0.86
fft
Maximum number
of Incoming
Synapses
each2.3%
6.0%
2.7%
6.2% 17.6%
5.4% 3.2% to
3.8%
Cycle
Energy
Energy
Neuron
Floating Point
D-NPU
Cycle
Improvement
(1.1336 GHz)
6
2.04
16
29
12%
4
2
0.47
0.17
0.25
10%
blackscholes
0.00
Application Speedup
34
4
Cycle Improvement
blackscholes
Energy
8.15
mum number of Incoming Synapses to each Neuron
0.29
0.2
0
Application Level Error
28.2
45
Cycle
Application Level Error (Image Diff)
8-bit D-NPU
6
0.49
0.44
Core + D-NPU
Core + A-NPU
Core + Ideal NPU
0.4
0.50
14.40
Table 1
Cycle
8
2%
165.46
8.35%
11.43
kmeans
33.85
19.67%
5.49
jpeg
Application
10%
0.0073
7.32
A-NPU - 1/2
Digital
Frequency
(Manual)
geomean
0.018
0.0118
5.21
0.41
4.28
Energy
2.61
A-NPU - 1/36.02
Digital
Frequency
(Manual)
inversek2j Table 1 2 -> 8 -> 2
Application
Level Error
Image Diff)
8.35 4.71
0.66
6.10
4
0.011
0.0129
jmeint number of Incoming
18 -> 32 ->Synapses
8 -> 2
Maximum
to each
Neuron
jpeg
64 -> 16 -> 8 -> 64
kmeans
0.0182
4.13 3.67
0.43
9.42 4.13
6.62 3.14
0.06
1.00.0428
sobel
3.63
Analog Sigmoid
5.88
0.03
10.2122.81
3.03 2.44
8.13 2.75
10
Analog Sigmoid
6.3
6.5
meint
2.42
Analog Sigmoid
3.92
15.21
8.39
1.0
0.959
geomean
3.1
3.6
9 -> 8 -> 0.0813
1
0.674
0.008
sobel
0.53
Energy
Improvement
(1.1336 GHz)
45
8.15
6
0.86
2.50
3.75
34
2.04
6
0.54
1.89
blackscholes
2.83
0
Energy
Improvement (1.7
GHz)
9.53
fft
3.79 inversek2j
8.48
jmeint
3.33
jpeg
kmeans
sobel
swaptions geomean
Energy Improvement
6 -> 8 -> 0.0303
4 -> 1
2.83
kmeans
7.3
1.0
64 -> 16 ->
8 -> 64
0.0839
1 ->kmeans
4 -> 4 -> 2
0.972
0.012
3.75
0.03
0.51
2.3
2 -> 8 -> 2
Fully Precise
Fully Precise
18 ->
32 -> 8 -> 2 Digital Sigmoid
Digital
Sigmoid
6 ->jpeg
8 -> 8 -> 1
swaptions
0.31
1.89
jpeg
1.6
1.7
1.7
inversek2j
Topology
jmeint
0.003
Speedup (1.7 GHz)
2.50
1.0
2 ->sobel
8 -> 2
fft
64
nversek2j
0.02
Speedup (1.1336
GHz)
1.0
jmeint
0.75
0.6
12.1
1 -> 4 ->Analog
4 -> 2 Sigmoid
2.0
6 -> 8 -> 8 -> 1
fft
inversek2j
14.1
24.5
48.0
Topology
blackscholes
fft
8.5
Application
2blackscholes
Application
blackscholes
8.4
59.90
Less3.67Insts
2.44
Application Level Error
geomean
NPU Queue
0.67
0.46
Instructions
14.0
8
33
10.32
6.2
12
111.54
Other
Instructions
44
5.77
1 -> 16 -> 8 -> 1
7.58
12.14
25.8
30.0
31.4
56.26
365
9 -> 8 -> 1
NPU Queue Instructions
Other Instructions
10.45
8.45
10.9
14.9
141
64 -> 16 -> 8 -> 64
CPU6 -> 8 -> 4 -> 1
7.57
11.81
7.9
18 -> 32 -> 8 -> 2
swaptions
kmeans
Application Energy Reduction
peg
sobel
82.2
meint
29
8.56
1.00
2.5
blackscholes
fft
Dynamic Insts
2
6 -> 8 -> 4 -> 1
sobel
Energy
Improvement (1.7
GHz)
2.7
2.8
2.8
Energy
Improvement
(1.1336 GHz)
2.4
Energy Improvement
Cycle
Improvement (1.7
GHz)
73.76
1.1
1.3
1.3
Cycle
Improvement
(1.1336 GHz)
4.05
25.50
0.5
0.8
1.2
Cycle Improvement
Energy
3.3
Cycle
82.21
12.1
8-bit A-NPU
Energy (nJ)
28.28
22.81
11.8
Cycle
5.88
15.21
2.7
Topology
3.92
1.36
1.1
1.3
1.6
8-bit D-NPU
Application
1.99
Energy Improvement
8
5.4
12
111.54
3.7
56.26
365
5.4
141
64 -> 16 -> 8 -> 64
2.5
18 -> 32 -> 8 -> 2
jpeg
3.7
jmeint
3.33
5.1
swaptions geomean
4.57
3.1
3.6
sobel
3.63
2.4
kmeans
2.42
2.7
2.8
2.8
jpeg
0.51
2.2
2.3
2.3
jmeint
8.48
3.79
1.5
1.8
1.9
inversek2j
GHz)
9.53
2.83
0.5
0.8
1.2
4
(1.1336 GHz)
3.75
1.89
1.1
1.3
1.3
2.33
GHz)
2.50
0.54
4
1.5
1.8
1.9
29
(1.1336 GHz)
0.86
2.2
2.3
2.3
6
14.0
fft
6
2.04
6.2
blackscholes
8.15
34
2.3Error
Application Level
0.00
45
17.8
18.8
2 -> 8 -> 2
0.25
10.9
14.9
inversek2j
0.50
Normalized
Application
Speedup
Normalized
# 7.9
of Dynamic
Instructions
1 -> 4 -> 4 -> 2
15.2
1 -> 16 -> 8 -> 1
geomean
6 -> 8 -> 8 -> 1
fft
8.5
swaptions
blackscholes
1.1
1.3
1.6
0.571
11.8
0.297
0.023
2.7
0.563
0.048
0.41
1.6
1.7
1.7
0.959
0.951
0.005
0.66
1.0
14.1
24.5
48.0
0.008
0.018
0.43
1.0
9 -> 8 -> 1
42.5
51.2
52.5
0.03
0.03
1.0
6 -> 8 -> 4 -> 1
Application Speedup
1.0
1.0
64 -> 16 -> 8 -> 64
Application Ener
2 -> 8 -> 2
18 -> 32 -> 8 -> 2
peg
sobel
Normalized # of D
nversek2j
meint
kmeans
6
4.57
4
4.1
3.79
3.3
8.56
References
These circuit and architecture studies, although proven successful, are limited to purely digital techniques. We explore
how a mixed-signal, analog-digital approach can go beyond
what digital approximate techniques offer.
[1] P. E. Allen and D. R. Holberg, CMOS Analog Circuit Design, 2nd ed.
Oxford University Press, 2002.
[2] W. Baek and T. M. Chilimbi, “Green: A framework for supporting
energy-conscious programming using controlled approximation,” in
PLDI, 2010.
[3] B. Belhadj, A. Joubert, Z. Li, R. Héliot, and O. Temam, “Continuous
real-world inputs can open up alternative accelerator designs,” in ISCA,
2013.
[4] B. E. Boser, E. Säckinger, J. Bromley, Y. L. Cun, L. D. Jackel, and
S. Member, “An analog neural network processor with programmable
topology,” J. Solid-State Circuits, vol. 26, no. 12, December 1991.
[5] Y. Cao, “Predictive technology models,” 2013. Available: http:
//ptm.asu.edu
[6] L. N. Chakrapani, B. E. S. Akgul, S. Cheemalavagu, P. Korkmaz,
K. V. Palem, and B. Seshasayee, “Ultra-efficient (embedded) SOC
architectures based on probabilistic CMOS (PCMOS) technology,” in
DATE, 2006.
[7] T. Chen, Y. Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti,
A. Nere, S. Qiu, M. Sebag, and O. Temam, “Benchnn: On the broad
potential application scope of hardware neural network accelerators,”
in IISWC, 2012.
[8] F. Choudry, E. Fiesler, A. Choudry, and H. J. Caulfield, “A weight
discretization paradigm for optical neural networks,” in ICOE, 1990.
[9] M. de Kruijf, S. Nomura, and K. Sankaralingam, “Relax: An architectural framework for software recovery of hardware faults,” in ISCA,
2010.
[10] H. Esmaeilzadeh, P. Saeedi, B. N. Araabi, C. Lucas, and S. M. Fakhraie,
“Neural network stream processing core (NnSP) for embedded systems,”
in ISCAS, 2006.
[11] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Architecture
support for disciplined approximate programming,” in ASPLOS, 2012.
[12] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural acceleration for general-purpose approximate programs,” in MICRO, 2012.
[13] S. Galal and M. Horowitz, “Energy-efficient floating-point unit design,”
IEEE Trans. Comput., vol. 60, no. 7, 2011.
[14] A. Hashmi, H. Berry, O. Temam, and M. Lipasti, “Automatic abstraction and fault tolerance in cortical microarchitectures,” in ISCA, 2011.
[15] A. Hashmi, A. Nere, J. J. Thomas, and M. Lipasti, “A case for neuromorphic ISAs,” in ASPLOS, 2011.
[16] C. Igel and M. Hüsken, “Improving the RPROP learning algorithm,”
in NC, 2000.
[17] D. A. Johns and K. Martin, Analog Integrated Circuit Design. John
Wiley and Sons, Inc., 1997.
[18] A. Joubert, B. Belhadj, O. Temam, and R. Héliot, “Hardware spiking
neurons design: Analog or digital?” in IJCNN, 2012.
[19] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and
N. P. Jouppi, “McPAT: An integrated power, area, and timing modeling
framework for multicore and manycore architectures,” in MICRO,
2009.
[20] S. Misailovic, S. Sidiroglou, H. Hoffman, and M. Rinard, “Quality of
service profiling,” in ICSE, 2010.
[21] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing
NUCA organizations and wiring alternatives for large caches with
CACTI 6.0,” in MICRO, 2007.
[22] S. Narayanan, J. Sartori, R. Kumar, and D. L. Jones, “Scalable stochastic processors,” in DATE, 2010.
[23] A. Patel, F. Afram, S. Chen, and K. Ghose, “MARSSx86: A full system
simulator for x86 CPUs,” in DAC, 2011.
[24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal
representations by error propagation,” in Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press,
1986, vol. 1.
[25] M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke, “Sage:
Self-tuning approximation for graphics engines,” in MICRO, 2013.
[26] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and
D. Grossman, “EnerJ: Approximate data types for safe and general
low-power computation,” in PLDI, 2011.
[27] J. Schemmel, J. Fieres, and K. Meier, “Wafer-scale integration of
analog neural networks,” in IJCNN, 2008.
[28] S. M. Tam, B. Gupta, H. A. Castro, and M. Holler, “Learning on an
analog VLSI neural network chip,” in Systems, Man, and Cybernetics
(SMC), 1990.
[29] O. Temam, “A defect-tolerant accelerator for emerging highperformance applications,” in ISCA, 2012.
[30] J. Zhu and P. Sutton, “FPGA implementations of neural networks: A
survey of a decade of progress,” in FPL, 2003.
Analog and digital neural hardware. There is an extensive body of work on hardware implementations of neural
networks both in the digital [10, 30] and the analog [4, 18]
domain. Recent work has proposed higher-level abstractions
for implementation of neural networks [15]. Other work has
examined fault-tolerant hardware neural networks [14, 29]. In
particular, Temam [29] uses datasets from the UCI machine
learning repository to explore fault tolerance of a hardware
neural network design. In contrast, our compilation, neuralnetwork selection/training framework, and architecture design aim at applying neural networks to general-purpose code
written in familiar programming models and languages, not
explicitly written to utilize neural networks directly.
Neural-based code acceleration. A recent study [7] shows
that a number of applications can be manually reimplemented
with explicit use of various kinds of neural networks. That
study did not prescribe a programming workflow, nor a preferred hardware architecture. More recent work exposes analog spiking neurons as primitive operators [3]. This work
devises a new programming model that allows programmers
to express digital signal-processing applications as a graph of
analog neurons and automatically maps the expressed graph
to a tiled analog, spiking-neural hardware. The work in [3] is
restricted to the domain of applications whose input are realworld signals that should be encoded as pulses. Our approach
intends to address the long-standing challenges of utilizing
analog computation (programmability and generality) by not
imposing domain-specific limitations, and by providing analog circuitry that is integrated with a conventional, digital
processor without require a new programming paradigm.
7. Conclusions
Analog circuits inherently trade accuracy for very attractive
energy-efficiency gains. However, it is challenging to utilize
them in a way that is both programmable and generally useful.
The transformation of general-purpose, approximable code
to a neural model, as well as the use of neural accelerators,
provide an avenue for realizing the benefits of analog computation by taking advantage of the fixed-function qualities of a
neural network while targeting traditionally-written, generallyapproximable code. We presented a complete solution on
using analog neural networks for accelerating approximate
applications, from circuits to compilers design. A very important insight from this work is that it is crucial to expose analog
circuit characteristics to the compilation and neural network
training phase. Our analog neural acceleration provides whole
application speedup of 3.3× and and energy savings of 12.1×
with quality loss less than 10% for all except one benchmark.
7
Download