ISSCC 2014 / SESSION 30 / TECHNOLOGIES FOR NEXT-GENERATION SYSTEMS /... 30.10 A 1TOPS/W Analog Deep Machine-Learning Engine

advertisement
ISSCC 2014 / SESSION 30 / TECHNOLOGIES FOR NEXT-GENERATION SYSTEMS / 30.10
30.10
A 1TOPS/W Analog Deep Machine-Learning Engine
with Floating-Gate Storage in 0.13µm CMOS
Junjie Lu, Steven Young, Itamar Arel, Jeremy Holleman
University of Tennessee, Knoxville, TN
Direct processing of raw high-dimensional data such as images and video by
machine learning systems is impractical both due to prohibitive power
consumption and the “curse of dimensionality,” which makes learning tasks
exponentially more difficult as dimension increases. Deep machine learning
(DML) mimics the hierarchical presentation of information in the human brain to
achieve robust automated feature extraction, reducing the dimension of such
data. However, the computational complexity of DML systems limits large-scale
implementations in standard digital computers. Custom analog or mixed-mode
signal processors have been reported to yield much higher energy efficiency
than DSP [1-4], presenting the means of overcoming these limitations. However,
the use of volatile digital memory in [1-3] precludes their use in intermittentlypowered devices, and the required interfacing and internal A/D/A conversions
add power and area overhead. Nonvolatile storage is employed in [4], but the
lack of learning capability requires task-specific programming before operation,
and precludes online adaptation.
The feasibility of analog clustering, a key component of DML, has been demonstrated in [5]. In this paper, we present an analog DML engine (ADE)
implementing DeSTIN [6], a state-of-art DML framework, and featuring online
unsupervised trainability. Floating-gate nonvolatile memory facilitates operation
with intermittent harvested energy. An energy efficiency of 1TOPS/W is achieved
through massive parallelism, deep weak-inversion biasing, current-mode analog
arithmetic, distributed memory, and power gating applied to per-operation
partitions. Additionally, algorithm-level feedback desensitizes the system to
errors such as offset and noise, allowing reduced device sizes and bias currents.
Figure 30.10.1 shows the architecture of the ADE, in which seven identical
cortical circuits (nodes) form a 4-2-1 hierarchy. Each node captures regularities
in its inputs through an unsupervised learning process. The lowest layer receives
raw data (e.g. the pixels of an image), and continuously constructs belief states
that characterize the sequence observed. The inputs of nodes on 2nd and 3rd
layers are the belief states of nodes at their respective lower layers. The beliefs
of the top layer are then used as rich features for a classifier.
The node (Fig. 30.10.2) incorporates an 8×4 array of reconfigurable analog
computation cells (RAC), grouped into 4 centroids, each with 8-dimensional
input. The centroids are characterized by their mean μ and variance σ 2 in each
dimension, stored in their respective floating gate memories (FGM). In a training
cycle, the analog arithmetic elements (AAE) calculate a simplified Mahalanobis
distance (assuming a diagonal covariance matrix) DMAH between the input
observation o and each centroid. The 8-D distances are built by joining the
output currents. A distance processing unit (DPU) performs inversenormalization (IN) operation to the 4 distances to construct the belief states,
which are the likelihood that the input belongs to each centroid. Then the
centroid parameters μ and σ 2 are adapted using the online clustering algorithm.
The centroid with the smallest Euclidean distance DEUC to the input is selected
(classification). The errors between the selected centroids and input are loaded
to the training control (TC) and their memories are then updated proportionally.
In recognition mode, only the belief states are constructed and the memories are
not adapted. Intra-cycle power gating is applied to reduce the power
consumption by up to 37%.
Figure 30.10.3 shows the schematic of the RAC, which performs three different
operations through reconfigurable current routing. Two embedded FGMs
provide nonvolatile storage for centroid parameters. Capacitive feedback
stabilizes the floating gate voltage (VFG) to yield pulse-width controlled update.
Tunneling is enabled by lowering its supply to bring down the VFG, increasing the
voltage across the tunneling junction. Injection is enabled by raising the source
of the injection transistor. This allows random-accessible bidirectional updates
without the need for on-chip high-voltage switches or charge pump. A 2-T V-I
converter then provides a current output and sigmoid update rule. The FGM
consumes 0.5nA of bias current, and shows an 8b programming accuracy and
a 46dB SNR at full scale. The absolute value circuit (ABS) in the AAE rectifies the
bidirectional difference current between o and μ. Class-B operation and the
virtual ground provided by amplifier A allow high-speed resolution of small
504
• 2014 IEEE International Solid-State Circuits Conference
current differences. The rectified currents are then fed into a translinear X2/Y
circuit, which simulations indicate operates with more than an order of
magnitude higher energy efficiency than its digital equivalence.
In the belief construction phase, the DPU (Fig. 30.10.4) inverts the distance
outputs from the 4 centroids to calculate similarities, and normalizes them to
yield a valid probability distribution. The output belief states are sampled then
held for the rest of the cycle to allow parallel operation of all layers. The sampling
switch reduces current sampling error due to charge injection: a diodeconnected PMOS provides a reduced VGS to the switch NMOS to turn it on with
minimal channel charge. The S/H achieved less than 0.7mV of charge injection
error (2% current error), and less than 14μV of droop with parasitic capacitors
as holding capacitor. In classification phase, the IN circuits are reused together
with the winner-take-all network (WTA) to classify the observation to the nearest
centroid. A starvation trace (ST) circuit is implemented to address unfavorable
initial conditions wherein some centroids are starved of nearby inputs and never
updated. The ST provides starved centroids with a small but increasing
additional current to force their occasional selection and pull them into more
populated areas of the input space. The lower right of Fig. 30.10.4 shows the TC
circuit, which performs current-to-pulse-width conversion using a VDDreferenced comparison. Proportional updates cause the mean and variance
memories to converge to the sample statistics, respectively.
The ADE is evaluated on a custom test board with data acquisition hardware
connecting to a host PC. The waveforms in Fig. 30.10.5 show the measured
beliefs, one from each layer. The sampling of beliefs proceeds from the top layer
to the bottom to avoid delays due to output settling. The performance of the
node is demonstrated by solving a clustering problem. The input data consists
of 4 underlying clusters, each drawn from a Gaussian distribution with different
mean and variance. The node achieves accurate extraction of the cluster
parameters (μ and σ 2), and the ST ensures a robust operation against
unfavorable initial conditions.
We demonstrate feature extraction for pattern recognition with the setup shown
in Fig. 30.10.6. The input patterns are 16×16 symbol bitmaps corrupted by
random pixel errors. An 8×4 moving window defines the pixels applied to the
ADE’s 32-D input. First the ADE is trained unsupervised with examples of
patterns. After the training converges, the 4 belief states from the top layer are
used as rich features and classified with a neural network implemented in
software, achieving a dimension reduction from 32 to 4. Recognition accuracies
of 100% with corruption lower than 10%, and 95.4% with 20% corruption are
obtained, comparable to a software baseline, demonstrating robustness to the
non-idealities of analog computation.
The ADE was fabricated in a 0.13μm CMOS process with thick-oxide IO FETs.
The die micrograph is shown in Fig. 30.10.7, together with a performance
summary and a comparison with state-of-art bio-inspired parallel processors
utilizing analog computation. We achieve 1TOPS/W peak energy efficiency in
recognition mode. Compared to state-of-art, this work achieves very high energy
efficiency in both modes. This combined with the advantages of nonvolatile
memory and unsupervised online trainability makes it a general-purpose feature
extraction engine ideal for autonomous sensory applications or as a building
block for large-scale learning systems.
References:
[1] J. Park, et al., “A 646GOPS/W Multi-Classifier Many-Core Processor with
Cortex-Like Architecture for Super-Resolution Recognition,” ISSCC Dig. Tech.
Papers, pp. 168-169, Feb. 2013.
[2] J. Oh, et al., “A 57mW Embedded Mixed-Mode Neuro-Fuzzy Accelerator for
Intelligent Multi-Core Processor,” ISSCC Dig. Tech. Papers, pp. 130-132, Feb.
2011.
[3] J.-Y. Kim, et al., “A 201.4GOPS 496mW Real-Time Multi-Object Recognition
Processor With Bio-Inspired Neural Perception Engine,” ISSCC Dig. Tech.
Papers, pp. 150-151, Feb. 2009.
[4] S. Chakrabartty and G. Cauwenberghs, “Sub-Microwatt Analog VLSI
Trainable Pattern Classifier,” IEEE J. Solid-State Circuits, vol. 42, no. 5, pp.
1169-1179, May 2007.
[5] J. Lu, et al., “An Analog Online Clustering Circuit in 130nm CMOS,” IEEE
Asian Solid-State Circuits Conference, Nov. 2013.
[6] S. Young, et al., “Hierarchical Spatiotemporal Feature Extraction using
Recurrent Online Clustering,” Pattern Recognition Letters, Oct. 2013.
978-1-4799-0920-9/14/$31.00 ©2014 IEEE
ISSCC 2014 / February 12, 2014 / 4:45 PM
Node Architecture
Output Belief
State B
Inputs
Post-Processing
Distance Processing Unit (DPU)
4
Winner-Take-All Network (WTA)
S/H
Inverse Normalization (IN)
The ADE Architecture
DEUC/DMAH
WIN
ΣI
ΣI
ΣI
ΣI
3rd Layer
AAE
Belief
States
2nd Layer
Classification
Rich
Features
Raw
Data
AAE
...
FGM FGM
μ
σ2
Memory
Writing Logic
Memory
Writing Logic
8
Dim1
1st Layer
AAE
...
FGM FGM
μ
σ2
2
Memory
Writing Logic
8
Dim1
2
Input
o
Observation
DMAH
Errors
DEUC
Voice
1V
Tun
VBP
Injection
Junction
DMAH
Floating
Gate
Z
VINJECT
-Σ
VBN
-
X
C
Absolute Value Circuit
Euclidean Distance
DEUC = Σ(o-μ)2/C, C is constant
0
Tunnel
Y
S/H
VDD
IB/2
...
Err_σ2
-Σ
σ2 Increase
Translinear Loop
RST
CLK
RST
CLK
μ increase
LOAD
RST
5
Input o (nA)
10
IOUT
0
0
5
Input o (nA)
Figure 30.10.3: The reconfigurable current routing of the RAC, the schematics
of the FGM and AAE and measurement results.
ADE
1 Layer
Incorrect
clustering
LOAD
Memory
Write
μ(n+1)=μ(n)+α*Err_μ
σ2(n+1)=σ2(n)+β*Err_σ2
Vp
OUT
VDD
μ→MEAN(o)
σ2→VAR(o)/C
PW
Rich Features
10
5
0
0
20
40
60
1
Sum of
Beliefs
0.5
Beliefs
0
0
80
50
100
Sample
With ST
100%
Accuracy
Correct
Clustering
95.4%
Accuracy
Evolution of
Centroid Means
Centroid moved
to data by ST
Extracted
Variance
μ= (7, 3)
σ2=(1, 1)
Input Pattern
10% Corruption
Figure 30.10.5: Measured output waveforms, clustering and ST test results.
For clustering, 2-D results are shown for better visualization.
Recognition Accuracy
Classification Accuracy
μ= (7, 7)
σ2=(1.5, 1.5)
Cluster Means
μ= (3, 3)
σ2=(1, 1)
PW_μ=K*Err_μ
PW_σ2=K*Err_σ2
Rich Features
Training Sample (x1k)
Clustering Test
μ= Data Cluster
σ2= Parameters
Input Current (nA)
12.4
20
Classification Result
Belief States
Centroid
Starved
220μs
μ= (3, 7)
σ2=(0.6, 0.6)
15
4-D
Training Convergence
Input Data
10
Raw Data
3rd Layer
Training
Load
5
Neural Network Classifier
32-D
Sample
Belief
Classific
Construction -ation
12.8
0.2
0
Input Pattern
Without ST
st
Clock
13.2
0.4
Pattern Recognition Test
Starvation Trace Test
B(n)
Centroid Means
{
0.6
Figure 30.10.4: The schematic of the DPU and its sub-blocks. The training
control is shown on the lower right.
Belief Output Waveforms
B(n-1)
|Vt,P|>Vt,N due to body effect
Post-Layout Simulation Results
RST
OUT LOAD
VRAMP
VRAMP
IOUT
10
VP
Err_μ/
2
Err_σ
15
0
0
M1
Training Control (TC)
Starvation Trace (ST)
σ2=2nA, Sweeping μ
30
20
IB
Winner Take All (WTA)
μ=5nA, Sweeping σ2
40
Z
M2
Shielding from noises
Starvation Trace
VS
M3
Signal
GND
−
A
VREF +
Memory Errors
Err_μ = o-μ, Err_σ2=(o-μ)2/C-σ2
Belief
States 2nd Layer
B
VDD
To Higher
Layer
From
RAC
Z=X2/Y
Err_μ
Sample Mode
Utilize Wiring
Capacitance as CHOLD
B
100
o-μ
X2/Y
X
IOUT
DEUC/
DMAH
Inject
50
Y
Σ
C
FGM
σ2
VB
5
X=|o-μ|
DMAH(nA)
Input o
CHOLD
Translinear X2/Y Operator
-Σ
AAE
Abs(·)
-
IOUT
...
DEUC
Mode = Training Load
FGM
μ
In next layer
Target Values (nA)
Number of 1ms update pulses
Z
Y
FGM
σ2
8
S/H
S/H
IIN
INORM
NORM
-0.2
AAE and Measured Results
X2/Y
OUT
Updates Rule
DMAH(nA)
Input o
6
...
IOUT = λ/IIN
{ ȭI
=I
10
0
Σ
Enabled
Inverse Normalization (IN)
-0.1
2
AAE
Abs(·)
0
4
Mode = Classification
FGM
μ
0.1
6
V-I
Converter
IOUT
Recognition
Mode
10
Disabled
DPU × ¼
0.2
4
Training
Mode
-37%
20
Memory
Write
Distance Processing Unit
Programming Accuracy
8
2
30
Droop ( PV)
Inj
3V
3V
0
Y
Hold Sample
Hold
Belief
ClassificTraining
Construction
ation
Load
Training Cycle
Error (nA)
AAE
Σ
VTUNNEL
Memory Values (nA)
Tunneling
Junction
Iout (nA)
Mode = Belief Construction
Mahalanobis Distance
DMAH = Σ[(o-μ)2/σ2]
2
-22%
OFF
OFF
FGM and Measured Results
The Reconfigurable Current Routing of RAC
C
8
Dim1
Training
Control (TC)×8
Figure 30.10.2: The node architecture and its timing diagram showing power
gating.
Figure 30.10.1: The analog deep learning engine (ADE) architecture.
FGM
σ2
S/H
Storage
X2/Y
...
Saving of Power Gating
Power (μW)
OFF
TC
X
2
OFF
OFF
WTA
Abs(·)
8
Dim1
Centroid4
OFF
IN
0101011
-
Training
Control
σ2
16
Reconfigurable Analog
Computation Cell (RAC) 8×4
8
FGM
FGM
μ
Ctrl
Timing Diagram and Intra-Cycle Power Gating
Transmission
AAE
Input o
2
Training
Control
μ
Node
Raw
Data
Pattern
Memory
Writing Logic
Centroid3
16
...
FGM FGM
μ
σ2
8
Dim1
Centroid2
Centroid1
Error
AAE
...
FGM FGM
μ
σ2
Charge Injection (mV)
Video/Image
Classification
Result
Input Pattern
20% Corruption
1
0.8
0.6
0.4
Measured
Baseline
0.2
0
0
10
20
30
40
50
Percentage Corruption
Figure 30.10.6: Pattern recognition test setup and results, demonstrating
accuracy comparable to baseline software simulation.
DIGEST OF TECHNICAL PAPERS •
505
30
ISSCC 2014 PAPER CONTINUATIONS
L1
L1
0.9mm
L3 L2
L2
L1
L1
TC
7 Nodes
Process
Purpose
Non-volatile Memory
Power (W)
Peak Energy Efficiency
0.4mm
RAC
DPU
Technology
1P8M 0.13μm CMOS
Power Supply
3V
Active Area
0.9mm×0.4mm
Memory
Non-Volatile Floating Gate
Memory SNR
46dB
Training
Unsupervised Online Clustering
Algorithm
Inverse-Nomalized
Output Feature
Mahanalobis Distances
Input Referred
56.23pArms
Noise
System SNR
45dB
I/O Type
Analog Current
Operating
Training Mode
4.5kHz
Recognition Mode
8.3kHz
Frequency
Power
Training Mode
27μW
11.4μW
Consumption Recognition Mode
Energy
Training Mode 480GOPS/W
Efficiency
Recognition Mode 1.04TOPS/W
This work
ISSCC'13 [1]
ISSCC'11 [2]
ISSCC'09 [3]
0.13μm
0.13μm
0.13μm
0.13μm
DML Feature Extraction Object Recognition Neural-Fuzzy Accelarator Object Recognition
Floating Gate
NA
NA
NA
11.4μW
260mW
57mW
496mW
1.04TOPS/W
646GOPS/W
655GOPS/W
290GOPS/W
Figure 30.10.7: Die micrograph, performance summary and comparison table.
• 2014 IEEE International Solid-State Circuits Conference
978-1-4799-0920-9/14/$31.00 ©2014 IEEE
Download