Abstract

advertisement
Low-Power Analog Deep Learning Architecture for Image Processing
Jeremy Holleman, Itamar Arel
Department of Electrical Engineering and Computer Science
University of Tennessee, Knoxville
Knoxville, TN, USA 37996
Abstract We present an analog implementation of the
DeSTIN deep learning architecture. It utilizes floating-gate
analog memories to store parameters and current-mode
computation and features online unsupervised trainability. The
circuit occupies 0.36 mm2 in a 0.13 μm CMOS process. At 8.3
kHz, it consumes 11.4 μW from a 3 V supply, achieving 1×1012
ops/s/W. Measurement demonstrates accuracy comparable to
software.
Keywords: Analog signal processing; deep machine learning;
neuromorphic engineering
Introduction
Automated processing of large data sets has a wide range
of applications [1], but the high dimensionality
necessitates unsupervised automated feature extraction.
Deep machine learning (DML) architectures have recently
emerged as a promising bio-inspired framework, which
mimics the hierarchical representation of information in
the neocortex to achieve robust automated feature
extraction [2]. However, their computational requirements
result in prohibitive power consumption when
implemented on conventional digital processors or GPUs.
Custom analog circuitry provides a means to dramatically
reduce the energy costs of the moderate-resolution
computation needed for learning applications [3-8].
In this paper we describe an analog implementation of a
deep machine learning system first presented in [9]. It
features unsupervised online trainability driven by the
input data only. The proposed Analog Deep Learning
Engine (ADE) utilizes floating-gate memory to provide
non-volatile storage, facilitating operation with harvested
energy. The memory has analog current output,
interfacing naturally with the other components in the
system, and is compatible with standard digital CMOS
technology.
Architecture and Algorithm
The analog deep machine learning engine (ADE)
implements Deep Spatiotemporal Inference Network
(DesTIN) [10], a state-of-art compositional DML
framework, illustrated in Figure 1. Seven identical
nodes form a 4-2-1 hierarchy. Each node captures the
regularities in its inputs through an unsupervised
learning process and constructs belief states, which it
passes up to the layer above. The bottom layer acts on
raw data (e.g. pixels of an image) and the information
becomes increasingly sparse and abstract as it moves
up through the hierarchy. The beliefs formed at the
Figure 1. Analog deep machine learning engine
architecture and possible application scenarios.
top layer are then used as rich features with reduced
dimensionality for post-processing.
The node learns through an online k-means
clustering algorithm [11], which extracts the salient
features of the inputs by recognizing patterns of
similarity in the input space. Each recognized cluster
is represented with a centroid, characterized by
estimated mean and variance
for each dimension.
The node, shown in Figure 2, incorporates an 8×4
array of reconfigurable analog computation cells
(RAC), grouped into 4 centroids, each with 8dimensional input.
A training cycle begins with the classification
phase, in which an input vector is assigned to the
nearest centroid. The RACs calculate the 1-D
Euclidean distance
between the centroid mean
and the input element o i , which are then wire-summed
in current form to yield the total Euclidean distances
between the observation and each centroid. The
nearest centroid is then selected and its mean and
variance estimates are updated such that the FGMs
calculate an exponential moving average estimate of
the cluster mean and variance. The RAC then
incorporates the centroids’ variances to compute the
Mahalanobis
distance,
which
is
used
to
probabilistically assign the observation to each
centroid.
To take full advantage of the algorithm’s
robustness and avoid performance degradation
due to analog error sources, extensive systemlevel simulations were performed [12]. The results
Output Belief
State B
Distance Processing Unit (DPU)
4
Inverse Normalization (IN)
Winner-Take-All (WTA)
DEUC/DMAH
Sel
AAE
FGM
μ
S/H
Starvation Trace (ST)
AAE
...
FGM
σ2
Memory
Writing Logic
Dim1
Centroid1
8
2
FGM
μ
AAE
...
FGM
σ2
Memory
Writing Logic
Dim1
FGM
μ
8
Dim1
Centroid2
Input
o
Observation
...
FGM
σ2
Memory
Writing Logic
8
2
( )
FGM
μ
Memory
Writing Logic
Dim1
16
Training
Control
μ
Ctrl
Training
Control
σ2
...
FGM
σ2
2
Centroid3
8
Error
AAE
16
8
2
...
8
Dim1
Centroid4
Reconfigurable Analog
Computation Cell (RAC) 8×4
2
Training Control
(TC)×8
Figure 2. The node architecture. Each node includes four 8-D centroids as well as control and processing
circuitry.
Distance Processing Unit × ¼
X
IN
Z  X2 Y
M3
IOUT = λ/IIN
ΣIOUT = INORM
INORM
VC
M2
X  o  ˆ
M1
VB
DEUC/
DMAH
IOUT
S/H
IIN
B
To Higher
Layer
From
RAC
M6
Figure 3. Schematic of analog arithmetic element.
informed the noise and matching specifications for
the computational circuits.
Circuit Implementation
The Reconfigurable Analog Computation cell (RAC)
performs three different operations using subthreshold
current-mode computation and reconfigurable current
routing. The input current o and the centroid mean ̂
stored in the FGM_μ are added with opposite polarity and
the difference current
̂ is rectified by the absolute
value circuit Abs. The unidirectional output current is
then fed into the X2/Y operator circuit, where the Y
component can be either the centroid variance memory
output , or a constant C to compute DMAH or DEUC,
respectively. The absolute error |
̂ | is used to
generate the update pulse for the mean memory, while the
is computed to determine the
quantity |
̂|
update for the variance memory.
The schematic of the analog arithmetic element (AAE)
is shown in Figure 3. Amplifier A and the class-B
structure M1/M2 provide a virtual ground at the input,
allowing high-speed resolution of small current
differences. The X2/Y operator circuit employs the
translinear principle [13] to implement efficient currentmode arithmetic. Parameters are stored in an analog
floating-gate memory (FGM)[14].
M7
M5
Sel
IB/2
ST
M4
WTA
IB
Figure 4. Schematic of one channel of the distance
processing unit. The translinear loop formed by M1
and M2 yields an inverse relationship between their
two drain currents, while the current source INorm
constrains all output currents IOut to be normalized to a
constant sum.
The Distance Processing Unit (DPU) performs inverse
normalization and a winner-take-all operation [16] on the
combined distance outputs from the four centroids. The
simplified schematic of one channel is shown in Figure 4.
The DPU also implements the starvation trace [15], which
allows poorly initialized centroids to be slowly drawn
towards populated areas of the data space.
The training control circuit converts the memory error
current to a pulse width to control the memory adaptation.
For each dimension, two TC cells are implemented, one
for mean and one for variance, shared across centroids.
Table 1. Performance Summary
Techonology
Power Supply
Active Area
Memory
Memory SNR
Training Algorithm
Input Referred Noise
System SNR
I/O Type
1P8M 0.13μm CMOS
3V
0.9mm×0.4mm
Non-Volatile Floating Gate
46dB
Unsupervised Online Clustering
56.23pArms
45dB
Analog Current
Training Mode
4.5kHz
Operating Frequency
Recognition Mode
8.3kHz
Training Mode
27μW
Power Consumption
Recognition Mode
11.4μW
Training Mode
480GOPS/W
Energy Efficiency
Recognition Mode 1.04TOPS/W
Figure 5. Clustering test results with bad initial
condition without (left) and with the starvation trace
enabled.
ADE
Neural Network Classifier
Input Pattern
Classification Result
4-D
32-D
Rich Features
Raw Data
Belief States
10
5
0
0
20
40
Clustering Test To test the clustering performance of
the node, an input dataset of 40,000 8-D vectors was
drawn from 4 underlying Gaussian distributions with
different mean and variance. During the test, the centroid
means were read out every 0.5 s; the locations are
overlaid on the data scatter in Figure 5. For easier visual
interpretation, 2-D results are shown. The performance of
the starvation trace is verified by presenting the node with
an ill-posed clustering problem where one centroid is
initialized far from the input data, and is therefore never
updated without the ST enabled (Figure 5, left). With the
starvation trace enabled, the starved centroid is slowly
pulled toward the area populated by the data, achieving a
correct clustering result, shown in Figure 5 (right).
Feature Extraction Test We demonstrate the full
functionality of the chip by performing feature extraction
for pattern recognition with the setup shown in Figure
6(a). The input patterns are 16×16 bitmaps corrupted by
random pixel errors. An 8×4 moving window selects the
ADE’s 32 inputs. The ADE is first trained on unlabeled
patterns. After training, adaptation can be disabled and the
circuit operates in recognition mode. The 4 belief states
Sum of
Beliefs
1
0.5
Beliefs
0
0
80
Training Sample (x1k)
50
Input Pattern
10% Corruption
100
Sample
(c)
(b)
100%
Noise Performance To measure input-referred noise,
we construct a histogram of classifications near a decision
boundary with adaptation disabled. Assuming additive
Gaussian noise, the relative frequency of one of the two
classification results approximates the cumulative density
function (cdf) of a normal distribution with standard
deviation equal to the RMS current noise. The measured
input-referred current noise is 56.23 pArms, which with an
input full scale range of 10 nA corresponds to an SNR of
45 dB, or 7.5 bit resolution.
60
94%
Classification Accuracy
The ADE occupies an active area of 0.36 mm2 in a
0.13 μm CMOS process. With a 3 V power supply, it
consumes 27 μW in training mode, and 11.4 μW in
recognition mode.
(a)
Centroid Means
Measurement Results
1
0.8
0.6
0.4
0.2
0
0
Classification Input Pattern
Result
20% Corruption
Measured Mean
95% CI
Baseline
10
20
30
40
50
Percentage Corruption
(d)
Figure 6. (a) Feature extraction test setup. (b) The
convergence of centroids during training. (c) Rich
features output from the top layer, showing the
effectiveness of normalization. (d) Measured
classification accuracy using the features extracted by
the chip. The plot on the right shows the mean
accuracy and 95% confidence interval (2 ) from the
three chips tested, compared to the software
baseline.
from the top layer are used as rich features, achieving a
dimension reduction from 32 to 4. A software neural
network then classifies the reduced-dimension patterns.
Three chips were tested and average recognition
accuracies of 100% with pixel corruption level lower than
10% and 94% with 20% corruption are obtained, which is
comparable to the floating-point software baseline, as
shown in Figure 6(d), demonstrating robustness to the
non-idealities of analog computation.
Efficiency Comparison The measured performance of the
ADE is summarized in Table 1. It achieves an energy
efficiency of 480 GOPS/W in training mode and 1.04
TOPS/W in recognition mode. Table 2 shows that,
compared to other mixed-signal computational circuits,
this work achieves very high energy efficiency in both
modes.
Table 2. Performance Comparison
Process
Purpose
This Work
0.13 μm
DML
Extraction
Non-volatile Memory
Power
Peak Energy Efficiency
Floating-Gate
11.4 μW
1.04 TOPS/W
Feature
JSSC ’13 [5]
0.13 μm
Neural-Fuzzy
Processor
ISSCC ‘13[6]
0.13 μm
Object Recognition
JSSC ’10 [7]
0.13 μm
Object Recognition
N/A
57 mW
655 GOPS/W
N/A
260 mW
646 GOPS/W
N/A
496 mW
290 GOPS/W
Additionally, a digital equivalent of the ADE was
implemented in the same process using logic synthesized
from standard cells with 8-bit resolution and 12-bit
memory width. According to post-layout power estimation,
this digital equivalent running at 2 MHz in training mode
consumes 3.46 W, yielding an energy efficiency of 1.66
GOPS/W, 288 times lower than this analog implementation.
Conclusions
In this paper, we describe an analog deep machinelearning system. It overcomes the limitations of
conventional digital implementations through the
efficiency of analog signal processing while exploiting
algorithmic robustness to mitigate the effects of the nonidealities of analog computation. It demonstrates
efficiency more than two orders of magnitude greater than
custom digital VLSI with accuracy comparable to a
floating-point software implementation. The low power
consumption and non-volatile memory make it attractive
for autonomous sensory applications and as a building
block for large-scale learning systems.
Acknowledgements
This work was partially supported by the NSF through
grant #CCF-1218492 and by DARPA through agreement
#HR0011-13-2-0016. The views and conclusions
contained herein are those of the authors and should not
be interpreted as representing the official policies or
endorsements, either expressed or implied, of DARPA,
the NSF, or the U.S. Government.
REFERENCES
[1] "Machine Learning Surveys," [Online]. Available:
http://www.mlsurveys.com/.
[2] I. Arel, D. Rose and T. Karnowski, "Deep machine learning - a
new frontier in artificial intelligence research," Computational
Intelligence Magazine, IEEE, vol. 5, no. 4, pp. 13-18, 2010.
[3] R. Sarpeshkar, "Analog versus digital: extrapolating from
electronics to neurobiology," Neural Comput., vol. 10, pp. 16011638, Oct. 1998.
[4] J. Holleman, S. Bridges, B. Otis and C. Diorio, "A 3 uW CMOS
true random number generator with adaptive floating-gate offset
cancellation," IEEE J. Solid-State Circuits, vol. 43, no. 5, pp. 13241336, May 2008.
[5] J. Oh, G. Kim, B.-G. Nam and H.-J. Yoo, "A 57 mW 12.5
µJ/Epoch embedded mixed-mode neuro-fuzzy processor for mobile
real-time object recognition," IEEE J. Solid-State Circuits, vol. 48,
no. 11, pp. 2894-2907, Nov. 2013.
[6] J. Park, I. Hong, G. Kim, Y. Kim, K. Lee, S. Park, K. Bong and H.J. Yoo, "A 646GOPS/W multi-classifier many-core processor with
cortex-like architecture for super-resolution recognition," in IEEE
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb.
2013, pp. 17-21.
[7] J.-Y. Kim, M. Kim, S. Lee, J. Oh, K. Kim and H.-J. Yoo, "A 201.4
GOPS 496 mW real-time multi-object recognition processor with
bio-inspired neural perception engine," IEEE J. Solid-State
Circuits, vol. 45, no. 1, pp. 32-45, Jan. 2010.
[8] S. Chakrabartty and G. Cauwenberghs, "Sub-microwatt analog
VLSI trainable pattern classifier," IEEE J. Solid-State Circuits, vol.
42, no. 5, pp. 1169-1179, May 2007.
[9] J. Lu, S. Young, I. Arel and J. Holleman, "A 1TOPS/W analog
deep machine-learning engine with floating-gate storage in 0.13μm
CMOS," in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
Tech. Papers, Feb. 2014, pp. 504-505.
[10] S. Young, A. Davis, A. Mishtal and I. Arel, "Hierarchical
spatiotemporal feature extraction using recurrent online clustering,"
Pattern Recognition Letters, vol. 37, pp. 115-123, Feg. 2014.
[11] J. Lu, S. Young, I. Arel and J. Holleman, "An analog online
clustering circuit in 130nm CMOS," in Proc. IEEE Asian SolidState Circuits Conf. (A-SSCC), Nov. 2013, pp. 177-180.
[12] S. Young, J. Lu, J. Holleman and I. Arel, "On the impact of
approximate computation in an analog DeSTIN architecture," IEEE
Trans. Neural Netw. Learn. Syst., vol. PP, no. 99, p. 1, Oct. 2013.
[13] B. Gilbert, "Translinear circuits: a proposed classification,"
Electron. Lett., vol. 11, no. 1, pp. 14-16, 1975.
[14] J. Lu and J. Holleman, "A floating-gate analog memory with
bidirectional sigmoid updates in a standard digital process," in
Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2013, vol. 2,
pp. 1600-1603.
[15] S. Young, I. Arel, T. Karnowski and D. Rose, "A fast and stable
incremental clustering algorithm," in proc. 7th International
Conference on Information Technology, Apr. 2010.
[16] J. Lazzaro, S. Ryckebusch, M. A. Mahowald and C. Mead,
"Winner-take-all networks of O(n) complexity," Advances in
Neural Information Processing Systems 1, pp. 703-711, Morgan
Kaufmann Publishers, San Francisco, CA, 1989.
Download