Document

advertisement
Restricted Boltzmann Machine
and Deep Belief Net
Wanli Ouyang
wlouyang@ee.cuhk.edu.hk
Animation is available for illustration
1
Outline
• Short introduction on deep learning
• Short introduction on statistical models
and Graphical model
• Restricted Boltzmann Machine (RBM)
and Contrastive divergence
• Deep belief net (DBN)
RBM and DBN are statistical models
Deep belief net is trained using RBM and CD
Deep belief net is an unsupervised training algorithm for deep neural network
2
Good learning resources
• Webpages:
– Geoffrey E. Hinton’s readings (with source code available for DBN)
http://www.cs.toronto.edu/~hinton/csc2515/deeprefs.html
– Notes on Deep Belief Networks http://www.quantumg.net/dbns.php
– MLSS Tutorial, October 2010, ANU Canberra, Marcus Frean
http://videolectures.net/mlss2010au_frean_deepbeliefnets/
– Deep Learning Tutorials http://deeplearning.net/tutorial/
– Hinton’s Tutorial, http://videolectures.net/mlss09uk_hinton_dbn/
– Fergus’s Tutorial, http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf
– CUHK MMlab project :
http://mmlab.ie.cuhk.edu.hk/project_deep_learning.html
• People:
–
–
–
–
–
–
–
–
Geoffrey E. Hinton’s http://www.cs.toronto.edu/~hinton
Andrew Ng http://www.cs.stanford.edu/people/ang/index.html
Ruslan Salakhutdinov http://www.utstat.toronto.edu/~rsalakhu/
Yee-Whye Teh http://www.gatsby.ucl.ac.uk/~ywteh/
Yoshua Bengio www.iro.umontreal.ca/~bengioy
Yann LeCun http://yann.lecun.com/
Marcus Frean http://ecs.victoria.ac.nz/Main/MarcusFrean
Rob Fergus http://cs.nyu.edu/~fergus/pmwiki/pmwiki.php
• Acknowledgement
– Many materials in this ppt are from these papers, tutorials, etc (especially
Hinton and Frean’s). Sorry for not listing them in full detail.
3
Dumitru Erhan, Aaron Courville, Yoshua Bengio. Understanding Representations Learned in Deep Architectures. Technical Report.
Neural network
Back propagation
1986
Deep belief net
Science
Speech
2006
2011 2012
deep learning results
• • Unsupervised
& Layer-wised
Loose tie with biological
systems pre-training
• SVM
… …Rank Name
Error
Description
• • Better
designs
for modeling
and training
Shallow
model
•methods
Solve
general
learning
problems
• Boosting
rate nonlinearity,
(normalization,
dropout)
•
Specific
for
specific
tasks
… …
• crafted
Tiedfeatures
with biological
system
But –it isHand
given
up…
• Decision tree
(GMM-HMM,
SIFT, LBP, HOG)
1
U. Toronto
0.15315 Deep learning
• Feature learning
… …2
Hard
to train
• KNN
U. Tokyo
0.26172 ofHand-crafted
• • New
development
computer architectures
• –Insufficient
computational
resources
• …… …3
features
and
GPU
U. Oxford
0.26979
• –Small
training
sets
learning
Multi-core
computer
systems models.
4
Xerox/INRIA
0.27058
• Does not work well Bottleneck.
• Large scale databases
ObjectComputers
recognition over
1,000,000
images
and CPU
1,000cores
categories
How Many
to Identify
a Cat?
16000
Kruger et al. TPAMI’13
(2 GPU)
Outline
• Short introduction on deep learning
• Short introduction on statistical models
and Graphical model
• Restricted Boltzmann Machine (RBM)
and Contrastive divergence
• Deep belief net (DBN)
5
Graphical model for Statistics
• Conditional independence
between random variables
• Given C, A and B are
independent:
C
– P(A, B|C) = P(A|C)P(B|C)
• P(A,B,C) =P(A, B|C) P(C)
– =P(A|C)P(B|C)P(C)
• Any two nodes
are conditionally
independent given the values
of their parents.
Smoker?
B
A
Has Lung cancer Has bronchitis
支气管炎
肺癌
http://www.eecs.qmul.ac.uk/~norman/BBNs/Independence_and_conditional_independence.htm
6
Directed and undirected graphical
model
C
• Directed graphical model
– P(A,B,C) = P(A|C)P(B|C)P(C)
– Any two nodes are conditionally
independent given the values of
their parents.
B
C
• Undirected graphical model
– P(A,B,C) = P(B,C)P(A,C)
– Also called Marcov Random Field
(MRF)
A
B
A
C
C
B
B
A
7
D
P(A,B,C,D) = P(D|A,B)P(B|C)P(A|C)P(C)
A
Modeling undirected model
• Probability:
P(x; ) 
 P(x; )  1
f (x; )
f (x; )

 f (x; ) Z ( )
x
x
partition
function
Is smoker?
Example: P(A,B,C) = P(B,C)P(A,C)
exp( w1 BC  w2 AC )
P( A, B, C; ) 
 exp( w1BC  w2 AC )
C
w1
w2
A , B ,C
exp( w1 BC ) exp( w2 AC )

Z ( w1 , w2 )
8
B
Is healthy
A
Has Lung cancer
More directed and undirected
models
y1
h1
y2
h2
Hidden Marcov model
9
A
B
C
D
E
F
G
H
I
y3
h3
MRF in 2D
More directed and undirected
models
A
B
y1
y2
y3
h1
h2
h3
C
D
P(A,B,C,D)=P(A)P(B)P(C|B)P(D|A,B,C)
10
P(y1, y2, y3, h1, h2, h3)=P(h1)P(h2| h1)
P(h3| h2) P(y1| h1)P(y2| h2)P(y3| h3)
More directed and undirected
models
...
h3
x
W2
HMM
h
W1
W
...
v
RBM
(a)
11
...
h1
W0
x
h3
...
h2
...
h1
W2
...
...
...
h2
...
W1
...
DBN
(b)
Our deep model
(c)
Extended reading on graphical model
• Zoubin Ghahramani ‘s video lecture on graphical
models:
• http://videolectures.net/mlss07_ghahramani_grafm/
12
Outline
• Short introduction on deep learning
• Short introduction on statistical models and Graphical model
• Restricted Boltzmann machine and
Contrastive divergence
– Product of experts
A training algorithm for
– Contrastive divergence
– Restricted Boltzmann Machine
• Deep belief net
13
Outline
• Short introduction on deep learning
• Short introduction on statistical models and Graphical model
• Restricted Boltzmann machine and
Contrastive divergence
– Product of experts
– Contrastive divergence
– Restricted Boltzmann Machine
• Deep belief net
14
A specific, useful case of
Outline
• Short introduction on deep learning
• Short introduction on statistical models and Graphical model
• Restricted Boltzmann machine and
Contrastive divergence
– Product of experts
– Contrastive divergence
– Restricted Boltzmann Machine
• Deep belief net
15
Product of Experts
 f (x ; )
e
P(x; ) 

  f (x ; )  e
m
m
 E ( x; )
m
m
m
x
m
m
 E ( x ; )
m
f (x; )

,
Z ( )
x
E (x; )  m log f m (x m ; m )
Partition function
Energy function
E (x; w)  w1 AB  w2 BC  w3 AD  w4 BE  w3CF  ...
MRF in 2D
16
A
B
C
D
E
F
G
H
I
Product of Experts
  e
15
i
i 1
( x u i ) T  ( x u i )
 c(1  i )

Products of experts versus Mixture model
 f (x ; )
P(x; ) 
  f (x ; )
m
• Products of experts :
m
m
m
m
x
m
m
– "and" operation
– Sharper than mixture
– Each expert can constrain a different subset of dimensions.
m
• Mixture model, e.g. Gaussian Mixture model
– “or” operation
– a weighted sum of many density functions
18
Outline
• Basic background on statistical learning and
Graphical model
• Contrastive divergence and Restricted
Boltzmann machine
– Product of experts
– Contrastive divergence
– Restricted Boltzmann Machine
• Deep belief net
19
Z ( )  x f ( x;  m )
Contrastive Divergence (CD)
P( x;  )  f ( x; ) / Z ( )
• Probability:
• Maximum Likelihood and gradient descent
K
K



(k )
(k )
max  P(x ; )  max L( X; )  max log  P(x ; )



k 1
 k 1



L( X; )
L( X; )
 t 1   t  
or
0


K


1
 log Z ( )   log f (x(k); )
K k 1
1 L( X; )



K


 log f (x; )
1 K  log f (x(k); )
  p (x,  )
dx  

K k 1

 log f (x; )


20
 log f (x; )


p ( x , )
model dist.
X
data dist.
expectation
P(A,B,C) = P(A|C)P(B|C)P(C)
C
Contrastive Divergence (CD)
B
A
• Gradient of Likelihood:
L ( X;  )



 log f ( x;  )
1
p( x, )
dx 

K
Intractable
Tractable Gibbs Sampling
Sample p(z1,z2,…,zM)
K

k 1
 log f ( x ( k); )

Easy to compute
Fast contrastive divergence
T=1
T  
 t 1   t  
L( X; )

CD
Minimum
Accurate but slow gradient
21
Approximate but fast gradient
Gibbs Sampling for graphical model
h1
More information on Gibbs sampling:
Pattern recognition and machine learning(PRML)
22
h2
h3
h4
x1
x2
x3
h5
Convergence of Contrastive divergence
(CD)
• The fixed points of ML are not fixed points of CD
and vice versa.
– CD is a biased learning algorithm.
– But the bias is typically very small.
– CD can be used for getting close to ML solution and then ML
learning can be used for fine-tuning.
• It is not clear if CD learning converges (to a stable
fixed point). At 2005, proof is not available.
• Further theoretical results? Please inform us
M. A. Carreira-Perpignan
and G. E. Hinton. On Contrastive Divergence Learning. Artificial Intelligence and Statistics, 2005
23
Outline
• Basic background on statistical learning and
Graphical model
• Contrastive divergence and Restricted
Boltzmann machine
– Product of experts
– Contrastive divergence
– Restricted Boltzmann Machine
• Deep belief net
24
Boltzmann Machine
• Undirected graphical model,
with hidden nodes.
 f (x ; )
e
P(x; ) 

  f (x ; )  e
m
m
 E ( x; )
m
m
m
x
m
 E ( x ; )
m

f (x; )
,
Z ( )
x
m
E (x; )   wij xi x j   i xi
i j
i
 : {wij , i }
Boltzmann machine: E(x,h)=b' x+c' h+h' Wx+x’Ux+h’Vh
25
Boltzmann machine: E(x,h)=b' x+c' h+h' Wx+x’Ux+h’Vh
Restricted Boltzmann Machine (RBM)
• Undirected, loopy, layer
P ( x, h ) 
e
 E ( x ,h )
e
h1
h2
h3
h4
x1
x2
x3
h5
 E ( x ,h )
x ,h
P ( x) 
 E ( x ,h )
e

h
 E ( x ,h )
e

partition
function
x ,h
• E(x,h)=b' x+c' h+h' Wx
h
P(h | x)   P(hi | x)
W
i
P ( x | h)   P ( x j | h)
x
j
Read the manuscript for details
P(xj = 1|h) = σ(bj +W’• j · h)
P(hi = 1|x) = σ(ci +Wi · · x)
Restricted Boltzmann Machine (RBM)
P(x; ) 
 ( b' x  c' h  h' Wx)
e

h
 ( b' x  c' h  h' Wx)
e

f ( x; )

Z ( )
x ,h
• E(x,h)=b' x+c' h+h' Wx
• x = [x1 x2 …]T, h = [h1 h2 …]T
• Parameter learning
– Maximum Log-Likelihood
K
 K



(k )
max  P(x ; )  min L( X; )  min   log  P(x ( k ) ; )



k 1
 k 1



Geoffrey E. Hinton, “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14, 1771–1800 (2002)
27
CD for RBM
• CD for RBM, very fast!
P( x;  ) 
 e  ( b' x c' h  h' Wx)
h
 ( b' x  c' h  h' Wx)
e


L(X; )
 t 1   t  

f ( x; )
Z ( )
x ,h
L( X; )
 log f (x; )
1
  p(x, )
dx 
wij

K
 xi h j
p ( x , )
 xi h j
 xi h j  xi h j
1
X
 xi h j
0
P(xj = 1|h) = σ(bj +W’• j · h)
28

 xi h j
 log f (x(k); )


k 1
K
0
CD
P(hi = 1|x) = σ(ci +Wi · x)
CD for RBM
L( X; )
 xi h j  xi h j
1
wij
P(xj = 1|h)
= σ(bj +W’• j · h)
P(hi = 1|x)
= σ(ci +Wi · x)
P(xj = 1|h)
= σ(bj +W’• j · h)
h2 h 1
x1
P(xj = 1|h) = σ(bj +W’• j · h)
29
x2
P(hi = 1|x) = σ(ci +Wi · x)
0
RBM for classification
• y: classification label
Hugo30
Larochelle and Yoshua Bengio, Classification using Discriminative Restricted Boltzmann Machines, ICML 2008.
RBM itself has many applications
•
•
•
•
•
•
Multiclass classification
Collaborative filtering
Motion capture modeling
Information retrieval
Modeling natural images
Segmentation
Y Li, D Tarlow, R Zemel, Exploring compositional high order pattern potentials for structured output learning, CVPR 2013
V. Mnih, H Larochelle, GE Hinton , Conditional Restricted Boltzmann Machines for Structured Output Prediction, Uncertainty in Artificial Intelligence, 2011.
Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted boltzmann machines. ICML, 2008.
Salakhutdinov, R., Mnih, A., & Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. ICML 2007.
Salakhutdinov, R., & Hinton, G. E. (2009). Replicated softmax: an undirected topic model., NIPS 2009.
Osindero, S., & Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random field., NIPS 2008
31
Outline
• Basic background on statistical learning and
Graphical model
• Contrastive divergence and Restricted
Boltzmann machine
• Deep belief net (DBN)
– Why deep leaning?
– Learning and inference
– Applications
32
Belief Nets
• A belief net is a directed
acyclic graph composed of
random variables.
random
hidden
cause
visible
effect
33
Deep Belief Net
• Belief net that is deep
• A generative model
– P(x,h1,…,hl) = p(x|h1) p(h1|h2)… p(hl-2|hl-1) p(hl-1,hl)
• Used for unsupervised training of multi-layer deep
model.
Pixels=>edges=> local shapes=> object parts
34
h3
…
…
h2
…
…
h1
…
…
x
…
…
P(x,h1,h2,h3) = p(x|h1) p(h1|h2) p(h2,h3)
Why Deep learning?
Pixels=>edges=> local shapes=> object parts
• The mammal brain is organized in a deep architecture with a
given input percept represented at multiple levels of
abstraction, each level corresponding to a different area of
cortex(脑或其他器官的皮层).
• An architecture with insufficient depth can require many
more computational elements, potentially exponentially
more (with respect to input size), than architectures whose
depth is matched to the task.
• Since the number of computational elements one can afford
depends on the number of training examples available to
tune or select them, the consequences are not just
computational but also statistical: poor generalization may
be expected when using an insufficiently deep architecture
for representing some functions.
T. Serre, etc., “A quantitative theory of immediate visual recognition,” Progress in Brain Research, Computational
Neuroscience: Theoretical Insights into Brain Function, vol. 165, pp. 33–56, 2007.
Yoshua Bengio, “Learning Deep Architectures for AI,” Foundations and Trends in Machine Learning, 2009.
35
Why Deep learning?
•
•
•
•
•
Linear regression, logistic regression: depth 1
Kernel SVM: depth 2
Decision tree: depth 2
Boosting: depth 2
The basic conclusion that these results suggest is that when
a function can be compactly represented by a deep
architecture, it might need a very large architecture to be
represented by an insufficiently deep one. (Example: logic
gates, multi-layer NN with linear threshold units and positive
weight).
Yoshua Bengio, “Learning Deep Architectures for AI,” Foundations and Trends in Machine Learning, 2009.
36
Example: sum product network (SPN)
2N-1


N2N-1 parameters
X1
O(N) parameters
37

X1
X2

X2
X3

X3
X4

X4
X5

X5
Depth of existing approaches
• Boosting (2 layers)
– L 1: base learner
– L 2: vote or linear combination of layer 1
• Decision tree, LLE, KNN, Kernel SVM (2 layers)
– L 1: matching degree to a set of local templates.
– L 2: Combine these degrees
b  i  i K (x, xi )
• Brain: 5-10 layers
38
Why decision tree has depth 2?
• Rely on partition of input space.
• Local estimator. Rely on partition of input space
and use separate params for each region. Each
region is associated with a leaf.
• Need as many as training samples as there are
variations of interest in the target function. Not
good for highly varying functions.
• Num. training sample is exponential to Num. dim
in order to achieve a fixed error rate.
39
Outline
• Basic background on statistical learning and
Graphical model
• Contrastive divergence and Restricted
Boltzmann machine
• Deep belief net (DBN)
– Why DBN?
– Learning and inference
– Applications
40
Deep Belief Net
• Inference problem: Infer the states of the
unobserved variables.
• Learning problem: Adjust the interactions
between variables to make the network more
likely to generate the observed data
41
h3
…
…
h2
…
…
h1
…
…
x
…
…
P(x,h1,h2,h3) = p(x|h1) p(h1|h2) p(h2,h3)
Deep Belief Net
– Inference problem (the problem of explaining away):
C
P(A,B|C)
P(h11,
= P(A|C)P(B|C)
h12 | x1) ≠ P(h11| x1) P(h12 | x1)
h1
…
…
x
…
…
An42 example from manuscript
B
A
h11
h12
x1
Sol: Complementary prior
Deep Belief Net

Inference problem (the problem
of explaining away)

43
Sol: Complementary prior
h4
h3
h2
h1
… …
… …
… …
x
…
… …
30
500
1000
2000
…
Sol: Complementary prior
P(hi = 1|x) = σ(ci +Wi · x)
Deep Belief Net
• Explaining away problem of Inference (see the
manuscript)
– Sol: Complementary prior, see the manuscript
• Learning problem
– Greedy layer by layer RBM training (optimize lower
bound) and fine tuning
– Contrastive divergence for RBM training
… …
h3
… …
… …
h2
… …
… …
h1
…
…
x44
…
…
…
…
…
…
…
…
h3
h2
h2
h1
h1
x
Code reading
• It is much easier to read the
DeepLearningToolbox for understanding DBN.
45
46
Deep Belief Net
• Why greedy layerwise learning work?
• Optimizing a lower bound:
log P(x)  log  P(x, h1 )
h
 {Q(h1 | x)[log P(h1 )  log P(h1 | x)]  Q(h1 | x) log Q(h1 | x)]}
h1
• When we fix parameters for layer 1 and
optimize the parameters for layer 2, we
are optimizing the P(h1) in (1)
47
(1)
…
…
…
…
…
…
h3
…
…
…
…
…
…
h1
h1
x
h2
h2
Deep Belief Net and RBM
• RBM can be considered as DBN that has infinitive
…
layers
…
h0
x0
W
…
…
…
…
WT
…
…
x2
…
W
h1
…
…
x1
…
…
W
h0
…
x0
…
48
WT
Pretrain, fine-tune and inference –
(autoencoder)
49
(BP)
Pretrain, fine-tune and inference - 2
y: identity or rotation degree
50
Pretraining
Fine-tuning
How many layers should we use?
• There might be no universally right depth
– Bengio suggests that several layers is better than one
– Results are robust against changes in the size of a
layer, but top layer should be big
– A parameter. Depends on your task.
– With enough narrow layers, we can model any
distribution over binary vectors [1]
[1] Sutskever, I. and Hinton, G. E., Deep Narrow Sigmoid Belief Networks are Universal Approximators. Neural
Computation, 2007
Copied from http://videolectures.net/mlss09uk_hinton_dbn/
51
Effect of Unsupervised Pre-training
Erhan et. al.
52
AISTATS’2009
Effect of Depth
without pre-training
w/o pre-training
53
with pre-training
Why unsupervised pre-training makes
sense
stuff
stuff
high
bandwidth
image
label
If image-label pairs were
generated this way, it
would make sense to try
to go straight from
images to labels.
For example, do the
pixels
have even parity?
54
image
low
bandwidth
label
If image-label pairs are
generated this way, it
makes sense to first learn
to recover the stuff that
caused the image by
inverting the high
bandwidth pathway.
Beyond layer-wise pretraining
• Layer-wise pretraining is efficient but not optimal.
• It is possible to train parameters for all layers using a
wake-sleep algorithm.
– Bottom-up in a layer-wise manner
– Top-down and reffiting the earlier models
55
Fine-tuning with a contrastive version
of the “wake-sleep” algorithm
After learning many layers of features, we can fine-tune the
features to improve generation.
1. Do a stochastic bottom-up pass
– Adjust the top-down weights to be good at reconstructing
the feature activities in the layer below.
2. Do a few iterations of sampling in the top level RBM
-- Adjust the weights in the top-level RBM.
3. Do a stochastic top-down pass
– Adjust the bottom-up weights to be good at reconstructing
the feature activities in the layer above.
56
Include lateral connections
• RBM has no connection among layers
• This can be generalized.
• Lateral connections for the first layer [1].
– Sampling from P(h|x) is still easy. But sampling from
p(x|h) is more difficult.
• Lateral connections at multiple layers [2].
– Generate more realistic images.
– CD is still applicable, with small modification.
[1]B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: a strategy employed by V1?,” Vision
Research, vol. 37, pp. 3311–3325, December 1997.
[2]S.57
Osindero and G. E. Hinton, “Modeling image patches with a directed hierarchy of Markov random field,” in NIPS, 2007.
Without lateral connection
58
With lateral connection
59
My data is real valued …
• Make it [0 1] linearly: x = ax + b
• Use another distribution
60
My data has temporal dependency …
• Static:
• Temporal
61
My data has temporal dependency …
• Static:
• Temporal
62
I consider DBN as…
• A statistical model that is used for unsupervised training
of fully connected deep model
• A directed graphical model that is approximated by fast
learning and inference algorithms
• A directed graphical model that is fine tuned using
mature neural network learning approach -- BP.
63
Outline
• Basic background on statistical learning and
Graphical model
• Contrastive divergence and Restricted
Boltzmann machine
• Deep belief net (DBN)
– Why DBN?
– Learning and inference
– Applications
64
Applications of deep learning
•
•
•
•
•
•
•
•
•
Hand written digits recognition
Dimensionality reduction
Information retrieval
Segmentation
Denoising
Phone recognition
Object recognition
Object detection
…
Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation
Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks, Science 2006.
Welling, M. etc., Exponential Family Harmoniums with an Application to Information Retrieval, NIPS 2004
A. R. Mohamed, etc., Deep Belief Networks for phone recognition, NIPS 09 workshop on deep learning for speech
recognition.
Nair, V. and Hinton, G. E. 3-D Object recognition with deep belief nets. NIPS09
………………………….
65
Object recognition
• NORB
– logistic regression 19.6%, kNN (k=1) 18.4%, Gaussian
kernel SVM 11.6%, convolutional neural net 6.0%,
convolutional net + SVM hybrid 5.9%. DBN 6.5%.
– With the extra unlabeled data (and the same amount of
labeled data as before), DBN achieves 5.2%.
66
Object recognition
ImageNet
Rank
Name
Error rate
Description
1
U. Toronto
0.15315
Deep learning
2
U. Tokyo
0.26172
3
U. Oxford
0.26979
4
Xerox/INRIA
0.27058
Hand-crafted
features and
67
learning
models.
Bottleneck.
Learning to extract the orientation of a face
patch (Salakhutdinov & Hinton, NIPS 2007)
68
The training and test sets
100, 500, or 1000 labeled cases
11,000 unlabeled cases
face patches from new people
69
The root mean squared error in the orientation
when combining GP’s with deep belief nets
GP on
the
pixels
GP on
top-level
features
GP on top-level
features with
fine-tuning
100 labels 22.2
17.9
15.2
500 labels 17.2
12.7
7.2
1000 labels 16.3
11.2
6.4
Conclusion: The deep features are much better
than the pixels. Fine-tuning helps a lot.
70
Deep Autoencoders
(Hinton & Salakhutdinov, 2006)
• They always looked like a
really nice way to do nonlinear dimensionality
reduction:
– But it is very difficult to
optimize deep autoencoders
using backpropagation.
• We now have a much better
way to optimize them:
– First train a stack of 4 RBM’s
– Then “unroll” them.
– Then fine-tune with
backprop.
71
28x28
W1T
1000 neurons
W2T
500 neurons
W3T
250 neurons
W4T
30
W4
250 neurons
W3
500 neurons
W2
1000 neurons
W1
28x28
linear
units
Deep Autoencoders
(Hinton & Salakhutdinov, 2006)
real
data
30-D
deep auto
30-D PCA
72
A comparison of methods for compressing digit
images to 30 real numbers.
real
data
30-D
deep auto
30-D logistic
PCA
30-D
PCA
73
Representation of DBN
74
Our works
http://mmlab.ie.cuhk.edu.hk/project_deep_learning.html
75
Pedestrian Detection
ICCV’13
CVPR’12
CVPR’13
ICCV’13
Facial keypoint detection, CVPR’13
(2% average error on LFPW)
Face parsing, CVPR’12
Pedestrian parsing, ICCV’13
Face Recognition and Face Attribute Recognition
(LFW: 96.45%)
Face verification, ICCV’13
Recovering Canonical-View Face Images, ICCV’13
Face attribute recognition, ICCV’13
Summary
• Deep belief net (DBN)
– is a network with deep layers, which provides strong
representation power;
– is a generative model;
– can be learned by layerwise RBM using Contrastive
Divergence;
– has many applications and more applications is yet to be
found.
Generative models explicitly or implicitly model the distribution of inputs and outputs.
Discriminative models model the posterior probabilities directly.
79
DBN VS SVM
• A very controversial topic
• Model
– DBN is generative, SVM is discriminative. But fine-tuning
of DBN is discriminative
• Application
– SVM is widely applied.
– Researchers are expanding the application area of DBN.
• Learning
– DBN is non-convex and slow
– SVM is convex and fast (in linear case).
• Which one is better?
– Time will say.
– You can contribute
Hinton: The superior classification performance of discriminative learning
methods holds only for domains in which it is not possible to learn a good
generative model. This set of domains is being eroded by Moore’s law.
80
81
Download