Training Products of Experts by Minimizing Contrastive Divergence

advertisement
Training Products of Experts by
Minimizing Contrastive Divergence
Geoffrey E. Hinton
presented by Frank Wood
Frank Wood - fwood@cs.brown.edu
Goal
• Learn parameters for probability distribution models of
high dimensional data
– (Images, Population Firing Rates, Securities Data, NLP data, etc)
Mixture Model





p d | 1 ,,  n    m f m d |  m
m

Product of Experts

fm d | m


p d | 1 ,, n  m

 f m c |  m 




c
Use EM to learn parameters
m
Use Contrastive Divergence to learn
parameters.
Frank Wood - fwood@cs.brown.edu

Take Home
• Contrastive divergence is a general MCMC
gradient ascent learning algorithm particularly
well suited to learning Product of Experts (PoE)
and energy- based (Gibbs distributions, etc.)
model parameters.
• The general algorithm:
– Repeat Until “Convergence”
• Draw samples from the current model starting from the training
data.
• Compute the expected gradient of the log probability w.r.t. all
model parameters over both samples and the training data.
• Update the model parameters according to the gradient.
Frank Wood - fwood@cs.brown.edu
Sampling – Critical to Understanding
• Uniform
– rand()
Linear Congruential Generator
• x(n) = a * x(n-1) + b mod M
0.2311
• Normal
– randn()
0.6068
0.4860
0.8913
0.7621
Box-Mueller
• x1,x2 ~ U(0,1) -> y1,y2 ~N(0,1)
– y1 = sqrt( - 2 ln(x1) ) cos( 2 pi x2 )
– y2 = sqrt( - 2 ln(x1) ) sin( 2 pi x2 )
• Binomial(p)
– if(rand()<p)
• More Complicated Distributions
– Mixture Model
• Sample from a Gaussian
• Sample from a multinomial (CDF + uniform)
– Product of Experts
• Metropolis and/or Gibbs
Frank Wood - fwood@cs.brown.edu
0.4565
0.0185
The Flavor of Metropolis Sampling
 
• Given some distribution

starting point d t 1 , and a symmetric proposal
 

distribution J dt | dt 1  .
pdt |  
r 
• Calculate the ratio of densities
pdt 1 |  

where dt is sampled from the proposal
distribution.

• With probability min( r ,1) accept dt .
• Given sufficiently many iterations
Only need to

p d |  , a random

  
  

dn , dn1 , dn2 , ~ p d | 
Frank Wood - fwood@cs.brown.edu
know the
distribution up
to a
proportionality!
Contrastive Divergence (Final Result!)
Training data
(empirical distribution).
Model
parameters.
 m 
 log f m
 m

P0
 log f m
 m
Samples from
model.
P1
Law of Large Numbers, compute
expectations using samples.
1
 m 
N

d D
 log f m (d )
 m
1

N

c ~ P1
 log f m (c)
 m
Now you know how to do it, let’s see why this works!
Frank Wood - fwood@cs.brown.edu
But First: The last vestige of concreteness.
• Looking towards the future:
1-D Student-t Expert
1
0.9
– Take f to be a Student-t.

 f

 m; j m


d
1

m


 1 T 
1  j m d 
 2

 
– Then (for instance)
 log f m; jm
 m


d
0.7
0.6
1/(1+.5*(j tx))
f m

d
 = .6, j = 5
0.8
0.5
0.4
0.3
0.2
0.1
0
-10
-8
-6
-4
-2
0
2
4
6
8
Dot product Projection 1-D Marginal
 
 1 T  
  m log 1  jm d 
2
 1 T  



  log 1  jm d 
 m
 2

Frank Wood - fwood@cs.brown.edu
 
10
Maximizing the training data log likelihood
• We want maximizing parameters
Standard PoE form



  fm d | m 
 m

arg max log p(D | 1 ,, n )  arg max log 




1 ,, n
1 ,, n
d D   f m c |  m  
 c m

Over all training data.
Assuming d’s drawn
independently from p()
• Differentiate w.r.t. to all parameters and
perform gradient ascent to find optimal
parameters.
• The derivation is somewhat nasty.
Frank Wood - fwood@cs.brown.edu
Maximizing the training data log likelihood
 log p (D | 1 , ,  n )

 m

 log 
p (d | 1 , ,  n )

d D


d D

 m
 m

log p (d | 1 , ,  n )

 log p(d | 1 ,, n )
N
 m
P
Remember this
equivalence!
Frank Wood - fwood@cs.brown.edu
Maximizing the training data log likelihood

f m (d |  m )

1  log p (D | 1 ,  ,  n )
 1  log  m


N
 m
N  m
d D  f m (c |  m )

c



 log f m d |  m
1
1
 


N dD
 m
N


d D
m
 log 

c


f m c |  m 
m
 m


 log   f m c |  m 

 log f m d |  m
1
c
m
 

N dD
 m
 m


Frank Wood - fwood@cs.brown.edu
Maximizing the training data log likelihood


 log   f m c |  m 

 log f m d |  m
1
c
m
 

N dD
 m
 m



 log f m d |  m

 m




 log f m d |  m

 m

 log 

c
P0


P0


c


f m c |  m 
m
log(x)’ = x’/x
 m
1

f m c |  m 
m
Frank Wood - fwood@cs.brown.edu


c

m

f m c |  m 
 m
Maximizing the training data log likelihood


 log f m d |  m

 m


 log f m d |  m

 m



P0


 log f m d |  m

 m

c

P0


c

c

m
 m
m


P0

1

f m c |  m 


f m c |  m 
1

f m c |  m 


f j c |  j f m c |  m 


c
jm
 m
log(x)’ = x’/x
m


c
c
1

f m c |  m 

m
Frank Wood - fwood@cs.brown.edu
m


f m c |  m  log f m c |  m 
 m
Maximizing the training data log likelihood


 log f m d |  m

 m

P0





 log f m d |  m

 m


 log f m d |  m

 m


c
c
1


fm c | m 

m


f m c |  m  log f m c |  m 
 m
m

  f m c |  m 


 m
 log f m c |  m  
 





f
c
|



c  
m
m
m

P0

 c m



 log f m c |  m 
  p (c | 1 ,  , n )

 m
c
P0
Frank Wood - fwood@cs.brown.edu
Maximizing the training data log likelihood


 log f m d |  m

 m


 log f m d |  m

 m

P0

 log f m c |  m 
  p (c | 1 ,  ,  n )
 m
c
P0

 log f m c |  m 

 m

P 
Phew! We’re done! So:



 log p d |  m
 log p (D | 1 , ,  n )
N
 m
 m
P0


 log f m d |  m
 log f m c |  m 


 m
 m
P
P0




Frank Wood - fwood@cs.brown.edu
Equilibrium Is Hard to Achieve
• With:


 log p (D | 1 , ,  n )
 log f m d |  m

 m
 m

P0

 log f m c |  m 

 m
P 
we can now train our PoE model.
• But… there’s a problem:
– P is computationally infeasible to obtain (esp. in an
inner gradient ascent loop).
– Sampling Markov Chain must converge to target
distribution. Often this takes a very long time!
Frank Wood - fwood@cs.brown.edu
Solution: Contrastive Divergence!


 log p (D | 1 , ,  n )
 log f m d |  m

 m
 m

P0

 log f m c |  m 

 m
P1
• Now we don’t have to run the sampling Markov
Chain to convergence, instead we can stop after
1 iteration (or perhaps a few iterations more
typically)
• Why does this work?
– Attempts to minimize the ways that the model
distorts the data.
Frank Wood - fwood@cs.brown.edu
Equivalence of argmax log P() and argmax KL()

0
d



P d

0
d






P P  
P d

0

P d
log  
P d



0
0

log P d  
P d log P d

d

log P d 0
0
 
 H P0 



P


P P
 log P d

 m
 m
0

P0
Frank Wood - fwood@cs.brown.edu
This is what
we got out of
the nasty
derivation!
Contrastive Divergence
• We want to “update the parameters to reduce
the tendency of the chain to wander away from
the initial distribution on the first step”.

 m
P
0
P  P1 P


 log f m d |  m

 m


P0


 log P d

 m

P0

 log f m c |  m 

 m


 log P d

 m

P1


 log f m d |  m

 m
P



 log f m d |  m

 m


P1

 log f m c |  m 

 m

P0
Frank Wood - fwood@cs.brown.edu

 log f m d |  m

 m

P1
P 
Contrastive Divergence (Final Result!)
Training data
(empirical distribution).
Model
parameters.
 m 
 log f m
 m

P0
 log f m
 m
Samples from
model.
P1
Law of Large Numbers, compute
expectations using samples.
1
 m 
N

d D
 log f m (d )
 m
1

N

c ~ P1
 log f m (c)
 m
Now you know how to do it and why it works!
Frank Wood - fwood@cs.brown.edu
Download