Training Products of Experts by Minimizing Contrastive Divergence

Training Products of Experts by Minimizing Contrastive Divergence Geoffrey E. Hinton presented by Frank Wood Frank Wood - fwood@cs.brown.edu Goal • Learn parameters for probability distribution models of high dimensional data – (Images, Population Firing Rates, Securities Data, NLP data, etc) Mixture Model      p d | 1 ,,  n    m f m d |  m m  Product of Experts  fm d | m   p d | 1 ,, n  m   f m c |  m      c Use EM to learn parameters m Use Contrastive Divergence to learn parameters. Frank Wood - fwood@cs.brown.edu  Take Home • Contrastive divergence is a general MCMC gradient ascent learning algorithm particularly well suited to learning Product of Experts (PoE) and energy- based (Gibbs distributions, etc.) model parameters. • The general algorithm: – Repeat Until “Convergence” • Draw samples from the current model starting from the training data. • Compute the expected gradient of the log probability w.r.t. all model parameters over both samples and the training data. • Update the model parameters according to the gradient. Frank Wood - fwood@cs.brown.edu Sampling – Critical to Understanding • Uniform – rand() Linear Congruential Generator • x(n) = a * x(n-1) + b mod M 0.2311 • Normal – randn() 0.6068 0.4860 0.8913 0.7621 Box-Mueller • x1,x2 ~ U(0,1) -> y1,y2 ~N(0,1) – y1 = sqrt( - 2 ln(x1) ) cos( 2 pi x2 ) – y2 = sqrt( - 2 ln(x1) ) sin( 2 pi x2 ) • Binomial(p) – if(rand()<p) • More Complicated Distributions – Mixture Model • Sample from a Gaussian • Sample from a multinomial (CDF + uniform) – Product of Experts • Metropolis and/or Gibbs Frank Wood - fwood@cs.brown.edu 0.4565 0.0185 The Flavor of Metropolis Sampling   • Given some distribution  starting point d t 1 , and a symmetric proposal    distribution J dt | dt 1  . pdt |   r  • Calculate the ratio of densities pdt 1 |    where dt is sampled from the proposal distribution.  • With probability min( r ,1) accept dt . • Given sufficiently many iterations Only need to  p d |  , a random         dn , dn1 , dn2 , ~ p d |  Frank Wood - fwood@cs.brown.edu know the distribution up to a proportionality! Contrastive Divergence (Final Result!) Training data (empirical distribution). Model parameters.  m   log f m  m  P0  log f m  m Samples from model. P1 Law of Large Numbers, compute expectations using samples. 1  m  N  d D  log f m (d )  m 1  N  c ~ P1  log f m (c)  m Now you know how to do it, let’s see why this works! Frank Wood - fwood@cs.brown.edu But First: The last vestige of concreteness. • Looking towards the future: 1-D Student-t Expert 1 0.9 – Take f to be a Student-t.   f   m; j m   d 1  m    1 T  1  j m d   2    – Then (for instance)  log f m; jm  m   d 0.7 0.6 1/(1+.5*(j tx)) f m  d  = .6, j = 5 0.8 0.5 0.4 0.3 0.2 0.1 0 -10 -8 -6 -4 -2 0 2 4 6 8 Dot product Projection 1-D Marginal    1 T     m log 1  jm d  2  1 T        log 1  jm d   m  2  Frank Wood - fwood@cs.brown.edu   10 Maximizing the training data log likelihood • We want maximizing parameters Standard PoE form      fm d | m   m  arg max log p(D | 1 ,, n )  arg max log      1 ,, n 1 ,, n d D   f m c |  m    c m  Over all training data. Assuming d’s drawn independently from p() • Differentiate w.r.t. to all parameters and perform gradient ascent to find optimal parameters. • The derivation is somewhat nasty. Frank Wood - fwood@cs.brown.edu Maximizing the training data log likelihood  log p (D | 1 , ,  n )   m   log  p (d | 1 , ,  n )  d D   d D   m  m  log p (d | 1 , ,  n )   log p(d | 1 ,, n ) N  m P Remember this equivalence! Frank Wood - fwood@cs.brown.edu Maximizing the training data log likelihood  f m (d |  m )  1  log p (D | 1 ,  ,  n )  1  log  m   N  m N  m d D  f m (c |  m )  c     log f m d |  m 1 1     N dD  m N   d D m  log   c   f m c |  m  m  m    log   f m c |  m    log f m d |  m 1 c m    N dD  m  m   Frank Wood - fwood@cs.brown.edu Maximizing the training data log likelihood    log   f m c |  m    log f m d |  m 1 c m    N dD  m  m     log f m d |  m   m      log f m d |  m   m   log   c P0   P0   c   f m c |  m  m log(x)’ = x’/x  m 1  f m c |  m  m Frank Wood - fwood@cs.brown.edu   c  m  f m c |  m   m Maximizing the training data log likelihood    log f m d |  m   m    log f m d |  m   m    P0    log f m d |  m   m  c  P0   c  c  m  m m   P0  1  f m c |  m    f m c |  m  1  f m c |  m    f j c |  j f m c |  m    c jm  m log(x)’ = x’/x m   c c 1  f m c |  m   m Frank Wood - fwood@cs.brown.edu m   f m c |  m  log f m c |  m   m Maximizing the training data log likelihood    log f m d |  m   m  P0       log f m d |  m   m    log f m d |  m   m   c c 1   fm c | m   m   f m c |  m  log f m c |  m   m m    f m c |  m     m  log f m c |  m          f c |    c   m m m  P0   c m     log f m c |  m    p (c | 1 ,  , n )   m c P0 Frank Wood - fwood@cs.brown.edu Maximizing the training data log likelihood    log f m d |  m   m    log f m d |  m   m  P0   log f m c |  m    p (c | 1 ,  ,  n )  m c P0   log f m c |  m    m  P  Phew! We’re done! So:     log p d |  m  log p (D | 1 , ,  n ) N  m  m P0    log f m d |  m  log f m c |  m     m  m P P0     Frank Wood - fwood@cs.brown.edu Equilibrium Is Hard to Achieve • With:    log p (D | 1 , ,  n )  log f m d |  m   m  m  P0   log f m c |  m    m P  we can now train our PoE model. • But… there’s a problem: – P is computationally infeasible to obtain (esp. in an inner gradient ascent loop). – Sampling Markov Chain must converge to target distribution. Often this takes a very long time! Frank Wood - fwood@cs.brown.edu Solution: Contrastive Divergence!    log p (D | 1 , ,  n )  log f m d |  m   m  m  P0   log f m c |  m    m P1 • Now we don’t have to run the sampling Markov Chain to convergence, instead we can stop after 1 iteration (or perhaps a few iterations more typically) • Why does this work? – Attempts to minimize the ways that the model distorts the data. Frank Wood - fwood@cs.brown.edu Equivalence of argmax log P() and argmax KL()  0 d    P d  0 d       P P   P d  0  P d log   P d    0 0  log P d   P d log P d  d  log P d 0 0    H P0     P   P P  log P d   m  m 0  P0 Frank Wood - fwood@cs.brown.edu This is what we got out of the nasty derivation! Contrastive Divergence • We want to “update the parameters to reduce the tendency of the chain to wander away from the initial distribution on the first step”.   m P 0 P  P1 P    log f m d |  m   m   P0    log P d   m  P0   log f m c |  m    m    log P d   m  P1    log f m d |  m   m P     log f m d |  m   m   P1   log f m c |  m    m  P0 Frank Wood - fwood@cs.brown.edu   log f m d |  m   m  P1 P  Contrastive Divergence (Final Result!) Training data (empirical distribution). Model parameters.  m   log f m  m  P0  log f m  m Samples from model. P1 Law of Large Numbers, compute expectations using samples. 1  m  N  d D  log f m (d )  m 1  N  c ~ P1  log f m (c)  m Now you know how to do it and why it works! Frank Wood - fwood@cs.brown.edu

Training Products of Experts by Minimizing Contrastive Divergence

Related documents

Products

Support

Training Products of Experts by Minimizing Contrastive Divergence

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib