Training Products of Experts by Minimizing Contrastive Divergence Geoffrey E. Hinton presented by Frank Wood Frank Wood - fwood@cs.brown.edu Goal • Learn parameters for probability distribution models of high dimensional data – (Images, Population Firing Rates, Securities Data, NLP data, etc) Mixture Model p d | 1 ,, n m f m d | m m Product of Experts fm d | m p d | 1 ,, n m f m c | m c Use EM to learn parameters m Use Contrastive Divergence to learn parameters. Frank Wood - fwood@cs.brown.edu Take Home • Contrastive divergence is a general MCMC gradient ascent learning algorithm particularly well suited to learning Product of Experts (PoE) and energy- based (Gibbs distributions, etc.) model parameters. • The general algorithm: – Repeat Until “Convergence” • Draw samples from the current model starting from the training data. • Compute the expected gradient of the log probability w.r.t. all model parameters over both samples and the training data. • Update the model parameters according to the gradient. Frank Wood - fwood@cs.brown.edu Sampling – Critical to Understanding • Uniform – rand() Linear Congruential Generator • x(n) = a * x(n-1) + b mod M 0.2311 • Normal – randn() 0.6068 0.4860 0.8913 0.7621 Box-Mueller • x1,x2 ~ U(0,1) -> y1,y2 ~N(0,1) – y1 = sqrt( - 2 ln(x1) ) cos( 2 pi x2 ) – y2 = sqrt( - 2 ln(x1) ) sin( 2 pi x2 ) • Binomial(p) – if(rand()<p) • More Complicated Distributions – Mixture Model • Sample from a Gaussian • Sample from a multinomial (CDF + uniform) – Product of Experts • Metropolis and/or Gibbs Frank Wood - fwood@cs.brown.edu 0.4565 0.0185 The Flavor of Metropolis Sampling • Given some distribution starting point d t 1 , and a symmetric proposal distribution J dt | dt 1 . pdt | r • Calculate the ratio of densities pdt 1 | where dt is sampled from the proposal distribution. • With probability min( r ,1) accept dt . • Given sufficiently many iterations Only need to p d | , a random dn , dn1 , dn2 , ~ p d | Frank Wood - fwood@cs.brown.edu know the distribution up to a proportionality! Contrastive Divergence (Final Result!) Training data (empirical distribution). Model parameters. m log f m m P0 log f m m Samples from model. P1 Law of Large Numbers, compute expectations using samples. 1 m N d D log f m (d ) m 1 N c ~ P1 log f m (c) m Now you know how to do it, let’s see why this works! Frank Wood - fwood@cs.brown.edu But First: The last vestige of concreteness. • Looking towards the future: 1-D Student-t Expert 1 0.9 – Take f to be a Student-t. f m; j m d 1 m 1 T 1 j m d 2 – Then (for instance) log f m; jm m d 0.7 0.6 1/(1+.5*(j tx)) f m d = .6, j = 5 0.8 0.5 0.4 0.3 0.2 0.1 0 -10 -8 -6 -4 -2 0 2 4 6 8 Dot product Projection 1-D Marginal 1 T m log 1 jm d 2 1 T log 1 jm d m 2 Frank Wood - fwood@cs.brown.edu 10 Maximizing the training data log likelihood • We want maximizing parameters Standard PoE form fm d | m m arg max log p(D | 1 ,, n ) arg max log 1 ,, n 1 ,, n d D f m c | m c m Over all training data. Assuming d’s drawn independently from p() • Differentiate w.r.t. to all parameters and perform gradient ascent to find optimal parameters. • The derivation is somewhat nasty. Frank Wood - fwood@cs.brown.edu Maximizing the training data log likelihood log p (D | 1 , , n ) m log p (d | 1 , , n ) d D d D m m log p (d | 1 , , n ) log p(d | 1 ,, n ) N m P Remember this equivalence! Frank Wood - fwood@cs.brown.edu Maximizing the training data log likelihood f m (d | m ) 1 log p (D | 1 , , n ) 1 log m N m N m d D f m (c | m ) c log f m d | m 1 1 N dD m N d D m log c f m c | m m m log f m c | m log f m d | m 1 c m N dD m m Frank Wood - fwood@cs.brown.edu Maximizing the training data log likelihood log f m c | m log f m d | m 1 c m N dD m m log f m d | m m log f m d | m m log c P0 P0 c f m c | m m log(x)’ = x’/x m 1 f m c | m m Frank Wood - fwood@cs.brown.edu c m f m c | m m Maximizing the training data log likelihood log f m d | m m log f m d | m m P0 log f m d | m m c P0 c c m m m P0 1 f m c | m f m c | m 1 f m c | m f j c | j f m c | m c jm m log(x)’ = x’/x m c c 1 f m c | m m Frank Wood - fwood@cs.brown.edu m f m c | m log f m c | m m Maximizing the training data log likelihood log f m d | m m P0 log f m d | m m log f m d | m m c c 1 fm c | m m f m c | m log f m c | m m m f m c | m m log f m c | m f c | c m m m P0 c m log f m c | m p (c | 1 , , n ) m c P0 Frank Wood - fwood@cs.brown.edu Maximizing the training data log likelihood log f m d | m m log f m d | m m P0 log f m c | m p (c | 1 , , n ) m c P0 log f m c | m m P Phew! We’re done! So: log p d | m log p (D | 1 , , n ) N m m P0 log f m d | m log f m c | m m m P P0 Frank Wood - fwood@cs.brown.edu Equilibrium Is Hard to Achieve • With: log p (D | 1 , , n ) log f m d | m m m P0 log f m c | m m P we can now train our PoE model. • But… there’s a problem: – P is computationally infeasible to obtain (esp. in an inner gradient ascent loop). – Sampling Markov Chain must converge to target distribution. Often this takes a very long time! Frank Wood - fwood@cs.brown.edu Solution: Contrastive Divergence! log p (D | 1 , , n ) log f m d | m m m P0 log f m c | m m P1 • Now we don’t have to run the sampling Markov Chain to convergence, instead we can stop after 1 iteration (or perhaps a few iterations more typically) • Why does this work? – Attempts to minimize the ways that the model distorts the data. Frank Wood - fwood@cs.brown.edu Equivalence of argmax log P() and argmax KL() 0 d P d 0 d P P P d 0 P d log P d 0 0 log P d P d log P d d log P d 0 0 H P0 P P P log P d m m 0 P0 Frank Wood - fwood@cs.brown.edu This is what we got out of the nasty derivation! Contrastive Divergence • We want to “update the parameters to reduce the tendency of the chain to wander away from the initial distribution on the first step”. m P 0 P P1 P log f m d | m m P0 log P d m P0 log f m c | m m log P d m P1 log f m d | m m P log f m d | m m P1 log f m c | m m P0 Frank Wood - fwood@cs.brown.edu log f m d | m m P1 P Contrastive Divergence (Final Result!) Training data (empirical distribution). Model parameters. m log f m m P0 log f m m Samples from model. P1 Law of Large Numbers, compute expectations using samples. 1 m N d D log f m (d ) m 1 N c ~ P1 log f m (c) m Now you know how to do it and why it works! Frank Wood - fwood@cs.brown.edu