Many paths to computing normalizing constants Yuri Burda yburda@gmail.com 08/15/2014 CIFAR NCAP Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 1 / 13 Annealed importance sampling Normalizing constants Need for normalizing constants We often have models where p px q9f px q with simple f px q. For instance if H px q is a function, then p px q distribution. p q is a probability 1 H x Ze Examples: exponential family, Boltzmann machines etc. Want to compute p px q, hence also Z Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 2 / 13 Annealed importance sampling Normalizing constants Need for normalizing constants We often have models where p px q9f px q with simple f px q. For instance if H px q is a function, then p px q distribution. p q is a probability 1 H x Ze Examples: exponential family, Boltzmann machines etc. Want to compute p px q, hence also Z Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 2 / 13 Annealed importance sampling Normalizing constants Need for normalizing constants We often have models where p px q9f px q with simple f px q. For instance if H px q is a function, then p px q distribution. p q is a probability 1 H x Ze Examples: exponential family, Boltzmann machines etc. Want to compute p px q, hence also Z Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 2 / 13 Annealed importance sampling Normalizing constants Need for normalizing constants We often have models where p px q9f px q with simple f px q. For instance if H px q is a function, then p px q distribution. p q is a probability 1 H x Ze Examples: exponential family, Boltzmann machines etc. Want to compute p px q, hence also Z Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 2 / 13 Annealed importance sampling Normalizing constants What is Z ? p px q Z ³ X 1 Zf px q f px q dx Such integral is often intractable Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 3 / 13 Annealed importance sampling Normalizing constants What is Z ? p px q Z ³ X 1 Zf px q f px q dx Such integral is often intractable Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 3 / 13 Annealed importance sampling Normalizing constants What is Z ? p px q Z ³ X 1 Zf px q f px q dx Such integral is often intractable Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 3 / 13 Annealed importance sampling Normalizing constants Approximation - Importance Sampling Z ³ X f px q dx ³ p q q px q dx pq f x q x N1 ° qf ppxx qq Observation: variation of the estimate is i i 1 N Varq pf {q q If p and q are not close, N has to be too large to get useful estimates Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 4 / 13 Annealed importance sampling Normalizing constants Approximation - Importance Sampling Z ³ X f px q dx ³ p q q px q dx pq f x q x N1 ° qf ppxx qq Observation: variation of the estimate is i i 1 N Varq pf {q q If p and q are not close, N has to be too large to get useful estimates Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 4 / 13 Annealed importance sampling Normalizing constants Approximation - Importance Sampling Z ³ X f px q dx ³ p q q px q dx pq f x q x N1 ° qf ppxx qq Observation: variation of the estimate is i i 1 N Varq pf {q q If p and q are not close, N has to be too large to get useful estimates Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 4 / 13 Annealed importance sampling Normalizing constants Approximation - Importance Sampling Z ³ X f px q dx ³ p q q px q dx pq f x q x N1 ° qf ppxx qq Observation: variation of the estimate is i i 1 N Varq pf {q q If p and q are not close, N has to be too large to get useful estimates Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 4 / 13 Annealed importance sampling Normalizing constants Approximation - Importance Sampling Z ³ X f px q dx ³ p q q px q dx pq f x q x N1 ° qf ppxx qq Observation: variation of the estimate is i i 1 N Varq pf {q q If p and q are not close, N has to be too large to get useful estimates Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 4 / 13 Annealed importance sampling Normalizing constants Annealed Importance Sampling Take a sequence of distributions q0 tractable q0 , and qi px q Z1i fi px q p0q q1 . . . qn p with simple p0q Sample x1 , . . . , xN from q0 , p0q {q x p0q approximates Z , 0 1 i average of f1 xi p1q p1q get samples x1 , . . . , xN from q1 , p1q {q x p1q approximates Z , 1 2 i average of f2 xi etc., until we get an estimate of Zn Yuri Burda (yburda@gmail.com) Z Many paths to computing normalizing constants 08/15/2014 5 / 13 Annealed importance sampling Normalizing constants Annealed Importance Sampling Take a sequence of distributions q0 tractable q0 , and qi px q Z1i fi px q p0q q1 . . . qn p with simple p0q Sample x1 , . . . , xN from q0 , p0q {q x p0q approximates Z , 0 1 i average of f1 xi p1q p1q get samples x1 , . . . , xN from q1 , p1q {q x p1q approximates Z , 1 2 i average of f2 xi etc., until we get an estimate of Zn Yuri Burda (yburda@gmail.com) Z Many paths to computing normalizing constants 08/15/2014 5 / 13 Annealed importance sampling Normalizing constants Annealed Importance Sampling Take a sequence of distributions q0 tractable q0 , and qi px q Z1i fi px q p0q q1 . . . qn p with simple p0q Sample x1 , . . . , xN from q0 , p0q {q x p0q approximates Z , 0 1 i average of f1 xi p1q p1q get samples x1 , . . . , xN from q1 , p1q {q x p1q approximates Z , 1 2 i average of f2 xi etc., until we get an estimate of Zn Yuri Burda (yburda@gmail.com) Z Many paths to computing normalizing constants 08/15/2014 5 / 13 Annealed importance sampling Normalizing constants Annealed Importance Sampling Take a sequence of distributions q0 tractable q0 , and qi px q Z1i fi px q p0q q1 . . . qn p with simple p0q Sample x1 , . . . , xN from q0 , p0q {q x p0q approximates Z , 0 1 i average of f1 xi p1q p1q get samples x1 , . . . , xN from q1 , p1q {q x p1q approximates Z , 1 2 i average of f2 xi etc., until we get an estimate of Zn Yuri Burda (yburda@gmail.com) Z Many paths to computing normalizing constants 08/15/2014 5 / 13 Annealed importance sampling Normalizing constants Annealed Importance Sampling A smarter scheme along these lines works even when the x pi 1q sampled from Ti 1 x pi q , with Ti 1 a Markov chain operator with stationary distribution qi It’s called AIS Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 6 / 13 Annealed importance sampling Normalizing constants Annealed Importance Sampling A smarter scheme along these lines works even when the x pi 1q sampled from Ti 1 x pi q , with Ti 1 a Markov chain operator with stationary distribution qi It’s called AIS Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 6 / 13 Annealed importance sampling Normalizing constants Choice of qi Common choice: with 0 β0 . . . βn 1 Yuri Burda (yburda@gmail.com) qi 9p01βi p βi Many paths to computing normalizing constants 08/15/2014 7 / 13 RBM example Samples RBM We will try it on a binary RBM with 500 hidden units trained on MNIST with PCD. Samples from RBM: Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 8 / 13 RBM example Samples Chain of samples when running AIS AIS run with 5000 intermediate distributions, βi linearly spaced between 0 and 1 n0 Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 9 / 13 RBM example Samples Chain of samples when running AIS AIS run with 5000 intermediate distributions, βi linearly spaced between 0 and 1 n 1000 Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 9 / 13 RBM example Samples Chain of samples when running AIS AIS run with 5000 intermediate distributions, βi linearly spaced between 0 and 1 n 2000 Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 9 / 13 RBM example Samples Chain of samples when running AIS AIS run with 5000 intermediate distributions, βi linearly spaced between 0 and 1 n 3000 Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 9 / 13 RBM example Samples Chain of samples when running AIS AIS run with 5000 intermediate distributions, βi linearly spaced between 0 and 1 n 4000 Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 9 / 13 RBM example Samples Chain of samples when running AIS AIS run with 5000 intermediate distributions, βi linearly spaced between 0 and 1 n 5000 Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 9 / 13 RBM example Samples Chain of samples when running AIS AIS run with 5000 intermediate distributions, βi linearly spaced between 0 and 1 Another 1000 Gibbs samples Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 9 / 13 RBM example Samples Idea: replace hidden units by their average activations. Samples: n 0, 20 units Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 10 / 13 RBM example Samples Idea: replace hidden units by their average activations. Samples: n 1000, 100 units Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 10 / 13 RBM example Samples Idea: replace hidden units by their average activations. Samples: n 2000, 200 units Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 10 / 13 RBM example Samples Idea: replace hidden units by their average activations. Samples: n 3000, 300 units Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 10 / 13 RBM example Samples Idea: replace hidden units by their average activations. Samples: n 4000, 400 units Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 10 / 13 RBM example Samples Idea: replace hidden units by their average activations. Samples: n 5000, 500 units Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 10 / 13 RBM example Samples Numbers With 5000 intermediate distributions geometric averages path underestimates Z by about 3 nats, while the one with varying number of hidden units underestimates it only by 1 nat. With 500 intermediate distributions geometric averages path underestimates Z by about 5 nats, while the one with varying number of hidden units underestimates it only by 8 nats. Not clear which is better. Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 11 / 13 RBM example Samples Numbers With 5000 intermediate distributions geometric averages path underestimates Z by about 3 nats, while the one with varying number of hidden units underestimates it only by 1 nat. With 500 intermediate distributions geometric averages path underestimates Z by about 5 nats, while the one with varying number of hidden units underestimates it only by 8 nats. Not clear which is better. Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 11 / 13 RBM example Samples Numbers With 5000 intermediate distributions geometric averages path underestimates Z by about 3 nats, while the one with varying number of hidden units underestimates it only by 1 nat. With 500 intermediate distributions geometric averages path underestimates Z by about 5 nats, while the one with varying number of hidden units underestimates it only by 8 nats. Not clear which is better. Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 11 / 13 RBM example Samples Conclusion Would be nice to have a way to go from a large RBM to smaller “fuzzier” one without going through meaningless distributions in between. See also “Annealing Between Distributions by Averaging Moments” by R. Grosse Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 12 / 13 RBM example Samples Conclusion Would be nice to have a way to go from a large RBM to smaller “fuzzier” one without going through meaningless distributions in between. See also “Annealing Between Distributions by Averaging Moments” by R. Grosse Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 12 / 13 Questions? Many paths to computing normalizing constants Yuri Burda yburda@gmail.com 08/15/2014 CIFAR NCAP Yuri Burda (yburda@gmail.com) Many paths to computing normalizing constants 08/15/2014 13 / 13