Many paths to computing normalizing constants

advertisement
Many paths to computing normalizing constants
Yuri Burda
yburda@gmail.com
08/15/2014
CIFAR NCAP
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
1 / 13
Annealed importance sampling
Normalizing constants
Need for normalizing constants
We often have models where p px q9f px q with simple f px q.
For instance if H px q is a function, then p px q distribution.
p q is a probability
1
H x
Ze
Examples: exponential family, Boltzmann machines etc.
Want to compute p px q, hence also Z
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
2 / 13
Annealed importance sampling
Normalizing constants
Need for normalizing constants
We often have models where p px q9f px q with simple f px q.
For instance if H px q is a function, then p px q distribution.
p q is a probability
1
H x
Ze
Examples: exponential family, Boltzmann machines etc.
Want to compute p px q, hence also Z
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
2 / 13
Annealed importance sampling
Normalizing constants
Need for normalizing constants
We often have models where p px q9f px q with simple f px q.
For instance if H px q is a function, then p px q distribution.
p q is a probability
1
H x
Ze
Examples: exponential family, Boltzmann machines etc.
Want to compute p px q, hence also Z
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
2 / 13
Annealed importance sampling
Normalizing constants
Need for normalizing constants
We often have models where p px q9f px q with simple f px q.
For instance if H px q is a function, then p px q distribution.
p q is a probability
1
H x
Ze
Examples: exponential family, Boltzmann machines etc.
Want to compute p px q, hence also Z
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
2 / 13
Annealed importance sampling
Normalizing constants
What is Z ?
p px q Z
³
X
1
Zf
px q
f px q dx
Such integral is often intractable
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
3 / 13
Annealed importance sampling
Normalizing constants
What is Z ?
p px q Z
³
X
1
Zf
px q
f px q dx
Such integral is often intractable
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
3 / 13
Annealed importance sampling
Normalizing constants
What is Z ?
p px q Z
³
X
1
Zf
px q
f px q dx
Such integral is often intractable
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
3 / 13
Annealed importance sampling
Normalizing constants
Approximation - Importance Sampling
Z
³
X
f px q dx
³
p q q px q dx
pq
f x
q x
N1 ° qf ppxx qq
Observation: variation of the estimate is
i
i
1
N
Varq pf {q q
If p and q are not close, N has to be too large to get useful estimates
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
4 / 13
Annealed importance sampling
Normalizing constants
Approximation - Importance Sampling
Z
³
X
f px q dx
³
p q q px q dx
pq
f x
q x
N1 ° qf ppxx qq
Observation: variation of the estimate is
i
i
1
N
Varq pf {q q
If p and q are not close, N has to be too large to get useful estimates
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
4 / 13
Annealed importance sampling
Normalizing constants
Approximation - Importance Sampling
Z
³
X
f px q dx
³
p q q px q dx
pq
f x
q x
N1 ° qf ppxx qq
Observation: variation of the estimate is
i
i
1
N
Varq pf {q q
If p and q are not close, N has to be too large to get useful estimates
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
4 / 13
Annealed importance sampling
Normalizing constants
Approximation - Importance Sampling
Z
³
X
f px q dx
³
p q q px q dx
pq
f x
q x
N1 ° qf ppxx qq
Observation: variation of the estimate is
i
i
1
N
Varq pf {q q
If p and q are not close, N has to be too large to get useful estimates
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
4 / 13
Annealed importance sampling
Normalizing constants
Approximation - Importance Sampling
Z
³
X
f px q dx
³
p q q px q dx
pq
f x
q x
N1 ° qf ppxx qq
Observation: variation of the estimate is
i
i
1
N
Varq pf {q q
If p and q are not close, N has to be too large to get useful estimates
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
4 / 13
Annealed importance sampling
Normalizing constants
Annealed Importance Sampling
Take a sequence of distributions q0
tractable q0 , and qi px q Z1i fi px q
p0q
q1 . . . qn p with simple
p0q
Sample x1 , . . . , xN from q0 ,
p0q {q x p0q approximates Z ,
0
1
i
average of f1 xi
p1q
p1q
get samples x1 , . . . , xN from q1 ,
p1q {q x p1q approximates Z ,
1
2
i
average of f2 xi
etc.,
until we get an estimate of Zn
Yuri Burda (yburda@gmail.com)
Z
Many paths to computing normalizing constants
08/15/2014
5 / 13
Annealed importance sampling
Normalizing constants
Annealed Importance Sampling
Take a sequence of distributions q0
tractable q0 , and qi px q Z1i fi px q
p0q
q1 . . . qn p with simple
p0q
Sample x1 , . . . , xN from q0 ,
p0q {q x p0q approximates Z ,
0
1
i
average of f1 xi
p1q
p1q
get samples x1 , . . . , xN from q1 ,
p1q {q x p1q approximates Z ,
1
2
i
average of f2 xi
etc.,
until we get an estimate of Zn
Yuri Burda (yburda@gmail.com)
Z
Many paths to computing normalizing constants
08/15/2014
5 / 13
Annealed importance sampling
Normalizing constants
Annealed Importance Sampling
Take a sequence of distributions q0
tractable q0 , and qi px q Z1i fi px q
p0q
q1 . . . qn p with simple
p0q
Sample x1 , . . . , xN from q0 ,
p0q {q x p0q approximates Z ,
0
1
i
average of f1 xi
p1q
p1q
get samples x1 , . . . , xN from q1 ,
p1q {q x p1q approximates Z ,
1
2
i
average of f2 xi
etc.,
until we get an estimate of Zn
Yuri Burda (yburda@gmail.com)
Z
Many paths to computing normalizing constants
08/15/2014
5 / 13
Annealed importance sampling
Normalizing constants
Annealed Importance Sampling
Take a sequence of distributions q0
tractable q0 , and qi px q Z1i fi px q
p0q
q1 . . . qn p with simple
p0q
Sample x1 , . . . , xN from q0 ,
p0q {q x p0q approximates Z ,
0
1
i
average of f1 xi
p1q
p1q
get samples x1 , . . . , xN from q1 ,
p1q {q x p1q approximates Z ,
1
2
i
average of f2 xi
etc.,
until we get an estimate of Zn
Yuri Burda (yburda@gmail.com)
Z
Many paths to computing normalizing constants
08/15/2014
5 / 13
Annealed importance sampling
Normalizing constants
Annealed Importance Sampling
A smarter scheme
along these lines works even when the x pi 1q sampled
from Ti 1 x pi q , with Ti 1 a Markov chain operator with stationary
distribution qi
It’s called AIS
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
6 / 13
Annealed importance sampling
Normalizing constants
Annealed Importance Sampling
A smarter scheme
along these lines works even when the x pi 1q sampled
from Ti 1 x pi q , with Ti 1 a Markov chain operator with stationary
distribution qi
It’s called AIS
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
6 / 13
Annealed importance sampling
Normalizing constants
Choice of qi
Common choice:
with 0 β0
. . . βn 1
Yuri Burda (yburda@gmail.com)
qi 9p01βi p βi
Many paths to computing normalizing constants
08/15/2014
7 / 13
RBM example
Samples
RBM
We will try it on a binary RBM with 500 hidden units trained on MNIST
with PCD.
Samples from RBM:
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
8 / 13
RBM example
Samples
Chain of samples when running AIS
AIS run with 5000 intermediate distributions, βi linearly spaced between 0
and 1
n0
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
9 / 13
RBM example
Samples
Chain of samples when running AIS
AIS run with 5000 intermediate distributions, βi linearly spaced between 0
and 1
n 1000
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
9 / 13
RBM example
Samples
Chain of samples when running AIS
AIS run with 5000 intermediate distributions, βi linearly spaced between 0
and 1
n 2000
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
9 / 13
RBM example
Samples
Chain of samples when running AIS
AIS run with 5000 intermediate distributions, βi linearly spaced between 0
and 1
n 3000
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
9 / 13
RBM example
Samples
Chain of samples when running AIS
AIS run with 5000 intermediate distributions, βi linearly spaced between 0
and 1
n 4000
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
9 / 13
RBM example
Samples
Chain of samples when running AIS
AIS run with 5000 intermediate distributions, βi linearly spaced between 0
and 1
n 5000
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
9 / 13
RBM example
Samples
Chain of samples when running AIS
AIS run with 5000 intermediate distributions, βi linearly spaced between 0
and 1
Another 1000 Gibbs samples
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
9 / 13
RBM example
Samples
Idea: replace hidden units by their average activations.
Samples:
n 0, 20 units
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
10 / 13
RBM example
Samples
Idea: replace hidden units by their average activations.
Samples:
n 1000, 100 units
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
10 / 13
RBM example
Samples
Idea: replace hidden units by their average activations.
Samples:
n 2000, 200 units
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
10 / 13
RBM example
Samples
Idea: replace hidden units by their average activations.
Samples:
n 3000, 300 units
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
10 / 13
RBM example
Samples
Idea: replace hidden units by their average activations.
Samples:
n 4000, 400 units
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
10 / 13
RBM example
Samples
Idea: replace hidden units by their average activations.
Samples:
n 5000, 500 units
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
10 / 13
RBM example
Samples
Numbers
With 5000 intermediate distributions geometric averages path
underestimates Z by about 3 nats, while the one with varying number of
hidden units underestimates it only by 1 nat.
With 500 intermediate distributions geometric averages path
underestimates Z by about 5 nats, while the one with varying number of
hidden units underestimates it only by 8 nats.
Not clear which is better.
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
11 / 13
RBM example
Samples
Numbers
With 5000 intermediate distributions geometric averages path
underestimates Z by about 3 nats, while the one with varying number of
hidden units underestimates it only by 1 nat.
With 500 intermediate distributions geometric averages path
underestimates Z by about 5 nats, while the one with varying number of
hidden units underestimates it only by 8 nats.
Not clear which is better.
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
11 / 13
RBM example
Samples
Numbers
With 5000 intermediate distributions geometric averages path
underestimates Z by about 3 nats, while the one with varying number of
hidden units underestimates it only by 1 nat.
With 500 intermediate distributions geometric averages path
underestimates Z by about 5 nats, while the one with varying number of
hidden units underestimates it only by 8 nats.
Not clear which is better.
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
11 / 13
RBM example
Samples
Conclusion
Would be nice to have a way to go from a large RBM to smaller “fuzzier”
one without going through meaningless distributions in between.
See also “Annealing Between Distributions by Averaging Moments” by R.
Grosse
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
12 / 13
RBM example
Samples
Conclusion
Would be nice to have a way to go from a large RBM to smaller “fuzzier”
one without going through meaningless distributions in between.
See also “Annealing Between Distributions by Averaging Moments” by R.
Grosse
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
12 / 13
Questions?
Many paths to computing normalizing constants
Yuri Burda
yburda@gmail.com
08/15/2014
CIFAR NCAP
Yuri Burda (yburda@gmail.com)
Many paths to computing normalizing constants
08/15/2014
13 / 13
Download