presentation

advertisement
Probabilistic Graphical Models
Chapter 12:
Particle-Based Approximate Inference
UR I ME I R
DA F N A SA DE H
Particle-Based Approximate Inference
Methods for approximate the joint distributions as a set of instantiations to all or some of the variables in
the network.
These instantiations, often called ๐‘๐‘Ž๐‘Ÿ๐‘ก๐‘–๐‘๐‘™๐‘’๐‘ , are designed to provide a good representation of the overall
probability distribution,
The general framework for most of this lecture is:
Consider some distribution ๐‘ƒ ๐’ณ , and assume we want to estimate the probability of some event ๐‘Œ = ๐‘ฆ
relative to ๐‘ƒ, for some ๐‘Œ ⊆ ๐’ณ and ๐‘ฆ ∈ ๐‘‰๐‘Ž๐‘™ ๐‘Œ . More generally, we might want to estimate the expectation
of some function relative to ๐‘ƒ.
We approximate this expectation by generating a set of M particles, estimating the value of the function
relative to each of the particles, and then aggregating the results.
Particle-Based Approximate Inference
For example:
sampled IID from P.
If ๐‘ƒ ๐‘ฅ = 1 = ๐‘ the estimator for P :
More generally, for any distribution P, function f:
Forward Sampling
Input - โ„ฌ, a Bayesian network over ๐’ณ.
Output - ๐œ‰ = ๐‘ฅ1 , … ๐‘ฅ๐‘š , a sample
from โ„ฌ to ๐’ณ.
Forward Sampling - Example
Forward Sampling - Example
i = 1:
Sampling D
Assume D=๐‘‘1
Forward Sampling - Example
๐ท = ๐‘‘1
i = 2:
Sampling ๐ผ
Assume ๐ผ = ๐‘– 0
Forward Sampling - Example
๐ท = ๐‘‘1 , ๐ผ = ๐‘– 0
i = 3:
Sampling ๐บ from ๐‘ ๐บ ๐‘– 0 , ๐‘‘1 .
And continue…
Forward Sampling – The estimates
- a set of particles (samples)
๐‘“ − a function over ๐’ณ
- an estimate for ๐”ผ๐‘ƒ ๐‘“
โ€ซื™ืขื™ื™ืขื™ื—ืขโ€ฌ
- an estimate for ๐‘(๐‘ฆ)
Forward Sampling – Complexity
๐‘€ – the total number of particles generated
๐‘›= ๐’ณ
๐‘‡โ„Ž๐‘’ ๐‘œ๐‘ฃ๐‘’๐‘Ÿ๐‘Ž๐‘™๐‘™ ๐‘๐‘œ๐‘ ๐‘ก ๐‘–๐‘ :
Forward Sampling – Absolute Error
From the ๐ป๐‘œ๐‘’๐‘“๐‘“๐‘‘๐‘–๐‘›๐‘” ๐‘๐‘œ๐‘ข๐‘›๐‘‘:
Thus, to achieve an estimate whose error is bounded by ε, with probability at
least 1 − ๐›ฟ, we required:
Equivalently:
Forward Sampling – Relative Error
From the ๐ถโ„Ž๐‘’๐‘Ÿ๐‘›๐‘œ๐‘“๐‘“ ๐‘๐‘œ๐‘ข๐‘›๐‘‘:
Thus, to achieve an estimate whose error is bounded by ε, with probability at
least 1 − ๐›ฟ, we required:
Forward Sampling – Relative Error
From the ๐ถโ„Ž๐‘’๐‘Ÿ๐‘›๐‘œ๐‘“๐‘“ ๐‘๐‘œ๐‘ข๐‘›๐‘‘:
Thus, to achieve an estimate whose error is bounded by ε, with probability at
least 1 − ๐›ฟ, we required:
If ๐‘ƒ(๐‘ฆ) is very small, it’s likely that
we will not generated any samples
where this event holds. Our
estimate of 0 is not going to be
within any relative error
Forward Sampling – Relative Error
From the ๐ถโ„Ž๐‘’๐‘Ÿ๐‘›๐‘œ๐‘“๐‘“ ๐‘๐‘œ๐‘ข๐‘›๐‘‘:
Thus, to achieve an estimate whose error is bounded by ε, with probability at
least 1 − ๐›ฟ, we required:
We do not know ๐‘ƒ ๐‘ฆ …
Forward Sampling – Relative Error
From the ๐ถโ„Ž๐‘’๐‘Ÿ๐‘›๐‘œ๐‘“๐‘“ ๐‘๐‘œ๐‘ข๐‘›๐‘‘:
Thus, to achieve an estimate whose error is bounded by ε, with probability at
least 1 − ๐›ฟ, we required:
Conditional Probability Queries
We are interested in conditional probabilities of the form ๐‘ƒ ๐‘ฆ ๐ธ = ๐‘’ .
Unfortunately, it turns out that this estimation task is significantly harder.
-Rejection Sampling:
Generate samples ๐‘ฅ from ๐‘ƒ(๐‘‹) with forward sampling, reject any sample that is
not compatible with ๐‘’. The resulting samples are sampled from ๐‘ƒ ๐‘‹ ๐‘’ .
The problem is that the expected number of particles that are not rejected from
an original sample set of size ๐‘€ is ๐‘€ โˆ™ ๐‘ƒ(๐‘’).
Conditional Probability Queries
Estimates separately ๐‘ƒ(๐‘ฆ, ๐‘’) and ๐‘ƒ ๐‘’ and compute the ratio.
If ๐‘ƒ ๐‘’ ∈ 1 − ๐œ€ ๐‘ƒ ๐‘’ , 1 + ๐œ€ ๐‘ƒ ๐‘’
๐‘ƒ ๐‘ฆ, ๐‘’ ∈ 1 − ๐œ€ ๐‘ƒ ๐‘ฆ, ๐‘’ , 1 + ๐œ€ ๐‘ƒ ๐‘ฆ, ๐‘’ then:
1−
2๐œ€ ๐‘ƒ(๐‘ฆ,๐‘’)
1+๐œ€ ๐‘ƒ(๐‘’)
≤
๐‘ƒ ๐‘ฆ,๐‘’
๐‘ƒ ๐‘’
≤ 1+
2๐œ€ ๐‘ƒ(๐‘ฆ,๐‘’)
1+๐œ€ ๐‘ƒ(๐‘’)
But, the number of samples required to get a low relative error also grows
linearly with 1 ๐‘ƒ(๐‘’).
This is not a problem when ๐‘ƒ(๐‘’) gets an absolute error, but it’s not suffice to get
any type of bound for ๐‘ƒ(๐‘ฆ,๐‘’) ๐‘ƒ(๐‘’).
Likelihood weighting
The rejection sampling process seems very wasteful in the way it handles
evidence.
It seems much more sensible to simply force the samples to take on the
appropriate values at observed nodes.
This simple approach can generate incorrect results…
Likelihood weighting
Assume our evidence is ๐‘† = ๐‘ 1 . Using the
naive process, the expected number of samples
that have ๐ผ = ๐‘–1 is 30 percent.
Thus, this approach fails to conclude that the
posterior probability of ๐‘–1 is higher when we
observe ๐‘ 1 (0.41).
We should conclude that a sample where we
have ๐ผ = ๐‘–1 and force ๐‘† = ๐‘ 1 should be worth
80 percent of a sample, whereas one where we
have ๐ผ = ๐‘– 0 and force ๐‘† = ๐‘ 1 should be worth
5 percent of a sample
Likelihood weighting
Likelihood Weighting - Example
๐‘ = ๐ฟ = ๐‘™0 , ๐‘† = ๐‘ 1
๐‘ค=1
sample ๐ท = ๐‘‘1
Likelihood Weighting - Example
๐‘ = ๐ฟ = ๐‘™0 , ๐‘† = ๐‘ 1
๐‘ค=1
๐ท = ๐‘‘1
Sample ๐ผ = ๐‘– 0
Likelihood Weighting - Example
๐‘ = ๐ฟ = ๐‘™0 , ๐‘† = ๐‘ 1
๐‘ค = 1, ๐ท = ๐‘‘1 , ๐ผ = ๐‘– 0
Set ๐‘† = ๐‘ 1
๐‘ค ← ๐‘ค โˆ™ 0.05
Likelihood Weighting - Example
๐‘ = ๐ฟ = ๐‘™0 , ๐‘† = ๐‘ 1
๐‘ค = 0.05
๐ท = ๐‘‘1 , ๐ผ = ๐‘– 0 , ๐‘† = ๐‘  1
Sample G= ๐‘”2
Likelihood Weighting - Example
๐‘ = ๐ฟ = ๐‘™0 , ๐‘† = ๐‘ 1
๐‘ค = 0.05
๐ท = ๐‘‘1 , ๐ผ = ๐‘– 0 , ๐‘† = ๐‘  1 , G = ๐‘” 2
Set ๐ฟ = ๐‘™ 0
๐‘ค ← ๐‘ค โˆ™ 0.4
Likelihood Weighting - Example
๐‘ = ๐ฟ = ๐‘™0 , ๐‘† = ๐‘ 1
๐‘ค = 0.02
๐œ‰ = < ๐ท = ๐‘‘1 , ๐ผ = ๐‘– 0 , ๐‘† = ๐‘  1 , G = ๐‘” 2 , ๐ฟ = ๐‘™ 0 >
Likelihood Weighting– The estimates
- a set of particles (samples)
- an estimate for ๐‘ƒ(๐‘ฆ|๐‘’)
The same set of particles can be used to estimate the probability of any event ๐‘ฆ.
Importance sampling
๐ผ๐‘š๐‘๐‘œ๐‘Ÿ๐‘ก๐‘Ž๐‘›๐‘๐‘’ ๐‘ ๐‘Ž๐‘š๐‘๐‘™๐‘–๐‘›๐‘” is a general approach for estimating the expectation of a function
๐‘“ ๐‘ฅ relative to some distribution ๐ญ๐š๐ซ๐ ๐ž๐ญ ๐๐ข๐ฌ๐ญ๐ซ๐ข๐›๐ฎ๐ญ๐ข๐จ๐ง ๐‘ท ๐‘ฟ . As we seen, we can estimate
this expectation by generating samples ๐‘ฅ 1 , … ๐‘ฅ[๐‘€] from ๐‘ƒ, and then estimating:
Sometimes, it might be impossible or computationally very expensive to generate samples
from ๐‘ƒ. For example, ๐‘ƒ might be a posterior distribution for a Bayesian network or a prior
distribution for a Markov network.
Thus, we might prefer to generate samples from a different distribution, the
๐’‘๐’“๐’๐’‘๐’๐’”๐’‚๐’ ๐’…๐’Š๐’”๐’•๐’“๐’Š๐’ƒ๐’–๐’•๐’Š๐’๐’ ๐‘ธ.
Unnormalized Importance sampling
If we generate samples from ๐‘„ instead of ๐‘ƒ, we need to adjust our estimator to compensate
for the incorrect sampling distribution.
We define the ๐‘ข๐‘›๐‘›๐‘œ๐‘Ÿ๐‘š๐‘Ž๐‘™๐‘–๐‘ง๐‘’๐‘‘ ๐‘–๐‘š๐‘๐‘œ๐‘Ÿ๐‘ก๐‘Ž๐‘›๐‘๐‘’ ๐‘ ๐‘Ž๐‘š๐‘๐‘™๐‘–๐‘›๐‘” estimator:
When the set of samples ๐’Ÿ = {๐‘ฅ 1 , … ๐‘ฅ ๐‘€ } generate from ๐‘„.
The new estimator is based on the observation that:
๐ธ๐‘„
๐‘ƒ ๐‘‹
๐‘“(๐‘‹)
๐‘„ ๐‘‹
๐‘‹
๐‘ƒ ๐‘ฅ[๐‘š]
๐‘ฅ[๐‘š]
The factor ๐‘„
We define:
=
๐‘ฅ
๐‘ƒ ๐‘ฅ
๐‘„ ๐‘ฅ ๐‘“ ๐‘ฅ
=
๐‘„ ๐‘ฅ
๐‘“ ๐‘ฅ ๐‘ƒ(๐‘ฅ) = ๐ธ๐‘ƒ
๐‘‹
๐‘ฅ
can be viewed as a correction weight to the term ๐‘“(๐‘ฅ[๐‘š]).
๐‘ƒ ๐‘ฅ
๐‘ฅ
๐‘ค ๐‘ฅ =๐‘„
.
๐‘“ ๐‘‹
Unnormalized Importance sampling
Our analysis immediately implies that the estimator is ๐‘ข๐‘›๐‘๐‘–๐‘Ž๐‘ ๐‘’๐‘‘, that is, it’s mean for any
data set is precisely the desired value:
๐‘ƒ ๐‘‹
๐‘‹
๐ธ๐ท ๐ธ๐ท (๐‘“) = ๐ธ๐‘„(๐‘‹) ๐‘“(๐‘‹) ๐‘„
= ๐ธ๐‘ƒ(๐‘‹) ๐‘“(๐‘‹)
Unnormalized Importance sampling
From the Central Limit Theorem, we have that since ๐‘€
∞:
๐œŽ๐‘„2
๐ธ๐ท (๐‘“)~๐’ฉ(๐”ผ๐‘ƒ ๐‘“ , )
๐‘€
Where,
The variance decreases linearly with the number of samples.
Normalized Importance sampling
One problem with the ๐‘ข๐‘›๐‘›๐‘œ๐‘Ÿ๐‘š๐‘Ž๐‘™๐‘–๐‘ง๐‘’๐‘‘ ๐‘–๐‘š๐‘๐‘œ๐‘Ÿ๐‘ก๐‘Ž๐‘›๐‘๐‘’ ๐‘ ๐‘Ž๐‘š๐‘๐‘™๐‘–๐‘›๐‘” estimator, is that it assumes
that ๐‘ƒ is known. A frequent situation is that ๐‘ƒ is known only up to a normalizing constant
๐‘. Specifically, what we have access to is a function ๐‘ƒ, such that
๐‘ƒ ๐‘‹ =๐‘โˆ™๐‘ƒ ๐‘‹ .
For example, in a Bayesian network โ„ฌ, we might have ๐‘ƒ ๐‘‹ = ๐‘ƒโ„ฌ ๐‘‹ ๐‘’ , ๐‘ƒ(X)=๐‘ƒโ„ฌ ๐‘‹, ๐‘’ and
๐‘ = ๐‘ƒโ„ฌ ๐‘’ .
In this context, we define:
Normalized Importance sampling
We define the ๐‘›๐‘œ๐‘Ÿ๐‘š๐‘Ž๐‘™๐‘–๐‘ง๐‘’๐‘‘ ๐‘–๐‘š๐‘๐‘œ๐‘Ÿ๐‘ก๐‘Ž๐‘›๐‘๐‘’ ๐‘ ๐‘Ž๐‘š๐‘๐‘™๐‘–๐‘›๐‘” ๐‘’๐‘ ๐‘ก๐‘–๐‘š๐‘Ž๐‘ก๐‘œ๐‘Ÿ:
The estimator is based on the observation that:
And:
Normalized Importance sampling
The normalized estimator involves a quotient, and it is therefore much more difficult to
analyze theoretically.
Unlike the unnormalized estimator, the normalized estimator is not unbiased. It’s immediate
in the case ๐‘€ = 1. Here, the estimator reduces to:
Here, it’s mean is
.
Conversely, when ๐‘€ goes to infinity, the numerators and denominators converges to the
expected values.
In general, the bias goes down as 1 ๐‘€.
Normalized Importance sampling
One can show that the variance of the importance sampling estimator with M data instances
is approximately:
This can be used to provide an estimate on the quality of a set of samples generated using
normalized importance sampling.
Assume that we were to estimate ๐”ผ๐‘ƒ [๐‘“] using standard sampling method, where we
generate ๐‘€ ๐ผ๐ผ๐ท samples from ๐‘ƒ(๐‘‹), this approach would result in a variance
ratio between these two variances is:
๐•๐‘Ž๐‘Ÿ๐‘ƒ ๐‘“ ๐‘‹
๐‘€
Thus, we would expect ๐‘€ weighted samples generated by importance sampling to be
“equivalent” to
samples generated by ๐ผ๐ผ๐ท samples from ๐‘ƒ.
. The
The Mutilated Network Proposal Distribution
Assume that we are interested in a particular event (๐’ต = ๐‘ง)
๐บ = ๐‘”2 , either because we wish to estimate its probability,
or because we have observed it as evidence. We wish to
focus our sampling process on the parts of the joint that are
consistent with this event.
It’s easy to take this event into consideration when sampling
๐ฟ, but it’s more difficult to account for ๐บ’s influence on
๐ท, ๐ผ and ๐‘†.
We define a simple proposal distribution that “sets” the
value of ๐‘ ∈ ๐’ต to take the prespecified value.
The Mutilated Network Proposal Distribution
The Mutilated Network Proposal Distribution
The Mutilated Network Proposal Distribution
Importance sampling with proposal distribution ๐‘„ induced by the mutilated network โ„ฌ๐’ต=๐‘ง , ๐‘ƒ ๐’ณ = ๐‘ƒโ„ฌ (๐’ณ, ๐‘ง) is
precisely equivalent to the ๐ฟ๐‘–๐‘˜๐‘’๐‘™๐‘–โ„Ž๐‘œ๐‘œ๐‘‘ ๐‘Š๐‘’๐‘–๐‘”โ„Ž๐‘ก๐‘–๐‘›๐‘” algorithm with ๐’ต = ๐‘ง:
๐‘ƒ๐‘Ÿ๐‘œ๐‘๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘œ๐‘›:
๐ฟ๐‘’๐‘ก ๐œ‰ ๐‘๐‘’ ๐‘Ž ๐‘ ๐‘Ž๐‘š๐‘๐‘™๐‘’ ๐‘”๐‘’๐‘›๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘’๐‘‘ ๐‘๐‘ฆ ๐ฟ. ๐‘Š โ„ฌ, ๐’ต = ๐‘ง ๐‘Ž๐‘™๐‘”๐‘œ๐‘Ÿ๐‘–๐‘กโ„Ž๐‘š ๐‘Ž๐‘›๐‘‘ ๐‘ค ๐‘๐‘’ ๐‘–๐‘ก๐‘  ๐‘ค๐‘’๐‘–๐‘”โ„Ž๐‘ก. ๐‘‡โ„Ž๐‘’๐‘› ๐‘กโ„Ž๐‘’ ๐‘‘๐‘–๐‘ ๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘–๐‘œ๐‘›
๐‘œ๐‘ฃ๐‘’๐‘Ÿ ๐œ‰ ๐‘–๐‘  ๐‘Ž๐‘  ๐‘‘๐‘’๐‘“๐‘–๐‘›๐‘’๐‘‘ ๐‘๐‘ฆ ๐‘กโ„Ž๐‘’ ๐‘›๐‘’๐‘ก๐‘ค๐‘œ๐‘Ÿ๐‘˜ โ„ฌ๐’ต=๐‘ง , ๐‘Ž๐‘›๐‘‘
๐‘ค ๐œ‰ =
๐‘ƒโ„ฌ ๐œ‰
๐‘ƒ(๐œ‰)
=
๐‘ƒโ„ฌ๐‘=๐‘ง (๐œ‰) ๐‘„(๐œ‰)
The Mutilated Network Proposal Distribution
๐‘ƒ๐‘Ÿ๐‘œ๐‘๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘œ๐‘›:
๐ฟ๐‘’๐‘ก ๐œ‰ ๐‘๐‘’ ๐‘Ž ๐‘ ๐‘Ž๐‘š๐‘๐‘™๐‘’ ๐‘”๐‘’๐‘›๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘’๐‘‘ ๐‘๐‘ฆ ๐ฟ. ๐‘Š โ„ฌ, ๐’ต = ๐‘ง ๐‘Ž๐‘™๐‘”๐‘œ๐‘Ÿ๐‘–๐‘กโ„Ž๐‘š ๐‘Ž๐‘›๐‘‘ ๐‘ค ๐‘๐‘’ ๐‘–๐‘ก๐‘  ๐‘ค๐‘’๐‘–๐‘”โ„Ž๐‘ก. ๐‘‡โ„Ž๐‘’๐‘› ๐‘กโ„Ž๐‘’ ๐‘‘๐‘–๐‘ ๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘–๐‘œ๐‘›
๐‘ƒโ„ฌ ๐œ‰
โ„ฌ๐‘=๐‘ง (๐œ‰)
๐‘œ๐‘ฃ๐‘’๐‘Ÿ ๐œ‰ ๐‘–๐‘  ๐‘Ž๐‘  ๐‘‘๐‘’๐‘“๐‘–๐‘›๐‘’๐‘‘ ๐‘๐‘ฆ ๐‘กโ„Ž๐‘’ ๐‘›๐‘’๐‘ก๐‘ค๐‘œ๐‘Ÿ๐‘˜ โ„ฌ๐’ต=๐‘ง , ๐‘Ž๐‘›๐‘‘, ๐‘ค ๐œ‰ = ๐‘ƒ
Proof:
Let ๐œ‰′ be some assignment to X.
0
, ๐œ‰′ ๐’ต ≠ ๐‘ง
๐‘ƒ ๐œ‰ = ๐œ‰′ =
๐‘’๐‘™๐‘ ๐‘’
๐‘ฅ∉๐’ต ๐‘ƒโ„ฌ ๐‘ฅ = ๐œ‰′ ๐‘ฅ |๐œ‰′ ๐‘ƒ๐‘Ž๐‘‹ ,
Let ๐œ‰’’ be a sample generated by Forward Sampling with โ„ฌ๐’ต=๐‘ง , , then:
0
, ๐œ‰′ ๐’ต ≠ ๐‘ง
๐‘ƒ ๐œ‰′′ = ๐œ‰ ′ = ๐‘ƒ โ„ฌ๐’ต=๐‘ง ๐œ‰ ′
๐‘ƒโ„ฌ ๐‘ฅ = ๐œ‰′ ๐‘ฅ |๐œ‰′ ๐‘ƒ๐‘Ž๐‘‹ ,
๐‘ฅ∉๐’ต
๐‘’๐‘™๐‘ ๐‘’
๐‘ƒ(๐œ‰)
= ๐‘„(๐œ‰).
The Mutilated Network Proposal Distribution
๐‘ƒ๐‘Ÿ๐‘œ๐‘๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘œ๐‘›:
๐ฟ๐‘’๐‘ก ๐œ‰ ๐‘๐‘’ ๐‘Ž ๐‘ ๐‘Ž๐‘š๐‘๐‘™๐‘’ ๐‘”๐‘’๐‘›๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘’๐‘‘ ๐‘๐‘ฆ ๐ฟ. ๐‘Š โ„ฌ, ๐’ต = ๐‘ง ๐‘Ž๐‘™๐‘”๐‘œ๐‘Ÿ๐‘–๐‘กโ„Ž๐‘š ๐‘Ž๐‘›๐‘‘ ๐‘ค ๐‘๐‘’ ๐‘–๐‘ก๐‘  ๐‘ค๐‘’๐‘–๐‘”โ„Ž๐‘ก. ๐‘‡โ„Ž๐‘’๐‘› ๐‘กโ„Ž๐‘’ ๐‘‘๐‘–๐‘ ๐‘ก๐‘Ÿ๐‘–๐‘๐‘ข๐‘ก๐‘–๐‘œ๐‘›
๐‘ƒโ„ฌ ๐œ‰
โ„ฌ๐‘=๐‘ง (๐œ‰)
๐‘œ๐‘ฃ๐‘’๐‘Ÿ ๐œ‰ ๐‘–๐‘  ๐‘Ž๐‘  ๐‘‘๐‘’๐‘“๐‘–๐‘›๐‘’๐‘‘ ๐‘๐‘ฆ ๐‘กโ„Ž๐‘’ ๐‘›๐‘’๐‘ก๐‘ค๐‘œ๐‘Ÿ๐‘˜ โ„ฌ๐’ต=๐‘ง , ๐‘Ž๐‘›๐‘‘, ๐‘ค ๐œ‰ = ๐‘ƒ
๐‘ƒ(๐œ‰)
= ๐‘„(๐œ‰).
Proof:
๐‘ค ๐œ‰ =
๐‘ƒ ๐œ‰ ๐‘ฅ ๐œ‰ ๐‘ƒ๐‘Ž๐‘ฅ
๐‘ฅ∈๐‘
=
๐‘ฅ๐‘ƒ
๐œ‰ ๐‘ฅ ๐œ‰ ๐‘ƒ๐‘Ž๐‘ฅ
๐‘ฅ∉๐‘ ๐‘ƒ ๐œ‰ ๐‘ฅ ๐œ‰ ๐‘ƒ๐‘Ž๐‘ฅ
=
๐‘ƒโ„ฌ ๐œ‰
๐‘ƒโ„ฌ๐‘=๐‘ง (๐œ‰)
Markov Chain Monte Carlo Methods
We now present an alternative sampling approach that generates sequences of samples.
This sequence is constructed so that, although the first sample may be generated from the prior,
successive samples are generated from distributions that provably get closer and closer to the
desired posterior we define: ๐‘ƒ∅ (X) = P(X|e)
unlike forward sampling methods (including likelihood weighting), Markov chain methods apply
equally well to directed and to undirected models. Indeed, the algorithm is easier to present in
the context of a distribution ๐‘ƒ∅ defined in terms of a general set of factors ∅.
General Outline
We will:
๏ƒ˜ See the algorithm for Gibbs sampling (+example).
๏ƒ˜ Define and explain what are Markov chains.
๏ƒ˜ Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm).
๏ƒ˜ Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm.
๏ƒ˜ Touch the definition of mixing time and methods to check whether our chain has mixed.
General Outline
We will:
๏ƒ˜ See the algorithm for Gibbs sampling (+example).
๏ƒ˜ Define and explain what are Markov chains.
๏ƒ˜ Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm).
๏ƒ˜ Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm.
๏ƒ˜ Touch the definition of mixing time and methods to check whether our chain has mixed.
Gibbs Sampling
Input:
X − set of variables
∅ – set of factors,
๐‘ƒ(0) (X) – initial distribution
T – number of steps (and full samples)
Output:
๐‘‹ (0) , . . . , ๐‘‹ (๐‘‡) − A set of samples
(each one is a vector, containing
values for all ๐‘‹๐‘– ∈ ๐‘‹!)
0
1
Gibbs Sampling – Example (with ๐‘  and ๐‘™
)
We look for samples of D, I, G, given ๐‘  0 and ๐‘™1 .
First, we sample once (say, by forward sampling). Let
us assume we got:
๐‘– (0) = ๐‘– 0 , ๐‘‘(0) = ๐‘‘1 , ๐‘”(0) = ๐‘”2
Now, we start creating new samples over G ,I ,D at
some order over them.
0
1
Gibbs Sampling – Example (with ๐‘  and ๐‘™
)
[s = ๐‘  0 , l = ๐‘™1 , ๐‘– (0) = ๐‘– 0 , ๐‘‘ (0) = ๐‘‘1 , ๐‘”(0) = ๐‘”2 ]
Sample ๐‘”(1) : Sampling ๐บ from ๐‘ƒ∅ ๐บ ๐‘‘1 , ๐‘– 0 .
0
1
Gibbs Sampling – Example (with ๐‘  and ๐‘™
)
[s = ๐‘  0 , l = ๐‘™1 , ๐‘”(1) = ๐‘”3 , ๐‘– (0) = ๐‘– 0 , ๐‘‘(0) = ๐‘‘1 ]
Sample ๐‘– (1) : Sampling ๐ผ from ๐‘ƒ∅ ๐ผ ๐‘‘1 , ๐‘”3 :
๐‘ƒ∅ ๐ผ ๐‘‘1 , ๐‘”3
๐‘ƒ ๐ผ ๐‘ƒ ๐‘  0 ๐ผ ๐‘ƒ(๐‘”3 |๐ผ, ๐‘‘1 )
=
0
๐‘ƒ
๐‘–
๐‘ƒ
๐‘ 
๐‘– ๐‘ƒ(๐‘”3 |๐‘–, ๐‘‘1 )
๐‘–
assume we get ๐‘– 1 .
0
1
Gibbs Sampling – Example (with ๐‘  and ๐‘™
)
[s = ๐‘  0 , l = ๐‘™1 , ๐‘”(1) = ๐‘”3 , ๐‘– (1) = ๐‘–1 , ๐‘‘ (0) = ๐‘‘1 ]
Sample ๐‘‘(1) : Sampling ๐ผ from ๐‘ƒ∅ ๐ท ๐‘”3 , ๐‘–1 :
๐‘ƒ∅ ๐ท ๐‘”3 , ๐‘–1 =
๐‘ƒ ๐ท ๐‘ƒ(๐‘”3 |๐‘–1 , ๐ท)
3 |๐‘– 1 , ๐‘‘)
๐‘ƒ
๐‘‘
๐‘ƒ(๐‘”
๐‘‘
And we end up getting a new sample:
[s = ๐‘  0 , l = ๐‘™1 , ๐‘”(1) = ๐‘”3 , ๐‘– (1) = ๐‘–1 , ๐‘‘(1) = ๐‘‘1 ]
General Outline
We will:
๏ƒ˜ See the algorithm for Gibbs sampling (+example).
๏ƒ˜ Define and explain what are Markov chains.
๏ƒ˜ Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm).
๏ƒ˜ Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm.
๏ƒ˜ Touch the definition of mixing time and methods to check whether our chain has mixed.
Markov Chains
The formal definition:
We also demand that: ∀๐‘‹.
๐‘ฅ′ ๐œ
๐‘ฅ
๐‘ฅ′ = 1
Simply put:
A Markov chain is made of:
• A set of states (in out case, each state will represent an instance of our probability space).
• A transition model, that holds for each state, the distribution of “which state can we visit next”.
*Note: the transition model can be depended on the number of steps we have already taken (i.e: depend on
the time our chain has been running).
We only speak about homogeneous Markov chains, where the transition model does not change over time.
Example: drunken grasshopper
•We define our states to be the integers from (-4)to (4).
And our drunken grasshopper defines a transition model with 25% of hopping one spot, and
50% of staying in the same place
•Formally, for i between -3 and 3 we define:
•And for the edges, where we cannot go farther, we higher the chances of the self loop:
Markov chain as samples of a distribution
•Namely, we look at each step as a distribution of “where can we be at that step”.
•Each such distribution is defined by the previous one (summation over our chance of coming
from a specific state, multiplied by our chance of getting to that state at the previous step.
•Each step now represents a distribution over the states of our chain.
Thus, a distribution over the probability space X.
Asymptotic Behavior
•For our purposes, the most important aspect of a Markov chain is its long-term behavior.
•Drunken grasshopper revisited:
-We observe the location at time T is a random variable.
There’s a hunch about the asymptotic behavior:
•For the two first steps we get something like that:
•For T=10, we already have the probabilities of ~0.05 for +4\-4, and only ~0.17 left for value of 0
•At T=50, we get that all the 9 states has probabilities between 0.1107 and 0.1116. UNIFORM!
Markov chain Monte Carlo (MCMC) Sampling
Input:
๐‘ƒ(0) (X) – initial distribution
τ − The transition model
T – number of steps (and full samples)
Output:
๐‘‹ (0) , . . . , ๐‘‹ (๐‘‡) − A set of samples from
strolling over the chain. Each taken
from the t’s step distribution: ๐‘ƒ(๐‘ก) (X)
(in our case: each sample is a vector,
containing values for all ๐‘‹๐‘– ∈ ๐‘‹)
Stationary Distributions
Just like in numerical analysis, we expect the following to hold:
Thus, given the transition model, we get a set of |X| variables, and |X| equalities.
We don’t forget to finally normalize our probability, adding the equation:
๐‘ƒ
๐‘ก
๐‘ฅ =1
๐‘ฅ∈๐‘‰๐‘Ž๐‘™(๐‘ฟ)
And finally we got a distribution out of our chain.
Key questions: Does the process really converge? If so - Is there only one such target distribution?
Stationary Distributions – cont.
We formally define:
That is, π ๐‘ฅ should satisfy the exact equation we had desired for ๐‘ƒ ๐‘ก ๐‘ฅ .
Meaning, that if for some t: ๐‘ƒ ๐‘ก ๐‘ฅ gets close enough to π ๐‘ฅ , our process will converge.
Stationary Distributions – example
Stationary Distributions – bad examples
We note that generally, there exists a possibility to have more than one stationary distribution for a
given Markov chain. For example, in a set called reducible Markov Chains.
In that set, we have that the chain holds several areas (sets of states) that are unreachable from one
another. Thus, the starting state will set our area and therefore the stationary distribution to which
our process will converge.
There can also be no stationary distribution at all, as shown:
This kind of Markov chains are called periodic.
(equivalent to ๐‘‹๐‘›+1 = 1 − ๐‘‹๐‘› )
Regular Markov chain
The formal definition:
Simply put: we can find an integer k, such that steps of length k in the chain can be used to reach
from each state to any another state (with probability > 0: meaning that no transition in the
path was 0, so each step is legitimate).
Simpler demands:
We will only demand the following, which together ensure that our chain will be regular:
1) between each two states, there is a positive probability of travel (namely: ๐œ ๐‘‹ ๐‘‹ ′ > 0).
2) every state has a positive-probability for a self loop (namely: ๐œ ๐‘‹ ๐‘‹ > 0).
Those 2 demands often hold in practice, and will ensure us regularity.
Regular Markov chain – cont.
What is regularity good for? Why would we even demand regularity (or anything that will ensure
us regularity) for our Markov chain?
Theorem 12.3:
(So, that’s a good enough reason…)
MCMC sampling - revisited
What does that mean?
•When we have a transition model of switching
between states in the chain, we will simply
use it to travel through the chain.
•After reaching the number T, our chain will
presumable have the stationary distribution.
•Having traveled T steps, we simply keep on
traveling through the chain to collect samples from the distribution!
•We later discuss briefly about the issue of defining T. (In practice we simply keep running the
algorithm until we have enough evidence that we have converged to the stationary)
General Outline
We will:
๏ƒ˜ See the algorithm for Gibbs sampling (+example).
๏ƒ˜ Define and explain what are Markov chains.
๏ƒ˜ Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm).
๏ƒ˜ Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm.
๏ƒ˜ Touch the definition of mixing time and methods to check whether our chain has mixed.
Gibbs Sampling revisited
• We would like to define a Markov chain that would represent samples from our distribution
• But the distribution is given by a graphical model. So… We’ll have to use that.
• Great Idea! We will use Gibbs sampling process as a way to travel through a Markov chain, in
which the state space is simply the whole probability space (each state is an instantiation for all
๐‘ฟ๐’Š ’s that appears in the graphical model).
•Problem: Gibbs sampling takes each time only one variable, and sample it depending on the
others. How can we handle it?
Multiple Transition Models
•We can define a cumbersome transition model for our chain, but it will be easier to consider one
transition model ๐‰๐’Š for each variable ๐‘ฟ๐’Š .
•We now have a set of transition models ๐‰๐’Š . We define each such ๐‰๐’Š to be called a kernel in our
chain. How do we combine between several kernels? How do we define the soul transition
model that defines our travel through the chain?
2 options:
•Randomly choosing which kernel to use at each step (for k variables, we get ๐‰: =
we note that the new transition model is legal (sum of outgoing edges is still 1)
๐Ÿ
๐’Œ
๐’Œ
๐’Š=๐Ÿ ๐‰๐’Š
•Sequentially switching between the kernels.
This creates a bit of a pickle, since we demanded many things from our transition model, that
now not necessarily hold (homogeneously, for example).
But.. We regard the whole k steps (one of each kernel) as a single step in the world, and thus
keep our chain homogeneous. We can show that the rest of our demands hold whenever each
of the kernels satisfy some weaker demands.
Gibbs chain
•We will use the following kernels, taken from Gibbs algorithm:
•Just like in the algorithm, when sampling a new value for the variable ๐‘ฟ๐’Š , we disregard it’s
current value. We have the most updates values for the other variables at that point.
•Note: when looking at the conditional distribution , we simply reduce the evidence in our
graphical model, to get an appropriate set of factors. We then use:
•We notice that like before, the evaluation of each sample is
Gibbs chain – cont.
Evaluating each value in a kernel:
We have:
And therefore we derive:
(The set of relevant
factors are also called
the Markov Blanket of ๐‘ฟ๐’Š )
Gibbs chain – bad example
We note that deriving Gibbs chain from a graphical model, is not enough to ensure the derived
Markov chain will be regular (and therefore we do not know that it will converge!)
Note the following example:
we observe ๐‘ท(๐‘ฟ๐Ÿ |๐’€ = ๐Ÿ).
Thus, our probability space holds only
two viable options (values (0,1,1) and (1,0,1)).
-If we start off in (0,1,1), and sample ๐‘ฟ๐Ÿ ,
we can only get 1 (given ๐‘ฟ๐Ÿ = ๐Ÿ, as evidence). Next, we’ll sample ๐‘ฟ๐Ÿ , given ๐‘ฟ๐Ÿ = ๐Ÿ, and get 0.
Thus, we will stay in the case (0,1,1) forever.
-if we start off in (1,0,1) – we’ll get the mirror case.
Thus, we got a reducible Gibbs chain (with two stationary distributions)!
Gibbs chain – cont.
•So, we must be careful when constructing a Gibbs chain.
•Luckily, we have another important theorem to ensure a unique stationary distribution:
•We notice that the last example happened because we had a transition with strict chance of
zero [i.e: the transition: (1,0,1)๏ƒ (1,1,1)]
•But we should notice, that if we “correct” such models by inserting an ๐œบ chance instead of 0, we
get our theorem to hold.
•However! That ๐œบ will cause us other problems, in form of bad mixing time (discussed ahead).
Gibbs chains - summary
•Gibbs chains converts the hard problem of inference to a sequence of “easy”
sampling steps, using many information we generally know about Markov chains.
•Pros:
-The simplest way one can think of, to generate a Markov chain for a probability
-computationally efficient
•Cons:
-Often slow to mix (converge). Especially when we have transition values close to 0 or 1.
-Only applies if we can sample from product of factors. That maybe be good in our case, but
generally it is very limiting (for example: the model will not work for continuous distributions)
General Outline
We will:
๏ƒ˜ See the algorithm for Gibbs sampling (+example).
๏ƒ˜ Define and explain what are Markov chains.
๏ƒ˜ Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm).
๏ƒ˜ Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm.
๏ƒ˜ Touch the definition of mixing time and methods to check whether our chain has mixed.
Broader class of Markov chains
• We have seen Gibbs chains, and some conditions to make sure they converge.
• We now show a more general way to create a Markov chain given some distribution.
• It focuses on ensuring the convergence.
First, we’ll need some definitions…
Reversible Markov chains
Definition:
Using that definition, we can easily achieve the following:
Reversible Markov chains – cont.
We prove that last proposition:
we have that:
π ๐’™ ๐‰ ๐’™ ๐’™′ = π ๐’™′ ๐‰ ๐’™′ ๐’™
and therefore we can sum those equations over all x’s, and we get:
π ๐’™ ๐‰ ๐’™ ๐’™′ =
π ๐’™′ ๐‰ ๐’™′ ๐’™ = π ๐’™′
๐‰ ๐’™′ ๐’™ = π ๐’™′ โˆ™ ๐Ÿ = π ๐’™′
Which
is exactly the definition
for a stationary distribution
π ๐’™ over the transition model ๐‰
๐’™
๐’™
๐’™
Metropolis-Hastings Algorithm
•We use the idea of a proposal distribution (already showed by Dafna, in the case of importance
sampling)
•We define a transition model ๐‰๐‘ธ based on that distribution.
•We use another “approver” to decide whether that move should occur or not. The choosing of
this approver will give us freedom. The real transition model τ will be defined like that:
(follows from our description…)
Metropolis-Hastings Algorithm – cont.
•How do we choose our approver? We want the detailed balance equation (12.24) to hold. So we
want the following to hold:
Thus, we will need this to hold:
๐‘จ(๐’™
๐‘จ(๐’™′
๐’™′ )
π ๐’™ ๐‰๐‘ธ ๐’™ ๐’™′
=
=: μ
๐’™)
π ๐’™′ ๐‰๐‘ธ ๐’™′ ๐’™
1
Now, if we have μ > 1, we replace the nominator with the denominator and get < 1.
μ
To maximize the acceptance rate (so our chain will move more freely, and mix faster,
we will simply take the approver of one side to take the value “1” and the other to take the
fraction. Finally we get a generic formula:
Finally, we conclude from our demands,
That the chain we described does indeed converge to π ๐’™ .
General Outline
We will:
๏ƒ˜ See the algorithm for Gibbs sampling (+example).
๏ƒ˜ Define and explain what are Markov chains.
๏ƒ˜ Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm).
๏ƒ˜ Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm.
๏ƒ˜ Touch the definition of mixing time and methods to check whether our chain has mixed.
Mixing time
•So, we have constructed a Markov chain, and guaranteed it converges to our target distribution.
•Using Gibbs, we even achieved the skill of sampling from a distribution that it would be hard to
sample otherwise.
•But, one important question is left unanswered:
-How much time will it take our chain to reach the desired distribution?
•General answer:
We cannot know!!!
There is very little math behind this issue, and not a lot of boundaries as to the number of
samples.
•We show a few definitions, and then jump on to practice
Mixing time - definitions
We define the mixing time as follows:
Where:
Or, alternatively:
Conductance
We define the conductance of our Markov chain as follows:
where:
This characteristic of the chain, can give us a clue about the chances we manage visiting all
around the chain, instead of getting “stuck” in a specific area.
Example
Here is an example where low conductance shows we can expect a long mixing time:
Assume we have two areas, where the
only way to transition between the two
is by two unique states, and we also
assume that this transition has a very
low chance. Thus, the conductance will
be very low (taking S = {๐‘ฅ 1 , ๐‘ฅ 2 , ๐‘ฅ 3 }).
We can also expect that the mixing time of that chain would be pretty high.
In practice
•In practice we have only the self-examining as a tool to check whether our chain has mixed.
•We do it by a few different ways:
-for example: We run a number of chains in parallel, comparing the results.
-We know that both converge (at some point) to the same distribution, so we can say for sure
when we have NOT achieved our mixing time. If the two examples are too different.
-Repeating this test (or running a higher number of chains in parallel), we can have many
observations that are “not bad”, concluding that overall our chains have mixed enough.
Some statistics of self-observation
Some statistics of self-observation
Some statistics of self-observation
Some statistics of self-observation
No
Maybe
Another little issue
Another issue that can be discussed is that having taken T steps, we can now start sampling from
our travel over the chain.
BUT! We notice that our samples are definitely dependent in one another.
Solution:
We can consider taking some number of steps (d) between each two samples.
-It is easy to see that by doing that, we “lie to ourselves” and only lose viable information we
had in the samples we through away.
-Nonetheless, in such a case where processing each sample could take a while, we can use this
method to process only a “slightly more independent” samples.
Summary
To sum up:
๏ƒ˜We’ve defined and learned how to use Markov chains.
๏ƒ˜We’ve seen a general algorithm to construct a chain for a target distribution.
๏ƒ˜We’ve seen a more particular way to construct these chains for a graphical model (based on
factors, but can easily be replaced by CPDs)
๏ƒ˜We’ve discussed the issues of practically showing convergence time, and some statistical tools
to overcome that issue by running a whole lot of time and by self observation.
Download