Variable elimination and junction tree
Exact inference
Exponential in tree-width
Mean field, loopy belief propagation
Approximate inference for marginals/conditionals
Fast, but can get inaccurate estimates
Sample-based inference
Approximate inference for marginals/conditionals
With “enough” samples, will converge to the right answer
2
What is the approximating structure?
? ?
π π π
How to measure the goodness of the approximation of
π π
1
, … , π π
to the original π π
1
, … , π π
?
Reverse KL-divergence πΎπΏ(π| π mean field
π
?
How to compute the new parameters?
Optimization Q ∗ = argmin
Q
πΎπΏ(π| π
πππ€ πππππππ‘πππ
3
Approximate
π π ∝ Ψ π· π
with
π π = π π
π π π . π‘. ∀π π
, π π
π π
> 0, ∫ π π
π π ππ π
= 1
π·(π| π = −∫ π π ln π π ππ + ∫ π π ln π π ππ + ππππ π‘.
= −∫ π
= − ∫ π
π π π∈π· π
( ln Ψ π· π
π π
π π
)ππ + ln Ψ π· π ππ
π· π
∫ π π
π
+ ∫ π π π
( ln π π
π π ln π π
π π
π π
)ππ ππ π
π
1
π
1
Ψ(π
1
, π
5
)
E.g., ∫ π
∫ π
1
π
1 π
π
5 π
π π
π
5 ln Ψ π
1 ln Ψ π
1
, π
5
, π
5 ππ = ππ
1 ππ
5
π
6
π
5
π
2
π
1
π
4
π
7
Ψ(π
1
, π
2
)
π
3
Ψ(π
1
, π
3
)
Ψ π
1
, π
4
4
Let πΏ be the Lagrangian function ( ππ π π
πΏ =
− ∫ π∈π· π
π π
∫ π π
π π ln π π
π π
π π ln Ψ π· π ππ π ππ
π· π
+
π
− π½ π
(1 − ∫ π π π
π π
> 0 ?)
)ππ π
For KL divergence, the nonnegativity constraint comes for free
If π is a maximum for the original constrained problem, then there exists π π
, π½ π
such that ( π, π π
, π½ π
) is a stationary point for the Lagrange function (derivative equal 0).
5
πΏ =
− ∫ π∈π· π
π π
π½ π
(1 − ∫ π π
π π
π π ln Ψ π·
)ππ π π ππ
π· π
+ ∫ π π
ππ π π½ π
ππΏ
(π π
)
= 0
= − π:π∈π· π
( π∈π· π
\i
π π
π π
π π ln π π
π π ln Ψ(π· π
)) + ln π π ππ π
−
π π
−
π π
1
π π
π π
= exp π:π∈π· π exp π:π∈π· π
( π∈π· π
\i
( π∈π· π
\i
π π
π π
π π
π π ln Ψ(π· π ln Ψ(π·
)) π
)) + π½ π
π π
π π
=
1
π π exp
π· π
:π π
∈π· π
πΈ
π ln Ψ π· π
=
6
Initialize π π
1
, … , π π
= π π π
(eg., randomly or smartly)
Set all variables to unprocessed
Pick an unprocessed variable π π
Update π π
:
π π
π π
=
1
π π exp πΈ
π· π
:π π
∈π· π
π
Set variable π π as processed
If π π
changed
Set neighbors of π π
to unprocessed ln Ψ π· π
Guaranteed to converge
7
π
1
π
1
π
6
π
5
π(π
5
) ln Ψ π
1
, π
5 π₯5
π
2
π
2
π
6
π(π
6
) ln Ψ π
1
, π
6 π₯6
π
5
π
2
π
1
π
4
π
2
π
1
π
4
π
7
π
3 π
7
π
3
π π π₯
2
2 ln Ψ π
1
, π
2
π π π₯
4
4 ln Ψ π
1
, π
4
π π
1 π₯
1 ln Ψ π
1
, π
2 π π
7 π₯
7 ln Ψ π
2
, π
7
π
1
π
1
π
2
π
2
π π
3 π₯
3 ln Ψ π
1
, π
3
=
=
1
π
1
1
π
1 exp πΈ
π(π
2 exp πΈ
π(π
1
)
) ln Ψ π
1
, π
2 ln Ψ π
1
, π
2
+ πΈ
π(π
3
) ln Ψ π
1
, π
3
+ πΈ
π(π
6
) ln Ψ π
2
, π
6
+ πΈ
π(π
4
) ln Ψ π
1
, π
4
+ πΈ
π(π
5
) ln Ψ π
1
, π
5
+ πΈ
π(π
7
) ln Ψ π
2
, π
7
8
Mean field equation
π π
π π
=
1
π π exp
π· π
:π π
∈π· π
πΈ
π
Pairwise Markov random field ln Ψ π· π
π π
1
, … , π π
= exp (π ππ
∝ exp π
π π
π π
) ππ
π π
π π
+ π π
π π
exp (π π
π π
) ln Ψ π π
Ψ π π
, π π
, π π
= π ππ
π π
π π
Ψ π π lnΨ π π
= π π
π π
The mean field equation has simple form:
π π
=
π π
=
1
π π exp π∈π π
1
π π exp π∈π π π ππ
π π
π
< π π
> ππ
π π
π π
π π
+ π
π π π
π
π π π
+ π π
π π
π π
π π
< π π
>
π π
< π π
>
π π
< π π
>
π π
< π π
>
π π
9
πΆ
2
πΆ
3
πΆ
1
πΆ
4
π πΆ
1 exp(
∝ π∈πΆ
1 π π
π π
+ ππ ∈πΈ,π∈πΆ
1
,π∈πΆ
1 π ππ
π π
π π
Node potential within πΆ
1
Edge potential within πΆ
1
ππ΅(πΆ
1
) π∈πΆ
1
,π∈πΆ π
,π∈ππ΅ πΆ
1 π ππ
π π
< π π
>
π πΆ π
)
Mean of variables in Markov blanket
Edge potential across πΆ
1
and πΆ π
10
Previous inference tasks focus on obtaining the entire posterior distribution π π π π
Often we want to take expectations
Mean π
π π
|π
= πΈ π
Variance π 2
π π
|π π π = ∫ π
= πΈ (π π π
π π
−π
π π
|π
) 2 π π ππ π π = ∫ (π π
−π
π π
|π
) 2 π π π π ππ
More general πΈ π = ∫ π π π π|π ππ , can be difficult to do it π analytically
Key idea : approximate expectation by sample average
π
1
πΈ π ≈
π
π π₯ π π=1 where π₯
1
, … , π₯
π
∼ π π|π independently and identically
11
Samples: points from the domain of a distribution π π
The higher the π π₯ , the more likely we see π₯ in the sample
π(π) π₯
1 π₯
5 π₯
2 π₯
6 π₯
4 π₯
3
π
Mean π =
1
π
π₯ π
Variance π 2 =
1
π
(π₯ π
−π ) 2
12
For π independent Gaussian variable with 0 mean and unit variance , use π₯ ∼ πππππ π, 1 in Matlab
For multivariate Gaussian π©(π, Σ) π¦ ∼ πππππ π, 1
Transform the sample: π₯ = π + Σ 1/2 π¦
π takes finite number of values, e.g., 1, 2, 3 , π π = 1 =
0.7, π π = 2 = 0.2, π π = 3 = 0.1
Draw a sample from uniform distribution U[0,1], π’ ∼ ππππ
If π’ falls into interval [0, 0.7) , output sample π₯ = 1
If π’ falls into interval [0.7, 0.9) , output sample π₯ = 2
If π’ falls into interval [0.9, 1] , output sample π₯ = 3
0
Value 1 Value 2 Value 3
1
0.7 0.2 0.1
13
BN describe a generative process for observations
First, sort the nodes in topological order
Then, generate sample using this order according to the CPTs
1
π΄ππππππ¦
ππππ’π
Generate a set of sample for (A, F, S, N, H):
Sample π π
Sample π π
Sample π π
Sample π π
Sample β π
∼ π π΄
∼ π πΉ
∼ π π π
∼ π π
∼ π π» π π π
, π π π π
πππ π
4
3
πΉππ’
2
π»πππππβπ
5
14
How to draw sample from a given distribution?
Not all distributions can be trivially sampled
How to make better use of samples (not all samples are equally useful)?
How do know we’ve sampled enough?
15
Direct Sampling
Simple
Works only for easy distributions
Rejection Sampling
Create samples like direct sampling
Only count samples consistent with given evidence
Importance Sampling
Create samples like direct sampling
Assign weights to samples
Gibbs Sampling
Often used for high-dimensional problem
Use variables and its Markov blanket for sampling
16
Want to sample from distribution π π =
1
π π(π)
π(π) is difficult to sample, but f(X) is easy to evaluate
Key idea: sample from a simpler distribution Q(X), and then select samples
Rejection sampling (choose
π s.t. π π ≤ π π π
) π = 1
Repeat until π = π π₯ ∼ π π , π’ ∼ π 0,1 if π’ ≤ π π
π π π
, accept π₯ and π = π + 1 ; otherwise, reject π₯
Reject sample with probability 1 − π π
π π π
17
Sample π₯ ∼ π(π)
and reject with probability
1 − π π
π π π
π π(π₯
1
)
π π(π) π(π₯
1
)
Between red and blue curves is rejection region π(π) π’
1
∼ π[0,1]
π π₯
1
∼ π(π)
The likelihood of seeing a particular π₯
π π₯ =
∫ π π₯
ππ π₯ π π
ππ π
π(π₯)
π π ππ
=
1
π π(π₯)
18
Can have very small acceptance rate
π π(π₯
1
)
π π(π)
Big rejection region π(π) π’
1
∼ π[0,1] π₯
1
∼ π(π)
Adaptive rejection sampling: use envelop function for Q
π
19
π
Support sampling from π π is hard
We can sample from a simpler proposal distribution π π
If
π dominates P (i.e.,
π π > 0
whenever
π π > 0
), we can sample from π and reweight
πΈ
π π π = ∫ π π π π ππ) = ∫ π(π)
π π
π π
Sample π₯ π
∼ π π
Reweight sample π₯ π
with π€ π
=
π π₯ π
π π₯ π approximate expectation with
1
π
π π₯ π
π(π)ππ π€ π
20
Instead of reject sample, reweight sample instead
π(π₯
2
)
π(π₯
1
)
π(π)
π(π₯
2
) π(π₯
1
)
π(π) π€
2 π₯
2
∼ π π₯
2
∼ π(π)
/π π₯
2 π₯
1 π€
1
∼ π(π)
∼ π π₯
1
/π π₯
1
The efficiency of importance sampling depends on how close the proposal Q is to the target P
π
21
Suppose we only evaluate π π =
1
π π(π) is difficult to compute), and don’t know Z
(e.g., grid or MRF, Z
We can get around the normalization Z by normalize the importance weight π π = π π
π π
⇒ πΈ
π π π = ∫ π π
π π
π π ππ = ∫ π π ππ = π
πΈ
π π π = ∫ π π π π ππ =
1
π
∫ π π
∫ π π π π π π ππ
∫ π π π π ππ
≈
π π₯ π π π₯
π π₯ π π π π
π π
π π ππ = ππππππ π€ π
= π π₯ π
π π₯ π
ππ π‘βπ πππ€ ππππππ‘ππππ π€πππβπ‘
22
Difference between normalized and unnormalized
Unnormalized importance sampling is unbiased
Normalized importance sampling is biased, eg. for N=1
πΈ
π π π₯
1 π€
1 π€
1 = πΈ
π π π₯
1
≠ πΈ
π π π₯
1
However, the normalized importance sampler usually has lower variance
But most importantly, unnormalized importance sampler work for unnormalized distributions
For unnormalized sampler, one can also do resampling based on π€ π
(multinomial distribution with probability π€ π
for π₯ π
)
23
Both rejection sampling and importance sampling do not scale well to high dimensions
Markov Chain Monte Carlo (MCMC) is an alternative
Key idea : Construct a Markov chain whose stationary distribution is the target distribution π π
Sampling process: random walk in the Markov chain
Gibbs sampling is a very special and simple MCMC method.
24
Wan to sample from π π , start with a random initial vector X
π π‘ : π at time step π‘
π π‘
transition to π π‘+1
with probability
π(π π‘+1
| π π‘ , … , π 1 ) = π (π π‘+1
| π π‘ )
The stationary distribution of π (π π‘+1
| π π‘ ) is our π π
Run for an intial π samples (burn-in time) until the chain converges/mixes/reaches the stationary ditribution
Then collect π (correlated) sample as π₯ π
Key issues : Designing the transition kernel, and diagnose convergence
25
A very special transition kernel, works nicely with Markov blanket in GMs.
The procedure
We have variables set π = π
1
, … , π
πΎ
variables in a GM.
At each step, one variable π π
is selected (at random or some fixed sequence), denote the remaining variables as π
−π
, and its current value as π₯ π‘
−π
Compute the conditional distribution π(π π
| π₯ π‘
−π
)
A value π₯ π π‘
is sampled from this distribution
This sample π₯ π π‘
replaces the previous sampled value of π π
in π
26
Gibbs sampling
π = π₯ 0
For t = 1 to N π₯ π₯
1
2
… π₯ π‘
πΎ
= π(π
1
| π₯
= π(π
2
| π₯
= π(π
2
| π₯ π‘
1
, … , π₯
, … , π₯
, … , π₯ π‘
πΎ−1
)
)
)
Variants:
Randomly pick variable to sample sample block by block
πΉππ ππππβππππ ππππππ
Only need to condition on the
Variables in the Markov blanket
π
2
π
3
π
4
π
1
π
5
27
Pairwise Markov random field
π π
1
, … , π π
∝ exp π ππ
π π
π π
+ π π
π π
In the Gibbs sampling step, we need conditional π(π
1
| π
2
, … π π
)
π(π
1
| π
2 exp π
12
, … π π
π
1
π
2
) ∝
+ π
13
π
1
π
3
+ π
14
π
1
π
4
+ π
15
π
1
π
5
+ π
1
π
1
π
6
π
5
π
2
π
1
π
4
π
7
π
3
28
Good chain
Sampled Value
Iteration number
29
Bad chain
Sampled Value
Iteration number
30
Direct Sampling
Simple
Works only for easy distributions
Rejection Sampling
Create samples like direct sampling
Only count samples consistent with given evidence
Importance Sampling
Create samples like direct sampling
Assign weights to samples
Gibbs Sampling
Often used for high-dimensional problem
Use variables and its Markov blanket for sampling
31