Approximate Inference and Learning
Le Song
Machine Learning II: Advanced Topics
CSE 8803ML, Spring 2012
Why Sampling
Exact and variational inference tasks focus on obtaining the
entire posterior distribution π ππ π
Often we want to take expectations
Mean πππ |π = πΈ ππ π = ∫ ππ π ππ π πππ
More general πΈ π = ∫ π π π π|π ππ, can be difficult to do
analytically
Sometime we also want to see typical data points from a
distribution
2
Sampling
Samples: points from the domain of a distribution π π
The higher the π π₯ , the more likely we see π₯ in the sample
π(π)
π
π₯1
π₯4
π₯5 π₯2 π₯6
π₯3
Approximate expectation by sample average
1
πΈπ ≈
π
π
π π₯π
π=1
where π₯1 , … , π₯π ∼ π π|π independently and identically
distributed
3
Generate Samples from Bayesian Networks
BN describe a generative process for observations
First, sort the nodes in topological order
1
2
πΉππ’
π΄ππππππ¦
Then, generate sample using this order according
to the CPTs
ππππ’π 3
Generate a set of sample for (A, F, S, N, H):
Sample ππ ∼ π π΄
Sample ππ ∼ π πΉ
Sample π π ∼ π π ππ , ππ
Sample ππ ∼ π π π π
Sample βπ ∼ π π» π π
π»πππππβπ
πππ π
4
5
4
Challenge in sampling
Not all distributions can be trivially sampled, e.g.,
Loopy graphical model with lots of variables
Distribution with complicated shapes
π(π)
π
5
Sampling Methods
Direct Sampling
Simple
Works only for easy distributions
Rejection Sampling
Create samples like direct sampling
Only count samples consistent with given evidence
Importance Sampling
Create samples like direct sampling
Assign weights to samples
Gibbs Sampling
Often used for high-dimensional problem
Use variables and its Markov blanket for sampling
6
Rejection sampling
Sample π₯ ∼ π(π) and reject with probability 1
π π
−
ππ π
π π(π₯1 )
π π(π)
π(π₯1 )
π(π)
Between red and blue
curves is rejection region
π’1 ∼ π[0,1]
π
π₯1 ∼ π(π)
7
Importance Sampling
Instead of reject sample, reweight sample instead
π(π₯2 )
π(π)
π(π₯1 )
π(π₯2 )
π(π₯1 )
π(π)
π
π₯2 ∼ π(π)
π€2 ∼ π π₯2 /π π₯2
π₯1 ∼ π(π)
π€1 ∼ π π₯1 /π π₯1
8
Example: sample from MRF on grid
Use tree distribution π as the proposal distribution
Cut some edges
to make a tree
π π1 , … , ππ
π π1 , … , ππ
∝ exp
πππ ππ ππ +
(ππ)∈πΈ
ππ ππ
π∈π
∝ exp
πππ ππ ππ +
(ππ)∈π
ππ ππ
π∈π
has fewer terms
Then use rejection sampling or importance sampling
9
Gibbs Sampling
Both rejection sampling and importance sampling do not scale
well to high dimensions
Markov Chain Monte Carlo (MCMC) is an alternative
Key idea: Construct a Markov chain whose stationary
distribution is the target distribution π π
Sampling process: random walk in the Markov chain
Gibbs sampling is a very special and simple MCMC method.
10
Markov Chain Monte Carlo
Wan to sample from π π , start with a random initial vector X
π π‘ : π at time step π‘
π π‘ transition to π π‘+1 with probability
π(π π‘+1 |π π‘ , … , π1 ) = π (π π‘+1 |π π‘ )
The stationary distribution of π (π π‘+1 |π π‘ ) is our π π
Run for an intial π samples (burn-in time) until the chain
converges/mixes/reaches the stationary distribution
Then collect π (correlated) sample as π₯π
Key issues: Designing the transition kernel, and diagnose
convergence
11
Gibbs Sampling
A very special transition kernel, works nicely with Markov
blanket in GMs.
The procedure
We have variables set π = π1 , … , ππΎ variables in a GM.
At each step, one variable ππ is selected (at random or some
fixed sequence), denote the remaining variables as π−π , and its
π‘
current value as π₯−π
π‘
Compute the conditional distribution π(ππ | π₯−π
)
A value π₯ππ‘ is sampled from this distribution
This sample π₯ππ‘ replaces the previous sampled value of ππ in π
12
Gibbs Sampling in formula
Gibbs sampling
π = π₯0
For t = 1 to N
πΉππ ππππβππππ ππππππ
Only need to condition on the
Variables in the Markov blanket
π₯1π‘ = π(π1 |π₯2π‘−1 , … , π₯πΎπ‘−1 )
π₯2π‘ = π(π2 |π₯1π‘ , … , π₯πΎπ‘−1 )
…
π‘
π₯πΎπ‘ = π(π2 |π₯1π‘ , … , π₯πΎ−1
)
π3
π2
π1
Variants:
Randomly pick variable to sample
sample block by block
π4
π5
13
Gibbs Sampling: Image Segmentation
Noisy grayscale image
Label each pixel as on/off
Model using a pairwise MRF
π π =
1
π
πΨ
ππ
Ψ π₯π = exp −
ππ Ψ
π¦π −ππ₯π
ππ , ππ
π5
π6
π8
π5
π6
π8
2
2ππ₯2
π
Ψ π₯π , π₯π = exp −π½ π₯π − π₯π
π1
π2
π2
π1
π4
π4
2
π3
π7
π7
π3
π9
π9
14
Gibbs Sampling: Image Segmentation
Need conditional π(π₯π |π₯1 , … , π₯π−1 , π₯π+1 , … , π₯π )
π(π₯1 ,…,π₯π )
π(π₯1 ,…,π₯π−1 ,π₯π+1 ,π₯π )
=
1
π
πΨ
1
π₯π π
π₯π
πΨ
ππ Ψ
π₯π
π₯π ,π₯π
ππ Ψ
π₯π ,π₯π
Terms without π₯π will cancel out
π₯π is summed out in the denominator
π5
π6
∝ Ψ π₯π
π∈π(π) Ψ(π₯π , π₯π )
π1
π4
π4
π3
π7
π7
π8
π1
π2
π2
π8
π5
π6
π3
π9
π9
15
Gibbs Sampling: Image Segmentation
16
MAP by Sampling
Generate a few samples from the posterior
For each ππ the MAP is the majority assignment
Majority vote
17
Convergence of Gibbs Sampling
Not all samples π₯ 0 , … π₯ π are independent
Consider a particular marginal π(π₯π |π’π )
1
True
π(π₯π |π’π ) πΈππππππ
π π₯π π’π
ππ ππππππ‘π πππ‘π€πππ
ππ’ππ‘ππππ πππππ
π‘
0
Burn-in
Take samples from here
18
Diagnose convergence
Good chain
Sampled Value
Iteration number
19
Diagnose convergence
Bad chain
Sampled Value
Iteration number
20
Sampling Methods
Direct Sampling
Works only for easy distributions (multinomial, Gaussian etc.)
Rejection Sampling
Create samples like direct sampling
Only count samples consistent with given evidence
Importance Sampling
Create samples like direct sampling
Assign weights to samples
Gibbs Sampling
Often used for high-dimensional problem
Use variables and its Markov blanket for sampling
21
Learning Graphical Models
The goal: given set of independent samples (assignments of
random variables), find the best (the most likely) graphical
model (both structure and the parameters
πΉ
π΄
Learn
π
π
Structure
learning
π
π»
(A,F,S,N,H) = (T,F,F,T,F)
(A,F,S,N,H) = (T,F,T,T,F)
…
(A,F,S,N,H) = (F,T,T,T,T)
πΉ
π΄
π»
π
S FA TF
TF
FT
FF
t
0.9
0.7
0.8
0.2
f
0.1
0.3
0.2
0.8
parameter
learning
22
Learning for GMs
Known Structure
Unknown Structure
Fully observable data
Relatively Easy
Hard
Missing data
Hard (EM)
Very hard
Estimation principle:
Maximal likelihood estimation
Bayesian estimation
Common Feature
Make use of distribution factorization
Make use of inference algorithm
Make use of regularization/prior
23
Example problem
Estimate the probability π of landing in heads
using a biased coin
Given a sequence of π independently and
identically distributed (iid) flips
Eg., π· = π₯1 , π₯2 , … , π₯π = {1,0,1, … , 0}, π₯π ∈ {0,1}
Model: π π₯|π = π π₯ 1 − π
π(π₯|π ) =
1−π₯
1 − π, πππ π₯ = 0
π,
πππ π₯ = 1
Likelihood of a single observation π₯π ?
π π₯π |π = π π₯π 1 − π
1−π₯π
24
Bayesian Parameter Estimation
Bayesian treat the unknown parameters as a random variable,
whose distribution can be inferred using Bayes rule:
π(π|π·) =
π π· π π(π)
π(π·)
=
π π· π π(π)
∫ π π· π π π ππ
π
The crucial equation can be written in words
πππ π‘πππππ =
ππππππβπππ×πππππ
ππππππππ ππππππβπππ
π
For iid data, the likelihood is π π· π =
π
π₯π
π=1 π
1−π
1−π₯π
=π
π π₯π
1−π
π
π 1−π₯π
π
π=1 π(π₯π |π)
= π #βπππ 1 − π
#π‘πππ
The prior π π encodes our prior knowledge on the domain
Different prior π π will end up with different estimate π(π|π·)!
25
Frequentist Parameter Estimation
Bayesian estimation has been criticized for being “subjective”
Frequentists think of a parameter as a fixed, unknown
constant, not a random variable
Hence different “objective” estimators, instead of Bayes’ rule
These estimators have different properties, such as being
“unbiased”, “minimum variance”, etc.
A very popular estimator is the maximum likelihood estimator
(MLE), which is simple and has good statistical properties
π = ππππππ₯π π π· π = ππππππ₯π
π
π=1 π(π₯π |π)
26
MLE for Biased Coin
Objective function, log likelihood
π π; π· = log π π· π = log π πβ 1 − π
π − πβ log 1 − π
ππ‘
= πβ log π +
We need to maximize this w.r.t. π
Take derivatives w.r.t. π
ππ
ππ
=
πβ
π
−
π−πβ
1−π
= 0 ⇒ πππΏπΈ =
πβ
π
or πππΏπΈ =
1
π
π π₯π
27
Maximum Likelihood Estimation for Bernoulli
What if we toss too few times so that we saw zero head in the
data?
π
In this case, πππΏπΈ = β = 0, and we will predict that the
π
probability of seeing a head next is zeros.
The rescue: add regularization to smooth the counts. Do
maximum a posteriori (MAP) estimation:
πππ΄π = ππππππ₯π π π π· = ππππππ₯π π π; π· + log π(π)
For instance, log π π = π′ log π + π′ log(1 − π)
πππ΄π =
πβ +π′
,
π+π′
π’ known as pseudo –count
π΅π’π‘ πππ π€π π π‘πππ
ππππππ‘ππ£π?
28
Bayesian estimation for biased coin
Prior over π, Beta distribution
π π; πΌ, π½ =
Γ πΌ+π½
Γ π Γ π½
π πΌ−1 1 − π
π½−1
When x is discrete Γ π₯ + 1 = π₯Γ π₯ = π₯!
Posterior distribution of π
π π₯1 ,…,π₯π π π π
π π₯1 ,…,π₯π
ππ‘ π πΌ−1 1 − π π½−1 =
π π|π₯1 , … , π₯π =
∝
π πβ 1 − π
π πβ +πΌ−1 1 − π ππ‘ +π½−1
Posterior is the same type of function as the prior
Such a prior is called a conjugate prior
πΌ and π½ are hyperparameters and correspond to the number of
“virtual” heads and tails (pseudo counts)
29
Bayesian Estimation for Bernoulli
Posterior distribution π
π π₯1 ,…,π₯π π π π
π π₯1 ,…,π₯π
ππ‘ π πΌ−1 1 − π π½−1 =
π π|π₯1 , … , π₯π =
∝
π πβ 1 − π
π πβ +πΌ−1 1 − π
ππ‘ +π½−1
Maximum a posteriori (MAP) estimation:
πππ΄π = ππππππ₯π log π π|π₯1 , … , π₯π
Posterior mean estimation:
ππππ¦ππ = ∫ π π π π· ππ = πΆ∫ π × π πβ +πΌ−1 1 − π
ππ‘ +π½−1 ππ
=
(πβ +πΌ)
π+πΌ+π½
Prior strength: π΄ = πΌ + π½
A can be interpreted as an imaginary dataset
30
Effect of Prior Strength
Suppose we have a uniform prior (πΌ = π½), and we observed
that πβ = 2, and ππ‘ = 8
Weak prior π΄ = πΌ + π½ = 2. Posterior prediction:
π π₯ = β πβ = 2, ππ‘ = 8, πΌ = 1, π½ = 1 =
2+1
10+2
= 0.25
Strong prior π΄ = πΌ + π½ = 20. Posterior prediction:
π π₯ = β πβ = 2, ππ‘ = 8, πΌ = 10, π½ = 10 =
2+10
10+20
= 0.4
However if we have enough data, it washes away the prior.
E.g. πβ = 200, and ππ‘ = 800. Then the estimate under weak and
200+1 200+10
strong prior are
,
respectively. Both close to 0.2
1000+2 1000+10
31
How estimators should be used?
πππ΄π is not Bayesian (even though it uses a prior) since it is a
point estimate
Consider predicting the future. A sensible way is to combine
predictions based on all possible value of π, weighted by their
posterior probability, this is called Bayesian prediction:
π π₯πππ€ π· = ∫ π π₯πππ€ , π π· ππ
= ∫ π π₯πππ€ π, π· π π π· ππ
= ∫ π π₯πππ€ π π π π· ππ
π
ππππ€
π
π
A frequentist prediction will typically use a “plug-in” estimator
such as ML/MAP
π π₯πππ€ π· = π(π₯πππ€ | πππΏ ) ππ π π₯πππ€ π· = π(π₯πππ€ | πππ΄π )
32
Frequentist vs. Bayesian
Advantages of Bayesian approach:
Mathematically elegant
Works well when amount of data is much less than the number
of parameters
Easy to do incremental (sequential) learning
Can be used for model selection (max likelihood will always pick
the most complex model)
Advantage of frequentist approach:
Mathematically/computationally simpler
“objective”, unbiased, invariant to reparametrization
As π· → ∞, the two approaches become the same
π π π· → πΏ(π, πππΏ )
33
MLE for General Bayesian Networks
If we assume that the parameters for each CPT are globally
independent, and all nodes are fully observed, then the loglikelihood function decomposes into a sum of local terms, one
per node:
π π; π· = log π π· π
= log
π
π
π
π
π₯
ππ
ππ , ππ ) =
π
π
π
π
π
log
π
π₯
ππ
ππ , ππ )
π
π
π΄ππππππ¦
For each variable ππ :
πππΏπΈ ππ = π₯π ππππ = π’ =
Why?
πΉππ’
#(ππ =π₯ ,ππππ =π’)
#(ππππ =π’)
ππππ’π
πππ π
π»πππππβπ
34
MLE for General Bayesian Networks
π π; π· = log π π· π =
π log π
ππ |ππ +
π log π
π π |ππ +
π ππππ
π π ππ , π π , ππ +
π ππππ(β
π
|π π , πβ )
One term for each CPT; break up MLE problem into independent subproblems
Earlier we already learn how to estimate a single CPT
Here we just need to estimate each CPT separately.
π΄ππππππ¦
πΉππ’
ππππ’π
π»πππππβπ
35
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )