Introduction to Machine Learning CSE474/574: Latent Variable Models Varun Chandola <>

advertisement
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Introduction to Machine Learning
CSE474/574: Latent Variable Models
Varun Chandola <chandola@buffalo.edu>
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Outline
1
Latent Variable Models
Latent Variable Models - Introduction
2
Mixture Models
Using Mixture Models
3
Parameter Estimation
Issues with Direct Optimization of the Likelihood or Posterior
4
Expectation Maximization
EM Operation
EM for Mixture Models
K-Means as EM
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Latent Variable Models - Introduction
Outline
1
Latent Variable Models
Latent Variable Models - Introduction
2
Mixture Models
Using Mixture Models
3
Parameter Estimation
Issues with Direct Optimization of the Likelihood or Posterior
4
Expectation Maximization
EM Operation
EM for Mixture Models
K-Means as EM
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Latent Variable Models - Introduction
Latent Variable Models
Consider a probability distribution parameterized by θ
Generates samples (x) with probability p(x|θ)
2-step generative process
1
Distribution generates the hidden variable
2
Distribution generates the observation, given the hidden variable
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Latent Variable Models - Introduction
Latent Variable Models
Consider a probability distribution parameterized by θ
Generates samples (x) with probability p(x|θ)
2-step generative process
1
Distribution generates the hidden variable
2
Distribution generates the observation, given the hidden variable
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Latent Variable Models - Introduction
Latent Variable Models
Consider a probability distribution parameterized by θ
Generates samples (x) with probability p(x|θ)
2-step generative process
1
Distribution generates the hidden variable
2
Distribution generates the observation, given the hidden variable
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Latent Variable Models - Introduction
Magazine Example - Sampling an Article
Assume that the editor has access to p(x)
x - a random variable that denotes an article
Latent Variable Model
Direct Model
1
First sample a topic z from a
topic distribution p(z)
2
Pick an article from the
topic-wise distribution p(x|z)
Sample from p(x) for an article
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Latent Variable Models - Introduction
Magazine Example - Sampling an Article
Assume that the editor has access to p(x)
x - a random variable that denotes an article
Latent Variable Model
Direct Model
1
First sample a topic z from a
topic distribution p(z)
2
Pick an article from the
topic-wise distribution p(x|z)
Sample from p(x) for an article
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Latent Variable Models - Introduction
Latent Variable Models - Introduction
The observed random variable x depends on a hidden random
variable z
z is generated using a prior distribution - p(z)
x is generated using p(x|z)
Different combinations of p(z) and p(x|z) give different latent
variable models
1
2
3
4
Mixture Models
Factor analysis
Probabilistic Principal Component Analysis (PCA)
Latent Dirichlet Allocation (LDA)
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Using Mixture Models
Outline
1
Latent Variable Models
Latent Variable Models - Introduction
2
Mixture Models
Using Mixture Models
3
Parameter Estimation
Issues with Direct Optimization of the Likelihood or Posterior
4
Expectation Maximization
EM Operation
EM for Mixture Models
K-Means as EM
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Using Mixture Models
Mixture Models
A latent discrete state
z ∈ {1, 2, . . . , K }
p(z) ∼ Multinomial(π)
For every state k, we have a probability distribution for x
p(x|z = k) = pk (x)
Overall, probability for x
p(x|θ) =
K
X
πk pk (x|θ)
k=1
A convex combination of pk ’s
πk is the probability of k th mixture component to be true
Or, contribution of the k th component
Or, the mixing weight
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Using Mixture Models
Using Mixture Models
1. Black-box Density Model
Use p(x|θ) for many things
Example: class conditional density
2. Clustering
Soft clustering
1
First learn the parameters of the mixture model
2
Compute p(z = k|x, θ) for every input point x (Bayes Rule)
Each mixture component corresponds to a cluster k
p(z = k|θ)p(x|z = k, θ)
0
0
k 0 =1 p(z = k |θ)p(x|z = k , θ)
p(z = k|x, θ) = PK
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Issues with Direct Optimization of the Likelihood or Posterior
Outline
1
Latent Variable Models
Latent Variable Models - Introduction
2
Mixture Models
Using Mixture Models
3
Parameter Estimation
Issues with Direct Optimization of the Likelihood or Posterior
4
Expectation Maximization
EM Operation
EM for Mixture Models
K-Means as EM
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Issues with Direct Optimization of the Likelihood or Posterior
Simple Parameter Estimation
Given: A set of scalar observations
x1 , x2 , . . . , xn
Task: Find the generative model (form and parameters)
1
Observe empirical distribution
of x
2
Make choice of the form of the
probability distribution
(Gaussian)
3
Estimate parameters from the
data using MLE or MAP (µ
and σ)
Varun Chandola
200
150
100
50
0
[2, 4) [4, 6) [6, 8) [8, 10)[10, 12)[12, 14)
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Issues with Direct Optimization of the Likelihood or Posterior
Simple Parameter Estimation
Given: A set of scalar observations
x1 , x2 , . . . , xn
Task: Find the generative model (form and parameters)
1
Observe empirical distribution
of x
2
Make choice of the form of the
probability distribution
(Gaussian)
3
Estimate parameters from the
data using MLE or MAP (µ
and σ)
Varun Chandola
200
150
100
50
0
[2, 4) [4, 6) [6, 8) [8, 10)[10, 12)[12, 14)
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Issues with Direct Optimization of the Likelihood or Posterior
Simple Parameter Estimation
Given: A set of scalar observations
x1 , x2 , . . . , xn
Task: Find the generative model (form and parameters)
1
Observe empirical distribution
of x
2
Make choice of the form of the
probability distribution
(Gaussian)
3
Estimate parameters from the
data using MLE or MAP (µ
and σ)
Varun Chandola
200
150
100
50
0
[2, 4) [4, 6) [6, 8) [8, 10)[10, 12)[12, 14)
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Issues with Direct Optimization of the Likelihood or Posterior
Simple Parameter Estimation
Given: A set of scalar observations
x1 , x2 , . . . , xn
Task: Find the generative model (form and parameters)
1
Observe empirical distribution
of x
2
Make choice of the form of the
probability distribution
(Gaussian)
3
Estimate parameters from the
data using MLE or MAP (µ
and σ)
Varun Chandola
200
150
100
50
0
[2, 4) [4, 6) [6, 8) [8, 10)[10, 12)[12, 14)
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Issues with Direct Optimization of the Likelihood or Posterior
When Data has Multiple Modes
Single mode is not sufficient
In reality data is generated from
two Gaussians
How to estimate µ1 , σ1 , µ2 , σ2 ?
What if we knew zi ∈ {1, 2}?
zi = 1 means that xi comes
from first mixture component
zi = 2 means that xi comes
from second mixture
component
Issue: zi ’s are not known
beforehand
250
200
150
100
50
0
[0, 5)
Need to explore 2N possibilities
Varun Chandola
Introduction to Machine Learning
[5, 10)
[10, 15)
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Issues with Direct Optimization of the Likelihood or Posterior
When Data has Multiple Modes
Single mode is not sufficient
In reality data is generated from
two Gaussians
How to estimate µ1 , σ1 , µ2 , σ2 ?
What if we knew zi ∈ {1, 2}?
zi = 1 means that xi comes
from first mixture component
zi = 2 means that xi comes
from second mixture
component
Issue: zi ’s are not known
beforehand
250
200
150
100
50
0
[0, 5)
Need to explore 2N possibilities
Varun Chandola
Introduction to Machine Learning
[5, 10)
[10, 15)
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Issues with Direct Optimization of the Likelihood or Posterior
Optimizing Likelihood or Posterior is Not Possible
For direct optimization, we find parameters that maximize
(log-)likelihood (or (log-)posterior)
Easy to optimize if zi were all known
What happens when zi ’s are not known
Likelihood and posterior will have multiple modes
Non-convex function - harder to optimize
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
Issues with Direct Optimization of the Likelihood or Posterior
Optimizing Likelihood or Posterior is Not Possible
For direct optimization, we find parameters that maximize
(log-)likelihood (or (log-)posterior)
Easy to optimize if zi were all known
What happens when zi ’s are not known
Likelihood and posterior will have multiple modes
Non-convex function - harder to optimize
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
EM Operation
EM for Mixture Models
K-Means as EM
Outline
1
Latent Variable Models
Latent Variable Models - Introduction
2
Mixture Models
Using Mixture Models
3
Parameter Estimation
Issues with Direct Optimization of the Likelihood or Posterior
4
Expectation Maximization
EM Operation
EM for Mixture Models
K-Means as EM
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
EM Operation
EM for Mixture Models
K-Means as EM
Estimating Parameters of a Mixture Model
Recall the we want to maximize the log-likelihood of a data set with
respect to θ:
θ̂ = maximize `(θ)
θ
Log-likelihood for a mixture model can be written as:
`(θ)
=
N
X
log p(xi |θ)
i=1
=
N
X
"
log
i=1
K
X
#
p(zk )pk (xi |θ)
k=1
Hard to optimize (a summation inside the log term)
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
EM Operation
EM for Mixture Models
K-Means as EM
A 2 Step Approach
Repeat until converged:
1
2
Start with some guess for θ and compute the most likely value for
zi , ∀i
Given zi , ∀i, update θ
Does not explicitly maximize the log-likelihood of mixture model
Can we come up with a better algorithm?
Repeat until converged:
1
2
Start with some guess for θ and compute the probability of
zi = k, ∀i, k
Combine probabilities to update θ
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
EM Operation
EM for Mixture Models
K-Means as EM
A 2 Step Approach
Repeat until converged:
1
2
Start with some guess for θ and compute the most likely value for
zi , ∀i
Given zi , ∀i, update θ
Does not explicitly maximize the log-likelihood of mixture model
Can we come up with a better algorithm?
Repeat until converged:
1
2
Start with some guess for θ and compute the probability of
zi = k, ∀i, k
Combine probabilities to update θ
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
EM Operation
EM for Mixture Models
K-Means as EM
Steps in EM
EM is an iterative procedure
Start with some value for θ
At every iteration t, update θ such that the log-likelihood of the
data goes up
Move from θ t−1 to θ such that:
`(θ) − `(θ t−1 )
is maximized
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
EM Operation
EM for Mixture Models
K-Means as EM
EM - Continued
Complete log-likelihood for any LVM
`(θ) =
N
X
log p(xi , zi |θ)
i=1
Cannot be computed as we do not know zi
Expected complete log-likelihood
Q(θ, θ t−1 ) = E[`(θ|D, θ t−1 )]
Expected value of `(θ|D, θ t−1 ) for all possibilities of zi
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
EM Operation
EM for Mixture Models
K-Means as EM
EM Operation
1
2
3
4
Initialize θ
At iteration t, compute Q(θ, θ t−1 )
Maximize Q() with respect to θ to get θ t
Goto step 2
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
EM Operation
EM for Mixture Models
K-Means as EM
Using EM for MM Parameter Estimation
EM formulation is generic
Calculating (E) and maximizing (M) Q() needs to be done for
specific instances
Q for MM
"
Q(θ, θ t−1 )
= E
N
X
#
log p(xi , zi |θ)
i=1
=
rik , p(zi
=
N X
K
X
rik log πk +
N X
K
X
rik log p(xi |θk )
i=1 k=1
k|xi , θ t−1 )
i=1 k=1
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
EM Operation
EM for Mixture Models
K-Means as EM
E-Step
Compute rik , ∀i, k
rik
= p(zi = k|xi , θ t−1 )
=
πk p(xi |θkt−1 )
0t−1
0
)
k 0 πk p(xi |θk
P
Compute Q()
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
EM Operation
EM for Mixture Models
K-Means as EM
M-Step
Maximize Q() w.r.t. θ
θ consists of π = {π1 , π2 , . . . , πK } and θ = {θ1 , θ2 , . . . , θK }
For Gaussian Mixture Model (GMM) (θk ≡ (µk , Σk )):
πk
=
µk
=
Σk
=
1 X
rik
N
i
P
r x
Pi ik i
rik
Pi
>
i rik xi xi
P
− µk µ>
k
i rik
Varun Chandola
Introduction to Machine Learning
(1)
(2)
(3)
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
EM Operation
EM for Mixture Models
K-Means as EM
Is K-Means an EM Algorithm?
Similar to GMM
1
2
3
Σ = σ 2 ID
πk = K1
The most probable cluster for xi is computed as the prototype
closest to it (hard clustering)
Varun Chandola
Introduction to Machine Learning
Latent Variable Models
Mixture Models
Parameter Estimation
Expectation Maximization
References
References
Varun Chandola
Introduction to Machine Learning
Download