Expectation-Maximization Markoviana Reading Group Fatih Gelgi, ASU, 2005 5/29/2016

advertisement
Expectation-Maximization
Markoviana Reading Group
Fatih Gelgi, ASU, 2005
5/29/2016
1
Outline


What is EM?
Intuitive Explanation





Example: Gaussian Mixture
Algorithm
Generalized EM
Discussion
Applications


HMM – Baum-Welch
K-means
5/29/2016
Fatih Gelgi, ASU’05
2
What is EM?

Two main applications:


Data has missing values, due to problems with or
limitations of the observation process.
Optimizing the likelihood function is extremely hard,
but the likelihood function can be simplified by assuming
the existence of and values for additional missing or
hidden parameters.
*  arg max L ( | U )  arg max p (U | )

 arg max

5/29/2016

N
 pu
i 1
i
|    arg max

Fatih Gelgi, ASU’05
M






   j p j ui |  j 
i 1  j 1

N
3
Key Idea…

The observed data U is generated by some
distribution and is called the incomplete data.

Assume that a complete data set exists Z =
(U,J), where J is the missing or hidden data.

Maximize the posterior probability of the
parameters  given the data U, marginalizing over
J:
 *  arg max P(, J | U )

5/29/2016
Fatih Gelgi, ASU’05
4
Intuitive Explanation of EM

Alternate between estimating the unknowns  and
the hidden variables J.

In each iteration, instead of finding the best J  J,
compute a distribution over the space J.

EM is a lower-bound maximization process
(Minka,98).

E-step: construct a local lower-bound to the posterior
distribution.

M-step: optimize the bound.
5/29/2016
Fatih Gelgi, ASU’05
5
Intuitive Explanation of EM

Lower-bound approximation method
** Sometimes provides
faster convergence
than gradient descent
and Newton’s method
5/29/2016
Fatih Gelgi, ASU’05
6
Example:
Mixture Components
5/29/2016
Fatih Gelgi, ASU’05
7
Example (cont’d):
True Likelihood of Parameters
5/29/2016
Fatih Gelgi, ASU’05
8
Example (cont’d):
Iterations of EM
5/29/2016
Fatih Gelgi, ASU’05
9
Lower-bound Maximization
Posterior probability  Logarithm of the joint distribution
  arg max P (, J | U )
*

 arg max log P (U , )  arg max log 
P (U , J , )
difficult!!!



J J n
Idea: start with a guess t, compute an easily
computed lower-bound B(; t) to the function log
P(|U) and maximize the bound instead.
5/29/2016
Fatih Gelgi, ASU’05
10
Lower-bound Maximization (cont.)


Construct a tractable lower-bound B(; t)
that contains a sum of logarithms.
ft(J) is an arbitrary prob. dist.
By Jensen’s inequality,
5/29/2016
Fatih Gelgi, ASU’05
11
Optimal Bound



B(; t) touches the objective function log
P(U,) at t.
Maximize B(t; t) with respect to ft(J):
Introduce a Lagrange multiplier  to enforce
the constraint
5/29/2016
Fatih Gelgi, ASU’05
12
Optimal Bound (cont.)

Derivative with respect to ft(J):

Maximizes at:
5/29/2016
Fatih Gelgi, ASU’05
13
Maximizing the Bound

Re-write B(;t) with respect to the expectations:
where

Finally,
5/29/2016
Fatih Gelgi, ASU’05
14
EM Algorithm

EM converges to a local maximum of
log P(U,)  maximum of log P(|U).
5/29/2016
Fatih Gelgi, ASU’05
15
A Relation to the Log-Posterior

An alternative way to compute expected
log-posterior:
which is the same as maximization with
respect to ,
5/29/2016
Fatih Gelgi, ASU’05
16
Generalized EM

Assume ln p( X | ) and B function are differentiable in
 .The EM likelihood converges to a point where

ln p ( X | )  0



GEM: Instead of setting t+1 = argmax B(;t)
Just find t+1 such that
B(;t+1) > B(;t)
GEM also is guaranteed to converge
5/29/2016
Fatih Gelgi, ASU’05
17
HMM – Baum-Welch Revisited
Estimate the parameters (a, b, ) st. number of correct individual
states to be maximum.
gt(i) is the probability of
being in state Si at time t
xt(i,j) is the probability of
being in state Si at time t,
and Sj at time t+1
5/29/2016
Fatih Gelgi, ASU’05
18
Baum-Welch: E-step
5/29/2016
Fatih Gelgi, ASU’05
19
Baum-Welch: M-step
5/29/2016
Fatih Gelgi, ASU’05
20
K-Means


Problem: Given data X and the number of
clusters K, find clusters.
Clustering based on centroids,


1
μ(c) 
x

| c | xc


A point belongs to the cluster with closest
centroid.
Hidden variables centroids of the clusters!
5/29/2016
Fatih Gelgi, ASU’05
21
K-Means (cont.)
Starting with an initial 0, centroids,
 E-step: Split the data into K clusters
according to distances to the centroids
(Calculate the distribution ft(J)).

M-step: Update the centroids
(Calculate t+1).
5/29/2016
Fatih Gelgi, ASU’05
22
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
x
x
x
x
Compute centroids
Reassign clusters
Converged!
5/29/2016
Fatih Gelgi, ASU’05
23
Discussion

Is EM a Primal-Dual algorithm?
5/29/2016
Fatih Gelgi, ASU’05
24
Reference:





A.P.Dempster et al “Maximum-likelihood from incomplete data
Journal of the Royal Statistical Society. Series B
(Methodological), Vol. 39, No. 1. (1977), pp. 1-38.
F. Dellaert, “The Expectation Maximization Algorithm”, Tech.
Rep. GIT-GVU-02-20, 2002.
T. Minka, “Expectation-Maximization as lower bound
maximization”, 1998
Y. Chang, M. Kölsch. Presentation: Expectation Maximization,
UCSB, 2002.
K. Andersson, Presentation: Model Optimization using the EM
algorithm, COSC 7373, 2001
5/29/2016
Fatih Gelgi, ASU’05
25
Thanks!
5/29/2016
Fatih Gelgi, ASU’05
26
Download