EM Algorithm - Computer Science

advertisement
EM Algorithm
Presented By: Haiguang Li
Computer Science Department
University of Vermont
Fall 2011
Copyright Note:

This presentation is based on the paper:
Dempster, A.P. Laird, N.M. Rubin, D.B. (1977). "Maximum Likelihood
from Incomplete Data via the EM Algorithm". Journal of the Royal
Statistical Society. Series B (Methodological) 39 (1): 1–
38. JSTOR 2984875.MR0501537.
2

The section 1 and 4 come from professor Taiwen Yu’s “EM Algorithm”.

The section 2, 3, and 6 come from professor Andrew W. Moore’s
“Clustering with Gaussian Mixtures”.

The section 5 is edited by me.
2
Contents
1.
2.
3.
4.
5.
6.
3
Introduction
Example  Silly Example
Example  Same Problem with Hidden Info
Example  Normal Sample
EM-Main Body
EM-Algorithm Running on GMM
Introduction




The EM algorithm was explained and given its name in a
classic 1977 paper by Arthur Dempster, Nan Laird, and Donald
Rubin.
They pointed out that the method had been "proposed many
times in special circumstances" by earlier authors.
EM is typically used to compute maximum likelihood estimates
given incomplete samples.
The EM algorithm estimates the parameters of a model
iteratively.
– Starting from some initial guess, each iteration consists of


4
an E step (Expectation step)
an M step (Maximization step)
Applications






5
Filling in missing data in samples
Discovering the value of latent variables
Estimating the parameters of HMMs
Estimating parameters of finite mixtures
Unsupervised learning of clusters
…
EM Algorithm
Silly Example
7
8
EM Algorithm
Same Problem
with Hidden Info
10
11
12
13
EM Algorithm
Normal Sample
X ~ N ( ,  )
2
Normal Sample
Sampling

x  ( x1, x2 ,

1
 (x  ) 
f ( x | , ) 
exp  
2
2 
 2 
2
15
T
, xn )
ˆ  ?
2
ˆ  ?
f ( x | , 2 ) 
1
 (x  ) 
exp  
2 

2
2 


Maximum Likelihood
Sampling


L(,  | x) 
2
Given x, it is a
function of  and 2
16
x  ( x1, x2 ,
T
, xn )
We want to
maximize it.
f (x | ,  2 )  f ( x1 | , 2 )
 1 

2 
2




n/2
f ( xn | , 2 )
 n ( xi   ) 2 
exp  

2
2

 i 1

L(,  | x) 
2



2 
 2 
1
n/2
 n ( xi   ) 2 
exp  

2
2

 i 1

Log-Likelihood Function
l (, | x)  log L(,  | x)
2
Maximize
this instead
2
n
( xi   )2
n
1
 log

2
2
2
2
2

i 1
n
n
1 n 2  n
n 2
2
  log   log 2  2  xi  2  xi  2
2
2
2 i 1
 i 1
2
By setting
17

l (  ,  2 | x)  0 and



2
l
(

,

| x)  0
2
Max. the Log-Likelihood Function
l (, | x)
2
n
n
1 n 2  n
n 2
2
  log   log 2  2  xi  2  xi  2
2
2
2 i 1
 i 1
2
n

1
n
2
l (  ,  | x)  2  xi  2  0

 i 1

18
1 n
ˆ   xi
n i 1
1 n
ˆ   xi
n i 1
n
1
ˆ 2   xi2  ˆ 2
n i 1
Max. the Log-Likelihood Function
l (, | x)
2
n
n
1 n 2  n
n 2
2
  log   log 2  2  xi  2  xi  2
2
2
2 i 1
 i 1
2
2
n

n

2
2
l
(

,

|
x
)



x
 4  xi  4  0
2
2
4  i

2
2 i 1
 i 1
2

n
n
n
i 1
i 1
n
1
n 2   xi2  2  xi  n 2
2
2
 1

  x    xi     xi 
n  i 1  n  i 1 
i 1
n
n
2
i
19
n
2
EM Algorithm
Main Body
Begin with Classification
21
Solve the problem using another
method– parametric method
22
Use our model for classification
23
EM Clustering Algorithm
24
E-M
25
26
27
What’s K-means?
28
EM Algorithm
EM Running
Example
30
31
32
33
34
35
36
37
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
38
^ Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal
Statistical Society. Series B (Methodological) 39 (1): 1–38. JSTOR 2984875.MR0501537.
^ Sundberg, Rolf (1974). "Maximum likelihood theory for incomplete data from an exponential family".Scandinavian Journal of Statistics 1 (2):
49–58. JSTOR 4615553. MR381110.
^ a b Rolf Sundberg. 1971. Maximum likelihood theory and applications for distributions generated when observing a function of an exponential
family variable. Dissertation, Institute for Mathematical Statistics, Stockholm University.
^ a b Sundberg, Rolf (1976). "An iterative method for solution of the likelihood equations for incomplete data from exponential
families". Communications in Statistics – Simulation and Computation 5 (1): 55–64.doi:10.1080/03610917608812007. MR443190.
^ See the acknowledgement by Dempster, Laird and Rubin on pages 3, 5 and 11.
^ G. Kulldorff. 1961. Contributions to the theory of estimation from grouped and partially grouped samples. Almqvist & Wiksell.
^ a b Anders Martin-Löf. 1963. "Utvärdering av livslängder i subnanosekundsområdet" ("Evaluation of sub-nanosecond lifetimes"). ("Sundberg
formula")
^ a b Per Martin-Löf. 1966. Statistics from the point of view of statistical mechanics. Lecture notes, Mathematical Institute, Aarhus University.
("Sundberg formula" credited to Anders Martin-Löf).
^ a b Per Martin-Löf. 1970. Statistika Modeller (Statistical Models): Anteckningar från seminarier läsåret 1969–1970 (Notes from seminars in
the academic year 1969-1970), with the assistance of Rolf Sundberg.Stockholm University. ("Sundberg formula")
^ a b Martin-Löf, P. The notion of redundancy and its use as a quantitative measure of the deviation between a statistical hypothesis and a set
of observational data. With a discussion by F. Abildgård, A. P. Dempster, D. Basu, D. R. Cox, A. W. F. Edwards, D. A. Sprott, G. A. Barnard, O.
Barndorff-Nielsen, J. D. Kalbfleisch and G. Rasch and a reply by the author. Proceedings of Conference on Foundational Questions in
Statistical Inference (Aarhus, 1973), pp. 1–42. Memoirs, No. 1, Dept. Theoret. Statist., Inst. Math., Univ. Aarhus, Aarhus, 1974.
^ a b Martin-Löf, Per The notion of redundancy and its use as a quantitative measure of the discrepancy between a statistical hypothesis and a
set of observational data. Scand. J. Statist. 1 (1974), no. 1, 3–18.
^ Wu, C. F. Jeff (Mar. 1983). "On the Convergence Properties of the EM Algorithm". Annals of Statistics 11 (1): 95–
103. doi:10.1214/aos/1176346060. JSTOR 2240463. MR684867.
^ a b Neal, Radford; Hinton, Geoffrey (1999). Michael I. Jordan. ed. "A view of the EM algorithm that justifies incremental, sparse, and other
variants". Learning in Graphical Models (Cambridge, MA: MIT Press): 355–368. ISBN 0262600323. Retrieved 2009-03-22.
^ a b Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2001). "8.5 The EM algorithm". The Elements of Statistical Learning. New York:
Springer. pp. 236–243. ISBN 0-387-95284-5.
^ Jamshidian, Mortaza; Jennrich, Robert I. (1997). "Acceleration of the EM Algorithm by using Quasi-Newton Methods". Journal of the Royal
Statistical Society: Series B (Statistical Methodology) 59 (2): 569–587.doi:10.1111/1467-9868.00083. MR1452026.
^ Meng, Xiao-Li; Rubin, Donald B. (1993). "Maximum likelihood estimation via the ECM algorithm: A general framework". Biometrika 80 (2):
267–278. doi:10.1093/biomet/80.2.267. MR1243503.
^ Hunter DR and Lange K (2004), A Tutorial on MM Algorithms, The American Statistician, 58: 30-37
The End

39
Thanks very much!
Question #1

40
What are the main advantages of parametric
methods?
–
You can easily change the model to adapt to
different distribution of data sets.
–
Knowledge representation is very compact.
Once the model selected, the model is
represented by a specific number of parameters.
The number of parameters does not increase with
the increasing of training data .
Question #2

What are the EM algorithm initialization
methods?
–
–
41
Random guess.
Initialized by k-means. After a few iterations of kmeans, using the parameters to initialize EM.
Question #3

What are the differences between EM and Kmeans?
–
–
42
K-means is a simplified EM.
K-means make a hard decision while EM make a
soft decision when update the parameters of the
model.
Download