EM Algorithm - Computer Science

advertisement
EM Algorithm
Presented By:
Christopher Morse
1
Department of Computer Science
University of Vermont
Spring 2013
Copyright Note:

This presentation is based on the paper:
–
Dempster, A.P. Laird, N.M. Rubin, D.B. (1977). "Maximum Likelihood
from Incomplete Data via the EM Algorithm". Journal of the Royal
Statistical Society. Series B (Methodological) 39 (1): 1–
38. JSTOR 2984875.MR0501537.

Sections 1 and 4 come from professor Taiwen Yu’s “EM Algorithm”.

Sections 2, 3, and 6 come from professor Andrew W. Moore’s
“Clustering with Gaussian Mixtures”.

Section 5 was edited by Haiguang Li.

Section 7 edited by Christopher Morse
2
Contents
Introduction
Example  Silly Example
Example  Same Problem with Hidden Info
Example  Normal Sample
EM-algorithm Explained
EM-Algorithm Running on GMM
EM-algorithm Application: Semi-Supervised
Text Classification
8. Questions
1.
2.
3.
4.
5.
6.
7.
Introduction




The EM algorithm was explained and given its name in a
classic 1977 paper by Arthur Dempster, Nan Laird, and Donald
Rubin.
They pointed out that the method had been "proposed many
times in special circumstances" by earlier authors.
EM is typically used to compute maximum likelihood estimates
given incomplete samples.
The EM algorithm estimates the parameters of a model
iteratively.
– Starting from some initial guess, each iteration consists of


an E step (Expectation step)
an M step (Maximization step)
Applications






Filling in missing data in samples
Discovering the value of latent variables
Estimating the parameters of HMMs
Estimating parameters of finite mixtures
Unsupervised learning of clusters
Semi-supervised classification and
clustering.
Contents
Introduction
Example  Silly Example
Example  Same Problem with Hidden Info
Example  Normal Sample
EM-algorithm Explained
EM-Algorithm Running on GMM
EM-algorithm Application: Semi-Supervised
Text Classification
8. Questions
1.
2.
3.
4.
5.
6.
7.
EM Algorithm
7
Silly Example
Contents
1. Introduction
2. Example  Silly Example
3. Example  Same Problem with Hidden
Info
4. Example  Normal Sample
5. EM-algorithm Explained
6. EM-Algorithm Running on GMM
7. EM-algorithm Application: Semi-Supervised
Text Classification
EM Algorithm
Same Problem
with Hidden Info
11
Contents
Introduction
Example  Silly Example
Example  Same Problem with Hidden Info
Example  Normal Sample
EM-algorithm Explained
EM-Algorithm Running on GMM
EM-algorithm Application: Semi-Supervised
Text Classification
8. Questions
1.
2.
3.
4.
5.
6.
7.
EM Algorithm
Normal Sample
17
Normal Sample
Sampling


Maximum Likelihood
Sampling


Given x, it is a
function of  and 2
We want to
maximize it.
Log-Likelihood Function
Maximize
this instead
By setting
and
Max. the Log-Likelihood Function
Max. the Log-Likelihood Function
Contents
Introduction
Example  Silly Example
Example  Same Problem with Hidden Info
Example  Normal Sample
EM-algorithm Explained
Illustration: EM-Algorithm Running on GMM
EM-algorithm Application: Semi-Supervised
Text Classification
8. Questions
1.
2.
3.
4.
5.
6.
7.
EM Algorithm
24
Explained
Begin with
Classification
Solve the problem using another
method– parametric method
Use our model for classification
EM Clustering Algorithm
E-M
Comparison to K-means
Contents
Introduction
Example  Silly Example
Example  Same Problem with Hidden Info
Example  Normal Sample
EM-algorithm Explained
EM-Algorithm Running on GMM
EM-algorithm Application: Semi-Supervised
Text Classification
8. Questions
1.
2.
3.
4.
5.
6.
7.
EM Algorithm
34
EM Running
on GMM
Contents
Introduction
Example  Silly Example
Example  Same Problem with Hidden Info
Example  Normal Sample
EM-algorithm Explained
EM-Algorithm Running on GMM
EM-algorithm Application: SemiSupervised Text Classification
8. Questions
1.
2.
3.
4.
5.
6.
7.
EM Application: Semi-Supervised
Text Classification
“Learning to Classify Text From Labeled and Unlabeled
Documents”
K. Nigam, A. Mccallum, and T. Mitchell (1998)
44
Learning to Classify Text from
Labeled and Unlabeled Documents


K. Nigam et al. present a method for building a
more accurate text classifier by augmenting a
limited amount of labeled training documents
with a large number of unlabeled documents
Why the interest in unlabeled data?
–
–
There is an abundance of unlabeled data available
for training but very limited amounts of labeled.
Labeling data is costly and there is too much data
being produced to make a dent in it.
Learning to Classify Text from
Labeled and Unlabeled Documents

The authors present evidence of the efficacy of
their methods in three main domains:
–


newsgroup postings, web pages, and newswire
articles
The exponential expansion of textual data
demands efficient and accurate methods of
classification.
The high-dimensionality of the feature set and
relatively small number of labeled training
samples makes effective classification difficult.
Building a Semi-Supervised
Classifier



Classify textual documents using a
combination of labeled and unlabeled data.
First, build an initial classifier by calculating
model parameters from the labeled
documents only.
Loop until convergence (or stopping criteria):
–
–
Probabilistically label the unlabeled documents
using the classifier.
Recalculate the classifier parameters given the
probabilistically assigned labels.
The Probabilistic Framework

Assumptions
–
–
Every document di is generated according to
probability distribution (mixture model)
parameterized by 𝜃.
Mixture model is composed of components cj ∈ C
(1-to-1 relationship between components and
classes and mix. components, i. e. cj refers to jth
component and jth class).
The Probabilistic Framework

Document Generation:
1. Mixture component is selected according to prior
probability
2. Component generates the document using its
own parameters with distribution

Thus the likelihood of a document di:
Initial Classifier: naive Bayes

Documents are considered an “ordered list
of word events” [17]
–
Probability of a document given its class:
–
represents the word in position
document
of
Naive Bayes

Naive Bayes in the context of this
experiment:
–“The
learning task for the naive Bayes classifier is
to use a set of training documents to estimate the
mixture model parameters, then use the estimated
model to classify new documents.”[17]
Step 1: Train a Naive Bayes
Classifier Using Labeled Data
Apply Bayes’ Rule
Step 2: Combine the Labeled Data
with Unlabeled Data Using EM



EM: maximum likelihood estimates given
incomplete data.
Wait, where’s the incomplete data?
Ah Ha! The unlabeled data is incomplete:
it’s missing class labels!
Step 2: Combine the Labeled Data
with Unlabeled Data Using EM

E-step:
–

the E-step corresponds to calculating probabilistic
labels
for every document by using the
current estimate for , which we demonstrated
the calculation for previously
M-step:
–
The M-step corresponds to calculating a new
maximum likelihood estimate for given the
current estimates for document labels,
Classifier Performance
Results


Experiment showed significant
improvements from using unlabeled
documents for training classifiers in three
real-world text classification tasks.
Using unlabeled data requires a closer
match between the data and the model than
those using only labeled data.
–
Warrants exploring more complex mixture models
References
1.
^ Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society.
Series B (Methodological) 39 (1): 1–38. JSTOR 2984875.MR0501537.
2.
^ Sundberg, Rolf (1974). "Maximum likelihood theory for incomplete data from an exponential family".Scandinavian Journal of Statistics 1 (2): 49–
58. JSTOR 4615553. MR381110.
3.
^ a b Rolf Sundberg. 1971. Maximum likelihood theory and applications for distributions generated when observing a function of an exponential family
variable. Dissertation, Institute for Mathematical Statistics, Stockholm University.
4.
^ a b Sundberg, Rolf (1976). "An iterative method for solution of the likelihood equations for incomplete data from exponential families". Communications in
Statistics – Simulation and Computation 5 (1): 55–64.doi:10.1080/03610917608812007. MR443190.
5.
^ See the acknowledgement by Dempster, Laird and Rubin on pages 3, 5 and 11.
6.
^ G. Kulldorff. 1961. Contributions to the theory of estimation from grouped and partially grouped samples. Almqvist & Wiksell.
7.
^ a b Anders Martin-Löf. 1963. "Utvärdering av livslängder i subnanosekundsområdet" ("Evaluation of sub-nanosecond lifetimes"). ("Sundberg formula")
8.
^ a b Per Martin-Löf. 1966. Statistics from the point of view of statistical mechanics. Lecture notes, Mathematical Institute, Aarhus University. ("Sundberg
formula" credited to Anders Martin-Löf).
9.
^ a b Per Martin-Löf. 1970. Statistika Modeller (Statistical Models): Anteckningar från seminarier läsåret 1969–1970 (Notes from seminars in the academic
year 1969-1970), with the assistance of Rolf Sundberg.Stockholm University. ("Sundberg formula")
10.
^ a b Martin-Löf, P. The notion of redundancy and its use as a quantitative measure of the deviation between a statistical hypothesis and a set of
observational data. With a discussion by F. Abildgård, A. P. Dempster, D. Basu, D. R. Cox, A. W. F. Edwards, D. A. Sprott, G. A. Barnard, O. BarndorffNielsen, J. D. Kalbfleisch and G. Rasch and a reply by the author. Proceedings of Conference on Foundational Questions in Statistical Inference (Aarhus,
1973), pp. 1–42. Memoirs, No. 1, Dept. Theoret. Statist., Inst. Math., Univ. Aarhus, Aarhus, 1974.
11.
^ a b Martin-Löf, Per The notion of redundancy and its use as a quantitative measure of the discrepancy between a statistical hypothesis and a set of
observational data. Scand. J. Statist. 1 (1974), no. 1, 3–18.
12.
^ Wu, C. F. Jeff (Mar. 1983). "On the Convergence Properties of the EM Algorithm". Annals of Statistics 11 (1): 95–
103. doi:10.1214/aos/1176346060. JSTOR 2240463. MR684867.
13.
^ a b Neal, Radford; Hinton, Geoffrey (1999). Michael I. Jordan. ed. "A view of the EM algorithm that justifies incremental, sparse, and other
variants". Learning in Graphical Models (Cambridge, MA: MIT Press): 355–368. ISBN 0262600323. Retrieved 2009-03-22.
14.
^ a b Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2001). "8.5 The EM algorithm". The Elements of Statistical Learning. New York: Springer.
pp. 236–243. ISBN 0-387-95284-5.
15.
^ Jamshidian, Mortaza; Jennrich, Robert I. (1997). "Acceleration of the EM Algorithm by using Quasi-Newton Methods". Journal of the Royal Statistical
Society: Series B (Statistical Methodology) 59 (2): 569–587.doi:10.1111/1467-9868.00083. MR1452026.
16.
^ Meng, Xiao-Li; Rubin, Donald B. (1993). "Maximum likelihood estimation via the ECM algorithm: A general framework". Biometrika 80 (2): 267–
278. doi:10.1093/biomet/80.2.267. MR1243503.
17.
^ Hunter DR and Lange K (2004), A Tutorial on MM Algorithms, The American Statistician, 58: 30-37
References
17.
^ K. Nigam, A. Mccallum, and T. Mitchell, “Learning to Classify Text From Labeled and Unlabeled Documents,” pp. 792–799, AAAI Press,
1998.
18.
^ G. Cong, W. S. Lee, H. Wu, and B. Liu, “Semi-Supervised Text Classification Using Partitioned EM,” in Database Systems for Advanced
Applications, pp. 482–493, 2004.
19.
^ K. Nigam, A. Mccallum, and T. M. Mitchell, Semi-Supervised Text Classification Using EM, ch. 3. Boston: MIT Press, 2006.
The End

Thanks very much!
Contents
Introduction
Example  Silly Example
Example  Same Problem with Hidden Info
Example  Normal Sample
EM-algorithm Explained
EM-Algorithm Running on GMM
EM-algorithm Application: Semi-Supervised
Text Classification
8. Questions
1.
2.
3.
4.
5.
6.
7.
Question #1

Describe a data mining application for EM?
–
Using EM to improve a classifier by augmenting labeled
training data with unlabeled data.
The example given illustrated a method for text
classification using an initial naive Bayes classifier and
assumes documents are generated probabilistically
according to an underlying mixture model.
–
Question #2

What are the EM algorithm initialization
methods?
–
Random guess.
Any general classifier that builds a parameterized
probability distribution model (i.e. naive Bayes).
Initialized by k-means. After a few iterations of k-means,
using the parameters to initialize EM.
–
–
Question #3

What are the main advantages of parametric
methods?
–
You can easily change the model to adapt to different
distribution of data sets.
–
Knowledge representation is very compact. Once the
model is selected, the model is represented by a
specific number of parameters.
–
The number of parameters does not increase with the
increasing of training data .
Download