The Expectation Maximization Algorithm

advertisement
The Expectation Maximization Algorithm
XXXXXXXXXXX
yyyyyyyy@zzzzzzzz
Abstract
In this paper we take a look at the Expectation Maximization algorithm and an example of its
use in a real world applications.
1 Introduction
The Expectation Maximization (EM) algorithm was first introduced almost 30 years ago in
(Dempster et al, 1977) [1], and since then it has found wide use in many fields. You don’t need
to take my word for this; you can prove it to yourself by doing a Google search on “Expectation
Maximization”. At the time of this writing such a search returns 66,100 links to pages that
contain tutorials on EM and applications that use EM. Looking over the links returned by the
search shows that EM is used in Bioinformatics, reinforcement learning in neural computation,
and cognitive mapping; and that is only in the first 10 links.
So what is it that makes the EM algorithm so appealing for use in so many applications? One
reason is that it allows the computation of problems with hidden or unknown variables. In many
machine learning algorithms such as decision trees and neural networks you have to have an
observed value for an attribute to use that attribute in classification. However with the
Expectation Maximization algorithm not only allows you use data that is only occasionally
observed, it also allows you to use data that is never directly observed as long as the probability
distribution of the unseen data is known [2]. But that’s not the only reason it’s so popular. [5]
tells us that, in addition to its usefulness in problems with unseen data, the EM algorithm can be
used to reduce the complexity of problems by introducing hidden variables. So how does the EM
algorithm allow us to do this? To answer that question we need to understand how it works.
2 The Expectation Maximization Algorithm
In [2], it begins explanation of the EM algorithm by giving us an introductory example of how it
works. Since the author has been doing this sort of thing a lot longer than I have, I’ll mimic his
approach to the point of even using the same example, explained as I understand it.
2.1 An Introductory Example: K-Means
The k-means problem is as follows. Imagine you have a number of data points generated by k
different Gaussian distributions. By looking at just the data points and by knowing only the
number of Gaussian distributions and their variance, can you figure out what the means of the k
distributions? With the EM algorithm you can determine a hypothesis h that has the maximum
likelihood of creating the observed data. The example that [2] starts with is an instance of the kmeans problem where we’re given that k=2, the variance of the 2 distributions are identical (that
1
is, σ2 is the same for both) and that variance is known, and the observed data points marked on
the x axis. This set up is shown in figure 1. All figures for this example are from [3].
Figure 1. k means with k=2
So here’s what we have in our problem:
ƒ A hypothesis: it takes the form of <µ1, µ2> which are the proposed means for
distribution 1 and 2.
ƒ The observed variables: the x values, each of which was generated by one of the two
distributions.
ƒ The hidden variables: each of our x values can be thought of as having 2 associated z
values that represent the Gaussian distribution that generated the x value. z1 = 1 and
z2 = 0 if the x value was generated by the 1st distribution or z1 = 0 and z2 = 1 if the x
value was generated by the 2nd distribution. Since these variables are hidden, their
true values are not known and must be estimated via the EM algorithm.
With this problem formulation, [2] shows us how we can use the Expectation Maximization
algorithm to calculate a maximum likelihood hypothesis for the means of the Gaussian
distributions that generated the observed data in the following manner:
ƒ First, we generate an initial hypothesis. We do this by picking some arbitrary values for
the means. Once we have our initial hypothesis h = <µ1, µ2> we can alternate between the
next two steps until the maximum likelihood hypothesis is found.
ƒ Once we have the initial hypothesis, we alternate between the expectation and the
maximization steps, from whence the EM algorithm derives its name.
ƒ The ‘Expectation’ step: In the expectation step, we generate the expected values E[zij]
given our current hypothesis. That is, we calculate the probability that xi was
generated by the jth distribution. The formula for E[zij] appears in figure 2:
2
Figure 2. Formula for Calculating E[zij]
We can see from this formula that the expected value for each zij is the probability of
xI given µj divided by the sum of the probabilities that xI belonged to each µ.
ƒ
The ‘Maximization’ step: Once we have calculated all the E[zij] values, we can
calculate new µ values via the formula in figure 3:
Figure 3. Formula for Calculating µj
This formula will generate the maximum likelihood hypothesis. [2] shows how it got
this formula via a derivation of the general statement of the EM algorithm.
By repeating the Expectation step (E-step) and the Maximization step (M-Step) [1] shows that
the algorithm will converge to a local maximum and give us a maximum likelihood estimation
for our hypothesis.
2.2 A More General Look at the EM Algorithm
Now that the k-means example has given us an idea of how the EM algorithm works, we can
look at a more general description of the EM algorithm. [2, 3, 4] use the following symbols to
define the problem:
Table 1. Symbols used to define the EM algorithm
Symbol
Meaning
Example
The
parameters
we
want
to
<µ1, µ2>
θ
learn
X
The set of observed values
{x1, x2,.., xn}
Z
The set of unknown or hidden
<zi1,zi2>
variables
Y
the set of all
The complete data X∪Z
<xi,zi1,zi2>
3
h
h'
P(Y|h’)
The current hypothesis for θ
The revised hypothesis for θ
Likelihood of full data Y
given hypothesis h’
h
h'
[2] tells us that the goal of using the EM algorithm is to find the h’ with the maximum
likelihood, which it does by maximizing E[ln P(Y|h’)]. It claims that by maximizing ln P(Y|h’) it
is also maximizing P(Y|h’) which intuitively makes sense. It adds the E[] because [2] claims that
we can treat Z (and therefor Y since it depends on Z) as a random variable. I interpret this to
mean that because it’s random it can only have expected values and not exact values. After all, if
we knew the exact values in Z we wouldn’t need the EM algorithm because the variables
wouldn’t be hidden. So we’re told that we “take the expected value E[ln P(Y|h’)] over the
probability distribution governing the random variable Y” [2], which we get from the
distributions for X and Z and is determined by the current hypothesis h. Now you may recall
from our example that the expectation step of the EM algorithm used an expected value, and that
is exactly where this formula is heading. [2] transforms the function one more time to give us the
function shown in figure 4:
Figure 4. Formula for the General Expectation Step
This formula adds in the | h,X to explicitly make clear the relation of the expected value to
the observed data X and the current hypothesis h. Once we have this formula we can state the 2
steps of the Expectation Maximization algorithm in their general form:
ƒ
ƒ
Step 1: Expectation step: Calculate the formula in figure 4 to get the probability distribution
over Y.
Step 2: Maximization step: get our next hypothesis via the following formula:
Figure 4. Formula for the General Maximization Step
[2] tells us that as long as Q is continuous, EM will converge to a local maximum. But is this
good enough? If we want a global maximum, what do we do?
[5] says that since the EM algorithm is a hill climbing algorithm, it can only be guaranteed to
find a local maximum and not necessarily the global maximum. Many applications want to get as
close to the global maximum as is possible, so [4] suggests two methods to handle this. The first
is rather intuitive and easy to implement, but is more computationally intensive. It is to run the
EM algorithm repeatedly, starting with different random values for the hidden variables and
taking the highest value from all the runs. The second suggestion takes more thought and may
not be possible in all situations, however it would require less computation. This approach is to
4
try simplifying the model being considered so that it only contains a global maximum, and then
using the values of the maximum that simplified model gives us as a ‘best guess’ for the starting
point in the more complex model.
3 An Application Using the EM Algorithm
As I stated in my introduction, the EM algorithm is used in a wide variety of areas. Here are a
few examples:
3.1 Robotics: Mapping Indoor Spaces
In [6] the authors outline how current robotic mapping approaches are deficient: 2D mapping
procedures don’t scale well to 3D, and the 3D mapping via sets of fine-grain polygons results in
complex maps that yield inaccurate results when off-the-shelf simplification algorithms are
applied.
In response to these problems, the authors propose that by making 2 assumptions – the
first involving robot positioning and the second involving a bias towards flat surfaces – they can
use a variation of the EM algorithm to compute maps that are low-complexity and have reduced
noise.
In their set up of the EM algorithm, the expectation step of their application they use the
following symbols:
ƒ θ – This represents the set of components in the model. Each component of
this model is a 3D map of a rectangular surface (such as doors, walls,
windows, etc) plus polygons for non-planar objects.
ƒ zi – This represents the actual measurement.
ƒ cij – This represents the correspondence between the flat components in the
model and the actual measurements. This variable is is 1 iff the ith
measurement zI corresponds to the jth surface in θ.
ƒ ci* – A special correspondence variable for random measurement noise and/or
non-planar objects in the world.
These symbols are used in the following formulas:
5
Figure 5. Formulas for the Expectation step
So to relate this to the general idea of the EM algorithm, the zij measurements are the
observed data, θ is the current hypothesis, and cij is the ‘hidden’ data.
In their maximization step, they want to maximize the ‘log likelihood’ of the map. Since
the log likelihood does not depend on θ, they tell us they can accomplish this maximization via
the following minimization formula:
Figure 6. Formula for the Maximization step
The authors of [6] give these formulas for the basic EM algorithm. However, they go on
to modify the EM algorithm to deal with real-time inputs. They note that since EM requires
multiple passes over the data it is an “inherently offline” algorithm. To make it real time, they
make use of their insight that “during each fixed time interval, only a constant number of new
range measurements arrive”. By taking advantage of this fact they incorporate the new inputs
into the model as time progresses.
So the authors of [6] offer up the formulas and ideas for their 3D mapping, but the
question is how well does it all work? To get an idea of the efficacy of their approach, here is
one sample of their output:
6
Figure 7. Results of real-time EM for robotic mapping
In this figure the right column is the EM generated mapping, while the left column is the
fine-grained polygonal mapping. The EM algorithm generates noticeably clearer mappings. So
we can see that, by using the EM algorithm, the authors of this paper were able to reduce the
complexity of their work and improve their outputs.
4 Conclusions
So what can we say about the EM algorithm? First we do have to note that it does have some
limitations. The first of these is that when it’s dealing with unseen data you still need to know
the probability distribution of the data to use it. But for many applications this constraint does
not seem to be a problem. Another problem we already mentioned is that since it is a hillclimbing algorithm it is susceptible to finding a local maxima instead of a global one. However,
at the end of section 2.2 we already noted how [4] suggested handling this problem. And despite
these limitations, as I noted in the introduction the EM algorithm is widely used in numerous
7
areas. So we can conclude that the EM algorithm offers a powerful, adaptable method for dealing
with unseen data and/or reducing computational complexity.
References
[1] Dempster, A., Laird, N., and Rubin, D. (1977). “Maximum likelihood from incomplete data via the EM algorithm.” Journal of the Royal Statistical Society, Series B,
39(1):1–38.
[2] Mitchell, T., Machine Learning, 1997; The McGraw-Hill Companies, Inc., New York, NY.
[3] http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/mlbook/ch6.ps
[4] http://sifaka.cs.uiuc.edu/course/397cxz03f/em-note.pdf
[5] Bilmes, J., “A gentle tutorial on the EM algorithm and its application to parameter estimation
for Gaussian mixture and hidden Markov models”. Technical Report ICSI-TR-97-021,
University of Berkeley (1998)
[6] Thrun, S., Martin, C., Liu, Y., Hahnel, D., Emery-Montemerlo, R., Chakrabarti, D., and
Burgard, W. (2003). “A Real-Time Expectation Maximization Algorithm for Acquiring MultiPlanar Maps of Indoor Environments with Mobile Robots” IEEE Transactions on Robotics and
Automation.
8
Download