The Expectation Maximization Algorithm XXXXXXXXXXX yyyyyyyy@zzzzzzzz Abstract In this paper we take a look at the Expectation Maximization algorithm and an example of its use in a real world applications. 1 Introduction The Expectation Maximization (EM) algorithm was first introduced almost 30 years ago in (Dempster et al, 1977) [1], and since then it has found wide use in many fields. You don’t need to take my word for this; you can prove it to yourself by doing a Google search on “Expectation Maximization”. At the time of this writing such a search returns 66,100 links to pages that contain tutorials on EM and applications that use EM. Looking over the links returned by the search shows that EM is used in Bioinformatics, reinforcement learning in neural computation, and cognitive mapping; and that is only in the first 10 links. So what is it that makes the EM algorithm so appealing for use in so many applications? One reason is that it allows the computation of problems with hidden or unknown variables. In many machine learning algorithms such as decision trees and neural networks you have to have an observed value for an attribute to use that attribute in classification. However with the Expectation Maximization algorithm not only allows you use data that is only occasionally observed, it also allows you to use data that is never directly observed as long as the probability distribution of the unseen data is known [2]. But that’s not the only reason it’s so popular. [5] tells us that, in addition to its usefulness in problems with unseen data, the EM algorithm can be used to reduce the complexity of problems by introducing hidden variables. So how does the EM algorithm allow us to do this? To answer that question we need to understand how it works. 2 The Expectation Maximization Algorithm In [2], it begins explanation of the EM algorithm by giving us an introductory example of how it works. Since the author has been doing this sort of thing a lot longer than I have, I’ll mimic his approach to the point of even using the same example, explained as I understand it. 2.1 An Introductory Example: K-Means The k-means problem is as follows. Imagine you have a number of data points generated by k different Gaussian distributions. By looking at just the data points and by knowing only the number of Gaussian distributions and their variance, can you figure out what the means of the k distributions? With the EM algorithm you can determine a hypothesis h that has the maximum likelihood of creating the observed data. The example that [2] starts with is an instance of the kmeans problem where we’re given that k=2, the variance of the 2 distributions are identical (that 1 is, σ2 is the same for both) and that variance is known, and the observed data points marked on the x axis. This set up is shown in figure 1. All figures for this example are from [3]. Figure 1. k means with k=2 So here’s what we have in our problem: A hypothesis: it takes the form of <µ1, µ2> which are the proposed means for distribution 1 and 2. The observed variables: the x values, each of which was generated by one of the two distributions. The hidden variables: each of our x values can be thought of as having 2 associated z values that represent the Gaussian distribution that generated the x value. z1 = 1 and z2 = 0 if the x value was generated by the 1st distribution or z1 = 0 and z2 = 1 if the x value was generated by the 2nd distribution. Since these variables are hidden, their true values are not known and must be estimated via the EM algorithm. With this problem formulation, [2] shows us how we can use the Expectation Maximization algorithm to calculate a maximum likelihood hypothesis for the means of the Gaussian distributions that generated the observed data in the following manner: First, we generate an initial hypothesis. We do this by picking some arbitrary values for the means. Once we have our initial hypothesis h = <µ1, µ2> we can alternate between the next two steps until the maximum likelihood hypothesis is found. Once we have the initial hypothesis, we alternate between the expectation and the maximization steps, from whence the EM algorithm derives its name. The ‘Expectation’ step: In the expectation step, we generate the expected values E[zij] given our current hypothesis. That is, we calculate the probability that xi was generated by the jth distribution. The formula for E[zij] appears in figure 2: 2 Figure 2. Formula for Calculating E[zij] We can see from this formula that the expected value for each zij is the probability of xI given µj divided by the sum of the probabilities that xI belonged to each µ. The ‘Maximization’ step: Once we have calculated all the E[zij] values, we can calculate new µ values via the formula in figure 3: Figure 3. Formula for Calculating µj This formula will generate the maximum likelihood hypothesis. [2] shows how it got this formula via a derivation of the general statement of the EM algorithm. By repeating the Expectation step (E-step) and the Maximization step (M-Step) [1] shows that the algorithm will converge to a local maximum and give us a maximum likelihood estimation for our hypothesis. 2.2 A More General Look at the EM Algorithm Now that the k-means example has given us an idea of how the EM algorithm works, we can look at a more general description of the EM algorithm. [2, 3, 4] use the following symbols to define the problem: Table 1. Symbols used to define the EM algorithm Symbol Meaning Example The parameters we want to <µ1, µ2> θ learn X The set of observed values {x1, x2,.., xn} Z The set of unknown or hidden <zi1,zi2> variables Y the set of all The complete data X∪Z <xi,zi1,zi2> 3 h h' P(Y|h’) The current hypothesis for θ The revised hypothesis for θ Likelihood of full data Y given hypothesis h’ h h' [2] tells us that the goal of using the EM algorithm is to find the h’ with the maximum likelihood, which it does by maximizing E[ln P(Y|h’)]. It claims that by maximizing ln P(Y|h’) it is also maximizing P(Y|h’) which intuitively makes sense. It adds the E[] because [2] claims that we can treat Z (and therefor Y since it depends on Z) as a random variable. I interpret this to mean that because it’s random it can only have expected values and not exact values. After all, if we knew the exact values in Z we wouldn’t need the EM algorithm because the variables wouldn’t be hidden. So we’re told that we “take the expected value E[ln P(Y|h’)] over the probability distribution governing the random variable Y” [2], which we get from the distributions for X and Z and is determined by the current hypothesis h. Now you may recall from our example that the expectation step of the EM algorithm used an expected value, and that is exactly where this formula is heading. [2] transforms the function one more time to give us the function shown in figure 4: Figure 4. Formula for the General Expectation Step This formula adds in the | h,X to explicitly make clear the relation of the expected value to the observed data X and the current hypothesis h. Once we have this formula we can state the 2 steps of the Expectation Maximization algorithm in their general form: Step 1: Expectation step: Calculate the formula in figure 4 to get the probability distribution over Y. Step 2: Maximization step: get our next hypothesis via the following formula: Figure 4. Formula for the General Maximization Step [2] tells us that as long as Q is continuous, EM will converge to a local maximum. But is this good enough? If we want a global maximum, what do we do? [5] says that since the EM algorithm is a hill climbing algorithm, it can only be guaranteed to find a local maximum and not necessarily the global maximum. Many applications want to get as close to the global maximum as is possible, so [4] suggests two methods to handle this. The first is rather intuitive and easy to implement, but is more computationally intensive. It is to run the EM algorithm repeatedly, starting with different random values for the hidden variables and taking the highest value from all the runs. The second suggestion takes more thought and may not be possible in all situations, however it would require less computation. This approach is to 4 try simplifying the model being considered so that it only contains a global maximum, and then using the values of the maximum that simplified model gives us as a ‘best guess’ for the starting point in the more complex model. 3 An Application Using the EM Algorithm As I stated in my introduction, the EM algorithm is used in a wide variety of areas. Here are a few examples: 3.1 Robotics: Mapping Indoor Spaces In [6] the authors outline how current robotic mapping approaches are deficient: 2D mapping procedures don’t scale well to 3D, and the 3D mapping via sets of fine-grain polygons results in complex maps that yield inaccurate results when off-the-shelf simplification algorithms are applied. In response to these problems, the authors propose that by making 2 assumptions – the first involving robot positioning and the second involving a bias towards flat surfaces – they can use a variation of the EM algorithm to compute maps that are low-complexity and have reduced noise. In their set up of the EM algorithm, the expectation step of their application they use the following symbols: θ – This represents the set of components in the model. Each component of this model is a 3D map of a rectangular surface (such as doors, walls, windows, etc) plus polygons for non-planar objects. zi – This represents the actual measurement. cij – This represents the correspondence between the flat components in the model and the actual measurements. This variable is is 1 iff the ith measurement zI corresponds to the jth surface in θ. ci* – A special correspondence variable for random measurement noise and/or non-planar objects in the world. These symbols are used in the following formulas: 5 Figure 5. Formulas for the Expectation step So to relate this to the general idea of the EM algorithm, the zij measurements are the observed data, θ is the current hypothesis, and cij is the ‘hidden’ data. In their maximization step, they want to maximize the ‘log likelihood’ of the map. Since the log likelihood does not depend on θ, they tell us they can accomplish this maximization via the following minimization formula: Figure 6. Formula for the Maximization step The authors of [6] give these formulas for the basic EM algorithm. However, they go on to modify the EM algorithm to deal with real-time inputs. They note that since EM requires multiple passes over the data it is an “inherently offline” algorithm. To make it real time, they make use of their insight that “during each fixed time interval, only a constant number of new range measurements arrive”. By taking advantage of this fact they incorporate the new inputs into the model as time progresses. So the authors of [6] offer up the formulas and ideas for their 3D mapping, but the question is how well does it all work? To get an idea of the efficacy of their approach, here is one sample of their output: 6 Figure 7. Results of real-time EM for robotic mapping In this figure the right column is the EM generated mapping, while the left column is the fine-grained polygonal mapping. The EM algorithm generates noticeably clearer mappings. So we can see that, by using the EM algorithm, the authors of this paper were able to reduce the complexity of their work and improve their outputs. 4 Conclusions So what can we say about the EM algorithm? First we do have to note that it does have some limitations. The first of these is that when it’s dealing with unseen data you still need to know the probability distribution of the data to use it. But for many applications this constraint does not seem to be a problem. Another problem we already mentioned is that since it is a hillclimbing algorithm it is susceptible to finding a local maxima instead of a global one. However, at the end of section 2.2 we already noted how [4] suggested handling this problem. And despite these limitations, as I noted in the introduction the EM algorithm is widely used in numerous 7 areas. So we can conclude that the EM algorithm offers a powerful, adaptable method for dealing with unseen data and/or reducing computational complexity. References [1] Dempster, A., Laird, N., and Rubin, D. (1977). “Maximum likelihood from incomplete data via the EM algorithm.” Journal of the Royal Statistical Society, Series B, 39(1):1–38. [2] Mitchell, T., Machine Learning, 1997; The McGraw-Hill Companies, Inc., New York, NY. [3] http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/mlbook/ch6.ps [4] http://sifaka.cs.uiuc.edu/course/397cxz03f/em-note.pdf [5] Bilmes, J., “A gentle tutorial on the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models”. Technical Report ICSI-TR-97-021, University of Berkeley (1998) [6] Thrun, S., Martin, C., Liu, Y., Hahnel, D., Emery-Montemerlo, R., Chakrabarti, D., and Burgard, W. (2003). “A Real-Time Expectation Maximization Algorithm for Acquiring MultiPlanar Maps of Indoor Environments with Mobile Robots” IEEE Transactions on Robotics and Automation. 8