Expectation-Maximization Markoviana Reading Group Fatih Gelgi, ASU, 2005 5/29/2016 1 Outline What is EM? Intuitive Explanation Example: Gaussian Mixture Algorithm Generalized EM Discussion Applications HMM – Baum-Welch K-means 5/29/2016 Fatih Gelgi, ASU’05 2 What is EM? Two main applications: Data has missing values, due to problems with or limitations of the observation process. Optimizing the likelihood function is extremely hard, but the likelihood function can be simplified by assuming the existence of and values for additional missing or hidden parameters. * arg max L ( | U ) arg max p (U | ) arg max 5/29/2016 N pu i 1 i | arg max Fatih Gelgi, ASU’05 M j p j ui | j i 1 j 1 N 3 Key Idea… The observed data U is generated by some distribution and is called the incomplete data. Assume that a complete data set exists Z = (U,J), where J is the missing or hidden data. Maximize the posterior probability of the parameters given the data U, marginalizing over J: * arg max P(, J | U ) 5/29/2016 Fatih Gelgi, ASU’05 4 Intuitive Explanation of EM Alternate between estimating the unknowns and the hidden variables J. In each iteration, instead of finding the best J J, compute a distribution over the space J. EM is a lower-bound maximization process (Minka,98). E-step: construct a local lower-bound to the posterior distribution. M-step: optimize the bound. 5/29/2016 Fatih Gelgi, ASU’05 5 Intuitive Explanation of EM Lower-bound approximation method ** Sometimes provides faster convergence than gradient descent and Newton’s method 5/29/2016 Fatih Gelgi, ASU’05 6 Example: Mixture Components 5/29/2016 Fatih Gelgi, ASU’05 7 Example (cont’d): True Likelihood of Parameters 5/29/2016 Fatih Gelgi, ASU’05 8 Example (cont’d): Iterations of EM 5/29/2016 Fatih Gelgi, ASU’05 9 Lower-bound Maximization Posterior probability Logarithm of the joint distribution arg max P (, J | U ) * arg max log P (U , ) arg max log P (U , J , ) difficult!!! J J n Idea: start with a guess t, compute an easily computed lower-bound B(; t) to the function log P(|U) and maximize the bound instead. 5/29/2016 Fatih Gelgi, ASU’05 10 Lower-bound Maximization (cont.) Construct a tractable lower-bound B(; t) that contains a sum of logarithms. ft(J) is an arbitrary prob. dist. By Jensen’s inequality, 5/29/2016 Fatih Gelgi, ASU’05 11 Optimal Bound B(; t) touches the objective function log P(U,) at t. Maximize B(t; t) with respect to ft(J): Introduce a Lagrange multiplier to enforce the constraint 5/29/2016 Fatih Gelgi, ASU’05 12 Optimal Bound (cont.) Derivative with respect to ft(J): Maximizes at: 5/29/2016 Fatih Gelgi, ASU’05 13 Maximizing the Bound Re-write B(;t) with respect to the expectations: where Finally, 5/29/2016 Fatih Gelgi, ASU’05 14 EM Algorithm EM converges to a local maximum of log P(U,) maximum of log P(|U). 5/29/2016 Fatih Gelgi, ASU’05 15 A Relation to the Log-Posterior An alternative way to compute expected log-posterior: which is the same as maximization with respect to , 5/29/2016 Fatih Gelgi, ASU’05 16 Generalized EM Assume ln p( X | ) and B function are differentiable in .The EM likelihood converges to a point where ln p ( X | ) 0 GEM: Instead of setting t+1 = argmax B(;t) Just find t+1 such that B(;t+1) > B(;t) GEM also is guaranteed to converge 5/29/2016 Fatih Gelgi, ASU’05 17 HMM – Baum-Welch Revisited Estimate the parameters (a, b, ) st. number of correct individual states to be maximum. gt(i) is the probability of being in state Si at time t xt(i,j) is the probability of being in state Si at time t, and Sj at time t+1 5/29/2016 Fatih Gelgi, ASU’05 18 Baum-Welch: E-step 5/29/2016 Fatih Gelgi, ASU’05 19 Baum-Welch: M-step 5/29/2016 Fatih Gelgi, ASU’05 20 K-Means Problem: Given data X and the number of clusters K, find clusters. Clustering based on centroids, 1 μ(c) x | c | xc A point belongs to the cluster with closest centroid. Hidden variables centroids of the clusters! 5/29/2016 Fatih Gelgi, ASU’05 21 K-Means (cont.) Starting with an initial 0, centroids, E-step: Split the data into K clusters according to distances to the centroids (Calculate the distribution ft(J)). M-step: Update the centroids (Calculate t+1). 5/29/2016 Fatih Gelgi, ASU’05 22 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reassign clusters x x x x Compute centroids Reassign clusters Converged! 5/29/2016 Fatih Gelgi, ASU’05 23 Discussion Is EM a Primal-Dual algorithm? 5/29/2016 Fatih Gelgi, ASU’05 24 Reference: A.P.Dempster et al “Maximum-likelihood from incomplete data Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1. (1977), pp. 1-38. F. Dellaert, “The Expectation Maximization Algorithm”, Tech. Rep. GIT-GVU-02-20, 2002. T. Minka, “Expectation-Maximization as lower bound maximization”, 1998 Y. Chang, M. Kölsch. Presentation: Expectation Maximization, UCSB, 2002. K. Andersson, Presentation: Model Optimization using the EM algorithm, COSC 7373, 2001 5/29/2016 Fatih Gelgi, ASU’05 25 Thanks! 5/29/2016 Fatih Gelgi, ASU’05 26