Unsupervised Methods: EM Lecture 7b Jason Corso SUNY at Buffalo 26 March 2009 J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 1 / 23 Gaussian Mixture Models Gaussian Mixture Models Recall the Gaussian distribution: N (x|µ, Σ) = J. Corso (SUNY at Buffalo) 1 (2π)d/2 |Σ|1/2 1 T −1 exp − (x − µ) Σ (x − µ) 2 Lecture 7b 26 March 2009 (1) 2 / 23 Gaussian Mixture Models Gaussian Mixture Models Recall the Gaussian distribution: N (x|µ, Σ) = 1 (2π)d/2 |Σ|1/2 1 T −1 exp − (x − µ) Σ (x − µ) 2 (1) It forms the basis for the important Mixture of Gaussians density. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 2 / 23 Gaussian Mixture Models Gaussian Mixture Models Recall the Gaussian distribution: N (x|µ, Σ) = 1 (2π)d/2 |Σ|1/2 1 T −1 exp − (x − µ) Σ (x − µ) 2 (1) It forms the basis for the important Mixture of Gaussians density. The Gaussian mixture is a linear superposition of Gaussians in the form: p(x) = K X πk N (x|µk , Σk ) . (2) k=1 J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 2 / 23 Gaussian Mixture Models Gaussian Mixture Models Recall the Gaussian distribution: N (x|µ, Σ) = 1 (2π)d/2 |Σ|1/2 1 T −1 exp − (x − µ) Σ (x − µ) 2 (1) It forms the basis for the important Mixture of Gaussians density. The Gaussian mixture is a linear superposition of Gaussians in the form: p(x) = K X πk N (x|µk , Σk ) . (2) k=1 The πk are non-negative scalars called mixing coefficients and they govern the relative Pimportance between the various Gaussians in the mixture density. k πk = 1. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 2 / 23 Gaussian Mixture Models p(x) x J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 3 / 23 Gaussian Mixture Models p(x) x 1 1 (a) 0.5 0.2 (b) 0.5 0.3 0.5 0 0 0 0.5 J. Corso (SUNY at Buffalo) 1 0 0.5 Lecture 7b 1 26 March 2009 3 / 23 Gaussian Mixture Models Introducing Latent Variables Define a K-dimensional binary random variable z. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 4 / 23 Gaussian Mixture Models Introducing Latent Variables Define a K-dimensional binary random variable z. z has a 1-of-K representation such that a particular element zk is 1 and all of the others are zero. Hence: X zk ∈ {0, 1} (3) zk = 1 (4) k J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 4 / 23 Gaussian Mixture Models Introducing Latent Variables Define a K-dimensional binary random variable z. z has a 1-of-K representation such that a particular element zk is 1 and all of the others are zero. Hence: X zk ∈ {0, 1} (3) zk = 1 (4) k The marginal distribution over z is specified in terms of the mixing coefficients: p(zk = 1) = πk P And, recall, 0 ≤ πk ≤ 1 and k πk = 1. J. Corso (SUNY at Buffalo) Lecture 7b (5) 26 March 2009 4 / 23 Gaussian Mixture Models Since z has a 1-of-K representation, we can also write this distribution as p(z) = K Y πkzk (6) k=1 J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 5 / 23 Gaussian Mixture Models Since z has a 1-of-K representation, we can also write this distribution as p(z) = K Y πkzk (6) k=1 The conditional distribution of x given z is a Gaussian: p(x|zk = 1) = N (x|µk , Σk ) (7) or p(x|z) = K Y N (x|µk , Σk )zk (8) k=1 J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 5 / 23 Gaussian Mixture Models We are interested in the marginal distribution of x: X p(x) = p(x, z) (9) z = X p(z)p(x|z) (10) z = = K XY πkzk N (x|µk , Σk )zk (11) z k=1 K X πk N (x|µk , Σk ) (12) k=1 J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 6 / 23 Gaussian Mixture Models We are interested in the marginal distribution of x: X p(x) = p(x, z) (9) z = X p(z)p(x|z) (10) z = = K XY πkzk N (x|µk , Σk )zk (11) z k=1 K X πk N (x|µk , Σk ) (12) k=1 So, given our latent variable z, the marginal distribution of x is a Gaussian mixture. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 6 / 23 Gaussian Mixture Models We are interested in the marginal distribution of x: X p(x) = p(x, z) (9) z = X p(z)p(x|z) (10) z = = K XY πkzk N (x|µk , Σk )zk (11) z k=1 K X πk N (x|µk , Σk ) (12) k=1 So, given our latent variable z, the marginal distribution of x is a Gaussian mixture. If we have N observations x1 , . . . , xN , then because of our chosen representation, it follows that we have a latent variable zn for each observed data point xn . J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 6 / 23 Gaussian Mixture Models Component Responsibility Term We need to also express the conditional probability of z given x. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 7 / 23 Gaussian Mixture Models Component Responsibility Term We need to also express the conditional probability of z given x. Denote this conditional p(zk = 1|x) as γ(zk ). J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 7 / 23 Gaussian Mixture Models Component Responsibility Term We need to also express the conditional probability of z given x. Denote this conditional p(zk = 1|x) as γ(zk ). We can derive this value with Bayes’ theorem: p(zk = 1)p(x|zk = 1) . γ(zk ) = p(zk = 1|x) = PK j=1 p(zj = 1)p(x|zj = 1) πk N (x|µk , Σk ) = PK j=1 πj N (x|µj , Σj ) J. Corso (SUNY at Buffalo) Lecture 7b (13) (14) 26 March 2009 7 / 23 Gaussian Mixture Models Component Responsibility Term We need to also express the conditional probability of z given x. Denote this conditional p(zk = 1|x) as γ(zk ). We can derive this value with Bayes’ theorem: p(zk = 1)p(x|zk = 1) . γ(zk ) = p(zk = 1|x) = PK j=1 p(zj = 1)p(x|zj = 1) πk N (x|µk , Σk ) = PK j=1 πj N (x|µj , Σj ) (13) (14) View πk as the prior probability of zk = 1 and the quantity γ(zk ) as the corresponding posterior probability once we have observed x. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 7 / 23 Gaussian Mixture Models Component Responsibility Term We need to also express the conditional probability of z given x. Denote this conditional p(zk = 1|x) as γ(zk ). We can derive this value with Bayes’ theorem: p(zk = 1)p(x|zk = 1) . γ(zk ) = p(zk = 1|x) = PK j=1 p(zj = 1)p(x|zj = 1) πk N (x|µk , Σk ) = PK j=1 πj N (x|µj , Σj ) (13) (14) View πk as the prior probability of zk = 1 and the quantity γ(zk ) as the corresponding posterior probability once we have observed x. γ(zk ) can also be viewed as the responsibility that component k takes for explaining the observation x. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 7 / 23 Gaussian Mixture Models Sampling Sampling from the GMM To sample from the GMM, we can first generate a value for z from the marginal distribution p(z). Denote this sample ẑ. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 8 / 23 Gaussian Mixture Models Sampling Sampling from the GMM To sample from the GMM, we can first generate a value for z from the marginal distribution p(z). Denote this sample ẑ. Then, sample from the conditional distribution p(x|ẑ). J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 8 / 23 Gaussian Mixture Models Sampling Sampling from the GMM To sample from the GMM, we can first generate a value for z from the marginal distribution p(z). Denote this sample ẑ. Then, sample from the conditional distribution p(x|ẑ). The figure below-left shows samples from a three-mixture and colors the samples based on their z. The figure below-middle shows samples from the marginal p(x) and ignores z. On the right, we show the γ(zk ) for each sampled point, colored accordingly. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 8 / 23 Gaussian Mixture Models Sampling Sampling from the GMM To sample from the GMM, we can first generate a value for z from the marginal distribution p(z). Denote this sample ẑ. Then, sample from the conditional distribution p(x|ẑ). The figure below-left shows samples from a three-mixture and colors the samples based on their z. The figure below-middle shows samples from the marginal p(x) and ignores z. On the right, we show the γ(zk ) for each sampled point, colored accordingly. 1 1 (a) 1 (b) 0.5 0.5 0.5 0 0 0 0 0.5 J. Corso (SUNY at Buffalo) 1 0 0.5 Lecture 7b 1 (c) 0 0.5 26 March 2009 1 8 / 23 Gaussian Mixture Models Maximum-Likelihood Maximum-Likelihood Suppose we have a set of N observations {x1 , . . . , xN } that we wish to model with a GMM. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 9 / 23 Gaussian Mixture Models Maximum-Likelihood Maximum-Likelihood Suppose we have a set of N observations {x1 , . . . , xN } that we wish to model with a GMM. Consider this data set as an N × d matrix X in which the nth row is given by xT n. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 9 / 23 Gaussian Mixture Models Maximum-Likelihood Maximum-Likelihood Suppose we have a set of N observations {x1 , . . . , xN } that we wish to model with a GMM. Consider this data set as an N × d matrix X in which the nth row is given by xT n. Similarly, the corresponding latent variables define an N × K matrix Z with rows zT n. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 9 / 23 Gaussian Mixture Models Maximum-Likelihood Maximum-Likelihood Suppose we have a set of N observations {x1 , . . . , xN } that we wish to model with a GMM. Consider this data set as an N × d matrix X in which the nth row is given by xT n. Similarly, the corresponding latent variables define an N × K matrix Z with rows zT n. The log-likelihood of the corresponding GMM is given by "K # N X X ln p(X|π, µ, Σ) = ln πk N (x|µk , Σk ) . n=1 J. Corso (SUNY at Buffalo) Lecture 7b (15) k=1 26 March 2009 9 / 23 Gaussian Mixture Models Maximum-Likelihood Maximum-Likelihood Suppose we have a set of N observations {x1 , . . . , xN } that we wish to model with a GMM. Consider this data set as an N × d matrix X in which the nth row is given by xT n. Similarly, the corresponding latent variables define an N × K matrix Z with rows zT n. The log-likelihood of the corresponding GMM is given by "K # N X X ln p(X|π, µ, Σ) = ln πk N (x|µk , Σk ) . n=1 (15) k=1 Ultimately, we want to find the values of the parameters π, µ, Σ that maximize this function. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 9 / 23 Gaussian Mixture Models Maximum-Likelihood However, maximizing the log-likelihood terms for GMMs is much more complicated than for the case of a single Gaussian. Why? J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 10 / 23 Gaussian Mixture Models Maximum-Likelihood However, maximizing the log-likelihood terms for GMMs is much more complicated than for the case of a single Gaussian. Why? The difficulty arises from the sum over k inside of the log-term. The log function no longer acts directly on the Gaussian, and no closed-form solution is available. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 10 / 23 Gaussian Mixture Models Maximum-Likelihood Singularities There is a significant problem when we apply MLE to estimate GMM parameters. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 11 / 23 Gaussian Mixture Models Maximum-Likelihood Singularities There is a significant problem when we apply MLE to estimate GMM parameters. Consider simply covariances defined by Σk = σk2 I. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 11 / 23 Gaussian Mixture Models Maximum-Likelihood Singularities There is a significant problem when we apply MLE to estimate GMM parameters. Consider simply covariances defined by Σk = σk2 I. Suppose that one of the components of the mixture model, j, has its mean µj exactly equal to one of the data points so that µj = xn for some n. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 11 / 23 Gaussian Mixture Models Maximum-Likelihood Singularities There is a significant problem when we apply MLE to estimate GMM parameters. Consider simply covariances defined by Σk = σk2 I. Suppose that one of the components of the mixture model, j, has its mean µj exactly equal to one of the data points so that µj = xn for some n. This term contributes 1 N (xn |xn , σj2 I) = (16) (1/2) (2π) σj J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 11 / 23 Gaussian Mixture Models Maximum-Likelihood Singularities There is a significant problem when we apply MLE to estimate GMM parameters. Consider simply covariances defined by Σk = σk2 I. Suppose that one of the components of the mixture model, j, has its mean µj exactly equal to one of the data points so that µj = xn for some n. This term contributes 1 N (xn |xn , σj2 I) = (16) (1/2) (2π) σj Consider the limit σj → 0 to see that this term goes to infinity and hence the log-likelihood will also go to infinity. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 11 / 23 Gaussian Mixture Models Maximum-Likelihood Singularities There is a significant problem when we apply MLE to estimate GMM parameters. Consider simply covariances defined by Σk = σk2 I. Suppose that one of the components of the mixture model, j, has its mean µj exactly equal to one of the data points so that µj = xn for some n. This term contributes 1 N (xn |xn , σj2 I) = (16) (1/2) (2π) σj Consider the limit σj → 0 to see that this term goes to infinity and hence the log-likelihood will also go to infinity. Thus, the maximization of the log-likelihood function is not a well posed problem because such a singularity will occur whenever one of the components collapses to a single, specific data point. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 11 / 23 Gaussian Mixture Models Maximum-Likelihood p(x) x J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 12 / 23 Expectation-Maximization for GMMs Expectation-Maximization for GMMs Expectation-Maximization or EM is an elegant and powerful method for finding MLE solutions in the case of missing data such as the latent variables z indicating the mixture component. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 13 / 23 Expectation-Maximization for GMMs Expectation-Maximization for GMMs Expectation-Maximization or EM is an elegant and powerful method for finding MLE solutions in the case of missing data such as the latent variables z indicating the mixture component. Recall the conditions that must be satisfied at a maximum of the likelihood function. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 13 / 23 Expectation-Maximization for GMMs Expectation-Maximization for GMMs Expectation-Maximization or EM is an elegant and powerful method for finding MLE solutions in the case of missing data such as the latent variables z indicating the mixture component. Recall the conditions that must be satisfied at a maximum of the likelihood function. For the mean µk , setting the derivatives of ln p(X|π, µ, Σ) w.r.t. µk to zero yields 0=− =− N X πk N (x|µk , Σk ) Σk (xn − µk ) PK j=1 πj N (x|µj , Σj ) n=1 N X γ(znk )Σk (xn − µk ) (17) (18) n=1 J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 13 / 23 Expectation-Maximization for GMMs Expectation-Maximization for GMMs Expectation-Maximization or EM is an elegant and powerful method for finding MLE solutions in the case of missing data such as the latent variables z indicating the mixture component. Recall the conditions that must be satisfied at a maximum of the likelihood function. For the mean µk , setting the derivatives of ln p(X|π, µ, Σ) w.r.t. µk to zero yields 0=− =− N X πk N (x|µk , Σk ) Σk (xn − µk ) PK j=1 πj N (x|µj , Σj ) n=1 N X γ(znk )Σk (xn − µk ) (17) (18) n=1 Note the natural appearance of the responsibility terms on the RHS. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 13 / 23 Expectation-Maximization for GMMs Multiplying by Σ−1 k , which we assume is non-singular, gives µk = N 1 X γ(znk )xn Nk (19) n=1 where Nk = N X γ(znk ) (20) n=1 J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 14 / 23 Expectation-Maximization for GMMs Multiplying by Σ−1 k , which we assume is non-singular, gives µk = N 1 X γ(znk )xn Nk (19) n=1 where Nk = N X γ(znk ) (20) n=1 We see the k th mean is the weighted mean over all of the points in the dataset. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 14 / 23 Expectation-Maximization for GMMs Multiplying by Σ−1 k , which we assume is non-singular, gives µk = N 1 X γ(znk )xn Nk (19) n=1 where Nk = N X γ(znk ) (20) n=1 We see the k th mean is the weighted mean over all of the points in the dataset. Interpret Nk as the number of points assigned to component k. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 14 / 23 Expectation-Maximization for GMMs Multiplying by Σ−1 k , which we assume is non-singular, gives µk = N 1 X γ(znk )xn Nk (19) n=1 where Nk = N X γ(znk ) (20) n=1 We see the k th mean is the weighted mean over all of the points in the dataset. Interpret Nk as the number of points assigned to component k. We find a similar result for the covariance matrix: Σk = N 1 X γ(znk )(xn − µk )(xn − µk )T . Nk (21) n=1 J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 14 / 23 Expectation-Maximization for GMMs We also need to maximize ln p(X|π, µ, Σ) with respect to the mixing coefficients πk . J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 15 / 23 Expectation-Maximization for GMMs We also need to maximize ln p(X|π, µ, Σ) with respect to the mixing coefficients πk . P Introduce a Lagrange multiplier to enforce the constraint k πk = 1. ! K X ln p(X|π, µ, Σ) + λ πk − 1 (22) k=1 J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 15 / 23 Expectation-Maximization for GMMs We also need to maximize ln p(X|π, µ, Σ) with respect to the mixing coefficients πk . P Introduce a Lagrange multiplier to enforce the constraint k πk = 1. ! K X ln p(X|π, µ, Σ) + λ πk − 1 (22) k=1 Maximizing it yields: 0= 1 X γ(znk ) + λ Nk (23) n=1 J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 15 / 23 Expectation-Maximization for GMMs We also need to maximize ln p(X|π, µ, Σ) with respect to the mixing coefficients πk . P Introduce a Lagrange multiplier to enforce the constraint k πk = 1. ! K X ln p(X|π, µ, Σ) + λ πk − 1 (22) k=1 Maximizing it yields: 0= 1 X γ(znk ) + λ Nk (23) n=1 After multiplying both sides by π and summing over k, we get λ = −N J. Corso (SUNY at Buffalo) Lecture 7b (24) 26 March 2009 15 / 23 Expectation-Maximization for GMMs We also need to maximize ln p(X|π, µ, Σ) with respect to the mixing coefficients πk . P Introduce a Lagrange multiplier to enforce the constraint k πk = 1. ! K X ln p(X|π, µ, Σ) + λ πk − 1 (22) k=1 Maximizing it yields: 0= 1 X γ(znk ) + λ Nk (23) n=1 After multiplying both sides by π and summing over k, we get λ = −N (24) Eliminate λ and rearrange to obtain: πk = J. Corso (SUNY at Buffalo) Lecture 7b Nk N (25) 26 March 2009 15 / 23 Expectation-Maximization for GMMs Solved...right? So, we’re done, right? We’ve computed the maximum likelihood solutions for each of the unknown parameters. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 16 / 23 Expectation-Maximization for GMMs Solved...right? So, we’re done, right? We’ve computed the maximum likelihood solutions for each of the unknown parameters. Wrong! J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 16 / 23 Expectation-Maximization for GMMs Solved...right? So, we’re done, right? We’ve computed the maximum likelihood solutions for each of the unknown parameters. Wrong! The responsibility terms depend on these parameters in an intricate way: πk N (x|µk , Σk ) . γ(zk ) = p(zk = 1|x) = PK j=1 πj N (x|µj , Σj ) J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 16 / 23 Expectation-Maximization for GMMs Solved...right? So, we’re done, right? We’ve computed the maximum likelihood solutions for each of the unknown parameters. Wrong! The responsibility terms depend on these parameters in an intricate way: πk N (x|µk , Σk ) . γ(zk ) = p(zk = 1|x) = PK j=1 πj N (x|µj , Σj ) But, these results do suggest an iterative scheme for finding a solution to the maximum likelihood problem. 1 2 3 4 Chooce some initial values for the parameters, π, µ, Σ. Use the current parameters estimates to compute the posteriors on the latent terms, i.e., the responsibilities. Use the responsibilities to update the estimates of the parameters. Repeat 2 and 3 until convergence. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 16 / 23 Expectation-Maximization for GMMs 2 0 −2 −2 0 (a) J. Corso (SUNY at Buffalo) 2 Lecture 7b 26 March 2009 17 / 23 Expectation-Maximization for GMMs 2 2 0 0 −2 −2 −2 0 (a) J. Corso (SUNY at Buffalo) 2 −2 0 Lecture 7b (b) 2 26 March 2009 17 / 23 Expectation-Maximization for GMMs 2 2 2 L=1 0 0 −2 0 −2 −2 0 (a) J. Corso (SUNY at Buffalo) 2 −2 −2 0 Lecture 7b (b) 2 −2 0 (c) 26 March 2009 2 17 / 23 Expectation-Maximization for GMMs 2 2 2 L=1 0 0 −2 0 −2 −2 0 (a) 2 0 (d) 2 −2 −2 0 (b) 2 −2 0 (c) 2 2 L=2 0 −2 −2 J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 17 / 23 Expectation-Maximization for GMMs 2 2 2 L=1 0 0 −2 0 −2 −2 0 (a) 2 2 −2 −2 0 (b) 2 0 (e) 2 −2 0 (c) 2 2 L=2 L=5 0 0 −2 −2 −2 0 (d) J. Corso (SUNY at Buffalo) 2 −2 Lecture 7b 26 March 2009 17 / 23 Expectation-Maximization for GMMs 2 2 2 L=1 0 0 −2 0 −2 −2 0 (a) 2 2 −2 −2 0 (b) 2 2 L=2 (d) J. Corso (SUNY at Buffalo) 2 2 0 (f) 2 0 −2 0 (c) L = 20 0 −2 0 2 L=5 0 −2 −2 −2 −2 0 Lecture 7b (e) 2 −2 26 March 2009 17 / 23 Expectation-Maximization for GMMs Some Quick, Early Notes on EM EM generally tends to take more steps than the K-Means clustering algorithm. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 18 / 23 Expectation-Maximization for GMMs Some Quick, Early Notes on EM EM generally tends to take more steps than the K-Means clustering algorithm. Each step is more computationally intense than with K-Means too. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 18 / 23 Expectation-Maximization for GMMs Some Quick, Early Notes on EM EM generally tends to take more steps than the K-Means clustering algorithm. Each step is more computationally intense than with K-Means too. So, one commonly computes K-Means first and then initializes EM from the resulting clusters. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 18 / 23 Expectation-Maximization for GMMs Some Quick, Early Notes on EM EM generally tends to take more steps than the K-Means clustering algorithm. Each step is more computationally intense than with K-Means too. So, one commonly computes K-Means first and then initializes EM from the resulting clusters. Care must be taken to avoid singularities in the MLE solution. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 18 / 23 Expectation-Maximization for GMMs Some Quick, Early Notes on EM EM generally tends to take more steps than the K-Means clustering algorithm. Each step is more computationally intense than with K-Means too. So, one commonly computes K-Means first and then initializes EM from the resulting clusters. Care must be taken to avoid singularities in the MLE solution. There will generally be multiple local maxima of the likelihood function and EM is not guaranteed to find the largest of these. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 18 / 23 Expectation-Maximization for GMMs Given a GMM, the goal is to maximize the likelihood function with respect to the parameters (the means, the covarianes, and the mixing coefficients). 1 Initialize the means, µk , the covariances, Σk , and mixing coefficients, π k . Evaluate the initial value of the log-likelihood. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 19 / 23 Expectation-Maximization for GMMs Given a GMM, the goal is to maximize the likelihood function with respect to the parameters (the means, the covarianes, and the mixing coefficients). 1 2 Initialize the means, µk , the covariances, Σk , and mixing coefficients, π k . Evaluate the initial value of the log-likelihood. E-Step Evaluate the responsibilities using the current parameter values: πk N (x|µk , Σk ) γ(zk ) = PK j=1 πj N (x|µj , Σj ) J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 19 / 23 Expectation-Maximization for GMMs Given a GMM, the goal is to maximize the likelihood function with respect to the parameters (the means, the covarianes, and the mixing coefficients). 1 2 Initialize the means, µk , the covariances, Σk , and mixing coefficients, π k . Evaluate the initial value of the log-likelihood. E-Step Evaluate the responsibilities using the current parameter values: πk N (x|µk , Σk ) γ(zk ) = PK j=1 πj N (x|µj , Σj ) 3 M-Step Update the parameters using the current responsibilities µnew = k N 1 X γ(znk )xn Nk n=1 (26) Σnew = k N 1 X new T γ(znk )(xn − µnew k )(xn − µk ) Nk n=1 (27) Nk N (28) πknew = where Nk = N X γ(znk ) (29) n=1 J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 19 / 23 Expectation-Maximization for GMMs 4 Evaluate the log-likelihood ln p(X|µ new new ,Σ ,π new )= N X ln n=1 J. Corso (SUNY at Buffalo) Lecture 7b "K X # πknew N new (xn |µnew k , Σk ) (30) k=1 26 March 2009 20 / 23 Expectation-Maximization for GMMs 4 Evaluate the log-likelihood ln p(X|µ new new ,Σ ,π new )= N X ln "K X n=1 5 # πknew N new (xn |µnew k , Σk ) (30) k=1 Check for convergence of either the parameters of the log-likelihood. If the convergence is not satisfied, set the parameters: µ = µnew (31) new Σ=Σ (32) π = π new (33) and goto step 2. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 20 / 23 A More General EM A More General View of EM The goal of EM is to find maximum likelihood solutions for models having latent variables. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 21 / 23 A More General EM A More General View of EM The goal of EM is to find maximum likelihood solutions for models having latent variables. Denote the set of all model parameters as θ, and so the log-likelihood function is " # X ln p(X|θ) = ln p(X, Z|θ) (34) Z J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 21 / 23 A More General EM A More General View of EM The goal of EM is to find maximum likelihood solutions for models having latent variables. Denote the set of all model parameters as θ, and so the log-likelihood function is " # X ln p(X|θ) = ln p(X, Z|θ) (34) Z Note how the summation over the latent variables appears inside of the log. Even if the joint distribution p(X, Z|θ) belongs to the exponential family, the marginal p(X|θ) typically does not. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 21 / 23 A More General EM A More General View of EM The goal of EM is to find maximum likelihood solutions for models having latent variables. Denote the set of all model parameters as θ, and so the log-likelihood function is " # X ln p(X|θ) = ln p(X, Z|θ) (34) Z Note how the summation over the latent variables appears inside of the log. Even if the joint distribution p(X, Z|θ) belongs to the exponential family, the marginal p(X|θ) typically does not. If, for each sample xn we were given the value of the latent variable zn , then we would have a complete data set, {X, Z}, with which maximizing this likelihood term would be straightforward. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 21 / 23 A More General EM However, in practice, we are not given the latent variables values. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 22 / 23 A More General EM However, in practice, we are not given the latent variables values. So, instead, we focus on the expectation of the log-likelihood under the posterior distribution of the latent variables. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 22 / 23 A More General EM However, in practice, we are not given the latent variables values. So, instead, we focus on the expectation of the log-likelihood under the posterior distribution of the latent variables. In the E-Step, we use the current parameter values θ old to find the posterior distribution of the latent variables given by p(Z|X, θ old ). J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 22 / 23 A More General EM However, in practice, we are not given the latent variables values. So, instead, we focus on the expectation of the log-likelihood under the posterior distribution of the latent variables. In the E-Step, we use the current parameter values θ old to find the posterior distribution of the latent variables given by p(Z|X, θ old ). This posterior is used to define the expectation of the complete-data log-likelihood, denoted Q(θ, θ old ), which is given by X Q(θ, θ old ) = p(Z|X, θ old ) ln p(X, Z|θ) (35) Z J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 22 / 23 A More General EM However, in practice, we are not given the latent variables values. So, instead, we focus on the expectation of the log-likelihood under the posterior distribution of the latent variables. In the E-Step, we use the current parameter values θ old to find the posterior distribution of the latent variables given by p(Z|X, θ old ). This posterior is used to define the expectation of the complete-data log-likelihood, denoted Q(θ, θ old ), which is given by X Q(θ, θ old ) = p(Z|X, θ old ) ln p(X, Z|θ) (35) Z Then, in the M-step, we revise the parameters to θ new by maximizing this function: θ new = arg max Q(θ, θ old ) (36) θ J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 22 / 23 A More General EM However, in practice, we are not given the latent variables values. So, instead, we focus on the expectation of the log-likelihood under the posterior distribution of the latent variables. In the E-Step, we use the current parameter values θ old to find the posterior distribution of the latent variables given by p(Z|X, θ old ). This posterior is used to define the expectation of the complete-data log-likelihood, denoted Q(θ, θ old ), which is given by X Q(θ, θ old ) = p(Z|X, θ old ) ln p(X, Z|θ) (35) Z Then, in the M-step, we revise the parameters to θ new by maximizing this function: θ new = arg max Q(θ, θ old ) (36) θ Note that the log acts directly on the joint distribution p(X, Z|θ) and so the M-step maximization will likely be tractable. J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 22 / 23 A More General EM J. Corso (SUNY at Buffalo) Lecture 7b 26 March 2009 23 / 23