Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Introduction to Machine Learning CSE474/574: Latent Variable Models Varun Chandola <chandola@buffalo.edu> Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Outline 1 Latent Variable Models Latent Variable Models - Introduction 2 Mixture Models Using Mixture Models 3 Parameter Estimation Issues with Direct Optimization of the Likelihood or Posterior 4 Expectation Maximization EM Operation EM for Mixture Models K-Means as EM Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Latent Variable Models - Introduction Outline 1 Latent Variable Models Latent Variable Models - Introduction 2 Mixture Models Using Mixture Models 3 Parameter Estimation Issues with Direct Optimization of the Likelihood or Posterior 4 Expectation Maximization EM Operation EM for Mixture Models K-Means as EM Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Latent Variable Models - Introduction Latent Variable Models Consider a probability distribution parameterized by θ Generates samples (x) with probability p(x|θ) 2-step generative process 1 Distribution generates the hidden variable 2 Distribution generates the observation, given the hidden variable Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Latent Variable Models - Introduction Latent Variable Models Consider a probability distribution parameterized by θ Generates samples (x) with probability p(x|θ) 2-step generative process 1 Distribution generates the hidden variable 2 Distribution generates the observation, given the hidden variable Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Latent Variable Models - Introduction Latent Variable Models Consider a probability distribution parameterized by θ Generates samples (x) with probability p(x|θ) 2-step generative process 1 Distribution generates the hidden variable 2 Distribution generates the observation, given the hidden variable Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Latent Variable Models - Introduction Magazine Example - Sampling an Article Assume that the editor has access to p(x) x - a random variable that denotes an article Latent Variable Model Direct Model 1 First sample a topic z from a topic distribution p(z) 2 Pick an article from the topic-wise distribution p(x|z) Sample from p(x) for an article Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Latent Variable Models - Introduction Magazine Example - Sampling an Article Assume that the editor has access to p(x) x - a random variable that denotes an article Latent Variable Model Direct Model 1 First sample a topic z from a topic distribution p(z) 2 Pick an article from the topic-wise distribution p(x|z) Sample from p(x) for an article Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Latent Variable Models - Introduction Latent Variable Models - Introduction The observed random variable x depends on a hidden random variable z z is generated using a prior distribution - p(z) x is generated using p(x|z) Different combinations of p(z) and p(x|z) give different latent variable models 1 2 3 4 Mixture Models Factor analysis Probabilistic Principal Component Analysis (PCA) Latent Dirichlet Allocation (LDA) Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Using Mixture Models Outline 1 Latent Variable Models Latent Variable Models - Introduction 2 Mixture Models Using Mixture Models 3 Parameter Estimation Issues with Direct Optimization of the Likelihood or Posterior 4 Expectation Maximization EM Operation EM for Mixture Models K-Means as EM Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Using Mixture Models Mixture Models A latent discrete state z ∈ {1, 2, . . . , K } p(z) ∼ Multinomial(π) For every state k, we have a probability distribution for x p(x|z = k) = pk (x) Overall, probability for x p(x|θ) = K X πk pk (x|θ) k=1 A convex combination of pk ’s πk is the probability of k th mixture component to be true Or, contribution of the k th component Or, the mixing weight Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Using Mixture Models Using Mixture Models 1. Black-box Density Model Use p(x|θ) for many things Example: class conditional density 2. Clustering Soft clustering 1 First learn the parameters of the mixture model 2 Compute p(z = k|x, θ) for every input point x (Bayes Rule) Each mixture component corresponds to a cluster k p(z = k|θ)p(x|z = k, θ) 0 0 k 0 =1 p(z = k |θ)p(x|z = k , θ) p(z = k|x, θ) = PK Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Issues with Direct Optimization of the Likelihood or Posterior Outline 1 Latent Variable Models Latent Variable Models - Introduction 2 Mixture Models Using Mixture Models 3 Parameter Estimation Issues with Direct Optimization of the Likelihood or Posterior 4 Expectation Maximization EM Operation EM for Mixture Models K-Means as EM Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Issues with Direct Optimization of the Likelihood or Posterior Simple Parameter Estimation Given: A set of scalar observations x1 , x2 , . . . , xn Task: Find the generative model (form and parameters) 1 Observe empirical distribution of x 2 Make choice of the form of the probability distribution (Gaussian) 3 Estimate parameters from the data using MLE or MAP (µ and σ) Varun Chandola 200 150 100 50 0 [2, 4) [4, 6) [6, 8) [8, 10)[10, 12)[12, 14) Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Issues with Direct Optimization of the Likelihood or Posterior Simple Parameter Estimation Given: A set of scalar observations x1 , x2 , . . . , xn Task: Find the generative model (form and parameters) 1 Observe empirical distribution of x 2 Make choice of the form of the probability distribution (Gaussian) 3 Estimate parameters from the data using MLE or MAP (µ and σ) Varun Chandola 200 150 100 50 0 [2, 4) [4, 6) [6, 8) [8, 10)[10, 12)[12, 14) Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Issues with Direct Optimization of the Likelihood or Posterior Simple Parameter Estimation Given: A set of scalar observations x1 , x2 , . . . , xn Task: Find the generative model (form and parameters) 1 Observe empirical distribution of x 2 Make choice of the form of the probability distribution (Gaussian) 3 Estimate parameters from the data using MLE or MAP (µ and σ) Varun Chandola 200 150 100 50 0 [2, 4) [4, 6) [6, 8) [8, 10)[10, 12)[12, 14) Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Issues with Direct Optimization of the Likelihood or Posterior Simple Parameter Estimation Given: A set of scalar observations x1 , x2 , . . . , xn Task: Find the generative model (form and parameters) 1 Observe empirical distribution of x 2 Make choice of the form of the probability distribution (Gaussian) 3 Estimate parameters from the data using MLE or MAP (µ and σ) Varun Chandola 200 150 100 50 0 [2, 4) [4, 6) [6, 8) [8, 10)[10, 12)[12, 14) Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Issues with Direct Optimization of the Likelihood or Posterior When Data has Multiple Modes Single mode is not sufficient In reality data is generated from two Gaussians How to estimate µ1 , σ1 , µ2 , σ2 ? What if we knew zi ∈ {1, 2}? zi = 1 means that xi comes from first mixture component zi = 2 means that xi comes from second mixture component Issue: zi ’s are not known beforehand 250 200 150 100 50 0 [0, 5) Need to explore 2N possibilities Varun Chandola Introduction to Machine Learning [5, 10) [10, 15) Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Issues with Direct Optimization of the Likelihood or Posterior When Data has Multiple Modes Single mode is not sufficient In reality data is generated from two Gaussians How to estimate µ1 , σ1 , µ2 , σ2 ? What if we knew zi ∈ {1, 2}? zi = 1 means that xi comes from first mixture component zi = 2 means that xi comes from second mixture component Issue: zi ’s are not known beforehand 250 200 150 100 50 0 [0, 5) Need to explore 2N possibilities Varun Chandola Introduction to Machine Learning [5, 10) [10, 15) Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Issues with Direct Optimization of the Likelihood or Posterior Optimizing Likelihood or Posterior is Not Possible For direct optimization, we find parameters that maximize (log-)likelihood (or (log-)posterior) Easy to optimize if zi were all known What happens when zi ’s are not known Likelihood and posterior will have multiple modes Non-convex function - harder to optimize Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References Issues with Direct Optimization of the Likelihood or Posterior Optimizing Likelihood or Posterior is Not Possible For direct optimization, we find parameters that maximize (log-)likelihood (or (log-)posterior) Easy to optimize if zi were all known What happens when zi ’s are not known Likelihood and posterior will have multiple modes Non-convex function - harder to optimize Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References EM Operation EM for Mixture Models K-Means as EM Outline 1 Latent Variable Models Latent Variable Models - Introduction 2 Mixture Models Using Mixture Models 3 Parameter Estimation Issues with Direct Optimization of the Likelihood or Posterior 4 Expectation Maximization EM Operation EM for Mixture Models K-Means as EM Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References EM Operation EM for Mixture Models K-Means as EM Estimating Parameters of a Mixture Model Recall the we want to maximize the log-likelihood of a data set with respect to θ: θ̂ = maximize `(θ) θ Log-likelihood for a mixture model can be written as: `(θ) = N X log p(xi |θ) i=1 = N X " log i=1 K X # p(zk )pk (xi |θ) k=1 Hard to optimize (a summation inside the log term) Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References EM Operation EM for Mixture Models K-Means as EM A 2 Step Approach Repeat until converged: 1 2 Start with some guess for θ and compute the most likely value for zi , ∀i Given zi , ∀i, update θ Does not explicitly maximize the log-likelihood of mixture model Can we come up with a better algorithm? Repeat until converged: 1 2 Start with some guess for θ and compute the probability of zi = k, ∀i, k Combine probabilities to update θ Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References EM Operation EM for Mixture Models K-Means as EM A 2 Step Approach Repeat until converged: 1 2 Start with some guess for θ and compute the most likely value for zi , ∀i Given zi , ∀i, update θ Does not explicitly maximize the log-likelihood of mixture model Can we come up with a better algorithm? Repeat until converged: 1 2 Start with some guess for θ and compute the probability of zi = k, ∀i, k Combine probabilities to update θ Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References EM Operation EM for Mixture Models K-Means as EM Steps in EM EM is an iterative procedure Start with some value for θ At every iteration t, update θ such that the log-likelihood of the data goes up Move from θ t−1 to θ such that: `(θ) − `(θ t−1 ) is maximized Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References EM Operation EM for Mixture Models K-Means as EM EM - Continued Complete log-likelihood for any LVM `(θ) = N X log p(xi , zi |θ) i=1 Cannot be computed as we do not know zi Expected complete log-likelihood Q(θ, θ t−1 ) = E[`(θ|D, θ t−1 )] Expected value of `(θ|D, θ t−1 ) for all possibilities of zi Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References EM Operation EM for Mixture Models K-Means as EM EM Operation 1 2 3 4 Initialize θ At iteration t, compute Q(θ, θ t−1 ) Maximize Q() with respect to θ to get θ t Goto step 2 Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References EM Operation EM for Mixture Models K-Means as EM Using EM for MM Parameter Estimation EM formulation is generic Calculating (E) and maximizing (M) Q() needs to be done for specific instances Q for MM " Q(θ, θ t−1 ) = E N X # log p(xi , zi |θ) i=1 = rik , p(zi = N X K X rik log πk + N X K X rik log p(xi |θk ) i=1 k=1 k|xi , θ t−1 ) i=1 k=1 Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References EM Operation EM for Mixture Models K-Means as EM E-Step Compute rik , ∀i, k rik = p(zi = k|xi , θ t−1 ) = πk p(xi |θkt−1 ) 0t−1 0 ) k 0 πk p(xi |θk P Compute Q() Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References EM Operation EM for Mixture Models K-Means as EM M-Step Maximize Q() w.r.t. θ θ consists of π = {π1 , π2 , . . . , πK } and θ = {θ1 , θ2 , . . . , θK } For Gaussian Mixture Model (GMM) (θk ≡ (µk , Σk )): πk = µk = Σk = 1 X rik N i P r x Pi ik i rik Pi > i rik xi xi P − µk µ> k i rik Varun Chandola Introduction to Machine Learning (1) (2) (3) Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References EM Operation EM for Mixture Models K-Means as EM Is K-Means an EM Algorithm? Similar to GMM 1 2 3 Σ = σ 2 ID πk = K1 The most probable cluster for xi is computed as the prototype closest to it (hard clustering) Varun Chandola Introduction to Machine Learning Latent Variable Models Mixture Models Parameter Estimation Expectation Maximization References References Varun Chandola Introduction to Machine Learning