The EM Algorithm Preview Learning with Missing Data

Preview The EM Algorithm • The EM algorithm • Mixture models • Why EM works • EM variants Learning with Missing Data The EM Algorithm • Goal: Learn parameters of Bayes net with known structure Initialize parameters ignoring missing information • For now: Maximum likelihood Repeat until convergence: • Suppose the values of some variables in some samples are missing • If we knew all values, computing parameters would be easy • If we knew the parameters, we could infer the missing values • “Chicken and egg” problem E step: Compute expected values of unobserved variables, assuming current parameter values M step: Compute new parameter values to maximize probability of data (observed & estimated) (Also: Initialize expected values ignoring missing info) Example Examples: A B C 0 1 1 1 1 0 1 ? 1 0 1 0 Initialization: P (B|A) = P (A) = P (B|¬A) = E-step: P (? = 1) = P (B|A, ¬C) = M-step: P (A) = P (B|A) = P (B|¬A) = E-step: P (? = 1) = 0 (converged) P (C|B) = P (C|¬B) = P (A,B,¬C) P (A,¬C) = ... = 0 P (C|B) = P (C|¬B) = Hidden Variables • What if some variables were always missing? • In general, difficult problem • Consider Naive Bayes structure, with class missing: P (x) = nc X i=1 P (ci ) d Y j=1 P (xj |ci ) Naive Bayes Model P(Bag=1) Clustering Bag Bag P(F=cherry | B) 1 2 C • Goal: Group similar objects F1 • Example: Group Web pages with similar topics F2 • Clustering can be hard or soft Flavor Wrapper Holes (a) X • What’s the objective function? (b) Mixtures of Gaussians Mixture Models nc X P (ci )P (x|ci ) i=1 p(x) P (x) = Objective function: Log likelihood of data Qn d Naive Bayes: P (x|ci ) = j=1 P (xj |ci ) AutoClass: Naive Bayes with various xj models Mixture of Gaussians: P (x|ci ) = Multivariate Gaussian In general: P (x|ci ) can be any distribution x " 1 exp − P (x|µi ) = √ 2 2πσ 2 1 x − µi σ 2 # EM for Mixtures of Gaussians Simplest case: Assume known priors and covariances Initialization: Choose means at random E step: For all samples xk : P (µi |xk ) = P (µi )P (xk |µi ) P (µi )P (xk |µi ) =P P (xk ) i0 P (µi0 )P (xk |µi0 ) M step: For all means µi : P x P (µi |xk ) µi = Pxk xk P (µi |xk ) Mixtures of Gaussians (cont.) • K-means clustering ≺ EM for mixtures of Gaussians • Mixtures of Gaussians ≺ Bayes nets • Also good for estimating joint distribution of continuous variables Why EM Works EM Variants LL(Onew) LL(Onew) LLold MAP: Compute MAP estimates instead of ML in M step GEM: Just increase likelihood in M step MCMC: Approximate E step LLold + Q(Onew) Simulated annealing: Avoid local maxima Early stopping: Faster, may reduce overfitting Oold Onew θnew = argmax Eθold [log P (X)] θ Summary • The EM algorithm • Mixture models • Why EM works • EM variants Structural EM: Missing data and unknown structure

The EM Algorithm Preview Learning with Missing Data

Related documents

Products

Support

The EM Algorithm Preview Learning with Missing Data

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib