Maximum-Likelihood and Bayesian Parameter Estimation Expectation Maximization (EM) CSE 555: Srihari Estimating Missing Feature Value Estimating missing variable with known parameters In the absence of x1, most likely class is ω2 Choose that value of x1 which maximizes the likelihood Known Value (Μissing variable) Choosing mean of missing feature (over all classes) will result in worse performance! This is a case of estimating the hidden variable given the parameters. In EM unknowns are both: parameters and hidden variables CSE 555: Srihari EM Task • Estimate unknown parameters θ given measurement • • data U However some variables J are missing which need to be integrated out We want to maximize the posterior probability of θ, given the data U, marginalizing over J θ = arg max ∑ P(θ , J |U ) θ JεJ n Parameter to be estimated Μissing Variables CSE 555: Srihari Data EM Principle • Estimate unknown parameters θ given measurement data U and not given nuisance variables J which need to be integrated out θ = arg max ∑ P(θ , J |U ) θ Jε J n • Alternate between estimating unknowns θ and the • hidden variables J At each iteration, instead of finding the best J ε J given an estimate θ at each iteration, EM computes a distribution over the space J CSE 555: Srihari k-means Algorithm as EM • Estimate means of k classes when class labels are unknown • Parameters: means to be estimated • Hidden variables: class labels begin initialize m1, m2,..mk (E-step) do classify n samples according to nearest mi (M-step) recompute mi until no change in mi return m1,m2,..mk end An iterative algorithm derivable from EM CSE 555: Srihari EM Importance • EM algorithm • widely used for learning in the presence of unobserved • • • • variables, e.g., missing features, class labels used even for variables whose value is never directly observed, provided the general form of the pdf governing these variables is known has been used to train Radial Basis Function Networks and Bayesian Belief Networks basis for many unsupervised clustering algorithms basis for widely used Baum-Welch forward-backward algorithm for HMMs CSE 555: Srihari EM Algorithm • Learning in the presence of unobserved variables • Only a subset of relevant instance features might be • • observable Case of unsupervised learning or clustering How many classes are there? CSE 555: Srihari EM Principle • EM algorithm iteratively estimates the likelihood given the data that is present CSE 555: Srihari Likelihood Formulation • Sample points from a single distribution W = {x1 ,.., xn } • Any sample has good and missing (bad ) features x k = {x kg ,.., x kb } • Features are divided into two sets W = W g UWb CSE 555: Srihari Likelihood Formulation • Central equation in EM Expected Value is over missing features Current best estimate for the full distribution Candidate Vector for an improved estimate • Algorithm will select the best candidate θ and call it θ i+1 CSE 555: Srihari Algorithm EM begin initialize θ 0, T, i = 0 do i = i+1 E step: compute Q(θ ;θ i) M step: θ i+1=arg maxθ Q(θ ;θ i) until Q(θ i+1;θ i) - Q(θ i;θ i-1)< T CSE 555: Srihari EM for 2D Normal Model Suppose data consists of 4 points in 2 dimensions, one point of which is missing a feature , where * represents ⎧⎛ 0 ⎞ ⎛ 1 ⎞ ⎛ 2 ⎞ ⎛ * ⎞⎫ D = {x1 , x2 , x3 , x4 } = ⎨⎜⎜ ⎟⎟, ⎜⎜ ⎟⎟, ⎜⎜ ⎟⎟, ⎜⎜ ⎟⎟⎬ ⎩⎝ 2 ⎠ ⎝ 0 ⎠ ⎝ 2 ⎠ ⎝ 4 ⎠⎭ the unknown value of the first feature of point x4. Thus our bad data Db consists of the single feature x41, and the good data Dg consists of the rest. x2 x1 CSE 555: Srihari EM for 2D Normal Model Assuming that the model is a Gaussian with diagonal covariance and arbitrary mean, it can be described by the parameter vector ⎛ µ1 ⎞ ⎜ ⎟ ⎜ µ2 ⎟ θ =⎜ 2⎟ σ ⎜ 12 ⎟ ⎜σ ⎟ ⎝ 2⎠ CSE 555: Srihari EM for 2D Normal model We take our initial initial guess to be a Gaussian centered on the origin having Σ = 1, that is, ⎛ 0⎞ ⎜ ⎟ ⎜ 0⎟ 0 θ =⎜ ⎟ 1 ⎜ ⎟ ⎜ 1⎟ ⎝ ⎠ CSE 555: Srihari EM for a 2D Normal Model To find improved estimate, must calculate CSE 555: Srihari EM for a 2D Normal Model Simplifying Completes E step. Gives the next estimate Final Solution 0.667 CSE 555: Srihari EM for 2D normal model Four data points, one missing the value of x1, are in red. Initial estimate is a circularly symmetric Gaussian, centered on the origin (gray). (A better initial estimate could have been derived from the 3 known points.) Each iteration leads to an improved estimate, labeled by iteration number i; after 3 iterations, the algorithm converged. CSE 555: Srihari EM to Estimate Means of k Gaussians • Data D drawn from mixture of k distinct • normal distributions Two-step process generates samples • One of the k distributions is selected at random • A single random instance xi is generated according to selected distribution CSE 555: Srihari Instances Generated by a Mixture of Two Normal Distributions instances CSE 555: Srihari Example of EM to Estimate Means of k Gaussians • Each instance is generated by 1 Choosing one of the k Gaussians with uniform probability 2 Generating an instance at random according to that Gaussian • The single normal distribution is chosen with uniform probability • Each of the k Normal distributions has the same known variance • Learning task: output a hypothesis that describes the means of each ofhthe =<k distributions µ ,.., µ > 1 CSE 555: Srihari k Estimating Means of k Gaussians • We would like to find a maximum likelihood hypothesis for these means: a hypothesis h that maximizes p(D|h) CSE 555: Srihari Maximum Likelihood Estimate of Mean of a Single Gaussian • Given observed data instances x1 , x2 ..., xm • Drawn from a single distribution that is • normally distributed Problem is to find the mean of that distribution CSE 555: Srihari Maximum Likelihood Estimate of Mean of a Single Gaussian • Maximum likelihood estimate of the mean of a normal distribution can be shown to be one that minimizes the sum of squared errors m µ ML = arg min 12 ∑ ( xi − µ ) 2 µ • Right hand side has a maximum value at µ ML • i =1 1 m = ∑ xi m i =1 which is the sample mean CSE 555: Srihari Mixture of Two Normal Distributions • • • • • • We cannot observe as to which instances were generated by which distribution Full description of instance < x ,z ,z > i i1 i2 xi = observed value of ith instance zi1 and zi2 indicate which of two normal distributions was used to generate xi zij = 1 if zij was used to generate xi , 0 otherwise zi1 and zi2 are hidden variables, which have probability distributions associated with them CSE 555: Srihari Hidden variables specify distribution Zi1=1 Zi2=0 Zi1=0 Zi2=1 (xi,1,0) CSE 555: Srihari 2-Means Problem • Full description of instance < xi , zi1 , zi 2 > • xi = observed variable • zi1 and zi2 are hidden variables • If zi1 and zi2 were observed, we could use maximum likelihood estimates for the means: 1 m µ1 = ∑ xi m i =1 ∋ zi1 =1 1 m µ 2 = ∑ xi m i =1 ∋ zi2 =1 • Since we do not know zi1 and zi2, we will use EM instead CSE 555: Srihari EM Algorithm Applied to k-Means Problem • Search for a Maximum Likelihood Hypothesis by repeatedly re-estimating expected values of hidden binary variables zij given its current hypothesis < µ1 ,...µ k > • Then recalculate the maximum likelihood hypothesis using these expected values for the hidden variables CSE 555: Srihari EM algorithm for two means Zi1=1 Zi2=0 Zi1=0 Zi2=1 Regarded As Probabilities 1. Hypothesize means, then determine expected value of hidden variables for all samples 2. Use these hidden variable values to recalculate the means CSE 555: Srihari EM Applied to Two-Means Problem • Initialize the hypothesis to h =< µ1 , µ 2 > • Estimate the expected values of hidden variables zij • • given its current hypothesis Recalculate the maximum likelihood hypothesis using these expected values for the hidden variables Re-estimate h repeatedly until the procedure converges to a stationary value of h CSE 555: Srihari EM Algorithm for 2-Means Step 1 Calculate the expected value E[zij] of each hidden variable zij, assuming the current hypothesis holds h = 〈 µ1 , µ 2 〉 Step 2 Calculate new maximum likelihood hypothesis ′ ′ 〈 µits assuming the value taken on by each hidden variablehz′ ij= is 1, µ2 〉 expected value E[zij] calculated in Step 1. Then replace hypothesis by the new hypothesis and iterate. h = 〈 µ1 , µ 2 〉 h′ = 〈 µ1′, µ 2′ 〉 CSE 555: Srihari EM First Step Calculate the expected value E[zij] of each hidden variable zij, assuming the current hypothesis holds h = 〈 µ1 , µ 2 〉 E[ zij ] = p ( x = xi | µ = µ j ) ∑ 2 n =1 − = e ∑ p ( x = xi | µ = µ n ) 1 2σ 2 n =1 = Probability that instance xi was generated by the jth Gaussian 2 ( ) x − µ i j 2 − e 1 2σ 2 ( xi − µ j )2 CSE 555: Srihari EM Second Step Calculate a new maximum likelihood hypothesis h′ = 〈 µ1′, µ 2′ 〉 E[ z ] x ∑ ← ∑ E[ z ] m µj i =1 m i =1 ij i ij Observations: 1. Similar to earlier sample mean calculation for a single Gaussian. µ ML 1 m = ∑ xi m i =1 CSE 555: Srihari Clustering using EM CSE 555: Srihari Feature Extraction 0.367379 Image I_1 0.715380 Feature Extraction 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 Feature Vector (74 values) 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.076923 0.000000 0.529915 0.993590 0.387363 0.000000 0.596154 1.000000 0.727273 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.175000 0.962500 0.688889 0.129167 0.419643 1.000000 0.387500 0.000000 0.000000 0.000000 1.000000 1.000000 0.108295 0.000000 1.000000 0.000000 0.337662 0.599078 0.207547 0.207547 0.194969 0.779874 0.760108 0.215633 0.350943 0.630728 0.127660 0.351064 0.073286 0.879433 1.000000 0.607903 0.263830 0.474164 i CSE 555: Srihari 0.111111 0.407407 0.637860 0.765432 0.870370 0.000000 0.114815 0.687831 Feature Extraction 0.367379 Image I_1 0.715380 Feature Extraction 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2 Global Features: Aspect ratio Stroke ratio 72 Local Features s(i,j) F(i,j) = N(i)*S(j) i= 1,9 j = 0,7 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.108295 0.000000 1.000000 0.337662 0.599078 0.207547 0.207547 0.194969 0.779874 0.760108 0.215633 0.350943 0.630728 S(j) = max s(i,j) N(i) 0.727273 0.419643 1.000000 0.387500 0.000000 s(i,j) - no. of components with slope j in subimage i N(i) - no. of components in subimage i i 0.076923 0.000000 0.529915 0.993590 0.387363 0.000000 0.596154 1.000000 0.175000 0.962500 0.688889 0.129167 0.127660 0.351064 0.073286 0.879433 1.000000 0.607903 0.263830 0.474164 CSE 555: Srihari 0.111111 0.407407 0.637860 0.765432 0.870370 0.000000 0.114815 0.687831 Feature Vector (74 values) For One Feature P(I_1|C_1) P(I_1|C_2) P(I_1|C_3) P(I_1|C_4) P(I_1|C_5) Cycle 1 Cycle 2 … Cycle N 1.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.990915 0.009083 0.000001 0.000000 0.000000 Final Clusters Centers (Images closer to the means) P(C_3)= 0.203198 P(C_1)= 0.206088 P(C_2)= 0.228980 CSE 555: Srihari P(C_3)= 0.218480 P(C_3)= 0.143254 Clustering 0.367379 0.545132 0.715380 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.076923 0.000000 0.529915 0.993590 0.387363 0.000000 0.596154 1.000000 0.825456 0.175000 0.962500 0.688889 0.129167 0.419643 1.000000 0.387500 10 cycles 0.023339 0.002406 0.003211 0.008192 0.011329 0.003749 0.046144 0.039846 0.471009 0.055688 0.091545 0.345479 0.456706 0.133725 0.513211 0.573237 0.428854 0.554012 0.838875 0.297697 0.373176 0.393534 0.193757 0.096639 0.209345 0.193814 0.077345 0.255358 0.525436 0.090627 0.071594 0.741128 0.849134 0.103370 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.727273 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 0.108295 0.000000 1.000000 0.251790 0.101519 0.029442 0.022731 0.052462 0.040323 0.063098 0.085502 0.337662 0.467050 0.583712 0.227030 0.042000 0.650248 0.632097 0.599078 0.207547 0.207547 0.194969 0.779874 0.760108 0.215633 0.350943 0.630728 0.127660 0.351064 0.073286 0.879433 1.000000 0.607903 0.263830 0.474164 0.111111 0.407407 0.637860 0.765432 0.870370 0.000000 0.114815 0.687831 Initial Mean for Cluster 1 (Initialized with FV for Image 1) 0.597694 0.246492 0.174002 0.360720 0.718818 0.558005 0.359154 0.128743 0.511530 0.824493 0.419992 0.430754 0.447975 0.492662 0.127002 0.418266 0.543050 0.442549 0.444120 0.720268 0.512506 0.137092 0.187808 0.391781 CSE 555: Srihari Final Mean for Cluster 1 Essence of EM Algorithm • The current hypothesis is used to estimate the • • • unobserved variables Expected values of these variables are used to calculate an improved hypothesis It can be shown that on each iteration through the loop EM increases the likelihood P(D/h) unless it is at a local maximum Algorithm converges on a local maximum likelihood hypothesis for < µ ,µ > 1 CSE 555: Srihari 2 General Statement of EM Algorithm Parameters of interest were θ =<µ1, µ2 > Full data were the triples < xi , zi1 , zi 2 > , of which only xi is observed CSE 555: Srihari General Statement of EM Algorithm (cont’d.) • Let X = {x1 , x2 ..., xm } • denote the observed data in a set of m independently drawn instances • Let Z = {z1 , z2 ..., zm } • • denote the unobserved data in these same instances Let Y = X ∪Z • denote the full data CSE 555: Srihari General Statement of EM Algorithm (cont’d.) • The unobserved Z can be treated as a random variable whose p.d.f. depends on the unknown parameters θ and on the observed data X • Similarly Y is a random variable because it is defined • • in terms of the random variable Z h denotes current hypothesized values of the parameters θ h ′ denotes revised hypothesis estimated on each iteration of EM algorithm CSE 555: Srihari General Statement of EM Algorithm (cont’d.) • EM algorithm searches for m.l.e. hypothesis by seeking the h′ that maximizes h′ E[ln P (Y | h′)] CSE 555: Srihari General Statement of EM Algorithm (cont’d.) • Define the function Q (h′ | h) that gives E[ln P (Y | h′)] • as a function of h′ under the assumption that θ =h Q ( h′ | h) ← E[ln P (Y | h′) | h, X ] • And given the observed portion X of the full data Y CSE 555: Srihari General Statement of EM Algorithm • Repeat until convergence • Step 1: Estimation (E) Step: Calculate Q(h′ | h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y Q ( h′ | h) ← E[ln P (Y | h′) | h, X ] • Step 2: Maximization (M) Step: Replace hypothesis h by the hypothesis h′ that maximizes this Q function h ← arg max Q (h′ | h) h′ CSE 555: Srihari Derivation of k Means Algorithm from General EM algorithm • Derive previously seen algorithm for estimating the means of a mixture of k Normal distributions. Estimate the parameters θ =< µ1 ,...µ k > • We are given the observed data X = {〈 xi 〉} • The hidden variables Z = {〈 zi1 , K zik 〉} • Indicate which of the k Normal distributions was used to generate xi CSE 555: Srihari Derivation of the k Means Algorithm from • General EM Algorithm (cont’d.) Need to derive an expression for Q ( h / h' ) • First we derive an expression for ln P (Y | h′) CSE 555: Srihari Derivation of the k Means Algorithm • The probability p( yi | h′) of a single instance of the full data can be written yi = 〈 x1 , zi1 ,K zik 〉 p ( yi | h′) = p ( xi , zi1 , K zik | h′) = − 1 2π σ CSE 555: Srihari 2 e z ( x − µ ′j ) 2 2 ∑ j =1 ij 1 2σ 1 k Derivation of the k Means Algorithm (cont’d.) • p ( yi, |the h′) Given this probability for a single instance logarithmic probability ln P (Y | h′for ) all m instances in the data is m ln P (Y | h′) = ln ∏ p ( yi | h′) i =1 m = ∑ ln p ( yi | h′) i =1 ⎛ 1 1 = ∑ ⎜ ln − 2 ⎜ 2σ 2 i =1 2 π σ ⎝ m CSE 555: Srihari ⎞ zij ( xi − µ ′j ) ⎟ ∑ ⎟ j =1 ⎠ k 2 Derivation of the k Means Algorithm (cont’d.) • In general, for any function f(z) that is a linear function of z, the following equality holds E[ f ( z )] = f ( E | z ]) ⎡m ⎛ 1 1 ⎜ − E[ln P (Y | h′)] = E ⎢∑ ln 2 2 ⎜ 2 σ ⎢ i =1 ⎝ 2π σ ⎣ ⎛ 1 1 ⎜ = ∑ ln − 2 2 ⎜ 2 σ i =1 2π σ ⎝ CSE 555: Srihari m ⎞⎤ ⎟⎥ ′ − z ( x µ ) ∑ ij i j ⎟⎥ j =1 ⎠⎦ k 2 ⎞ ⎟ ′ − E [ z ]( x µ ) ∑ ij i j ⎟ j =1 ⎠ k 2 Derivation of the k Means Algorithm • Eis [ zijcalculated ] ′ h′ = 〈 µ1′, K , µ and To summarize, where k〉 on the current hypothesis h and the observed data X, the function for the ′ | h) problem is Q (khmeans ⎛ 1 1 ⎜ − Q(h′ | h ) = ∑ ln 2 2 ⎜ 2 σ i =1 2π σ ⎝ m − E[ zij ] = e ∑ k 1 2σ e n =1 ⎞ E[ zij ]( xi − µ ′j ) ⎟ ∑ ⎟ j =1 ⎠ k 2 µ ( − ) x i j 2 − 1 2σ 2 ( xi − µ n ) 2 CSE 555: Srihari 2 Derivation of k Means Algorithm: Second (Maximization Step) To Find the Values h′ = 〈 µ1′, K , µ k′ 〉 ⎛ 1 1 ⎜ arg max Q(h′ | h) = arg max ∑ ln − 2 2 ⎜ 2 σ h′ h′ i =1 2π σ ⎝ m m k ⎞ ⎟ ′ E [ z ]( x µ ) − ∑ ij i j ⎟ j =1 ⎠ k = arg min ∑∑ E[ zij ]( xi − µ ′j ) 2 h′ i =1 j =1 E[ z ] x ∑ ← ∑ E[ z ] m µj i =1 m CSE 555:i Srihari =1 ij i ij 2 Summary • In many parameter estimation tasks, some of • the relevant instance variables may be unobservable. In this case, the EM algorithm is useful. CSE 555: Srihari