Parameter Learning and EM Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Parameter Estimation and Prediction Bayesian treats the unknown parameters as a random variable: π(π|π·) = π π· π π(π) π(π·) = π π· π π(π) ∫ π π· π π π ππ Posterior mean estimation: ππππ¦ππ = ∫ π π π π· ππ π ππππ€ Maximum likelihood approach πππΏ = ππππππ₯π π π· π , πππ΄π = ππππππ₯π π(π|π·) π π Bayesian prediction, take into account all possible value of π π π₯πππ€ π· = ∫ π π₯πππ€ , π π· ππ = ∫ π π₯πππ€ π π π π· ππ A frequentist prediction: use a “plug-in” estimator π π₯πππ€ π· = π(π₯πππ€ | πππΏ ) ππ π π₯πππ€ π· = π(π₯πππ€ | πππ΄π ) 2 MLE for directed model π π; π· = log π π· π = π log π ππ |ππ + π log π π π |ππ + π ππππ π π ππ , π π , ππ + π ππππ(β π |π π , πβ ) One term for each CPT; break up MLE problem into independent subproblems Because the factorization of the distribution, we can estimate each CPT separately. π΄ππππππ¦ π΄ππππππ¦ π΄ππππππ¦ πΉππ’ πΉππ’ ππππ’π Learn separately πΉππ’ ππππ’π ππππ’π π»πππππβπ π»πππππβπ 3 MLE for BNs with tabluar CPTs Assume each CPT is represented as a table (multinomial): ππππ = π ππ = π πππ = π Note that in case of multiple parents, πππ will have a composite state, and CPT will be a high dimensional table The sufficient statistics are counts of family configurations ππππ = # ππ = π&&πππ = π The log-likelihood is πΏ π; π· = log ππππ πππ ππππ = πππ ππππ Using a Lagrange multiplier to enforce ππΏ ππππ = log ππππ π ππππ = 1, we get ππππ π′ πππ′ π 4 Bayesian estimator for directed models Factorization π π = π₯ = ππ π₯π ππππ , ππ ) Local CPT: multinomial distribution π ππ = π ππππ = π = πππ Factorized prior over parameters π ππ π ππ π ππ π(πβ ) ππ ππ π΄ππππππ¦ ππ πΉππ’ ππππ’π πβ π»πππππβπ 5 Parameter independence Provided all variables are observed, we can perform Bayesian update for each parameter independently ππππππ πππππππ‘ππ ππππππππππππ πππππ πππππππ‘ππ ππππππππππππ Discrete DAG models: ππ |ππππ ∼ ππ’ππ‘π π Dirichlet prior: π π = Γ( π πΌπ ) π Γ πΌπ πΌ π ππ π21 π1 π1 π2 π3 π4 π22 6 MLE for undirected models π π1 , … , ππ |π = = 1 π π 1 π π exp ππ exp(πππ ππ ππ ) π π = π π, π· = log( π₯ ππ πππ ππ ππ π π ( π π π π₯ π₯π + ππ π ππ πππ ππ ππ‘βππ ππππ‘π’ππ ππ’πππ‘πππ π π₯π π exp(ππ ππ ) π π exp (π π₯ ππ π π₯π ) ππ π π₯ π )) + = π ( log (exp (π π₯ ππ π π π ππ log π π ) = π ππ ππ π exp(ππ ππ ) ππ exp(πππ ππ ππ ) 1 π π=1 π π + π6 π5 π8 π2 π1 π4 π7 π3 π9 π exp (π π₯ π π )) π π log(exp (π π₯ π π )) − π π π π₯ π π π − log π π ) ππππ ππππ π ππππ πππ‘ ππππππππ π! 7 Derivatives of log likelihood π π, π· = 1 π ππ π,π· ππππ 1 π ππ π,π· ππππ = 1 π = = 1 π π π π π π₯π π₯π π π π π π π − ( ππ₯ π + π π₯ ππ π ππ π ππ₯ π π₯ ππ π π ππ₯ π π₯ ππ π π 1 Z(π) π₯ − − π π π₯ π π π − log π π ) π log π π ππππ π΄ ππππ£ππ₯ πππππππ Can find global optimum 1 ππ π Z(π) ππππ ππ exp(πππ ππ ππ ) π exp(ππ ππ ) ππ ππ ππππ π‘π ππ πππππππππ 8 Moment matching condition ππ π,π· = ππππ 1 π π π π₯ π₯ π π π π − 1 Z π π₯ ππ exp Moment 1 π π exp ππ ππ ππ ππ πππ£πππππππ πππ‘πππ₯ from model πππππππππ πππ£πππππππ πππ‘πππ₯ P ππ , ππ = πππ ππ ππ π π π πΏ(π , π₯ ) πΏ(π , π₯ π π π=1 π π) ππ π,π· matching: ππππ = πΈP ππ ,ππ ππ ππ − πΈπ(π|π) [ππ ππ ] 9 Optimize MLE for undirected models max π π, π· is a convex optimization problem. π Can be solve by many methods, such as gradient descent, conjugate gradient. Initialize model parameters π Loop until convergence Compute ππ π,π· ππππ = πΈP Update πππ ← πππ − π ππ ,ππ ππ ππ − πΈπ ππ ππ ππ ππ π,π· ππππ Or use the gradient equation for fixed point iteration: iterative proportional fitting 10 Exponential Family Rand variable X, π π π = 1 π π π π = ∫ β π exp π β€ π π β π exp(π β€ π π ) ππ π π π = β π exp(π β€ π π − π΄ π ) πππ π ππππ π’ππ πππππππππ πππππππ‘ππ π π π’ππππππππ‘ π π‘ππ‘ππ π‘πππ lππ − ππππ‘ππ‘πππ ππ’πππ‘πππ π΄ π = log π π Example: Bernoulli, multinomial, Gaussian, Poisson, Gamma, … 11 Multivariate Gaussian π π π = β π exp(π β€ π π − π΄ π ) Random variable π ∈ π π 1 π π π, Σ = 2π = 1 π 2 1 exp − Σ2 1 2 π−π 1 2 β€ −1 Σ 1 2 (π − π) 1 2 −1 ππ β€ ) + πβ€ Σ−1 π − πβ€ Σ −1 π − log |Σ| π exp − tr(Σ 2π 2 Exponential family representation 1 2 π = (Σ −1 π; − π£ππ Σ −1 ) π π = π₯; π£ππ ππ β€ 1 2 1 2 π΄ π = πβ€ Σ −1 π + log |Σ| β π = 2π π 2 12 Multinomial distribution π π π = β π exp(π β€ π π − π΄ π ) multinomial distribution with k values ππππ¦ πππ πππ‘ππ¦ ππ πππ − π§ππππ Binary vector π ∈ 0,1 πΎ , π ∼ ππ’ππ‘π(π|π) π ππ = 1, π ππ = 1 π π π = π1 π1 π2 π2 … ππΎ ππΎ = exp log ππ + ππΎ log ππΎ = exp πΎ−1 π=1 ππ πΎ−1 π=1 ππ = exp πΎ−1 π=1 ππ log = exp π = log ππ ;0 ππΎ log ππ + (1 − ππ 1− πΎ−1 π π=1 π πΎ π=1 ππ ln ππ πΎ−1 π=1 ππ ) log(1 + log(1 − , π π = π, π΄ π = −log(1 − − πΎ−1 π=1 ππ ) πΎ−1 π=1 ππ ) πΎ−1 π=1 ππ ) , β π =1 13 Why exponential family? Moment generating property: we can easily compute moments of any exponential family distribution by taking the derivatives of the log normalizer Mean: ππ΄ ππ = = π log π ππ 1 π ∫β π π ππ = ∫π π π = 1 π π π π ππ π exp π β€ π π β π exp π β€ π π π π π ππ ππ = πΈπ(π|π) π π Variance: π2π΄ ππ 2 = πΈπ(π|π) π 2 π 2 − πΈπ(π|π) π(π) = πππ[π π ] 14 MLE for exponential family For iid data, the log-likelihood is π π, π· = log( = π π=1 β β€ log β(π₯ ) + π π π π₯π exp(π β€ π π₯π − π΄ π )) π π(π₯π ) − ππ΄ π Take derivatives and set of zero: ππ π,π· ππ 1 π = π π(π₯π ) − π π π(π₯π ) = ππ΄ π ππ =0 ππ΄ π ππ This is moment matching condition for exponential family 15 Partially observed graphical models Speech recognition 16 Partially observed graphical models Biological Evolution 17 Partially observed graphical models Mixture Models π(π1 , Σ1 ) ππ ππ π π(π2 , Σ2 ) 18 Unobserved variables A variable can be unobserved (latent,hidden,missing) because: It is an imaginary quantity meant to provide some simplified and abstract view of the data generation process Eg. Mixture models, topic modeling, image context It is a real-world object and/or phenomena, but difficult or impossible to measure Eg. Causes of disease, evolutionary ancestor It is a real-world object and/or phenomena, but sometimes wasn’t measured, because of faulty sensors etc. Discrete latent variables can be used to partition/cluster data into subgroups Continuous latent variables (factors) can be used for dimensionality reduction (factor analysis, etc.) 19 Gaussian Mixture model A density model p(X) may be multi-modal: model it as a mixture of uni-modal distributions (e.g. Gaussians) π(π1 , Σ1 ) Consider a mixture of πΎ Gaussians π(π) = π ππ π(π|ππ , Σπ ) πππ₯π‘π’ππ proportion πππ₯π‘π’ππ Component π(π2 , Σ2 ) Learn ππ , ππ , Σπ ; π Correspond to π π = π§ π(π| π§; ππ§ , Σπ§ )π π§; π π Can be used for unsupervised clustering π 20 Why is learning hard? In fully observed iid settings, the log-likelihood decomposes into a sum of local terms π π; π· = log π π₯, π§ π = log π π§ π1 + log π(π₯|π§, π2 ) With latent variables, all the parameters become coupled together via marginalization π π; π· = log π§π π₯, π§ π = log π§ π(π₯| π§, π2 )π π π π π π π§|π1 π 21 Key questions in EM algorithm EM: Expectation-maximization for finding π π π; π· = log π§π π₯, π§ π = log π§ π(π₯| π§, π2 )π π§|π1 Expectation step (E-step) What distribution do we take expectation with? π π§ = π(π§| π₯ , π) What do we take expectation over? π π = πΈπ π§ [log π π₯, π§ π ] Maximization step (M-step) What do we maximize? π π What do we maximize with respect to? π 22 Example: Gaussian mixture model A mixture of K Gaussians: π Z is latent class indicator vector π π π = π1 π1 π2 π2 … ππΎ ππΎ π X is a conditional Gaussian variable with class specific mean and covariance π π ππ = 1, π, Σ = 1 2π π 2 1 1 exp − 2 π − ππ Σ2 β€ Σ −1 (π k − ππ ) The likelihood of a sample: π π₯π π, π, Σ = π π(π§π = 1|π) π π₯π π§π = 1, π, Σ = π ππ π π₯π ππ , Σπ The expected complete log-likelihood < ππ {π₯, π§}; π, π, Σ >π π|{π₯} = π π < log π π§ π >π π|{π₯} + = 1 2 π π π < π§ππ >π π|{π₯} log ππ − π π < π§ππ >π π|{π₯} ( π₯π − ππ < log π(π₯π |π§ π , π, Σ) >π β€ Σ −1 (π₯ π k π|{π₯} − ππ ) + log |Σπ | + πΆ) 23 π E-step We maximize < ππ {π₯, π§}; π, π, Σ >π the following procedure: π|{π₯} iteratively using Expectation step: computing the expected value of the sufficient statistics of the hidden variables (z) given current estimate of the parameters (π, π, Σ) πππ =< π§ππ >π π|{π₯} =π π§ππ = 1 π₯ , π, Σ = ππ π π₯π ππ ,Σπ π ππ π π₯π ππ ,Σπ We are essentially doing inference 24 M-step We maximize < ππ {π₯, π§}; π, π, Σ >π the following procedure: π|{π₯} iteratively using Maximization step: compute the parameters under current results of the expected complete log-likelihood < ππ {π₯, π§}; π, π, Σ >π π 1 π π ππ log ππ − 2 π|{π₯} = π π π ππ ( π₯π − ππ β€ Σ −1 (π₯ π k ππ = ππππππ₯ππ < ππ {π₯, π§}; π, π, Σ >π ⇒ ππ = π ππ =1 π π|{π₯} π π ππ π₯π π π ππ Σπ = ππππππ₯Σπ < ππ {π₯, π§}; π, π, Σ >π ⇒ Σπ = , π . π‘. π π ππ ππ = ππππππ₯ππ < ππ {π₯, π§}; π, π, Σ >π ⇒ ππ = π|{π₯} − ππ ) + log |Σπ | + πΆ) π π π ππ (π₯π −ππ ) (π₯π −ππ ) π π ππ π|{π₯} Fact: π log |π΄−1 | = π΄β€ −1 ππ΄ ππ₯ β€ π΄π₯ = π₯π₯ β€ ππ΄ 25 Expectation-Maximization Iteractions 26 K-means vs EM for Gaussian mixture The EM algorithm for mixture of Gaussian is like a soft clustering algorithm K-means: “E-step”, we do hard assignment: π§ π = ππππππ₯π (π₯π − ππ ) Σk−1 (π₯π − ππ ) “M-step”, we update the means and covariance of cluster using maximum likelihood estimate: ππ = Σπ = ππΏ ππΏ ππΏ π§ π ,π π₯π π§ π ,π π§ π ,π (π₯π −ππ ) (π₯π −ππ )π ππΏ π§ π ,π x 27 Theory underlying EM What are we doing? Recall that according to MLE, we intend to learn the model parameter that would have maximize the likelihood of the data. But we are iterating these: Expectation step (E-step) π π = πΈπ π§ log π π₯, π§ π , π€βπππ π π§ = π(π§|π₯, π π‘ ) Maximization step (M-step) π π‘+1 = ππππππ₯π π π 28