Parameter Learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sampling Methods Direct Sampling Works only for easy distributions (multinomial, Gaussian etc.) Rejection Sampling Create samples like direct sampling Only count samples consistent with given evidence Importance Sampling Create samples like direct sampling Assign weights to samples Gibbs Sampling Often used for high-dimensional problem Use variables and its Markov blanket for sampling 2 Gibbs Sampling in formula Gibbs sampling π = π₯0 For t = 1 to N π₯1π‘ ∼ π(π1 |π₯2π‘−1 , … , π₯πΎπ‘−1 ) π₯2π‘ ∼ π(π2 |π₯1π‘ , π₯3π‘−1 … , π₯πΎπ‘−1 ) … π‘ π₯πΎπ‘ ∼ π(ππΎ |π₯1π‘ , … , π₯πΎ−1 ) πΉππ ππππβππππ ππππππ Only need to condition on the Variables in the Markov blanket π1 Variants: Randomly pick variable to sample sample block by block π3 π2 π4 π5 π₯1π‘ , π₯2π‘ ∼ π(π1 , π2 |π₯3π‘−1 … , π₯πΎπ‘−1 ) 3 Gibbs Sampling: Image Segmentation 1 π Pairwise Markov random fields: π π = πΨ ππ ππ Ψ ππ , ππ Need conditional π(π₯π |π₯1 , … , π₯π−1 , π₯π+1 , … , π₯π ) π(π₯1 ,…,π₯π ) π(π₯1 ,…,π₯π−1 ,π₯π+1 ,π₯π ) = 1 π πΨ 1 π₯π π π₯π πΨ ππ Ψ π₯π π₯π ,π₯π ππ Ψ π₯π ,π₯π Terms without π₯π will cancel out π6 π5 π8 π2 π1 π4 π7 π3 π9 π₯π is summed out in the denominator ∝ Ψ π₯π π∈π(π) Ψ(π₯π , π₯π ) 4 Convergence of Gibbs Sampling Not all samples π₯ 0 , … π₯ π are independent Consider a particular marginal π(π₯π |π’π ) 1 True π(π₯π |π’π ) πΈππππππ π π₯π π’π π‘ 0 Burn-in Take samples from here 5 Parameter Learning Example Estimate the probability π of landing in heads using a biased coin Given a sequence of π independently and identically distributed (iid) flips Eg., π· = π₯1 , π₯2 , … , π₯π = {1,0,1, … , 0}, π₯π ∈ {0,1} Model: π π₯|π = π π₯ 1 − π π(π₯|π ) = 1−π₯ 1 − π, πππ π₯ = 0 π, πππ π₯ = 1 Likelihood of a single observation π₯π ? π π₯π |π = π π₯π 1−π 1−π₯π π π π 6 Bayesian Parameter Estimation Bayesian treats the unknown parameters as a random variable, whose distribution can be inferred using Bayes rule: π(π|π·) = π π· π π(π) π(π·) = π π· π π(π) ∫ π π· π π π ππ Prior over π, Beta distribution π π; πΌ, π½ ∝ π πΌ−1 1 − π π½−1 Posterior distribution π π π₯1 ,…,π₯π π π π π π₯1 ,…,π₯π ππ‘ π πΌ−1 1 − π π½−1 = π π|π₯1 , … , π₯π = ∝ π πβ 1 − π π πβ +πΌ−1 1 − π ππ‘ +π½−1 Posterior mean estimation: ππππ¦ππ = ∫ π π π π· ππ = πΆ∫ π × π πβ +πΌ−1 1 − π ππ‘ +π½−1 ππ = (πβ +πΌ) π+πΌ+π½ 7 MLE for Biased Coin πππΏ = ππππππ₯π π(π·|π) Objective function, log likelihood π π; π· = log π π· π = log π πβ 1 − π π − πβ log 1 − π ππ‘ = πβ log π + We need to maximize this w.r.t. π Take derivatives w.r.t. π ππ ππ = πβ π − π−πβ 1−π = 0 ⇒ πππΏπΈ = πβ π or πππΏπΈ = 1 π π π₯π 8 How estimators should be used? πππ΄π = ππππππ₯π π(π|π·) is not Bayesian (even though it uses a prior) since it is a point estimate Consider predicting the future. A sensible way is to combine predictions based on all possible value of π, weighted by their posterior probability, this is called Bayesian prediction: π π₯πππ€ π· = ∫ π π₯πππ€ , π π· ππ = ∫ π π₯πππ€ π, π· π π π· ππ = ∫ π π₯πππ€ π π π π· ππ π ππππ€ π π A frequentist prediction will typically use a “plug-in” estimator such as ML/MAP π π₯πππ€ π· = π(π₯πππ€ | πππΏ ) ππ π π₯πππ€ π· = π(π₯πππ€ | πππ΄π ) 9 Learning Graphical Models The goal: given set of independent samples (assignments of random variables), find the best (the most likely) graphical model (both structure and the parameters πΉ π΄ Learn π π Structure learning π π» (A,F,S,N,H) = (T,F,F,T,F) (A,F,S,N,H) = (T,F,T,T,F) … (A,F,S,N,H) = (F,T,T,T,T) πΉ π΄ π» π S FA TT TF FT FF t 0.9 0.7 0.8 0.2 f 0.1 0.3 0.2 0.8 parameter learning 10 MLE for General Bayesian Networks If we assume that the parameters for each CPT are globally independent, and all nodes are fully observed, then the loglikelihood function decomposes into a sum of local terms, one per node: π π; π· = log π π· π = log π π π π π₯ ππ ππ , ππ ) = π π π π π log π π₯ ππ ππ , ππ ) π π π΄ππππππ¦ For each variable ππ : πππΏπΈ ππ = π₯π ππππ = π’ = Why? πΉππ’ #(ππ =π₯ ,ππππ =π’) #(ππππ =π’) ππππ’π πππ π π»πππππβπ 11 Decomposable likelihood of directed model π π; π· = log π π· π = π log π ππ |ππ + π log π π π |ππ + π ππππ π π ππ , π π , ππ + π ππππ(β π |π π , πβ ) One term for each CPT; break up MLE problem into independent subproblems Because the factorization of the distribution, we can estimate each CPT separately. π΄ππππππ¦ π΄ππππππ¦ π΄ππππππ¦ πΉππ’ πΉππ’ ππππ’π Learn separately πΉππ’ ππππ’π ππππ’π π»πππππβπ π»πππππβπ 12 MLE for BNs with tabluar CPTs Assume each CPT is represented as a table (multinomial): ππππ = π ππ = π πππ = π Note that in case of multiple parents, πππ will have a composite state, and CPT will be a high dimensional table The sufficient statistics are counts of family configurations ππππ = # ππ = π&&πππ = π The log-likelihood is πΏ π; π· = log ππππ πππ ππππ = πππ ππππ Using a Lagrange multiplier to enforce ππΏ ππππ = log ππππ π ππππ = 1, we get ππππ ππ′ π πππ′ π 13 MLE and Kullback-Leibler divergence KL divergence π·(π(π)| π π = π₯ π π₯ log π(π) π π Empirical distribution P π = 1 π π π=1 πΏ(π, π₯π ) Where πΏ π, π₯π is a Kronecker delta function πππ₯π ππΏπΈ = ππππ πΎπΏ π·(P π | π π|π = π₯P = π₯ logP π₯ − = ππππ π‘. − 1 π π π=1 log π₯ P π₯ log π₯P P π₯ π π₯|π π₯ log π π₯|π 1 π π π₯π |π = ππππ π‘. − π(π; π·) 14 Bayesian estimator for tabular CPTs Factorization π π = π₯ = ππ π₯π ππππ , ππ ) Local CPT: multinomial distribution π ππ = π ππππ = π = πππ Put prior distribution over parameters π(ππ , ππ , ππ , πβ ) ππ ππ π΄ππππππ¦ ππ πΉππ’ ππππ’π πβ π»πππππβπ 15 Global and Local Parameter Independence Global parameter independence Parameters for all nodes in a GM π ππ , ππ , ππ , πβ = π ππ π ππ π ππ π(πβ ) ππ ππ π΄ππππππ¦ Local Parameter Independence Parameters in each node ππ πΉππ’ ππππ’π πβ π ππ = π ππππ = π = πππ π ππ = ππ π(πππ ) π»πππππβπ 16 Parameter independence Example Provided all variables are observed, we can perform Bayesian update for each parameter independently πππππ πππππππ‘ππ ππππππππππππ ππππππ πππππππ‘ππ ππππππππππππ π21 π1 π1 π2 π1 π2 π22 17 Which distribution satisfy independence assumptions? Discrete DAG models: ππ |ππππ ∼ ππ’ππ‘π π Dirichlet prior: π π = Γ( π πΌπ ) π Γ πΌπ πΌ π ππ Gaussian DAG models: ππ |ππππ ∼ ππππππ π, Σ Normal prior: π π π£, Ψ ∝ exp 1 − 2 π − π£ ′ Ψ −1 π − π£ 18 Parameter sharing Consider a time-invariant (stationary) first order Markov model Initial state probability vector: ππ = π π1 = π State transition probability matrix: π΄ππ = π ππ‘ = π ππ‘−1 = π π The likelihood of one sequence π π₯1:π π = π π₯1 π π π‘=2 π π₯π‘ π₯π‘−1 π1 π΄ π2 ππ π3 The log-likelihood of π sequences π π; π₯1:π = π π=1 log π π₯1π π + π π=1 π π‘=2 log π π π₯π‘π π₯π‘−1 Again, we can optimize each parameter separately π can be simple estimated by the frequency How about π΄? 19 Learning a Markov chain transition matrix π΄ is a stochastic matrix: π π΄ππ =1 Each row of π΄ is multinomial distribution MLE of π΄ππ is the fraction of transitions from π to π π΄ππΏ ππ = #(π→π) #(π→⋅) Application: If the states of ππ‘ represent words, this is called a bigram language model Small sample size problem: If π → π did not occur in data, we will have π΄ππΏ ππ = 0 A standard hack: backoff smoothing π΄π⋅π = ππ + (1 − π)π΄ππΏ π⋅ 20 Bayesian model for Markov chain Global and local parameter independence π΄i→⋅ π π2 π1 π3 π΄i′ →⋅ ππ The posterior of π΄i→⋅ and π΄i′→⋅ is factorized despite the vstructure Assign a Dirichlet prior π½π to each row of the transition matrix π΅ππ¦ππ π΄ππ = ′ ππ½ππ = #(π→π)+π½ππ #(π→⋅)+ π π½ππ + (1 − π)π΄ππΏ ππ where π = π π½ππ #(π→⋅)+ π π½ππ 21 MLE for Exponential Family π π1 , … , ππ |π = 1 =ππ 1 π π exp ππ exp(πππ ππ ππ ) π π = π π, π· = log( π₯ ππ πππ ππ ππ π π ( π π π π₯ π₯π + ππ π ππ π exp(ππ ππ ) π π exp (π π₯ ππ π π₯π ) ππ π π₯ π )) + = π ( log (exp (π π₯ ππ π π π ππ log π π ) = π ππ ππ π exp(ππ ππ ) ππ exp(πππ ππ ππ ) 1 π π=1 π π + π exp (π π₯ π π )) π π log(exp (π π₯ π π )) − π π π π₯ π π π − log π π ) πππ ππ ππ‘βππ ππππ‘π’ππ ππ’πππ‘πππ π π₯π 22 MLE for Exponential Family π π, π· = 1 π ππ π,π· ππππ 1 π ππ π,π· ππππ = 1 π = = 1 π π π π π π₯π π₯π π π π π π π − ( ππ₯ π + π π₯ ππ π ππ π ππ₯ π π₯ ππ π π ππ₯ π π₯ ππ π π 1 Z(π) π₯ − − π π π₯ π π π − log π π ) π log π π ππππ 1 ππ π Z(π) ππππ ππ exp(πππ ππ ππ ) π exp(ππ ππ ) ππ ππ ππππ π‘π ππ πππππππππ 23 Moment Matching condition ππ π,π· = ππππ 1 π π π π₯ π₯ π π π π − 1 Z π ππ π,π· ππππ 1 π = πΈP ππ exp πππ ππ ππ π exp ππ ππ ππ ππ πππ£πππππππ πππ‘πππ₯ from model πππππππππ πππ£πππππππ πππ‘πππ₯ P ππ , ππ = π₯ π π π πΏ(π , π₯ ) πΏ(π , π₯ π π π=1 π π) ππ ,ππ ππ ππ − πΈπ(π|π) [ππ ππ ] 24 MLE Learning Algorithm for Exponential models maxπ π π, π· is a convex optimization problem. Can be solve by many methods, such as gradient descent, conjugate gradient. Initialize model parameters π Loop until convergence Compute ππ π,π· ππππ = πΈP Update πππ ← πππ − π ππ ,ππ ππ ππ − πΈπ ππ ππ ππ ππ π,π· ππππ 25