Approximate Inference and Learning Le Song Machine Learning II: Advanced Topics

Approximate Inference and Learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Why Sampling Exact and variational inference tasks focus on obtaining the entire posterior distribution 𝑃 𝑋𝑖 𝑒 Often we want to take expectations Mean 𝜇𝑋𝑖 |𝑒 = 𝐸 𝑋𝑖 𝑒 = ∫ 𝑋𝑖 𝑃 𝑋𝑖 𝑒 𝑑𝑋𝑖 More general 𝐸 𝑓 = ∫ 𝑓 𝑋 𝑃 𝑋|𝑒 𝑑𝑋, can be difficult to do analytically Sometime we also want to see typical data points from a distribution 2 Sampling Samples: points from the domain of a distribution 𝑃 𝑋 The higher the 𝑃 𝑥 , the more likely we see 𝑥 in the sample 𝑃(𝑋) 𝑋 𝑥1 𝑥4 𝑥5 𝑥2 𝑥6 𝑥3 Approximate expectation by sample average 1 𝐸𝑓 ≈ 𝑁 𝑁 𝑓 𝑥𝑖 𝑖=1 where 𝑥1 , … , 𝑥𝑁 ∼ 𝑃 𝑋|𝑒 independently and identically distributed 3 Generate Samples from Bayesian Networks BN describe a generative process for observations First, sort the nodes in topological order 1 2 𝐹𝑙𝑢 𝐴𝑙𝑙𝑒𝑟𝑔𝑦 Then, generate sample using this order according to the CPTs 𝑆𝑖𝑛𝑢𝑠 3 Generate a set of sample for (A, F, S, N, H): Sample 𝑎𝑖 ∼ 𝑃 𝐴 Sample 𝑓𝑖 ∼ 𝑃 𝐹 Sample 𝑠𝑖 ∼ 𝑃 𝑆 𝑎𝑖 , 𝑓𝑖 Sample 𝑛𝑖 ∼ 𝑃 𝑁 𝑠𝑖 Sample ℎ𝑖 ∼ 𝑃 𝐻 𝑠𝑖 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 𝑁𝑜𝑠𝑒 4 5 4 Challenge in sampling Not all distributions can be trivially sampled, e.g., Loopy graphical model with lots of variables Distribution with complicated shapes 𝑃(𝑋) 𝑋 5 Sampling Methods Direct Sampling Simple Works only for easy distributions Rejection Sampling Create samples like direct sampling Only count samples consistent with given evidence Importance Sampling Create samples like direct sampling Assign weights to samples Gibbs Sampling Often used for high-dimensional problem Use variables and its Markov blanket for sampling 6 Rejection sampling Sample 𝑥 ∼ 𝑄(𝑋) and reject with probability 1 𝑓 𝑋 − 𝑀𝑄 𝑋 𝑀 𝑄(𝑥1 ) 𝑀 𝑄(𝑋) 𝑓(𝑥1 ) 𝑓(𝑋) Between red and blue curves is rejection region 𝑢1 ∼ 𝑈[0,1] 𝑋 𝑥1 ∼ 𝑄(𝑋) 7 Importance Sampling Instead of reject sample, reweight sample instead 𝑃(𝑥2 ) 𝑄(𝑋) 𝑄(𝑥1 ) 𝑄(𝑥2 ) 𝑃(𝑥1 ) 𝑃(𝑋) 𝑋 𝑥2 ∼ 𝑄(𝑋) 𝑤2 ∼ 𝑃 𝑥2 /𝑄 𝑥2 𝑥1 ∼ 𝑄(𝑋) 𝑤1 ∼ 𝑃 𝑥1 /𝑄 𝑥1 8 Example: sample from MRF on grid Use tree distribution 𝑄 as the proposal distribution Cut some edges to make a tree 𝑄 𝑋1 , … , 𝑋𝑛 𝑃 𝑋1 , … , 𝑋𝑛 ∝ exp 𝜃𝑖𝑗 𝑋𝑖 𝑋𝑗 + (𝑖𝑗)∈𝐸 𝜃𝑖 𝑋𝑖 𝑖∈𝑉 ∝ exp 𝜃𝑖𝑗 𝑋𝑖 𝑋𝑗 + (𝑖𝑗)∈𝑇 𝜃𝑖 𝑋𝑖 𝑖∈𝑉 has fewer terms Then use rejection sampling or importance sampling 9 Gibbs Sampling Both rejection sampling and importance sampling do not scale well to high dimensions Markov Chain Monte Carlo (MCMC) is an alternative Key idea: Construct a Markov chain whose stationary distribution is the target distribution 𝑃 𝑋 Sampling process: random walk in the Markov chain Gibbs sampling is a very special and simple MCMC method. 10 Markov Chain Monte Carlo Wan to sample from 𝑃 𝑋 , start with a random initial vector X 𝑋 𝑡 : 𝑋 at time step 𝑡 𝑋 𝑡 transition to 𝑋 𝑡+1 with probability 𝑄(𝑋 𝑡+1 |𝑋 𝑡 , … , 𝑋1 ) = 𝑇 (𝑋 𝑡+1 |𝑋 𝑡 ) The stationary distribution of 𝑇 (𝑋 𝑡+1 |𝑋 𝑡 ) is our 𝑃 𝑋 Run for an intial 𝑀 samples (burn-in time) until the chain converges/mixes/reaches the stationary distribution Then collect 𝑁 (correlated) sample as 𝑥𝑖 Key issues: Designing the transition kernel, and diagnose convergence 11 Gibbs Sampling A very special transition kernel, works nicely with Markov blanket in GMs. The procedure We have variables set 𝑋 = 𝑋1 , … , 𝑋𝐾 variables in a GM. At each step, one variable 𝑋𝑖 is selected (at random or some fixed sequence), denote the remaining variables as 𝑋−𝑖 , and its 𝑡 current value as 𝑥−𝑖 𝑡 Compute the conditional distribution 𝑃(𝑋𝑖 | 𝑥−𝑖 ) A value 𝑥𝑖𝑡 is sampled from this distribution This sample 𝑥𝑖𝑡 replaces the previous sampled value of 𝑋𝑖 in 𝑋 12 Gibbs Sampling in formula Gibbs sampling 𝑋 = 𝑥0 For t = 1 to N 𝐹𝑜𝑟 𝑔𝑟𝑎𝑝ℎ𝑖𝑐𝑎𝑙 𝑚𝑜𝑑𝑒𝑙𝑠 Only need to condition on the Variables in the Markov blanket 𝑥1𝑡 = 𝑃(𝑋1 |𝑥2𝑡−1 , … , 𝑥𝐾𝑡−1 ) 𝑥2𝑡 = 𝑃(𝑋2 |𝑥1𝑡 , … , 𝑥𝐾𝑡−1 ) … 𝑡 𝑥𝐾𝑡 = 𝑃(𝑋2 |𝑥1𝑡 , … , 𝑥𝐾−1 ) 𝑋3 𝑋2 𝑋1 Variants: Randomly pick variable to sample sample block by block 𝑋4 𝑋5 13 Gibbs Sampling: Image Segmentation Noisy grayscale image Label each pixel as on/off Model using a pairwise MRF 𝑃 𝑋 = 1 𝑍 𝑖Ψ 𝑋𝑖 Ψ 𝑥𝑖 = exp − 𝑖𝑗 Ψ 𝑦𝑖 −𝜇𝑥𝑖 𝑋𝑖 , 𝑋𝑗 𝑋5 𝑋6 𝑌8 𝑌5 𝑌6 𝑋8 2 2𝜎𝑥2 𝑖 Ψ 𝑥𝑖 , 𝑥𝑗 = exp −𝛽 𝑥𝑖 − 𝑥𝑗 𝑌1 𝑌2 𝑋2 𝑋1 𝑌4 𝑋4 2 𝑌3 𝑌7 𝑋7 𝑋3 𝑌9 𝑋9 14 Gibbs Sampling: Image Segmentation Need conditional 𝑃(𝑥𝑖 |𝑥1 , … , 𝑥𝑖−1 , 𝑥𝑖+1 , … , 𝑥𝑘 ) 𝑃(𝑥1 ,…,𝑥𝑘 ) 𝑃(𝑥1 ,…,𝑥𝑖−1 ,𝑥𝑖+1 ,𝑥𝑘 ) = 1 𝑍 𝑖Ψ 1 𝑥𝑖 𝑍 𝑥𝑖 𝑖Ψ 𝑖𝑗 Ψ 𝑥𝑖 𝑥𝑖 ,𝑥𝑗 𝑖𝑗 Ψ 𝑥𝑖 ,𝑥𝑗 Terms without 𝑥𝑖 will cancel out 𝑥𝑖 is summed out in the denominator 𝑋5 𝑋6 ∝ Ψ 𝑥𝑖 𝑗∈𝑁(𝑖) Ψ(𝑥𝑖 , 𝑥𝑗 ) 𝑋1 𝑌4 𝑋4 𝑌3 𝑌7 𝑋7 𝑋8 𝑌1 𝑌2 𝑋2 𝑌8 𝑌5 𝑌6 𝑋3 𝑌9 𝑋9 15 Gibbs Sampling: Image Segmentation 16 MAP by Sampling Generate a few samples from the posterior For each 𝑋𝑖 the MAP is the majority assignment Majority vote 17 Convergence of Gibbs Sampling Not all samples 𝑥 0 , … 𝑥 𝑇 are independent Consider a particular marginal 𝑃(𝑥𝑖 |𝑢𝑖 ) 1 True 𝑃(𝑥𝑖 |𝑢𝑖 ) 𝐸𝑚𝑝𝑖𝑐𝑎𝑙 𝑃 𝑥𝑖 𝑢𝑖 𝑂𝑠𝑐𝑖𝑙𝑙𝑎𝑡𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒 𝑚𝑜𝑑𝑒𝑠 𝑡 0 Burn-in Take samples from here 18 Diagnose convergence Good chain Sampled Value Iteration number 19 Diagnose convergence Bad chain Sampled Value Iteration number 20 Sampling Methods Direct Sampling Works only for easy distributions (multinomial, Gaussian etc.) Rejection Sampling Create samples like direct sampling Only count samples consistent with given evidence Importance Sampling Create samples like direct sampling Assign weights to samples Gibbs Sampling Often used for high-dimensional problem Use variables and its Markov blanket for sampling 21 Learning Graphical Models The goal: given set of independent samples (assignments of random variables), find the best (the most likely) graphical model (both structure and the parameters 𝐹 𝐴 Learn 𝑆 𝑁 Structure learning 𝑆 𝐻 (A,F,S,N,H) = (T,F,F,T,F) (A,F,S,N,H) = (T,F,T,T,F) … (A,F,S,N,H) = (F,T,T,T,T) 𝐹 𝐴 𝐻 𝑁 S FA TF TF FT FF t 0.9 0.7 0.8 0.2 f 0.1 0.3 0.2 0.8 parameter learning 22 Learning for GMs Known Structure Unknown Structure Fully observable data Relatively Easy Hard Missing data Hard (EM) Very hard Estimation principle: Maximal likelihood estimation Bayesian estimation Common Feature Make use of distribution factorization Make use of inference algorithm Make use of regularization/prior 23 Example problem Estimate the probability 𝜃 of landing in heads using a biased coin Given a sequence of 𝑁 independently and identically distributed (iid) flips Eg., 𝐷 = 𝑥1 , 𝑥2 , … , 𝑥𝑁 = {1,0,1, … , 0}, 𝑥𝑖 ∈ {0,1} Model: 𝑃 𝑥|𝜃 = 𝜃 𝑥 1 − 𝜃 𝑃(𝑥|𝜃 ) = 1−𝑥 1 − 𝜃, 𝑓𝑜𝑟 𝑥 = 0 𝜃, 𝑓𝑜𝑟 𝑥 = 1 Likelihood of a single observation 𝑥𝑖 ? 𝑃 𝑥𝑖 |𝜃 = 𝜃 𝑥𝑖 1 − 𝜃 1−𝑥𝑖 24 Bayesian Parameter Estimation Bayesian treat the unknown parameters as a random variable, whose distribution can be inferred using Bayes rule: 𝑃(𝜃|𝐷) = 𝑃 𝐷 𝜃 𝑃(𝜃) 𝑃(𝐷) = 𝑃 𝐷 𝜃 𝑃(𝜃) ∫ 𝑃 𝐷 𝜃 𝑃 𝜃 𝑑𝜃 𝜃 The crucial equation can be written in words 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑×𝑝𝑟𝑖𝑜𝑟 𝑚𝑎𝑟𝑔𝑖𝑛𝑎𝑙 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑁 For iid data, the likelihood is 𝑃 𝐷 𝜃 = 𝑁 𝑥𝑖 𝑖=1 𝜃 1−𝜃 1−𝑥𝑖 =𝜃 𝑖 𝑥𝑖 1−𝜃 𝑋 𝑖 1−𝑥𝑖 𝑁 𝑖=1 𝑃(𝑥𝑖 |𝜃) = 𝜃 #ℎ𝑒𝑎𝑑 1 − 𝜃 #𝑡𝑎𝑖𝑙 The prior 𝑃 𝜃 encodes our prior knowledge on the domain Different prior 𝑃 𝜃 will end up with different estimate 𝑃(𝜃|𝐷)! 25 Frequentist Parameter Estimation Bayesian estimation has been criticized for being “subjective” Frequentists think of a parameter as a fixed, unknown constant, not a random variable Hence different “objective” estimators, instead of Bayes’ rule These estimators have different properties, such as being “unbiased”, “minimum variance”, etc. A very popular estimator is the maximum likelihood estimator (MLE), which is simple and has good statistical properties 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃 𝐷 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑁 𝑖=1 𝑃(𝑥𝑖 |𝜃) 26 MLE for Biased Coin Objective function, log likelihood 𝑙 𝜃; 𝐷 = log 𝑃 𝐷 𝜃 = log 𝜃 𝑛ℎ 1 − 𝜃 𝑁 − 𝑛ℎ log 1 − 𝜃 𝑛𝑡 = 𝑛ℎ log 𝜃 + We need to maximize this w.r.t. 𝜃 Take derivatives w.r.t. 𝜃 𝜕𝑙 𝜕𝜃 = 𝑛ℎ 𝜃 − 𝑁−𝑛ℎ 1−𝜃 = 0 ⇒ 𝜃𝑀𝐿𝐸 = 𝑛ℎ 𝑁 or 𝜃𝑀𝐿𝐸 = 1 𝑁 𝑖 𝑥𝑖 27 Maximum Likelihood Estimation for Bernoulli What if we toss too few times so that we saw zero head in the data? 𝑛 In this case, 𝜃𝑀𝐿𝐸 = ℎ = 0, and we will predict that the 𝑁 probability of seeing a head next is zeros. The rescue: add regularization to smooth the counts. Do maximum a posteriori (MAP) estimation: 𝜃𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃 𝜃 𝐷 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑙 𝜃; 𝐷 + log 𝑃(𝜃) For instance, log 𝑃 𝜃 = 𝑛′ log 𝜃 + 𝑛′ log(1 − 𝜃) 𝜃𝑀𝐴𝑃 = 𝑛ℎ +𝑛′ , 𝑁+𝑛′ 𝑛’ known as pseudo –count 𝐵𝑢𝑡 𝑎𝑟𝑒 𝑤𝑒 𝑠𝑡𝑖𝑙𝑙 𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒? 28 Bayesian estimation for biased coin Prior over 𝜃, Beta distribution 𝑃 𝜃; 𝛼, 𝛽 = Γ 𝛼+𝛽 Γ 𝑎 Γ 𝛽 𝜃 𝛼−1 1 − 𝜃 𝛽−1 When x is discrete Γ 𝑥 + 1 = 𝑥Γ 𝑥 = 𝑥! Posterior distribution of 𝜃 𝑃 𝑥1 ,…,𝑥𝑁 𝜃 𝑃 𝜃 𝑃 𝑥1 ,…,𝑥𝑁 𝑛𝑡 𝜃 𝛼−1 1 − 𝜃 𝛽−1 = 𝑃 𝜃|𝑥1 , … , 𝑥𝑁 = ∝ 𝜃 𝑛ℎ 1 − 𝜃 𝜃 𝑛ℎ +𝛼−1 1 − 𝜃 𝑛𝑡 +𝛽−1 Posterior is the same type of function as the prior Such a prior is called a conjugate prior 𝛼 and 𝛽 are hyperparameters and correspond to the number of “virtual” heads and tails (pseudo counts) 29 Bayesian Estimation for Bernoulli Posterior distribution 𝜃 𝑃 𝑥1 ,…,𝑥𝑁 𝜃 𝑃 𝜃 𝑃 𝑥1 ,…,𝑥𝑁 𝑛𝑡 𝜃 𝛼−1 1 − 𝜃 𝛽−1 = 𝑃 𝜃|𝑥1 , … , 𝑥𝑁 = ∝ 𝜃 𝑛ℎ 1 − 𝜃 𝜃 𝑛ℎ +𝛼−1 1 − 𝜃 𝑛𝑡 +𝛽−1 Maximum a posteriori (MAP) estimation: 𝜃𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 log 𝑃 𝜃|𝑥1 , … , 𝑥𝑁 Posterior mean estimation: 𝜃𝑏𝑎𝑦𝑒𝑠 = ∫ 𝜃 𝑃 𝜃 𝐷 𝑑𝜃 = 𝐶∫ 𝜃 × 𝜃 𝑛ℎ +𝛼−1 1 − 𝜃 𝑛𝑡 +𝛽−1 𝑑𝜃 = (𝑛ℎ +𝛼) 𝑁+𝛼+𝛽 Prior strength: 𝐴 = 𝛼 + 𝛽 A can be interpreted as an imaginary dataset 30 Effect of Prior Strength Suppose we have a uniform prior (𝛼 = 𝛽), and we observed that 𝑛ℎ = 2, and 𝑛𝑡 = 8 Weak prior 𝐴 = 𝛼 + 𝛽 = 2. Posterior prediction: 𝑃 𝑥 = ℎ 𝑛ℎ = 2, 𝑛𝑡 = 8, 𝛼 = 1, 𝛽 = 1 = 2+1 10+2 = 0.25 Strong prior 𝐴 = 𝛼 + 𝛽 = 20. Posterior prediction: 𝑃 𝑥 = ℎ 𝑛ℎ = 2, 𝑛𝑡 = 8, 𝛼 = 10, 𝛽 = 10 = 2+10 10+20 = 0.4 However if we have enough data, it washes away the prior. E.g. 𝑛ℎ = 200, and 𝑛𝑡 = 800. Then the estimate under weak and 200+1 200+10 strong prior are , respectively. Both close to 0.2 1000+2 1000+10 31 How estimators should be used? 𝜃𝑀𝐴𝑃 is not Bayesian (even though it uses a prior) since it is a point estimate Consider predicting the future. A sensible way is to combine predictions based on all possible value of 𝜃, weighted by their posterior probability, this is called Bayesian prediction: 𝑃 𝑥𝑛𝑒𝑤 𝐷 = ∫ 𝑃 𝑥𝑛𝑒𝑤 , 𝜃 𝐷 𝑑𝜃 = ∫ 𝑃 𝑥𝑛𝑒𝑤 𝜃, 𝐷 𝑃 𝜃 𝐷 𝑑𝜃 = ∫ 𝑃 𝑥𝑛𝑒𝑤 𝜃 𝑃 𝜃 𝐷 𝑑𝜃 𝜃 𝑋𝑛𝑒𝑤 𝑋 𝑁 A frequentist prediction will typically use a “plug-in” estimator such as ML/MAP 𝑃 𝑥𝑛𝑒𝑤 𝐷 = 𝑃(𝑥𝑛𝑒𝑤 | 𝜃𝑀𝐿 ) 𝑜𝑟 𝑃 𝑥𝑛𝑒𝑤 𝐷 = 𝑃(𝑥𝑛𝑒𝑤 | 𝜃𝑀𝐴𝑃 ) 32 Frequentist vs. Bayesian Advantages of Bayesian approach: Mathematically elegant Works well when amount of data is much less than the number of parameters Easy to do incremental (sequential) learning Can be used for model selection (max likelihood will always pick the most complex model) Advantage of frequentist approach: Mathematically/computationally simpler “objective”, unbiased, invariant to reparametrization As 𝐷 → ∞, the two approaches become the same 𝑃 𝜃 𝐷 → 𝛿(𝜃, 𝜃𝑀𝐿 ) 33 MLE for General Bayesian Networks If we assume that the parameters for each CPT are globally independent, and all nodes are fully observed, then the loglikelihood function decomposes into a sum of local terms, one per node: 𝑙 𝜃; 𝐷 = log 𝑃 𝐷 𝜃 = log 𝑖 𝑖 𝑖 𝑃 𝑥 𝑝𝑎 𝑋𝑗 , 𝜃𝑗 ) = 𝑗 𝑗 𝑖 𝑖 𝑖 log 𝑃 𝑥 𝑝𝑎 𝑋𝑗 , 𝜃𝑗 ) 𝑗 𝑗 𝐴𝑙𝑙𝑒𝑟𝑔𝑦 For each variable 𝑋𝑖 : 𝑃𝑀𝐿𝐸 𝑋𝑖 = 𝑥𝑖 𝑃𝑎𝑋𝑖 = 𝑢 = Why? 𝐹𝑙𝑢 #(𝑋𝑖 =𝑥 ,𝑃𝑎𝑋𝑖 =𝑢) #(𝑃𝑎𝑋𝑖 =𝑢) 𝑆𝑖𝑛𝑢𝑠 𝑁𝑜𝑠𝑒 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 34 MLE for General Bayesian Networks 𝑙 𝜃; 𝐷 = log 𝑃 𝐷 𝜃 = 𝑖 log 𝑃 𝑎𝑖 |𝜃𝑎 + 𝑖 log 𝑃 𝑓 𝑖 |𝜃𝑓 + 𝑖 𝑙𝑜𝑔𝑃 𝑠 𝑖 𝑎𝑖 , 𝑓 𝑖 , 𝜃𝑠 + 𝑖 𝑙𝑜𝑔𝑃(ℎ 𝑖 |𝑠 𝑖 , 𝜃ℎ ) One term for each CPT; break up MLE problem into independent subproblems Earlier we already learn how to estimate a single CPT Here we just need to estimate each CPT separately. 𝐴𝑙𝑙𝑒𝑟𝑔𝑦 𝐹𝑙𝑢 𝑆𝑖𝑛𝑢𝑠 𝐻𝑒𝑎𝑑𝑎𝑐ℎ𝑒 35

Approximate Inference and Learning Le Song Machine Learning II: Advanced Topics

Related documents

Products

Support

Approximate Inference and Learning Le Song Machine Learning II: Advanced Topics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib