Approximate Inference and Learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Why Sampling Exact and variational inference tasks focus on obtaining the entire posterior distribution π ππ π Often we want to take expectations Mean πππ |π = πΈ ππ π = ∫ ππ π ππ π πππ More general πΈ π = ∫ π π π π|π ππ, can be difficult to do analytically Sometime we also want to see typical data points from a distribution 2 Sampling Samples: points from the domain of a distribution π π The higher the π π₯ , the more likely we see π₯ in the sample π(π) π π₯1 π₯4 π₯5 π₯2 π₯6 π₯3 Approximate expectation by sample average 1 πΈπ ≈ π π π π₯π π=1 where π₯1 , … , π₯π ∼ π π|π independently and identically distributed 3 Generate Samples from Bayesian Networks BN describe a generative process for observations First, sort the nodes in topological order 1 2 πΉππ’ π΄ππππππ¦ Then, generate sample using this order according to the CPTs ππππ’π 3 Generate a set of sample for (A, F, S, N, H): Sample ππ ∼ π π΄ Sample ππ ∼ π πΉ Sample π π ∼ π π ππ , ππ Sample ππ ∼ π π π π Sample βπ ∼ π π» π π π»πππππβπ πππ π 4 5 4 Challenge in sampling Not all distributions can be trivially sampled, e.g., Loopy graphical model with lots of variables Distribution with complicated shapes π(π) π 5 Sampling Methods Direct Sampling Simple Works only for easy distributions Rejection Sampling Create samples like direct sampling Only count samples consistent with given evidence Importance Sampling Create samples like direct sampling Assign weights to samples Gibbs Sampling Often used for high-dimensional problem Use variables and its Markov blanket for sampling 6 Rejection sampling Sample π₯ ∼ π(π) and reject with probability 1 π π − ππ π π π(π₯1 ) π π(π) π(π₯1 ) π(π) Between red and blue curves is rejection region π’1 ∼ π[0,1] π π₯1 ∼ π(π) 7 Importance Sampling Instead of reject sample, reweight sample instead π(π₯2 ) π(π) π(π₯1 ) π(π₯2 ) π(π₯1 ) π(π) π π₯2 ∼ π(π) π€2 ∼ π π₯2 /π π₯2 π₯1 ∼ π(π) π€1 ∼ π π₯1 /π π₯1 8 Example: sample from MRF on grid Use tree distribution π as the proposal distribution Cut some edges to make a tree π π1 , … , ππ π π1 , … , ππ ∝ exp πππ ππ ππ + (ππ)∈πΈ ππ ππ π∈π ∝ exp πππ ππ ππ + (ππ)∈π ππ ππ π∈π has fewer terms Then use rejection sampling or importance sampling 9 Gibbs Sampling Both rejection sampling and importance sampling do not scale well to high dimensions Markov Chain Monte Carlo (MCMC) is an alternative Key idea: Construct a Markov chain whose stationary distribution is the target distribution π π Sampling process: random walk in the Markov chain Gibbs sampling is a very special and simple MCMC method. 10 Markov Chain Monte Carlo Wan to sample from π π , start with a random initial vector X π π‘ : π at time step π‘ π π‘ transition to π π‘+1 with probability π(π π‘+1 |π π‘ , … , π1 ) = π (π π‘+1 |π π‘ ) The stationary distribution of π (π π‘+1 |π π‘ ) is our π π Run for an intial π samples (burn-in time) until the chain converges/mixes/reaches the stationary distribution Then collect π (correlated) sample as π₯π Key issues: Designing the transition kernel, and diagnose convergence 11 Gibbs Sampling A very special transition kernel, works nicely with Markov blanket in GMs. The procedure We have variables set π = π1 , … , ππΎ variables in a GM. At each step, one variable ππ is selected (at random or some fixed sequence), denote the remaining variables as π−π , and its π‘ current value as π₯−π π‘ Compute the conditional distribution π(ππ | π₯−π ) A value π₯ππ‘ is sampled from this distribution This sample π₯ππ‘ replaces the previous sampled value of ππ in π 12 Gibbs Sampling in formula Gibbs sampling π = π₯0 For t = 1 to N πΉππ ππππβππππ ππππππ Only need to condition on the Variables in the Markov blanket π₯1π‘ = π(π1 |π₯2π‘−1 , … , π₯πΎπ‘−1 ) π₯2π‘ = π(π2 |π₯1π‘ , … , π₯πΎπ‘−1 ) … π‘ π₯πΎπ‘ = π(π2 |π₯1π‘ , … , π₯πΎ−1 ) π3 π2 π1 Variants: Randomly pick variable to sample sample block by block π4 π5 13 Gibbs Sampling: Image Segmentation Noisy grayscale image Label each pixel as on/off Model using a pairwise MRF π π = 1 π πΨ ππ Ψ π₯π = exp − ππ Ψ π¦π −ππ₯π ππ , ππ π5 π6 π8 π5 π6 π8 2 2ππ₯2 π Ψ π₯π , π₯π = exp −π½ π₯π − π₯π π1 π2 π2 π1 π4 π4 2 π3 π7 π7 π3 π9 π9 14 Gibbs Sampling: Image Segmentation Need conditional π(π₯π |π₯1 , … , π₯π−1 , π₯π+1 , … , π₯π ) π(π₯1 ,…,π₯π ) π(π₯1 ,…,π₯π−1 ,π₯π+1 ,π₯π ) = 1 π πΨ 1 π₯π π π₯π πΨ ππ Ψ π₯π π₯π ,π₯π ππ Ψ π₯π ,π₯π Terms without π₯π will cancel out π₯π is summed out in the denominator π5 π6 ∝ Ψ π₯π π∈π(π) Ψ(π₯π , π₯π ) π1 π4 π4 π3 π7 π7 π8 π1 π2 π2 π8 π5 π6 π3 π9 π9 15 Gibbs Sampling: Image Segmentation 16 MAP by Sampling Generate a few samples from the posterior For each ππ the MAP is the majority assignment Majority vote 17 Convergence of Gibbs Sampling Not all samples π₯ 0 , … π₯ π are independent Consider a particular marginal π(π₯π |π’π ) 1 True π(π₯π |π’π ) πΈππππππ π π₯π π’π ππ ππππππ‘π πππ‘π€πππ ππ’ππ‘ππππ πππππ π‘ 0 Burn-in Take samples from here 18 Diagnose convergence Good chain Sampled Value Iteration number 19 Diagnose convergence Bad chain Sampled Value Iteration number 20 Sampling Methods Direct Sampling Works only for easy distributions (multinomial, Gaussian etc.) Rejection Sampling Create samples like direct sampling Only count samples consistent with given evidence Importance Sampling Create samples like direct sampling Assign weights to samples Gibbs Sampling Often used for high-dimensional problem Use variables and its Markov blanket for sampling 21 Learning Graphical Models The goal: given set of independent samples (assignments of random variables), find the best (the most likely) graphical model (both structure and the parameters πΉ π΄ Learn π π Structure learning π π» (A,F,S,N,H) = (T,F,F,T,F) (A,F,S,N,H) = (T,F,T,T,F) … (A,F,S,N,H) = (F,T,T,T,T) πΉ π΄ π» π S FA TF TF FT FF t 0.9 0.7 0.8 0.2 f 0.1 0.3 0.2 0.8 parameter learning 22 Learning for GMs Known Structure Unknown Structure Fully observable data Relatively Easy Hard Missing data Hard (EM) Very hard Estimation principle: Maximal likelihood estimation Bayesian estimation Common Feature Make use of distribution factorization Make use of inference algorithm Make use of regularization/prior 23 Example problem Estimate the probability π of landing in heads using a biased coin Given a sequence of π independently and identically distributed (iid) flips Eg., π· = π₯1 , π₯2 , … , π₯π = {1,0,1, … , 0}, π₯π ∈ {0,1} Model: π π₯|π = π π₯ 1 − π π(π₯|π ) = 1−π₯ 1 − π, πππ π₯ = 0 π, πππ π₯ = 1 Likelihood of a single observation π₯π ? π π₯π |π = π π₯π 1 − π 1−π₯π 24 Bayesian Parameter Estimation Bayesian treat the unknown parameters as a random variable, whose distribution can be inferred using Bayes rule: π(π|π·) = π π· π π(π) π(π·) = π π· π π(π) ∫ π π· π π π ππ π The crucial equation can be written in words πππ π‘πππππ = ππππππβπππ×πππππ ππππππππ ππππππβπππ π For iid data, the likelihood is π π· π = π π₯π π=1 π 1−π 1−π₯π =π π π₯π 1−π π π 1−π₯π π π=1 π(π₯π |π) = π #βπππ 1 − π #π‘πππ The prior π π encodes our prior knowledge on the domain Different prior π π will end up with different estimate π(π|π·)! 25 Frequentist Parameter Estimation Bayesian estimation has been criticized for being “subjective” Frequentists think of a parameter as a fixed, unknown constant, not a random variable Hence different “objective” estimators, instead of Bayes’ rule These estimators have different properties, such as being “unbiased”, “minimum variance”, etc. A very popular estimator is the maximum likelihood estimator (MLE), which is simple and has good statistical properties π = ππππππ₯π π π· π = ππππππ₯π π π=1 π(π₯π |π) 26 MLE for Biased Coin Objective function, log likelihood π π; π· = log π π· π = log π πβ 1 − π π − πβ log 1 − π ππ‘ = πβ log π + We need to maximize this w.r.t. π Take derivatives w.r.t. π ππ ππ = πβ π − π−πβ 1−π = 0 ⇒ πππΏπΈ = πβ π or πππΏπΈ = 1 π π π₯π 27 Maximum Likelihood Estimation for Bernoulli What if we toss too few times so that we saw zero head in the data? π In this case, πππΏπΈ = β = 0, and we will predict that the π probability of seeing a head next is zeros. The rescue: add regularization to smooth the counts. Do maximum a posteriori (MAP) estimation: πππ΄π = ππππππ₯π π π π· = ππππππ₯π π π; π· + log π(π) For instance, log π π = π′ log π + π′ log(1 − π) πππ΄π = πβ +π′ , π+π′ π’ known as pseudo –count π΅π’π‘ πππ π€π π π‘πππ ππππππ‘ππ£π? 28 Bayesian estimation for biased coin Prior over π, Beta distribution π π; πΌ, π½ = Γ πΌ+π½ Γ π Γ π½ π πΌ−1 1 − π π½−1 When x is discrete Γ π₯ + 1 = π₯Γ π₯ = π₯! Posterior distribution of π π π₯1 ,…,π₯π π π π π π₯1 ,…,π₯π ππ‘ π πΌ−1 1 − π π½−1 = π π|π₯1 , … , π₯π = ∝ π πβ 1 − π π πβ +πΌ−1 1 − π ππ‘ +π½−1 Posterior is the same type of function as the prior Such a prior is called a conjugate prior πΌ and π½ are hyperparameters and correspond to the number of “virtual” heads and tails (pseudo counts) 29 Bayesian Estimation for Bernoulli Posterior distribution π π π₯1 ,…,π₯π π π π π π₯1 ,…,π₯π ππ‘ π πΌ−1 1 − π π½−1 = π π|π₯1 , … , π₯π = ∝ π πβ 1 − π π πβ +πΌ−1 1 − π ππ‘ +π½−1 Maximum a posteriori (MAP) estimation: πππ΄π = ππππππ₯π log π π|π₯1 , … , π₯π Posterior mean estimation: ππππ¦ππ = ∫ π π π π· ππ = πΆ∫ π × π πβ +πΌ−1 1 − π ππ‘ +π½−1 ππ = (πβ +πΌ) π+πΌ+π½ Prior strength: π΄ = πΌ + π½ A can be interpreted as an imaginary dataset 30 Effect of Prior Strength Suppose we have a uniform prior (πΌ = π½), and we observed that πβ = 2, and ππ‘ = 8 Weak prior π΄ = πΌ + π½ = 2. Posterior prediction: π π₯ = β πβ = 2, ππ‘ = 8, πΌ = 1, π½ = 1 = 2+1 10+2 = 0.25 Strong prior π΄ = πΌ + π½ = 20. Posterior prediction: π π₯ = β πβ = 2, ππ‘ = 8, πΌ = 10, π½ = 10 = 2+10 10+20 = 0.4 However if we have enough data, it washes away the prior. E.g. πβ = 200, and ππ‘ = 800. Then the estimate under weak and 200+1 200+10 strong prior are , respectively. Both close to 0.2 1000+2 1000+10 31 How estimators should be used? πππ΄π is not Bayesian (even though it uses a prior) since it is a point estimate Consider predicting the future. A sensible way is to combine predictions based on all possible value of π, weighted by their posterior probability, this is called Bayesian prediction: π π₯πππ€ π· = ∫ π π₯πππ€ , π π· ππ = ∫ π π₯πππ€ π, π· π π π· ππ = ∫ π π₯πππ€ π π π π· ππ π ππππ€ π π A frequentist prediction will typically use a “plug-in” estimator such as ML/MAP π π₯πππ€ π· = π(π₯πππ€ | πππΏ ) ππ π π₯πππ€ π· = π(π₯πππ€ | πππ΄π ) 32 Frequentist vs. Bayesian Advantages of Bayesian approach: Mathematically elegant Works well when amount of data is much less than the number of parameters Easy to do incremental (sequential) learning Can be used for model selection (max likelihood will always pick the most complex model) Advantage of frequentist approach: Mathematically/computationally simpler “objective”, unbiased, invariant to reparametrization As π· → ∞, the two approaches become the same π π π· → πΏ(π, πππΏ ) 33 MLE for General Bayesian Networks If we assume that the parameters for each CPT are globally independent, and all nodes are fully observed, then the loglikelihood function decomposes into a sum of local terms, one per node: π π; π· = log π π· π = log π π π π π₯ ππ ππ , ππ ) = π π π π π log π π₯ ππ ππ , ππ ) π π π΄ππππππ¦ For each variable ππ : πππΏπΈ ππ = π₯π ππππ = π’ = Why? πΉππ’ #(ππ =π₯ ,ππππ =π’) #(ππππ =π’) ππππ’π πππ π π»πππππβπ 34 MLE for General Bayesian Networks π π; π· = log π π· π = π log π ππ |ππ + π log π π π |ππ + π ππππ π π ππ , π π , ππ + π ππππ(β π |π π , πβ ) One term for each CPT; break up MLE problem into independent subproblems Earlier we already learn how to estimate a single CPT Here we just need to estimate each CPT separately. π΄ππππππ¦ πΉππ’ ππππ’π π»πππππβπ 35