Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Introduction to Machine Learning CSE474/574: Generative Models Varun Chandola <chandola@buffalo.edu> Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Outline 1 2 3 Generative Models for Discrete Data Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Steps for Learning a Generative Model Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging Learning Gaussian Models Estimating Parameters Estimating Posterior Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Outline 1 Generative Models for Discrete Data Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution 2 Steps for Learning a Generative Model Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging 3 Learning Gaussian Models Estimating Parameters Estimating Posterior Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Generative Models Let us go back to our tumor example X represents the data with multiple discrete attributes Is X a discrete or continuous random variable? Y represent the class (benign or malignant) Most probable class P(Y = c|X, θ) ∝ P(X = x|Y = c, θ)P(Y = c, θ) P(X = x|Y = c, θ) = p(x|y = c, θ) p(x|y = c, θ) - class conditional density How is the data distributed for each class? Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Bayesian Concept Learning Concept assigns binary labels to examples We want to find out: P(x|x ∈ c) Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Concept Learning in Number Line I give you a set of numbers (training set D) belonging to a concept Choose the most likely hypothesis (concept) Assume that numbers are between 1 and 100 Hypothesis Space (H): All powers of 2 All powers of 4 All even numbers All prime numbers Numbers close to a fixed number (say 12) .. . Varun Chandola Socrative Game Goto: http://m.socrative. com/student/ Enter class ID ubmachinelearning2016 Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Ready? Hypothesis Space (H) 1 Even numbers 2 Odd numbers 3 Squares 4 Powers of 2 5 Powers of 4 6 Powers of 16 7 Multiples of 5 8 Multiples of 10 9 Numbers within 20 ± 5 10 All numbers between 1 and 100 D = {} D = {16} D = {60} D= {16, 19, 15, 20, 18} D = {16, 4, 64, 32} Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Ready? Hypothesis Space (H) 1 Even numbers 2 Odd numbers 3 Squares 4 Powers of 2 5 Powers of 4 6 Powers of 16 7 Multiples of 5 8 Multiples of 10 9 Numbers within 20 ± 5 10 All numbers between 1 and 100 D = {} D = {16} D = {60} D= {16, 19, 15, 20, 18} D = {16, 4, 64, 32} Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Ready? Hypothesis Space (H) 1 Even numbers 2 Odd numbers 3 Squares 4 Powers of 2 5 Powers of 4 6 Powers of 16 7 Multiples of 5 8 Multiples of 10 9 Numbers within 20 ± 5 10 All numbers between 1 and 100 D = {} D = {16} D = {60} D= {16, 19, 15, 20, 18} D = {16, 4, 64, 32} Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Ready? Hypothesis Space (H) 1 Even numbers 2 Odd numbers 3 Squares 4 Powers of 2 5 Powers of 4 6 Powers of 16 7 Multiples of 5 8 Multiples of 10 9 Numbers within 20 ± 5 10 All numbers between 1 and 100 D = {} D = {16} D = {60} D= {16, 19, 15, 20, 18} D = {16, 4, 64, 32} Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Ready? Hypothesis Space (H) 1 Even numbers 2 Odd numbers 3 Squares 4 Powers of 2 5 Powers of 4 6 Powers of 16 7 Multiples of 5 8 Multiples of 10 9 Numbers within 20 ± 5 10 All numbers between 1 and 100 D = {} D = {16} D = {60} D= {16, 19, 15, 20, 18} D = {16, 4, 64, 32} Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Computing Likelihood Why choose powers of 2 concept over even numbers concept for D = {16, 4, 64, 32}? Avoid suspicious coincidences Choose concept with higher likelihood What is the likelihood of above D to be generated using the powers of 2 concept? Likelihood for even numbers concept? Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Likelihood Why choose one hypothesis over other? Avoid suspicious coincidences Choose concept with higher likelihood Y p(D|h) = p(x|h) x∈D Log Likelihood log p(D|h) = X log p(x|h) x∈D Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Bayesian Concept Learning 1 Even numbers 2 Odd numbers 3 Squares 4 Powers of 2 5 Powers of 4 6 Powers of 16 7 Multiples of 5 8 Multiples of 10 Numbers within 20 ± 5 9 10 D = {16, 4, 64, 32} All numbers between 1 and 100 Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Adding a Prior Inside information about the hypotheses Some hypotheses are more likely apriori May not be the right hypothesis (prior can be wrong) Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Posterior Revised estimates for h after observing evidence (D) and the prior Posterior ∝ Likelihood × Prior p(D|h)p(h) p(h|D) = P 0 0 h0 ∈H p(D|h )p(h ) 1 2 3 4 5 6 7 8 9 10 h Even Odd Squares Powers of 2 Powers of 4 Powers of 16 Multiples of 5 Multiples of 10 Numbers within 20 ± 5 All Numbers Prior 0.3 0.075 0.075 0.1 0.075 0.075 0.075 0.075 0.075 0.075 Varun Chandola Likelihood 0.16 × 10−6 0 0 0.77 × 10−3 0 0 0 0 0 0.01 × 10−6 Introduction to Machine Learning Posterior 0.621 × 10−3 0 0 0.997 0 0 0 0 0 0.009 × 10−3 Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Finding the Best Hypothesis Maximum A Priori Estimate ĥprior = arg max p(h) h Maximum Likelihood Estimate (MLE) ĥMLE = arg max p(D|h) = arg max log p(D|h) h h X = arg max log p(x|H) h x∈D Maximum a Posteriori (MAP) Estimate ĥMAP = arg max p(D|h)p(h) = arg max(log p(D|h) + log p(h)) h h Varun Chandola Introduction to Machine Learning Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References MAP and MLE ĥprior - Most likely hypothesis based on prior ĥMLE - Most likely hypothesis based on evidence ĥMAP - Most likely hypothesis based on posterior ĥprior = arg max log p(h) h ĥMLE = arg max log p(D|h) h ĥMAP = arg max(log p(D|h) + log p(h)) h Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Interesting Properties As data increases, MAP estimate converges towards MLE Why? MAP/MLE are consistent estimators If concept is in H, MAP/ML estimates will converge If c ∈ / H, MAP/ML estimates converge to h which is closest possible to the truth Varun Chandola Introduction to Machine Learning Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Posterior Predictive Distribution New input, x ∗ What is the probability that x ∗ is also generated by the same concept as D? p(x ∗ ∈ c|D)? Option 0: Treat hprior as the true concept p(x ∗ ∈ c|D) = p(x ∗ ∈ hprior ) Option 1: Treat hMLE as the true concept p(x ∗ ∈ c|D) = p(x ∗ ∈ hMLE ) Option 2: Treat hMAP as the true concept p(x ∗ ∈ c|D) = p(x ∗ ∈ hMAP ) Option 3: Bayesian Averaging p(x ∗ ∈ c|D) = X p(x ∗ ∈ h)p(h|D) h Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging Outline 1 Generative Models for Discrete Data Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution 2 Steps for Learning a Generative Model Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging 3 Learning Gaussian Models Estimating Parameters Estimating Posterior Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging Steps for Learning a Generative Model Example: D is a sequence of N binary values (0s and 1s) (coin tosses) What is the best distribution that could describe D? What is the probability of observing a head in future? Step 1: Choose the form of the model Hypothesis Space - All possible distributions Too complicated!! Revised hypothesis space - All Bernoulli distributions (X ∼ Ber (θ), 0 ≤ θ ≤ 1) θ is the hypothesis Still infinite (θ can take infinite possible values) Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging Compute Likelihood Likelihood of D p(D|θ) = θN1 (1 − θ)N0 Maximum Likelihood Estimate θ̂MLE = arg max p(D|θ) = arg max θN1 (1 − θ)N0 = N1 N0 + N1 θ θ We can stop here (MLE approach) Probability of getting a head next: p(x ∗ = 1|D) = θ̂MLE Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging Compute Likelihood Likelihood of D p(D|θ) = θN1 (1 − θ)N0 Maximum Likelihood Estimate θ̂MLE = arg max p(D|θ) = arg max θN1 (1 − θ)N0 = N1 N0 + N1 θ θ We can stop here (MLE approach) Probability of getting a head next: p(x ∗ = 1|D) = θ̂MLE Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging Incorporating Prior 1.2 1.1 Prior encodes our prior belief on θ How to set a Bayesian prior? 1 2 1 0.9 A point estimate: θprior = 0.5 A probability distribution over θ (a random variable) Which one? For a bernoulli distribution 0 ≤ θ ≤ 1 Beta Distribution 0.2 0.4 0.6 0.8 1 p(θ) 3 2 1 0.2 Varun Chandola 0.4 0.6 0.8 Introduction to Machine Learning 1 Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Beta Distribution as Prior Continuous random variables defined between 0 and 1 1 Beta(θ|a, b) , p(θ|a, b) = θa−1 (1 − θ)b−1 B(a, b) a and b are the (hyper-)parameters for the distribution B(a, b) is the beta function B(a, b) = Z Γ(x) = Γ(a)Γ(b) Γ(a + b) ∞ u x−1 e −u du 0 If x is integer Γ(x) = (x − 1)! “Control” the shape of the pdf We can stop here as well (prior approach) p(x ∗ = 1) = θprior Introduction to Machine Learning Varun Chandola Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging Conjugate Priors Another reason to choose Beta distribution p(D|θ) = θN1 (1 − θ)N0 p(θ) ∝ θa−1 (1 − θ)b−1 Posterior ∝ Likelihood × Prior p(θ|D) ∝ θN1 (1 − θ)N0 θa−1 (1 − θ)b−1 ∝ θN1 +a−1 (1 − θ)N0 +b−1 Posterior has same form as the prior Beta distribution is a conjugate prior for Bernoulli/Binomial distribution Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging Estimating Posterior Posterior p(θ|D) ∝ θN1 +a−1 (1 − θ)N0 +b−1 = Beta(θ|N1 + a, N0 + b) We start with a belief that E[θ] = a a+b After observing N trials in which we observe N1 heads and N0 trails, we update our belief as: E[θ|D] = Varun Chandola a + N1 a+b+N Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging Using Posterior We know that posterior over θ is a beta distribution MAP estimate θ̂MAP = arg max p(θ|a + N1 , b + N0 ) = a + N1 − 1 a+b+N −2 θ What happens if a = b = 1? We can stop here as well (MAP approach) Probability of getting a head next: p(x ∗ = 1|D) = θ̂MAP Varun Chandola Introduction to Machine Learning Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References True Bayesian Approach All values of θ are possible Prediction on an unknown input (x ∗ ) is given by Bayesian Averaging p(x ∗ = 1|D) Z 1 p(x = 1|θ)p(θ|D)dθ = 0 Z = 1 θBeta(θ|a + N1 , b + N0 ) 0 = = E[θ|D] a + N1 a+b+N This is same as using E[θ|D] as a point estimate for θ Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging The Black Swan Paradox Why use a prior? Consider D = tails, tails, tails N1 = 0, N = 3 θ̂MLE = 0 p(x ∗ = 1|D) = 0!! Never observe a heads The black swan paradox How does the Bayesian approach help? p(x ∗ = 1|D) = a a+b+3 Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging Why is MAP Estimate Insufficient? MAP is only one part of the posterior θ at which the posterior probability is maximum But is that enough? What about the posterior variance of θ? var [θ|D] = (a + N1 )(b + N0 ) (a + b + N)2 (a + b + N + 1) If variance is high then θMAP is not trustworthy Bayesian averaging helps in this case Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Estimating Parameters Estimating Posterior Outline 1 Generative Models for Discrete Data Bayesian Concept Learning Likelihood Adding a Prior Posterior Posterior Predictive Distribution 2 Steps for Learning a Generative Model Incorporating Prior Beta Distribution Conjugate Priors Estimating Posterior Using Predictive Distribution Need for Prior Need for Bayesian Averaging 3 Learning Gaussian Models Estimating Parameters Estimating Posterior Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Estimating Parameters Estimating Posterior Multivariate Gaussian pdf for MVN with d dimensions: N (x|µ, Σ) , 1 > −1 exp − (x − µ) Σ (x − µ) 2 (2π)d/2 |Σ|1/2 1 Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Estimating Parameters Estimating Posterior Estimating Parameters of MVN Problem Statement Given a set of N independent and identically distributed (iid) samples, D, learn the parameters (µ, Σ) of a Gaussian distribution that generated D. MLE approach - maximize log-likelihood Result b MLE = µ N 1 X xi , x̄ N i=1 N X b MLE = 1 Σ (xi − x̄)(xi − x̄)> N i=1 Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Estimating Parameters Estimating Posterior Estimating Posterior We need posterior for both µ and Σ p(µ) p(Σ) What distribution do we need to sample µ? A Gaussian distribution! p(µ) = N (µ|m0 , V0 ) What distribution do we need to sample Σ? An Inverse-Wishart distribution. p(Σ) = = IW (Σ|S, ν) 1 1 |Σ|−(ν+D+1)/2 exp − tr (S−1 Σ−1 ) ZIW 2 where, ZIW = |S|−ν/2 2νD/2 ΓD (ν/2) Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References Estimating Parameters Estimating Posterior Calculating Posterior Posterior for µ - Also a MVN p(µ|D, Σ) −1 VN mN = N (mN , VN ) = V0−1 + NΣ−1 = VN (Σ−1 (Nx̄) + V0−1 m0 ) Posterior for Σ - Also an Inverse Wishart p(Σ|D, µ) = IW (SN , νN ) = S0 + Sµ νN = ν0 + N S−1 N Varun Chandola Introduction to Machine Learning Generative Models for Discrete Data Steps for Learning a Generative Model Learning Gaussian Models References References Varun Chandola Introduction to Machine Learning