Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations, you can infer the bias of the coin Maximum Likelihood Estimate Sequence of observations HTTHTTTH Maximum likelihood estimate? Θ = 3/8 What about this sequence? TTTTTHHH What assumption makes order unimportant? Independent Identically Distributed (IID) draws The Likelihood P(Head | q ) = q P(Tail | q ) = 1- q Independent events -> P(HHTHTT ...| q ) = q NH (1- q ) Related to binomial distribution æ N H + NT ö N N P(N H , NT | q ) = ç ÷ q H (1- q ) T NH è ø NH and NT are sufficient statistics How to compute max likelihood solution? NT Bayesian Hypothesis Evaluation: Two Alternatives Two hypotheses h0: θ=.5 h1: θ=.9 hypothesis, not head! Role of priors diminishes as number of flips increases Note weirdness that each hypothesis has an associated probability, and each hypothesis specifies a probability probabilities of probabilities! Setting prior to zero -> narrowing hypothesis space Bayesian Hypothesis Evaluation: Many Alternatives 11 hypotheses h0: θ=0.0 h1: θ=0.1 … h10: θ=1.0 Uniform priors P(hi) = 1/11 priors trial 1: H trial 2: T trial 3: T 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.15 0.15 0.15 0.15 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0.5 model 1 0 0 trial 4: H 0.5 model 1 0 0 trial 5: T 0.5 model 1 0 trial 6: T 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.15 0.15 0.15 0.15 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0.5 model 1 0 0 0.5 model 1 0 0 0.5 model 0.5 model 1 trial 7: T 0.25 0 0 1 0 0 0.5 model 1 MATLAB Code Infinite Hypothesis Spaces Consider all values of θ, 0 <= θ <= 1 ● Inferring θ is just like any other sort of Bayesian inference ● Likelihood is as before: ● Normalization term: ● With uniform priors on θ: ● ● priors trial 1: H trial 2: T trial 3: T 3 3 3 3 2.5 2.5 2.5 2.5 2 2 2 2 1.5 1.5 1.5 1.5 1 1 1 1 0.5 0.5 0.5 0.5 0 0 0.5 model 1 0 0 trial 4: H 0.5 model 1 0 0 trial 5: T 0.5 model 1 0 trial 6: T 3 3 3 2.5 2.5 2.5 2.5 2 2 2 2 1.5 1.5 1.5 1.5 1 1 1 1 0.5 0.5 0.5 0.5 0 0.5 model 1 0 0 0.5 model 1 0 0 0.5 model 0.5 model 1 trial 7: T 3 0 0 1 0 0 0.5 model 1 Infinite Hypothesis Spaces Consider all values of θ, 0 <= θ <= 1 ● Inferring θ is just like any other sort of Bayesian inference ● Likelihood is as before: ● Normalization term: ● With uniform priors on θ: ● This is a beta distribution: Beta(NH+1, NT+1) ● Beta Distribution Beta(a , b ) = B(a , b ) = 1 xa -1 (1- x)b -1 B(a , b ) G(a )G(b ) G(a + b ) If a and b are integers, B(a , b ) = (a - 1)!(b - 1)! (a + b - 1)! x Incorporating Priors Suppose we have a Beta prior ● 1 p(q ) = Beta(VH ,VT ) = q VH -1 (1- q )VT -1 B(VH ,VT ) Can compute posterior analytically ● p(q | d) ~ q N H +VH (1- q ) NT +VT p(q | d) = (N H + VH + N T + VT - 1)! N H +VH -1 q (1- q )NT +VT -1 (N H + VH - 1)!(N T + VT - 1)! p(q | d) = Beta(N H + VH , N T + VT ) Posterior is also Beta distributed Imaginary Counts p(q ) = Beta(VH ,VT ) = 1 q VH -1 (1- q )VT -1 B(VH ,VT ) VH and VT can be thought of as the outcome of coin flipping experiments either in one’s imagination or in past experience Equivalent sample size = VH + VT The larger the equivalent sample size, the more confident we are about our prior beliefs… And the more evidence we need to overcome priors. Regularization Suppose we flip coin once and get a tail, i.e., NT = 1, NH = 0 What is maximum likelihood estimate of θ? What if we toss in imaginary counts, VH = VT = 1? i.e., effective NT = 2, NH = 1 What if we toss in imaginary counts, VH = VT = 2? i.e., effective NT = 3, NH = 2 Imaginary counts smooth estimates to avoid bias by small data sets Issue in text processing Some words don’t appear in train corpus Prediction Using Posterior Given some sequence of n coin flips (e.g., HTTHH), what’s the probability of heads on the next flip? expectation of a beta distribution EBeta(a ,b ) (q ) = a a +b Summary So Far Beta prior on θ p(q ) = Beta(VH ,VT ) Binomial likelihood for observations Beta posterior on θ p(q | d) = Beta(N H + VH , NT + VT ) Conjugate priors The Beta distribution is the conjugate prior of a binomial or Bernoulli distribution Conjugate Mixtures If a distribution Q is a conjugate prior for likelihood R, then so is a distribution that is a mixture of Q’s. E.g., mixture of Betas p(q ) = 0.5 Beta(q | 20,20) + 0.5 Beta(q | 30,10) After observing 20 heads and 10 tails: p(q | D ) = 0.346 Beta(q | 40, 30) + 0.654 Beta(q | 50,20) Example from Murphy (Fig 5.10) Dirichlet-Multinomial Model We’ve been talking about the Beta-Binomial model Observations are binary, 1-of-2 possibilities What if observations are 1-of-K possibilities? K sided dice K English words K nationalities Multinomial RV Variable X with values x1, x2, … xK P(X = xk ) = q k K åq k =1 k=1 Likelihood, given Nk observations of xk: Analogous to binomial draw θ specifies a probability mass function (pmf) Dirichlet Distribution The conjugate prior of a multinomial likelihood … for θ in K-dimensional probability simplex, 0 otherwise Dirichlet is a distribution over probability mass functions (pmfs) Compare {αk} to VH and VT From Frigyik, Kapila, & Gupta (2010) Hierarchical Bayes Consider generative model for multinomial One of K alternatives is chosen by drawing alternative k with probability k But when we have uncertainty in the { k}, we must draw a pmf from {αk} Hyperparameters Parameters of multinomial Hierarchical Bayes Whenever you have a parameter you don’t know, instead of arbitrarily picking a value for that parameter, pick a distribution. Weaker assumption than selecting parameter value. Requires hyperparameters (hypernparameters), but results are typically less sensitive to hypernparameters than hypern-1parameters Example Of Hierarchical Bayes: Modeling Student Performance Collect data from S students on performance on N test items. There is variability from student-to-student and from item-to-item item distribution student distribution Item-Response Theory Parameters for Student ability Item difficulty P(correct) = logistic(Abilitys-Difficultyi) Need different ability parameters for each student, difficulty parameters for each item But can we benefit from the fact that students in the population share some characteristics, and likewise for items?