STATS 551, Winter 2022 Lectures on Bayesian modeling and computation XuanLong Nguyen University of Michigan April 12, 2022 Abstract This is a set of lecture notes for Stats 551. The materials presented in these notes are self-contained. I will keep updating these notes as we go. A main text book utilized in preparing these notes is Peter Hoff’s ”A first course in Bayesian statistical methods” [Hoff, 2009]. I will also draw from several other sources, including Charles Geyer’s ”Markov Chain Monte Carlo lecture notes” [Geyer, 2005], Michael I. Jordan’s, ”An introduction to probabilistic graphical models” [Jordan, 2003], and Christian Robert’s, ”The Bayesian choice” [Robert, 2007]. Please let me know (xuanlong@umich.edu) of any errors. Contents 1 2 3 Introduction and examples 1.1 What is Bayesian inference? . . . . . . . . . . . . . 1.2 Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . 1.3 Example: estimating the probability of a rare event . 1.4 Example: prediction via a Bayesian regression model Interpretation of probabilities and Bayes’ formulas 2.1 Interpretation of probabilities . . . . . . . . . . . 2.2 Bayes’ rule . . . . . . . . . . . . . . . . . . . . 2.3 Bayesian hypothesis testing . . . . . . . . . . . . 2.4 Random variables and conditional independence . 2.4.1 Discrete domains . . . . . . . . . . . . . 2.4.2 Continuous domains . . . . . . . . . . . 2.4.3 Multivariate domains . . . . . . . . . . . 2.5 Bayes’ formulas and parameter estimation . . . . One-parameter models 3.1 The binomial model . 3.2 Confidence regions . 3.3 The Poisson model . 3.4 Example: birth rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 . 4 . 8 . 9 . 13 . . . . . . . . . . . . . . . . 16 16 17 19 20 20 21 22 25 . . . . 27 27 32 35 38 . . . . 4 5 6 7 8 9 Monte Carlo approximation 4.1 Basic ideas . . . . . . . . . . . . . . . . . . . 4.2 Posterior inference for arbitrary functions . . . 4.3 Sampling from posterior predictive distributions 4.4 Posterior predictive model checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 41 45 46 49 The normal model 5.1 The normal / Gaussian distribution . . . . 5.2 Inference of the mean with variance fixed 5.3 Joint inference for the mean and variance 5.4 Normal model for non-normal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 52 54 59 68 Posterior approximation with the Gibbs sampler 6.1 Conjugate vs non-conjugate prior . . . . . . . 6.2 The Gibbs sampler . . . . . . . . . . . . . . 6.3 Markov chain Monte Carlo algorithms . . . . 6.3.1 Gibbs sampler . . . . . . . . . . . . 6.3.2 General Markov chain framework . . 6.3.3 Variants of Gibbs samplers . . . . . . 6.4 MCMC diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 70 72 76 76 79 81 83 . . . . . 89 89 91 95 97 100 . . . . . . . . . . 102 102 107 109 113 114 116 122 122 128 129 . . . . . 135 135 138 140 146 150 . . . . Multivariate normal models 7.1 Mean vector and covariance matrix . . . . . . . 7.2 The multivariate normal distribution . . . . . . 7.3 Semiconjugate prior for the mean vector . . . . 7.4 Inverse Wishart prior for the covariance matrix 7.5 Example: reading comprehension study . . . . . . . . . . . . . . . . . . . Group comparisons and hierarchical modeling 8.1 Comparing two groups . . . . . . . . . . . . . . . . 8.2 Comparing multiple groups . . . . . . . . . . . . . . 8.3 Exchangeability and hierarchical models . . . . . . . 8.4 Hierarchical normal models . . . . . . . . . . . . . . 8.4.1 Posterior inference . . . . . . . . . . . . . . 8.4.2 Example: Math scores in U.S. public schools 8.5 Topic models . . . . . . . . . . . . . . . . . . . . . 8.5.1 Model formulation . . . . . . . . . . . . . . 8.5.2 Posterior inference . . . . . . . . . . . . . . 8.5.3 Variational Bayes . . . . . . . . . . . . . . . Linear regression 9.1 Linear regression model . . . . . . 9.2 Semi-conjugate priors . . . . . . . . 9.3 Objective priors . . . . . . . . . . . 9.4 Model selection . . . . . . . . . . . 9.4.1 Bayesian model comparison . . . . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Model averaging via MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 10 Metropolis-Hasting algorithms 10.1 Metropolis-Hastings update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Detailed balance and reversibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 158 162 166 11 Unsupervised learning and nonparametric Bayes 11.1 Finite mixture models . . . . . . . . . . . . . . . . . . 11.1.1 Auxiliary variables . . . . . . . . . . . . . . . 11.2 Infinite mixture models . . . . . . . . . . . . . . . . . 11.2.1 Dirichlet process prior . . . . . . . . . . . . . 11.3 Posterior computation via slice sampling . . . . . . . . 11.4 Chinese restaurant process and another Gibbs sampler . 168 170 171 174 176 179 182 12 Additional topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 3 1 1.1 Introduction and examples What is Bayesian inference? Bayesian inference is a major framework for statistical inference. In general, statistical inference is the (computational) process of turning data into some form of data summarization and understanding, which may also enable prediction. Bayesian inference, or more broadly speaking, Bayesian statistics, is often contrasted with a competing framework known as frequentist (or classical) statistics. In this course, we refer to statistical inference and statistical learning interchangeably. There are two main players in statistical inference: data and quantity of inferential interest. The data are represented by a variable y taking values in some suitable space Y. The quantity of interest is denoted by θ taking values in another space Θ. Typically θ represents some characteristic of the data population that we wish to understand. For inference to be possible, there must be a ”linkage” between θ and the observed data y. This linkage is formalized by a sampling model (statistical model) for which θ is viewed as the model parameter: the true θ that is responsible for generating the observed data y is unknown. As such, θ encodes our understanding of the data population. It is the quantity of interest. 4 Example 1.1. Suppose we are interested in the prevalence of an infectious disease in a city. Data y are obtained from a random sample of individuals from the city, namely, the total number of people in the sample who are infected. Of interest is θ, the fraction of infected individuals in the city. Thus, Θ = [0, 1], while Y = {0, 1, 2, 3, . . . , }. Example 1.2. y represents a collection of heights sampled from a population, θ the typical height. Here, Θ = Y = R. Example 1.3. y represents polling data, θ is a categorical valued variable that tells us which candidate wins an election. Example 1.4. y is a sequence of of binary values that record whether a given day is rainy or not. θ may be taken to represent the frequency of rainy days, i.e., the cloudiness of a location. We may also want to predict if it is going to rain tomorrow or not (in this case, we may introduce another binary random variable to represent tomorrow’s forecast). 5 Example 1.5. A less obvious example, y is the collection of data pair of the form (u, v), where v is the binary class label the represents the ”class” of the corresponding u. θ is a mathematical quantity related to the classifier, a function which maps u to v that we wish to obtain on the basis of a training data set. Example 1.6. A clustering problem involves subdividing a collection of data points represented by y into ”clusters”, which can be represented by θ. Example 1.7. ”Who is who”. y represents a collection of photos available in the Internet. θ represents identity of all individuals that appear in such photos. 6 In practice and in our times, y has become increasingly complex, and so is the ambition of the data modeler and statistician, who want to infer increasingly complex quantity of interest θ. For both frameworks of Bayesian and frequentist statistics, data y are always considered to be realizations of some random variable denoted by Y . 1 The nature of the unknown θ is a different matter: frequentist methods treat θ as unknown but non-random. Bayesian methods always assume θ to be random. The randomness of the unknown can be viewed as the most distinguishing feature of Bayesian statistics. The ramifications are both deep and strong. This course is an applied Bayesian analysis course, so we will not get into the deeper theoretical foundations of Bayesian statistics. Instead, we focus on Bayesian methods and applications. Nonetheless, such ramifications of the Bayesian choice will be felt strongly. 1 In these notes, we will try to adhere to the convention that random variables are upper cases, unless denoted by Greek letters. The numerical value of the random variable, say Y , is denoted in lower cases, y. 7 1.2 Bayes’ rule The idealized form of Bayesian inference begins with a numerical formulation of the joint beliefs about y and θ, expressed in terms of probability distributions over Y and Θ. Here are the key ingredients: 1. For each numerical value θ ∈ Θ, prior distribution p(θ) describes our belief that θ represents the true population’s characteristics. 2. For each θ ∈ Θ and y ∈ Y, sampling model p(y|θ) describes our belief that y would be the outcome of our study if we knew θ to be true. Once we obtain the data y, the last step is to update our beliefs about θ: 3. For each numerical value of θ ∈ Θ, posterior distribution p(θ|y) describes our belief that θ is the true value, having observed data set y. The posterior distribution is obtained from prior distribution and sampling model via Bayes’ rule p(θ|y) = p(y|θ)p(θ) p(y|θ)p(θ) =R . p(y) Θ p(y|θ̃)p(θ̃)dθ̃ (1) Note that Bayes’ rule is a mathematical formula that allows one to ”invert” arguments of conditional probabilities. We have applied the Bayes formula for the purpose of statistical inference method named after its progenitor. Implicit in the above description is a significant conceptual lift of Bayesian statistics: we treat the a posteriori ”belief” about θ by adopting the conditional probability of θ given y. The higher the value of the probability about a numerical value of θ, the stronger our belief about it. Athough ”belief” may be a vague notion, probabilities and conditional probabilities are mathematically well-defined. Thus, we may speak of belief in a quantitively rigorous way. Note also that Bayes’ rule does not tell us what the truth θ should be; it tells us how our belief about θ changes after seeing new information. 8 Figure 1.1: The plot on the left gives binomial(20, θ) distributions for three values of θ. The right side gives prior (gray) and posterior (black) densities of θ. This is Fig. 1.1 of PH. 1.3 Example: estimating the probability of a rare event Continue on Example 1.1. Of interest is θ, the fraction of infected individuals in the city, so θ ∈ [0, 1]. Data is the number of infected individuals out of a 20 sample. So y ∈ {0, . . . , 20}. We need a sampling model. A reasonable choice is [why?] Y |θ ∼ binomial(20, θ) y 20−y . In particular, P (Y = 0|θ) = (1 − θ)20 . See Fig. 1.1 for an This means P (Y = y|θ) = 20 y θ (1 − θ) illustration. To get a sense of this probability P (Y = 0|θ = 0.05) = 0.9520 ≈ 0.36. If θ = 0.1 or θ = 0.2, this number woud be 0.12 or 0.01, respectively. Next, a prior is specified. A common choice is the beta distribution [why?] θ ∼ beta(a, b). There are two parameters a, b > 0 that we need to set. But how? The expectation under the beta prior is a/(a + b). The mode of the beta prior is (a − 1)/(a − 1 + b − 1). Previous studies from various parts of the country indicate that the infection rate in comparable cities ranges from about 0.05 to 0.20, with an average prevalence of 0.10. This suggests us to take a = 2, b = 20: θ ∼ beta(2, 20). This prior specification yields the following choice about the prior distribution: E[θ] = 0.09 mode[θ] = 0.05 Pr(θ < 0.10) = 0.64 Pr(0.05 < θ < 0.20) = 0.66. You may still find reasons to be uncomfortable with this particular choice of prior parameters (what?), but we will get to them. Let us now apply the Bayes rule that enables one to go from the prior to the posterior distribution. 9 From prior to posterior By an application of the Bayes rule, we will find that: if Y |θ) ∼ binomial(n, θ), θ ∼ beta(a, b) then the conditional distribution is again a beta: θ|Y = y ∼ beta(a + y, b + n − y). This is just an example of a general structural property called as “conjugacy“ (the beta prior is conjugate to the binomial likelihood) that is widely exploited in Bayesian computation. We will study this property systematically in later lectures. Suppose that in our specific study, we observed that in fact Y = 0, i.e., none of the sampled individuals was infected. [What do we make of this?] The posterior distribution of θ is therefore θ|{Y = 0} ∼ beta(2, 40). Observe the change in shape from the prior distribution to the posterior distribution under the (new) observation y = 0 in Fig. 1.1: the mass of the posterior is shifted toward zero. This reflects the consequence of the “Bayes update”; by contrast a simple-minded approach is to set θ = 0 in the presence of y = 0. The posterior is also more “peaked“ than the prior. This reflects a general phenomenon: as more data are observed, our belief about θ becomes more concentrated, even if we start out with a prior belief that is more “defuse“. In other words, the more data are observed, the less influential the role of the prior. This is a desirable property. More quantitatively on this transformation: E[θ|Y = 0] = 0.048 mode[θ|Y = 0] = 0.025 Pr(θ < 0.10) = 0.93 In particular, we may say: our posterior belief in the presence of the observation that θ < 0.1 is pretty high (> 0.93). How sensitive is this conclusion based on our prior specification? 10 Sensitivity analysis The Bayes update enables us to go from a beta(a, b) prior to a beta posterior, namely, beta(a + y, b + n − y), whose parameter incorporates the impact of the observed data y. In particular, we go from the prior mean θ0 := a/(a + b) to the posterior mean E[θ|Y = y] = = = a+y a+b+n n a+b (y/n) + (a/(a + b)) a+b+n a+b+n n w ȳ + θ0 . w+n w+n Here, we denote w = a + b. The above formula captures nicely the combined impacts of data, via the term ȳ = y/n, and prior knowledge, via the prior mean θ0 . The posterior mean represents a weighted average of the sample mean ȳ and our prior guess θ0 . We may view w as a parameter that represents our confidence in this prior guess. Note that the prior distribution may be expressed as beta(wθ0 , w(1 − θ0 )). The posterior distribution is beta(wθ0 + y, w(1 − θ0 ) + n − y). • If we fix w, let us see the impact of data size. If sample size n tends to infinity, then posterior mean tends to the sample mean ȳ; the prior belief plays a vanishing role no matter how confident we are about it. However, when sample size n is small, the prior belief can be influential, and can be captured by the role of w. • Let us fix n (to be relatively small). As w → 0, as our confidence in the prior vanishes, the posterior mean converges to the data-driven sample mean ȳ. If w → ∞, the opposite happens: the posterior mean tends toward that of the prior belief, θ0 ; the observed data hardly matters any more. 11 Figure 1.2: Posterior quantities under different beta prior specifications. The left and right hand panels give contours of E[θ|Y = 0] and Pr[θ < 0.10|Y = 0], respectively. (Fig. 1.2 of PH). Fig. 1.2 gives a more detailed picture of the sensitivity of the prior specification. The left panel tells a general story: the prior specification can play a big role in our conclusion “after the fact”. The sensitivity analysis allows us to be both honest and more confident in drawing our inference. The confidence in our inference depends on the specific question that we ask about θ. Suppose that the city officials want to recommend a vaccine to the general public unless they were reasonably sure that the current infection rate was less than 10%. Then we may want to look at the right panel, which gives the contours of the posterior for Pr[θ < 0.10|Data]. • For chosen θ0 ≤ 0.1, which is the average prevalence in other comparable cities from prior studies, we can be reasonably certain that the current infection rate is below 10% (with posterior probability above 90% for a large range of w). • A higher degree of certainty, say 97.5%, is only achieved by people who already thought the infection rate was lower than the average of the other cities, e.g., if θ0 < 0.05. 12 1.4 Example: prediction via a Bayesian regression model The problem is to come up with a predictive model for diabetes progression as a function of 64 baseline explanatory variables such as age, sex and body mass index. A standard tool is the following a linear regression model; a simplest one is y = β > x + σ, where β ∈ R64 is a quantity of interest, along with σ the standard deviation of the error term. The parameters can be estimated using a training dataset consisting of measurements from 342 patients. There is also a test dataset of 100 patients, which will be used to evaluate the predictive performance of the learned model. Sampling model Suppose that the error term follows the normal distribution with unknown variance, ∼ N (0, 1), then the sampling model takes the following conditional form Y |X = x, β, σ ∼ Normal(β > x, σ). Prior specification Placing a prior distribution on β ∈ R64 is non-trivial. This is a large parameter space; one needs to impose the kind of distribution that reflects the prior knowledge of this space. The prior belief we have is that most of the 64 explanatory variables have little to no effect on diabetes progression, i.e., most are zeroes but we do not know which one. We can start by a prior that allows that β1 , . . . , β64 are a priori independent and that Pr(βj = 0) = 1/2. Details are omitted for now. Likewise, a prior on σ is required, but for now we may assume σ fixed. Posterior distribution With the sampling model and the prior specification, and given observed data pairs (y, X) = (yi , xi )342 i=1 , by Bayes’ rule we obtain the posterior for the parameter β P (β|y, X, σ) = P (β)P (y|X, θ, σ) . P (y|X, σ) (Notice that for regression or classification problems, the explanatory variables X remain in the conditioning in the Bayes’ formula). Later we will learn how to derive this kind of posterior computation. The posterior distribution gives us much information. Of interest, for instance, is the question of variable selection, which can be extracted from Pr(β 6= 0|y, X). See Fig. 1.3. Recall that all of the 64 coefficients βi were a priori zero with probability 1/2; the posterior given the data tells us that the number of non-zero coefficients must be much smaller. We are also interested in predicting the value of response variable on the test dataset. A simple way to do this, is to take β̂ Bayes = E[β|y, X], the posterior expectation of β, and plug in this point estimate to the test dataset. In particular, let X test be the 100 × 64 matrix giving the data for the 100 patients in the test dataset. Then we can predict the corresponding diabete progression level by ŷ test := XBayes β̂ Bayes . By contrast, a non-Bayesian and standard approach is to take the ordinary least square (OLS) estimate: β̂ ols := argmin n X (yi − β > xi )2 , i=1 13 Figure 1.3: Posterior probabilities that each coefficient is non-zero (Fig. 1.3 of PH). Figure 1.4: Observed versus predicted response y (diabetes progression value) using the Bayes estimate (left) and the OLS estimate (right panel) (Fig. 1.4 of PH). 14 which gives β̂ ols = (X > X)−1 X > y. With this point estimate in place, the predictive estimate for the response is given by X test β̂ ols . Fig. 1.4 gives a comparison between OLS and the Bayesian approach. The prediction error for the two methods are 0.67 and 0.45, respectively. We make some high-level remarks. • the poor performance of OLS is due to the fact that that the sample size is small relative to the large number of explanatory variables. • to do well in such situations, one needs to constrain the parameter space (provided this is ”correct”). The Bayesian prior has this effect. • alternatively, modern regression method can achieve this by introducing a penalty term. A well-known method is the lasso regression, which proposes the following point estimate β̂ lasso := argmin n X > 2 (yi − β xi ) + λ i=1 64 X |βj |, j=1 where λ > 0 is a tuning parameter that balances between the square error and the penalty term, which helps to push (many) βj to take small or zero values. • it can be verified (you should!) that the lasso estimate corresponds exactly to the mode of the posterior distribution, if we take the prior on βj to be Laplace — this is a probability distribution that has a sharp peak at βj = 0. • note that the above illustration of β̂ Bayes is a convenient way to obtain predictive estimate for the response variable Y using the Bayesian posterior. But this is not the ”true” Bayesian estimate. Recall that all that is unknown in Bayesian analysis is treated as random. Thus, the true Bayesian estimate for Y has to be obtained by integrating out β according to its posterior distribution. That is, we can compute Z P (Y test |X test ) = P (Y test |X test , β)P (β|y, X)dβ. This computation is a bit more involved, but the resulting estimate is expected to be more robust than the plug-in estimate with β Bayes . 15 2 2.1 Interpretation of probabilities and Bayes’ formulas Interpretation of probabilities When we say: ”if I toss a coin, the probability that the coin turns head is 1/2”, what we understand is the possibility of repeated experiments of coin tossing, and approximately half of the times the coin turns head. This is the frequentist interpretation of probabilities. This is also the interpretation we rely on when we think of sampling model Pr(y = 1|θ = 1/2). But what do we mean by saying, a day before the votes are cast, that a candidate wins the election with probability 66.67%? We cannot repeat the election multiple times for the same candidates. This probability number obviously quantifies the degree of our belief in a given statement. The higher the number, say 80, or 95%, the stronger the belief (which is subjective, since my 80% may be differently perceived from your 80%). We use this interpretation when we specify the prior Pr(θ) and drawing inference from the posterior distribution Pr(θ|Data). Both interpretations are present in Bayesian analysis in the prior and the sampling model terms and linked via the Bayes formula. More remarkably, the Bayes formula enables us to revert the arguments in conditional probabilities, i.e., to relate Pr(A|B) with Pr(B|A), and so on. We can makes sense of and quantifies both statements such as ”if a person has college degree, then his likely income level is...”, versus ”if a person has this income level, then they are likely to have received a college degree”. In logic, it is simple to distinguish the logical statements A ⇒ B and B ⇒ A. In probabilistic settings and real-life applications, it is not so obvious to quantify the uncertainty of such statements. 16 2.2 Bayes’ rule Bayes’ formulas are straightforward to grasp in the somewhat abstract language of probability space and Venn diagrams of subsets of events. Later, we apply Bayes’ rule to random variables, as commonly done in practice. (Paradoxically, the application of Bayes’ rule to random variables seems less intuitive in specific applied settings). Let H be the set of all possible truths, that we can place the unit probability on: Pr(H) = 1. Suppose {H1 , . . . , HK } be a partition of H. The rule of total probability imposes that K X Pr(Hk ) = 1. k=1 Examples • H is the set of truths about people’s religious orientations. Partitions include {Christian, non-Christian}, but also {Protestian, Catholic, Jewish, other, none}, and so on. • H is the set of truths about people’s number of children. • H is the set of truths about the relationship between smoking and hypertension in a given population. Partitions include {some relationship, no relationship}, or {negative correlation, zero correlation, positive correlation, and so on. An even E is defined as a subset of H for which we may quantify in terms of Pr(E). By the rule of marginal probability: Pr(E) = K X Pr(E ∩ Hk ) = k=1 K X Pr(Hk ) Pr(E|Hk ), k=1 where we have used the definition of conditional probability in the second equality. It follows that Pr(Hj |E) = = Pr(Hj ∩ E) Pr(E) Pr(E|Hj ) Pr(Hj ) . PK Pr(E|H ) Pr(H ) k k k=1 This is an instance of the celebrated Bayes’ formulas, which allows one to compute the ”inverse probability” Pr(Hj |E) in terms of Pr(E|Hj ) and other quantities. The other quantities here are the seemingly benign unconditional probability terms Pr(Hj ). In reality it is often the presence of understated or hidden assumptions about these conditional probabilities that lead people to draw drastically contradictory conclusions in the face of the same set of observed evidence. Bayes’ formulas explain this phenomenon clearly. 17 Example 2.1. A subset of the 1996 General Social Survey includes data on the education level and income for a sample of males over 30 years of age. Let {H1 , H2 , H3 , H4 } be the events that a random selected person in this sample is in the lowest, the second, the third and the upper 25th percentile in terms of the income. By definition, the unconditional probabilities are {Pr(H1 ), Pr(H2 ), Pr(H3 ), Pr(H4 )} = {.25, .25, .25, .25}. These probabilities add up to 1. Let E be the event that a randomly sampled person from the survey has a college education. From the survey data, we also have {Pr(E|H1 ), Pr(E|H2 ), Pr(E|H3 ), Pr(E|H4 )} = {.11, .19, .31, .54}. These are also probabilities. They do not add up to one. Rather, they represent the proportions of college degree holders in each of the four subpopulations. Observe the increase in the proportion relative to the income percentile level. Now, applying Bayes’ rule to obtain {Pr(H1 |E), Pr(H2 |E), Pr(H3 |E), Pr(H4 |E)} = {.09, 0.17, .27, .47}. What we see here are the probability that someone is in each of the income basket, if that person is a college degree holder. These probabilities add up to one. Note how the share the same monotonicity with the numbers in the previous paragraph. This is by design, because the unconditional probabilities Pr(Hi ) are the same. The monotonicity will not be preserved in general and may be counterintuitive, if the subpopulations {Hi } are partitioned such a way that their corresponding probabilities {Pr(H1 ), Pr(H2 ), Pr(H3 ), Pr(H4 )} are suitably skewed [Exercise: come up with an example!] 18 2.3 Bayesian hypothesis testing In Bayesian inference, {H1 , . . . , HK } often refer to disjoint hypotheses or states of nature, and E refers to the outcome of the survey, study or experiment. To compare the hypotheses post-experimentally, we may calculate the ratio Pr(Hi |E) Pr(Hj |E) Pr(E|Hi ) Pr(Hi ) × Pr(E|Hj Pr(Hj ) = ”Bayes factor” × ”prior beliefs”. = This tells us that the Bayes’ rule only tells us what our beliefs should be after seeing the data; the prior beliefs play a very important role. The following example is apt given the most recent election: H = all possible rates of support for candidate A H1 = more than half the voters support candidate A H2 = less than or equal to half the voters support candidate A E = 54 out of 100 people surveyed said they support candidate A In the face of the polling data E, how should we conclude about the chance of candidate A? The modeling of both {Pr(Hi )} and Pr(E|Hi ), and the interplay among these quantities combine to determine the inference. 19 2.4 Random variables and conditional independence Bayesian inference is applied to random variables: the observed data y and the quantity of interest θ are both realizations of random variables. The domain of these random variables and the structural properties about them have to be taken into account in order to construct suitable probability models for which the Bayes formula can be applied. 2.4.1 Discrete domains We say Y is discrete if its domain Y is countable, meaning that it can be expressed as Y = {y1 , y2 , . . .}. The event that the outcome Y takes a value y can be quantified by the probability Pr({Y = y}) := p(y), where function p is called probability density function of Y . It satisfies the property that 1. 0 ≤ p(y) ≤ 1 for all y ∈ Y. P 2. y∈Y p(y) = 1. An event of interest concerning the outcome Y takes the form Y ∈ A, for some subset A ⊂ Y. We may quantify our belief about such an event via X Pr(Y ∈ A) = p(y). y∈A There are many examples of probability distributions on discrete domains. They will form crucial building blocks we will need for the probability models we will construct. Here are a few examples; it is important to review them. 1. bernoulli(y|θ), where y ∈ {0, 1}, θ ∈ [0, 1]. The pdf takes the form p(y|θ) = θy (1 − θ)1−y . 2. binomial(y|θ, n), where y ∈ N and θ ∈ [0, 1]. n y p(y|θ) = θ (1 − θ)1−y . y 3. poisson(y|θ), where y ∈ N, θ ≥ 0. p(y|θ) = θy e−θ /y!. 4. categorical(y|θ), where y ∈ {1, . . . , K}, θ ∈ ∆K−1 := {(q1 , . . . , qK ) ∈ RK +, p(y|θ) = θy = K Y I(y=k) θk PK k=1 qk = 1}. . k=1 P K−1 . 5. multinomial(y|θ, n), where y = (y1 , . . . , yK ) ∈ NK such that K k=1 yk = n, θ ∈ ∆ Y K n p(y|θ, n) = θknk . y1 . . . yK k=1 We have used Y to illustrated random variables of discrete domains, but remember that in Bayesian inference, the quantity of interest θ is also random, for which we apply the prior distributions that are drawn from the same tool box as mentioned. 20 2.4.2 Continuous domains By this, we mean the domain of the variable is the real line or a subset of the real line. We have a rich tool box of modeling devices, including distributions by the name of Gauss, Laplace, Cauchy, Gamma, Beta, Dirichlet, and so on, and beyond. Many of these building blocks can be viewed as instance of distributions in the exponential families of distributions. We will return to this in the sequel. 21 2.4.3 Multivariate domains Most interesting and challenging scenarios deal with multiple variables and/or variables of multiple dimensions. How do we specify probability distributions in these cases? Let us start with bivariate distributions in a discrete domain. Consider discrete random variables Y1 and Y2 taking values in countable spaces Y1 , Y2 , respectively. We need to specify the joint probability density function (joint pdf): pY1 Y2 (y1 , y2 ) := Pr({Y1 = y1 } ∩ {Y2 = y2 }). (2) If Y1 and Y2 are mutually independent, the joint pdf is simplified to the product form pY1 Y2 (y1 , y2 ) = pY1 (y1 )pY2 (y2 ), where the two univariate pdf for Y1 and Y2 may be specified using to basic building block mentioned earlier. In general, Y1 and Y2 are not independent; one needs to specify the the joint pdf in Eq. (2), which defines the probability mass for each of the |Y1 | × |Y2 | pairs of numerical values of (y1 , y2 ). Once the joint pdf is specified, the marginal distribution and conditional distribution can be computed from the joint density: X pY1 (y1 ) := pY1 Y2 (y1 , y2 ), y2 ∈Y2 pY2 |Y1 (y2 |y1 ) = pY Y (y1 , y2 ) Pr(Y1 = y1 , Y2 = y2 ) = 1 2 . Pr(Y1 = y1 ) pY1 (y1 ) From the above, we can alternatively specify the joint pY1 Y2 by first specifying marginal distribution, say pY1 , and then the conditional pdf pY2 |Y1 , because pY1 Y2 (y1 , y2 ) = pY1 (y1 )pY2 (y2 |y1 ) = pY2 y2 pY1 (y1 |y2 ). When the context of the random variables is clear, we may drop the subscripts to write the above as p(y1 , y2 ) = p(y1 )p(y2 |y1 ) = p(y2 )p(y1 |y2 ). 22 Example 2.2. Let’s start with the following example from PH (pg. 24) and then expand on this. In this example, we saw how to derive the conditional probabilities pY2 |Y1 and pY1 |Y2 from the joint probabilities pY1 ,Y2 . Likewise we can also specify the joint from the marginal pY1 and the conditional pY2 |Y1 . In any case, we essentially need to specify 5×5 entries for the joint probability values Pr(Y1 = y1 , Y2 = y2 ). Without further assumption, we needs 25 − 1 = 24 parameters for the joint pdf, one for each probability value. Suppose now that we wish to extend the joint pdf to describe the social mobility not for two but three or more generations. Assume that the list of occupations remain 5 in this example. With three generations (of grandfathers, fathers, sons) we need to specify 53 = 125 entries for the joint pdf. With four generations, we need 54 = 625 entries. And so on. This shows a fundamental challenge in working with multivariate domains. Without further assumptions, the number of parameters required is exponential in the number of variables. This would be unworkable. The main tool that statistical modelers exploit to overcome the complexity in modeling multivariate domains is to make use of independence, more appropriately, conditional independence, by incorporating our domain knowledge about the variables of interest. 23 Example 2.3. Continuing from the previous example. Let Y1 , Y2 , Y3 denote the grandfather, father and son’s occupations. By chain rule, we may always write2 p(y1 , y2 , y3 ) = p(y1 )p(y2 |y1 )p(y3 |y1 , y2 ). We may help ourselves by making the following assumption: assume that Y3 is conditionally independent of Y1 given Y2 . This means, the joint conditional density of Y1 and Y3 given Y2 equals the product the corresponding marginal conditional densities: p(y1 , y3 |y2 ) = p(y1 |y2 )p(y3 |y2 ) for any numerical values (y1 , y2 , y3 ). The reader should verify that under the above conditional independence: p(y1 |y2 , y3 ) = p(y1 |y2 ) p(y3 |y2 , y1 ) = p(y3 |y2 ). As a consequence, we may specify the joint pdf of Y1 , Y2 , Y3 by a smaller number of parameters, by noting that (why?) p(y1 , y2 , y3 ) = p(y1 )p(y2 |y1 )p(y3 |y2 ). Question: how many parameters do we need to specify the joint pdf? Another question: suppose that the conditional distribution of the occupation of the grandfather generation given the father’s is the same as the conditional distribution of that of the father’s generation given the son’s. How many parameters do we need now? 2 Recall that we have removed the subscripts to avoid cluttering from pY1 ,Y2 ,Y3 (y1 , y2 , y3 ) = pY1 (y1 )pY2 |Y1 (y2 |y1 )pY3 |Y1 ,Y2 (y3 |y1 , y2 ). 24 2.5 Bayes’ formulas and parameter estimation As we described in Section 1, in order to initiate a Bayesian analysis, we need to specify the joint distribution of the quantity of interest θ and data y, by specifying the prior belief about θ via the prior distribution p(θ), and the sampling model p(y|θ). In practice, y represents the values of a collection of random variables/ vectors. θ is a random variable in a suitable domain. The principle of these specifications is the same as before, whether y and θ are discrete or continuous valued, or a combination thereof. A large proportion of a Bayesian modeler’s technical effort therefore is on finding a suitable specification of the joint distribution p(θ, y) for the problem at hand. Once this is done, having observed {Y = y}, we need to compute our updated beliefs about θ via the Bayes’ formula, which is now expressed in terms of density function for random variables: p(θ|y) = p(θ, y)/p(y) = p(θ)p(y|θ)/p(y). (3) Another significant proportion of the Bayesian framework is to compute that above posterior density function of θ, expressed above as a ratio. • The numerator is the product between the prior pdf, p(θ), and the quantity p(y|θ). • As a function of y, we call p(y|θ) the pdf of the sampling model, where θ plays the role of the parameter. • As a function of θ, we call p(y|θ) as the likelihood function, with data y being fixed. It’s worth repeating that the likelihood function is not a density function. As the focus is shifted toward the inference of θ, ”likelihood function” will be invoked more often. Although the numerator of the posterior density is often simple to compute because the prior component and the likelihood component are typically explicitly specified, the denominator is typically difficult to compute explicitly. It can be seen that Z Z p(y) = p(θ, y)dθ = p(θ)p(y|θ)dθ which involves taking integration (or summation) over the space of θ ∈ Θ. The integration typically does not admit an explicit form. 25 One may be interested in the relative posterior density, by comparing its value at different numerical values of interest. Let θa and θb be two such numerical values of θ, and take p(θa |y) p(θb |y) = = p(θa )p(y|θa )/p(y) p(θb )p(y|θb )/p(y) p(θa )p(y|θa ) p(θb )p(y|θb ) In the above, the computation of the relative posterior density does not require the computation of p(y), because p(y) does not depend on specific value of θ. Accordingly, we often write p(θ|y) ∝ p(θ)p(y|θ) where ∝ is called ”proportional” up to a normalizing constant to ensure that the left hand side is a value pdf for θ. The normalizing constant is precisely p(y) in this case. In English, we write posterior ∝ prior × likelihood. This captures succintly and beautifully the spirit of Bayesian inference: the posterior belief about the quantity of interest is obtained from two sources of information: the prior belief, and empirical observations (via the likelihood). Moreover, these two sources are combined explicitly via a multiplicative operation. As a function of the quantity interest, you may take this as an update to the prior belief via a reweighting operation, where the weights are provided by the likelihood function. Finally, in practice, we are interested in various properties of the posterior density function p(θ|y), rather than the density function itself. This helps us express more precisely our belief about the true θ, because of the Bayesian ”doctrine” that we usually do not know the exact truth; we can only calculate our belief about such truth. We have seen in the example of Section 1.3 various quantities of interest, including the posterior mean and posterior variance, posterior mode, posterior probability of tails, various quantiles and confidence regions with respect to the posterior distribution. 26 3 One-parameter models A one-parameter model is a class of sampling distribution that is indexed by a single unknown parameter. We will study Bayesian inference with several such models. Although simple, they will help to illustrate several key concepts in Bayesian data analysis, including conjugate priors, predictive distributions and confidence regions. 3.1 The binomial model Example 3.1. (Happiness data) In a General Social Survey conducted in 1998, each female of age 65 or over was asked whether or not they were generally happy or not. Let Yi = 1 if respondent i reported being generally happy, and 0 otherwise. The label i is given arbitrarily before the data are collected; we do not assume to have any further information distinguishing these individuals. As before, we use p(y1 , . . . , yn ) as the shorthand notation for Pr(Y1 = y1 , . . . , Yn = yn ) and so on. We shall assume a binomial distribution to describe our sampling model. Associated with this model is a parameter θ ∈ [0, 1] and that i.i.d. Y1 , . . . , Yn |θ ∼ Bernoulli(θ). Accordingly, p(y1 , . . . , yn |θ) = θ Pn i=1 yi Pn (1 − θ)n− i=1 yi . It is reported that out of n = 129 respondents, 118 individuals report being generally happy (91%), and 11 individuals do not report being generally happy (9%). 27 Uniform prior To continue with Bayesian analysis, we need to give θ a prior distribution. Let us take the uniform prior, so that p(θ) = 1 for all θ ∈ [0, 1]. Uniform prior is considered a ”vague” or ”non-informative” prior, and referred as such in the literature. [whether it is truly non-informative is a different matter!] Now, we are ready to apply the Bayes’ rule to obtain p(θ|y1 , . . . , y129 ) ∝ p(θ)p(y1 , . . . , y129 |θ) = θ118 (1 − θ)11 . In the above expression, we drop the normalizing constant, which is p(y1 , . . . , y129 ). To find the mode of the posterior distribution, we need to solve the optimization problem max log{θ118 (1 − θ)11 }. θ∈[0,1] Taking derivative with respect to θ and setting to zero, we obtain the maximizer to be θ̂ = 118/129 = .91, the fraction of respondents who report being generally happy. The reader might think: so much for all the math, only to get such an obvious answer? But what about other quantities relevant to the posterior distribution of θ? The normalizing constant of the posterior density is p(y1 , . . . , y129 ). Why in general this quantity is difficult to calculate, for this specific example it has a closed form: the expression defining the posterior distribution should remind us of the beta distribution. A beta pdf is defined on [0, 1] and takes the form p(θ|a, b) = Γ(a + b) a−1 θ (1 − θ)b−1 . Γ(a)Γ(b) (4) Here a, b > 0 are the parameters. Since the density function integrates to one, this implies that Z 1 Γ(a)Γ(b) θa−1 (1 − θ)b−1 dθ = . Γ(a + b) 0 Exercise 3.1. Based on the above identity, prove the following: under the beta distribution beta(a, b) mode[θ] = (a − 1)/[(a − 1) + (b − 1)]ifa > 1, b > 1, E[θ] = a/(a + b), Var[θ] = ab/[(a + b + 1)(a + b)2 ]. Back to our example, then we have Z p(y) = 1 θ118 (1 − θ)11 dθ = 0 Γ(119)Γ(12) . Γ(131) In fact, the posterior distribution of θ is indeed beta(119, 12). 28 Beta prior The uniform distribution of [0, 1] is an instance of the beta distribution for a = b = 1. Employing the beta prior instead, and apply Bayes’ rule p(θ|y1 , . . . , yn ) ∝ p(θ)p(y1 , . . . , yn |θ) Pn ∝ θa−1 (1 − θ)b−1 × θ i=1 yi (1 − θ)n− n n X X = beta(θ|a + yi , b + n − yi ). i=1 Pn i=1 yi i=1 This is an instance of conjugacy: a beta prior, when combined with a binomial likelihood, yields a beta posterior distribution. Conjugacy is the property of a prior relative to a given likelihood: a prior is conjugate with respect to a likelihood if the resulting posterior distribution takes the same form. Conjugacy a treasured property in Bayesian statistics because it simplifies posterior computation, a considerable bottleneck. Once we know the form of the posterior density, we only need to concern with the posterior distribution’s parameters, which reflects the posterior updates that combines both prior information and the information gleaned from the data. Example P distribution of θ receives the update from the data via the P3.2. In Example 3.1, the posterior statistic ni=1 Yi . This reflects the fact that ni=1 is the sufficient statistic for θ under the Bernoulli sampling model. In our Bayesian frame work, we may express this as p(θ|Y1 , . . . , Yn ) = p(θ| n X Yi ). i=1 In other words, the information contained in the observed {Y1 = y1 , . . . , Yn = yn } is the same Pn in the data P as the information contained in Y = y, where Y = i=1 Yi and y = ni=1 yi . Alternatively, we may consider a sampling model in which the data are the count of people who report to be ”generally happy”, as opposed to ”not generally happy”. The suitable sampling model is a binomial distribution. Applying the same computation as above, the reader should be able to derive that if we posit prior: θ ∼ beta(a, b) sampling: Y = y ∼ binomial(n, θ), then by the Bayes’ rule we obtain posterior: θ|Y = y ∼ beta(a + y, b + n − y). This is also the calculation that we relied on in Example 1.1. 29 Prediction After having obtained data sample {y1 , . . . , yn } we are also interested in the distribution of new observations. This is called the predictive distribution. Suppose that Ỹ is an additional outcome of the same population as the observed sample via the sampling model i.i.d. Y1 , . . . , Yn , Ỹ |θ ∼ p(.|θ). Under the prior distribution θ ∼ p(θ) the predictive distribution of Ỹ given {Y1 = y1 , . . . , Yn = yn } takes the form Z p(Ỹ = ỹ|y1 , . . . , yn ) = p(ỹ, θ|y1 , . . . , yn )dθ Z = p(ỹ|θ, y1 , . . . , yn )p(θ|y1 , . . . , yn )dθ Z = p(ỹ|θ)p(θ|y1 , . . . , yn )dθ. The last identity is due to the i.i.d. assumption in the sampling model. Some remarks • The predictive distribution depends on the observed data. It does not depend on the unknown θ. • The unknown θ is integrated out in the formula via the posterior distribution. Thus the predictive distribution takes into account both the observed data and the prior distribution. • Contrast this with a frequentist approach: one can obtain a point-estimate θ̂ based on the observed data, and then plug-in the sampling model to produce a predictive distribution of new observation: pplug-in (Ỹ = ỹ) := p(ỹ|θ̂). Because the Bayesian approach relies on a distribution over the unknown θ rather than a single numerical value of θ, it allows for a broader range of predictive distributions than a plug-in approach. 30 Example 3.3. Continue from Example 3.1 (Binomial sampling and uniform prior). We use the uniform distribution as the prior for happiness level θ. The uniform distribution is beta(a, b), where a = b = 1. The predictive distribution of the answer ”I’m generally happy” for the next respondent is Z Pr(Ỹ = 1|y1 , . . . , yn ) = p(ỹ|θ)p(θ|y1 , . . . , yn )dθ Z = θp(θ|y1 , . . . , yn ) P a + ni=1 yi = . a+b+n Suppose that out of 20 people, none is reportedly happy, then the probability that the next person is reportedly happy will be a/(a + b + 20) = 1/22. Contrast this with the plug-in approach: the mode of p(θ|y1 , . . . , yn ) is the same as the mode of the likelihood function p(y1 , . . . , yn |θ), which is equal P a + ni=1 yi − 1 = 0. a+b+n−2 If we plug in θ̂ = 0, then the predictive probability that the next person is reportedly happy will be 0. 31 3.2 Confidence regions It is of interest to identify regions of the parameter space that are likely to contain the true value of the unknown parameter. The following definition for scalar parameter can be extended to multidimensional domains. Definition 3.1 (Bayesian coverage). An interval [l(y), u(y)], based on the data observed data Y = y, has 95% Bayesian coverage for θ if Pr(l(y) < θ < u(y)|Y = y) = .95. Note: in the above probability expression, it is θ that is random, Y = y fixed. Interpretation: having observed the data and calculated the conditional probability, the unknown θ is in the given interval with probability 95%. Frequentist approach provides point estimates for unknown θ, not a distribution. To quantify for the uncertainty of the estimate, there is a notion of confidence interval defined as follows. Definition 3.2 (Frequentist coverage). A random interval [l(Y ), u(Y )] has 95% frequentist coverage for θ if, before the data are gathered, Pr(l(Y ) < θ < u(Y )|θ) = .95. Note: in the above probability expression, it is Y that is random, θ is unknown but fixed. Once you observe Y = y, you cannot provide any gua rantee for [l(y), u(y)] regarding the unknown θ. What frequentist coverage means is: if we are to run a large number of unrelated (independent) experiments and create the interval [l(y), u(y)] for each one of them, then we can expect that 95% of the intervals contain the correct parameter value. 32 Some remarks • Both notions are useful. • The frequentist coverage describes the pre-experiment coverage, i.e., it promises a guarantee if the experiments are to be repeated many times in the future. • The Bayesian coverage describes the post-experiment coverage, i.e., it is applicable to the data at hand, under a prior specification. • When sample size gets large, usually the two coverages tend toward the same interval. Quantile-based interval This is the easiest way to obtain a Bayesian coverage: take l(y) := θα/2 and u(y) := θ1−α/2 , the left and right threshold for the α/2 probability tail of the posterior distribution: Pr(θ < θα/2 |Y = y) = Pr(θ > θ1−α/2 |Y = y) = α/2. In R programming language: A potential problem with this interval is that some θ-values outside the quantile-based interval may have higher probability (density) than some points inside the interval. In addition, for multi-modal posterior distribution (having multiple peaks), this choice of interval may be not very useful. 33 Figure 3.1: Quantile-based interval and highest posterior density regions. An alternative is the so-called ”highest posterior density (HPD)” region: it is the subset s(y) ⊂ Θ such that (i) Pr(θ ∈ s(y)|Y = y) = 1 − α. (ii) If θa ∈ s(y) and θb ∈ / s(y), then p(θa |Y = y) > p(θb |Y = y). See Fig. 3.1 for an illustration. The HPD is characterized by threshold c > 0 of the posterior density. By sliding the threshold up and down the real axis we obtain different α. When the posterior density is a multi-modal function, the HPD may be composed of multiple disconnected subsets. 34 3.3 The Poisson model Poisson is a probability distribution whose domain is the unbounded set of natural numbers. It is a useful modeling tool for count data. Consider the Poisson sampling model: Y |θ ∼ Poisson(θ). That is, for y = 0, 1, . . ., Pr(Y = y|θ) = θy e−θ /y!. Poisson random variables have an interesting feature in that both the mean and the variance are determined by the same parameter θ and in fact, E[Y |θ] = Var[Y |θ] = θ. iid Given n-i.i.d. sample: Y1 , . . . , Yn |θ ∼ Poisson(θ). We have Pr(Y1 = y1 , . . . , Yn = yn |θ) = = n Y i=1 n Y p(yi |θ) θyi e−θ /yi ! i=1 =: c(y1 , . . . , yn )θ P i yi −nθ e . Pn From the above expression we find that i=1 Yi is the sufficient statistic of the Poisson sampling model. Pn Moreover, it can be verified that i=1 Yi |θ ∼ Poisson(nθ). We proceed to give a prior distribution for θ ∈ R+ . By Bayes’ rule, we know that a prior pdf p(θ) yields the posterior pdf of the form p(θ|y1 , . . . , yn ) ∝ p(θ)p(y1 , . . . , yn |θ) ∝ p(θ)θ P i yi −nθ e . If we want a conjugate prior, then p(θ) must be of the form θc1 e−c2 θ , up to a multiplying constant. The pdf that has this form is given by the Gamma distribution. 35 Gamma distribution Endow θ with the Gamma prior: θ|a, b ∼ Gamma(a, b), for some (hyper) parameters a, b > 0: ba a−1 −bθ p(θ) = θ e . Γ(a) a is called the shape parameter, and b the rate parameter of Gamma distributions. With this prior in place, the posterior pdf takes the form P p(θ|y1 , . . . , yn ) ∝ θa+ i yi −1 −(b+n)θ e . The proportional operator simplifies the expression by allowing us to keep only terms that vary with θ. This shows that the posterior pdf of another Gamma distribution. In other words, we have shown that the Gamma is a conjugate prior with respect to the Poisson sampling/ likelihood model: θ|Y1 , . . . , Yn ∼ Gamma(a + n X Yi , b + n). i=1 Based on basic properties of the Gamma distribution, we find E[θ|y1 , . . . , yn ] = = Var[θ|y1 , . . . , yn ] = P a + yi b+n b n X (a/b) + yi /n b+n b+n P a + yi . (b + n)2 We find that the posterior mean is, again, a convex combination of the prior expectation and the sample average. Note the impact of increasing the sample size n. We proceed to the posterior predictive distribution. For ỹ = 0, 1, 2, . . ., Z p(ỹ|y1 , . . . , yn ) = ∞ p(ỹ|θ, y1 , . . . , yn )p(θ|y1 , . . . , yn )dθ Z0 = p(ỹ|θ)p(θ|y1 , . . . , yn )dθ Z X Poisson(ỹ|θ)Gamma(θ|a + yi , b + n)dθ P Z b + n)a+ yi a+P yi −1 −(b+n)θ 1 ỹ −θ P = θ e θ e dθ ỹ! Γ(a + yi ) P Z P (b + n)a+ yi P = θa+ yi +ỹ−1 e−(b+n+1)θ dθ. Γ(ỹ + 1)Γ(a + yi ) = Exploiting the identity that follows from the definition of Gamma density Z θa−1 e−bθ = Γ(a)/ba 36 to obtain P a+P yi ỹ b+n 1 Γ(a + yi + ỹ) P . p(ỹ|y1 , . . . , yn ) = Γ(ỹ + 1)Γ(a + yi ) b + n + 1 b+n+1 P This isPa negative binomial distribution with parameters (a + yi , b + n) (i.e., the number of ỹ failures until a + yi successes), for which X 1/(b + n + 1) (a + yi ) (b + n)/(b + n + 1) P a + yi = = E[θ|y1 , . . . , yn ] b+n b+n+1 Var[Ỹ |y1 , . . . , yn ] = E[Ỹ |y1 , . . . , yn ] b+n P a + yi b + n + 1 = = Var[θ|y1 , . . . , yn ](b + n + 1). b+n b+n E[Ỹ |y1 , . . . , yn ] = Note how the predictive posterior mean of Ỹ is the same as that of θ. This is due to the fact of Poisson sampling model: E[Ỹ |θ] = θ. Note also under Poisson, Var[Ỹ |θ] = θ. The predictive posterior variance of Ỹ is quite a bit larger than that of θ: the sources of its variability are that of the Poisson sampling model and the parameter θ itself. As n gets large, the posterior of θ contracts considerably, so the variability of Ỹ stems primarily from that of the Poisson sampling model rather than the parameter’s. 3 3 Instead of exploiting properties of the negative binomial distribution, we may appeal to the iterated expectation and iterated variance formula to arrive at the above formula for the predictive posterior distribution. 37 Figure 3.2: Birthrate data from the 1990s General Social Survey: number of children for the two groups of women. 3.4 Example: birth rates We follow the example in PH (2009), Chapter 3. Fig. 3.2 illustrates the data collected on the number of children of 155 women who were 40 year of age at the time of the survey. The women are divided into two groups, those with college degrees and those without. 1 2 Let {Yi,1 }ni=1 denote the data from the first group, and {Yi,2 }ni=1 from the second group. To compare between these two groups, we shall make use of the Poisson sampling model: iid Y1,1 , . . . , Yn1 ,1 |θ1 ∼ poisson(θ1 ) iid Y1,2 , . . . , Yn2 ,2 |θ1 ∼ poisson(θ2 ). Some basic statistics: P • Less than bachelor’s: n1 = 111, Yi,1 = 217, Ȳ1 = 1.95 P • Bachelor’s or higher: n2 = 44, Yi,2 = 66, Ȳ2 = 1.50. Let us endow θ1 and θ2 with the same prior: iid θ1 , θ2 ∼ gamma(a = 2, b = 1). Then we obtain the following posterior ditsributions θ1 |{n1 = 111, X Yi,1 = 217} ∼ gamma(2 + 217, 1 + 111) X θ2 |{n2 = 44, Yi,2 = 66} ∼ gamma(68, 45) 38 In R codes: The posterior P distributions Pgive substantial evidence that θ1 > θ2 . For example, it can be computed that Pr(θ1 > θ2 | Yi , 1 = 217, Yi , 2 = 66) = .97. 39 Figure 3.3: Posterior distributions of mean birth rates with the common prior given by the dashed line, and the posterior predictive distributions for number of children. To what extent do we expect that a woman without the bachelor’s degree to have more children than the other? See the right panel in Fig. 3.3. In R codes: There is considerable overlap between the two predictive posterior distributions of Ỹ1 and Ỹ2 . We can compute that X X Pr(Ỹ1 > Ỹ2 | Yi,1 = 217, Yi,2 = 66) = .48 X X Pr(Ỹ1 = Ỹ2 | Yi,1 = 217, Yi,2 = 66) = .22. It is a reminder that the Poisson sampling model has very high variance, so that the strong evidence in the difference of two population’s does not mean the individual observations are overtly different. 40 4 Monte Carlo approximation Suppose that we are interested in quantities of interest for the posterior distribution, such as (i) Pr(θ ∈ A|y1 , . . . , yn ) for some subset A ⊂ Θ. (ii) Posterior mean, variance, confidence intervals for θ1 − θ2 , θ/θ2 , max{θ1 , . . . , θm }. Under conjugacy, some of these quantities may be explicitly available in closed form, but this is not always the case. When we deal with complex models where no conjugate form of the prior is available, then posterior computation becomes a huge issue. This is in fact the main barrier for Bayesian statistics before the age of computers. Thankfully with the computational advances, such barrier can be overcome. One of the primary computational techniques for Bayesian computation is Markov Chain Monte Carlo. In this section, we will explore the ”Monte Carlo” part of the technique. 4.1 Basic ideas Suppose we could sample some number S of i.i.d. samples of the posterior distribution iid θ(1) , . . . , θ(S) ∼ p(θ|y1 , . . . , yn ). Then the posterior distribution can be approximated with the empirical distribution provided by the S-sample. Notationally: S 1X p(·|y1 , . . . , yn ) ≈ δθ(s) (·). S s=1 41 The Monte Carlo technique is simply this: take any function g(θ) (that is integrable with respect to the posterior distribution), by the law of large numbers, as S → ∞, Z S 1X g(θ(s) ) → E[g(θ)|y1 , . . . , yn ] = g(θ)p(θ|y1 , . . . , yn )dθ. S s=1 Take different choices for function g, we obtain P • θ̄ := Ss=1 θ(s) /S → E[θ|y1 , . . . , yn ]. 1 PS (s) − θ̄)2 → Var[θ|y , . . . , y ]. • S−1 1 n s=1 (θ • #(θ(s) ≤ c)/S = 1 S PS s=1 I(θ (s) ≤ c) → Pr(θ ≤ c|y1 , . . . , yn ). • median{θ(1) , . . . , θ(S) } → θ1/2 . • the α-percentile of {θ(1) , . . . , θ(S) } tends to θα . 42 Numerical evaluation In the previous section, we use a Poisson sampling model, Y1 , . . . , Yn |θ ∼ Poisson(θ), and endow parameter θ with a gamma prior: γ ∼ Gamma(a,P b). We know that the posterior of θ is P Gamma(a + yi , b + n), which yields the posterior mean (a + yi )/(b + n) = 68/45 = 1.51. If we didn’t have this mean formula, we can appeal to Monte Carlo approximation in R. First, to obtain random Gamma samples To obtain the mean and probabilities of intervals of interest or relevant quantiles 43 Figure 4.1: Convergence of Monte Carlo estimates as MC sample size increases. Fig. 4.1 provides an illustration of the effects of increasing Monte Carlo sample size S. Note that the MC sample size S has nothing to do with the sample size of the data set given/observed. S represents the computational cost, which becomes cheaper as the computer becomes more powerful. To standard way of choosing S is to choose it just large enough so the Monte Carlo standard error is less than the precision to which we want to report the quantity of interest. Example 4.1. We want to compute the posterior expectation of θ. The Monte Carlo estimate gives us θ̄. By the central limit theorem, the samplep mean θ̄ is approximately distributed as normal distribution with expectation E[θ|y1 , . . . , yP Var[θ|y1 , . . . , yn ]/S. n ] and variance 1 (θ(s) − θ̄)2 be the MC estimate of the variance σ 2 , then MC standard error (of So letting σ̂ 2 = S−1 p 2 the MC estimate of the posterior p mean for θ) is σ̂ /S. Thus, the approx. 95% MC confidence interval for the posterior mean is θ̂ ± 2 σ̂ 2 /S. For example, one set S = 100 and found that the MC pestimate of Var[θ|y1 , . . . , yn ] was 0.024. Then the approximate MC standard error for the mean would be 0.024/100 = 0.015. Suppose that you wanted the difference between the posterior mean E[θ|y1 , . . . , yn ] and its MC estimate to be less than 0.01 p with high probability (i.e., > 95% confidence) then you would need to increase your sample size so that 2 0.024/S < 0.01, i.e., S > 960. 44 4.2 Posterior inference for arbitrary functions Recall the example of birthrates in Section 3.4. Based on the prior specifications and the data of birthrates, the posterior distributions for the two educational groups are {θ1 |y1,1 , . . . , yn1 ,1 } ∼ Gamma(219, 112) (women without bachelor’s degrees) {θ2 |y1,2 , . . . , yn2 ,2 } ∼ Gamma(68, 45) (women with bachelor’s degrees). We are interested in Pr(θ1 > θ2 |Data from both groups), or the posterior of the ratio θ1 /θ2 . Obtain Monte Carlo samples independently for the two data groups: (1) (S) iid (1) (S) iid sample θ1 , . . . , θ1 sample θ2 , . . . , θ2 (s) ∼ p(θ1 |Data from first group), ∼ p(θ1 |Data from second group). (s) Accordingly, the pairs of (θ1 , θ2 ) for s = 1, . . . , S are i.i.d. Monte Carlo samples. We can approximate S 1 X (s) (s) I(θ1 > θ2 ). Pr(θ1 > θ2 |Data from both groups) :≈ S s=1 In R codes 45 4.3 Sampling from posterior predictive distributions Parameter θ and the prior on θ represent the modeler’s understanding of the data population. Different modelers may come up with different parameterization and different prior specification. How do we verify the validity and compare among different models? This is usually done through assessment of the predictive distribution. We saw examples of predictive distributions in Section 3. In general, a predictive distribution is the (marginal) distribution of unobserved data Ỹ which is obtained by • having all known quantities been conditioned on; • having all unknown quantities been integrated out. Before we have seen any data, all modeling assumptions result in the prior predictive distribution Z p(ỹ) = p(ỹ|θ)p(θ)dθ. Having observed the data set {y1 , . . . , yn }, we obtain the posterior predictive distribution Z p(ỹ|y1 , . . . , yn ) = p(ỹ|θ, y1 , . . . , yn )p(θ|y1 , . . . , yn )dθ Z = p(ỹ|θ)p(θ|y1 , . . . , yn )dθ. Example 4.2. Continue on the birth rates modeling considered earlier. We assumed a Poisson sampling model: Y |θ ∼ Poisson(θ) for a data population (say the group of women aged 40 with a college degree). We placed a Gamma prior on θ: θ ∼ Gamma(a, b). We found that the resulting prior predictive distribution of Ỹ is a negative binomial (a, b). P Having observed an n-data sample, we found that the posterior distribution of θ is Gamma(a+ yi , b+ P n), and the predictive distribution of Ỹ is a negative binomial with parameters (a + yi , b + n). In this example, thanks to conjugacy we have a very closed form for the predictive distribution, both a priori and a posteriori. In general, we probably won’t be so ”lucky” — most realistic models do not admit a closed form for the posterior distributions. In order to evaluate the posterior predictive distributions, we may proceed by drawing samples from them instead. 46 The key is to observe that p(ỹ|y1 , . . . , yn ) can be viewed as a mixture of the sampling distributions p(ỹ|θ), where the θ is randomly mixed by the posterior distribution p(θ|y1 , . . . , yn ). If we can draw samples from the posterior of θ, we can use such samples to again draw samples from the sampling distribution, with each θ given. To be specific, for s = 1, . . . , S obtain independent Monte Carlo samples as follows • sample θ(s) ∼ p(θ|y1 , . . . , yn ), and then sample ỹ (s) ∼ p(ỹ|θ(s) ). Then, we have obtained a valid i.i.d. n-sample ỹ (1) , . . . , ỹ (S) from the posterior predictive distribution. Example 4.3. Continue on the birth rates modeling example. Suppose we are interested in the predictive probability that an age-40 woman without a college degree wold have more children than an age-40 woman with a college degree (using prior Gamma parameters a = 2, b = 1): Pr(Ỹ1 > Ỹ2 | = ∞ X ∞ X X Yi,1 = 217, X Yi,2 = 66) NegBinomial(ỹ1 , 219, 112) × NegBinomial(ỹ2 , 68, 45). ỹ2 =0 ỹ1 =ỹ2 +1 This can be easily evaluated via the MC technique. In R codes We can also compute other quantities of interest based on these MC samples. We can also plot an estimate of the posterior predictive distribution for Ỹ1 − Ỹ2 , as illustrated in Fig. ??. 47 Additional remark We can use the same technique to draw samples for prior predictive distribution; such samples are then utilized for setting prior parameters. This technique is very useful if the prior distribution is not conjugate, and/or the prior predictive distribution is not easily accessible via closed form expressions. 48 Figure 4.2: Evaluation of model fit. Left panel: the empirical and predictive distributions of the number of children of women without a bachelor’s degree. Right panel: The posterior predictive distribution of the empirical odds of having two children versus one child in a data set of size n1 = 111. The observed odds are given in the short vertical line. 4.4 Posterior predictive model checking We again use the birthrates data example to illustrate the important issue of model checking via posterior predictive distributions. We used a Poisson sampling model endowed with a Gamma prior to describe the number of children of groups of age-40 women with or without college degrees. Consider the group of women without college degrees, for which we arrived at the posterior predictive distribution for Ỹ1 (which is a negative binomial). Let us compare that distribution with the empirical distribution. Note that these are two products that are computed out of the same data sample {y1,1 , . . . , yn1 ,1 }, where n1 = 111. In the empirical sample, shown in back, the number of women with exactly two children is 38, which is twice the number of women with one child. By contrast, this group’s posterior predictive distribution, shown in gray, suggests that the probability of sampling a woman with two children is slightly less than of sampling a woman with one (0.27 and 0.28, respectively). How do we make sense of this significant discrepancy? 49 There are two possible explanations. • There is a sampling variability and the sample size is probably too small, so the empirical distribution of sampled data does not generally match exactly the distribution of the population. In fact, empirical distributions (like all histograms) usually look bumpy, so having a predictive distribution that smoothes over the bumps may be desirable. • An alternative explanation is that the Poisson model is quite wrong. This is plausible because there is no Poisson distribution with such a sharp peak at y = 2. Having said that, note that the posterior predictive distribution is in fact a mixture of Poissons that equals a negative binomial, so this explanation needs further evaluation. We can evaluate the validity of the posterior predictive model via Monte Carlo simulation. We need a ”marker”, and in this case we use the ratio of the number of y = 2’s to the number of y = 1’s in our data. For every vector y of length n1 = 111, let t(y) denote this ratio. For our observed data sample, y obs , we have t(y obs ) = 2. 50 What sort of values of t(Ỹ ) should one expect, if Ỹ are drawn from the posterior predictive distribution? The Monte Carlo simulation procedure is as follows. For s = 1, . . . , S, • sample θ(s) ∼ p(θ|Y = y obs ). (s) (s) iid • sample Ỹ (s) = (ỹ1 , . . . , ỹn1 ) ∼ p(y|θ(s) ). • compute t(s) = t(Ỹ (s) ). The right panel of Fig. 4.2 shows the histogram of t(Ỹ ) that one can get out of 10000 Monte Carlo samples (note: each MC sample here consists of an n1 -sample represented by Ỹ (s) ). Observe that out of 10000 such datasets only about 0.5% had values of t(y) that equaled or exceeded t(y obs ). This indicates that our Poisson sampling model is flawed. If one is in particular interested in a more accurate model for Y , perhaps a complex sampling model than the Poisson is warranted. Certain aspects of the Poisson sampling model that may still be useful in this example. For instance, if we are only interested in population parameters such as the mean and variance via θ, then Poisson is quite accurate in capturing the relationship between these quantities, as the empirical mean and empirical variance is found to be 1.95 and 1.90, respectively. It is known in theory that even if a model is misspecified, some aspects of the population may still be accurately estimated with such a model. In practice, as George Box said, all models are wrong, but some are useful. Thus, while statistical modelers constantly search for better models, and we have a vast arsenal for doing so as you will see in later lectures, we do not readily discard simpler ones just for the sake of bigger and more complex models. 51 5 5.1 The normal model The normal / Gaussian distribution A random variable Y is said to be normally distributed with mean θ and variance σ 2 > 0 if the density of Y takes the form 1 2 1 e− 2σ2 (y−θ) , −∞ < y < ∞. p(y|θ, σ 2 ) = √ 2πσ 2 Several important properties • the distribution is symmetric about θ; the location, median and mean are all equal to θ • σ 2 represents the spread of the mass: about 95% of the population lies within (θ ± 2σ) • if X ∼ Normal(µ, τ 2 ) and Y ∼ Normal(θ, σ 2 ), and X and Y are independent, then aX + bY ∼ Normal(aµ + bθ, a2 τ 2 + b2 σ 2 ). Normal distribution is one of the most useful and widely utilized model in statistical sciences. Its important stems primarily from the central limit theorem, which says that under very general conditions, the empirical average of a collection of random variables is approximately distributed according to the Gaussian (normal) distribution. 52 Example 5.1. The following figure shows a normal density function overlay over the histogram of heights of n = 1375 women over age 18 collected in a study of 1100 English families from 1893 to 1898. One explanation for the variability in heights among these women is that the women were heterogenous in terms of a number of factors controlling human growth, such as genetics, diet, disease, stress and so on. Variability in such factors results in variability in height. Thus, letting yi be the height in inches of woman i, a simple additive model for height might be yi = a + b × genei + c × dieti + d × diseasei + . . . where genei might denote the presence of a particular height-promoting gene, dieti might measure some aspect of woman i’s diet, and so on. Now, there may be a very large number of genes, dietary factors, and so on that contributes to a woman’s height. If the effects of these factors are additive, then the height of a random woman may be modeled as a linear combination of a large number of random variables. The central limit theorem says that such a linear combination is approximately distributed according to a normal distribution. 53 5.2 Inference of the mean with variance fixed iid Given a sampling model Y1 , . . . , Yn |θ, σ 2 ∼ Normal(θ, σ 2 ). The joint sampling pdf is p(y1 , . . . , yn |θ, σ 2 ) = n Y p(yi |θ, σ 2 ) i=1 = n Y i=1 √ 1 2πσ 2 1 2 e− 2σ2 (yi −θ) n 1 X 2 = (2πσ ) exp − 2 (yi − θ) 2σ i=1 nθ2 1 1 X 2 2θ X 2 −n/2 yi − 2 yi + 2 . = (2πσ ) exp − 2 σ2 σ σ P P a (two-dimensional) This expression shows that { yi2 , yi } form P P sufficient statistic for the normal model’s parameters θ and σ 2 . Equivalently, let ȳ := yi /n and s2 := (yi − ȳ)2 /(n − 1), then (ȳ, s2 ) is a sufficient statistic. Suppose that σ is fixed and known; the quantity of interest is θ. It is easy to see that the maximum likelihood estimate for θ is θ̂ = ȳ. 2 −n/2 54 Let us proceed to specifying a conjugate prior for θ. Given a (conditional) prior distribution p(θ|σ 2 ), the posterior pdf takes the form p(θ|y1 , . . . , yn ) ∝ p(θ|σ 2 )p(y1 , . . . , yn |θ, σ 2 ) 1 ∝ p(θ|σ 2 )e− 2σ2 P (θ−yi )2 . 2 The simplest possible form for a conjugate prior for θ is of the form ec1 (θ−c2 ) . This suggests a normal distribution prior: Prior:θ ∼ Normal(µ0 , τ02 ). Continuing on the Bayesian update: 1 X 1 (θ − µ0 )2 × exp − 2 (θ − yi )2 2 2σ 2τ0 1 ∝ exp − (aθ2 − 2bθ + c), 2 p(θ|y1 , . . . , yn , σ 2 ) ∝ exp − where it is easy to verify that P 1 n µ0 yi a = 2 + 2, b = 2 + 2 , σ σ τ0 τ0 and c is independent of θ. Since the exponent of the posterior pdf is a quadratic form, with negative coefficient of the leading (second order) term, this must be the pdf of a normal distribution. Let us derive the corresponding mean and variance of the posterior. 1 p(θ|σ 2 , y1 , . . . , yn ) ∝ exp − (aθ2 − 2bθ) 2 1 ∝ exp − a(θ − b/a)2 2 = Normal(b/a, 1/a). 55 Combining information Thus we have obtained that the posterior distribution of θ is indeed normal with mean µn and variance τn : τn2 = 1 = a 1 τ02 1 + n σ2 , µn = b = a 1 µ + σn2 ȳ τ02 0 . 1 n + 2 2 σ τ0 (5) Not only is the posterior pdf remains a Gaussian, its corresponding parameters are obtained by combining information from the prior and the data in an intuitive way. • Posterior variance: Inverse variance is often referred to as the precision. Let σ̃ 2 = 1/σ 2 denote the sampling precision, τ̃02 = 1/τ02 the prior precision and τ̃n2 = 1/τn2 . Then τ̃n2 = τ̃02 + nσ̃ 2 , so the precision (for the parameter of interest) adds up with more data. • Posterior mean: µn = τ̃02 nσ̃ 2 τ̃02 µ + ȳ. 0 2 + nσ̃ 2 τ̃0 + nσ̃ 2 The posterior mean is a convex combination (i.e., weighted average) of the prior mean and the sample mean. The weights are corresponding precisions from either the prior or the data. The prior precision provides a shrinkage effect pulling the estimate toward the prior mean. As sample size n increases, the information from the data takes over. 56 Predictive distribution Consider predicting a new observation Ỹ from the population after having observed (Y1 = y1 , . . . , Yn = yn ). That is to find p(ỹ|y1 , . . . , yn ). In general, to find the predictive distribution we need to perform an integration over the unknown θ. For the normal model, the situation is very easy (without having to perform this integration), due to the fact that a linear combination of normal random variables is another normal random variable. In particular, for our model Ỹ |θ, σ 2 ∼ Normal(θ, σ 2 ) ⇔ Ỹ = θ + , where |θ, σ 2 ∼ Normal(0, σ 2 ). Since θ|y1 , . . . , yn ∼ Normal(µn , τn ) and is also normal and (conditionally) independent of θ. So, Ỹ |σ 2 , y1 , . . . , yn ∼ Normal(µn , τn2 + σ 2 ). 57 Example 5.2. (Midge wing length) We are given a data set on the wing length in millimeters of nine members of a species of midge (small, two-winged flies). From these nine measurements we wish to make inference about the population mean θ. From previous studies, the wing lengths are typically around 1.9mm, so we set µ0 = 1.9. We also know that the wing length are positive-valued, but since we are using a normal prior, we need to set for τ0 so that most of the mass is concentrated on the positive values. Conservatively, we set µ0 − 2τ0 > 0, so τ0 < 1.9/2 = 0.95. The observations are: {1.64, 1.70, 1.72, 1.74, 1.82, 1.82.1.82, 1.90, 2.08}, giving ȳ = 1.804. Using the above formulas for posterior computation, µn = τn = 1 µ + σn2 ȳ τ02 0 1 + σn2 τ02 1 τ02 1 + n σ2 = = 9 × σ2 9 + σ2 1.11 × 1.9 + 1.11 1 1.11 + 9 σ2 1.804 , . If we set σ 2 := s2 = 0.017, then posterior distribution θ|y1 , . . . , yn , σ 2 = 0.017 ∼ Normal(1.805, 0.002). A 95% quantile-based confidence interval for θ according to this posterior distribution is (1.72,1.89). Of course, this result is based on a point estimate of σ 2 := s2 which is in fact only a rough estimate based on only nine observations. Next section we will study techniques for properly handling unknown variance. Figure 5.1: Prior and posterior ditributions for the population mean wing lengh. 58 5.3 Joint inference for the mean and variance We need to specify a prior distribution on θ and σ 2 . By Bayes’ rule p(θ, σ 2 |y1 , . . . , yn ) ∝ p(θ, σ 2 )p(y1 , . . . , yn |θ, σ 2 ) 1 P 2 1 ∝ p(θ, σ 2 ) n e− 2σ2 (θ−yi ) . σ (6) It is not immediately obvious how to come up with a conjugate prior jointly for θ and σ 2 . In the previous section, σ 2 is assumed to be fixed — from there it is simple to find that a normal prior for θ yields a normal posterior, conditionally on σ. This suggests that we may wish to set θ|σ 2 ∼ Normal(µ0 , τ02 ), for some suitable choice of µ0 , τ0 which may be dependent on σ 2 . This suggests a prior according to which θ and σ 2 may be coupled (i.e., dependent). The question is how. Moreover, this still does not tell us how to place a suitable prior on σ 2 , since we still need to specify the joint prior distribution p(θ, σ 2 ) = p(σ 2 )p(θ|σ 2 ). 59 Fixed mean, varying variance To get a sense of what the form for a conjugate prior of σ 2 may be, let us take a step back, by assuming that θ is fixed. Simplifying from (6) p(σ 2 |θ, y1 , . . . , yn ) ∝ p(σ 2 )p(y1 , . . . , yn |θ, σ 2 ) 1 P 2 1 ∝ p(σ 2 ) n e− 2σ2 (θ−yi ) . σ (7) It is more convenient to look at the posterior pdf in terms of the precision σ̃ 2 = 1/σ 2 , we see that the 2 simplest form for a conjugate prior for σ̃ will be one of the form σ̃ c1 e−c2 σ̃ . This gives us a Gamma prior for the precision parameter. In particular, we set σ̃ 2 ∼ Gamma(a, b) This is equivalent to saying that σ 2 ∼ InvGamma(a, b), and can be taken as a definition of the Inverse Gamma distribution. ba a−1 −by y e , for y > 0. Let z = 1/y, so that y = 1/z and Recall the Gamma pdf: p(y|a, b) = Γ(a) 2 dy/dz = −1/z . By the change of variable formula, p(z|a, b) = p(y(z)|a, b)|dy/dz| = ba ba −a−1 −b/z y(z)a−1 e−by(z) (1/z 2 ) = z e , Γ(a) Γ(a) which gives the pdf for InvGamma(a, b). Now, combining the inverse-gamma prior for σ 2 with the normal likelihood, we find that p(σ 2 |a, b, θ, y1 , . . . , yn ) 1 − 12 P(θ−yi )2 e 2σ σn ∝ p(σ 2 ) × ∝ (σ 2 )−a−1 e−b/σ × σ −n e− 2σ2 1 2 1 P P (θ−yi )2 2 2 (σ 2 )−(a+n/2)−1 e−(b+ 2 (θ−yi ) )/σ n 1X = InvGamma(a + , b + (θ − yi )2 ) 2 2 =: InvGamma(an , bn ). ∝ 60 (8) We proceed to finding the predictive distribution. Note that this can be viewed as a mixture distribution of Gaussians, with the location fixed, and the precision parameter varying according to the Gamma distribution Gamma(an , bn ). We also note that the representation in the precision parameter is more convenient because it allows us to directly utilize the relevant identity that arise from Gamma pdf’s normalizing constant. Thus, in what’s followed we may switch back and forth between the two representations, in terms of σ̃ 2 and σ 2 . Z p(ỹ|a, b, y1 , . . . , yn ) = p(ỹ|θ, σ̃ 2 ) × p(σ̃ 2 |a, b, θ, y1 , . . . , yn )dσ̃ 2 . = = = σ̃ 2 2π 1/2 σ̃ 2 ban 2 (ỹ − θ)2 × n (σ̃ 2 )an −1 e−bn σ̃ dσ̃ 2 2 Γ(an ) a n 1 Γ(an + 1/2) bn Γ(an ) (2π)1/2 (bn + (ỹ − θ)2 /2)an +1/2 Γ(an + 1/2) 1 . 1/2 Γ(an ) (2πbn ) (1 + (ỹ − θ)2 /(2bn ))an +1/2 Z exp − (9) We arrive at the well-known Student’s t distribution, which has three parameters, location parameter θ, scale parameter bn /an and 2an degrees of freedom. The variance of the predictive distribution is, provided 2an > 2, 2an (bn /an ) = bn /(an − 1). 2an − 2 It is interesting to note that the predictive distribution of the data becomes heavier tailed than the normal sampling model (inverse squared tail vs inverse exponential tail), thanks to the uncertainty about the variance/precision parameter that is integrated out. 61 Both mean and variance parameter varying Now we are ready to handle the case both θ and σ 2 vary. As we have seen in the previous pages, it may be more convenient in our derivation to work with the precision parameter σ̃ 2 instead. It is tempting to place independent prior distributions on θ and σ̃ 2 : say a normal prior on θ and independently, a Gamma prior on σ̃ 2 . The reader can verify without difficulty that this won’t give us a conjugate prior because the posterior for either θ or σ̃ 2 will not be normal or Gamma, respectively. (What would be the form of the posteriors then?) The issue is that conditionally given the observations y1 , . . . , yn , parametes θ and σ̃ 2 are dependent even if they are independent a priori. So we need to construct a prior distribution according to which θ and σ̃ 2 are dependent to begin with. Here is how: use the decomposition p(θ, σ̃ 2 ) = p(σ̃ 2 )p(θ|σ̃ 2 ) and set the prior as σ̃ 2 ∼ Gamma(a, b) θ|σ̃ 2 ∼ Normal(µ0 , κ0 σ̃ 2 ). The key is in the second line, which allows the coupling of θ and σ̃ 2 via the conditional prior’s variance. For ease of interpretation later, we set a = ν0 /2, b = ν0 σ02 /2 (which gives the prior expectation for σ̃ 2 to equal a/b = 1/σ02 =: σ̃02 ). The sampling/likelihood model is the same as before: iid Y1 , . . . , Yn |θ, σ̃ 2 ∼ Normal(θ, σ̃ 2 ). Now we verify that the specified prior is indeed conjugate. Decompose the posterior distribution similarly: p(θ, σ̃ 2 |y1 , . . . , yn ) = p(σ̃ 2 |y1 , . . . , yn )p(θ|σ̃, y1 , . . . , yn ). From the previous section, we already have θ|y1 , . . . , yn , σ̃ 2 ∼ Normal(µn , τ̃n2 ) where τ̃n2 = κ0 σ̃ 2 + nσ̃ 2 =: κn σ̃ 2 κ0 σ̃ 2 µ0 + (nσ̃ 2 )ȳ κ0 µ0 + nȳ = . µn = κ0 σ̃ 2 + nσ̃ 2 κn In short, the conditional posterior of θ, namely p(θ|σ̃ 2 , y1 , . . . , yn ), has the same form as that of the conditional prior p(θ|σ̃ 2 ). 62 Next, we check the marginal posterior of σ̃ 2 . For this computation, we need to integrate out θ (unlike the previous detour where θ is fixed, and σ̃ 2 varies). p(σ̃ 2 |y1 , . . . , yn ) ∝ p(σ̃ 2 )p(y1 , . . . , yn |σ̃ 2 ) Z 2 ∝ p(σ̃ ) p(y1 , . . . , yn |θ, σ̃ 2 )p(θ|σ̃ 2 )dθ Z 1 2P 1 2 2 2 2 a−1 −bσ̃ 2 ∝ (σ̃ ) e (σ̃ 2 )n/2 e− 2 σ̃ (θ−yi ) (κ0 σ̃ 2 )1/2 e− 2 κ0 σ̃ (θ−µ0 ) dθ Z 1 2 P 2 2 2 a+n/2−1 −bσ̃ 2 2 1/2 ∝ (σ̃ ) e (κ0 σ̃ ) e− 2 σ̃ [ (θ−yi ) +κ0 (θ−µ0 ) ] dθ. We quickly see that in the integrand the form of a Gaussian pdf, so the integral can be simplified by p R 1 2 2 utilizing the formula for normalizing the Gaussian pdf: e− 2 σ̃ (y−µ) dy = 2π/σ̃ 2 . Accordingly, the integral is precisely p 2 2π/[(κ0 + n)σ̃ ] exp − p 2 2π/[(κ0 + n)σ̃ ] exp − = (µ0 κ0 + nȳ)2 X 2 1 2 2 σ̃ − + yi + κ0 µ0 2 κ0 + n 1 2 κ0 n(µ0 − ȳ)2 X 2 2 σ̃ + yi − nȳ . 2 κ0 + n Plugging back for the posterior of σ̃ 2 , keeping only relevant terms p(σ̃ 2 |y1 , . . . , yn ) ∝ = = =: 1 κ0 n(µ0 − ȳ)2 2 (σ̃ 2 )a+n/2−1 e−bσ̃ exp − σ̃ 2 + (n − 1)s2 2 κ0 + n 1 1 κ0 n(µ0 − ȳ)2 2 Gamma a + n, b + + (n − 1)s 2 2 κ0 + n κ0 n(µ0 − ȳ)2 2 2 + (n − 1)s Gamma ν0 /2 + n/2, (1/2) ν0 σ0 + κ0 + n Gamma(νn /2, νn σn2 /2) , where the posterior distribution’s parameters take the form νn = ν0 + n κ0 n(µ0 − ȳ)2 1 σn2 = ν0 σ02 + + (n − 1)s2 . νn κ0 + n How to make sense of the contribution of the prior information and the data in these expressions? The posterior mean of σ̃ 2 is 1/σn2 , while the posterior variance is of the order 1/νn σn4 . In the above formula for νn σn2 , it is clear that ν0 σ02 represents the information from the prior for σ 2 . The term (n − 1)s2 represents the variability of the observed data from the sample mean. 2 0 −ȳ) Finally, the middle term κ0 n(µ represents the contribution to the variance parameter σ 2 due to the κ0 +n coupling between the location parameter θ and precision σ̃ 2 according to the conditional prior θ|σ̃ 2 ∼ Normal(µ0 , κ0 σ̃ 2 ). 63 According to this prior, θ is drawn from a mixture of normal distributions centering on µ0 with varying precision proportional to κ0 . This seems to be relatively strong opinion for a prior specification, which entails the ”biased” contribution of the middle term that increases with both κ0 and the variability of sample mean about µ0 toward the estimate of the variance σ 2 . One may harshly criticize the prior due to the implication discussed as being too strong. We don’t necessarily defend this at all cost: after all we have arrived at this prior construction mainly from a mathematical/ computational viewpoint, i.e., to obtain a conjugate prior. So, there’s a bias — the incurred bias is a cost one has to pay for the mathematical/ computational convenience. Whether it is worth it depends on the modeler and the data at hand. Note that when the sample size is large, the bias incurred by our prior construction will be washed away by the last term (n − 1)s2 , which is purely driven by the data set. 64 Example 5.3. (Midge wing length — continued). Our sampling model for midge wing lengths is Y |θ, σ̃ ∼ Normal(θ, σ̃ 2 ) and we will place a joint prior on θ, σ̃ via σ̃ 2 ∼ Gamma(a = ν0 /2, b = ν0 σ02 /2) θ|σ̃ 2 ∼ Normal(µ0 , κ0 σ̃ 2 ). Previous studies suggest that the true mean and standard deviation should not be too far from 1.9 mm and 0.1 mm, respectively. So we may set µ0 = 1.9 and σ02 = 0.01. The Gamma prior implies the prior mean for the precision is a/b = 1/σ02 = σ̃02 = 100, and prior variance for the precision is a/b2 = 2/(ν0 σ04 ). We set ν0 = 1 to allow for a reasonably large variance. As for κ0 : we also set κ0 = 1. Since σ̃ is a priori distributed over a large range of values, this implies that we assume θ to be only weakly coupled to σ̃ 2 . From the sample, ȳ = 1.804 and s2 = 0.0169. Applying to the posterior computation derived earlier: κ0 µ0 + nȳ 1.9 + 9 × 1.804 µn = = = 1.814 κn 1+9 0.010 + 0.008 + 0.135 1 κ0 n(µ0 − ȳ)2 2 2 2 σn = ν0 σ0 + + (n − 1)s = = 0.015. νn κ0 + n 10 Compared to the point estimate presented earlier, the posterior mean for θ is comparable, but the uncertainty captured by σn2 is considerably larger. But we can say much more. In particular, the joint posterior distribution is given by θ|y1 , . . . , yn , σ 2 ∼ Normal(µn = 1.814, τ̃n2 = κn σ̃ 2 = 10σ̃ 2 ) σ̃ 2 |y1 , . . . , yn ∼ Gamma(νn /2 = 10/2, νn σn2 /2 = 10 × 0.015/2). Figure 5.2: Joint posterior distributions for (θ, σ̃ 2 ) and (θ, σ 2 ). These plots were obtained by computing the joint pdf at pairs of values of (θ, σ̃ 2 ) and (θ, σ 2 ) on a grid. Note also that samples from this posterior can be easily obtained via Monte Carlo sampling. These plots tell us about where most of the mass of the posterior for (θ, σ 2 ) is, and to some extent the relationship between the two parameters. When σ̃ 2 is small (σ 2 is large) there are more uncertainties about θ. Moreover, the contours are more peaked as function of θ for low values of σ 2 than high values. 65 Hyperparameters and improper priors Hyperparamerers are parameters specified for the prior distributions. In our previous example, two of them are κ0 and ν0 . They may be regarded as the prior sample size, because according to the Bayesian update κn = κ0 + n νn = ν0 + n. When κ0 and ν0 are relatively small compared to n, the effects of these hyperparameters are negligible. Of interest is when n is itself quite small. Is this still possible to have a prior specification whose impact relative to the impact from the data is minimal? The smaller ν0 is, the ”flatter” the marginal prior distribution for σ̃; the smaller κ0 and ν0 are, the flatter the marginal prior distribution for θ. (Recall our earlier computation in Eq. (9) that a mixture of fixed-mean normal distributions with a Gamma mixing on the precision is a Student’s t distribution). In other words, the priors can be viewed as ”less discriminative”; and hence ”more objective”. Let us perform the formal computation, by letting κ0 , ν0 → 0 µn = σn2 = κ0 µ0 + nȳ → ȳ κ n 1 n−1 2 1X κ0 n(µ0 − ȳ)2 2 2 + (n − 1)s → s = (yi − ȳ)2 . ν0 σ0 + νn κ0 + n n n This leads to the following ”posterior distribution”, which is free of hyperparameters: σ̃ 2 |y1 , . . . , yn ∼ Gamma(n/2, (n/2) 1 θ|σ̃ 2 , y1 , . . . , yn ∼ Normal(ȳ, σ 2 ). n 1X (yi − ȳ)2 ) n (10) There does not exist a valid prior distribution for the above ”posterior distribution”, which appears only as the limit of a sequence of posterior distributions that arise from the sequence of prior distributions according to which κ0 , ν0 → 0. If one still wish to employ such posterior distribution, one need to utilize a notion of improper prior distribution. 66 Consider function p̃(θ, σ 2 ) = 1/σ 2 . This is not a proper distribution because it is not integrable over (θ, σ 2 ). Thus we will treat this as an improper prior distribution and apply the Bayes’ rule to obtain: p(θ, σ 2 |y) ∝ p(y|θ, σ 2 ) × p̃(θ, σ 2 ). Then we have a valid distribution over (θ, σ 2 ): in fact, it can be easily verified that Pthe induced marginal θ is the same as that of (10), while the marginal for σ̃ 2 is Gamma((n − 1)/2, (1/2) (yi − ȳ)2 . In addition, integrating over σ̃ 2 , following a computation similar to Eq. (9) we find that θ − ȳ √ y1 , . . . , yn ∼ tn−1 . s/ n (11) Remark 5.1. Some remarks. (i) The use of improper priors is not considered to be truly Bayesian, but it can be justified (informally) by the limiting argument presented above, and formally via a decision-theoretic framework. It is one area where one can find the meeting points between Bayesian and frequentist approaches. (ii) It is interesting to compare with the sampling distribution of the t statistic, conditional on θ but unconditional on the data: Ȳ − θ √ θ ∼ tn−1 . (12) s/ n Eq. (12) is a statement about the data: it says that before we sample the data, our uncertainty about the scaled deviation of the sample mean Ȳ from the population mean θ has a tn−1 distribution. Eq. (11) says that after we sample the data, our uncertainty is still represented with a tn−1 distribution, except that it is our uncertainty about θ given the information provided by the data ȳ. 67 5.4 Normal model for non-normal data People apply normal models to non-normal data all the time. In this section, we have seen examples of modeling heights for a human population and modeling flies’ wing length. In both cases, the data are positive valued, whereas normal distributions are supported on the entire real line. However, the quantity of interest is the population mean, which can be treated as approximately normally distributed according to the central limit theorem. As another example, consider the number of children for a group of women over age 40, and consider estimating the mean number of children for this population, based on the samples Y1 , . . . , Yn . In the previous section, we considered a Poisson sampling model, which is motivated by the fact that Yi are integer-valued. Obviously it makes no sense to assume Yi |θ, σ 2 ∼ Normal(θ, σ 2 ). However, it is still reasonable to assume that the population mean θ is normally distributed (a priori). By the CLT, we know that p p(ȳ|θ, σ 2 ) ≈ Normal(ȳ|θ, σ 2 /n), where σ 2 denotes the population variance, with the approximation becoming increasingly accurate as n gets larger. If σ 2 is known, then we may consider placing a normal prior on θ and obtain the posterior for θ via p(θ|ȳ, σ 2 ) ∝ p(θ) × p(ȳ|θ, σ 2 ). If σ 2 is unknown, we may consider to bring in the point estimate s2 and conditioning on it: p(θ, σ 2 |ȳ, s2 ) ∝ p(θ, σ 2 ) × p(ȳ, s2 |θ, σ 2 ). The likelihood term p(ȳ, s2 |θ, σ 2 ) may be approximated by applying a normal sampling model p(ȳ|θ, σ 2 ) for ȳ and Gamma sampling model p(s2 |ȳ, θ, σ) for s2 , conditionally on ȳ. Hence, we have seen that when the sample size is reasonably large, the above approximation treatment is quite reasonable and can lead to good practical results. 68 When are normal models not appropriate? • when the quantity of interest is not about the population mean and/or variance but requires tail behavior of the population, while the population’s distribution is clearly not normal (e.g., heavy-tailed or skewed distributions). For instance, we may be interested in the group of people with large number of children. • when the population is highly heterogeneous and we are interested in learning about such heterogeneity. For instance, the population’s distribution may be multi-modal, and so it makes more sense to represent it as a mixture of sub-populations each of which have their own parameters of interest. One is not interested in the population mean as much as the parameters of each sub-population. • even when the normal model is not appropriate, normal distributions frequently serve as an useful building block: recall that heavier-tailed distributions such as t-distribution can be viewed as a mixture of normals with variance parameters varying, while multi-modal distributions can be approximated by a mixture of normal distributions with mean parameters or both type of parameters varying. 69 6 6.1 Posterior approximation with the Gibbs sampler Conjugate vs non-conjugate prior In the previous section we considered a particular prior for the normal sampling model Normal(θ, σ 2 ). This is a conjugate prior for the parameters θ, σ 2 (or alternatively, θ, σ̃ 2 = 1/σ 2 ): σ̃ 2 ∼ Gamma(ν0 /2, ν0 σ02 ) θ|σ̃ 2 ∼ Normal(µ0 , κ0 σ̃ 2 ). We found that by applying the Bayes update, the posterior distribution p(θ, σ̃ 2 |y1 , . . . , yn ) carries the same form: σ̃ 2 |y1 , . . . , yn ∼ Gamma(νn /2, νn σn2 /2) θ|y1 , . . . , yn , σ̃ 2 ∼ Normal(µn , τ̃n2 ). The posterior distributions’ parameters are updated as τ̃n2 = κ0 σ̃ 2 + nσ̃ 2 =: κn σ̃ 2 κ0 µ0 + nȳ κ0 σ̃ 2 µ0 + (nσ̃ 2 )ȳ = µn = , 2 2 κ0 σ̃ + nσ̃ κn and νn = ν0 + n κ0 n(µ0 − ȳ)2 1 2 2 2 + (n − 1)s . ν0 σ0 + σn = νn κ0 + n The price we have to pay for the computational convenience is the coupling between the two parameters θ and σ̃ 2 imposed in the prior specification. Such coupling results in a prior bias: the higher the precision σ̃ 2 (the lower the variance value σ), the the more certain we are about parameter θ. In general, when dealing with multiple parameters, it is difficult to come up with a conjugate prior jointly for all parameters. And even if we can, the discussion from the previous section suggests that it is important to explore non-conjugate priors, because in some situations they may be more appropriate for our understanding of the parameter space. 70 In the case of the normal model above, we may want to express our uncertainty about θ as independent of σ̃ 2 . Such a prior specification is clearly less stringent than the one given above. Intuitively, such a prior would be less subjective. In particular, consider the following independent prior: σ̃ 2 ∼ Gamma(ν0 /2, ν0 σ02 /2) (13a) Normal(µ0 , τ02 ). (13b) θ∼ The particular choices of Gamma and Normal come from our computations in subsection 5.2 and the beginning of subsection 5.3. Although this prior distribution is not conjugate, in the sense that the joint posterior distribution p(θ, σ̃ 2 |y1 , . . . , yn ) does not carry the same form as the prior distribution p(θ, σ̃ 2 ), the full conditional distributions p(θ|σ̃ 2 , y1 , . . . , yn ) and p(σ̃ 2 |θ, y1 , . . . , yn ) can be easily computed and in fact carry the same form as the corresponding marginal prior distribution. The full conditional distributions are the distribution of a parameter given everything else, including the data and all remaining parameters. We call prior for which the full conditional distributions have the same form as the marginal prior ”semiconjugate”. 71 6.2 The Gibbs sampler Gibbs sampler is a sampling technique for multivariate distributions that exploits the fact that the full conditional distributions can be easily computed or sampled from. This crucial fact allows one to generate a dependent sequence of parameter samples that converge in distribution to the joint posterior distribution of interest. Continuing with our semiconjugate prior specification given in Eq. (13). From the previous section, we have obtained that (cf. Eq. (5)) θ|σ̃ 2 , y1 , . . . , yn ∼ Normal(µn , τn2 ), where τn2 = 1 τ02 1 + b n , µn = a = σ2 1 µ + σn2 ȳ τ02 0 . 1 + σn2 τ02 Note carefully that the updated parameters µn and τn are in fact dependent on the conditioned σ̃ 2 = 1/σ 2 . And from Eq. (8), σ̃ 2 |θ, y1 , . . . , yn ∼ Gamma(νn /2, νn σn2 /2), where νn = ν0 + n, σn2 = 1 (ν0 σ02 + ns2n ), νn P and s2n = (yi − θ)2 /n, the unbiased estimate of σ 2 if θ were known. Note carefully also that the updated parameter σn2 is dependent of the conditioned θ. These full conditionals tell us that • if we know σ̃ 2 , we can draw a sample for θ from p(θ|σ̃ 2 , y1 , . . . , yn ) • if we know θ, we can draw a sample for σ̃ 2 from p(σ̃ 2 |θ, y1 , . . . , yn ). 72 These full conditionals do not give us a direct way of drawing a sample from the joint posterior p(θ, σ̃ 2 |y1 , . . . , yn ), but they suggest an iterative procedure for drawing the joint samples φ := (θ, σ̃ 2 ). In each iteration, we take turn to draw a random sample for one parameter using the relevant full conditional distribution for that parameter given the latest values of all other parameters. This procedure is called the Gibbs sampler. More precisely for our present model, let φ(s) := (θ(s) , σ 2(s) ), where s is the index for the iterations. • Start with an arbitrary initial value φ(1) = (θ(1) , σ 2(1) ). • For s = 1, 2, . . . – sample θ(s+1) ∼ p(θ|σ̃ 2(s) , y1 , . . . , yn ); – sample σ̃ (2(s+1) ∼ p(σ̃ 2 |θ(s+1) , y1 , . . . , yn ); – let φ(s+1) = {θ(s+1) , σ̃ 2(s+1) }. What this algorithm does is that it generates a dependent sequence of parameter vector φ(1) , φ(2) , . . . , φ(s) , . . ., where the s + 1-parameter vector φ(s+1) is generated by the conditional distribution given the previous value φ(s) , namely p(φ(s+1) |φ(s) ). This sequence of random vector {φ(s) } is called a Markov chain. Under very weak conditions this Markov chain of random variables converge to a stationary distribution. Moreover, by our construction of the Gibbs sampler, that stationary distribution is the posterior distribution p(φ|y1 , . . . , yn ) — the joint posterior distribution of interest. Note carefully that we do not say that we have obtained a valid sample from the joint posterior p(θ, σ̃|y1 , . . . , yn ). What we said is that if we run the Markov chain (the Gibbs sampler) long enough, i.e., if s is large, then φ(s) can be viewed as a good approximation of the posterior sample. 73 A nice feature of Gibbs samplers is that they tend to be very easy to implement. In R codes: In this code, we have used the identity: ns2n = n X (yi − θ)2 = (n − 1)s̄2 + n(ȳ − θ)2 . i=1 P The RHS is fast to update with each iteration because (n − 1)s̄2 = (yi − ȳ)2 does not change, only (ȳ − θ)2 gets updated. Let us examine the performance of the Gibbs sampler using the midge data from the previous section and the independent semiconjugate prior (13). A Gibbs sanpler consisting of 1000 iterations were constructed. Fig. 6.1 plots the first 5, 15 and 100 simulated values of the sampler. 74 Figure 6.1: The first 5, 15 and 100 iterations of a Gibbs sampler. Once the Gibbs samples are collected we can find some empirical quantiles, which can be verified to be very close to a discrete approximation of the joint posterior distribution. (Hoff’s text book (Chapter 6, Sec. 6.2) gives further details of this discrete approximation technique.) 75 Figure 6.2: The first panel shows 1,000 samples from the Gibbs sampler, plotted over the contours of a discrete approximation. The second and third panels give kernel density estimates to the distributions of Gibbs samples of θ and σ̃ 2 . Vertical gray bars on the second plot indicate 2.5% and 97.5% quantiles of the Gibbs samplers of θ, while nearly identical black vertical bars indicate the 95% confidence interval based on the t-test. 6.3 6.3.1 Markov chain Monte Carlo algorithms Gibbs sampler Suppose we have a vector of parameters φ = (φ1 , . . . , φp ), and our information about φ is measured with the probability distribution p(φ) = p(φ1 , . . . , φq ). In the example from the previous subsection, φ = (θ, σ 2 ) and the probability distribution of interest is p(θ, σ 2 |y1 , . . . , yn ), a posterior distribution given the observed n-data sample. Remark 6.1. In Bayesian statistics, the application of Gibbs sampling is typically to posterior distributions, hence the conditioning on the observed data. However, it is important to note that Gibbs sampler is applicable to any joint probability distribution for a random vector φ of interest; regardless of whether we are dealing with an additional conditioning (in the case of Bayesian inference), or not. (0) (0) The general recipe should be clear. Given a starting point φ(0) = {φ1 , . . . , φq }, the Gibbs sampler generates φ(s) from φ(s−1) as follows (s) (s−1) (s) (s) 1. sample φ1 ∼ p(φ1 |φ2 = φ2 (s−1) , . . . , φ q = φq (s−1) 2. sample φ2 ∼ p(φ2 |φ1 = φ1 , φ3 = φ3 (s−1) , . . . , φ q = φq ... (s) (s) (s) ) (s) q. sample φq ∼ p(φq |φ1 , φ2 , . . . , φq−1 ). 76 ) After S iterations, this algorithm generates a dependent sequence of random vectors (1) φ(1) = {φ1 , . . . , φ(1) q } (2) φ(2) = {φ1 , . . . , φ(2) q } ... φ (S) = (S) {φ1 , . . . , φ(S) q }. This sequence forms what we call a Markov chain, because the random vector φs is conditionally independent of all the past instances φ(1) , . . . , φ(s−2) , given φ(s−1) . (Markov property: the future is conditionally independent of the past, given the presence). We will define Markov chains shortly in the sequel. The main point to quickly get into is that under suitable conditions that are easily met, as s → ∞, φ(s) converges in distribution to the Markov chain’s stationary distribution p(φ). We also refer to p(φ) as the target distribution of the Markov chain (MC). In particular, for any measurable event A of interest, we may write Pr(φ(s) ∈ A) → Pr(φ ∈ A) as s → ∞. In other words, if we run the chain long enough then φ(s) can be used to approximate a sample for the joint distribution p(φ) of interest. More importantly, take any function g(φ) for which we may be interested in the expectation under p(φ), then the following law of large numbers holds quite generally, as S → ∞: Z S 1X (s) g(φ ) → Eg(φ) = g(φ)p(φ)dφ. S (14) s=1 In other words, we can apply Monte Carlo approximation technique to the Markov chain’s generated samples to evaluate the expectation of interest. For this reason, we call all such approximations Markov chain Monte Carlo (MCMC) approximations, and the overall procedure an MCMC algorithm. 77 Remark 6.2. • The good: While it is generally difficult to construct a sample for the joint distribution p(φ), it is relatively easier to construct a Markov chain that converges in the limit to the target p(φ). • The advent of MCMC algorithms is the primary reason that helped to push Bayesian statistics into a central place of modern statistics, because they provide a generic mechanism for posterior computation for complex models. From a modeling standpoint, we can go beyond conjugate prior specification; from a scalability standpoint we can work with very large number of variables and parameters. • MCMC approximation techniques are quite remarkable because they exploit the strong law of large numbers for non-i.i.d. random variables —- MC’s generated samples are clearly dependent. • Hence the bad: there are infinitely many Markov chains for the same target distribution, not all equal. – Some may take a long time to get close to the target stationary distribution, i.e., they have a slow mixing time. In such a case, to produce even approximately good sample for the target distribution, S needs to be very large (and we don’t generally know how large). – Moreover, some Markov chain may produce strongly correlated samples, hence the Monte Carlo technique may carry very high variance. Hence, the empirical average requires a considerably larger number S of dependent samples than one would with independent Monte Carlo samples. 78 6.3.2 General Markov chain framework Gibbs samplers are very easy to implement and can be applied to almost any complex statistical models. For this reason they are very popular. Its popularity is also its curse, as Gibbs sampling can be very inefficient for the reasons we’ve just mentioned. Therefore, it is important to gain intuition of Gibbs sampling by placing it within a more general framework of Markov chain, so we can get a feel of what a Gibbs sampler tries to achieve, when does it ”works” and when it may not. And when it does not work, what can we do. In fact, there are many variants of Gibbs sampler (we have introduced only one such variant). More importantly, there are many non-Gibbs Markov chain Monte Carlo techniques, including Metropolis-Hastings, Hamiltonian MCMC, and so on. Bear with us a bit of formalism in the next couple of pages. The payoff is worth it. 4 Definition 6.1. A Markov chain is a discrete time stochastic process φ(1) , φ(2) , . . . taking values in an arbitrary state space S, having the property that the conditional distribution of φ(s+1) given the past φ(1) , . . . , φ(s) depends only on the present state φ(s) . φ(s) is called the state variable at time s. A Markov chain is defined by its transition probabilities. For discrete state space S, these are specified by defining a matrix p: p(x, y) := Pr(φ(s+1) = y|φ(s) = x), x, y ∈ S that gives the probability of moving from any element x ∈ S at time s to any element y ∈ S at time s + 1. The transition probability matrix p(x, y) does not depend on time s. For continuous state space S, the proper way to think of the transition probabilities is via a notion of kernel P , which can be represented by a regular conditional probability: for any measurable subset A ⊂ S, the kernel P is given as P (x, A) := Pr(φ(s+1) ∈ A|φ(s) = x). Kernel P (x, A) is defined by two arguments, x is an element in the state space S and A a subset of S. It gives the probability of moving from an element x ∈ S into a subset A at time s + 1. Note that the transition probabilities do not by themselves define the probability distribution of the Markov chain. To do so, we need to additionally specify the initial distribution of the chain, namely, the marginal distribution for φ(1) . 4 I largely follow Charles Geyer (2005) for the rest of this subsection. 79 A key concept of a Markov chain is Definition 6.2. A probability distribution π is a stationary distribution or an invariant distribution for the Markov chain if it is ”preserved” by the transition probability. That is if the initial distribution is π, then the marginal of φ(2) is also π. Hence, so is the marginal distribution of φ(3) and all the rest of the chain. For discrete state space S, π is specified by a vector π(x), and the stationary property is X π(y) = π(x)p(x, y). (15) x∈S If we think of transition probabilities as a matrix P with entries p(x, y), Eq. (15) can be written as π = πP , where the RHS is the multiplication of the matrix P on the left by the row vector π. For continuous state space S, the stationary property is Z π(dx)P (x, A). (16) π(A) = S Eqs.(15) and (16) are the same except that a sum over a discrete state space has been replaced by an integral over a continuous state space. In MCMC we often construct a Markov chain with a specified stationary distribution π in mind, so there is never a question whether a stationary distribution exists —- it does so by construction. Moreover it is unique under easily met conditions and more importantly it admits the law of large numbers, described earlier in Eq. (14). 80 6.3.3 Variants of Gibbs samplers With the general Markov chain framework in mind, we can see that the Gibbs sampler is a very simple construction of a Markov chain of state space variable represented by vector φ = (φ1 , . . . , φq ) taking value in S for some q ≥ 2. The Gibbs sampler is composed of elementary update steps, which we call Gibbs update: an elementary Gibbs update changes only one component of the state vector, say φi for some i = 1, . . . , q. This component is given a new value which is a sample from its ”full conditional” — its conditional distribution given the rest π(φi |φ−i ), where φ−i := (φ1 , . . . , φi−1 , φi+1 , . . . , φq ). It is easy to verify that the elementary Gibbs update preserves the stationary distribution: if the current state φ is a realization from π, then φ−i is distributed according to its marginal π(φ−i ) derived from π, and the state after the update will have the distribution π(φi |φ−i )π(φ−i ) which is π(φ) by definition of conditional probability: joint equals conditional times marginal. We can represent an elementary Gibbs update for component i by a kernel denoted by Pi , for i = 1, . . . , q. Moreover, a composition of an elementary Gibbs update, say P1 followed by an elementary Gibbs update, say P2 can be represented by the composite kernel P1 P2 . It has a concrete meaning: • For a discrete state space S, P1 P2 represents the multiplication of two transition probability matrices. The result is a matrix with entries X p1 (x, y)p2 (y, z). y∈S • For a continuous state space S, we need to replace the sum by the integral: Z (P1 P2 )(x, A) = P1 (x, dy)Q(y, A). 81 Composition of kernels Now we can write the first Gibbs sampling introduced in subsection 6.3.1 as the construction of a Markov chain using the kernel P = P1 P2 . . . Pq In words: this Markov chain is constructed by first updating φ1 via its full conditional, and then φ2 , . . . , until φq . The compositions of the q elementary Gibbs update result in the kernel P . And application of P allows us to generate a Markov chain sample φs+1 if we are to start from φs . It is easy to verify that the composition of kernels this way preserves the stationary distribution: π(P1 P2 P3 ) = ((πP1 )P2 )P3 = (πP2 )P3 = πP3 = π, and so on. Mixing kernels But we can also create new Markov chains from the elementary Gibbs update by mixing: q P = 1X Pi . q i=1 In words: pick a coordinate i to update with equal probabilities 1/q. Then update φi according to kernel Pi . There is no reason to stay with equality probabilities: take any weights (α1 , . . . , αq ) ∈ ∆q−1 . Pick coordinate i to update with probability αi . If i is chosen, then update φi according to kernel Pi . Combining composition and mixing We can combine the composition and mixing tricks. The best known example of this is the so-called random sequence scan that combines q elementary update mechanisms by choosing a random permutation (i1 , i2 , . . . , iq ) of the integers 1, 2, . . . , q and then appplying the updates Pij , j = 1, . . . , q in that order. If P denotes the set of all q! permutations, the kernel of this scan is P = 1 q! X (i1 ,...,iq )∈P 82 Pi1 . . . Piq . 6.4 MCMC diagnostics Now with so many Gibbs variants (and in the future non-Gibbs Markov chains) available to consider, how can we tell which one works, and works better? Remember that all Gibbs samplers and MCMC algorithms in general work in theory, if we were allowed to run the Markov chain until infinity. But we can never do that in practice. We may come up with one or several Markov chain constructions, run them for a while and evaluate. This requires techniques for assessing the effectiveness of MCMC algorithms. This section provides a brief introduction into MCMC diagnostics. The goal of Monte Carlo or Markov chain Monte Carlo approximation is to obtain a sequence of parameter values {φ(1) , . . . , φ(S) } such that, for some function g of interest and a target distribution p(φ), Z S 1X (s) g(φ ) ≈ g(φ)p(φ)dφ. S s=1 In order to obtain a good approximation, there are primarily two main issues that we need to worry about (i) the empirical distribution of the simulated sequence {φ(1) , . . . φ(S) } need to approximate well the target distribution p(φ). (ii) the members of the simulated sequence need to be as weakly correlated as possible (zero correlation is the best). Standard Monte Carlo samples represent the ”gold standard”, if they could be obtained: by assumption, the MC samples are identically and independently distributed according to the target p(φ). Thus, both criteria (i) and (ii) are perfectly achieved. Let φ̄ denote the empirical average of the Monte Carlo samples of φ, assuming for the moment to be scalar, then the variance of this Monte Carlo approximate is VarMC [φ̄] = 83 1 Var[φ]. S (17) For samples simulated by a Markov chain, the aforementioned issues are generally non-trivial to address. The Markov chain may take a long time to get close to the target stationary distribution, requiring S to be large for (i) to be achieved. Moreover, there may be strong correlations among simulated samples {φ(s) }Ss=1 , resulting in difficulty in achieving (ii). Example 6.1. Consider the target distribution of the form p(θ) = 3 X pi × Normal(θ|µi , σi2 ), k=1 where p = (p1 , p2 , p3 ) = (.45, .10, .45); (µ1 , µ2 , µ3 ) = (−3, 0, 3); (σ12 , σ22 , σ32 ) = (1/3, 1/3, 1/3). This is a mixture of three normal densities. A useful technique is not to draw samples for θ directly, but to add an auxiliary random variable Z such that the joint distribution for (Z, θ) induces marginal distribution for θ which is equal to the target distribution p(θ). We will then draw sample for the joint sample (Z, θ). The joint distribution for (Z, θ) is given as follows Z ∼ Categorical(p) θ|Z = k ∼ Normal(µk , σk2 ). Figure 6.3: A mixture of normal densities and a Monte Carlo approximation. 84 (18) For a Gibbs sampler of (Z, θ), the full conditional for θ is already given by Eq. (18). The full conditional for Z is given by, via Bayes’ rule: pk Normal(θ|µk , σk ) . Pr(Z = k|θ) = P3 j=1 pj Normal(θ|µj , σj ) (19) Fig. 6.4 illustartes the histogram and traceplot of the first 1,000 Gibbs samples. Figure 6.4: Histogram and traceplot of 1,000 Gibbs samples. What do we see: • For the Gibbs sampler for θ-values starts in the region corresponding to the second mode (from the left) of the distribution, then ventures to the region corresponding to the first mode, and get ”stuck” there for a quite long time. It manages to get out of the second mode, passing through it, and transition to the region corresponding to the third mode. Nonetheless, it doesn’t seem to spend ”enough” time there before transitioning back to the second mode again. • As shown by the first panel of Fig. 6.4, the Markov chain is not close to the stationary target distribution p(θ). It has not mixed after 1,000 iterations. If we run considerably longer, for 10,000 iterations, the mixing is considerably improved. See Fig. 6.5. • The ”stickiness” of the Markov chain at regions corresponding to the three modes, especially the first and third mode suggests strong correlation among the simulated samples. 85 Figure 6.5: Histogram and traceplot of 10,000 Gibbs samples. How do we verify both issues of mixing and strong correlation of Markov chain samples? To verify mixing is difficult in theory. This is an active area of research, where researchers work on upper and lower bounds of the mixing time. Unfortunately, for complex models, tight bounds for the mixing time are rarely available. In practice, a standard method is to run multiple Markov chains (starting at different positions), and compare the distributions for the variables of interest. This works well when the number of variables of interest is not too large. For high-dimensional state spaces, having a robust way to verify the mixing of Markov chain remains a big challenge. The reason we want to check the correlation of Markov chain samples — the technical term is autocorrelation — is that this quantity affects to the variance of the Monte Carlo estimate in a crucial way. 86 Assume that that stationarity of the Markov chain has been achieved. Let φ0 be the expectation of a 1 P scalar φ under the stationary target distribution. The variance of the Monte Carlo estimate φ̄ := S s φ(s) can be computed as follows VarMCMC (φ̄) := E(φ̄ − φ0 )2 X S 1 (s) 2 φ − φ0 ) = E ( S s=1 X X 1 (s) 2 (s) (t) = E (φ − φ0 ) + (φ − φ0 )(φ − φ0 ) S2 s s6=t = 1 X (s) (φ − φ0 )(φ(t) − φ0 ). VarMC (φ̄) + 2 S s6=t Thus, the MCMC variance is equal to the MC variance plus a term that depends on the correlation of samples within the Markov chain. This term is usually positive, so the MCMC variance is usually higher than the MC variance. To assess how much correlation there is in the chain, we compute the sample autocorrelation function: for a generic sequence of numbers {φ1 , . . . , φS }, the lag-t autocorrelation function estimates the correlation between elements of the sequence that are t steps apart: acft (φ) = 1 S−t PS−t s=1 (φs − φ̄)(φs+t − 1 PS 2 s=1 (φs − φ̄) S−1 φ̄) . (20) In R, this quantity is computed by R-function acf. If we are close to stationarity, this quantity is almost always between [-1,1]. Being close to 1 means strong positive correlation. Being close to zero means small correlation. For the example in Fig. 6.5, for the sequence of 10K Gibbs samples for θ-values, the lag-10 autocorrelation is 0.93, and lag-50 autocorrelation is 0.812. This means that the Markov chain has very high correlation. Such a Markov chain explores the parameter space slowly, taking a long time to mix, and the empirical average also has a high variance. 87 A practically useful way is to consider the effective sample size of a Markov chain. Motivated by the Monte Carlo variance formula (see Eq. (17)), the MCMC effective sample size Seff is the value such that VarMCMC (φ̄) = Var φ . Seff (21) In R, this quantity is estimated by the R-command effectiveSize. In the example of normal mixture discussed above, the effective sample size of the 10,000 Gibbs samples of θ is 18.42, indicating that the precision of the MCMC approximation to E[θ] is as good as the precision that would have been obtained by utilizing only about 18 i.i.d. samples of θ. This may suggest two possible courses of action: either run the Gibbs sampler considerably longer, or design a better Markov chain. 88 7 Multivariate normal models For most non-trivial applications we are interested in models with multi-dimensional parameters and multidimensional measurements. Such situations require models based on multivariate distributions for both parameters and data. The multivariate normal distributions represent one of the most useful and powerful tools for such modeling tasks. 7.1 Mean vector and covariance matrix Let X denote a random vector taking values in Rp . We may write X in terms of its components X = (X1 , . . . , Xp ). There are two equivalent ways to think of the random vector X. The first way is what we have been used to, that is, to think of a joint distribution over the n random variables X1 , . . . , Xp ∈ R. Given such a joint distribution, we can speak of quantities such as the expectation of each of the variables X1 , . . . , Xp . We also consider covariance between Xi and Xj : µi := EXi , i = 1, . . . , p σii := Var Xi := σi2 , (22a) i = 1, . . . , p (22b) σij := Cov(Xi , Xj ) := E(Xi − µi )(Xj − µj ), i, j = 1, . . . , p. (22c) The second way is to view X as a random variable taking values in the p-dimensional space Rp . This is useful in thinking geometrically about the behavior of random behavior in Rp , and in algebraic manipulation of distributions in spaces of dimensionality greater than one. Suppose that X is endowed with a probability density function p(x) with domain on Rp . We can speak of its mean vector µ and covariance matrix Σ: Z µ := EX := xp(x)dx (23a) Rp Σ := Var X := Cov X := E(X − µ)(X − µ)> := Z (x − µ)(x − µ)> p(x)dx. (23b) In the above equations, X and x in Rp are treated as p × 1 columns (matrices). The integrals operate in a component-wise fashion. Sometimes we use Var and sometimes Cov in front of random vector X; they mean the same thing. 89 By verifying the basic linear algebra operations on matrices, it is easy to see that the p-dimensional mean vector µ and p × p covariance matrix Σ of Eqs (23) are related to quantities given in Eqs (22) as follows: µ1 σ11 . . . σ1p .. .. .. . . . µ = µi ; Σ = σi1 . . . σip (24) .. .. .. . . . µp σp1 . . . σpp The entries of covariance matrix Σ represent the variance of components Xi in the diagonal, and the covariance between Xi and Xj in the (i, j) positions. It is simple to check that Σ is a symmetric and positive semidefinite matrix. 90 7.2 The multivariate normal distribution The multivariate Gaussian density function takes the following form: for x ∈ Rp 1 1 > −1 p(x|µ, Σ) = − (x − µ) Σ (x − µ) , p 1 exp 2 (2π) 2 |Σ| 2 (25) where there are two parameters µ ∈ Rp and Σ an p × p symmetric and positive definite matrix (Σ > 0). |Σ| denotes the determinant of matrix Σ. We write X ∼ N(µ, Σ), or X ∼ Normal(µ, Σ), or X ∼ Np (µ, Σ) interchangeably to denote the p-variate Gaussian random vector X. Here are several basic facts. 1. The function given in the above display is a valid density function on Rp . That is, it satisfies Z p(x|µ, Σ)dx = 1. Rp 2. Given the density defined above, it can be verified that µ = E(X) and Σ = E(X − µ)(X − µ)> . So parameters µ and Σ indeed play the respective roles of being the mean and the covariance for the Gaussian distribution. 3. What is the ”shape” of the Gaussian density in multi-dimensional spaces? One way to visualize is looking at its contours. A contour of the Gaussian density is a collection of points of equal density values. So the Gaussian’s contours are solutions to the quadratic equation (x − µ)> Σ−1 (x − µ) = c for each positive constant c. These are ellipses oriented along the eigenvector of Σ. 91 More Basic Facts 1. If X ∼ Np (µ, Σ) where µ ∈ Rp and Σ ∈ Rp×p , then for any m × p matrix A and p × 1 column vector b, then the linear transformation Y = AX + b is also Gaussian, i.e., Y ∼ Nm (Aµ + b, AΣA> ). 2. If X = (X1 , . . . , Xp ) ∼ N (0, Ip ), where Ip denotes the p × p identity matrix. Then, X1 , . . . , Xp are independent (why?) Moreover, AX ∼ N (0, AA> ). 1 3. If X ∼ N (µ, Σ), let A = Σ− 2 , the square root of the inverse covariance matrix, then AX ∼ 1 1 N (Σ− 2 µ, Ip ). And so, AX − Σ− 2 µ ∼ N (0, Ip ). This is called the standardization of the Gaussian. Remark 7.1. In the above, we made use of the concept of square root of a positive definite matrix. This is a generalization of the square root of a positive number. If S is symmetric and (strictly) positive definite, then 1 the square root of matrix S is a matrix denoted by A = S 2 if there holds AA> = S. We can express A more concretely: Let D := diag(λ1 , . . . , λp ) where λi ’s are the eigenvalues of S. The λj are positive because S is positive definite. Define Γ := [γ1 , . . . , γp ] whose columns are eigenvectors γi p P λi γi γi> . of S. By the spectral theorem of positive definite matrices, S = i=1 1/2 1/2 More succintly, we write S = ΓDΓ> , where D := diag(λ1 , . . . , λp ). Let D1/2 := diag(λ1 , . . . , λp ). It is now simple to verify that the square root of S takes the form A = ΓD1/2 Γ> . In sum, the symmetric square root of p.d. matrix S is a matrix that has the same set of eigenvectors, while the eigenvalues are the square root of that of S. 92 Marginalization & Conditioning The multivariate Gaussian distributions enjoy the following key invariance properties which make them powerful in theory and useful in practice: if a joint distribution is Gaussian, then induced marginal and conditional distributions are also Gaussian. Let us express the n-dimensional vector X in terms of two blocks of components X1 ∈ Rn1 and X1 X2 ∈ Rn2 as in X = . X2 The mean vector µ and covariance Σ can be partitioned accordingly according to the n1 and n2 dimensional components: µ1 Σ11 Σ12 µ= ,Σ = . µ2 Σ21 Σ22 We can read off (µ1 , Σ11 ) and (µ2 , Σ22 ) as the mean vector and the covariance matrix for X1 and X2 , respectively. In addition, Σ12 is the covariance matrix for X1 and X2 . These are general facts that hold for any distributions on Rp . Now, if we assume µ1 Σ11 Σ12 X ∼ Np µ = ,Σ = µ2 Σ21 Σ22 then we also have that X1 ∼ Nn1 (µ1 , Σ11 ) X2 ∼ Nn2 (µ2 , Σ22 ) −1 X1 |X2 = x2 ∼ Nn1 (µ1 + Σ12 Σ−1 22 (x2 − µ2 ), Σ11 − Σ12 Σ22 Σ21 ) −1 X2 |X1 = x1 ∼ Nn2 (µ2 + Σ21 Σ−1 11 (x1 − µ1 ), Σ22 − Σ21 Σ11 Σ12 ). These conditional and marginal formulas invite some sort of conjugacy to take place — sweet music to the Bayesian ears. 93 Canonical Parameterization The representation of Gaussian density in terms of parameters µ and Σ (see Eq. (25) is called the mean parameterization. There is an equivalent parameterization, namely dual parameterization or canonical parameterization, obtained by expanding the quadratic term and letting Λ = Σ−1 and η = Σ−1 µ, where Λ is the concentration matrix. The canonical parameterization is useful in that it can be easily extended to a broader class of distributions. Moreover, the relationship between the two representations has some fruitful consequences in terms of inference. We will get there eventually. A very important fact about multivariate Gausians: a zero entry in matrix Σ imply independence between the corresponding components of the random vector X. On the other hand, by exploiting the canonical parametrization it can be seen that the zeroes in Λ imply only conditional independence, given some other random components. Let us look at specific examples: X1 1. Suppose X = where X1 of n1 dimensions, X2 is of n2 dimensions, such that X is disX2 µ1 Σ11 Σ12 tributed by N (µ, Σ), where µ = and Σ = . Prove that X1 ⊥ X2 if and only µ2 Σ21 Σ22 if Σ12 = Σ21 = 0. X1 2. Let X = X2 , and suppose that X ∼ N (0, Λ) where X3 Λ11 Λ12 Λ13 Λ = Λ21 Λ22 Λ23 . Λ31 Λ32 Λ33 Show that if Λ12 = Λ21 = 0 then X1 ⊥ X2 |X3 . 94 7.3 Semiconjugate prior for the mean vector i.i.d. Given the normal sampling model: Y 1 , . . . , Y n |θ, Σ ∼ Np (θ, Σ). Stacking up the samples Y i , the data can be viewed as an n × p matrix. Then the likelihood takes the form n Y 1 (2π)−p/2 |Σ|−1/2 exp − (y i − θ)> Σ−1 (y i − θ) 2 i=1 n 1X −np/2 −n/2 > −1 = (2π) |Σ| exp − (y i − θ) Σ (y i − θ) 2 i=1 1 > > ∝ exp − θ A1 θ + θ b1 , 2 p(y 1 , . . . , y n |θ, Σ) = where the coefficients associated with the quadratic and linear terms are A1 = nΣ−1 , n X −1 b1 = Σ y i =: nΣ−1 ȳ. i=1 The likelihood takes the exponential form, where the exponent is quadratic with respect to the mean parameter θ. It is clear that the simplest semiconjugate prior (with Σ held fixed) is multivariate normal distribution. We set the prior: θ ∼ N(µ0 , Σ0 ). As in the univariate case, it is convenient to write the multivariate normal density in terms of precision matrix Σ−1 0 : 1 −p/2 −1/2 > −1 p(θ) = (2π) |Σ0 | exp − (θ − µ0 ) Σ0 (θ − µ0 ) 2 1 −p/2 1/2 > = (2π) |A0 | exp − (θ − µ0 ) A0 (θ − µ0 ) . 2 1 > > ∝ exp − θ A0 θ + θ b0 , 2 −1 where A0 = Σ−1 0 and b0 = Σ0 µ0 . 95 By Bayes’ rule p(θ|y 1 , . . . , y n , Σ) ∝ p(θ)p(y 1 , . . . , y n |θ, Σ) 1 ∝ exp − θ > An θ + θ > bn , 2 where −1 An = A0 + A1 = Σ−1 0 + nΣ −1 bn = b0 + b1 = Σ−1 0 µ0 + nΣ ȳ. Thus, the posterior distribution of θ given y 1 , . . . , y n is a multivariate normal with covariance matrix Σn := A−1 n and mean vector Σn bn . 96 7.4 Inverse Wishart prior for the covariance matrix We learned in the one-dimensional Gaussian model that when the mean parameter is held fixed, the (semi) conjugate prior for the precision parameter is Gamma distribution. In this subsection we will find a multivariate version of the Gamma distribution for the precision matrix. It is called Wishart distribution. Since covariance matrix is the inverse of precision matrix, this corresponds to using inverse Wishart prior for the covariance matrix. Recall the likelihood form, keeping only quantities that vary with the covariance matrix parameter: n Y 1 −p/2 −1/2 > −1 p(y 1 , . . . , y n |θ, Σ) = (2π) |Σ| exp − (y i − θ) Σ (y i − θ) 2 i=1 n X 1 ∝ |Σ|−n/2 exp − trace( (y i − θ)> Σ−1 (y i − θ)) 2 i=1 n X 1 (y i − θ)(y i − θ)> ) . ∝ |Σ|−n/2 exp − trace(Σ−1 2 i=1 In the second line, we used the trivial fact that a scalar is equal to its own trace. Recall that the trace of a square matrix is the sum all its elements in the diagonal. In the third line, we used the cyclic property of the trace of product of matrices (assuming the matrix dimensions match up): trace(AB) = trace(BA); trace(ABC) = trace(BCA) = trace(CAB); . . . Let A = Σ−1 denote the precision matrix and Sn = n X (y i − θ)(y i − θ)> (26) i=1 then the likelihood takes a simple form 1 p(y 1 , . . . , y n |θ, A) ∝ |A|n/2 exp − trace(AS n ). 2 The simplest form for a conjugate prior is the Wishart distribution for the precision matrix A, or equivalently, inverse-Wishart distribution for the covariance matrix Σ. We say a random matrix A ∼ inverse-Wishart(ν0 , S −1 0 ) if it admits the density function on the space of symmetric and positive definite matrices: p(A|ν0 , S 0 ) := p 2ν0 p/2 π (2)/2 |S 0 |−ν0 /2 p Y −1 Γ([ν0 + 1 − j]/2) × j=1 1 |A|(ν0 −p−1)/2 exp − trace(AS 0 ). 2 (27) We immediately find that the conditional distribution of A given y 1 , . . . , y n is again an Wishart distribution: 97 1 p(A|y 1 , . . . , y n , θ) ∝ |A|(ν0 +n−p−1)/2 exp − trace(A(S 0 + S n )) 2 ≡ Wishart(ν0 + n, [S 0 + S n ]−1 ). Equivalently, in terms of the covariance matrix: given the prior Σ = A−1 ∼ inverse-Wishart(ν0 , S −1 0 ), which has the density: p(Σ|ν0 , S −1 0 ) = ν0 p/2 2 p π (2)/2 |S 0 |−ν0 /2 p Y −1 Γ([ν0 + 1 − j]/2) × j=1 1 |Σ|−(ν0 +p+1)/2 exp − trace(Σ−1 S 0 ). 2 (28) then we find that 1 p(Σ|y 1 , . . . , y n , θ) ∝ |Σ|−(ν0 +n+p+1)/2 exp − trace(Σ−1 (S 0 + S n )) 2 ≡ inverse-Wishart(ν0 + n, [S 0 + S n ]−1 ). 98 (29) (30) Useful facts of Wishart distributions Wishart is a canonical distribution for symmetric and positive definite matrices. Wishart(ν0 , V ) has two parameters: ν0 is called the number of degrees of freedom, and V > 0 the scale matrix. Wishart is the multivariate analogue of the chi-square distribution (which is a special case of the Gamma distribution). Recall that a chi-square random variable with n0 degrees of freedom can be constructed by taking a sum of square of standard normal variables. A similar property holds for Wishart random matrices. iid Let z 1 , . . . , z ν0 ∼ Np (0, V ). Let Z = [z 1 . . . z ν0 ] be the p × ν0 matrix made of the ν0 column vectors z i ’s. Then, ν0 X > ZZ = ziz> (31) i ∼ Wishart(ν0 , V ). i=1 When ν0 ≥ p, the matrix ZZ > is positive definite (and hence invertible) almost surely if V is invertible. If p = V = 1, we are reduced to a chi-squared distribution with ν0 degrees of freedom. The above characterization makes it simple to draw sample from a Wishart distribution (or an inverseWishart distribution). It also allows us to collect few useful facts: for A ∼ Wishart(ν0 , V ) ⇔ Σ = A−1 ∼ inverse-Wishart(ν0 , V ): E(A) = ν0 V Var(Aij ) = ν0 (Vij2 + Vii Vjj ) 1 E(Σ) = V −1 . ν0 − p − 1 The formula for the variance of Σ is slightly more complicated and omitted, but the rule of thumb is that we set ν0 to be small if we want large variation around the prior expectation for the covariance matrix Σ. Plugging the last identity to the posterior distribution for Σ given in Eq (30): E[Σ|y 1 , . . . , y n , θ] = = 1 (S 0 + S n ) ν0 + n − p − 1 ν0 − p − 1 1 n 1 S0 + Sn, ν0 + n − p − 1 ν0 − p − 1 ν0 + n − p − 1 n which be viewed as a weighted average of the prior expectation and the unbiased estimator S n /n for the covariance matrix Σ. 99 7.5 Example: reading comprehension study Given n-iid samples Y 1 , . . . , Y n |θ, Σ ∼ Np (θ, Σ). Using priors θ ∼ N(µ0 , Σ0 ), and Σ ∼ inverse-Wishart(ν0 , S −1 0 ), it is simple to implement a Gibbs sampler to approximate the posterior distribution of (θ, Σ) based on the full conditionals obtained in the previous subsections. Let us consider an example in Hoff (2009) (Chapter 7): • 22 children were given two reading two reading comprehension exams, one before a certain type of instruction and on after. • model these 22 pairs of scores as i.i.d. samples y 1 , . . . , y 22 from a bivariate normal (p = 2). The data samples are plotted as black dots in the second panel of Fig. 7.1. • basic sample statistics from y 1 , . . . , y 22 : we found that ȳ = (47.18 53.86)> . In terms of sample variances: 182.16 147.44 S 22 = . 147.44 243.65 • the exam was designed for average scores of around 50 out of 100, so µ0 = (50 50)> . σ11 σ12 • for hyperparameter Σ0 := , we set σ11 = σ22 = (50/2)2 = 625 to ensure most of the σ21 σ22 prior mass concentrate on [0, 100]. Moreover, σ12 = σ21 = 0.5σ11 = 312.5 to allow some prior correlation. Now we proceed with the Bayesian approach. • as for hyperparameter for Σ: set S 0 = Σ0 and choose relatively small value for the number of degrees of freedom ν0 = p + 2 = 4 to allow sufficient spread around Σ0 . • run Gibbs samplers for 5000 iterations, from which we can approximate the posterior distribution as follows Pr(θ2 > θ1 |y 1 , . . . , y n ) ≈ 0.99 • we also find the quantiles of the posterior distribution of θ2 − θ1 : 100 Figure 7.1: Reading comprehension: posterior distribution of mean scores before and after instruction (left), and posterior predictive distribution of two scores (right). • The left panel of Fig. 7.1 gives 97.5%, 75%, 50%, 25% and 5% highest posterior density contours for the posterior distribution of θ = (θ1 , θ2 )> . Thus, the evidence is strong that the mean test score θ2 after the instruction is greater than the one, θ1 , before the instruction. • But this does not tell the full story. It’s far more interesting to look at the posterior predictive distribution Pr(Y2 > Y1 |y 1 , . . . , y n ) This asks: what is the probability that a randomly selected child will score higher on the second exam than on the first. • The second panel of Fig. 7.1 shows the highest posterior density contours of the predictive distribution, and there are a more substantial overlap with the line y2 = y1 . In fact, we can find that Pr(Y2 > Y1 |y 1 , . . . , y n ) ≈ 0.71. Thus, almost a third of the students will get a lower score on the second exam! This example highlights the distinction in two different ways of comparing populations in the reporting of results from experiments or surveys: studies with very large sample size n may result in values of Pr(θ2 > θ1 |y 1 , . . . , y n ) that are very close to 1 (or p-values that are very close to 0), and conclude ”significant effect”, but such results say nothing about how large of an effect that we expect to see for a randomly sampled individual. 101 8 Group comparisons and hierarchical modeling In this section we will study questions related to comparisons of different populations. While group comparison may conjure up the question of ranking, a thorough treatment will inevitably require thinking of notions such as within-group variability and between-group variability. Such notions will be best addressed by employing (Bayesian) hierarchical modeling. In this sense, this section is also good entry point to hierarchical modeling, which is applicable far beyond the basic group comparison problems. In fact, hierarchical modeling is also one of the most powerful tools in the arsenal of Bayesian statistics. 8.1 Comparing two groups Example 8.1. Given a sample of 10th grade students from two public U.S. high schools. n1 = 31 and n2 = 28 are the two sample sizes from school 1 and 2, respectively. Both schools have a total enrollment of around 600 10th graders and both are in a similar environment (urban neighborhoods). • Suppose we are interested in comparing the population means θ1 and θ2 . • Sample means: ȳ1 = 50.81 and ȳ2 = 46.15 suggesting that θ1 > θ2 . • Let’s take a look at the box plots. There are evidently different levels of variability in the two groups. A standard approach is to consider the t-statistic: t(y 1 , y 2 ) = ȳ − ȳ2 50.81 − 46.15 p 1 p = = 1.74, sp 1/n1 + 1/n2 10.44 1/31 + 1/28 where sp = [(n1 − 1)s21 + (n2 − 1)s22 ]/(n1 + n2 − 2), the pooled estimate of the population variance of the two groups. Figure 8.1: Left panel: Boxplots of samples of math scores from two schools. Right panel: gray line indicates the observed value of the t-statistic. A basic frequentist technique (the t-test) proceeds as follows. 102 • Exploit the fact that if the two populations are normal distributions with the same mean and variance, then the t-statistic t(Y 1 , Y 2 ) is a t-distribution with n1 +n2 −2 = 57 degrees of freedom. The density of this distribution is plotted in the second panel of Fig. 8.2. Under this distribution, the probability that |t(Y 1 , Y 2 )| > 1.74 is p = 0.087. This is called the (two-sided) p-value of the obtained statistic. – Although not completely justified in theory, p-values are widely used and easily misused and abused in parameter estimation and model selection. A small p-value is considered as indicating the evidence for the rejection of the null hypothesis/ model θ1 = θ2 . Thus, a small p value is construed with a strong evidence that the two populations are different (θ1 6= θ2 ). Customarily, p is considered small if p < 0.05 (or a smaller positive threshold number). – Mathematically, p = Pr(|t(Y 1 , Y 2 )| > t(y 1 , y 2 )|θ1 = θ2 ). This is a (pre-experiment) probability statement on the unseen data represented by (Y 1 , Y 2 ), even though the observed statistic t(y 1 , y 2 ) supplies part of the equation that defines p-value. This is a source of confusion for many practitioners of frequentist tests. It should not be the case for a student of Bayesian statistics. Clearly, p is not the (post-experiment) probability that θ1 = θ2 is true given the data evidence provided by t(y 1 , y 2 ): Pr(θ1 = θ2 |t(y 1 , y 2 )). • The t-test commonly taught in statistic classes continues as follows: – if p < 0.05: reject the null hypothesis/model that the two groups have the same distributions; conclude that θ1 6= θ2 . Moreover, use the estimates: θ̂1 = ȳ1 ; θ̂2 = ȳ2 . – if p ≥ 0.05: accept the null hypothesis/model, and conclude that θ1 = θ2 . Moreover, use the estimate X X θ̂1 = θ̂2 = ( yi,1 + yi,2 )/(n1 + n2 ). • In our present example: p ≥ 0.05, so we accept that θ1 = θ2 , even though there seems to be some evidence to the contrary. 103 • Imagine a scenario where the sample from school 1 might have included a few more high-performing students, and the sample from school 2 a few more low-performing students. Then we could have observed a p-value of 0.04 or so, in which case we would have treated the two populations as different, and resorted to using only data from school 1 for estimating θ1 , and data from school 2 for estimating θ2 . It seems such estimates for θ1 and θ2 are not robust with respect to changes to the samples. 5 • Estimating θ1 and θ2 and the difference θ1 − θ2 is perhaps more important than determining in binary the question whether θ1 6= θ2 or not when the difference between the two is relatively small. The above frequentist approach results in taking two extreme positions for the estimation of θ1 and θ2 : θ̂1 = w1 ȳ1 + (1 − w1 )ȳ2 θ̂2 = (1 − w2 )ȳ1 + w2 ȳ2 , where w1 = w2 = 1 if p < 0.05 and w1 = n1 /(n1 + n2 ); w2 = n2 /(n1 + n2 ) otherwise. • It might make more sense to allow w to vary continuously and have a value that depends on quantities such as sample sizes n1 , n2 and other quantities that determine population variabilities. In other words, we want to allow the borrowing of information across groups: the data from group 1 may influence the estimate for group 2 and vice versa. 5 In the t-test, as is the case with most frequentist tests, we are on a firm mathematical ground when we happen to reject. I.e., the rejection is mathematically justified. However, in such a scenario for the t test, our estimates may not be robust for the issue mentioned. When we happen to not reject, i.e., we remain with the null hypothesis/model, then the issue becomes whether the null model is too simplistic and heavily misspecified; the estimates would be suspect as a result. 104 Enabling information sharing across groups Consider the following sampling model for two groups: Yi,1 Yi,2 = µ + δ + i,1 , = µ − δ + i,2 , iid {i,j } ∼ normal(0, σ 2 ). We have utilized a (re)parameterization trick: under this parameterization, θ1 = µ + δ and θ2 = µ − δ, so µ = (θ1 + θ2 )/2 and δ = (θ1 − θ2 )/2. The intention is to enable the coupling (dependence) of the two groups via variables µ and δ, which will be made random by a prior distribution. The fact that these two are random is enough to allow the coupling and subsequent information sharing in posterior inference. The specific prior choice given below is for computational convenience: p(µ, δ, σ 2 ) = p(µ) × p(δ) × p(σ 2 ) µ ∼ normal(µ0 , γ02 ) δ ∼ normal(δ0 , τ02 ) σ 2 ∼ inverse-gamma(ν0 /2, ν0 σ02 /2). Based on our previous calculations for the (univariate) normal model, it should be an easy exercise to derive the full conditional distributions of these parameters as follows {µ|y 1 , y 2 , δ, σ 2 } ∼ normal(µn γn2 ), where γn2 = [1/γ02 + (n1 + n2 )/σ 2 ]−1 , n1 n2 X X 2 2 2 µn = γn × [µ0 /γ0 + (yi,1 − δ)/σ + (yi,2 + δ)/σ 2 ]; i=1 i=1 {δ|y 1 , y 2 , µ, σ 2 } ∼ normal(δn , τn2 ), where τn2 = [1/τ02 + (n1 + n2 )/σ 2 ]−1 , X X δn = τn2 × [δ0 /τ02 + (yi,1 − µ)/σ 2 − (yi,2 − µ)/σ 2 ]; {σ 2 |y 1 , y 2 , µ, δ} ∼ inverse-gamma(νn /2, νn σn2 /2), where νn = ν0 + n1 + n2 , X X νn σn2 = ν0 σ02 + (yi,1 − [µ + δ])2 + (yi,2 − [µ − δ])2 . 105 Let us go back to our example of comparing math test scores of students from two high schools. Example 8.2. As for prior distribution parameter for µ ∼ normal(µ0 , γ02 ), we put µ0 = 50, γ0 = 50/2 = 25 to get a reasonably diffuse prior. For the prior on δ, set δ0 = 0, τ0 = 25. For the prior for σ 2 , set ν0 = 1, σ0 = 10 (this latter choice is due to the setup that the math scores were standardized to produce a nationalwide mean of 50 and a standard deviation of 10). • the following figure shows the posterior distribution for µ and δ. In particular, the 95% quantile-based posterior confidence interval for 2δ, the difference of average scores between the two schools, is (-.61, 9.98), indicating a strong evidence that the posterior mean for school 1 is higher than that of school 2. • In addition, Pr(θ1 > θ2 |y 1 , y 2 ) = Pr(δ > 0|y 1 , y 2 ) ≈ 0.96, even though the prior probability is such that Pr(δ > 0) = .50. • As for posterior predictive probability that a randomly selected student from school 1 has a higher score than a randomly selected student from school 2: Pr(Y1 > Y2 |y 1 , y 2 ) ≈ 0.62. Figure 8.2: Posterior distributions for µ and δ. 106 8.2 Comparing multiple groups It is very common to organize data or data sets in a hierarchy of nested populations. Such data sets are often called hierarchical or multilevel data. For example • there are multiple hospitals, each hospital has many patients • there are different animals, each animal carry a set of genes • different countries, each of which is organized into regions, each of which is organized into counties, with residents in each of them • ”activity recognition problem”: a collection of computer users, each user is associated with a collection of computer related activities (organized by days), each day has a collection of activities (apps run) • a collection of text corpora, each text corpus is a collection of documents, each document is a collection of words • a database of images divided by groups, each image is a collection of image patches, each patch a collection of pixels or other specific computer vision elements 107 We are interested in learning about these groups: what are the shared features among them, what make different groups different and how. In most applications, it does not make great sense to assume that the groups are independent. It makes sense to assume that they are dependent, and to exploit such dependence to learn about global aspects of all groups, as well as locally distinct aspects of each group. In other words, we wish to borrow information from one group to inform about the others, as well as the whole. The question is how. 108 8.3 Exchangeability and hierarchical models Hierarchical models are a general method for describing dependence for grouped data. They can be motivated by a theorem of Bruno de Finetti. At a high level, de Finetti’s theorem says that a collection of exchangeable sequence of random variables must be conditionally i.i.d., and as a consequence, an exchangeable collection of groups of random variables must be distributed according to a hierarchical model. Let us make this statement more precise. Definition 8.1. (Exchangeable). Let p(y1 , . . . , yn ) be the joint density of random variables Y1 , . . . , Yn . If p(y1 , . . . , yn ) = p(yπ1 , . . . , yπn ) for all permutation π of 1, . . . , n. 6 Equivalently, the joint distribution of (Yπ1 , . . . , Yπn ) remains invariant under any permutation π. Then, we say that Y1 , . . . , Yn are exchangeable. Intuitively, when Y1 , . . . , Yn are exchangeable, then the subscript labels of these n variables convey no additional information about them. It is simple to see that if a collection of random variables Y1 , . . . , Yn are conditionally i.i.d. given some random variable θ, i.e., ∼ θ Y1 , . . . , Yn |θ i.i.d. ∼ π(θ) p(·|θ), then Y1 , . . . , Yn are exchangeable. What about the other direction? This is where de Finetti’s theorem comes in. 6 At this point, it may be helpful to express the identity explicitly: pY1 ,...,Yn (y1 , . . . , yn ) = pY1 ,...,Yn (yπ1 , . . . , yπn ). 109 Theorem 8.1. Let Y1 , Y2 , . . . be an infinite sequence of random variables all having a common sample space Y. Suppose that Y1 , . . . , Yn are exchangeable for any sequence size n. Then Y1 , Y2 , . . . must be conditionally i.i.d. That is, the joint distribution of Y1 , . . . , Yn for any n must be of the form (provided that a density function exists): for all n and y1 , . . . , yn p(y1 , . . . , yn ) = Z Y n p(yi |θ) π(θ)dθ (32) i=1 for some parameter θ, some distribution π over θ, and some sampling model p(y|θ). Remark 8.1. • The ”infinite” part in the statement is necessary, along with the condition of exchangeability for any n. • de Finetti’s theorem is one of the great theorems in probability theory. It also gives us probability models that can be written as Eq. (32), as well as hierarchical versions of this, as we will see. • It has a foundational role in Bayesian statistics, because it provides a mathematical justification for the existence of the notion of random parameter θ: – whereas a frequentist statistician may be content with making an i.i.d. assumption about an unknown sampling mechanism such as i.i.d. Y1 , . . . , Yn ∼ p(·|θ), de Finetti’s theorem says that if the observation sequence is in fact exchangeable, then the unknown θ must be random. Bayesian statisticians proceed by placing a prior distribution π on such θ. 110 • Exchangeability makes sense in many practical situations: – the math scores from n randomly selected students from a particular school, in absence of other information about the students, may be treated as exchangeable. – the collection of U.S. high schools in similar environments (e.g., large urban areas). – The computer-related activities by a user collected on Monday mornings in the past year. – What are not exchangeable? The collection of time-stamped computer-related activities in the past 24 hours, is not exchangeable. The words in a document, read from the beginning to the end, are not exchangeable, either. But if we print out the document into a piece of paper, and cut the paper into small pieces, one for each word, which are then placed into a bag and shuffled well. Then we have a bag of exchangeable words. 111 Now, let us consider a model to describe our information about a hierarchical data structure: there are m groups {Y 1 , . . . , Y m }; each group Y j = {Yj1 , . . . , Yjnj } has nj elements, for some nj ≥ 1. Suppose that the elements within each group Y j may be treated as exchangeable. Then, by de Finetti’s theorem we may model the observations from each group as conditionally i.i.d. given some parameter: i.i.d. Yj1 , . . . , Yjnj |φj ∼ p(y|φj ). (33) What about the collection of parameters φ1 , . . . , φm ? If we assume that the m groups are exchangeable, then, applying de Finetti’s theorem once more, we have i.i.d. φ1 , . . . , φm |ψ ∼ p(φ|ψ), (34) for some random parameter ψ. Collecting the above specifications, we arrive at the following hierarchical model ψ ∼ φ1 , . . . , φm |ψ i.i.d. Yj1 , . . . , Yjnj |φj i.i.d. ∼ ∼ p(ψ) (prior distribution) p(φ|ψ) (between-group sampling variability) p(y|φj ), j = 1, . . . , m (within-group sampling variability). This hierarchical model has three levels that representing different aspects of randomness/ random variability: p(y|φ) represents the sampling variability among measurements within a group, and p(φ|ψ) represents the sampling variability across groups. Finally, p(ψ) represents prior information about unknown parameter ψ. Depending on data structure and the modeler’s knowledge, there may be more levels in the hierarchy of sampling distributions and prior distributions that can be constructed. 112 8.4 Hierarchical normal models A popular model for describing the heterogeneity of means across several populations is the hierarchical normal models: here, each group is endowed with a normal sampling model; the mean parameters across groups are endowed with another normal sampling model further up in the hierarchy. φj = (θj , σ 2 ), p(y|φj ) = normal(θj , σ 2 ) 2 (within-group model) 2 ψ = (µ, τ ), p(θj |ψ) = normal(µ, τ ) (between-group model). (35a) (35b) Note that in this model, we allow different groups to have different means, but they share the same variance σ 2 (this assumption may be relaxed). The parameters for the given sampling model are µ, τ 2 , σ 2 . For convenience we may give them standard semi-conjugate priors: 1/σ 2 ∼ gamma(ν0 /2, ν0 σ02 /2) 1/τ 2 ∼ gamma(η0 /2, η0 τ02 /2) µ ∼ normal(µ0 , γ02 ). 113 8.4.1 Posterior inference The unknown quantities in our model include the group-specific means (θ1 , . . . , θm ), within-group sampling variability σ 2 , the mean and variance µ, τ 2 of the population of group-specific means. Joint posterior inference for these parameters may be made by an MCMC approximation for the posterior distribution p(θ1 , . . . , θm , σ 2 , µ, τ 2 |y 1 , . . . , y m ) ∝ p(µ, τ 2 , σ 2 ) × p(θ1 , . . . , θm |µ, τ 2 , σ 2 ) × p(y 1 , . . . , y m |θ1 , . . . , θm , µ, τ 2 , σ 2 ) Y Y nj m m Y 2 2 2 2 = p(µ)p(τ )p(σ ) p(θj |µ, τ ) p(yji |θj , σ ) . j=1 j=1 i=1 Although this may look daunting, we will see shortly that it is not difficult to derive full conditional distributions for all parameters of interest, which will enable us to run a Gibbs sampler. The key is to observe that the joint distribution of all parameters and observed is expressed in factorized form (i.e., product form) given above. This is a reflection of the conditional independence relations inherent in our hierarchical modeling assumption. It is also the conditional independence that we exploits in deriving the full conditional distributions comfortably. Full conditional distributions of µ and τ 2 : It is useful to note that µ and τ 2 are conditionally independent of all other variables in the joint model when given θ1 , . . . , θm . Collecting only relevant terms from the joint distribution, we find that Y p(µ|θ1 , . . . , θm , τ 2 , σ 2 , y 1 , . . . , y m ) ∝ p(µ) p(θj |µ, τ 2 ) Y p(θj |µ, τ 2 ). p(τ 2 |θ1 , . . . , θm , µ, σ 2 , y 1 , . . . , y m ) ∝ p(τ 2 ) The right hand side of the two equations in the above display allow us to look at only ”submodels” for µ and τ 2 . For example: in the first equation we can treat θj as the m-data sample for normal submodel with mean parameter µ, so we need to compute the posterior distribution of µ for this submodel. We have seen such submodels before, in Section 5. Thus, mθ̄/τ 2 + µ0 /γ02 2 2 −1 , (m/τ + 1/γ0 ) , µ|θ1 , . . . , θm , τ ∼ normal m/τ 2 + 1/γ02 X 1/τ 2 |θ1 , . . . , θm , µ ∼ gamma((η0 + m)/2, η0 τ02 /2 + (θj − µ)2 /2). 2 114 Full conditional distribution of θj , j = 1, . . . , m: θj represents the mean for group j. It is useful to note that, given µ, τ 2 , σ 2 , y j , θj must be conditionally independent of all other mean parameters θ’s, as well as the data from groups other than j. In fact, 2 2 2 p(θj |µ, τ , σ , y 1 , . . . , y m ) ∝ p(θj |µ, τ ) nj Y p(yji |θj , σ 2 ). i=1 We can view this as the posterior distribution for the normal sampling model for group j, given the nj -data sample from this group only. Let ȳj denote the sample mean for group j, then nj ȳj /σ 2 + µ/τ 2 2 2 −1 2 , (nj /σ + 1/τ ) . (36) θj |σ , yj1 , . . . , yjnj ∼ normal nj /σ 2 + 1/τ 2 Full conditional distribution of σ 2 : σ 2 represents the shared within-group variance for all groups. Note that σ 2 is conditionally independent of µ, τ 2 given y 1 , . . . , y m , θ1 , . . . , θm . We find that 2 2 p(σ |θ1 , . . . , θm , y 1 , . . . , y m ) ∝ p(σ ) nj m Y Y p(yji |θj , σ 2 ) j=1 i=1 ∝ (σ 2 )−ν0 /2+1 e− 2 ν0 σ0 2σ 2 (σ 2 )− P nj /2 exp − 1 XX (yji − θj )2 , 2σ 2 j so 2 1/σ |θ, y 1 , . . . , y m ∼ gamma((ν0 + m X nj )/2, ν0 σ02 /2 j=1 + nj m X X i (yji − θj )2 /2). j=1 i=1 Note that the double sum term is the sum of squared residuals across all groups, conditional on the within-group means, so the (full) conditional distribution of σ 2 concentrates probability around a poolledsample estimate of the variance. This makes sense, because σ 2 is the same variance parameter shared across all groups according to our model. 115 8.4.2 Example: Math scores in U.S. public schools We return to the analysis of math scores examined in Hoff (2009). The setting is as follows • there are 100 large urban public high schools, all having a 10th grade enrollment of 400 or larger. Figure 8.3: ELS data. • average score per school ranges from 36.6 to 65.0. Figure 8.4: Empirical distribution of sample means and relationship with sample size. • extreme average scores tend to be associated with low sample sizes. This is a common phenomenon for hierarchical data sets (how?) 116 Prior specification and posterior approximation • recall our hierarchical model µ, τ 2 , σ 2 ∼ p(ψ)p(τ 2 )p(σ 2 ) (prior distribution) θ1 , . . . , θm |ψ i.i.d. normal(µ, τ 2 ) (between-group sampling variability) Yj1 , . . . , Yjnj |θj i.i.d. normal(θj , σ 2 ), j = 1, . . . , m ∼ ∼ (within-group sampling variability). • we need to provide hyperparameters for the semi-conjugate priors 1/σ 2 ∼ gamma(ν0 /2, ν0 σ02 /2) 1/τ 2 ∼ gamma(η0 /2, η0 τ02 /2) µ ∼ normal(µ0 , γ02 ). – the math exam was designed to give a nationwide variance of 100, so we set σ02 = 100. For a diffuse prior for the variance, we set ν0 = 1. – for between-group variance: we set τ02 = 100, and η0 = 1. – for the global mean: we set µ0 = 50, γ 2 = 25 (so the prior probability that µ is in (µ0 −2γ, µ0 + 2γ) = (40, 60) is about 95%). • the previous subsection gave the derivations of all full conditional distributions required for the implementation of a Gibbs sampler. 117 MCMC diagnostic • run the Gibbs sampler for 5000 iterations. Fig. 8.5 shows the boxplots for batch of 500 consecutive MCMC samples (e.g., {1, . . . , 500}, {510, . . . , 1000}, and so on). There does not seem to be any evidence that the chain has not achieved stationarity. Figure 8.5: Stationarity plots of the MCMC samples of µ, σ 2 , τ 2 . • lag-1 autocorrelations for the sequences of µ, σ 2 and τ 2 are 0.15, 0.053, and 0.312, respectively. • the effective sample sizes are 3706, 4499, and 2503, respectively. • the approximate MC std can be obtained by dividing the approximated posterior std by the square root of the effective sample sizes, resulting in 0.009, 0.004, 0.09 for µ, σ 2 , τ 2 , resp. These are small compared to the posterior means of these parameters (Fig. 8.6). Figure 8.6: Marginal posteriors with 2.5%, 50% and 97.5% quantiles. • for θ: we found the ESS for the 100 sequences of θ-values ranged between 3,500 and 6,000, with the MC std ranging between 0.02 and 0.05. 118 Posterior summaries and shrinkage • The posterior means of µ, σ and τ are 48.12, 9.21 and 4.97, respectively. Recalling the meaning of these parameters; this indicates that roughly 95% of scores within a class room are within 4 × 9.21 ≈ 37 points of each other, whereas 95% of the average classroom scores (across schools) are within 4 × 4.97 ≈ 20 points of each other. • The shrinkage effect: recall that, conditional on µ, τ 2 , σ 2 , the expected value of θj is a weighted average of ȳj and µ (cf. Eq. (36)): E[θj |y j , µ, τ, σ] = nj ȳj /σ 2 + µ/τ 2 . nj /σ 2 + 1/τ 2 As a result, the expected value of θj is from the sample mean ȳj toward the global mean µ. This is called the shrinkage effect: the parameter estimates are ”shrinked” toward the global mean. How strong this effect is dependent partially on the sample size nj . Figure 8.7: Shrinkage as a function of sample size. • Fig. 8.7 illustrates the amount of shrinkage for different groups. Left panel shows that the groups with large sample means are ”pulled down” a bit, while the groups with low sample means are ”pushed up”. The right panel shows that groups with small sample size receives the largest amount of shrinkage |ȳj − θ̂|. – for this reason we say that hierarchical modeling faciliates the ”borrowing of strength”: in particular, the groups with small sample size borrow information from the groups with large sample size. In theory, it has been shown that the borrowing of strength (also, sharing of information) results in more robust and efficient inference. 119 Back to the question of ranking • We may rank all schools according to the posterior expectations {E[θ1 |y 1 , . . . , y m ], . . . , E[θm |y 1 , . . . , y m ]} Alternatively, one may simply rank all schools according to the sample means ȳ1 , . . . , ȳm • Although these two rankings would be quite similar, there are differences. • Let’s consider two schools: school 46 and school 82; these two schools are at the bottom 10% of the 100 schools in the data set. The sample means are ȳ46 = 40.18 > ȳ82 = 38.76 However, in terms of posterior expectation, the ranking would be different: E[θ46 |y 1 , . . . , y m ] = 41.31 < E[θ82 |y 1 , . . . , y m ] = 42.53. • We observe the effects of shrinkage: n46 = 21, while n82 = 5. School 82 receives a larger amount of shrinkage toward global mean (E[µ|y 1 , . . . , y m ] = 48.11) than that of school 46, resulting in a ”reversal” in the ranking. Figure 8.8: Data and posterior distributions for two schools • Does this make sense? – there are more uncertainty about school 82’s average scores due to its low sample size. – suppose on the day of the exam, the student who got the lowest exam score from school 82 doesn’t show up, then the sample mean would have been 41.99, a change of more than three points from 38.76. In the case of school 46, the sample mean would have been 40.9, a change of only three quarters of a point. So, while we are more certain about the average score of school 46, we are less certain about that of school 82, which results in a larger amount of shrinkage toward the global mean. 120 – to some, this ranking may seem unfair. However, it reflects an objective fact that there is more evidence that θ46 is exceptionally low than there is for θ82 . – An example in sport: on any basketball team, there are ”bench” players who play very little play time, many of whom have taken only a few free throws in their entire career, resulting in very high free throw shooting percentage, e.g., 100%. Yet, the coach when given an opportunity for a free throw (during a technical foul) will likely choose a veteran player, despite having a lower shooting percentage, say 87%. This is because coaches recognize that the bench player’s true free throw percentage is nowhere near the ”sample mean” 100%. 121 8.5 Topic models We will study hierarchical models for discrete data, such as texts, images and biological data. The class of models that we consider is known as topic models 7 and finite admixtures. 8 The paper by Blei and coauthors was motivated from the information retrieval/machine learning of texts and images. It also develops variational inference for this particular class of model. The paper by Pritchard and co-authors was motivated by population genetics applications and makes use of Gibbs sampling for the posterior inference. Both are extremely well-known (and combine for more than 60,000 citations on Google Scholar). 8.5.1 Model formulation First come some notations. • Random variable W ∈ {1, ..., V } represents words in a vocabulary, where V is the length of the vocabulary. • A document is a collection of words denoted by W = (W1 , ..., WN ). Although we write W as if it is a sequence, the ordering of the words does not matter in the modeling that we introduce here. • A corpus is a collection of documents (W 1 , ..., W m ). For each document m, let Nm be the document length. Topic model is essentially a hierarchical model for discrete data that can be viewed as a hierarchical mixture model (for discrete random variables). Each mixing component of the model will be referred to as a topic. Thus a topic is a particular distribution over words, and a document can be described as a mixture of topics. 7 D. Blei, A. Ng and M. I. Jordan. Latent Dirichlet allocation, Journal of Machine Learning Research, 3:993–1022, 2003. J. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using multi-locus genotype data. Genetics, 155:945–959, 2000. 8 122 An example document from the AP corpus (Blei, Ng, Jordan, 2003) After feeding such documents to Latent Dirichlet Allocation (LDA) model: 123 Another example document from Science corpus (1880–2002) (Blei & Lafferty, 2009) Topic models – such as Latent Dirichlet allocation and its variants – are a popular tool for modeling and mining patterns from texts in news articles, scientific papers, blogs, but also tweets, query logs, digital books, metadata records... 124 θ θ Z β Wn N Wn N M M Figure 8.9: Graphical representation of the unigram (left) and mixture of unigrams (right). Before we describe the latent Dirichlet allocation model, let us start with simpler precursors. Unigram model For any document W , assume i.i.d. W = (W1 , ..., WN )|θ ∼ Cat(θ). In other words, θ is the (same) word frequency that characterizes each document in the corpus. Thus, the corpus generated this way is implicitly assumed to have only one topic. See Fig. 8.9. Mixture of unigrams Each document W d is associated with a latent topic variable Zd . Suppose that there are K topics, where K is given. Assume that Zd |θ ∼ Cat(π). Now given Zd , we assume W d |Zd = k ∼ Cat(βk ), where parameter βk ∈ ∆V −1 is the frequency vector associated with topic k. This is nothing but a mixture of discrete distributions. The parameters of interest are {βk }K k=1 and π. In both models, the documents are assumed to be an i.i.d. sample from fairly simple distributions on vocabulary of words. Both above models were utilized in the early days of ”natural language processing” (NLP), a field in artificial intelligence that focuses on analysis of texts. 125 Latent Dirichlet Allocation (LDA) LDA is an instance of hierarchical modeling. It was in fact motivated from de Finetti’s theorem. Given the hierarchical view of the text corpus, we assume that the documents are exchangeable. Moreover within each document the words are assumed to be exchangeable. Exchangeability assumption can be questioned, as we discussed in a previous subsection. However, this is an important step-up from the previous i.i.d. assumption. Moreover, exchangeability is not an unreasonable assumption if we do not want to capture aspects of the data the violates exchangeability (such as the ordering of words, or documents). From de Finetti’s theorem, we expect a hierarchical model specification for the words and then for the documents. Originally the LDA is described as a generative process: To generate document W , one proceeds as follows • For a document, generate N of W from a Poisson distribution: N ∼ Poisson(λ). • For some parameter α1 , . . . , αK > 0, let θ represent “topic proportion” for document W : θ|α ∼ Dir(α1 , ..., αK ). • Given N and θ associated with the document, for each word index n = 1, . . . , N , Zn |θ Wn |Zn = k, β i.i.d. ∼ iid ∼ Cat(θ), Cat(βk ) In the above, we use βk to denote row vector k of K × V matrix β. In particular βk represents the distribution over the vocabulary for topic k. This means Pr(Wn = j|Zn = k, β) = βkj . A graphical representation of this model is given Fig. 8.10. α θ β Zn Wn N M Figure 8.10: Latent Dirichlet Allocation Model. 126 There is a simpler geometric reformulation for the LDA. It goes like this.9 Each document W = (W1 , . . . , WN ) consists of words that are generated i.i.d. according to the following probability Pr(Wn = j|θ, β) = K X Pr(Zn = k|θ) × Pr(Wn = j|Zn = k, β) = k=1 K X θk βkj . k=1 P V −1 . This is a point that lies in That is, the vector of word frequency for document W is K k=1 θk βk ∈ ∆ the convex hull G = conv(β1 , . . . , βK ). Each extreme point β1 , . . . , βK corresponds to a word frequency vector of a topic (e.g., “education”, “politics”, “sports”). Given the convex hull G, a document corresponds to a point randomly drawn from the convex hull G. The randomness is due to the random weight vector θ ∈ ∆K−1 , which is distributed by a Dirichlet distribution. 9 J. Tang, Z. Meng, X. Nguyen, Q. Mei and M. Zhang. Understanding the limiting factors of topic modeling via posterior contraction analysis. Proceedings of the 31st International Conference on Machine Learning (ICML), 2014. 127 8.5.2 Posterior inference We have seen how the LDA is composed of familiar building blocks, Poisson for the document length, multinomial/categorical distributions for topic-specific distribution over words, Dirichlet for topic proportion, as well as suitable prior distributions for parameters of interest. Posterior inference is computationally challenging, due to the presence of mixed data type (categorical and continuous-valued). Moreover, the model is typically applied to large collection of documents, and later on images, genomes and all sort of large-scale data types. There are two computational tasks: 1. Compute the posterior distribution, P (θ, Z|W , α, β). 2. Estimating α, β from the data. The posterior distribution can be rewritten as P (θ, Z|W , α, β) = P (θ, Z, W , α, β) P (W |α, β) (37) The numerator in the above display is easy to compute P (θ, Z, W |α, β) = P (θ|α) N Y P (Zn |θ)P (Wn |Zn , β). (38) n=1 However, the denominator Z p(W |α, β) = X θ Z ,...,Z n 1 Z P (θ, Z, W |α, β)dθ = Γ( K Q P K αi ) Y Γ(αi ) i=1 θiαi −1 N K Y V Y X (θk βkj )I{Wn =j} dθ n=1 k=1 j=1 i=1 (39) is much harder to compute because we must integrate out all the latent variables of mixed types. Exercise. Derive a Gibbs sampling algorithm for the LDA model. For this purpose, we need to endow prior distributions for parameters α and β. Although the Gibbs sampler is easy to derive, the Markov chains it produces may take a long time to mix (due to the large number of latent variables to be sampled). An alternative is variational inference — a general method for approximating posterior distributions based on optimization. We will introduce this method in the context of LDA next. Note that the state of the art method W for learning specifically the LDA model and its extensions, both in terms of parameter estimation accuracy and computational efficiency, appears to be a geometric algorithm of Yurochkin et al.10 10 Dirichlet simplex nest and geometric inference. M. Yurochkin, A. Guha, Y. Sun and X. Nguyen. Proceedings of the 36th International Conference on Machine Learning (ICML), 2019. 128 8.5.3 Variational Bayes Variational inference is a general computational technique for inference with complex models in which the problem of model fitting and probabilistic inference (problem 1 and 2 in the previous page) can be reformulated as an optimization problem. When applied to the approximate computation of the posterior distribution, we call this ”variational Bayes”. The strength of variational Bayes is that it’s generally applicable to all (complex) Bayesian models; it’s fast compared to sampling based techniques such as MCMC. While fast, it may not be as accurate as MCMC if the latter is run for sufficiently long time. We shall now illustrate the variational Bayes technique to topic models. The basic idea is as follows: (1) Consider a family of simplified distribution Q = {q(θ, Z|W )} (2) Choose the one in Q, that is closest to the true posterior q ∗ := argminq∈Q KL(q||p(θ, Z|W , α, β)) (40) (3) Use q ∗ as the surrogate for the true posterior p(θ, Z|W , α, β) for subsequent inferential purposes. In the above display, KL denotes the Kullback-Leibler divergence: given two distributions with corresponding probability density functions f and g on some common space, the KL divergence is given by Z KL(f ||g) = Ef log(f (X)/g(X)) = f (x) log(f (x)/g(x))dx. Although Kullback-Lebler divergence is not symmetric, it is always non-negative. Moreover, KL(f ||g) = 0 iff f (x) = g(x) for almost all x. The KL is a fundamental quantity that measures how far g is from f . 129 It is somewhat surprising but not difficult to verify that the optimization problem given in Eq. (40) becomes relatively tractable the class of approximating distribution Q takes a sufficiently simple form. The simple choice for Q is the family of ”factorized” distributions: each q ∈ Q satisfies q(θ, Z|W , γ, φ) = q(θ|γ)ΠN n=1 q(Zn |φn ). (41) Here, the parameters γ and φ = (φ1 , . . . , φN ) are called variational parameters to be optimized according to the KL objective so as to obtain as tight as possible an approximation to the true posterior (γ ∗ , φ∗ ) := argmin KL(q(θ, Z|γ, φ)||p(θ, Z|W , α, β)). (42) A few words about the roles of variational parameters γ and φ: recall that θ ∈ ∆K−1 . Here, we shall K. take q(θ|γ) to be Dirichlet with parameters γ ∈ R+ Similarly, for each n = 1, . . . , N , q(Zn |φn ) is taken to be categorical distribution, where parameter φn is composed of φn = (φn1 , . . . , φnK ) so that under q: q(Zn = i|φn ) = φni , n = 1, ..., N, i = 1, . . . , K. 130 (43) Optimization algorithm for variational Bayes We will show that the optimization in Eq. (42) can be solved by coordinate descent via iteratively applying the updating equations as follows: for n = 1, . . . , N , i = 1, . . . , K, γi = αi + N X φni , (44) βiWn exp{Eq [log θi |γ]}. (45) n=1 φni ∝ Thus, the algorithm is fairly simple to implement: initialize the variational parameters γ, φ in some fashion, and then keep updating them via above equations until convergence. Some remarks (1) In the updating equation for φni , since θi |γ ∼ Dirichlet(γ) it is a simple fact of the Dirichlet distribution that K X E [log θi |γ] = Ψ(γi ) − Ψ( γi ), (46) i=1 where Ψ is called digamma function Ψ(x) = d log Γ dx = Γ0 (x) Γ(x) . (2) Note the roles of data Wn in the two updating equations. (3) The updating equations are reminiscent of Gibbs sampler’s updates for semi-conjugate priors, except here the updates are deterministic (subject to initialization). The fact that we are optimizing rather than sampling makes this approximate inference technique computationally more efficient than MCMC. 131 The remaining pages in this section will be devoted the to derivation of the algorithm and can be skipped at the first reading. The first step is to note that the minimization of the KL divergence in Eq. (42) is equivalently viewed as the maximization of a lower bound to the log likelihood function of the original LDA model. Indeed, by Jensen’s inequality Z X log p(W |α, β) = log p(θ, Z, W |α, β)dθ θ Z = ≥ = Z X p(θ, Z, W |α, β) log q (θ, Z) dθ q(θ, Z) θ Z Z X p(θ, Z, W |α, β) q (θ, Z) log dθ q(θ, Z) θ Z Z X Z X q (θ, Z) log q(θ, Z)dθ q (θ, Z) log p(θ, Z, W |α, β)dθ − θ Z θ Z = Eq log p(θ, Z, W |α, β) − Eq log q(θ, Z) =: L(γ, φ; α, β). We immediately see that the difference between the two sides of the above inequality is p(θ, Z, W |α, β) log p(W |α, β) − L(γ, φ; α, β) = Eq log q(θ, Z) − log p(W |α, β) = Eq log q(θ, Z) − log p(θ, Z|W , α, β) = KL(q(θ, Z)||p(θ, Z|W , α, β)), so minimizing the KL divergence in Eq. (42) is equivalent to max L(γ, φ; α, β). γ,φ 132 The second step is to note that the quantities in L(γ, φ; α, β) are relatively easy to compute and optimize, due to the fact that the (full) joint probability distribution p(θ, Z, W |α, β) factorizes into marginal and conditional distributions, while q also factorizes by our choice of approximation. Indeed, N X log p(θ, Z, W |α, β) = log p(θ|α) + {log p(Zn |θ) + log p(Wn |Zn , β)}, n=1 so taking expectation with respect to the q distribution we obtain L(γ, φ; α, β) = Eq log p (θ|α) + N X {Eq log p(Zn |θ) + Eq log p(Wn |Zn , β)} n=1 N X −Eq log q(θ|γ) − Eq log q (Zn |φn ) . n=1 Now, we proceed to compute each of the quantities in the above display. p (θ|α) = log p (θ|α) = P K Γ ( αi ) Y αi −1 θi , so K Q i=1 Γ (αi ) i=1 K X (αi − 1) log θi + log Γ K X i=1 Eq log p (θ|α) = K X PN αi − i=1 K X (αi − 1) Ψ(γi ) − Ψ i=1 Next up, we consider ! K X log Γ (αi ) i=1 !! γi + log Γ i=1 n=1 Eq K X ! αi i=1 K Y I(Zn =i) θi , so i=1 log p(Zn |θ) = Eq log p(Zn |θ) = K X i=1 K X I (Zn = i) log θi φni Ψ (γi ) − Ψ i=1 K X !! γi , i=1 where the last equality is due to (46). Continuing along, log p (Wn |Zn , β) = log K Y V Y (βij )I(Wn =j,Zn =i) , so i=1 j=1 Eq log p(Zn |θ) = K X V X i=1 j=1 133 I(Wn = j)φni log βij . K X i=1 log p(Zn |θ): p(Zn |θ) = − log Γ (αi ) . (47) In addition, we take care of q(θ|γ) and q(Zn |φn ) q(θ|γ) = Eq log q(θ|γ) = P K Γ( γi ) Y γi −1 θi , so K P i=1 Γ(γi ) i=1 K X (γi − 1) Ψ (γi ) − Ψ i=1 K X !! γi K K X X + log Γ( γi ) − log Γ(γi ) i=1 i=1 i=1 as well, q(Zn |φn ) = K Y I(Zn =i) φni , so i=1 Eq log q(Zn |φn ) = K X φni log φni . i=1 The final step: with all components in the expression (47) for L(γ, φ; α, β) computed, it remains to optimize L with respect to the unknown variational parameters γ and φ. max γ,φ subject to L(γ, φ; α, β) K X φni = 1 n = 1, . . . , N. (48) (49) i=1 Differentiate with the γ and set to zero to obtain the updating equation (45) for γ. Differentiate with respect to the Lagrangian (by accounting for the equality constraints for φn ) and set to zero to obtain the updating equation (44) for φn . Iterating these algorithms upon convergence for the estimates γ ∗ , φ∗ ). Thus, we have accomplished the task of approximating the true posterior p(θ, Z|W , α, β) by means of the surrogate q(θ, Z|γ ∗ , φ∗ ). The second task of estimating the parameter α, β can also be done in a similar fashion. See Blei et al (2003) for details. 134 9 9.1 Linear regression Linear regression model Regression problem is concerned with the relationship between a response variable Y and a collection of explanatory variables x = (x1 , . . . , xp ). Figure 9.1: Change in maximal oxygen uptake as a function of age and exercise program. Example 9.1. Twelve healthy men who did not exercise regularly were recruited to take part in a study of the effects of two different exercise regimens on oxygen uptake. The maximum oxygen uptake (liters per minute) of each subject was measured while running on an inclined treadmill, both before and after the program. See Fig. 9.1 A linear regression model assumes that E[Y |x] takes a linear form: Z E[Y |x] = yp(y|x)dy = β1 x1 + . . . βp xp = β > x. In the above example, the explanatory variables (covariates) x may be taken to be x1 = 1 x2 = 0 if the subject is on the running program, 1 x3 = age of subject x4 = x2 × x3 . 135 if on aerobic We have not specified the distribution p(y|x) beyond its conditional expectation. The normal linear regression model posits that in addition to E[Y |x] being linear, the sampling variability around the mean is in fact i.i.d. from normal distribution: 1 , . . . , n ∼ normal(0, σ 2 ) Yi = β > xi + i . This gives the conditional likelihood, given the n-sample (noting that nothing is said about the marginal distribution of covariates x: p(y 1 , . . . , y n |x1 , . . . , xn , β, σ 2 ) = n Y p(yi |xi , β, σ 2 ) i=1 n 1 X = (2πσ 2 )−n/2 exp − 2 (yi − β > xi )2 . 2σ i=1 In customary matrix notations: y = (y1 , . . . , yn )> is a n × 1 column vector; X is the n × p design matrix whose ith row is xi . Then the above can be written as y|X, β, σ 2 ∼ Nn (Xβ, σ 2 I), where I is the n × n identity matrix. 136 Parameter vector β may be estimated by minimizing the sum of squared residuals, SSR(β): SSR(β) = n X (yi − β > xi )2 i=1 = (y − Xβ)> (y − Xβ) = y > y − 2β > X > y + β > X > Xβ. To minimize the above expression, we take derivative with respect to β and set it to zero: −2X > y + 2X > Xβ = 0 resulting in β = (X > X)−1 X > y. The value β̂ ols = (X > X)−1 X > y is called the ”ordinary least squares” (OLS) estimate of β. This value is unique as long as the p × p matrix X > X is of full rank (and thus invertible). This happens when n ≥ p (and the columns of the design matrix X are linearly independent). The OLS estimate is a frequentist estimate, but it also plays a role in Bayesian estimation. 137 9.2 Semi-conjugate priors The (conditional) likelihood function takes the form 1 SSR(β) 2σ 2 1 = exp − 2 (y > y − 2β > X > y + β > X > Xβ). 2σ p(y|X, β, σ 2 ) ∝ exp − It is simple to see that a normal distribution can be used as a semi-conjugate prior for β. Let β ∼ Np (β 0 , Σ0 ) a priori, then p(β|y, X, σ 2 ) ∝ p(β) × p(y|Xβ, σ 2 ) 1 1 > −1 > > 2 > > 2 ∝ exp − (−2β > Σ−1 0 β 0 + β Σ0 β) × exp − (−2β X y/σ + β X Xβ/σ ) 2 2 1 > −1 > 2 > 2 = exp{β > (Σ−1 0 β 0 + X y/σ ) − β (Σ0 + X X/σ )β}. 2 This is a multivariate normal density with > 2 −1 Var[β|y, X, σ 2 ] = (Σ−1 0 + X X/σ ) , 2 E[β|y, X, σ ] = (Σ−1 0 > 2 −1 + X X/σ ) (Σ−1 0 β0 (50a) > 2 + X y/σ ). (50b) It is a simple exercise to see that the posterior expectation represents a combination of the prior expectation and the purely data driven estimate OLS. 138 It is also simple to see that the inverse-gamma distribution can be used as a semi-conjugate prior for σ 2 . Let γ = 1/σ 2 ∼ gamma(ν0 /2, ν0 σ02 /2) a priori, then p(γ|y, X, β) ∝ p(γ)p(y|X, β, γ) ∝ γ ν0 /2−1 exp(−γν0 σ02 /2) × [γ n/2 exp(−γ × SSR(β)/2) ∝ gamma((ν0 + n)/2, (ν0 σ02 + SSR(β)/2)). A Gibbs sampler is simple to implement. Each Gibbs update consists of the following: given the current values {β (s) , σ 2(s) }, for s = 1, 2, . . .: 1. update β (s+1) ∼ Np (E[β|y, X, σ 2(s) ], Var[β|y, X, σ 2(s) ]). 2. update σ 2(s+1) ∼ inverse-gamma((ν0 + n)/2, (ν0 σ02 + SSR(β (s+1) )/2)). 139 9.3 Objective priors In regression analysis it may be difficult to come up with a suitable prior distribution on β and σ 2 . Example 9.2. Continuing on the oxygen uptake example. Suppose we know from our prior knowledge (e.g., by consulting with experts on physiology) that males in their 20s have an oxygen uptake of around 150 liters per minute with a std of 15. We then take 150 ± 2 × 15 = (120, 180) as the prior expected range of oxygen uptake distribution, and so the changes in the oxygen uptake lies within (−60, 60) with high probability. Consider our subjects in the running group. This means the line β1 +β3 x should produce values between -60 and 60 for all values of x between 20 and 30. A little algebra shows that we need a prior distribution on β1 and β3 so that β1 ∈ (−300, 300) and β3 ∈ (−12, 12) with high probability. From here we can find the suitable prior hyper-parameters β 0 , Σ0 . But this type of calculation becomes difficult when there are more explanatory variables. When we are in such a scenario, i.e., when it is difficult to come up with an informative prior specification, then one may consider prior specification that contains as little information as possible. This is the spirit of objective Bayes. 11 . For linear regression, there are a number of objective priors that are commonly used in practice. 11 We encountered this notion for the first time when we was discussing improper priors in Section 5. The ideas behind the derivation of both improper prior and unit information prior are basically the same, but the latter has the advantage of being proper. 140 Unit information prior A unit information prior is one that contains the same amount of information as that would be contained in a single observation (Kass and Wasserman, 1995). Recall β̂ ols = (X > X)−1 X > y. Since y|X, β ∼ σ 2 I, this implies that the variance (with β held fixed) of β̂ ols is σ 2 (X > X)−1 . The precision of β̂ ols is its inverse variance: (X > X)/σ 2 . Viewing this as the amount of information contained in n observations, the amount of information in one observation should be 1/n as much. Thus, we set > 2 Σ−1 0 = (X X)/(nσ ). To complete the prior specification β ∼ N(β 0 , Σ0 ), we set β 0 = β̂ ols . In a similar way, the prior distribution of σ 2 is given by σ 2 ∼ inverse-gamma(ν0 /2, ν0 σ02 /2), where 2 , which is obtained as an unbiased estimate of σ 2 : ν0 = 1 and σ02 := σ̂ols 2 σ̂ols = SSR(β̂ ols )/(n − p). Some remarks • the unit information prior is not purely Bayesian, since the prior is derived from the data. It provides some sort of protection against misleading prior specification. • however, it uses only a very small amount of the information gleaned from the data due to suitable scale 1/n of information. Thus, its influence on the posterior inference is expected to be weak. 141 g-prior g-prior is another popular choice proposed by Arnold Zellner. It is motivated from another principle of objective Bayesian statistics: the relevant distributions of interest should remain invariant to changes in parameterization of the model.12 Example 9.3. Continue on the regression model for oxygen uptake. Suppose that someone were to analyze the data using explanatory variable x̃3 = age in months, instead of x3 = age in years. The role of this variable in the model for the response Y is in the linear term β̃ x̃3 , as opposed to β3 x3 . Since now x̃3 = 12 × x3 , it makes sense that the posterior distribution for 12 × β̃3 in the model with x̃3 should be the same as the posterior distribution for β3 based on the model with x3 . For many modelers, due to the lack of domain knowledge, the same form of prior specification may be given to β̃3 as would be the case for β3 . Thus, it is important the impart the kind of prior so that the posterior inference is robust against such rescaling in the explanatory variables. Let us proceed to a formulation of the g-prior that arises in the normal linear regression model. • Suppose X is the given n × p design matrix. Under this design, y|X, β, σ 2 ∼ Nn (Xβ, σ 2 I). • Alternatively, due to a change of explanatory variables, X̃ = XH is a modified design matrix, for some p × p matrix H. Under this design, y|X̃, β̃, σ 2 ∼ Nn (X̃ β̃, σ 2 I) = Nn (XH β̃, σ 2 I). • We need the same conditional prior on β and β̃ (conditionally given X or X̃) such that under such prior specification, the posterior distributions of β and H β̃ are equal for all H: d [β|y, X, σ 2 ] = [H β̃|y, X̃, σ 2 ]. 12 Jeffreys’ prior is another example. 142 (51) Suppose the prior is of the form β ∼ Np (β 0 , Σ0 ). Recall from Eq. (50) the posterior distribution β is a multivariate normal with > 2 −1 Var[β|y, X, σ 2 ] = (Σ−1 0 + X X/σ ) , 2 E[β|y, X, σ ] = (Σ−1 0 > 2 −1 + X X/σ ) (Σ−1 0 β0 (52a) > 2 + X y/σ ). (52b) It is easy to show that if we put β 0 = 0 and Σ0 = gσ 2 (X > X)−1 , where g > 0 is an arbitrary constant, the invariance property expressed in Eq. (51) is satisfied (Exercise: verify this.) • to be clear, the prior for β is β ∼ Np (0, gσ 2 (X > X)−1 ). > The prior for β̃ would be of the form β̃ ∼ Np (0, gσ 2 (X̃ X̃)−1 ). • in fact, Var[β|y, X, σ 2 ] E[β|y, X, σ 2 ] (X > X/(gσ 2 ) + X > X/σ 2 )−1 , g = σ 2 (X > X)−1 g+1 =: V ; = (X > X/(gσ 2 ) + X > X/σ 2 )−1 (X > y/σ 2 ) g (X > X)−1 X > y = g+1 g = β̂ g + 1 ols =: m. = In short, β|y, X, σ 2 ∼ Np (m, V ). 143 (53) • For σ 2 , suppose that an inverse-gamma prior is given: σ 2 ∼ inverse-gamma(ν0 /2, ν0 σ02 /2). It is a very nice feature of g-prior that the induced posterior distribution of σ 2 is again an inverse-gamma distribution (Exercise: verify this): [σ 2 |y, X] ∼ inverse-gamma((ν0 + n)/2, (ν0 σ02 + SSRg )/2), (54) where the term SSRg := y > y − m> V −1 m = y > (I − g X(X > X)−1 X > )y. g+1 (55) when g → ∞, this term tends to the SSR corresponding to the OLS estimate β̂ ols . • We observe a form of shrinkage for both parameters β and σ 2 . • MCMC is not needed, as we can obtain Monte Carlo samples for (σ 2 , β) from the above computation. 144 Example 9.4. Back to our example of regression analysis of the oxygen uptake data. 2 = 8.54. The posterior mean for β does not depend Set the g-prior with g = n = 12, ν0 = 1, σ02 = σ̂ols 2 on σ and can be computed directly. The posterior standard deviations of these parameters are obtained. Some observations: • the posterior distributions seem to suggest only weak evidence of a difference between the two groups, as the 95% quantile-based posterior intervals for β2 and β4 both contain zero. • however, there seems to be a relatively strong evidence on the effect of age. According to our model, the average difference in y between two people of the same age x but in different training programs is β2 + β4 x. The box plots of the posterior distribution of this quantity is given for each x. It suggests a strong evidence of a difference at young ages, but less so at the older ones. Figure 9.2: Posterior distributions of β2 and β4 , with the marginal prior distributions in gray. Figure 9.3: Ninety-five percent confidence intervals for the difference in expected change scores between aerobic subjects and running subjects. For more details, see Hoff (2009). 145 9.4 Model selection In regression problems we may encounter a large number of possible explanatory variables/ regressors x1 , . . . , xp , many of which may be irrelevant to the response variable y. Although we may fit a regression model with all such potential regressors, such a technique will likely produce a poor result in terms of both prediction and parameter estimation, due to overfitting. Thus, selecting only the most relevant subset of variables xi ’s for predictive and interpretative purposes is an extremely important task. The broad term for this task is called ”model selection”. Example 9.5. (Diabetes data) There are ten variables x1 , . . . , x10 on a group of = 442 diabetes patients, and a variable y representing the disease progression taken one year after the baseline measurements xi ’s. It is suspected that the relationship between xi ’s and y may be nonlinear, so a common practice is utilize a linear regression model using regressors x1 , . . . , x10 (a.k.a. main effects), as well as nonlinear terms that represent the interactions between the main effects, namely xj xk , and the quadratic terms x2j for j, k = 1, . . . , 10. One of regressors, x2 = sex, is binary so x22 is unnecessary. This gives a total of p = 10 + 10 2 + 9 = 64 potential regressors among {xj , x2j , xj xk } 146 Naive OLS approach Randomly split the 442 diabetes subjects into 342 training samples and 100 test samples, resulting in training data set (y, X) and test set (y test , X test ). Apply the OLS approach to the training data with all 64 regressors to obtain β̂ ols (cf. Section 9.1), and then generate the predictive responses ŷ test = X test β̂ ols . 1 The average sequared preditive error is 100 ky test − ŷ test k2 = 0.67. This is not good, since if we simply 1 put the predicted responses to be zero, our predictive error would already be 100 ky test k2 = 0.97. Figure 9.4: Left and middle panels: Predicted values and regression coefficients for the diabetes data via OLS. Right panel: Results based on a backwards elimination procedure. The second panel shows that most of the estimated regression coefficients are quite small — this suggests we should remove them. A simple way is a greedy procedure known as backwards elimination. 147 Backwards elimination procedure This is a sequential procedure for assessing the relevance of the regression coefficients based on the current model’s fit, and eliminating one variable at a time. A standard way of assessing the evidence that the true value of coefficient βj is non-zero is via a t-statitic, which is obtained by dividing the OLS estimate β̂j by its standard error. Since β̂ = (X > X)−1 X > y, and y|Xβ ∼ Nn (0, σ 2 I), we put β̂j tj = . > 1/2 2 (σ̂ (X X)−1 jj ) (Note: σ̂ 2 is the corresponding OLS estimate of the residual variance σ 2 . Also, the response vector y and all columns of X have been centered to have mean zero.) Now, if |tj | is below a certain cutoff threshold, |tj | < tcutoff , then the evidence for βj 6= 0 is weak; variable xj is removed from the model. A version of the overall backwards elimination procedure is as follows 1. Obtain OLS estimate β̂ and its t-statistics. 2. If there are any regressors j such that |tj | ≤ tcutoff , a) find the regressor j that has the smallest value of tj and remove column j from X. b) return to step 1. 3. If |tj | > tcutoff for all j, then stop. Example 9.6. Apply this procedure to diabetes data, using tcutoff = 1.65 (corresponding roughly to a pvalue of 2 × 0.05 = 0.10 according to a t distribution with a very large number of degrees of freedom, or the standard normal distribution). We obtain that 44 of the 64 variables are eliminated, leaving 20 variables in the regression model. The third plot of Fig. 9.4 shows ŷ test according to the reduced-model regression coefficients. The prediction error for the model is 0.53, which is an improvement from the standard OLS error of 0.67. 148 The backwards elimination procedure described above is a fast heuristic, but it may pick up many spurious associations between selected xj ’s and y. Example 9.7. Let’s consider the following experiment: we create a new data vector ỹ by randomly permuting the values of y. Thus, the value of xi has no effect on ỹi . There is no true association between ỹ and the columns of X. The left figure of Fig. 9.5 shows the t-statistics for one randomly generated ỹ of y. Initially, only one regressor has a t-statistic greater than 1.65, but as we sequentially remove the columns of X, the estimated variance of the remaining regressors decreases and their t-statistics increase in value. With tcutoff = 1.65, the procedure arrives at a regression model with 18 regressors. See the illustration in the right panel. All such regressors are spurious, of course. Figure 9.5: t-statistics for the regression of ỹ on X, before and after backwards elimination. 149 9.4.1 Bayesian model comparison The Bayesian approach is conceptually straightforward: we do not know which variables are spurious or not; such information will be represented by random variables (parameters) which are then endowed with some prior distributions. The model selection problem is essentially no different from the inference of an unknown parameter(s). Let zj = 0 if the explanatory variable xj is spurious and zj = 1 otherwise (that is, if xj is active). We may express the regression coefficients as zj βj , so the regression equation becomes y = z1 β1 x1 + . . . zp βp xp + . P As before, the conditional distribution of the response is given by Y |z, β, σ 2 ∼ Normal( pj=1 zi βj xj , σ 2 ). We need a prior specification for {z, β, σ 2 }. The prior distribution over z can be viewed as a prior over the space of models, while the conditional prior distribution of β, σ 2 given a model represented by z can be specified as in the previous subsections, e.g., via semi-conjugate priors or objective priors, etc. Then, by Bayes’ rule, we can compute a posterior probability for each regression model: p(z)p(y|X, z) p(z|y, X)= P . z̃ p(z̃)p(y|Z, z̃) (56) The posterior computation may be a challenging issue: the normalizing constant involves the integration over the space of potential models. Moreover, the computation of the marginal likelihood term p(y|X, z) may be far from being straightforward, due to the need of integration over the remaining parameters β and σ 2 . The specific modeling choices will play crucial role in mitigating such computational challenges. Model comparison via the posterior odds is relatively simpler computationally, because the difficult normalizing constants are cancelled out: p(z a |y, X) p(z b |y, X) p(z a ) p(y|X, z a ) = × p(z b ) p(y|X, z b ) posterior odds = prior odds × Bayes factor. odds(z a , z b |y, X) = 150 Computing the marginal likelihood We have Z Z p(y, β, σ 2 |X, z)dβdσ 2 Z Z p(y|β, X, σ 2 )p(β|X, z, σ 2 )p(σ 2 )dβdσ 2 . p(y|X, z) = = Some notations: For a given z with pz non-zero entries, let X z be the n × pz design matrix corresponding the active explanatory variable xj ’s, and β z the pz × 1 vector consisting of the entries of β for the active variables. Let’s consider a (conditional) g-prior for β given z: −1 β z |X, z, σ 2 ∼ Npz (0, gσ 2 [X > z X z ] ). In addition, give γ := 1/σ 2 a gamma prior: gamma(ν0 /2, ν0 σ02 /2). Then we have Z p(y|X, z) = p(y|X, z, σ 2 )p(σ 2 )dσ 2 Z = p(y|X, z, γ)p(γ)dγ z (2π)−n/2 (1 + g)−pz /2 × γ n/2 e−γSSRg /2 × 2 ν0 /2 −1 ν0 /2−1 −γν0 σ02 /2 Γ(ν0 /2) γ e dγ, (ν0 σ0 /2) Z = where SSRzg is the same as in Eq. (55), with X being replaced by X z (exercise: verify this!): SSRzg = y > (I − g −1 > X z (X > z X z ) X z )y. g+1 Now, using the normalizing constant identity for Gamma density leads to p(y|X, z) = π −n/2 (ν0 σ02 )ν0 /2 Γ((ν0 + n)/2) (1 + g)−pz /2 . Γ(ν0 /2) (ν0 σ02 + SSRzg )(ν0 +n)/2 151 With the marginal likelihood calculation completed, we can proceed to model comparison by computing the posterior odds defined earlier. Suppose that we set g = n, ν0 = 1 for all z, while σ02 is the estimated residual variance under the least squares estimate for a given model z. That is, given z, ν0 σ02 := s2z . To compare the two models represented by z a and z b , the Bayes factor is given by 2 1/2 2 szb + SSRzg b (n+1)/2 p(y|X, z a ) (pzb −pza )/2 sz a = (1 + n) × 2 . p(y|X, z b ) s2zb sza + SSRzg a (57) The ratio of marginal probabilities associated with the two models reflect the balance between model complexity and goodness of fit. In particular, the ratio improves for z a (i.e., increases) if • SSRzg a becomes small relatively to SSRzg b , i.e., the goodness of fit improves for z a . This happens when the model becomes more complex, i.e., pza increases relatively to pzb . • on the other hand, the term (1 + n)(pzb −pza )/2 penalizes large pza . It is important to note that this observation on the balancing act present in the marginal likelihood (and their ratios) is a very general characteristic: by the virtue of integrating over the unknown parameters, the marginal likelihood captures the tension between both model complexity and goodness of fit in its expression. 152 Example 9.8. Consider the oxygen uptake example. Recall our regression model E[Y |β, x] = β1 x1 + β2 x2 + β3 x3 + β4 x4 = β1 + β2 × group + β3 × age + β4 × group × age. The model selection question is whether or not β2 and β4 are non-zero (i.e., are there effects of grouping according training programs on oxygen uptake change?). Recall from our earlier analyses that the answer was somewhat ambiguous: the posterior coverage of both β2 and β4 contain zero in their 95% confidence intervals. However, also according to the posterior joint distribution, the two parameters are negatively correlated, so whether or notβ2 = 0 affects our inference about β4 . We consider 5 candidate models, giving them equal prior probabilities 1/5. The remaining prior specification is as described above. Then we may obtain the relavant marginal likelihood and posterior odds as following: According to the posterior computation, the best model is (1, 1, 1, 0). There is a strong evidence for age effect, as the posterior probabilities for the three models that include age is essentially 1. The group effect is relatively weaker, as the posterior probabilities of the three models that include group information is 0.00 + 0.63 + 0.19 = 0.82. This is still substantially higher than the prior probability of 0.60 for the three models combined. 153 9.4.2 Model averaging via MCMC Given p explanatory variables, each of which may be either zero or non-zero, there are 2p model candidates to consider. If p is large, it is challenging to compute the marginal likelihood for each model. The posterior distribution of interest is then Pr(z, β, σ 2 |y, X). We can derive a Markov chain that enables us to approximate this distribution. However, z is high-dimensional, finding an approximation of the joint posterior distribution for z may be impractical. Instead, we want to do the following: 1. finding the high probability density region for any variable zj of interest 2. finding a good estimate for parameters β and σ 2 (presumably residing near a low-dimensional subspace) by integrating over z ∈ {0, 1}p Deriving a Gibbs sampler for the posterior distribution of this model is simple. The full conditional distribution for each zj is oj Pr(zj = 1|y, X, z −j ) = (58) 1 + oj where the odds oj is given by oj = = Pr(zj Pr(zj Pr(zj Pr(zj = 1|y, X, z −j ) = 0|y, X, z −j ) = 1) p(y|X, z −j , zj = 1) × := A × B. = 0) p(y|X, z −j , zj = 0) Note that B was already computed via Eq. (57) for a g-prior specification. If we put an (independent) uniform prior probability on each variable xj , so that Pr(zj = 1) = Pr(zj = 0) = 1/2, then A = 1. The posterior samples for β and σ 2 were given in 9.3 for the g-prior. 154 To summarize the Gibbs sampling procedure using the g-prior for β and σ 2 , and the uniform prior for z: Given the sample (z (s) , β (s) , σ 2(s) ), the sample at step s + 1 is generated as follows 1. Set z = z (s) ; 2. For j ∈ {1, . . . , p} in random order, replace zj with a sample from p(zj |z −j , y, X) given by Eq. (58); 3. Set z (s+1) = z; 4. Sample σ 2(s+1) ∼ p(σ 2 |z (s+1) , y, X) given by Eq. (54); 5. Sample β (s+1) ∼ p(β|z (s+1) , σ 2(s+1) , y, X) given by Eq. (53); 155 This is the R codes for the Gibbs sampling procedure above (only the portion for sampling z is included) 156 Example 9.9. We return to the diabetes data example. • Recall that we have p = 64 potential regressors, resulting in 264 ≈ 1019 total number of models. • It is impossible to explore this space: if we generate 10,000 Gibbs samples, these samples account for only 1/1015 total number of models. • Our intuition is that if there are only a small number of relevant regressors, and so they will be present in many of the most likely models among the 264 candidates. Averaging over the most likely candidates will still give us a good estimate of the marginal posterior probabilities of each of the regressor zj ’s as well as the corresponding β. (Recent theoretical developments on Bayesian asymptotics confirmed this intuition). P • The estimate for β is given by β̂ bma = Ss=1 β (s) /S, where S is the MCMC sample size. – This is called the Bayesian model averaged estimate of β, because it does not correspond to any particular value of z, but an average of regression parameters from different values of z. By averaging the regression coefficients from multiple high-probability models, the resulting estimate often performs better than a point estimate that corresponds to only a single model. – The test error for the model averaging technique is 0.452, which is better than both OLS and backwards elimination. • More on Bayesian robustness: recall that the backwards elimination procedure also produced 18 spurious associations in a randomization experiment (cf. Example 9.7). Using the Bayesian model averaging technique, it was found that the (approximated) posterior probabilities Pr(zj = 1|y, X) are less than 1/2 for all j = 1, . . . , 64, and all but two of which are less than 1/4. The model averaging technique did not erroneously identify any regressors as having an effect on the distribution of ỹ. 157 10 Metropolis-Hasting algorithms The Gibbs sampler constructs a Markov chain, whose transition probability kernel is defined as a composition of multiple Gibbs updates. A Gibbs update changes one variable at a time. This can be inefficient. Moreover, the Gibbs update often requires some sort of (semi) conjugacy in the model, so that the full conditional distributions can be computed in close form. In this section we shall study a more general MC based sampling method known as Metropolis-Hastings algorithm. Most MCMC based algorithms in practice, including Gibbs sampling, are special cases of Metropolis-Hastings algorithm, which is versatile and powerful. M-H is especially useful in the nonconjugacy situation, and when there is a need and possibility of updating multiple variables simultaneously. 10.1 Metropolis-Hastings update Let π be the (stationary) distribution of interest. Suppose that π is known only up to an unknown constant. That is, π is specified by an unnormalized density function h(x) with respect to a counting measure on a discrete space S or Lebesgue measure µ(dx) with respect to an Euclidean space S. Write π(x) = h(x)/c R where the normalizing constant c = h(x)µ(dx) < ∞ is unknown. In Bayesian computation, h(x) is often the product of the prior density and the likelihood function. Proposal distribution The M-H update uses an auxiliary transition probability specified by a conditional density function q(x, y). It’s called ”proposal distribution”, or ”candidate generating distribution”. For every point x ∈ S, q(x, ·) is the probability density (wrt µ) having two properties • for each x we can sample a random variable y having the density q(x, ·) • we can evaluate q(x, y) for each x, y ∈ S Roughly speaking, q(x, y) represents the conditional probability ”proposing” an update value y, given that we are presently at x. We can choose any density we know to propose. For instance, if S = Rd , a random walk proposal corresponds to q(x, y) = Nd (y|x, σ 2 I), a density function evaluated at y ∈ Rd of a d-variate normal density with mean x ∈ Rd and variance σ 2 I. 158 The Metropolis-Hastings algorithm then works by constructing the Markov chain {Xt }t≥1 as follows. Start X0 = x where x is in the support of h, i.e., h(x) > 0. Given the current position Xt = x ∈ S, the update changes x to its value at the next iteration. 1. Draw a sample y ∼ q(x, ·). 2. Calculate the Hastings ratio: R= h(y)q(y, x) . h(x)q(x, y) (59) 3. Accept the proposal by setting Xt+1 = y with probability min(1, R). Otherwise, keep the position unchanged by setting Xt+1 = x. 159 Example 10.1. (Metropolis update) If we use a proposal density q(x, y) that is symmetric: q(x, y) = q(y, x). For instance, the ”normal random walk” q(x, y) = Nd (y|x, σ 2 I). Then, Hastings ratio takes the form R = h(y)/h(x). There is no need to evaluate q(x, y). Metropolis algorithm is very popular, because it is easy to implement. It is also very intuitive: as long as one takes a symmetric proposal, then we always accept the proposed move from x to y if this represents an increase in the density of the stationary distribution, i.e., π(y) ≥ π(x). If the move represents a decrease, then the larger the decrease the less likely one will accept the move. 160 Let us write down the transition probability kernel P (x, A) for the general Metropolis-Hastings update, for any x ∈ S, A ⊂ S. The kernel has two terms related to accepted proposals and rejected one. For accepted proposals, we propose y and then accept it, which happens with density p(x, y) = q(x, y)a(x, y), where a(x, y) = min(R, 1). Thus Z p(x, y)µ(dy) A R represents the part of P (x, A) that results from the accepted proposals. Moreover, S p(x, y)µ(dy) gives the total probability that some proposed move is accepted (including the possibility that y = x) while Z r(x) := 1 − p(x, y)µ(dy) S is the probability a proposed move is rejected. If the proposed move is rejected, we stay put at x. Thus, the probability of moving from x to a measurable subset A ⊂ S is Z Z P (x, A) = p(x, y)µ(dy)+ 1 − p(x, y)µ(dy) I(x, A). (60) S A In the above, I(x, A) denotes identity kernel that represents ”stay put”: I(x, A) = 1 if x ∈ A and 0 otherwise. 161 10.1.1 Detailed balance and reversibility Definition 10.1. A Markov chain {Xt }t≥0 with a stationary distribution π is said to be reversible if when Xt has the distribution π, then Xt and Xt+1 are exchangeable random variables. Recall that if π is called a stationary distribution of the Markov chain if the following holds: when Xt has distribution π, then so is Xt+1 . Thus, exchangeability is a stronger condition, as we have learned earlier in Section 8.3: it requires that the ordered pair (Xt , Xt+1 ) has the same joint distribution as the ordered pair (Xt+1 , Xt ). (Exercise: verify that a basic Gibbs update is reversible). Although reversibility is not a requirement, many MC constructions have this property. While reversibility has some theoretical benefits for the analysis of a MC; for us it is enough to note that reversibility is a useful property in that one automatically have the guarantee that a Markov chain construction admits π as stationary distribution by checking that it satisfies the stronger condition of reversibility, which tends to be easy to do in practice. Recall p(x, y) = q(x, y)a(x, y). The key to verify reversibility is to check that the Markov chain satisfies the detailed balance. That is: h(x)p(x, y) = h(y)p(y, x), for all x, y ∈ S. (61) Note that this is also equivalent to π(x)p(x, y) = π(y)p(y, x). Suppose that the detailed balance holds. Then for any A, B ⊂ S, we have = Pr(X ∈ A, Xt+1 ∈ B) Z Z t 1A (x)1B (y)π(x)P (x, dy)µ(dx) Z Z 1A (x)1B (y)π(x) p(x, y) + r(x)1(y = x) µ(dy)µ(dx) Z Z Z Z 1A (x)1B (y)π(x)p(x, y)µ(dy)µ(dx) + 1A (x)1B (y)1(y = x)r(x)π(x)µ(dy)µ(dx) Z Z Z Z 1A (x)1B (y)π(y)p(y, x)µ(dy)µ(dx) + 1A (x)1B (y)1(x = y)r(y)π(y)µ(dy)µ(dx) Z Z 1A (x)1B (y)π(y) p(y, x) + r(y)1(x = y) µ(dy)µ(dx) = Pr(Xt ∈ B, Xt+1 ∈ A), = = = (61) = which confirms reversibility. 162 Reversibility of Metropolis-Hastings update Now we can verify that the M-H update is reversible by checking the detailed balance condition. But this is immediate h(x)p(x, y) = h(x)q(x, y)a(x, y) h(y)q(y, x) = h(x)q(x, y) min 1, h(x)q(x, y) = min h(x)q(x, y), h(y)q(y, x) . The last expression in the above display is symmetric with respect to x and y, so it is also equal to h(y)p(y, x). We are done with the verification. Metropolis-Hastings update for a subset of variables Although the above description is for the full set of variables x (e.g., x = (x1 , . . . , xd ) ∈ Rd , Metropolis-Hastings can and quite typically be applied to a subset of variables (like the Gibbs sampler can also be applied to subset of variables). Suppose that a subset of variables x1 , . . . , xj are to be updated for some j < d, then the proposal density q((x1 , . . . , xj ), (y1 , . . . , yj )) should be taken as the density with respect to the base measure on the subspace Rj spanned by the j variables being updated. The procedure is then applied as described. 163 Gibbs as a special case of Metropolis-Hastings The Gibbs sampler updates a variable xi from its full conditional distribution of xi given all remaining variables x−i . We will show that a Gibbs update for variable xi is nothing but a Metropolis-Hastings with the proposal distribution π(xi |x−i ). Indeed, for variable xi , take the proposal density to be q(x, y) ∝ h(x1 , . . . , xi−1 , yi , xi+1 , . . . , xd )/c where yj = xj for j 6= i, and h is the unnormalized density function for the target stationary distribution π. Note that q(x, y) so defined is exactly the full conditional distribution π(xi = yi |x−i ). Then, the Hastings ratio is R= h(y)q(y, x) h(x)q(x, y) = = h(y)h(y1 , . . . , yi−1 , xi , yi+1 , . . . , yd ) h(x)h(x1 , . . . , xi−1 , yi , xi+1 , . . . , xd ) h(y)h(x) = 1. h(x)h(y) It follows that the acceptance probability is min{R, 1} = 1. Thus, by adopting the full conditional distribution as the proposal distribution, the Metropolis-Hasting proposal is always accepted. This is exactly the Gibbs update! 164 Remark 10.1. • The Metropolis-Hastings framework is so general and powerful that its introduction dramatically opened up the landscape of possibilities for MCMC based inference, because one can in principle adopt any reasonable distribution as a proposal, and still get a valid Markov chain for a target stationary distribution of interest. Ideally, we would like a proposal that allows one to explore efficiently the distribution, by spending proportionally more time in all high density regions. • Metropolis and Gibbs samplers can be viewed as two extremes in this landscape of proposals. Metropolis is realized by applying an arbitrary symmetric proposal distribution — this allows the Markov chain to explore virtually any location in the state space as one likes. The price to pay is that the acceptance rate may be very small, if the proposal is too ”reckless”, as it may have nothing to do with the actually concentration of mass of the target distribution. When this is the case, one ends up rejecting the proposals most of the time, which amounts to a frustrating hit-and-miss sampling experience. • Gibbs sampling, on the other hand, is too cautious in its proposal, which is automatically determined by the induced full conditional distributions. Although all its moves are accepted, the movements through the space of support can be hopelessly slow: due to the requirement of conjugacy needed for the computation of the full conditionals, one may update only one variable or a small subset of variables at a time and get stuck in local modes as a result. • Finding a good proposal for a given posterior distribution is an active area of research. It requires a deeper understanding of the geometry of such a posterior distribution. Hamiltonian Monte Carlo Markov represents such a promising approach, but the progress remains rudimentary at this point. • In practice, one may mix and match between different proposal strategies. For instance, one may mixing up Gibbs updates for some subsets of variables with Metropolis-Hasting updates for other subsets. 165 10.2 Example Poisson regression model Given a population of song sparrows, we are interested in learning about the relationship about the number of offsprings versus age. An approach is to consider a regression model: the response y represents the number of offspring of a song sparrow, while the regressors may be constructed of age variable x. For instance, we assume log E[Y |x] = β1 + β2 x + β3 x2 . This means E[Y |x] = exp(β1 + β2 x + β3 x2 ). Since Y is positive integer-valued, we may consider Poisson distribution as the conditional distribution for Y given x. The resulting model is called a Poisson regression model: Y |x ∼ Poisson(exp(β > x)). To complete the prior specification: we may endow β with a normal prior. Note immediately that this is not conjugate to the Poisson-type likelihood. In general, Poisson regression is a specific instance of a broad class of models known as generalized linear model, for which conjugate priors generally don’t exist. Thus, Gibbs sampling is difficult to implement. Let’s consider Metropolis sampling. Provided a normal prior for β: β ∼ N(β 0 , Σ0 ). Given n-sample (yi , xi )ni=1 . The Hastings acceptance ratio is easy to compute: given the current β (s) and a proposed β ∗ , R = = p(β ∗ |X, y) p(β (s) |X, y) normal(β ∗ |β 0 , Σ0 ) Qn > ∗ i=1 poisson(yi , xi β ) . × Qn > β (s) ) poisson(y , x normal(β (s) |β 0 , Σ0 ) i i=1 i This ratio is easy to compute. In practice, when n is large, the ratio may be either too large or too small. To avoid numerical issue, it is advised to compute the logarithm of the ratio R instead of computing R directly. Then, the acceptance probability is a(β (s) , β ∗ ) = emin{0,log R} . 166 It remains to specify the proposal distribution for β ∗ . A natural choice is to take a normal random walk, i.e., via a normal distribution centered at β (s) : q(β (s) , β ∗ ) = normal(β ∗ |β (s) , Σ). How do we choose Σ? In a normal regression problem, the posterior variance of β will be close to σ 2 (X > X)−1 , where σ 2 is the variance of Y . This gives us a hint for our Poisson regression problem: since log Y is taken to have expectation β > x, we can take the proposal variance to be Σ := σ̂ 2 (X > X)−1 where σ̂ is the sample variance of {log(y1 + 1/2), . . . , log(yn + 1/2)}. (The addition of 1/2 is so the log can be applied to a positive valued number). We can also consider other choice for Σ, or q(β (s) , β ∗ ). For the chosen form of Σ above, we may also choose different σ̂. All such choices result in a valid Markov chain, but they can have different mixing qualities and autocorrelation of the MC samples. The general rule of thumb is to specify a proposal so that the acceptance rate is neither too large nor too small (say, between 20 and 50%). For more detail of this and other examples, see chapter 10 of Hoff [2009]. 167 11 Unsupervised learning and nonparametric Bayes Unsupervised learning is a term that originates from machine learning, but it basically refers to a class of learning problems and techniques that involves latent variable models. The most basic instance of unsupervised learning is the problem of clustering. The problem of clustering is often vaguely formulated as follows: given n data points X1 , . . . , Xn residing in some space, say Rd , how do one subdivide these data into a number of clusters of points, in a way so that the data points belong to the same cluster are more similar than those from different clusters. A popular method is called the k-means algorithm, which is a simple and fast procedure for obtaining k clusters for a given k < ∞. There is only limited theoretical basis for such an algorithm. To provide a firm foundation for clustering, a powerful approach is to introduce additional probabilistic structures for the data. Such modeling is important to provide guarantee that we are doing the right thing under certain assumptions, but more importantly it opens up new venues for developing more sophisticated clustering algorithms as additional information about the data set or requirement about the inference become available. 168 The most common statistical modeling tool is mixture models. A mixture distribution admits the following density: k X p(x|p, φ) = pj f (x|φj ) j=1 where f is a known density kernel, k is the number of mixing components. pj and φj are the mixing probability and parameter associated with component j. When k is finite, this is the pdf of a finite mixture model. Given n-iid sample X := (x1 , . . . , xn ) from this mixture density, it is possible to obtain the parameters φj via maximum likelihood estimation, which can be achieved by the Expectation-Maximization (EM) algorithm. In fact, the EM algorithm can be viewed to be a generalization of the popular k-means algorithm mentioned above. 13 By taking a Bayesian approach to the learning of mixture model, we will see that a Gibbs sampler for posterior inference with a suitable choice of conjugate priors is a probabilistic version of the EM algorithm (and k-means algorithm). Thus, the Bayesian approach can produce comparable estimate as that of EM, but with the advantage of uncertainty quantification. The question of model selection, i.e., how to select k the number of mixture components, requires the development of a new framework known as Bayesian nonparametrics: The number of relevant parameters will be unknown, random, and potentially unbounded. Thus the totality of all potential parameters will be infinite. This requires new ideas for the prior construction and computational methods. The outcome is an elegant solution to the model selection in that the number of the parameters will be shown to be increasing a posteriori as the data sample size increases. 14 In the Bayesian nonparametric framework, the corresponding model for the clustering problem will be called infinite mixture models that are endowed with suitable nonparametric Bayesian priors. 13 You may ignore any references to the k-means and the EM algorithm in this set of notes if you have not seen these algorithms before. 14 Good references for Bayesian nonparametrics include Hjort et al. [2010], Ghosh and Ramamoorthi [2002], Ghosal and van der Vaart [2017]. 169 11.1 Finite mixture models Consider a finite mixture of normal distribution on the real line: p(x|p, φ) = k X pj N(x|φj , σ 2 ), j=1 where the parameters are p = (p1 , . . . , pk ) and the mean parameters φ1 , . . . , φk ∈ R. σ 2 is assumed known. For prior specification, we take indep φj ∼ N (µ, τ 2 ) for j = 1, . . . , k, for some hyperparameters µ and τ . The mixing probability vector p = (p1 , . . . , pk ) ∈ ∆k−1 will be endowed with a Dirichlet prior, p ∼ Dk (α). Recall that the Dirichlet distribution on ∆k−1 requires positive valued hyperparameters α = (α1 , . . . , αk ). 170 11.1.1 Auxiliary variables Now we introduce a very common and powerful technique in Bayesian inference: instead of working directly with the original (mixture) model, we shall introduce additional auxiliary latent variables in a joint model. When the auxiliary variables are integrated out, we get back the original model. The main advantage of this technique is in the posterior computation. The joint posterior distribution (with the auxiliary variables included) tend to be easier too work with via Gibbs sampling or other MCMC updates, because the full conditional distributions are easy to compute: in the presence of the auxiliary variables, the prior that was not semiconjugate with respect to the original model becomes semiconjugate with respect to the joint model. For our current mixture model, we need on auxiliary variable for each sample xn : Z := (Z1 , . . . , Zn ), where each Zi ∈ {1, . . . , k}. Zi is interpreted as the (unknown and random) label of the mixture component from which the data Xi is generated. The joint model p(X, Z|p, φ) with the auxiliary Z included is defined as follows: Zi Xi |φ, Zi = j iid ∼ Cat(p) i = 1, . . . , n; ∼ N(·|φj , σ 2 ), The priors for p and φ are given as before. 171 i = 1, . . . , n; j = 1, . . . , k. Now we proceed to compute the posterior distribution for the quantities of interest p(Z, p, φ|X) via Gibbs sampling. The full conditional distributions are easy to derive. • For Z: for each i = 1, . . . , n, j = 1, . . . , k, p(Zi = j|Z−i , X, p, φ) = p(Zi = j|Xi = xi , p, φ) ∝ p(Zi = j)p(xi |Zi = j, p, φ) pj N(xi |φj , σ 2 ) . = Pk 2 j=1 pj N(xi |φj , σ ) • For φ: for each j = 1, . . . , k p(φj |φ−j , Z, X, p) = p(φj |Z, {Xi = xi such that zi = j}) P µ/τ 2 + xi 1(zi = j)/σ 2 1 . = N φj , 1/τ 2 + nj /σ 2 1/τ 2 + nj /σ 2 The first identity is due to conditional independence. The second identity is a standard posterior computation P under a normal likelihood and a normal prior for the mean parameter (cf. Section 5). n Here, nj = i=1 1(zi = j), i.e., the number of data points are currently assigned to the mixture component j by means of having the label zi = j. 172 • For p: p(p|Z, X, φ) = p(p|Z) due to cond. indep. ∝ p(p)p(Z|p) ∝ k Y α −1 pj j j=1 k Y × n pj j j=1 ∝ D(p|α1 + n1 , . . . , αk + nk ) = D(p|α + n) wherein the last line we use n to denote (n1 , . . . , nk ). We make some comments • The Gibbs updates for Zi and p is can be viewed as the result of a ”soft” (probabilistic) assignment of the cluster label for each of the data point xi . Recall that in k-means clustering algorithm, there is a hard assignment of the cluster label associated with each data point. In the EM algorithm, this corresponds to the E-step, which updates the expectation of the parameters such as Zi . • The Gibbs update for φj is a probabilistic update of the cluster means. This is the direct counterpart of the M-step in the EM algorithm and the mean update step in k-means. • Gibbs sampling is convenient but not the most efficient posterior computation technique. We may consider other forms of MCMC such as using Metropolis-Hastings algorithms, as we saw in Section 10. The wealth of posterior inference algorithms available is a hidden benefit of working with a rich Bayesian modeling framework. It is considerably harder to invent a deterministic counterpart of Metropolis-Hastings algorithms among frequentist approaches that must extend from the basic k-means and EM algorithms. 173 11.2 Infinite mixture models As we said earlier, the salient feature of a nonparametric Bayesian approach is to allow infinitely many parameters to be present in the model. Continuing with our present example of mixture modeling with normal components, an infinite mixture model admits the following density function p(x|p, φ) = ∞ X pj f (x|φj ). j=1 As before f (x|φj ) = N(x|φj , σ 2 ) for some known σ 2 , but here there are infinitely many parameters (pj , φj )∞ j=1 . An immediate question is: how do we specify a Bayesian prior on infinitely many parameters? Since φj ’s are unconstrained in the real line, we may again set the prior for these parameters as iid φ1 , φ2 . . . ∼ G0 . For instance, take G0 = N(µ, τ 2 ). 174 The nontrivial issue lies in specifyingPthe prior for p = (p1 , p2 , . . .), which is now an infinite sequence satisfying the constraint that pj ≥ 0 and j pj = 1. Recall that if the sequence p = (p1 , . . . , pk ) is a finite sequence, i.e., k < ∞, then we may use the Dirichlet distribution as a prior for p ∈ ∆k−1 , say p ∼ D(α) for some α = (α1 , . . . , αk ) ∈ Rk+ . We need a generalization of the Dirichlet distribution that works for ∆∞ . 175 11.2.1 Dirichlet process prior Recall a simple fact about the finite-dimensional Dirichlet distribution. If k = 2, then the Dirichlet distribution D1 ((p1 , p2 )|α1 , α2 ) reduces to the Beta distribution on the unit interval Beta(p1 |α1 , α2 ), because p2 = 1 − p1. With some moment of thought, it is possible to conceive the following distribution on the infinite sequence p = (p1 , p2 , . . .) by constructing a random process of ”stick-breaking” as following: take a stick of unit length, break it into two shorter pieces in a random fashion, one of which is assigned to be of length p1 , and the remaining part of length 1 − p1 is broken again randomly to obtain p2 , and so on. Whenever we break a piece of stick into two smaller pieces, we may take the proportions of the smaller pieces to be beta distributed. To be precise, let β = (β1 , β2 , . . .) be iid Beta(1, α). Define p1 = β1 , pk = k−1 Y (1 − βi )βk , k = 2, 3, . . . . i=1 P It is easy to check that the infinite sequence p constructed this way satisfies the constraint that ∞ k=1 pk = 1 almost surely. We have just described a Dirichlet distribution on the infinite-dimensional probability simplex ∆∞ . 176 Collecting the above specifications gives us a definition of the famous Dirichlet process 15 Definition 11.1. Let G0 is a probability distribution on the real line and given an infinite i.i.d. sequence of random variables iid φ1 , φ2 , . . . ∼ G0 . Let α > 0 and given an infinite i.i.d. sequence of random variables iid β1 , β2 , . . . ∼ Beta(1, α). Set p1 = β1 , pk = k−1 Y (1 − βi )βk , k = 2, 3, . . . . (62) i=1 Define the discrete distribution on the real line G := ∞ X pj δφj j=1 Then we say that G is a Dirichlet process on the real line. We write G|α, G0 ∼ D(αG0 ). (63) What we just defined is that G is a random variable taking values in the space of probability distributions on the real line, namely, P(R). The distribution from which the random G is generated, namely, D(αG0 ), is called a Dirichlet distribution, which generalizes the standard Dirichlet distribution on a finite-dimensional probability simplex to a distribution on the infinite-dimensional probability simplex ∆∞ . Note that the distribution D(αG0 ) has two parameters: a positive scalar α > 0, and G0 is a distribution on the real line. 15 The Dirichlet process was first introduced by Thomas Ferguson. Definition 11.1, however, was given by Jayaram Sethuraman. 177 Back to our infinite mixture model setting p(x|p, φ) = ∞ X pj f (x|φj ) (64) j=1 P∞ The distribution G = i=1 pj δφj encapsulates all parameters for the infinite mixture model that we seek to estimate. We can rewrite the mixture model equivalently as Z p(x|G) = f (x|φ)G(dφ). (65) Eq. (65) gives us the view of infinite mixture model as a model parameterized by G ∈ P(R). G is called the mixing distribution, or mixing measure for the mixture model. When the mixing distribution G is endowed with the Dirichlet prior given by Eq. (63): G|α, G0 ∼ D(αG0 ) we call our model Dirichlet process mixture model. This is still a standard Bayesian formulation, although a nonparametric one, where the parameter of interest is the infinite dimensional G ∈ P(R). Given an i.i.d. n-sample X1 , . . . , Xn |G ∼ p(x|G), the immediate question of concern is that of posterior computation. How do we compute p(G|X1 , . . . , Xn )? 178 11.3 Posterior computation via slice sampling The totality of all variables of interest include the observed data X = (X1 , . . . , Xn ), the mixing proportions p = (p1 , . . .), atoms φ = (φ1 , . . .). Moreover, p is constructed via the stick-breaking representation (62), which is based on variables β = (β1 , . . .). We shall make use of the auxiliary variable technique extensively. The first use is similar to the case of finite mixture that we saw in Section 11.1. According to the joint model, • each data point Xi is associated with a mixture component label Zi ∈ {1, 2, . . .}. iid • Given p, Zi |p ∼ Cat(p) for i = 1, . . . , n. • Given Zi and all other variables, Xi is distributed according to f (Xi |φZi ). Thus, we may write the joint model as ∞ (β, φ, Z, X) ∼ Beta(1, α) × G∞ 0 × n Y i=1 pZi × n Y f (Xi |φZi ). (66) i=1 ∞ The superscripts ∞ signify the infinitely many variables β = {βk }∞ k=1 and φ = {φk }k=1 present in the model. We seek to devise a Markov chain that converges in distribution to the target stationary distribution which is the posterior of β, φ, Z given data X. The difficulty is apparent: there are an infinite number of variables to handle, which cannot possibly be sampled simultaneously. We use a technique known as ”slice sampling”. 179 Slice sampling involves the introduction of yet another set of auxiliary random variables, u := (u1 , . . . , un ) taking values in bounded intervals (0, qzi ), where i = 1, . . . , n and (qj )j≥1 is a sequence of values in (0, 1) either deterministically or randomly generated so that q tend to zero (certainly or almost surely). In particular, for each i, given q we draw ui from the uniform distribution on the interval (0, qzi ). Thus, the extended joint model takes the form ∞ (β, φ, u, Z, X|q) ∼ Beta(1, α) × G∞ 0 × n Y i=1 n n i=1 i=1 Y Y 1 1(ui < qZi ) × pZi × f (Xi |φZi ). qZi (67) It is clear that integrating out all ui in the joint distribution given by Eq. (67) leads to the joint distribution given by Eq. (66). Thus, it sufficient to construct a MC for the model given by Eq. (67). What one gains in the introduction of auxiliary variables u is that, when u are conditioned on, we only need to choose labels Zi from the finite set H(ui ) := {j ∈ N+ : qj > ui }. If one thinks of a bar graph in which the height of each bar represents the magnitude of qj , j = 1, . . ., then restricting the label Zi to only H(ui ) corresponds visually to ”slicing” out the portion below the height ui , and making only the bars higher than ui to remain. Hence, the name ”slice sampling”. Gibbs sampler for model (67) indep • Sampling u given β, Z, X, q: for each i = 1, . . . , n, draw ui ∼ Uniform[0, qZi ]. • Sampling β given u, Z, φ, X, q: Note that the variables β are relevant as far as the extent that they determines the variables pj ’s. Moreover, the only variable pj ’s of concern are those with indices j such that j ∈ ∪ni=1 H(ui ). Thus, Y p(βj |the rest) ∝ (1 − βj )α−1 × pZi i:qZi >ui ∝ (1 − βj ) α−1 i −1 Y ZY (1 − βk )βZi × i:qZi >ui Pn 1(Zi =j;qj >ui ) k=1 Pn (1 − βj )α−1+ X ∝ Beta(1 + mj , α + mk ), ∝ βj i=1 i=1 1(Zi >j;qZi >ui ) k>j where Pn in the last line, we set Pmj := i=1 1(Zi > j; qZi > ui ) = k>j mk . Pn i=1 1(Zi = j; ui < qj ) for j = 1, . . ., and note that Clearly, in the above computation we only need to update for j = 1, . . . , K such that for all k > K, mk = 0. K represents the upper bound of the number of ”active” indices. K may change from one Gibbs iteration to the next. 180 • Sampling φ given β, u, Z, X, q: p(φj |the rest) ∝ G0 (dφj ) n Y f (Xi |φZi ) i=1 ∝ Y N(Xi |φj , σ 2 )G0 (dφj ) i:Zi =j P 1 µ/τ 2 + xi 1(zi = j)/σ 2 , . ∝ N φj 1/τ 2 + nj /σ 2 1/τ 2 + nj /σ 2 P where nj = ni=1 1(zi = j), i.e., the number of data points are currently assigned to the mixture component j by means of having the label zi = j. (Note that this step is similar to the sampling of the label in a finite mixture.) • Sampling Z given β, φ, u, X, q: for i = 1, . . . , n p(Zi = j|the rest) ∝ 1(ui < qj ) pj f (Xi |φj ), qj for j = 1, . . .. This is where we need to be careful since the support of Zi is unbounded. Obviously, the above probability is positive only if ui < qj . If q ∈ ∆∞ (although this is not a strict requirement, more on this P is below), then it suffices to update for all values j = 1, . . . up to the minimal index K where n 1− K k=1 qk < mini=1 {ui }. If we reach a new index k for which pk and φkQ have not been generated, P then we proceed by generating k−1 φk ∼ G0 , βk ∼ Beta(1, α), and letting pk = i=1 (1 − βi )βk = (1 − k−1 i=1 pi )βk . • Sampling q: If q is deterministically generated, then this step is not necessary (although the choice of this sequence may be critical to the mixing behavior of the underlying Markov chain). If q is randomly generated, there are several options – a simple method is to generate q independently of all over variables (e.g., via a fixed stickbreaking process). Then, we may update q after one or several iterations of the Gibbs updates for all other variables. – another approach is place an independent prior for q: qj ∼Uniform(0, bj ) for j = 1, 2, . . . such that bj ↓ 0. Then the update for q can be achieved given u via the conditional distribution: −nj p(qj |the rest) ∝ qj 1(qj > max ui ). i:Zi =j – yet another approach is to take q := p, but then q is no longer independent of β; the update of β may not have the conjugate form or an easily calculable form as given above. Observe that the MCMC algorithm gradually and stochastically adds new components (βj , φj ) for j = 1, 2, . . . into the state space of the Markov chain. No upper bound on the number of components is required a priori! 181 11.4 Chinese restaurant process and another Gibbs sampler Dirichlet processes have many other remarkable characterizations, which help us to understand them more deeply, while giving us additional ideas for computations. Next, we describe a Polya urn characterization of the Dirichlet processes. Consider the following specification for a sequence of random variables which are i.i.d. draw from a Dirichlet process: G|α, G0 ∼ DαG0 iid θ1 , . . . , θn |G ∼ G. (68) (69) P Note that given α and G0 , the random distribution G may be represented as G = ∞ k=1 pj δφj , where p and φ are random variables given by Definition 11.1. Since θ1 , θ2 , . . . are a conditionally i.i.d. sequence, this is an exchangeable sequence of random variables. We ask: what is the marginal distribution of the exchangeable sequence θ1 , θ2 , . . ., which would be obtained if we integrate out the random G in the above specification? 182 Based on Definition 11.1 it is not difficult to verify that the joint distribution of the sequence θ1 , θ2 , . . . can be completely specified as follows: θ1 ∼ G0 , θ2 |θ1 ∝ δθ1 + αG0 , ... θj |θ1 , . . . , θj−1 ∝ j−1 X δθk + αG0 , k=1 ... . The sequence of random variables defined this way is generally known as a Pólya sequence. It makes explicit the clustering behavior of the collection of random variables θ1 , θ2 , . . . which are generated from a (random) Dirichlet process G ∼ D(αG0 ): with positive probability each of the θj shares the same value as some of the other variables generated before it in the sequence. This Pólya sequence has a tasty name, ”the Chinese restaurant process”. Consider the following imaginary Chinese restaurant, which receives an infinite sequence of customers labeled by 1, 2, . . . with its infinitely many tables: • customer 1 arrives, and sits by an arbitrary table there. • the following customers 2, 3, . . . arrive in sequence and choose their table according to the following rule: either one of the non-empty table is chosen with probability proportion to the current number of customers sitting at table; otherwise that customer chooses a new table with probability proportional to α • for each table, a random dish is ordered i.i.d. from menu (distribution) G0 for all to share. assign each θi to the dish that i is having. 183 Gibbs sampler based on the Pólya characterization pressed as follows. Recall the prior: G|α, G0 The Dirichlet process mixture model can be ex- ∼ DαG0 iid θ1 , . . . , θn |G ∼ G., which is combined with the likelihood specification: for i = 1, . . . , n: indep Xi |θi ∼ f (Xi |θi ). (70) Latent variables θ1 , . . . , θn represent the parameter with each X1 , . . . , Xn are respectively associated. E.g., θi is the mean parameter for the mixture component Xi is associated with when use use f (Xi |θi ) = N(Xi |θi , σ 2 ), To implement a Gibbs sampler, we need to construct a Markov chain for {θ1 , . . . , θn } that converges to the target stationary distribution P(θ1 , . . . , θn |X). For a Gibbs update, we need to compute the full conditional distribution for each θi given every other variables. 184 By the fact that θ1 , . . . , θn are a priori exchangeable, we may treat θi as the last element in the Pólya sequence (i.e., the last customer in the Chinese restaurant process). Thus, θi |θ−i ∼ X δθj + αG0 . j6=i By Bayes’ rule, and conditional independence, we have p(θi |θ−i , X) ∝ p(θi |θ−i )f (Xi |θi ) ∝ αf (Xi |θ)G0 (dθ) + X f (Xi |θj )δθj . j6=i The above full conditional distribution is a mixture distribution: with probability proportional to f (Xi |θj ) R we set θi := θj , and with probability proportional to αf (Xi |θ)G0 (dθ) we draw θi ∼ G0 . The integration in question is available in closed form due to the normal-normal conjugacy between G0 and f . We see clearly in the Gibbs sampling step the types of move: one type of move is to select a cluster/table/dish for θi among the existing ones, and another type of move is to generate a new cluster/table/dish from the base distribution G0 . Thus, the number of clusters are also sampled as part of the Markov chain generation. Summarizing, the Gibbs sampling algorithm consists of the following single line of code: For each MCMC step, do as follows: (1) for i = 1, . . . , n, draw θi given existing θ−i and X by the full conditional distribution derived above. This is only the simplest example of a Gibbs sampler based on the Pólya characterization of Dirichlet processes. Researchers have developed more sophisticated and efficient techniques based on Gibbs and Metropolis-Hastings sampling frameworks. In this section offered a glimpse of Dirichlet process, which is just one of many powerful tools of Bayesian nonparametrics. For an expanded version of this short introduction, see also the lecture notes [Nguyen, 2015]. 185 12 Additional topics Bayesian statistics has a rich literature, both classical and modern, which results in an enormous repository of ideas and tools for modeling and computation. Several modeling/ computational topics that are worth exploring further from here: • Modeling: probabilistic graphical models [Jordan, 2004, Blei et al., 2003, Pritchard et al., 2000], Gaussian processes for nonlinear regression and classification [Rasmussen and Williams, 2006], hierarchical modeling with Dirichlet processes and extensions [Teh and Jordan, 2010]. • Computation: general variational inference [Wainwright and Jordan, 2008], and variational inference applied to Bayes [Blei et al., 2018], geometric methods, e.g., for topic and hierarchical models [Yurochkin et al., 2019], MCMC with proposal distributions arising from Langevin [Roberts and Tweedie, 1996] and Hamintonian dynamics [Neal, 2011]. Most of the above references are available from the Canvas folder ”Additional Reading” for this course. 186 References D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res, 3:993–1022, 2003. David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians, 2016. ISSN 1537274X, 2018. C. Geyer. Markov Chain Monte Carlo lecture notes. Unpublished, 2005. S. Ghosal and A. van der Vaart. Fundamentals of Nonparametric Bayesian Inference. Cambridge University Press, 2017. J. K. Ghosh and R. V. Ramamoorthi. Bayesian nonparametrics. Springer, 2002. N. Hjort, C. Holmes, P. Mueller, and S. Walker (Eds.). Bayesian Nonparametrics: Principles and Practice. Cambridge University Press, 2010. P. Hoff. A First Course in Bayesian Statistical Methods. Springer, 2009. M. I. Jordan. An introduction to probablistic graphical models. Unpublished edition, 2003. M. I. Jordan. Graphical models. Statistical Science, Special Issue on Bayesian Statistics (19):140–155, 2004. R. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, pages 113–163, 2011. X. Nguyen. VIASM lectures on Bayesian nonparametrics. Unpublished edition, 2015. J. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using multilocus genotype data. Genetics, 155:945–959, 2000. C. E. Rasmussen and C. Williams. Gaussian processes for machine learning. MIT Press, 2006. C. P. Robert. The Bayesian Choice: From decision-theoretic foundations to computational implementations. Springer, 2nd edition, 2007. Gareth O Roberts and Richard L Tweedie. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pages 341–363, 1996. Y. W. Teh and M. I. Jordan. Hierarchical bayesian nonparametric models with applications. In N. Hjort, C. Holmes, P. Mueller, and S. Walker, editors, Bayesian Nonparametrics: Principles and Practice. Cambridge University Press, Cambridge, UK, 2010. M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1:1–305, 2008. Mikhail Yurochkin, Aritra Guha, Yuekai Sun, and XuanLong Nguyen. Dirichlet simplex nest and geometric inference. Proceedings of the International Conference on Machine Learning (ICML), 2019. URL arXiv:1905.11009. 187