Uploaded by mingxuange2017

Bayesian STATS 551 WN22

advertisement
STATS 551, Winter 2022
Lectures on Bayesian modeling and computation
XuanLong Nguyen
University of Michigan
April 12, 2022
Abstract
This is a set of lecture notes for Stats 551. The materials presented in these notes are self-contained.
I will keep updating these notes as we go. A main text book utilized in preparing these notes is Peter
Hoff’s ”A first course in Bayesian statistical methods” [Hoff, 2009]. I will also draw from several other
sources, including Charles Geyer’s ”Markov Chain Monte Carlo lecture notes” [Geyer, 2005], Michael
I. Jordan’s, ”An introduction to probabilistic graphical models” [Jordan, 2003], and Christian Robert’s,
”The Bayesian choice” [Robert, 2007]. Please let me know (xuanlong@umich.edu) of any errors.
Contents
1
2
3
Introduction and examples
1.1 What is Bayesian inference? . . . . . . . . . . . . .
1.2 Bayes’ rule . . . . . . . . . . . . . . . . . . . . . .
1.3 Example: estimating the probability of a rare event .
1.4 Example: prediction via a Bayesian regression model
Interpretation of probabilities and Bayes’ formulas
2.1 Interpretation of probabilities . . . . . . . . . . .
2.2 Bayes’ rule . . . . . . . . . . . . . . . . . . . .
2.3 Bayesian hypothesis testing . . . . . . . . . . . .
2.4 Random variables and conditional independence .
2.4.1 Discrete domains . . . . . . . . . . . . .
2.4.2 Continuous domains . . . . . . . . . . .
2.4.3 Multivariate domains . . . . . . . . . . .
2.5 Bayes’ formulas and parameter estimation . . . .
One-parameter models
3.1 The binomial model .
3.2 Confidence regions .
3.3 The Poisson model .
3.4 Example: birth rates .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
. 4
. 8
. 9
. 13
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16
16
17
19
20
20
21
22
25
.
.
.
.
27
27
32
35
38
.
.
.
.
4
5
6
7
8
9
Monte Carlo approximation
4.1 Basic ideas . . . . . . . . . . . . . . . . . . .
4.2 Posterior inference for arbitrary functions . . .
4.3 Sampling from posterior predictive distributions
4.4 Posterior predictive model checking . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
41
45
46
49
The normal model
5.1 The normal / Gaussian distribution . . . .
5.2 Inference of the mean with variance fixed
5.3 Joint inference for the mean and variance
5.4 Normal model for non-normal data . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
52
52
54
59
68
Posterior approximation with the Gibbs sampler
6.1 Conjugate vs non-conjugate prior . . . . . . .
6.2 The Gibbs sampler . . . . . . . . . . . . . .
6.3 Markov chain Monte Carlo algorithms . . . .
6.3.1 Gibbs sampler . . . . . . . . . . . .
6.3.2 General Markov chain framework . .
6.3.3 Variants of Gibbs samplers . . . . . .
6.4 MCMC diagnostics . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
70
70
72
76
76
79
81
83
.
.
.
.
.
89
89
91
95
97
100
.
.
.
.
.
.
.
.
.
.
102
102
107
109
113
114
116
122
122
128
129
.
.
.
.
.
135
135
138
140
146
150
.
.
.
.
Multivariate normal models
7.1 Mean vector and covariance matrix . . . . . . .
7.2 The multivariate normal distribution . . . . . .
7.3 Semiconjugate prior for the mean vector . . . .
7.4 Inverse Wishart prior for the covariance matrix
7.5 Example: reading comprehension study . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Group comparisons and hierarchical modeling
8.1 Comparing two groups . . . . . . . . . . . . . . . .
8.2 Comparing multiple groups . . . . . . . . . . . . . .
8.3 Exchangeability and hierarchical models . . . . . . .
8.4 Hierarchical normal models . . . . . . . . . . . . . .
8.4.1 Posterior inference . . . . . . . . . . . . . .
8.4.2 Example: Math scores in U.S. public schools
8.5 Topic models . . . . . . . . . . . . . . . . . . . . .
8.5.1 Model formulation . . . . . . . . . . . . . .
8.5.2 Posterior inference . . . . . . . . . . . . . .
8.5.3 Variational Bayes . . . . . . . . . . . . . . .
Linear regression
9.1 Linear regression model . . . . . .
9.2 Semi-conjugate priors . . . . . . . .
9.3 Objective priors . . . . . . . . . . .
9.4 Model selection . . . . . . . . . . .
9.4.1 Bayesian model comparison
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9.4.2
Model averaging via MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
10 Metropolis-Hasting algorithms
10.1 Metropolis-Hastings update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1.1 Detailed balance and reversibility . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
158
158
162
166
11 Unsupervised learning and nonparametric Bayes
11.1 Finite mixture models . . . . . . . . . . . . . . . . . .
11.1.1 Auxiliary variables . . . . . . . . . . . . . . .
11.2 Infinite mixture models . . . . . . . . . . . . . . . . .
11.2.1 Dirichlet process prior . . . . . . . . . . . . .
11.3 Posterior computation via slice sampling . . . . . . . .
11.4 Chinese restaurant process and another Gibbs sampler .
168
170
171
174
176
179
182
12 Additional topics
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
186
3
1
1.1
Introduction and examples
What is Bayesian inference?
Bayesian inference is a major framework for statistical inference. In general, statistical inference is the
(computational) process of turning data into some form of data summarization and understanding, which
may also enable prediction. Bayesian inference, or more broadly speaking, Bayesian statistics, is often
contrasted with a competing framework known as frequentist (or classical) statistics. In this course, we refer
to statistical inference and statistical learning interchangeably.
There are two main players in statistical inference: data and quantity of inferential interest. The data are
represented by a variable y taking values in some suitable space Y. The quantity of interest is denoted by θ
taking values in another space Θ. Typically θ represents some characteristic of the data population that we
wish to understand.
For inference to be possible, there must be a ”linkage” between θ and the observed data y. This linkage
is formalized by a sampling model (statistical model) for which θ is viewed as the model parameter: the true
θ that is responsible for generating the observed data y is unknown. As such, θ encodes our understanding
of the data population. It is the quantity of interest.
4
Example 1.1. Suppose we are interested in the prevalence of an infectious disease in a city. Data y are
obtained from a random sample of individuals from the city, namely, the total number of people in the
sample who are infected. Of interest is θ, the fraction of infected individuals in the city. Thus, Θ = [0, 1],
while Y = {0, 1, 2, 3, . . . , }.
Example 1.2. y represents a collection of heights sampled from a population, θ the typical height. Here,
Θ = Y = R.
Example 1.3. y represents polling data, θ is a categorical valued variable that tells us which candidate wins
an election.
Example 1.4. y is a sequence of of binary values that record whether a given day is rainy or not. θ may
be taken to represent the frequency of rainy days, i.e., the cloudiness of a location. We may also want to
predict if it is going to rain tomorrow or not (in this case, we may introduce another binary random variable
to represent tomorrow’s forecast).
5
Example 1.5. A less obvious example, y is the collection of data pair of the form (u, v), where v is the
binary class label the represents the ”class” of the corresponding u. θ is a mathematical quantity related to
the classifier, a function which maps u to v that we wish to obtain on the basis of a training data set.
Example 1.6. A clustering problem involves subdividing a collection of data points represented by y into
”clusters”, which can be represented by θ.
Example 1.7. ”Who is who”. y represents a collection of photos available in the Internet. θ represents
identity of all individuals that appear in such photos.
6
In practice and in our times, y has become increasingly complex, and so is the ambition of the data
modeler and statistician, who want to infer increasingly complex quantity of interest θ.
For both frameworks of Bayesian and frequentist statistics, data y are always considered to be realizations of some random variable denoted by Y . 1 The nature of the unknown θ is a different matter: frequentist
methods treat θ as unknown but non-random. Bayesian methods always assume θ to be random.
The randomness of the unknown can be viewed as the most distinguishing feature of Bayesian statistics.
The ramifications are both deep and strong. This course is an applied Bayesian analysis course, so we will
not get into the deeper theoretical foundations of Bayesian statistics. Instead, we focus on Bayesian methods
and applications. Nonetheless, such ramifications of the Bayesian choice will be felt strongly.
1
In these notes, we will try to adhere to the convention that random variables are upper cases, unless denoted by Greek letters.
The numerical value of the random variable, say Y , is denoted in lower cases, y.
7
1.2
Bayes’ rule
The idealized form of Bayesian inference begins with a numerical formulation of the joint beliefs about y
and θ, expressed in terms of probability distributions over Y and Θ. Here are the key ingredients:
1. For each numerical value θ ∈ Θ, prior distribution p(θ) describes our belief that θ represents the true
population’s characteristics.
2. For each θ ∈ Θ and y ∈ Y, sampling model p(y|θ) describes our belief that y would be the outcome
of our study if we knew θ to be true.
Once we obtain the data y, the last step is to update our beliefs about θ:
3. For each numerical value of θ ∈ Θ, posterior distribution p(θ|y) describes our belief that θ is the true
value, having observed data set y.
The posterior distribution is obtained from prior distribution and sampling model via Bayes’ rule
p(θ|y) =
p(y|θ)p(θ)
p(y|θ)p(θ)
=R
.
p(y)
Θ p(y|θ̃)p(θ̃)dθ̃
(1)
Note that Bayes’ rule is a mathematical formula that allows one to ”invert” arguments of conditional
probabilities. We have applied the Bayes formula for the purpose of statistical inference method named
after its progenitor.
Implicit in the above description is a significant conceptual lift of Bayesian statistics: we treat the a
posteriori ”belief” about θ by adopting the conditional probability of θ given y. The higher the value of the
probability about a numerical value of θ, the stronger our belief about it. Athough ”belief” may be a vague
notion, probabilities and conditional probabilities are mathematically well-defined. Thus, we may speak of
belief in a quantitively rigorous way. Note also that Bayes’ rule does not tell us what the truth θ should be;
it tells us how our belief about θ changes after seeing new information.
8
Figure 1.1: The plot on the left gives binomial(20, θ) distributions for three values of θ. The right side gives
prior (gray) and posterior (black) densities of θ. This is Fig. 1.1 of PH.
1.3
Example: estimating the probability of a rare event
Continue on Example 1.1. Of interest is θ, the fraction of infected individuals in the city, so θ ∈ [0, 1]. Data
is the number of infected individuals out of a 20 sample. So y ∈ {0, . . . , 20}.
We need a sampling model. A reasonable choice is [why?]
Y |θ ∼ binomial(20, θ)
y
20−y . In particular, P (Y = 0|θ) = (1 − θ)20 . See Fig. 1.1 for an
This means P (Y = y|θ) = 20
y θ (1 − θ)
illustration.
To get a sense of this probability P (Y = 0|θ = 0.05) = 0.9520 ≈ 0.36. If θ = 0.1 or θ = 0.2, this
number woud be 0.12 or 0.01, respectively.
Next, a prior is specified. A common choice is the beta distribution [why?] θ ∼ beta(a, b). There are
two parameters a, b > 0 that we need to set. But how?
The expectation under the beta prior is a/(a + b). The mode of the beta prior is (a − 1)/(a − 1 + b − 1).
Previous studies from various parts of the country indicate that the infection rate in comparable cities ranges
from about 0.05 to 0.20, with an average prevalence of 0.10. This suggests us to take a = 2, b = 20:
θ ∼ beta(2, 20).
This prior specification yields the following choice about the prior distribution:
E[θ] = 0.09
mode[θ] = 0.05
Pr(θ < 0.10) = 0.64
Pr(0.05 < θ < 0.20) = 0.66.
You may still find reasons to be uncomfortable with this particular choice of prior parameters (what?),
but we will get to them. Let us now apply the Bayes rule that enables one to go from the prior to the posterior
distribution.
9
From prior to posterior By an application of the Bayes rule, we will find that: if Y |θ) ∼ binomial(n, θ),
θ ∼ beta(a, b) then the conditional distribution is again a beta: θ|Y = y ∼ beta(a + y, b + n − y).
This is just an example of a general structural property called as “conjugacy“ (the beta prior is conjugate
to the binomial likelihood) that is widely exploited in Bayesian computation. We will study this property
systematically in later lectures.
Suppose that in our specific study, we observed that in fact Y = 0, i.e., none of the sampled individuals
was infected. [What do we make of this?] The posterior distribution of θ is therefore
θ|{Y = 0} ∼ beta(2, 40).
Observe the change in shape from the prior distribution to the posterior distribution under the (new) observation y = 0 in Fig. 1.1: the mass of the posterior is shifted toward zero. This reflects the consequence
of the “Bayes update”; by contrast a simple-minded approach is to set θ = 0 in the presence of y = 0.
The posterior is also more “peaked“ than the prior. This reflects a general phenomenon: as more data are
observed, our belief about θ becomes more concentrated, even if we start out with a prior belief that is more
“defuse“. In other words, the more data are observed, the less influential the role of the prior. This is a
desirable property.
More quantitatively on this transformation:
E[θ|Y = 0] = 0.048
mode[θ|Y = 0] = 0.025
Pr(θ < 0.10) = 0.93
In particular, we may say: our posterior belief in the presence of the observation that θ < 0.1 is pretty high
(> 0.93). How sensitive is this conclusion based on our prior specification?
10
Sensitivity analysis
The Bayes update enables us to go from a beta(a, b) prior to a beta posterior, namely,
beta(a + y, b + n − y),
whose parameter incorporates the impact of the observed data y. In particular, we go from the prior mean
θ0 := a/(a + b) to the posterior mean
E[θ|Y = y] =
=
=
a+y
a+b+n
n
a+b
(y/n) +
(a/(a + b))
a+b+n
a+b+n
n
w
ȳ +
θ0 .
w+n
w+n
Here, we denote w = a + b. The above formula captures nicely the combined impacts of data, via the term
ȳ = y/n, and prior knowledge, via the prior mean θ0 . The posterior mean represents a weighted average of
the sample mean ȳ and our prior guess θ0 .
We may view w as a parameter that represents our confidence in this prior guess. Note that the prior
distribution may be expressed as beta(wθ0 , w(1 − θ0 )). The posterior distribution is beta(wθ0 + y, w(1 −
θ0 ) + n − y).
• If we fix w, let us see the impact of data size. If sample size n tends to infinity, then posterior mean
tends to the sample mean ȳ; the prior belief plays a vanishing role no matter how confident we are
about it. However, when sample size n is small, the prior belief can be influential, and can be captured
by the role of w.
• Let us fix n (to be relatively small). As w → 0, as our confidence in the prior vanishes, the posterior
mean converges to the data-driven sample mean ȳ. If w → ∞, the opposite happens: the posterior
mean tends toward that of the prior belief, θ0 ; the observed data hardly matters any more.
11
Figure 1.2: Posterior quantities under different beta prior specifications. The left and right hand panels give
contours of E[θ|Y = 0] and Pr[θ < 0.10|Y = 0], respectively. (Fig. 1.2 of PH).
Fig. 1.2 gives a more detailed picture of the sensitivity of the prior specification. The left panel tells a
general story: the prior specification can play a big role in our conclusion “after the fact”. The sensitivity
analysis allows us to be both honest and more confident in drawing our inference.
The confidence in our inference depends on the specific question that we ask about θ. Suppose that the
city officials want to recommend a vaccine to the general public unless they were reasonably sure that the
current infection rate was less than 10%. Then we may want to look at the right panel, which gives the
contours of the posterior for Pr[θ < 0.10|Data].
• For chosen θ0 ≤ 0.1, which is the average prevalence in other comparable cities from prior studies,
we can be reasonably certain that the current infection rate is below 10% (with posterior probability
above 90% for a large range of w).
• A higher degree of certainty, say 97.5%, is only achieved by people who already thought the infection
rate was lower than the average of the other cities, e.g., if θ0 < 0.05.
12
1.4
Example: prediction via a Bayesian regression model
The problem is to come up with a predictive model for diabetes progression as a function of 64 baseline
explanatory variables such as age, sex and body mass index. A standard tool is the following a linear
regression model; a simplest one is
y = β > x + σ,
where β ∈ R64 is a quantity of interest, along with σ the standard deviation of the error term. The parameters
can be estimated using a training dataset consisting of measurements from 342 patients. There is also a test
dataset of 100 patients, which will be used to evaluate the predictive performance of the learned model.
Sampling model Suppose that the error term follows the normal distribution with unknown variance,
∼ N (0, 1), then the sampling model takes the following conditional form
Y |X = x, β, σ ∼ Normal(β > x, σ).
Prior specification Placing a prior distribution on β ∈ R64 is non-trivial. This is a large parameter space;
one needs to impose the kind of distribution that reflects the prior knowledge of this space. The prior belief
we have is that most of the 64 explanatory variables have little to no effect on diabetes progression, i.e., most
are zeroes but we do not know which one. We can start by a prior that allows that β1 , . . . , β64 are a priori
independent and that Pr(βj = 0) = 1/2. Details are omitted for now. Likewise, a prior on σ is required,
but for now we may assume σ fixed.
Posterior distribution With the sampling model and the prior specification, and given observed data pairs
(y, X) = (yi , xi )342
i=1 , by Bayes’ rule we obtain the posterior for the parameter β
P (β|y, X, σ) =
P (β)P (y|X, θ, σ)
.
P (y|X, σ)
(Notice that for regression or classification problems, the explanatory variables X remain in the conditioning
in the Bayes’ formula). Later we will learn how to derive this kind of posterior computation.
The posterior distribution gives us much information. Of interest, for instance, is the question of variable
selection, which can be extracted from Pr(β 6= 0|y, X). See Fig. 1.3. Recall that all of the 64 coefficients
βi were a priori zero with probability 1/2; the posterior given the data tells us that the number of non-zero
coefficients must be much smaller.
We are also interested in predicting the value of response variable on the test dataset. A simple way to
do this, is to take β̂ Bayes = E[β|y, X], the posterior expectation of β, and plug in this point estimate to the
test dataset. In particular, let X test be the 100 × 64 matrix giving the data for the 100 patients in the test
dataset. Then we can predict the corresponding diabete progression level by
ŷ test := XBayes β̂ Bayes .
By contrast, a non-Bayesian and standard approach is to take the ordinary least square (OLS) estimate:
β̂ ols := argmin
n
X
(yi − β > xi )2 ,
i=1
13
Figure 1.3: Posterior probabilities that each coefficient is non-zero (Fig. 1.3 of PH).
Figure 1.4: Observed versus predicted response y (diabetes progression value) using the Bayes estimate
(left) and the OLS estimate (right panel) (Fig. 1.4 of PH).
14
which gives β̂ ols = (X > X)−1 X > y. With this point estimate in place, the predictive estimate for the
response is given by X test β̂ ols . Fig. 1.4 gives a comparison between OLS and the Bayesian approach. The
prediction error for the two methods are 0.67 and 0.45, respectively.
We make some high-level remarks.
• the poor performance of OLS is due to the fact that that the sample size is small relative to the large
number of explanatory variables.
• to do well in such situations, one needs to constrain the parameter space (provided this is ”correct”).
The Bayesian prior has this effect.
• alternatively, modern regression method can achieve this by introducing a penalty term. A well-known
method is the lasso regression, which proposes the following point estimate
β̂ lasso := argmin
n
X
>
2
(yi − β xi ) + λ
i=1
64
X
|βj |,
j=1
where λ > 0 is a tuning parameter that balances between the square error and the penalty term, which
helps to push (many) βj to take small or zero values.
• it can be verified (you should!) that the lasso estimate corresponds exactly to the mode of the posterior
distribution, if we take the prior on βj to be Laplace — this is a probability distribution that has a sharp
peak at βj = 0.
• note that the above illustration of β̂ Bayes is a convenient way to obtain predictive estimate for the
response variable Y using the Bayesian posterior. But this is not the ”true” Bayesian estimate. Recall
that all that is unknown in Bayesian analysis is treated as random. Thus, the true Bayesian estimate
for Y has to be obtained by integrating out β according to its posterior distribution. That is, we can
compute
Z
P (Y test |X test ) =
P (Y test |X test , β)P (β|y, X)dβ.
This computation is a bit more involved, but the resulting estimate is expected to be more robust than
the plug-in estimate with β Bayes .
15
2
2.1
Interpretation of probabilities and Bayes’ formulas
Interpretation of probabilities
When we say: ”if I toss a coin, the probability that the coin turns head is 1/2”, what we understand is the
possibility of repeated experiments of coin tossing, and approximately half of the times the coin turns head.
This is the frequentist interpretation of probabilities. This is also the interpretation we rely on when we think
of sampling model Pr(y = 1|θ = 1/2).
But what do we mean by saying, a day before the votes are cast, that a candidate wins the election with
probability 66.67%? We cannot repeat the election multiple times for the same candidates. This probability
number obviously quantifies the degree of our belief in a given statement. The higher the number, say 80,
or 95%, the stronger the belief (which is subjective, since my 80% may be differently perceived from your
80%). We use this interpretation when we specify the prior Pr(θ) and drawing inference from the posterior
distribution Pr(θ|Data).
Both interpretations are present in Bayesian analysis in the prior and the sampling model terms and
linked via the Bayes formula. More remarkably, the Bayes formula enables us to revert the arguments in
conditional probabilities, i.e., to relate Pr(A|B) with Pr(B|A), and so on. We can makes sense of and
quantifies both statements such as ”if a person has college degree, then his likely income level is...”, versus
”if a person has this income level, then they are likely to have received a college degree”.
In logic, it is simple to distinguish the logical statements A ⇒ B and B ⇒ A. In probabilistic settings
and real-life applications, it is not so obvious to quantify the uncertainty of such statements.
16
2.2
Bayes’ rule
Bayes’ formulas are straightforward to grasp in the somewhat abstract language of probability space and
Venn diagrams of subsets of events. Later, we apply Bayes’ rule to random variables, as commonly done in
practice. (Paradoxically, the application of Bayes’ rule to random variables seems less intuitive in specific
applied settings).
Let H be the set of all possible truths, that we can place the unit probability on: Pr(H) = 1. Suppose
{H1 , . . . , HK } be a partition of H. The rule of total probability imposes that
K
X
Pr(Hk ) = 1.
k=1
Examples
• H is the set of truths about people’s religious orientations. Partitions include {Christian, non-Christian},
but also {Protestian, Catholic, Jewish, other, none}, and so on.
• H is the set of truths about people’s number of children.
• H is the set of truths about the relationship between smoking and hypertension in a given population.
Partitions include {some relationship, no relationship}, or
{negative correlation, zero correlation, positive correlation, and so on.
An even E is defined as a subset of H for which we may quantify in terms of Pr(E). By the rule of
marginal probability:
Pr(E) =
K
X
Pr(E ∩ Hk ) =
k=1
K
X
Pr(Hk ) Pr(E|Hk ),
k=1
where we have used the definition of conditional probability in the second equality.
It follows that
Pr(Hj |E) =
=
Pr(Hj ∩ E)
Pr(E)
Pr(E|Hj ) Pr(Hj )
.
PK
Pr(E|H
)
Pr(H
)
k
k
k=1
This is an instance of the celebrated Bayes’ formulas, which allows one to compute the ”inverse probability” Pr(Hj |E) in terms of Pr(E|Hj ) and other quantities. The other quantities here are the seemingly
benign unconditional probability terms Pr(Hj ). In reality it is often the presence of understated or hidden
assumptions about these conditional probabilities that lead people to draw drastically contradictory conclusions in the face of the same set of observed evidence. Bayes’ formulas explain this phenomenon clearly.
17
Example 2.1. A subset of the 1996 General Social Survey includes data on the education level and income
for a sample of males over 30 years of age. Let {H1 , H2 , H3 , H4 } be the events that a random selected
person in this sample is in the lowest, the second, the third and the upper 25th percentile in terms of the
income. By definition, the unconditional probabilities are
{Pr(H1 ), Pr(H2 ), Pr(H3 ), Pr(H4 )} = {.25, .25, .25, .25}.
These probabilities add up to 1.
Let E be the event that a randomly sampled person from the survey has a college education. From the
survey data, we also have
{Pr(E|H1 ), Pr(E|H2 ), Pr(E|H3 ), Pr(E|H4 )} = {.11, .19, .31, .54}.
These are also probabilities. They do not add up to one. Rather, they represent the proportions of college
degree holders in each of the four subpopulations. Observe the increase in the proportion relative to the
income percentile level.
Now, applying Bayes’ rule to obtain
{Pr(H1 |E), Pr(H2 |E), Pr(H3 |E), Pr(H4 |E)} = {.09, 0.17, .27, .47}.
What we see here are the probability that someone is in each of the income basket, if that person is a college
degree holder. These probabilities add up to one. Note how the share the same monotonicity with the
numbers in the previous paragraph. This is by design, because the unconditional probabilities Pr(Hi ) are the
same. The monotonicity will not be preserved in general and may be counterintuitive, if the subpopulations
{Hi } are partitioned such a way that their corresponding probabilities
{Pr(H1 ), Pr(H2 ), Pr(H3 ), Pr(H4 )}
are suitably skewed [Exercise: come up with an example!]
18
2.3
Bayesian hypothesis testing
In Bayesian inference, {H1 , . . . , HK } often refer to disjoint hypotheses or states of nature, and E refers to
the outcome of the survey, study or experiment. To compare the hypotheses post-experimentally, we may
calculate the ratio
Pr(Hi |E)
Pr(Hj |E)
Pr(E|Hi ) Pr(Hi )
×
Pr(E|Hj
Pr(Hj )
= ”Bayes factor” × ”prior beliefs”.
=
This tells us that the Bayes’ rule only tells us what our beliefs should be after seeing the data; the prior
beliefs play a very important role. The following example is apt given the most recent election:
H =
all possible rates of support for candidate A
H1 =
more than half the voters support candidate A
H2 =
less than or equal to half the voters support candidate A
E = 54 out of 100 people surveyed said they support candidate A
In the face of the polling data E, how should we conclude about the chance of candidate A? The modeling of
both {Pr(Hi )} and Pr(E|Hi ), and the interplay among these quantities combine to determine the inference.
19
2.4
Random variables and conditional independence
Bayesian inference is applied to random variables: the observed data y and the quantity of interest θ are
both realizations of random variables.
The domain of these random variables and the structural properties about them have to be taken into
account in order to construct suitable probability models for which the Bayes formula can be applied.
2.4.1
Discrete domains
We say Y is discrete if its domain Y is countable, meaning that it can be expressed as Y = {y1 , y2 , . . .}.
The event that the outcome Y takes a value y can be quantified by the probability Pr({Y = y}) := p(y),
where function p is called probability density function of Y . It satisfies the property that
1. 0 ≤ p(y) ≤ 1 for all y ∈ Y.
P
2.
y∈Y p(y) = 1.
An event of interest concerning the outcome Y takes the form Y ∈ A, for some subset A ⊂ Y. We may
quantify our belief about such an event via
X
Pr(Y ∈ A) =
p(y).
y∈A
There are many examples of probability distributions on discrete domains. They will form crucial building blocks we will need for the probability models we will construct. Here are a few examples; it is important
to review them.
1. bernoulli(y|θ), where y ∈ {0, 1}, θ ∈ [0, 1]. The pdf takes the form
p(y|θ) = θy (1 − θ)1−y .
2. binomial(y|θ, n), where y ∈ N and θ ∈ [0, 1].
n y
p(y|θ) =
θ (1 − θ)1−y .
y
3. poisson(y|θ), where y ∈ N, θ ≥ 0.
p(y|θ) = θy e−θ /y!.
4. categorical(y|θ), where y ∈ {1, . . . , K}, θ ∈ ∆K−1 := {(q1 , . . . , qK ) ∈ RK
+,
p(y|θ) = θy =
K
Y
I(y=k)
θk
PK
k=1 qk
= 1}.
.
k=1
P
K−1 .
5. multinomial(y|θ, n), where y = (y1 , . . . , yK ) ∈ NK such that K
k=1 yk = n, θ ∈ ∆
Y
K
n
p(y|θ, n) =
θknk .
y1 . . . yK
k=1
We have used Y to illustrated random variables of discrete domains, but remember that in Bayesian
inference, the quantity of interest θ is also random, for which we apply the prior distributions that are drawn
from the same tool box as mentioned.
20
2.4.2
Continuous domains
By this, we mean the domain of the variable is the real line or a subset of the real line. We have a rich tool
box of modeling devices, including distributions by the name of Gauss, Laplace, Cauchy, Gamma, Beta,
Dirichlet, and so on, and beyond. Many of these building blocks can be viewed as instance of distributions
in the exponential families of distributions. We will return to this in the sequel.
21
2.4.3
Multivariate domains
Most interesting and challenging scenarios deal with multiple variables and/or variables of multiple dimensions. How do we specify probability distributions in these cases?
Let us start with bivariate distributions in a discrete domain. Consider discrete random variables Y1 and
Y2 taking values in countable spaces Y1 , Y2 , respectively. We need to specify the joint probability density
function (joint pdf):
pY1 Y2 (y1 , y2 ) := Pr({Y1 = y1 } ∩ {Y2 = y2 }).
(2)
If Y1 and Y2 are mutually independent, the joint pdf is simplified to the product form
pY1 Y2 (y1 , y2 ) = pY1 (y1 )pY2 (y2 ),
where the two univariate pdf for Y1 and Y2 may be specified using to basic building block mentioned earlier.
In general, Y1 and Y2 are not independent; one needs to specify the the joint pdf in Eq. (2), which defines
the probability mass for each of the |Y1 | × |Y2 | pairs of numerical values of (y1 , y2 ).
Once the joint pdf is specified, the marginal distribution and conditional distribution can be computed
from the joint density:
X
pY1 (y1 ) :=
pY1 Y2 (y1 , y2 ),
y2 ∈Y2
pY2 |Y1 (y2 |y1 ) =
pY Y (y1 , y2 )
Pr(Y1 = y1 , Y2 = y2 )
= 1 2
.
Pr(Y1 = y1 )
pY1 (y1 )
From the above, we can alternatively specify the joint pY1 Y2 by first specifying marginal distribution,
say pY1 , and then the conditional pdf pY2 |Y1 , because
pY1 Y2 (y1 , y2 ) = pY1 (y1 )pY2 (y2 |y1 ) = pY2 y2 pY1 (y1 |y2 ).
When the context of the random variables is clear, we may drop the subscripts to write the above as
p(y1 , y2 ) = p(y1 )p(y2 |y1 ) = p(y2 )p(y1 |y2 ).
22
Example 2.2. Let’s start with the following example from PH (pg. 24) and then expand on this.
In this example, we saw how to derive the conditional probabilities pY2 |Y1 and pY1 |Y2 from the joint
probabilities pY1 ,Y2 . Likewise we can also specify the joint from the marginal pY1 and the conditional pY2 |Y1 .
In any case, we essentially need to specify 5×5 entries for the joint probability values Pr(Y1 = y1 , Y2 = y2 ).
Without further assumption, we needs 25 − 1 = 24 parameters for the joint pdf, one for each probability
value.
Suppose now that we wish to extend the joint pdf to describe the social mobility not for two but three
or more generations. Assume that the list of occupations remain 5 in this example. With three generations
(of grandfathers, fathers, sons) we need to specify 53 = 125 entries for the joint pdf. With four generations,
we need 54 = 625 entries. And so on. This shows a fundamental challenge in working with multivariate
domains. Without further assumptions, the number of parameters required is exponential in the number of
variables. This would be unworkable.
The main tool that statistical modelers exploit to overcome the complexity in modeling multivariate
domains is to make use of independence, more appropriately, conditional independence, by incorporating
our domain knowledge about the variables of interest.
23
Example 2.3. Continuing from the previous example. Let Y1 , Y2 , Y3 denote the grandfather, father and
son’s occupations.
By chain rule, we may always write2
p(y1 , y2 , y3 ) = p(y1 )p(y2 |y1 )p(y3 |y1 , y2 ).
We may help ourselves by making the following assumption: assume that Y3 is conditionally independent of Y1 given Y2 .
This means, the joint conditional density of Y1 and Y3 given Y2 equals the product the corresponding
marginal conditional densities:
p(y1 , y3 |y2 ) = p(y1 |y2 )p(y3 |y2 )
for any numerical values (y1 , y2 , y3 ). The reader should verify that under the above conditional independence:
p(y1 |y2 , y3 ) = p(y1 |y2 )
p(y3 |y2 , y1 ) = p(y3 |y2 ).
As a consequence, we may specify the joint pdf of Y1 , Y2 , Y3 by a smaller number of parameters, by noting
that (why?)
p(y1 , y2 , y3 ) = p(y1 )p(y2 |y1 )p(y3 |y2 ).
Question: how many parameters do we need to specify the joint pdf?
Another question: suppose that the conditional distribution of the occupation of the grandfather generation given the father’s is the same as the conditional distribution of that of the father’s generation given the
son’s. How many parameters do we need now?
2
Recall that we have removed the subscripts to avoid cluttering from
pY1 ,Y2 ,Y3 (y1 , y2 , y3 ) = pY1 (y1 )pY2 |Y1 (y2 |y1 )pY3 |Y1 ,Y2 (y3 |y1 , y2 ).
24
2.5
Bayes’ formulas and parameter estimation
As we described in Section 1, in order to initiate a Bayesian analysis, we need to specify the joint distribution
of the quantity of interest θ and data y, by specifying the prior belief about θ via the prior distribution p(θ),
and the sampling model p(y|θ). In practice, y represents the values of a collection of random variables/
vectors. θ is a random variable in a suitable domain. The principle of these specifications is the same as
before, whether y and θ are discrete or continuous valued, or a combination thereof.
A large proportion of a Bayesian modeler’s technical effort therefore is on finding a suitable specification
of the joint distribution p(θ, y) for the problem at hand. Once this is done, having observed {Y = y}, we
need to compute our updated beliefs about θ via the Bayes’ formula, which is now expressed in terms of
density function for random variables:
p(θ|y) = p(θ, y)/p(y) = p(θ)p(y|θ)/p(y).
(3)
Another significant proportion of the Bayesian framework is to compute that above posterior density
function of θ, expressed above as a ratio.
• The numerator is the product between the prior pdf, p(θ), and the quantity p(y|θ).
• As a function of y, we call p(y|θ) the pdf of the sampling model, where θ plays the role of the
parameter.
• As a function of θ, we call p(y|θ) as the likelihood function, with data y being fixed.
It’s worth repeating that the likelihood function is not a density function. As the focus is shifted toward
the inference of θ, ”likelihood function” will be invoked more often.
Although the numerator of the posterior density is often simple to compute because the prior component
and the likelihood component are typically explicitly specified, the denominator is typically difficult to
compute explicitly. It can be seen that
Z
Z
p(y) = p(θ, y)dθ = p(θ)p(y|θ)dθ
which involves taking integration (or summation) over the space of θ ∈ Θ. The integration typically does
not admit an explicit form.
25
One may be interested in the relative posterior density, by comparing its value at different numerical
values of interest. Let θa and θb be two such numerical values of θ, and take
p(θa |y)
p(θb |y)
=
=
p(θa )p(y|θa )/p(y)
p(θb )p(y|θb )/p(y)
p(θa )p(y|θa )
p(θb )p(y|θb )
In the above, the computation of the relative posterior density does not require the computation of p(y),
because p(y) does not depend on specific value of θ. Accordingly, we often write
p(θ|y) ∝ p(θ)p(y|θ)
where ∝ is called ”proportional” up to a normalizing constant to ensure that the left hand side is a value pdf
for θ. The normalizing constant is precisely p(y) in this case.
In English, we write
posterior ∝ prior × likelihood.
This captures succintly and beautifully the spirit of Bayesian inference: the posterior belief about the quantity of interest is obtained from two sources of information: the prior belief, and empirical observations
(via the likelihood). Moreover, these two sources are combined explicitly via a multiplicative operation.
As a function of the quantity interest, you may take this as an update to the prior belief via a reweighting
operation, where the weights are provided by the likelihood function.
Finally, in practice, we are interested in various properties of the posterior density function p(θ|y), rather
than the density function itself. This helps us express more precisely our belief about the true θ, because of
the Bayesian ”doctrine” that we usually do not know the exact truth; we can only calculate our belief about
such truth. We have seen in the example of Section 1.3 various quantities of interest, including the posterior
mean and posterior variance, posterior mode, posterior probability of tails, various quantiles and confidence
regions with respect to the posterior distribution.
26
3
One-parameter models
A one-parameter model is a class of sampling distribution that is indexed by a single unknown parameter. We
will study Bayesian inference with several such models. Although simple, they will help to illustrate several
key concepts in Bayesian data analysis, including conjugate priors, predictive distributions and confidence
regions.
3.1
The binomial model
Example 3.1. (Happiness data) In a General Social Survey conducted in 1998, each female of age 65 or
over was asked whether or not they were generally happy or not. Let Yi = 1 if respondent i reported being
generally happy, and 0 otherwise. The label i is given arbitrarily before the data are collected; we do not
assume to have any further information distinguishing these individuals. As before, we use p(y1 , . . . , yn ) as
the shorthand notation for Pr(Y1 = y1 , . . . , Yn = yn ) and so on.
We shall assume a binomial distribution to describe our sampling model. Associated with this model is
a parameter θ ∈ [0, 1] and that
i.i.d.
Y1 , . . . , Yn |θ ∼ Bernoulli(θ).
Accordingly,
p(y1 , . . . , yn |θ) = θ
Pn
i=1
yi
Pn
(1 − θ)n−
i=1
yi
.
It is reported that out of n = 129 respondents, 118 individuals report being generally happy (91%), and 11
individuals do not report being generally happy (9%).
27
Uniform prior To continue with Bayesian analysis, we need to give θ a prior distribution. Let us take the
uniform prior, so that
p(θ) = 1 for all θ ∈ [0, 1].
Uniform prior is considered a ”vague” or ”non-informative” prior, and referred as such in the literature.
[whether it is truly non-informative is a different matter!] Now, we are ready to apply the Bayes’ rule to
obtain
p(θ|y1 , . . . , y129 ) ∝ p(θ)p(y1 , . . . , y129 |θ) = θ118 (1 − θ)11 .
In the above expression, we drop the normalizing constant, which is p(y1 , . . . , y129 ).
To find the mode of the posterior distribution, we need to solve the optimization problem
max log{θ118 (1 − θ)11 }.
θ∈[0,1]
Taking derivative with respect to θ and setting to zero, we obtain the maximizer to be θ̂ = 118/129 =
.91, the fraction of respondents who report being generally happy. The reader might think: so much for all
the math, only to get such an obvious answer?
But what about other quantities relevant to the posterior distribution of θ? The normalizing constant of
the posterior density is p(y1 , . . . , y129 ). Why in general this quantity is difficult to calculate, for this specific
example it has a closed form: the expression defining the posterior distribution should remind us of the beta
distribution. A beta pdf is defined on [0, 1] and takes the form
p(θ|a, b) =
Γ(a + b) a−1
θ (1 − θ)b−1 .
Γ(a)Γ(b)
(4)
Here a, b > 0 are the parameters. Since the density function integrates to one, this implies that
Z 1
Γ(a)Γ(b)
θa−1 (1 − θ)b−1 dθ =
.
Γ(a + b)
0
Exercise 3.1. Based on the above identity, prove the following: under the beta distribution beta(a, b)
mode[θ] = (a − 1)/[(a − 1) + (b − 1)]ifa > 1, b > 1,
E[θ] = a/(a + b),
Var[θ] = ab/[(a + b + 1)(a + b)2 ].
Back to our example, then we have
Z
p(y) =
1
θ118 (1 − θ)11 dθ =
0
Γ(119)Γ(12)
.
Γ(131)
In fact, the posterior distribution of θ is indeed beta(119, 12).
28
Beta prior The uniform distribution of [0, 1] is an instance of the beta distribution for a = b = 1. Employing the beta prior instead, and apply Bayes’ rule
p(θ|y1 , . . . , yn ) ∝ p(θ)p(y1 , . . . , yn |θ)
Pn
∝ θa−1 (1 − θ)b−1 × θ i=1 yi (1 − θ)n−
n
n
X
X
= beta(θ|a +
yi , b + n −
yi ).
i=1
Pn
i=1
yi
i=1
This is an instance of conjugacy: a beta prior, when combined with a binomial likelihood, yields a beta
posterior distribution. Conjugacy is the property of a prior relative to a given likelihood: a prior is conjugate
with respect to a likelihood if the resulting posterior distribution takes the same form.
Conjugacy a treasured property in Bayesian statistics because it simplifies posterior computation, a
considerable bottleneck. Once we know the form of the posterior density, we only need to concern with the
posterior distribution’s parameters, which reflects the posterior updates that combines both prior information
and the information gleaned from the data.
Example
P distribution of θ receives the update from the data via the
P3.2. In Example 3.1, the posterior
statistic ni=1 Yi . This reflects the fact that ni=1 is the sufficient statistic for θ under the Bernoulli sampling
model. In our Bayesian frame work, we may express this as
p(θ|Y1 , . . . , Yn ) = p(θ|
n
X
Yi ).
i=1
In other words, the information contained in the observed
{Y1 = y1 , . . . , Yn = yn } is the same
Pn in the data P
as the information contained in Y = y, where Y = i=1 Yi and y = ni=1 yi .
Alternatively, we may consider a sampling model in which the data are the count of people who report
to be ”generally happy”, as opposed to ”not generally happy”. The suitable sampling model is a binomial
distribution. Applying the same computation as above, the reader should be able to derive that if we posit
prior: θ ∼ beta(a, b)
sampling: Y = y ∼ binomial(n, θ),
then by the Bayes’ rule we obtain
posterior: θ|Y = y ∼ beta(a + y, b + n − y).
This is also the calculation that we relied on in Example 1.1.
29
Prediction After having obtained data sample {y1 , . . . , yn } we are also interested in the distribution of
new observations. This is called the predictive distribution. Suppose that Ỹ is an additional outcome of the
same population as the observed sample via the sampling model
i.i.d.
Y1 , . . . , Yn , Ỹ |θ ∼ p(.|θ).
Under the prior distribution
θ ∼ p(θ)
the predictive distribution of Ỹ given {Y1 = y1 , . . . , Yn = yn } takes the form
Z
p(Ỹ = ỹ|y1 , . . . , yn ) =
p(ỹ, θ|y1 , . . . , yn )dθ
Z
=
p(ỹ|θ, y1 , . . . , yn )p(θ|y1 , . . . , yn )dθ
Z
=
p(ỹ|θ)p(θ|y1 , . . . , yn )dθ.
The last identity is due to the i.i.d. assumption in the sampling model.
Some remarks
• The predictive distribution depends on the observed data. It does not depend on the unknown θ.
• The unknown θ is integrated out in the formula via the posterior distribution. Thus the predictive
distribution takes into account both the observed data and the prior distribution.
• Contrast this with a frequentist approach: one can obtain a point-estimate θ̂ based on the observed
data, and then plug-in the sampling model to produce a predictive distribution of new observation:
pplug-in (Ỹ = ỹ) := p(ỹ|θ̂).
Because the Bayesian approach relies on a distribution over the unknown θ rather than a single numerical value of θ, it allows for a broader range of predictive distributions than a plug-in approach.
30
Example 3.3. Continue from Example 3.1 (Binomial sampling and uniform prior). We use the uniform
distribution as the prior for happiness level θ. The uniform distribution is beta(a, b), where a = b = 1. The
predictive distribution of the answer ”I’m generally happy” for the next respondent is
Z
Pr(Ỹ = 1|y1 , . . . , yn ) =
p(ỹ|θ)p(θ|y1 , . . . , yn )dθ
Z
=
θp(θ|y1 , . . . , yn )
P
a + ni=1 yi
=
.
a+b+n
Suppose that out of 20 people, none is reportedly happy, then the probability that the next person is reportedly happy will be a/(a + b + 20) = 1/22.
Contrast this with the plug-in approach: the mode of p(θ|y1 , . . . , yn ) is the same as the mode of the
likelihood function p(y1 , . . . , yn |θ), which is equal
P
a + ni=1 yi − 1
= 0.
a+b+n−2
If we plug in θ̂ = 0, then the predictive probability that the next person is reportedly happy will be 0.
31
3.2
Confidence regions
It is of interest to identify regions of the parameter space that are likely to contain the true value of the
unknown parameter. The following definition for scalar parameter can be extended to multidimensional
domains.
Definition 3.1 (Bayesian coverage). An interval [l(y), u(y)], based on the data observed data Y = y, has
95% Bayesian coverage for θ if
Pr(l(y) < θ < u(y)|Y = y) = .95.
Note: in the above probability expression, it is θ that is random, Y = y fixed. Interpretation: having
observed the data and calculated the conditional probability, the unknown θ is in the given interval with
probability 95%.
Frequentist approach provides point estimates for unknown θ, not a distribution. To quantify for the
uncertainty of the estimate, there is a notion of confidence interval defined as follows.
Definition 3.2 (Frequentist coverage). A random interval [l(Y ), u(Y )] has 95% frequentist coverage for θ
if, before the data are gathered,
Pr(l(Y ) < θ < u(Y )|θ) = .95.
Note: in the above probability expression, it is Y that is random, θ is unknown but fixed. Once you observe Y = y, you cannot provide any gua rantee for [l(y), u(y)] regarding the unknown θ. What frequentist
coverage means is: if we are to run a large number of unrelated (independent) experiments and create the
interval [l(y), u(y)] for each one of them, then we can expect that 95% of the intervals contain the correct
parameter value.
32
Some remarks
• Both notions are useful.
• The frequentist coverage describes the pre-experiment coverage, i.e., it promises a guarantee if the
experiments are to be repeated many times in the future.
• The Bayesian coverage describes the post-experiment coverage, i.e., it is applicable to the data at
hand, under a prior specification.
• When sample size gets large, usually the two coverages tend toward the same interval.
Quantile-based interval This is the easiest way to obtain a Bayesian coverage: take l(y) := θα/2 and
u(y) := θ1−α/2 , the left and right threshold for the α/2 probability tail of the posterior distribution:
Pr(θ < θα/2 |Y = y) = Pr(θ > θ1−α/2 |Y = y) = α/2.
In R programming language:
A potential problem with this interval is that some θ-values outside the quantile-based interval may
have higher probability (density) than some points inside the interval. In addition, for multi-modal posterior
distribution (having multiple peaks), this choice of interval may be not very useful.
33
Figure 3.1: Quantile-based interval and highest posterior density regions.
An alternative is the so-called ”highest posterior density (HPD)” region: it is the subset s(y) ⊂ Θ such
that
(i) Pr(θ ∈ s(y)|Y = y) = 1 − α.
(ii) If θa ∈ s(y) and θb ∈
/ s(y), then p(θa |Y = y) > p(θb |Y = y).
See Fig. 3.1 for an illustration. The HPD is characterized by threshold c > 0 of the posterior density.
By sliding the threshold up and down the real axis we obtain different α. When the posterior density is a
multi-modal function, the HPD may be composed of multiple disconnected subsets.
34
3.3
The Poisson model
Poisson is a probability distribution whose domain is the unbounded set of natural numbers. It is a useful
modeling tool for count data.
Consider the Poisson sampling model: Y |θ ∼ Poisson(θ). That is, for y = 0, 1, . . .,
Pr(Y = y|θ) = θy e−θ /y!.
Poisson random variables have an interesting feature in that both the mean and the variance are determined
by the same parameter θ and in fact, E[Y |θ] = Var[Y |θ] = θ.
iid
Given n-i.i.d. sample: Y1 , . . . , Yn |θ ∼ Poisson(θ). We have
Pr(Y1 = y1 , . . . , Yn = yn |θ)
=
=
n
Y
i=1
n
Y
p(yi |θ)
θyi e−θ /yi !
i=1
=: c(y1 , . . . , yn )θ
P
i
yi −nθ
e
.
Pn
From the above expression we
find
that
i=1 Yi is the sufficient statistic of the Poisson sampling model.
Pn
Moreover, it can be verified that i=1 Yi |θ ∼ Poisson(nθ).
We proceed to give a prior distribution for θ ∈ R+ . By Bayes’ rule, we know that a prior pdf p(θ) yields
the posterior pdf of the form
p(θ|y1 , . . . , yn ) ∝ p(θ)p(y1 , . . . , yn |θ)
∝ p(θ)θ
P
i
yi −nθ
e
.
If we want a conjugate prior, then p(θ) must be of the form θc1 e−c2 θ , up to a multiplying constant. The
pdf that has this form is given by the Gamma distribution.
35
Gamma distribution Endow θ with the Gamma prior: θ|a, b ∼ Gamma(a, b), for some (hyper) parameters a, b > 0:
ba a−1 −bθ
p(θ) =
θ e .
Γ(a)
a is called the shape parameter, and b the rate parameter of Gamma distributions.
With this prior in place, the posterior pdf takes the form
P
p(θ|y1 , . . . , yn ) ∝ θa+
i
yi −1 −(b+n)θ
e
.
The proportional operator simplifies the expression by allowing us to keep only terms that vary with θ. This
shows that the posterior pdf of another Gamma distribution. In other words, we have shown that the Gamma
is a conjugate prior with respect to the Poisson sampling/ likelihood model:
θ|Y1 , . . . , Yn ∼ Gamma(a +
n
X
Yi , b + n).
i=1
Based on basic properties of the Gamma distribution, we find
E[θ|y1 , . . . , yn ] =
=
Var[θ|y1 , . . . , yn ] =
P
a + yi
b+n
b
n X
(a/b) +
yi /n
b+n
b+n
P
a + yi
.
(b + n)2
We find that the posterior mean is, again, a convex combination of the prior expectation and the sample
average. Note the impact of increasing the sample size n.
We proceed to the posterior predictive distribution. For ỹ = 0, 1, 2, . . .,
Z
p(ỹ|y1 , . . . , yn ) =
∞
p(ỹ|θ, y1 , . . . , yn )p(θ|y1 , . . . , yn )dθ
Z0
=
p(ỹ|θ)p(θ|y1 , . . . , yn )dθ
Z
X
Poisson(ỹ|θ)Gamma(θ|a +
yi , b + n)dθ
P
Z b + n)a+ yi a+P yi −1 −(b+n)θ
1 ỹ −θ
P
=
θ e
θ
e
dθ
ỹ!
Γ(a + yi )
P
Z
P
(b + n)a+ yi
P
=
θa+ yi +ỹ−1 e−(b+n+1)θ dθ.
Γ(ỹ + 1)Γ(a + yi )
=
Exploiting the identity that follows from the definition of Gamma density
Z
θa−1 e−bθ = Γ(a)/ba
36
to obtain
P
a+P yi ỹ
b+n
1
Γ(a + yi + ỹ)
P
.
p(ỹ|y1 , . . . , yn ) =
Γ(ỹ + 1)Γ(a + yi ) b + n + 1
b+n+1
P
This isPa negative binomial distribution with parameters (a + yi , b + n) (i.e., the number of ỹ failures
until a + yi successes), for which
X
1/(b + n + 1)
(a +
yi )
(b + n)/(b + n + 1)
P
a + yi
=
= E[θ|y1 , . . . , yn ]
b+n
b+n+1
Var[Ỹ |y1 , . . . , yn ] = E[Ỹ |y1 , . . . , yn ]
b+n
P
a + yi b + n + 1
=
= Var[θ|y1 , . . . , yn ](b + n + 1).
b+n
b+n
E[Ỹ |y1 , . . . , yn ] =
Note how the predictive posterior mean of Ỹ is the same as that of θ. This is due to the fact of Poisson
sampling model: E[Ỹ |θ] = θ. Note also under Poisson, Var[Ỹ |θ] = θ. The predictive posterior variance
of Ỹ is quite a bit larger than that of θ: the sources of its variability are that of the Poisson sampling model
and the parameter θ itself. As n gets large, the posterior of θ contracts considerably, so the variability of Ỹ
stems primarily from that of the Poisson sampling model rather than the parameter’s. 3
3
Instead of exploiting properties of the negative binomial distribution, we may appeal to the iterated expectation and iterated
variance formula to arrive at the above formula for the predictive posterior distribution.
37
Figure 3.2: Birthrate data from the 1990s General Social Survey: number of children for the two groups of
women.
3.4
Example: birth rates
We follow the example in PH (2009), Chapter 3. Fig. 3.2 illustrates the data collected on the number of
children of 155 women who were 40 year of age at the time of the survey. The women are divided into two
groups, those with college degrees and those without.
1
2
Let {Yi,1 }ni=1
denote the data from the first group, and {Yi,2 }ni=1
from the second group. To compare
between these two groups, we shall make use of the Poisson sampling model:
iid
Y1,1 , . . . , Yn1 ,1 |θ1 ∼ poisson(θ1 )
iid
Y1,2 , . . . , Yn2 ,2 |θ1 ∼ poisson(θ2 ).
Some basic statistics:
P
• Less than bachelor’s: n1 = 111, Yi,1 = 217, Ȳ1 = 1.95
P
• Bachelor’s or higher: n2 = 44, Yi,2 = 66, Ȳ2 = 1.50.
Let us endow θ1 and θ2 with the same prior:
iid
θ1 , θ2 ∼ gamma(a = 2, b = 1).
Then we obtain the following posterior ditsributions
θ1 |{n1 = 111,
X
Yi,1 = 217} ∼ gamma(2 + 217, 1 + 111)
X
θ2 |{n2 = 44,
Yi,2 = 66} ∼ gamma(68, 45)
38
In R codes:
The posterior
P distributions
Pgive substantial evidence that θ1 > θ2 . For example, it can be computed that
Pr(θ1 > θ2 | Yi , 1 = 217, Yi , 2 = 66) = .97.
39
Figure 3.3: Posterior distributions of mean birth rates with the common prior given by the dashed line, and
the posterior predictive distributions for number of children.
To what extent do we expect that a woman without the bachelor’s degree to have more children than the
other? See the right panel in Fig. 3.3.
In R codes:
There is considerable overlap between the two predictive posterior distributions of Ỹ1 and Ỹ2 . We can
compute that
X
X
Pr(Ỹ1 > Ỹ2 |
Yi,1 = 217,
Yi,2 = 66) = .48
X
X
Pr(Ỹ1 = Ỹ2 |
Yi,1 = 217,
Yi,2 = 66) = .22.
It is a reminder that the Poisson sampling model has very high variance, so that the strong evidence in the
difference of two population’s does not mean the individual observations are overtly different.
40
4
Monte Carlo approximation
Suppose that we are interested in quantities of interest for the posterior distribution, such as
(i) Pr(θ ∈ A|y1 , . . . , yn ) for some subset A ⊂ Θ.
(ii) Posterior mean, variance, confidence intervals for θ1 − θ2 , θ/θ2 , max{θ1 , . . . , θm }.
Under conjugacy, some of these quantities may be explicitly available in closed form, but this is not
always the case. When we deal with complex models where no conjugate form of the prior is available, then
posterior computation becomes a huge issue. This is in fact the main barrier for Bayesian statistics before
the age of computers. Thankfully with the computational advances, such barrier can be overcome. One
of the primary computational techniques for Bayesian computation is Markov Chain Monte Carlo. In this
section, we will explore the ”Monte Carlo” part of the technique.
4.1
Basic ideas
Suppose we could sample some number S of i.i.d. samples of the posterior distribution
iid
θ(1) , . . . , θ(S) ∼ p(θ|y1 , . . . , yn ).
Then the posterior distribution can be approximated with the empirical distribution provided by the
S-sample. Notationally:
S
1X
p(·|y1 , . . . , yn ) ≈
δθ(s) (·).
S
s=1
41
The Monte Carlo technique is simply this: take any function g(θ) (that is integrable with respect to the
posterior distribution), by the law of large numbers, as S → ∞,
Z
S
1X
g(θ(s) ) → E[g(θ)|y1 , . . . , yn ] = g(θ)p(θ|y1 , . . . , yn )dθ.
S
s=1
Take different choices for function g, we obtain
P
• θ̄ := Ss=1 θ(s) /S → E[θ|y1 , . . . , yn ].
1 PS
(s) − θ̄)2 → Var[θ|y , . . . , y ].
• S−1
1
n
s=1 (θ
• #(θ(s) ≤ c)/S =
1
S
PS
s=1 I(θ
(s)
≤ c) → Pr(θ ≤ c|y1 , . . . , yn ).
• median{θ(1) , . . . , θ(S) } → θ1/2 .
• the α-percentile of {θ(1) , . . . , θ(S) } tends to θα .
42
Numerical evaluation In the previous section, we use a Poisson sampling model, Y1 , . . . , Yn |θ ∼ Poisson(θ),
and endow parameter
θ with a gamma prior: γ ∼ Gamma(a,P
b). We know that the posterior of θ is
P
Gamma(a + yi , b + n), which yields the posterior mean (a + yi )/(b + n) = 68/45 = 1.51.
If we didn’t have this mean formula, we can appeal to Monte Carlo approximation in R.
First, to obtain random Gamma samples
To obtain the mean
and probabilities of intervals of interest
or relevant quantiles
43
Figure 4.1: Convergence of Monte Carlo estimates as MC sample size increases.
Fig. 4.1 provides an illustration of the effects of increasing Monte Carlo sample size S. Note that the
MC sample size S has nothing to do with the sample size of the data set given/observed. S represents the
computational cost, which becomes cheaper as the computer becomes more powerful.
To standard way of choosing S is to choose it just large enough so the Monte Carlo standard error is
less than the precision to which we want to report the quantity of interest.
Example 4.1. We want to compute the posterior expectation of θ. The Monte Carlo estimate gives us θ̄.
By the central limit theorem, the samplep
mean θ̄ is approximately distributed as normal distribution with
expectation E[θ|y1 , . . . , yP
Var[θ|y1 , . . . , yn ]/S.
n ] and variance
1
(θ(s) − θ̄)2 be the MC estimate of the variance σ 2 , then MC standard error (of
So letting σ̂ 2 = S−1
p
2
the MC estimate of the posterior
p mean for θ) is σ̂ /S. Thus, the approx. 95% MC confidence interval for
the posterior mean is θ̂ ± 2 σ̂ 2 /S.
For example, one set S = 100 and found that the MC
pestimate of Var[θ|y1 , . . . , yn ] was 0.024. Then the
approximate MC standard error for the mean would be 0.024/100 = 0.015. Suppose that you wanted the
difference between the posterior mean E[θ|y1 , . . . , yn ] and its MC estimate to be less than 0.01
p with high
probability (i.e., > 95% confidence) then you would need to increase your sample size so that 2 0.024/S <
0.01, i.e., S > 960.
44
4.2
Posterior inference for arbitrary functions
Recall the example of birthrates in Section 3.4. Based on the prior specifications and the data of birthrates,
the posterior distributions for the two educational groups are
{θ1 |y1,1 , . . . , yn1 ,1 } ∼ Gamma(219, 112) (women without bachelor’s degrees)
{θ2 |y1,2 , . . . , yn2 ,2 } ∼ Gamma(68, 45) (women with bachelor’s degrees).
We are interested in Pr(θ1 > θ2 |Data from both groups), or the posterior of the ratio θ1 /θ2 . Obtain
Monte Carlo samples independently for the two data groups:
(1)
(S)
iid
(1)
(S)
iid
sample θ1 , . . . , θ1
sample θ2 , . . . , θ2
(s)
∼ p(θ1 |Data from first group),
∼ p(θ1 |Data from second group).
(s)
Accordingly, the pairs of (θ1 , θ2 ) for s = 1, . . . , S are i.i.d. Monte Carlo samples. We can approximate
S
1 X (s)
(s)
I(θ1 > θ2 ).
Pr(θ1 > θ2 |Data from both groups) :≈
S
s=1
In R codes
45
4.3
Sampling from posterior predictive distributions
Parameter θ and the prior on θ represent the modeler’s understanding of the data population. Different
modelers may come up with different parameterization and different prior specification. How do we verify
the validity and compare among different models? This is usually done through assessment of the predictive
distribution.
We saw examples of predictive distributions in Section 3. In general, a predictive distribution is the
(marginal) distribution of unobserved data Ỹ which is obtained by
• having all known quantities been conditioned on;
• having all unknown quantities been integrated out.
Before we have seen any data, all modeling assumptions result in the prior predictive distribution
Z
p(ỹ) = p(ỹ|θ)p(θ)dθ.
Having observed the data set {y1 , . . . , yn }, we obtain the posterior predictive distribution
Z
p(ỹ|y1 , . . . , yn ) =
p(ỹ|θ, y1 , . . . , yn )p(θ|y1 , . . . , yn )dθ
Z
=
p(ỹ|θ)p(θ|y1 , . . . , yn )dθ.
Example 4.2. Continue on the birth rates modeling considered earlier. We assumed a Poisson sampling
model: Y |θ ∼ Poisson(θ) for a data population (say the group of women aged 40 with a college degree).
We placed a Gamma prior on θ: θ ∼ Gamma(a, b). We found that the resulting prior predictive distribution
of Ỹ is a negative binomial (a, b).
P
Having observed an n-data sample, we found that the posterior distribution of θ is Gamma(a+
yi , b+
P
n), and the predictive distribution of Ỹ is a negative binomial with parameters (a + yi , b + n). In this
example, thanks to conjugacy we have a very closed form for the predictive distribution, both a priori and a
posteriori.
In general, we probably won’t be so ”lucky” — most realistic models do not admit a closed form for
the posterior distributions. In order to evaluate the posterior predictive distributions, we may proceed by
drawing samples from them instead.
46
The key is to observe that p(ỹ|y1 , . . . , yn ) can be viewed as a mixture of the sampling distributions
p(ỹ|θ), where the θ is randomly mixed by the posterior distribution p(θ|y1 , . . . , yn ). If we can draw samples
from the posterior of θ, we can use such samples to again draw samples from the sampling distribution, with
each θ given.
To be specific, for s = 1, . . . , S obtain independent Monte Carlo samples as follows
• sample θ(s) ∼ p(θ|y1 , . . . , yn ), and then sample ỹ (s) ∼ p(ỹ|θ(s) ).
Then, we have obtained a valid i.i.d. n-sample ỹ (1) , . . . , ỹ (S) from the posterior predictive distribution.
Example 4.3. Continue on the birth rates modeling example. Suppose we are interested in the predictive
probability that an age-40 woman without a college degree wold have more children than an age-40 woman
with a college degree (using prior Gamma parameters a = 2, b = 1):
Pr(Ỹ1 > Ỹ2 |
=
∞
X
∞
X
X
Yi,1 = 217,
X
Yi,2 = 66)
NegBinomial(ỹ1 , 219, 112) × NegBinomial(ỹ2 , 68, 45).
ỹ2 =0 ỹ1 =ỹ2 +1
This can be easily evaluated via the MC technique. In R codes
We can also compute other quantities of interest based on these MC samples. We can also plot an
estimate of the posterior predictive distribution for Ỹ1 − Ỹ2 , as illustrated in Fig. ??.
47
Additional remark We can use the same technique to draw samples for prior predictive distribution; such
samples are then utilized for setting prior parameters. This technique is very useful if the prior distribution
is not conjugate, and/or the prior predictive distribution is not easily accessible via closed form expressions.
48
Figure 4.2: Evaluation of model fit. Left panel: the empirical and predictive distributions of the number
of children of women without a bachelor’s degree. Right panel: The posterior predictive distribution of the
empirical odds of having two children versus one child in a data set of size n1 = 111. The observed odds
are given in the short vertical line.
4.4
Posterior predictive model checking
We again use the birthrates data example to illustrate the important issue of model checking via posterior
predictive distributions. We used a Poisson sampling model endowed with a Gamma prior to describe the
number of children of groups of age-40 women with or without college degrees.
Consider the group of women without college degrees, for which we arrived at the posterior predictive
distribution for Ỹ1 (which is a negative binomial). Let us compare that distribution with the empirical distribution. Note that these are two products that are computed out of the same data sample {y1,1 , . . . , yn1 ,1 },
where n1 = 111.
In the empirical sample, shown in back, the number of women with exactly two children is 38, which
is twice the number of women with one child. By contrast, this group’s posterior predictive distribution,
shown in gray, suggests that the probability of sampling a woman with two children is slightly less than
of sampling a woman with one (0.27 and 0.28, respectively). How do we make sense of this significant
discrepancy?
49
There are two possible explanations.
• There is a sampling variability and the sample size is probably too small, so the empirical distribution
of sampled data does not generally match exactly the distribution of the population. In fact, empirical distributions (like all histograms) usually look bumpy, so having a predictive distribution that
smoothes over the bumps may be desirable.
• An alternative explanation is that the Poisson model is quite wrong. This is plausible because there is
no Poisson distribution with such a sharp peak at y = 2. Having said that, note that the posterior predictive distribution is in fact a mixture of Poissons that equals a negative binomial, so this explanation
needs further evaluation.
We can evaluate the validity of the posterior predictive model via Monte Carlo simulation. We need a
”marker”, and in this case we use the ratio of the number of y = 2’s to the number of y = 1’s in our data.
For every vector y of length n1 = 111, let t(y) denote this ratio. For our observed data sample, y obs , we
have t(y obs ) = 2.
50
What sort of values of t(Ỹ ) should one expect, if Ỹ are drawn from the posterior predictive distribution?
The Monte Carlo simulation procedure is as follows. For s = 1, . . . , S,
• sample θ(s) ∼ p(θ|Y = y obs ).
(s)
(s) iid
• sample Ỹ (s) = (ỹ1 , . . . , ỹn1 ) ∼ p(y|θ(s) ).
• compute t(s) = t(Ỹ (s) ).
The right panel of Fig. 4.2 shows the histogram of t(Ỹ ) that one can get out of 10000 Monte Carlo
samples (note: each MC sample here consists of an n1 -sample represented by Ỹ (s) ). Observe that out of
10000 such datasets only about 0.5% had values of t(y) that equaled or exceeded t(y obs ). This indicates
that our Poisson sampling model is flawed. If one is in particular interested in a more accurate model for Y ,
perhaps a complex sampling model than the Poisson is warranted.
Certain aspects of the Poisson sampling model that may still be useful in this example. For instance, if
we are only interested in population parameters such as the mean and variance via θ, then Poisson is quite
accurate in capturing the relationship between these quantities, as the empirical mean and empirical variance
is found to be 1.95 and 1.90, respectively.
It is known in theory that even if a model is misspecified, some aspects of the population may still be
accurately estimated with such a model. In practice, as George Box said, all models are wrong, but some are
useful. Thus, while statistical modelers constantly search for better models, and we have a vast arsenal for
doing so as you will see in later lectures, we do not readily discard simpler ones just for the sake of bigger
and more complex models.
51
5
5.1
The normal model
The normal / Gaussian distribution
A random variable Y is said to be normally distributed with mean θ and variance σ 2 > 0 if the density of Y
takes the form
1
2
1
e− 2σ2 (y−θ) , −∞ < y < ∞.
p(y|θ, σ 2 ) = √
2πσ 2
Several important properties
• the distribution is symmetric about θ; the location, median and mean are all equal to θ
• σ 2 represents the spread of the mass: about 95% of the population lies within (θ ± 2σ)
• if X ∼ Normal(µ, τ 2 ) and Y ∼ Normal(θ, σ 2 ), and X and Y are independent, then
aX + bY ∼ Normal(aµ + bθ, a2 τ 2 + b2 σ 2 ).
Normal distribution is one of the most useful and widely utilized model in statistical sciences. Its
important stems primarily from the central limit theorem, which says that under very general conditions, the
empirical average of a collection of random variables is approximately distributed according to the Gaussian
(normal) distribution.
52
Example 5.1. The following figure shows a normal density function overlay over the histogram of heights
of n = 1375 women over age 18 collected in a study of 1100 English families from 1893 to 1898. One
explanation for the variability in heights among these women is that the women were heterogenous in terms
of a number of factors controlling human growth, such as genetics, diet, disease, stress and so on. Variability
in such factors results in variability in height. Thus, letting yi be the height in inches of woman i, a simple
additive model for height might be
yi = a + b × genei + c × dieti + d × diseasei + . . .
where genei might denote the presence of a particular height-promoting gene, dieti might measure some
aspect of woman i’s diet, and so on. Now, there may be a very large number of genes, dietary factors, and
so on that contributes to a woman’s height. If the effects of these factors are additive, then the height of
a random woman may be modeled as a linear combination of a large number of random variables. The
central limit theorem says that such a linear combination is approximately distributed according to a normal
distribution.
53
5.2
Inference of the mean with variance fixed
iid
Given a sampling model Y1 , . . . , Yn |θ, σ 2 ∼ Normal(θ, σ 2 ). The joint sampling pdf is
p(y1 , . . . , yn |θ, σ 2 ) =
n
Y
p(yi |θ, σ 2 )
i=1
=
n
Y
i=1
√
1
2πσ 2
1
2
e− 2σ2 (yi −θ)
n
1 X
2
= (2πσ )
exp − 2
(yi − θ)
2σ
i=1
nθ2
1 1 X 2 2θ X
2 −n/2
yi − 2
yi + 2 .
= (2πσ )
exp −
2 σ2
σ
σ
P
P
a (two-dimensional)
This expression shows that { yi2 , yi } form P
P sufficient statistic for the normal
model’s parameters θ and σ 2 . Equivalently, let ȳ := yi /n and s2 := (yi − ȳ)2 /(n − 1), then (ȳ, s2 ) is
a sufficient statistic.
Suppose that σ is fixed and known; the quantity of interest is θ. It is easy to see that the maximum
likelihood estimate for θ is θ̂ = ȳ.
2 −n/2
54
Let us proceed to specifying a conjugate prior for θ. Given a (conditional) prior distribution p(θ|σ 2 ), the
posterior pdf takes the form
p(θ|y1 , . . . , yn ) ∝ p(θ|σ 2 )p(y1 , . . . , yn |θ, σ 2 )
1
∝ p(θ|σ 2 )e− 2σ2
P
(θ−yi )2
.
2
The simplest possible form for a conjugate prior for θ is of the form ec1 (θ−c2 ) . This suggests a normal
distribution prior:
Prior:θ ∼ Normal(µ0 , τ02 ).
Continuing on the Bayesian update:
1 X
1
(θ − µ0 )2 × exp − 2
(θ − yi )2
2
2σ
2τ0
1
∝ exp − (aθ2 − 2bθ + c),
2
p(θ|y1 , . . . , yn , σ 2 ) ∝ exp −
where it is easy to verify that
P
1
n
µ0
yi
a = 2 + 2, b = 2 + 2 ,
σ
σ
τ0
τ0
and c is independent of θ. Since the exponent of the posterior pdf is a quadratic form, with negative coefficient of the leading (second order) term, this must be the pdf of a normal distribution. Let us derive the
corresponding mean and variance of the posterior.
1
p(θ|σ 2 , y1 , . . . , yn ) ∝ exp − (aθ2 − 2bθ)
2
1
∝ exp − a(θ − b/a)2
2
= Normal(b/a, 1/a).
55
Combining information Thus we have obtained that the posterior distribution of θ is indeed normal with
mean µn and variance τn :
τn2 =
1
=
a
1
τ02
1
+
n
σ2
, µn =
b
=
a
1
µ + σn2 ȳ
τ02 0
.
1
n
+
2
2
σ
τ0
(5)
Not only is the posterior pdf remains a Gaussian, its corresponding parameters are obtained by combining information from the prior and the data in an intuitive way.
• Posterior variance: Inverse variance is often referred to as the precision.
Let σ̃ 2 = 1/σ 2 denote the sampling precision, τ̃02 = 1/τ02 the prior precision and τ̃n2 = 1/τn2 . Then
τ̃n2 = τ̃02 + nσ̃ 2 ,
so the precision (for the parameter of interest) adds up with more data.
• Posterior mean:
µn =
τ̃02
nσ̃ 2
τ̃02
µ
+
ȳ.
0
2
+ nσ̃ 2
τ̃0 + nσ̃ 2
The posterior mean is a convex combination (i.e., weighted average) of the prior mean and the sample
mean. The weights are corresponding precisions from either the prior or the data. The prior precision
provides a shrinkage effect pulling the estimate toward the prior mean. As sample size n increases,
the information from the data takes over.
56
Predictive distribution Consider predicting a new observation Ỹ from the population after having observed (Y1 = y1 , . . . , Yn = yn ). That is to find p(ỹ|y1 , . . . , yn ).
In general, to find the predictive distribution we need to perform an integration over the unknown θ. For
the normal model, the situation is very easy (without having to perform this integration), due to the fact that
a linear combination of normal random variables is another normal random variable.
In particular, for our model
Ỹ |θ, σ 2 ∼ Normal(θ, σ 2 ) ⇔ Ỹ = θ + , where |θ, σ 2 ∼ Normal(0, σ 2 ).
Since θ|y1 , . . . , yn ∼ Normal(µn , τn ) and is also normal and (conditionally) independent of θ. So,
Ỹ |σ 2 , y1 , . . . , yn ∼ Normal(µn , τn2 + σ 2 ).
57
Example 5.2. (Midge wing length) We are given a data set on the wing length in millimeters of nine
members of a species of midge (small, two-winged flies). From these nine measurements we wish to make
inference about the population mean θ.
From previous studies, the wing lengths are typically around 1.9mm, so we set µ0 = 1.9. We also
know that the wing length are positive-valued, but since we are using a normal prior, we need to set for τ0
so that most of the mass is concentrated on the positive values. Conservatively, we set µ0 − 2τ0 > 0, so
τ0 < 1.9/2 = 0.95.
The observations are: {1.64, 1.70, 1.72, 1.74, 1.82, 1.82.1.82, 1.90, 2.08}, giving ȳ = 1.804. Using the
above formulas for posterior computation,
µn =
τn =
1
µ + σn2 ȳ
τ02 0
1
+ σn2
τ02
1
τ02
1
+
n
σ2
=
=
9
×
σ2
9
+ σ2
1.11 × 1.9 +
1.11
1
1.11 +
9
σ2
1.804
,
.
If we set σ 2 := s2 = 0.017, then posterior distribution θ|y1 , . . . , yn , σ 2 = 0.017 ∼ Normal(1.805, 0.002).
A 95% quantile-based confidence interval for θ according to this posterior distribution is (1.72,1.89). Of
course, this result is based on a point estimate of σ 2 := s2 which is in fact only a rough estimate based on
only nine observations. Next section we will study techniques for properly handling unknown variance.
Figure 5.1: Prior and posterior ditributions for the population mean wing lengh.
58
5.3
Joint inference for the mean and variance
We need to specify a prior distribution on θ and σ 2 . By Bayes’ rule
p(θ, σ 2 |y1 , . . . , yn ) ∝ p(θ, σ 2 )p(y1 , . . . , yn |θ, σ 2 )
1 P
2
1
∝ p(θ, σ 2 ) n e− 2σ2 (θ−yi ) .
σ
(6)
It is not immediately obvious how to come up with a conjugate prior jointly for θ and σ 2 . In the previous
section, σ 2 is assumed to be fixed — from there it is simple to find that a normal prior for θ yields a normal
posterior, conditionally on σ. This suggests that we may wish to set θ|σ 2 ∼ Normal(µ0 , τ02 ), for some
suitable choice of µ0 , τ0 which may be dependent on σ 2 . This suggests a prior according to which θ and σ 2
may be coupled (i.e., dependent). The question is how. Moreover, this still does not tell us how to place a
suitable prior on σ 2 , since we still need to specify the joint prior distribution
p(θ, σ 2 ) = p(σ 2 )p(θ|σ 2 ).
59
Fixed mean, varying variance To get a sense of what the form for a conjugate prior of σ 2 may be, let us
take a step back, by assuming that θ is fixed.
Simplifying from (6)
p(σ 2 |θ, y1 , . . . , yn ) ∝ p(σ 2 )p(y1 , . . . , yn |θ, σ 2 )
1 P
2
1
∝ p(σ 2 ) n e− 2σ2 (θ−yi ) .
σ
(7)
It is more convenient to look at the posterior pdf in terms of the precision σ̃ 2 = 1/σ 2 , we see that the
2
simplest form for a conjugate prior for σ̃ will be one of the form σ̃ c1 e−c2 σ̃ . This gives us a Gamma prior
for the precision parameter. In particular, we set
σ̃ 2 ∼ Gamma(a, b)
This is equivalent to saying that σ 2 ∼ InvGamma(a, b), and can be taken as a definition of the Inverse
Gamma distribution.
ba a−1 −by
y e , for y > 0. Let z = 1/y, so that y = 1/z and
Recall the Gamma pdf: p(y|a, b) = Γ(a)
2
dy/dz = −1/z . By the change of variable formula,
p(z|a, b) = p(y(z)|a, b)|dy/dz| =
ba
ba −a−1 −b/z
y(z)a−1 e−by(z) (1/z 2 ) =
z
e
,
Γ(a)
Γ(a)
which gives the pdf for InvGamma(a, b).
Now, combining the inverse-gamma prior for σ 2 with the normal likelihood, we find that
p(σ 2 |a, b, θ, y1 , . . . , yn )
1 − 12 P(θ−yi )2
e 2σ
σn
∝
p(σ 2 ) ×
∝
(σ 2 )−a−1 e−b/σ × σ −n e− 2σ2
1
2
1
P
P
(θ−yi )2
2
2
(σ 2 )−(a+n/2)−1 e−(b+ 2 (θ−yi ) )/σ
n
1X
= InvGamma(a + , b +
(θ − yi )2 )
2
2
=: InvGamma(an , bn ).
∝
60
(8)
We proceed to finding the predictive distribution. Note that this can be viewed as a mixture distribution of
Gaussians, with the location fixed, and the precision parameter varying according to the Gamma distribution
Gamma(an , bn ). We also note that the representation in the precision parameter is more convenient because
it allows us to directly utilize the relevant identity that arise from Gamma pdf’s normalizing constant. Thus,
in what’s followed we may switch back and forth between the two representations, in terms of σ̃ 2 and σ 2 .
Z
p(ỹ|a, b, y1 , . . . , yn ) =
p(ỹ|θ, σ̃ 2 ) × p(σ̃ 2 |a, b, θ, y1 , . . . , yn )dσ̃ 2 .
=
=
=
σ̃ 2
2π
1/2
σ̃ 2
ban
2
(ỹ − θ)2 × n (σ̃ 2 )an −1 e−bn σ̃ dσ̃ 2
2
Γ(an )
a
n
1
Γ(an + 1/2)
bn
Γ(an ) (2π)1/2 (bn + (ỹ − θ)2 /2)an +1/2
Γ(an + 1/2)
1
.
1/2
Γ(an )
(2πbn ) (1 + (ỹ − θ)2 /(2bn ))an +1/2
Z exp −
(9)
We arrive at the well-known Student’s t distribution, which has three parameters, location parameter θ,
scale parameter bn /an and 2an degrees of freedom. The variance of the predictive distribution is, provided
2an > 2,
2an
(bn /an )
= bn /(an − 1).
2an − 2
It is interesting to note that the predictive distribution of the data becomes heavier tailed than the normal sampling model (inverse squared tail vs inverse exponential tail), thanks to the uncertainty about the
variance/precision parameter that is integrated out.
61
Both mean and variance parameter varying Now we are ready to handle the case both θ and σ 2 vary. As
we have seen in the previous pages, it may be more convenient in our derivation to work with the precision
parameter σ̃ 2 instead.
It is tempting to place independent prior distributions on θ and σ̃ 2 : say a normal prior on θ and independently, a Gamma prior on σ̃ 2 . The reader can verify without difficulty that this won’t give us a conjugate
prior because the posterior for either θ or σ̃ 2 will not be normal or Gamma, respectively. (What would be
the form of the posteriors then?)
The issue is that conditionally given the observations y1 , . . . , yn , parametes θ and σ̃ 2 are dependent even
if they are independent a priori. So we need to construct a prior distribution according to which θ and σ̃ 2 are
dependent to begin with. Here is how: use the decomposition
p(θ, σ̃ 2 ) = p(σ̃ 2 )p(θ|σ̃ 2 )
and set the prior as
σ̃ 2 ∼ Gamma(a, b)
θ|σ̃ 2 ∼ Normal(µ0 , κ0 σ̃ 2 ).
The key is in the second line, which allows the coupling of θ and σ̃ 2 via the conditional prior’s variance.
For ease of interpretation later, we set a = ν0 /2, b = ν0 σ02 /2 (which gives the prior expectation for σ̃ 2
to equal a/b = 1/σ02 =: σ̃02 ).
The sampling/likelihood model is the same as before:
iid
Y1 , . . . , Yn |θ, σ̃ 2 ∼ Normal(θ, σ̃ 2 ).
Now we verify that the specified prior is indeed conjugate. Decompose the posterior distribution similarly:
p(θ, σ̃ 2 |y1 , . . . , yn ) = p(σ̃ 2 |y1 , . . . , yn )p(θ|σ̃, y1 , . . . , yn ).
From the previous section, we already have
θ|y1 , . . . , yn , σ̃ 2 ∼ Normal(µn , τ̃n2 )
where
τ̃n2 = κ0 σ̃ 2 + nσ̃ 2 =: κn σ̃ 2
κ0 σ̃ 2 µ0 + (nσ̃ 2 )ȳ
κ0 µ0 + nȳ
=
.
µn =
κ0 σ̃ 2 + nσ̃ 2
κn
In short, the conditional posterior of θ, namely p(θ|σ̃ 2 , y1 , . . . , yn ), has the same form as that of the conditional prior p(θ|σ̃ 2 ).
62
Next, we check the marginal posterior of σ̃ 2 . For this computation, we need to integrate out θ (unlike
the previous detour where θ is fixed, and σ̃ 2 varies).
p(σ̃ 2 |y1 , . . . , yn ) ∝ p(σ̃ 2 )p(y1 , . . . , yn |σ̃ 2 )
Z
2
∝ p(σ̃ ) p(y1 , . . . , yn |θ, σ̃ 2 )p(θ|σ̃ 2 )dθ
Z
1 2P
1
2
2
2
2 a−1 −bσ̃ 2
∝ (σ̃ ) e
(σ̃ 2 )n/2 e− 2 σ̃ (θ−yi ) (κ0 σ̃ 2 )1/2 e− 2 κ0 σ̃ (θ−µ0 ) dθ
Z
1 2 P
2
2
2 a+n/2−1 −bσ̃ 2
2 1/2
∝ (σ̃ )
e
(κ0 σ̃ )
e− 2 σ̃ [ (θ−yi ) +κ0 (θ−µ0 ) ] dθ.
We quickly see that in the integrand the form of a Gaussian pdf, so the integral
can be simplified by
p
R
1 2
2
utilizing the formula for normalizing the Gaussian pdf: e− 2 σ̃ (y−µ) dy = 2π/σ̃ 2 . Accordingly, the
integral is precisely
p
2
2π/[(κ0 + n)σ̃ ] exp −
p
2
2π/[(κ0 + n)σ̃ ] exp −
=
(µ0 κ0 + nȳ)2 X 2
1 2
2
σ̃ −
+
yi + κ0 µ0
2
κ0 + n
1 2 κ0 n(µ0 − ȳ)2 X 2
2
σ̃
+
yi − nȳ
.
2
κ0 + n
Plugging back for the posterior of σ̃ 2 , keeping only relevant terms
p(σ̃ 2 |y1 , . . . , yn )
∝
=
=
=:
1
κ0 n(µ0 − ȳ)2
2
(σ̃ 2 )a+n/2−1 e−bσ̃ exp − σ̃ 2
+ (n − 1)s2
2
κ0 + n
1
1 κ0 n(µ0 − ȳ)2
2
Gamma a + n, b +
+ (n − 1)s
2
2
κ0 + n
κ0 n(µ0 − ȳ)2
2
2
+ (n − 1)s
Gamma ν0 /2 + n/2, (1/2) ν0 σ0 +
κ0 + n
Gamma(νn /2, νn σn2 /2) ,
where the posterior distribution’s parameters take the form
νn = ν0 + n
κ0 n(µ0 − ȳ)2
1
σn2 =
ν0 σ02 +
+ (n − 1)s2 .
νn
κ0 + n
How to make sense of the contribution of the prior information and the data in these expressions? The
posterior mean of σ̃ 2 is 1/σn2 , while the posterior variance is of the order 1/νn σn4 . In the above formula for
νn σn2 , it is clear that ν0 σ02 represents the information from the prior for σ 2 . The term (n − 1)s2 represents
the variability of the observed data from the sample mean.
2
0 −ȳ)
Finally, the middle term κ0 n(µ
represents the contribution to the variance parameter σ 2 due to the
κ0 +n
coupling between the location parameter θ and precision σ̃ 2 according to the conditional prior
θ|σ̃ 2 ∼ Normal(µ0 , κ0 σ̃ 2 ).
63
According to this prior, θ is drawn from a mixture of normal distributions centering on µ0 with varying
precision proportional to κ0 . This seems to be relatively strong opinion for a prior specification, which
entails the ”biased” contribution of the middle term that increases with both κ0 and the variability of sample
mean about µ0 toward the estimate of the variance σ 2 .
One may harshly criticize the prior due to the implication discussed as being too strong. We don’t necessarily defend this at all cost: after all we have arrived at this prior construction mainly from a mathematical/
computational viewpoint, i.e., to obtain a conjugate prior. So, there’s a bias — the incurred bias is a cost one
has to pay for the mathematical/ computational convenience. Whether it is worth it depends on the modeler
and the data at hand. Note that when the sample size is large, the bias incurred by our prior construction will
be washed away by the last term (n − 1)s2 , which is purely driven by the data set.
64
Example 5.3. (Midge wing length — continued). Our sampling model for midge wing lengths is Y |θ, σ̃ ∼
Normal(θ, σ̃ 2 ) and we will place a joint prior on θ, σ̃ via
σ̃ 2 ∼ Gamma(a = ν0 /2, b = ν0 σ02 /2)
θ|σ̃ 2 ∼ Normal(µ0 , κ0 σ̃ 2 ).
Previous studies suggest that the true mean and standard deviation should not be too far from 1.9 mm
and 0.1 mm, respectively. So we may set µ0 = 1.9 and σ02 = 0.01.
The Gamma prior implies the prior mean for the precision is a/b = 1/σ02 = σ̃02 = 100, and prior
variance for the precision is a/b2 = 2/(ν0 σ04 ). We set ν0 = 1 to allow for a reasonably large variance.
As for κ0 : we also set κ0 = 1. Since σ̃ is a priori distributed over a large range of values, this implies
that we assume θ to be only weakly coupled to σ̃ 2 .
From the sample, ȳ = 1.804 and s2 = 0.0169. Applying to the posterior computation derived earlier:
κ0 µ0 + nȳ
1.9 + 9 × 1.804
µn =
=
= 1.814
κn
1+9
0.010 + 0.008 + 0.135
1
κ0 n(µ0 − ȳ)2
2
2
2
σn =
ν0 σ0 +
+ (n − 1)s =
= 0.015.
νn
κ0 + n
10
Compared to the point estimate presented earlier, the posterior mean for θ is comparable, but the uncertainty
captured by σn2 is considerably larger. But we can say much more.
In particular, the joint posterior distribution is given by
θ|y1 , . . . , yn , σ 2 ∼ Normal(µn = 1.814, τ̃n2 = κn σ̃ 2 = 10σ̃ 2 )
σ̃ 2 |y1 , . . . , yn ∼ Gamma(νn /2 = 10/2, νn σn2 /2 = 10 × 0.015/2).
Figure 5.2: Joint posterior distributions for (θ, σ̃ 2 ) and (θ, σ 2 ).
These plots were obtained by computing the joint pdf at pairs of values of (θ, σ̃ 2 ) and (θ, σ 2 ) on a grid.
Note also that samples from this posterior can be easily obtained via Monte Carlo sampling.
These plots tell us about where most of the mass of the posterior for (θ, σ 2 ) is, and to some extent the
relationship between the two parameters. When σ̃ 2 is small (σ 2 is large) there are more uncertainties about
θ. Moreover, the contours are more peaked as function of θ for low values of σ 2 than high values.
65
Hyperparameters and improper priors Hyperparamerers are parameters specified for the prior distributions. In our previous example, two of them are κ0 and ν0 . They may be regarded as the prior sample size,
because according to the Bayesian update
κn = κ0 + n
νn = ν0 + n.
When κ0 and ν0 are relatively small compared to n, the effects of these hyperparameters are negligible.
Of interest is when n is itself quite small. Is this still possible to have a prior specification whose impact
relative to the impact from the data is minimal?
The smaller ν0 is, the ”flatter” the marginal prior distribution for σ̃; the smaller κ0 and ν0 are, the flatter
the marginal prior distribution for θ. (Recall our earlier computation in Eq. (9) that a mixture of fixed-mean
normal distributions with a Gamma mixing on the precision is a Student’s t distribution). In other words,
the priors can be viewed as ”less discriminative”; and hence ”more objective”.
Let us perform the formal computation, by letting κ0 , ν0 → 0
µn =
σn2 =
κ0 µ0 + nȳ
→ ȳ
κ
n
1
n−1 2
1X
κ0 n(µ0 − ȳ)2
2
2
+ (n − 1)s →
s =
(yi − ȳ)2 .
ν0 σ0 +
νn
κ0 + n
n
n
This leads to the following ”posterior distribution”, which is free of hyperparameters:
σ̃ 2 |y1 , . . . , yn ∼ Gamma(n/2, (n/2)
1
θ|σ̃ 2 , y1 , . . . , yn ∼ Normal(ȳ, σ 2 ).
n
1X
(yi − ȳ)2 )
n
(10)
There does not exist a valid prior distribution for the above ”posterior distribution”, which appears
only as the limit of a sequence of posterior distributions that arise from the sequence of prior distributions
according to which κ0 , ν0 → 0. If one still wish to employ such posterior distribution, one need to utilize a
notion of improper prior distribution.
66
Consider function p̃(θ, σ 2 ) = 1/σ 2 . This is not a proper distribution because it is not integrable over
(θ, σ 2 ). Thus we will treat this as an improper prior distribution and apply the Bayes’ rule to obtain:
p(θ, σ 2 |y) ∝ p(y|θ, σ 2 ) × p̃(θ, σ 2 ).
Then we have a valid distribution over (θ, σ 2 ): in fact, it can be easily verified that
Pthe induced marginal θ
is the same as that of (10), while the marginal for σ̃ 2 is Gamma((n − 1)/2, (1/2) (yi − ȳ)2 . In addition,
integrating over σ̃ 2 , following a computation similar to Eq. (9) we find that
θ − ȳ
√ y1 , . . . , yn ∼ tn−1 .
s/ n
(11)
Remark 5.1. Some remarks.
(i) The use of improper priors is not considered to be truly Bayesian, but it can be justified (informally)
by the limiting argument presented above, and formally via a decision-theoretic framework. It is one
area where one can find the meeting points between Bayesian and frequentist approaches.
(ii) It is interesting to compare with the sampling distribution of the t statistic, conditional on θ but
unconditional on the data:
Ȳ − θ
√ θ ∼ tn−1 .
(12)
s/ n
Eq. (12) is a statement about the data: it says that before we sample the data, our uncertainty about the
scaled deviation of the sample mean Ȳ from the population mean θ has a tn−1 distribution. Eq. (11)
says that after we sample the data, our uncertainty is still represented with a tn−1 distribution, except
that it is our uncertainty about θ given the information provided by the data ȳ.
67
5.4
Normal model for non-normal data
People apply normal models to non-normal data all the time. In this section, we have seen examples of
modeling heights for a human population and modeling flies’ wing length. In both cases, the data are
positive valued, whereas normal distributions are supported on the entire real line. However, the quantity of
interest is the population mean, which can be treated as approximately normally distributed according to the
central limit theorem.
As another example, consider the number of children for a group of women over age 40, and consider
estimating the mean number of children for this population, based on the samples Y1 , . . . , Yn . In the previous
section, we considered a Poisson sampling model, which is motivated by the fact that Yi are integer-valued.
Obviously it makes no sense to assume Yi |θ, σ 2 ∼ Normal(θ, σ 2 ). However, it is still reasonable to assume
that the population mean θ is normally distributed (a priori).
By the CLT, we know that
p
p(ȳ|θ, σ 2 ) ≈ Normal(ȳ|θ, σ 2 /n),
where σ 2 denotes the population variance, with the approximation becoming increasingly accurate as n gets
larger.
If σ 2 is known, then we may consider placing a normal prior on θ and obtain the posterior for θ via
p(θ|ȳ, σ 2 ) ∝ p(θ) × p(ȳ|θ, σ 2 ).
If σ 2 is unknown, we may consider to bring in the point estimate s2 and conditioning on it:
p(θ, σ 2 |ȳ, s2 ) ∝ p(θ, σ 2 ) × p(ȳ, s2 |θ, σ 2 ).
The likelihood term p(ȳ, s2 |θ, σ 2 ) may be approximated by applying a normal sampling model p(ȳ|θ, σ 2 )
for ȳ and Gamma sampling model p(s2 |ȳ, θ, σ) for s2 , conditionally on ȳ.
Hence, we have seen that when the sample size is reasonably large, the above approximation treatment
is quite reasonable and can lead to good practical results.
68
When are normal models not appropriate?
• when the quantity of interest is not about the population mean and/or variance but requires tail behavior of the population, while the population’s distribution is clearly not normal (e.g., heavy-tailed or
skewed distributions). For instance, we may be interested in the group of people with large number of
children.
• when the population is highly heterogeneous and we are interested in learning about such heterogeneity. For instance, the population’s distribution may be multi-modal, and so it makes more sense to
represent it as a mixture of sub-populations each of which have their own parameters of interest. One
is not interested in the population mean as much as the parameters of each sub-population.
• even when the normal model is not appropriate, normal distributions frequently serve as an useful
building block: recall that heavier-tailed distributions such as t-distribution can be viewed as a mixture
of normals with variance parameters varying, while multi-modal distributions can be approximated
by a mixture of normal distributions with mean parameters or both type of parameters varying.
69
6
6.1
Posterior approximation with the Gibbs sampler
Conjugate vs non-conjugate prior
In the previous section we considered a particular prior for the normal sampling model Normal(θ, σ 2 ). This
is a conjugate prior for the parameters θ, σ 2 (or alternatively, θ, σ̃ 2 = 1/σ 2 ):
σ̃ 2 ∼ Gamma(ν0 /2, ν0 σ02 )
θ|σ̃ 2 ∼ Normal(µ0 , κ0 σ̃ 2 ).
We found that by applying the Bayes update, the posterior distribution p(θ, σ̃ 2 |y1 , . . . , yn ) carries the
same form:
σ̃ 2 |y1 , . . . , yn ∼ Gamma(νn /2, νn σn2 /2)
θ|y1 , . . . , yn , σ̃ 2 ∼ Normal(µn , τ̃n2 ).
The posterior distributions’ parameters are updated as
τ̃n2 = κ0 σ̃ 2 + nσ̃ 2 =: κn σ̃ 2
κ0 µ0 + nȳ
κ0 σ̃ 2 µ0 + (nσ̃ 2 )ȳ
=
µn =
,
2
2
κ0 σ̃ + nσ̃
κn
and
νn = ν0 + n
κ0 n(µ0 − ȳ)2
1
2
2
2
+ (n − 1)s .
ν0 σ0 +
σn =
νn
κ0 + n
The price we have to pay for the computational convenience is the coupling between the two parameters
θ and σ̃ 2 imposed in the prior specification. Such coupling results in a prior bias: the higher the precision
σ̃ 2 (the lower the variance value σ), the the more certain we are about parameter θ.
In general, when dealing with multiple parameters, it is difficult to come up with a conjugate prior
jointly for all parameters. And even if we can, the discussion from the previous section suggests that it is
important to explore non-conjugate priors, because in some situations they may be more appropriate for our
understanding of the parameter space.
70
In the case of the normal model above, we may want to express our uncertainty about θ as independent
of σ̃ 2 . Such a prior specification is clearly less stringent than the one given above. Intuitively, such a prior
would be less subjective. In particular, consider the following independent prior:
σ̃ 2 ∼ Gamma(ν0 /2, ν0 σ02 /2)
(13a)
Normal(µ0 , τ02 ).
(13b)
θ∼
The particular choices of Gamma and Normal come from our computations in subsection 5.2 and the beginning of subsection 5.3. Although this prior distribution is not conjugate, in the sense that the joint posterior
distribution p(θ, σ̃ 2 |y1 , . . . , yn ) does not carry the same form as the prior distribution p(θ, σ̃ 2 ), the full conditional distributions p(θ|σ̃ 2 , y1 , . . . , yn ) and p(σ̃ 2 |θ, y1 , . . . , yn ) can be easily computed and in fact carry
the same form as the corresponding marginal prior distribution. The full conditional distributions are the
distribution of a parameter given everything else, including the data and all remaining parameters. We call
prior for which the full conditional distributions have the same form as the marginal prior ”semiconjugate”.
71
6.2
The Gibbs sampler
Gibbs sampler is a sampling technique for multivariate distributions that exploits the fact that the full conditional distributions can be easily computed or sampled from. This crucial fact allows one to generate a
dependent sequence of parameter samples that converge in distribution to the joint posterior distribution of
interest.
Continuing with our semiconjugate prior specification given in Eq. (13). From the previous section, we
have obtained that (cf. Eq. (5))
θ|σ̃ 2 , y1 , . . . , yn ∼ Normal(µn , τn2 ),
where
τn2
=
1
τ02
1
+
b
n , µn = a =
σ2
1
µ + σn2 ȳ
τ02 0
.
1
+ σn2
τ02
Note carefully that the updated parameters µn and τn are in fact dependent on the conditioned σ̃ 2 = 1/σ 2 .
And from Eq. (8),
σ̃ 2 |θ, y1 , . . . , yn ∼ Gamma(νn /2, νn σn2 /2),
where
νn = ν0 + n, σn2 =
1
(ν0 σ02 + ns2n ),
νn
P
and s2n = (yi − θ)2 /n, the unbiased estimate of σ 2 if θ were known. Note carefully also that the updated
parameter σn2 is dependent of the conditioned θ.
These full conditionals tell us that
• if we know σ̃ 2 , we can draw a sample for θ from p(θ|σ̃ 2 , y1 , . . . , yn )
• if we know θ, we can draw a sample for σ̃ 2 from p(σ̃ 2 |θ, y1 , . . . , yn ).
72
These full conditionals do not give us a direct way of drawing a sample from the joint posterior p(θ, σ̃ 2 |y1 , . . . , yn ),
but they suggest an iterative procedure for drawing the joint samples φ := (θ, σ̃ 2 ). In each iteration, we take
turn to draw a random sample for one parameter using the relevant full conditional distribution for that
parameter given the latest values of all other parameters.
This procedure is called the Gibbs sampler. More precisely for our present model, let φ(s) := (θ(s) , σ 2(s) ),
where s is the index for the iterations.
• Start with an arbitrary initial value φ(1) = (θ(1) , σ 2(1) ).
• For s = 1, 2, . . .
– sample θ(s+1) ∼ p(θ|σ̃ 2(s) , y1 , . . . , yn );
– sample σ̃ (2(s+1) ∼ p(σ̃ 2 |θ(s+1) , y1 , . . . , yn );
– let φ(s+1) = {θ(s+1) , σ̃ 2(s+1) }.
What this algorithm does is that it generates a dependent sequence of parameter vector φ(1) , φ(2) , . . . , φ(s) , . . .,
where the s + 1-parameter vector φ(s+1) is generated by the conditional distribution given the previous
value φ(s) , namely p(φ(s+1) |φ(s) ). This sequence of random vector {φ(s) } is called a Markov chain. Under
very weak conditions this Markov chain of random variables converge to a stationary distribution. Moreover, by our construction of the Gibbs sampler, that stationary distribution is the posterior distribution
p(φ|y1 , . . . , yn ) — the joint posterior distribution of interest.
Note carefully that we do not say that we have obtained a valid sample from the joint posterior p(θ, σ̃|y1 , . . . , yn ).
What we said is that if we run the Markov chain (the Gibbs sampler) long enough, i.e., if s is large, then
φ(s) can be viewed as a good approximation of the posterior sample.
73
A nice feature of Gibbs samplers is that they tend to be very easy to implement. In R codes:
In this code, we have used the identity:
ns2n =
n
X
(yi − θ)2 = (n − 1)s̄2 + n(ȳ − θ)2 .
i=1
P
The RHS is fast to update with each iteration because (n − 1)s̄2 =
(yi − ȳ)2 does not change, only
(ȳ − θ)2 gets updated.
Let us examine the performance of the Gibbs sampler using the midge data from the previous section and
the independent semiconjugate prior (13). A Gibbs sanpler consisting of 1000 iterations were constructed.
Fig. 6.1 plots the first 5, 15 and 100 simulated values of the sampler.
74
Figure 6.1: The first 5, 15 and 100 iterations of a Gibbs sampler.
Once the Gibbs samples are collected we can find some empirical quantiles, which can be verified to be
very close to a discrete approximation of the joint posterior distribution. (Hoff’s text book (Chapter 6, Sec.
6.2) gives further details of this discrete approximation technique.)
75
Figure 6.2: The first panel shows 1,000 samples from the Gibbs sampler, plotted over the contours of a
discrete approximation. The second and third panels give kernel density estimates to the distributions of
Gibbs samples of θ and σ̃ 2 . Vertical gray bars on the second plot indicate 2.5% and 97.5% quantiles of the
Gibbs samplers of θ, while nearly identical black vertical bars indicate the 95% confidence interval based
on the t-test.
6.3
6.3.1
Markov chain Monte Carlo algorithms
Gibbs sampler
Suppose we have a vector of parameters φ = (φ1 , . . . , φp ), and our information about φ is measured with
the probability distribution p(φ) = p(φ1 , . . . , φq ). In the example from the previous subsection, φ = (θ, σ 2 )
and the probability distribution of interest is p(θ, σ 2 |y1 , . . . , yn ), a posterior distribution given the observed
n-data sample.
Remark 6.1. In Bayesian statistics, the application of Gibbs sampling is typically to posterior distributions,
hence the conditioning on the observed data. However, it is important to note that Gibbs sampler is applicable to any joint probability distribution for a random vector φ of interest; regardless of whether we are
dealing with an additional conditioning (in the case of Bayesian inference), or not.
(0)
(0)
The general recipe should be clear. Given a starting point φ(0) = {φ1 , . . . , φq }, the Gibbs sampler
generates φ(s) from φ(s−1) as follows
(s)
(s−1)
(s)
(s)
1. sample φ1 ∼ p(φ1 |φ2 = φ2
(s−1)
, . . . , φ q = φq
(s−1)
2. sample φ2 ∼ p(φ2 |φ1 = φ1 , φ3 = φ3
(s−1)
, . . . , φ q = φq
...
(s)
(s)
(s)
)
(s)
q. sample φq ∼ p(φq |φ1 , φ2 , . . . , φq−1 ).
76
)
After S iterations, this algorithm generates a dependent sequence of random vectors
(1)
φ(1) = {φ1 , . . . , φ(1)
q }
(2)
φ(2) = {φ1 , . . . , φ(2)
q }
...
φ
(S)
=
(S)
{φ1 , . . . , φ(S)
q }.
This sequence forms what we call a Markov chain, because the random vector φs is conditionally
independent of all the past instances φ(1) , . . . , φ(s−2) , given φ(s−1) . (Markov property: the future is conditionally independent of the past, given the presence). We will define Markov chains shortly in the sequel.
The main point to quickly get into is that under suitable conditions that are easily met, as s → ∞, φ(s)
converges in distribution to the Markov chain’s stationary distribution p(φ). We also refer to p(φ) as the
target distribution of the Markov chain (MC). In particular, for any measurable event A of interest, we may
write
Pr(φ(s) ∈ A) → Pr(φ ∈ A) as s → ∞.
In other words, if we run the chain long enough then φ(s) can be used to approximate a sample for the joint
distribution p(φ) of interest.
More importantly, take any function g(φ) for which we may be interested in the expectation under p(φ),
then the following law of large numbers holds quite generally, as S → ∞:
Z
S
1X
(s)
g(φ ) → Eg(φ) = g(φ)p(φ)dφ.
S
(14)
s=1
In other words, we can apply Monte Carlo approximation technique to the Markov chain’s generated samples
to evaluate the expectation of interest. For this reason, we call all such approximations Markov chain Monte
Carlo (MCMC) approximations, and the overall procedure an MCMC algorithm.
77
Remark 6.2.
• The good: While it is generally difficult to construct a sample for the joint distribution
p(φ), it is relatively easier to construct a Markov chain that converges in the limit to the target p(φ).
• The advent of MCMC algorithms is the primary reason that helped to push Bayesian statistics into a
central place of modern statistics, because they provide a generic mechanism for posterior computation for complex models. From a modeling standpoint, we can go beyond conjugate prior specification; from a scalability standpoint we can work with very large number of variables and parameters.
• MCMC approximation techniques are quite remarkable because they exploit the strong law of large
numbers for non-i.i.d. random variables —- MC’s generated samples are clearly dependent.
• Hence the bad: there are infinitely many Markov chains for the same target distribution, not all equal.
– Some may take a long time to get close to the target stationary distribution, i.e., they have a
slow mixing time. In such a case, to produce even approximately good sample for the target
distribution, S needs to be very large (and we don’t generally know how large).
– Moreover, some Markov chain may produce strongly correlated samples, hence the Monte Carlo
technique may carry very high variance. Hence, the empirical average requires a considerably
larger number S of dependent samples than one would with independent Monte Carlo samples.
78
6.3.2
General Markov chain framework
Gibbs samplers are very easy to implement and can be applied to almost any complex statistical models. For
this reason they are very popular. Its popularity is also its curse, as Gibbs sampling can be very inefficient
for the reasons we’ve just mentioned.
Therefore, it is important to gain intuition of Gibbs sampling by placing it within a more general framework of Markov chain, so we can get a feel of what a Gibbs sampler tries to achieve, when does it ”works”
and when it may not. And when it does not work, what can we do. In fact, there are many variants of Gibbs
sampler (we have introduced only one such variant). More importantly, there are many non-Gibbs Markov
chain Monte Carlo techniques, including Metropolis-Hastings, Hamiltonian MCMC, and so on.
Bear with us a bit of formalism in the next couple of pages. The payoff is worth it. 4
Definition 6.1. A Markov chain is a discrete time stochastic process φ(1) , φ(2) , . . . taking values in an arbitrary state space S, having the property that the conditional distribution of φ(s+1) given the past φ(1) , . . . , φ(s)
depends only on the present state φ(s) .
φ(s) is called the state variable at time s. A Markov chain is defined by its transition probabilities. For
discrete state space S, these are specified by defining a matrix p:
p(x, y) := Pr(φ(s+1) = y|φ(s) = x), x, y ∈ S
that gives the probability of moving from any element x ∈ S at time s to any element y ∈ S at time s + 1.
The transition probability matrix p(x, y) does not depend on time s.
For continuous state space S, the proper way to think of the transition probabilities is via a notion of
kernel P , which can be represented by a regular conditional probability: for any measurable subset A ⊂ S,
the kernel P is given as
P (x, A) := Pr(φ(s+1) ∈ A|φ(s) = x).
Kernel P (x, A) is defined by two arguments, x is an element in the state space S and A a subset of S. It
gives the probability of moving from an element x ∈ S into a subset A at time s + 1.
Note that the transition probabilities do not by themselves define the probability distribution of the
Markov chain. To do so, we need to additionally specify the initial distribution of the chain, namely, the
marginal distribution for φ(1) .
4
I largely follow Charles Geyer (2005) for the rest of this subsection.
79
A key concept of a Markov chain is
Definition 6.2. A probability distribution π is a stationary distribution or an invariant distribution for the
Markov chain if it is ”preserved” by the transition probability. That is if the initial distribution is π, then the
marginal of φ(2) is also π. Hence, so is the marginal distribution of φ(3) and all the rest of the chain.
For discrete state space S, π is specified by a vector π(x), and the stationary property is
X
π(y) =
π(x)p(x, y).
(15)
x∈S
If we think of transition probabilities as a matrix P with entries p(x, y), Eq. (15) can be written as π = πP ,
where the RHS is the multiplication of the matrix P on the left by the row vector π.
For continuous state space S, the stationary property is
Z
π(dx)P (x, A).
(16)
π(A) =
S
Eqs.(15) and (16) are the same except that a sum over a discrete state space has been replaced by an integral
over a continuous state space.
In MCMC we often construct a Markov chain with a specified stationary distribution π in mind, so
there is never a question whether a stationary distribution exists —- it does so by construction. Moreover it
is unique under easily met conditions and more importantly it admits the law of large numbers, described
earlier in Eq. (14).
80
6.3.3
Variants of Gibbs samplers
With the general Markov chain framework in mind, we can see that the Gibbs sampler is a very simple
construction of a Markov chain of state space variable represented by vector φ = (φ1 , . . . , φq ) taking value
in S for some q ≥ 2.
The Gibbs sampler is composed of elementary update steps, which we call Gibbs update: an elementary
Gibbs update changes only one component of the state vector, say φi for some i = 1, . . . , q. This component
is given a new value which is a sample from its ”full conditional” — its conditional distribution given the
rest π(φi |φ−i ), where φ−i := (φ1 , . . . , φi−1 , φi+1 , . . . , φq ).
It is easy to verify that the elementary Gibbs update preserves the stationary distribution: if the current
state φ is a realization from π, then φ−i is distributed according to its marginal π(φ−i ) derived from π, and
the state after the update will have the distribution
π(φi |φ−i )π(φ−i )
which is π(φ) by definition of conditional probability: joint equals conditional times marginal.
We can represent an elementary Gibbs update for component i by a kernel denoted by Pi , for i =
1, . . . , q. Moreover, a composition of an elementary Gibbs update, say P1 followed by an elementary Gibbs
update, say P2 can be represented by the composite kernel P1 P2 . It has a concrete meaning:
• For a discrete state space S, P1 P2 represents the multiplication of two transition probability matrices.
The result is a matrix with entries
X
p1 (x, y)p2 (y, z).
y∈S
• For a continuous state space S, we need to replace the sum by the integral:
Z
(P1 P2 )(x, A) = P1 (x, dy)Q(y, A).
81
Composition of kernels Now we can write the first Gibbs sampling introduced in subsection 6.3.1 as the
construction of a Markov chain using the kernel
P = P1 P2 . . . Pq
In words: this Markov chain is constructed by first updating φ1 via its full conditional, and then φ2 , . . . ,
until φq . The compositions of the q elementary Gibbs update result in the kernel P . And application of P
allows us to generate a Markov chain sample φs+1 if we are to start from φs .
It is easy to verify that the composition of kernels this way preserves the stationary distribution: π(P1 P2 P3 ) =
((πP1 )P2 )P3 = (πP2 )P3 = πP3 = π, and so on.
Mixing kernels
But we can also create new Markov chains from the elementary Gibbs update by mixing:
q
P =
1X
Pi .
q
i=1
In words: pick a coordinate i to update with equal probabilities 1/q. Then update φi according to kernel Pi .
There is no reason to stay with equality probabilities: take any weights (α1 , . . . , αq ) ∈ ∆q−1 . Pick
coordinate i to update with probability αi . If i is chosen, then update φi according to kernel Pi .
Combining composition and mixing We can combine the composition and mixing tricks. The best
known example of this is the so-called random sequence scan that combines q elementary update mechanisms by choosing a random permutation (i1 , i2 , . . . , iq ) of the integers 1, 2, . . . , q and then appplying the
updates Pij , j = 1, . . . , q in that order. If P denotes the set of all q! permutations, the kernel of this scan is
P =
1
q!
X
(i1 ,...,iq )∈P
82
Pi1 . . . Piq .
6.4
MCMC diagnostics
Now with so many Gibbs variants (and in the future non-Gibbs Markov chains) available to consider, how
can we tell which one works, and works better? Remember that all Gibbs samplers and MCMC algorithms
in general work in theory, if we were allowed to run the Markov chain until infinity. But we can never do
that in practice. We may come up with one or several Markov chain constructions, run them for a while
and evaluate. This requires techniques for assessing the effectiveness of MCMC algorithms. This section
provides a brief introduction into MCMC diagnostics.
The goal of Monte Carlo or Markov chain Monte Carlo approximation is to obtain a sequence of parameter values {φ(1) , . . . , φ(S) } such that, for some function g of interest and a target distribution p(φ),
Z
S
1X
(s)
g(φ ) ≈ g(φ)p(φ)dφ.
S
s=1
In order to obtain a good approximation, there are primarily two main issues that we need to worry about
(i) the empirical distribution of the simulated sequence {φ(1) , . . . φ(S) } need to approximate well the
target distribution p(φ).
(ii) the members of the simulated sequence need to be as weakly correlated as possible (zero correlation
is the best).
Standard Monte Carlo samples represent the ”gold standard”, if they could be obtained: by assumption,
the MC samples are identically and independently distributed according to the target p(φ). Thus, both
criteria (i) and (ii) are perfectly achieved. Let φ̄ denote the empirical average of the Monte Carlo samples of
φ, assuming for the moment to be scalar, then the variance of this Monte Carlo approximate is
VarMC [φ̄] =
83
1
Var[φ].
S
(17)
For samples simulated by a Markov chain, the aforementioned issues are generally non-trivial to address.
The Markov chain may take a long time to get close to the target stationary distribution, requiring S to be
large for (i) to be achieved. Moreover, there may be strong correlations among simulated samples {φ(s) }Ss=1 ,
resulting in difficulty in achieving (ii).
Example 6.1.
Consider the target distribution of the form
p(θ) =
3
X
pi × Normal(θ|µi , σi2 ),
k=1
where
p = (p1 , p2 , p3 ) = (.45, .10, .45); (µ1 , µ2 , µ3 ) = (−3, 0, 3); (σ12 , σ22 , σ32 ) = (1/3, 1/3, 1/3).
This is a mixture of three normal densities. A useful technique is not to draw samples for θ directly, but
to add an auxiliary random variable Z such that the joint distribution for (Z, θ) induces marginal distribution
for θ which is equal to the target distribution p(θ). We will then draw sample for the joint sample (Z, θ).
The joint distribution for (Z, θ) is given as follows
Z ∼ Categorical(p)
θ|Z = k ∼ Normal(µk , σk2 ).
Figure 6.3: A mixture of normal densities and a Monte Carlo approximation.
84
(18)
For a Gibbs sampler of (Z, θ), the full conditional for θ is already given by Eq. (18). The full conditional
for Z is given by, via Bayes’ rule:
pk Normal(θ|µk , σk )
.
Pr(Z = k|θ) = P3
j=1 pj Normal(θ|µj , σj )
(19)
Fig. 6.4 illustartes the histogram and traceplot of the first 1,000 Gibbs samples.
Figure 6.4: Histogram and traceplot of 1,000 Gibbs samples.
What do we see:
• For the Gibbs sampler for θ-values starts in the region corresponding to the second mode (from the
left) of the distribution, then ventures to the region corresponding to the first mode, and get ”stuck”
there for a quite long time. It manages to get out of the second mode, passing through it, and transition
to the region corresponding to the third mode. Nonetheless, it doesn’t seem to spend ”enough” time
there before transitioning back to the second mode again.
• As shown by the first panel of Fig. 6.4, the Markov chain is not close to the stationary target distribution p(θ). It has not mixed after 1,000 iterations. If we run considerably longer, for 10,000 iterations,
the mixing is considerably improved. See Fig. 6.5.
• The ”stickiness” of the Markov chain at regions corresponding to the three modes, especially the first
and third mode suggests strong correlation among the simulated samples.
85
Figure 6.5: Histogram and traceplot of 10,000 Gibbs samples.
How do we verify both issues of mixing and strong correlation of Markov chain samples?
To verify mixing is difficult in theory. This is an active area of research, where researchers work on upper
and lower bounds of the mixing time. Unfortunately, for complex models, tight bounds for the mixing time
are rarely available. In practice, a standard method is to run multiple Markov chains (starting at different
positions), and compare the distributions for the variables of interest. This works well when the number of
variables of interest is not too large. For high-dimensional state spaces, having a robust way to verify the
mixing of Markov chain remains a big challenge.
The reason we want to check the correlation of Markov chain samples — the technical term is autocorrelation — is that this quantity affects to the variance of the Monte Carlo estimate in a crucial way.
86
Assume that that stationarity of the Markov chain has been achieved. Let φ0 be the expectation
of a
1 P
scalar φ under the stationary target distribution. The variance of the Monte Carlo estimate φ̄ := S s φ(s)
can be computed as follows
VarMCMC (φ̄) := E(φ̄ − φ0 )2
X
S
1
(s)
2
φ − φ0 )
= E (
S
s=1
X
X
1
(s)
2
(s)
(t)
=
E
(φ − φ0 ) +
(φ − φ0 )(φ − φ0 )
S2
s
s6=t
=
1 X (s)
(φ − φ0 )(φ(t) − φ0 ).
VarMC (φ̄) + 2
S
s6=t
Thus, the MCMC variance is equal to the MC variance plus a term that depends on the correlation of
samples within the Markov chain. This term is usually positive, so the MCMC variance is usually higher
than the MC variance.
To assess how much correlation there is in the chain, we compute the sample autocorrelation function:
for a generic sequence of numbers {φ1 , . . . , φS }, the lag-t autocorrelation function estimates the correlation
between elements of the sequence that are t steps apart:
acft (φ) =
1
S−t
PS−t
s=1 (φs − φ̄)(φs+t −
1 PS
2
s=1 (φs − φ̄)
S−1
φ̄)
.
(20)
In R, this quantity is computed by R-function acf. If we are close to stationarity, this quantity is almost
always between [-1,1]. Being close to 1 means strong positive correlation. Being close to zero means small
correlation. For the example in Fig. 6.5, for the sequence of 10K Gibbs samples for θ-values, the lag-10
autocorrelation is 0.93, and lag-50 autocorrelation is 0.812. This means that the Markov chain has very high
correlation. Such a Markov chain explores the parameter space slowly, taking a long time to mix, and the
empirical average also has a high variance.
87
A practically useful way is to consider the effective sample size of a Markov chain. Motivated by the
Monte Carlo variance formula (see Eq. (17)), the MCMC effective sample size Seff is the value such that
VarMCMC (φ̄) =
Var φ
.
Seff
(21)
In R, this quantity is estimated by the R-command effectiveSize.
In the example of normal mixture discussed above, the effective sample size of the 10,000 Gibbs samples
of θ is 18.42, indicating that the precision of the MCMC approximation to E[θ] is as good as the precision
that would have been obtained by utilizing only about 18 i.i.d. samples of θ. This may suggest two possible
courses of action: either run the Gibbs sampler considerably longer, or design a better Markov chain.
88
7
Multivariate normal models
For most non-trivial applications we are interested in models with multi-dimensional parameters and multidimensional measurements. Such situations require models based on multivariate distributions for both
parameters and data. The multivariate normal distributions represent one of the most useful and powerful
tools for such modeling tasks.
7.1
Mean vector and covariance matrix
Let X denote a random vector taking values in Rp . We may write X in terms of its components X =
(X1 , . . . , Xp ).
There are two equivalent ways to think of the random vector X. The first way is what we have been
used to, that is, to think of a joint distribution over the n random variables X1 , . . . , Xp ∈ R. Given such a
joint distribution, we can speak of quantities such as the expectation of each of the variables X1 , . . . , Xp .
We also consider covariance between Xi and Xj :
µi := EXi , i = 1, . . . , p
σii := Var Xi :=
σi2 ,
(22a)
i = 1, . . . , p
(22b)
σij := Cov(Xi , Xj ) := E(Xi − µi )(Xj − µj ), i, j = 1, . . . , p.
(22c)
The second way is to view X as a random variable taking values in the p-dimensional space Rp . This is
useful in thinking geometrically about the behavior of random behavior in Rp , and in algebraic manipulation
of distributions in spaces of dimensionality greater than one. Suppose that X is endowed with a probability
density function p(x) with domain on Rp . We can speak of its mean vector µ and covariance matrix Σ:
Z
µ := EX :=
xp(x)dx
(23a)
Rp
Σ := Var X := Cov X := E(X − µ)(X − µ)> :=
Z
(x − µ)(x − µ)> p(x)dx.
(23b)
In the above equations, X and x in Rp are treated as p × 1 columns (matrices). The integrals operate in
a component-wise fashion. Sometimes we use Var and sometimes Cov in front of random vector X; they
mean the same thing.
89
By verifying the basic linear algebra operations on matrices, it is easy to see that the p-dimensional mean
vector µ and p × p covariance matrix Σ of Eqs (23) are related to quantities given in Eqs (22) as follows:
 


µ1
σ11 . . . σ1p
 .. 
 ..
.. 
.
 .
. 
 





µ =  µi  ; Σ =  σi1 . . . σip 
(24)

 .. 
 ..
.. 
.
 .
. 
µp
σp1 . . . σpp
The entries of covariance matrix Σ represent the variance of components Xi in the diagonal, and the
covariance between Xi and Xj in the (i, j) positions. It is simple to check that Σ is a symmetric and
positive semidefinite matrix.
90
7.2
The multivariate normal distribution
The multivariate Gaussian density function takes the following form: for x ∈ Rp
1
1
> −1
p(x|µ, Σ) =
− (x − µ) Σ (x − µ) ,
p
1 exp
2
(2π) 2 |Σ| 2
(25)
where there are two parameters µ ∈ Rp and Σ an p × p symmetric and positive definite matrix (Σ > 0). |Σ|
denotes the determinant of matrix Σ.
We write X ∼ N(µ, Σ), or X ∼ Normal(µ, Σ), or X ∼ Np (µ, Σ) interchangeably to denote the
p-variate Gaussian random vector X. Here are several basic facts.
1. The function given in the above display is a valid density function on Rp . That is, it satisfies
Z
p(x|µ, Σ)dx = 1.
Rp
2. Given the density defined above, it can be verified that µ = E(X) and Σ = E(X − µ)(X − µ)> .
So parameters µ and Σ indeed play the respective roles of being the mean and the covariance for the
Gaussian distribution.
3. What is the ”shape” of the Gaussian density in multi-dimensional spaces? One way to visualize is
looking at its contours. A contour of the Gaussian density is a collection of points of equal density
values. So the Gaussian’s contours are solutions to the quadratic equation (x − µ)> Σ−1 (x − µ) = c
for each positive constant c. These are ellipses oriented along the eigenvector of Σ.
91
More Basic Facts
1. If X ∼ Np (µ, Σ) where µ ∈ Rp and Σ ∈ Rp×p , then for any m × p matrix A and p × 1 column vector
b, then the linear transformation Y = AX + b is also Gaussian, i.e., Y ∼ Nm (Aµ + b, AΣA> ).
2. If X = (X1 , . . . , Xp ) ∼ N (0, Ip ), where Ip denotes the p × p identity matrix. Then, X1 , . . . , Xp are
independent (why?) Moreover, AX ∼ N (0, AA> ).
1
3. If X ∼ N (µ, Σ), let A = Σ− 2 , the square root of the inverse covariance matrix, then AX ∼
1
1
N (Σ− 2 µ, Ip ). And so, AX − Σ− 2 µ ∼ N (0, Ip ). This is called the standardization of the Gaussian.
Remark 7.1. In the above, we made use of the concept of square root of a positive definite matrix. This is a
generalization of the square root of a positive number. If S is symmetric and (strictly) positive definite, then
1
the square root of matrix S is a matrix denoted by A = S 2 if there holds AA> = S.
We can express A more concretely: Let D := diag(λ1 , . . . , λp ) where λi ’s are the eigenvalues of S. The
λj are positive because S is positive definite. Define Γ := [γ1 , . . . , γp ] whose columns are eigenvectors γi
p
P
λi γi γi> .
of S. By the spectral theorem of positive definite matrices, S =
i=1
1/2
1/2
More succintly, we write S = ΓDΓ> , where D := diag(λ1 , . . . , λp ). Let D1/2 := diag(λ1 , . . . , λp ).
It is now simple to verify that the square root of S takes the form
A = ΓD1/2 Γ> .
In sum, the symmetric square root of p.d. matrix S is a matrix that has the same set of eigenvectors, while
the eigenvalues are the square root of that of S.
92
Marginalization & Conditioning The multivariate Gaussian distributions enjoy the following key invariance properties which make them powerful in theory and useful in practice: if a joint distribution is
Gaussian, then induced marginal and conditional distributions are also Gaussian.
Let us express the n-dimensional
vector X in terms of two blocks of components X1 ∈ Rn1 and
X1
X2 ∈ Rn2 as in X =
.
X2
The mean vector µ and covariance Σ can be partitioned accordingly according to the n1 and n2 dimensional components:
µ1
Σ11 Σ12
µ=
,Σ =
.
µ2
Σ21 Σ22
We can read off (µ1 , Σ11 ) and (µ2 , Σ22 ) as the mean vector and the covariance matrix for X1 and X2 ,
respectively. In addition, Σ12 is the covariance matrix for X1 and X2 . These are general facts that hold for
any distributions on Rp .
Now, if we assume
µ1
Σ11 Σ12
X ∼ Np µ =
,Σ =
µ2
Σ21 Σ22
then we also have that
X1 ∼ Nn1 (µ1 , Σ11 )
X2 ∼ Nn2 (µ2 , Σ22 )
−1
X1 |X2 = x2 ∼ Nn1 (µ1 + Σ12 Σ−1
22 (x2 − µ2 ), Σ11 − Σ12 Σ22 Σ21 )
−1
X2 |X1 = x1 ∼ Nn2 (µ2 + Σ21 Σ−1
11 (x1 − µ1 ), Σ22 − Σ21 Σ11 Σ12 ).
These conditional and marginal formulas invite some sort of conjugacy to take place — sweet music to
the Bayesian ears.
93
Canonical Parameterization The representation of Gaussian density in terms of parameters µ and Σ
(see Eq. (25) is called the mean parameterization. There is an equivalent parameterization, namely dual
parameterization or canonical parameterization, obtained by expanding the quadratic term and letting Λ =
Σ−1 and η = Σ−1 µ, where Λ is the concentration matrix. The canonical parameterization is useful in that
it can be easily extended to a broader class of distributions. Moreover, the relationship between the two
representations has some fruitful consequences in terms of inference. We will get there eventually.
A very important fact about multivariate Gausians: a zero entry in matrix Σ imply independence between
the corresponding components of the random vector X. On the other hand, by exploiting the canonical
parametrization it can be seen that the zeroes in Λ imply only conditional independence, given some other
random components. Let us look at specific examples:
X1
1. Suppose X =
where X1 of n1 dimensions, X2 is of n2 dimensions, such that X is disX2
µ1
Σ11 Σ12
tributed by N (µ, Σ), where µ =
and Σ =
. Prove that X1 ⊥ X2 if and only
µ2
Σ21 Σ22
if Σ12 = Σ21 = 0.


X1
2. Let X =  X2 , and suppose that X ∼ N (0, Λ) where
X3


Λ11 Λ12 Λ13
Λ =  Λ21 Λ22 Λ23  .
Λ31 Λ32 Λ33
Show that if Λ12 = Λ21 = 0 then X1 ⊥ X2 |X3 .
94
7.3
Semiconjugate prior for the mean vector
i.i.d.
Given the normal sampling model: Y 1 , . . . , Y n |θ, Σ ∼ Np (θ, Σ). Stacking up the samples Y i , the data
can be viewed as an n × p matrix. Then the likelihood takes the form
n
Y
1
(2π)−p/2 |Σ|−1/2 exp − (y i − θ)> Σ−1 (y i − θ)
2
i=1
n
1X
−np/2
−n/2
> −1
= (2π)
|Σ|
exp −
(y i − θ) Σ (y i − θ)
2
i=1
1 >
>
∝ exp − θ A1 θ + θ b1 ,
2
p(y 1 , . . . , y n |θ, Σ) =
where the coefficients associated with the quadratic and linear terms are
A1 = nΣ−1 ,
n
X
−1
b1 = Σ
y i =: nΣ−1 ȳ.
i=1
The likelihood takes the exponential form, where the exponent is quadratic with respect to the mean
parameter θ. It is clear that the simplest semiconjugate prior (with Σ held fixed) is multivariate normal
distribution. We set the prior:
θ ∼ N(µ0 , Σ0 ).
As in the univariate case, it is convenient to write the multivariate normal density in terms of precision
matrix Σ−1
0 :
1
−p/2
−1/2
> −1
p(θ) = (2π)
|Σ0 |
exp − (θ − µ0 ) Σ0 (θ − µ0 )
2
1
−p/2
1/2
>
= (2π)
|A0 | exp − (θ − µ0 ) A0 (θ − µ0 ) .
2
1 >
>
∝ exp − θ A0 θ + θ b0 ,
2
−1
where A0 = Σ−1
0 and b0 = Σ0 µ0 .
95
By Bayes’ rule
p(θ|y 1 , . . . , y n , Σ) ∝ p(θ)p(y 1 , . . . , y n |θ, Σ)
1
∝ exp − θ > An θ + θ > bn ,
2
where
−1
An = A0 + A1 = Σ−1
0 + nΣ
−1
bn = b0 + b1 = Σ−1
0 µ0 + nΣ ȳ.
Thus, the posterior distribution of θ given y 1 , . . . , y n is a multivariate normal with covariance matrix
Σn := A−1
n and mean vector Σn bn .
96
7.4
Inverse Wishart prior for the covariance matrix
We learned in the one-dimensional Gaussian model that when the mean parameter is held fixed, the (semi)
conjugate prior for the precision parameter is Gamma distribution. In this subsection we will find a multivariate version of the Gamma distribution for the precision matrix. It is called Wishart distribution. Since
covariance matrix is the inverse of precision matrix, this corresponds to using inverse Wishart prior for the
covariance matrix.
Recall the likelihood form, keeping only quantities that vary with the covariance matrix parameter:
n
Y
1
−p/2
−1/2
> −1
p(y 1 , . . . , y n |θ, Σ) =
(2π)
|Σ|
exp − (y i − θ) Σ (y i − θ)
2
i=1
n
X
1
∝ |Σ|−n/2 exp − trace( (y i − θ)> Σ−1 (y i − θ))
2
i=1
n
X
1
(y i − θ)(y i − θ)> ) .
∝ |Σ|−n/2 exp − trace(Σ−1
2
i=1
In the second line, we used the trivial fact that a scalar is equal to its own trace. Recall that the trace of a
square matrix is the sum all its elements in the diagonal. In the third line, we used the cyclic property of the
trace of product of matrices (assuming the matrix dimensions match up):
trace(AB) = trace(BA); trace(ABC) = trace(BCA) = trace(CAB); . . .
Let A = Σ−1 denote the precision matrix and
Sn =
n
X
(y i − θ)(y i − θ)>
(26)
i=1
then the likelihood takes a simple form
1
p(y 1 , . . . , y n |θ, A) ∝ |A|n/2 exp − trace(AS n ).
2
The simplest form for a conjugate prior is the Wishart distribution for the precision matrix A, or
equivalently, inverse-Wishart distribution for the covariance matrix Σ. We say a random matrix A ∼
inverse-Wishart(ν0 , S −1
0 ) if it admits the density function on the space of symmetric and positive definite
matrices:
p(A|ν0 , S 0 ) :=
p
2ν0 p/2 π (2)/2 |S 0 |−ν0 /2
p
Y
−1
Γ([ν0 + 1 − j]/2)
×
j=1
1
|A|(ν0 −p−1)/2 exp − trace(AS 0 ).
2
(27)
We immediately find that the conditional distribution of A given y 1 , . . . , y n is again an Wishart distribution:
97
1
p(A|y 1 , . . . , y n , θ) ∝ |A|(ν0 +n−p−1)/2 exp − trace(A(S 0 + S n ))
2
≡ Wishart(ν0 + n, [S 0 + S n ]−1 ).
Equivalently, in terms of the covariance matrix: given the prior Σ = A−1 ∼ inverse-Wishart(ν0 , S −1
0 ),
which has the density:
p(Σ|ν0 , S −1
0 )
=
ν0 p/2
2
p
π (2)/2 |S 0 |−ν0 /2
p
Y
−1
Γ([ν0 + 1 − j]/2)
×
j=1
1
|Σ|−(ν0 +p+1)/2 exp − trace(Σ−1 S 0 ).
2
(28)
then we find that
1
p(Σ|y 1 , . . . , y n , θ) ∝ |Σ|−(ν0 +n+p+1)/2 exp − trace(Σ−1 (S 0 + S n ))
2
≡ inverse-Wishart(ν0 + n, [S 0 + S n ]−1 ).
98
(29)
(30)
Useful facts of Wishart distributions Wishart is a canonical distribution for symmetric and positive definite matrices. Wishart(ν0 , V ) has two parameters: ν0 is called the number of degrees of freedom, and V > 0
the scale matrix.
Wishart is the multivariate analogue of the chi-square distribution (which is a special case of the Gamma
distribution). Recall that a chi-square random variable with n0 degrees of freedom can be constructed by
taking a sum of square of standard normal variables. A similar property holds for Wishart random matrices.
iid
Let z 1 , . . . , z ν0 ∼ Np (0, V ). Let Z = [z 1 . . . z ν0 ] be the p × ν0 matrix made of the ν0 column vectors
z i ’s. Then,
ν0
X
>
ZZ =
ziz>
(31)
i ∼ Wishart(ν0 , V ).
i=1
When ν0 ≥ p, the matrix ZZ > is positive definite (and hence invertible) almost surely if V is invertible.
If p = V = 1, we are reduced to a chi-squared distribution with ν0 degrees of freedom.
The above characterization makes it simple to draw sample from a Wishart distribution (or an inverseWishart distribution).
It also allows us to collect few useful facts: for A ∼ Wishart(ν0 , V ) ⇔ Σ = A−1 ∼ inverse-Wishart(ν0 , V ):
E(A) = ν0 V
Var(Aij ) = ν0 (Vij2 + Vii Vjj )
1
E(Σ) =
V −1 .
ν0 − p − 1
The formula for the variance of Σ is slightly more complicated and omitted, but the rule of thumb is that we
set ν0 to be small if we want large variation around the prior expectation for the covariance matrix Σ.
Plugging the last identity to the posterior distribution for Σ given in Eq (30):
E[Σ|y 1 , . . . , y n , θ] =
=
1
(S 0 + S n )
ν0 + n − p − 1
ν0 − p − 1
1
n
1
S0 +
Sn,
ν0 + n − p − 1 ν0 − p − 1
ν0 + n − p − 1 n
which be viewed as a weighted average of the prior expectation and the unbiased estimator S n /n for the
covariance matrix Σ.
99
7.5
Example: reading comprehension study
Given n-iid samples Y 1 , . . . , Y n |θ, Σ ∼ Np (θ, Σ). Using priors θ ∼ N(µ0 , Σ0 ), and Σ ∼ inverse-Wishart(ν0 , S −1
0 ),
it is simple to implement a Gibbs sampler to approximate the posterior distribution of (θ, Σ) based on the
full conditionals obtained in the previous subsections.
Let us consider an example in Hoff (2009) (Chapter 7):
• 22 children were given two reading two reading comprehension exams, one before a certain type of
instruction and on after.
• model these 22 pairs of scores as i.i.d. samples y 1 , . . . , y 22 from a bivariate normal (p = 2). The data
samples are plotted as black dots in the second panel of Fig. 7.1.
• basic sample statistics from y 1 , . . . , y 22 : we found that ȳ = (47.18 53.86)> . In terms of sample
variances:
182.16 147.44
S 22 =
.
147.44 243.65
• the exam was designed for average scores of around 50 out of 100, so µ0 = (50 50)> .
σ11 σ12
• for hyperparameter Σ0 :=
, we set σ11 = σ22 = (50/2)2 = 625 to ensure most of the
σ21 σ22
prior mass concentrate on [0, 100]. Moreover, σ12 = σ21 = 0.5σ11 = 312.5 to allow some prior
correlation.
Now we proceed with the Bayesian approach.
• as for hyperparameter for Σ: set S 0 = Σ0 and choose relatively small value for the number of degrees
of freedom ν0 = p + 2 = 4 to allow sufficient spread around Σ0 .
• run Gibbs samplers for 5000 iterations, from which we can approximate the posterior distribution as
follows
Pr(θ2 > θ1 |y 1 , . . . , y n ) ≈ 0.99
• we also find the quantiles of the posterior distribution of θ2 − θ1 :
100
Figure 7.1: Reading comprehension: posterior distribution of mean scores before and after instruction (left),
and posterior predictive distribution of two scores (right).
• The left panel of Fig. 7.1 gives 97.5%, 75%, 50%, 25% and 5% highest posterior density contours for
the posterior distribution of θ = (θ1 , θ2 )> . Thus, the evidence is strong that the mean test score θ2
after the instruction is greater than the one, θ1 , before the instruction.
• But this does not tell the full story. It’s far more interesting to look at the posterior predictive distribution
Pr(Y2 > Y1 |y 1 , . . . , y n )
This asks: what is the probability that a randomly selected child will score higher on the second exam
than on the first.
• The second panel of Fig. 7.1 shows the highest posterior density contours of the predictive distribution,
and there are a more substantial overlap with the line y2 = y1 . In fact, we can find that
Pr(Y2 > Y1 |y 1 , . . . , y n ) ≈ 0.71.
Thus, almost a third of the students will get a lower score on the second exam!
This example highlights the distinction in two different ways of comparing populations in the reporting
of results from experiments or surveys: studies with very large sample size n may result in values of Pr(θ2 >
θ1 |y 1 , . . . , y n ) that are very close to 1 (or p-values that are very close to 0), and conclude ”significant effect”,
but such results say nothing about how large of an effect that we expect to see for a randomly sampled
individual.
101
8
Group comparisons and hierarchical modeling
In this section we will study questions related to comparisons of different populations. While group comparison may conjure up the question of ranking, a thorough treatment will inevitably require thinking of
notions such as within-group variability and between-group variability. Such notions will be best addressed
by employing (Bayesian) hierarchical modeling. In this sense, this section is also good entry point to hierarchical modeling, which is applicable far beyond the basic group comparison problems. In fact, hierarchical
modeling is also one of the most powerful tools in the arsenal of Bayesian statistics.
8.1
Comparing two groups
Example 8.1. Given a sample of 10th grade students from two public U.S. high schools. n1 = 31 and
n2 = 28 are the two sample sizes from school 1 and 2, respectively. Both schools have a total enrollment of
around 600 10th graders and both are in a similar environment (urban neighborhoods).
• Suppose we are interested in comparing the population means θ1 and θ2 .
• Sample means: ȳ1 = 50.81 and ȳ2 = 46.15 suggesting that θ1 > θ2 .
• Let’s take a look at the box plots. There are evidently different levels of variability in the two groups.
A standard approach is to consider the t-statistic:
t(y 1 , y 2 ) =
ȳ − ȳ2
50.81 − 46.15
p 1
p
=
= 1.74,
sp 1/n1 + 1/n2
10.44 1/31 + 1/28
where sp = [(n1 − 1)s21 + (n2 − 1)s22 ]/(n1 + n2 − 2), the pooled estimate of the population variance
of the two groups.
Figure 8.1: Left panel: Boxplots of samples of math scores from two schools. Right panel: gray line
indicates the observed value of the t-statistic.
A basic frequentist technique (the t-test) proceeds as follows.
102
• Exploit the fact that if the two populations are normal distributions with the same mean and variance,
then the t-statistic t(Y 1 , Y 2 ) is a t-distribution with n1 +n2 −2 = 57 degrees of freedom. The density
of this distribution is plotted in the second panel of Fig. 8.2. Under this distribution, the probability
that |t(Y 1 , Y 2 )| > 1.74 is p = 0.087. This is called the (two-sided) p-value of the obtained statistic.
– Although not completely justified in theory, p-values are widely used and easily misused and
abused in parameter estimation and model selection. A small p-value is considered as indicating
the evidence for the rejection of the null hypothesis/ model θ1 = θ2 . Thus, a small p value is
construed with a strong evidence that the two populations are different (θ1 6= θ2 ). Customarily,
p is considered small if p < 0.05 (or a smaller positive threshold number).
– Mathematically,
p = Pr(|t(Y 1 , Y 2 )| > t(y 1 , y 2 )|θ1 = θ2 ).
This is a (pre-experiment) probability statement on the unseen data represented by (Y 1 , Y 2 ),
even though the observed statistic t(y 1 , y 2 ) supplies part of the equation that defines p-value.
This is a source of confusion for many practitioners of frequentist tests. It should not be the
case for a student of Bayesian statistics. Clearly, p is not the (post-experiment) probability that
θ1 = θ2 is true given the data evidence provided by t(y 1 , y 2 ):
Pr(θ1 = θ2 |t(y 1 , y 2 )).
• The t-test commonly taught in statistic classes continues as follows:
– if p < 0.05: reject the null hypothesis/model that the two groups have the same distributions;
conclude that θ1 6= θ2 . Moreover, use the estimates:
θ̂1 = ȳ1 ;
θ̂2 = ȳ2 .
– if p ≥ 0.05: accept the null hypothesis/model, and conclude that θ1 = θ2 . Moreover, use the
estimate
X
X
θ̂1 = θ̂2 = (
yi,1 +
yi,2 )/(n1 + n2 ).
• In our present example: p ≥ 0.05, so we accept that θ1 = θ2 , even though there seems to be some
evidence to the contrary.
103
• Imagine a scenario where the sample from school 1 might have included a few more high-performing
students, and the sample from school 2 a few more low-performing students. Then we could have
observed a p-value of 0.04 or so, in which case we would have treated the two populations as different,
and resorted to using only data from school 1 for estimating θ1 , and data from school 2 for estimating
θ2 . It seems such estimates for θ1 and θ2 are not robust with respect to changes to the samples. 5
• Estimating θ1 and θ2 and the difference θ1 − θ2 is perhaps more important than determining in binary
the question whether θ1 6= θ2 or not when the difference between the two is relatively small. The
above frequentist approach results in taking two extreme positions for the estimation of θ1 and θ2 :
θ̂1 = w1 ȳ1 + (1 − w1 )ȳ2
θ̂2 = (1 − w2 )ȳ1 + w2 ȳ2 ,
where w1 = w2 = 1 if p < 0.05 and w1 = n1 /(n1 + n2 ); w2 = n2 /(n1 + n2 ) otherwise.
• It might make more sense to allow w to vary continuously and have a value that depends on quantities
such as sample sizes n1 , n2 and other quantities that determine population variabilities. In other
words, we want to allow the borrowing of information across groups: the data from group 1 may
influence the estimate for group 2 and vice versa.
5
In the t-test, as is the case with most frequentist tests, we are on a firm mathematical ground when we happen to reject. I.e.,
the rejection is mathematically justified. However, in such a scenario for the t test, our estimates may not be robust for the issue
mentioned. When we happen to not reject, i.e., we remain with the null hypothesis/model, then the issue becomes whether the null
model is too simplistic and heavily misspecified; the estimates would be suspect as a result.
104
Enabling information sharing across groups Consider the following sampling model for two groups:
Yi,1
Yi,2
= µ + δ + i,1 ,
= µ − δ + i,2 ,
iid
{i,j } ∼ normal(0, σ 2 ).
We have utilized a (re)parameterization trick: under this parameterization, θ1 = µ + δ and θ2 = µ − δ, so
µ = (θ1 + θ2 )/2 and δ = (θ1 − θ2 )/2. The intention is to enable the coupling (dependence) of the two
groups via variables µ and δ, which will be made random by a prior distribution. The fact that these two
are random is enough to allow the coupling and subsequent information sharing in posterior inference. The
specific prior choice given below is for computational convenience:
p(µ, δ, σ 2 ) = p(µ) × p(δ) × p(σ 2 )
µ ∼ normal(µ0 , γ02 )
δ ∼ normal(δ0 , τ02 )
σ 2 ∼ inverse-gamma(ν0 /2, ν0 σ02 /2).
Based on our previous calculations for the (univariate) normal model, it should be an easy exercise to
derive the full conditional distributions of these parameters as follows
{µ|y 1 , y 2 , δ, σ 2 } ∼ normal(µn γn2 ), where
γn2 = [1/γ02 + (n1 + n2 )/σ 2 ]−1 ,
n1
n2
X
X
2
2
2
µn = γn × [µ0 /γ0 +
(yi,1 − δ)/σ +
(yi,2 + δ)/σ 2 ];
i=1
i=1
{δ|y 1 , y 2 , µ, σ 2 } ∼ normal(δn , τn2 ), where
τn2 = [1/τ02 + (n1 + n2 )/σ 2 ]−1 ,
X
X
δn = τn2 × [δ0 /τ02 +
(yi,1 − µ)/σ 2 −
(yi,2 − µ)/σ 2 ];
{σ 2 |y 1 , y 2 , µ, δ} ∼ inverse-gamma(νn /2, νn σn2 /2), where
νn = ν0 + n1 + n2 ,
X
X
νn σn2 = ν0 σ02 +
(yi,1 − [µ + δ])2 +
(yi,2 − [µ − δ])2 .
105
Let us go back to our example of comparing math test scores of students from two high schools.
Example 8.2. As for prior distribution parameter for µ ∼ normal(µ0 , γ02 ), we put µ0 = 50, γ0 = 50/2 =
25 to get a reasonably diffuse prior. For the prior on δ, set δ0 = 0, τ0 = 25. For the prior for σ 2 , set
ν0 = 1, σ0 = 10 (this latter choice is due to the setup that the math scores were standardized to produce a
nationalwide mean of 50 and a standard deviation of 10).
• the following figure shows the posterior distribution for µ and δ. In particular, the 95% quantile-based
posterior confidence interval for 2δ, the difference of average scores between the two schools, is (-.61,
9.98), indicating a strong evidence that the posterior mean for school 1 is higher than that of school 2.
• In addition, Pr(θ1 > θ2 |y 1 , y 2 ) = Pr(δ > 0|y 1 , y 2 ) ≈ 0.96, even though the prior probability is
such that Pr(δ > 0) = .50.
• As for posterior predictive probability that a randomly selected student from school 1 has a higher
score than a randomly selected student from school 2:
Pr(Y1 > Y2 |y 1 , y 2 ) ≈ 0.62.
Figure 8.2: Posterior distributions for µ and δ.
106
8.2
Comparing multiple groups
It is very common to organize data or data sets in a hierarchy of nested populations. Such data sets are often
called hierarchical or multilevel data. For example
• there are multiple hospitals, each hospital has many patients
• there are different animals, each animal carry a set of genes
• different countries, each of which is organized into regions, each of which is organized into counties,
with residents in each of them
• ”activity recognition problem”: a collection of computer users, each user is associated with a collection of computer related activities (organized by days), each day has a collection of activities (apps
run)
• a collection of text corpora, each text corpus is a collection of documents, each document is a collection of words
• a database of images divided by groups, each image is a collection of image patches, each patch a
collection of pixels or other specific computer vision elements
107
We are interested in learning about these groups: what are the shared features among them, what make
different groups different and how. In most applications, it does not make great sense to assume that the
groups are independent. It makes sense to assume that they are dependent, and to exploit such dependence to
learn about global aspects of all groups, as well as locally distinct aspects of each group. In other words, we
wish to borrow information from one group to inform about the others, as well as the whole. The question
is how.
108
8.3
Exchangeability and hierarchical models
Hierarchical models are a general method for describing dependence for grouped data. They can be motivated by a theorem of Bruno de Finetti. At a high level, de Finetti’s theorem says that a collection of
exchangeable sequence of random variables must be conditionally i.i.d., and as a consequence, an exchangeable collection of groups of random variables must be distributed according to a hierarchical model. Let us
make this statement more precise.
Definition 8.1. (Exchangeable). Let p(y1 , . . . , yn ) be the joint density of random variables Y1 , . . . , Yn . If
p(y1 , . . . , yn ) = p(yπ1 , . . . , yπn ) for all permutation π of 1, . . . , n. 6 Equivalently, the joint distribution of
(Yπ1 , . . . , Yπn ) remains invariant under any permutation π. Then, we say that Y1 , . . . , Yn are exchangeable.
Intuitively, when Y1 , . . . , Yn are exchangeable, then the subscript labels of these n variables convey no
additional information about them.
It is simple to see that if a collection of random variables Y1 , . . . , Yn are conditionally i.i.d. given some
random variable θ, i.e.,
∼
θ
Y1 , . . . , Yn |θ
i.i.d.
∼
π(θ)
p(·|θ),
then Y1 , . . . , Yn are exchangeable.
What about the other direction? This is where de Finetti’s theorem comes in.
6
At this point, it may be helpful to express the identity explicitly: pY1 ,...,Yn (y1 , . . . , yn ) = pY1 ,...,Yn (yπ1 , . . . , yπn ).
109
Theorem 8.1. Let Y1 , Y2 , . . . be an infinite sequence of random variables all having a common sample
space Y. Suppose that Y1 , . . . , Yn are exchangeable for any sequence size n. Then Y1 , Y2 , . . . must be
conditionally i.i.d. That is, the joint distribution of Y1 , . . . , Yn for any n must be of the form (provided that
a density function exists): for all n and y1 , . . . , yn
p(y1 , . . . , yn ) =
Z Y
n
p(yi |θ) π(θ)dθ
(32)
i=1
for some parameter θ, some distribution π over θ, and some sampling model p(y|θ).
Remark 8.1.
• The ”infinite” part in the statement is necessary, along with the condition of exchangeability for any n.
• de Finetti’s theorem is one of the great theorems in probability theory. It also gives us probability
models that can be written as Eq. (32), as well as hierarchical versions of this, as we will see.
• It has a foundational role in Bayesian statistics, because it provides a mathematical justification for
the existence of the notion of random parameter θ:
– whereas a frequentist statistician may be content with making an i.i.d. assumption about an
unknown sampling mechanism such as
i.i.d.
Y1 , . . . , Yn ∼ p(·|θ),
de Finetti’s theorem says that if the observation sequence is in fact exchangeable, then the unknown θ must be random. Bayesian statisticians proceed by placing a prior distribution π on
such θ.
110
• Exchangeability makes sense in many practical situations:
– the math scores from n randomly selected students from a particular school, in absence of other
information about the students, may be treated as exchangeable.
– the collection of U.S. high schools in similar environments (e.g., large urban areas).
– The computer-related activities by a user collected on Monday mornings in the past year.
– What are not exchangeable? The collection of time-stamped computer-related activities in the
past 24 hours, is not exchangeable.
The words in a document, read from the beginning to the end, are not exchangeable, either. But if
we print out the document into a piece of paper, and cut the paper into small pieces, one for each
word, which are then placed into a bag and shuffled well. Then we have a bag of exchangeable
words.
111
Now, let us consider a model to describe our information about a hierarchical data structure: there are
m groups {Y 1 , . . . , Y m }; each group Y j = {Yj1 , . . . , Yjnj } has nj elements, for some nj ≥ 1.
Suppose that the elements within each group Y j may be treated as exchangeable. Then, by de Finetti’s
theorem we may model the observations from each group as conditionally i.i.d. given some parameter:
i.i.d.
Yj1 , . . . , Yjnj |φj ∼ p(y|φj ).
(33)
What about the collection of parameters φ1 , . . . , φm ? If we assume that the m groups are exchangeable,
then, applying de Finetti’s theorem once more, we have
i.i.d.
φ1 , . . . , φm |ψ ∼ p(φ|ψ),
(34)
for some random parameter ψ. Collecting the above specifications, we arrive at the following hierarchical
model
ψ
∼
φ1 , . . . , φm |ψ
i.i.d.
Yj1 , . . . , Yjnj |φj
i.i.d.
∼
∼
p(ψ) (prior distribution)
p(φ|ψ) (between-group sampling variability)
p(y|φj ), j = 1, . . . , m
(within-group sampling variability).
This hierarchical model has three levels that representing different aspects of randomness/ random variability: p(y|φ) represents the sampling variability among measurements within a group, and p(φ|ψ) represents the sampling variability across groups. Finally, p(ψ) represents prior information about unknown
parameter ψ. Depending on data structure and the modeler’s knowledge, there may be more levels in the
hierarchy of sampling distributions and prior distributions that can be constructed.
112
8.4
Hierarchical normal models
A popular model for describing the heterogeneity of means across several populations is the hierarchical
normal models: here, each group is endowed with a normal sampling model; the mean parameters across
groups are endowed with another normal sampling model further up in the hierarchy.
φj = (θj , σ 2 ), p(y|φj ) = normal(θj , σ 2 )
2
(within-group model)
2
ψ = (µ, τ ), p(θj |ψ) = normal(µ, τ ) (between-group model).
(35a)
(35b)
Note that in this model, we allow different groups to have different means, but they share the same variance
σ 2 (this assumption may be relaxed). The parameters for the given sampling model are µ, τ 2 , σ 2 . For
convenience we may give them standard semi-conjugate priors:
1/σ 2 ∼ gamma(ν0 /2, ν0 σ02 /2)
1/τ 2 ∼ gamma(η0 /2, η0 τ02 /2)
µ ∼ normal(µ0 , γ02 ).
113
8.4.1
Posterior inference
The unknown quantities in our model include the group-specific means (θ1 , . . . , θm ), within-group sampling variability σ 2 , the mean and variance µ, τ 2 of the population of group-specific means. Joint posterior
inference for these parameters may be made by an MCMC approximation for the posterior distribution
p(θ1 , . . . , θm , σ 2 , µ, τ 2 |y 1 , . . . , y m )
∝ p(µ, τ 2 , σ 2 ) × p(θ1 , . . . , θm |µ, τ 2 , σ 2 ) × p(y 1 , . . . , y m |θ1 , . . . , θm , µ, τ 2 , σ 2 )
Y
Y
nj
m
m Y
2
2
2
2
= p(µ)p(τ )p(σ )
p(θj |µ, τ )
p(yji |θj , σ ) .
j=1
j=1 i=1
Although this may look daunting, we will see shortly that it is not difficult to derive full conditional
distributions for all parameters of interest, which will enable us to run a Gibbs sampler. The key is to
observe that the joint distribution of all parameters and observed is expressed in factorized form (i.e., product
form) given above. This is a reflection of the conditional independence relations inherent in our hierarchical
modeling assumption. It is also the conditional independence that we exploits in deriving the full conditional
distributions comfortably.
Full conditional distributions of µ and τ 2 :
It is useful to note that µ and τ 2 are conditionally independent of all other variables in the joint model
when given θ1 , . . . , θm . Collecting only relevant terms from the joint distribution, we find that
Y
p(µ|θ1 , . . . , θm , τ 2 , σ 2 , y 1 , . . . , y m ) ∝ p(µ)
p(θj |µ, τ 2 )
Y
p(θj |µ, τ 2 ).
p(τ 2 |θ1 , . . . , θm , µ, σ 2 , y 1 , . . . , y m ) ∝ p(τ 2 )
The right hand side of the two equations in the above display allow us to look at only ”submodels” for
µ and τ 2 . For example: in the first equation we can treat θj as the m-data sample for normal submodel with
mean parameter µ, so we need to compute the posterior distribution of µ for this submodel. We have seen
such submodels before, in Section 5. Thus,
mθ̄/τ 2 + µ0 /γ02
2
2 −1
, (m/τ + 1/γ0 )
,
µ|θ1 , . . . , θm , τ ∼ normal
m/τ 2 + 1/γ02
X
1/τ 2 |θ1 , . . . , θm , µ ∼ gamma((η0 + m)/2, η0 τ02 /2 +
(θj − µ)2 /2).
2
114
Full conditional distribution of θj , j = 1, . . . , m:
θj represents the mean for group j. It is useful to note that, given µ, τ 2 , σ 2 , y j , θj must be conditionally
independent of all other mean parameters θ’s, as well as the data from groups other than j. In fact,
2
2
2
p(θj |µ, τ , σ , y 1 , . . . , y m ) ∝ p(θj |µ, τ )
nj
Y
p(yji |θj , σ 2 ).
i=1
We can view this as the posterior distribution for the normal sampling model for group j, given the
nj -data sample from this group only. Let ȳj denote the sample mean for group j, then
nj ȳj /σ 2 + µ/τ 2
2
2 −1
2
, (nj /σ + 1/τ )
.
(36)
θj |σ , yj1 , . . . , yjnj ∼ normal
nj /σ 2 + 1/τ 2
Full conditional distribution of σ 2 :
σ 2 represents the shared within-group variance for all groups. Note that σ 2 is conditionally independent
of µ, τ 2 given y 1 , . . . , y m , θ1 , . . . , θm . We find that
2
2
p(σ |θ1 , . . . , θm , y 1 , . . . , y m ) ∝ p(σ )
nj
m Y
Y
p(yji |θj , σ 2 )
j=1 i=1
∝ (σ 2 )−ν0 /2+1 e−
2
ν0 σ0
2σ 2
(σ 2 )−
P
nj /2
exp −
1 XX
(yji − θj )2 ,
2σ 2
j
so
2
1/σ |θ, y 1 , . . . , y m ∼ gamma((ν0 +
m
X
nj )/2, ν0 σ02 /2
j=1
+
nj
m X
X
i
(yji − θj )2 /2).
j=1 i=1
Note that the double sum term is the sum of squared residuals across all groups, conditional on the
within-group means, so the (full) conditional distribution of σ 2 concentrates probability around a poolledsample estimate of the variance. This makes sense, because σ 2 is the same variance parameter shared across
all groups according to our model.
115
8.4.2
Example: Math scores in U.S. public schools
We return to the analysis of math scores examined in Hoff (2009). The setting is as follows
• there are 100 large urban public high schools, all having a 10th grade enrollment of 400 or larger.
Figure 8.3: ELS data.
• average score per school ranges from 36.6 to 65.0.
Figure 8.4: Empirical distribution of sample means and relationship with sample size.
• extreme average scores tend to be associated with low sample sizes. This is a common phenomenon
for hierarchical data sets (how?)
116
Prior specification and posterior approximation
• recall our hierarchical model
µ, τ 2 , σ 2
∼
p(ψ)p(τ 2 )p(σ 2 )
(prior distribution)
θ1 , . . . , θm |ψ
i.i.d.
normal(µ, τ 2 ) (between-group sampling variability)
Yj1 , . . . , Yjnj |θj
i.i.d.
normal(θj , σ 2 ), j = 1, . . . , m
∼
∼
(within-group sampling variability).
• we need to provide hyperparameters for the semi-conjugate priors
1/σ 2 ∼ gamma(ν0 /2, ν0 σ02 /2)
1/τ 2 ∼ gamma(η0 /2, η0 τ02 /2)
µ ∼ normal(µ0 , γ02 ).
– the math exam was designed to give a nationwide variance of 100, so we set σ02 = 100. For a
diffuse prior for the variance, we set ν0 = 1.
– for between-group variance: we set τ02 = 100, and η0 = 1.
– for the global mean: we set µ0 = 50, γ 2 = 25 (so the prior probability that µ is in (µ0 −2γ, µ0 +
2γ) = (40, 60) is about 95%).
• the previous subsection gave the derivations of all full conditional distributions required for the implementation of a Gibbs sampler.
117
MCMC diagnostic
• run the Gibbs sampler for 5000 iterations. Fig. 8.5 shows the boxplots for batch of 500 consecutive
MCMC samples (e.g., {1, . . . , 500}, {510, . . . , 1000}, and so on). There does not seem to be any
evidence that the chain has not achieved stationarity.
Figure 8.5: Stationarity plots of the MCMC samples of µ, σ 2 , τ 2 .
• lag-1 autocorrelations for the sequences of µ, σ 2 and τ 2 are 0.15, 0.053, and 0.312, respectively.
• the effective sample sizes are 3706, 4499, and 2503, respectively.
• the approximate MC std can be obtained by dividing the approximated posterior std by the square
root of the effective sample sizes, resulting in 0.009, 0.004, 0.09 for µ, σ 2 , τ 2 , resp. These are small
compared to the posterior means of these parameters (Fig. 8.6).
Figure 8.6: Marginal posteriors with 2.5%, 50% and 97.5% quantiles.
• for θ: we found the ESS for the 100 sequences of θ-values ranged between 3,500 and 6,000, with the
MC std ranging between 0.02 and 0.05.
118
Posterior summaries and shrinkage
• The posterior means of µ, σ and τ are 48.12, 9.21 and 4.97, respectively. Recalling the meaning of
these parameters; this indicates that roughly 95% of scores within a class room are within 4 × 9.21 ≈
37 points of each other, whereas 95% of the average classroom scores (across schools) are within
4 × 4.97 ≈ 20 points of each other.
• The shrinkage effect: recall that, conditional on µ, τ 2 , σ 2 , the expected value of θj is a weighted
average of ȳj and µ (cf. Eq. (36)):
E[θj |y j , µ, τ, σ] =
nj ȳj /σ 2 + µ/τ 2
.
nj /σ 2 + 1/τ 2
As a result, the expected value of θj is from the sample mean ȳj toward the global mean µ. This
is called the shrinkage effect: the parameter estimates are ”shrinked” toward the global mean. How
strong this effect is dependent partially on the sample size nj .
Figure 8.7: Shrinkage as a function of sample size.
• Fig. 8.7 illustrates the amount of shrinkage for different groups. Left panel shows that the groups with
large sample means are ”pulled down” a bit, while the groups with low sample means are ”pushed
up”. The right panel shows that groups with small sample size receives the largest amount of shrinkage
|ȳj − θ̂|.
– for this reason we say that hierarchical modeling faciliates the ”borrowing of strength”: in particular, the groups with small sample size borrow information from the groups with large sample
size. In theory, it has been shown that the borrowing of strength (also, sharing of information)
results in more robust and efficient inference.
119
Back to the question of ranking
• We may rank all schools according to the posterior expectations
{E[θ1 |y 1 , . . . , y m ], . . . , E[θm |y 1 , . . . , y m ]}
Alternatively, one may simply rank all schools according to the sample means ȳ1 , . . . , ȳm
• Although these two rankings would be quite similar, there are differences.
• Let’s consider two schools: school 46 and school 82; these two schools are at the bottom 10% of the
100 schools in the data set. The sample means are
ȳ46 = 40.18 > ȳ82 = 38.76
However, in terms of posterior expectation, the ranking would be different:
E[θ46 |y 1 , . . . , y m ] = 41.31 < E[θ82 |y 1 , . . . , y m ] = 42.53.
• We observe the effects of shrinkage: n46 = 21, while n82 = 5. School 82 receives a larger amount
of shrinkage toward global mean (E[µ|y 1 , . . . , y m ] = 48.11) than that of school 46, resulting in a
”reversal” in the ranking.
Figure 8.8: Data and posterior distributions for two schools
• Does this make sense?
– there are more uncertainty about school 82’s average scores due to its low sample size.
– suppose on the day of the exam, the student who got the lowest exam score from school 82
doesn’t show up, then the sample mean would have been 41.99, a change of more than three
points from 38.76. In the case of school 46, the sample mean would have been 40.9, a change of
only three quarters of a point. So, while we are more certain about the average score of school
46, we are less certain about that of school 82, which results in a larger amount of shrinkage
toward the global mean.
120
– to some, this ranking may seem unfair. However, it reflects an objective fact that there is more
evidence that θ46 is exceptionally low than there is for θ82 .
– An example in sport: on any basketball team, there are ”bench” players who play very little play
time, many of whom have taken only a few free throws in their entire career, resulting in very
high free throw shooting percentage, e.g., 100%. Yet, the coach when given an opportunity for
a free throw (during a technical foul) will likely choose a veteran player, despite having a lower
shooting percentage, say 87%. This is because coaches recognize that the bench player’s true
free throw percentage is nowhere near the ”sample mean” 100%.
121
8.5
Topic models
We will study hierarchical models for discrete data, such as texts, images and biological data. The class
of models that we consider is known as topic models 7 and finite admixtures. 8 The paper by Blei and coauthors was motivated from the information retrieval/machine learning of texts and images. It also develops
variational inference for this particular class of model. The paper by Pritchard and co-authors was motivated
by population genetics applications and makes use of Gibbs sampling for the posterior inference. Both are
extremely well-known (and combine for more than 60,000 citations on Google Scholar).
8.5.1
Model formulation
First come some notations.
• Random variable W ∈ {1, ..., V } represents words in a vocabulary, where V is the length of the
vocabulary.
• A document is a collection of words denoted by W = (W1 , ..., WN ).
Although we write W as if it is a sequence, the ordering of the words does not matter in the modeling
that we introduce here.
• A corpus is a collection of documents (W 1 , ..., W m ). For each document m, let Nm be the document
length.
Topic model is essentially a hierarchical model for discrete data that can be viewed as a hierarchical
mixture model (for discrete random variables). Each mixing component of the model will be referred to as
a topic. Thus a topic is a particular distribution over words, and a document can be described as a mixture
of topics.
7
D. Blei, A. Ng and M. I. Jordan. Latent Dirichlet allocation, Journal of Machine Learning Research, 3:993–1022, 2003.
J. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using multi-locus genotype data. Genetics,
155:945–959, 2000.
8
122
An example document from the AP corpus (Blei, Ng, Jordan, 2003)
After feeding such documents to Latent Dirichlet Allocation (LDA) model:
123
Another example document from Science corpus (1880–2002) (Blei & Lafferty, 2009)
Topic models – such as Latent Dirichlet allocation and its variants – are a popular tool for modeling
and mining patterns from texts in news articles, scientific papers, blogs, but also tweets, query logs, digital
books, metadata records...
124
θ
θ
Z
β
Wn
N
Wn
N
M
M
Figure 8.9: Graphical representation of the unigram (left) and mixture of unigrams (right).
Before we describe the latent Dirichlet allocation model, let us start with simpler precursors.
Unigram model For any document W , assume
i.i.d.
W = (W1 , ..., WN )|θ ∼ Cat(θ).
In other words, θ is the (same) word frequency that characterizes each document in the corpus. Thus,
the corpus generated this way is implicitly assumed to have only one topic. See Fig. 8.9.
Mixture of unigrams Each document W d is associated with a latent topic variable Zd . Suppose that there
are K topics, where K is given. Assume that
Zd |θ ∼ Cat(π).
Now given Zd , we assume
W d |Zd = k ∼ Cat(βk ),
where parameter βk ∈ ∆V −1 is the frequency vector associated with topic k.
This is nothing but a mixture of discrete distributions. The parameters of interest are {βk }K
k=1 and π.
In both models, the documents are assumed to be an i.i.d. sample from fairly simple distributions on
vocabulary of words. Both above models were utilized in the early days of ”natural language processing”
(NLP), a field in artificial intelligence that focuses on analysis of texts.
125
Latent Dirichlet Allocation (LDA) LDA is an instance of hierarchical modeling. It was in fact motivated
from de Finetti’s theorem. Given the hierarchical view of the text corpus, we assume that the documents are
exchangeable. Moreover within each document the words are assumed to be exchangeable.
Exchangeability assumption can be questioned, as we discussed in a previous subsection. However, this
is an important step-up from the previous i.i.d. assumption. Moreover, exchangeability is not an unreasonable assumption if we do not want to capture aspects of the data the violates exchangeability (such as the
ordering of words, or documents).
From de Finetti’s theorem, we expect a hierarchical model specification for the words and then for
the documents. Originally the LDA is described as a generative process: To generate document W , one
proceeds as follows
• For a document, generate N of W from a Poisson distribution: N ∼ Poisson(λ).
• For some parameter α1 , . . . , αK > 0, let θ represent “topic proportion” for document W :
θ|α ∼ Dir(α1 , ..., αK ).
• Given N and θ associated with the document, for each word index n = 1, . . . , N ,
Zn |θ
Wn |Zn = k, β
i.i.d.
∼
iid
∼
Cat(θ),
Cat(βk )
In the above, we use βk to denote row vector k of K × V matrix β. In particular βk represents the
distribution over the vocabulary for topic k. This means Pr(Wn = j|Zn = k, β) = βkj .
A graphical representation of this model is given Fig. 8.10.
α
θ
β
Zn
Wn
N
M
Figure 8.10: Latent Dirichlet Allocation Model.
126
There is a simpler geometric reformulation for the LDA. It goes like this.9
Each document W = (W1 , . . . , WN ) consists of words that are generated i.i.d. according to the following probability
Pr(Wn = j|θ, β) =
K
X
Pr(Zn = k|θ) × Pr(Wn = j|Zn = k, β) =
k=1
K
X
θk βkj .
k=1
P
V −1 . This is a point that lies in
That is, the vector of word frequency for document W is K
k=1 θk βk ∈ ∆
the convex hull G = conv(β1 , . . . , βK ).
Each extreme point β1 , . . . , βK corresponds to a word frequency vector of a topic (e.g., “education”,
“politics”, “sports”). Given the convex hull G, a document corresponds to a point randomly drawn from the
convex hull G. The randomness is due to the random weight vector θ ∈ ∆K−1 , which is distributed by a
Dirichlet distribution.
9
J. Tang, Z. Meng, X. Nguyen, Q. Mei and M. Zhang. Understanding the limiting factors of topic modeling via posterior
contraction analysis. Proceedings of the 31st International Conference on Machine Learning (ICML), 2014.
127
8.5.2
Posterior inference
We have seen how the LDA is composed of familiar building blocks, Poisson for the document length, multinomial/categorical distributions for topic-specific distribution over words, Dirichlet for topic proportion, as
well as suitable prior distributions for parameters of interest.
Posterior inference is computationally challenging, due to the presence of mixed data type (categorical
and continuous-valued). Moreover, the model is typically applied to large collection of documents, and later
on images, genomes and all sort of large-scale data types. There are two computational tasks:
1. Compute the posterior distribution, P (θ, Z|W , α, β).
2. Estimating α, β from the data.
The posterior distribution can be rewritten as
P (θ, Z|W , α, β) =
P (θ, Z, W , α, β)
P (W |α, β)
(37)
The numerator in the above display is easy to compute
P (θ, Z, W |α, β) = P (θ|α)
N
Y
P (Zn |θ)P (Wn |Zn , β).
(38)
n=1
However, the denominator
Z
p(W |α, β) =
X
θ Z ,...,Z
n
1
Z
P (θ, Z, W |α, β)dθ =
Γ(
K
Q
P
K
αi ) Y
Γ(αi ) i=1
θiαi −1


N
K Y
V
Y
X

(θk βkj )I{Wn =j}  dθ
n=1
k=1 j=1
i=1
(39)
is much harder to compute because we must integrate out all the latent variables of mixed types.
Exercise. Derive a Gibbs sampling algorithm for the LDA model. For this purpose, we need to endow
prior distributions for parameters α and β.
Although the Gibbs sampler is easy to derive, the Markov chains it produces may take a long time to
mix (due to the large number of latent variables to be sampled). An alternative is variational inference —
a general method for approximating posterior distributions based on optimization. We will introduce this
method in the context of LDA next. Note that the state of the art method W for learning specifically the
LDA model and its extensions, both in terms of parameter estimation accuracy and computational efficiency,
appears to be a geometric algorithm of Yurochkin et al.10
10
Dirichlet simplex nest and geometric inference. M. Yurochkin, A. Guha, Y. Sun and X. Nguyen. Proceedings of the 36th
International Conference on Machine Learning (ICML), 2019.
128
8.5.3
Variational Bayes
Variational inference is a general computational technique for inference with complex models in which
the problem of model fitting and probabilistic inference (problem 1 and 2 in the previous page) can be
reformulated as an optimization problem.
When applied to the approximate computation of the posterior distribution, we call this ”variational
Bayes”. The strength of variational Bayes is that it’s generally applicable to all (complex) Bayesian models;
it’s fast compared to sampling based techniques such as MCMC. While fast, it may not be as accurate as
MCMC if the latter is run for sufficiently long time.
We shall now illustrate the variational Bayes technique to topic models. The basic idea is as follows:
(1) Consider a family of simplified distribution Q = {q(θ, Z|W )}
(2) Choose the one in Q, that is closest to the true posterior
q ∗ := argminq∈Q KL(q||p(θ, Z|W , α, β))
(40)
(3) Use q ∗ as the surrogate for the true posterior p(θ, Z|W , α, β) for subsequent inferential purposes.
In the above display, KL denotes the Kullback-Leibler divergence: given two distributions with corresponding probability density functions f and g on some common space, the KL divergence is given by
Z
KL(f ||g) = Ef log(f (X)/g(X)) = f (x) log(f (x)/g(x))dx.
Although Kullback-Lebler divergence is not symmetric, it is always non-negative. Moreover, KL(f ||g) =
0 iff f (x) = g(x) for almost all x. The KL is a fundamental quantity that measures how far g is from f .
129
It is somewhat surprising but not difficult to verify that the optimization problem given in Eq. (40)
becomes relatively tractable the class of approximating distribution Q takes a sufficiently simple form.
The simple choice for Q is the family of ”factorized” distributions: each q ∈ Q satisfies
q(θ, Z|W , γ, φ) = q(θ|γ)ΠN
n=1 q(Zn |φn ).
(41)
Here, the parameters γ and φ = (φ1 , . . . , φN ) are called variational parameters to be optimized according to the KL objective so as to obtain as tight as possible an approximation to the true posterior
(γ ∗ , φ∗ ) := argmin KL(q(θ, Z|γ, φ)||p(θ, Z|W , α, β)).
(42)
A few words about the roles of variational parameters γ and φ: recall that θ ∈ ∆K−1 . Here, we shall
K.
take q(θ|γ) to be Dirichlet with parameters γ ∈ R+
Similarly, for each n = 1, . . . , N , q(Zn |φn ) is taken to be categorical distribution, where parameter φn
is composed of φn = (φn1 , . . . , φnK ) so that under q:
q(Zn = i|φn ) = φni , n = 1, ..., N, i = 1, . . . , K.
130
(43)
Optimization algorithm for variational Bayes We will show that the optimization in Eq. (42) can be
solved by coordinate descent via iteratively applying the updating equations as follows: for n = 1, . . . , N ,
i = 1, . . . , K,
γi =
αi +
N
X
φni ,
(44)
βiWn exp{Eq [log θi |γ]}.
(45)
n=1
φni
∝
Thus, the algorithm is fairly simple to implement: initialize the variational parameters γ, φ in some
fashion, and then keep updating them via above equations until convergence.
Some remarks
(1) In the updating equation for φni , since θi |γ ∼ Dirichlet(γ) it is a simple fact of the Dirichlet distribution that
K
X
E [log θi |γ] = Ψ(γi ) − Ψ(
γi ),
(46)
i=1
where Ψ is called digamma function Ψ(x) =
d log Γ
dx
=
Γ0 (x)
Γ(x)
.
(2) Note the roles of data Wn in the two updating equations.
(3) The updating equations are reminiscent of Gibbs sampler’s updates for semi-conjugate priors, except
here the updates are deterministic (subject to initialization). The fact that we are optimizing rather than
sampling makes this approximate inference technique computationally more efficient than MCMC.
131
The remaining pages in this section will be devoted the to derivation of the algorithm and can be skipped
at the first reading.
The first step is to note that the minimization of the KL divergence in Eq. (42) is equivalently viewed
as the maximization of a lower bound to the log likelihood function of the original LDA model. Indeed, by
Jensen’s inequality
Z X
log p(W |α, β) = log
p(θ, Z, W |α, β)dθ
θ Z
=
≥
=
Z X
p(θ, Z, W |α, β)
log
q (θ, Z) dθ
q(θ, Z)
θ Z
Z X
p(θ, Z, W |α, β)
q (θ, Z) log
dθ
q(θ, Z)
θ Z
Z X
Z X
q (θ, Z) log q(θ, Z)dθ
q (θ, Z) log p(θ, Z, W |α, β)dθ −
θ Z
θ Z
=
Eq log p(θ, Z, W |α, β) − Eq log q(θ, Z)
=: L(γ, φ; α, β).
We immediately see that the difference between the two sides of the above inequality is
p(θ, Z, W |α, β)
log p(W |α, β) − L(γ, φ; α, β) = Eq log q(θ, Z) − log
p(W |α, β)
= Eq log q(θ, Z) − log p(θ, Z|W , α, β)
= KL(q(θ, Z)||p(θ, Z|W , α, β)),
so minimizing the KL divergence in Eq. (42) is equivalent to
max L(γ, φ; α, β).
γ,φ
132
The second step is to note that the quantities in L(γ, φ; α, β) are relatively easy to compute and optimize,
due to the fact that the (full) joint probability distribution p(θ, Z, W |α, β) factorizes into marginal and
conditional distributions, while q also factorizes by our choice of approximation. Indeed,
N
X
log p(θ, Z, W |α, β) = log p(θ|α) +
{log p(Zn |θ) + log p(Wn |Zn , β)},
n=1
so taking expectation with respect to the q distribution we obtain
L(γ, φ; α, β) = Eq log p (θ|α) +
N
X
{Eq log p(Zn |θ) + Eq log p(Wn |Zn , β)}
n=1
N
X
−Eq log q(θ|γ) −
Eq log q (Zn |φn ) .
n=1
Now, we proceed to compute each of the quantities in the above display.
p (θ|α) =
log p (θ|α) =
P
K
Γ ( αi ) Y αi −1
θi
, so
K
Q
i=1
Γ (αi )
i=1
K
X
(αi − 1) log θi + log Γ
K
X
i=1
Eq log p (θ|α) =
K
X
PN
αi
−
i=1
K
X
(αi − 1) Ψ(γi ) − Ψ
i=1
Next up, we consider
!
K
X
log Γ (αi )
i=1
!!
γi
+ log Γ
i=1
n=1 Eq
K
X
!
αi
i=1
K
Y
I(Zn =i)
θi
, so
i=1
log p(Zn |θ) =
Eq log p(Zn |θ) =
K
X
i=1
K
X
I (Zn = i) log θi
φni
Ψ (γi ) − Ψ
i=1
K
X
!!
γi
,
i=1
where the last equality is due to (46).
Continuing along,
log p (Wn |Zn , β) = log
K Y
V
Y
(βij )I(Wn =j,Zn =i) , so
i=1 j=1
Eq log p(Zn |θ) =
K X
V
X
i=1 j=1
133
I(Wn = j)φni log βij .
K
X
i=1
log p(Zn |θ):
p(Zn |θ) =
−
log Γ (αi ) .
(47)
In addition, we take care of q(θ|γ) and q(Zn |φn )
q(θ|γ) =
Eq log q(θ|γ) =
P
K
Γ( γi ) Y γi −1
θi , so
K
P
i=1
Γ(γi )
i=1
K
X
(γi − 1) Ψ (γi ) − Ψ
i=1
K
X
!!
γi
K
K
X
X
+ log Γ(
γi ) −
log Γ(γi )
i=1
i=1
i=1
as well,
q(Zn |φn ) =
K
Y
I(Zn =i)
φni
, so
i=1
Eq log q(Zn |φn ) =
K
X
φni log φni .
i=1
The final step: with all components in the expression (47) for L(γ, φ; α, β) computed, it remains to
optimize L with respect to the unknown variational parameters γ and φ.
max
γ,φ
subject to
L(γ, φ; α, β)
K
X
φni = 1 n = 1, . . . , N.
(48)
(49)
i=1
Differentiate with the γ and set to zero to obtain the updating equation (45) for γ. Differentiate with
respect to the Lagrangian (by accounting for the equality constraints for φn ) and set to zero to obtain the
updating equation (44) for φn . Iterating these algorithms upon convergence for the estimates γ ∗ , φ∗ ).
Thus, we have accomplished the task of approximating the true posterior p(θ, Z|W , α, β) by means of
the surrogate q(θ, Z|γ ∗ , φ∗ ). The second task of estimating the parameter α, β can also be done in a similar
fashion. See Blei et al (2003) for details.
134
9
9.1
Linear regression
Linear regression model
Regression problem is concerned with the relationship between a response variable Y and a collection of
explanatory variables x = (x1 , . . . , xp ).
Figure 9.1: Change in maximal oxygen uptake as a function of age and exercise program.
Example 9.1. Twelve healthy men who did not exercise regularly were recruited to take part in a study
of the effects of two different exercise regimens on oxygen uptake. The maximum oxygen uptake (liters
per minute) of each subject was measured while running on an inclined treadmill, both before and after the
program. See Fig. 9.1
A linear regression model assumes that E[Y |x] takes a linear form:
Z
E[Y |x] = yp(y|x)dy = β1 x1 + . . . βp xp = β > x.
In the above example, the explanatory variables (covariates) x may be taken to be
x1 = 1
x2 = 0
if the subject is on the running program, 1
x3 = age of subject
x4 = x2 × x3 .
135
if on aerobic
We have not specified the distribution p(y|x) beyond its conditional expectation. The normal linear
regression model posits that in addition to E[Y |x] being linear, the sampling variability around the mean is
in fact i.i.d. from normal distribution:
1 , . . . , n ∼ normal(0, σ 2 )
Yi = β > xi + i .
This gives the conditional likelihood, given the n-sample (noting that nothing is said about the marginal
distribution of covariates x:
p(y 1 , . . . , y n |x1 , . . . , xn , β, σ 2 ) =
n
Y
p(yi |xi , β, σ 2 )
i=1
n
1 X
= (2πσ 2 )−n/2 exp − 2
(yi − β > xi )2 .
2σ
i=1
In customary matrix notations: y = (y1 , . . . , yn )> is a n × 1 column vector; X is the n × p design
matrix whose ith row is xi . Then the above can be written as
y|X, β, σ 2 ∼ Nn (Xβ, σ 2 I),
where I is the n × n identity matrix.
136
Parameter vector β may be estimated by minimizing the sum of squared residuals, SSR(β):
SSR(β) =
n
X
(yi − β > xi )2
i=1
= (y − Xβ)> (y − Xβ)
= y > y − 2β > X > y + β > X > Xβ.
To minimize the above expression, we take derivative with respect to β and set it to zero:
−2X > y + 2X > Xβ = 0
resulting in
β = (X > X)−1 X > y.
The value β̂ ols = (X > X)−1 X > y is called the ”ordinary least squares” (OLS) estimate of β. This
value is unique as long as the p × p matrix X > X is of full rank (and thus invertible). This happens when
n ≥ p (and the columns of the design matrix X are linearly independent). The OLS estimate is a frequentist
estimate, but it also plays a role in Bayesian estimation.
137
9.2
Semi-conjugate priors
The (conditional) likelihood function takes the form
1
SSR(β)
2σ 2
1
= exp − 2 (y > y − 2β > X > y + β > X > Xβ).
2σ
p(y|X, β, σ 2 ) ∝ exp −
It is simple to see that a normal distribution can be used as a semi-conjugate prior for β. Let β ∼
Np (β 0 , Σ0 ) a priori, then
p(β|y, X, σ 2 ) ∝ p(β) × p(y|Xβ, σ 2 )
1
1
> −1
> >
2
> >
2
∝ exp − (−2β > Σ−1
0 β 0 + β Σ0 β) × exp − (−2β X y/σ + β X Xβ/σ )
2
2
1 > −1
>
2
>
2
= exp{β > (Σ−1
0 β 0 + X y/σ ) − β (Σ0 + X X/σ )β}.
2
This is a multivariate normal density with
>
2 −1
Var[β|y, X, σ 2 ] = (Σ−1
0 + X X/σ ) ,
2
E[β|y, X, σ ] =
(Σ−1
0
>
2 −1
+ X X/σ )
(Σ−1
0 β0
(50a)
>
2
+ X y/σ ).
(50b)
It is a simple exercise to see that the posterior expectation represents a combination of the prior expectation and the purely data driven estimate OLS.
138
It is also simple to see that the inverse-gamma distribution can be used as a semi-conjugate prior for σ 2 .
Let γ = 1/σ 2 ∼ gamma(ν0 /2, ν0 σ02 /2) a priori, then
p(γ|y, X, β) ∝ p(γ)p(y|X, β, γ)
∝ γ ν0 /2−1 exp(−γν0 σ02 /2) × [γ n/2 exp(−γ × SSR(β)/2)
∝ gamma((ν0 + n)/2, (ν0 σ02 + SSR(β)/2)).
A Gibbs sampler is simple to implement. Each Gibbs update consists of the following: given the current
values {β (s) , σ 2(s) }, for s = 1, 2, . . .:
1. update β (s+1) ∼ Np (E[β|y, X, σ 2(s) ], Var[β|y, X, σ 2(s) ]).
2. update σ 2(s+1) ∼ inverse-gamma((ν0 + n)/2, (ν0 σ02 + SSR(β (s+1) )/2)).
139
9.3
Objective priors
In regression analysis it may be difficult to come up with a suitable prior distribution on β and σ 2 .
Example 9.2. Continuing on the oxygen uptake example. Suppose we know from our prior knowledge (e.g.,
by consulting with experts on physiology) that males in their 20s have an oxygen uptake of around 150 liters
per minute with a std of 15. We then take 150 ± 2 × 15 = (120, 180) as the prior expected range of oxygen
uptake distribution, and so the changes in the oxygen uptake lies within (−60, 60) with high probability.
Consider our subjects in the running group. This means the line β1 +β3 x should produce values between
-60 and 60 for all values of x between 20 and 30. A little algebra shows that we need a prior distribution on
β1 and β3 so that β1 ∈ (−300, 300) and β3 ∈ (−12, 12) with high probability. From here we can find the
suitable prior hyper-parameters β 0 , Σ0 . But this type of calculation becomes difficult when there are more
explanatory variables.
When we are in such a scenario, i.e., when it is difficult to come up with an informative prior specification, then one may consider prior specification that contains as little information as possible. This is the
spirit of objective Bayes. 11 . For linear regression, there are a number of objective priors that are commonly
used in practice.
11
We encountered this notion for the first time when we was discussing improper priors in Section 5. The ideas behind the
derivation of both improper prior and unit information prior are basically the same, but the latter has the advantage of being proper.
140
Unit information prior A unit information prior is one that contains the same amount of information as
that would be contained in a single observation (Kass and Wasserman, 1995).
Recall β̂ ols = (X > X)−1 X > y. Since y|X, β ∼ σ 2 I, this implies that the variance (with β held fixed)
of β̂ ols is σ 2 (X > X)−1 .
The precision of β̂ ols is its inverse variance: (X > X)/σ 2 . Viewing this as the amount of information
contained in n observations, the amount of information in one observation should be 1/n as much. Thus,
we set
>
2
Σ−1
0 = (X X)/(nσ ).
To complete the prior specification β ∼ N(β 0 , Σ0 ), we set β 0 = β̂ ols .
In a similar way, the prior distribution of σ 2 is given by σ 2 ∼ inverse-gamma(ν0 /2, ν0 σ02 /2), where
2 , which is obtained as an unbiased estimate of σ 2 :
ν0 = 1 and σ02 := σ̂ols
2
σ̂ols
= SSR(β̂ ols )/(n − p).
Some remarks
• the unit information prior is not purely Bayesian, since the prior is derived from the data. It provides
some sort of protection against misleading prior specification.
• however, it uses only a very small amount of the information gleaned from the data due to suitable
scale 1/n of information. Thus, its influence on the posterior inference is expected to be weak.
141
g-prior g-prior is another popular choice proposed by Arnold Zellner. It is motivated from another principle of objective Bayesian statistics: the relevant distributions of interest should remain invariant to changes
in parameterization of the model.12
Example 9.3. Continue on the regression model for oxygen uptake. Suppose that someone were to analyze
the data using explanatory variable x̃3 = age in months, instead of x3 = age in years. The role of this variable
in the model for the response Y is in the linear term β̃ x̃3 , as opposed to β3 x3 . Since now x̃3 = 12 × x3 ,
it makes sense that the posterior distribution for 12 × β̃3 in the model with x̃3 should be the same as the
posterior distribution for β3 based on the model with x3 .
For many modelers, due to the lack of domain knowledge, the same form of prior specification may be
given to β̃3 as would be the case for β3 . Thus, it is important the impart the kind of prior so that the posterior
inference is robust against such rescaling in the explanatory variables.
Let us proceed to a formulation of the g-prior that arises in the normal linear regression model.
• Suppose X is the given n × p design matrix. Under this design,
y|X, β, σ 2 ∼ Nn (Xβ, σ 2 I).
• Alternatively, due to a change of explanatory variables, X̃ = XH is a modified design matrix, for
some p × p matrix H. Under this design,
y|X̃, β̃, σ 2 ∼ Nn (X̃ β̃, σ 2 I) = Nn (XH β̃, σ 2 I).
• We need the same conditional prior on β and β̃ (conditionally given X or X̃) such that under such
prior specification, the posterior distributions of β and H β̃ are equal for all H:
d
[β|y, X, σ 2 ] = [H β̃|y, X̃, σ 2 ].
12
Jeffreys’ prior is another example.
142
(51)
Suppose the prior is of the form β ∼ Np (β 0 , Σ0 ). Recall from Eq. (50) the posterior distribution β is a
multivariate normal with
>
2 −1
Var[β|y, X, σ 2 ] = (Σ−1
0 + X X/σ ) ,
2
E[β|y, X, σ ] =
(Σ−1
0
>
2 −1
+ X X/σ )
(Σ−1
0 β0
(52a)
>
2
+ X y/σ ).
(52b)
It is easy to show that if we put β 0 = 0 and Σ0 = gσ 2 (X > X)−1 , where g > 0 is an arbitrary constant,
the invariance property expressed in Eq. (51) is satisfied (Exercise: verify this.)
• to be clear, the prior for β is β ∼ Np (0, gσ 2 (X > X)−1 ).
>
The prior for β̃ would be of the form β̃ ∼ Np (0, gσ 2 (X̃ X̃)−1 ).
• in fact,
Var[β|y, X, σ 2 ]
E[β|y, X, σ 2 ]
(X > X/(gσ 2 ) + X > X/σ 2 )−1 ,
g
=
σ 2 (X > X)−1
g+1
=: V ;
=
(X > X/(gσ 2 ) + X > X/σ 2 )−1 (X > y/σ 2 )
g
(X > X)−1 X > y
=
g+1
g
=
β̂
g + 1 ols
=: m.
=
In short,
β|y, X, σ 2 ∼ Np (m, V ).
143
(53)
• For σ 2 , suppose that an inverse-gamma prior is given: σ 2 ∼ inverse-gamma(ν0 /2, ν0 σ02 /2). It is a
very nice feature of g-prior that the induced posterior distribution of σ 2 is again an inverse-gamma
distribution (Exercise: verify this):
[σ 2 |y, X] ∼ inverse-gamma((ν0 + n)/2, (ν0 σ02 + SSRg )/2),
(54)
where the term
SSRg := y > y − m> V −1 m = y > (I −
g
X(X > X)−1 X > )y.
g+1
(55)
when g → ∞, this term tends to the SSR corresponding to the OLS estimate β̂ ols .
• We observe a form of shrinkage for both parameters β and σ 2 .
• MCMC is not needed, as we can obtain Monte Carlo samples for (σ 2 , β) from the above computation.
144
Example 9.4. Back to our example of regression analysis of the oxygen uptake data.
2 = 8.54. The posterior mean for β does not depend
Set the g-prior with g = n = 12, ν0 = 1, σ02 = σ̂ols
2
on σ and can be computed directly. The posterior standard deviations of these parameters are obtained.
Some observations:
• the posterior distributions seem to suggest only weak evidence of a difference between the two groups,
as the 95% quantile-based posterior intervals for β2 and β4 both contain zero.
• however, there seems to be a relatively strong evidence on the effect of age. According to our model,
the average difference in y between two people of the same age x but in different training programs
is β2 + β4 x. The box plots of the posterior distribution of this quantity is given for each x. It suggests
a strong evidence of a difference at young ages, but less so at the older ones.
Figure 9.2: Posterior distributions of β2 and β4 , with the marginal prior distributions in gray.
Figure 9.3: Ninety-five percent confidence intervals for the difference in expected change scores between
aerobic subjects and running subjects.
For more details, see Hoff (2009).
145
9.4
Model selection
In regression problems we may encounter a large number of possible explanatory variables/ regressors
x1 , . . . , xp , many of which may be irrelevant to the response variable y. Although we may fit a regression model with all such potential regressors, such a technique will likely produce a poor result in terms of
both prediction and parameter estimation, due to overfitting. Thus, selecting only the most relevant subset
of variables xi ’s for predictive and interpretative purposes is an extremely important task. The broad term
for this task is called ”model selection”.
Example 9.5. (Diabetes data) There are ten variables x1 , . . . , x10 on a group of = 442 diabetes patients,
and a variable y representing the disease progression taken one year after the baseline measurements xi ’s.
It is suspected that the relationship between xi ’s and y may be nonlinear, so a common practice is
utilize a linear regression model using regressors x1 , . . . , x10 (a.k.a. main effects), as well as nonlinear
terms that represent the interactions between the main effects, namely xj xk , and the quadratic terms x2j for
j, k = 1, . . . , 10. One of regressors, x2 = sex, is binary so x22 is unnecessary.
This gives a total of p = 10 + 10
2 + 9 = 64 potential regressors among
{xj , x2j , xj xk }
146
Naive OLS approach Randomly split the 442 diabetes subjects into 342 training samples and 100 test
samples, resulting in training data set (y, X) and test set (y test , X test ).
Apply the OLS approach to the training data with all 64 regressors to obtain β̂ ols (cf. Section 9.1), and
then generate the predictive responses ŷ test = X test β̂ ols .
1
The average sequared preditive error is 100
ky test − ŷ test k2 = 0.67. This is not good, since if we simply
1
put the predicted responses to be zero, our predictive error would already be 100
ky test k2 = 0.97.
Figure 9.4: Left and middle panels: Predicted values and regression coefficients for the diabetes data via
OLS. Right panel: Results based on a backwards elimination procedure.
The second panel shows that most of the estimated regression coefficients are quite small — this suggests
we should remove them. A simple way is a greedy procedure known as backwards elimination.
147
Backwards elimination procedure This is a sequential procedure for assessing the relevance of the regression coefficients based on the current model’s fit, and eliminating one variable at a time.
A standard way of assessing the evidence that the true value of coefficient βj is non-zero is via a t-statitic,
which is obtained by dividing the OLS estimate β̂j by its standard error. Since β̂ = (X > X)−1 X > y, and
y|Xβ ∼ Nn (0, σ 2 I), we put
β̂j
tj =
.
>
1/2
2
(σ̂ (X X)−1
jj )
(Note: σ̂ 2 is the corresponding OLS estimate of the residual variance σ 2 . Also, the response vector y and
all columns of X have been centered to have mean zero.)
Now, if |tj | is below a certain cutoff threshold, |tj | < tcutoff , then the evidence for βj 6= 0 is weak;
variable xj is removed from the model.
A version of the overall backwards elimination procedure is as follows
1. Obtain OLS estimate β̂ and its t-statistics.
2. If there are any regressors j such that |tj | ≤ tcutoff ,
a) find the regressor j that has the smallest value of tj and remove column j from X.
b) return to step 1.
3. If |tj | > tcutoff for all j, then stop.
Example 9.6. Apply this procedure to diabetes data, using tcutoff = 1.65 (corresponding roughly to a pvalue of 2 × 0.05 = 0.10 according to a t distribution with a very large number of degrees of freedom, or
the standard normal distribution). We obtain that 44 of the 64 variables are eliminated, leaving 20 variables
in the regression model. The third plot of Fig. 9.4 shows ŷ test according to the reduced-model regression
coefficients. The prediction error for the model is 0.53, which is an improvement from the standard OLS
error of 0.67.
148
The backwards elimination procedure described above is a fast heuristic, but it may pick up many spurious associations between selected xj ’s and y.
Example 9.7. Let’s consider the following experiment: we create a new data vector ỹ by randomly permuting the values of y. Thus, the value of xi has no effect on ỹi . There is no true association between ỹ
and the columns of X. The left figure of Fig. 9.5 shows the t-statistics for one randomly generated ỹ of y.
Initially, only one regressor has a t-statistic greater than 1.65, but as we sequentially remove the columns of
X, the estimated variance of the remaining regressors decreases and their t-statistics increase in value. With
tcutoff = 1.65, the procedure arrives at a regression model with 18 regressors. See the illustration in the right
panel. All such regressors are spurious, of course.
Figure 9.5: t-statistics for the regression of ỹ on X, before and after backwards elimination.
149
9.4.1
Bayesian model comparison
The Bayesian approach is conceptually straightforward: we do not know which variables are spurious or
not; such information will be represented by random variables (parameters) which are then endowed with
some prior distributions. The model selection problem is essentially no different from the inference of an
unknown parameter(s).
Let zj = 0 if the explanatory variable xj is spurious and zj = 1 otherwise (that is, if xj is active). We
may express the regression coefficients as zj βj , so the regression equation becomes
y = z1 β1 x1 + . . . zp βp xp + .
P
As before, the conditional distribution of the response is given by Y |z, β, σ 2 ∼ Normal( pj=1 zi βj xj , σ 2 ).
We need a prior specification for {z, β, σ 2 }.
The prior distribution over z can be viewed as a prior over the space of models, while the conditional
prior distribution of β, σ 2 given a model represented by z can be specified as in the previous subsections,
e.g., via semi-conjugate priors or objective priors, etc.
Then, by Bayes’ rule, we can compute a posterior probability for each regression model:
p(z)p(y|X, z)
p(z|y, X)= P
.
z̃ p(z̃)p(y|Z, z̃)
(56)
The posterior computation may be a challenging issue: the normalizing constant involves the integration
over the space of potential models. Moreover, the computation of the marginal likelihood term p(y|X, z)
may be far from being straightforward, due to the need of integration over the remaining parameters β and
σ 2 . The specific modeling choices will play crucial role in mitigating such computational challenges.
Model comparison via the posterior odds is relatively simpler computationally, because the difficult
normalizing constants are cancelled out:
p(z a |y, X)
p(z b |y, X)
p(z a ) p(y|X, z a )
=
×
p(z b )
p(y|X, z b )
posterior odds = prior odds × Bayes factor.
odds(z a , z b |y, X) =
150
Computing the marginal likelihood We have
Z Z
p(y, β, σ 2 |X, z)dβdσ 2
Z Z
p(y|β, X, σ 2 )p(β|X, z, σ 2 )p(σ 2 )dβdσ 2 .
p(y|X, z) =
=
Some notations: For a given z with pz non-zero entries, let X z be the n × pz design matrix corresponding the active explanatory variable xj ’s, and β z the pz × 1 vector consisting of the entries of β for the active
variables.
Let’s consider a (conditional) g-prior for β given z:
−1
β z |X, z, σ 2 ∼ Npz (0, gσ 2 [X >
z X z ] ).
In addition, give γ := 1/σ 2 a gamma prior: gamma(ν0 /2, ν0 σ02 /2). Then we have
Z
p(y|X, z) =
p(y|X, z, σ 2 )p(σ 2 )dσ 2
Z
=
p(y|X, z, γ)p(γ)dγ
z
(2π)−n/2 (1 + g)−pz /2 × γ n/2 e−γSSRg /2 ×
2
ν0 /2
−1
ν0 /2−1 −γν0 σ02 /2
Γ(ν0 /2)
γ
e
dγ,
(ν0 σ0 /2)
Z
=
where SSRzg is the same as in Eq. (55), with X being replaced by X z (exercise: verify this!):
SSRzg = y > (I −
g
−1 >
X z (X >
z X z ) X z )y.
g+1
Now, using the normalizing constant identity for Gamma density leads to
p(y|X, z) = π −n/2
(ν0 σ02 )ν0 /2
Γ((ν0 + n)/2)
(1 + g)−pz /2
.
Γ(ν0 /2)
(ν0 σ02 + SSRzg )(ν0 +n)/2
151
With the marginal likelihood calculation completed, we can proceed to model comparison by computing
the posterior odds defined earlier. Suppose that we set g = n, ν0 = 1 for all z, while σ02 is the estimated
residual variance under the least squares estimate for a given model z. That is, given z, ν0 σ02 := s2z .
To compare the two models represented by z a and z b , the Bayes factor is given by
2 1/2 2
szb + SSRzg b (n+1)/2
p(y|X, z a )
(pzb −pza )/2 sz a
= (1 + n)
× 2
.
p(y|X, z b )
s2zb
sza + SSRzg a
(57)
The ratio of marginal probabilities associated with the two models reflect the balance between model
complexity and goodness of fit. In particular, the ratio improves for z a (i.e., increases) if
• SSRzg a becomes small relatively to SSRzg b , i.e., the goodness of fit improves for z a . This happens
when the model becomes more complex, i.e., pza increases relatively to pzb .
• on the other hand, the term (1 + n)(pzb −pza )/2 penalizes large pza .
It is important to note that this observation on the balancing act present in the marginal likelihood (and
their ratios) is a very general characteristic: by the virtue of integrating over the unknown parameters,
the marginal likelihood captures the tension between both model complexity and goodness of fit in its
expression.
152
Example 9.8. Consider the oxygen uptake example. Recall our regression model
E[Y |β, x] = β1 x1 + β2 x2 + β3 x3 + β4 x4
= β1 + β2 × group + β3 × age + β4 × group × age.
The model selection question is whether or not β2 and β4 are non-zero (i.e., are there effects of grouping
according training programs on oxygen uptake change?). Recall from our earlier analyses that the answer
was somewhat ambiguous: the posterior coverage of both β2 and β4 contain zero in their 95% confidence
intervals. However, also according to the posterior joint distribution, the two parameters are negatively
correlated, so whether or notβ2 = 0 affects our inference about β4 .
We consider 5 candidate models, giving them equal prior probabilities 1/5. The remaining prior specification is as described above. Then we may obtain the relavant marginal likelihood and posterior odds as
following:
According to the posterior computation, the best model is (1, 1, 1, 0). There is a strong evidence for
age effect, as the posterior probabilities for the three models that include age is essentially 1. The group
effect is relatively weaker, as the posterior probabilities of the three models that include group information
is 0.00 + 0.63 + 0.19 = 0.82. This is still substantially higher than the prior probability of 0.60 for the three
models combined.
153
9.4.2
Model averaging via MCMC
Given p explanatory variables, each of which may be either zero or non-zero, there are 2p model candidates
to consider. If p is large, it is challenging to compute the marginal likelihood for each model.
The posterior distribution of interest is then Pr(z, β, σ 2 |y, X). We can derive a Markov chain that
enables us to approximate this distribution. However, z is high-dimensional, finding an approximation of
the joint posterior distribution for z may be impractical.
Instead, we want to do the following:
1. finding the high probability density region for any variable zj of interest
2. finding a good estimate for parameters β and σ 2 (presumably residing near a low-dimensional subspace) by integrating over z ∈ {0, 1}p
Deriving a Gibbs sampler for the posterior distribution of this model is simple. The full conditional
distribution for each zj is
oj
Pr(zj = 1|y, X, z −j ) =
(58)
1 + oj
where the odds oj is given by
oj
=
=
Pr(zj
Pr(zj
Pr(zj
Pr(zj
= 1|y, X, z −j )
= 0|y, X, z −j )
= 1) p(y|X, z −j , zj = 1)
×
:= A × B.
= 0) p(y|X, z −j , zj = 0)
Note that B was already computed via Eq. (57) for a g-prior specification. If we put an (independent)
uniform prior probability on each variable xj , so that Pr(zj = 1) = Pr(zj = 0) = 1/2, then A = 1. The
posterior samples for β and σ 2 were given in 9.3 for the g-prior.
154
To summarize the Gibbs sampling procedure using the g-prior for β and σ 2 , and the uniform prior for
z: Given the sample (z (s) , β (s) , σ 2(s) ), the sample at step s + 1 is generated as follows
1. Set z = z (s) ;
2. For j ∈ {1, . . . , p} in random order, replace zj with a sample from p(zj |z −j , y, X) given by Eq. (58);
3. Set z (s+1) = z;
4. Sample σ 2(s+1) ∼ p(σ 2 |z (s+1) , y, X) given by Eq. (54);
5. Sample β (s+1) ∼ p(β|z (s+1) , σ 2(s+1) , y, X) given by Eq. (53);
155
This is the R codes for the Gibbs sampling procedure above (only the portion for sampling z is included)
156
Example 9.9. We return to the diabetes data example.
• Recall that we have p = 64 potential regressors, resulting in 264 ≈ 1019 total number of models.
• It is impossible to explore this space: if we generate 10,000 Gibbs samples, these samples account for
only 1/1015 total number of models.
• Our intuition is that if there are only a small number of relevant regressors, and so they will be present
in many of the most likely models among the 264 candidates. Averaging over the most likely candidates will still give us a good estimate of the marginal posterior probabilities of each of the regressor
zj ’s as well as the corresponding β. (Recent theoretical developments on Bayesian asymptotics confirmed this intuition).
P
• The estimate for β is given by β̂ bma = Ss=1 β (s) /S, where S is the MCMC sample size.
– This is called the Bayesian model averaged estimate of β, because it does not correspond to
any particular value of z, but an average of regression parameters from different values of z.
By averaging the regression coefficients from multiple high-probability models, the resulting
estimate often performs better than a point estimate that corresponds to only a single model.
– The test error for the model averaging technique is 0.452, which is better than both OLS and
backwards elimination.
• More on Bayesian robustness: recall that the backwards elimination procedure also produced 18
spurious associations in a randomization experiment (cf. Example 9.7). Using the Bayesian model
averaging technique, it was found that the (approximated) posterior probabilities Pr(zj = 1|y, X)
are less than 1/2 for all j = 1, . . . , 64, and all but two of which are less than 1/4. The model averaging
technique did not erroneously identify any regressors as having an effect on the distribution of ỹ.
157
10
Metropolis-Hasting algorithms
The Gibbs sampler constructs a Markov chain, whose transition probability kernel is defined as a composition of multiple Gibbs updates. A Gibbs update changes one variable at a time. This can be inefficient.
Moreover, the Gibbs update often requires some sort of (semi) conjugacy in the model, so that the full
conditional distributions can be computed in close form.
In this section we shall study a more general MC based sampling method known as Metropolis-Hastings
algorithm. Most MCMC based algorithms in practice, including Gibbs sampling, are special cases of
Metropolis-Hastings algorithm, which is versatile and powerful. M-H is especially useful in the nonconjugacy situation, and when there is a need and possibility of updating multiple variables simultaneously.
10.1
Metropolis-Hastings update
Let π be the (stationary) distribution of interest. Suppose that π is known only up to an unknown constant.
That is, π is specified by an unnormalized density function h(x) with respect to a counting measure on a
discrete space S or Lebesgue measure µ(dx) with respect to an Euclidean space S. Write
π(x) = h(x)/c
R
where the normalizing constant c = h(x)µ(dx) < ∞ is unknown. In Bayesian computation, h(x) is often
the product of the prior density and the likelihood function.
Proposal distribution The M-H update uses an auxiliary transition probability specified by a conditional
density function q(x, y). It’s called ”proposal distribution”, or ”candidate generating distribution”. For every
point x ∈ S, q(x, ·) is the probability density (wrt µ) having two properties
• for each x we can sample a random variable y having the density q(x, ·)
• we can evaluate q(x, y) for each x, y ∈ S
Roughly speaking, q(x, y) represents the conditional probability ”proposing” an update value y, given
that we are presently at x. We can choose any density we know to propose. For instance, if S = Rd , a
random walk proposal corresponds to q(x, y) = Nd (y|x, σ 2 I), a density function evaluated at y ∈ Rd of a
d-variate normal density with mean x ∈ Rd and variance σ 2 I.
158
The Metropolis-Hastings algorithm then works by constructing the Markov chain {Xt }t≥1 as follows.
Start X0 = x where x is in the support of h, i.e., h(x) > 0. Given the current position Xt = x ∈ S, the
update changes x to its value at the next iteration.
1. Draw a sample y ∼ q(x, ·).
2. Calculate the Hastings ratio:
R=
h(y)q(y, x)
.
h(x)q(x, y)
(59)
3. Accept the proposal by setting Xt+1 = y with probability min(1, R). Otherwise, keep the position
unchanged by setting Xt+1 = x.
159
Example 10.1. (Metropolis update) If we use a proposal density q(x, y) that is symmetric: q(x, y) =
q(y, x). For instance, the ”normal random walk” q(x, y) = Nd (y|x, σ 2 I). Then, Hastings ratio takes the
form
R = h(y)/h(x).
There is no need to evaluate q(x, y).
Metropolis algorithm is very popular, because it is easy to implement. It is also very intuitive: as long
as one takes a symmetric proposal, then we always accept the proposed move from x to y if this represents
an increase in the density of the stationary distribution, i.e., π(y) ≥ π(x). If the move represents a decrease,
then the larger the decrease the less likely one will accept the move.
160
Let us write down the transition probability kernel P (x, A) for the general Metropolis-Hastings update,
for any x ∈ S, A ⊂ S. The kernel has two terms related to accepted proposals and rejected one. For
accepted proposals, we propose y and then accept it, which happens with density
p(x, y) = q(x, y)a(x, y),
where a(x, y) = min(R, 1). Thus
Z
p(x, y)µ(dy)
A
R
represents the part of P (x, A) that results from the accepted proposals. Moreover, S p(x, y)µ(dy) gives the
total probability that some proposed move is accepted (including the possibility that y = x) while
Z
r(x) := 1 −
p(x, y)µ(dy)
S
is the probability a proposed move is rejected. If the proposed move is rejected, we stay put at x.
Thus, the probability of moving from x to a measurable subset A ⊂ S is
Z
Z
P (x, A) =
p(x, y)µ(dy)+ 1 −
p(x, y)µ(dy) I(x, A).
(60)
S
A
In the above, I(x, A) denotes identity kernel that represents ”stay put”: I(x, A) = 1 if x ∈ A and 0
otherwise.
161
10.1.1
Detailed balance and reversibility
Definition 10.1. A Markov chain {Xt }t≥0 with a stationary distribution π is said to be reversible if when
Xt has the distribution π, then Xt and Xt+1 are exchangeable random variables.
Recall that if π is called a stationary distribution of the Markov chain if the following holds: when Xt
has distribution π, then so is Xt+1 . Thus, exchangeability is a stronger condition, as we have learned earlier
in Section 8.3: it requires that the ordered pair (Xt , Xt+1 ) has the same joint distribution as the ordered pair
(Xt+1 , Xt ). (Exercise: verify that a basic Gibbs update is reversible).
Although reversibility is not a requirement, many MC constructions have this property. While reversibility has some theoretical benefits for the analysis of a MC; for us it is enough to note that reversibility is a
useful property in that one automatically have the guarantee that a Markov chain construction admits π as
stationary distribution by checking that it satisfies the stronger condition of reversibility, which tends to be
easy to do in practice.
Recall p(x, y) = q(x, y)a(x, y). The key to verify reversibility is to check that the Markov chain satisfies
the detailed balance. That is:
h(x)p(x, y) = h(y)p(y, x),
for all x, y ∈ S.
(61)
Note that this is also equivalent to π(x)p(x, y) = π(y)p(y, x).
Suppose that the detailed balance holds. Then for any A, B ⊂ S, we have
=
Pr(X ∈ A, Xt+1 ∈ B)
Z Z t
1A (x)1B (y)π(x)P (x, dy)µ(dx)
Z Z
1A (x)1B (y)π(x) p(x, y) + r(x)1(y = x) µ(dy)µ(dx)
Z Z
Z Z
1A (x)1B (y)π(x)p(x, y)µ(dy)µ(dx) +
1A (x)1B (y)1(y = x)r(x)π(x)µ(dy)µ(dx)
Z Z
Z Z
1A (x)1B (y)π(y)p(y, x)µ(dy)µ(dx) +
1A (x)1B (y)1(x = y)r(y)π(y)µ(dy)µ(dx)
Z Z
1A (x)1B (y)π(y) p(y, x) + r(y)1(x = y) µ(dy)µ(dx)
=
Pr(Xt ∈ B, Xt+1 ∈ A),
=
=
=
(61)
=
which confirms reversibility.
162
Reversibility of Metropolis-Hastings update Now we can verify that the M-H update is reversible by
checking the detailed balance condition. But this is immediate
h(x)p(x, y) = h(x)q(x, y)a(x, y)
h(y)q(y, x)
= h(x)q(x, y) min 1,
h(x)q(x, y)
= min h(x)q(x, y), h(y)q(y, x) .
The last expression in the above display is symmetric with respect to x and y, so it is also equal to
h(y)p(y, x). We are done with the verification.
Metropolis-Hastings update for a subset of variables Although the above description is for the full set of
variables x (e.g., x = (x1 , . . . , xd ) ∈ Rd , Metropolis-Hastings can and quite typically be applied to a subset
of variables (like the Gibbs sampler can also be applied to subset of variables). Suppose that a subset of variables x1 , . . . , xj are to be updated for some j < d, then the proposal density q((x1 , . . . , xj ), (y1 , . . . , yj ))
should be taken as the density with respect to the base measure on the subspace Rj spanned by the j variables
being updated. The procedure is then applied as described.
163
Gibbs as a special case of Metropolis-Hastings The Gibbs sampler updates a variable xi from its full
conditional distribution of xi given all remaining variables x−i . We will show that a Gibbs update for
variable xi is nothing but a Metropolis-Hastings with the proposal distribution π(xi |x−i ).
Indeed, for variable xi , take the proposal density to be
q(x, y) ∝ h(x1 , . . . , xi−1 , yi , xi+1 , . . . , xd )/c
where yj = xj for j 6= i, and h is the unnormalized density function for the target stationary distribution π.
Note that q(x, y) so defined is exactly the full conditional distribution π(xi = yi |x−i ). Then, the Hastings
ratio is
R=
h(y)q(y, x)
h(x)q(x, y)
=
=
h(y)h(y1 , . . . , yi−1 , xi , yi+1 , . . . , yd )
h(x)h(x1 , . . . , xi−1 , yi , xi+1 , . . . , xd )
h(y)h(x)
= 1.
h(x)h(y)
It follows that the acceptance probability is min{R, 1} = 1. Thus, by adopting the full conditional distribution as the proposal distribution, the Metropolis-Hasting proposal is always accepted. This is exactly the
Gibbs update!
164
Remark 10.1.
• The Metropolis-Hastings framework is so general and powerful that its introduction
dramatically opened up the landscape of possibilities for MCMC based inference, because one can
in principle adopt any reasonable distribution as a proposal, and still get a valid Markov chain for a
target stationary distribution of interest. Ideally, we would like a proposal that allows one to explore
efficiently the distribution, by spending proportionally more time in all high density regions.
• Metropolis and Gibbs samplers can be viewed as two extremes in this landscape of proposals. Metropolis is realized by applying an arbitrary symmetric proposal distribution — this allows the Markov chain
to explore virtually any location in the state space as one likes. The price to pay is that the acceptance
rate may be very small, if the proposal is too ”reckless”, as it may have nothing to do with the actually concentration of mass of the target distribution. When this is the case, one ends up rejecting the
proposals most of the time, which amounts to a frustrating hit-and-miss sampling experience.
• Gibbs sampling, on the other hand, is too cautious in its proposal, which is automatically determined
by the induced full conditional distributions. Although all its moves are accepted, the movements
through the space of support can be hopelessly slow: due to the requirement of conjugacy needed
for the computation of the full conditionals, one may update only one variable or a small subset of
variables at a time and get stuck in local modes as a result.
• Finding a good proposal for a given posterior distribution is an active area of research. It requires
a deeper understanding of the geometry of such a posterior distribution. Hamiltonian Monte Carlo
Markov represents such a promising approach, but the progress remains rudimentary at this point.
• In practice, one may mix and match between different proposal strategies. For instance, one may
mixing up Gibbs updates for some subsets of variables with Metropolis-Hasting updates for other
subsets.
165
10.2
Example
Poisson regression model Given a population of song sparrows, we are interested in learning about the
relationship about the number of offsprings versus age.
An approach is to consider a regression model: the response y represents the number of offspring of a
song sparrow, while the regressors may be constructed of age variable x.
For instance, we assume log E[Y |x] = β1 + β2 x + β3 x2 . This means E[Y |x] = exp(β1 + β2 x + β3 x2 ).
Since Y is positive integer-valued, we may consider Poisson distribution as the conditional distribution
for Y given x. The resulting model is called a Poisson regression model:
Y |x ∼ Poisson(exp(β > x)).
To complete the prior specification: we may endow β with a normal prior. Note immediately that this is
not conjugate to the Poisson-type likelihood. In general, Poisson regression is a specific instance of a broad
class of models known as generalized linear model, for which conjugate priors generally don’t exist. Thus,
Gibbs sampling is difficult to implement.
Let’s consider Metropolis sampling. Provided a normal prior for β: β ∼ N(β 0 , Σ0 ). Given n-sample
(yi , xi )ni=1 . The Hastings acceptance ratio is easy to compute: given the current β (s) and a proposed β ∗ ,
R =
=
p(β ∗ |X, y)
p(β (s) |X, y)
normal(β ∗ |β 0 , Σ0 )
Qn
> ∗
i=1 poisson(yi , xi β )
.
×
Qn
> β (s) )
poisson(y
,
x
normal(β (s) |β 0 , Σ0 )
i
i=1
i
This ratio is easy to compute. In practice, when n is large, the ratio may be either too large or too small.
To avoid numerical issue, it is advised to compute the logarithm of the ratio R instead of computing R
directly. Then, the acceptance probability is
a(β (s) , β ∗ ) = emin{0,log R} .
166
It remains to specify the proposal distribution for β ∗ . A natural choice is to take a normal random walk,
i.e., via a normal distribution centered at β (s) :
q(β (s) , β ∗ ) = normal(β ∗ |β (s) , Σ).
How do we choose Σ? In a normal regression problem, the posterior variance of β will be close to
σ 2 (X > X)−1 , where σ 2 is the variance of Y . This gives us a hint for our Poisson regression problem: since
log Y is taken to have expectation β > x, we can take the proposal variance to be
Σ := σ̂ 2 (X > X)−1
where σ̂ is the sample variance of {log(y1 + 1/2), . . . , log(yn + 1/2)}. (The addition of 1/2 is so the log
can be applied to a positive valued number).
We can also consider other choice for Σ, or q(β (s) , β ∗ ). For the chosen form of Σ above, we may also
choose different σ̂. All such choices result in a valid Markov chain, but they can have different mixing
qualities and autocorrelation of the MC samples.
The general rule of thumb is to specify a proposal so that the acceptance rate is neither too large nor too
small (say, between 20 and 50%).
For more detail of this and other examples, see chapter 10 of Hoff [2009].
167
11
Unsupervised learning and nonparametric Bayes
Unsupervised learning is a term that originates from machine learning, but it basically refers to a class of
learning problems and techniques that involves latent variable models.
The most basic instance of unsupervised learning is the problem of clustering. The problem of clustering
is often vaguely formulated as follows: given n data points X1 , . . . , Xn residing in some space, say Rd , how
do one subdivide these data into a number of clusters of points, in a way so that the data points belong to the
same cluster are more similar than those from different clusters.
A popular method is called the k-means algorithm, which is a simple and fast procedure for obtaining k
clusters for a given k < ∞. There is only limited theoretical basis for such an algorithm.
To provide a firm foundation for clustering, a powerful approach is to introduce additional probabilistic
structures for the data. Such modeling is important to provide guarantee that we are doing the right thing
under certain assumptions, but more importantly it opens up new venues for developing more sophisticated
clustering algorithms as additional information about the data set or requirement about the inference become
available.
168
The most common statistical modeling tool is mixture models. A mixture distribution admits the following density:
k
X
p(x|p, φ) =
pj f (x|φj )
j=1
where f is a known density kernel, k is the number of mixing components. pj and φj are the mixing
probability and parameter associated with component j. When k is finite, this is the pdf of a finite mixture
model.
Given n-iid sample X := (x1 , . . . , xn ) from this mixture density, it is possible to obtain the parameters
φj via maximum likelihood estimation, which can be achieved by the Expectation-Maximization (EM)
algorithm. In fact, the EM algorithm can be viewed to be a generalization of the popular k-means algorithm
mentioned above. 13
By taking a Bayesian approach to the learning of mixture model, we will see that a Gibbs sampler for
posterior inference with a suitable choice of conjugate priors is a probabilistic version of the EM algorithm
(and k-means algorithm). Thus, the Bayesian approach can produce comparable estimate as that of EM, but
with the advantage of uncertainty quantification.
The question of model selection, i.e., how to select k the number of mixture components, requires the
development of a new framework known as Bayesian nonparametrics: The number of relevant parameters
will be unknown, random, and potentially unbounded. Thus the totality of all potential parameters will be
infinite. This requires new ideas for the prior construction and computational methods. The outcome is an
elegant solution to the model selection in that the number of the parameters will be shown to be increasing
a posteriori as the data sample size increases. 14
In the Bayesian nonparametric framework, the corresponding model for the clustering problem will be
called infinite mixture models that are endowed with suitable nonparametric Bayesian priors.
13
You may ignore any references to the k-means and the EM algorithm in this set of notes if you have not seen these algorithms
before.
14
Good references for Bayesian nonparametrics include Hjort et al. [2010], Ghosh and Ramamoorthi [2002], Ghosal and van der
Vaart [2017].
169
11.1
Finite mixture models
Consider a finite mixture of normal distribution on the real line:
p(x|p, φ) =
k
X
pj N(x|φj , σ 2 ),
j=1
where the parameters are p = (p1 , . . . , pk ) and the mean parameters φ1 , . . . , φk ∈ R. σ 2 is assumed known.
For prior specification, we take
indep
φj ∼ N (µ, τ 2 )
for j = 1, . . . , k, for some hyperparameters µ and τ . The mixing probability vector p = (p1 , . . . , pk ) ∈
∆k−1 will be endowed with a Dirichlet prior,
p ∼ Dk (α).
Recall that the Dirichlet distribution on ∆k−1 requires positive valued hyperparameters α = (α1 , . . . , αk ).
170
11.1.1
Auxiliary variables
Now we introduce a very common and powerful technique in Bayesian inference: instead of working directly with the original (mixture) model, we shall introduce additional auxiliary latent variables in a joint
model. When the auxiliary variables are integrated out, we get back the original model.
The main advantage of this technique is in the posterior computation. The joint posterior distribution
(with the auxiliary variables included) tend to be easier too work with via Gibbs sampling or other MCMC
updates, because the full conditional distributions are easy to compute: in the presence of the auxiliary
variables, the prior that was not semiconjugate with respect to the original model becomes semiconjugate
with respect to the joint model.
For our current mixture model, we need on auxiliary variable for each sample xn : Z := (Z1 , . . . , Zn ),
where each Zi ∈ {1, . . . , k}. Zi is interpreted as the (unknown and random) label of the mixture component
from which the data Xi is generated.
The joint model p(X, Z|p, φ) with the auxiliary Z included is defined as follows:
Zi
Xi |φ, Zi = j
iid
∼ Cat(p) i = 1, . . . , n;
∼ N(·|φj , σ 2 ),
The priors for p and φ are given as before.
171
i = 1, . . . , n;
j = 1, . . . , k.
Now we proceed to compute the posterior distribution for the quantities of interest p(Z, p, φ|X) via
Gibbs sampling. The full conditional distributions are easy to derive.
• For Z: for each i = 1, . . . , n, j = 1, . . . , k,
p(Zi = j|Z−i , X, p, φ) = p(Zi = j|Xi = xi , p, φ)
∝ p(Zi = j)p(xi |Zi = j, p, φ)
pj N(xi |φj , σ 2 )
.
= Pk
2
j=1 pj N(xi |φj , σ )
• For φ: for each j = 1, . . . , k
p(φj |φ−j , Z, X, p) = p(φj |Z, {Xi = xi such that zi = j})
P
µ/τ 2 + xi 1(zi = j)/σ 2
1
.
= N φj
,
1/τ 2 + nj /σ 2
1/τ 2 + nj /σ 2
The first identity is due to conditional independence. The second identity is a standard posterior
computation P
under a normal likelihood and a normal prior for the mean parameter (cf. Section 5).
n
Here, nj =
i=1 1(zi = j), i.e., the number of data points are currently assigned to the mixture
component j by means of having the label zi = j.
172
• For p:
p(p|Z, X, φ) = p(p|Z)
due to cond. indep.
∝ p(p)p(Z|p)
∝
k
Y
α −1
pj j
j=1
k
Y
×
n
pj j
j=1
∝ D(p|α1 + n1 , . . . , αk + nk )
= D(p|α + n)
wherein the last line we use n to denote (n1 , . . . , nk ).
We make some comments
• The Gibbs updates for Zi and p is can be viewed as the result of a ”soft” (probabilistic) assignment
of the cluster label for each of the data point xi .
Recall that in k-means clustering algorithm, there is a hard assignment of the cluster label associated with each data point. In the EM algorithm, this corresponds to the E-step, which updates the
expectation of the parameters such as Zi .
• The Gibbs update for φj is a probabilistic update of the cluster means. This is the direct counterpart
of the M-step in the EM algorithm and the mean update step in k-means.
• Gibbs sampling is convenient but not the most efficient posterior computation technique. We may
consider other forms of MCMC such as using Metropolis-Hastings algorithms, as we saw in Section 10. The wealth of posterior inference algorithms available is a hidden benefit of working with
a rich Bayesian modeling framework. It is considerably harder to invent a deterministic counterpart
of Metropolis-Hastings algorithms among frequentist approaches that must extend from the basic
k-means and EM algorithms.
173
11.2
Infinite mixture models
As we said earlier, the salient feature of a nonparametric Bayesian approach is to allow infinitely many
parameters to be present in the model.
Continuing with our present example of mixture modeling with normal components, an infinite mixture
model admits the following density function
p(x|p, φ) =
∞
X
pj f (x|φj ).
j=1
As before f (x|φj ) = N(x|φj , σ 2 ) for some known σ 2 , but here there are infinitely many parameters
(pj , φj )∞
j=1 .
An immediate question is: how do we specify a Bayesian prior on infinitely many parameters? Since
φj ’s are unconstrained in the real line, we may again set the prior for these parameters as
iid
φ1 , φ2 . . . ∼ G0 .
For instance, take G0 = N(µ, τ 2 ).
174
The nontrivial issue lies in specifyingPthe prior for p = (p1 , p2 , . . .), which is now an infinite sequence
satisfying the constraint that pj ≥ 0 and j pj = 1.
Recall that if the sequence p = (p1 , . . . , pk ) is a finite sequence, i.e., k < ∞, then we may use the
Dirichlet distribution as a prior for p ∈ ∆k−1 , say p ∼ D(α) for some α = (α1 , . . . , αk ) ∈ Rk+ . We need
a generalization of the Dirichlet distribution that works for ∆∞ .
175
11.2.1
Dirichlet process prior
Recall a simple fact about the finite-dimensional Dirichlet distribution. If k = 2, then the Dirichlet distribution D1 ((p1 , p2 )|α1 , α2 ) reduces to the Beta distribution on the unit interval Beta(p1 |α1 , α2 ), because
p2 = 1 − p1.
With some moment of thought, it is possible to conceive the following distribution on the infinite sequence p = (p1 , p2 , . . .) by constructing a random process of ”stick-breaking” as following: take a stick of
unit length, break it into two shorter pieces in a random fashion, one of which is assigned to be of length
p1 , and the remaining part of length 1 − p1 is broken again randomly to obtain p2 , and so on. Whenever we
break a piece of stick into two smaller pieces, we may take the proportions of the smaller pieces to be beta
distributed.
To be precise, let β = (β1 , β2 , . . .) be iid Beta(1, α). Define
p1 = β1 , pk =
k−1
Y
(1 − βi )βk ,
k = 2, 3, . . . .
i=1
P
It is easy to check that the infinite sequence p constructed this way satisfies the constraint that ∞
k=1 pk = 1
almost surely.
We have just described a Dirichlet distribution on the infinite-dimensional probability simplex ∆∞ .
176
Collecting the above specifications gives us a definition of the famous Dirichlet process 15
Definition 11.1. Let G0 is a probability distribution on the real line and given an infinite i.i.d. sequence of
random variables
iid
φ1 , φ2 , . . . ∼ G0 .
Let α > 0 and given an infinite i.i.d. sequence of random variables
iid
β1 , β2 , . . . ∼ Beta(1, α).
Set
p1 = β1 , pk =
k−1
Y
(1 − βi )βk ,
k = 2, 3, . . . .
(62)
i=1
Define the discrete distribution on the real line
G :=
∞
X
pj δφj
j=1
Then we say that G is a Dirichlet process on the real line. We write
G|α, G0 ∼ D(αG0 ).
(63)
What we just defined is that G is a random variable taking values in the space of probability distributions
on the real line, namely, P(R). The distribution from which the random G is generated, namely, D(αG0 ),
is called a Dirichlet distribution, which generalizes the standard Dirichlet distribution on a finite-dimensional
probability simplex to a distribution on the infinite-dimensional probability simplex ∆∞ .
Note that the distribution D(αG0 ) has two parameters: a positive scalar α > 0, and G0 is a distribution
on the real line.
15
The Dirichlet process was first introduced by Thomas Ferguson. Definition 11.1, however, was given by Jayaram Sethuraman.
177
Back to our infinite mixture model setting
p(x|p, φ) =
∞
X
pj f (x|φj )
(64)
j=1
P∞
The distribution G = i=1 pj δφj encapsulates all parameters for the infinite mixture model that we
seek to estimate. We can rewrite the mixture model equivalently as
Z
p(x|G) = f (x|φ)G(dφ).
(65)
Eq. (65) gives us the view of infinite mixture model as a model parameterized by G ∈ P(R). G is called
the mixing distribution, or mixing measure for the mixture model.
When the mixing distribution G is endowed with the Dirichlet prior given by Eq. (63):
G|α, G0 ∼ D(αG0 )
we call our model Dirichlet process mixture model.
This is still a standard Bayesian formulation, although a nonparametric one, where the parameter of
interest is the infinite dimensional G ∈ P(R).
Given an i.i.d. n-sample X1 , . . . , Xn |G ∼ p(x|G), the immediate question of concern is that of posterior
computation. How do we compute
p(G|X1 , . . . , Xn )?
178
11.3
Posterior computation via slice sampling
The totality of all variables of interest include the observed data X = (X1 , . . . , Xn ), the mixing proportions
p = (p1 , . . .), atoms φ = (φ1 , . . .). Moreover, p is constructed via the stick-breaking representation (62),
which is based on variables β = (β1 , . . .).
We shall make use of the auxiliary variable technique extensively. The first use is similar to the case of
finite mixture that we saw in Section 11.1. According to the joint model,
• each data point Xi is associated with a mixture component label Zi ∈ {1, 2, . . .}.
iid
• Given p, Zi |p ∼ Cat(p) for i = 1, . . . , n.
• Given Zi and all other variables, Xi is distributed according to f (Xi |φZi ).
Thus, we may write the joint model as
∞
(β, φ, Z, X) ∼ Beta(1, α)
×
G∞
0
×
n
Y
i=1
pZi ×
n
Y
f (Xi |φZi ).
(66)
i=1
∞
The superscripts ∞ signify the infinitely many variables β = {βk }∞
k=1 and φ = {φk }k=1 present in the
model.
We seek to devise a Markov chain that converges in distribution to the target stationary distribution
which is the posterior of β, φ, Z given data X. The difficulty is apparent: there are an infinite number of
variables to handle, which cannot possibly be sampled simultaneously. We use a technique known as ”slice
sampling”.
179
Slice sampling involves the introduction of yet another set of auxiliary random variables, u := (u1 , . . . , un )
taking values in bounded intervals (0, qzi ), where i = 1, . . . , n and (qj )j≥1 is a sequence of values in (0, 1)
either deterministically or randomly generated so that q tend to zero (certainly or almost surely).
In particular, for each i, given q we draw ui from the uniform distribution on the interval (0, qzi ).
Thus, the extended joint model takes the form
∞
(β, φ, u, Z, X|q) ∼ Beta(1, α)
×
G∞
0
×
n
Y
i=1
n
n
i=1
i=1
Y
Y
1
1(ui < qZi )
×
pZi ×
f (Xi |φZi ).
qZi
(67)
It is clear that integrating out all ui in the joint distribution given by Eq. (67) leads to the joint distribution
given by Eq. (66). Thus, it sufficient to construct a MC for the model given by Eq. (67).
What one gains in the introduction of auxiliary variables u is that, when u are conditioned on, we only
need to choose labels Zi from the finite set
H(ui ) := {j ∈ N+ : qj > ui }.
If one thinks of a bar graph in which the height of each bar represents the magnitude of qj , j = 1, . . ., then
restricting the label Zi to only H(ui ) corresponds visually to ”slicing” out the portion below the height ui ,
and making only the bars higher than ui to remain. Hence, the name ”slice sampling”.
Gibbs sampler for model (67)
indep
• Sampling u given β, Z, X, q: for each i = 1, . . . , n, draw ui ∼ Uniform[0, qZi ].
• Sampling β given u, Z, φ, X, q: Note that the variables β are relevant as far as the extent that they
determines the variables pj ’s. Moreover, the only variable pj ’s of concern are those with indices j
such that j ∈ ∪ni=1 H(ui ). Thus,
Y
p(βj |the rest) ∝ (1 − βj )α−1 ×
pZi
i:qZi >ui
∝ (1 − βj )
α−1
i −1
Y ZY
(1 − βk )βZi
×
i:qZi >ui
Pn
1(Zi =j;qj >ui )
k=1
Pn
(1 − βj )α−1+
X
∝ Beta(1 + mj , α +
mk ),
∝ βj
i=1
i=1
1(Zi >j;qZi >ui )
k>j
where
Pn in the last line, we set
Pmj :=
i=1 1(Zi > j; qZi > ui ) =
k>j mk .
Pn
i=1 1(Zi
= j; ui < qj ) for j = 1, . . ., and note that
Clearly, in the above computation we only need to update for j = 1, . . . , K such that for all k > K,
mk = 0. K represents the upper bound of the number of ”active” indices. K may change from one
Gibbs iteration to the next.
180
• Sampling φ given β, u, Z, X, q:
p(φj |the rest) ∝ G0 (dφj )
n
Y
f (Xi |φZi )
i=1
∝
Y
N(Xi |φj , σ 2 )G0 (dφj )
i:Zi =j
P
1
µ/τ 2 + xi 1(zi = j)/σ 2
,
.
∝ N φj
1/τ 2 + nj /σ 2
1/τ 2 + nj /σ 2
P
where nj = ni=1 1(zi = j), i.e., the number of data points are currently assigned to the mixture
component j by means of having the label zi = j. (Note that this step is similar to the sampling of
the label in a finite mixture.)
• Sampling Z given β, φ, u, X, q: for i = 1, . . . , n
p(Zi = j|the rest) ∝ 1(ui < qj )
pj
f (Xi |φj ),
qj
for j = 1, . . ..
This is where we need to be careful since the support of Zi is unbounded. Obviously, the above
probability is positive only if ui < qj . If q ∈ ∆∞ (although this is not a strict requirement, more on
this P
is below), then it suffices to update for all values j = 1, . . . up to the minimal index K where
n
1− K
k=1 qk < mini=1 {ui }.
If we reach a new index k for which pk and φkQ
have not been generated, P
then we proceed by generating
k−1
φk ∼ G0 , βk ∼ Beta(1, α), and letting pk = i=1 (1 − βi )βk = (1 − k−1
i=1 pi )βk .
• Sampling q: If q is deterministically generated, then this step is not necessary (although the choice
of this sequence may be critical to the mixing behavior of the underlying Markov chain). If q is
randomly generated, there are several options
– a simple method is to generate q independently of all over variables (e.g., via a fixed stickbreaking process). Then, we may update q after one or several iterations of the Gibbs updates
for all other variables.
– another approach is place an independent prior for q: qj ∼Uniform(0, bj ) for j = 1, 2, . . . such
that bj ↓ 0. Then the update for q can be achieved given u via the conditional distribution:
−nj
p(qj |the rest) ∝ qj
1(qj > max ui ).
i:Zi =j
– yet another approach is to take q := p, but then q is no longer independent of β; the update of
β may not have the conjugate form or an easily calculable form as given above.
Observe that the MCMC algorithm gradually and stochastically adds new components (βj , φj ) for j =
1, 2, . . . into the state space of the Markov chain. No upper bound on the number of components is required
a priori!
181
11.4
Chinese restaurant process and another Gibbs sampler
Dirichlet processes have many other remarkable characterizations, which help us to understand them more
deeply, while giving us additional ideas for computations. Next, we describe a Polya urn characterization of
the Dirichlet processes.
Consider the following specification for a sequence of random variables which are i.i.d. draw from a
Dirichlet process:
G|α, G0
∼ DαG0
iid
θ1 , . . . , θn |G ∼ G.
(68)
(69)
P
Note that given α and G0 , the random distribution G may be represented as G = ∞
k=1 pj δφj , where
p and φ are random variables given by Definition 11.1. Since θ1 , θ2 , . . . are a conditionally i.i.d. sequence,
this is an exchangeable sequence of random variables.
We ask: what is the marginal distribution of the exchangeable sequence θ1 , θ2 , . . ., which would be
obtained if we integrate out the random G in the above specification?
182
Based on Definition 11.1 it is not difficult to verify that the joint distribution of the sequence θ1 , θ2 , . . .
can be completely specified as follows:
θ1
∼
G0 ,
θ2 |θ1
∝
δθ1 + αG0 ,
...
θj |θ1 , . . . , θj−1
∝
j−1
X
δθk + αG0 ,
k=1
... .
The sequence of random variables defined this way is generally known as a Pólya sequence. It makes
explicit the clustering behavior of the collection of random variables θ1 , θ2 , . . . which are generated from a
(random) Dirichlet process G ∼ D(αG0 ): with positive probability each of the θj shares the same value as
some of the other variables generated before it in the sequence.
This Pólya sequence has a tasty name, ”the Chinese restaurant process”. Consider the following imaginary Chinese restaurant, which receives an infinite sequence of customers labeled by 1, 2, . . . with its infinitely many tables:
• customer 1 arrives, and sits by an arbitrary table there.
• the following customers 2, 3, . . . arrive in sequence and choose their table according to the following
rule: either one of the non-empty table is chosen with probability proportion to the current number of
customers sitting at table; otherwise that customer chooses a new table with probability proportional
to α
• for each table, a random dish is ordered i.i.d. from menu (distribution) G0 for all to share.
assign each θi to the dish that i is having.
183
Gibbs sampler based on the Pólya characterization
pressed as follows. Recall the prior:
G|α, G0
The Dirichlet process mixture model can be ex-
∼ DαG0
iid
θ1 , . . . , θn |G ∼ G.,
which is combined with the likelihood specification: for i = 1, . . . , n:
indep
Xi |θi ∼ f (Xi |θi ).
(70)
Latent variables θ1 , . . . , θn represent the parameter with each X1 , . . . , Xn are respectively associated.
E.g., θi is the mean parameter for the mixture component Xi is associated with when use use f (Xi |θi ) =
N(Xi |θi , σ 2 ),
To implement a Gibbs sampler, we need to construct a Markov chain for {θ1 , . . . , θn } that converges
to the target stationary distribution P(θ1 , . . . , θn |X). For a Gibbs update, we need to compute the full
conditional distribution for each θi given every other variables.
184
By the fact that θ1 , . . . , θn are a priori exchangeable, we may treat θi as the last element in the Pólya
sequence (i.e., the last customer in the Chinese restaurant process). Thus,
θi |θ−i ∼
X
δθj + αG0 .
j6=i
By Bayes’ rule, and conditional independence, we have
p(θi |θ−i , X) ∝ p(θi |θ−i )f (Xi |θi )
∝ αf (Xi |θ)G0 (dθ) +
X
f (Xi |θj )δθj .
j6=i
The above full conditional distribution is a mixture distribution:
with probability proportional to f (Xi |θj )
R
we set θi := θj , and with probability proportional to αf (Xi |θ)G0 (dθ) we draw θi ∼ G0 . The integration
in question is available in closed form due to the normal-normal conjugacy between G0 and f .
We see clearly in the Gibbs sampling step the types of move: one type of move is to select a cluster/table/dish for θi among the existing ones, and another type of move is to generate a new cluster/table/dish
from the base distribution G0 . Thus, the number of clusters are also sampled as part of the Markov chain
generation.
Summarizing, the Gibbs sampling algorithm consists of the following single line of code: For each
MCMC step, do as follows:
(1) for i = 1, . . . , n, draw θi given existing θ−i and X by the full conditional distribution derived above.
This is only the simplest example of a Gibbs sampler based on the Pólya characterization of Dirichlet
processes. Researchers have developed more sophisticated and efficient techniques based on Gibbs and
Metropolis-Hastings sampling frameworks.
In this section offered a glimpse of Dirichlet process, which is just one of many powerful tools of
Bayesian nonparametrics. For an expanded version of this short introduction, see also the lecture notes [Nguyen,
2015].
185
12
Additional topics
Bayesian statistics has a rich literature, both classical and modern, which results in an enormous repository
of ideas and tools for modeling and computation. Several modeling/ computational topics that are worth
exploring further from here:
• Modeling: probabilistic graphical models [Jordan, 2004, Blei et al., 2003, Pritchard et al., 2000],
Gaussian processes for nonlinear regression and classification [Rasmussen and Williams, 2006], hierarchical modeling with Dirichlet processes and extensions [Teh and Jordan, 2010].
• Computation: general variational inference [Wainwright and Jordan, 2008], and variational inference applied to Bayes [Blei et al., 2018], geometric methods, e.g., for topic and hierarchical models [Yurochkin et al., 2019], MCMC with proposal distributions arising from Langevin [Roberts and
Tweedie, 1996] and Hamintonian dynamics [Neal, 2011].
Most of the above references are available from the Canvas folder ”Additional Reading” for this course.
186
References
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res, 3:993–1022, 2003.
David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians, 2016.
ISSN 1537274X, 2018.
C. Geyer. Markov Chain Monte Carlo lecture notes. Unpublished, 2005.
S. Ghosal and A. van der Vaart. Fundamentals of Nonparametric Bayesian Inference. Cambridge University
Press, 2017.
J. K. Ghosh and R. V. Ramamoorthi. Bayesian nonparametrics. Springer, 2002.
N. Hjort, C. Holmes, P. Mueller, and S. Walker (Eds.). Bayesian Nonparametrics: Principles and Practice.
Cambridge University Press, 2010.
P. Hoff. A First Course in Bayesian Statistical Methods. Springer, 2009.
M. I. Jordan. An introduction to probablistic graphical models. Unpublished edition, 2003.
M. I. Jordan. Graphical models. Statistical Science, Special Issue on Bayesian Statistics (19):140–155,
2004.
R. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, pages 113–163,
2011.
X. Nguyen. VIASM lectures on Bayesian nonparametrics. Unpublished edition, 2015.
J. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using multilocus genotype
data. Genetics, 155:945–959, 2000.
C. E. Rasmussen and C. Williams. Gaussian processes for machine learning. MIT Press, 2006.
C. P. Robert. The Bayesian Choice: From decision-theoretic foundations to computational implementations.
Springer, 2nd edition, 2007.
Gareth O Roberts and Richard L Tweedie. Exponential convergence of langevin distributions and their
discrete approximations. Bernoulli, pages 341–363, 1996.
Y. W. Teh and M. I. Jordan. Hierarchical bayesian nonparametric models with applications. In N. Hjort,
C. Holmes, P. Mueller, and S. Walker, editors, Bayesian Nonparametrics: Principles and Practice. Cambridge University Press, Cambridge, UK, 2010.
M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.
Foundations and Trends in Machine Learning, 1:1–305, 2008.
Mikhail Yurochkin, Aritra Guha, Yuekai Sun, and XuanLong Nguyen. Dirichlet simplex nest and geometric inference. Proceedings of the International Conference on Machine Learning (ICML), 2019. URL
arXiv:1905.11009.
187
Download