Introduction to Bayesian Methods

advertisement
Introduction to Bayesian Methods
Introduction to Bayesian Methods – p.1/??
Introduction
We develop the Bayesian paradigm for parametric inference. To
this end, suppose we conduct (or wish to design) a study, in
which the parameter θ is of inferential interest. Here θ may be
vector valued. For example,
1. θ = difference in treatment means
2. θ = hazard ratio
3. θ = vector of regression coefficients
4. θ = probability a treatment is effective
Introduction to Bayesian Methods – p.2/??
Introduction
In parametric inference, we specify a parametric model for the
data, indexed by the parameter θ. Letting x denote the data, we
denote this model (density) by p(x|θ). The likelihood function of
θ is any function proportional to p(x|θ), i.e.,
L(θ) ∝ p(x|θ).
Example
Suppose x|θ Binomial(N, θ). Then
N x
p(x|θ) =
θ (1 − θ)N −x ,
x
x = 0, 1, ..., N.
Introduction to Bayesian Methods – p.3/??
Introduction
We can take
L(θ) = θx (1 − θ)N −x .
The parameter θ is unknown. In the Bayesian mind-set, we
express our uncertainty about quantities by specifying
distributions for them. Thus, we express our uncertainty about θ
by specifying a prior distribution for it. We denote the prior
density of θ by π(θ). The word "prior" is used to denote that it is
the density of θ before the data x is observed. By Bayes theorem,
we can construct the distribution of θ|x, which is called the
posterior distribution of θ . We denote the posterior distribution
of θ by p(θ|x).
Introduction to Bayesian Methods – p.4/??
Introduction
By Bayes theorem,
p(θ|x) = Z
p(x|θ)π(θ)
p(x|θ)π(θ)dθ
Θ
where Θ denotes the parameter space of θ. The quantity
Z
p(x) =
p(x|θ)π(θ)dθ
Θ
is the normalizing constant of the posterior distribution. For most
inference problems, p(x) does not have a closed form. Bayesian
inference about θ is primarily based on the posterior distribution
of θ , p(θ|x).
Introduction to Bayesian Methods – p.5/??
Introduction
For example, one can compute various posterior summaries,
such as the mean, median, mode, variance, and quantiles. For
example, the posterior mean of θ is given by
Z
E(θ|x) =
θp(θ|x)dθ
Θ
Example 1
Given
θ,
suppose
Binomial(1,θ), and θ ∼ Beta(α, λ).
x1 , x2 , ..., xn
are
i.i.d.
The parameters of the
prior distribution are often called the hyperparameters.
Let us derive the posterior distribution of θ.
Let
x = (x1 , x2 , ..., xn ), and thus,
Introduction to Bayesian Methods – p.6/??
Introduction
p(x|θ) =
p(x|θ) =
n
Y
i=1
n
Y
i=1
P
p(x|θ) = θ
P
Pn
where xi = i=1 xi . Also,
p(xi |θ)
θxi (1 − θ)n−xi
xi
n−
(1 − θ)
P
xi
Γ(α + λ) α−1
θ (1 − θ)λ−1
π(θ|x) =
Γ(α)Γ(λ)
Now, we can write the kernel of the posterior density as
Introduction to Bayesian Methods – p.7/??
Introduction
p(θ|x) ∝ θ
= θ
P
P
P
xi α−1
θ
xi +α−1
xi +α−1
n−
(1 − θ)
P
(1 − θ)n−
n−
P
xi
(1 − θ)λ−1
xi +λ−1
P
xi +λ−1
Thus p(θ|x) ∝ θ
(1 − θ)
. We can recognize
this kernel as a beta kernel with paramters
P
P
( xi + α, n − xi + λ) . Thus,
X
X
θ|x ∼ Beta
xi + α, n −
xi + λ
and therefore
P
P
Γ(α + n + λ)
xi +α−1
n− xi +λ−1
P
P
×θ
(1 − θ)
.
p(θ|x) =
Γ( xi + α)Γ(n −
xi + λ)
Introduction to Bayesian Methods – p.8/??
Introduction
Remark In deriving posterior densities, an often used technique
is to try and recognize the kernel of the posterior density of θ.
This avoids direct computation of p(x). This technique saves lots
of time in derivation. If the kernel cannot be recognized, then
p(x) must be computed directly.
In this example we have
p(x) = p(x1 , ..., xn )
Z 1 P
P
xi +α−1
n− xi +λ−1
∝
(1 − θ)
.
θ
0
P
P
Γ( xi + α)Γ(n − xi + λ)
=
Γ(α + n + λ)
Introduction to Bayesian Methods – p.9/??
Introduction
Thus
P
P
Γ(α + λ) Γ( xi + α)Γ(n − xi + λ)
p(x1 , ..., xn ) =
Γ(α)Γ(λ)
Γ(α + n + λ)
for xi = 0, 1, and i = 1, ..., n.
T
Suppose A1 , A2 , ... are events such that Ai Aj = φ and
S∞
i=1 = Ω, where Ω denotes the sample space. Let B denote an
event in Ω. Then Bayes theorem for events can be written as
P (B|Ai )P (Ai )
p(Ai |B) = P∞
i=1 P (B|Ai )P (Ai )
Introduction to Bayesian Methods – p.10/??
Introduction
P (Ai ) is the prior probability of Ai and p(Ai |B) is the posterior
probability ofAi given B has ocurred.
Example 2 Bayes theorem is often used in diagnostic tests for
cancer. A young person was diagnosed as having a type of cancer
that occurs extremely rarely in young people. Naturally, has was
very upset. A friend told him that it was probably a mistake. His
friend reasons as follows. No medical test is perfect: There are
always incidences of false positives and false negatives.
Introduction to Bayesian Methods – p.11/??
Introduction
Let C stand for the event that he has cancer and let + stand for
the event that an individual responds positively to the test.
Assume P (C) = 1/1, 000, 000 = 10−6 and P (+|C c ) = .01. (So
only one per million people his age have the disease and the test
is extremely god relative to most medical tests, giving only 1%
false positives and 1% false negatives). Find the probability that
he has cancer given that he has a positive response. (After you
make this calculation you will not be surprised to learn that he
did not have cancer.)
P (+|C)P (C)
P (C|+) =
P (+|C)P (C) + P (+|C c )P (C c )
(.99)(10−6 )
P (C|+) =
(.99)(10−6 ) + (.01)(.999999)
Introduction to Bayesian Methods – p.12/??
Introduction
00000099
P (C|+) =
= .00009899
.01000098
Example 3 Suppose x1 , ..., xn is a random sample from
N (µ, σ 2 ).
i) Suppose σ 2 is known and µ ∼ N (µo , σo2 ). The posterior
density of µ is given by:
P (µ|x)
∝
n
Y
p(xi |µ, σ 2 )
i=1
∝
„
!
π(µ)
ff« „

ff«
1
1 X
(xi − µ)2
× exp − 2 (µ − µo )2
exp − 2
2σ
2σo

Introduction to Bayesian Methods – p.13/??
Introduction
σo2
nσo2 + σ 2
2
µ + 2µ
2
2
σo σ
2
2
2
nσo + σ
σo
2
µ − 2µ
2
2
σo σ
2
P
xi + µo σ
1
∝ exp −
2
2σo2 σ 2
P
2
1
xi + µo σ
= exp −
2
nσo2 + σ 2
)
(
P
2
2
σo2 xi + µo σ 2
1 nσo + σ 2
µ−
∝ exp −
2
2
2
σo σ
nσo2 + σ 2
We can recognize
this as a normal kernel with mean
P
nσo2 +σ 2 −1
xi +µo σ 2
σo2
2
µpost = nσ2 +σ2 and variance σpost = ( σ2 σ2 ) =
o
o
Thus
2P
2
2 2
σo xi + µo σ
σo σ
.
µ|x ∼ N
, 2
2
2
2
nσo + σ
nσo + σ
σo2 σ 2
nσo2 +σ 2
Introduction to Bayesian Methods – p.14/??
Introduction
ii) Suppose µ is known and σ 2 is unknown. Let τ = 1/σ 2 . τ is
often called the precision parameter. Suppose
τ ∼ gamma( δ2o , γ2o ). Thus
τγ δo
o
π(τ ) ∝ τ 2 −1 exp −
2
Let us derive the posterior distribution of τ .
o δo
n τγ o
n τX
o
(xi − µ)2 τ 2 −1 exp −
p(τ |x) ∝ τ n/2 exp −
2
2
o
n
X
n+δo
τ
−1
p(τ |x) ∝ τ 2
(xi − µ)2 )
exp − (γo +
2
Introduction to Bayesian Methods – p.15/??
Introduction
Thus
τ |x ∼ gamma
P
2
n + δo γo + (xi − µ)
,
2
2
iii) Now suppose µ and σ 2 are both unknown. Suppose we
specify the joint prior
π(µ, τ ) = π(µ|τ )π(τ )
where
µ|τ ∼ N (µo , τ −1 σo2 )
δo γo
τ ∼ gamma
,
2 2
Introduction to Bayesian Methods – p.16/??
Introduction
The joint posterior density of (µ, τ ) is given by
n τX
o
∝ τ n/2 exp −
(xi − µ)2
2
1
τ
2
×
τ exp − 2 (µ − µo )2
2σo
n τ γ o
o
× τ δo /2−1 exp −
2 2
X
n+δo +1
τ
(µ − µo )
−1
2
2
= τ
exp −
γo +
+
(x
−
µ)
i
2
σo2
The joint posterior does not have a clear recognizable form. Thus,
we need to compute p(x) by brute force.
Introduction to Bayesian Methods – p.17/??
Introduction
p(x)
∝
Z
∞
0
∝
Z
0
=
Z
∞Z
∞
«ff
„
X
(µ − µo )2
2
dµdτ
+
(x
−
µ)
γo +
i
σo2
X
τ
2
2
xi + µo /σo2 ) +
× exp{− (γo + µ (n + 1/σo ) − 2µ(
2
τ
n+δo +1
−1
2
exp
τ
n+δo +1
−1
2
−∞
∞
−∞
X
x2i )}dµτ

(µ2o /σo2
+
„Z
n τ
n+δo +1
−1
2
exp −
×
∞
τ
0
2
τ
−
2
(γo + µ2o /σo2 +
X
x2i )
o«
dτ
”o«
n τ “
X
dµ
exp −
xi + µo /σo2 )
µ2 (n + 1/σo2 ) − 2µ(
2
−∞
„Z
∞
Introduction to Bayesian Methods – p.18/??
Introduction
The integratal with respect to µ can be evaluated by completing
the square.
)
(
P
Z ∞
2
( xi + µo σo−2 )
τ (n + σo−2 )2
µ−
exp −
2
(n + σo−2 )
0
P
−2 2
τ ( xi + µo σo )
×exp
dµ
−2
2(n + σo )
P
−2 2
τ ( xi + µo σo )
1/2 −1/2
−2 −1/2
(2π)
τ
(n
+
σ
= exp
o )
−2
2(n + σo )
Introduction to Bayesian Methods – p.19/??
Introduction
Now we need to evaluate
Z ∞
n+δo /2−1
1/2
−2 −1/2 −1/2
−1
2
τ
τ
(2π) (n + σo )
0
n τ
X o
x2i ]
×exp − [γo + µ2o /σo2 +
2P
2 2
τ ( xi + µo /σo )
[
] dτ
×exp
2
2
(n + 1/σo )
Z ∞
n+δo /2−1
−1
1/2
−2 −1/2
2
= (2π) (n + σo )
τ
0
P
2 2
X
τ
( xi + µo /σo )
2
2
2
×exp − γo + µo /σo +
xi −
dτ
2
2
(n + 1/σo )
Introduction to Bayesian Methods – p.20/??
Introduction
1/2
= h (2π)
Γ
n+δo
2
1
2 −2
1/σo )
(n +
P
xi +µo /σo2 )2
(n+1/σo2 )
i n+δ
o
2
γo + µ2o /σo2 + x2i − (
n+δo
− 21
n+δo
1/2
2
(2π) Γ 2 2 2 (n + 1/σo )
∗
= h
≡
p
(x)
n+δo
i
P
P 2 ( xi +µo /σo2 )2 2
2
2
γo + µo /σo + xi − (n+1/σ2 )
1
2
P
o
Thus,
p(x) =
γo δo /2
(
−(n+1)/2 −1 2 )
(2π)
σo
Γ( δ2o )
!
p∗ (x)
Introduction to Bayesian Methods – p.21/??
Introduction
The joint posterior density of (µ, τ |x) can also be obtained in this
case by deriving p(µ, τ |x) = p(µ|τ |x)p(τ |x).
Exercise: Find p(µ|τ |x) and p(τ |x).
It is of great interest to find the marginal posterior distributions
of µ and τ .
Z ∞
p(µ|x) =
p(µ, τ |x)dτ
Z0 ∞
n τh
X io
n+δ0 +1
∝
τ 2 −1 exp − γo + µ2o /σo2 +
x2i
2
0
io
n τh
X
xi + µo /σo2 ) dτ
×exp − µ2 (n + 1/σo2 ) − 2µ(
2
Introduction to Bayesian Methods – p.22/??
Introduction
∞
n τh
X io
x2i
exp − γo + µ2o /σo2 +
=
τ
2
0
(
#)
"
P
2
τ (n + 1/σo2 )
xi + µo /σo2
×exp −
µ−
2
n + 1/σo2
P
2 2
τ ( xi + µo /σo )
×exp
dτ
2
2
n + 1/σo
Z
Let a =
as
(
n+δ0 +1
−1
2
xi +µo /σo2 )
(n+1/σo2 ) .
P
Then, we can write the integral
Introduction to Bayesian Methods – p.23/??
Introduction
=
=
∝
Z
∞
τ
n+δ0 +1
−1
2
0
io
n τ h
X
2
2
2
2
2
2 2
γo + µo /σo +
xi + (n + 1/σo )(µ − a) − (n + 1/σo )a
dτ
×exp −
2
“
” n+δ0 +1
n+δ0 +1
2
Γ
2
2
ˆ
˜
P 2
γo + µ2o /σo2 +
xi + (n + 1/σo2 )(µ − a)2 − (n + 1/σo2 )a2
»
1+
–
a)2
c(µ −
b − ca2
where c = n +
n+δ0 +1
2
1/σo2
and b = γo +
µ2o /σo2
+
P
x2i . We recognize
this kernel as that of a t-distribution with location parameter a
−1
o )c
, and n + δo degrees of freeand dispersion parameter (n+δ
b−ca2
dom.
Introduction to Bayesian Methods – p.24/??
Introduction
Definition Let y = (y1 , ..., yp )′ be a p × 1 random vector. Then y
is said to have a p diminsional multivariate t distribution with d
degrees of freedom, location paramter m and dispersion matrix
Σp×p if y has density
d+p
−p/2
−1/2
Γ 2 (πd)
|Σ|
p(y) =
d
Γ 2
− d+p
2
1
′ −1
× 1 + (y − m) Σ (y − m)
d
We write this as y ∼ Sp (d, m, Σ). In our problem, p =
−1
(n+δo )c
(n+δo )c
−1
1, d = n + δo , m = a, Σ = b−ca2 , Σ = b−ca2
Introduction to Bayesian Methods – p.25/??
Introduction
The marginal distribution of τ is give by
p(τ |y)
=
∝
=
n τ h
X io
2
2
× exp −
τ
γo + µo /σo +
x2i
2
0
ff

o
nτ
2)
τ
(n
+
1/σ
o
(n + 1/σo2 )a2 × exp −
(m − a)2 dµ
×exp
2
2
io
n τ h
X
n+δ0 +1
1
−1
−
2
2
2
2
2
2
γo + µo /σo +
τ
xi − (n + 1/σo )a
τ 2 exp −
2
n τ h
io
X
n+δ0
−1
2
2
2
2
2
exp −
τ 2
γo + µo /σo +
xi − (n + 1/σo )a
2
Z
∞
n+δ0 +1
−1
2
Thus,
»
”–
X
n + δ0 1 “
γo + µ2o /σo2 +
x2i − (n + 1/σo2 )a2 .
τ |x ∼ gamma
,
2
2
Introduction to Bayesian Methods – p.26/??
Introduction
Remark A t distribution can be obtain as a scale mixture of
normals. That is, if x|τ ∼ Np (m, τ −1 Σ) and τ ∼ gamma(δo , γo ),
then
Z ∞
p(x) =
p(x|τ )π(τ )dτ
0
is the Sp
Note:
δo , m, γδoo Σ density. That is, x ∼ Sp δo , m, γδoo Σ
p(x|τ ) = (2π)−p/2 τ p/2 |Σ|−1/2
n τ
o
×exp − (x − m)′ |Σ|−1 (x − m) .
2
Introduction to Bayesian Methods – p.27/??
Introduction
Remark Note that in Examples 1 and 3i),ii), the posterior
distribution is of the same family as the prior distribution. When
the posterior distributionof a paramter is of the sme family as the
prior istribution, such prior distributions are called conjugate
prior distributions.
For example 1, a Beta prior in θ led to a Beta posterior for θ. In
example 3i), a normal prior for µ yielded a normal posterior for µ.
In example 3ii), a gamma prior for τ yielded a gamma posterior
for τ . More on conjugate priors later.
Introduction to Bayesian Methods – p.28/??
Advantages of Bayesian Methods
1. Interpretation
Having a distribution for your unknown parameter θ is easier to
understand that a point estimate and a standard error. In addition,
we consider the following example of a confidence interval. A
95% confidence interval for a population mean θ can be written
as
√
x ± (1.96)s/ n.
Thus P (a < θ < b) 6= 0.95.
Introduction to Bayesian Methods – p.29/??
Advantages of Bayesian Methods
1. Interpretation We have to rely on a repeated sampling
interpretation to make a probability as above. Thus, after
observing the data, we cannot make a statement like the true θ
has a 95% chance of falling in
√
x ± (1.96)s/ n.
although we are tempted to say this.
Introduction to Bayesian Methods – p.30/??
Advantages of Bayesian Methods
2. Bayes Inference Obeys the Likelihood Principal
The likelihood principle: If two distinct sampling plans (designs)
yield proportional likelihood functions for θ, then inference about
θ should be identical from these two designs. Frequentist inference does not obey the likelihood principle, in general.
Example Suppose in 12 independent tosses of a coin, 9 heads
and 3 tails are observed.
I wish to test the null hypothesis
Ho : θ = 1/2 vs.Ho : θ > 1/2, where θ is the true probability of heads.
Introduction to Bayesian Methods – p.31/??
Advantages of Bayesian Methods
Consider the following 2 choices for the likelihood function:
a) Binomial n = 12 (fixed), x = number of heads. x ∼
Binomial(12, θ) and the likelihood is
n x
L1 (θ) =
θ (1 − θ)n−x
x
12 9
=
θ (1 − θ)3
9
b) Negative Binomial: n is not fixed, flip until the third
tail appears. Here x is the number of flips required to
complete the experiment, x ∼ NegBinomial(r=3,θ).
Introduction to Bayesian Methods – p.32/??
Advantages of Bayesian Methods
r+x−1 x
L2 (θ) =
θ (1 − θ)r
x
11 9
=
θ (1 − θ)3
9
Note that L1 (θ) ∝ L2 (θ). From a Bayesian perspective, the
posterior distribution of θ is the same under either design. That
is
L2 (θ)π(θ)
L1 (θ)π(θ)
≡R
p(θ|x) = R
L1 (θ)π(θ)dθ
L2 (θ)π(θ)dθ
Introduction to Bayesian Methods – p.33/??
Advantages of Bayesian Methods
However, under the frequentist paradigm, inferences about θ are
quite different under each design. The rejection region based on
the binomial likelihood is
12 X
12 j
θ (1 − θ)12−j = 0.075
p(x ≥ 9|θ = 1/2) =
j
j=9
while for the negative binomial likelihood, the p-value is
∞ X
2+j j
θ (1 − θ)3 = 0.0325
p(x ≥ 9|θ = 1/2) =
j
j=9
The two designs lead to different decisions, rejecting Ho under
design 2 and not under design 1.
Introduction to Bayesian Methods – p.34/??
Advantages of Bayesian Methods
3. Bayesian Inference Does not Lead to Absurd Results
Absurd results can be obtained when doing UMVUE estimation.
Suppose x ∼ Poisson(λ), and we want to estimate θ = e−2λ ,
0 < θ < 1. It can be shown that the UMVUE of θ is (−1)x . Thus,
if x is even the UMVUE of θ is 1 and if x is odd the UMVUE of
θ is -1!!
Introduction to Bayesian Methods – p.35/??
Advantages of Bayesian Methods
4. Bayes Theorem is a formula for learning
Suppose you conduct an experiment and collect observations
x1 , ..., xn . Then
p(θ|x) = Z
p(x|θ)π(θ)
p(x|θ)π(θ)dθ
Θ
where x = (x1 , ..., xn ). Suppose you collect an additional
observation xn+1 in a new study. Then,
p(θ|x, xn+1 ) = Z
Θ
p(xn+1 |θ)π(θ|x)
p(xn+1 |θ)π(θ|x)dθ
So your prior in the new study is the posterior from the previous.
Introduction to Bayesian Methods – p.36/??
Advantages of Bayesian Methods
5. Bayes inference does not require large sample theory
With modern computing advances, “exact” calculations can be
carried out using Markov chain Monte Carlo (MCMC) methods.
Bayes methods do not require asymptotics for valid inference.
Thus small sample Bayesian inference proceeds in the same way
as if one had a large sample.
Introduction to Bayesian Methods – p.37/??
Advantages of Bayesian Methods
6. Bayes inference often has frequentist inference as a special
case
Often one can obtain frequentists answers by choosing a uniform
priorfor the parameters, i.e. π(θ) ∝ 1, so that
p(θ|x) ∝ L(θ)
In such cases, frenquentist answers can be obtained from such a
posterior distribution.
Introduction to Bayesian Methods – p.38/??
Download