Accept-Reject Algorithm

advertisement
Accept-Reject Algorithm
Let π(x) be out Target density, i.e. the density we want to sample from.
Accept-Reject Algorithm
Choose initial value x(0) .
For t = 1, 2, . . . , T
1. Generate Proposal:
y ∼ q(x(t−1) , y).
2. Accept proposal with probability:
a(x(t−1) , y)
otherwise reject it.
3. If accepting: x(t) = y
4. If rejecting:
x(t) = x(t−1)
This algorithm generate a realisation of a time homogeneous Markov
chain.
How do we choose q(x, y) and a(x, y) so that the unique invariant
distribution of the resulting Markov chain is given by π(x)?
1/1
The Metropolis-Hastings algorithm
How to choose q(x, y) and a(x, y)?
One choice leads to the Metropolis-Hasting algorithm. The user specifies
a proposal kernel q(x, y). The algorithm then “automatically” chooses
the correct acceptance probability.
Metropolis-Hastings algorithm
Choose any proposal kernel q(x, y)
Define the Hastings ratio
H(x, y) =
π(y)q(y, x)
,
π(x)q(x, y)
where H(x, y) = ∞ if π(x)a(x, y) = 0.
The acceptance probability is
a(x, y) = min {1, H(x, y)} .
2/1
The Metropolis algorithm
A special case of the MH-algorithm is when the proposal kernel is
symmetric:
q(x, y) = q(y, x)
In this case the Hastings-ratio simplifies to
H(x, y) =
π(y)
π(y)q(y, x)
=
.
π(x)q(x, y)
π(x)
Example: The most common example, is when the proposal is normal
distributed with x as the mean value, and τp as the precision:
r
τp
1
exp − τp (y − x)2 .
q(x, y) =
2π
2
Clearly, q(x, y) = q(y, x).
3/1
Burn-in
Generate X (0) ∼ π0 (x), an initial distribution, typically different
from π(x).
Create irreducible Markov chain X (0) , X (1) , X (2) , . . . with π(x) as
invariant distribution.
For small values of t the distribution of X (t) can be quite different
from π(x).
As a consequence, the sample mean
T
1 X (t)
X
T t=1
is biased, i.e. E
h P
T
1
Instead consider
T
t=1
i
6 µ.
X (t) =
T
1 X (m+t)
X
,
T t=1
where m is the length of the burn-in
4/1
The effect of burn-in
0.0
−15
0.2
−5
0
0.4
Histogram of x[1:200]
100
200
300
400
500
−15
−10
−5
0
Histogram of x
0.0
−15
0.2
−5
0
0.4
0
100
200
300
400
500
−15
−10
−5
0
Histogram of x[s]
0.0
−15
0.2
−5
0
0.4
0
0
100
200
300
400
500
−15
−10
−5
0
5/1
Variance of the sample mean: IID Case
Assume we have independent samples X (1) , X (2) , . . . , X (T ) from π(x).
Assume E[X (t) ] = µ and V ar[X (t) ] = σ 2 .
The sample mean is
T
1 X (t)
X
T t=1
For the sample mean we have the following results.
#
"
T
1 X (t)
X
=µ
E
T t=1
"
#
T
1 X (t)
1
Var
X
= σ2
T t=1
T
"
#
T
1 X (t)
T · Var
X
= σ2 .
T t=1
6/1
Variance of the sample mean: Markov Chain Case
Assume X (1) , X (2) , X (3) , . . . is an irreducible Markov chain with
invariant distribution with density π(x).
Further, assume that X (1) ∼ π(x) which implies that X (t) ∼ π(X) for all
t = 2, 3, 4, . . ., which in turn implies that E[X (t) ] = µ and
Var[X (t) ] = σ 2 .
The expected value of the sample mean is (again)
#
"
T
1 X (t)
X
= µ.
E
T t=1
So the expected value of the sample mean is unaffected by the shift from
IID sample to Markov chain.
7/1
Variance of the sample mean: Markov Chain Case
Regarding the variance we have
#
"
T
1 X (t)
X
→ σ2
T · Var
T t=1
1+2
∞
X
i=1
ρi
!
,
where
ρi = Corr(X
(t)
,X
(t+i)
E (X (t) − µ)(X (t+i) − µ)
)=
σ2
is the lag-i auto-correlation.
P∞
We call σ 2 (1 + 2 i=1 ρi ) the asymptotic variance.
P∞
Trying to get τ = 1 + 2 i=1 ρi to be as small as possible seems like
a good idea.
If we just want to estimate µ this is a brilliant idea.
8/1
Tuning
Assume the proposal kernel is
r
τp
1
2
q(x, y) =
exp − τp (y − x) .
2π
2
Now, τp is an “algorithm parameter” that we need to choose.
What is a good choice of τp ? This is an example of tuning an algorithm.
Example: Assume target density is normal
r
τ
1
π(x) =
exp(− x2 )
2π
2
The optimal choice (in terms of reducing the asymptotic variance) is so
that the acceptance probability is around 40%.
If π(x1 , x2 , . . . , xk ) is multivariate normal, the optimal choice of τp
corresponds to an acceptance probability of 0.234.
9/1
Tuning, Acceptance probability and Auto-correlation
Series x$x
Histogram of x$x
−1.5
1.0
0.6
0.4
0.0
0.0
0.2
0.5
−0.5
1.0
0.8
0.5 1.0
1.5
acceptance prob.: 0.024
0
50
100
150
200
250
−1.5 −1.0 −0.5
0.0
0.5
1.0
1.5
100
150
200
250
20 x$x
Series
30
40
1.0
1
2
0.0 0.2 0.4 0.6 0.8 1.0
0.30
3
0.20
0.10
0.00
200
10
0.6
0
2
1
150
0
0.4
−1
0
−1
100
40
Histogram of x$x
−2
50
30
0.0
−2
acceptance prob.: 0.8
0
20 x$x
Series
0.2
0.2
0.1
0.0
50
10
0.8
0.5
0.4
0.3
1
0
−1
−2
0
0
Histogram of x$x
2
acceptance prob.: 0.22
250
−2
−1
0
1
2
3
0
5
10
15
20
25
30
10/1
A bivariate case example
0.99 1
1
Target: N (0, 0),
0.99
Gibbs sampling
3
0
−2
−3
−2
−2
−1
−1
−1
0
0
1
1
1
2
2
2
Acceptance prob.: 1
1
2
−2
−1
0
1
2
−1
0
1
2
0.8
2
−3
0.0
−2
0.2
−1
0.4
0
0.6
1
2
1
−3
−2
−1
0
−2
1.0
0
3
−1
3
−2
0
500
1000
1500
2000
2500
0
500
1000
1500
2000
2500
0
5
10
15
20
25
30
11/1
A bivariate case example
1
0.99 Target: N (0, 0),
0.99
1
100
0
Proposal covariance
0
100
Acceptance prob.: 0.0016
3
0
−2
−3
−2
−2
−1
−1
−1
0
0
1
1
1
2
2
2
−1
0
1
2
−2
−1
0
1
2
−3
−2
0
5
−1
0
1
2
3
0.8
0.6
0.2
0.0
0.0
0.0
0.5
0.5
0.4
1.0
1.0
1.5
1.5
1.0
−2
0
500
1000
1500
2000
2500
0
500
1000
1500
2000
2500
10
15
20
25
30
12/1
A bivariate case example
1
0.99 Target: N (0, 0),
0.99
1
1 0
Proposal covariance:
0 1
Acceptance prob.: 0.1196
3
0
−2
−3
−2
−2
−1
−1
−1
0
0
1
1
1
2
2
2
0
1
2
−2
−1
0
1
2
−3
−2
0
5
−1
0
1
2
3
0
500
1000
1500
2000
2500
0.4
0.2
0.0
−2
−2
−1
−1
0
0
0.6
1
1
0.8
2
1.0
−1
2
−2
0
500
1000
1500
2000
2500
10
15
20
25
30
13/1
A bivariate case example
1
0.99 Target: N (0, 0),
0.99
1
0.01
0
Proposal covariance:
0
0.01
Acceptance prob.: 0.6892
3
0
−2
−3
−2
−2
−1
−1
−1
0
0
1
1
1
2
2
2
−1
0
1
2
−2
−1
0
1
2
−3
−2
0
5
−1
0
1
2
3
0
500
1000
1500
2000
2500
0.8
0.0
−1.0
−0.5
0.2
0.0
0.4
0.5
1.0
0.6
1.5
2.0
1.0
−2
0
500
1000
1500
2000
2500
10
15
20
25
30
14/1
A bivariate case example
0.99 1
2
1
0.99
Proposal covariance: 2.38
2
0.99
1
1
Target: N (0, 0),
0.99
Acceptance prob.: 0.352
3
0
−2
−3
−2
−2
−1
−1
−1
0
0
1
1
1
2
2
2
1
2
−2
−1
0
1
2
−2
0
5
−1
0
1
2
3
0.8
2
−3
0.0
−2
0.2
−1
0.4
0
0.6
1
2
1
−3
−2
−1
0
−3
1.0
0
3
−1
3
−2
0
500
1000
1500
2000
2500
0
500
1000
1500
2000
2500
10
15
20
25
30
15/1
Optimum proposal
Assume target is a d-dimensional normal:
π(x) = Nd (µ, Σ)
and the proposal is normal:
q(x, y) = Nd (x, Σq )
Then the optimum choice of proposal variance Σq is
Σq =
2.382
Σ
d
Catch: Σ is unknown.
Solutions: Pilot run or adaptive MCMC.
16/1
Reminder: The Gibbs sampler
Aim: We want to sample θ = (θ1 , θ2 , . . . , θk ) from a pdf/pf π(θ).
Assume θi ∈ Ωi ⊆ Rdi . Then, θ ∈ Ω1 × Ω2 × · · · × Ωk ⊆ Rd1 +d2 +···+dk
We can now (under some conditions) generate an approximate sample
from π(θ) as follows:
Gibbs Sampler
(0)
(0)
(0)
Choose initial value θ (0) = (θ1 , θ2 , . . . , θk ).
For t = 1, 2, . . . , T
◮
For i = 1, 2, . . . , k
(t)
1. Generate θi
(t)
(t)
(t−1)
(t−1)
∼ π(θi |θ1 , . . . , θi−1 , θi+1 , . . . , θk
)
Question: What if we cannot generate samples from one or more of the
full conditional distributions?
Solution: Use a Metropolis-Hastings update instead!
17/1
Metropolis within Gibbs (MwG)
Gibbs Sampler
(0)
(0)
(0)
Choose initial value θ (0) = (θ1 , θ2 , . . . , θk ).
For t = 1, 2, . . . , T
◮
For i = 1, 2, . . . , k
(t)
(t)
(t−1)
1. Generate proposal θi′ ∼ q(θi′ |θ1 , . . . , θi−1 , θi
2. Calculate Hastings ratio
(t)
(t−1)
H(θi
, θi′ ) =
(t)
(t−1)
(t−1)
, . . . , θk
(t−1)
π(θi′ |θ1 , . . . , θi−1 , θi+1 , . . . , θk
)
)
(t−1)
(t−1)
(t−1) (t)
(t)
π(θi
|θ1 , . . . , θi−1 , θi+1 , . . . , θk
)
×
(t−1)
(t)
(t)
(t−1)
|θ1 , . . . , θi−1 , θi′ , . . . , θk
)
(t)
(t)
(t−1)
(t−1)
q(θi′ |θ1 , . . . , θi−1 , θi
, . . . , θk
)
q(θi
3. With probability
n
o
(t−1) ′
min 1, H(θi
, θi )
(t)
set θi
(t)
= θi′ (accept) otherwise set θi
(t−1)
= θi
(reject).
18/1
Metropolis within Gibbs: Comments
Notice that each of the k component updates have π(θ) as their
invariant distribution.
Hence the MwG algorithm has π(θ) as its invariant distribution.
Irreducibility is not automatically fulfilled.
Special case: Assume that q(θi |θ−i ) is given by the full conditional:
(t)
(t)
(t−1)
(t−1)
q(θi′ |θ1 , . . . , θi , θi+1 , . . . , θk
=
(t−1)
Then H(θi
)
(t)
(t)
(t−1)
(t−1)
π(θi′ |θ1 , . . . , θi−1 , θi+1 , . . . , θk
).
, θi′ ) = 1, hence all proposals are accepted.
In fact, this is exactly the usual Gibbs sampler!
19/1
Prior predictions
Predicting future observations without data.
Notation: Let x̃ denote a prediction.
Assume:
Data model:
x̃|θ ∼ π(x|θ)
Prior:
π(θ)
The above assumptions implies a joint distribution of data, x, and
parameter, θ:
π(x, θ) = π(x|θ)π(θ).
We are interested in predicting a future observation, i.e. the (marginal)
distribution of x, i.e. when ignoring θ, i.e.
x̃ ∼ π(x),
where
π(x) =
Z
π(x|θ)π(θ)dθ.
20/1
Prior prediction: Normal case, τ known
Assume:
Data model: π(x|µ) ∼ N (µ, τ ).
Prior:
π(µ) = N (µ0 , τ0 )
Prior predictive distribution
Z
π(x) = π(x|µ)π(µ)dµ
r
Z r
1
1
τ
τ0
2
2
exp − τ (x − µ)
exp − τ0 (µ − µ0 ) dµ
=
2π
2
2π
2
r
τ τ0 1
1 τ τ0
=
exp −
(x − µ0 )2
τ + τ0 2π
2 τ + τ0
τ τ0
= N µ0 ,
.
τ + τ0
21/1
dnorm(x.seq, mu0, 1/sqrt(tau0))
Prior predictive distribution
0.10
Prior
Prior predictive
Predictive if µ =170
0.08
0.06
0.04
0.02
0.00
140
150
160
170
180
190
200
22/1
Simulating the prior predictive distribution
If π(x) is difficult to derive or not easily simulated from directly we can
use another strategy.
Simulating the prior predictive distribution can be done as follows:
1. Generate parameter from prior:
2. Conditional on θ generate x:
θ ∼ π(θ)
x̃ ∼ π(x|θ)
Now x̃ is a sample from the prior predictive distribution.
23/1
Posterior prediction
Predicting future observation given data.
Assume:
Data model:
x|θ ∼ π(x|θ)
Prior:
π(θ)
The joint distribution of predicted data x̃, data x and parameter θ is
π(x̃, x, θ) = π(x̃|θ)π(x|θ)π(θ)
∝ π(x̃|θ)π(θ|x).
Notice: Here π(x̃|θ) and π(x|θ) represent the same distribution.
The posterior predictive distribution is the (marginal) distribution of x̃
conditional on data x:
Z
Z
Z
π(x̃, θ, x)
dθ ∝ π(x̃|θ)π(x|θ)π(θ)dθ
π(x̃|x) = π(x̃, θ|x)dθ =
π(x)
Z
∝ π(x̃|θ)π(θ|x)dθ
24/1
Hence, the posterior acts as a prior for the predictions.
Posterior prediction: Normal case, τ known
iid
Data model: X1 , X2 , . . . , Xn ∼ N (µ, τ ).
Prior: π(µ) = N (µ0 , σ0 ).
x̄+τ0 µ0
Posterior: π(µ|x) = N nτnτ
+τ0 , nτ + τ0 .
Recall that the prior prediction (of one observation) was
τ0 τ
x̃ ∼ N µ0 ,
τ + τ0
Since the posterior is the “prior” for the posterior prediction we have
nτ x̄ + τ0 µ0 (nτ + τ0 )τ
,
x̃|x ∼ N
nτ + τ0
τ + nτ + τ0
When n is large we have x̃|x
approx
∼
N (x̄, τ ).
25/1
dnorm(x.seq, mu0, 1/sqrt(tau0))
dnorm(x.seq, mu0, 1/sqrt(tau0))
Prior and posterior predictive distributions
0.10
Prior
Prior predictive
Predictive if µ =170
0.08
0.06
0.04
0.02
0.00
140
0.10
150
160
170
180
190
200
170
180
190
200
Posterior
Prior predictive
Posterior predictive
0.08
0.06
0.04
0.02
0.00
140
150
160
26/1
Posterior prediction using a graph
27/1
Model checking
Idea: If the model is correct posterior predictions of the data should look
similar to the observed data.
Difficulty: Who to choose a good measure of “similarity”.
Example: We have observed a sequence of n = 20 zeros and ones:
11000001111100000000
Model: X1 , X2 , . . . , X20 are IID and P (Xi = 1) = p.
Prior: π(π) = Be(α, β).
Posterior: π(π|x) = Be(#1 + α, #0 + β).
Model checking: We simulate posterior predictive realisations
X̃(1) , X̃(2) , . . . , X̃(N ) , where
(i)
(i)
(n)
X̃(i) = (X1 , X2 , . . . , X2 ).
If these vectors look “similar” to the data above, the model is probably
ok.
28/1
Model checking: First attempt (A failure)
Define summary function
s(x) = Number of ones in the sequence x
Histogram for s(x̃(i) ) for N = 10, 000 independent posterior predictions:
0
400
800
1400
Histogram of proportion of ones
0.0
0.2
0.4
0.6
0.8
Clearly the observed number of ones is in no way unusual compared to
the posterior predictions.
This is really expected — so we need another summary function s(x).
29/1
Model checking: Second attempt (A success)
Define summary function
s(x) = Number of switches between ones and zeros in x
In the data the number of switches is 3:
11000001111100000000
Histogram for s(x̃(i) ) for N = 10, 000 independent posterior predictions:
0
500 1000
Histogram of number of switches
0
5
10
15
20
Only around 1.7% of the posterior prediction have 3 or fewer switches.
This suggests that the model assumption of independence is questionable.
30/1
Download