Lectures 1-2

advertisement
MSc MT15. Further Statistical Methods: MCMC
Lecturer: Geoff Nicholls
Lectures 1-2: concepts and motivation; inversion; transformation; rejection.
Notes and Practicals will be made available at
www.stats.ox.ac.uk\ ∼nicholls\MScMCMC15
“Monte Carlo Simulation: An analytical technique for solving
a problem by performing a large number of trial runs, called
simulations, and inferring a solution from the collective results
of the trial runs.”, glossary at www.nasdaq.com.
“Anyone who can do solid statistical programming will never miss
a meal.”, Prof David Banks, 2008.
Why Monte Carlo?
We fit complex realistic models and analyze large data sets.
Taking expectations is a fundamental operation in Statistics.
Expectations are integrals and integrals are hard.
Use simulation to do hard integrals.
Monte Carlo theory is applied probability.
Doing Monte Carlo on a computer is statistical computing.
Monte Carlo methods empower you to do a large chunck of
statistical inference.
Simulation: the basic idea
In many settings we wish to estimate expectations.
Suppose X ∈ Ω is a random variable (rv) with density X ∼ p(x),
f : Ω → ℜ is a function and we want to evaluate the expectation
Z
Ep(f ) =
f (x)p(x)dx.
Ω
If we have Xi, i = 1, 2, ..., n with Xi ∼ p then
n
X
f n = n−1
f (Xi)
i=1
is an unbiased estimator for Ep(f ).
If f (Xi) are iid with 0 < var(f (X)) < ∞ and |Ep(f )| < ∞ then
f n − E(f )
q
→ N (0, 1),
var(f )/n
by the CLT. If Zα/2 is the 1 − α/2 quantile of a standard normal
and
n
X
1
s2 =
(f (Xi) − f n)2
n − 1 i=1
is an unbiased estimator for var(f ) we can report a level-α CI
s
f n ± Zα/2 √
n
for Ep(f ).
Example: X ∼ N (0, 1), what is a = P (sin(πX) > 0.5)?
Here P (sin(πX) > 0.5) = E(Isin(πX)>0.5), where
1 if sin(πx) > 0.5
Isin(πx)>0.5 = 0 otherise.
We fix n and draw Xi ∼ N (0, 1) iid then form
n
1X
â =
n i=1
Isin(πXi)>0.5
Exercise: show there is a CLT for â.
Exercise: suppose 0 ≤ f (x) ≤ M for all a ≤ x ≤ b.
hit-and-miss method estimates
Z b
F =
a
The
f (x)dx
by dropping points uniformly in the box [a, b] × [0, M ] and estimating F using
F̂hm = M (b − a)p̂,
where p̂ is the proportion of points in the box falling under the
curve of f (x). We could alternatively note that
F = (b − a)E(f (X))
when X ∼ U [a, b] so we can use F̂av = (b − a)f n with f n as
above. Show that var(F̂av ) ≤ var(F̂hm). Why is this desirable?
Example: In Bayesian inference for θ|y with posterior p(θ|y),
likelihood L(θ; y) and prior p(θ),
L(θ; y)p(θ)
p(θ|y) =
m(y)
R
where m(y) = L(θ; y)p(θ)dθ is the marginal likelihood. The
posterior mean, µf = E(f (θ)|y), for a function f (θ)
Z
µf =
L(θ; y)p(θ)
dθ
f (θ)
m(y)
is usually hopelessly intractable. We can use Monte Carlo to
sample θ (i) ∼ p(·|y), i = 1, 2, ..., N and estimate µf using
N
X
1
f (θ (i)).
f¯N =
N i=1
We will be interested in MC-methods that do not require evaluation of m(y) (as that’s hard too).
MCMC module overview. Our first task is to design an algorithm for generating the Xi’s: given a probability density or
mass function p(x) (the target distribution), give an algorithm
for simulating X ∼ p. There are several main classes of Monte
Carlo algorithms. They all have their strengths and weaknesses.
We will use
• the inversion and transformation methods (approximately 1
lecture),
• rejection sampling (1L),
• importance sampling (1L),
• Markov chain Monte Carlo (5L)
You will learn why they work, and what their relative advantages
are in different settings. You will learn to adapt these algorithms
to simulation for different target distributions.
I wont be covering Sequential Monte Carlo. This is a method
related to importance sampling. It is often very effective for
sampling distributions with some kind of time-series structure.
Our second task is to implement these algorithms (in R say). I
wont say much about implementation, just a few comments.
Our third task is to interpret the Monte Carlo output. We will
look at output analysis, especially when we discuss Importance
Sampling and MCMC.
Inversion
Say X ∈ Ω is a scalar rv with cdf F (x) = Pr(X ≤ x) at X = x
and we want to simulate X ∼ F .
Claim: suppose F (x) is continuous and strictly increasing with
x. If U ∼ U (0, 1) and X = F −1(U ) then X ∼ F .
Proof: Pr(X ≤ x) = Pr(F −1(U ) ≤ x) so applying strictly
increasing F to both sides of F −1(U ) ≤ x we have Pr(X ≤
x) = Pr(U ≤ F (x)) which is F (x), since U is uniform.
Remark: if we define
F −1(u) = min(z; F (z) ≥ u)
then F −1(U ) ∼ F without F continuous or strict increasing.
Example: if we want X ∼ Exp(r), ie X ∼ p(x) with
p(x) = r exp(−rx),
then the CDF is
F (x) = 1 − exp(−rx)
and its inverse is
F −1(u) = −(1/r) log(1 − u).
The algorithm is
U ∼ U (0, 1)
X ← − log(U )/r
and note I replaced 1 − U with U since U ∼ 1 − U .
Inversion: discrete random variables Suppose X ∼ F with X a
discrete rv with pmf p(x), x = 0, 1, 2, .... Consider the following
algorithm
U ∼ U (0, 1)
Px
Px−1
set X equal the unique x satisfying i=0 p(i) < u < i=0 p(i)
Px−1
with i=0 p(i) ≡ 0 if x = 0.
Proof: It is easy to check that X ∼ p. (board presentation)
Px
Remark: the cdf here is F (x) = i=0 p(i) and the above algorithm is actually just x = F −1(u) again with the more general
definition for F −1 above.
Exercise: Suppose 0 < p < 1 and q = 1 − p, and we want to
simulate X ∼ Geometric(p) with pmf
p(x) = pq x−1
and cdf F (x) = 1 − q x for x = 1, 2, 3, ....
Show that the following algorithm simulates X :
Step 1. Simulate U ∼ U [0, 1]
Step 2. Set
&
'
log(1 − u)
x=
log(q)
where ⌈x⌉ rounds up.
(Hint: you need to show that this formula gives the smallest x
such that 1 − q x ≥ u).
Transformation Methods
Say Y ∼ Q, Y ∈ ΩQ we can simulate.
X ∼ P , X ∈ ΩP we want to simulate.
If we can find a function f : ΩQ → ΩP with the property that
f (Y ) ∼ P
then we can simulate X by simulating
Y ∼Q
and setting
X = f (Y ).
Example: Suppose we want to simulate X ∼ Exp(1) and we
can simulate Y ∼ U (0, 1). Try setting X = − log(Y ). Does
that work?
Ans: recall that if Y has density q(y) and X = f (Y ) then the
density of X is
p(x) = q(y(x))|dy/dx|.
Now q(y) = 1 (the density of U (0, 1)) and y(x) = exp(−x), so
p(x) = exp(−x) and so X has the density of an Exp(1) rv.
Example: Inversion is a transformation method: Q is U (0, 1); Y
is U ; and X = f (Y ) with f (y) = F −1(y) and F the CDF of
the target distribution P .
Transforming several variables We can generalize the idea. We
can take functions of collections of variables.
Example: Suppose we want to simulate X ∼ Γ(a, β) with a ∈
1, 2, 3, ... and we can simulate Y ∼ Exp(1).
Pa
Simulate Yi ∼ Exp(1), i = 1, 2, ..., a and set X =
i=1 Yi /β .
Then X ∼ Γ(a, β).
Proof: Use moment generating functions.
Exp(1) rv Y is
tY
E e
= (1 − t)−1
The MGF of the
so the MGF of X is
a
Y
tX
tY
/β
E e
=
E e i
= (1 − t/β)−a
i=1
which is the MGF of a Γ(a, β) variate.
Rejection sampling
Suppose I want to sample X ∼ p(x). If I throw darts uniformly
at random ’under the curve’ of p(x) then the x − values are
distributed like p(x). We do this by throwing darts UAR under
a function q(x) that sits over p(x), and keeping the darts that
fall under p(x).
Example: suppose I want to sample X ∼ p with pX (x) =
2x, 0 ≤ x ≤ 1. If I sample Y ∼ U (0, 1) and U ∼ U (0, 1)
then (Y, 2U ) is uniform in the box [0, 1] × [0, 2]. The points
under the curve are the ones we want since Y |2U < 2Y ∼ pX .
2.0
2.0
1.5
1.5
1.0
0.5
0.0
u
1.0
0.5
0.0
0.0
0.2
0.4
0.6
y
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
The Rejection algorithm
Say Y ∼ q(y), y ∈ Ω is a rv we can simulate.
X ∼ p(x), x ∈ Ω is a rv we want to simulate.
Suppose there exists M such that M q(x) > p(x) for all x ∈ Ω.
Rejection Algorithm simulate X ∼ p(x):
[1] Simulate y ∼ q(y) and u ∼ U (0, 1).
[2] If u < p(y)/M q(y) return X = y (’accept y ’) and stop, and
otherwise goto [1] (reject y ).
This algorithm works because (y, uM q(y)) is simulated UAR
’under’ q(y). We keep trying until we get a point with V =
uM q(y) ’under’ p(y). This point is UAR under p(y) so its y component is distributed according to p.
Proposition: The rejection algorithm simulates X ∼ p(x).
Consider Ω = {1, 2, 3, ..., M }.
For x ∈ Ω let p(x) be the probability mass function for X .
If p = (p1, p2, p3, ..., pM ) then p(x) = px.
As before we suppose we have M so that M ≥ p(x)/q(x) for
each x ∈ Ω.
Rejection Algorithm simulating X ∼ p:
[1] Simulate y ∼ q(y) and u ∼ U (0, 1).
[2] If u < p(y)/M q(y) return X = y and stop, and otherwise
goto [1] and try again.
Let Pr(X = x) be the probability this algorithm returns X = x.
We want to show that Pr(X = x) = p(x).
Claim: The number of trials in the rejection algorithm is geometric with mean M
Proof: each trial at [2] succeeds independently with probability
ξ say. In order to succeed we must draw a y and accept it. The
first event (draw y ) happens wp q(y). The second event (accept
y ) happens wp p(y)/q(y)M . We dont care which y it is so...
X
ξ =
q(y) × p(y)/(M q(y))
(1)
y∈Ω
= 1/M
(2)
P
because y p(y) = 1. If the number of trials is N then
E(N ) = 1/ξ = M
Comment: Each time we execute steps [1-2] we fail wp 1 − 1/M
Claim: The probability the algorithm returns X = x is p(x).
How do we end up with X = x? When we enter the loop [1-2]
the probability we draw x and then accept it is
q(x) × p(x)/M q(x) = p(x)/M.
However we might pass through [1-2] any number of times. We
can draw x and then accept it at the first pass, or at the second
pass, or third and so on.
p(x)
p(x)
p(x)
2
Pr(X = x) =
+ (1 − 1/M )
+ (1 − 1/M )
+ ...
M
M
M
∞
p(x) X
(1 − 1/M )n−1
=
M n=1
= p(x).
as supposed.
First comment: Where did we use M q(x) ≥ p(x)? If for some
y , p(y)/q(y) ≥ M then the probability u ≤ p(y)/M q(y) at step
[2] is one, not p(y)/M q(y).
Exercise: show that rejection simulates
X ∼ min(p(x), M q(x))
(and hence X ∼ p(x) if p(x) ≤ M q(x) for all x).
Second comment: The mean number of trials is M . If we are
conservative and choose a very large M , in order to ensure the
bound M ≥ p(y)/q(y) holds for all y , then the number of trials
is very large, and the implementation runs slowly.
Example Use Y ∼ U (0, 1) (ie q(x) = 1) to simulate X ∼ p(x)
with p(x) = 2x for 0 ≤ x ≤ 1. Calculate the mean number of
trials per simulated X -value.
Step 1. Find M so that p(x)/q(x) ≤ M for all 0 ≤ x ≤ 1. Here
p(x)/q(x) = 2x. We can use M = 2 since 2x ≤ 2 there. We
have p(y)/q(y)M = y for the second step.
Step 2. Write down the algorithm.
Rejection Algorithm simulating X ∼ 2x:
[1] Simulate y ∼ U [0, 1] and u ∼ U [0, 1].
[2] If u < y return X = y , and otherwise goto [1] and try again.
The mean number of trials is M = 2.
Implement this algorithm in R:
> while (runif(1)>(x<-runif(1))) {}; x
[1] 0.8843435
1.0
0.5
0.0
Density
1.5
2.0
Compare the histogram of 10K samples with the target density.
0.0
0.2
0.4
0.6
X
0.8
1.0
Download