When to jump off the edge of a cliff Gareth Roberts

advertisement
When to jump off the edge of a cliff
Gareth Roberts
University of Warwick
Equip Launch, October 2013
including work mainly with Jeffrey Rosenthal, Chris Sherlock, Alex
Beskos, Pete Neal and Alex Thiery
Optimal scaling for MCMC and heterogeneity
Statistical inverse problems,typically ill-posed, and characterised by non-identifiability.
In a Bayesian context, this manifests itself by posterior distributions with very
different scales.
Scale separation gets worse rather than better with more data.
Identifiable components are typically complex, highly non-linear, and in practice
unknown.
So, often very difficult to define clever algorithms to exploit specific structure
(though see Mark’s talk!).
How does generic MCMC fair when we have scale separation?
Metropolis-Hastings algorithm
Given a target density π(·) that we wish to sample from, and a Markov chain
transition kernel density q(·, ·), we construct a Markov chain as follows. Given Xn,
generate Yn+1 from q(Xn, ·). Now set Xn+1 = Yn+1 with probability
π(Yn+1)q(Yn+1, Xn)
α(Xn, Yn+1) = 1 ∧
.
π(Xn)q(Xn, Yn+1)
Otherwise set Xn+1 = Xn.
Two first scaling problems
• RWM
q(x, y) = q(|y − x|)
The acceptance probability simplifies to
π(y)
α(x, y) = 1 ∧
π(x)
For example q ∼ M V Nd(x, σ 2Id), but also more generally.
• MALA
Y ∼ M V N (x
(k)
hV ∇ log π(x(k))
+
, hV ) .
2
200
600
1000
0
200
1000
200
600
1000
−1.5
1000
600
1000
200
600
1000
0
200
1.5
0.6192
600
1000
1000
−2
0
200
1.8
0.6529
2
3
600
0
0
1.2
0.7474
600
1000
0
200
2.1
0.6067
600
1000
2.38
0.5542
2
200
200
0.9
0.7732
−3 −1
0
0
0
0
200
600
1000
0
200
600
1000
200
600
1000
−2
0
0
−2
0
2.9
0.6159
0
200
3.2
0.672
600
1000
0
200
3.5
0.6982
600
1000
0
200
4
0.6446
600
1000
5
0.6962
2
1000
0
−1
0
1000
200
600
1000
1000
0
200
600
1000
18
0.9089
1
200
3
600
1000
200
600
1000
1
−1
600
1000
200
600
1000
600
1000
0
200
600
0
200
1000
600
1000
100
0.989
−2
−3
0
200
50
0.962
0
200
0
40
0.9564
−2
0
−3 −1
−2
0
33
0.933
1
0
0
1000
1
1000
600
15
0.9036
−2
600
27
0.9325
−3 −1
600
200
0
0
200
2
200
0
12
0.8651
−3
0
0
−3
0
−3
−3
0
2
3
1000
600
0 1 2
600
22
0.9426
200
10
0.7974
−3 −1
200
0
3
600
8
0.829
2
200
1
2
0
−3
0
−2
0
0
0
1000
−2
600
6
0.7411
1
200
2
0
−2
−3 −1
−3 −1
1
1
2
1
2
3
2.6
0.6277
−2
0
0
−2
−3
−3 −1
−1
1
1
600
1
2
1000
2
600
1
0.7579
200
0.7
0.8368
−3
−2
−2
200
0
0.5
0.9178
0
0
0
−2
0
0.0
0.5
−0.5
0
2
2
600
0.4
0.9443
2
0.3
0.954
2
0
0.1
0.9905
2
1000
3
600
0.2
0.9715
4
200
0.05
0.9939
−0.5
1.0
0.0
−0.4
−1.0
−0.25
0
0.04
0.9895
1.5
0.03
0.9912
0.5
0.02
0.996
−0.05
0.01
0.9869
0
200
The Goldilocks dilemma
600
1000
0
200
600
1000
Scaling problems and diffusion limits
Choosing σ in the above algorithms to optimise efficiency. For ‘appropriate choices’
the d-dimensional algorithm has a limit which is a diffusion. The faster the diffusion
the better!
• How should σd depend on d for large d?
• What does this tell us about the efficiency of the algorithm?
• Can we optimise σd in some sensible way?
• Can we characterise optimal (or close to optimal) values of σd in terms of observable properties of the Markov chain?
• How is this story affected by heterogeneity of scale?
For RWM and MALA (and some other local algorithms) and for some simple classes
of target distributions, a solution to the above can be obtained by considering a
diffusion limit (for high dimensional problems).
What is “efficiency”?
Let X be a Markov chain. Then for a π-integrable function f , efficiency can be
described by
Pn
i=1 g(Xi )
σ 2(g, P ) = lim nVar
.
n→∞
n
Under weak(ish) regularity conditions
σ 2(g, P ) = Varπ (g) + 2
∞
X
Covπ (g(X0), g(Xi))
i=1
In general relative efficiency between two possible Markov chains varies depending
on what function of interest g is being considered. As d → ∞ the dependence on
g disappears, at least in cases where we have a diffusion limit as we will see....
How do we measure “efficiency” efficiently?
It is well-established that estimating limiting variance is hard.
“It’s easy, just measure ESJD instead!” Andrew Gelman, 1993
ESJD = E((Xt+1 − Xt)2)
Why? “It’s obvious!” Andrew Gelman 2011
Optimising this is just like considering only linear functions g and ignoring all but
the first term in
∞
X
Covπ (g(X0), g(Xi))
i=1
1.8
1.6
1.4
0.8
1.0
1.2
w
0
2000
4000
6000
8000
10000
Index
MCMC sample paths and diffusions.
Here ESJM is the quadratic variation
[t−1 ]
lim
→0
X
i=1
(Xi − X(i−1))2
Diffusions
A d-dimensional diffusion is a continuous-time strong Markov process with continuous sample paths. We can define a diffusion as the solution of the Stochastic
Differential Equation (SDE):
dXt = µ(Xt)dt + σ(Xt)dBt.
where B denotes d-dimensional Brownian motion, σ is a d × d matrix and µ is a
d-vector.
Often understood intuitively and constructively via its dynamics over small time
intervals. Approximately for small h:
Xt+h|Xt = xt
∼
xt + hµ(xt) + h1/2σ(xt)Z
where Z is a d-dimensional standard normal random variable.
“Efficiency” for diffusions
Consider two Langevin diffusions, both with stationary distribution π.
1/2
dXti = hi dBt + hi∇ log π(Xti)/2,
i = 1, 2,
0.3
0.0
0.1
0.2
thin(y, 5)
0.4
0.5
with h1 < h2.
0
500
1000
Index
X 2 is a “speeded-up” version of X 1.
1500
2000
The first diffusion comparison result (R Gelman Gilks, 1997)
Consider the Metropolis case.
Qd
Suppose π ∼ i=1 f (xi), q(x, ·) ∼ N (x, σd2Id), X0 ∼ π.
Set σd2 = `2/d. Consider
(1)
Ztd = X[td] .
Speed up time by factor d
Z d is not a Markov chain, however in the limit as d goes to ∞, it is Markov:
Zd ⇒ Z
where Z satisfies the SDE,
dZt = h(`)1/2dBt +
for some function h(`).
h(`)∇ log f (Zt)
dt ,
2
1.2
1.0
0.8
0.6
0.4
0.2
nh(ℓ)/d
How much diffusion path do we get for our n iterations?
√ !
I`
,
h(`) = `2 × 2Φ −
2
and I = Ef [((log f (X))0)2]. So
h(`) = `2 × A(`) ,
where A(`) is the limiting overall acceptance rate of the algorithm, ie the proportion
of proposed Metropolis moves ultimately accepted. So
2
4 −1
h(`) =
Φ (A(`)) A(`) ,
I
and so the maximisation problem can be written entirely in terms of the algorithm’s
acceptance rate.
Efficiency
0.00.20.40.6 0.81.01.2
Efficiency as a function of scaling and acceptance rate
0
1
2
3
4
5
0.8
1.0
Efficiency
0.00.20.40.6 0.81.01.2
Sigma
0.0
0.2
0.4
0.6
Acceptance rate
When can we ‘solve’ the scaling problem for Metropolis?
We need a sequence of target densities πd which are sufficiently regular as d → ∞
in order that meaningful (and optimisable) limiting distributions exist. Eg.
1. π ∼
Qd
2
i=1 f (xi ).(NB for discts f , mixing is O(d ), rate 0.13, (Neal).)
Qd
2
i=1 f (ci xi ), q(x, ·) ∼ N (x, σd Id ). for some inverse scales ci .
2. π ∼
Rosenthal, Voss).
(Bedard,
3. Elliptically symmetric target densities (Sherlock, Bedard).
4. The components form a homogeneous Markov chain.
5. π is a Gibbs random field with finite range interactions (Breyer).
6. Discretisations of an infinite-dimensional system absolutely cts wrt a Gaussian
measure (eg Pillai, Stuart, Thiery).
7. Purely discrete product form distributions.
Picturing RWM in high dimensions
eg consider X ∼ N (0, Id): X0X has mean d and s.d. (2d)1/2
O(sqrt(d))
O(1)
Target distribution lies concentrated around the surface of a d-dimensional hypersphere.
Two independent processes, the radial process (1-dimensional), needing to move
O(1)) and the angular one (with a need to move distances O(d1/2)). Which process
converges quickest?
Spherical symmetry (Sherlock and R, 2009, Bernoulli)
Theorem Let {X(d)} be a sequence of d−dimensional spherically symmetric
unimodal target distributions and let {Y(d)} be a sequence of jump proposal dis(d)
(d)
tributions. If there exist sequences {kx } and {ky } such that the marginal radial
D
distribution function of X(d) satisfies |X(d)|/kd −→ R where R is a non-negative
(d) m.s.
random variable with no point mass at 0, |Y(d)|/ky −→ 1, and provided there is
a solution to an explicit integral equation involving the distribution of R, then suppose that αd denotes the optimal acceptance probability (in the sense of minimising
the expected squared jumping distance satisfies
0 < lim αd = α∞ ≤ 0.234
d→∞
with α∞ = 0.234 if and only if R equals some fixed positive constant with probability
1.
If R does have a point mass at 0, OR the integral condition does not hold (essentially
R has a heavy tailed distribution) then α∞ = 0.
O(sqrt(d))
O(1)
Where the radial component does not converge to a point mass, the target distribution has heterogenous roughness.
Can happen in many other situations.
What should be we do when we have heterogenous roughness?
Ideally choose a state-dependent proposal distribution.
Expressions for the cost of herogeneity can be obtained in some situations:
• Scale of different components drawn from some non-trivial distribution R +
Rosenthal 2001.
• Scales of component i scales like i−k for some k > 0. (Beskos, R. Stuart and
Voss.)
All these results look at high-dimensional limits.
What can we do in finite-dimensions?
Eccentricity
Theorem Suppose we can write X(d) = TdZ(d) for matrices {Td} each having
(d)
collections of eigenvalues {νi ; 1 ≤ i ≤ d}, and where {Z(d)} be a sequence of
d−dimensional spherically symmetric unimodal target distributions and let {Y(d)}
be a sequence of jump proposal distributions. If the conditions of previous theorem
hold (on Z(d) rather than Z(d) this time). Suppose that {Td} are not too eccentric:
(d)
sup1≤i≤d νi
lim Pd (d) = 0 ,
d→∞
1 νi
then suppose that αd denotes the optimal acceptance probability (in the sense of
minimising the expected squared jumping distance satisfies
0 < lim αd = α∞ ≤ 0.234
d→∞
with α∞ = 0.234 if and only if R equals some fixed constant with probability 1.
See also work by Mylene Bedard.
Scaling with diverging scales
A caricature of MCMC on models with unidentifiable parameters (eg certain inverse
problems).
Consider the target distribution πε : Rdx × Rdy 7→ R:
1
πε(x, y) = π(x) πε(y|x) = dy eA(x)+B(x,y/ε) ,
ε
with ε > 0 being ‘small’. Propose
0 x
x
Zx
=
+
`
h(ε)
,
y0
y
Zy
(1)
for constant ` > 0, scaling factor h(ε) and noise (Zx, Zy )> ∼ N (0, Idx+dy ).
α = α(x, Y, Zx,Zy ) = 1 ∧ e
where we have set:
Y = y/ε ;
A(x0 )−A(x)+B(x0 ,Y 0 )−B(x,Y )
Y 0 = Y + ` h(ε)
ε Zy .
(2)
Theorem
Consider the continuous-time process:
xε,t = xb t/h(ε)2c ,
t≥0,
(3)
started in stationarity, x̄0 ∼ π(x). Assume that h(ε) → 0 as ε → 0. Then, as
ε → 0, we have that xε,t ⇒ xt with xt the diffusion process specified as the solution
of the stochastic differential equation:
p
`2
dxt = 2 a0(xt, `) ∇A(xt)dt + ∇a0(xt, `) + a0(xt, `)`2 dWt ,
where a0 denotes the acceptance probability of moves around x:
Z
B(x,Y )
B(x,Y +`Z)−B(x,Y )
1
a0(x, `) =
1∧e
e
dY dZ .
dy /2
(2π)
Rdy ×Rdy
(4)
Optimal scaling for the diverging scales problem
By analysing the form of the acceptance probability a0, we get a surprise!
• If dY = 1, it is optimal to propose jumps of size 0(1), the limiting optimal
algorithmis a continuous time pure jump process. Cost of heterogeneity = ε−1/2.
optimal acceptance probability is 0!
• If dY ≥ 3, the diffusion regime is optimal, cost of heterogeneity is O(ε−1),
optimal acceptance probability can be anything.
• If dy = 2, anything can happen ..
How does this fit with Equip?
In many ill-posed inverse problems, separation of scale is typically non-linear and
difficult or impossible to really know about.
The above very simple result needs to be substantially generalised to understand
the penalty intrinsic to the ill-posedness of the problem.
Gradient based MCMC algorithms are very promising (cue Mark!) though often
have instabilities which are particularly sensitive to heterogeneity of scale. Need
tounderstand this further!
Ideally need theory to underpin a robust MCMC methodology to cope with nonidentifiability issues.
Download