When to jump off the edge of a cliff Gareth Roberts University of Warwick Equip Launch, October 2013 including work mainly with Jeffrey Rosenthal, Chris Sherlock, Alex Beskos, Pete Neal and Alex Thiery Optimal scaling for MCMC and heterogeneity Statistical inverse problems,typically ill-posed, and characterised by non-identifiability. In a Bayesian context, this manifests itself by posterior distributions with very different scales. Scale separation gets worse rather than better with more data. Identifiable components are typically complex, highly non-linear, and in practice unknown. So, often very difficult to define clever algorithms to exploit specific structure (though see Mark’s talk!). How does generic MCMC fair when we have scale separation? Metropolis-Hastings algorithm Given a target density π(·) that we wish to sample from, and a Markov chain transition kernel density q(·, ·), we construct a Markov chain as follows. Given Xn, generate Yn+1 from q(Xn, ·). Now set Xn+1 = Yn+1 with probability π(Yn+1)q(Yn+1, Xn) α(Xn, Yn+1) = 1 ∧ . π(Xn)q(Xn, Yn+1) Otherwise set Xn+1 = Xn. Two first scaling problems • RWM q(x, y) = q(|y − x|) The acceptance probability simplifies to π(y) α(x, y) = 1 ∧ π(x) For example q ∼ M V Nd(x, σ 2Id), but also more generally. • MALA Y ∼ M V N (x (k) hV ∇ log π(x(k)) + , hV ) . 2 200 600 1000 0 200 1000 200 600 1000 −1.5 1000 600 1000 200 600 1000 0 200 1.5 0.6192 600 1000 1000 −2 0 200 1.8 0.6529 2 3 600 0 0 1.2 0.7474 600 1000 0 200 2.1 0.6067 600 1000 2.38 0.5542 2 200 200 0.9 0.7732 −3 −1 0 0 0 0 200 600 1000 0 200 600 1000 200 600 1000 −2 0 0 −2 0 2.9 0.6159 0 200 3.2 0.672 600 1000 0 200 3.5 0.6982 600 1000 0 200 4 0.6446 600 1000 5 0.6962 2 1000 0 −1 0 1000 200 600 1000 1000 0 200 600 1000 18 0.9089 1 200 3 600 1000 200 600 1000 1 −1 600 1000 200 600 1000 600 1000 0 200 600 0 200 1000 600 1000 100 0.989 −2 −3 0 200 50 0.962 0 200 0 40 0.9564 −2 0 −3 −1 −2 0 33 0.933 1 0 0 1000 1 1000 600 15 0.9036 −2 600 27 0.9325 −3 −1 600 200 0 0 200 2 200 0 12 0.8651 −3 0 0 −3 0 −3 −3 0 2 3 1000 600 0 1 2 600 22 0.9426 200 10 0.7974 −3 −1 200 0 3 600 8 0.829 2 200 1 2 0 −3 0 −2 0 0 0 1000 −2 600 6 0.7411 1 200 2 0 −2 −3 −1 −3 −1 1 1 2 1 2 3 2.6 0.6277 −2 0 0 −2 −3 −3 −1 −1 1 1 600 1 2 1000 2 600 1 0.7579 200 0.7 0.8368 −3 −2 −2 200 0 0.5 0.9178 0 0 0 −2 0 0.0 0.5 −0.5 0 2 2 600 0.4 0.9443 2 0.3 0.954 2 0 0.1 0.9905 2 1000 3 600 0.2 0.9715 4 200 0.05 0.9939 −0.5 1.0 0.0 −0.4 −1.0 −0.25 0 0.04 0.9895 1.5 0.03 0.9912 0.5 0.02 0.996 −0.05 0.01 0.9869 0 200 The Goldilocks dilemma 600 1000 0 200 600 1000 Scaling problems and diffusion limits Choosing σ in the above algorithms to optimise efficiency. For ‘appropriate choices’ the d-dimensional algorithm has a limit which is a diffusion. The faster the diffusion the better! • How should σd depend on d for large d? • What does this tell us about the efficiency of the algorithm? • Can we optimise σd in some sensible way? • Can we characterise optimal (or close to optimal) values of σd in terms of observable properties of the Markov chain? • How is this story affected by heterogeneity of scale? For RWM and MALA (and some other local algorithms) and for some simple classes of target distributions, a solution to the above can be obtained by considering a diffusion limit (for high dimensional problems). What is “efficiency”? Let X be a Markov chain. Then for a π-integrable function f , efficiency can be described by Pn i=1 g(Xi ) σ 2(g, P ) = lim nVar . n→∞ n Under weak(ish) regularity conditions σ 2(g, P ) = Varπ (g) + 2 ∞ X Covπ (g(X0), g(Xi)) i=1 In general relative efficiency between two possible Markov chains varies depending on what function of interest g is being considered. As d → ∞ the dependence on g disappears, at least in cases where we have a diffusion limit as we will see.... How do we measure “efficiency” efficiently? It is well-established that estimating limiting variance is hard. “It’s easy, just measure ESJD instead!” Andrew Gelman, 1993 ESJD = E((Xt+1 − Xt)2) Why? “It’s obvious!” Andrew Gelman 2011 Optimising this is just like considering only linear functions g and ignoring all but the first term in ∞ X Covπ (g(X0), g(Xi)) i=1 1.8 1.6 1.4 0.8 1.0 1.2 w 0 2000 4000 6000 8000 10000 Index MCMC sample paths and diffusions. Here ESJM is the quadratic variation [t−1 ] lim →0 X i=1 (Xi − X(i−1))2 Diffusions A d-dimensional diffusion is a continuous-time strong Markov process with continuous sample paths. We can define a diffusion as the solution of the Stochastic Differential Equation (SDE): dXt = µ(Xt)dt + σ(Xt)dBt. where B denotes d-dimensional Brownian motion, σ is a d × d matrix and µ is a d-vector. Often understood intuitively and constructively via its dynamics over small time intervals. Approximately for small h: Xt+h|Xt = xt ∼ xt + hµ(xt) + h1/2σ(xt)Z where Z is a d-dimensional standard normal random variable. “Efficiency” for diffusions Consider two Langevin diffusions, both with stationary distribution π. 1/2 dXti = hi dBt + hi∇ log π(Xti)/2, i = 1, 2, 0.3 0.0 0.1 0.2 thin(y, 5) 0.4 0.5 with h1 < h2. 0 500 1000 Index X 2 is a “speeded-up” version of X 1. 1500 2000 The first diffusion comparison result (R Gelman Gilks, 1997) Consider the Metropolis case. Qd Suppose π ∼ i=1 f (xi), q(x, ·) ∼ N (x, σd2Id), X0 ∼ π. Set σd2 = `2/d. Consider (1) Ztd = X[td] . Speed up time by factor d Z d is not a Markov chain, however in the limit as d goes to ∞, it is Markov: Zd ⇒ Z where Z satisfies the SDE, dZt = h(`)1/2dBt + for some function h(`). h(`)∇ log f (Zt) dt , 2 1.2 1.0 0.8 0.6 0.4 0.2 nh(ℓ)/d How much diffusion path do we get for our n iterations? √ ! I` , h(`) = `2 × 2Φ − 2 and I = Ef [((log f (X))0)2]. So h(`) = `2 × A(`) , where A(`) is the limiting overall acceptance rate of the algorithm, ie the proportion of proposed Metropolis moves ultimately accepted. So 2 4 −1 h(`) = Φ (A(`)) A(`) , I and so the maximisation problem can be written entirely in terms of the algorithm’s acceptance rate. Efficiency 0.00.20.40.6 0.81.01.2 Efficiency as a function of scaling and acceptance rate 0 1 2 3 4 5 0.8 1.0 Efficiency 0.00.20.40.6 0.81.01.2 Sigma 0.0 0.2 0.4 0.6 Acceptance rate When can we ‘solve’ the scaling problem for Metropolis? We need a sequence of target densities πd which are sufficiently regular as d → ∞ in order that meaningful (and optimisable) limiting distributions exist. Eg. 1. π ∼ Qd 2 i=1 f (xi ).(NB for discts f , mixing is O(d ), rate 0.13, (Neal).) Qd 2 i=1 f (ci xi ), q(x, ·) ∼ N (x, σd Id ). for some inverse scales ci . 2. π ∼ Rosenthal, Voss). (Bedard, 3. Elliptically symmetric target densities (Sherlock, Bedard). 4. The components form a homogeneous Markov chain. 5. π is a Gibbs random field with finite range interactions (Breyer). 6. Discretisations of an infinite-dimensional system absolutely cts wrt a Gaussian measure (eg Pillai, Stuart, Thiery). 7. Purely discrete product form distributions. Picturing RWM in high dimensions eg consider X ∼ N (0, Id): X0X has mean d and s.d. (2d)1/2 O(sqrt(d)) O(1) Target distribution lies concentrated around the surface of a d-dimensional hypersphere. Two independent processes, the radial process (1-dimensional), needing to move O(1)) and the angular one (with a need to move distances O(d1/2)). Which process converges quickest? Spherical symmetry (Sherlock and R, 2009, Bernoulli) Theorem Let {X(d)} be a sequence of d−dimensional spherically symmetric unimodal target distributions and let {Y(d)} be a sequence of jump proposal dis(d) (d) tributions. If there exist sequences {kx } and {ky } such that the marginal radial D distribution function of X(d) satisfies |X(d)|/kd −→ R where R is a non-negative (d) m.s. random variable with no point mass at 0, |Y(d)|/ky −→ 1, and provided there is a solution to an explicit integral equation involving the distribution of R, then suppose that αd denotes the optimal acceptance probability (in the sense of minimising the expected squared jumping distance satisfies 0 < lim αd = α∞ ≤ 0.234 d→∞ with α∞ = 0.234 if and only if R equals some fixed positive constant with probability 1. If R does have a point mass at 0, OR the integral condition does not hold (essentially R has a heavy tailed distribution) then α∞ = 0. O(sqrt(d)) O(1) Where the radial component does not converge to a point mass, the target distribution has heterogenous roughness. Can happen in many other situations. What should be we do when we have heterogenous roughness? Ideally choose a state-dependent proposal distribution. Expressions for the cost of herogeneity can be obtained in some situations: • Scale of different components drawn from some non-trivial distribution R + Rosenthal 2001. • Scales of component i scales like i−k for some k > 0. (Beskos, R. Stuart and Voss.) All these results look at high-dimensional limits. What can we do in finite-dimensions? Eccentricity Theorem Suppose we can write X(d) = TdZ(d) for matrices {Td} each having (d) collections of eigenvalues {νi ; 1 ≤ i ≤ d}, and where {Z(d)} be a sequence of d−dimensional spherically symmetric unimodal target distributions and let {Y(d)} be a sequence of jump proposal distributions. If the conditions of previous theorem hold (on Z(d) rather than Z(d) this time). Suppose that {Td} are not too eccentric: (d) sup1≤i≤d νi lim Pd (d) = 0 , d→∞ 1 νi then suppose that αd denotes the optimal acceptance probability (in the sense of minimising the expected squared jumping distance satisfies 0 < lim αd = α∞ ≤ 0.234 d→∞ with α∞ = 0.234 if and only if R equals some fixed constant with probability 1. See also work by Mylene Bedard. Scaling with diverging scales A caricature of MCMC on models with unidentifiable parameters (eg certain inverse problems). Consider the target distribution πε : Rdx × Rdy 7→ R: 1 πε(x, y) = π(x) πε(y|x) = dy eA(x)+B(x,y/ε) , ε with ε > 0 being ‘small’. Propose 0 x x Zx = + ` h(ε) , y0 y Zy (1) for constant ` > 0, scaling factor h(ε) and noise (Zx, Zy )> ∼ N (0, Idx+dy ). α = α(x, Y, Zx,Zy ) = 1 ∧ e where we have set: Y = y/ε ; A(x0 )−A(x)+B(x0 ,Y 0 )−B(x,Y ) Y 0 = Y + ` h(ε) ε Zy . (2) Theorem Consider the continuous-time process: xε,t = xb t/h(ε)2c , t≥0, (3) started in stationarity, x̄0 ∼ π(x). Assume that h(ε) → 0 as ε → 0. Then, as ε → 0, we have that xε,t ⇒ xt with xt the diffusion process specified as the solution of the stochastic differential equation: p `2 dxt = 2 a0(xt, `) ∇A(xt)dt + ∇a0(xt, `) + a0(xt, `)`2 dWt , where a0 denotes the acceptance probability of moves around x: Z B(x,Y ) B(x,Y +`Z)−B(x,Y ) 1 a0(x, `) = 1∧e e dY dZ . dy /2 (2π) Rdy ×Rdy (4) Optimal scaling for the diverging scales problem By analysing the form of the acceptance probability a0, we get a surprise! • If dY = 1, it is optimal to propose jumps of size 0(1), the limiting optimal algorithmis a continuous time pure jump process. Cost of heterogeneity = ε−1/2. optimal acceptance probability is 0! • If dY ≥ 3, the diffusion regime is optimal, cost of heterogeneity is O(ε−1), optimal acceptance probability can be anything. • If dy = 2, anything can happen .. How does this fit with Equip? In many ill-posed inverse problems, separation of scale is typically non-linear and difficult or impossible to really know about. The above very simple result needs to be substantially generalised to understand the penalty intrinsic to the ill-posedness of the problem. Gradient based MCMC algorithms are very promising (cue Mark!) though often have instabilities which are particularly sensitive to heterogeneity of scale. Need tounderstand this further! Ideally need theory to underpin a robust MCMC methodology to cope with nonidentifiability issues.