Spectral Analytic Comparisons for Data Augmentation Vivekananda Roy Department of Statistics Iowa State University September 2011 Abstract The sandwich algorithm (SA) is an alternative to the data augmentation (DA) algorithm that uses an extra simulation step at each iteration. In this paper, we show that the sandwich algorithm always converges at least as fast as the DA algorithm, in the Markov operator norm sense. We also establish conditions under which the spectrum of SA dominates that of DA. An example illustrates the results. 1 Introduction Let fX : X → [0, ∞) be a probability density function (with respect to a σ−finite measure µ, say) and assume that direct simulation from fX is not possible. Suppose f : X × Y → [0, ∞) is a joint density R (with respect to µ × ν, say) whose x−marginal is fX i.e., Y f (x, y)ν(dy) = fX (x). If sampling from the corresponding conditional densities fX|Y and fY |X is straightforward, then we can use the data augmentation (DA) algorithm (Tanner and Wong (1987)) based on f (x, y) to explore fX . In particular, the Markov transition density (Mtd) of this DA algorithm is given by Z 0 k(x |x) = fX|Y (x0 |y)fY |X (y|x)ν(dy). Y So each iteration of the DA algorithm consists of two simple steps — a draw from fY |X followed by a draw from fX|Y . The DA algorithm, like its deterministic counterpart EM algorithm, is considered a useful algorithm that often suffers from slow convergence. Key words and phrases. Compact operator, Convergence rate, Data augmentation algorithm, Eigenvalue, Markov chain, Spectrum. 1 Following Liu and Wu (1999), Meng and van Dyk (1999) and van Dyk and Meng (2001), Hobert and Marchev (2008) recently introduced an alternative to DA to speed up the convergence. In order to describe Hobert and Marchev’s (2008) sandwich algorithm (SA), let R(y, ·) be the Markov transition function (Mtf) of any Markov chain on Y with invariant density fY , where fY is the y−marginal density R of f (x, y) i.e., X f (x, y)µ(dx) = fY (y). Then the Mtd of the sandwich algorithm is Z Z fX|Y (x0 |y 0 )R(y, dy 0 )fY |X (y|x)ν(dy). k̃(x0 |x) = Y Y Simple calculations show that fX is the invariant density for k̃ and that, if R is reversible with respect to fY , then k̃ is reversible with respect to fX . While Theorem 1, the first main result of this paper, requires that fX is only the invariant density for k̃, Theorem 2, like most of the existing theory comparing DA and SA, is based on the stronger assumption that R is reversible with respect to fY . Notice that each iteration of the sandwich algorithm consists of three steps — a draw from fY |X followed by a draw from R and finally a draw from fX|Y . So SA has an extra step according to R sandwiched between the draws from the two conditional densities. In practice R is often chosen to be a univariate Markov chain, so simulating from R is computationally inexpensive compared to simulating from fY |X and fX|Y which are often high dimensional densities. In that case the DA and sandwich algorithms are equivalent in terms of computational complexity. On the other hand, there is a great deal of empirical evidence showing that sandwich algorithms converge much faster than the DA algorithms. (For examples, see Liu and Wu (1999), Meng and van Dyk (1999), van Dyk and Meng (2001), Roy and Hobert (2007) and Hobert, Roy and Robert (2011).) In this short paper, we develop some theoretical results comparing the convergence rates of DA and sandwich algorithms. In particular, we show that SA is always at least as good as the original DA algorithm in terms of having smaller operator norm, that is, we have kK̃k ≤ kKk, where K, K̃ are the operators defined by k and k̃ respectively and k · k denotes operator norm. This result extends results in Hobert and Rosenthal (2007) and Hobert and Román (2011). In these articles the above norm comparison result has been obtained under some conditions on SA. Hobert and Rosenthal (2007) prove that kK̃k ≤ kKk as long as K̃ is a positive operator. Recently, Hobert and Román (2011) pointed out that Yu and Meng’s (2011) Theorem 1 can be used to establish the above result when R is reversible. While the norm of a Markov operator provides a univariate summary of the convergence of the corresponding Markov chain, a detailed picture of its convergence can be obtained by studying the spectrum of the operator (Diaconis, Khare and Saloff-Coste (2008), Hobert et al. (2011)). We prove that if the Markov operator corresponding to the DA chain is compact and the Mtf R is idempotent (i.e. R 0 0 00 00 2 Y R(y, dy )R(y , dy ) = R(y, dy ) or in short R = R) then the spectrum of SA dominates that of DA in the sense that the ordered eigenvalues of SA are less than or equal to the corresponding eigenvalues of 2 DA (see Section 2 for definition of compact operator). This is a generalization of results in Hobert et al. (2011) and Khare and Hobert (2011). While Hobert et al. (2011) proved this eigenvalue domination result under the condition that Y is finite (in which case, the DA chain is of course compact), Khare and Hobert (2011) proved the same result for trace-class DA algorithms (see Section 2 for definition of trace-class operator). Note that a trace-class operator is necessarily compact. In this article, we also give weaker conditions on the class of DA algorithms than Khare and Hobert (2011) that allow the sandwich algorithms to be strictly better than the corresponding DA algorithms in the Markov operator norm sense. In particular, we show that if the DA operator is compact and R satisfies certain conditions then the norm of the sandwich algorithm is strictly less than that of the DA. Khare and Hobert (2011) proved this result under the stronger assumption that DA algorithm is trace-class. The remainder of this paper is organized as follows. Section 2 contains a brief review of results from operator theory that are used in this article. Our main results comparing DA and SA appear in Section 3. Section 4 contains an example of a compact DA algorithm that is not trace-class to illustrate our theoretical results. 2 Background on Markov Operators Suppose fX : X → [0, ∞) is a pdf with respect to a σ−finite measure µ. Let ( Z Z L20 (fX ) = 2 g:X→R: g (x)fX (x)µ(dx) < ∞ and X ) g(x)fX (x)µ(dx) = 0 . X The inner product in L20 (fX ) is defined as hg, hi = p is kgk = hg, gi. R X g(x) h(x) fX (x) µ(dx) and hence the norm of g Let P (x, dx0 ) be the Mtf of an irreducible, aperiodic and Harris recurrent Markov chain {Xn }∞ n=0 on X with invariant density fX . Let P : L20 (fX ) → L20 (fX ) be the corresponding operator that maps g ∈ R R L20 (fX ) to (P g)(x) = X g(x0 )P (x, dx0 ). Define L20,1 (fX ) = {g ∈ L20 (fX ) : X g 2 (x) fX (x) µ(dx) = 1}. The (operator) norm of P is defined as kP k = kP gk . sup g∈L20,1 (fX ) Liu, Wong and Kong (1994) showed that kP k = sup corr(f (X0 ), g(X1 )), where corr(U, V ) is f,g∈L20 (fX ) the classical (Pearson) correlation between two random variables U and V . Hence kP k describes the strength of correlation between two consecutive steps of the chain. It easily follows that kP k ≤ 1 and if the Mtf P (x, dx0 ) is reversible with respect to fX , i.e., if fX (x)P (x, dx0 ) = fX (x0 )P (x0 , dx) for all x, x0 , then P is a self-adjoint operator. For the rest of this section we assume that P is a self-adjoint 3 operator. It is known that kP k < 1 if and only if the underlying Markov chain is geometrically ergodic (Roberts and Rosenthal (1997)). Rosenthal (2003) showed that for a geometrically ergodic Markov chain, the quantity 1 − kP k, which is called the spectral gap, is a good measure of its (asymptotic) rate of convergence to the stationary distribution. The spectrum of P , σ(P ) is defined as σ(P ) = {β ∈ R : P − βI is not invertible}. It is known that σ(P ) ⊂ [−kP k, kP k] ⊂ [−1, 1] (Retherford, 1993, chap. 6). The operator P is called positive if hP g, gi ≥ 0 for all g ∈ L20 (fX ). It can be shown that for a positive P , σ(P ) ⊂ [0, 1](Retherford, 1993, p. 153). When the state space is finite, the spectrum is simply the set of eigenvalues of the corresponding Markov transition matrix. But, for a general state space X, the spectrum of the Markov operator, P can be quite complex. One exception is when P is compact. The operator P is compact if for any sequence {gn } ∈ L20 (fX ) with kgn k ≤ 1, there is a subsequence {gn k } such that {P gn k } converges. For a compact operator P , all the points (except 0) in the spectrum are eigenvalues, σ(P ) is at most countable and the spectrum has at most one limit point, namely 0 (Conway, 1990, p. 214). If βn ↓ 0 are the ordered eigenvalues of a positive compact operator P , then kP k = β1 (Retherford, 1993, chap. 7) and in this case β1 is necessarily less than 1 (because ∞ X of ergodicity). The operator P is trace-class if βn < ∞, i.e., the sum of the eigenvalues is finite and P is Hilbert Schmidt if ∞ X n=1 βn2 < ∞. Note that if a positive Markov operator is trace-class then it n=1 is automatically Hilbert Schmidt. Diaconis et al. (2008) showed that if P is Hilbert Schmidt then the Markov chain’s χ2 distance to its stationary distribution can be written as 2 Z n 0 X p (x |x) − fX (x0 ) 0 µ(dx ) = βi2n ξi2 (x) , fX (x0 ) X i where pn (·|x) denotes the density of Xn given X0 = x and {ξi } is an orthonormal basis of eigenfunctions corresponding to {βi }. The above representation shows that among positive Hilbert Schmidt operators the Markov chains with smaller eigenvalues are likely to have faster convergence to stationarity. In the next section we compare the DA and sandwich algorithms. 3 Comparison of the DA and sandwich algorithms Consider the Mtd of the DA algorithm that is given by Z 0 k(x |x) = fX|Y (x0 |y)fY |X (y|x)ν(dy). Y 4 Recall that fX|Y and fY |X are the two conditional densities associated with f (x, y). Let K : L20 (fX ) → R L20 (fX ) be the Markov operator defined by k(x0 |x) i.e., K takes g ∈ L20 (fX ) to (Kg)(x) = X g(x0 )k(x0 |x)µ(dx0 ). Liu et al. (1994) showed that the DA operator K is always self-adjoint and positive. Following Diaconis et al. (2008), we can write K as K = Q∗ Q, (see also Buja (1990)) where the operators Q : L20 (fX ) → L20 (fY ) and its adjoint Q∗ : L20 (fY ) → L20 (fX ) are defined as follows: Z Z h(y) fY |X (y|x) ν(dy) . g(x) fX|Y (x|y) µ(dx) and (Q∗ h)(x) = (Qg)(y) = Y X Slightly abusing notation we use k · k to denote the norm of any operator regardless of its domain and range. Similarly, we use h·, ·i as inner product on both L20 (fX ) and L20 (fY ). The following result is a simple extension of Proposition 2.7 in Conway (1990, p. 32). p Proposition 1. kKk = kQk = kQ∗ k. Let K̃ : L20 (fX ) → L20 (fX ) be the operator of the sandwich algorithm with the Mtd Z Z 0 k̃(x |x) = fX|Y (x0 |y 0 )R(y, dy 0 )fY |X (y|x)ν(dy), Y where R(y, dy 0 ) Y is a Mtf with invariant density fY . Simple calculation shows that fX is the invariant density of k̃. Clearly, we can represent K̃ as K̃ = Q∗ RQ, where R : L20 (fY ) → L20 (fY ) is the operator corresponding to the Mtf R(y, dy 0 ). We now prove that the norm of the DA chain is at least as large as that of the SA. Theorem 1. If K and K̃ are the operators corresponding to the DA and sandwich algorithms respectively then kK̃k ≤ kKk. Proof. Note that kK̃k = kQ∗ RQk ≤ kQ∗ kkRkkQk = kRkkKk ≤ kKk, where the second inequality is due to the fact that R is a Markov operator and the second equality follows from Proposition 1. We now consider conditions under which kK̃k is strictly smaller than kKk. Assume that the Mtf R is reversible with respect to fY . Of course, then k̃ is reversible with respect to fX . If R is also geometrically ergodic, that is, if kRk < 1 then we have kK̃k ≤ kRkkKk < kKk. But, as mentioned in the Introduction, in practice R is often chosen to be a one dimensional reducible Markov chain. In fact, often R is an idempotent operator (i.e., R2 = R). Clearly, in this case if R 6= 0 then kRk = 1. In Theorem 2 below, we establish results comparing DA and SA under the assumption that R is idempotent and the operator K is compact. The following result is a minor extension of results in Retherford (1993, chap. VII). 5 Proposition 2. The following statements are equivalent. • K is compact. • Q is compact. • Q∗ is compact. We assume that the DA operator K is compact. So Q and Q∗ are also compact. Moreover, spectral theorem for compact dual pairs and self-adjoint operators (Naylor and Sell (1982, Section 6.14); see also Buja (1990)) guarantees the existence of singular values {λn }∞ n=1 and associated transformations ∞ {gn }∞ n=1 , {hn }n=1 where • λn ∈ [0, 1] and λn ≤ λn−1 . • {gn } and {hn } form complete orthonormal bases of L20 (fX ) and L20 (fY ) respectively. • (Qgn )(y) = λn hn (y) and (Q∗ hn )(x) = λn gn (x). • R R 0 X Y gn (x)hn (y)f (x, y)µ(dx)ν(dy) = 0 for n 6= n0 . Let α1 ≥ α2 ≥ . . . be the (ordered) eigenvalues of K. Since Kgn = Q∗ Qgn = λn Q∗ hn = λ2n gn , we have αn = λ2n and kKk = α1 = λ21 . Also, since K is compact and Harris ergodic, we have kKk < 1 which in turn shows that λn ∈ [0, 1) for all n = 1, 2, . . . Now we will prove the following theorem. Theorem 2. Assume that R is idempotent with kRk = 1 and that the DA operator K is compact. Define m = max{n ∈ N : λn = λ1 }. Then 1. K̃ is positive and compact. 2. Let α̃1 ≥ α̃2 ≥ . . . be the (ordered) eigenvalues of K̃, then α̃n ≤ αn for all n = 1, 2, . . . . 3. A necessary and sufficient condition for kK̃k < kKk is that R m X ai hi = i=1 m X ai hi i=1 holds if and only if (a1 , a2 , . . . , am ) = 0 ∈ Rm . Remark 1. Note that if m = 1, then kK̃k < kKk if and only if Rh1 6= h1 . 6 Proof. Since R is idempotent hK̃g, gi = hQ∗ RQg, gi = hRQg, Qgi = hRQg, RQgi ≥ 0, which shows that K̃ is positive. Since R is a bounded operator and Q is compact, a minor extension of results in Retherford (1993, chap. VII) shows that RQ is compact. Then similarly we have K̃ = Q∗ RQ is compact since Q∗ is bounded. As in Khare and Hobert (2011), note that for any g ∈ L20 (fX ), hKg, gi − hK̃g, gi = h(K − K̃)g, gi = hQ∗ (I − R)Qg, gi = h(I − R)Qg, (I − R)Qgi ≥ 0, i.e., K − K̃ is a positive operator. Then the Courant-Fischer-Weyl minmax characterization of eigenvalues of positive, compact, self-adjoint operators (see, e.g., Voss, 2003) yields α̃n = min dim(V )=n−1 hK̃g, gi ≤ min ,g6=0 hg, gi dim(V )=n−1 max g∈V ⊥ hKg, gi = αn . ,g6=0 hg, gi max g∈V ⊥ We know that kQk = λ1 = kQ∗ k. Then using the properties of compact adjoint operators mentioned above, the proof of 3 directly follows from the proof of Khare and Hobert’s (2011) Theorem 1. In the next section we present a compact DA algorithm which is not trace-class. We also construct a sandwich algorithm where the operator R is idempotent with kRk = 1. Using Theorem 2 we then show that kK̃k < kKk. Notice that since the DA algorithm in this example is not trace-class, Khare and Hobert’s (2011) results are not applicable to compare the DA and SA in this case. 4 A toy compact DA algorithm Let fX (x) be the hyperbolic secant density given by fX (x) = 1 , −∞ < x < ∞ 2 cosh(πx/2) with respect to Lebesgue measure on R. While we do not need to use MCMC algorithms to explore fX (x), it is interesting to construct and compare DA and sandwich algorithms in this context. Consider a joint density f (x, y) given by f (x, y) = fX (y − x)fX (x), (x, y) ∈ R2 7 with respect to Lebesgue measure on R2 . Note that R R f (x, y)dy = fX (x). Suppose W1 and W2 are two independent standard Cauchy random variables. Then fX is the density function of (Morris, 1982, p. 73). The marginal density fY , which is the density of fY (y) = 2 π 2 π log |W1 | log |W1 W2 |, is given by y , −∞ < y < ∞. 2 sinh(πy/2) The Mtd of the corresponding DA algorithm is given by Z ∞ 0 fX|Y (x0 |y)fX (y − x)dy, k(x |x) = −∞ where the conditional density fX|Y (x|y) is given by fX|Y (x|y) = sinh(πy/2) , 2y cosh(π(y − x)/2) cosh(πx/2) which is not a standard distribution. Diaconis et al. (2008) considered this DA algorithm in their study of Gibbs sampler for location families with conjugate priors. In fact from Diaconis et al. (2008), it 1 for n = 1, 2, . . . . Since follows that the DA algorithm (K) is compact with eigenvalues αn = n+1 ∞ ∞ X X 1 αn = = ∞, K is not trace-class. n+1 n=1 n=1 In order to construct a sandwich algorithm, we use Hobert and Marchev’s (2008) recipe using group action. Consider the multiplicative group R+ , where the group composition is defined as multiplication. The Haar measure on this unimodular group is ω(dg) = dg/g, where dg is the Lebesgue measure on R+ . Consider a group action gy : R+ × R → R, where, as the notation suggests, the group action is defined by multiplication. Then the Lebesgue measure on R, dy is invariant with multiplier χ(g) = g (Eaton (1989)), i.e., Z χ(g) Z Z φ(gy)dy = g R φ(gy)dy = R φ(y)dy, R for all g ∈ R+ and all integrable φ : R → R. R Let m(y) = R+ fY (gy)χ(g)ω(dg). Then Z Z Z ∞ dg 1 z 1 m(y) = fY (gy)χ(g)ω(dg) = fY (gy)g = dz = . g |y| 0 2 sinh(πz/2) 2|y| R+ R+ Note that m(y) is positive for all y ∈ R and is finite for all y ∈ R\{0}. Given a fixed y 6= 0, assume g has the density (with respect to Lebesgue measure on R+ ) fY (gy)χ(g) 1 gy|y| = , m(y) g sinh(πgy/2) g ∈ R+ . Suppose y 0 = gy. Then conditional on y 6= 0, the density of y 0 is h i y0 0 0 r(y 0 |y) = I (y)I (y ) + I (y)I (y ) . R R R R + + − − sinh(πy 0 /2) 8 We define the sandwich Mtf, R in the SA chain as R(y, A) = R A r(y 0 |y)dy 0 for measurable A ⊂ R\{0}. Then from Hobert and Marchev (2008) it follows that the corresponding Markov operator R is self-adjoint and idempotent with kRk = 1. From Theorem 2 it follows that the spectrum of the SA chain dominates that of the DA chain, that is, α̃n ≤ αn for all n = 1, 2, . . . . Note that since X X X 1 < ∞, both of the DA and sandwich algorithms in this example are α̃n2 ≤ αn2 = (n + 1)2 n n n Hilbert Schmidt. We now show that kK̃k < kKk. Since the eigenvalues αn , n = 1, 2, . . . , are strictly decreasing, we need to show that Rh1 6= h1 2 (Remark 1), where {hn }∞ n=1 is the orthonormal basis of L0 (fY ) as mentioned in Section 3. The eigen- functions {hn }∞ n=1 are Meixner-Pollaczek orthonormal polynomials (Diaconis et al. (2008)) given by Pnλ (y, ϕ) = (2λ)n −2iϕ inϕ )e , 2 F1 (−n, λ + iy; 2λ|1 − e n! with ϕ = π/2, λ = 1. Here (a)0 = 1; for n ∈ N, (a)n = a(a + 1) . . . (a + n − 1) and r Fs (b1 , . . . , br ; c1 , . . . , cs |z) = ∞ X (b1 . . . br )l z l l=0 (c1 . . . cs )l l! In particular, simple calculations show that h1 (y) = 2y. Since Z Z 0 0 0 (Rh1 )(y) = h1 (y )r(y |y)dy = 2Sign(y) r Y with (b1 . . . br )l = (bi )l . i=1 z2 dz 6= h1 (y), sinh(πz/2) we have kK̃k < kKk. Acknowledgments The author thanks two anonymous reviewers for helpful comments and suggestions. References B UJA , A. (1990). Remarks on functional canonical variates, alternating least squares methods and ACE. The Annals of Statistics, 18 1032–1069. C ONWAY, J. B. (1990). A Course in Functional Analysis. 2nd ed. Springer-Verlag, New York. D IACONIS , P., K HARE , K. and S ALOFF -C OSTE , L. (2008). Gibbs sampling, exponential families and orthogonal polynomials (with discussion). Statistical Science, 23 151–200. E ATON , M. L. (1989). Group Invariance Applications in Statistics. Institute of Mathematical Statistics and the American Statistical Association, Hayward, California and Alexandria, Virginia. 9 H OBERT, J. P. and M ARCHEV, D. (2008). A theoretical comparison of the data augmentation, marginal augmentation and PX-DA algorithms. The Annals of Statistics, 36 532–554. H OBERT, J. P. and ROM ÁN , J. C. (2011). Discussion of “to center or not to center: that is not the question-and ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC efficiency” by Y. Yu and X.-L. Meng. Journal of Computational and Graphical Statistics. In press. H OBERT, J. P. and ROSENTHAL , J. S. (2007). Norm comparisons for data augmentation. Advances and Application in Statistics, 7 291–302. H OBERT, J. P., ROY, V. and ROBERT, C. P. (2011). Improving the convergence properties of the data augmentation algorithm with an application to Bayesian mixture modelling. Statistical Science. To appear. K HARE , K. and H OBERT, J. P. (2011). A spectral analytic comparison of trace-class data augmentation algorithms and their sandwich variants. The Annals of Statistics. To appear. L IU , J. S., W ONG , W. H. and KONG , A. (1994). Covariance structure of the Gibbs sampler with applications to comparisons of estimators and augmentation schemes. Biometrika, 81 27–40. L IU , J. S. and W U , Y. N. (1999). Parameter expansion for data augmentation. Journal of the American Statistical Association, 94 1264–1274. M ENG , X.-L. and VAN DYK , D. A. (1999). Seeking efficient data augmentation schemes via conditional and marginal augmentation. Biometrika, 86 301–320. M ORRIS , C. N. (1982). Natural exponential families with quadratic variance functions. The Annals of Statistics, 10 65–80. NAYLOR , A. W. and S ELL , G. R. (1982). Linear Operator Theory in Engineering and Science. Springer, New York. R ETHERFORD , J. R. (1993). Hilbert Space: Compact Operators and the Trace theorem. Cambridge University Press. ROBERTS , G. O. and ROSENTHAL , J. S. (1997). Geometric ergodicity and hybrid Markov chains. Electronic Communications in Probability, 2 13–25. ROSENTHAL , J. S. (2003). Asymptotic variance and convergence rates of nearly-periodic Markov chain Monte Carlo algorithms. Journal of the American Statistical Association, 98 169–177. 10 ROY, V. and H OBERT, J. P. (2007). Convergence rates and asymptotic standard errors for MCMC algorithms for Bayesian probit regression. Journal of the Royal Statistical Society, Series B, 69 607– 623. TANNER , M. A. and W ONG , W. H. (1987). The calculation of posterior distributions by data augmentation(with discussion). Journal of the American Statistical Association, 82 528–550. VAN DYK , D. A. and M ENG , X.-L. (2001). The Art of Data Augmentation (with Discussion). Journal of Computational and Graphical Statistics, 10 1–50. VOSS , H. (2003). Variational characterization of eigenvalues of nonlinear eigenproblems. In Proceedings of the International Conference on Mathematical and Computer Modelling in Science and Engineering (M. Kocandrlova and V. Kelar, eds.). Czech Technical University in Prague, 379–383. Y U , Y. and M ENG , X.-L. (2011). To center or not to center: that is not the question-and ancillaritysufficiency interweaving strategy (ASIS) for boosting MCMC efficiency. Journal of Computational and Graphical Statistics. In press. 11