Optimal Low-rank Approximations of Bayesian Linear Inverse Problems The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Spantini, Alessio, Antti Solonen, Tiangang Cui, James Martin, Luis Tenorio, and Youssef Marzouk. “Optimal Low-Rank Approximations of Bayesian Linear Inverse Problems.” SIAM Journal on Scientific Computing 37, no. 6 (January 2015): A2451–A2487. © 2015 Society for Industrial and Applied Mathematics As Published http://dx.doi.org/10.1137/140977308 Publisher Society for Industrial and Applied Mathematics Version Final published version Accessed Thu May 26 03:55:14 EDT 2016 Citable Link http://hdl.handle.net/1721.1/101896 Terms of Use Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use. Detailed Terms Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php SIAM J. SCI. COMPUT. Vol. 37, No. 6, pp. A2451–A2487 c 2015 Society for Industrial and Applied Mathematics OPTIMAL LOW-RANK APPROXIMATIONS OF BAYESIAN LINEAR INVERSE PROBLEMS∗ ALESSIO SPANTINI† , ANTTI SOLONEN‡ , TIANGANG CUI† , JAMES MARTIN§ , LUIS TENORIO¶, AND YOUSSEF MARZOUK† Abstract. In the Bayesian approach to inverse problems, data are often informative, relative to the prior, only on a low-dimensional subspace of the parameter space. Significant computational savings can be achieved by using this subspace to characterize and approximate the posterior distribution of the parameters. We first investigate approximation of the posterior covariance matrix as a low-rank update of the prior covariance matrix. We prove optimality of a particular update, based on the leading eigendirections of the matrix pencil defined by the Hessian of the negative loglikelihood and the prior precision, for a broad class of loss functions. This class includes the Förstner metric for symmetric positive definite matrices, as well as the Kullback–Leibler divergence and the Hellinger distance between the associated distributions. We also propose two fast approximations of the posterior mean and prove their optimality with respect to a weighted Bayes risk under squarederror loss. These approximations are deployed in an offline-online manner, where a more costly but data-independent offline calculation is followed by fast online evaluations. As a result, these approximations are particularly useful when repeated posterior mean evaluations are required for multiple data sets. We demonstrate our theoretical results with several numerical examples, including highdimensional X-ray tomography and an inverse heat conduction problem. In both of these examples, the intrinsic low-dimensional structure of the inference problem can be exploited while producing results that are essentially indistinguishable from solutions computed in the full space. Key words. inverse problems, Bayesian inference, low-rank approximation, covariance approximation, Förstner–Moonen metric, posterior mean approximation, Bayes risk, optimality AMS subject classifications. 15A29, 62F15, 68W25 DOI. 10.1137/140977308 1. Introduction. In the Bayesian approach to inverse problems, the parameters of interest are treated as random variables, endowed with a prior probability distribution that encodes information available before any data are observed. Observations are modeled by their joint probability distribution conditioned on the parameters of interest, which defines the likelihood function and incorporates the forward model and a stochastic description of measurement or model errors. The prior and likelihood then combine to yield a probability distribution for the parameters conditioned on the observations, i.e., the posterior distribution. While this formulation is quite general, essential features of inverse problems bring additional structure to the Bayesian update. The prior distribution often encodes some kind of smoothness or correlation ∗ Submitted to the journal’s Methods and Algorithms for Scientific Computing section July 14, 2014; accepted for publication (in revised form) August 14, 2015; published electronically November 3, 2015. This work was supported by the U.S. Department of Energy, Office of Advanced Scientific Computing (ASCR), under grants DE-SC0003908 and DE-SC0009297. http://www.siam.org/journals/sisc/37-6/97730.html † Department of Aeronautics and Astronautics, Massachusetts Institute of Technology, Cambridge, MA 02139 (spantini@mit.edu, tcui@mit.edu, ymarz@mit.edu). ‡ Department of Mathematics and Physics, Lappeenranta University of Technology, 53851 Lappeenranta, Finland, and Department of Aeronautics and Astronautics, Massachusetts Institute of Technology, Cambridge, MA 02139 (solonen@mit.edu). § Institute for Computational Engineering and Sciences, University of Texas at Austin, Austin, TX 78712 (jmartin@ices.utexas.edu). ¶ Applied Mathematics and Statistics, Colorado School of Mines, Golden, CO 80401 (ltenorio@mines.edu). A2451 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php A2452 SPANTINI ET AL. among the inversion parameters; observations typically are finite, few in number, and corrupted by noise; and the observations are indirect, related to the inversion parameters by the action of a forward operator that destroys some information. A key consequence of these features is that the data may be informative, relative to the prior, only on a low-dimensional subspace of the entire parameter space. Identifying and exploiting this subspace—to design approximations of the posterior distribution and related Bayes estimators—can lead to substantial computational savings. In this paper we investigate approximation methods for finite-dimensional Bayesian linear inverse problems with Gaussian measurement and prior distributions. We characterize approximations of the posterior distribution that are structureexploiting and that are optimal in a sense to be defined below. Since the posterior distribution is Gaussian, it is completely determined by its mean and covariance. We therefore focus on approximations of these posterior characteristics. Optimal approximations will reduce computation and storage requirements for high-dimensional inverse problems and will also enable fast computation of the posterior mean in a many-query setting. We consider approximations of the posterior covariance matrix in the form of lowrank negative updates of the prior covariance matrix. This class of approximations exploits the structure of the prior-to-posterior update and also arises naturally in Kalman filtering techniques (e.g., [2, 3, 76]); the challenge is to find an optimal update within this class and to define in what sense it is optimal. We will argue that a suitable loss function with which to define optimality is the Förstner metric [29] for symmetric positive definite (SPD) matrices, and we will show that this metric generalizes to a broader class of loss functions that emphasize relative differences in covariance. We will derive the optimal low-rank update for this entire class of loss functions. In particular, we will show that the prior covariance matrix should be updated along the leading generalized eigenvectors of the pencil (H, Γ−1 pr ) defined by the Hessian of the negative log-likelihood and the prior precision matrix. If we assume exact knowledge of the posterior mean, then our results extend to optimality statements between distributions (e.g., optimality in Kullback–Leibler (K-L) divergence and in Hellinger distance). The form of this low-rank update of the prior is not new [6, 9, 28, 60], but previous work has not shown whether—and if so, in exactly what sense—it yields optimal approximations of the posterior. A key contribution of this paper is to establish and explain such optimality. Properties of the generalized eigenpairs of (H, Γ−1 pr ) and related matrix pencils have been studied previously in the literature, especially in the context of classical regularization techniques for linear inverse problems1 [81, 68, 38, 37, 25]. The joint action of the log-likelihood Hessian and the prior precision matrix have also been used in related regularization methods [13, 10, 12, 43]. However, these efforts have not been concerned with the posterior covariance matrix or with its optimal approximation, since this matrix is a property of the Bayesian approach to inversion. One often justifies the assumption that the posterior mean is exactly known by arguing that it can easily be computed as the solution of a regularized least-squares problem [42, 69, 1, 62, 5]; indeed, evaluation of the posterior mean to machine precision is now feasible even for million-dimensional parameter spaces [6]. If, however, one needs multiple evaluations of the posterior mean for different realizations of the data 1 In the framework of Tikhonov regularization [80], the regularized estimate coincides with the posterior mean of the Bayesian linear model we consider here, provided that the prior covariance matrix is chosen appropriately. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php OPTIMAL LOW-RANK APPROXIMATIONS A2453 (e.g., in an online inference context), then solving a linear system to determine the posterior mean may not be the most efficient strategy. A second goal of our paper is to address this problem. We will propose two computationally efficient approximations of the posterior mean based on (i) evaluating a low-rank affine function of the data or (ii) using a low-rank update of the prior covariance matrix in the exact formula for the posterior mean. The optimal approximation in each case is defined as the minimizer of the Bayes risk for a squared-error loss weighted by the posterior precision matrix. We provide explicit formulas for these optimal approximations and show that they can be computed by exploiting the optimal posterior covariance approximation described above. Thus, given a new set of data, computing an optimal approximation of the posterior mean becomes a computationally trivial task. Low-rank approximations of the posterior mean that minimize the Bayes risk for squared-error loss have been proposed in [17, 20, 19, 18, 16] for a general non-Gaussian case. Here, instead we develop analytical results for squared-error loss weighted by the posterior precision matrix. This choice of norm reflects the idea that approximation errors in directions of low posterior variance should be penalized more strongly than errors in high-variance directions, as we do not want the approximate posterior mean to fall outside the bulk of the posterior probability distribution. Remarkably, in this case, the optimal approximation only requires the leading eigenvectors and eigenvalues of a single eigenvalue problem. This is the same eigenvalue problem we solve to obtain an optimal approximation of the posterior covariance matrix, and thus we can efficiently obtain both approximations at the same time. While the efficient solution of large-scale linear-Gaussian Bayesian inverse problems is of standalone interest [28], optimal approximations of Gaussian posteriors are also a building block for the solution of nonlinear Bayesian inverse problems. For example, the stochastic Newton Markov chain Monte Carlo (MCMC) method [60] uses Gaussian proposals derived from local linearizations of a nonlinear forward model; the parameters of each Gaussian proposal are computed using the optimal approximations analyzed in this paper. To tackle even larger nonlinear inverse problems, [6] uses a Laplace approximation of the posterior distribution wherein the Hessian at the mode of the log-posterior density is itself approximated using the present approach. Similarly, approximations of local Gaussians can facilitate the construction of a nonstationary Gaussian process whose mean directly approximates the posterior density [8]. Alternatively, [22] combines data-informed directions derived from local linearizations of the forward model—a direct extension of the posterior covariance approximations described in the present work—to create a global data-informed subspace. A computationally efficient approximation of the posterior distribution is then obtained by restricting MCMC to this subspace and treating complementary directions analytically. Moving from the finite- to the infinite-dimensional setting, the same global data-informed subspace is used to drive efficient dimension-independent posterior sampling for inverse problems in [21]. Earlier work on dimension reduction for Bayesian inverse problems used the Karhunen–Loève expansion of the prior distribution [61, 54] to describe the parameters of interest. To reduce dimension, this expansion is truncated; this step renders both the prior and posterior distributions singular—i.e., collapsed onto the prior mean—in the neglected directions. Avoiding large truncation errors then requires that the prior distribution impose significant smoothness on the parameters, so that the spectrum of the prior covariance kernel decays quickly. In practice, this requirement restricts the choice of priors. Moreover, this approach relies entirely on properties of the prior Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php A2454 SPANTINI ET AL. distribution and does not incorporate the influence of the forward operator or the observational errors. Alternatively, [56] constructs a reduced basis for the parameter space via greedy model-constrained sampling, but this approach can also fail to capture posterior variability in directions uninformed by the data. Both of these earlier approaches seek reduction in the overall description of the parameters. This notion differs fundamentally from the dimension reduction technique advocated in this paper, where low-dimensional structure is sought in the change from prior to posterior. The rest of this paper is organized as follows. In section 2 we introduce the posterior covariance approximation problem and derive the optimal prior-to-posterior update with respect to a broad class of loss functions. The structure of the optimal posterior covariance matrix approximation is examined in section 3. Several interpretations are given in this section, including an equivalent reformulation of the covariance approximation problem as an optimal projection of the likelihood function onto a lower-dimensional subspace. In section 4 we characterize optimal approximations of the posterior mean. In section 5 we provide several numerical examples. Section 6 offers concluding remarks. Appendix A collects proofs of many of the theorems stated throughout the paper, along with additional technical results. 2. Optimal approximation of the posterior covariance matrix. Consider the Bayesian linear model defined by a Gaussian likelihood and a Gaussian prior with a nonsingular covariance matrix Γpr 0 and, without loss of generality, zero mean: (2.1) y | x ∼ N (Gx, Γobs ), x ∼ N (0, Γpr ). Here x represents the parameters to be inferred, G is the linear forward operator, and y are the observations, with Γobs 0. The statistical model (2.1) also follows from y = Gx + ε, where ε ∼ N (0, Γobs ) is independent of x. It is easy to see that the posterior distribution is again Gaussian (see, e.g., [14]): x | y ∼ N (μpos (y), Γpos ), with mean and covariance matrix given by (2.2) μpos (y) = Γpos G Γ−1 obs y −1 and Γpos = H + Γ−1 , pr where (2.3) H = G Γ−1 obs G is the Hessian of the negative log-likelihood (i.e., the Fisher information matrix). Since the posterior is Gaussian, the posterior mean coincides with the posterior mode: μpos (y) = arg maxx πpos (x; y), where πpos is the posterior density. Note that the posterior covariance matrix does not depend on the data. pos , 2.1. Defining the approximation class. We will seek an approximation, Γ of the posterior covariance matrix that is optimal in a class of matrices to be defined shortly. As we can see from (2.2), the posterior precision matrix Γ−1 pos is a nonnegative −1 −1 update of the prior precision matrix Γ−1 : Γ = Γ + ZZ , where ZZ = H. pr pos pr Similarly, using Woodbury’s identity we can write Γpos as a nonpositive update of Γpr : Γpos = Γpr − KK , where KK = Γpr G Γ−1 y G Γpr and Γy = Γobs + G Γpr G is the covariance matrix of the marginal distribution of y [47]. This update of Γpr is negative semidefinite because the data add information: the posterior variance in Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. A2455 Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php OPTIMAL LOW-RANK APPROXIMATIONS any direction is always smaller than the corresponding prior variance. Moreover, the update is usually low rank for exactly the reasons described in the introduction: there are directions in the parameter space along which the data are not very informative, relative to the prior. For instance, H might have a quickly decaying spectrum [7, 75]. Note, however, that Γpos itself might not be low-rank. Low-rank structure, if any, lies in the update of Γpr that yields Γpos . Hence, a natural class of matrices for approximating Γpos is the set of negative semidefinite updates of Γpr , with a fixed maximum rank, that lead to positive definite matrices: (2.4) Mr = Γpr − KK 0 : rank(K) ≤ r . This class of approximations of the posterior covariance matrix takes advantage of the structure of the prior-to-posterior update. 2.2. Loss functions. Optimality statements regarding the approximation of a covariance matrix require an appropriate notion of distance between SPD matrices. We shall use the metric introduced by Förstner and Moonen [29], which is derived from a canonical invariant metric on the cone of real SPD matrices and is defined as follows: the Förstner distance, dF (A, B), between a pair of SPD matrices, A and B, is given by d2F (A, B) = tr ln2 ( A−1/2 BA−1/2 ) = ln2 (σi ), i where (σi ) is the sequence of generalized eigenvalues of the pencil (A, B). The Förstner metric satisfies the following important invariance properties: (2.5) dF (A, B) = dF (A−1 , B −1 ) and dF (A, B) = dF (M AM , M BM ) for any nonsingular matrix M . Moreover, dF treats under- and overapproximations pos ) → ∞ as α → 0 and as α → ∞.2 Note that similarly in the sense that dF (Γpos , αΓ the metric induced by the Frobenius norm does not satisfy any of the aforementioned invariance properties. In addition, it penalizes under- and overestimation differently. We will show that our posterior covariance matrix approximation is optimal not only in terms of the Förstner metric but also in terms of the following more general class, L, of loss functions for SPD matrices. Definition 2.1 (loss functions). The class L is defined as the collection of functions of the form (2.6) L(A, B) = n f (σi ), i=1 where A and B are SPD matrices, (σi ) are the generalized eigenvalues of the pencil (A, B), and (2.7) f ∈ U = {g ∈ C 1 (R+ ) : g (x)(1 − x) < 0 for x = 1, and lim g(x) = ∞}. x→∞ Elements of U are differentiable real-valued functions defined on the positive axis that decrease on x < 1, increase on x > 1, and tend to infinity as x → ∞. The squared 2 This behavior is shared by Stein’s loss function, which has been proposed to assess estimates of a covariance matrix [45]. Stein’s loss function is just the K-L distance between two Gaussian distributions with the same mean (see (A.10)), but it is not a metric for SPD matrices. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php A2456 SPANTINI ET AL. Förstner metric belongs to the class of loss functions defined by (2.6), whereas the distance induced by the Frobenius norm does not. Lemma 2.2, whose proof can be found in Appendix A, justifies the importance of the class L. In particular, it shows that optimality of the covariance matrix approximation with respect to any loss function in L leads to an optimal approximation of the posterior distribution using a Gaussian (with the same mean) in terms of other familiar criteria used to compare probability measures, such as the Hellinger distance and the K-L divergence [70]. More precisely, we have the following result. pos ∈ Mr Lemma 2.2 (equivalence of approximations). If L ∈ L, then a matrix Γ minimizes the Hellinger distance and the K-L divergence between N (μpos (y), Γpos ) pos ) iff it minimizes L( Γpos , Γ pos ). and the approximation N (μpos (y), Γ Remark 1. We note that neither the Hellinger distance nor the K-L divergence pos ) depends on the data between the distributions N (μpos (y), Γpos ) and N (μpos (y), Γ y. Optimality in distribution does not necessarily hold when the posterior means are different. 2.3. Optimality results. We are now in a position to state one of the main results of the paper. For a proof see Appendix A. Theorem 2.3 (optimal posterior covariance approximation). Let (δi2 , w i ) be the generalized eigenvalue-eigenvector pairs of the pencil (H, Γ−1 pr ), (2.8) 2 , and H = G Γ−1 with the ordering δi2 ≥ δi+1 obs G as in (2.3). Let L be a loss function in the class L defined in (2.6). Then the following hold: pos , of the loss, L, between Γpos and an element of Mr is (i) A minimizer, Γ given by (2.9) pos = Γpr − KK , Γ KK = r −1 δi2 1 + δi2 w i w i . i=1 The corresponding minimum loss is given by pos , Γpos ) = f (1) r + (2.10) L(Γ f ( 1/(1 + δi2 ) ). i>r (ii) The minimizer (2.9) is unique if the first r eigenvalues of (H, Γ−1 pr ) are different. Theorem 2.3 provides a way to compute the best approximation of Γpos by matrices in Mr : it is just a matter of computing the eigenpairs corresponding to the decreasing sequence of eigenvalues of the pencil (H, Γ−1 pr ) until a stopping criterion is satisfied. This criterion can be based on the minimum loss (2.10). Notice that the minimum loss is a function of the generalized eigenvalues (δi2 )i≥r that have not been computed. This is quite common in numerical linear algebra (e.g., error in the truncated SVD [26, 32]). However, since the eigenvalues (δi2 ) are computed in a decreasing order, the minimum loss can be easily bounded. The generalized eigenvectors, w i , are orthogonal with respect to the inner product induced by the prior precision matrix, and they maximize the Rayleigh ratio, z Hz R(z) = −1 , z Γpr z Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php OPTIMAL LOW-RANK APPROXIMATIONS A2457 i = span⊥ (w over subspaces of the form W j )j<i . Intuitively, the vectors w i associated with generalized eigenvalues greater than one correspond to directions in the parameter space (or subspaces thereof) where the curvature of the log-posterior density is constrained more by the log-likelihood than by the prior. 2.4. Computing eigenpairs of (H, Γ−1 pr ). If a square root factorization of the prior covariance matrix Γpr = Spr Spr is available, then the Hermitian generalized eigenvalue problem can be reduced to a standard one: find the eigenpairs, (δi2 , wi ), of Spr HSpr , and transform the resulting eigenvectors according to wi → Spr wi [4, section 5.2]. An analogous transformation is also possible when a square root factor ization of Γ−1 pr is available. Notice that only the actions of Spr and Spr on a vector are required. For instance, evaluating the action of Spr might involve the solution of an elliptic PDE [57]. There are numerous examples of priors for which a decomposition is readily available, e.g., [82, 24, 57, 84, 79]. Either direct methods Γpr = Spr Spr or, more often, matrix-free algorithms (e.g., Lanczos iteration or its block version [50, 66, 23, 33]) can be used to solve the standard Hermitian eigenvalue problem [4, section 4]. Reference implementations of these algorithms are available in ARPACK [52]. We note that the Lanczos iteration comes with a rich literature on error analysis (e.g., [49, 65, 67, 46, 73, 32]). Alternatively, one can use randomized methods [35], which offer the advantage of parallelism (asynchronous computations) and robustness over standard Lanczos methods [6]. If a square root factorization of Γpr is not available, but it is possible to solve linear systems with Γ−1 pr , we can use a Lanczos method for generalized Hermitian eigenvalue problems [4, section 5.5], where a Krylov basis orthogonal with respect to the inner product induced by Γ−1 pr is maintained. Again, ARPACK provides an efficient implementation of these solvers. When accurately solving linear systems with Γ−1 pr is a difficult task, we refer the reader to alternative algorithms proposed in [74] and [34]. is available, then it is straightforward Remark 2. If a factorization Γpr = Spr Spr to obtain an expression for a nonsymmetric square root of the optimal approximation of Γpos (2.9) as in [9], r 2 −1/2 1+δ − 1 wi w + I , (2.11) Spos = Spr i i i=1 pos = Spos S and wi = S −1 w such that Γ pos pr i . This expression can be used to efficiently pos ) (e.g., [28, 60]). sample from the approximate posterior distribution N (μpos (y), Γ Alternative techniques for sampling from high-dimensional Gaussian distributions can be found, for instance, in [71, 30]. 3. Properties of the optimal covariance approximation. Now we discuss several implications of the optimal approximation of Γpos introduced in the previous section. We start by describing the relationship between this approximation and the directions of greatest relative reduction of prior variance. Then we interpret the covariance approximation as the result of projecting the likelihood function onto a “data-informed” subspace. Finally, we contrast the present approach with several other approximation strategies: using the Frobenius norm as a loss function for the covariance matrix approximation, or developing low-rank approximations based on prior or Hessian information alone. We conclude by drawing the connections with the BFGS Kalman filter update. 3.1. Interpretation of the eigendirections. Thanks to the particular structure of loss functions in L, the problem of approximating Γpos is equivalent Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php A2458 SPANTINI ET AL. −1 to that of approximating Γ−1 pos . Yet the form of the optimal approximation of Γpos is important, as it explicitly describes the directions that control the ratio of posterior to prior variance. The following corollary to Theorem 2.3 characterizes these directions. The proof is in Appendix A. i ) and Corollary 3.1 (optimal posterior precision approximation). Let (δi2 , w L ∈ L be defined as in Theorem 2.3. Then the following hold: (i) A minimizer of L(B, Γ−1 pos ) for −1 (3.1) B ∈ M−1 r := Γpr + JJ : rank(J) ≤ r is given by (3.2) −1 −1 Γ pos = Γpr + U U , UU = r δi2 w i w i , w i = Γ−1 i . pr w i=1 The minimizer (3.2) is unique if the first r eigenvalues of (H, Γ−1 pr ) are different. (ii) The optimal posterior precision matrix (3.2) is precisely the inverse of the optimal posterior covariance matrix (2.9). (iii) The vectors w i are generalized eigenvectors of the pencil (Γpos , Γpr ): Γpos w i = (3.3) 1 Γpr w i . 1 + δi2 Note that the definition of the class M−1 is analogous to that of Mr . Indeed, r Lemma A.2 in Appendix A defines a bijection between these two classes. The vectors w i = Γ−1 i are orthogonal with respect to the inner product defined pr w i minimizes the generalized Rayleigh quotient, by Γpr . By (3.3), we also know that w (3.4) R(z) = Var(z x | y) z Γpos z = , z Γpr z Var(z x) i = span⊥ (w over subspaces of the form W j )j<i . This Rayleigh quotient is precisely the ratio of posterior to prior variance along a particular direction, z, in the parami are exactly eter space. The smallest values that R can take over the subspaces W the smallest generalized eigenvalues of (Γpos , Γpr ). In particular, the data are most informative along the first r eigenvectors w i and, since (3.5) R(w i ) = Var(w i x | y) 1 , = 1 + δi2 Var(w i x) the posterior variance is smaller than the prior variance by a factor of (1 + δi2 )−1 . In the span of the other eigenvectors, (w i )i>r , the data are not as informative. Hence, (w i ) are the directions along which the ratio of posterior to prior variance is minimized. Furthermore, a simple computation shows that these directions also maximize the relative difference between prior and posterior variance normalized by the prior variance. Indeed, if the directions (w i ) minimize (3.4), then they must also maximize 1 − R(z), leading to (3.6) 1 − R(w i ) = i x | y) Var(w i x) − Var(w δi2 . = 1 + δi2 Var(w i x) Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php OPTIMAL LOW-RANK APPROXIMATIONS A2459 3.2. Optimal projector. Since the data are most informative on a subspace of the parameter space, it should be possible to reduce the effective dimension of the inference problem in a manner that is consistent with the posterior approximation. This is essentially the content of the following corollary, which follows by a simple computation. pos and the vectors (w Corollary 3.2 (optimal projector). Let Γ i , w i ) be defined r = G ◦ Pr , as in Theorems 2.3 and 3.1. Consider the reduced forward operator G where Pr is the oblique projector (i.e., Pr2 = Pr but Pt = Pr ): (3.7) Pr = r w i w i . i=1 pos is precisely the posterior covariance matrix corresponding to the Bayesian Then Γ linear model: (3.8) r x, Γobs ), y | x ∼ N (G x ∼ N (0, Γpr ). The projected Gaussian linear model (3.8) reveals the intrinsic dimensionality of the inference problem. The introduction of the optimal projector (3.7) is also useful in the context of dimensionality reduction for nonlinear inverse problems. In this case a particularly simple and effective approximation of the posterior density, πpos (x|y), is of the form π pos (x|y) ∝ π(y; Pr x) πpr (x), where πpr is the prior density and π(y; Pr x) is the density corresponding to the likelihood function with parameters constrained by the projector. The range of the projector can be determined by combining locally optimal data-informed subspaces from high-density regions in the support of the posterior distribution. This approximation is the subject of a related paper [22]. Returning to the linear inverse problem, notice also that the posterior mean of the projected model (3.8) might be used as an efficient approximation of the exact posterior mean. We will show in section 4 that this posterior mean approximation in fact minimizes the Bayes risk for a weighted squared-error loss among all low-rank linear functions of the data. 3.3. Comparison with optimality in Frobenius norm. Thus far our optimality results for the approximation of Γpos have been restricted to the class of loss functions L given in Definition 2.1. However, it is also interesting to investigate optimality in the metric defined by the Frobenius norm. Given any two matrices A and B of the same size, the Frobenius distance between them is defined as A − B, where · is the Frobenius norm. Note that the Frobenius distance does not exploit the pos ∈ Mr structure of the positive definite cone of symmetric matrices. The matrix Γ that minimizes the Frobenius distance from the exact posterior covariance matrix is given by (3.9) pos = Γpr − KK , Γ KK = r λi ui u i , i=1 where (ui ) are the directions corresponding to the r largest eigenvalues of Γpr − Γpos . This result can be very different from the optimal approximation given in Theorem 2.3. In particular, the directions (ui ) are solutions of the eigenvalue problem (3.10) Γpr G Γ−1 y G Γpr u = λu, Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. A2460 SPANTINI ET AL. which maximize Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php (3.11) u (Γpr − Γpos )u = Var(u x) − Var(u x | y). That is, while optimality in the Förstner metric identifies directions that maximize the relative difference between prior and posterior variance, the Frobenius distance favors directions that maximize only the absolute value of this difference. There are many reasons to prefer the former. For instance, data might be informative along directions of low prior variance (perhaps due to inadequacies in prior modeling); a covariance matrix approximation that is optimal in Frobenius distance may ignore updates in these directions entirely. Also, if parameters of interest (i.e., components of x) have differing units of measurement, relative variance reduction provides a unit-independent way of judging the quality of a posterior approximation; this notion follows naturally from the second invariance property of dF in (2.5). From a computational perspective, solving the eigenvalue problem (3.10) is quite expensive compared to finding the generalized eigenpairs of the pencil (H, Γ−1 pr ). Finally, optimality in the Frobenius distance for an approximation of Γpos does not yield an optimality statement for the corresponding approximation of the posterior distribution, as shown in Lemma 2.2 for loss functions in L. 3.4. Suboptimal posterior covariance approximations. 3.4.1. Hessian-based and prior-based reduction schemes. The posterior approximation described by Theorem 2.3 uses both Hessian and prior information. It is instructive to consider approximations of the linear Bayesian inverse problem that rely only on one or the other. As we will illustrate numerically in section 5.1, these approximations can be viewed as natural limiting cases of our approach. They are also closely related to previous efforts in dimensionality reduction that propose only Hessian-based [55] or prior-based [61] reductions. In contrast with these previous efforts, here we will consider versions of Hessian- and prior-based reductions that do not discard prior information in the remaining directions. In other words, we will discuss posterior covariance approximations that remain in the form of (2.4)—i.e., updating the prior covariance only in r directions. A Hessian-based reduction scheme updates Γpr in directions where the data have greatest influence in an absolute sense (i.e., not relative to the prior). This involves approximating the negative log-likelihood Hessian (2.3) with a low-rank decomposition as follows: let (s2i , vi ) be the eigenvalue-eigenvector pairs of H with the ordering s2i ≥ s2i+1 . Then a best low-rank approximation of H in the Frobenius norm is given by H≈ r s2i vi vi = Vr Sr Vr , i=1 where vi is the ith column of Vr and Sr = diag{s21 , . . . , s2r }. Using Woodbury’s identity we then obtain an approximation of Γpos as a low-rank negative semidefinite update of Γpr : (3.12) −1 −1 Γpos ≈ Vr Sr Vr + Γ−1 = Γpr − Γpr Vr Sr−1 + Vr Γpr Vr Vr Γpr . pr This approximation of the posterior covariance matrix belongs to the class Mr . Thus, Hessian-based reductions are in general suboptimal when compared to the optimal Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php OPTIMAL LOW-RANK APPROXIMATIONS A2461 approximations defined in Theorem 2.3. Note that an equivalent way to obtain (3.12) = G ◦ Vr V , which is the comis to use a reduced forward operator of the form G r position of the original forward operator with a projector onto the leading eigenspace of H. In general, the projector Pr = Vr Vr is different from the optimal projector defined in Corollary 3.2 and is thus suboptimal. To achieve prior-based reductions, on the other hand, we restrict the Bayesian inference problem to directions in the parameter space that explain most of the prior variance. More precisely, we look for a rank-r orthogonal projector, Pr , that minimizes the mean squared-error defined as (3.13) E (Pr ) = E x − Pr x2 , where the expectation is taken over the prior distribution (assumed to have zero mean) and · is the standard Euclidean norm [44]. Let (t2i , ui ) be the eigenvalue-eigenvector pairs of Γpr ordered as t2i ≥ t2i+1 . Then a minimizer of (3.13) is given by a projector, Pr , onto the leading eigenspace of Γpr defined as Pr = ri=1 ui u i = Ur Ur , where ui is the ith column of Ur . The actual approximation of the linear inverse problem = G◦Ur Ur . By direct comparison consists of using the projected forward operator, G with the optimal projector defined in Corollary 3.2, we see that prior-based reductions are suboptimal in general. Also in this case, the posterior covariance matrix with the projected Gaussian model can be written as a negative semidefinite update of Γpr : Γpos ≈ Γpr − Ur Tr [ ( Ur HUr )−1 + Tr ]−1 Tr Ur , where Tr = diag{t21 , . . . , t2r }. The double matrix inversion makes this low-rank update computationally challenging to implement. It is also not optimal, as shown in Theorem 2.3. To summarize, the Hessian and prior-based dimensionality reduction techniques are both suboptimal. These methods do not take into account the interactions between the dominant directions of H and those of Γpr , nor the relative importance of these quantities. Accounting for such interaction is a key feature of the optimal covariance approximation described in Theorem 2.3. Section 5.1 will illustrate conditions under which these interactions become essential. 3.4.2. Connections with the BFGS Kalman filter. The linear Bayesian inverse problem analyzed in this paper can be interpreted as the analysis step of a linear Bayesian filtering problem [27]. If the prior distribution corresponds to the forecast distribution at some time t, the posterior coincides with the so-called analysis distribution. In the linear case, with Gaussian process noise and observational errors, both of these distributions are Gaussian. The Kalman filter is a Bayesian solution to this filtering problem [48]. In [2] the authors propose a computationally feasible way to implement (and approximate) this solution in large-scale systems. The key observation is that when solving an SPD linear system of the form Ax = b by means of BFGS or limited memory BFGS (L-BFGS [59]), one typically obtains an approximation of A−1 for free. This approximation can be written as a low-rank correction of an arbi−1 trary positive definite initial approximation matrix A−1 0 . The matrix A0 can be, for −1 instance, the scaled identity. Notice that the approximation of A given by L-BFGS is full rank and positive definite. This approximation is in principle convergent as the storage limit of L-BFGS increases [64]. An L-BFGS approximation of A is also possible [83]. There are many ways to exploit this property of the L-BFGS method. For example, in [2] the posterior covariance is written as a low-rank update of the prior Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php A2462 SPANTINI ET AL. covariance matrix, Γpos = Γpr − Γpr G Γ−1 y GΓpr , where Γy = Γobs + GΓpr G , and −1 Γy itself is approximated using the L-BFGS method. Since this approximation of Γy is full rank, however, this approach does not exploit potential low-dimensional structure of the inverse problem. Alternatively, one can obtain an L-BFGS approx −1 imation of Γpos when solving the linear system Γ−1 pos x = G Γobs y for the posterior mean μpos (y) [3]. If one uses the prior covariance matrix as an initial approximation matrix, A−1 0 , then the resulting L-BFGS approximation of Γpos can be written as a low-rank update of Γpr . This approximation format is similar to the one discussed in [28] and advocated in this paper. However, the approach of [3] (or its ensemble version [76]) does not correspond to any known optimal approximation of the posterior covariance matrix, nor does it lead to any optimality statement between the corresponding probability distributions. This is an important contrast with the present approach, which we will revisit numerically in section 5.1. 4. Optimal approximation of the posterior mean. In this section, we develop and characterize fast approximations of the posterior mean that can be used, for instance, to accelerate repeated inversion with multiple data sets. Note that we are not proposing alternatives to the efficient computation of the posterior mean for a single realization of the data. This task can already be accomplished with current state-of-the-art iterative solvers for regularized least-squares problems [42, 69, 1, 62, 5]. Instead, we are interested in constructing statistically optimal approximations 3 of the posterior mean as linear functions of the data. That is, we seek a matrix A, from an approximation class to be defined shortly, such that the posterior mean can be approximated as μpos (y) ≈ A y. We will investigate different approximation classes for A; in particular, we will only consider approximation classes for which applying A to a vector y is relatively inexpensive. Computing such a matrix A is more expensive than solving a single linear system associated with the posterior precision matrix to determine the posterior mean. However, once A is computed, it can be applied inexpensively to any realization of the data.4 Our approach is therefore justified when the posterior mean must be evaluated for multiple instances of the data. This approach can thus be viewed as an offline-online strategy, where a more costly but data-independent offline calculation is followed by fast online evaluations. Moreover, we will show that these approximations can be obtained from an optimal approximation of the posterior covariance matrix (cf. Theorem 2.3) with minimal additional cost. Hence, if one is interested in both the posterior mean and covariance matrix (as is often the case in the Bayesian approach to inverse problems), then the approximation formulas we propose can be more efficient than standard approaches even for a single realization of the data. 4.1. Optimality results. For the Bayesian linear model defined in (2.1), the posterior mode is equal to the posterior mean, μpos (y) = E(x|y), which is in turn the minimizer of the Bayes risk for squared-error loss [51, 58]. We first review this fact and establish some basic notation. Let S be an SPD matrix and let 2 L(δ(y), x) = (x − δ(y)) S (x − δ(y)) = x − δ(y)S be the loss incurred by the estimator δ(y) of x. The Bayes risk, R (δ(y), x), of δ(y) is defined as the average loss over the joint distribution of x and y [14, 51]: R(δ(y), x) = E ( L(δ(y), x) ). Since 3 We 4 In will precisely define this notion of optimality in section 4.1. particular, applying A is much cheaper than solving a linear system. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. A2463 OPTIMAL LOW-RANK APPROXIMATIONS R(δ(y), x) = E δ(y) − μpos (y) 2S + E μpos (y) − x2S , Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php (4.1) it follows that δ(y) = μpos (y) minimizes the Bayes risk over all estimators of x. To study approximations of μpos (y), we use the squared-error loss function defined 2 . This by the Mahalanobis distance [15] induced by Γ−1 pos : L(δ(y), x) = δ(y) − xΓ−1 pos loss function accounts for the geometry induced by the posterior measure on the parameter space, penalizing errors in the approximation of μpos (y) more strongly in directions of lower posterior variance. Under the assumption of zero prior mean, μpos (y) is a linear function of the data. Hence we seek approximations of μpos (y) of the form Ay, where A is a matrix in a class to be defined. Our goal is to obtain fast posterior mean approximations that can be applied repeatedly to multiple realizations of y. We consider two classes of approximation matrices: (4.2) Ar := {A : rank(A) ≤ r} and Ar := A = (Γpr − B) G Γ−1 obs : rank(B) ≤ r . The class Ar consists of low-rank matrices; it is standard in the statistics literature [44]. The class Ar , on the other hand, can be understood via comparison with (2.2); it simply replaces Γpos with a low-rank negative semidefinite update of Γpr . We shall henceforth use A to denote either of the two classes above. Let RA (Ay, x) be the Bayes risk of Ay subject to A ∈ A. We may now restate our goal as follows: find a matrix, A∗ ∈ A, that minimizes the Bayes risk RA (Ay, x). That is, find A∗ ∈ A such that 2 ). RA (A∗ y, x) = min E( Ay − xΓ−1 pos (4.3) A∈A The following two theorems show that for either class of approximation matrices, Ar or Ar , this problem admits a particularly simple analytical solution that exploits the structure of the optimal approximation of Γpos . The proofs of the theorems rely on a result developed independently by Sondermann [77] and Friedland and Torokhti [31] and are given in Appendix A. We also use the fact that E(μpos (y) − x2Γ−1 ) = , pos where is the dimension of the parameter space. Theorem 4.1. Let (δi2 , w i ) be defined as in Theorem 2.3 and let ( vi ) be generalized eigenvectors of the pencil (GΓpr G , Γobs ) associated with a nonincreasing sequence of eigenvalues, with the normalization vi Γobs vi = 1. Then the following hold: (i) A solution of (4.3) for A ∈ Ar is given by (4.4) ∗ A = r i=1 δi w i vi . 1 + δi2 (ii) The corresponding minimum Bayes risk over Ar is given by (4.5) 2 2 +E μ = RAr (A∗ y, x) = E A∗ y − μpos (y)Γ−1 (y) − x δi2 +. −1 pos Γ pos pos i>r Notice that the rank-r posterior mean approximation given by Theorem 4.1 coincides with the posterior mean of the projected linear Gaussian model defined in (3.8). Thus, applying this approximation to a new realization of the data requires only a low-rank matrix-vector product, a computationally trivial task. We define the quality Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php A2464 SPANTINI ET AL. of a posterior mean approximation as the minimum Bayes risk (4.5). Notice, however, that for a given rank r of the approximation, (4.5) depends on the eigenvalues that have not yet been computed. Since (δi2 ) are determined in order of decreasing magnitude, (4.5) can be easily bounded (cf. discussion after Theorem 2.3). The forthcoming minimum Bayes risk (4.8) can be bounded analogously. Remark 3. Equation (4.4) can be interpreted as the truncated GSVD solution of a Tikhonov regularized linear inverse problem [38] (with unit regularization parameter). Hence, Theorem 4.1 also describes a Bayesian property of the (frequentist) truncated GSVD estimator. Remark 4. If factorizations of the form Γpr = Spr Spr and Γobs = Sobs Sobs are readily available, then we can characterize the triplets (δi , w i , vi ) from an SVD, −1 −1 Sobs GSpr = δ v w , of the matrix S GS with the transformations w i = i i pr i obs i≥1 − Spr wi , vi = Sobs vi and the ordering δi ≥ δi+1 . In particular, the approximate posterior mean can be written as −1 Tikh −1 Sobs y, μ(r) pos (y) = Spr (Sobs GSpr )r (4.6) −1 where (Sobs GSpr )Tikh is the best rank-r approximation to a Tikhonov regularized r 5 inverse. That is, for any matrix A, (A)r is the best rank-r approximation of A (e.g., computed via SVD), whereas (A)Tikh := (A A + I)−1 A . pos ∈ Mr be the optimal approximation of Γpos defined in Theorem 4.2. Let Γ Theorem 2.3. Then the following hold: (i) A solution of (4.3) for A ∈ Ar is given by (4.7) ∗ = Γ pos G Γ−1 . A obs (ii) The corresponding minimum Bayes risk over Ar is given by (4.8) 2 2 ∗ y − μpos (y) ∗ y, x) = E = RAr (A δi6 +. A −1 +E μpos (y) − xΓ−1 pos Γpos i>r Once the optimal approximation of Γpos described in Theorem 4.2 is computed, the cost of approximating μpos (y) for a new realization of y is dominated by the adjoint and prior solves needed to apply G and Γpr , respectively. Combining the optimal approximations of μpos (y) and Γpos given by Theorems 4.2 and 2.3, respectively, yields a complete approximation of the Gaussian posterior distribution. This is precisely the approximation adopted by the stochastic Newton MCMC method [60] to describe the Gaussian proposal distribution obtained from a local linearization of the forward operator of a nonlinear Bayesian inverse problem. Our results support the algorithmic choice of [60] with precise optimality statements. It is worth noting that the two optimal Bayes risks, (4.5) and (4.8), depend on the parameter, r, that defines the dimension of the corresponding approximation classes Ar and Ar . In the former case, r is the rank of the optimal matrix that defines the approximation. In the latter case, r is the rank of a negative update of Γpr that yields the posterior covariance matrix approximation. We shall thus refer to the estimator given by Theorem 4.1 as the low-rank approximation and to the estimator given by Theorem 4.2 as the low-rank update approximation. In both cases, we shall refer to r as the order of the approximation. A posterior mean approximation of order 5 With unit regularization parameter and identity regularization operator [39]. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php OPTIMAL LOW-RANK APPROXIMATIONS A2465 r will be called underresolved if more than r generalized eigenvalues of the pencil (H, Γ−1 pr ) are greater than one. If this is the case, then using the low-rank update approximation is not appropriate because the associated Bayes risk includes highorder powers of eigenvalues of (H, Γ−1 pr ) that are greater than one. Thus, underresolved approximations tend to be more accurate when using the low-rank approximation. As we will show in section 5, this estimator is also less expensive to compute than its counterpart in Theorem 4.2. If, on the other hand, fewer than r eigenvalues of (H, Γ−1 pr ) are greater than one, then the optimal low-rank update estimator will have better performance than the optimal low-rank estimator in the following sense: ∗ y, x) = δi2 1 + δi2 1 − δi2 . 0 < RAr (A∗ y, x) − RAr (A i>r 4.2. Connection with “priorconditioners”. In this subsection we draw connections between the low-rank approximation of the posterior mean given in Theorem 4.1 and the regularized solution of a discrete ill-posed inverse problem, y = Gx + ε (using the notation of this paper), as presented in [13, 10]. In [13, 10], the authors propose an early stopping regularization using iterative solvers preconditioned by prior statistical information on the parameter of interest, say, x ∼ N (0, Γpr ), and on the noise, say, ε ∼ N (0, Γobs ).6 That is, if factorizations Γpr = Spr Spr and Γobs = Sobs Sobs are available, then [13] provides a solution, x = Spr q, to the inverse problem, where q comes from an early stopping regularization applied to the preconditioned linear system: (4.9) −1 −1 GSpr q = Sobs y. Sobs The iterative method of choice in this case is the conjugate gradient least squares (CGLS) algorithm [13, 36] (or GMRES for nonsymmetric square systems [11]) equipped with a proper stopping criterion (e.g., the discrepancy principle [47]). Although the approach of [13] is not exactly Bayesian, we can still use the optimality results of Theorem 4.1 to justify the observed good performance of this particular form of regularization. By a property of the CGLS algorithm, the rth iterate, xr = Spr q r , satisfies (4.10) −1 −1 q r = argmin Sobs y − Sobs GSpr q . y) q∈Kr (H, y) is the r-dimensional Krylov subspace associated with the matrix where Kr (H, −1 = Spr HSpr and starting vector y = Spr H G Γobs y. It was shown in [41] that −1 −1 GSpr )† Sobs y, the CGLS solution, at convergence, can be written as x∗ = Spr (Sobs † where ( · ) denotes the Moore–Penrose pseudoinverse [63, 72]. To highlight the dif y) ≈ ran(Wr ) ferences between the CGLS solution and (4.6), we assume that Kr (H, for all r, where Wr = [w1 | · · · | wr ], ran(A) denotes the range of a matrix A, and = δ 2 wi w is an SVD of H. Notice that the condition Kr (H, y) ≈ ran(Wr ) is H i i i usually quite reasonable for moderate values of r. This practical observation is at the heart of the Lanczos iteration for symmetric eigenvalue problems [50]. With simple algebraic manipulations we conclude that (4.11) 6 It −1 −1 GSpr )†r Sobs y. xr ≈ Spr (Sobs suffices to consider a Gaussian approximation to the distribution of x and ε. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. A2466 SPANTINI ET AL. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php Recall from (4.6) that the optimal rank-r approximation of the posterior mean defined in Theorem 4.1 can be written as −1 Tikh −1 μ(r) Sobs y. pos (y) = Spr (Sobs GSpr )r (4.12) The only difference between (4.11) and (4.12) is the use of a Tikhonov-regularized −1 inverse in (4.12) as opposed to a Moore–Penrose pseudoinverse. If Sobs GSpr = −1 i≥1 δi vi wi is a reduced SVD of the matrix Sobs GSpr , then (4.13) −1 (Sobs GSpr )†r = 1 wi vi , δi −1 (Sobs GSpr )Tikh = r i≤r i≤r δi wi vi . 1 + δi2 These two matrices are nearly identical for values of r corresponding to δr2 greater than 2 ). Beyond this regime, it might be convenient one7 (assuming the ordering δi2 ≥ δi+1 to stop the CGLS solver to obtain (4.11) (i.e., early stopping regularization). The similarity of these expressions is quite remarkable since (4.12) was derived as the minimizer of the optimization problem (4.3) with A = Ar . This informal argument may explain why priorconditioners perform so well in applications [12, 43]. Yet we remark that the goals of Theorem 4.1 and [13] are still quite different; [13] is concerned with preconditioning techniques for early stopping regularization of ill-posed inverse problems, whereas Theorem 4.1 is concerned with statistically optimal approximations of the posterior mean in the Bayesian framework. Algorithm 1. Optimal low-rank approximation of the posterior mean. −1 INPUT: forward and adjoint models G, G ; prior and noise precisions Γ−1 pr , Γobs ; approximation order r ∈ N (r) OUTPUT: approximate posterior mean μpos (y) 1: Find the r leading generalized eigenvalue-eigenvector pairs (δi2 , w i ) of the pencil −1 (G Γ−1 obs G, Γpr ) 2: Find the r leading generalized eigenvector pairs ( vi ) of the pencil (G Γpr G , Γobs ) (r) 3: For each new realization of the data y, compute μpos (y) = r i=1 δi (1 + δi2 )−1 w i vi y. Algorithm 2. Optimal low-rank update approximation of the posterior mean. −1 INPUT: forward and adjoint models G, G ; prior and noise precisions Γ−1 pr , Γobs ; approximation order r ∈ N (r) OUTPUT: approximate posterior mean μ pos (y) pos as described in Theorem 2.3. 1: Obtain Γ (r) pos G Γ−1 y. 2: For each new realization of the data y, compute μ pos (y) = Γ obs 5. Numerical examples. Now we provide several numerical examples to illustrate the theory developed in the preceding sections. We start with a synthetic example to demonstrate various posterior covariance matrix approximations and continue with two more realistic linear inverse problems, where we also study posterior mean approximations. 7 In section 5 we show that by the time we start including generalized eigenvalues δ 2 ≈ 1 in (4.4), i the approximation of the posterior mean is usually already satisfactory. Intuitively, this means that all the directions in parameter space where the data are more informative than the prior have been considered. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php OPTIMAL LOW-RANK APPROXIMATIONS A2467 5.1. Example 1: Hessian and prior with controlled spectra. We begin by investigating the approximation of Γpos as a negative semidefinite update of Γpr . We compare the optimal approximation obtained in Theorem 2.3 with the Hessian-, prior-, and BFGS-based reduction schemes discussed in section 3.4. The idea is to reveal differences between these approximations by exploring regimes where the data have differing impacts on the prior information. Since the directions defining the optimal update are the generalized eigenvectors of the pencil (H, Γ−1 pr ), we shall also refer to this update as the generalized approximation. To compare these approximation schemes, we start with a simple example with diagonal Hessian and prior covariance matrices: G = I, Γobs = diag{σi2 }, and Γpr = diag{λ2i }. Since the forward operator G is the identity, this problem can (loosely) be 2 2 2 thought of as denoising a signal x. In this case, H = Γ−1 obs and Γpos = diag{λi σi /(σi + 2 λi )}. The ratios of posterior to prior variance in the canonical directions (ei ) are 1 Var(e i x | y) = 2 /σ 2 . 1 + λ Var(e x) i i i We note that if the observation variances σi2 are constant, σi = σ, then the directions of greatest variance reduction are those corresponding to the largest prior variance. Hence the prior distribution alone determines the most informed directions, and the prior-based reduction is as effective as the generalized one. On the other hand, if the prior variances λ2i are constant, λi = λ, but the σi vary, then the directions of highest variance reduction are those corresponding to the smallest noise variance. This time the noise distribution alone determines the most important directions, and Hessianbased reduction is as effective as the generalized one. In the case of more general spectra, the important directions depend on the ratios λ2i /σi2 and thus one has to use the information provided by both the noise and prior distributions. This is done naturally by the generalized reduction. We now generalize this simple case by moving to full matrices H and Γpr with a variety of prescribed spectra. We assume that H and Γpr have SVDs of the form 1 , . . . , λ n } , where Λ = diag{λ1 , . . . , λn } and Λ = diag{λ H = U ΛU and Γpr = V ΛV with λk = λ0 /k α + τ k = λ 0 /k α + τ. and λ To consider many different cases, the orthogonal matrices U and V are randomly and independently generated uniformly over the orthogonal group [78], leading to different realizations of H and Γpr . In particular, U and V are computed with a QR decomposition of a square matrix of independent standard Gaussian entries using a Gram–Schmidt orthogonalization. (In this case, the standard Householder reflections cannot be used.) Before discussing the results of the first experiment, we explain our implementation of BFGS-based reduction. We ran the BFGS optimizer with a dummy quadratic optimization target J (x) = 12 x Γ−1 pos x and used Γpr as the initial approximation matrix for Γpos . Thus, the BFGS approximation of the posterior covariance matrix can be written as Γpos = Γpr − KK for some rank-r matrix K. The rank-r update is constructed by running the BFGS optimizer for r steps from random initial conditions as shown in [3]. Note that in order to obtain results for sufficiently high-rank updates, we use BFGS rather than L-BFGS in our numerical examples. While [2, 3] in principle employ L-BFGS, the results in these papers use a number of optimization Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php A2468 SPANTINI ET AL. steps roughly equal to the number of vectors stored in L-BFGS; our approach thus is comparable to [2, 3]. Nonetheless, some results for the highest-rank BFGS updates are not plotted in Figures 1 and 2, as the optimizer converged so close to the optimum that taking further steps resulted in numerical instabilities. Figure 1 summarizes the results of the first experiment. The top row shows the prescribed spectra of H −1 (red) and Γpr (blue). The parameters describing the 0 = 1, α eigenvalues of Γpr are fixed to λ = 2, and τ = 10−6 . The corresponding parameters for H are given by λ0 = 500 and τ = 10−6 with α = 0.345 (left), α = 0.690 (middle), and α = 1.724 (right). Thus, moving from the leftmost column to the rightmost column, the data become increasingly less informative. The second row in the figure shows the Förstner distance between Γpos and its approximation, pos = Γpr −KK , as a function of the rank of KK for 100 different realizations of H Γ and Γpr . The third row shows, for each realization of (H, Γpr ) and for each fixed rank of KK , the difference between the Förstner distance obtained with a prior-, Hessian-, or BFGS-based dimensionality reduction technique and the minimum distance obtained with the generalized approximation. All these differences are positive—a confirmation of Theorem 2.3. But Figure 1 also shows interesting patterns consistent with the observations made for the simple example above. When the spectrum of H is basically flat (left column), the directions along which the prior variance is reduced the most are likely to be those corresponding to the largest prior variances, and thus a priorbased reduction is almost as effective as the generalized one (as seen in the bottom two rows on the left). As we move to the third column, eigenvalues of H −1 increase more quickly. The data provide significant information only on a lower-dimensional subspace of the parameter space. In this case, it is crucial to combine this information with the directions in the parameter space along which the prior variance is the greatest. The generalized reduction technique successfully accomplishes this task, whereas the prior and Hessian reductions fail as they focus either on Γpr or H alone; the key is to combine the two. The BFGS update performs remarkably well across all three configurations of the Hessian spectrum, although it is clearly suboptimal compared to the generalized reduction. In Figure 2 the situation is reversed and the results are symmetric to those of Figure 1. The spectrum of H (red) is now kept fixed with parameters λ0 = 500, 0 = 1 and α = 1, and τ = 10−9 , while the spectrum of Γpr (blue) has parameters λ τ = 10−9 with decay rates increasing from left to right: α = 0.552 (left), α = 1.103 (middle), and α = 2.759 (right). In the first column, the spectrum of the prior is nearly flat. That is, the prior variance is almost equally spread along every direction in the parameter space. In this case, the eigenstructure of H determines the directions of greatest variance reduction, and the Hessian-based reduction is almost as effective as the generalized one. As we move toward the third column, the spectrum of Γpr decays more quickly. The prior variance is restricted to lower-dimensional subspaces of the parameter space. Mismatch between prior- and Hessian-dominated directions then leads to poor performance of both the prior- and Hessian-based reduction techniques. However, the generalized reduction performs well also in this more challenging case. The BFGS reduction is again empirically quite effective in most of the configurations that we consider. It is not always better than the prior- or Hessian-based techniques when the update rank is low, or when the prior spectrum decays slowly; for example, Hessian-based reduction is more accurate than BFGS across all ranks in the first column of Figure 2. But when either the prior covariance or the Hessian has quickly decaying spectra, the BFGS approach performs almost as well as the generalized Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php OPTIMAL LOW-RANK APPROXIMATIONS A2469 Fig. 1. Top row: Eigenspectra of Γpr (blue) and H −1 (red) for three values for the decay rate of the eigenvalues of H: α = 0.345 (left), α = 0.690 (middle), and α = 1.724 (right). Second row: Förstner distance between Γpos and its approximation versus the rank of the update for 100 realizations of Γpr and H using prior-based (blue), Hessian-based (green), BFGS-based (magenta), and optimal (red) updates. Bottom row: Differences of posterior covariance approximation error (measured with the Förstner metric) between the prior-based and optimal updates (blue), between the Hessian-based and optimal updates (green), and between the BFGS-based and optimal updates (magenta). reduction. Though this approach remains suboptimal, its approximation properties deserve further theoretical study. 5.2. Example 2: X-ray tomography. We consider a classical inverse problem of X-ray computed tomography (CT), where X-rays travel from sources to detectors through an object of interest. The intensities from multiple sources are measured at the detectors, and the goal is to reconstruct the density of the object. In this framework, we investigate the performance of the optimal mean and covariance matrix approximations presented in sections 2 and 4. This synthetic example is motivated by a real application: real-time X-ray imaging of logs that enter a saw mill for the purpose of automatic quality control. For instance, in the system commercialized by Bintec (www.bintec.fi), logs enter the X-ray system on a fast-moving conveyer belt and fast reconstructions are needed. The imaging setting (e.g., X-ray source and detector locations) and the priors are fixed; only the data changes from one log cross section to another. The basis for our posterior mean approximation can therefore be precomputed, and repeated inversions can be carried out quickly with direct matrix formulas. We model the absorption of an X-ray along a line, i , using Beer’s law: (5.1) Id = Is exp − x(s)ds , i where Id and Is are the intensities at the detector and at the source, respectively, Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php A2470 SPANTINI ET AL. Fig. 2. Analogous to Figure 1 but this time the spectrum of H is fixed, while that of Γpr has varying decay rates: α = 0.552 (left), α = 1.103 (middle), and α = 2.759 (right). and x(s) is the density of the object at position s on the line i . The computational domain is discretized into a grid and the density is assumed to be constant within each grid cell. The line integrals are approximated as (5.2) x(s)ds ≈ i # of cells gij xj , j=1 where gij is the length of the intersection between line i and cell j, and xj is the unknown density in cell j. The vector of absorptions along m lines can then be approximated as (5.3) Id ≈ Is exp (−Gx) , where Id is the vector of m intensities at the detectors and G = (gij ) is the m × n matrix of intersection lengths for each of the m lines. Even though the forward operator (5.3) is nonlinear, the inference problem can be recast in a linear fashion by taking the logarithm of both sides of (5.3). This leads to the following linear model for the inversion: y = Gx + , where the measurement vector is y = − log(Id /Is ) and the measurement errors are assumed to be independent and identically distributed Gaussian, ∼ N (0, σ 2 I). The setup for the inference problem, borrowed from [40], is as follows. The rectangular domain is discretized with an n × n grid. The true object consists of three circular inclusions, each of uniform density, inside an annulus. Ten X-ray sources are positioned on one side of a circle, and each source sends a fan of 100 X-rays that are measured by detectors on the opposite side of the object. Here, the 10 sources are distributed evenly so that they form a total illumination angle of 90 degrees, resulting in a limited-angle CT problem. We use the exponential model Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. A2471 OPTIMAL LOW-RANK APPROXIMATIONS 1.02 intensity Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php 1 0.98 0.96 0.94 0.92 0.9 0 20 40 60 80 100 detector pixel Fig. 3. X-ray tomography problem. Left: Discretized domain, true object, sources (red dots), and detectors corresponding to one source (black dots). The fan transmitted by one source is illustrated in gray. The density of the object is 0.006 in the outer ring and 0.004 in the three inclusions; the background density is zero. Right: The true simulated intensity (black line) and noisy measurements (red dots) for one source. (5.1) to generate synthetic data in a discretization-independent fashion by computing the exact intersections between the rays and the circular inclusions in the domain. Gaussian noise with standard deviation σ = 0.002 is added to the simulated data. The imaging setup and data from one source are illustrated in Figure 3. The unknown density is estimated on a 128×128 grid. Thus the discretized vector, x, has length 16,384, and direct computation of the posterior mean and the posterior covariance matrix, as well as generation of posterior samples, can be computationally nontrivial. To define the prior distribution, x is modeled as a discretized solution of a stochastic PDE of the form (5.4) γ κ2 I − x(s) = W(s), s ∈ Ω, where W is a white noise process, is the Laplacian operator, and I is the identity operator. The solution of (5.4) is a Gaussian random field whose correlation length and variance are controlled by the free parameters κ and γ, respectively. A square root of the prior precision matrix of x (which is positive√definite) can then be easily computed (see [57] for details). We use κ = 10 and γ = 800 in our simulations. Our first task is to compute an optimal approximation of Γpos as a low-rank negative update of Γpr (cf. Theorem 2.3). Figure 4 (top row) shows the convergence of the approximate posterior variance as the rank of the update increases. The zerorank update corresponds to Γpr (first column). For this formally 16,384-dimensional problem, a good approximation of the posterior variance is achieved with a rank 200 update; hence the data are informative only on a low-dimensional subspace. The quality of the covariance matrix approximation is also reflected in the structure of samples drawn from the approximate posterior distributions (bottom row). All five of these samples are drawn using the same random seed and the exact posterior mean, so that all the differences observed are due to the approximation of Γpos . Already with a rank 100 update, the small-scale features of the approximate posterior sample match those of the exact posterior sample. In applications, agreement in this “eyeball norm” is important. Of course, Theorem 2.3 also provides an exact formula for the error in the posterior covariance; this error is shown in the right panel of Figure 7 (blue curve). Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php A2472 SPANTINI ET AL. Fig. 4. X-ray tomography problem. First column: Prior variance field, in log scale (top), and a sample drawn from the prior distribution (bottom). Second through last columns (left to right): Variance field, in log scale, of the approximate posterior as the rank of the update increases (top); samples from the corresponding approximate posterior distributions (bottom) assuming exact knowledge of the posterior mean. Our second task is to assess the performances of the two optimal posterior mean (r) approximations given in section 4. We will use μpos (y) to denote the low-rank ap(r) proximation and μ pos (y) to denote the low-rank update approximation. Recall that (r) both approximations are linear functions of the data y, given by μpos (y) = A∗ y with (r) ∗ y with A ∗ ∈ Ar , where the classes Ar and Ar are defined A∗ ∈ Ar and μ pos (y) = A in (4.2). As in section 4, we shall use A to denote either of the two classes. /μpos (y)Γ−1 for difFigure 5 shows the normalized error μ(y) − μpos (y)Γ−1 pos pos ferent approximations μ(y) of the true posterior mean μpos (y) and a fixed realization y of the data. The error is a function of the order r of the approximation class A. Snapshots of μ(y) are shown along the two error curves. For reference, μpos (y) is also shown at the top. We see that the errors decrease monotonically but that the lowrank approximation outperforms the low-rank update approximation for lower values of r. This is consistent with the discussion at the end of section 4; the crossing point of the error curves is also consistent with that analysis. In particular, we expect the low-rank update approximation to outperform the low-rank approximation only when the approximation starts to include generalized eigenvalues of the pencil (H, Γ−1 pr ) that are less than one—i.e., once the approximations are no longer underresolved. This can be confirmed by comparing Figure 5 with the decay of the generalized eigenvalues of the pencil (H, Γ−1 pr ) in the right panel of Figure 7 (blue curve). On top of each snapshot in Figure 5, we show the relative CPU time required to compute the corresponding posterior mean approximation for each new realization of the data. The relative CPU time is defined as the time required to compute this approximation8 divided by the time required to apply the posterior precision matrix to a vector. This latter operation is essential to computing the posterior mean via an iterative solver, such as a Krylov subspace method. These solvers are a standard choice for computing the posterior mean in large-scale inverse problems. Evaluating the ratio allows us to determine how many solver iterations could be performed with a computational cost roughly equal to that of approximating the posterior mean for a new realization of the data. Based on the reported times, a 8 This timing does not include the computation of (4.4) or (4.7), which should be regarded as offline steps. Here we report the time necessary to apply the optimal linear function to any new realization of the data, i.e., the online cost. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php OPTIMAL LOW-RANK APPROXIMATIONS A2473 Fig. 5. Limited-angle X-ray tomography: Comparison of the optimal posterior mean approx(r) (r) pos (y) (black) of μpos (y) for a fixed realization of the data y, as a imations, μpos (y) (blue) and μ r , respectively. The normalized error function of the order r of the approximating classes Ar and A for an approximation μ(y) is defined as μ(y) − μpos (y)Γ−1 / μpos (y)Γ−1 . The numbers above or pos pos below the snapshots indicate the relative CPU time of the corresponding mean approximation—i.e., the time required to compute the approximation divided by the time required to apply the posterior precision matrix to a vector. few observations can be made. First of all, as anticipated in section 4, computing (r) (r) μpos (y) for any new realization of the data is faster than computing μ pos (y). Second, obtaining an accurate posterior mean approximation requires roughly r = 200, and (r) the relative CPU times for this order of approximation are 7.3 for μpos (y) and 29.0 (r) for μ pos (y). These are roughly the number of iterations of an iterative solver that one could take for equivalent computational cost. That is, the speedup of the posterior mean approximation compared to an iterative solver is not particularly dramatic in this case, because the forward model A is simply a sparse matrix that is cheap to apply. This is different for the heat equation example discussed in section 5.3. Note that the above computational time estimates exclude other costs associated with iterative solvers. For instance, preconditioners are often applied; these significantly decrease the number of iterations needed for the solvers to converge but, on the other hand, increase the cost per iteration. A popular approach for solving the posterior mean efficiently is to use the prior covariance as the preconditioner [6]. In the limited-angle tomography problem, including the application of this preconditioner in the reference CPU time would reduce the relative CPU time of our r = 200 approx(r) (r) pos (y). That is, the cost of computing our imations to 0.48 for μpos (y) and 1.9 for μ approximations is roughly equal to one iteration of a prior-preconditioned iterative solver. The large difference compared to the case without preconditioning is due to the fact that the cost of applying the prior here is computationally much higher than applying the forward model. Figure 6 (left panel) shows unnormalized errors in the approximation of μpos (y), (5.5) 2 e(y)2Γ−1 = μ(r) pos (y) − μpos (y)Γ−1 pos pos 2 and e(y)2Γ−1 = μ(r) pos (y) − μpos (y)Γ−1 , pos pos Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php A2474 SPANTINI ET AL. Fig. 6. The errors e(y)2 −1 (blue) and e(y)2 −1 (black) defined by (5.5), and their expected Γpos Γpos values in green and red, respectively, for Example 2 (left panel) and Example 3 (right panel). for the same realization of y used in Figure 5. In the same panel we also show the expected values of these errors over the prior predictive distribution of y, which is exactly the r-dependent component of the Bayes risk given in Theorems 4.1 and 4.2. Both sets of errors decay with increasing r and show a similar crossover between the two approximation classes. But the particular error e(y)2Γ−1 departs consistently pos from its expectation; this is not unreasonable in general (the mean estimator has a nonzero variance), but the offset may be accentuated in this case because the data are generated from an image that is not drawn from the prior. (The right panel of Figure 6, which comes from Example 3, represents a contrasting case.) By design, the posterior approximations described in this paper perform well when the data inform a low-dimensional subspace of the parameter space. To better understand this effect, we also consider a full-angle configuration of the tomography problem, wherein the sources and detectors are evenly spread around the entire unknown object. In this case, the data are more informative than in the limited-angle configuration. This can be seen in the decay rate of the generalized eigenvalues of the pencil (H, Γ−1 pr ) in the center panel of Figure 7 (blue and red curves); eigenvalues for the full-angle configuration decay more slowly than for the limited-angle configuration. Thus, according to the optimal loss given in (2.10) (Theorem 2.3), the priorto-posterior update in the full-angle case must be of greater rank than the update in the limited-angle case for any given approximation error. Also, good approximation of μpos (y) in the full-angle case requires higher order of the approximation class A, as is shown in Figure 8. But because the data are strongly informative, they allow an almost perfect reconstruction of the underlying truth image. The relative CPU (r) (r) times are similar to the limited angle case: roughly 8 for μpos (y) and 14 for μ pos (y). If preconditioning with the prior covariance is included in the reference CPU time (r) (r) pos (y). calculation, the relative CPU times drop to 1.5 for μpos (y) and to 2.6 for μ We remark that in realistic applications of X-ray tomography, the limited angle setup is extremely common as it is cheaper and more flexible (yielding smaller and lighter devices) than a full-angle configuration. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php OPTIMAL LOW-RANK APPROXIMATIONS A2475 Fig. 7. Left: Leading eigenvalues of Γpr and H −1 in the limited-angle and full-angle X-ray tomography problems. Center: Leading generalized eigenvalues of the pencil (H, Γ−1 pr ) in the limited pos ) as a function of the rank of the update angle (blue) and full-angle (red) cases. Right: dF (Γpos , Γ pos = Γpr − KK , in the limited-angle (blue) and full-angle (red) cases. KK , with Γ Fig. 8. Same as Figure 5, but for full-angle X-ray tomography (sources and receivers spread uniformly around the entire object). 5.3. Example 3: Heat equation. Our last example is the classic linear inverse problem of solving for the initial conditions of an inhomogeneous heat equation. Let u(s, t) be the time-dependent state of the heat equation on s = (s1 , s2 ) ∈ Ω = [0, 1]2 , t ≥ 0, and let κ(s) be the heat conductivity field. Given initial conditions, u0 (s) = u(s, 0), the state evolves in time according to the linear heat equation: (5.6) ∂u(s, t) = −∇ · (κ(s)∇u(s, t)), ∂t κ(s)∇u(s, t) · n(s) = 0, s ∈ ∂Ω, t > 0, s ∈ Ω, t > 0, where n(s) denotes the outward-pointing unit normal at s ∈ ∂Ω. We place ns = 81 sensors at the locations s1 , . . . , sns , uniformly spaced within the lower left quadrant of the spatial domain, as illustrated by the black dots in Figure 9. We use a finitedimensional discretization of the parameter space based on the finite element method on a regular 100 × 100 grid, {si }. Our goal is to infer the vector x = (u0 (si )) of initial Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php A2476 SPANTINI ET AL. Fig. 9. Heat equation (Example 3). Initial condition (top left) and several snapshots of the states at different times. Black dots indicate sensor locations. conditions on the grid. Thus, the dimension of the parameter space for the inference problem is n = 104 . We use data measured at 50 discrete times t = t1 , t2 , . . . , t50 , where ti = it, and t = 2 × 10−4 . At each time ti , pointwise observations of the state u are taken at these sensors, i.e., (5.7) di = Cu(s, ti ), where C is the observation operator that maps the function u(s, ti ) to d = (u(s1 , ti ), . . . , u(sn , ti )) . The vector of observations is then d = [d1 ; d2 ; . . . ; d50 ]. The noisy data vector is y = d + ε, where ε ∼ N (0, σ 2 I) and σ = 10−2 . Note that the data are a linear function of the initial conditions, perturbed by Gaussian noise. Thus the data can be written as (5.8) y = Gx + ε, ε ∼ N (0, σ 2 I). where G is a linear map defined by the composition of the forward model (5.6) with the observation operator (5.7), both linear. We generate synthetic data by evolving the initial conditions shown in Figure 9. This “true” value of the inversion parameters x is a discretized realization of a Gaussian process satisfying an SPDE of the same form used in the previous tomography example, but now with a nonstationary permeability field. In other words, the truth is a draw from the prior in this example (unlike in the previous example), and the prior Gaussian process satisfies the following SPDE: s ∈ Ω, (5.9) γ κ2 I − ∇ · c(s)∇ x(s) = W(s), where c(s) is the space-dependent permeability tensor. Figure 10 and the right panel in Figure 6 show our numerical results. They have the same interpretations as Figures 5 and 6 in the tomography example. The trends in the figures are consistent with those encountered in the previous example and confirm the good performance of the optimal low-rank approximation. Notice that in Figures 10 and 6 the approximation of the posterior mean appears to be nearly perfect (visually) once the error curves for the two approximations cross. This is somewhat expected from the theory since we know that the crossing point should occur when the approximations start to use eigenvalues of the pencil (H, Γ−1 pr ) that are less than one—that is, once we have exhausted directions in the parameter space where the data are more constraining than the prior. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php OPTIMAL LOW-RANK APPROXIMATIONS A2477 Fig. 10. Same as Figure 5, but for Example 3 (initial condition inversion for the heat equation). Again, we report the relative CPU time for each posterior mean approximation above/below the corresponding snapshot in Figure 10. The results differ significantly from the tomography example. For instance, at order r = 200, which yields approximations that are visually indistinguishable from the true mean, the relative CPU (r) (r) pos (y). Therefore we can compute an accutimes are 0.001 for μpos (y) and 0.53 for μ rate mean approximation for a new realization of the data much more quickly than taking one iteration of an iterative solver. Recall that, consistent with the setting described at the start of section 4, this is a comparison of online times, after the matrices (4.4) or (4.7) have been precomputed. The difference between this case and the tomography example of section 5.2 is due to the higher CPU cost of applying the forward and adjoint models for the heat equation—solving a time-dependent PDE versus applying a sparse matrix. Also, because the cost of applying the prior covariance is negligible compared to that of the forward and adjoint solves in this example, preconditioning the iterative solver with the prior would not strongly affect the reported relative CPU times, unlike the tomography example. Figure 11 illustrates some important directions characterizing the heat equation inverse problem. The first two columns show the four leading eigenvectors of, respectively, Γpr and H. Notice that the support of the eigenvectors of H concentrates around the sensors. The third column shows the four leading directions (w i ) defined in Theorem 2.3. These directions define the optimal prior-to-posterior covariance matrix update (cf. (2.9)). This update of Γpr is necessary to capture directions (w i ) of greatest relative difference between prior and posterior variance (cf. Corollary 3.1). The four leading directions (w i ) are shown in the fourth column. The support of these modes is again concentrated around the sensors, which intuitively makes sense as these are directions of greatest variance reduction. 6. Conclusions. This paper has presented and characterized optimal approximations of the Bayesian solution of linear inverse problems, with Gaussian prior and noise distributions defined on finite-dimensional spaces. In a typical large-scale inverse problem, observations may be informative—relative to the prior—only on a Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php A2478 SPANTINI ET AL. Fig. 11. Heat equation (Example 3). First column: Four leading eigenvectors of Γpr . Second column: Four leading eigenvectors of H. Third column: Four leading directions (w i ) (cf. (2.9)). Fourth column: Four leading directions (w i ) (cf. Corollary 3.1) low-dimensional subspace of the parameter space. Our approximations therefore identify and exploit low-dimensional structure in the update from prior to posterior. We have developed two types of optimality results. In the first, the posterior covariance matrix is approximated as a low-rank negative semidefinite update of the prior covariance matrix. We describe an update of this form that is optimal with respect to a broad class of loss functions between covariance matrices, exemplified by the Förstner metric [29] for symmetric positive definite matrices. We argue that this is the appropriate class of loss functions with which to evaluate approximations of the posterior covariance matrix, and we show that optimality in such metrics identifies directions in parameter space along which the posterior variance is reduced the most, relative to the prior. Optimal low-rank updates are derived from a generalized eigendecomposition of the pencil defined by the negative log-likelihood Hessian and the prior precision matrix. These updates have been proposed in previous work [28], but our work complements these efforts by characterizing the optimality of the resulting approximations. Under the assumption of exact knowledge of the posterior mean, our results extend to optimality statements between the associated distributions (e.g., optimality in the Hellinger distance and in the K-L divergence). Second, we have developed fast approximations of the posterior mean that are useful when repeated evaluations thereof are required for multiple realizations of the data (e.g., in an online inference setting). These approximations are optimal in the sense that they minimize the Bayes risk for squared-error loss induced by the posterior precision matrix. The most computationally efficient of these approximations expresses the posterior mean as the product of a single low-rank matrix with the data. We have demonstrated the covariance and mean approximations numerically on a variety of inverse problems: synthetic problems constructed from random Hessian and prior covariance matrices; an X-ray tomography problem with different observation scenarios; and inversion for the initial condition of a heat equation, with localized observations and a nonstationary prior. This work has several possible extensions of interest, some of which are already part of ongoing research. First, it is natural to generalize the present approach to infinite-dimensional parameter spaces endowed with Gaussian priors. This setting is essential to understanding and formalizing Bayesian inference over function spaces [9, 79]. Here, by analogy with the current results, one would expect the posterior covariance operator to be well approximated by a finite-rank negative perturbation of the Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php OPTIMAL LOW-RANK APPROXIMATIONS A2479 prior covariance operator. A further extension could allow the data to become infinitedimensional as well. Another important task is to generalize the present methodology to inverse problems with nonlinear forward models. One approach for doing so is presented in [22]; other approaches are certainly possible. Yet another interesting research topic is the study of analogous approximation techniques for sequential inference. We note that the assimilation step in a linear (or linearized) data assimilation scheme can be already tackled within the framework presented here. But the nonstationary setting, where inference is interleaved with evolution of the state, introduces the possibility for even more tailored and structure-exploiting approximations. Appendix A. Technical results. Here we collect the proofs and other technical results necessary to support the statements made in the previous sections. We start with an auxiliary approximation result that plays an important role in our analysis. Given a semipositive definite diagonal matrix D, we seek an approximation of D + I by a rank r perturbation of the identity, U U + I, that minimizes a loss function from the class L defined in (2.6). The following lemma shows that the U is simply the best rank r approximation of the matrix D in the optimal solution U Frobenius norm. Lemma A.1 (approximation lemma). Let D = diag{d21 , . . . , d2n }, with d2i ≥ d2i+1 , n×r → R, as J (U ) = L(U U + I, D + I) = and L ∈ L. Define the functional J : R i f (σi ), where (σi ) are the generalized eigenvalues of the pencil (U U + I, D + I) and f ∈ U. Then the following hold: , of J such that (i) There is a minimizer, U (A.1) U = U r d2i ei e i , i=1 where (ei ) are the columns of the identity matrix. (ii) If the first r eigenvalues of D are distinct, then any minimizer of J satisfies (A.1). Proof. The idea is to apply [53, Theorem 1.1] to the functional J . To this end, we notice that J can be equivalently written as J (U ) = F ◦ ρn ◦ g(U ), where F : Rn+ → R n is of the form F (x) = i=1 f (xi ); ρn denotes a function that maps an n × n SPD matrix A to its eigenvalues σ = (σi ) (i.e., ρn (A) = σ and since F is a symmetric function, the order of the eigenvalues is irrelevant); and the mapping g is given by g(U ) = (D + I)−1/2 (U U + I)(D + I)−1/2 for all U ∈ Rn×r . Since the function F ◦ ρn satisfies the hypotheses in [53, Theorem 1.1], F ◦ρn is differentiable at the SPD matrix X iff F is differentiable at ρn (X), in which case (F ◦ ρn ) (X) = ZSσ Z , where Sσ = diag[ F (ρn (X)) ] = diag{f (σ1 ), . . . , f (σn )}, and Z is an orthogonal matrix such that X = Z diag[ ρn (X) ]Z . Using the chain rule, we obtain ∂g(U ) ∂J (U ) = tr ZSσ Z , ∂ Uij ∂ Uij which leads to the following gradient of J at U : J (U ) = 2(D + I)−1/2 ZSσ (D + I)−1/2 Z U = 2 W Sσ W U, Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. A2480 SPANTINI ET AL. where the orthogonal matrix Z is such that the matrix W = (D + I)−1/2 Z satisfies Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php (A.2) (U U + I)W = (D + I)W Υσ with Υσ = diag(σ). Now we show that the functional J is coercive. Let (Uk ) be a sequence of matrices such that Uk F → ∞. Hence, σmax (g(Uk )) → ∞ and so does J since J (Uk ) ≥ f (σmax (g(Uk ))) + (n − 1)f (1) and f (x) → ∞ as x → ∞. Thus, J is a differentiable coercive functional and has a with zero gradient: global minimizer U (A.3) = 0. ) = 2W Sσ W U J (U However, since f ∈ U, f (x) = 0 iff x = 1. It follows that condition (A.3) is equivalent to (A.4) = 0. (I − Υσ )W U U − D = W − Υ−1 (Υ − I)W , and right-multipliEquations (A.2) and (A.4) give U cation by U U then yields (A.5) U ) = ( U U )2 . D (U U with nonzero eigenvalue α, then u is an In particular, if u is an eigenvector of U eigenvector of D, Du = αu, and thus α = d2i > 0 for some i. Thus, any solution of (A.5) is such that (A.6) U = U rk d2ki eki e ki i=1 satisfying for some subsequence (k ) of {1, . . . , n} and rank rk ≤ r. Notice that any U ) is (A.5) is also a critical point according to (A.4). From (A.6) we also find that g(U a diagonal matrix, r k −1 2 d eki e + I . g(U ) = (D + I) ki ki i=1 ), are given by σi = 1 if The diagonal entries σi , which are the eigenvalues of g(U 2 i = k for some ≤ rk , or σi = 1/(1 + di ) otherwise. In either case, we have 0 < ) is minimized by the subsequence σi ≤ 1 and the monotonicity of f implies that J (U k1 = 1, . . . , kr = r and by the choice rk = r. This proves (A.1). It is clear that if the first r eigenvalues of D are distinct, then any minimizer of J satisfies (A.1). Most of the objective functions we consider have the same structure as the loss function J , hence the importance of Lemma A.1. The next lemma shows that searching for a negative update of Γpr is equivalent to looking for a positive update of the prior precision matrix. In particular, the lemma provides a bijection between the two approximation classes, Mr and M−1 r , defined Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php OPTIMAL LOW-RANK APPROXIMATIONS A2481 by (2.4) and (3.1). In what follows, Spr is any square root of the prior covariance matrix such that Γpr = Spr Spr . pos = Lemma A.2 (prior updates). For any negative semidefinite update of Γpr , Γ Γpr − KK with Γpos 0, there is a matrix U (of the same rank as K) such that pos = Γ−1 + U U −1 . The converse is also true. Γ pr −1 − KK Spr , D = diag{d2i }, be a reduced SVD of Proof. Let ZDZ = Spr −1 − pos 0 by assumption, we must have d2 < 1 for all i, and we Spr KK Spr . Since Γ i − may thus define U = Spr ZD1/2 (I − D)−1/2 . By Woodbury’s identity, −1 −1 −1 pos . Γpr + U U = Γpr − Γpr U I + U Γ−1 U Γpr = Γpr − KK = Γ pr U pos as a Conversely, given a matrix U , we use again Woodbury’s identity to write Γ negative semidefinite update of Γpr : Γpos = Γpr − KK 0. Now we prove our main result on approximations of the posterior covariance matrix. Proof of Theorem 2.3. Given a loss function L ∈ L, our goal is to minimize pos ) = f (σi ) (A.7) L(Γpos , Γ i pos = Γpr − KK 0, where (σi ) are over K ∈ Rn×r subject to the constraint Γ pos ) and f belongs to the class U the generalized eigenvalues of the pencil (Γpos , Γ defined by (2.7). We also write σi (Γpos , Γpos ) to specify the pencil corresponding to the eigenvalues. By Lemma A.2, the optimization problem is equivalent to finding a −1 = Γ−1 + U U . Observe that matrix, U ∈ Rn×r , that minimizes (A.7) subject to Γ pos pr −1 −1 , Γ ). (σi ) are also the eigenvalues of the pencil (Γ pos pos Let W DW = Spr H Spr with D = diag{δi2 } be an SVD of Spr H Spr . Then, by the invariance properties of the generalized eigenvalues we have −1 −1 −1 , Γ−1 ) = σi ( W S Γ σi (Γ pos pos pr pos Spr W , W Spr Γpos Spr W ) = σi ( ZZ + I, D + I ), where Z = W Spr U . Therefore, our goal reduces to finding a matrix, Z ∈ Rn×r , that minimizes (A.7) with (σi ) being the generalized eigenvalues of the pencil + r ( ZZ 2 I, D + I ). Applying Lemma A.1 leads to the simple solution ZZ = i=1 δi ei e i , where (ei ) are the columns of the identity matrix. In particular, the solution is unique H Spr are distinct. The corresponding approximation if the first r eigenvalues of Spr U U is then (A.8) − −1 W ZZ W Spr = U U = Spr r δi2 w i w i , i=1 − wi and wi is the ith column of W . Woodbury’s identity gives the where w i = Spr corresponding negative update of Γpr as (A.9) pos = Γpr − KK , Γ KK = r −1 δi2 1 + δi2 w i w i i=1 with w i = Spr wi . Now, it suffices to note that the couples (δi2 , w i ) defined here are ). At optimality, σi = 1 for precisely the generalized eigenpairs of the pencil (H, Γ−1 pr i ≤ r and σi = (1 + δi2 )−1 for i > r, proving (2.10). Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. A2482 SPANTINI ET AL. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php Before proving Lemma 2.2, we recall that the K-L divergence and the Hellinger distance between two multivariate Gaussians, ν1 = N (μ, Σ1 ) and ν2 = N (μ, Σ2 ), with the same mean and full rank covariance matrices are given, respectively, by [70]: (A.10) DKL (ν1 ν2 ) = det(Σ1 ) 1 − rank(Σ Σ ) − ln trace Σ−1 , 1 1 2 2 det(Σ2 ) (A.11) dHell (ν1 , ν2 ) = 1− |Σ1 |1/4 |Σ2 |1/4 . | 12 Σ1 + 12 Σ2 |1/2 Proof of Lemma 2.2. By (A.10), the K-L divergence between the posterior νpos (y) and the Gaussian approximation νpos (y) can be written in terms of the generalized pos ) as eigenvalues of the pencil (Γpos , Γ DKL (νpos (y) νpos (y)) = ( σi − ln σi − 1 ) /2, i and since f (x) = (x − ln x − 1) /2 belongs to U, we see that the K-L divergence is a loss function in the class L defined by (2.6). Hence, Theorem 2.3 applies and the equivalence between the two approximations follows trivially. An analogous argument holds for the Hellinger distance. The squared Hellinger distance between νpos (y) and νpos (y) can be written in terms of the generalized eigenvalues, (σi ), of the pencil pos ), as (Γpos , Γ (A.12) 2 dHell (νpos (y), νpos (y)) = 1 − 2/2 1/4 σi −1/2 (1 + σi ) . i where is the dimension of the parameter space. Minimizing (A.12) is equivalent 1/4 to maximizing i σi (1 + σi )−1/2 , which in turn is equivalent to minimizing the functional: 1/4 pos ) = − (A.13) L(Γpos , Γ ln( σi (1 + σi )−1/2 ) = ln( 2 + σi + 1/σi )/4. i i Since f (x) = ln( 2 + x + 1/x )/4 belongs to U, Theorem 2.3 can be applied once again. Proof of Corollary 3.1. The proofs of parts (i) and (ii) were already given in the proof of Theorem 2.3. Part (iii) holds because −1 − (1 + δi2 )Γpos w i = (1 + δi2 )(H + Γ−1 Spr wi pr ) = (1 + δi2 ) Spr (Spr H Spr + I)−1 wi = Spr wi = Γpr w i , because wi is an eigenvector of ( Spr H Spr + I )−1 with eigenvalue (1 + δi2 )−1 as shown in the proof of Theorem 2.3. Now we turn to optimality results for approximations of the posterior mean. In what follows, let Spr , Sobs , Spos , and Sy be the matrix square roots of, respectively, Γpr , Γobs , Γpos , and Γy := Γobs + G Γpr G such that Γ = S S (i.e., possibly nonsymmetric square roots). Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. OPTIMAL LOW-RANK APPROXIMATIONS A2483 2 Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php Equation (4.1) shows that to minimize E( Ay − xΓ−1 ) over A ∈ A, we need only pos to minimize E( Ay − μpos (y) 2Γ−1 ). Furthermore, since μpos (y) = Γpos G Γ−1 obs y, it pos follows that (A.14) −1 2 E( Ay − μpos (y) 2Γ−1 ) = Spos (A − Γpos G Γ−1 obs ) Sy F . pos We are therefore led to the following optimization problem: (A.15) −1 min Spos A Sy − Spos G Γ−1 obs Sy F . A∈A − The following result shows that an SVD of the matrix SH := Spr G Sobs can be used to obtain simple expressions for the square roots of Γpos and Γy . − G Sobs . Then Lemma A.3 (square roots). Let W DV be an SVD of SH = Spr (A.16) Spos = Spr W ( I + DD )−1/2 W , (A.17) Sy = Sobs V ( I + DD )1/2 V are square roots of Γpos and Γy . −1 −1 as Proof. We can rewrite Γpos = ( G Γ−1 obs G + Γpr ) −1 Spr = Spr W ( DD + I )−1 W Spr Γpos = Spr ( SH SH +I) = [ Spr W ( DD + I )−1/2 W ] [ Spr W ( DD + I )−1/2 W ] , −1 which proves (A.16). The proof of (A.17) follows similarly using SH = Sobs G Γpr SH − G Γobs . In the next two proofs we use (C)r to denote a rank r approximation of the matrix C in the Frobenius norm. Proof of Theorem 4.1. By [31, Theorem 2.1], an optimal A ∈ Ar is given by −1 (A.18) A = Spos Spos G Γobs Sy r Sy−1 . Now, we need some computations to show that (A.18) is equivalent to (4.4). Using −1/2 (A.16) and (A.17) we find Spos G Γ−1 D (I + D D)1/2 V , obs Sy = W (I + DD ) r −1 and therefore ( Spos G Γobs Sy )r = i=1 δi wi vi , where wi is the ith column of W , vi is the ith column of V, and δi is the ith diagonal entry of D. Inserting this back −1 into (A.18) yields A = i≤r δi (1 + δi2 )−1 Spr wi vi Sobs . Now it suffices to note that − −1 w i := Spr wi is a generalized eigenvector of (H, Γpr ), that vi := Sobs vi is a generalized 2 The eigenvector of (GΓpr G , Γobs ), and that (δi ) are also eigenvalues of (H, Γ−1 pr ). minimum Bayes risk is a straightforward computation for the optimal estimator (4.4) using (A.14). Proof of Theorem 4.2. Given A ∈ Ar , we can restate (A.15) as the problem of finding a matrix B, of rank at most r, that minimizes −1 −1 −1 (A.19) Spos (Γpr − Γpos ) G Γ−1 obs Sy − Spos B G Γobs Sy F such that A = (Γpr − B) G Γ−1 obs . By [31, Theorem 2.1], an optimal B is given by (A.20) −1 −1 † B = Spos ( Spos (Γpr − Γpos ) G Γ−1 obs Sy )r (G Γobs Sy ) , where † denotes the pseudoinverse operator. A closer look at [31, Theorem 2.1] reveals that another minimizer of (A.19), itself not necessarily of minimum Frobenius norm, is given by (A.21) −1 −1 † B = Spos ( Spos (Γpr − Γpos ) G Γ−1 obs Sy )r (Spr G Γobs Sy ) Spr . Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. A2484 SPANTINI ET AL. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php By Lemma A.3, 1/2 −1 Spr G Γobs Sy = W [ D I + D D ]V , −1 1/2 Spos Γpr G Γ−1 D(I + D D)1/2 ]V , obs Sy = W [ (I + DD ) −1 −1/2 Spos Γpos G Γ−1 D(I + D D)1/2 ]V obs Sy = W [ (I + DD ) −1 and therefore (Spr G Γobs Sy )† = whereas q −1 i=1 δi −1/2 1 + δi2 vi wi for q = rank(SH ), −1 (Γpr − Γpos )G Γ−1 ( Spos obs Sy )r = r δi3 wi vi . i=1 Inserting these expressions back into (A.21), we obtain r δ2 i B = Spr , Spr 2 wi wi 1 + δ i i=1 where wi is the ith column of W , vi is the ith column of V , and δi is the ith diagonal entry of D. Notice that (δi2 , w i ), with w i = Spr wi , are the generalized eigenpairs of (H, Γ−1 pr ) (cf. proof of Theorem 2.3). Hence, by Theorem 2.3, we recognize the pos = Γpr − B. Plugging this expression back into optimal approximation of Γpos as Γ (A.21) gives (4.7). The minimum Bayes risk in (ii) follows readily using the optimal estimator given by (4.7) in (A.14). Acknowledgments. We thank Jere Heikkinen of Bintec Ltd. for providing us with the code used in Example 2. We also thank Julianne and Matthias Chung for many insightful discussions. REFERENCES [1] V. Akçelik, G. Biros, O. Ghattas, J. Hill, D. Keyes, and B. van Bloemen Waanders, Parallel algorithms for PDE-constrained optimization, Parallel Process. Sci. Comput., 20 (2006), p. 291. [2] H. Auvinen, J. M. Bardsley, H. Haario, and T. Kauranne, Large-scale Kalman filtering using the limited memory BFGS method, Electron. Trans. Numer. Anal., 35 (2009), pp. 217–233. [3] H. Auvinen, J. M. Bardsley, H. Haario, and T. Kauranne, The variational Kalman filter and an efficient implementation using limited memory BFGS, Internat. J. Numer. Methods Fluids, 64 (2010), pp. 314–335. [4] Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst, Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide, Software Environ. Tools 11, SIAM, Philadelphia, 2000. [5] R. Barrett, M. W. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. van der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, SIAM, Philadelphia, 1994. [6] T. Bui-Than, C. Burstedde, O. Ghattas, J. Martin, G. Stadler, and L. Wilcox, Extremescale UQ for Bayesian inverse problems governed by PDEs, in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, IEEE Computer Society Press, 2012, p. 3. [7] T. Bui-Thanh and O. Ghattas, Analysis of the Hessian for inverse scattering problems: I. Inverse shape scattering of acoustic waves, Inverse Problems, 28 (2012), 055001. [8] T. Bui-Thanh, O. Ghattas, and D. Higdon, Adaptive Hessian-based nonstationary Gaussian process response surface method for probability density approximation with application to Bayesian solution of large-scale inverse problems, SIAM J. Sci. Comput., 34 (2012), pp. A2837–A2871. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php OPTIMAL LOW-RANK APPROXIMATIONS A2485 [9] T. Bui-Thanh, O. Ghattas, J. Martin, and G. Stadler, A computational framework for infinite-dimensional Bayesian inverse problems part I: The linearized case, with application to global seismic inversion, SIAM J. Sci. Comput., 35 (2013), pp. A2494–A2523. [10] D. Calvetti, Preconditioned iterative methods for linear discrete ill-posed problems from a Bayesian inversion perspective, J. Comput. Appl. Math., 198 (2007), pp. 378–395. [11] D. Calvetti, B. Lewis, and L. Reichel, On the regularizing properties of the GMRES method, Numer. Math., 91 (2002), pp. 605–625. [12] D. Calvetti, D. McGivney, and E. Somersalo, Left and right preconditioning for electrical impedance tomography with structural information, Inverse Problems, 28 (2012), 055015. [13] D. Calvetti and E. Somersalo, Priorconditioners for linear systems, Inverse Problems, 21 (2005), p. 1397. [14] B. P. Carlin and T. A. Louis, Bayesian Methods for Data Analysis, 3rd ed., Chapman & Hall/CRC Press, Boca Raton, FL, 2009. [15] R. A. Christensen, Plane Answers to Complex Questions: The Theory of Linear Models, Springer-Verlag, Berlin, 1987. [16] J. Chung and M. Chung, Computing optimal low-rank matrix approximations for image processing, in Proceedings of the Asilomar Conference on Signals, Systems and Computers, IEEE, New York, 2013, pp. 670–674. [17] J. Chung and M. Chung, An efficient approach for computing optimal low-rank regularized inverse matrices, Inverse Problems, 30 (2014), 114009. [18] J. Chung, M. Chung, and D. P. O’Leary, Designing optimal spectral filters for inverse problems, SIAM J. Sci. Comput., 33 (2011), pp. 3132–3152. [19] J. Chung, M. Chung, and D. P. O’Leary, Optimal filters from calibration data for image deconvolution with data acquisition error, J. Math. Imaging Vis., 44 (2012), pp. 366–374. [20] J. Chung, M. Chung, and D. P. O’Leary, Optimal regularized low rank inverse approximation, Linear Algebra Appl., 468 (2015), pp. 260–269. [21] T. Cui, K. J. H. Law, and Y. Marzouk, Dimension-independent likelihood-informed MCMC, arXiv:1411.3688, 2014. [22] T. Cui, J. Martin, Y. Marzouk, A. Solonen, and A. Spantini, Likelihood-informed dimension reduction for nonlinear inverse problems, Inverse Problems, 30 (2014), 114015. [23] J. Cullum and W. Donath, A block Lanczos algorithm for computing the q algebraically largest eigenvalues and a corresponding eigenspace of large, sparse, real symmetric matrices, in Proceedings of the IEEE Conference on Decision and Control Including the 13th Symposium on Adaptive Processes, IEEE, New York, 1974, pp. 505–509. [24] C. R. Dietrich and G. N. Newsam, Fast and exact simulation of stationary Gaussian processes through circulant embedding of the covariance matrix, SIAM J. Sci. Comput., 18 (1997), pp. 1088–1107. [25] L. Dykes and L. Reichel, Simplified gsvd computations for the solution of linear discrete ill-posed problems, J. Comput. Appl. Math., 255 (2014), pp. 15–27. [26] C. Eckart and G. Young, The approximation of one matrix by another of lower rank, Psychometrika, 1 (1936), pp. 211–218. [27] G. Evensen, Data Assimilation, Springer, Berlin, 2007. [28] H. P. Flath, L. Wilcox, V. Akçelik, J. Hill, B. van Bloemen Waanders, and O. Ghattas, Fast algorithms for Bayesian uncertainty quantification in large-scale linear inverse problems based on low-rank partial Hessian approximations, SIAM J. Sci. Comput., 33 (2011), pp. 407–432. [29] W. Förstner and B. Moonen, A metric for covariance matrices, in Geodesy: The Challenge of the 3rd Millennium, Springer, Berlin, 2003, pp. 299–309. [30] C. Fox and A. Parker, Convergence in variance of Chebyshev accelerated Gibbs samplers, SIAM J. Sci. Comput., 36 (2014), pp. A124–A147. [31] S. Friedland and A. Torokhti, Generalized rank-constrained matrix approximations, SIAM J. Matrix Anal. Appl., 29 (2007), pp. 656–659. [32] G. H. Golub and C. F. Van Loan, Matrix Computations, Vol. 3, Johns Hopkins University Press, Baltimore, 2012. [33] G. H. Golub and R. Underwood, The block Lanczos method for computing eigenvalues, Math. Software, 3 (1977), pp. 361–377. [34] G. H. Golub and Q. Ye, An inverse free preconditioned krylov subspace method for symmetric generalized eigenvalue problems, SIAM J. Sci. Comput., 24 (2002), pp. 312–334. [35] N. Halko, P. Martinsson, and J. A. Tropp, Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, SIAM Rev., 53 (2011), pp. 217–288. [36] M. Hanke, Conjugate Gradient Type Methods for Ill-Posed Problems, Pitman Res. Notes in Math. Ser. 327, Longman, Harlow, UK, 1995. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php A2486 SPANTINI ET AL. [37] M. Hanke and P. C. Hansen, Regularization methods for large-scale problems, Surv. Math. Ind., 3 (1993), pp. 253–315. [38] P. C. Hansen, Regularization, GSVD and truncated GSVD, BIT, 29 (1989), pp. 491–504. [39] P. C. Hansen, Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion, Monogr. Math. Model. Comput. 4, SIAM, Philadelphia, 1998. [40] J. Heikkinen, Statistical Inversion Theory in X-Ray Tomography, Master’s thesis, Lappeenranta University of Technology, Finland, 2008. [41] M. R. Hestenes, Pseudoinversus and conjugate gradients, Comm. ACM, 18 (1975), pp. 40–43. [42] M. R. Hestenes and E. Stiefel, Methods of Conjugate Gradients for Solving Linear Systems, Vol. 49, National Bureau of Standards Washington, DC, 1952. [43] L. Homa, D. Calvetti, A. Hoover, and E. Somersalo, Bayesian preconditioned CGLS for source separation in MEG time series, SIAM J. Sci. Comput., 35 (2013), pp. B778–B798. [44] Y. Hua and W. Liu, Generalized Karhunen-Loève transform, IEEE Signal Process. Lett., 5 (1998), pp. 141–142. [45] W. James and C. Stein, Estimation with quadratic loss, in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, 1961, pp. 361–379. [46] W. Kahan and B. N. Parlett, How Far Should You Go with the Lanczos Process, No UCB/ERL-M78/48, Electronics Research Lab, University of California, Berkeley, 1978. [47] J. Kaipio and E. Somersalo, Statistical and Computational Inverse Problems, Springer, Berlin, 2005. [48] R. E. Kalman, A new approach to linear filtering and prediction problems, J. Fluids Engrg., 82 (1960), pp. 35–45. [49] S. Kaniel, Estimates for some computational techniques in linear algebra, Math. Comp., (1966), pp. 369–378. [50] C. Lanczos, An Iteration Method for the Solution of the Eigenvalue Problem of Linear Differential and Integral Operators, U.S. Government Press Office, 1950. [51] E. L. Lehmann and G. Casella, Theory of Point Estimation, Springer-Verlag, Berlin, 1998. [52] R. B. Lehoucq, D. C. Sorensen, and C. Yang, ARPACK Users’ Guide: Solution of LargeScale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods, Software Environ. Tools 6, SIAM, Philadelphia, 1998. [53] A. S. Lewis, Derivatives of spectral functions, Math. Oper. Res., 21 (1996), pp. 576–588. [54] W. Li and O. A. Cirpka, Efficient geostatistical inverse methods for structured and unstructured grids, Water Resources Res., 42 (2006), W06402. [55] C. Lieberman, K. Fidkowski, K. Willcox, and B. van Bloemen Waanders, Hessian-based model reduction: Large-scale inversion and prediction, Internat. J. Numer. Methods Fluids, 71 (2013), pp. 135–150. [56] C. Lieberman, K. Willcox, and O. Ghattas, Parameter and state model reduction for largescale statistical inverse problems, SIAM J. Sci. Comput., 32 (2010), pp. 2523–2542. [57] F. Lindgren, H. Rue, and J. Lindström, An explicit link between Gaussian fields and Gaussian Markov random fields: The stochastic partial differential equation approach, J. R. Stat. Soc. Ser. B, 73 (2011), pp. 423–498. [58] D. V. Lindley and A. F. M. Smith, Bayes estimates for the linear model, J. R. Stat. Soc. Ser. B, 34 (1972), pp. 1–41. [59] D. C. Liu and J. Nocedal, On the limited memory BFGS method for large scale optimization, Math. Program., 45 (1989), pp. 503–528. [60] J. Martin, L. Wilcox, C. Burstedde, and O. Ghattas, A stochastic Newton MCMC method for large-scale statistical inverse problems with application to seismic inversion, SIAM J. Sci. Comput., 34 (2012), pp. A1460–A1487. [61] Y. Marzouk and H. N. Najm, Dimensionality reduction and polynomial chaos acceleration of Bayesian inference in inverse problems, J. Comput. Phys., 228 (2009), pp. 1862–1902. [62] X. Meng, M. A. Saunders, and M. W. Mahoney, LSRN: A parallel iterative solver for strongly over-or underdetermined systems, SIAM J. Sci. Comput., 36 (2014), pp. C95– C118. [63] E. H. Moore, On the reciprocal of the general algebraic matrix, Bull. Amer. Math. Soc., 26, pp. 394–395. [64] J. Nocedal, Updating quasi-Newton matrices with limited storage, Math. Comput., 35 (1980), pp. 773–782. [65] C. C. Paige, The Computation of Eigenvalues and Eigenvectors of Very Large Sparse Matrices, Ph.D. thesis, University of London, 1971. [66] C. C. Paige, Computational variants of the Lanczos method for the eigenproblem, IMA J. Appl. Math., 10 (1972), pp. 373–381. [67] C. C. Paige, Error analysis of the Lanczos algorithm for tridiagonalizing a symmetric matrix, IMA J. Appl. Math., 18 (1976), pp. 341–349. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php OPTIMAL LOW-RANK APPROXIMATIONS A2487 [68] C. C. Paige and M. A. Saunders, Towards a generalized singular value decomposition, SIAM J. Numer. Anal., 18 (1981), pp. 398–405. [69] C. C. Paige and M. A. Saunders, Algorithm 583: LSQR: Sparse linear equations and least squares problems, ACM Trans. Math. Software, 8 (1982), pp. 195–209. [70] L. Pardo, Statistical Inference Based on Divergence Measures, CRC Press, Boca Raton, FL, 2005. [71] A. Parker and C. Fox, Sampling Gaussian distributions in Krylov spaces with conjugate gradients, SIAM J. Sci. Comput., 34 (2012), pp. B312–B334. [72] R. Penrose, A generalized inverse for matrices, Math. Proc. Cambridge Philos. Soc., 51 (1955), pp. 406–413. [73] Y. Saad, On the rates of convergence of the Lanczos and the block-Lanczos methods, SIAM J. Numer. Anal., 17 (1980), pp. 687–706. [74] A. H. Sameh and J. A. Wisniewski, A trace minimization algorithm for the generalized eigenvalue problem, SIAM J. Numer. Anal., 19 (1982), pp. 1243–1259. [75] C. Schillings and C. Schwab, Scaling Limits in Computational Bayesian Inversion, Research report 2014-26, ETH-Zürich, 2014. [76] A. Solonen, H. Haario, J. Hakkarainen, H. Auvinen, I. Amour, and T. Kauranne, Variational ensemble Kalman filtering using limited memory BFGS, Electron. Trans. Numer. Anal., 39 (2012), pp. 271–285. [77] D. Sondermann, Best approximate solutions to matrix equations under rank restrictions, Statist. Papers, 27 (1986), pp. 57–66. [78] G. W. Stewart, The efficient generation of random orthogonal matrices with an application to condition estimators, SIAM J. Numer. Anal., 17 (1980), pp. 403–409. [79] A. M. Stuart, Inverse problems: A Bayesian perspective, Acta Numer., 19 (2010), pp. 451–559. [80] A. Tikhonov, Solution of incorrectly formulated problems and the regularization method, Soviet Math. Dokl., 5 (1963), pp. 1035–1038. [81] C. F. Van Loan, Generalizing the singular value decomposition, SIAM J. Numer. Anal., 13 (1976), pp. 76–83. [82] A. T. A. Wood and G. Chan, Simulation of stationary Gaussian processes in [0, 1]d, J. Comput. Graph. Statist., 3 (1994), pp. 409–432. [83] S. J. Wright and J. Nocedal, Numerical Optimization, Vol. 2, Springer, New York, 1999. [84] Y. Yue and P. L. Speckman, Nonstationary spatial Gaussian Markov random fields, J. Comput. Graphical Statist., 19 (2010), pp. 96–116. Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.