Optimal Low-rank Approximations of Bayesian Linear Inverse Problems Please share

advertisement
Optimal Low-rank Approximations of Bayesian Linear
Inverse Problems
The MIT Faculty has made this article openly available. Please share
how this access benefits you. Your story matters.
Citation
Spantini, Alessio, Antti Solonen, Tiangang Cui, James Martin,
Luis Tenorio, and Youssef Marzouk. “Optimal Low-Rank
Approximations of Bayesian Linear Inverse Problems.” SIAM
Journal on Scientific Computing 37, no. 6 (January 2015):
A2451–A2487. © 2015 Society for Industrial and Applied
Mathematics
As Published
http://dx.doi.org/10.1137/140977308
Publisher
Society for Industrial and Applied Mathematics
Version
Final published version
Accessed
Thu May 26 03:55:14 EDT 2016
Citable Link
http://hdl.handle.net/1721.1/101896
Terms of Use
Article is made available in accordance with the publisher's policy
and may be subject to US copyright law. Please refer to the
publisher's site for terms of use.
Detailed Terms
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
SIAM J. SCI. COMPUT.
Vol. 37, No. 6, pp. A2451–A2487
c 2015 Society for Industrial and Applied Mathematics
OPTIMAL LOW-RANK APPROXIMATIONS OF BAYESIAN
LINEAR INVERSE PROBLEMS∗
ALESSIO SPANTINI† , ANTTI SOLONEN‡ , TIANGANG CUI† , JAMES MARTIN§ ,
LUIS TENORIO¶, AND YOUSSEF MARZOUK†
Abstract. In the Bayesian approach to inverse problems, data are often informative, relative
to the prior, only on a low-dimensional subspace of the parameter space. Significant computational
savings can be achieved by using this subspace to characterize and approximate the posterior distribution of the parameters. We first investigate approximation of the posterior covariance matrix
as a low-rank update of the prior covariance matrix. We prove optimality of a particular update,
based on the leading eigendirections of the matrix pencil defined by the Hessian of the negative loglikelihood and the prior precision, for a broad class of loss functions. This class includes the Förstner
metric for symmetric positive definite matrices, as well as the Kullback–Leibler divergence and the
Hellinger distance between the associated distributions. We also propose two fast approximations of
the posterior mean and prove their optimality with respect to a weighted Bayes risk under squarederror loss. These approximations are deployed in an offline-online manner, where a more costly but
data-independent offline calculation is followed by fast online evaluations. As a result, these approximations are particularly useful when repeated posterior mean evaluations are required for multiple
data sets. We demonstrate our theoretical results with several numerical examples, including highdimensional X-ray tomography and an inverse heat conduction problem. In both of these examples,
the intrinsic low-dimensional structure of the inference problem can be exploited while producing
results that are essentially indistinguishable from solutions computed in the full space.
Key words. inverse problems, Bayesian inference, low-rank approximation, covariance approximation, Förstner–Moonen metric, posterior mean approximation, Bayes risk, optimality
AMS subject classifications. 15A29, 62F15, 68W25
DOI. 10.1137/140977308
1. Introduction. In the Bayesian approach to inverse problems, the parameters
of interest are treated as random variables, endowed with a prior probability distribution that encodes information available before any data are observed. Observations
are modeled by their joint probability distribution conditioned on the parameters
of interest, which defines the likelihood function and incorporates the forward model
and a stochastic description of measurement or model errors. The prior and likelihood
then combine to yield a probability distribution for the parameters conditioned on the
observations, i.e., the posterior distribution. While this formulation is quite general,
essential features of inverse problems bring additional structure to the Bayesian update. The prior distribution often encodes some kind of smoothness or correlation
∗ Submitted to the journal’s Methods and Algorithms for Scientific Computing section July 14,
2014; accepted for publication (in revised form) August 14, 2015; published electronically November
3, 2015. This work was supported by the U.S. Department of Energy, Office of Advanced Scientific
Computing (ASCR), under grants DE-SC0003908 and DE-SC0009297.
http://www.siam.org/journals/sisc/37-6/97730.html
† Department of Aeronautics and Astronautics, Massachusetts Institute of Technology, Cambridge,
MA 02139 (spantini@mit.edu, tcui@mit.edu, ymarz@mit.edu).
‡ Department of Mathematics and Physics, Lappeenranta University of Technology, 53851
Lappeenranta, Finland, and Department of Aeronautics and Astronautics, Massachusetts Institute
of Technology, Cambridge, MA 02139 (solonen@mit.edu).
§ Institute for Computational Engineering and Sciences, University of Texas at Austin, Austin,
TX 78712 (jmartin@ices.utexas.edu).
¶ Applied Mathematics and Statistics, Colorado School of Mines, Golden, CO 80401
(ltenorio@mines.edu).
A2451
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
A2452
SPANTINI ET AL.
among the inversion parameters; observations typically are finite, few in number, and
corrupted by noise; and the observations are indirect, related to the inversion parameters by the action of a forward operator that destroys some information. A key
consequence of these features is that the data may be informative, relative to the
prior, only on a low-dimensional subspace of the entire parameter space. Identifying
and exploiting this subspace—to design approximations of the posterior distribution
and related Bayes estimators—can lead to substantial computational savings.
In this paper we investigate approximation methods for finite-dimensional
Bayesian linear inverse problems with Gaussian measurement and prior distributions. We characterize approximations of the posterior distribution that are structureexploiting and that are optimal in a sense to be defined below. Since the posterior
distribution is Gaussian, it is completely determined by its mean and covariance.
We therefore focus on approximations of these posterior characteristics. Optimal approximations will reduce computation and storage requirements for high-dimensional
inverse problems and will also enable fast computation of the posterior mean in a
many-query setting.
We consider approximations of the posterior covariance matrix in the form of lowrank negative updates of the prior covariance matrix. This class of approximations
exploits the structure of the prior-to-posterior update and also arises naturally in
Kalman filtering techniques (e.g., [2, 3, 76]); the challenge is to find an optimal update
within this class and to define in what sense it is optimal. We will argue that a suitable
loss function with which to define optimality is the Förstner metric [29] for symmetric
positive definite (SPD) matrices, and we will show that this metric generalizes to a
broader class of loss functions that emphasize relative differences in covariance. We
will derive the optimal low-rank update for this entire class of loss functions. In
particular, we will show that the prior covariance matrix should be updated along
the leading generalized eigenvectors of the pencil (H, Γ−1
pr ) defined by the Hessian
of the negative log-likelihood and the prior precision matrix. If we assume exact
knowledge of the posterior mean, then our results extend to optimality statements
between distributions (e.g., optimality in Kullback–Leibler (K-L) divergence and in
Hellinger distance). The form of this low-rank update of the prior is not new [6, 9, 28,
60], but previous work has not shown whether—and if so, in exactly what sense—it
yields optimal approximations of the posterior. A key contribution of this paper is to
establish and explain such optimality.
Properties of the generalized eigenpairs of (H, Γ−1
pr ) and related matrix pencils
have been studied previously in the literature, especially in the context of classical
regularization techniques for linear inverse problems1 [81, 68, 38, 37, 25]. The joint
action of the log-likelihood Hessian and the prior precision matrix have also been used
in related regularization methods [13, 10, 12, 43]. However, these efforts have not been
concerned with the posterior covariance matrix or with its optimal approximation,
since this matrix is a property of the Bayesian approach to inversion.
One often justifies the assumption that the posterior mean is exactly known by
arguing that it can easily be computed as the solution of a regularized least-squares
problem [42, 69, 1, 62, 5]; indeed, evaluation of the posterior mean to machine precision
is now feasible even for million-dimensional parameter spaces [6]. If, however, one
needs multiple evaluations of the posterior mean for different realizations of the data
1 In the framework of Tikhonov regularization [80], the regularized estimate coincides with the
posterior mean of the Bayesian linear model we consider here, provided that the prior covariance
matrix is chosen appropriately.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
OPTIMAL LOW-RANK APPROXIMATIONS
A2453
(e.g., in an online inference context), then solving a linear system to determine the
posterior mean may not be the most efficient strategy. A second goal of our paper is to
address this problem. We will propose two computationally efficient approximations
of the posterior mean based on (i) evaluating a low-rank affine function of the data or
(ii) using a low-rank update of the prior covariance matrix in the exact formula for the
posterior mean. The optimal approximation in each case is defined as the minimizer
of the Bayes risk for a squared-error loss weighted by the posterior precision matrix.
We provide explicit formulas for these optimal approximations and show that they can
be computed by exploiting the optimal posterior covariance approximation described
above. Thus, given a new set of data, computing an optimal approximation of the
posterior mean becomes a computationally trivial task.
Low-rank approximations of the posterior mean that minimize the Bayes risk for
squared-error loss have been proposed in [17, 20, 19, 18, 16] for a general non-Gaussian
case. Here, instead we develop analytical results for squared-error loss weighted by the
posterior precision matrix. This choice of norm reflects the idea that approximation
errors in directions of low posterior variance should be penalized more strongly than
errors in high-variance directions, as we do not want the approximate posterior mean
to fall outside the bulk of the posterior probability distribution. Remarkably, in this
case, the optimal approximation only requires the leading eigenvectors and eigenvalues
of a single eigenvalue problem. This is the same eigenvalue problem we solve to
obtain an optimal approximation of the posterior covariance matrix, and thus we can
efficiently obtain both approximations at the same time.
While the efficient solution of large-scale linear-Gaussian Bayesian inverse problems is of standalone interest [28], optimal approximations of Gaussian posteriors are
also a building block for the solution of nonlinear Bayesian inverse problems. For example, the stochastic Newton Markov chain Monte Carlo (MCMC) method [60] uses
Gaussian proposals derived from local linearizations of a nonlinear forward model;
the parameters of each Gaussian proposal are computed using the optimal approximations analyzed in this paper. To tackle even larger nonlinear inverse problems,
[6] uses a Laplace approximation of the posterior distribution wherein the Hessian
at the mode of the log-posterior density is itself approximated using the present approach. Similarly, approximations of local Gaussians can facilitate the construction
of a nonstationary Gaussian process whose mean directly approximates the posterior
density [8]. Alternatively, [22] combines data-informed directions derived from local
linearizations of the forward model—a direct extension of the posterior covariance
approximations described in the present work—to create a global data-informed subspace. A computationally efficient approximation of the posterior distribution is then
obtained by restricting MCMC to this subspace and treating complementary directions analytically. Moving from the finite- to the infinite-dimensional setting, the
same global data-informed subspace is used to drive efficient dimension-independent
posterior sampling for inverse problems in [21].
Earlier work on dimension reduction for Bayesian inverse problems used the
Karhunen–Loève expansion of the prior distribution [61, 54] to describe the parameters
of interest. To reduce dimension, this expansion is truncated; this step renders both
the prior and posterior distributions singular—i.e., collapsed onto the prior mean—in
the neglected directions. Avoiding large truncation errors then requires that the prior
distribution impose significant smoothness on the parameters, so that the spectrum
of the prior covariance kernel decays quickly. In practice, this requirement restricts
the choice of priors. Moreover, this approach relies entirely on properties of the prior
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
A2454
SPANTINI ET AL.
distribution and does not incorporate the influence of the forward operator or the
observational errors. Alternatively, [56] constructs a reduced basis for the parameter
space via greedy model-constrained sampling, but this approach can also fail to capture posterior variability in directions uninformed by the data. Both of these earlier
approaches seek reduction in the overall description of the parameters. This notion
differs fundamentally from the dimension reduction technique advocated in this paper,
where low-dimensional structure is sought in the change from prior to posterior.
The rest of this paper is organized as follows. In section 2 we introduce the
posterior covariance approximation problem and derive the optimal prior-to-posterior
update with respect to a broad class of loss functions. The structure of the optimal
posterior covariance matrix approximation is examined in section 3. Several interpretations are given in this section, including an equivalent reformulation of the covariance approximation problem as an optimal projection of the likelihood function onto
a lower-dimensional subspace. In section 4 we characterize optimal approximations
of the posterior mean. In section 5 we provide several numerical examples. Section 6
offers concluding remarks. Appendix A collects proofs of many of the theorems stated
throughout the paper, along with additional technical results.
2. Optimal approximation of the posterior covariance matrix. Consider
the Bayesian linear model defined by a Gaussian likelihood and a Gaussian prior with
a nonsingular covariance matrix Γpr 0 and, without loss of generality, zero mean:
(2.1)
y | x ∼ N (Gx, Γobs ),
x ∼ N (0, Γpr ).
Here x represents the parameters to be inferred, G is the linear forward operator, and
y are the observations, with Γobs 0. The statistical model (2.1) also follows from
y = Gx + ε,
where ε ∼ N (0, Γobs ) is independent of x.
It is easy to see that the posterior
distribution is again Gaussian (see, e.g., [14]): x | y ∼ N (μpos (y), Γpos ), with mean
and covariance matrix given by
(2.2)
μpos (y) = Γpos G Γ−1
obs y
−1
and Γpos = H + Γ−1
,
pr
where
(2.3)
H = G Γ−1
obs G
is the Hessian of the negative log-likelihood (i.e., the Fisher information matrix).
Since the posterior is Gaussian, the posterior mean coincides with the posterior mode:
μpos (y) = arg maxx πpos (x; y), where πpos is the posterior density. Note that the
posterior covariance matrix does not depend on the data.
pos ,
2.1. Defining the approximation class. We will seek an approximation, Γ
of the posterior covariance matrix that is optimal in a class of matrices to be defined
shortly. As we can see from (2.2), the posterior precision matrix Γ−1
pos is a nonnegative
−1
−1
update of the prior precision matrix Γ−1
:
Γ
=
Γ
+
ZZ
,
where ZZ = H.
pr
pos
pr
Similarly, using Woodbury’s identity we can write Γpos as a nonpositive update of
Γpr : Γpos = Γpr − KK , where KK = Γpr G Γ−1
y G Γpr and Γy = Γobs + G Γpr G
is the covariance matrix of the marginal distribution of y [47]. This update of Γpr
is negative semidefinite because the data add information: the posterior variance in
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
A2455
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
OPTIMAL LOW-RANK APPROXIMATIONS
any direction is always smaller than the corresponding prior variance. Moreover, the
update is usually low rank for exactly the reasons described in the introduction: there
are directions in the parameter space along which the data are not very informative,
relative to the prior. For instance, H might have a quickly decaying spectrum [7, 75].
Note, however, that Γpos itself might not be low-rank. Low-rank structure, if any,
lies in the update of Γpr that yields Γpos . Hence, a natural class of matrices for
approximating Γpos is the set of negative semidefinite updates of Γpr , with a fixed
maximum rank, that lead to positive definite matrices:
(2.4)
Mr = Γpr − KK 0 : rank(K) ≤ r .
This class of approximations of the posterior covariance matrix takes advantage of
the structure of the prior-to-posterior update.
2.2. Loss functions. Optimality statements regarding the approximation of a
covariance matrix require an appropriate notion of distance between SPD matrices.
We shall use the metric introduced by Förstner and Moonen [29], which is derived
from a canonical invariant metric on the cone of real SPD matrices and is defined as
follows: the Förstner distance, dF (A, B), between a pair of SPD matrices, A and B,
is given by
d2F (A, B) = tr ln2 ( A−1/2 BA−1/2 ) =
ln2 (σi ),
i
where (σi ) is the sequence of generalized eigenvalues of the pencil (A, B). The Förstner
metric satisfies the following important invariance properties:
(2.5)
dF (A, B) = dF (A−1 , B −1 ) and dF (A, B) = dF (M AM , M BM )
for any nonsingular matrix M . Moreover, dF treats under- and overapproximations
pos ) → ∞ as α → 0 and as α → ∞.2 Note that
similarly in the sense that dF (Γpos , αΓ
the metric induced by the Frobenius norm does not satisfy any of the aforementioned
invariance properties. In addition, it penalizes under- and overestimation differently.
We will show that our posterior covariance matrix approximation is optimal not
only in terms of the Förstner metric but also in terms of the following more general
class, L, of loss functions for SPD matrices.
Definition 2.1 (loss functions). The class L is defined as the collection of
functions of the form
(2.6)
L(A, B) =
n
f (σi ),
i=1
where A and B are SPD matrices, (σi ) are the generalized eigenvalues of the pencil
(A, B), and
(2.7)
f ∈ U = {g ∈ C 1 (R+ ) : g (x)(1 − x) < 0 for x = 1, and lim g(x) = ∞}.
x→∞
Elements of U are differentiable real-valued functions defined on the positive axis
that decrease on x < 1, increase on x > 1, and tend to infinity as x → ∞. The squared
2 This behavior is shared by Stein’s loss function, which has been proposed to assess estimates
of a covariance matrix [45]. Stein’s loss function is just the K-L distance between two Gaussian
distributions with the same mean (see (A.10)), but it is not a metric for SPD matrices.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
A2456
SPANTINI ET AL.
Förstner metric belongs to the class of loss functions defined by (2.6), whereas the
distance induced by the Frobenius norm does not.
Lemma 2.2, whose proof can be found in Appendix A, justifies the importance of
the class L. In particular, it shows that optimality of the covariance matrix approximation with respect to any loss function in L leads to an optimal approximation of
the posterior distribution using a Gaussian (with the same mean) in terms of other
familiar criteria used to compare probability measures, such as the Hellinger distance
and the K-L divergence [70]. More precisely, we have the following result.
pos ∈ Mr
Lemma 2.2 (equivalence of approximations). If L ∈ L, then a matrix Γ
minimizes the Hellinger distance and the K-L divergence between N (μpos (y), Γpos )
pos ) iff it minimizes L( Γpos , Γ
pos ).
and the approximation N (μpos (y), Γ
Remark 1. We note that neither the Hellinger distance nor the K-L divergence
pos ) depends on the data
between the distributions N (μpos (y), Γpos ) and N (μpos (y), Γ
y. Optimality in distribution does not necessarily hold when the posterior means are
different.
2.3. Optimality results. We are now in a position to state one of the main
results of the paper. For a proof see Appendix A.
Theorem 2.3 (optimal posterior covariance approximation). Let (δi2 , w
i ) be the
generalized eigenvalue-eigenvector pairs of the pencil
(H, Γ−1
pr ),
(2.8)
2
, and H = G Γ−1
with the ordering δi2 ≥ δi+1
obs G as in (2.3). Let L be a loss function
in the class L defined in (2.6). Then the following hold:
pos , of the loss, L, between Γpos and an element of Mr is
(i) A minimizer, Γ
given by
(2.9)
pos = Γpr − KK ,
Γ
KK =
r
−1
δi2 1 + δi2
w
i w
i .
i=1
The corresponding minimum loss is given by
pos , Γpos ) = f (1) r +
(2.10)
L(Γ
f ( 1/(1 + δi2 ) ).
i>r
(ii) The minimizer (2.9) is unique if the first r eigenvalues of (H, Γ−1
pr ) are different.
Theorem 2.3 provides a way to compute the best approximation of Γpos by matrices in Mr : it is just a matter of computing the eigenpairs corresponding to the
decreasing sequence of eigenvalues of the pencil (H, Γ−1
pr ) until a stopping criterion is
satisfied. This criterion can be based on the minimum loss (2.10). Notice that the
minimum loss is a function of the generalized eigenvalues (δi2 )i≥r that have not been
computed. This is quite common in numerical linear algebra (e.g., error in the truncated SVD [26, 32]). However, since the eigenvalues (δi2 ) are computed in a decreasing
order, the minimum loss can be easily bounded.
The generalized eigenvectors, w
i , are orthogonal with respect to the inner product
induced by the prior precision matrix, and they maximize the Rayleigh ratio,
z Hz
R(z)
= −1 ,
z Γpr z
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
OPTIMAL LOW-RANK APPROXIMATIONS
A2457
i = span⊥ (w
over subspaces of the form W
j )j<i . Intuitively, the vectors w
i associated
with generalized eigenvalues greater than one correspond to directions in the parameter space (or subspaces thereof) where the curvature of the log-posterior density is
constrained more by the log-likelihood than by the prior.
2.4. Computing eigenpairs of (H, Γ−1
pr ). If a square root factorization of the
prior covariance matrix Γpr = Spr Spr is available, then the Hermitian generalized
eigenvalue problem can be reduced to a standard one: find the eigenpairs, (δi2 , wi ),
of Spr
HSpr , and transform the resulting eigenvectors according to wi → Spr wi [4,
section 5.2]. An analogous transformation is also possible when a square root factor
ization of Γ−1
pr is available. Notice that only the actions of Spr and Spr on a vector are
required. For instance, evaluating the action of Spr might involve the solution of an
elliptic PDE [57]. There are numerous examples of priors for which a decomposition
is readily available, e.g., [82, 24, 57, 84, 79]. Either direct methods
Γpr = Spr Spr
or, more often, matrix-free algorithms (e.g., Lanczos iteration or its block version
[50, 66, 23, 33]) can be used to solve the standard Hermitian eigenvalue problem [4,
section 4]. Reference implementations of these algorithms are available in ARPACK
[52]. We note that the Lanczos iteration comes with a rich literature on error analysis
(e.g., [49, 65, 67, 46, 73, 32]). Alternatively, one can use randomized methods [35],
which offer the advantage of parallelism (asynchronous computations) and robustness
over standard Lanczos methods [6]. If a square root factorization of Γpr is not available, but it is possible to solve linear systems with Γ−1
pr , we can use a Lanczos method
for generalized Hermitian eigenvalue problems [4, section 5.5], where a Krylov basis
orthogonal with respect to the inner product induced by Γ−1
pr is maintained. Again,
ARPACK provides an efficient implementation of these solvers. When accurately
solving linear systems with Γ−1
pr is a difficult task, we refer the reader to alternative
algorithms proposed in [74] and [34].
is available, then it is straightforward
Remark 2. If a factorization Γpr = Spr Spr
to obtain an expression for a nonsymmetric square root of the optimal approximation
of Γpos (2.9) as in [9],
r
2 −1/2
1+δ
− 1 wi w + I ,
(2.11)
Spos = Spr
i
i
i=1
pos = Spos S and wi = S −1 w
such that Γ
pos
pr i . This expression can be used to efficiently
pos ) (e.g., [28, 60]).
sample from the approximate posterior distribution N (μpos (y), Γ
Alternative techniques for sampling from high-dimensional Gaussian distributions can
be found, for instance, in [71, 30].
3. Properties of the optimal covariance approximation. Now we discuss
several implications of the optimal approximation of Γpos introduced in the previous
section. We start by describing the relationship between this approximation and
the directions of greatest relative reduction of prior variance. Then we interpret the
covariance approximation as the result of projecting the likelihood function onto a
“data-informed” subspace. Finally, we contrast the present approach with several
other approximation strategies: using the Frobenius norm as a loss function for the
covariance matrix approximation, or developing low-rank approximations based on
prior or Hessian information alone. We conclude by drawing the connections with the
BFGS Kalman filter update.
3.1. Interpretation of the eigendirections. Thanks to the particular
structure of loss functions in L, the problem of approximating Γpos is equivalent
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
A2458
SPANTINI ET AL.
−1
to that of approximating Γ−1
pos . Yet the form of the optimal approximation of Γpos is
important, as it explicitly describes the directions that control the ratio of posterior to
prior variance. The following corollary to Theorem 2.3 characterizes these directions.
The proof is in Appendix A.
i ) and
Corollary 3.1 (optimal posterior precision approximation). Let (δi2 , w
L ∈ L be defined as in Theorem 2.3. Then the following hold:
(i) A minimizer of L(B, Γ−1
pos ) for
−1
(3.1)
B ∈ M−1
r := Γpr + JJ : rank(J) ≤ r
is given by
(3.2)
−1
−1
Γ
pos = Γpr + U U ,
UU =
r
δi2 w
i w
i ,
w
i = Γ−1
i .
pr w
i=1
The minimizer (3.2) is unique if the first r eigenvalues of (H, Γ−1
pr ) are different.
(ii) The optimal posterior precision matrix (3.2) is precisely the inverse of the
optimal posterior covariance matrix (2.9).
(iii) The vectors w
i are generalized eigenvectors of the pencil (Γpos , Γpr ):
Γpos w
i =
(3.3)
1
Γpr w
i .
1 + δi2
Note that the definition of the class M−1
is analogous to that of Mr . Indeed,
r
Lemma A.2 in Appendix A defines a bijection between these two classes.
The vectors w
i = Γ−1
i are orthogonal with respect to the inner product defined
pr w
i minimizes the generalized Rayleigh quotient,
by Γpr . By (3.3), we also know that w
(3.4)
R(z) =
Var(z x | y)
z Γpos z
=
,
z Γpr z
Var(z x)
i = span⊥ (w
over subspaces of the form W
j )j<i . This Rayleigh quotient is precisely
the ratio of posterior to prior variance along a particular direction, z, in the parami are exactly
eter space. The smallest values that R can take over the subspaces W
the smallest generalized eigenvalues of (Γpos , Γpr ). In particular, the data are most
informative along the first r eigenvectors w
i and, since
(3.5)
R(w
i ) =
Var(w
i x | y)
1
,
=
1 + δi2
Var(w
i x)
the posterior variance is smaller than the prior variance by a factor of (1 + δi2 )−1 . In
the span of the other eigenvectors, (w
i )i>r , the data are not as informative. Hence,
(w
i ) are the directions along which the ratio of posterior to prior variance is minimized. Furthermore, a simple computation shows that these directions also maximize
the relative difference between prior and posterior variance normalized by the prior
variance. Indeed, if the directions (w
i ) minimize (3.4), then they must also maximize
1 − R(z), leading to
(3.6)
1 − R(w
i ) =
i x | y)
Var(w
i x) − Var(w
δi2
.
=
1 + δi2
Var(w
i x)
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
OPTIMAL LOW-RANK APPROXIMATIONS
A2459
3.2. Optimal projector. Since the data are most informative on a subspace of
the parameter space, it should be possible to reduce the effective dimension of the
inference problem in a manner that is consistent with the posterior approximation.
This is essentially the content of the following corollary, which follows by a simple
computation.
pos and the vectors (w
Corollary 3.2 (optimal projector). Let Γ
i , w
i ) be defined
r = G ◦ Pr ,
as in Theorems 2.3 and 3.1. Consider the reduced forward operator G
where Pr is the oblique projector (i.e., Pr2 = Pr but Pt = Pr ):
(3.7)
Pr =
r
w
i w
i .
i=1
pos is precisely the posterior covariance matrix corresponding to the Bayesian
Then Γ
linear model:
(3.8)
r x, Γobs ),
y | x ∼ N (G
x ∼ N (0, Γpr ).
The projected Gaussian linear model (3.8) reveals the intrinsic dimensionality of
the inference problem. The introduction of the optimal projector (3.7) is also useful in
the context of dimensionality reduction for nonlinear inverse problems. In this case a
particularly simple and effective approximation of the posterior density, πpos (x|y), is
of the form π
pos (x|y) ∝ π(y; Pr x) πpr (x), where πpr is the prior density and π(y; Pr x)
is the density corresponding to the likelihood function with parameters constrained
by the projector. The range of the projector can be determined by combining locally optimal data-informed subspaces from high-density regions in the support of the
posterior distribution. This approximation is the subject of a related paper [22].
Returning to the linear inverse problem, notice also that the posterior mean of
the projected model (3.8) might be used as an efficient approximation of the exact
posterior mean. We will show in section 4 that this posterior mean approximation
in fact minimizes the Bayes risk for a weighted squared-error loss among all low-rank
linear functions of the data.
3.3. Comparison with optimality in Frobenius norm. Thus far our optimality results for the approximation of Γpos have been restricted to the class of loss
functions L given in Definition 2.1. However, it is also interesting to investigate optimality in the metric defined by the Frobenius norm. Given any two matrices A and B
of the same size, the Frobenius distance between them is defined as A − B, where
· is the Frobenius norm. Note that the Frobenius distance does not exploit the
pos ∈ Mr
structure of the positive definite cone of symmetric matrices. The matrix Γ
that minimizes the Frobenius distance from the exact posterior covariance matrix is
given by
(3.9)
pos = Γpr − KK ,
Γ
KK =
r
λi ui u
i ,
i=1
where (ui ) are the directions corresponding to the r largest eigenvalues of Γpr − Γpos .
This result can be very different from the optimal approximation given in Theorem
2.3. In particular, the directions (ui ) are solutions of the eigenvalue problem
(3.10)
Γpr G Γ−1
y G Γpr u = λu,
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
A2460
SPANTINI ET AL.
which maximize
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
(3.11)
u (Γpr − Γpos )u = Var(u x) − Var(u x | y).
That is, while optimality in the Förstner metric identifies directions that maximize the
relative difference between prior and posterior variance, the Frobenius distance favors
directions that maximize only the absolute value of this difference. There are many
reasons to prefer the former. For instance, data might be informative along directions
of low prior variance (perhaps due to inadequacies in prior modeling); a covariance matrix approximation that is optimal in Frobenius distance may ignore updates in these
directions entirely. Also, if parameters of interest (i.e., components of x) have differing units of measurement, relative variance reduction provides a unit-independent
way of judging the quality of a posterior approximation; this notion follows naturally
from the second invariance property of dF in (2.5). From a computational perspective, solving the eigenvalue problem (3.10) is quite expensive compared to finding the
generalized eigenpairs of the pencil (H, Γ−1
pr ). Finally, optimality in the Frobenius
distance for an approximation of Γpos does not yield an optimality statement for the
corresponding approximation of the posterior distribution, as shown in Lemma 2.2
for loss functions in L.
3.4. Suboptimal posterior covariance approximations.
3.4.1. Hessian-based and prior-based reduction schemes. The posterior
approximation described by Theorem 2.3 uses both Hessian and prior information.
It is instructive to consider approximations of the linear Bayesian inverse problem
that rely only on one or the other. As we will illustrate numerically in section 5.1,
these approximations can be viewed as natural limiting cases of our approach. They
are also closely related to previous efforts in dimensionality reduction that propose
only Hessian-based [55] or prior-based [61] reductions. In contrast with these previous
efforts, here we will consider versions of Hessian- and prior-based reductions that do
not discard prior information in the remaining directions. In other words, we will
discuss posterior covariance approximations that remain in the form of (2.4)—i.e.,
updating the prior covariance only in r directions.
A Hessian-based reduction scheme updates Γpr in directions where the data have
greatest influence in an absolute sense (i.e., not relative to the prior). This involves
approximating the negative log-likelihood Hessian (2.3) with a low-rank decomposition
as follows: let (s2i , vi ) be the eigenvalue-eigenvector pairs of H with the ordering
s2i ≥ s2i+1 . Then a best low-rank approximation of H in the Frobenius norm is
given by
H≈
r
s2i vi vi = Vr Sr Vr ,
i=1
where vi is the ith column of Vr and Sr = diag{s21 , . . . , s2r }. Using Woodbury’s identity
we then obtain an approximation of Γpos as a low-rank negative semidefinite update
of Γpr :
(3.12)
−1
−1 Γpos ≈ Vr Sr Vr + Γ−1
= Γpr − Γpr Vr Sr−1 + Vr Γpr Vr
Vr Γpr .
pr
This approximation of the posterior covariance matrix belongs to the class Mr . Thus,
Hessian-based reductions are in general suboptimal when compared to the optimal
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
OPTIMAL LOW-RANK APPROXIMATIONS
A2461
approximations defined in Theorem 2.3. Note that an equivalent way to obtain (3.12)
= G ◦ Vr V , which is the comis to use a reduced forward operator of the form G
r
position of the original forward operator with a projector onto the leading eigenspace
of H. In general, the projector Pr = Vr Vr is different from the optimal projector
defined in Corollary 3.2 and is thus suboptimal.
To achieve prior-based reductions, on the other hand, we restrict the Bayesian
inference problem to directions in the parameter space that explain most of the prior
variance. More precisely, we look for a rank-r orthogonal projector, Pr , that minimizes
the mean squared-error defined as
(3.13)
E (Pr ) = E x − Pr x2 ,
where the expectation is taken over the prior distribution (assumed to have zero mean)
and · is the standard Euclidean norm [44]. Let (t2i , ui ) be the eigenvalue-eigenvector
pairs of Γpr ordered as t2i ≥ t2i+1 . Then a minimizer of (3.13)
is given by a projector,
Pr , onto the leading eigenspace of Γpr defined as Pr = ri=1 ui u
i = Ur Ur , where
ui is the ith column of Ur . The actual approximation of the linear inverse problem
= G◦Ur Ur . By direct comparison
consists of using the projected forward operator, G
with the optimal projector defined in Corollary 3.2, we see that prior-based reductions
are suboptimal in general. Also in this case, the posterior covariance matrix with the
projected Gaussian model can be written as a negative semidefinite update of Γpr :
Γpos ≈ Γpr − Ur Tr [ ( Ur HUr )−1 + Tr ]−1 Tr Ur ,
where Tr = diag{t21 , . . . , t2r }. The double matrix inversion makes this low-rank update
computationally challenging to implement. It is also not optimal, as shown in Theorem
2.3.
To summarize, the Hessian and prior-based dimensionality reduction techniques
are both suboptimal. These methods do not take into account the interactions between
the dominant directions of H and those of Γpr , nor the relative importance of these
quantities. Accounting for such interaction is a key feature of the optimal covariance
approximation described in Theorem 2.3. Section 5.1 will illustrate conditions under
which these interactions become essential.
3.4.2. Connections with the BFGS Kalman filter. The linear Bayesian inverse problem analyzed in this paper can be interpreted as the analysis step of a linear
Bayesian filtering problem [27]. If the prior distribution corresponds to the forecast
distribution at some time t, the posterior coincides with the so-called analysis distribution. In the linear case, with Gaussian process noise and observational errors, both
of these distributions are Gaussian. The Kalman filter is a Bayesian solution to this
filtering problem [48]. In [2] the authors propose a computationally feasible way to
implement (and approximate) this solution in large-scale systems. The key observation is that when solving an SPD linear system of the form Ax = b by means of BFGS
or limited memory BFGS (L-BFGS [59]), one typically obtains an approximation of
A−1 for free. This approximation can be written as a low-rank correction of an arbi−1
trary positive definite initial approximation matrix A−1
0 . The matrix A0 can be, for
−1
instance, the scaled identity. Notice that the approximation of A given by L-BFGS
is full rank and positive definite. This approximation is in principle convergent as
the storage limit of L-BFGS increases [64]. An L-BFGS approximation of A is also
possible [83].
There are many ways to exploit this property of the L-BFGS method. For example, in [2] the posterior covariance is written as a low-rank update of the prior
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
A2462
SPANTINI ET AL.
covariance matrix, Γpos = Γpr − Γpr G Γ−1
y GΓpr , where Γy = Γobs + GΓpr G , and
−1
Γy itself is approximated using the L-BFGS method. Since this approximation of
Γy is full rank, however, this approach does not exploit potential low-dimensional
structure of the inverse problem. Alternatively, one can obtain an L-BFGS approx −1
imation of Γpos when solving the linear system Γ−1
pos x = G Γobs y for the posterior
mean μpos (y) [3]. If one uses the prior covariance matrix as an initial approximation
matrix, A−1
0 , then the resulting L-BFGS approximation of Γpos can be written as a
low-rank update of Γpr . This approximation format is similar to the one discussed in
[28] and advocated in this paper. However, the approach of [3] (or its ensemble version [76]) does not correspond to any known optimal approximation of the posterior
covariance matrix, nor does it lead to any optimality statement between the corresponding probability distributions. This is an important contrast with the present
approach, which we will revisit numerically in section 5.1.
4. Optimal approximation of the posterior mean. In this section, we develop and characterize fast approximations of the posterior mean that can be used,
for instance, to accelerate repeated inversion with multiple data sets. Note that we
are not proposing alternatives to the efficient computation of the posterior mean for
a single realization of the data. This task can already be accomplished with current
state-of-the-art iterative solvers for regularized least-squares problems [42, 69, 1, 62, 5].
Instead, we are interested in constructing statistically optimal approximations 3 of the
posterior mean as linear functions of the data. That is, we seek a matrix A, from
an approximation class to be defined shortly, such that the posterior mean can be
approximated as μpos (y) ≈ A y. We will investigate different approximation classes
for A; in particular, we will only consider approximation classes for which applying
A to a vector y is relatively inexpensive. Computing such a matrix A is more expensive than solving a single linear system associated with the posterior precision
matrix to determine the posterior mean. However, once A is computed, it can be applied inexpensively to any realization of the data.4 Our approach is therefore justified
when the posterior mean must be evaluated for multiple instances of the data. This
approach can thus be viewed as an offline-online strategy, where a more costly but
data-independent offline calculation is followed by fast online evaluations. Moreover,
we will show that these approximations can be obtained from an optimal approximation of the posterior covariance matrix (cf. Theorem 2.3) with minimal additional
cost. Hence, if one is interested in both the posterior mean and covariance matrix (as
is often the case in the Bayesian approach to inverse problems), then the approximation formulas we propose can be more efficient than standard approaches even for a
single realization of the data.
4.1. Optimality results. For the Bayesian linear model defined in (2.1), the
posterior mode is equal to the posterior mean, μpos (y) = E(x|y), which is in turn the
minimizer of the Bayes risk for squared-error loss [51, 58]. We first review this fact
and establish some basic notation. Let S be an SPD matrix and let
2
L(δ(y), x) = (x − δ(y)) S (x − δ(y)) = x − δ(y)S
be the loss incurred by the estimator δ(y) of x. The Bayes risk, R (δ(y), x), of δ(y) is
defined as the average loss over the joint distribution of x and y [14, 51]: R(δ(y), x) =
E ( L(δ(y), x) ). Since
3 We
4 In
will precisely define this notion of optimality in section 4.1.
particular, applying A is much cheaper than solving a linear system.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
A2463
OPTIMAL LOW-RANK APPROXIMATIONS
R(δ(y), x) = E δ(y) − μpos (y) 2S + E μpos (y) − x2S ,
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
(4.1)
it follows that δ(y) = μpos (y) minimizes the Bayes risk over all estimators of x.
To study approximations of μpos (y), we use the squared-error loss function defined
2
. This
by the Mahalanobis distance [15] induced by Γ−1
pos : L(δ(y), x) = δ(y) − xΓ−1
pos
loss function accounts for the geometry induced by the posterior measure on the
parameter space, penalizing errors in the approximation of μpos (y) more strongly in
directions of lower posterior variance.
Under the assumption of zero prior mean, μpos (y) is a linear function of the data.
Hence we seek approximations of μpos (y) of the form Ay, where A is a matrix in a
class to be defined. Our goal is to obtain fast posterior mean approximations that
can be applied repeatedly to multiple realizations of y. We consider two classes of
approximation matrices:
(4.2)
Ar := {A : rank(A) ≤ r}
and Ar := A = (Γpr − B) G Γ−1
obs : rank(B) ≤ r .
The class Ar consists of low-rank matrices; it is standard in the statistics literature
[44]. The class Ar , on the other hand, can be understood via comparison with (2.2);
it simply replaces Γpos with a low-rank negative semidefinite update of Γpr . We shall
henceforth use A to denote either of the two classes above.
Let RA (Ay, x) be the Bayes risk of Ay subject to A ∈ A. We may now restate
our goal as follows: find a matrix, A∗ ∈ A, that minimizes the Bayes risk RA (Ay, x).
That is, find A∗ ∈ A such that
2
).
RA (A∗ y, x) = min E( Ay − xΓ−1
pos
(4.3)
A∈A
The following two theorems show that for either class of approximation matrices, Ar
or Ar , this problem admits a particularly simple analytical solution that exploits the
structure of the optimal approximation of Γpos . The proofs of the theorems rely on a
result developed independently by Sondermann [77] and Friedland and Torokhti [31]
and are given in Appendix A. We also use the fact that E(μpos (y) − x2Γ−1 ) = ,
pos
where is the dimension of the parameter space.
Theorem 4.1. Let (δi2 , w
i ) be defined as in Theorem 2.3 and let (
vi ) be generalized
eigenvectors of the pencil (GΓpr G , Γobs ) associated with a nonincreasing sequence of
eigenvalues, with the normalization vi Γobs vi = 1. Then the following hold:
(i) A solution of (4.3) for A ∈ Ar is given by
(4.4)
∗
A =
r
i=1
δi
w
i vi .
1 + δi2
(ii) The corresponding minimum Bayes risk over Ar is given by
(4.5)
2
2
+E
μ
=
RAr (A∗ y, x) = E A∗ y − μpos (y)Γ−1
(y)
−
x
δi2 +.
−1
pos
Γ
pos
pos
i>r
Notice that the rank-r posterior mean approximation given by Theorem 4.1 coincides with the posterior mean of the projected linear Gaussian model defined in (3.8).
Thus, applying this approximation to a new realization of the data requires only a
low-rank matrix-vector product, a computationally trivial task. We define the quality
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
A2464
SPANTINI ET AL.
of a posterior mean approximation as the minimum Bayes risk (4.5). Notice, however,
that for a given rank r of the approximation, (4.5) depends on the eigenvalues that
have not yet been computed. Since (δi2 ) are determined in order of decreasing magnitude, (4.5) can be easily bounded (cf. discussion after Theorem 2.3). The forthcoming
minimum Bayes risk (4.8) can be bounded analogously.
Remark 3. Equation (4.4) can be interpreted as the truncated GSVD solution of a
Tikhonov regularized linear inverse problem [38] (with unit regularization parameter).
Hence, Theorem 4.1 also describes a Bayesian property of the (frequentist) truncated
GSVD estimator.
Remark 4. If factorizations of the form Γpr = Spr Spr
and Γobs = Sobs Sobs
are readily available,
then we can characterize the triplets (δi , w
i , vi ) from an SVD,
−1
−1
Sobs
GSpr =
δ
v
w
,
of
the
matrix
S
GS
with
the
transformations
w
i =
i
i
pr
i
obs
i≥1
−
Spr wi , vi = Sobs vi and the ordering δi ≥ δi+1 . In particular, the approximate posterior mean can be written as
−1
Tikh −1
Sobs y,
μ(r)
pos (y) = Spr (Sobs GSpr )r
(4.6)
−1
where (Sobs
GSpr )Tikh
is the best rank-r approximation to a Tikhonov regularized
r
5
inverse. That is, for any matrix A, (A)r is the best rank-r approximation of A (e.g.,
computed via SVD), whereas (A)Tikh := (A A + I)−1 A .
pos ∈ Mr be the optimal approximation of Γpos defined in
Theorem 4.2. Let Γ
Theorem 2.3. Then the following hold:
(i) A solution of (4.3) for A ∈ Ar is given by
(4.7)
∗ = Γ
pos G Γ−1 .
A
obs
(ii) The corresponding minimum Bayes risk over Ar is given by
(4.8)
2 2
∗ y − μpos (y)
∗ y, x) = E =
RAr (A
δi6 +.
A
−1 +E μpos (y) − xΓ−1
pos
Γpos
i>r
Once the optimal approximation of Γpos described in Theorem 4.2 is computed,
the cost of approximating μpos (y) for a new realization of y is dominated by the adjoint
and prior solves needed to apply G and Γpr , respectively. Combining the optimal
approximations of μpos (y) and Γpos given by Theorems 4.2 and 2.3, respectively, yields
a complete approximation of the Gaussian posterior distribution. This is precisely
the approximation adopted by the stochastic Newton MCMC method [60] to describe
the Gaussian proposal distribution obtained from a local linearization of the forward
operator of a nonlinear Bayesian inverse problem. Our results support the algorithmic
choice of [60] with precise optimality statements.
It is worth noting that the two optimal Bayes risks, (4.5) and (4.8), depend on the
parameter, r, that defines the dimension of the corresponding approximation classes
Ar and Ar . In the former case, r is the rank of the optimal matrix that defines the
approximation. In the latter case, r is the rank of a negative update of Γpr that yields
the posterior covariance matrix approximation. We shall thus refer to the estimator
given by Theorem 4.1 as the low-rank approximation and to the estimator given by
Theorem 4.2 as the low-rank update approximation. In both cases, we shall refer
to r as the order of the approximation. A posterior mean approximation of order
5 With
unit regularization parameter and identity regularization operator [39].
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
OPTIMAL LOW-RANK APPROXIMATIONS
A2465
r will be called underresolved if more than r generalized eigenvalues of the pencil
(H, Γ−1
pr ) are greater than one. If this is the case, then using the low-rank update
approximation is not appropriate because the associated Bayes risk includes highorder powers of eigenvalues of (H, Γ−1
pr ) that are greater than one. Thus, underresolved
approximations tend to be more accurate when using the low-rank approximation.
As we will show in section 5, this estimator is also less expensive to compute than
its counterpart in Theorem 4.2. If, on the other hand, fewer than r eigenvalues of
(H, Γ−1
pr ) are greater than one, then the optimal low-rank update estimator will have
better performance than the optimal low-rank estimator in the following sense:
∗ y, x) =
δi2 1 + δi2 1 − δi2 .
0 < RAr (A∗ y, x) − RAr (A
i>r
4.2. Connection with “priorconditioners”. In this subsection we draw connections between the low-rank approximation of the posterior mean given in Theorem
4.1 and the regularized solution of a discrete ill-posed inverse problem, y = Gx + ε
(using the notation of this paper), as presented in [13, 10]. In [13, 10], the authors propose an early stopping regularization using iterative solvers preconditioned by prior
statistical information on the parameter of interest, say, x ∼ N (0, Γpr ), and on the
noise, say, ε ∼ N (0, Γobs ).6 That is, if factorizations Γpr = Spr Spr
and Γobs = Sobs Sobs
are available, then [13] provides a solution, x = Spr q, to the inverse problem, where
q comes from an early stopping regularization applied to the preconditioned linear
system:
(4.9)
−1
−1
GSpr q = Sobs
y.
Sobs
The iterative method of choice in this case is the conjugate gradient least squares
(CGLS) algorithm [13, 36] (or GMRES for nonsymmetric square systems [11]) equipped with a proper stopping criterion (e.g., the discrepancy principle [47]). Although
the approach of [13] is not exactly Bayesian, we can still use the optimality results
of Theorem 4.1 to justify the observed good performance of this particular form of
regularization. By a property of the CGLS algorithm, the rth iterate, xr = Spr q r ,
satisfies
(4.10)
−1
−1
q r = argmin Sobs
y − Sobs
GSpr q .
y)
q∈Kr (H,
y) is the r-dimensional Krylov subspace associated with the matrix
where Kr (H,
−1
= Spr HSpr and starting vector y = Spr
H
G Γobs y. It was shown in [41] that
−1
−1
GSpr )† Sobs
y,
the CGLS solution, at convergence, can be written as x∗ = Spr (Sobs
†
where ( · ) denotes the Moore–Penrose pseudoinverse [63, 72]. To highlight the dif y) ≈ ran(Wr )
ferences between the CGLS solution and (4.6), we assume that Kr (H,
for all r, where Wr = [w1 | · · · | wr ], ran(A) denotes the range of a matrix A, and
= δ 2 wi w is an SVD of H.
Notice that the condition Kr (H,
y) ≈ ran(Wr ) is
H
i
i i
usually quite reasonable for moderate values of r. This practical observation is at the
heart of the Lanczos iteration for symmetric eigenvalue problems [50]. With simple
algebraic manipulations we conclude that
(4.11)
6 It
−1
−1
GSpr )†r Sobs
y.
xr ≈ Spr (Sobs
suffices to consider a Gaussian approximation to the distribution of x and ε.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
A2466
SPANTINI ET AL.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Recall from (4.6) that the optimal rank-r approximation of the posterior mean defined
in Theorem 4.1 can be written as
−1
Tikh −1
μ(r)
Sobs y.
pos (y) = Spr (Sobs GSpr )r
(4.12)
The only difference between (4.11) and (4.12) is the use of a Tikhonov-regularized
−1
inverse in (4.12) as opposed to a Moore–Penrose pseudoinverse. If Sobs
GSpr =
−1
i≥1 δi vi wi is a reduced SVD of the matrix Sobs GSpr , then
(4.13)
−1
(Sobs
GSpr )†r =
1
wi vi ,
δi
−1
(Sobs
GSpr )Tikh
=
r
i≤r
i≤r
δi
wi vi .
1 + δi2
These two matrices are nearly identical for values of r corresponding to δr2 greater than
2
). Beyond this regime, it might be convenient
one7 (assuming the ordering δi2 ≥ δi+1
to stop the CGLS solver to obtain (4.11) (i.e., early stopping regularization). The
similarity of these expressions is quite remarkable since (4.12) was derived as the
minimizer of the optimization problem (4.3) with A = Ar . This informal argument
may explain why priorconditioners perform so well in applications [12, 43]. Yet we
remark that the goals of Theorem 4.1 and [13] are still quite different; [13] is concerned
with preconditioning techniques for early stopping regularization of ill-posed inverse
problems, whereas Theorem 4.1 is concerned with statistically optimal approximations
of the posterior mean in the Bayesian framework.
Algorithm 1. Optimal low-rank approximation of the posterior mean.
−1
INPUT: forward and adjoint models G, G ; prior and noise precisions Γ−1
pr , Γobs ;
approximation order r ∈ N
(r)
OUTPUT: approximate posterior mean μpos (y)
1: Find the r leading generalized eigenvalue-eigenvector pairs (δi2 , w
i ) of the pencil
−1
(G Γ−1
obs G, Γpr )
2: Find the r leading generalized eigenvector pairs (
vi ) of the pencil (G Γpr G , Γobs )
(r)
3: For each new realization of the data y, compute μpos (y) =
r
i=1
δi (1 + δi2 )−1 w
i vi y.
Algorithm 2. Optimal low-rank update approximation of the posterior mean.
−1
INPUT: forward and adjoint models G, G ; prior and noise precisions Γ−1
pr , Γobs ;
approximation order r ∈ N
(r)
OUTPUT: approximate posterior mean μ
pos (y)
pos as described in Theorem 2.3.
1: Obtain Γ
(r)
pos G Γ−1 y.
2: For each new realization of the data y, compute μ
pos (y) = Γ
obs
5. Numerical examples. Now we provide several numerical examples to illustrate the theory developed in the preceding sections. We start with a synthetic
example to demonstrate various posterior covariance matrix approximations and continue with two more realistic linear inverse problems, where we also study posterior
mean approximations.
7 In section 5 we show that by the time we start including generalized eigenvalues δ 2 ≈ 1 in (4.4),
i
the approximation of the posterior mean is usually already satisfactory. Intuitively, this means that
all the directions in parameter space where the data are more informative than the prior have been
considered.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
OPTIMAL LOW-RANK APPROXIMATIONS
A2467
5.1. Example 1: Hessian and prior with controlled spectra. We begin
by investigating the approximation of Γpos as a negative semidefinite update of Γpr .
We compare the optimal approximation obtained in Theorem 2.3 with the Hessian-,
prior-, and BFGS-based reduction schemes discussed in section 3.4. The idea is to
reveal differences between these approximations by exploring regimes where the data
have differing impacts on the prior information. Since the directions defining the
optimal update are the generalized eigenvectors of the pencil (H, Γ−1
pr ), we shall also
refer to this update as the generalized approximation.
To compare these approximation schemes, we start with a simple example with
diagonal Hessian and prior covariance matrices: G = I, Γobs = diag{σi2 }, and Γpr =
diag{λ2i }. Since the forward operator G is the identity, this problem can (loosely) be
2 2
2
thought of as denoising a signal x. In this case, H = Γ−1
obs and Γpos = diag{λi σi /(σi +
2
λi )}. The ratios of posterior to prior variance in the canonical directions (ei ) are
1
Var(e
i x | y)
=
2 /σ 2 .
1
+
λ
Var(e
x)
i
i
i
We note that if the observation variances σi2 are constant, σi = σ, then the directions
of greatest variance reduction are those corresponding to the largest prior variance.
Hence the prior distribution alone determines the most informed directions, and the
prior-based reduction is as effective as the generalized one. On the other hand, if the
prior variances λ2i are constant, λi = λ, but the σi vary, then the directions of highest
variance reduction are those corresponding to the smallest noise variance. This time
the noise distribution alone determines the most important directions, and Hessianbased reduction is as effective as the generalized one. In the case of more general
spectra, the important directions depend on the ratios λ2i /σi2 and thus one has to
use the information provided by both the noise and prior distributions. This is done
naturally by the generalized reduction.
We now generalize this simple case by moving to full matrices H and Γpr with
a variety of prescribed spectra. We assume that H and Γpr have SVDs of the form
1 , . . . , λ
n }
, where Λ = diag{λ1 , . . . , λn } and Λ
= diag{λ
H = U ΛU and Γpr = V ΛV
with
λk = λ0 /k α + τ
k = λ
0 /k α + τ.
and λ
To consider many different cases, the orthogonal matrices U and V are randomly
and independently generated uniformly over the orthogonal group [78], leading to
different realizations of H and Γpr . In particular, U and V are computed with a QR
decomposition of a square matrix of independent standard Gaussian entries using a
Gram–Schmidt orthogonalization. (In this case, the standard Householder reflections
cannot be used.)
Before discussing the results of the first experiment, we explain our implementation of BFGS-based reduction. We ran the BFGS optimizer with a dummy quadratic
optimization target J (x) = 12 x Γ−1
pos x and used Γpr as the initial approximation matrix for Γpos . Thus, the BFGS approximation of the posterior covariance matrix can
be written as Γpos = Γpr − KK for some rank-r matrix K. The rank-r update is
constructed by running the BFGS optimizer for r steps from random initial conditions as shown in [3]. Note that in order to obtain results for sufficiently high-rank
updates, we use BFGS rather than L-BFGS in our numerical examples. While [2, 3]
in principle employ L-BFGS, the results in these papers use a number of optimization
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
A2468
SPANTINI ET AL.
steps roughly equal to the number of vectors stored in L-BFGS; our approach thus is
comparable to [2, 3]. Nonetheless, some results for the highest-rank BFGS updates
are not plotted in Figures 1 and 2, as the optimizer converged so close to the optimum
that taking further steps resulted in numerical instabilities.
Figure 1 summarizes the results of the first experiment. The top row shows
the prescribed spectra of H −1 (red) and Γpr (blue). The parameters describing the
0 = 1, α
eigenvalues of Γpr are fixed to λ
= 2, and τ = 10−6 . The corresponding
parameters for H are given by λ0 = 500 and τ = 10−6 with α = 0.345 (left), α =
0.690 (middle), and α = 1.724 (right). Thus, moving from the leftmost column to
the rightmost column, the data become increasingly less informative. The second
row in the figure shows the Förstner distance between Γpos and its approximation,
pos = Γpr −KK , as a function of the rank of KK for 100 different realizations of H
Γ
and Γpr . The third row shows, for each realization of (H, Γpr ) and for each fixed rank of
KK , the difference between the Förstner distance obtained with a prior-, Hessian-, or
BFGS-based dimensionality reduction technique and the minimum distance obtained
with the generalized approximation. All these differences are positive—a confirmation
of Theorem 2.3. But Figure 1 also shows interesting patterns consistent with the
observations made for the simple example above. When the spectrum of H is basically
flat (left column), the directions along which the prior variance is reduced the most
are likely to be those corresponding to the largest prior variances, and thus a priorbased reduction is almost as effective as the generalized one (as seen in the bottom
two rows on the left). As we move to the third column, eigenvalues of H −1 increase
more quickly. The data provide significant information only on a lower-dimensional
subspace of the parameter space. In this case, it is crucial to combine this information
with the directions in the parameter space along which the prior variance is the
greatest. The generalized reduction technique successfully accomplishes this task,
whereas the prior and Hessian reductions fail as they focus either on Γpr or H alone;
the key is to combine the two. The BFGS update performs remarkably well across
all three configurations of the Hessian spectrum, although it is clearly suboptimal
compared to the generalized reduction.
In Figure 2 the situation is reversed and the results are symmetric to those of
Figure 1. The spectrum of H (red) is now kept fixed with parameters λ0 = 500,
0 = 1 and
α = 1, and τ = 10−9 , while the spectrum of Γpr (blue) has parameters λ
τ = 10−9 with decay rates increasing from left to right: α
= 0.552 (left), α
= 1.103
(middle), and α
= 2.759 (right). In the first column, the spectrum of the prior is nearly
flat. That is, the prior variance is almost equally spread along every direction in the
parameter space. In this case, the eigenstructure of H determines the directions of
greatest variance reduction, and the Hessian-based reduction is almost as effective as
the generalized one. As we move toward the third column, the spectrum of Γpr decays
more quickly. The prior variance is restricted to lower-dimensional subspaces of the
parameter space. Mismatch between prior- and Hessian-dominated directions then
leads to poor performance of both the prior- and Hessian-based reduction techniques.
However, the generalized reduction performs well also in this more challenging case.
The BFGS reduction is again empirically quite effective in most of the configurations
that we consider. It is not always better than the prior- or Hessian-based techniques
when the update rank is low, or when the prior spectrum decays slowly; for example,
Hessian-based reduction is more accurate than BFGS across all ranks in the first
column of Figure 2. But when either the prior covariance or the Hessian has quickly
decaying spectra, the BFGS approach performs almost as well as the generalized
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
OPTIMAL LOW-RANK APPROXIMATIONS
A2469
Fig. 1. Top row: Eigenspectra of Γpr (blue) and H −1 (red) for three values for the decay rate
of the eigenvalues of H: α = 0.345 (left), α = 0.690 (middle), and α = 1.724 (right). Second
row: Förstner distance between Γpos and its approximation versus the rank of the update for 100
realizations of Γpr and H using prior-based (blue), Hessian-based (green), BFGS-based (magenta),
and optimal (red) updates. Bottom row: Differences of posterior covariance approximation error
(measured with the Förstner metric) between the prior-based and optimal updates (blue), between
the Hessian-based and optimal updates (green), and between the BFGS-based and optimal updates
(magenta).
reduction. Though this approach remains suboptimal, its approximation properties
deserve further theoretical study.
5.2. Example 2: X-ray tomography. We consider a classical inverse problem
of X-ray computed tomography (CT), where X-rays travel from sources to detectors
through an object of interest. The intensities from multiple sources are measured
at the detectors, and the goal is to reconstruct the density of the object. In this
framework, we investigate the performance of the optimal mean and covariance matrix
approximations presented in sections 2 and 4. This synthetic example is motivated
by a real application: real-time X-ray imaging of logs that enter a saw mill for the
purpose of automatic quality control. For instance, in the system commercialized by
Bintec (www.bintec.fi), logs enter the X-ray system on a fast-moving conveyer belt
and fast reconstructions are needed. The imaging setting (e.g., X-ray source and
detector locations) and the priors are fixed; only the data changes from one log cross
section to another. The basis for our posterior mean approximation can therefore be
precomputed, and repeated inversions can be carried out quickly with direct matrix
formulas.
We model the absorption of an X-ray along a line, i , using Beer’s law:
(5.1)
Id = Is exp − x(s)ds ,
i
where Id and Is are the intensities at the detector and at the source, respectively,
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
A2470
SPANTINI ET AL.
Fig. 2. Analogous to Figure 1 but this time the spectrum of H is fixed, while that of Γpr has
varying decay rates: α
= 0.552 (left), α
= 1.103 (middle), and α
= 2.759 (right).
and x(s) is the density of the object at position s on the line i . The computational
domain is discretized into a grid and the density is assumed to be constant within
each grid cell. The line integrals are approximated as
(5.2)
x(s)ds ≈
i
# of
cells
gij xj ,
j=1
where gij is the length of the intersection between line i and cell j, and xj is the
unknown density in cell j. The vector of absorptions along m lines can then be
approximated as
(5.3)
Id ≈ Is exp (−Gx) ,
where Id is the vector of m intensities at the detectors and G = (gij ) is the m × n
matrix of intersection lengths for each of the m lines. Even though the forward
operator (5.3) is nonlinear, the inference problem can be recast in a linear fashion by
taking the logarithm of both sides of (5.3). This leads to the following linear model
for the inversion: y = Gx + , where the measurement vector is y = − log(Id /Is ) and
the measurement errors are assumed to be independent and identically distributed
Gaussian, ∼ N (0, σ 2 I).
The setup for the inference problem, borrowed from [40], is as follows. The
rectangular domain is discretized with an n × n grid. The true object consists of three
circular inclusions, each of uniform density, inside an annulus. Ten X-ray sources
are positioned on one side of a circle, and each source sends a fan of 100 X-rays
that are measured by detectors on the opposite side of the object. Here, the 10
sources are distributed evenly so that they form a total illumination angle of 90
degrees, resulting in a limited-angle CT problem. We use the exponential model
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
A2471
OPTIMAL LOW-RANK APPROXIMATIONS
1.02
intensity
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
1
0.98
0.96
0.94
0.92
0.9
0
20
40
60
80
100
detector pixel
Fig. 3. X-ray tomography problem. Left: Discretized domain, true object, sources (red dots),
and detectors corresponding to one source (black dots). The fan transmitted by one source is illustrated in gray. The density of the object is 0.006 in the outer ring and 0.004 in the three inclusions;
the background density is zero. Right: The true simulated intensity (black line) and noisy measurements (red dots) for one source.
(5.1) to generate synthetic data in a discretization-independent fashion by computing
the exact intersections between the rays and the circular inclusions in the domain.
Gaussian noise with standard deviation σ = 0.002 is added to the simulated data.
The imaging setup and data from one source are illustrated in Figure 3.
The unknown density is estimated on a 128×128 grid. Thus the discretized vector,
x, has length 16,384, and direct computation of the posterior mean and the posterior
covariance matrix, as well as generation of posterior samples, can be computationally
nontrivial. To define the prior distribution, x is modeled as a discretized solution of
a stochastic PDE of the form
(5.4)
γ κ2 I − x(s) = W(s),
s ∈ Ω,
where W is a white noise process, is the Laplacian operator, and I is the identity
operator. The solution of (5.4) is a Gaussian random field whose correlation length
and variance are controlled by the free parameters κ and γ, respectively. A square
root of the prior precision matrix of x (which is positive√definite) can then be easily
computed (see [57] for details). We use κ = 10 and γ = 800 in our simulations.
Our first task is to compute an optimal approximation of Γpos as a low-rank
negative update of Γpr (cf. Theorem 2.3). Figure 4 (top row) shows the convergence
of the approximate posterior variance as the rank of the update increases. The zerorank update corresponds to Γpr (first column). For this formally 16,384-dimensional
problem, a good approximation of the posterior variance is achieved with a rank 200
update; hence the data are informative only on a low-dimensional subspace. The
quality of the covariance matrix approximation is also reflected in the structure of
samples drawn from the approximate posterior distributions (bottom row). All five of
these samples are drawn using the same random seed and the exact posterior mean,
so that all the differences observed are due to the approximation of Γpos . Already
with a rank 100 update, the small-scale features of the approximate posterior sample
match those of the exact posterior sample. In applications, agreement in this “eyeball
norm” is important. Of course, Theorem 2.3 also provides an exact formula for the
error in the posterior covariance; this error is shown in the right panel of Figure 7
(blue curve).
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
A2472
SPANTINI ET AL.
Fig. 4. X-ray tomography problem. First column: Prior variance field, in log scale (top),
and a sample drawn from the prior distribution (bottom). Second through last columns (left to
right): Variance field, in log scale, of the approximate posterior as the rank of the update increases
(top); samples from the corresponding approximate posterior distributions (bottom) assuming exact
knowledge of the posterior mean.
Our second task is to assess the performances of the two optimal posterior mean
(r)
approximations given in section 4. We will use μpos (y) to denote the low-rank ap(r)
proximation and μ
pos (y) to denote the low-rank update approximation. Recall that
(r)
both approximations are linear functions of the data y, given by μpos (y) = A∗ y with
(r)
∗ y with A
∗ ∈ Ar , where the classes Ar and Ar are defined
A∗ ∈ Ar and μ
pos (y) = A
in (4.2). As in section 4, we shall use A to denote either of the two classes.
/μpos (y)Γ−1
for difFigure 5 shows the normalized error μ(y) − μpos (y)Γ−1
pos
pos
ferent approximations μ(y) of the true posterior mean μpos (y) and a fixed realization
y of the data. The error is a function of the order r of the approximation class A.
Snapshots of μ(y) are shown along the two error curves. For reference, μpos (y) is also
shown at the top. We see that the errors decrease monotonically but that the lowrank approximation outperforms the low-rank update approximation for lower values
of r. This is consistent with the discussion at the end of section 4; the crossing point
of the error curves is also consistent with that analysis. In particular, we expect the
low-rank update approximation to outperform the low-rank approximation only when
the approximation starts to include generalized eigenvalues of the pencil (H, Γ−1
pr ) that
are less than one—i.e., once the approximations are no longer underresolved. This can
be confirmed by comparing Figure 5 with the decay of the generalized eigenvalues of
the pencil (H, Γ−1
pr ) in the right panel of Figure 7 (blue curve).
On top of each snapshot in Figure 5, we show the relative CPU time required
to compute the corresponding posterior mean approximation for each new realization
of the data. The relative CPU time is defined as the time required to compute
this approximation8 divided by the time required to apply the posterior precision
matrix to a vector. This latter operation is essential to computing the posterior
mean via an iterative solver, such as a Krylov subspace method. These solvers are
a standard choice for computing the posterior mean in large-scale inverse problems.
Evaluating the ratio allows us to determine how many solver iterations could be
performed with a computational cost roughly equal to that of approximating the
posterior mean for a new realization of the data. Based on the reported times, a
8 This timing does not include the computation of (4.4) or (4.7), which should be regarded as
offline steps. Here we report the time necessary to apply the optimal linear function to any new
realization of the data, i.e., the online cost.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
OPTIMAL LOW-RANK APPROXIMATIONS
A2473
Fig. 5. Limited-angle X-ray tomography: Comparison of the optimal posterior mean approx(r)
(r)
pos (y) (black) of μpos (y) for a fixed realization of the data y, as a
imations, μpos (y) (blue) and μ
r , respectively. The normalized error
function of the order r of the approximating classes Ar and A
for an approximation μ(y) is defined as μ(y) − μpos (y)Γ−1 / μpos (y)Γ−1 . The numbers above or
pos
pos
below the snapshots indicate the relative CPU time of the corresponding mean approximation—i.e.,
the time required to compute the approximation divided by the time required to apply the posterior
precision matrix to a vector.
few observations can be made. First of all, as anticipated in section 4, computing
(r)
(r)
μpos (y) for any new realization of the data is faster than computing μ
pos (y). Second,
obtaining an accurate posterior mean approximation requires roughly r = 200, and
(r)
the relative CPU times for this order of approximation are 7.3 for μpos (y) and 29.0
(r)
for μ
pos (y). These are roughly the number of iterations of an iterative solver that one
could take for equivalent computational cost. That is, the speedup of the posterior
mean approximation compared to an iterative solver is not particularly dramatic in
this case, because the forward model A is simply a sparse matrix that is cheap to
apply. This is different for the heat equation example discussed in section 5.3.
Note that the above computational time estimates exclude other costs associated
with iterative solvers. For instance, preconditioners are often applied; these significantly decrease the number of iterations needed for the solvers to converge but, on the
other hand, increase the cost per iteration. A popular approach for solving the posterior mean efficiently is to use the prior covariance as the preconditioner [6]. In the
limited-angle tomography problem, including the application of this preconditioner in
the reference CPU time would reduce the relative CPU time of our r = 200 approx(r)
(r)
pos (y). That is, the cost of computing our
imations to 0.48 for μpos (y) and 1.9 for μ
approximations is roughly equal to one iteration of a prior-preconditioned iterative
solver. The large difference compared to the case without preconditioning is due to
the fact that the cost of applying the prior here is computationally much higher than
applying the forward model.
Figure 6 (left panel) shows unnormalized errors in the approximation of μpos (y),
(5.5)
2
e(y)2Γ−1 = μ(r)
pos (y) − μpos (y)Γ−1
pos
pos
2
and e(y)2Γ−1 = μ(r)
pos (y) − μpos (y)Γ−1 ,
pos
pos
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
A2474
SPANTINI ET AL.
Fig. 6. The errors e(y)2 −1 (blue) and e(y)2 −1 (black) defined by (5.5), and their expected
Γpos
Γpos
values in green and red, respectively, for Example 2 (left panel) and Example 3 (right panel).
for the same realization of y used in Figure 5. In the same panel we also show the
expected values of these errors over the prior predictive distribution of y, which is
exactly the r-dependent component of the Bayes risk given in Theorems 4.1 and 4.2.
Both sets of errors decay with increasing r and show a similar crossover between the
two approximation classes. But the particular error e(y)2Γ−1 departs consistently
pos
from its expectation; this is not unreasonable in general (the mean estimator has a
nonzero variance), but the offset may be accentuated in this case because the data
are generated from an image that is not drawn from the prior. (The right panel of
Figure 6, which comes from Example 3, represents a contrasting case.)
By design, the posterior approximations described in this paper perform well
when the data inform a low-dimensional subspace of the parameter space. To better
understand this effect, we also consider a full-angle configuration of the tomography
problem, wherein the sources and detectors are evenly spread around the entire unknown object. In this case, the data are more informative than in the limited-angle
configuration. This can be seen in the decay rate of the generalized eigenvalues of
the pencil (H, Γ−1
pr ) in the center panel of Figure 7 (blue and red curves); eigenvalues
for the full-angle configuration decay more slowly than for the limited-angle configuration. Thus, according to the optimal loss given in (2.10) (Theorem 2.3), the priorto-posterior update in the full-angle case must be of greater rank than the update in
the limited-angle case for any given approximation error. Also, good approximation
of μpos (y) in the full-angle case requires higher order of the approximation class A,
as is shown in Figure 8. But because the data are strongly informative, they allow
an almost perfect reconstruction of the underlying truth image. The relative CPU
(r)
(r)
times are similar to the limited angle case: roughly 8 for μpos (y) and 14 for μ
pos (y).
If preconditioning with the prior covariance is included in the reference CPU time
(r)
(r)
pos (y).
calculation, the relative CPU times drop to 1.5 for μpos (y) and to 2.6 for μ
We remark that in realistic applications of X-ray tomography, the limited angle setup
is extremely common as it is cheaper and more flexible (yielding smaller and lighter
devices) than a full-angle configuration.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
OPTIMAL LOW-RANK APPROXIMATIONS
A2475
Fig. 7. Left: Leading eigenvalues of Γpr and H −1 in the limited-angle and full-angle X-ray
tomography problems. Center: Leading generalized eigenvalues of the pencil (H, Γ−1
pr ) in the limited pos ) as a function of the rank of the update
angle (blue) and full-angle (red) cases. Right: dF (Γpos , Γ
pos = Γpr − KK , in the limited-angle (blue) and full-angle (red) cases.
KK , with Γ
Fig. 8. Same as Figure 5, but for full-angle X-ray tomography (sources and receivers spread
uniformly around the entire object).
5.3. Example 3: Heat equation. Our last example is the classic linear inverse
problem of solving for the initial conditions of an inhomogeneous heat equation. Let
u(s, t) be the time-dependent state of the heat equation on s = (s1 , s2 ) ∈ Ω = [0, 1]2 ,
t ≥ 0, and let κ(s) be the heat conductivity field. Given initial conditions, u0 (s) =
u(s, 0), the state evolves in time according to the linear heat equation:
(5.6)
∂u(s, t)
= −∇ · (κ(s)∇u(s, t)),
∂t
κ(s)∇u(s, t) · n(s) = 0,
s ∈ ∂Ω, t > 0,
s ∈ Ω, t > 0,
where n(s) denotes the outward-pointing unit normal at s ∈ ∂Ω. We place ns = 81
sensors at the locations s1 , . . . , sns , uniformly spaced within the lower left quadrant
of the spatial domain, as illustrated by the black dots in Figure 9. We use a finitedimensional discretization of the parameter space based on the finite element method
on a regular 100 × 100 grid, {si }. Our goal is to infer the vector x = (u0 (si )) of initial
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
A2476
SPANTINI ET AL.
Fig. 9. Heat equation (Example 3). Initial condition (top left) and several snapshots of the
states at different times. Black dots indicate sensor locations.
conditions on the grid. Thus, the dimension of the parameter space for the inference
problem is n = 104 . We use data measured at 50 discrete times t = t1 , t2 , . . . , t50 ,
where ti = it, and t = 2 × 10−4 . At each time ti , pointwise observations of the
state u are taken at these sensors, i.e.,
(5.7)
di = Cu(s, ti ),
where C is the observation operator that maps the function u(s, ti ) to d = (u(s1 , ti ), . . . ,
u(sn , ti )) . The vector of observations is then d = [d1 ; d2 ; . . . ; d50 ]. The noisy data
vector is y = d + ε, where ε ∼ N (0, σ 2 I) and σ = 10−2 . Note that the data are a
linear function of the initial conditions, perturbed by Gaussian noise. Thus the data
can be written as
(5.8)
y = Gx + ε,
ε ∼ N (0, σ 2 I).
where G is a linear map defined by the composition of the forward model (5.6) with
the observation operator (5.7), both linear.
We generate synthetic data by evolving the initial conditions shown in Figure 9.
This “true” value of the inversion parameters x is a discretized realization of a Gaussian process satisfying an SPDE of the same form used in the previous tomography
example, but now with a nonstationary permeability field. In other words, the truth
is a draw from the prior in this example (unlike in the previous example), and the
prior Gaussian process satisfies the following SPDE:
s ∈ Ω,
(5.9)
γ κ2 I − ∇ · c(s)∇ x(s) = W(s),
where c(s) is the space-dependent permeability tensor.
Figure 10 and the right panel in Figure 6 show our numerical results. They have
the same interpretations as Figures 5 and 6 in the tomography example. The trends
in the figures are consistent with those encountered in the previous example and
confirm the good performance of the optimal low-rank approximation. Notice that in
Figures 10 and 6 the approximation of the posterior mean appears to be nearly perfect
(visually) once the error curves for the two approximations cross. This is somewhat
expected from the theory since we know that the crossing point should occur when
the approximations start to use eigenvalues of the pencil (H, Γ−1
pr ) that are less than
one—that is, once we have exhausted directions in the parameter space where the
data are more constraining than the prior.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
OPTIMAL LOW-RANK APPROXIMATIONS
A2477
Fig. 10. Same as Figure 5, but for Example 3 (initial condition inversion for the heat equation).
Again, we report the relative CPU time for each posterior mean approximation
above/below the corresponding snapshot in Figure 10. The results differ significantly
from the tomography example. For instance, at order r = 200, which yields approximations that are visually indistinguishable from the true mean, the relative CPU
(r)
(r)
pos (y). Therefore we can compute an accutimes are 0.001 for μpos (y) and 0.53 for μ
rate mean approximation for a new realization of the data much more quickly than
taking one iteration of an iterative solver. Recall that, consistent with the setting
described at the start of section 4, this is a comparison of online times, after the matrices (4.4) or (4.7) have been precomputed. The difference between this case and the
tomography example of section 5.2 is due to the higher CPU cost of applying the forward and adjoint models for the heat equation—solving a time-dependent PDE versus
applying a sparse matrix. Also, because the cost of applying the prior covariance is
negligible compared to that of the forward and adjoint solves in this example, preconditioning the iterative solver with the prior would not strongly affect the reported
relative CPU times, unlike the tomography example.
Figure 11 illustrates some important directions characterizing the heat equation
inverse problem. The first two columns show the four leading eigenvectors of, respectively, Γpr and H. Notice that the support of the eigenvectors of H concentrates
around the sensors. The third column shows the four leading directions (w
i ) defined
in Theorem 2.3. These directions define the optimal prior-to-posterior covariance matrix update (cf. (2.9)). This update of Γpr is necessary to capture directions (w
i ) of
greatest relative difference between prior and posterior variance (cf. Corollary 3.1).
The four leading directions (w
i ) are shown in the fourth column. The support of
these modes is again concentrated around the sensors, which intuitively makes sense
as these are directions of greatest variance reduction.
6. Conclusions. This paper has presented and characterized optimal approximations of the Bayesian solution of linear inverse problems, with Gaussian prior
and noise distributions defined on finite-dimensional spaces. In a typical large-scale
inverse problem, observations may be informative—relative to the prior—only on a
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
A2478
SPANTINI ET AL.
Fig. 11. Heat equation (Example 3). First column: Four leading eigenvectors of Γpr . Second
column: Four leading eigenvectors of H. Third column: Four leading directions (w
i ) (cf. (2.9)).
Fourth column: Four leading directions (w
i ) (cf. Corollary 3.1)
low-dimensional subspace of the parameter space. Our approximations therefore identify and exploit low-dimensional structure in the update from prior to posterior.
We have developed two types of optimality results. In the first, the posterior
covariance matrix is approximated as a low-rank negative semidefinite update of
the prior covariance matrix. We describe an update of this form that is optimal
with respect to a broad class of loss functions between covariance matrices, exemplified by the Förstner metric [29] for symmetric positive definite matrices. We argue
that this is the appropriate class of loss functions with which to evaluate approximations of the posterior covariance matrix, and we show that optimality in such metrics
identifies directions in parameter space along which the posterior variance is reduced
the most, relative to the prior. Optimal low-rank updates are derived from a generalized eigendecomposition of the pencil defined by the negative log-likelihood Hessian
and the prior precision matrix. These updates have been proposed in previous work
[28], but our work complements these efforts by characterizing the optimality of the
resulting approximations. Under the assumption of exact knowledge of the posterior
mean, our results extend to optimality statements between the associated distributions (e.g., optimality in the Hellinger distance and in the K-L divergence). Second,
we have developed fast approximations of the posterior mean that are useful when
repeated evaluations thereof are required for multiple realizations of the data (e.g.,
in an online inference setting). These approximations are optimal in the sense that
they minimize the Bayes risk for squared-error loss induced by the posterior precision
matrix. The most computationally efficient of these approximations expresses the
posterior mean as the product of a single low-rank matrix with the data. We have
demonstrated the covariance and mean approximations numerically on a variety of
inverse problems: synthetic problems constructed from random Hessian and prior covariance matrices; an X-ray tomography problem with different observation scenarios;
and inversion for the initial condition of a heat equation, with localized observations
and a nonstationary prior.
This work has several possible extensions of interest, some of which are already
part of ongoing research. First, it is natural to generalize the present approach to
infinite-dimensional parameter spaces endowed with Gaussian priors. This setting is
essential to understanding and formalizing Bayesian inference over function spaces
[9, 79]. Here, by analogy with the current results, one would expect the posterior covariance operator to be well approximated by a finite-rank negative perturbation of the
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
OPTIMAL LOW-RANK APPROXIMATIONS
A2479
prior covariance operator. A further extension could allow the data to become infinitedimensional as well. Another important task is to generalize the present methodology
to inverse problems with nonlinear forward models. One approach for doing so is
presented in [22]; other approaches are certainly possible. Yet another interesting
research topic is the study of analogous approximation techniques for sequential inference. We note that the assimilation step in a linear (or linearized) data assimilation
scheme can be already tackled within the framework presented here. But the nonstationary setting, where inference is interleaved with evolution of the state, introduces
the possibility for even more tailored and structure-exploiting approximations.
Appendix A. Technical results. Here we collect the proofs and other technical
results necessary to support the statements made in the previous sections.
We start with an auxiliary approximation result that plays an important role in
our analysis. Given a semipositive definite diagonal matrix D, we seek an approximation of D + I by a rank r perturbation of the identity, U U + I, that minimizes a
loss function from the class L defined in (2.6). The following lemma shows that the
U
is simply the best rank r approximation of the matrix D in the
optimal solution U
Frobenius norm.
Lemma A.1 (approximation lemma). Let D = diag{d21 , . . . , d2n }, with d2i ≥ d2i+1 ,
n×r
→ R, as J (U ) = L(U U + I, D + I) =
and
L ∈ L. Define the functional J : R
i f (σi ), where (σi ) are the generalized eigenvalues of the pencil (U U + I, D + I)
and f ∈ U. Then the following hold:
, of J such that
(i) There is a minimizer, U
(A.1)
U
=
U
r
d2i ei e
i ,
i=1
where (ei ) are the columns of the identity matrix.
(ii) If the first r eigenvalues of D are distinct, then any minimizer of J satisfies
(A.1).
Proof. The idea is to apply [53, Theorem 1.1] to the functional J . To this end, we
notice that J can be equivalently
written as J (U ) = F ◦ ρn ◦ g(U ), where F : Rn+ → R
n
is of the form F (x) = i=1 f (xi ); ρn denotes a function that maps an n × n SPD
matrix A to its eigenvalues σ = (σi ) (i.e., ρn (A) = σ and since F is a symmetric
function, the order of the eigenvalues is irrelevant); and the mapping g is given by
g(U ) = (D + I)−1/2 (U U + I)(D + I)−1/2 for all U ∈ Rn×r . Since the function F ◦ ρn
satisfies the hypotheses in [53, Theorem 1.1], F ◦ρn is differentiable at the SPD matrix
X iff F is differentiable at ρn (X), in which case (F ◦ ρn ) (X) = ZSσ Z , where
Sσ = diag[ F (ρn (X)) ] = diag{f (σ1 ), . . . , f (σn )},
and Z is an orthogonal matrix such that X = Z diag[ ρn (X) ]Z . Using the chain
rule, we obtain
∂g(U )
∂J (U )
= tr ZSσ Z ,
∂ Uij
∂ Uij
which leads to the following gradient of J at U :
J (U ) = 2(D + I)−1/2 ZSσ (D + I)−1/2 Z U = 2 W Sσ W U,
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
A2480
SPANTINI ET AL.
where the orthogonal matrix Z is such that the matrix W = (D + I)−1/2 Z satisfies
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
(A.2)
(U U + I)W = (D + I)W Υσ
with Υσ = diag(σ). Now we show that the functional J is coercive. Let (Uk ) be a
sequence of matrices such that Uk F → ∞. Hence, σmax (g(Uk )) → ∞ and so does
J since
J (Uk ) ≥ f (σmax (g(Uk ))) + (n − 1)f (1)
and f (x) → ∞ as x → ∞. Thus, J is a differentiable coercive functional and has a
with zero gradient:
global minimizer U
(A.3)
= 0.
) = 2W Sσ W U
J (U
However, since f ∈ U, f (x) = 0 iff x = 1. It follows that condition (A.3) is equivalent
to
(A.4)
= 0.
(I − Υσ )W U
U
− D = W − Υ−1 (Υ − I)W , and right-multipliEquations (A.2) and (A.4) give U
cation by U U then yields
(A.5)
U
) = ( U
U
)2 .
D (U
U
with nonzero eigenvalue α, then u is an
In particular, if u is an eigenvector of U
eigenvector of D, Du = αu, and thus α = d2i > 0 for some i. Thus, any solution of
(A.5) is such that
(A.6)
U
=
U
rk
d2ki eki e
ki
i=1
satisfying
for some subsequence (k ) of {1, . . . , n} and rank rk ≤ r. Notice that any U
) is
(A.5) is also a critical point according to (A.4). From (A.6) we also find that g(U
a diagonal matrix,
r
k
−1
2
d eki e + I .
g(U ) = (D + I)
ki
ki
i=1
), are given by σi = 1 if
The diagonal entries σi , which are the eigenvalues of g(U
2
i = k for some ≤ rk , or σi = 1/(1 + di ) otherwise. In either case, we have 0 <
) is minimized by the subsequence
σi ≤ 1 and the monotonicity of f implies that J (U
k1 = 1, . . . , kr = r and by the choice rk = r. This proves (A.1). It is clear that if the
first r eigenvalues of D are distinct, then any minimizer of J satisfies (A.1).
Most of the objective functions we consider have the same structure as the loss
function J , hence the importance of Lemma A.1.
The next lemma shows that searching for a negative update of Γpr is equivalent to
looking for a positive update of the prior precision matrix. In particular, the lemma
provides a bijection between the two approximation classes, Mr and M−1
r , defined
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
OPTIMAL LOW-RANK APPROXIMATIONS
A2481
by (2.4) and (3.1). In what follows, Spr is any square root of the prior covariance
matrix such that Γpr = Spr Spr
.
pos =
Lemma A.2 (prior updates). For any negative semidefinite update of Γpr , Γ
Γpr − KK with Γpos 0, there is a matrix U (of the same rank as K) such that
pos = Γ−1 + U U −1 . The converse is also true.
Γ
pr
−1
−
KK Spr
, D = diag{d2i }, be a reduced SVD of
Proof. Let ZDZ = Spr
−1
−
pos 0 by assumption, we must have d2 < 1 for all i, and we
Spr
KK Spr
. Since Γ
i
−
may thus define U = Spr
ZD1/2 (I − D)−1/2 . By Woodbury’s identity,
−1
−1
−1 pos .
Γpr + U U = Γpr − Γpr U I + U Γ−1
U Γpr = Γpr − KK = Γ
pr U
pos as a
Conversely, given a matrix U , we use again Woodbury’s identity to write Γ
negative semidefinite update of Γpr : Γpos = Γpr − KK 0.
Now we prove our main result on approximations of the posterior covariance
matrix.
Proof of Theorem 2.3. Given a loss function L ∈ L, our goal is to minimize
pos ) =
f (σi )
(A.7)
L(Γpos , Γ
i
pos = Γpr − KK 0, where (σi ) are
over K ∈ Rn×r subject to the constraint Γ
pos ) and f belongs to the class U
the generalized eigenvalues of the pencil (Γpos , Γ
defined by (2.7). We also write σi (Γpos , Γpos ) to specify the pencil corresponding to
the eigenvalues. By Lemma A.2, the optimization problem is equivalent to finding a
−1 = Γ−1 + U U . Observe that
matrix, U ∈ Rn×r , that minimizes (A.7) subject to Γ
pos
pr
−1
−1
, Γ ).
(σi ) are also the eigenvalues of the pencil (Γ
pos
pos
Let W DW = Spr
H Spr with D = diag{δi2 } be an SVD of Spr
H Spr . Then, by
the invariance properties of the generalized eigenvalues we have
−1
−1
−1 , Γ−1 ) = σi ( W S Γ
σi (Γ
pos
pos
pr pos Spr W , W Spr Γpos Spr W ) = σi ( ZZ + I, D + I ),
where Z = W Spr
U . Therefore, our goal reduces to finding a matrix, Z ∈ Rn×r , that
minimizes (A.7) with (σi ) being the generalized eigenvalues of the pencil
+
r ( ZZ
2
I, D + I ). Applying Lemma A.1 leads to the simple solution ZZ = i=1 δi ei e
i ,
where (ei ) are the columns of the identity matrix. In particular, the solution is unique
H Spr are distinct. The corresponding approximation
if the first r eigenvalues of Spr
U U is then
(A.8)
−
−1
W ZZ W Spr
=
U U = Spr
r
δi2 w
i w
i ,
i=1
−
wi and wi is the ith column of W . Woodbury’s identity gives the
where w
i = Spr
corresponding negative update of Γpr as
(A.9)
pos = Γpr − KK ,
Γ
KK =
r
−1
δi2 1 + δi2
w
i w
i
i=1
with w
i = Spr wi . Now, it suffices to note that the couples (δi2 , w
i ) defined here are
).
At
optimality,
σi = 1 for
precisely the generalized eigenpairs of the pencil (H, Γ−1
pr
i ≤ r and σi = (1 + δi2 )−1 for i > r, proving (2.10).
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
A2482
SPANTINI ET AL.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Before proving Lemma 2.2, we recall that the K-L divergence and the Hellinger
distance between two multivariate Gaussians, ν1 = N (μ, Σ1 ) and ν2 = N (μ, Σ2 ), with
the same mean and full rank covariance matrices are given, respectively, by [70]:
(A.10)
DKL (ν1 ν2 ) =
det(Σ1 )
1
−
rank(Σ
Σ
)
−
ln
trace Σ−1
,
1
1
2
2
det(Σ2 )
(A.11)
dHell (ν1 , ν2 ) =
1−
|Σ1 |1/4 |Σ2 |1/4
.
| 12 Σ1 + 12 Σ2 |1/2
Proof of Lemma 2.2. By (A.10), the K-L divergence between the posterior νpos (y)
and the Gaussian approximation νpos (y) can be written in terms of the generalized
pos ) as
eigenvalues of the pencil (Γpos , Γ
DKL (νpos (y) νpos (y)) =
( σi − ln σi − 1 ) /2,
i
and since f (x) = (x − ln x − 1) /2 belongs to U, we see that the K-L divergence is
a loss function in the class L defined by (2.6). Hence, Theorem 2.3 applies and the
equivalence between the two approximations follows trivially. An analogous argument
holds for the Hellinger distance. The squared Hellinger distance between νpos (y) and
νpos (y) can be written in terms of the generalized eigenvalues, (σi ), of the pencil
pos ), as
(Γpos , Γ
(A.12)
2
dHell (νpos (y), νpos (y)) = 1 − 2/2
1/4
σi
−1/2
(1 + σi )
.
i
where is the dimension of the parameter space. Minimizing (A.12) is equivalent
1/4
to maximizing i σi (1 + σi )−1/2 , which in turn is equivalent to minimizing the
functional:
1/4
pos ) = −
(A.13)
L(Γpos , Γ
ln( σi (1 + σi )−1/2 ) =
ln( 2 + σi + 1/σi )/4.
i
i
Since f (x) = ln( 2 + x + 1/x )/4 belongs to U, Theorem 2.3 can be applied once
again.
Proof of Corollary 3.1. The proofs of parts (i) and (ii) were already given in the
proof of Theorem 2.3. Part (iii) holds because
−1 −
(1 + δi2 )Γpos w
i = (1 + δi2 )(H + Γ−1
Spr wi
pr )
= (1 + δi2 ) Spr (Spr
H Spr + I)−1 wi = Spr wi = Γpr w
i ,
because wi is an eigenvector of ( Spr
H Spr + I )−1 with eigenvalue (1 + δi2 )−1 as shown
in the proof of Theorem 2.3.
Now we turn to optimality results for approximations of the posterior mean. In
what follows, let Spr , Sobs , Spos , and Sy be the matrix square roots of, respectively,
Γpr , Γobs , Γpos , and Γy := Γobs + G Γpr G such that Γ = S S (i.e., possibly nonsymmetric square roots).
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
OPTIMAL LOW-RANK APPROXIMATIONS
A2483
2
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Equation (4.1) shows that to minimize E( Ay − xΓ−1
) over A ∈ A, we need only
pos
to minimize E( Ay − μpos (y) 2Γ−1 ). Furthermore, since μpos (y) = Γpos G Γ−1
obs y, it
pos
follows that
(A.14)
−1
2
E( Ay − μpos (y) 2Γ−1 ) = Spos
(A − Γpos G Γ−1
obs ) Sy F .
pos
We are therefore led to the following optimization problem:
(A.15)
−1
min Spos
A Sy − Spos
G Γ−1
obs Sy F .
A∈A
−
The following result shows that an SVD of the matrix SH := Spr
G Sobs
can be used
to obtain simple expressions for the square roots of Γpos and Γy .
−
G Sobs
. Then
Lemma A.3 (square roots). Let W DV be an SVD of SH = Spr
(A.16)
Spos = Spr W ( I + DD )−1/2 W ,
(A.17)
Sy = Sobs V ( I + DD )1/2 V are square roots of Γpos and Γy .
−1 −1
as
Proof. We can rewrite Γpos = ( G Γ−1
obs G + Γpr )
−1 Spr = Spr W ( DD + I )−1 W Spr
Γpos = Spr ( SH SH
+I)
= [ Spr W ( DD + I )−1/2 W ] [ Spr W ( DD + I )−1/2 W ] ,
−1
which proves (A.16). The proof of (A.17) follows similarly using SH
= Sobs G Γpr
SH
−
G Γobs .
In the next two proofs we use (C)r to denote a rank r approximation of the matrix
C in the Frobenius norm.
Proof of Theorem 4.1. By [31, Theorem 2.1], an optimal A ∈ Ar is given by
−1
(A.18)
A = Spos Spos
G Γobs Sy r Sy−1 .
Now, we need some computations to show that (A.18) is equivalent to (4.4). Using
−1/2
(A.16) and (A.17) we find Spos
G Γ−1
D (I + D D)1/2 V ,
obs Sy = W (I + DD )
r
−1
and therefore ( Spos G Γobs Sy )r = i=1 δi wi vi , where wi is the ith column of W ,
vi is the ith column of V, and δi is the ith diagonal entry of D. Inserting this back
−1
into (A.18) yields A = i≤r δi (1 + δi2 )−1 Spr wi vi Sobs
. Now it suffices to note that
−
−1
w
i := Spr wi is a generalized eigenvector of (H, Γpr ), that vi := Sobs
vi is a generalized
2
The
eigenvector of (GΓpr G , Γobs ), and that (δi ) are also eigenvalues of (H, Γ−1
pr ).
minimum Bayes risk is a straightforward computation for the optimal estimator (4.4)
using (A.14).
Proof of Theorem 4.2. Given A ∈ Ar , we can restate (A.15) as the problem of
finding a matrix B, of rank at most r, that minimizes
−1 −1
−1
(A.19)
Spos
(Γpr − Γpos ) G Γ−1
obs Sy − Spos B G Γobs Sy F
such that A = (Γpr − B) G Γ−1
obs . By [31, Theorem 2.1], an optimal B is given by
(A.20)
−1
−1
†
B = Spos ( Spos
(Γpr − Γpos ) G Γ−1
obs Sy )r (G Γobs Sy ) ,
where † denotes the pseudoinverse operator. A closer look at [31, Theorem 2.1] reveals
that another minimizer of (A.19), itself not necessarily of minimum Frobenius norm,
is given by
(A.21)
−1
−1
† B = Spos ( Spos
(Γpr − Γpos ) G Γ−1
obs Sy )r (Spr G Γobs Sy ) Spr .
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
A2484
SPANTINI ET AL.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
By Lemma A.3,
1/2 −1
Spr
G Γobs Sy = W [ D I + D D
]V ,
−1
1/2
Spos
Γpr G Γ−1
D(I + D D)1/2 ]V ,
obs Sy = W [ (I + DD )
−1
−1/2
Spos
Γpos G Γ−1
D(I + D D)1/2 ]V obs Sy = W [ (I + DD )
−1
and therefore (Spr
G Γobs Sy )† =
whereas
q
−1
i=1 δi
−1/2
1 + δi2
vi wi for q = rank(SH ),
−1
(Γpr − Γpos )G Γ−1
( Spos
obs Sy )r =
r
δi3 wi vi .
i=1
Inserting these expressions back into (A.21), we obtain
r
δ2
i
B = Spr
,
Spr
2 wi wi
1
+
δ
i
i=1
where wi is the ith column of W , vi is the ith column of V , and δi is the ith diagonal
entry of D. Notice that (δi2 , w
i ), with w
i = Spr wi , are the generalized eigenpairs
of (H, Γ−1
pr ) (cf. proof of Theorem 2.3). Hence, by Theorem 2.3, we recognize the
pos = Γpr − B. Plugging this expression back into
optimal approximation of Γpos as Γ
(A.21) gives (4.7). The minimum Bayes risk in (ii) follows readily using the optimal
estimator given by (4.7) in (A.14).
Acknowledgments. We thank Jere Heikkinen of Bintec Ltd. for providing us
with the code used in Example 2. We also thank Julianne and Matthias Chung for
many insightful discussions.
REFERENCES
[1] V. Akçelik, G. Biros, O. Ghattas, J. Hill, D. Keyes, and B. van Bloemen Waanders,
Parallel algorithms for PDE-constrained optimization, Parallel Process. Sci. Comput., 20
(2006), p. 291.
[2] H. Auvinen, J. M. Bardsley, H. Haario, and T. Kauranne, Large-scale Kalman filtering using the limited memory BFGS method, Electron. Trans. Numer. Anal., 35 (2009),
pp. 217–233.
[3] H. Auvinen, J. M. Bardsley, H. Haario, and T. Kauranne, The variational Kalman filter
and an efficient implementation using limited memory BFGS, Internat. J. Numer. Methods
Fluids, 64 (2010), pp. 314–335.
[4] Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst, Templates for the
Solution of Algebraic Eigenvalue Problems: A Practical Guide, Software Environ. Tools
11, SIAM, Philadelphia, 2000.
[5] R. Barrett, M. W. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout,
R. Pozo, C. Romine, and H. van der Vorst, Templates for the Solution of Linear
Systems: Building Blocks for Iterative Methods, SIAM, Philadelphia, 1994.
[6] T. Bui-Than, C. Burstedde, O. Ghattas, J. Martin, G. Stadler, and L. Wilcox, Extremescale UQ for Bayesian inverse problems governed by PDEs, in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis,
IEEE Computer Society Press, 2012, p. 3.
[7] T. Bui-Thanh and O. Ghattas, Analysis of the Hessian for inverse scattering problems: I.
Inverse shape scattering of acoustic waves, Inverse Problems, 28 (2012), 055001.
[8] T. Bui-Thanh, O. Ghattas, and D. Higdon, Adaptive Hessian-based nonstationary Gaussian
process response surface method for probability density approximation with application
to Bayesian solution of large-scale inverse problems, SIAM J. Sci. Comput., 34 (2012),
pp. A2837–A2871.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
OPTIMAL LOW-RANK APPROXIMATIONS
A2485
[9] T. Bui-Thanh, O. Ghattas, J. Martin, and G. Stadler, A computational framework for
infinite-dimensional Bayesian inverse problems part I: The linearized case, with application to global seismic inversion, SIAM J. Sci. Comput., 35 (2013), pp. A2494–A2523.
[10] D. Calvetti, Preconditioned iterative methods for linear discrete ill-posed problems from a
Bayesian inversion perspective, J. Comput. Appl. Math., 198 (2007), pp. 378–395.
[11] D. Calvetti, B. Lewis, and L. Reichel, On the regularizing properties of the GMRES method,
Numer. Math., 91 (2002), pp. 605–625.
[12] D. Calvetti, D. McGivney, and E. Somersalo, Left and right preconditioning for electrical
impedance tomography with structural information, Inverse Problems, 28 (2012), 055015.
[13] D. Calvetti and E. Somersalo, Priorconditioners for linear systems, Inverse Problems, 21
(2005), p. 1397.
[14] B. P. Carlin and T. A. Louis, Bayesian Methods for Data Analysis, 3rd ed., Chapman &
Hall/CRC Press, Boca Raton, FL, 2009.
[15] R. A. Christensen, Plane Answers to Complex Questions: The Theory of Linear Models,
Springer-Verlag, Berlin, 1987.
[16] J. Chung and M. Chung, Computing optimal low-rank matrix approximations for image
processing, in Proceedings of the Asilomar Conference on Signals, Systems and Computers,
IEEE, New York, 2013, pp. 670–674.
[17] J. Chung and M. Chung, An efficient approach for computing optimal low-rank regularized
inverse matrices, Inverse Problems, 30 (2014), 114009.
[18] J. Chung, M. Chung, and D. P. O’Leary, Designing optimal spectral filters for inverse
problems, SIAM J. Sci. Comput., 33 (2011), pp. 3132–3152.
[19] J. Chung, M. Chung, and D. P. O’Leary, Optimal filters from calibration data for image
deconvolution with data acquisition error, J. Math. Imaging Vis., 44 (2012), pp. 366–374.
[20] J. Chung, M. Chung, and D. P. O’Leary, Optimal regularized low rank inverse approximation, Linear Algebra Appl., 468 (2015), pp. 260–269.
[21] T. Cui, K. J. H. Law, and Y. Marzouk, Dimension-independent likelihood-informed MCMC,
arXiv:1411.3688, 2014.
[22] T. Cui, J. Martin, Y. Marzouk, A. Solonen, and A. Spantini, Likelihood-informed dimension reduction for nonlinear inverse problems, Inverse Problems, 30 (2014), 114015.
[23] J. Cullum and W. Donath, A block Lanczos algorithm for computing the q algebraically
largest eigenvalues and a corresponding eigenspace of large, sparse, real symmetric matrices, in Proceedings of the IEEE Conference on Decision and Control Including the 13th
Symposium on Adaptive Processes, IEEE, New York, 1974, pp. 505–509.
[24] C. R. Dietrich and G. N. Newsam, Fast and exact simulation of stationary Gaussian processes through circulant embedding of the covariance matrix, SIAM J. Sci. Comput., 18
(1997), pp. 1088–1107.
[25] L. Dykes and L. Reichel, Simplified gsvd computations for the solution of linear discrete
ill-posed problems, J. Comput. Appl. Math., 255 (2014), pp. 15–27.
[26] C. Eckart and G. Young, The approximation of one matrix by another of lower rank, Psychometrika, 1 (1936), pp. 211–218.
[27] G. Evensen, Data Assimilation, Springer, Berlin, 2007.
[28] H. P. Flath, L. Wilcox, V. Akçelik, J. Hill, B. van Bloemen Waanders, and O. Ghattas, Fast algorithms for Bayesian uncertainty quantification in large-scale linear inverse
problems based on low-rank partial Hessian approximations, SIAM J. Sci. Comput., 33
(2011), pp. 407–432.
[29] W. Förstner and B. Moonen, A metric for covariance matrices, in Geodesy: The Challenge
of the 3rd Millennium, Springer, Berlin, 2003, pp. 299–309.
[30] C. Fox and A. Parker, Convergence in variance of Chebyshev accelerated Gibbs samplers,
SIAM J. Sci. Comput., 36 (2014), pp. A124–A147.
[31] S. Friedland and A. Torokhti, Generalized rank-constrained matrix approximations, SIAM
J. Matrix Anal. Appl., 29 (2007), pp. 656–659.
[32] G. H. Golub and C. F. Van Loan, Matrix Computations, Vol. 3, Johns Hopkins University
Press, Baltimore, 2012.
[33] G. H. Golub and R. Underwood, The block Lanczos method for computing eigenvalues,
Math. Software, 3 (1977), pp. 361–377.
[34] G. H. Golub and Q. Ye, An inverse free preconditioned krylov subspace method for symmetric
generalized eigenvalue problems, SIAM J. Sci. Comput., 24 (2002), pp. 312–334.
[35] N. Halko, P. Martinsson, and J. A. Tropp, Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, SIAM Rev., 53 (2011),
pp. 217–288.
[36] M. Hanke, Conjugate Gradient Type Methods for Ill-Posed Problems, Pitman Res. Notes in
Math. Ser. 327, Longman, Harlow, UK, 1995.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
A2486
SPANTINI ET AL.
[37] M. Hanke and P. C. Hansen, Regularization methods for large-scale problems, Surv. Math.
Ind., 3 (1993), pp. 253–315.
[38] P. C. Hansen, Regularization, GSVD and truncated GSVD, BIT, 29 (1989), pp. 491–504.
[39] P. C. Hansen, Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear
Inversion, Monogr. Math. Model. Comput. 4, SIAM, Philadelphia, 1998.
[40] J. Heikkinen, Statistical Inversion Theory in X-Ray Tomography, Master’s thesis, Lappeenranta University of Technology, Finland, 2008.
[41] M. R. Hestenes, Pseudoinversus and conjugate gradients, Comm. ACM, 18 (1975), pp. 40–43.
[42] M. R. Hestenes and E. Stiefel, Methods of Conjugate Gradients for Solving Linear Systems,
Vol. 49, National Bureau of Standards Washington, DC, 1952.
[43] L. Homa, D. Calvetti, A. Hoover, and E. Somersalo, Bayesian preconditioned CGLS for
source separation in MEG time series, SIAM J. Sci. Comput., 35 (2013), pp. B778–B798.
[44] Y. Hua and W. Liu, Generalized Karhunen-Loève transform, IEEE Signal Process. Lett., 5
(1998), pp. 141–142.
[45] W. James and C. Stein, Estimation with quadratic loss, in Proceedings of the Fourth Berkeley
Symposium on Mathematical Statistics and Probability, Vol. 1, 1961, pp. 361–379.
[46] W. Kahan and B. N. Parlett, How Far Should You Go with the Lanczos Process, No
UCB/ERL-M78/48, Electronics Research Lab, University of California, Berkeley, 1978.
[47] J. Kaipio and E. Somersalo, Statistical and Computational Inverse Problems, Springer,
Berlin, 2005.
[48] R. E. Kalman, A new approach to linear filtering and prediction problems, J. Fluids Engrg.,
82 (1960), pp. 35–45.
[49] S. Kaniel, Estimates for some computational techniques in linear algebra, Math. Comp.,
(1966), pp. 369–378.
[50] C. Lanczos, An Iteration Method for the Solution of the Eigenvalue Problem of Linear Differential and Integral Operators, U.S. Government Press Office, 1950.
[51] E. L. Lehmann and G. Casella, Theory of Point Estimation, Springer-Verlag, Berlin, 1998.
[52] R. B. Lehoucq, D. C. Sorensen, and C. Yang, ARPACK Users’ Guide: Solution of LargeScale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods, Software Environ.
Tools 6, SIAM, Philadelphia, 1998.
[53] A. S. Lewis, Derivatives of spectral functions, Math. Oper. Res., 21 (1996), pp. 576–588.
[54] W. Li and O. A. Cirpka, Efficient geostatistical inverse methods for structured and unstructured grids, Water Resources Res., 42 (2006), W06402.
[55] C. Lieberman, K. Fidkowski, K. Willcox, and B. van Bloemen Waanders, Hessian-based
model reduction: Large-scale inversion and prediction, Internat. J. Numer. Methods Fluids,
71 (2013), pp. 135–150.
[56] C. Lieberman, K. Willcox, and O. Ghattas, Parameter and state model reduction for largescale statistical inverse problems, SIAM J. Sci. Comput., 32 (2010), pp. 2523–2542.
[57] F. Lindgren, H. Rue, and J. Lindström, An explicit link between Gaussian fields and Gaussian Markov random fields: The stochastic partial differential equation approach, J. R.
Stat. Soc. Ser. B, 73 (2011), pp. 423–498.
[58] D. V. Lindley and A. F. M. Smith, Bayes estimates for the linear model, J. R. Stat. Soc.
Ser. B, 34 (1972), pp. 1–41.
[59] D. C. Liu and J. Nocedal, On the limited memory BFGS method for large scale optimization,
Math. Program., 45 (1989), pp. 503–528.
[60] J. Martin, L. Wilcox, C. Burstedde, and O. Ghattas, A stochastic Newton MCMC method
for large-scale statistical inverse problems with application to seismic inversion, SIAM J.
Sci. Comput., 34 (2012), pp. A1460–A1487.
[61] Y. Marzouk and H. N. Najm, Dimensionality reduction and polynomial chaos acceleration
of Bayesian inference in inverse problems, J. Comput. Phys., 228 (2009), pp. 1862–1902.
[62] X. Meng, M. A. Saunders, and M. W. Mahoney, LSRN: A parallel iterative solver for
strongly over-or underdetermined systems, SIAM J. Sci. Comput., 36 (2014), pp. C95–
C118.
[63] E. H. Moore, On the reciprocal of the general algebraic matrix, Bull. Amer. Math. Soc., 26,
pp. 394–395.
[64] J. Nocedal, Updating quasi-Newton matrices with limited storage, Math. Comput., 35 (1980),
pp. 773–782.
[65] C. C. Paige, The Computation of Eigenvalues and Eigenvectors of Very Large Sparse Matrices,
Ph.D. thesis, University of London, 1971.
[66] C. C. Paige, Computational variants of the Lanczos method for the eigenproblem, IMA J.
Appl. Math., 10 (1972), pp. 373–381.
[67] C. C. Paige, Error analysis of the Lanczos algorithm for tridiagonalizing a symmetric matrix,
IMA J. Appl. Math., 18 (1976), pp. 341–349.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Downloaded 03/15/16 to 18.51.1.3. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
OPTIMAL LOW-RANK APPROXIMATIONS
A2487
[68] C. C. Paige and M. A. Saunders, Towards a generalized singular value decomposition, SIAM
J. Numer. Anal., 18 (1981), pp. 398–405.
[69] C. C. Paige and M. A. Saunders, Algorithm 583: LSQR: Sparse linear equations and least
squares problems, ACM Trans. Math. Software, 8 (1982), pp. 195–209.
[70] L. Pardo, Statistical Inference Based on Divergence Measures, CRC Press, Boca Raton, FL,
2005.
[71] A. Parker and C. Fox, Sampling Gaussian distributions in Krylov spaces with conjugate
gradients, SIAM J. Sci. Comput., 34 (2012), pp. B312–B334.
[72] R. Penrose, A generalized inverse for matrices, Math. Proc. Cambridge Philos. Soc., 51 (1955),
pp. 406–413.
[73] Y. Saad, On the rates of convergence of the Lanczos and the block-Lanczos methods, SIAM J.
Numer. Anal., 17 (1980), pp. 687–706.
[74] A. H. Sameh and J. A. Wisniewski, A trace minimization algorithm for the generalized
eigenvalue problem, SIAM J. Numer. Anal., 19 (1982), pp. 1243–1259.
[75] C. Schillings and C. Schwab, Scaling Limits in Computational Bayesian Inversion, Research
report 2014-26, ETH-Zürich, 2014.
[76] A. Solonen, H. Haario, J. Hakkarainen, H. Auvinen, I. Amour, and T. Kauranne, Variational ensemble Kalman filtering using limited memory BFGS, Electron. Trans. Numer.
Anal., 39 (2012), pp. 271–285.
[77] D. Sondermann, Best approximate solutions to matrix equations under rank restrictions,
Statist. Papers, 27 (1986), pp. 57–66.
[78] G. W. Stewart, The efficient generation of random orthogonal matrices with an application
to condition estimators, SIAM J. Numer. Anal., 17 (1980), pp. 403–409.
[79] A. M. Stuart, Inverse problems: A Bayesian perspective, Acta Numer., 19 (2010),
pp. 451–559.
[80] A. Tikhonov, Solution of incorrectly formulated problems and the regularization method, Soviet
Math. Dokl., 5 (1963), pp. 1035–1038.
[81] C. F. Van Loan, Generalizing the singular value decomposition, SIAM J. Numer. Anal., 13
(1976), pp. 76–83.
[82] A. T. A. Wood and G. Chan, Simulation of stationary Gaussian processes in [0, 1]d, J.
Comput. Graph. Statist., 3 (1994), pp. 409–432.
[83] S. J. Wright and J. Nocedal, Numerical Optimization, Vol. 2, Springer, New York, 1999.
[84] Y. Yue and P. L. Speckman, Nonstationary spatial Gaussian Markov random fields, J. Comput. Graphical Statist., 19 (2010), pp. 96–116.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Download