doc - IRIS

advertisement
Making Models from Data:
A (very) Basic Overview of Parameter Estimation and Inverse Theory
or
Four Centuries of Linear Algebra in 10 Equations
Rick Aster, Professor of Geophysics, NM Tech
A (one might say the) fundamental problem in science is finding a mathematical representation, hopefully
with predictive or insight benefits, that describes a physical phenomenon of interest. In seismology, one
such classic problem is determining the seismic velocity structure of the Earth from travel time or other
measurements made on seismograms (seismic tomography).
In a general mathematical form, we can write, or at least conceptualize, a forward problem, which
mathematically describes how (we hope) nature produces a finite set of data (such as the travel times for
seismic rays traveling through Earth’s interior) as
d = G(m)
(1)
where we have the data described as a column vector of numbers, d=(d1, d2, ...)T, m=(m1, m2, ...)T is
another column vector (not necessarily the same length as d) that describes a model of interest (e.g., the
seismic velocity structure of the Earth as a function of position, x, y, z), and G is some mathematical
function, or operator, that maps m to d. Our task in parameter estimation and inverse theory is to reverse
(or invert) the forward problem to find m given d when we know the physics well enough to usefully
mathematically characterize G.
The ur-problem in this field of mathematics is one that you are probably at least somewhat familiar with,
so-called linear regression. In linear regression the problem is (mathematically) linear. Linear problems
obey the principle of superposition. This means, among other useful things, that we can write the forward
problem as
d = Gm
(2)
where the operator G is now a matrix of numbers (appropriately sized so it has the same number of
columns as the length of m and the same number of rows as the length of d). In linear algebra parlance,
we say that the elements of d are each a linear combination of the elements of m, e.g.,
d1 = G11m1+G12m2+...+G1nmn
(3)
where n is the number of columns in G (and is also the number of elements in m). The problem at hand
then becomes one of inverting this system of equations to meaningfully find m when G and d are given.
We can write this inverse problem for (2) as
m = G-g d
(4)
where G-g is some sort of inverse matrix that “undoes” the effect of G. If G is square and nonsingular
(that is, if any conceivable d and m can be found by a linear combination of the columns of G), then there
is only one solution, which is G-g = G-1, the standard inverse matrix of linear algebra that arises when we
solve n equations in n unknowns and there is just one unique solution that will satisfy (2) exactly! There
are standard formulas for finding the inverse matrix in this case. Again, this is essentially the problem
you may have been familiar with since middle school of solving “n equations in n unknowns”, although n
can become really large (thousands or even millions).
Things get more interesting when G is singular (so that there are an infinite number of equally acceptable
solutions) and/or is a non-square matrix (with n columns and m rows, and n ≠ m). In linear regression, we
have a G matrix that has more rows than columns. This means that we have more constraint equations
in the forward problem (more elements in our data vector d) than elements in our model m. Think for
example of solving for two unknowns using three equations. Typically such systems are inconsistent,
meaning that they do not have exact solutions.
Inconsistent problems require some sort of approximate solution method that “best” satisfies the
constraint equations (best fits the data). How do we decide what is “best”? This is done by specifying
some sort of misfit measure between observed data and data predicted by (2), and seeking a solution m
that minimizes that measure. The most commonly used such measure is “least squares”, where we
minimize the length of the residual between observed and predicted data. Mathematically, we state this
problem as “find an m so that
|| d – Gm || = || r ||
(5)
is as small as it can possibly be”, where the vector r describing the difference between observed and
predicted data is called the residual vector and the bars just indicate the Euclidean, or 2-norm, length of a
vector
|| r || = ( r12 + r22+ … + rm2)1/2
(6)
(note that the italic integer subscript m above is just the number of elements in d, and is not to be
confused with the model vector m or its elements mi).
Here’s a neat way to solve the least squares minimization problem that you can find in Appendix A of
Parameter Estimation and Inverse Problems (you can also solve it a variety of other ways, including
using calculus, of course). Given an inconsistent system of equations in the form of (2), we make the
astute observation that the matrix-vector product Gm is only a subspace of the space, Rm containing all
possible m-length vectors where an arbitrary d exists. This is because there are only n < m columns in G.
All of the possible vectors (predicted data from any model m) that could be produced by Gm are thus
linear combinations of only n basis vectors, and it requires at least m (which is > n here) vectors to
construct every conceivable vector in the m-dimensional space containing all the possible vectors d. Let
Gm = projR(G) d
(7)
by which we mean that G times m is the vector that we get by projecting d onto the (at most ndimensional) subspace that we can get to via linear combinations of the columns of G (see Figure 1). The
fact that we cannot exactly fit an arbitrary data set in an inconsistent set of equations is thus seen
geometrically to be a consequence of the dimension of the subspace spanned by the columns of G, often
called the column space or range of G, or simply R(G) as in (7), i.e., the range of G is the space
composed of all vectors that can be represented by
m1G.,1 + m2G.,2 + … + mnG.,n
(8)
for all possible coefficients mi, where G.,i represents the ith column of G. Because any model m can only
produce a predicted data vector Gm in the subspace R(G), the best that we can do in solving the problem
in the least-squares sense is to fit the part of d that is in R(G), i.e., there is no component of the projection
of d in the subspace perpendicular to R(G) that we could ever fit to get a “better” solution with any
product of G and m. This is equivalent to saying that projR(G) d is as close as we can get to d in the leastsquares sense. This is easy to see if we consider a “standard” basis of vectors for R(G) and Rm, each
consisting of a 1 in a unique single entry position and zeros everywhere else. In this case we can fit the n
elements of d that lie in R(G) but not the remaining m-n elements.
The misfit between Gm and d (5) for the least squares solution mls must therefore be in the subspace that
is at right angles (or orthogonal) to R(G). This means that the dot product between any vector in R(G)
and a misfit vector lying totally outside of (or orthogonal to R(G) will be zero. This means that
GT r = GT(Gmls - d) = 0
(9)
where GT means the transpose of G (i.e., GTij = Gji) and the operation GT r just generates the dot products
between the columns of G and r that must all be zero by the above argument. The n equations of (9) are
called the normal equations.
Multiplying out (9) and recognizing that GTG is an n by n square matrix, gives, provided (GTG)-1 exists,
the normal equations solution, giving the solution that best fits the data in the sense that it minimizes the
sum of the squares of the differences between the actual data and the data predicted by the forward
problem
mls = (GTG)-1GTd .
(10)
If the columns of G are linearly independent (i.e., if none of them can be constructed as a linear
combination of the others), then it can be shown that (GTG)-1 always exists and (10) is a unique solution.
The normal equations provide a general way to solve a linear regression problem, such as fitting data to a
line in the least squares sense (try it; you must only come up with the suitable form for G and then use
(10)!).
The other canonical situation for a linear system of equations is if the G matrix has more columns than
rows. In this case we must impose additional constraints to the minimization problem to solve for a
unique solution (this is called biasing the solution). In geophysical inverse theory, these extra constraints
commonly take the form of smoothness conditions (this approach, called regularization, is also
commonly dictated by issues of wild solution instability in the presence of data noise that we don’t have
time to go into here, but can discuss when time permits!).
A final key point is that it is not always the best strategy to minimize the 2-norm misfit measure (5).
Least-squares solutions, it turns out, are notoriously susceptible to wild “outlier” data points. There are
other misfit measures, such as the 1-norm (the sum of the absolute values of the elements in r), that are
much more outlier-resistant and are hence charmingly referred to as robust.
Figure 1. The geometric relationship between d and Gm for m = mls, the least squares solution (the solution that
has the minimum 2-norm residual (6)) to a problem with more constraint equations, m, than model parameters, n, so
that G has m rows and n columns. A general data vector d lies in the m-dimensional space Rm but any model, m,
can only actually fit the projection of d that lies in the lower-dimensional range of G, R(G), that is spanned by the (n
< m) columns of G.
Reference
Aster, R., Borchers, B., Thurber, C., Parameter Estimation and Inverse Problems, 301 pp, Elsevier
Academic Press, 2004.
Exercise
Figure 2. A simple tomography example.
Consider a 4-cube model of a square region, 200 m on a side, where we make travel time measurements,
t1, …, t5 in five directions as shown in Figure 2. The slowness (the reciprocal of the seismic velocity) in
each region is parameterized as S11, S12, S21, S22, as also depicted in the figure (we parameterize the model
in terms of slowness instead of velocity because it results in a linear system of equations, as we’ll see).
Each travel time measurement has a forward model associated with it. For example
t1 = S11 • 100 + S12 • 100
(11)
where the slownesses are specified in seconds/meter and the time is in seconds. The complete
(overdetermined) system of (m=5) constraint equations (in n=4 unknowns) is thus:
(12)
where the elements Gij are all specified by the raypath geometry of the experiment. Your assignment is:
1) Find the elements of G.
2) Solve for a least-squares solution using the normal equations (10) and MATLAB if:
t1 = 0.1783 s
t2 = 0.1896 s
t3 = 0.2008 s
t4 = 0.1535 s
t5 = 0.2523 s
Note that there is some random noise in these times, so the system of equations (12) is
inconsistent (it has no exact solution).
3) Convert your slownesses to velocities; where is the region seismically faster or slower?
4) Calculate the residual vector, r (6). How well does your model actually fit the data on average?
Download