Making Models from Data: A (very) Basic Overview of Parameter Estimation and Inverse Theory or Four Centuries of Linear Algebra in 10 Equations Rick Aster, Professor of Geophysics, NM Tech A (one might say the) fundamental problem in science is finding a mathematical representation, hopefully with predictive or insight benefits, that describes a physical phenomenon of interest. In seismology, one such classic problem is determining the seismic velocity structure of the Earth from travel time or other measurements made on seismograms (seismic tomography). In a general mathematical form, we can write, or at least conceptualize, a forward problem, which mathematically describes how (we hope) nature produces a finite set of data (such as the travel times for seismic rays traveling through Earth’s interior) as d = G(m) (1) where we have the data described as a column vector of numbers, d=(d1, d2, ...)T, m=(m1, m2, ...)T is another column vector (not necessarily the same length as d) that describes a model of interest (e.g., the seismic velocity structure of the Earth as a function of position, x, y, z), and G is some mathematical function, or operator, that maps m to d. Our task in parameter estimation and inverse theory is to reverse (or invert) the forward problem to find m given d when we know the physics well enough to usefully mathematically characterize G. The ur-problem in this field of mathematics is one that you are probably at least somewhat familiar with, so-called linear regression. In linear regression the problem is (mathematically) linear. Linear problems obey the principle of superposition. This means, among other useful things, that we can write the forward problem as d = Gm (2) where the operator G is now a matrix of numbers (appropriately sized so it has the same number of columns as the length of m and the same number of rows as the length of d). In linear algebra parlance, we say that the elements of d are each a linear combination of the elements of m, e.g., d1 = G11m1+G12m2+...+G1nmn (3) where n is the number of columns in G (and is also the number of elements in m). The problem at hand then becomes one of inverting this system of equations to meaningfully find m when G and d are given. We can write this inverse problem for (2) as m = G-g d (4) where G-g is some sort of inverse matrix that “undoes” the effect of G. If G is square and nonsingular (that is, if any conceivable d and m can be found by a linear combination of the columns of G), then there is only one solution, which is G-g = G-1, the standard inverse matrix of linear algebra that arises when we solve n equations in n unknowns and there is just one unique solution that will satisfy (2) exactly! There are standard formulas for finding the inverse matrix in this case. Again, this is essentially the problem you may have been familiar with since middle school of solving “n equations in n unknowns”, although n can become really large (thousands or even millions). Things get more interesting when G is singular (so that there are an infinite number of equally acceptable solutions) and/or is a non-square matrix (with n columns and m rows, and n ≠ m). In linear regression, we have a G matrix that has more rows than columns. This means that we have more constraint equations in the forward problem (more elements in our data vector d) than elements in our model m. Think for example of solving for two unknowns using three equations. Typically such systems are inconsistent, meaning that they do not have exact solutions. Inconsistent problems require some sort of approximate solution method that “best” satisfies the constraint equations (best fits the data). How do we decide what is “best”? This is done by specifying some sort of misfit measure between observed data and data predicted by (2), and seeking a solution m that minimizes that measure. The most commonly used such measure is “least squares”, where we minimize the length of the residual between observed and predicted data. Mathematically, we state this problem as “find an m so that || d – Gm || = || r || (5) is as small as it can possibly be”, where the vector r describing the difference between observed and predicted data is called the residual vector and the bars just indicate the Euclidean, or 2-norm, length of a vector || r || = ( r12 + r22+ … + rm2)1/2 (6) (note that the italic integer subscript m above is just the number of elements in d, and is not to be confused with the model vector m or its elements mi). Here’s a neat way to solve the least squares minimization problem that you can find in Appendix A of Parameter Estimation and Inverse Problems (you can also solve it a variety of other ways, including using calculus, of course). Given an inconsistent system of equations in the form of (2), we make the astute observation that the matrix-vector product Gm is only a subspace of the space, Rm containing all possible m-length vectors where an arbitrary d exists. This is because there are only n < m columns in G. All of the possible vectors (predicted data from any model m) that could be produced by Gm are thus linear combinations of only n basis vectors, and it requires at least m (which is > n here) vectors to construct every conceivable vector in the m-dimensional space containing all the possible vectors d. Let Gm = projR(G) d (7) by which we mean that G times m is the vector that we get by projecting d onto the (at most ndimensional) subspace that we can get to via linear combinations of the columns of G (see Figure 1). The fact that we cannot exactly fit an arbitrary data set in an inconsistent set of equations is thus seen geometrically to be a consequence of the dimension of the subspace spanned by the columns of G, often called the column space or range of G, or simply R(G) as in (7), i.e., the range of G is the space composed of all vectors that can be represented by m1G.,1 + m2G.,2 + … + mnG.,n (8) for all possible coefficients mi, where G.,i represents the ith column of G. Because any model m can only produce a predicted data vector Gm in the subspace R(G), the best that we can do in solving the problem in the least-squares sense is to fit the part of d that is in R(G), i.e., there is no component of the projection of d in the subspace perpendicular to R(G) that we could ever fit to get a “better” solution with any product of G and m. This is equivalent to saying that projR(G) d is as close as we can get to d in the leastsquares sense. This is easy to see if we consider a “standard” basis of vectors for R(G) and Rm, each consisting of a 1 in a unique single entry position and zeros everywhere else. In this case we can fit the n elements of d that lie in R(G) but not the remaining m-n elements. The misfit between Gm and d (5) for the least squares solution mls must therefore be in the subspace that is at right angles (or orthogonal) to R(G). This means that the dot product between any vector in R(G) and a misfit vector lying totally outside of (or orthogonal to R(G) will be zero. This means that GT r = GT(Gmls - d) = 0 (9) where GT means the transpose of G (i.e., GTij = Gji) and the operation GT r just generates the dot products between the columns of G and r that must all be zero by the above argument. The n equations of (9) are called the normal equations. Multiplying out (9) and recognizing that GTG is an n by n square matrix, gives, provided (GTG)-1 exists, the normal equations solution, giving the solution that best fits the data in the sense that it minimizes the sum of the squares of the differences between the actual data and the data predicted by the forward problem mls = (GTG)-1GTd . (10) If the columns of G are linearly independent (i.e., if none of them can be constructed as a linear combination of the others), then it can be shown that (GTG)-1 always exists and (10) is a unique solution. The normal equations provide a general way to solve a linear regression problem, such as fitting data to a line in the least squares sense (try it; you must only come up with the suitable form for G and then use (10)!). The other canonical situation for a linear system of equations is if the G matrix has more columns than rows. In this case we must impose additional constraints to the minimization problem to solve for a unique solution (this is called biasing the solution). In geophysical inverse theory, these extra constraints commonly take the form of smoothness conditions (this approach, called regularization, is also commonly dictated by issues of wild solution instability in the presence of data noise that we don’t have time to go into here, but can discuss when time permits!). A final key point is that it is not always the best strategy to minimize the 2-norm misfit measure (5). Least-squares solutions, it turns out, are notoriously susceptible to wild “outlier” data points. There are other misfit measures, such as the 1-norm (the sum of the absolute values of the elements in r), that are much more outlier-resistant and are hence charmingly referred to as robust. Figure 1. The geometric relationship between d and Gm for m = mls, the least squares solution (the solution that has the minimum 2-norm residual (6)) to a problem with more constraint equations, m, than model parameters, n, so that G has m rows and n columns. A general data vector d lies in the m-dimensional space Rm but any model, m, can only actually fit the projection of d that lies in the lower-dimensional range of G, R(G), that is spanned by the (n < m) columns of G. Reference Aster, R., Borchers, B., Thurber, C., Parameter Estimation and Inverse Problems, 301 pp, Elsevier Academic Press, 2004. Exercise Figure 2. A simple tomography example. Consider a 4-cube model of a square region, 200 m on a side, where we make travel time measurements, t1, …, t5 in five directions as shown in Figure 2. The slowness (the reciprocal of the seismic velocity) in each region is parameterized as S11, S12, S21, S22, as also depicted in the figure (we parameterize the model in terms of slowness instead of velocity because it results in a linear system of equations, as we’ll see). Each travel time measurement has a forward model associated with it. For example t1 = S11 • 100 + S12 • 100 (11) where the slownesses are specified in seconds/meter and the time is in seconds. The complete (overdetermined) system of (m=5) constraint equations (in n=4 unknowns) is thus: (12) where the elements Gij are all specified by the raypath geometry of the experiment. Your assignment is: 1) Find the elements of G. 2) Solve for a least-squares solution using the normal equations (10) and MATLAB if: t1 = 0.1783 s t2 = 0.1896 s t3 = 0.2008 s t4 = 0.1535 s t5 = 0.2523 s Note that there is some random noise in these times, so the system of equations (12) is inconsistent (it has no exact solution). 3) Convert your slownesses to velocities; where is the region seismically faster or slower? 4) Calculate the residual vector, r (6). How well does your model actually fit the data on average?