[1] CURVE FITTING WITH EXCEL

advertisement
GLG410: Computers in Earth & Space Exploration Spring 2010, ASU
Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge
Lecture 04
February 9, 2010
Page
1
Tuesday
Today is our third Excel lecture. Our two central themes are: (1) curve-fitting, and (2) linear algebra
(matrices). We will have a 4th lecture on Excel to further explore linear algebra, since it has become
increasingly common in computational physical sciences.
A note on being computing scientists …and homework
So far, the Excel assignments have been with a fair bit of guidance, even with step by step recipes.
However, you now have enough tools under your belt to tackle problems without the step by step
instructions. HW#3 will contain a mix of levels of guidance. As the semester carries on, you will be
doing more and more of the work without such detailed instructions.
[1] CURVE FITTING WITH EXCEL
Today, we will experiment with curve fitting using Excel. First, we'll explore Excel's built-in "Trendline" option, where we can add trend lines to data sets after a chart has been generated. Later, we will
utilize an Excel tool called Solver, whereby we can fit a data set to more general functional forms, i.e.,
more freedom than a straight line, exponential, logarithmic, etc. Then after that, we'll approach it
completely from scratch.
Curve fitting using the Add Trendline option when right clicking a data series point in a chart
We have already done this in lecture. All you have to do is to right-click a data point within a chart
then select Add Trendline. There are many options to choose from. Let’s quickly discuss this with the
following spreadsheet: (download this from the web to your disk space)
http://geophysics.asu.edu/cese/files/L04_climate.xls
If you are to be your own computing scientist, you should look into what Excel is assuming regarding
its different trend line options. The equations behind these options can be found from Excel's Help
page:
GLG410: Computers in Earth & Space Exploration Spring 2010, ASU
Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge
Page
2
GLG410: Computers in Earth & Space Exploration Spring 2010, ASU
Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge
Page
3
If the explanations for each option are not clear, you should seek additional information. For example,
do you know what least squares is? It is an integral part of most of the options. If not, just go to
wikipedia.org, type in least squares, then you’ll find:
Least squares
Least squares is a mathematical optimization technique which, when given a
series of measured data, attempts to find a function which closely approximates
the data (a "best fit"). It attempts to minimize the sum of the squares of the ordinate
differences (called residuals) between points generated by the function and
corresponding points in the data. Specifically, it is called least mean squares (LMS)
when the number of measured data is 1 and the gradient descent method is used
to minimize the squared residual. LMS is known to minimize the expectation of the
squared residual, with the smallest operations (per iteration). But it requires a large
number of iterations to converge.
An implicit requirement for the least squares method to work is that errors in each
measurement be randomly distributed. The Gauss-Markov theorem proves that
least square estimators are unbiased and that the sample data do not have to
comply with, for instance, a normal distribution. It is also important that the
collected data be well chosen, so as to allow visibility into the variables to be
solved for (for giving more weight to particular data, refer to weighted least
squares).
The least squares technique is commonly used in curve fitting. Many other
optimization problems can also be expressed in a least squares form, by either
minimizing energy or maximizing entropy.
We will actually do this by hand shortly. Play with the different trendline options. Observe the
results.
Curve fitting using Excel's SOLVER function
If we did not want to use an equation of a line to fit to data, or any of Excel's other options, then that’s
not a problem, we can use Solver to do this. To simply fit a line to some data, the Trendline function is
the easiest approach. If we wanted to fit data to some specific function not contained in Excel’s
Trendline options, then we'd try Solver. Solver will help us find solutions that best satisfy inputted
constraints. Let's look at a simple example. Download the following file from the web to your disk
space:
http://geophysics.asu.edu/cese/files/L04_mileage.xls
Once opened, the spreadsheet should look similar to the image on the next page. Here, we’re hoping
to minimize the cost of a car trip that depends on more than one variable.
GLG410: Computers in Earth & Space Exploration Spring 2010, ASU
Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge
Page
4
While this is perhaps a silly example, it makes a couple points. For one, you can see our functional
form is not of the type
y = mx + b.
Thus, we can't use the Add Trendline function to optimize a fit to our equation. First, you will need to
make sure that the Solver add-in is installed. We will check this together in class. Once Solver is
installed, go to the Data tab and open Solver in the Analysis section. You will see several options
already chosen in the Solver Parameters GUI that opens up. We want to minimize the target cell E4
(the total cost of the car trip), so make sure the Min radio button is selected. You can see that cell E4
is an equation that estimates the cost of the trip. There are two main things that the trip depends upon:
driver’s expenses and gas expenses. Driver’s expenses depend on how long it takes to get where you
are going (since the driver charges an hourly rate). The gas expenses depend on the total travel time of
car operation (obviously), but also have fast you drive, since that affects your gas mileage. Since the
equation is entered in cell E4, and the Min button is selected, just click Solve. Look at the cells in row
4 after “solving”.
Finding the minimum of some function with RMS
Instead of finding a minimum of a functional form, we will find the minimum of the RMS which
stands for root mean square. RMS is similar to the standard deviation:
GLG410: Computers in Earth & Space Exploration Spring 2010, ASU
Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge
Page
5
where p is the number of observations, Hi is the observed value at xi, and Hmodel(xi) is the model (or
calculated value) at xi. Thus, if we can minimize this measure, i.e., if we can minimize the square root
of the square of the differences between model (prediction) and observation, then we will be getting a
best fit according to this description of "goodness" of fit.
Today, we will use Solver to obtain a best fit to a data set of ocean floor sedimentation rates. Let's
work with these data of age of sediment versus depth below the seafloor:
Seafloor Depth (cm)
Sedimentation Age (years)
407
10510
545
11160
825
11730
1158
12410
1454
12585
2060
13445
2263
14685
Can we figure out the sedimentation rate from these data? Yes, if we assume a functional form (e.g.,
linear, or quadratic, or whatever). For sedimentation, let's assume the sedimentation rate stays
constant, thus the equation should be linear, i.e., of the form y=mx + b. To analyze these data, we will
solve this problem two ways:
(1) Use Add trendline on a chart made from these sedimentation data
(2) Use Solver to do the same thing.
Thus, by using Solver, we will show you another approach for determining best fit models to data sets.
Open a new Excel workbook, and we will put each approach on a different sheet
(1) Automatic way: Add Trendline
Copy the above data into Excel (put them in column A and B). Make a chart of the data (XY scatter
plot). Fit the data with a straight line, using Excel's built-in Add Trendline function. Choose the Linear
option. Also, display the equation on the chart. Here is a picture of what that might look like (you'll
notice that I changed the Y-axis range so that the data span is more appropriately represented in this
dimension – always de-junk your chart!):
GLG410: Computers in Earth & Space Exploration Spring 2010, ASU
Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge
Page
6
There, that was easy. If you’re just fitting a line (or one of the other automated functional forms), this
is really the way to do it. Done.
(2) Less automated, but more flexibility: Solver
To illustrate the concept of Solver, we'll solve the same problem by finding the m and b that
minimizes the RMS (root mean square - a measure of the misfit of the model to the data). To do so,
follow these instructions:
1) Somewhere on a clean worksheet (e.g., below your freshly copied data table), define two cells, one
for slope "m" and another for the Y-intercept "b". Go ahead and stick some number in each (they'll get
updated later) and define names for those cells: I recommend naming them m and b. Also, put your m
and b cells next to each other (view the screen shot of my spreadsheet as an example).
2) Set up a column of model age estimations that uses our guess of m ands b, call this Y_calc which
equals m*x + b, where the x values are the depths of the observations, and m and b are our model
parameters that describe our best fitting line (see Column C in image below).
3) Determine the RMS between the observed and modeled values: make a column that calculates the
difference between the observed ages (Y_obs) and those you just estimated from your guessed m and b
(Y_calc) [my column D in image below]. In a column to the right of this (Column E), square that
difference. Here, we are, by hand, calculating RMS, as in the formula presented earlier in this lecture.
4) Calculate the sum of those squares (thus, simply sum the values in column E), then take the square
GLG410: Computers in Earth & Space Exploration Spring 2010, ASU
Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge
Page
7
root of this number, then finally divide by the number of observations. I put all of these steps in my
column F. NOTE: count the number of observations using Excel's function count(cell range). If your
data set had 100's or 1000's of data, you'd surely do it this way. Our final result is the RMS, and that’s
the cell we will minimize using Solver. Make sure you understand what we're doing: we are
computing the misfit of each data point to your line, and then squaring this. Then... we sum these up,
we square root this sum, we then divide this by the number of observations. Thus, this is essentially a
weighted average of our misfits of our model (i.e., the RMS). Hence, minimizing the RMS will
minimize our model's misfit. Understand this.
My sheet looked like:
5) Your last step is to run Solver (as before): set it up to make the target cell the one containing your
estimation of the RMS (F9 in my sheet) a minimum (click the Min button) by varying the cells that
contain m and b (the cell range that is dashed below). In Excel 2003, it would look like:
GLG410: Computers in Earth & Space Exploration Spring 2010, ASU
Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge
Page
8
6) Click Solve, Then you will find that after the solver finishes searching for the best fit, it will ask you
if you want to keep the result, which of course you do. It will also offer a report (usually on a separate
worksheet), select the Answer sheet option.
7) Finally: compare your answers for the best m and b values (i.e., the ones Solver obtained) to those
calculated by the Add trendline function.
If the above exercise was not clear, please go through these notes again. This is an important approach
in scientific computing, and independent of software package (and a ripe topic for a quiz). If you are
having trouble understanding it, please meet with the instructor or TA.
[2] LINEAR ALBEGRA IN EXCEL
Solving systems of equations
from: http://puccini.che.pitt.edu/~karlj/Classes/CHE2101/solver.html] Someday, somewhere,
sometime, you may have the need to solve a system of equations in your homework or research. You
can do this in a wide variety of ways, but a simple way to find a solution is to use Solver. However, be
[Taken
GLG410: Computers in Earth & Space Exploration Spring 2010, ASU
Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge
Page
9
warned that Solver is known to choke, especially if solutions do not exist. In general, finding roots to
nonlinear functions is a very difficult problem. Depending on your problem, you might be better off
using a tool like Matlab, but Excel works great for simpler problems. As a very simple example,
suppose you want to solve a system of linear equations using Solver. Consider
3x + y =–2
x + 2y = –5
We can rewrite these equations as:
3x + y + 2 = 0
x + 2y + 5 = 0
We can solve these by hand (remember Gaussian elimination?). Or, we can use linear algebra (i.e.,
matrix algebra) to find the solution. Let’s solve it by hand on the board. We should find that x = 0.2,
y = –2.6.
We can also use Excel’s Solver to find the solution to the above equations. Go to the above web page,
and find the link that says sample spreadsheet. Open it, where the problem is set up. You should see
that cells A2 and B2 hold an initial guess values of x and y, respectively. Cells C2 and D2 hold the
formulas for the two equations shown above, and Cell E2 holds an objective function, which in this
case is just the sum of the two equations, C2 and D2. We will want to minimize this objective function
to be zero, since the sum of the above equations (note the right hand side) is zero.
Now, in Solver, the Target Cell should be $E$2. Equal to: should be set to Value of: 0. The By
Changing Cells should be $A$2:$B$2. This means that Solver will attempt to make the value of $E$2
be zero by adjusting the variables in cells $A$2:$B$2. Now, in this case, just setting the objective
function to zero does not lead to a unique solution. We can add constraints to the solution to make sure
that both equations are in fact zero. It is sufficient to set a constraint on one of the equations also being
zero and then the fact that the sum must also be zero will require the second equation to also be zero.
Hence, in Subject to the Constraints: box you should set $C$2 = 0. You can add the additional
requirement that $D$2 = 0, but this is redundant.
If you click on the Solve button you should notice that the values in Cells A2:E2 change, with Cells
C2:E2 being very close to zero. You should see a dialog box. Make sure Keep Solver Solution is
checked and click OK. You should have values very close to 0.2 and -2.6 in Cells A2 and B2,
respectively.
If you plan on solving systems of linear equations often with Solver, you should explore the web for
other examples.
1
Download