GLG410: Computers in Earth & Space Exploration Spring 2010, ASU Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge Lecture 04 February 9, 2010 Page 1 Tuesday Today is our third Excel lecture. Our two central themes are: (1) curve-fitting, and (2) linear algebra (matrices). We will have a 4th lecture on Excel to further explore linear algebra, since it has become increasingly common in computational physical sciences. A note on being computing scientists …and homework So far, the Excel assignments have been with a fair bit of guidance, even with step by step recipes. However, you now have enough tools under your belt to tackle problems without the step by step instructions. HW#3 will contain a mix of levels of guidance. As the semester carries on, you will be doing more and more of the work without such detailed instructions. [1] CURVE FITTING WITH EXCEL Today, we will experiment with curve fitting using Excel. First, we'll explore Excel's built-in "Trendline" option, where we can add trend lines to data sets after a chart has been generated. Later, we will utilize an Excel tool called Solver, whereby we can fit a data set to more general functional forms, i.e., more freedom than a straight line, exponential, logarithmic, etc. Then after that, we'll approach it completely from scratch. Curve fitting using the Add Trendline option when right clicking a data series point in a chart We have already done this in lecture. All you have to do is to right-click a data point within a chart then select Add Trendline. There are many options to choose from. Let’s quickly discuss this with the following spreadsheet: (download this from the web to your disk space) http://geophysics.asu.edu/cese/files/L04_climate.xls If you are to be your own computing scientist, you should look into what Excel is assuming regarding its different trend line options. The equations behind these options can be found from Excel's Help page: GLG410: Computers in Earth & Space Exploration Spring 2010, ASU Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge Page 2 GLG410: Computers in Earth & Space Exploration Spring 2010, ASU Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge Page 3 If the explanations for each option are not clear, you should seek additional information. For example, do you know what least squares is? It is an integral part of most of the options. If not, just go to wikipedia.org, type in least squares, then you’ll find: Least squares Least squares is a mathematical optimization technique which, when given a series of measured data, attempts to find a function which closely approximates the data (a "best fit"). It attempts to minimize the sum of the squares of the ordinate differences (called residuals) between points generated by the function and corresponding points in the data. Specifically, it is called least mean squares (LMS) when the number of measured data is 1 and the gradient descent method is used to minimize the squared residual. LMS is known to minimize the expectation of the squared residual, with the smallest operations (per iteration). But it requires a large number of iterations to converge. An implicit requirement for the least squares method to work is that errors in each measurement be randomly distributed. The Gauss-Markov theorem proves that least square estimators are unbiased and that the sample data do not have to comply with, for instance, a normal distribution. It is also important that the collected data be well chosen, so as to allow visibility into the variables to be solved for (for giving more weight to particular data, refer to weighted least squares). The least squares technique is commonly used in curve fitting. Many other optimization problems can also be expressed in a least squares form, by either minimizing energy or maximizing entropy. We will actually do this by hand shortly. Play with the different trendline options. Observe the results. Curve fitting using Excel's SOLVER function If we did not want to use an equation of a line to fit to data, or any of Excel's other options, then that’s not a problem, we can use Solver to do this. To simply fit a line to some data, the Trendline function is the easiest approach. If we wanted to fit data to some specific function not contained in Excel’s Trendline options, then we'd try Solver. Solver will help us find solutions that best satisfy inputted constraints. Let's look at a simple example. Download the following file from the web to your disk space: http://geophysics.asu.edu/cese/files/L04_mileage.xls Once opened, the spreadsheet should look similar to the image on the next page. Here, we’re hoping to minimize the cost of a car trip that depends on more than one variable. GLG410: Computers in Earth & Space Exploration Spring 2010, ASU Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge Page 4 While this is perhaps a silly example, it makes a couple points. For one, you can see our functional form is not of the type y = mx + b. Thus, we can't use the Add Trendline function to optimize a fit to our equation. First, you will need to make sure that the Solver add-in is installed. We will check this together in class. Once Solver is installed, go to the Data tab and open Solver in the Analysis section. You will see several options already chosen in the Solver Parameters GUI that opens up. We want to minimize the target cell E4 (the total cost of the car trip), so make sure the Min radio button is selected. You can see that cell E4 is an equation that estimates the cost of the trip. There are two main things that the trip depends upon: driver’s expenses and gas expenses. Driver’s expenses depend on how long it takes to get where you are going (since the driver charges an hourly rate). The gas expenses depend on the total travel time of car operation (obviously), but also have fast you drive, since that affects your gas mileage. Since the equation is entered in cell E4, and the Min button is selected, just click Solve. Look at the cells in row 4 after “solving”. Finding the minimum of some function with RMS Instead of finding a minimum of a functional form, we will find the minimum of the RMS which stands for root mean square. RMS is similar to the standard deviation: GLG410: Computers in Earth & Space Exploration Spring 2010, ASU Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge Page 5 where p is the number of observations, Hi is the observed value at xi, and Hmodel(xi) is the model (or calculated value) at xi. Thus, if we can minimize this measure, i.e., if we can minimize the square root of the square of the differences between model (prediction) and observation, then we will be getting a best fit according to this description of "goodness" of fit. Today, we will use Solver to obtain a best fit to a data set of ocean floor sedimentation rates. Let's work with these data of age of sediment versus depth below the seafloor: Seafloor Depth (cm) Sedimentation Age (years) 407 10510 545 11160 825 11730 1158 12410 1454 12585 2060 13445 2263 14685 Can we figure out the sedimentation rate from these data? Yes, if we assume a functional form (e.g., linear, or quadratic, or whatever). For sedimentation, let's assume the sedimentation rate stays constant, thus the equation should be linear, i.e., of the form y=mx + b. To analyze these data, we will solve this problem two ways: (1) Use Add trendline on a chart made from these sedimentation data (2) Use Solver to do the same thing. Thus, by using Solver, we will show you another approach for determining best fit models to data sets. Open a new Excel workbook, and we will put each approach on a different sheet (1) Automatic way: Add Trendline Copy the above data into Excel (put them in column A and B). Make a chart of the data (XY scatter plot). Fit the data with a straight line, using Excel's built-in Add Trendline function. Choose the Linear option. Also, display the equation on the chart. Here is a picture of what that might look like (you'll notice that I changed the Y-axis range so that the data span is more appropriately represented in this dimension – always de-junk your chart!): GLG410: Computers in Earth & Space Exploration Spring 2010, ASU Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge Page 6 There, that was easy. If you’re just fitting a line (or one of the other automated functional forms), this is really the way to do it. Done. (2) Less automated, but more flexibility: Solver To illustrate the concept of Solver, we'll solve the same problem by finding the m and b that minimizes the RMS (root mean square - a measure of the misfit of the model to the data). To do so, follow these instructions: 1) Somewhere on a clean worksheet (e.g., below your freshly copied data table), define two cells, one for slope "m" and another for the Y-intercept "b". Go ahead and stick some number in each (they'll get updated later) and define names for those cells: I recommend naming them m and b. Also, put your m and b cells next to each other (view the screen shot of my spreadsheet as an example). 2) Set up a column of model age estimations that uses our guess of m ands b, call this Y_calc which equals m*x + b, where the x values are the depths of the observations, and m and b are our model parameters that describe our best fitting line (see Column C in image below). 3) Determine the RMS between the observed and modeled values: make a column that calculates the difference between the observed ages (Y_obs) and those you just estimated from your guessed m and b (Y_calc) [my column D in image below]. In a column to the right of this (Column E), square that difference. Here, we are, by hand, calculating RMS, as in the formula presented earlier in this lecture. 4) Calculate the sum of those squares (thus, simply sum the values in column E), then take the square GLG410: Computers in Earth & Space Exploration Spring 2010, ASU Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge Page 7 root of this number, then finally divide by the number of observations. I put all of these steps in my column F. NOTE: count the number of observations using Excel's function count(cell range). If your data set had 100's or 1000's of data, you'd surely do it this way. Our final result is the RMS, and that’s the cell we will minimize using Solver. Make sure you understand what we're doing: we are computing the misfit of each data point to your line, and then squaring this. Then... we sum these up, we square root this sum, we then divide this by the number of observations. Thus, this is essentially a weighted average of our misfits of our model (i.e., the RMS). Hence, minimizing the RMS will minimize our model's misfit. Understand this. My sheet looked like: 5) Your last step is to run Solver (as before): set it up to make the target cell the one containing your estimation of the RMS (F9 in my sheet) a minimum (click the Min button) by varying the cells that contain m and b (the cell range that is dashed below). In Excel 2003, it would look like: GLG410: Computers in Earth & Space Exploration Spring 2010, ASU Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge Page 8 6) Click Solve, Then you will find that after the solver finishes searching for the best fit, it will ask you if you want to keep the result, which of course you do. It will also offer a report (usually on a separate worksheet), select the Answer sheet option. 7) Finally: compare your answers for the best m and b values (i.e., the ones Solver obtained) to those calculated by the Add trendline function. If the above exercise was not clear, please go through these notes again. This is an important approach in scientific computing, and independent of software package (and a ripe topic for a quiz). If you are having trouble understanding it, please meet with the instructor or TA. [2] LINEAR ALBEGRA IN EXCEL Solving systems of equations from: http://puccini.che.pitt.edu/~karlj/Classes/CHE2101/solver.html] Someday, somewhere, sometime, you may have the need to solve a system of equations in your homework or research. You can do this in a wide variety of ways, but a simple way to find a solution is to use Solver. However, be [Taken GLG410: Computers in Earth & Space Exploration Spring 2010, ASU Instructor: Ed Garnero & Matt Fouch; TA: Jeff Lockridge Page 9 warned that Solver is known to choke, especially if solutions do not exist. In general, finding roots to nonlinear functions is a very difficult problem. Depending on your problem, you might be better off using a tool like Matlab, but Excel works great for simpler problems. As a very simple example, suppose you want to solve a system of linear equations using Solver. Consider 3x + y =–2 x + 2y = –5 We can rewrite these equations as: 3x + y + 2 = 0 x + 2y + 5 = 0 We can solve these by hand (remember Gaussian elimination?). Or, we can use linear algebra (i.e., matrix algebra) to find the solution. Let’s solve it by hand on the board. We should find that x = 0.2, y = –2.6. We can also use Excel’s Solver to find the solution to the above equations. Go to the above web page, and find the link that says sample spreadsheet. Open it, where the problem is set up. You should see that cells A2 and B2 hold an initial guess values of x and y, respectively. Cells C2 and D2 hold the formulas for the two equations shown above, and Cell E2 holds an objective function, which in this case is just the sum of the two equations, C2 and D2. We will want to minimize this objective function to be zero, since the sum of the above equations (note the right hand side) is zero. Now, in Solver, the Target Cell should be $E$2. Equal to: should be set to Value of: 0. The By Changing Cells should be $A$2:$B$2. This means that Solver will attempt to make the value of $E$2 be zero by adjusting the variables in cells $A$2:$B$2. Now, in this case, just setting the objective function to zero does not lead to a unique solution. We can add constraints to the solution to make sure that both equations are in fact zero. It is sufficient to set a constraint on one of the equations also being zero and then the fact that the sum must also be zero will require the second equation to also be zero. Hence, in Subject to the Constraints: box you should set $C$2 = 0. You can add the additional requirement that $D$2 = 0, but this is redundant. If you click on the Solve button you should notice that the values in Cells A2:E2 change, with Cells C2:E2 being very close to zero. You should see a dialog box. Make sure Keep Solver Solution is checked and click OK. You should have values very close to 0.2 and -2.6 in Cells A2 and B2, respectively. If you plan on solving systems of linear equations often with Solver, you should explore the web for other examples. 1