WEEK #5, Lecture 3: Least-Squares Fitting Function Fitting So far in MATLAB, we have not dealt with data with error: • Splines go exactly through given points. • Equation solving methods assume we know all input values exactly. Another important use of numerical tools is to go from (error-prone) data to a mathematical model. 2 Data Files and MATLAB The first step in working with data is getting it into memory. Look up what the following commands do in MATLAB. • dlmread - • csvread - • textread - • xlsread - Week 5 – Interpolation 3 Example Dataset Exercise: Start a new script, W5 8.m, and have it read in QuizAndExamGrades.xls • Can be done with xlsread. • Can also use double-click in Directory listing. What format is data now in in MATLAB? 4 Histograms Exercise: Look at the distribution of each variable separately using a histogram. The MATLAB command for this is hist. Arranging plots can be helpful. Exercise: In the script, use the commands figure(1) and figure(2) for each separate plot. Exercise: Look up the subplot command in the Help system. Modify the script so it shows all the graphs in the same graph window using subplot. How does the subplot command lay out the sub-windows? Week 5 – Interpolation Relationships Of more interest than each variable separately is how they are related. Exercise: Generate a scatter plot of the exam vs. test grades. Once you have the scatterplot, what might you want to do next? 5 6 Fitting Curves to Data Exercise: In the MATLAB plot window, select Tools/Basic Fitting There is a ‘spline’ option: what happens when you try it? Explain what happened. Exercise: Play around: • Move legend out of the way • Get formula for best fit linear and quadratic curves • What does big Right Arrow button do? Week 5 – Interpolation 7 Model Selection - Which Fit is “Best”? MATLAB is supposedly finding the “best fit line” or “best fit curve”. i.e. Of all possible straight lines, the linear fit shown is the best straight line. However, what does that mean if we want to compare the best straight line fit to the best quadratic fit? 8 Guidelines Which models keep closest to the actual data points? • linear, quadratic, or higher order? Which models match logic/intuition/practical constraints better? • linear, quadratic, or higher order? Always ask: Is a closer fit to the data substantial enough to justify higher-order fittings? We will study the question of “how high a degree should I use” in a more systematic way next class. Week 5 – Interpolation 9 Defining the “Best Fit” Within One Model For today, we will look at selecting • the best linear fit, among all possible linear fits, or • the best quadratic fit, among all possible quadratic fits, etc. How is the best model within each family selected? Or in other words, what what makes the “best fit line” the best? 10 Mathematics of Least Squares Our data is a set of (xi, yi) pairs. Before we find the best curve, we select/limit ourselves to one predictive family of functions, e.g. • Linear: ŷ = p1x + p2 • Quadratic: ŷ = p1x2 + p2x + p3 Definition: Finding the “Best fit” means “find values for pi that minimize the squared error”: X (yi − ŷi)2 i Week 5 – Interpolation Graphically 11 60 50 40 30 20 10 0 0 2 4 6 8 10 12 14 16 18 12 Naming and Symbols What symbols are traditionally used to describe the various components of function fitting? • Original Data • Fitted function • Fitted values • Residuals Week 5 – Interpolation 13 Least-Squares Error Least-squares error is the standard means by which we select the best fit. The best function fit is selected, from all possible curves in the same family, so as to minimize the sum of y errors squared. Are other definitions of “best fit” possible? Why do we use this the least-squares definition of “best” so often? 14 Next class, identifying when least-squares fits are not “best”, and selecting between multiple “best fit” models.