Understanding r-squared Page 1 of 3 Goal: To understand how r2, the coefficient of determination, describes the strength of a linear model. As Rossman points out, r2 “measures how closely the points fall to the least squares line and thus also provides an indication of how confident one can be of predictions made with the line”1 Or to paraphrase Moore “When you report a regression, give r2 as a measure of how successful the model is explaining the response [for a given explanatory value].2 Goal of Modeling 1. To understand r2 we need to ask the following question. If we did not know any modeling techniques such as regression what would the best model be? The answer is that, lacking any sophisticated techniques, our best model is the horizontal line containing the mean of the response values, i.e. y = y . 2. Let’s look at an example. Assume the data from a bivariate experiment are the points shown below.3 On your calculator, draw a scatterplot of the data and a horizontal line representing y . Your plot should look similar to the picture below. (5,15) (4,8) y =7 (3,6) (2,4) (1,2) a. Is there error in this model? b. Draw a segment from each data point showing the error with respect to the model y = y . c. Fill in the following chart: error with respect to the model y = 7 Square of the errors from the model y = 7 (1,2) yi y (2,4) (3,6) (4,8) (5,15) (yi y )2 d. Draw shaded “squares” on your plot to represent the squared values just computed (note that since the scales of the axes are not the same, the “squares” will look like rectangles). Some may overlap. e. (yi y )2 = . f. Write a general phrase describing the quantity you just computed. 1 Workshop Statistics. Allen Rossman. page 139 Basic Practice of Statistics. David S. Moore. p 127 3 Data set from Gretchen Davis at Santa Monica HS via Eric Mulfinger, Westridge School, Pasadena, CA 2 Understanding r-squared Page 2 of 3 3. So what is the goal of our modeling efforts? The goal of our modeling effort is to find a better model than the mean of the response variable. We certainly would want the sum of the squares of the errors from the new model to be less than that of the model using the mean of the response variable. A Better Model 1. So let’s look for a better model. How about a Least Squares Regression Line (LSR line)? Using the same data as before, add the graph of the LSR line to your plot. Your plot should look like the picture below. (5,15) yˆ 3x 2 (4,8) y =7 (3,6) (2,4) (1,2) 2. Is there still error in this new model? Comparing the two models, which appears to have lower error? 3. Draw a segment from each data point showing the error to the model yˆ 3x 2 . 4. What does the “hat” symbol mean on yˆ ? 5. Fill in the following chart: error with respect to the model yˆ 3x 2 Square of the errors with respect to the model yˆ 3x 2 yi yˆ (1,2) (2,4) (3,6) (4,8) (5,15) (yi yˆ )2 6. Draw shaded “squares” on your plot to represent the values just computed. Does the sum of the areas of these squares seem smaller than those from the model using the response mean? 7. (yi yˆ )2 = . 8. Write a general phrase describing the quantity you just computed. Understanding r-squared 9. You have determined that suggests what conclusion? (yi y )2 = 100 and Page 3 of 3 (yi yˆ )2 = 10. This information Comparing Models 1. The natural next step is to find a number which gives us a sense how our new model compares with the model of the mean of the response variable. Of course we would like this number to be r2. (y i yˆ )2 (yi y )2 = . The answer should be 0.1 . computed? Which of the following correctly describes the proportion you just a. The proportion of how much error there is in the new model with respect to the error in the mean model. b. The proportion of how well the new model fits the data. 2. The correct answer to the previous question was The proportion of how much error there is the new model with respect to the error in the mean model. Remembering that we want r2 to show us how well our model measures how closely the observed values fall to the least squares line. How would we compute r2 from the proportion of error in the model? See footnote 4 for a hint. 1 3. So r2 = 1 – 10 = 0.9. Check this against the r2 your calculator computed. So r 2 (y i yˆ )2 1 (yi y ) 2 In other words, you find r2 by finding: 1– the sum of the squares of the error of the observed data with respect to the model the sum of the squares of the error of the observed data with respect to the mean of the response variable - end- 4 What is the correct value of r-squared? See your calculator.