` The last stage of a regression analysis is to assess the adequacy of the model. In order to do this, we need to examine the components of the model which are purely statistical, i.e. the residual terms ei. To see why this is necessary, consider the four data sets in the EXCEL worksheet “Tufte.xls” which you will find in the EXCEL folder MBA Part 1. All four of these data sets lead to exactly the same values of the regression coefficients and se. In addition the y values and x values all have the same means and standard deviations. However when you do the regression and examine the automatic residual graph you get very different results. Consider the residual plot below: Residual Plot 1.5 1 Residuals 0.5 0 3 4 5 6 7 -0.5 -1 -1.5 -2 -2.5 X 8 9 10 11 Now consider the next residual plot: Residual Plot 3.5 3 2.5 Residuals 2 1.5 1 0.5 0 3 4 5 6 7 -0.5 -1 -1.5 X 8 9 10 11 Now consider the third plot: Residual Plot 2.5 2 1.5 Residuals 1 0.5 0 5 6 7 8 9 -0.5 -1 -1.5 -2 X 10 11 12 13 Finally consider the fourth residual plot: Residual Plot 2.5 2 1.5 Residuals 1 0.5 0 -0.5 3 4 5 6 7 -1 -1.5 -2 -2.5 X 8 9 10 11 For our electrical data, the residual plot looks like the following: Residual Plot 1500 Residuals 1000 500 0 -500 0 200 400 600 -1000 -1500 DD 800 1000 EXCEL also provides a plot of the actual values of y and the predicted values for each value of x. This is illustrated below: KWH DD Line Fit Plot 8,000 7,000 6,000 5,000 4,000 3,000 2,000 1,000 - KWH Predicted KWH 0 500 DD 1000 Since our model looks like it fits the data, we can try to use it to predict kilowatt hour usage using degree days as a predictor. The basic equation would be: Predicted KWH usage = 903.515 + 7.089 x (degree days). Thus for a billing period with 100 degree days (possibly a period in the spring or autumn in Dallas), the predicted KWH usage would be 1,612 KWH. This “point” estimate ignores variability. However we can use the results discussed earlier about the “mound” rule and Chebyshev to incorporate variability into our forecasts to get what is called an “interval” forecast. In this case, the formula is given by: Forecast b̂0 b̂1 x 2 se In the case discussed above for a billing period of 100 degree days, the interval forecast would be: 1612 +/- 2*(605.9) or approximately 400 KWH to 2,824 KWH. Although the above method is the most practical way to assess the adequacy of a model for forecasting purposes, it is common to use a single descriptive measure called the “correlation coefficient” to describe the adequacy of the regression fit. The correlation coefficient attempts to quantify how useful x is as a predictor of y. If one were not to use x in the forecasting of y, then one would guess the mean value of y as the forecast of kilowatt hours for each period. If one defines the error made as: i th error yi y and, SST ( y i y ) 2 i SST is a measure of the total errors made not using x as a predictor. If we use x, then we can define the error made as: êi yi b̂0 b̂1 xi and, SSE ê i 2 i SSE is a measure of the total errors made using x as a predictor. One can show that, 0 SSE SST Therefore, 0 SSE 1 SST If we define, R 2 1 SSE SST then, 0 R2 1 By rewriting the definition of R2 as: R 2 ( SST SSE ) / SST One can interpret R2 as the proportion of the variability in y explained (or eliminated) by using x as a predictor As we shall see, all of the above generalizes to the case where one has more than one x as a predictor. In the case however of a single x predictor, one usually encounters the correlation coefficient r defined as: r sign( b̂1 ) R 2 In our case R2=.8538, and r=.9240. Both of these values can be found on the EXCEL as highlighted below: A natural question is how big does R2 have to be in order for the regression analysis to be useful? I would suggest that the important measure is the usefulness of the interval forecast.