12.13 Residual Analysis in Multiple Regression (Optional) Although Excel and MegaStat are emphasized in Business Statistics in Practice, Second Canadian Edition, some examples in the additional material on Connect can only be demonstrated using other programs, such as MINITAB, SPSS, and SAS. Please consult the user guides for these programs for instructions on their use. 12.13 RESIDUAL ANALYSIS IN MULTIPLE REGRESSION (OPTIONAL) In Section 11.10, we showed how to use residual analysis to check the regression assumptions for a simple linear regression model. In multiple regression, we proceed similarly. Specifically, for a multiple regression model we plot the residuals given by the model against (1) values of each independent variable, (2) predicted values of the dependent variable, and (3) the time order in which the data have been observed (if the regression data are time series data). A fanning-out pattern on a residual plot indicates an increasing error variance; a funneling-in pattern indicates a decreasing error variance. Both violate the constant-variance assumption. A curved pattern on a residual plot indicates that the functional form of the regression model is incorrect. If the regression data are time series data, a cyclical pattern on the residual plot versus time suggests positive autocorrelation, while an alternating pattern suggests negative autocorrelation. Both violate the independence assumption. On the other hand, if all residual plots have (at least approximately) a horizontal band appearance, then it is reasonable to believe that the constant-variance, correct functional form, and independence assumptions approximately hold. To check the normality assumption, we can construct a histogram, stem-and-leaf display, and normal plot of the residuals. The histogram and stem-and-leaf display should look bell-shaped and symmetric about 0; the normal plot should have a straight-line appearance. To illustrate these ideas, consider the sales territory performance data in Table 12.2 (page 422). Figure 12.7 (page 430) gives the MegaStat output of a regression analysis of these data using the model y 5 b0 1 b1x1 1 b2 x2 1 b3 x3 1 b4 x4 1 b5 x5 1 e. The least squares point estimates on the output give the prediction equation ŷ 5 21,113.7879 1 3.6121x1 1 0.0421x2 1 0.1289x3 1 256.9555x4 1 324.5334x5 . Using this prediction equation, we can calculate the predicted sales values and residuals given on the MegaStat output of Figure 12.50. For example, observation 10 on this output corresponds to a sales representative for whom x1 5 105.69, x2 5 42,053.24, x3 5 5,673.11, x4 5 8.85, and x5 5 0.31. If we insert these values into the prediction equation, we obtain a predicted sales value of ŷ10 5 4,143.597. Since the actual sales for the sales representative are y10 5 4,876.370, the residual e10 equals the difference between y10 5 4,876.370 and ŷ10 5 4,143.597, which is 732.773. The normal plot of the residuals in Figure 12.51(a) has a straight-line appearance. The plot of the residuals versus predicted sales in Figure 12.51(b) has a horizontal band appearance, as do the plots of the residuals versus the independent variables (the plot versus x3, advertising, is shown in Figure 12.51(c)). We conclude that the regression assumptions approximately hold for the sales territory performance model (note that because the data are cross-sectional, a residual plot versus time is not appropriate). 1 2 Chapter 12 Multiple Regression and Model Building FIGURE 12.50 MegaStat Output of the Sales Territory FIGURE 12.51 MegaStat Residual Plots for the Sales Territory Performance Model Residuals Predicted Residual 3,504.990 3,901.180 2,774.866 4,911.872 5,415.196 2,026.090 5,126.127 3,106.925 6,055.297 4,143.597 2,503.165 1,827.065 2,478.083 2,351.344 4,797.688 2,904.099 3,362.660 2,907.376 3,625.026 4,056.443 1,409.835 2,494.101 1,617.561 4,574.903 2,488.700 164.890 2427.230 2479.766 2236.312 710.764 108.850 294.467 260.525 464.153 732.773 234.895 706.245 269.973 213.964 2210.738 2174.859 273.260 2106.596 2360.826 2602.823 331.615 2458.351 239.561 2407.463 311.270 (a) Normal plot of the residuals 1,000.000 500.000 Residual Sales 3,669.880 3,473.950 2,295.100 4,675.560 6,125.960 2,134.940 5,031.660 3,367.450 6,519.450 4,876.370 2,468.270 2,533.310 2,408.110 2,337.380 4,586.950 2,729.240 3,289.400 2,800.780 3,264.200 3,453.620 1,741.450 2,035.750 1,578.000 4,167.440 2,799.970 0.000 ⫺500.000 ⫺1,000.000 ⫺3.0 ⫺2.0 ⫺1.0 0.0 1.0 2.0 3.0 Normal Score (b) Plot of the residuals versus predicted sales Residual (gridlines ⫽ std. error) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 860.464 430.232 0.000 ⫺430.232 ⫺860.464 0 2,000 4,000 6,000 8,000 Predicted (c) Plot of the residuals versus advertising Residual (gridlines ⫽ std. error) Observation Performance Model 860.464 430.232 0.000 ⫺430.232 ⫺860.464 0.0 5,000.0 10,000.0 15,000.0 Adver To conclude this section, we consider the Durbin–Watson test for first-order autocorrelation. This test is carried out for a multiple regression model exactly as it is for a simple linear regression model (see Section 11.10), except that we consider k, the number of independent variables used by the model, when looking up the critical values dL,a and dU,a. For example, Figure 12.52 gives n 5 16 weekly values of Folio Bookstore sales (y), Folio’s advertising expenditure (x1), and competitors’ advertising expenditure (x2). The Durbin–Watson statistic for the model y 5 b0 1 b1x1 1 b2x2 1 e n k52 dL,0.05 dU,0.05 15 16 17 18 0.95 0.98 1.02 1.05 1.54 1.54 1.54 1.53 is d 5 1.63. If we set a equal to 0.05, then we use Table A.12—a portion of which is shown in the page margin. Because n 5 16 and k 5 2, the appropriate critical values for a test for first-order positive autocorrelation are dL,0.05 5 0.98 and dU,0.05 5 1.54. Because d 5 1.63 is greater than dU,0.05 5 1.54, we conclude that there is no first-order positive autocorrelation. The Durbin–Watson test carried out in Figure 12.52 indicates that this autocorrelation does exist for the model relating y to x1. Therefore, adding x2 to this model seems to have removed the autocorrelation. 12.13 Residual Analysis in Multiple Regression (Optional) FIGURE 3 12.52 Folio Bookstore Sales and Advertising Data, and Residual Analysis (a) The data and the MegaStat output of the residuals from a simple linear regression relating Folio’s sales to Folio’s advertising expenditure Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Adver 18 20 20 25 28 29 29 28 30 31 34 35 36 38 41 45 Compadv 10 10 15 15 15 20 20 25 35 35 35 30 30 25 20 20 Sales 22 27 23 31 45 47 45 42 37 39 45 52 57 62 73 84 Predicted Residual 18.7 3.3 23.0 4.0 23.0 20.0 33.9 22.9 40.4 4.6 42.6 4.4 42.6 2.4 40.4 1.6 44.7 27.7 46.9 27.9 53.4 28.4 55.6 23.6 57.8 20.8 62.1 20.1 68.6 4.4 77.3 6.7 Durbin–Watson 5 0.65 (b) MegaStat output of a plot of the residuals versus time Residual (gridlines ⫽ std. error) 10.1 5.0 0.0 ⫺5.0 ⫺10.1 0 5 10 15 20 Observation Exercises for Section 12.13 CONCEPTS 12.63 Discuss how to use the residuals to check the regression assumptions for a multiple regression model. 12.64 Discuss how to carry out the Durbin–Watson test for a multiple regression model. METHODS AND APPLICATIONS 12.65 THE HOSPITAL LABOUR NEEDS CASE Consider the hospital labour needs data in Table 12.5 (page 424). Figure 12.53 gives residual plots that are obtained when we perform a regression analysis of these data by using the model y 5 b0 1 b1x1 1 b2x2 1 b3x3 1 e. a. Interpret the normal plot of the residuals. b. Interpret the residual plots versus predicted labour hours, BedDays (x2), and Length (x3). Note: The first two of these plots, as well as the plot versus Xray (x1) (not shown), indicate that 3 hospitals are substantially larger than the other 13 hospitals. We will discuss the potential influence of these three large hospitals in Section 12.14. 12.66 THE FRESH DETERGENT CASE Recall that Table 12.4 (page 424) gives values for n 5 30 sales periods of demand for Fresh liquid laundry detergent (y), price difference (x4), and advertising expenditure (x3). a. Figure 12.54(a) gives the residual plot versus x3 that is obtained when the regression model relating y to x4 and x3 is used to analyze the Fresh detergent data. Discuss why the residual plot indicates that we should add x23 to the model. b. Figure 12.54(b) gives the residual plot versus time and the Durbin–Watson statistic that are obtained when the regression model relating y to x4, x3, and x23 is used to analyze the Fresh detergent data. Test for positive autocorrelation by setting a equal to 0.05. 4 Chapter 12 Multiple Regression and Model Building FIGURE 12.53 MegaStat and Excel Residual Analysis for the Hospital Labour Needs Model (for Exercise 12.65) (a) MegaStat normal plot of the residuals (b) MegaStat plot of the residuals versus predicted hours Residual (gridlines ⫽ std. error) 600.000 400.000 Residual 200.000 0.000 ⫺200.000 ⫺400.000 ⫺600.000 ⫺800.000 ⫺2.0 ⫺1.5 ⫺1.0 ⫺0.5 0.0 0.5 1.0 1.5 774.320 387.160 0.000 ⫺387.160 ⫺774.320 0 2.0 5,000 Normal Score 1,000 1,000 500 500 5,000.00 10,000.00 15,000.00 20,000.00 ⫺1,000 FIGURE 15,000 20,000 (d) Excel plot of the residuals versus Length Residuals Residuals (c) Excel plot of the residuals versus BedDays 0 0.00 ⫺500 10,000 Predicted 0 0.00 ⫺500 2.00 4.00 ⫺1,000 BedDays 6.00 8.00 10.00 12.00 Length 12.54 MegaStat Output for the Fresh Detergent Data (Exercise 12.66) (b) Output for Exercise 12.66(b) (a) Residual plot for Exercise 12.66(a) 0.664 0.715 0.477 0.238 0.000 ⫺0.238 ⫺0.477 ⫺0.715 Residual (gridlines = std. error) Residual (gridlines ⫽ std. error) Residuals 4.00 5.00 6.00 7.00 8.00 X3 0.443 0.221 0.000 -0.221 -0.443 -0.664 0 5 10 15 20 25 30 35 Observation Durbin - Watson = 1.62 12.67 THE QHIC CASE Consider the quadratic regression model describing the QHIC data. Figure 12.55 shows that the residual plot versus x for this model fans out, indicating that the error term ´ tends to become larger as x increases. To remedy this violation of the constant-variance assumption, we divide all terms in the quadratic model by x. This gives the transformed model y 1 e 5 b0 a b 1 b1 1 b2x 1 . x x x Figure 12.56(a) and (b) gives a regression output and a residual plot versus x for this model. a. Does the residual plot indicate that the constant-variance assumption holds for the transformed model? b. Consider a home worth $220,000. Let m0 represent the mean yearly upkeep expenditure for all homes worth $220,000 and y0 represent the yearly upkeep expenditure for an individual home worth $220,000. The bottom of the output in Figure 12.56(a) says that ŷy220 5 5.635 is a point estimate of m0y220 and a point prediction of y0y220. Multiply this result by 220 to obtain ŷ. Multiply the ends of the confidence interval and prediction interval shown on the output by 220. This will give a 95 percent confidence interval for m0 and a 95 percent prediction interval for y0. 5 12.13 Residual Analysis in Multiple Regression (Optional) FIGURE 12.55 MegaStat Plot of the Quadratic QHIC Model Residuals Versus x Residuals by Value X Residual (gridlines = std. error) 440.692 293.795 146.897 0.000 -146.897 -293.795 -440.692 0 FIGURE 50 100 150 200 Value X 250 300 350 12.56 MegaStat Output of the Transformed QHIC Model for Exercise 12.67 (a) Regression output SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.7134 0.508939 0.482395 0.793459 40 ANOVA Regression Residual Total df SS MS F Significance F 2 37 39 24.14244 23.29437 47.43681 12.07122 0.629577 19.17353 1.93E-06 Coefficients Standard Error Intercept 1/X Value X 3.408925 253.50053 0.011224 t Stat 1.32082 2.580915 83.19955 20.643039 0.004627 2.425865 P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% 0.013954 0.524164 0.020266 0.732691 2222.0787 0.001849 6.085158 115.0776 0.020598 0.732691 2222.0787 0.001849 (b) Residual plots Residuals 1/X Residual Plot 2 0 -2 0 0.005 0.01 0.015 0.02 0.025 1/X Residuals Value X Residual Plot 2 0 -2 0 100 200 Value X 300 400 6.085158 115.0776 0.020598