Chapter 9: Multiple Regression 1. a. 5.117 b. 4.256 c. 3.863 d. 3.633 e. 3.481 a. 0.094 b. 0.075 c. 0.063 d. 0.055 e. 0.049 a. Yes. The equation is linear with respect to the error term. b. No. The equation is not linear with respect to the error term. 2. 3. 4. c. Yes. The equation is linear with respect to the error term. Collinearity is the state in which one or more of the predictor variables are higher correlated with each other. This means that for the purposes of multiple regression, those variables are conveying much of the same information. 5. a. The correlation matrix and scatterplot matrix for the variables appear as follows: 1 Chapter 9: Multiple Regression b. In Repairs vs. Support, there is an outlier with Support of 7.2 and Repairs of 8.8, belonging to Texas Instruments. In Buy Again vs. Reliability, Texas Instruments is also an outlier, with Buy Again of 4.5 and Reliability of 8.8. And finally, Texas Instruments is an outlier on Buy Again vs. Repairs, at 4.5 Buy Again and 8.8 Repairs. With respect to the willingness to buy again, the high scores belong to mail order companies like Gateway 2000 and Dell. Big companies like Texas Instruments, AT&T, and IBM score rather low on willingness to buy again, although Apple, Compaq, Digital Equipment, and Hewlett-Packard have scores near the middle. c. Using the Regression command from the Analysis ToolPak, the resulting regression line is BuyAgain=1.53(Reliability)–1.14(Repairs)+1.50(Support)–8.76. Surprisingly, higher repair satisfaction is negatively associated with buying again after including the other factors. The regression is fairly successful, with an R Squared value of .72 (.69 adjusted). All three predictors’ coefficients are significant at the 5% level. 2 Chapter 9: Multiple Regression Here are the residual plots: The plot of residuals vs. predicted values does not show any special pattern. There is no indication of curvature or unequal variance, and there are no outliers. The normal plot is pretty straight, with no extreme values, so there is no suggestion of problems with the normality assumption. This might be surprising, considering that there are outliers in the scatterplot matrix. Texas Instruments has the most negative residual, and this makes sense because it showed up on the scatterplot matrix as having very low values of Bug Again relative to the reliability and repair record. 3 Chapter 9: Multiple Regression Here are the residual plots of the three predictors: The individual predictors’ residual plots do not indicate any model failure with respect to the 3 predictor variables. To summarize, the four variables are strongly related, although there are some outlying points. The large companies do not necessarily score well with PC customers, and the highest scorers are mail order firms. The three other variables are pretty successful in predicting willingness to buy again. The diagnostic plots indicate no problems with the regression assumptions. 6. a. The correlation and scatterplot matrices appear as follows: 4 Chapter 9: Multiple Regression b. The regression statistics are: The regression can be counted as a success in the sense that the multiple R2 value is quite high. The value 0.971 indicates the regression accounts for 97.1% of the variability in Calories. Both the Carbo and Protein coefficients are close to their known values (which are both four). However, the coefficient for Fat is 12.248, and this pretty far from the known value of 9. However the 95% confidence interval for the Fat coefficient does include the known value within its range. c. The coefficient for Fat has a high standard error and inaccuracy as compared to the known value because Fat amounts are given to the nearest half, and the amounts themselves are small (all 2 or less). This means that the percentage error can be very large, so it is reasonable to expect that the Fat coefficient would be inaccurate. d. The residual vs. predicted value plot appears as follows: Wheaties is an outlier, with a predicted calories per serving of 110, but actual calories per serving of 100. Using the known values, we find that there should be at least 4(3) + 4(23) + 9(1) = 113 calories per serving, considerably higher than the advertised 100. Pretzels have the same amounts of protein, carbohydrates, and fat, yet state 110 calories per serving. It is possible that the 5 Chapter 9: Multiple Regression company erred in their measurements, or that the error is due to rounding, or that the company intentionally understated the total calories. 7. a. The coefficient for Fat in this regression is 9.74, much closer to the accepted value of 9. Also, the standard error is 0.77, as opposed to 4.10 in the previous regression, so the inclusion of the high fat danish improved the regression a lot. a. Using the Regression command from the Analysis ToolPak, the resulting regression model is calculated to be: Runs=–2.74+0.44(singles)+0.62(doubles)+2.10(triples)+1.47(homeruns)+0.48(walks). The complete output from the regression command follows. 8. The Rosner-Woods coefficients are contained within the 95% confidence interval for the coefficients from this regression. In other words, the results here are consistent with the results obtained earlier. b. Yes, the Rosner-Woods coefficients make more sense, since one would expect that the home runs coefficient would be larger than the triple coefficient. Apparently there were lots of runners on base when triples were hit that day, accounting for the large number of runs scored on triples! c. With more data, the coefficients would be close to Rosner-Woods’. 6 Chapter 9: Multiple Regression 9. a. The scatterplot matrix and correlation matrix appear as follows: There is strong negative correlation between Price and Age (–0.873) with a p-value of < 0.001. There is also a strong negative correlation between Price and Miles (0.702) with a p-value of 0.011. 7 Chapter 9: Multiple Regression b. Here is the output from the Analysis ToolPak’s Regression command: The regression equation is Price = 10246–721(Age)–0.019(Miles); R Squared = .77 (.72 adjusted). Examining the residuals, you note that there is a slight upward trend to the residuals vs. predicted values plot which may cause you to believe that the variation in price is completely explained by the model. There is no indication of non-normality in the normal probability plot of the residuals. Examining the plots of the residuals vs. predictor variables leads you to conclude that the 12 year-old car has a price that is an outlier is relation to the price/age comparison of the younger cars. Aside from the 12-year old car there is a strong negative relationship between the residuals and age. 8 Chapter 9: Multiple Regression c. The regression statistics for the reduced data set are: Price = 11796–1276(Age)–0.022(Miles); Multiple R Squared = .81 (.77 adjusted). 9 Chapter 9: Multiple Regression There is no indication of model failure in any of the residual plots. There is a big change in the coefficient for Age between the two models. By omitting the oldest car the coefficient is nearly doubled–from a price decline of $721/year for the first model to a decline of $1276/year for the second model. The second model is better since the first model showed serious flaws in the residual plots. The fact that there are no cars with ages between 6 and 12 also makes it unwise to try to model the price/age relationship during those years since there is no evidence that the relationship will continue to be linear. Finally, there is a slight increase in the adjusted R2 value in going from the first model to the second. d. The Miles coefficient is not significant in either regression. None of the cars have high mileage relative to their age. It seems that the mileage was not given in the advertisements for cars with high mileage, so the mileage given is quite predictable from the age, and the mileage therefore offers no new information beyond what is known from the age. a. The scatterplot matrix and correlation matrix for the log(price) variable appear as follows: 10. The correlation is stronger: between Log Price and Age, -.970; between Log Price and Miles, .732 than the corresponding correlations with Price. 10 Chapter 9: Multiple Regression b. The output from the Analysis ToolPak Regression command is as follows: Log Price = 9.42–0.169(Age)–0.0000018(Miles) The residual plots appear as follows: There does not appear to be any major departures from the regression assumptions. The 12-year old car does appear as separate from the other models in the Residuals vs. Predicted Values and Residuals vs. Age plots, but the remaining observations do not appear to demonstrate any slope. 11 Chapter 9: Multiple Regression With Log Price, the R2 value is 0.94 (0.93 adjusted), an improvement over the 0.88 (0.72 adjusted) value for the Price model. The correlation is stronger, the R Squared value is higher, and the old car does not have the high residual, so the regression is much improved. Miles is even more insignificant in this regression. c. It is more sensible to use Log Price, since cars depreciate quickly when they are new, and more slowly when they are older. To see how the log model works, express Log Price in terms of Age and Miles and exponentiate the equation: Log Price = 9.42–0.169(Age)–0.000018(Miles) to get: Price = e9.42–0.169(Age)–0.000018(Miles)=e9.420 e–0.169(Age) e0.0000018(Miles) =12337*0.844Age*0.999998Miles This estimates that a new car costs $12337 and that the value is multiplied by 0.844 for each year of Age. In other words, the price drops by 15.6% each year (1–0.844 = 0.156). This makes more sense that having the price drop by $721 or $1276 each year, because eventually the price will become negative, and because newer cars lose value faster than older cars. 11. a. The correlation matrix and scatterplot matrix appear as follows. 12 Chapter 9: Multiple Regression All the Pearson probabilities of the correlations are extremely close to 0. The statistical significance is partly due to the large number of observations involved making it possible to detect smaller correlations. b. The output from the Analysis ToolPak’s Regression command appears as follows: MPG =–14.54–0.330(Cylinders) + 0.00768(Engine Disp)–0.00039(Horsepower)–0.00679 (Weight) + 0.0853(Acceleration) + 0.753(Year) 13 Chapter 9: Multiple Regression c. Weight, Horsepower, and Engine Disp are all pairwise strongly related. This means that changes in one will cause changes in the other two. Once Weight has been added to the regression model, there is little “extra” that Horsepower and Engine Disp will add to the model therefore their effects will appear insignificant. d. The plot Residuals vs. Predicted Values appears as follows: The residuals seem to form a U-shaped curve, indicating that the model has some serious flaws. Transforming the data may remove the curve. The regression statistics for the Log MPG variable appears as follows: e. Regression Statistics Multiple R 0.935 R Square 0.874 Adjusted R Square 0.872 Standard Error 0.053 Observations 392 ANOVA df Regression SS MS 6 7.454955486 1.242493 Residual 385 1.072171984 0.002785 Total 391 8.52712747 Coefficients Standard Error Significance F 446.1594 6.381E-170 P-value Lower 95% Upper 95% Intercept 0.794 0.073 10.845 0.000 0.650 0.938 Cylinders -0.010 0.005 -1.982 0.048 -0.020 0.000 Engine Disp 0.000 0.000 1.104 0.270 0.000 0.000 Horsepower 0.000 0.000 -2.047 0.041 -0.001 0.000 Weight 0.000 0.000 -11.151 0.000 0.000 0.000 -0.001 0.002 -0.336 0.737 -0.004 0.003 0.013 0.001 15.930 0.000 0.011 0.014 Accelerate Year t Stat F Log(MPG) = 0.794 – 0.0101(Cylinders) + 0.00125(Engine Disp) – 0.0044 (Horsepower) – 0.00011(Weight) – 0.00053(Acceleration) + 0.0129 (Year) 14 Chapter 9: Multiple Regression The plot of Residuals vs. Predicted values appears as: Residuals vs Predicted Values 0.2 0.15 0.1 Residuals 0.05 0 1 1.1 1.2 1.3 1.4 1.5 1.6 -0.05 -0.1 -0.15 -0.2 Predicted The transformation has imporoved the scatterplot, though there is still a curvilinear trend in the plot. The plots of the Residuals vs. each of the Predictor variables appears as follows: The model appears to fail for the Weight, Cylinders and Engine Displacement variables. There does not appear to be any problems with Acceleration, Year or Horsepower. These problems might be fixed by including powers of the different variables, for example including Weight2 or (Engine Displacement)2. 15 Chapter 9: Multiple Regression 12. The following steps are used to remove the variables from the model. Step Status a. Action 1 The least significant predictor that is nonsignificant is Acceleration at 0.737 Remove Acceleration 2 The least significant predictor that is nonsignificant is Engine Disp at 0.251 Remove Eng. Disp. 3 The least significant predictor that is nonsignificant is Cylinders at 0.106 Remove Cylinders 4 All remaining predictors are significant Stop The final regression equation is: log MPG = 0.76–0.00038(Horsepower)–0.00012(Weight)+0.013(Year). b. The R2 vlaue of the full model is 0.874. The R2 of the reduced model is 0.873. By removing 3 variables from the regression model, we have only reduced the R2 value by 0.001. 13. The regression statistics are: a. Regression Statistics Multiple R 0.940 R Square 0.884 Adjusted R Square 0.881 Standard Error 0.046 Observations 245 ANOVA df Regression SS MS 6 3.871 0.645 Residual 238 0.510 0.002 Total 244 4.382 Coefficients Standard Error t Stat F 301.019 P-value Significance F 0.000 Lower 95% Upper 95% Intercept 0.911 0.084 10.780 0.000 0.744 1.077 Cylinders -0.018 0.005 -3.488 0.001 -0.028 -0.008 Engine Disp 0.000 0.000 0.218 0.828 0.000 0.000 Horsepower -0.001 0.000 -2.946 0.004 -0.001 0.000 0.000 0.000 -7.124 0.000 0.000 0.000 -0.008 0.002 -4.346 0.000 -0.011 -0.004 0.012 0.001 13.012 0.000 0.011 0.014 Weight Accelerate Year For the American cars only, the regression model that includes all the predictors variables is: log(MPG)=0.911–0.0179(Cylinders)+0.0000273(Engine Disp)–0.00062(Horsepower)– 0.000080(Weight)–0.0079(Acceleration)+0.012402(Year). 16 Chapter 9: Multiple Regression b. The following residual plots check the regression assumptions: The residual plots show the model does not fulfill all the assumptions. There are a very outliers which show up on the normal probability plot. There is some evidence of non-constant variance in the Residual vs. Predicted Value plot as the spread of the residuals is less for the low and the high Predicted Values. There is some evidence of curvature in the Year, Weight and Engine Displacement plots. However the Multiple R2 value is 0.884 which means that this model does account for 88.4% of the variation in American Log(MPG). We would by no means be contents with this model in it’s present form as the final model, but it does make for a good starting point for discussion and is “reasonable” in that context. c. The first few values of the column are: Model Log MPG Predicted Residuals amc ambassador dpl 1.176091259 1.1536 0.0225 amc gremlin 1.322219295 1.2909 0.0314 amc hornet 1.255272505 1.2725 -0.0172 amc rebel sst 1.204119983 1.1817 0.0225 buick estate wagon (sw) 1.146128036 1.1828 -0.0366 buick skylark 320 1.176091259 1.1568 0.0193 chevrolet chevelle malibu 1.255272505 1.1885 0.0668 chevrolet impala 1.146128036 1.0925 0.0536 chevrolet monte carlo 1.176091259 1.1778 -0.0018 1 1.0517 -0.0517 chevy c20 17 Chapter 9: Multiple Regression The plot is: d. Residuals vs Predicted 0.25 0.2 0.15 Residuals 0.1 0.05 American European Japanese 0 1 1.1 1.2 1.3 1.4 1.5 1.6 -0.05 -0.1 -0.15 -0.2 Predicted e. The descriptive statistics for the residuals are: Origin = "American" Count Origin = "European" Origin = "Japanese" 245 68 79 0.0000 1.9497 2.2801 0.000000 0.028672 0.028862 0.0037 0.0224 0.0346 Minimum -0.1638 -0.1087 -0.1768 Maximum 0.1804 0.2074 0.1852 Range 0.3442 0.3161 0.3619 Standard Deviation 0.0457251 0.0704523 0.0612171 Variance 0.0020908 0.0049635 0.0037475 Standard Error 0.0029213 0.0085436 0.0068875 t statistic (mean = 0) 0.000 3.356 4.191 t statistic p-value 1.000 0.001 0.000 lower 95% c.i. -0.005754 0.011619 0.015150 upper 95% c.i. 0.005754 0.045725 0.042574 Sum Average Median f. After adjusting for other factors as determined by the regression equation for American cars, European cars have a higher Log(MPG) by 0.0116 to 0.0457 points (95% confidence interval). The 95% CI for Japanese cars is (0.0152 , 0.04257). In terms of MPG, this means an increase of 2.7 to 11.1% for European cars and 3.6 to 10.3% for Japanese cars. 18 Chapter 9: Multiple Regression 14. The scatterplot appears as: a. 50.0 33 19 8 0 22 7 9 13 19 13 11 21 2726 24 42 22 37 38 35 32 38 45 30.0 4544 44 49 34 31 35.0 31 Longitude 47 31 28 24 24 12 11 14 23 22 2426 24 27 30 25 27 21 21 40.0 14 15 18 45.0 2 33 25.0 58 65 20.0 130.0 120.0 110.0 100.0 90.0 80.0 70.0 60.0 Latitude The regression statistics are: b. Regression Statistics Multiple R 0.861 R Square Adjusted R Square 0.741 Standard Error 6.935 0.731 Observations 56 ANOVA df Regression SS MS 2 7297.335 3648.667 Residual 53 2548.647 48.088 Total 55 9845.982 Coefficients Intercept Long Lat c. Standard Error 98.645 8.327 0.134 -2.164 t Stat F 75.875 P-value Significance F 2.79154E16 Lower 95% 81.943 Upper 95% 11.846 0.000 115.347 0.063 2.122 0.039 0.007 0.261 0.176 -12.314 0.000 -2.516 -1.811 The significance level for the Latitude predictor variable is < 0.001, while the p-value for the Longitude predictor variable is 0.039. Both factors are signicant at the 5% level. The R 2 value is 0.741, so 74.1% of the variation in temperature is explained by the regression equation. 19 Chapter 9: Multiple Regression The residual values on the map plot appear as follows: d. 50.0 22 9 -4 -10 2 -4 -8 -4 0 -8 -10 -7 -6 -1 0 -2 1 1 2 30.0 1 1 3 35.0 2 1 -3 Longitute 7 0 -6 -11 -13 0 -3 -2 7 4 3 547 4 72 3 -3 11 -2 10 4 40.0 -10 -10 -7 45.0 -10 17 25.0 5 9 20.0 130.0 120.0 110.0 100.0 90.0 80.0 70.0 60.0 Latitude e. Cities on the East and West coasts show positive residuals, while those cities in the countries interior show negative residuals. a. The regression statistics are: 15. Regression Statistics Multiple R 0.857 R Square 0.735 Adjusted R Square 0.728 Standard Error 19847.671 Observations 117 ANOVA df Regression SS MS F Significance F 104.397 1.95925E-32 P-value Lower 95% 3 1.23375E+11 4.11E+10 Residual 113 44514094371 3.94E+08 Total 116 1.67889E+11 Coefficients Intercept 7530.949 Square Feet Age Features Standard Error t Stat 7412.625 1.016 0.312 -7154.793 Upper 95% 22216.691 58.437 3.830 15.258 0.000 50.849 66.025 -374.156 163.497 -2.288 0.024 -698.072 -50.239 2257.974 1445.182 1.562 0.121 -605.192 5121.140 Price = 7530.95 + 58.437 (Sq Feet) – 374.16 (Age) + 2257.97 (Feature) 20 Chapter 9: Multiple Regression The residual plot is: b. Residuals vs. Predicted 100,000 80,000 60,000 40,000 Residuals 20,000 0 50,000 70,000 90,000 110,000 130,000 150,000 170,000 190,000 210,000 230,000 250,000 -20,000 -40,000 -60,000 -80,000 -100,000 -120,000 Predicted The residual plot indicates a possible violation of the assumption of constant variance in the residuals. AS the predicted value increases, the spread of the residuals increases as well. The regression statistics are: c. Regression Statistics Multiple R 0.866 R Square 0.751 Adjusted R Square 0.744 Standard Error 0.070 Observations 117 ANOVA df Regression SS MS 3 1.680060833 0.56002 Residual 113 0.557939533 0.004938 Total 116 2.238000365 Coefficients Intercept Square Feet Age Features Standard Error t Stat 113.4214 Significance F 6.10099E34 P-value Lower 95% Upper 95% 4.589 4.693 F 4.641 0.026 176.837 0.000 0.000 0.000 15.787 0.000 0.000 0.000 -0.002 0.001 -2.613 0.010 -0.003 0.000 0.009 0.005 1.765 0.080 -0.001 0.019 21 Chapter 9: Multiple Regression The residual plot is: Residuals vs. Predicted 0.300 0.200 Residuals 0.100 0.000 4.700 4.800 4.900 5.000 5.100 5.200 5.300 5.400 5.500 -0.100 -0.200 -0.300 129,500 -0.400 Predicted The transformation appears to take care of the problem of non-constant variance. d. The point belongs to a house that is priced at $129,500. However based on the regression model using the Log(Prices) variable, this house should be priced at $282,426. Thus, based on the model, the house is very under-priced. a. The scatterplot and trend line appear as follows: 16. Unemployment vs FRB index 5 4.5 4 FRB Index 3.5 3 2.5 2 1.5 1 100 110 120 130 140 150 160 170 Unemployment Though the trend line is positive, there is a lot of variability in the data. It is unclear whether unemployment rises with the FRB index. 22 Chapter 9: Multiple Regression b. Unemployment = –0.035 + 0.021 (FRB index) R2 = 0.098. The regression explains only 9.8% of the variability in unemployment. c. After adding Years to the regression, the equation is: Unemployment = 13.454 – 0.103 (FRB index) + 0.659 (Year) R2 = 0.866, accounting for 86.6% of the variability. d. The parameter for the FRB index has changed sign from one regression to another. Taking these results at face value, we would come to different conclusions if we ignored the significance of the regression tests. e. The Pearson correlation is 0.906 with a p-value < 0.001. So there is significant correlation between the FRB index and the Year variable. Because Year and the FRB index are high correlated, they are essentially providing the same information to the regression equation. The collinearity makes it difficult to interpret the regression equation when both are present. Thus the regression equation parameters are highly suspect. 23 25