Chapter 08 - Model Selection in Multiple Linear Regression Analysis CHAPTER 8 Answers to End of Chapter Problems 8.1 a. For the average individual, holding the effects of average points per game, average rebounds, and player position constant, if the number of years a player has been in the NBA goes up by one year, salary increases by 16%. b. ln(ππππππ¦π ) = π½0 + π½1 πππ π + π½2 πππΊπ + π½3 π ππΊπ + π½4 πΉπ + π½5 πΊπ + π½6 πΉπ ∗ π ππΊπ + π½7 πΉπ ∗ π ππΊπ + ππ To test this hypothesis, you can do t-test if to determine if π½6 = 0 (the returns to salary are the same for Forwards and Centers) and if π½7 = 0 (the returns to salary are the same for Guards and Centers), and an F-test if jointly π½6 = π½7 = 0 c. Note the question says ANY differences. ln(ππππππ¦π ) = π½0 + π½1 πππ π + π½2 πππΊπ + π½3 π ππΊπ + π½4 πΉπ + π½5 πΊπ + π½6 πΉπππππππ + π½7 πΉπππππππ ∗ πππ π + π½8 πΉπππππππ ∗ πππΊπ + π½9 πΉπππππππ ∗ π ππΊπ + ππ Where the variable foreign = 1 if the player is foreign born and foreign = 0 if the player is born in the U.S. Hypothesis: π»0 : π½6 = π½7 = π½8 = π½9 = 0 π»1 : ππ‘ ππππ π‘ πππ π½π ππ πππ‘ πππ’ππ π‘π 0 Test statistic: (ππππππ₯ππππππππππ π‘ππππ‘ππ − ππππππ₯ππππππππ’ππππ π‘ππππ‘ππ )/4 πΉ − π π‘ππ‘ = ππππππ₯ππππππππ’ππππ π‘ππππ‘ππ /(π − π − 1) Where the restricted model is the original model in the problem (or alternatively the model with the null hypothesis imposed) Critical Value is πΉπΌ,4,π−π−1 Rejection Rule: Reject H0 if F-stat > πΉπΌ,4,π−π−1 d. This is a Davidson MacKinnon Test (1) Estimate the model ln(ππππππ¦π ) = π½0 + π½1 πππ π + π½2 πππΊπ + π½3 π ππΊπ + π½4 πΉπ + π½5 πΊπ + π½6 πΉπππππππ + π½7 πΉπππππππ ∗ πππ π + π½8 πΉπππππππ ∗ πππΊπ + π½9 πΉπππππππ ∗ π ππΊπ + ππ Μ π ). and obtain the predicted value ln(ππππππ¦ (2) Add the predicted value from step (1) to the model ln(ππππππ¦π ) = π½0 + π½1 πππ π + π½2 πππΊπ + π½3 π ππΊπ + π½4 πΉπ + π½5 πΊπ + π½6 πΉπ ∗ π ππΊπ + π½7 πΉπ Μ π ) + ππ ∗ π ππΊπ + π½8 ln(ππππππ¦ 8-1 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis (3) Perform a t-test for the statistical significance of π½8 . If it is statistically significant then the model from step (1) may be preferred. 8.2 a. The unrestricted model is πΉπ π = π½0 + π½1 π΄πππ + π½2 πΈππ’ππ + π½3 ππππππ + ππ while the restricted model is πΉπ π = π½0 + π½1 π΄πππ + ππ Hypothesis: π»0 : π½2 = π½3 = 0 π»1 : ππ‘ ππππ π‘ πππ π½π ππ πππ‘ πππ’ππ π‘π 0 Test statistic: πΉ − π π‘ππ‘ = (ππππππ₯ππππππππππ π‘ππππ‘ππ − ππππππ₯ππππππππ’ππππ π‘ππππ‘ππ )/2 ππππππ₯ππππππππ’ππππ π‘ππππ‘ππ /(π − π − 1) Where the restricted model is the original model in the problem (or alternatively the model with the null hypothesis imposed) Critical Value is πΉπΌ,5,π−π−1 Rejection Rule: Reject H0 if F-stat > πΉπΌ,4,π−π−1 b. Set π½1 − π½2 = π, solve for π½1 or π½1 = π + π½2 , and then substitute for π½1 in the original model. πΉπ π = π½0 + π½1 π΄πππ + π½2 πΈππ’ππ + π½3 ππππππ + ππ πΉπ π = π½0 + (π + π½2 )π΄πππ + π½2 πΈππ’ππ + π½3 ππππππ + ππ πΉπ π = π½0 + ππ΄πππ + π½2 π΄πππ + π½2 πΈππ’ππ + π½3 ππππππ + ππ πΉπ π = π½0 + ππ΄πππ + π½2 (π΄πππ + πΈππ’ππ ) + π½3 ππππππ + ππ From this last equation that isolates the parameters that need to be estimated, , π½2, and π½3. A new variable need to be created by adding the age an education columns together and the regress FR on Age, (Age+Educ), and Urban. The coefficient on Age is the estimate, π½Μ1 − π½Μ2 , the standard error on age is the standard error of this hypothesis, the t-statistic on Age is the test statistic for this test, and last but not least the p-value on Age is the p-value for this test. To see if these coefficients are equal, reject the null hypothesis if they are equal if the p-value is less than the significance level α. c. This is a Davidson MacKinnon Test (1) Estimate the model ln πΉπ π = π½0 + π½1 π΄πππ + π½2 πΈππ’ππ + π½3 ππππππ + ππ Μπ ). and obtain the predicted value ln(πΉπ (2) Add the predicted value from step (1) to the model πΉπ π = π½0 + π½1 π΄πππ + π½2 πΈππ’ππ + π½3 ππππππ + π½4 lnΜ (πΉπ )π + ππ (3) Perform a t-test for the statistical significance of π½4 . If it is statistically significant then the model from step (1) may be preferred. 8-2 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis The reason that the semi-log model is more likely to lead to biased estimates is that that the natural log is a non-linear function and the estimates it yields (without a transformation) are already biased even if the true model is non-linear. The choice of specification should be largely be made on the underlying economics. If economic theory says that when age goes up by one year (or education) the percentage change in the fertility rate is constant then the semi-log model should be estimated. The coefficient on Age is interpreted as, on average, holding education and urban constant, if an individual gets one year older then the fertility rate increases (decreases) by π½1*(100)%. The coefficient on Education is interpreted as, on average, holding age and urban constant, if an individual gets one more year of education then the fertility rate increases (decreases) by π½2*(100)%. The coefficient on Urban is interpreted as, on average, holding age and education constant, if an individual lives in an urban area the fertility rate is π½3*(100)% higher (lower) relative to living in a rural area. 8.3 a. To find where pollution reaches a maximum (or where diminishing marginal returns sets in) set 4000 − 0.25(2)πΊπ·ππ = 0 or when GDP per capita is $8000. b. If all of the multiple linear regression assumptions hold then the consequences of heteroskedasticity is that the OLS estimates are no longer BLUE but they remain unbiased. The other consequence is that all standard error and hypothesis tests are incorrect. c. This is chapter 9 material. d. This is chapter 9 material. e. Because the dependent variable hasn’t changed, you can compare the R-squared values between the two models and if one R-squared is clearly larger than the other then that model is preferred. You could also perform a Davidson MacKinnon test. 8.4 a. This is the two step estimator for multiple linear regression analysis. First a formal proof. The estimates are obtained by minimizing the sum of squared residuals with amounts to taking the derivative respect to π½Μ0 , π½Μ1 , and π½Μ2 and setting those equations equal to 0. π ∑(π¦π − π½Μ0 − π½Μ1 π₯1,π − π½Μ2 π₯2,π )2 π=1 yielding the normal equations π ∑(π¦π − π½Μ0 − π½Μ1 π₯1,π − π½Μ2 π₯2,π ) = 0 π=1 π ∑ π₯1,π (π¦π − π½Μ0 − π½Μ1 π₯1,π − π½Μ2 π₯2,π ) = 0 π=1 8-3 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis π ∑ π₯2,π (π¦π − π½Μ0 − π½Μ1 π₯1,π − π½Μ2 π₯2,π ) = 0 π=1 Noting that when π₯1 is regressed on π₯2 , then π₯1 can be written as the predicted values and the residuals or π₯1,π = π₯Μ1,π + πΜ1π . Substitute this into the second normal equation to obtain π ∑(π₯Μ1,π + πΜ1π )(π¦π − π½Μ0 − π½Μ1 π₯1,π − π½Μ2 π₯2,π ) = 0 π=1 Because the sum of the predicted values times the residual is equal to zero or ∑ππ=1 π₯Μ1,π ππ = 0 the equation reduces to π ∑(πΜ1π )(π¦π − π½Μ0 − π½Μ1 π₯1,π − π½Μ2 π₯2,π ) = 0 π=1 Now because the πΜ1π are the residuals from the regression of π₯1 on π₯2 which means ∑ππ=1 π₯Μ2,π πΜ1π π = 0 and the sum of residuals are always equal to 0 so ∑ππ=1 ππ = 0Therefore we are left with π ∑(πΜ1π )(π¦π − π½Μ1 (π₯Μ1,π + πΜ1π )) = 0 π=1 and then using the fact that ∑ππ=1 π₯Μ1,π πΜ1π = 0 we get π ∑(πΜ1π )(π¦π − π½Μ1 πΜ1π ) = 0 Solving for π½Μ1 π=1 ∑π πΜπ1 π¦π 2 ∑π πΜπ1 In Venn Diagram form and less formally π½Μ1 = 8-4 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis when π₯1 is regressed on π₯2 then the part the regression captures is pink and dark orange and the residuals of that regression are the red plus yellow area. Then when y is regressed on those residuals (i.e. only the red and yellow part of π₯1 ) only the red area is left. ∑ π 2 /(π−π−1) b. The expression is πππ(π½Μ1 ) = π π 2 . When π₯1 and π₯2 have a πππ1 (1−π 1 ) large amount of independent variation then π 12 is small, (1-π 12 ) is large and 1 divided by that value is small (note that π 12 is bounded to be between 0 and 1. Now if π₯1 and π₯2 have a small amount of independent variation then π 12 is large, (1-π 12 ) is l is small and 1 divided by that value is large. c. No, including irrelevant variables does not cause the estimates to be biased. If being taller is strongly related to married then π 12 is large, (1π 12 ) is l is small and 1 divided by that value is large. 8.5 a. Two new variables need to be created by multiplying inf by home runs and inf by batting average and then estimating the regression model ππ(π πππππ¦π ) = π½0 + π½1 πΈπ₯ππ + π½2 π΅π΄π + π½3 π π΅πΌπ + π½4 π»π π + π½3 πΌππΉπ + π½4 π΄πΏπΏππ‘πππ + π½5 πΌππΉπ ∗ π»π π + π½6 πΌππΉπ ∗ π΅π΄π +ππ To test these two hypotheses it is two t-test of π»0 : π½5 ≥ 0 infielders do not get paid less to hit home runs than outfielders π»1 : π½5 < 0 infielders get paid less to hit home runs than outfielders and Reject π»0 if t-stat < −π‘πΌ,π−π−1 . π»0 : π½6 ≤ 0 infielders do not get paid more to have a high batting average than outfielders 8-5 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis π»1 : π½6 > 0 infielders get paid more to have a high batting average than outfielders and Reject π»0 if t-stat > π‘πΌ,π−π−1 . Note that because these are one sided tests so the critical value is π‘πΌ,π−π−1 (α remains whole when obtaining the critical value) or if the p-value approach is used then the p-value in the regression output needs to be multiplied by 2 and then compared to α. b. Define a new variable as Native =1 if the player is native born and Native = 0 if the player is foreign born. Multiply this dummy variable by all independent variables that were originally in the model. The new model becomes ππ(π πππππ¦π ) = π½0 + π½1 πΈπ₯ππ + π½2 π΅π΄π + π½3 π π΅πΌπ + π½4 π»π π + π½3 πΌππΉπ + π½4 π΄πΏπΏππ‘πππ + π½5 πππ‘ππ£ππ + π½6 πππ‘ππ£ππ πΈπ₯ππ + π½7 πππ‘ππ£ππ π π΅πΌπ + π½8 πππ‘ππ£ππ π»π π + π½9 πππ‘ππ£ππ πΌππΉπ + π½10 πππ‘ππ£ππ π΄πΏπΏππ‘πππ +ππ To test for any differences it is an F-test. Hypothesis: π»0 : π½5 = π½6 = π½7 = π½8 = π½9 = π½10 = 0 π»1 : ππ‘ ππππ π‘ πππ π½π ππ πππ‘ πππ’ππ π‘π 0 Test statistic: (ππππππ₯ππππππππππ π‘ππππ‘ππ − ππππππ₯ππππππππ’ππππ π‘ππππ‘ππ )/6 πΉ − π π‘ππ‘ = ππππππ₯ππππππππ’ππππ π‘ππππ‘ππ /(π − π − 1) Where the restricted model is the original model in the problem (or alternatively the model with the null hypothesis imposed) Critical Value is πΉπΌ,6,π−π−1 Rejection Rule: Reject H0 if F-stat > πΉπΌ,6,π−π−1 c. This is a Davidson MacKinnon Test (1) Estimate the model ππ(π πππππ¦π ) = π½0 + π½1 ππ(πΈπ₯π)π + π½2 π΅π΄π + π½3 πΌππΉπ + π½4 π΅π΄2π + π½4 π π΅πΌπ2 + ππ Μ π ). and obtain the predicted value ln(ππππππ¦ (2) Add the predicted value from step (1) to the model ππ(π πππππ¦π ) = π½0 + π½1 πΈπ₯ππ + π½2 π΅π΄π + π½3 π π΅πΌπ + π½4 π»π π + π½3 πΌππΉπ + π½4 π΄πΏπΏππ‘πππ Μ π ) + ππ + π½5 ln(ππππππ¦ (3) Perform a t-test for the statistical significance of π½5 . If it is statistically significant then the model from step (1) may be preferred. Because the left hand side variable doesn’t change between the two specifications, the Rsquares between the two models can also be compared. 8-6 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis Answers to End of Chapter Exercises E8.1 a. The problem with ability is that it is hard to obtain a variable that is an appropriate measure of ability and even though ability is certainly a determinant of GPA. Individuals with a higher ability also typically have a higher GPA and vice versa. Omitted variable bias becomes an issue because ability is also related to hours studied, work, video games, and even possibly texts. The omission of a relevant variable causes the coefficient estimates to be biased. This means that all coefficient estimates are wrong on average and the all hypothesis tests and confidence intervals are also incorrect. 8-7 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis b. The consequences are the inclusion of an irrelevant variable does not yield biased estimates but the standard errors typically become inflated. In this case, the inclusion of the irrelevant variable did not change the overall decisions about statistical significance. Notice that when Eye Color was included the R-squared went up but the adjusted Rsquared went down. c. It is much better to include an irrelevant variable than omit a relevant variable because larger standard error are much better than biased estimators. Most of the time omitted variables are not omitted because the researcher is sloppy and didn’t think to include that variable but rather because data on that variable is not available. E8.2 a. See Excel Worksheet 8-8 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis b. From this regression we see that distance to the beach is not statistically significant but missing is statistically significant suggesting that the observations with missing data have a lower housing price of $297,185.14 than those observations without missing data. c. An easy way to test this hypothesis is to just regress housing price on the missing column which will yield a differences in means. 8-9 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis This regression suggests that the mean housing price for data without missing observations is $795,333.21 while the mean housing price for data with missing observations is $795,333.21 - $168,943.28 = $626,389.93. The p-value suggests that this difference in means is not statistically significant. Another way to see if the missing data causes issues is to perform the regression with only the data that have the distance to the beach observations. In the regression with only the 43 observations that have data on distance to the beach, have somewhat different results than the regression that accounted for the missing data. The beach distance variable is now statistically significant at the 5% level and suggests that for each additional mile a house is away from the beach the price drops by $17,497.39. E8.3 a. See graph below. 8-10 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis Units Sold vs. MP Sales 2500000 Units Sold 2000000 1500000 1000000 500000 0 0 0.2 0.4 0.6 0.8 1 1.2 Online MP The two potential outliers are Call of Duty: Black Ops 2and Assassin's Creed 3. b. 8-11 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis The coefficient on outlier is statistically significant at the 1% level suggesting that the two outliers have, on average, 1,162,603.11 more units sold than the 51 other observations. Interacting this with Online MP we obtain the regression In this regression, the outlier without Online MP has 5,001,950.17 more sales than non outlier and the outlier with Online MP has 7,424.4+501,950+1,266,616= 1,775,808 more units sold than video games that our not outliers with no online MP. c. It doesn’t look like either outlier was a there due to a special reason except for both Call of Duty: Black Ops 2and Assassin's Creed 3 are extremely popular video games. 8-12 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis E8.4 a. In this regression, the only statistically significant independent variable is square feet. On average, holding bedrooms, bathrooms, lot size, and pool constant, if square feet goes up by one foot then the price increases by .062%. Even though square feet is statistically significant it does it is not economically significant because the coefficient estimate is so small. b. 8-13 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis In this regression, the only statistically significant independent variable are log square feet, bedrooms at the 5% level, and bathrooms at the 10% level. On average, holding bedrooms, bathrooms, lot size, and pool constant, if square feet goes up by 1% then the price increases by 1.14%. On average, holding log square feet, bathrooms, lot size, and pool constant, if bedrooms goes up by 1% then the price decreases by .107%. On average, holding log square feet, bedrooms, lot size, and pool constant, if bathrooms goes up by 1% then the price increases by .113%. c. Performing the Davidson-MacKinnon test The predicted ln housing price is not statistically significant, which suggests that the model without the log square feet is preferred. Since the dependent variable in both models is the same, the R-squares can also be compared and the R-squared from the initial model is larger than the R-squared from the model with log square feet. 8-14 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis E8.5 Regression from step 1 of reset tests. 8-15 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis Second regression for reset test The yhat^2, yhat^3, and yhat^4 are individually statistically insignificant but we need to test if they are jointly statistically significant. Hypothesis: π»0 : π½5 = π½6 = π½7 = 0 π»1 : ππ‘ ππππ π‘ πππ π½π ππ πππ‘ πππ’ππ π‘π 0 Test statistic: (4.5536 − 4.432)/3 πΉ − π π‘ππ‘ = = 0.5302 4.432/58 Critical Value is πΉ.05,3,58 = 2.746 Rejection Rule: Reject H0 if F-stat > 2.746 Decision: Because 0.5302 < 2.746 we fail to reject π»0 and conclude that the model without the quadratic terms is statistically preferred. 8-16 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.