Solutions to Chapter 8

Chapter 08 - Model Selection in Multiple Linear Regression Analysis CHAPTER 8 Answers to End of Chapter Problems 8.1 a. For the average individual, holding the effects of average points per game, average rebounds, and player position constant, if the number of years a player has been in the NBA goes up by one year, salary increases by 16%. b. ln(𝑆𝑎𝑙𝑎𝑟𝑦𝑖 ) = 𝛽0 + 𝛽1 𝑌𝑟𝑠𝑖 + 𝛽2 𝑃𝑃𝐺𝑖 + 𝛽3 𝑅𝑃𝐺𝑖 + 𝛽4 𝐹𝑖 + 𝛽5 𝐺𝑖 + 𝛽6 𝐹𝑖 ∗ 𝑅𝑃𝐺𝑖 + 𝛽7 𝐹𝑖 ∗ 𝑅𝑃𝐺𝑖 + 𝜀𝑖 To test this hypothesis, you can do t-test if to determine if 𝛽6 = 0 (the returns to salary are the same for Forwards and Centers) and if 𝛽7 = 0 (the returns to salary are the same for Guards and Centers), and an F-test if jointly 𝛽6 = 𝛽7 = 0 c. Note the question says ANY differences. ln(𝑆𝑎𝑙𝑎𝑟𝑦𝑖 ) = 𝛽0 + 𝛽1 𝑌𝑟𝑠𝑖 + 𝛽2 𝑃𝑃𝐺𝑖 + 𝛽3 𝑅𝑃𝐺𝑖 + 𝛽4 𝐹𝑖 + 𝛽5 𝐺𝑖 + 𝛽6 𝐹𝑜𝑟𝑒𝑖𝑔𝑛𝑖 + 𝛽7 𝐹𝑜𝑟𝑒𝑖𝑔𝑛𝑖 ∗ 𝑌𝑟𝑠𝑖 + 𝛽8 𝐹𝑜𝑟𝑒𝑖𝑔𝑛𝑖 ∗ 𝑃𝑃𝐺𝑖 + 𝛽9 𝐹𝑜𝑟𝑒𝑖𝑔𝑛𝑖 ∗ 𝑅𝑃𝐺𝑖 + 𝜀𝑖 Where the variable foreign = 1 if the player is foreign born and foreign = 0 if the player is born in the U.S. Hypothesis: 𝐻0 : 𝛽6 = 𝛽7 = 𝛽8 = 𝛽9 = 0 𝐻1 : 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 𝑖𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙 𝑡𝑜 0 Test statistic: (𝑆𝑆𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑𝑟𝑒𝑠𝑡𝑟𝑖𝑐𝑡𝑒𝑑 − 𝑆𝑆𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑𝑢𝑛𝑟𝑒𝑠𝑡𝑟𝑖𝑐𝑡𝑒𝑑 )/4 𝐹 − 𝑠𝑡𝑎𝑡 = 𝑆𝑆𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑𝑢𝑛𝑟𝑒𝑠𝑡𝑟𝑖𝑐𝑡𝑒𝑑 /(𝑛 − 𝑘 − 1) Where the restricted model is the original model in the problem (or alternatively the model with the null hypothesis imposed) Critical Value is 𝐹𝛼,4,𝑛−𝑘−1 Rejection Rule: Reject H0 if F-stat > 𝐹𝛼,4,𝑛−𝑘−1 d. This is a Davidson MacKinnon Test (1) Estimate the model ln(𝑆𝑎𝑙𝑎𝑟𝑦𝑖 ) = 𝛽0 + 𝛽1 𝑌𝑟𝑠𝑖 + 𝛽2 𝑃𝑃𝐺𝑖 + 𝛽3 𝑅𝑃𝐺𝑖 + 𝛽4 𝐹𝑖 + 𝛽5 𝐺𝑖 + 𝛽6 𝐹𝑜𝑟𝑒𝑖𝑔𝑛𝑖 + 𝛽7 𝐹𝑜𝑟𝑒𝑖𝑔𝑛𝑖 ∗ 𝑌𝑟𝑠𝑖 + 𝛽8 𝐹𝑜𝑟𝑒𝑖𝑔𝑛𝑖 ∗ 𝑃𝑃𝐺𝑖 + 𝛽9 𝐹𝑜𝑟𝑒𝑖𝑔𝑛𝑖 ∗ 𝑅𝑃𝐺𝑖 + 𝜀𝑖 ̂ 𝑖 ). and obtain the predicted value ln(𝑆𝑎𝑙𝑎𝑟𝑦 (2) Add the predicted value from step (1) to the model ln(𝑆𝑎𝑙𝑎𝑟𝑦𝑖 ) = 𝛽0 + 𝛽1 𝑌𝑟𝑠𝑖 + 𝛽2 𝑃𝑃𝐺𝑖 + 𝛽3 𝑅𝑃𝐺𝑖 + 𝛽4 𝐹𝑖 + 𝛽5 𝐺𝑖 + 𝛽6 𝐹𝑖 ∗ 𝑅𝑃𝐺𝑖 + 𝛽7 𝐹𝑖 ̂ 𝑖 ) + 𝜀𝑖 ∗ 𝑅𝑃𝐺𝑖 + 𝛽8 ln(𝑆𝑎𝑙𝑎𝑟𝑦 8-1 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis (3) Perform a t-test for the statistical significance of 𝛽8 . If it is statistically significant then the model from step (1) may be preferred. 8.2 a. The unrestricted model is 𝐹𝑅𝑖 = 𝛽0 + 𝛽1 𝐴𝑔𝑒𝑖 + 𝛽2 𝐸𝑑𝑢𝑐𝑖 + 𝛽3 𝑈𝑟𝑏𝑎𝑛𝑖 + 𝜀𝑖 while the restricted model is 𝐹𝑅𝑖 = 𝛽0 + 𝛽1 𝐴𝑔𝑒𝑖 + 𝜀𝑖 Hypothesis: 𝐻0 : 𝛽2 = 𝛽3 = 0 𝐻1 : 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 𝑖𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙 𝑡𝑜 0 Test statistic: 𝐹 − 𝑠𝑡𝑎𝑡 = (𝑆𝑆𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑𝑟𝑒𝑠𝑡𝑟𝑖𝑐𝑡𝑒𝑑 − 𝑆𝑆𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑𝑢𝑛𝑟𝑒𝑠𝑡𝑟𝑖𝑐𝑡𝑒𝑑 )/2 𝑆𝑆𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑𝑢𝑛𝑟𝑒𝑠𝑡𝑟𝑖𝑐𝑡𝑒𝑑 /(𝑛 − 𝑘 − 1) Where the restricted model is the original model in the problem (or alternatively the model with the null hypothesis imposed) Critical Value is 𝐹𝛼,5,𝑛−𝑘−1 Rejection Rule: Reject H0 if F-stat > 𝐹𝛼,4,𝑛−𝑘−1 b. Set 𝛽1 − 𝛽2 = 𝜃, solve for 𝛽1 or 𝛽1 = 𝜃 + 𝛽2 , and then substitute for 𝛽1 in the original model. 𝐹𝑅𝑖 = 𝛽0 + 𝛽1 𝐴𝑔𝑒𝑖 + 𝛽2 𝐸𝑑𝑢𝑐𝑖 + 𝛽3 𝑈𝑟𝑏𝑎𝑛𝑖 + 𝜀𝑖 𝐹𝑅𝑖 = 𝛽0 + (𝜃 + 𝛽2 )𝐴𝑔𝑒𝑖 + 𝛽2 𝐸𝑑𝑢𝑐𝑖 + 𝛽3 𝑈𝑟𝑏𝑎𝑛𝑖 + 𝜀𝑖 𝐹𝑅𝑖 = 𝛽0 + 𝜃𝐴𝑔𝑒𝑖 + 𝛽2 𝐴𝑔𝑒𝑖 + 𝛽2 𝐸𝑑𝑢𝑐𝑖 + 𝛽3 𝑈𝑟𝑏𝑎𝑛𝑖 + 𝜀𝑖 𝐹𝑅𝑖 = 𝛽0 + 𝜃𝐴𝑔𝑒𝑖 + 𝛽2 (𝐴𝑔𝑒𝑖 + 𝐸𝑑𝑢𝑐𝑖 ) + 𝛽3 𝑈𝑟𝑏𝑎𝑛𝑖 + 𝜀𝑖 From this last equation that isolates the parameters that need to be estimated, , 𝛽2, and 𝛽3. A new variable need to be created by adding the age an education columns together and the regress FR on Age, (Age+Educ), and Urban. The coefficient on Age is the estimate, 𝛽̂1 − 𝛽̂2 , the standard error on age is the standard error of this hypothesis, the t-statistic on Age is the test statistic for this test, and last but not least the p-value on Age is the p-value for this test. To see if these coefficients are equal, reject the null hypothesis if they are equal if the p-value is less than the significance level α. c. This is a Davidson MacKinnon Test (1) Estimate the model ln 𝐹𝑅𝑖 = 𝛽0 + 𝛽1 𝐴𝑔𝑒𝑖 + 𝛽2 𝐸𝑑𝑢𝑐𝑖 + 𝛽3 𝑈𝑟𝑏𝑎𝑛𝑖 + 𝜀𝑖 ̂𝑖 ). and obtain the predicted value ln(𝐹𝑅 (2) Add the predicted value from step (1) to the model 𝐹𝑅𝑖 = 𝛽0 + 𝛽1 𝐴𝑔𝑒𝑖 + 𝛽2 𝐸𝑑𝑢𝑐𝑖 + 𝛽3 𝑈𝑟𝑏𝑎𝑛𝑖 + 𝛽4 ln̂ (𝐹𝑅)𝑖 + 𝜀𝑖 (3) Perform a t-test for the statistical significance of 𝛽4 . If it is statistically significant then the model from step (1) may be preferred. 8-2 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis The reason that the semi-log model is more likely to lead to biased estimates is that that the natural log is a non-linear function and the estimates it yields (without a transformation) are already biased even if the true model is non-linear. The choice of specification should be largely be made on the underlying economics. If economic theory says that when age goes up by one year (or education) the percentage change in the fertility rate is constant then the semi-log model should be estimated. The coefficient on Age is interpreted as, on average, holding education and urban constant, if an individual gets one year older then the fertility rate increases (decreases) by 𝛽1*(100)%. The coefficient on Education is interpreted as, on average, holding age and urban constant, if an individual gets one more year of education then the fertility rate increases (decreases) by 𝛽2*(100)%. The coefficient on Urban is interpreted as, on average, holding age and education constant, if an individual lives in an urban area the fertility rate is 𝛽3*(100)% higher (lower) relative to living in a rural area. 8.3 a. To find where pollution reaches a maximum (or where diminishing marginal returns sets in) set 4000 − 0.25(2)𝐺𝐷𝑃𝑖 = 0 or when GDP per capita is $8000. b. If all of the multiple linear regression assumptions hold then the consequences of heteroskedasticity is that the OLS estimates are no longer BLUE but they remain unbiased. The other consequence is that all standard error and hypothesis tests are incorrect. c. This is chapter 9 material. d. This is chapter 9 material. e. Because the dependent variable hasn’t changed, you can compare the R-squared values between the two models and if one R-squared is clearly larger than the other then that model is preferred. You could also perform a Davidson MacKinnon test. 8.4 a. This is the two step estimator for multiple linear regression analysis. First a formal proof. The estimates are obtained by minimizing the sum of squared residuals with amounts to taking the derivative respect to 𝛽̂0 , 𝛽̂1 , and 𝛽̂2 and setting those equations equal to 0. 𝑛 ∑(𝑦𝑖 − 𝛽̂0 − 𝛽̂1 𝑥1,𝑖 − 𝛽̂2 𝑥2,𝑖 )2 𝑖=1 yielding the normal equations 𝑛 ∑(𝑦𝑖 − 𝛽̂0 − 𝛽̂1 𝑥1,𝑖 − 𝛽̂2 𝑥2,𝑖 ) = 0 𝑖=1 𝑛 ∑ 𝑥1,𝑖 (𝑦𝑖 − 𝛽̂0 − 𝛽̂1 𝑥1,𝑖 − 𝛽̂2 𝑥2,𝑖 ) = 0 𝑖=1 8-3 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis 𝑛 ∑ 𝑥2,𝑖 (𝑦𝑖 − 𝛽̂0 − 𝛽̂1 𝑥1,𝑖 − 𝛽̂2 𝑥2,𝑖 ) = 0 𝑖=1 Noting that when 𝑥1 is regressed on 𝑥2 , then 𝑥1 can be written as the predicted values and the residuals or 𝑥1,𝑖 = 𝑥̂1,𝑖 + 𝑟̂1𝑖 . Substitute this into the second normal equation to obtain 𝑛 ∑(𝑥̂1,𝑖 + 𝑟̂1𝑖 )(𝑦𝑖 − 𝛽̂0 − 𝛽̂1 𝑥1,𝑖 − 𝛽̂2 𝑥2,𝑖 ) = 0 𝑖=1 Because the sum of the predicted values times the residual is equal to zero or ∑𝑛𝑖=1 𝑥̂1,𝑖 𝑒𝑖 = 0 the equation reduces to 𝑛 ∑(𝑟̂1𝑖 )(𝑦𝑖 − 𝛽̂0 − 𝛽̂1 𝑥1,𝑖 − 𝛽̂2 𝑥2,𝑖 ) = 0 𝑖=1 Now because the 𝑟̂1𝑖 are the residuals from the regression of 𝑥1 on 𝑥2 which means ∑𝑛𝑖=1 𝑥̂2,𝑖 𝑟̂1𝑖 𝑖 = 0 and the sum of residuals are always equal to 0 so ∑𝑛𝑖=1 𝑟𝑖 = 0Therefore we are left with 𝑛 ∑(𝑟̂1𝑖 )(𝑦𝑖 − 𝛽̂1 (𝑥̂1,𝑖 + 𝑟̂1𝑖 )) = 0 𝑖=1 and then using the fact that ∑𝑛𝑖=1 𝑥̂1,𝑖 𝑟̂1𝑖 = 0 we get 𝑛 ∑(𝑟̂1𝑖 )(𝑦𝑖 − 𝛽̂1 𝑟̂1𝑖 ) = 0 Solving for 𝛽̂1 𝑖=1 ∑𝑖 𝑟̂𝑖1 𝑦𝑖 2 ∑𝑖 𝑟̂𝑖1 In Venn Diagram form and less formally 𝛽̂1 = 8-4 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis when 𝑥1 is regressed on 𝑥2 then the part the regression captures is pink and dark orange and the residuals of that regression are the red plus yellow area. Then when y is regressed on those residuals (i.e. only the red and yellow part of 𝑥1 ) only the red area is left. ∑ 𝑒 2 /(𝑛−𝑘−1) b. The expression is 𝑉𝑎𝑟(𝛽̂1 ) = 𝑖 𝑖 2 . When 𝑥1 and 𝑥2 have a 𝑇𝑆𝑆1 (1−𝑅1 ) large amount of independent variation then 𝑅12 is small, (1-𝑅12 ) is large and 1 divided by that value is small (note that 𝑅12 is bounded to be between 0 and 1. Now if 𝑥1 and 𝑥2 have a small amount of independent variation then 𝑅12 is large, (1-𝑅12 ) is l is small and 1 divided by that value is large. c. No, including irrelevant variables does not cause the estimates to be biased. If being taller is strongly related to married then 𝑅12 is large, (1𝑅12 ) is l is small and 1 divided by that value is large. 8.5 a. Two new variables need to be created by multiplying inf by home runs and inf by batting average and then estimating the regression model 𝑙𝑛(𝑠𝑎𝑙𝑎𝑟𝑦𝑖 ) = 𝛽0 + 𝛽1 𝐸𝑥𝑝𝑖 + 𝛽2 𝐵𝐴𝑖 + 𝛽3 𝑅𝐵𝐼𝑖 + 𝛽4 𝐻𝑅𝑖 + 𝛽3 𝐼𝑁𝐹𝑖 + 𝛽4 𝐴𝐿𝐿𝑆𝑡𝑎𝑟𝑖 + 𝛽5 𝐼𝑁𝐹𝑖 ∗ 𝐻𝑅𝑖 + 𝛽6 𝐼𝑁𝐹𝑖 ∗ 𝐵𝐴𝑖 +𝜀𝑖 To test these two hypotheses it is two t-test of 𝐻0 : 𝛽5 ≥ 0 infielders do not get paid less to hit home runs than outfielders 𝐻1 : 𝛽5 < 0 infielders get paid less to hit home runs than outfielders and Reject 𝐻0 if t-stat < −𝑡𝛼,𝑛−𝑘−1 . 𝐻0 : 𝛽6 ≤ 0 infielders do not get paid more to have a high batting average than outfielders 8-5 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis 𝐻1 : 𝛽6 > 0 infielders get paid more to have a high batting average than outfielders and Reject 𝐻0 if t-stat > 𝑡𝛼,𝑛−𝑘−1 . Note that because these are one sided tests so the critical value is 𝑡𝛼,𝑛−𝑘−1 (α remains whole when obtaining the critical value) or if the p-value approach is used then the p-value in the regression output needs to be multiplied by 2 and then compared to α. b. Define a new variable as Native =1 if the player is native born and Native = 0 if the player is foreign born. Multiply this dummy variable by all independent variables that were originally in the model. The new model becomes 𝑙𝑛(𝑠𝑎𝑙𝑎𝑟𝑦𝑖 ) = 𝛽0 + 𝛽1 𝐸𝑥𝑝𝑖 + 𝛽2 𝐵𝐴𝑖 + 𝛽3 𝑅𝐵𝐼𝑖 + 𝛽4 𝐻𝑅𝑖 + 𝛽3 𝐼𝑁𝐹𝑖 + 𝛽4 𝐴𝐿𝐿𝑆𝑡𝑎𝑟𝑖 + 𝛽5 𝑁𝑎𝑡𝑖𝑣𝑒𝑖 + 𝛽6 𝑁𝑎𝑡𝑖𝑣𝑒𝑖 𝐸𝑥𝑝𝑖 + 𝛽7 𝑁𝑎𝑡𝑖𝑣𝑒𝑖 𝑅𝐵𝐼𝑖 + 𝛽8 𝑁𝑎𝑡𝑖𝑣𝑒𝑖 𝐻𝑅𝑖 + 𝛽9 𝑁𝑎𝑡𝑖𝑣𝑒𝑖 𝐼𝑁𝐹𝑖 + 𝛽10 𝑁𝑎𝑡𝑖𝑣𝑒𝑖 𝐴𝐿𝐿𝑆𝑡𝑎𝑟𝑖 +𝜀𝑖 To test for any differences it is an F-test. Hypothesis: 𝐻0 : 𝛽5 = 𝛽6 = 𝛽7 = 𝛽8 = 𝛽9 = 𝛽10 = 0 𝐻1 : 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 𝑖𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙 𝑡𝑜 0 Test statistic: (𝑆𝑆𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑𝑟𝑒𝑠𝑡𝑟𝑖𝑐𝑡𝑒𝑑 − 𝑆𝑆𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑𝑢𝑛𝑟𝑒𝑠𝑡𝑟𝑖𝑐𝑡𝑒𝑑 )/6 𝐹 − 𝑠𝑡𝑎𝑡 = 𝑆𝑆𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑𝑢𝑛𝑟𝑒𝑠𝑡𝑟𝑖𝑐𝑡𝑒𝑑 /(𝑛 − 𝑘 − 1) Where the restricted model is the original model in the problem (or alternatively the model with the null hypothesis imposed) Critical Value is 𝐹𝛼,6,𝑛−𝑘−1 Rejection Rule: Reject H0 if F-stat > 𝐹𝛼,6,𝑛−𝑘−1 c. This is a Davidson MacKinnon Test (1) Estimate the model 𝑙𝑛(𝑠𝑎𝑙𝑎𝑟𝑦𝑖 ) = 𝛽0 + 𝛽1 𝑙𝑛(𝐸𝑥𝑝)𝑖 + 𝛽2 𝐵𝐴𝑖 + 𝛽3 𝐼𝑁𝐹𝑖 + 𝛽4 𝐵𝐴2𝑖 + 𝛽4 𝑅𝐵𝐼𝑖2 + 𝜀𝑖 ̂ 𝑖 ). and obtain the predicted value ln(𝑆𝑎𝑙𝑎𝑟𝑦 (2) Add the predicted value from step (1) to the model 𝑙𝑛(𝑠𝑎𝑙𝑎𝑟𝑦𝑖 ) = 𝛽0 + 𝛽1 𝐸𝑥𝑝𝑖 + 𝛽2 𝐵𝐴𝑖 + 𝛽3 𝑅𝐵𝐼𝑖 + 𝛽4 𝐻𝑅𝑖 + 𝛽3 𝐼𝑁𝐹𝑖 + 𝛽4 𝐴𝐿𝐿𝑆𝑡𝑎𝑟𝑖 ̂ 𝑖 ) + 𝜀𝑖 + 𝛽5 ln(𝑆𝑎𝑙𝑎𝑟𝑦 (3) Perform a t-test for the statistical significance of 𝛽5 . If it is statistically significant then the model from step (1) may be preferred. Because the left hand side variable doesn’t change between the two specifications, the Rsquares between the two models can also be compared. 8-6 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis Answers to End of Chapter Exercises E8.1 a. The problem with ability is that it is hard to obtain a variable that is an appropriate measure of ability and even though ability is certainly a determinant of GPA. Individuals with a higher ability also typically have a higher GPA and vice versa. Omitted variable bias becomes an issue because ability is also related to hours studied, work, video games, and even possibly texts. The omission of a relevant variable causes the coefficient estimates to be biased. This means that all coefficient estimates are wrong on average and the all hypothesis tests and confidence intervals are also incorrect. 8-7 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis b. The consequences are the inclusion of an irrelevant variable does not yield biased estimates but the standard errors typically become inflated. In this case, the inclusion of the irrelevant variable did not change the overall decisions about statistical significance. Notice that when Eye Color was included the R-squared went up but the adjusted Rsquared went down. c. It is much better to include an irrelevant variable than omit a relevant variable because larger standard error are much better than biased estimators. Most of the time omitted variables are not omitted because the researcher is sloppy and didn’t think to include that variable but rather because data on that variable is not available. E8.2 a. See Excel Worksheet 8-8 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis b. From this regression we see that distance to the beach is not statistically significant but missing is statistically significant suggesting that the observations with missing data have a lower housing price of $297,185.14 than those observations without missing data. c. An easy way to test this hypothesis is to just regress housing price on the missing column which will yield a differences in means. 8-9 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis This regression suggests that the mean housing price for data without missing observations is $795,333.21 while the mean housing price for data with missing observations is $795,333.21 - $168,943.28 = $626,389.93. The p-value suggests that this difference in means is not statistically significant. Another way to see if the missing data causes issues is to perform the regression with only the data that have the distance to the beach observations. In the regression with only the 43 observations that have data on distance to the beach, have somewhat different results than the regression that accounted for the missing data. The beach distance variable is now statistically significant at the 5% level and suggests that for each additional mile a house is away from the beach the price drops by $17,497.39. E8.3 a. See graph below. 8-10 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis Units Sold vs. MP Sales 2500000 Units Sold 2000000 1500000 1000000 500000 0 0 0.2 0.4 0.6 0.8 1 1.2 Online MP The two potential outliers are Call of Duty: Black Ops 2and Assassin's Creed 3. b. 8-11 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis The coefficient on outlier is statistically significant at the 1% level suggesting that the two outliers have, on average, 1,162,603.11 more units sold than the 51 other observations. Interacting this with Online MP we obtain the regression In this regression, the outlier without Online MP has 5,001,950.17 more sales than non outlier and the outlier with Online MP has 7,424.4+501,950+1,266,616= 1,775,808 more units sold than video games that our not outliers with no online MP. c. It doesn’t look like either outlier was a there due to a special reason except for both Call of Duty: Black Ops 2and Assassin's Creed 3 are extremely popular video games. 8-12 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis E8.4 a. In this regression, the only statistically significant independent variable is square feet. On average, holding bedrooms, bathrooms, lot size, and pool constant, if square feet goes up by one foot then the price increases by .062%. Even though square feet is statistically significant it does it is not economically significant because the coefficient estimate is so small. b. 8-13 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis In this regression, the only statistically significant independent variable are log square feet, bedrooms at the 5% level, and bathrooms at the 10% level. On average, holding bedrooms, bathrooms, lot size, and pool constant, if square feet goes up by 1% then the price increases by 1.14%. On average, holding log square feet, bathrooms, lot size, and pool constant, if bedrooms goes up by 1% then the price decreases by .107%. On average, holding log square feet, bedrooms, lot size, and pool constant, if bathrooms goes up by 1% then the price increases by .113%. c. Performing the Davidson-MacKinnon test The predicted ln housing price is not statistically significant, which suggests that the model without the log square feet is preferred. Since the dependent variable in both models is the same, the R-squares can also be compared and the R-squared from the initial model is larger than the R-squared from the model with log square feet. 8-14 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis E8.5 Regression from step 1 of reset tests. 8-15 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter 08 - Model Selection in Multiple Linear Regression Analysis Second regression for reset test The yhat^2, yhat^3, and yhat^4 are individually statistically insignificant but we need to test if they are jointly statistically significant. Hypothesis: 𝐻0 : 𝛽5 = 𝛽6 = 𝛽7 = 0 𝐻1 : 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 𝑖𝑠 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙 𝑡𝑜 0 Test statistic: (4.5536 − 4.432)/3 𝐹 − 𝑠𝑡𝑎𝑡 = = 0.5302 4.432/58 Critical Value is 𝐹.05,3,58 = 2.746 Rejection Rule: Reject H0 if F-stat > 2.746 Decision: Because 0.5302 < 2.746 we fail to reject 𝐻0 and conclude that the model without the quadratic terms is statistically preferred. 8-16 Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Solutions to Chapter 8

Related documents

Products

Support

Solutions to Chapter 8

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib