Chapter 08 - Model Selection in Multiple Linear Regression Analysis
CHAPTER 8
Answers to End of Chapter Problems
8.1
a. For the average individual, holding the effects of average points per game, average
rebounds, and player position constant, if the number of years a player has been in the
NBA goes up by one year, salary increases by 16%.
b.
ln(π‘†π‘Žπ‘™π‘Žπ‘Ÿπ‘¦π‘– )
= 𝛽0 + 𝛽1 π‘Œπ‘Ÿπ‘ π‘– + 𝛽2 𝑃𝑃𝐺𝑖 + 𝛽3 𝑅𝑃𝐺𝑖 + 𝛽4 𝐹𝑖 + 𝛽5 𝐺𝑖 + 𝛽6 𝐹𝑖 ∗ 𝑅𝑃𝐺𝑖 + 𝛽7 𝐹𝑖
∗ 𝑅𝑃𝐺𝑖 + πœ€π‘–
To test this hypothesis, you can do t-test if to determine if 𝛽6 = 0 (the returns to salary
are the same for Forwards and Centers) and if 𝛽7 = 0 (the returns to salary are the same
for Guards and Centers), and an F-test if jointly 𝛽6 = 𝛽7 = 0
c. Note the question says ANY differences.
ln(π‘†π‘Žπ‘™π‘Žπ‘Ÿπ‘¦π‘– ) = 𝛽0 + 𝛽1 π‘Œπ‘Ÿπ‘ π‘– + 𝛽2 𝑃𝑃𝐺𝑖 + 𝛽3 𝑅𝑃𝐺𝑖 + 𝛽4 𝐹𝑖 + 𝛽5 𝐺𝑖 + 𝛽6 πΉπ‘œπ‘Ÿπ‘’π‘–π‘”π‘›π‘–
+ 𝛽7 πΉπ‘œπ‘Ÿπ‘’π‘–π‘”π‘›π‘– ∗ π‘Œπ‘Ÿπ‘ π‘– + 𝛽8 πΉπ‘œπ‘Ÿπ‘’π‘–π‘”π‘›π‘– ∗ 𝑃𝑃𝐺𝑖 + 𝛽9 πΉπ‘œπ‘Ÿπ‘’π‘–π‘”π‘›π‘– ∗ 𝑅𝑃𝐺𝑖 + πœ€π‘–
Where the variable foreign = 1 if the player is foreign born and foreign = 0 if the player is
born in the U.S.
Hypothesis:
𝐻0 : 𝛽6 = 𝛽7 = 𝛽8 = 𝛽9 = 0
𝐻1 : π‘Žπ‘‘ π‘™π‘’π‘Žπ‘ π‘‘ π‘œπ‘›π‘’ 𝛽𝑖 𝑖𝑠 π‘›π‘œπ‘‘ π‘’π‘žπ‘’π‘Žπ‘™ π‘‘π‘œ 0
Test statistic:
(π‘†π‘†π‘ˆπ‘›π‘’π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘π‘Ÿπ‘’π‘ π‘‘π‘Ÿπ‘–π‘π‘‘π‘’π‘‘ − π‘†π‘†π‘ˆπ‘›π‘’π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘π‘’π‘›π‘Ÿπ‘’π‘ π‘‘π‘Ÿπ‘–π‘π‘‘π‘’π‘‘ )/4
𝐹 − π‘ π‘‘π‘Žπ‘‘ =
π‘†π‘†π‘ˆπ‘›π‘’π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘π‘’π‘›π‘Ÿπ‘’π‘ π‘‘π‘Ÿπ‘–π‘π‘‘π‘’π‘‘ /(𝑛 − π‘˜ − 1)
Where the restricted model is the original model in the problem (or alternatively the
model with the null hypothesis imposed)
Critical Value is 𝐹𝛼,4,𝑛−π‘˜−1
Rejection Rule:
Reject H0 if F-stat > 𝐹𝛼,4,𝑛−π‘˜−1
d. This is a Davidson MacKinnon Test
(1) Estimate the model
ln(π‘†π‘Žπ‘™π‘Žπ‘Ÿπ‘¦π‘– ) = 𝛽0 + 𝛽1 π‘Œπ‘Ÿπ‘ π‘– + 𝛽2 𝑃𝑃𝐺𝑖 + 𝛽3 𝑅𝑃𝐺𝑖 + 𝛽4 𝐹𝑖 + 𝛽5 𝐺𝑖 + 𝛽6 πΉπ‘œπ‘Ÿπ‘’π‘–π‘”π‘›π‘–
+ 𝛽7 πΉπ‘œπ‘Ÿπ‘’π‘–π‘”π‘›π‘– ∗ π‘Œπ‘Ÿπ‘ π‘– + 𝛽8 πΉπ‘œπ‘Ÿπ‘’π‘–π‘”π‘›π‘– ∗ 𝑃𝑃𝐺𝑖 + 𝛽9 πΉπ‘œπ‘Ÿπ‘’π‘–π‘”π‘›π‘– ∗ 𝑅𝑃𝐺𝑖 + πœ€π‘–
Μ‚ 𝑖 ).
and obtain the predicted value ln(π‘†π‘Žπ‘™π‘Žπ‘Ÿπ‘¦
(2) Add the predicted value from step (1) to the model
ln(π‘†π‘Žπ‘™π‘Žπ‘Ÿπ‘¦π‘– ) = 𝛽0 + 𝛽1 π‘Œπ‘Ÿπ‘ π‘– + 𝛽2 𝑃𝑃𝐺𝑖 + 𝛽3 𝑅𝑃𝐺𝑖 + 𝛽4 𝐹𝑖 + 𝛽5 𝐺𝑖 + 𝛽6 𝐹𝑖 ∗ 𝑅𝑃𝐺𝑖 + 𝛽7 𝐹𝑖
Μ‚ 𝑖 ) + πœ€π‘–
∗ 𝑅𝑃𝐺𝑖 + 𝛽8 ln(π‘†π‘Žπ‘™π‘Žπ‘Ÿπ‘¦
8-1
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
Chapter 08 - Model Selection in Multiple Linear Regression Analysis
(3) Perform a t-test for the statistical significance of 𝛽8 . If it is statistically significant
then the model from step (1) may be preferred.
8.2
a. The unrestricted model is 𝐹𝑅𝑖 = 𝛽0 + 𝛽1 𝐴𝑔𝑒𝑖 + 𝛽2 𝐸𝑑𝑒𝑐𝑖 + 𝛽3 π‘ˆπ‘Ÿπ‘π‘Žπ‘›π‘– + πœ€π‘– while the
restricted model is 𝐹𝑅𝑖 = 𝛽0 + 𝛽1 𝐴𝑔𝑒𝑖 + πœ€π‘–
Hypothesis:
𝐻0 : 𝛽2 = 𝛽3 = 0
𝐻1 : π‘Žπ‘‘ π‘™π‘’π‘Žπ‘ π‘‘ π‘œπ‘›π‘’ 𝛽𝑖 𝑖𝑠 π‘›π‘œπ‘‘ π‘’π‘žπ‘’π‘Žπ‘™ π‘‘π‘œ 0
Test statistic:
𝐹 − π‘ π‘‘π‘Žπ‘‘ =
(π‘†π‘†π‘ˆπ‘›π‘’π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘π‘Ÿπ‘’π‘ π‘‘π‘Ÿπ‘–π‘π‘‘π‘’π‘‘ − π‘†π‘†π‘ˆπ‘›π‘’π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘π‘’π‘›π‘Ÿπ‘’π‘ π‘‘π‘Ÿπ‘–π‘π‘‘π‘’π‘‘ )/2
π‘†π‘†π‘ˆπ‘›π‘’π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘π‘’π‘›π‘Ÿπ‘’π‘ π‘‘π‘Ÿπ‘–π‘π‘‘π‘’π‘‘ /(𝑛 − π‘˜ − 1)
Where the restricted model is the original model in the problem (or alternatively the
model with the null hypothesis imposed)
Critical Value is 𝐹𝛼,5,𝑛−π‘˜−1
Rejection Rule:
Reject H0 if F-stat > 𝐹𝛼,4,𝑛−π‘˜−1
b. Set 𝛽1 − 𝛽2 = πœƒ, solve for 𝛽1 or 𝛽1 = πœƒ + 𝛽2 , and then substitute for 𝛽1 in the original model.
𝐹𝑅𝑖 = 𝛽0 + 𝛽1 𝐴𝑔𝑒𝑖 + 𝛽2 𝐸𝑑𝑒𝑐𝑖 + 𝛽3 π‘ˆπ‘Ÿπ‘π‘Žπ‘›π‘– + πœ€π‘–
𝐹𝑅𝑖 = 𝛽0 + (πœƒ + 𝛽2 )𝐴𝑔𝑒𝑖 + 𝛽2 𝐸𝑑𝑒𝑐𝑖 + 𝛽3 π‘ˆπ‘Ÿπ‘π‘Žπ‘›π‘– + πœ€π‘–
𝐹𝑅𝑖 = 𝛽0 + πœƒπ΄π‘”π‘’π‘– + 𝛽2 𝐴𝑔𝑒𝑖 + 𝛽2 𝐸𝑑𝑒𝑐𝑖 + 𝛽3 π‘ˆπ‘Ÿπ‘π‘Žπ‘›π‘– + πœ€π‘–
𝐹𝑅𝑖 = 𝛽0 + πœƒπ΄π‘”π‘’π‘– + 𝛽2 (𝐴𝑔𝑒𝑖 + 𝐸𝑑𝑒𝑐𝑖 ) + 𝛽3 π‘ˆπ‘Ÿπ‘π‘Žπ‘›π‘– + πœ€π‘–
From this last equation that isolates the parameters that need to be estimated, , 𝛽2, and 𝛽3.
A new variable need to be created by adding the age an education columns together and
the regress FR on Age, (Age+Educ), and Urban. The coefficient on Age is the estimate,
𝛽̂1 − 𝛽̂2 , the standard error on age is the standard error of this hypothesis, the t-statistic on Age is
the test statistic for this test, and last but not least the p-value on Age is the p-value for this test.
To see if these coefficients are equal, reject the null hypothesis if they are equal if the p-value is
less than the significance level α.
c. This is a Davidson MacKinnon Test
(1) Estimate the model
ln 𝐹𝑅𝑖 = 𝛽0 + 𝛽1 𝐴𝑔𝑒𝑖 + 𝛽2 𝐸𝑑𝑒𝑐𝑖 + 𝛽3 π‘ˆπ‘Ÿπ‘π‘Žπ‘›π‘– + πœ€π‘–
̂𝑖 ).
and obtain the predicted value ln(𝐹𝑅
(2) Add the predicted value from step (1) to the model
𝐹𝑅𝑖 = 𝛽0 + 𝛽1 𝐴𝑔𝑒𝑖 + 𝛽2 𝐸𝑑𝑒𝑐𝑖 + 𝛽3 π‘ˆπ‘Ÿπ‘π‘Žπ‘›π‘– + 𝛽4 lnΜ‚
(𝐹𝑅)𝑖 + πœ€π‘–
(3) Perform a t-test for the statistical significance of 𝛽4 . If it is statistically significant
then the model from step (1) may be preferred.
8-2
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
Chapter 08 - Model Selection in Multiple Linear Regression Analysis
The reason that the semi-log model is more likely to lead to biased estimates is that that
the natural log is a non-linear function and the estimates it yields (without a
transformation) are already biased even if the true model is non-linear. The choice of
specification should be largely be made on the underlying economics. If economic
theory says that when age goes up by one year (or education) the percentage change in
the fertility rate is constant then the semi-log model should be estimated. The coefficient
on Age is interpreted as, on average, holding education and urban constant, if an
individual gets one year older then the fertility rate increases (decreases) by 𝛽1*(100)%.
The coefficient on Education is interpreted as, on average, holding age and urban
constant, if an individual gets one more year of education then the fertility rate increases
(decreases) by 𝛽2*(100)%. The coefficient on Urban is interpreted as, on average,
holding age and education constant, if an individual lives in an urban area the fertility rate
is 𝛽3*(100)% higher (lower) relative to living in a rural area.
8.3
a. To find where pollution reaches a maximum (or where diminishing marginal returns
sets in) set 4000 − 0.25(2)𝐺𝐷𝑃𝑖 = 0 or when GDP per capita is $8000.
b. If all of the multiple linear regression assumptions hold then the consequences of
heteroskedasticity is that the OLS estimates are no longer BLUE but they remain
unbiased. The other consequence is that all standard error and hypothesis tests are
incorrect.
c. This is chapter 9 material.
d. This is chapter 9 material.
e. Because the dependent variable hasn’t changed, you can compare the R-squared values
between the two models and if one R-squared is clearly larger than the other then that
model is preferred. You could also perform a Davidson MacKinnon test.
8.4
a. This is the two step estimator for multiple linear regression analysis. First a formal
proof. The estimates are obtained by minimizing the sum of squared residuals with
amounts to taking the derivative respect to 𝛽̂0 , 𝛽̂1 , and 𝛽̂2 and setting those equations
equal to 0.
𝑛
∑(𝑦𝑖 − 𝛽̂0 − 𝛽̂1 π‘₯1,𝑖 − 𝛽̂2 π‘₯2,𝑖 )2
𝑖=1
yielding the normal equations
𝑛
∑(𝑦𝑖 − 𝛽̂0 − 𝛽̂1 π‘₯1,𝑖 − 𝛽̂2 π‘₯2,𝑖 ) = 0
𝑖=1
𝑛
∑ π‘₯1,𝑖 (𝑦𝑖 − 𝛽̂0 − 𝛽̂1 π‘₯1,𝑖 − 𝛽̂2 π‘₯2,𝑖 ) = 0
𝑖=1
8-3
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
Chapter 08 - Model Selection in Multiple Linear Regression Analysis
𝑛
∑ π‘₯2,𝑖 (𝑦𝑖 − 𝛽̂0 − 𝛽̂1 π‘₯1,𝑖 − 𝛽̂2 π‘₯2,𝑖 ) = 0
𝑖=1
Noting that when π‘₯1 is regressed on π‘₯2 , then π‘₯1 can be written as the
predicted values and the residuals or π‘₯1,𝑖 = π‘₯Μ‚1,𝑖 + π‘ŸΜ‚1𝑖 . Substitute this into
the second normal equation to obtain
𝑛
∑(π‘₯Μ‚1,𝑖 + π‘ŸΜ‚1𝑖 )(𝑦𝑖 − 𝛽̂0 − 𝛽̂1 π‘₯1,𝑖 − 𝛽̂2 π‘₯2,𝑖 ) = 0
𝑖=1
Because the sum of the predicted values times the residual is equal to zero
or ∑𝑛𝑖=1 π‘₯Μ‚1,𝑖 𝑒𝑖 = 0 the equation reduces to
𝑛
∑(π‘ŸΜ‚1𝑖 )(𝑦𝑖 − 𝛽̂0 − 𝛽̂1 π‘₯1,𝑖 − 𝛽̂2 π‘₯2,𝑖 ) = 0
𝑖=1
Now because the π‘ŸΜ‚1𝑖 are the residuals from the regression of π‘₯1 on π‘₯2
which means ∑𝑛𝑖=1 π‘₯Μ‚2,𝑖 π‘ŸΜ‚1𝑖 𝑖 = 0 and the sum of residuals are always equal
to 0 so ∑𝑛𝑖=1 π‘Ÿπ‘– = 0Therefore we are left with
𝑛
∑(π‘ŸΜ‚1𝑖 )(𝑦𝑖 − 𝛽̂1 (π‘₯Μ‚1,𝑖 + π‘ŸΜ‚1𝑖 )) = 0
𝑖=1
and then using the fact that ∑𝑛𝑖=1 π‘₯Μ‚1,𝑖 π‘ŸΜ‚1𝑖 = 0 we get
𝑛
∑(π‘ŸΜ‚1𝑖 )(𝑦𝑖 − 𝛽̂1 π‘ŸΜ‚1𝑖 ) = 0
Solving for 𝛽̂1
𝑖=1
∑𝑖 π‘ŸΜ‚π‘–1 𝑦𝑖
2
∑𝑖 π‘ŸΜ‚π‘–1
In Venn Diagram form and less formally
𝛽̂1 =
8-4
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
Chapter 08 - Model Selection in Multiple Linear Regression Analysis
when π‘₯1 is regressed on π‘₯2 then the part the regression captures is pink
and dark orange and the residuals of that regression are the red plus yellow
area. Then when y is regressed on those residuals (i.e. only the red and
yellow part of π‘₯1 ) only the red area is left.
∑ 𝑒 2 /(𝑛−π‘˜−1)
b. The expression is π‘‰π‘Žπ‘Ÿ(𝛽̂1 ) = 𝑖 𝑖
2 . When π‘₯1 and π‘₯2 have a
𝑇𝑆𝑆1 (1−𝑅1 )
large amount of independent variation then 𝑅12 is small, (1-𝑅12 ) is large and
1 divided by that value is small (note that 𝑅12 is bounded to be between 0
and 1. Now if π‘₯1 and π‘₯2 have a small amount of independent variation
then 𝑅12 is large, (1-𝑅12 ) is l is small and 1 divided by that value is large.
c. No, including irrelevant variables does not cause the estimates to be
biased. If being taller is strongly related to married then 𝑅12 is large, (1𝑅12 ) is l is small and 1 divided by that value is large.
8.5
a. Two new variables need to be created by multiplying inf by home runs
and inf by batting average and then estimating the regression model
𝑙𝑛(π‘ π‘Žπ‘™π‘Žπ‘Ÿπ‘¦π‘– ) = 𝛽0 + 𝛽1 𝐸π‘₯𝑝𝑖 + 𝛽2 𝐡𝐴𝑖 + 𝛽3 𝑅𝐡𝐼𝑖 + 𝛽4 𝐻𝑅𝑖 + 𝛽3 𝐼𝑁𝐹𝑖 + 𝛽4 π΄πΏπΏπ‘†π‘‘π‘Žπ‘Ÿπ‘–
+ 𝛽5 𝐼𝑁𝐹𝑖 ∗ 𝐻𝑅𝑖 + 𝛽6 𝐼𝑁𝐹𝑖 ∗ 𝐡𝐴𝑖 +πœ€π‘–
To test these two hypotheses it is two t-test of
𝐻0 : 𝛽5 ≥ 0 infielders do not get paid less to hit home runs than outfielders
𝐻1 : 𝛽5 < 0 infielders get paid less to hit home runs than outfielders
and Reject 𝐻0 if t-stat < −𝑑𝛼,𝑛−π‘˜−1 .
𝐻0 : 𝛽6 ≤ 0 infielders do not get paid more to have a high batting average than
outfielders
8-5
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
Chapter 08 - Model Selection in Multiple Linear Regression Analysis
𝐻1 : 𝛽6 > 0 infielders get paid more to have a high batting average than
outfielders
and Reject 𝐻0 if t-stat > 𝑑𝛼,𝑛−π‘˜−1 .
Note that because these are one sided tests so the critical value is 𝑑𝛼,𝑛−π‘˜−1 (α
remains whole when obtaining the critical value) or if the p-value approach is
used then the p-value in the regression output needs to be multiplied by 2 and
then compared to α.
b. Define a new variable as Native =1 if the player is native born and Native = 0
if the player is foreign born. Multiply this dummy variable by all independent
variables that were originally in the model. The new model becomes
𝑙𝑛(π‘ π‘Žπ‘™π‘Žπ‘Ÿπ‘¦π‘– )
= 𝛽0 + 𝛽1 𝐸π‘₯𝑝𝑖 + 𝛽2 𝐡𝐴𝑖 + 𝛽3 𝑅𝐡𝐼𝑖 + 𝛽4 𝐻𝑅𝑖 + 𝛽3 𝐼𝑁𝐹𝑖 + 𝛽4 π΄πΏπΏπ‘†π‘‘π‘Žπ‘Ÿπ‘–
+ 𝛽5 π‘π‘Žπ‘‘π‘–π‘£π‘’π‘– + 𝛽6 π‘π‘Žπ‘‘π‘–π‘£π‘’π‘– 𝐸π‘₯𝑝𝑖 + 𝛽7 π‘π‘Žπ‘‘π‘–π‘£π‘’π‘– 𝑅𝐡𝐼𝑖 + 𝛽8 π‘π‘Žπ‘‘π‘–π‘£π‘’π‘– 𝐻𝑅𝑖
+ 𝛽9 π‘π‘Žπ‘‘π‘–π‘£π‘’π‘– 𝐼𝑁𝐹𝑖 + 𝛽10 π‘π‘Žπ‘‘π‘–π‘£π‘’π‘– π΄πΏπΏπ‘†π‘‘π‘Žπ‘Ÿπ‘– +πœ€π‘–
To test for any differences it is an F-test.
Hypothesis:
𝐻0 : 𝛽5 = 𝛽6 = 𝛽7 = 𝛽8 = 𝛽9 = 𝛽10 = 0
𝐻1 : π‘Žπ‘‘ π‘™π‘’π‘Žπ‘ π‘‘ π‘œπ‘›π‘’ 𝛽𝑖 𝑖𝑠 π‘›π‘œπ‘‘ π‘’π‘žπ‘’π‘Žπ‘™ π‘‘π‘œ 0
Test statistic:
(π‘†π‘†π‘ˆπ‘›π‘’π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘π‘Ÿπ‘’π‘ π‘‘π‘Ÿπ‘–π‘π‘‘π‘’π‘‘ − π‘†π‘†π‘ˆπ‘›π‘’π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘π‘’π‘›π‘Ÿπ‘’π‘ π‘‘π‘Ÿπ‘–π‘π‘‘π‘’π‘‘ )/6
𝐹 − π‘ π‘‘π‘Žπ‘‘ =
π‘†π‘†π‘ˆπ‘›π‘’π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘π‘’π‘›π‘Ÿπ‘’π‘ π‘‘π‘Ÿπ‘–π‘π‘‘π‘’π‘‘ /(𝑛 − π‘˜ − 1)
Where the restricted model is the original model in the problem (or alternatively the
model with the null hypothesis imposed)
Critical Value is 𝐹𝛼,6,𝑛−π‘˜−1
Rejection Rule:
Reject H0 if F-stat > 𝐹𝛼,6,𝑛−π‘˜−1
c. This is a Davidson MacKinnon Test
(1) Estimate the model
𝑙𝑛(π‘ π‘Žπ‘™π‘Žπ‘Ÿπ‘¦π‘– ) = 𝛽0 + 𝛽1 𝑙𝑛(𝐸π‘₯𝑝)𝑖 + 𝛽2 𝐡𝐴𝑖 + 𝛽3 𝐼𝑁𝐹𝑖 + 𝛽4 𝐡𝐴2𝑖 + 𝛽4 𝑅𝐡𝐼𝑖2 + πœ€π‘–
Μ‚ 𝑖 ).
and obtain the predicted value ln(π‘†π‘Žπ‘™π‘Žπ‘Ÿπ‘¦
(2) Add the predicted value from step (1) to the model
𝑙𝑛(π‘ π‘Žπ‘™π‘Žπ‘Ÿπ‘¦π‘– )
= 𝛽0 + 𝛽1 𝐸π‘₯𝑝𝑖 + 𝛽2 𝐡𝐴𝑖 + 𝛽3 𝑅𝐡𝐼𝑖 + 𝛽4 𝐻𝑅𝑖 + 𝛽3 𝐼𝑁𝐹𝑖 + 𝛽4 π΄πΏπΏπ‘†π‘‘π‘Žπ‘Ÿπ‘–
Μ‚ 𝑖 ) + πœ€π‘–
+ 𝛽5 ln(π‘†π‘Žπ‘™π‘Žπ‘Ÿπ‘¦
(3) Perform a t-test for the statistical significance of 𝛽5 . If it is statistically significant
then the model from step (1) may be preferred.
Because the left hand side variable doesn’t change between the two specifications, the Rsquares between the two models can also be compared.
8-6
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
Chapter 08 - Model Selection in Multiple Linear Regression Analysis
Answers to End of Chapter Exercises
E8.1
a.
The problem with ability is that it is hard to obtain a variable that is an appropriate
measure of ability and even though ability is certainly a determinant of GPA. Individuals
with a higher ability also typically have a higher GPA and vice versa. Omitted variable
bias becomes an issue because ability is also related to hours studied, work, video games,
and even possibly texts. The omission of a relevant variable causes the coefficient
estimates to be biased. This means that all coefficient estimates are wrong on average
and the all hypothesis tests and confidence intervals are also incorrect.
8-7
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
Chapter 08 - Model Selection in Multiple Linear Regression Analysis
b.
The consequences are the inclusion of an irrelevant variable does not yield biased
estimates but the standard errors typically become inflated. In this case, the inclusion of
the irrelevant variable did not change the overall decisions about statistical significance.
Notice that when Eye Color was included the R-squared went up but the adjusted Rsquared went down.
c. It is much better to include an irrelevant variable than omit a relevant variable because
larger standard error are much better than biased estimators. Most of the time omitted
variables are not omitted because the researcher is sloppy and didn’t think to include that
variable but rather because data on that variable is not available.
E8.2
a. See Excel Worksheet
8-8
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
Chapter 08 - Model Selection in Multiple Linear Regression Analysis
b.
From this regression we see that distance to the beach is not statistically significant but
missing is statistically significant suggesting that the observations with missing data have
a lower housing price of $297,185.14 than those observations without missing data.
c. An easy way to test this hypothesis is to just regress housing price on the missing
column which will yield a differences in means.
8-9
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
Chapter 08 - Model Selection in Multiple Linear Regression Analysis
This regression suggests that the mean housing price for data without missing
observations is $795,333.21 while the mean housing price for data with missing
observations is $795,333.21 - $168,943.28 = $626,389.93. The p-value suggests that this
difference in means is not statistically significant.
Another way to see if the missing data causes issues is to perform the regression with
only the data that have the distance to the beach observations.
In the regression with only the 43 observations that have data on distance to the beach,
have somewhat different results than the regression that accounted for the missing data.
The beach distance variable is now statistically significant at the 5% level and suggests
that for each additional mile a house is away from the beach the price drops by
$17,497.39.
E8.3
a. See graph below.
8-10
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
Chapter 08 - Model Selection in Multiple Linear Regression Analysis
Units Sold vs. MP Sales
2500000
Units Sold
2000000
1500000
1000000
500000
0
0
0.2
0.4
0.6
0.8
1
1.2
Online MP
The two potential outliers are Call of Duty: Black Ops 2and Assassin's Creed 3.
b.
8-11
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
Chapter 08 - Model Selection in Multiple Linear Regression Analysis
The coefficient on outlier is statistically significant at the 1% level suggesting that the
two outliers have, on average, 1,162,603.11 more units sold than the 51 other
observations.
Interacting this with Online MP we obtain the regression
In this regression, the outlier without Online MP has 5,001,950.17 more sales than non
outlier and the outlier with Online MP has 7,424.4+501,950+1,266,616= 1,775,808 more
units sold than video games that our not outliers with no online MP.
c. It doesn’t look like either outlier was a there due to a special reason except for both
Call of Duty: Black Ops 2and Assassin's Creed 3 are extremely popular video games.
8-12
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
Chapter 08 - Model Selection in Multiple Linear Regression Analysis
E8.4
a.
In this regression, the only statistically significant independent variable is square feet.
On average, holding bedrooms, bathrooms, lot size, and pool constant, if square feet goes
up by one foot then the price increases by .062%. Even though square feet is statistically
significant it does it is not economically significant because the coefficient estimate is so
small.
b.
8-13
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
Chapter 08 - Model Selection in Multiple Linear Regression Analysis
In this regression, the only statistically significant independent variable are log square
feet, bedrooms at the 5% level, and bathrooms at the 10% level. On average, holding
bedrooms, bathrooms, lot size, and pool constant, if square feet goes up by 1% then the
price increases by 1.14%. On average, holding log square feet, bathrooms, lot size, and
pool constant, if bedrooms goes up by 1% then the price decreases by .107%. On average,
holding log square feet, bedrooms, lot size, and pool constant, if bathrooms goes up by
1% then the price increases by .113%.
c. Performing the Davidson-MacKinnon test
The predicted ln housing price is not statistically significant, which suggests that the
model without the log square feet is preferred. Since the dependent variable in both
models is the same, the R-squares can also be compared and the R-squared from the
initial model is larger than the R-squared from the model with log square feet.
8-14
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
Chapter 08 - Model Selection in Multiple Linear Regression Analysis
E8.5
Regression from step 1 of reset tests.
8-15
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
Chapter 08 - Model Selection in Multiple Linear Regression Analysis
Second regression for reset test
The yhat^2, yhat^3, and yhat^4 are individually statistically insignificant but we need to
test if they are jointly statistically significant.
Hypothesis:
𝐻0 : 𝛽5 = 𝛽6 = 𝛽7 = 0
𝐻1 : π‘Žπ‘‘ π‘™π‘’π‘Žπ‘ π‘‘ π‘œπ‘›π‘’ 𝛽𝑖 𝑖𝑠 π‘›π‘œπ‘‘ π‘’π‘žπ‘’π‘Žπ‘™ π‘‘π‘œ 0
Test statistic:
(4.5536 − 4.432)/3
𝐹 − π‘ π‘‘π‘Žπ‘‘ =
= 0.5302
4.432/58
Critical Value is 𝐹.05,3,58 = 2.746
Rejection Rule:
Reject H0 if F-stat > 2.746
Decision:
Because 0.5302 < 2.746 we fail to reject 𝐻0 and conclude that the model without the
quadratic terms is statistically preferred.
8-16
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.