7-Multiple Regressio..

advertisement
Chapter 7
Further Inference in Multiple Regression
1.
2.
3.
4.
5.
6.
7.
8.
9.
Testing the Significance of a Model—The 𝐹­Test
Cases Where the 𝐹­Test and 𝑑­Test Give Contradictory Results—Collinearity
2.1. 𝐹­Test for “Restricted Least Squares”
Extension of the Regression Model
Testing Some Economic Hypotheses
4.1. Test the Significance of Advertising
4.2. The Optimal Level of Advertising
The Use of Non-sample Information
Model Specification
6.1. Consequences of Omitted and Irrelevant Variables
6.1.1. The Omitted Variable Problem
6.1.1.1. Proof of the Omitted Variable Bias
6.1.2. The Irrelevant Variable Problem
6.2. The RESET Test for Model Misspecification
Identifying and Mitigating Collinearity
Confidence and Prediction Intervals
A More Practical Way of Finding var(𝑦̂0 )
1. Testing the Significance of a Model—The 𝑭­π“πžπ¬π­
In Chapter 5 the 𝐹-Test was explained as an alternative to the 𝑑-Test for the significance of a simple regression
model. There we noted that the p-value for the 𝐹-Test in the ANOVA section of the regression summary
output was equal to the p-value for the t-Test for the significance of the slope coefficient 𝑏2 .
In multiple regression, however, the 𝐹­test and 𝑑-tests have different roles. The 𝑑-tests test the significance of
the coefficient of each variable individually, while the 𝐹-test tests their joint explanatory power. The model
will have no explanatory power if it turns out that 𝑦 is unrelated to any of the explanatory variables. In the
two-explanatory variable regression model,
𝑦 = 𝛽1 + 𝛽2 π‘₯2 + 𝛽3 π‘₯3 + 𝑒
we perform the following hypothesis test:
𝐻0 : 𝛽2 = 0, 𝛽3 = 0
𝐻1 : at least one of the 𝛽𝑗 is not zero
In Chapter 5 it was explained that the ratio of mean square regression (𝑀𝑆𝑅) over mean square error (𝑀𝑆𝐸)
has an 𝐹-distribution with the numerator and denominator degrees of freedom, respectively, of (π‘˜ − 1) and
(𝑛 − π‘˜).
𝐹(π‘˜−1,𝑛−π‘˜) =
∑(𝑦̂ − 𝑦̅)2 ⁄(π‘˜ − 1)
∑(𝑦 − 𝑦̂)2 ⁄(𝑛 − π‘˜)
This is the test statistic (𝑇𝑆) 𝐹 which is compared to the critical value 𝐢𝑉 = 𝐹𝛼,(π‘˜−1,𝑛−π‘˜) . We would reject the
null hypothesis and conclude that estimated relationship between 𝑦 and the independent variables is
significant, if 𝑇𝑆 > 𝐢𝑉.
In the π‘π‘’π‘Ÿπ‘”π‘’π‘Ÿ model presented in the previous chapter,
Chapter 7—Further Inference in Multiple Regression
Page 1 of 20
𝑆𝐴𝐿𝐸𝑆 = 𝛽1 + 𝛽2 𝑃𝑅𝐼𝐢𝐸 + 𝛽3 𝐴𝐷𝑉𝐸𝑅𝑇 + 𝑒
The estimated regression model was,
Μ‚ = 118.914 − 7.908𝑃𝑅𝐼𝐢𝐸 + 1.863𝐴𝐷𝑉𝐸𝑅𝑇
𝑆𝐴𝐿𝐸𝑆
To test for the significance of the overall model, the 𝐹 statistic was obtained from the ANOVA table,
ANOVA
df
Regression
Residual
Total
𝑇𝑆 = 𝐹 =
2
72
74
SS
1396.5389
1718.9429
3115.4819
MS
698.26946
23.874207
F
29.247859
Significance F
5.04086E-10
698.2695
= 29.248
23.8742
𝐢𝑉 = 𝐹0.05,(2,72) = 3.124
Since 𝑇𝑆 > 𝐢𝑉, we reject the null hypothesis and conclude that the relationship between 𝑆𝐴𝐿𝐸𝑆, 𝑃𝑅𝐼𝐢𝐸 and
𝐴𝐷𝑉𝐸𝑅𝑇 is significant. Also note that the probability value shown under “Significance F” is practically 0.
1.1.
F-Tests for “Restricted Least Squares”
Most economic data are obtained not from controlled (laboratory) experiments but are collected as
“historical” data. Therefore, when data are the result of an uncontrolled experiment, many of the economic
variables may be correlated or are said to be collinear. The problem is labeled collinearity, or
multicollinearity when several variables are involved. The restricted least squares can be useful when the
problem of collinearity is present. We might want to test if an explanatory variable or a group of explanatory
variables is relevant in a particular model.
The 𝐹-test for one hypothesis, or a set of hypotheses, is based on a comparison of the sum of squared error
(𝑆𝑆𝐸) from the original, unrestricted model to the 𝑆𝑆𝐸 from a regression model in which the null hypotheses
for the restricted mode is assumed to be true. This is explained below. The unrestricted model is the original
regression model.
Unrestricted model:
𝑦 = 𝛽1 + 𝛽2 π‘₯2 + 𝛽3 π‘₯3 + 𝑒
The restricted model is that which one or more of the explanatory variables are removed. In a model with
two independent variables we can remove the impact of one of the two variables.
Restricted model:
𝑦 = 𝛽1 + 𝛽3 π‘₯3 + 𝑒
We want to test the hypothesis that changes in, say, π‘₯2 , have no effect on 𝑦, against the alternative that it has
an effect.
𝐻0 : 𝛽2 = 0
𝐻1 : 𝛽2 ≠ 0
Chapter 7—Further Inference in Multiple Regression
Page 2 of 20
It was explained in Chapter 6 that a model with a larger number of independent variables would have a
smaller 𝑆𝑆𝐸. Therefore, when we remove one of the independent variables by constraining the model, the
restricted 𝑆𝑆𝐸 increases.
𝑆𝑆𝐸𝑅 > π‘†π‘†πΈπ‘ˆ
The ratio of the difference between 𝑆𝑆𝐸𝑅 − π‘†π‘†πΈπ‘ˆ over π‘†π‘†πΈπ‘ˆ is a test statistic that has an 𝐹 distribution with a
numerator degrees of freedom equal to the number of hypotheses in 𝐻0 and the denominator degrees of
freedom of 𝑛 − π‘˜, where π‘˜ is the number of parameters in the unrestricted model. Here we have only one
null hypothesis. Therefore, 𝑗 = 1.
𝐹=
(𝑆𝑆𝐸𝑅 − π‘†π‘†πΈπ‘ˆ )⁄𝑗
π‘†π‘†πΈπ‘ˆ ⁄(𝑛 − π‘˜)
If 𝑇𝑆 = 𝐹(𝑗,𝑛−π‘˜) > 𝐢𝑉 = 𝐹𝛼,(𝑗,𝑛−π‘˜) , then we would reject the null hypothesis (hypotheses), concluding the π‘₯2
has a significant effect on 𝑦.
In the example relating sales to price and advertising expenditure, constrain the model by removing the
impact of price, π‘₯2 . The unrestricted 𝑆𝑆𝑅, determined previously, is π‘†π‘†π‘…π‘ˆ = 1718.943. The restricted 𝑆𝑆𝐸 is
then (see the Excel file CH7 DATA, worksheet tab burger)
𝑆𝑆𝐸𝑅 = 2961.827
Thus,
𝐹=
(2961.827 − 1718.943)⁄1
= 52.06
1718.943⁄(75 − 3)
To find the critical value, using the Excel =F.INV.RT function, we have: 𝐹0.05,(1,72) = 3.974. Since the test
statistic exceeds the critical value, we reject the null hypothesis 𝐻0 : 𝛽2 = 0 and conclude that variations in
price influence monthly sales.
Note that when we test the significance of the overall model, in effect, we use the following methodology:
Unrestricted model:
Restricted model:
𝑦 = 𝛽1 + 𝛽2 π‘₯2 + 𝛽3 π‘₯3 + 𝑒
𝑦 = 𝛽1 + 𝑒
The null and alternative hypotheses are:
𝐻0 : 𝛽2 = 0, 𝛽3 = 0
𝐻1 : at least one of the 𝛽𝑗 is non-zero
The estimated coefficient for the restricted model is simply 𝑏1 = 𝑦̅, and the regression equation is: 𝑦̂ = 𝑏1 =
𝑦̅. Therefore,
𝑆𝑆𝐸𝑅 = ∑(𝑦 − 𝑦̂)2 = ∑(𝑦 − 𝑦̅)2 = 𝑆𝑆𝑇
Thus,
𝐹=
(𝑆𝑆𝐸𝑅 − π‘†π‘†πΈπ‘ˆ )⁄𝑗 (𝑆𝑆𝑇 − π‘†π‘†πΈπ‘ˆ )⁄(π‘˜ − 1)
=
π‘†π‘†πΈπ‘ˆ ⁄(𝑛 − π‘˜)
π‘†π‘†πΈπ‘ˆ ⁄(𝑛 − π‘˜)
𝐹=
𝑆𝑆𝑅 ⁄(π‘˜ − 1)
𝑆𝑆𝐸 ⁄(𝑛 − π‘˜)
Chapter 7—Further Inference in Multiple Regression
Page 3 of 20
Which is the same 𝐹 statistic for the original unrestricted model.
1.2.
Test of the Significance of Advertising
Using the extended burger model,
𝑆𝐴𝐿𝐸𝑆 = 𝛽1 + 𝛽2 𝑃𝑅𝐼𝐢𝐸 + 𝛽3 𝐴𝐷𝑉𝐸𝑅𝑇 + 𝛽4 𝐴𝐷𝑉𝐸𝑅𝑇 2 + 𝑒
we can test some interesting economic hypotheses and illustrate the use of 𝑑- and 𝐹-tests. In the extended
model above, incorporating 𝐴𝐷𝑉𝐸𝑅𝑇 2 in the model, we wish to test whether advertising has an effect on total
revenue. This means we want to perform a joint test for β3 and β4. Again, let’s make use of unrestricted/
restricted models methodology used above.
Unrestricted model:
Restricted model:
𝑦 = 𝛽1 + 𝛽2 π‘₯2 + 𝛽3 π‘₯3 + 𝛽4 π‘₯32 + 𝑒
𝑦 = 𝛽1 + 𝛽2 π‘₯2 + 𝑒
The joint null and alternative hypotheses are:
𝐻0 : 𝛽3 = 0, 𝛽4 = 0
𝐻1 : 𝛽3 or 𝛽4 , or both are non-zero.
From the regression output for the unrestricted model, as seen above, π‘†π‘†πΈπ‘ˆ = 1532.084.
regression output for the restricted model (see CH7 DATA Excel file): 𝑆𝑆𝐸𝑅 = 1896.391.
Unrestricted Model
ANOVA
df
Regression 3
Residual
71
Total
74
Intercept
PRICE
ADVERT
ADVERT²
Coefficients
109.7190
-7.6400
12.1512
-2.7680
SS
1583.3974
1532.0845
3115.4819
MS
527.79914
21.578654
Standard
6.7990
Error
1.0459
3.5562
0.9406
t Stat
16.1374
-7.3044
3.4170
-2.9427
Restricted Model
ANOVA
df
Regression 1
Residual
73
Total
74
Intercept
PRICE
Coefficients
121.9002
-7.8291
From the
SS
1219.0910
1896.3908
3115.4819
MS
1219.0910
25.9780
Standard
6.5263
Error
1.1429
t Stat
18.6783
-6.8504
The test statistic F is:
𝐹=
(π‘†π‘†π‘ˆπ‘… − π‘†π‘†πΈπ‘ˆ )⁄𝑗 (1896.3908 − 1532.0845)⁄2
=
= 8.441
π‘†π‘†πΈπ‘ˆ ⁄(𝑛 − π‘˜)
1532.0845⁄(75 − 4)
And the critical value is:
F0.05,(2, 71) = 3.126
Clearly we reject the null hypothesis and conclude that advertising has an effect on total revenue.
Chapter 7—Further Inference in Multiple Regression
Page 4 of 20
1.3. Test the Significance of Advertising—Optimal Level
What is the optimum level of advertising? Optimality in economics always implies marginal benefit of an
action be equal to its marginal cost. If marginal benefit exceeds the marginal cost, the action should be taken.
If marginal benefit is less than the marginal cost, the action should be curtailed. The optimum is, therefore,
where the two are equal.
The marginal benefit of advertising is the contribution of each additional $1 thousand of advertising
expenditure to total revenue. Form the model,
𝑦 = 𝛽1 + 𝛽2 π‘₯2 + 𝛽3 π‘₯3 + 𝛽4 π‘₯32 + 𝑒
the marginal benefit of advertising is:
πœ•y
= 𝛽3 + 2𝛽4 π‘₯3
πœ•π‘₯3
Ignoring the marginal cost of additional sales, marginal cost is each additional $1 thousand spent on
advertising. Thus, the optimality requirement is 𝑀𝐡 = 𝑀𝐢:
𝛽3 + 2𝛽4 π‘₯3 = $1
Using the estimated least squares coefficients, we thus have:
12.151 + 2(−2.768)π‘₯3 = 1
Yielding, π‘₯3 = $2.014 thousand.
Now suppose we wish to use the estimated regression model to test the null hypothesis that the optimum
advertising expenditure is $1.9 thousand. That is,
𝐻0 : 𝛽3 + 2𝛽4 (1.9) = 1
𝐻1 : 𝛽3 + 2𝛽4 (1.9) ≠ 1
The test statistic for this test has a t distribution.
𝑇𝑆 = |𝑑| =
π‘†π‘Žπ‘šπ‘π‘™π‘’ π‘†π‘‘π‘Žπ‘‘π‘–π‘ π‘‘π‘–π‘ − 𝑁𝑒𝑙𝑙 π‘‰π‘Žπ‘™π‘’π‘’
π‘†π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘ πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ
The sample statistic for the test is 𝑏3 + 3.8𝑏4 . The standard error in the denominator is se(𝑏3 + 3.8𝑏4 ).
𝑇𝑆 =
𝑏3 + 3.8𝑏4 − 1
se(𝑏3 + 3.8𝑏4 )
How do we find the value for the standard error? Using the properties of variance, we have
var(𝑏3 + 3.8𝑏4 ) = var(𝑏3 ) + 3.82 var(𝑏4 ) + 2(3.8)cov(𝑏3 , 𝑏4 )
We can obtain var(𝑏3 ) and var(𝑏4 ) by simply squaring the standard errors from the regression output.
Unfortunately, the Excel regression output does not provide the covariance value. We can, however, still use
Excel to compute cov(𝑏3 , 𝑏4 ). If you recall, using matrix algebra we can determine the variance-covariance
matrix.
Chapter 7—Further Inference in Multiple Regression
Page 5 of 20
var(𝑏1 )
[covar(𝑏1 , 𝑏2 )
covar(𝑏1 , 𝑏3 )
covar(𝑏1 , 𝑏2 )
var(𝑏2 )
covar(𝑏2 , 𝑏3 )
covar(𝑏1 , 𝑏3 )
covar(𝑏2 , 𝑏3 )] = var(𝑒)X −1
var(𝑏3 )
where, X-1 is the inverse of the matrix,
𝑛
X = [∑π‘₯2
∑π‘₯3
∑π‘₯2
∑π‘₯3
2
∑π‘₯2
∑π‘₯2 π‘₯3 ]
∑π‘₯2 π‘₯3 ∑π‘₯32
is for a model with two independent variables. In our current model we have three independent variables.
The solution for this problem is simple because the X matrix can be expanded to incorporate any number of
independent variables. For a 3-variable model we have,
𝑛
X=
∑π‘₯2
∑π‘₯3
[∑π‘₯4
∑π‘₯2
∑π‘₯22
∑π‘₯2 π‘₯3
∑π‘₯2 π‘₯4
∑π‘₯3
∑π‘₯2 π‘₯3
∑π‘₯32
∑π‘₯3 π‘₯4
∑π‘₯4
∑π‘₯2 π‘₯4
∑π‘₯3 π‘₯4
∑π‘₯42 ]
Using Excel we can compute these quantities as the elements of the X matrix, find the inverse and then
multiply the inverse matrix by var(𝑒): var(𝑒)X −1 . The result is the following covariance matrix. (The
calculations are shown in the Excel file.
46.227
-6.426
-11.601
2.939
-6.426
1.094
0.300
-0.086
-11.601
0.300
12.646
-3.289
2.939
-0.086
-3.289
0.885
Thus,
var(𝑏3 + 3.8𝑏4 ) = var(𝑏3 ) + 3.82 var(𝑏4 ) + 2(3.8)covar(𝑏3 , 𝑏4 )
var(𝑏3 + 3.8𝑏4 ) = 12.646 + 3.82 (0.885) + 2(3.8)(−3.289) = 0.428
se(𝑏3 + 3.8𝑏4 ) = √0.428 = 06542
𝑇𝑆 =
1.633 − 1
= 0.968
0.6542
The critical value is 𝐢𝑉 = 𝑑0.025,72 = 1.994. Since 𝑇𝑆 < 𝐢𝑉, do not reject the null hypothesis. The optimum
advertising expenditure amount is $1.9 thousand.
An 𝑭-test alternative can also be used for the optimum advertising hypothesis test. To do this test state the
unrestricted and restricted models:
Unrestricted model:
Restricted model:
𝑦 = 𝛽1 + 𝛽2 π‘₯2 + 𝛽3 π‘₯3 + 𝛽4 π‘₯32 + 𝑒
𝑦 = 𝛽1 + 𝛽2 π‘₯2 + (1 − 3.8𝛽4 )π‘₯3 + 𝛽4 π‘₯32 + 𝑒
Note that we have used the null statement 𝛽3 +2𝛽4 (1.9) = 1, that is, 𝛽3 = 1 − 3.8𝛽4 , in place of the coefficient
of π‘₯3 . To estimate the restricted model, rearrange the model in the following format:
𝑦 = 𝛽1 + 𝛽2 π‘₯2 + π‘₯3 + 𝛽4 (π‘₯32 − 3.8π‘₯3 ) + 𝑒
Chapter 7—Further Inference in Multiple Regression
Page 6 of 20
𝑦 − π‘₯3 = 𝛽1 + 𝛽2 π‘₯2 + 𝛽4 (π‘₯32 − 3.8π‘₯3 ) + 𝑒
Running the regression on the restricted model yields a sum of square error of SSER = 2594.5332, from which
compute the F-statistic (see the Excel file):
𝐹=
(𝑆𝑆𝐸𝑅 − π‘†π‘†πΈπ‘ˆ )⁄1 (1152.286 − 1532.084)⁄1
=
= 0.936
π‘†π‘†πΈπ‘ˆ ⁄(𝑛 − π‘˜)
1532.084⁄71
The critical value is 𝐢𝑉 = 𝐹0.05,(1,71) = 3.976. Note that when there is only one null hypothesis, the t- and Ftests are equivalent tests because they both yield the same probability value of 0.8014.1 Also note that
𝐹 = 𝑑 2 = (0.967572)2 = 0.936195.
2.
The Use of Non-sample Information
In many estimation and inference problems we have information external to the sample data. This nonsample information may come from economic theory or principles, or experience. When available, we can
combine non-sample with the sample information to improve the precision of the estimated parameters.
In economic analysis of demand, demand for a good depends on the price of the good, price of substitutes and
complements, and on income. Take the demand for beer. It depends on the price of beer, 𝑃𝐡 , the price of
other liquor, 𝑃𝐿 , the price of all other remaining goods and services, 𝑃𝑅 , and income (𝐼):
𝑄 = 𝑓(𝑃𝐡 , 𝑃𝐿 , 𝑃𝑅 , 𝐼)
Assuming that a log-log function form is appropriate for this demand relationship,
ln(𝑄) = 𝛽1 + 𝛽2 ln(𝑃𝐡 ) + 𝛽3 ln(𝑃𝐿 ) + 𝛽4 ln(𝑃𝑅 ) + 𝛽5 ln(𝐼) + 𝑒
The relevant non-sample information can be derived by assuming that there is no “money illusion” with
respect to simultaneous equal-percentage increase in all prices and income. That is, the demand for beer will
not change, if income and all prices, say, double. Impose this assumption on the model by multiplying all
variables by the proportion λ.
ln(𝑄) = 𝛽1 + 𝛽2 ln(πœ†π‘ƒπ΅ ) + 𝛽3 ln(πœ†π‘ƒπΏ ) + 𝛽4 ln(πœ†π‘ƒπ‘… ) + 𝛽5 ln(πœ†πΌ)
Using the properties of logarithm, we can rewrite the above equation as:
ln(𝑄) = 𝛽1 + 𝛽2 ln(𝑃𝐡 ) + 𝛽3 ln(𝑃𝐿 ) + 𝛽4 ln(𝑃𝑅 ) + 𝛽5 ln(𝐼) + (𝛽2 + 𝛽3 + 𝛽4 + 𝛽5 ) ln(πœ†)
Since multiplying all right-hand-side variables in the original equation does not alter ln(𝑄), then the following
must be true,
𝛽2 + 𝛽3 + 𝛽4 + 𝛽5 = 0
This non-sample information thus can be imposed as a constraint on the parameters in the demand model.
Solve this restriction for 𝛽4 ,
𝛽4 = −𝛽2 − 𝛽3 − 𝛽5
1
Use the Excel function =𝐹𝐷𝐼𝑆𝑇() to find the tail area under the F-curve for a given F statistic.
Chapter 7—Further Inference in Multiple Regression
Page 7 of 20
Using the multiple regression model obtain an estimate of the log-log demand function above,
ln(𝑄) = 𝛽1 + 𝛽2 ln(𝑃𝐡 ) + 𝛽3 ln(𝑃𝐿 ) + 𝛽4 ln(𝑃𝑅 ) + 𝛽5 ln(𝐼) + 𝑒
and substituting for 𝛽4 from the constraint, we have
ln(𝑄) = 𝛽1 + 𝛽2 ln(𝑃𝐡 ) + 𝛽3 ln(𝑃𝐿 ) + (−𝛽2 − 𝛽3 − 𝛽5 ) ln(𝑃𝑅 ) + 𝛽5 ln(𝐼) + 𝑒
ln(𝑄) = 𝛽1 + 𝛽2 [ln(𝑃𝐡 ) − ln(𝑃𝑅 )] + 𝛽3 [ln(𝑃𝐿 ) − ln(𝑃𝑅 )] + 𝛽5 [ln(𝐼) − ln(𝑃𝑅 )] + 𝑒
The restricted model then can be written as,
𝑃𝐡
𝑃𝐿
𝐼
ln(𝑄) = 𝛽1 + 𝛽2 ln ( ) + 𝛽3 ln ( ) + 𝛽5 ln ( ) + 𝑒
𝑃𝑅
𝑃𝑅
𝑃𝑅
The data is available in 𝐢𝐻7 𝐷𝐴𝑇𝐴 in the tab 𝐡𝐸𝐸𝑅 to estimate this model. The summary output is presented
below:
SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.8989
0.8079
0.7858
0.0617
30
ANOVA
df
Regression
Residual
Total
Intercept
ln(PB/PR)
ln(PL/PR)
ln(I/PR)
3
26
29
SS
0.4161
0.0989
0.5150
MS
0.1387
0.0038
F
36.4602
Significance F
0.0000
Coefficients
-4.7978
-1.2994
0.1868
0.9458
Standard Error
3.7139
0.1657
0.2844
0.4270
t Stat
-1.2918
-7.8400
0.6569
2.2148
P-value
0.2078
0.0000
0.5170
0.0357
Lower 95%
-12.4318
-1.6401
-0.3977
0.0680
Upper 95%
2.8362
-0.9587
0.7714
1.8236
Since in the restricted model β4 = −β2 − β3 − β5, then the estimated β4 is
𝑏4∗ = −𝑏2∗ − 𝑏3∗ − 𝑏5∗ = −(−1.2994) − 0.1868 − 0.9458 = 0.1667
The restricted least square estimates are biased [E(π‘π‘˜ ) ≠ π›½π‘˜ ], unless the constraints are exactly true. Also
note that the variance of a restricted least square estimator is smaller than the unrestricted one. By
combining non-sample information with the sample information, we reduce the variation in the estimation
procedure introduced by random sampling.
Chapter 7—Further Inference in Multiple Regression
Page 8 of 20
3.
Model Specification
In any econometric investigation, choice of the model is one of the first steps. What are the important
considerations when choosing a model? What are the consequences of choosing the wrong model? Are there
any ways of assessing model specification or misspecification?
The essential features of the model choice are:
ο‚·
ο‚·
ο‚·
Choice of functional form,
Choice of explanatory variables, and
whether the multiple regression model assumptions hold.2
For choice of functional form and regressors, economic principles and logical reasoning play a prominent and
vital role.
3.1.
Consequences of Omitted and Irrelevant Variables
3.1.1. The Omitted Variable Problem
Suppose in a particular industry the wage rate of employees W, depends on their experience E, and their
motivation M. Then the model is specified as:
π‘Š = 𝛽1 + 𝛽2 𝐸 + 𝛽3 𝑀 + 𝑒
However, since data on motivation are unavailable, we dispense with M and instead we estimate the model
π‘Š = 𝛽1 + 𝛽2 𝐸 + 𝑒
By estimating the alternative model we are imposing the constraint 𝛽3 = 0 when it is not true, that is, when in
fact 𝛽3 ≠ 0. By imposing this constraint, the least squares estimates 𝑏1 and 𝑏2 will be biased. Only when the
omitted variable is uncorrelated with the retained variables will the estimates be unbiased. But perfectly
uncorrelated regressors are rare.
Because of the possibility of omitted-variable bias (OVB), one must include all important relevant variables. If
an estimated equation has coefficients with unexpected signs, or unrealistic magnitudes, a possible cause of
these strange results is the omission of an important variable.
One method to determine if a variable or a group of variables should be included in a model is to perform
“significance tests”. For one variable (one null hypothesis 𝐻0 : 𝛽3 = 0), we use the t-test and for more than one
variable (two or more null hypotheses 𝐻0 : 𝛽3 = 0, 𝛽4 = 0) we use the F-test. But we must also remember that
we may reject the null because of the quality or paucity of data, even though the variable is in fact relevant to
the model. One could, in such cases, be inducing omitted-variable bias in the remaining coefficient estimates.
3.1.1.1.
Proof of the Omitted Variable Bias
Suppose the true model is 𝑦 = 𝛽1 + 𝛽2 π‘₯ + 𝛽3 β„Ž + 𝑒, but by omitting the variable β„Ž we estimate the model 𝑦 =
𝛽1 + 𝛽2 π‘₯ + 𝑒 instead.
Denote the least squares estimator of 𝛽2 in the reduced model by 𝑏2∗ .
2
The typical violations of the assumptions are: heteroskedasticity, autocorrelation, and random regressors.
Chapter 7—Further Inference in Multiple Regression
Page 9 of 20
𝑏2∗ =
∑(π‘₯ − π‘₯Μ… )(𝑦 − 𝑦̅) ∑(π‘₯ − π‘₯Μ… )𝑦
=
∑(π‘₯ − π‘₯Μ… )2
∑(π‘₯ − π‘₯Μ… )2
To simplify the proof, let
𝑀=
π‘₯ − π‘₯Μ…
∑(π‘₯ − π‘₯Μ… )2
Thus,
𝑏2∗ = ∑𝑀𝑦
Now substitute for 𝑦 from the original model, which includes h,
𝑏2∗ = ∑𝑀(𝛽1 + 𝛽2 π‘₯ + 𝛽3 β„Ž + 𝑒) = 𝛽1 ∑𝑀 + 𝛽2 ∑𝑀π‘₯ + 𝛽3 ∑π‘€β„Ž + ∑𝑀𝑒
It is simple to show that ∑𝑀 = 0 and ∑𝑀π‘₯ = 1. Thus,
𝑏2∗ = 𝛽2 + 𝛽3 ∑π‘€β„Ž + ∑𝑀𝑒
Taking the expectations of both sides, we have
E(𝑏2∗ ) = 𝛽2 + 𝛽3 ∑π‘€β„Ž ≠ 𝛽2
Consider the term ∑π‘€β„Ž.
∑π‘€β„Ž =
∑(π‘₯ − π‘₯Μ… )β„Ž ∑(π‘₯ − π‘₯Μ… )(β„Ž − β„ŽΜ…)
=
∑(π‘₯ − π‘₯Μ… )2
∑(π‘₯ − π‘₯Μ… )2
∑π‘€β„Ž =
∑(π‘₯ − π‘₯Μ… )(β„Ž − β„ŽΜ…)⁄(𝑛 − 1) cov(π‘₯, β„Ž)
=
∑(π‘₯ − π‘₯Μ… )2 ⁄(𝑛 − 1)
var(π‘₯)
The numerator is the covariance of π‘₯ and β„Ž, and the denominator the variance of π‘₯. This allows us to write:
E(𝑏2∗ ) = 𝛽2 + 𝛽3
cov(π‘₯, β„Ž)
≠ 𝛽2
var(π‘₯)
The OVB here is shown as the difference between E(𝑏2∗ ) and 𝛽2 :
bias(𝑏2∗ ) = E(𝑏2∗ ) − 𝛽2 = 𝛽3
cov(π‘₯, β„Ž)
var(π‘₯)
Knowing the sign of 𝛽3 and the sign of cov(π‘₯, β„Ž) tells us the direction of the bias. Also note that if π‘₯ and β„Ž are
uncorrelated, their covariance will be zero. Thus the bias disappears and E(𝑏2∗ ) = 𝛽2 .
Example
The worksheet “𝑒𝑑𝑒­π‘–𝑛𝑐” in the file “CH7 DATA” contains 428 observations relating annual family income (𝑦 =
𝐹𝐴𝑀𝐼𝑁𝐢) to the level of education of the income earners. The explanatory variables are husband's years of
education (π‘₯2 = π»πΈπ·π‘ˆ) and wife's years of education (π‘₯3 = π‘ŠπΈπ·π‘ˆ).
The regression outcome shows family income rises by $3,132 for each additional year of the husband's
education and by $4,523 for each additional year of the wife's education.
𝑦̂ = −5533.629 + 3131.509π‘₯2 + 4522.641π‘₯3
Chapter 7—Further Inference in Multiple Regression
Page 10 of 20
If, however, we omit π‘₯3 = π‘ŠπΈπ·π‘ˆ from the model the regression equation becomes,
𝑦̂ = 2619.27 + 5155.483π‘₯2
With the effect of an extra year of the husband's education on family income rising by nearly $2,000 to
$5,155, the model overstates the contribution of the husband's educational attainment to the family income.
Denote the biased coefficient as 𝑏2∗ . Then the omitted variable bias, as explained above, is
bias(𝑏2∗ ) = 𝛽3
cov(π‘₯2 , π‘₯3 )
var(π‘₯)
We can show that the omitted variable imparts a positive bias to the model. The estimated coefficient 𝑏3 =
4522.641 > 0 and, using Excel,
cov(π‘₯2 , π‘₯3 ) = 4.113 > 0. 3
Now include a third explanatory variable, the number of children under 6 years of age, π’™πŸ’ = π‘²π‘³πŸ”. The
regression equation is as follows. Note that, the coefficient of 𝐾𝐿6 is negative, implying that the larger the
number of children in the family the lower the income (fewer number of hours worked). For each additional
child, family income is reduced by $14,311. For comparison, the regression equation for the original model is
shown below the new equation.
𝑦̂ = −7755.3 + 3211.5π‘₯2 + 4776.9π‘₯3 − 14310.9π‘₯4
𝑦̂ = −5533.6 + 3131.5π‘₯2 + 4522.6π‘₯3
Note that the inclusion of the π‘₯4 = 𝐾𝐿6 variable does not alter the coefficients of the original model by much.
The reason for this is that the 𝐾𝐿6 variable is not highly correlated with the education variables. Even though
the new variable is relevant (the p-value in the regression output is 0.0044), its omission would not impart an
OVB because of the absence of significant correlation with existing explanatory variables (here the education
variables).
3.1.2. The Irrelevant Variable Problem
The opposite of the omitted variable problem is the irrelevant variable problem. The irrelevant variable does
not make the other estimated coefficients biased. But it increases their variance (or standard error), if the
2
irrelevant variable is significantly correlated with existing variables. Recall the role of (1 − π‘Ÿπ‘—π‘˜
) in the
denominator of the formula for variance of 𝑏𝑗 . Even though the irrelevant variable is unlikely to influence the
dependent variable, it could be correlated with one or both of the previous variables, thus increasing their
variance.
3.2. The RESET Test for Model Misspecification
The general idea behind the RESET test (Regression Specification Error Test) is that if we can significantly
improve the model by artificially including powers of the predictions of the model (that is 𝑦̂ 2 and 𝑦̂ 3 ), then we
can conclude that the original model is inadequate or misspecified. Let’s state the original model as:
𝑦 = 𝛽1 + 𝛽2 π‘₯2 + 𝛽3 π‘₯3 + 𝑒
which is estimated by
3
Do not confuse cov(π‘₯2 , π‘₯3 ) with cov(𝑏2 , 𝑏3 )!
Chapter 7—Further Inference in Multiple Regression
Page 11 of 20
𝑦̂ = 𝑏1 + 𝑏2 π‘₯2 + 𝑏3 π‘₯3
We can include the squared prediction of 𝑦, 𝑦̂ 2 , alone or along with its cubic value, 𝑦̂ 3 , in the original model:
𝑦 = 𝛽1 + 𝛽2 π‘₯2 + 𝛽3 π‘₯3 + 𝛾1 𝑦̂ 2 + 𝑒
𝑦 = 𝛽1 + 𝛽2 π‘₯2 + 𝛽3 π‘₯3 + 𝛾1 𝑦̂ 2 + 𝛾2 𝑦̂ 3 + 𝑒
(The symbol "γ" is the Greek letter Gamma.) In the first model we can then test the hypotheses,
𝐻0 : 𝛾1 = 0
𝐻1 : 𝛾1 ≠ 0
and in the second one,
𝐻0 : 𝛾1 = 0, 𝛾2 = 0
𝐻1 : 𝛾1 ≠ 0, or 𝛾2 = 0, or both ≠ 0
If we reject the null hypothesis and conclude that 𝛾1 or 𝛾2 is significantly different from zero, then the
inclusion of the powers of prediction of 𝑦 in the model has improved the model. Therefore, the original model
is inadequate. If we have omitted variables, and these variables are correlated with π‘₯2 and π‘₯3 , then some of
their effect may be picked up by the inclusion of the powers of prediction of 𝑦.
Let us use the family income model above to do the RESET test. To conduct a RESET test, run the regression
first by adding 𝑦̂ 2 alone, and then by including both 𝑦̂ 2 and 𝑦̂ 3 in the original model. In running a RESET test
we treat the original model as the restricted model and use the F-test described above. Using the estimated
regression equation with the KL6 included,
𝑦̂ = −7755.3 + 3211.5π‘₯2 + 4776.9π‘₯3 − 14310.9π‘₯4
and then running the model with the added 𝑦̂ 2
𝑦̂ ∗ = 𝑏1 + 𝑏2 π‘₯3 + 𝑏3 π‘₯4 + 𝑏4 π‘₯4 + 𝑔1 𝑦̂ 2
The ANOVE section of the computer output for the two model are shown as follows:
𝑦̂ ∗ = 𝑏1 + 𝑏2 π‘₯3 + 𝑏3 π‘₯4 + 𝑏4 π‘₯4 + 𝑔1 𝑦̂ 2
ANOVA
df
Regression
Residual
Total
Intercept
HEDU
WEDU
KL6
yΜ‚ ²
𝑦̂ = 𝑏1 + 𝑏2 π‘₯2 + 𝑏3 π‘₯3 + 𝑏4 π‘₯4
ANOVA
4
423
427
SS
1.568E+11
6.743E+11
8.311E+11
MS
3.92E+10
1.594E+09
Coefficients
87242.9829
-2381.4657
-4235.1089
10887.3371
0.000010
Standard Error
40389.3906
2419.6918
3832.1395
11439.2762
0.0000041
t Stat
2.1600
-0.9842
-1.1052
0.9518
2.4462
Regression
Residual
Total
df
3
424
427
SS
1.472E+11
6.838E+11
8.311E+11
MS
4.91E+10
1.61E+09
Intercept
HEDU
WEDU
KL6
Coefficients
-7755.330
3211.526
4776.907
-14310.921
Standard Error
11162.935
796.703
1061.164
5003.928
t Stat
-0.695
4.031
4.502
-2.860
The F-statistic for the test 𝐻0 : 𝛾1 = 0 is:
𝐹=
(𝑆𝑆𝐸𝑅 − π‘†π‘†πΈπ‘ˆ )⁄𝑗
π‘†π‘†πΈπ‘ˆ ⁄(𝑛 − π‘˜)
Chapter 7—Further Inference in Multiple Regression
Page 12 of 20
𝑆𝑆𝐸𝑅 = 6.838E + 11
π‘†π‘†πΈπ‘ˆ = 6.743E + 11
𝑗=1
𝑛 − π‘˜ = 423
𝐹 = 5.984
The probability value is: =F. DIST. RT(5.984,1,423) = 0.0148. At a 5% level of significance, we reject H0 and
conclude that predictions squared does improve the original model. This indicates that the original model
(with the kids variable included) is misspecified.
Including, in addition, the “predictions cubed” and running the regression provides the following ANOVA
table and subsequent F-statistic.
ANOVA
df
Regression
Residual
Total
5
422
427
SS
1.57219E+11
6.73868E+11
8.31087E+11
MS
3.14E+10
1.6E+09
F
19.69122
Significance F
1.2E-17
𝑆𝑆𝐸𝑅 = 6.838E+11
π‘†π‘†πΈπ‘ˆ = 6.739E+11
𝑗=2
𝑛 − π‘˜ = 422
𝐹 = 3.1226
The p-value is =F. DIST. RT(3.1226,2,422) = 0.0451. We reject 𝐻0 : 𝛾1 = 0, 𝛾2 = 0 at a 5% level of significance.
This would indicate that the model could be improved upon by adding more variables, such as age of wage
earner, experience, the geographic location of the household (rural versus urban).
4. Cases Where the 𝑭-Test and 𝒕-Tests Give Contradictory Results—
Collinearity
In some cases, the 𝐹-test may indicate that the overall model is significant, but individual 𝑑-tests lead us to
conclude that the individual 𝛽𝑗 are not significantly different from zero. When collinearity among the
explanatory variable exists, that is, when independent variables themselves are correlated (when their
explanatory powers overlap), the standard errors of the regression coefficients would be large, making the 𝑑
test statistics small, thus leading us to conclude that each 𝛽𝑗 individually is not significantly different from
zero. The following is a discussion of the collinearity problem in multiple regression models.
4.1.
Collinearity
To gain a better understanding of the relationship between the dependent variable y and the explanatory
variables, it is important to understand the factors affecting the variance and covariance of the coefficients.
To that end, consider the actual formulas for the variances of the slope coefficients and their covariance.
Chapter 7—Further Inference in Multiple Regression
Page 13 of 20
var(𝑏2 ) =
var(𝑒)
∑(π‘₯2 − π‘₯Μ…2 )2 (1 − π‘Ÿ232 )
var(𝑏3 ) =
var(𝑒)
∑(π‘₯3 − π‘₯Μ…3 )2 (1 − π‘Ÿ232 )
covar(𝑏2 , 𝑏3 ) = var(𝑒)
4
−∑(π‘₯2 − π‘₯Μ…2 )(π‘₯3 − π‘₯Μ…3 )
∑(π‘₯2 − π‘₯Μ…2 )2 ∑(π‘₯3 − π‘₯Μ…3 )2 (1 − π‘Ÿ232 )
ο‚·
The larger the variance of the disturbance term var(𝑒), as estimated by var(𝑒), the larger the
variance (and covariance) of the least squares estimators. This means that the dependent variable
data is more widely scattered about the regression plane, indicating a weaker association between 𝑦
and the respective explanatory variable.
ο‚·
Note that the sum of squared deviations for all independent variables appear in the denominator of
the three formulas. The bigger these are, the smaller the variances and the covariance. Two factors
affect these sum-squares:
ο‚·
o
The sample size: The larger the sample size 𝑛, bigger the sum-squares.
o
The degree of dispersion of the π‘₯𝑖𝑗 data about their respective mean π‘₯̅𝑗 : The more dispersed
the π‘₯𝑖𝑗 data are about their mean, the bigger the sum-squares. In order to estimate the
population slope parameters 𝛽𝑗 precisely by reducing the variance of 𝑏𝑗 , there should be a
large amount of variation in the π‘₯𝑖𝑗 .
Finally, the larger the correlation between π‘₯2 and π‘₯3 , π‘Ÿ23 , the bigger the variance of 𝑏𝑗 .
The final point here deserves more detailed attention. Note that the correlation coefficient π‘Ÿ23 , given by the
familiar formula,
π‘Ÿ23 =
∑(π‘₯2 − π‘₯Μ…2 )(π‘₯3 − π‘₯Μ…3 )
√∑(π‘₯2 − π‘₯Μ…2 )2 ∑(π‘₯3 − π‘₯Μ…3 )2
measures the degree of association between π‘₯2 and π‘₯3 . Since π‘Ÿ23 appears in the denominator of var(𝑏2 ),
2 ),
var(𝑏3 ), and covar(𝑏2 , 𝑏3 ) in the term (1 − π‘Ÿ23
then the bigger the correlation coefficient π‘Ÿ23 , the smaller the
denominator, hence the bigger the variances and the covariance. When the two independent variables are
correlated it is difficult to disentangle their separate effects on the independent variable.
4
The general formula for var(𝑏2 ) is:
var(𝑏2 ) = var(𝑒)
∑(π‘₯2 − π‘₯Μ…2 )2 ∑(π‘₯3
∑(π‘₯3 − π‘₯Μ…3 )2
− π‘₯Μ…3 )2 − [∑(π‘₯2 − π‘₯Μ…2 )(π‘₯3 − π‘₯Μ…3 )]2
In the denominator we can substitute from the formula for the correlation coefficient r23, measuring the correlation
between x1 and x2.
π‘Ÿ23 =
∑(π‘₯2 − π‘₯Μ…2 )(π‘₯3 − π‘₯Μ…3 )
√∑(π‘₯2 − π‘₯Μ…2 )2 ∑(π‘₯3 − π‘₯Μ… 3 )2
2
[∑(π‘₯2 − π‘₯Μ…2 )(π‘₯3 − π‘₯Μ… 3 )]2 = π‘Ÿ23
∑(π‘₯2 − π‘₯Μ…2 )2 ∑(π‘₯3 − π‘₯Μ…3 )2
var(𝑏2 ) = var(𝑒)
∑(π‘₯3 − π‘₯Μ…3 )2
var(𝑒)
2)=
2)
∑(π‘₯2 − π‘₯Μ…2 )2 ∑(π‘₯3 − π‘₯Μ…3 )2 (1 − π‘Ÿ23
∑(π‘₯2 − π‘₯Μ…2 )2 (1 − π‘Ÿ23
Chapter 7—Further Inference in Multiple Regression
Page 14 of 20
In simple regression, 𝑏2 is the total effect of π‘₯ on 𝑦: 𝑏2 = 𝑑𝑦⁄𝑑π‘₯2 . In multiple regression, the coefficient of the
first variable, π‘₯2 , is the estimated net effect of a change in that variable on 𝑦: 𝑏2 = πœ•π‘¦⁄πœ•π‘₯2 , and that of the
second variable π‘₯3 , similarly, is the estimated net effect of a change in π‘₯3 on 𝑦: 𝑏3 = πœ•π‘¦⁄πœ•π‘₯3 . Suppose π‘₯2 and
π‘₯3 are themselves linearly related, so that a change in one induces a change in the other. Then the total effect
of a change in π‘₯2 on 𝑦 involves not only 𝑏2 , but it must also include the effect of the change in π‘₯3 induced by
the change in π‘₯2 .
To further illustrate the effect of correlation among the independent variables, consider the following two
Venn diagrams (𝐴) and (𝐡). In both (𝐴) and (𝐡) each circle represents the total variation in the variable. In
(𝐴) the two independent variable π‘₯2 and π‘₯3 are uncorrelated as shown by the two non-overlapping circles.
Here π‘Ÿ23 = 0. Thus, the variance of the coefficients 𝑏2 and 𝑏3 are “simplified” into:
var(𝑏2 ) =
var(𝑒)
(π‘₯
∑ 2 − π‘₯Μ…2 )2
var(𝑏3 ) =
var(𝑒)
(π‘₯
∑ 3 − π‘₯Μ…3 )2
which are the same as the variance of the slope coefficient in the simple linear regression. Also, since π‘Ÿ23 = 0,
the term ∑(π‘₯2 − π‘₯Μ…2 )(π‘₯3 − π‘₯Μ…3 ) in the numerator of covar(𝑏2 , 𝑏3 ) formula equals zero, thus making
covar(𝑏2 , 𝑏3 ) = 0. Thus, a multiple regression of 𝑦 on π‘₯2 and π‘₯3 will contain the same information as is
contained in two separate simple regressions.
(A) xβ‚‚ and x₃ are uncorrelated
Total variations in y
Total variations in xβ‚‚
Total variations in x₃
(B) xβ‚‚ and x₃ are correlated
Total variations in y
Total variations in xβ‚‚ Total variations in x₃
In (𝐡), the two circles representing the variations in the independent variables, in addition to overlapping
with the 𝑦 circle, overlap with each other. Thus there is variation common to all three variables, shown as the
area of intersection of the three circles. A simple regression of 𝑦 on π‘₯2 would involve the entire overlap
between 𝑦 and π‘₯2 , but, as the diagram shows, this overlap includes also some of the variation in π‘₯3 . The
resulting “net effect” overlap of the variation in π‘₯2 and 𝑦 is smaller than the overlap depicting the gross
relationship between these two variables.
The existence of the overlap between the circles representing variations in the independent variables
indicates “collinearity”. The bigger the area of overlap π‘₯2 and π‘₯3 , the stronger the collinearity. The practical
impact of collinearity can be observed by the impact of the correlation coefficient π‘Ÿ23 in the denominator of
the variance formula for either slope coefficient. Take the variance of 𝑏2 :
var(𝑏2 ) =
var(𝑒)
∑(π‘₯2 − π‘₯Μ…2 )2 (1 − π‘Ÿ232 )
2
The stronger the collinearity of π‘₯2 and π‘₯3 , the bigger π‘Ÿ23
, the bigger the variance of 𝑏2 , and the less precise the
estimate of the parameter 𝛽2 . The variation in π‘₯2 about its mean π‘₯Μ…2 adds most to the precision of estimation
when it is not connected to the variation in the other explanatory variable. When the variation in π‘₯2 about its
Chapter 7—Further Inference in Multiple Regression
Page 15 of 20
mean is related to the variation in the in the other explanatory variable, the precision of estimation is
diminished.
Example
Refer to the data in the tab “cars” in the Excel file, to estimate the effect of the number of cylinders (CYL),
engine size (displacement in cubic inches, ENG), and vehicle weight (WGT) on fuel consumption (MPG).
First run a simple regression using only the number of cylinders as the only variable. The relevant part
regression summary output is shown below:
Intercept
CYL
Coefficients
42.916
-3.558
Standard Error
0.835
0.146
t Stat
51.404
-24.425
P-value
0.000
0.000
Lower 95%
41.274
-3.844
Upper 95%
44.557
-3.272
The r-square value is 𝑅2 = 0.6047.
Given the p-value = 0.000, clearly, the number of cylinders have a significant impact on MPG—as expected.
Now run the regression using all independent variables mentioned above.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8362
R Square
0.6993
Adjusted R Square
0.6970
Standard Error
4.2965
Observations
392
ANOVA
df
Regression
Residual
Total
Intercept
CYL
ENG
WGT
3
388
391
SS
16656.444
7162.549
23818.993
MS
5552.148
18.460
F
300.764
Significance F
0.000
Coefficients
44.3710
-0.2678
-0.0127
-0.0057
Standard Error
1.4807
0.4131
0.0083
0.0007
t Stat
29.9665
-0.6483
-1.5362
-7.9951
P-value
0.0000
0.5172
0.1253
0.0000
Lower 95%
41.4598
-1.0799
-0.0289
-0.0071
Upper 95%
47.2821
0.5443
0.0035
-0.0043
Note that both 𝑅2 = 0.6993 and F-statistic show that the combined impact of all variables on MPG is
significant. However, considered separately, given the t-statistics and p-values for CYL and ENG, we cannot
reject the null hypotheses 𝐻0 : 𝛽2 = 0 and 𝐻0 : 𝛽3 = 0, indicating that these variables have no impact on MPG!
Also, using the F-test for the null hypotheses 𝐻0 : 𝛽2 = 𝛽3 = 0, the F-statistic is F = 4.298 with a pvalue = 0.0142, leading us to reject the “no-effect” null hypothesis.
These contradictions arise from the fact that there is strong collinearity between the variables CYL and ENG.
Considering the Venn diagram shown above, there is significant overlap (correlation) between the two
independent variables π‘₯2 = CYL and π‘₯3 = ENG. Using the Excel function =CORREL, the correlation coefficient
for the two variables is π‘Ÿ23 = 0.9508.
Chapter 7—Further Inference in Multiple Regression
Page 16 of 20
5.
Identifying and Mitigating Collinearity
As explained above, the collinearity problem arises from the association or correlation between the
independent variables. Theoretically, if there is perfect collinearity between any two independent variables,
the inverse of the X matrix does not exist. Therefore there is no unique solution for the system of normal
equations that is used to obtain values of the regression coefficients. In matrix jargon, we say one row is a
linear combination of another row.
In any regression model, the collinearity problem would indicate that the data do not contain enough
"information" about the individual effects of the explanatory variables to precisely estimate the population
slope parameters 𝛽𝑗 . Even if, in theory, two explanatory variables are perfectly collinear, the sample data may
never indicate perfect collinearity. Therefore, there is always a solution for the values of 𝑏𝑗 . But these
solutions will not be precise estimates of the 𝛽𝑗 . The question is, how can we detect the existence of
significant collinearity?
In a model with two explanatory variables, a simple way to detect collinearity is to compute the correlation
coefficient using the formula,
π‘Ÿ23 =
∑(π‘₯2 − π‘₯Μ…2 )∑(π‘₯3 − π‘₯Μ…3 )
√∑(π‘₯2 − π‘₯Μ…2 )2 ∑(π‘₯3 − π‘₯Μ…3 )2
In Excel, the function is =CORREL(). For example, in the family income model with two explanatory variables
the correlation coefficient between HEDU and WEDU is r₂₃ = 0.5943.
In models with more than two explanatory variables, the collinear relationships may involve more than two
of the explanatory variables. To detect collinearity, we can estimate the "auxiliary" regression, where the
"dependent" or explained variable is one of the explanatory variables. We run the regression using the
remaining explanatory variables. In the family income model where KL6 is the additional explanatory
variable, now we can use this variable as the dependent variable and run the auxiliary regression. The
relevant "regression" equation is then
π‘₯Μ‚4 = π‘Ž1 + π‘Ž2 π‘₯2 + π‘Ž3 π‘₯3
Here, the objective is not to determine the coefficients, rather, the concern is the value of the R2. A large R2
value, say, above 0.80, would indicate a significant correlation between the variables under consideration.
The R2 for this auxiliary regression is 0.0179, which clearly indicates the absence of collinearity.
6.
Confidence and Prediction Intervals
The interval estimate for the mean value of the dependent variable, 𝑦̂0 , for given value of the dependent
variables is the familiar general format:
𝐿, π‘ˆ = 𝑦̂0 ± 𝑑𝛼⁄2,(𝑛−π‘˜) se(𝑦̂0 )
In a model with two independent variables, the variance of 𝑦̂0 , from which we obtain the standard error figure
to build the interval estimate, is:
var(𝑦̂0 ) = var(𝑏1 + 𝑏2 π‘₯02 + 𝑏3 π‘₯03 )
2
2
var(𝑦̂0 ) = var(𝑏1 ) + π‘₯02
var(𝑏2 ) + π‘₯03
var(𝑏3 ) + 2π‘₯02 cov(𝑏1 , 𝑏2 ) + 2π‘₯03 cov(𝑏1 , 𝑏3 ) + 2π‘₯02 π‘₯03 cov(𝑏2 , 𝑏3 )
Chapter 7—Further Inference in Multiple Regression
Page 17 of 20
The prediction interval for an individual value of the dependent variable, 𝑦0 , for given values of the
independent variables, takes the following form:
𝐿, π‘ˆ = 𝑦̂0 ± 𝑑𝛼⁄2,(𝑛−π‘˜) se(𝑦0 )
The interval is still built around 𝑦̂0 . But the standard error is now different. The difference arises from the
fact that the individual value of 𝑦 deviates from the mean value by the prediction error.
𝑦0 = 𝑦̂0 + 𝑒
Therefore,
var(𝑦0 ) = var(𝑦̂0 ) + var(𝑒)
Example
Use the data in the tab “burger2” to estimate the coefficients of the model.
𝑆𝐴𝐿𝐸𝑆 = 𝛽1 + 𝛽2 𝑃𝑅𝐼𝐢𝐸 + 𝛽3 𝐴𝐷𝑉𝐸𝑅𝑇 + 𝛽4 𝐴𝐷𝑉𝐸𝑅𝑇 2 + 𝑒
𝑦 = 𝑆𝐴𝐿𝐸𝑆
π‘₯2 = 𝑃𝑅𝐼𝐢𝐸
π‘₯3 = 𝐴𝐷𝑉𝐸𝑅𝑇
π‘₯4 = 𝐴𝐷𝑉𝐸𝑅𝑇 2
Thus,
𝑦̂ = 𝑏1 + 𝑏2 π‘₯2 + 𝑏3 π‘₯3 + 𝑏4 π‘₯4
𝑦̂ = 109.719 − 7.64π‘₯2 + 12.1512π‘₯3 − 2.768π‘₯4
Now, let
π‘₯02 = $6
π‘₯03 = $1.9
π‘₯04 = (1.9)2 = 3.61
Then,
𝑦̂0 = 109.719 − 7.64(6) + 12.1512(1.9) − 2.768(3.61)
𝑦̂0 = 76.974
First, build a confidence interval for the mean value of 𝑦.
We need to find var(𝑦̂0 ). In the Excel file determine the covariance matrix.
var(𝑦̂0 ) = var(𝑏1 + 𝑏2 π‘₯02 + 𝑏3 π‘₯03 + 𝑏4 π‘₯04 )
2
2
2
var(𝑦̂0 ) = var(𝑏1 ) + π‘₯02
var(𝑏2 ) + π‘₯03
var𝑏3 π‘₯3 + π‘₯04
var(𝑏4 )
var(𝑦̂0 ) = + 2π‘₯02 cov(𝑏1 , 𝑏2 ) + 2π‘₯03 cov(𝑏1 , 𝑏3 ) + 2π‘₯04 cov(𝑏1 , 𝑏4 )
var(𝑦̂0 ) = + 2π‘₯02 π‘₯03 cov(𝑏2 , 𝑏3 ) + 2π‘₯02 π‘₯04 cov(𝑏2 , 𝑏4 ) + 2π‘₯03 π‘₯04 cov(𝑏3 , 𝑏4 )
var(𝑦̂0 ) = 46.22702 + (62 )(1.09399) + (1.92 )(12.64630) + (3.612 )(0.88477)
var(𝑦̂0 ) = + 2(6)(−6.42611) + 2(1.9)(−11.60096) + 2(3.61)(2.93903)
var(𝑦̂0 ) = + 2(6)(1.9)(0.30041) + 2(6)(3.61)(−0.08562) + 2(1.9)(3.61)(−3.28875)
Chapter 7—Further Inference in Multiple Regression
Page 18 of 20
var(𝑦̂0 ) = 0.8422
se(𝑦̂0 ) = 0.9177
The 95% confidence interval for 𝑦̂0 when π‘₯02 = 6, π‘₯03 = 1.9, and π‘₯04 = 3.61 is then,
𝐿, π‘ˆ = 𝑦̂0 ± 𝑑𝛼⁄2,(𝑛−π‘˜) se(𝑦̂0 )
𝐿, π‘ˆ = 76.974 ± (1.994)(0.9177) = 76.974 ± 1.830 = [75.144,78.804]
Now the prediction interval for the individual value of 𝑦:
var(𝑦0 ) = var(𝑦̂0 ) + var(𝑒)
var(𝑦0 ) = 0.8422 + 21.5787 = 22.42085
se(𝑦0 ) = 4.7351
𝐿, π‘ˆ = 𝑦̂0 ± 𝑑𝛼⁄2,(𝑛−π‘˜) se(𝑦0 )
𝐿, π‘ˆ = 76.974 ± (1.994)(4.7351) = 76.974 ± 9.441 = [67.533,86.415]
Μ‚πŸŽ )
7. A More Practical Way of Finding 𝐬𝐞(π’š
In many models, where the number of independent variables exceeds two, obtaining the standard error of the
linear combination of the regression coefficients, as we have done above, becomes very tedious and may lead
to miscalculations. There is a simpler way to compute the standard error in question, as shown below.
We will use the same example above. The general approach is as follows:
ο‚·
Subtract the given value of each π‘₯𝑗 from each value in that column. For example,
π‘₯𝑖2 − π‘₯02 = π‘₯𝑖2 − 6
π‘₯𝑖3 − π‘₯03 = π‘₯𝑖3 − 1.9
π‘₯𝑖4 − π‘₯04 = π‘₯𝑖2 − 3.61
ο‚·
Run the regression with the adjusted values of the π‘₯𝑗
The result for our example (see the Excel file tab “burger 3” for the full calculation) is:
ANOVA
Regression
Residual
Total
df
3
71
74
SS
1583.3974
1532.0845
3115.4819
MS
527.7991
21.5787
F
24.459316
Significance F
5.59996E-11
Intercept
PRICE*
ADVERT*
ADVERT²*
Coefficients
76.974
-7.640
12.151
-2.768
Standard Error
0.9177
1.0459
3.5562
0.9406
t Stat
83.8760
-7.3044
3.4170
-2.9427
P-value
9.19E-73
3.24E-10
0.00105
0.00439
Lower 95%
75.1442
-9.7255
5.0604
-4.6435
Upper 95%
78.8039
-5.5545
19.2420
-0.8924
Note that the coefficients of 𝑃𝑅𝐼𝐢𝐸 ∗ = 𝑃𝑅𝐼𝐢𝐸 − 6, 𝐴𝐷𝑉𝐸𝑅𝑇 ∗ = 𝐴𝐷𝑉𝐸𝑅𝑇 − 1.9, and 𝐴𝐷𝑉𝐸𝑅𝑇 2∗ =
𝐴𝐷𝑉𝐸𝑅𝑇 2 − 3.61, and their standard errors are exactly equal to the coefficients and standard errors before
Chapter 7—Further Inference in Multiple Regression
Page 19 of 20
the adjustments to these variables. However the intercept coefficient and its standard error are different. In
fact the intercept value is the predicted 𝑆𝐴𝐿𝐸𝑆 for 𝑃𝑅𝐼𝐢𝐸 = 6, 𝐴𝐷𝑉𝐸𝑅𝑇 = 1.9, and 𝐴𝐷𝑉𝐸𝑅𝑇 2 = 3.61. More
importantly, now we have obtained se(𝑦̂0 ) as the standard error of the intercept directly from running this
regression.
To obtain var(𝑦0 ),
var(𝑦0 ) = var(𝑦̂0 ) + var(𝑒)
var(𝑦0 ) = (0.9177)2 + 21.5787 = 22.4208
Note that var(𝑒) does not change when the variables are adjusted as above.
Chapter 7—Further Inference in Multiple Regression
Page 20 of 20
Download