Multicollinearity

advertisement
Multicollinearity: an introductory example
A high-tech business wants to measure the effect
of advertising on sales and likes to distinguish
between traditional advertising (TV and
newspapers) and advertising on internet.
– Y : sales in $m
– X1: advertising in $m
– X2: internet in $m
Data: Sales3.sav
A matrix scatter plot of the data
Cor(y, x1)
0.983
Cor(y, x2)
0.986
Cor(x1,x2)
0.990
x1 and x2 are strongly
correlated, i.e. they
have a substantial
amount of common
information
x1 = α 0 + α 1x2 + ε
Regression output
R
,983a
R2
Using x1 only
With equivalent
results when
using x2 only
R2
DS
Adj
1
,965
,962 ,9764
Anovab
SS
df
MS
1 Regressione
265,466
1 265,466
Residuo
9,534
10
,953
Totale
275,000
11
Coefficients
Model
B
DS
1
(Costante)
,885
,696
X1 = traditional
advertising in $m
2,254
,135
F
Sig.
278,438 ,000a
t
Sig.
1,272 ,232
16,686 ,000
Regression output
R
,987a
R2
Using x1 and x2
x1 and x2 are not
significant anymore
R2
DS
Adj
1
,974
,968 ,8916
Anovab
SS
df
MS
1 Regressione
267,846
2 133,923
Residuo
7,154
9
,795
Totale
275,000
11
Coefficients
1
(Costante)
X1 = traditional
advertising in $m
X2 = internet
advertising in $m
F
Sig.
168,483 ,000a
B
1,992
DS
,902
,767
,868
,884
,400
1,275
,737
1,730
,118
t
Sig.
2,210 ,054
Multicollinearity
Multicollinearity exists when two or more of the independent
variables are moderately or highly correlated with each other.
xi = α0 + α1xj +… + αpxj+p+ ε, j+p<k, i≠j,j+1,…, j+p
In the extreme case, if there exists perfect correlation among
some of the independent variables, OLS estimates cannot be
computed.
In practice if independent variables are (highly) correlated
they contribute too much redundant information which
prevents isolating the effect of single independent
variables on y. Confusion is often the result.
High levels of multicollinearity:
a) inflate the variance of the β estimates
b) regression results maybe misleading and confusing.
Detecting Multicollinearity
The following are indicators of multicollinearity:
1. Significant correlations between pairs of independent
variables in the model (sufficient but not necessary).
2. Nonsignificant t-tests for all (or nearly all) the
individual β parameters when the F test for model
adequacy H0: β1= β2 = … = βk = 0 is significant.
3. Opposite signs (from what expected) in the estimated
parameters
4. A variance inflation factor (VIF) for a β parameter
greater that 10.
The VIFs can be calculated in SPSS by selecting
“Collinearity diagnostics” in the “Statistics” options in the
“Regression” dialog box.
A typical situation
Multicollinearity can arise when transforming variables,
e.g. using x1 and x12 in the regression equations if the
range of values of x1 is limited.
X square
1,00
1,44
1,96
2,56
3,24
4,00
4,84
5,76
6,76
7,84
9,00
10,24
11,56
12,96
14,44
16,00
18,00
Cor(x,x2)=0.987
16,00
14,00
12,00
X square
X
1,0
1,2
1,4
1,6
1,8
2,0
2,2
2,4
2,6
2,8
3,0
3,2
3,4
3,6
3,8
4,0
10,00
8,00
6,00
4,00
2,00
0,00
1,0
1,5
2,0
2,5
3,0
X
3,5
4,0
4,5
Remember, if the multicollinearity is present but not
excessive (no high correlations, no VIFs above 10), you
can ignore it. Each variable provides enough
independent information and one can assess its value.
If your main goal is prediction (using the available
explanatory variables to predict the response), then
you can safely ignore the multicollinearity.
If your main goal is explaining relationships, then the
multicollinearity maybe a problem because measured
effects can be misleading.
Some solutions to Multicollinearity
Get more data if you can.
Drop one or more of the correlated independent variables
from the final model. A screening procedure like Stepwise
regression may be helpful in determining which variable to
drop.
If you keep all independent variables be cautios in interpreting
parameter values and keep prediction within the range of
your data.
Use Ridge regression (we do not touch this subject in the
course).
Some solutions to MC
If the multicollinearity is introduced by the use of higher
order models (e.g. use x and x2 or x1, x2 and x1x2) use IV
as deviations from their mean.
Example: suppose multicollinearity is present in
E(Y) = β0 + β1x + β2x2
1) Compute: x* = x – Mean(X)
2) Run the regression E(Y) = β0 + β1x* + β2(x*)2
In most cases multicollinearity is greatly reduced. Clearly
the parameters β of the new regression will have
different values and meaning.
Example: Shipping costs - continues
A company conducted a study to investigate the relationship
between the cost of shipment and the variables that control
the shipping charge: weight and distance.
– Y : cost of shipment in dollars
– X1: package weight in pounds
– X2: distance shipped in miles
It is suspected that non linear effect may be present,
let us analyze the model
Model 1: E(Y) = β0 + β1x1 + β2x2 + β3x1x2 + β4x12 + β5x22
Data: Express.sav
Matrix scatter-plot
A matrix scatterplot shows at once
the bivariate
scatter plots for
the selected
variables. Use it as
preliminary
screening.
Symmetric
matrix, just look
at the lower
triangle
In SPSS choose the
“Matrix” option
from “Scatter/Dot”
Graph and input the
variables of interest
Note the obvious quadratic relation for some of the
variables, very close to linearity
Correlation matrix
Correlazioni
Weight of Distance Cost of Weight
Dist. Weight*
parcel shipped shipm. squared squared Dist.
Weight of
Correlazione
1
,182
,774**
parcel in lbs. Sig. (2-code)
,444
,000
N
20
20
20
Distance
Correlazione
,182
1
,695**
shipped
Sig. (2-code)
,444
,001
N
20
20
20
**
**
Cost of
Correlazione
,774
,695
1
shipment
Sig. (2-code)
,000
,001
N
20
20
20
**
Weight
Correlazione
,967
,202
,799**
squared
Sig. (2-code)
,000
,393
,000
N
20
20
20
**
Distance
Correlazione
,151
,980
,652**
squared
Sig. (2-code)
,524
,000
,002
N
20
20
20
**
**
Weight*Dist Correlazione
,820
,633
,989**
ance
Sig. (2-code)
,000
,003
,000
N
20
20
20
**. La correlazione è significativa al livello 0,01 (2-code).
,967**
,151
,820**
,000
,524
,000
Individually
20
20
20
**
,202
,980
,633**
strongly
related
,393
,003
to,000
Y
20
20
20
**
**
,799
,652
,989**
,000
,002
,000
20
20
20
1
,160
,821**
,500
,000
20
20
20
,160
1
,590**
,500
,006
20
20
20
**
**
,821
,590
1
,000
,006
20
20
20
Model 1:VIF statistics
The VIFs can be calculated in SPSS by selecting “Collinearity
diagnostics” in the “Statistics” options in the “Regression” dialog box.
A VIF statistics larger than 10 is usualy considered an indicator of
substantial collinearity
Coefficientia
Model
B
1
(Costante)
DS
t
Sig.
VIF
,827 ,702
1,178
,259
Weight of parcel in
lbs.
-,609 ,180
-3,386
,004
20,031
Distance shipped
,004 ,008
,503
,623
35,526
Weight squared
,090 ,020
4,442
,001
17,027
1,507E-5 ,000
,672
,513
28,921
,007 ,001
11,495
,000
12,618
Distance squared
Weight*Distance
Model 2: Using IV as deviations from their mean
Coefficientsa
Model
1
B
(Costante)
5,467
X1star
1,263
X2star
,038
X1x2star
,007
X1star2
,090
x2star2
1,507E-5
DS
,216
,042
,001
,001
,020
,000
t
25,252
30,128
27,563
11,495
4,442
,672
Sig.
,000
,000
,000
,000
,001
,513
VIF
1,087
1,081
1,095
1,113
1,120
Seems actually irrelevant, drop it
Note: problems of multicollinearity have disappeared
Note: R-square (adjusted), ANOVA table and prediction are
the same for the two models (check).
Download