Correlated Explanatory Variables

advertisement
Correlated Explanatory Variables
If there are very many variables, it is likely that they will be highly correlated. This implies that
some variables or sets of variables are measuring similar things. If included in a prediction
formula, there will be considerable redundancy, with many variables merely contributing to the
chance variation in the formula. The task of selecting a reasonable subset of variables is not
easy. While there are a number of more or less automatic techniques available, they are
notoriously unstable and susceptible to influence by exceptional cases.
In addition, causal interpretations of prediction formulas in terms of the variables remaining after
a selection procedure are unreliable. For purely prediction purposes, this is not a serious
problem. However, attempts to use such causal interpretations are likely to mislead.
Even with small numbers of explanatory variables, correlations among them can lead to
problems, particularly with causal interpretation. Consider the Jobtime case study. Following
deletion of exceptional cases, the following regression was calculated.
Regression Analysis: Jobtime versus Units, Ops per Unit, T-Ops,
Rushed?
17 cases used, 3 cases contain missing values
Predictor
Constant
Units
Ops per Unit
T-Ops
Rushed?
Coef
44.216
-0.06931
9.8286
0.107795
-37.960
SE Coef
9.080
0.02853
0.8873
0.004114
3.857
T
4.87
-2.43
11.08
26.20
-9.84
P
0.000
0.032
0.000
0.000
0.000
S = 7.41272
In earlier discussion of this case study, we decided that this gave a reasonable prediction
formula, with reasonable prediction error of ± 2 days, approximately. This is helpful to
customers, who want to know reasonably precisely when they can expect a delivery, so that
they can arrange their own production schedules accordingly.
Looking at the coefficient estimates, however, there is a problem. The estimated Units
coefficient is negative, which seems to imply that jobs with more units take less time to
complete. Clearly, such an implication is nonsense. So, what is going on?
The explanation may be found in the correlation matrix,
Correlations: Jobtime, Units, Ops per Unit, T-Ops, Rushed?
Units
Ops per Unit
T-Ops
Rushed?
Jobtime
0.787
-0.039
0.961
-0.342
Units
-0.524
0.909
-0.193
Ops per Unit
-0.270
0.197
T-Ops
-0.243
Before examining the correlations, note that the diagonal of this matrix is missing; the correlation
of each variable with itself is 1. Also note that the upper triangle elements are missing; corr(X,Y)
= corr(Y,X), so the matrix is symmetric and the lower triangle is sufficient.
There is a case to be made for rounding to 2 significant digits.
Correlations: Jobtime, Units, Ops per Unit, T-Ops, Rushed?
Units
Ops per Unit
T-Ops
Rushed?
Jobtime
0.79
-0.04
0.96
-0.34
Units
-0.52
0.91
-0.19
Ops per Unit
-0.27
0.20
T-Ops
-0.24
The first column gives the correlations of the response, Jobtime, with the explanatory variables.
(This corresponds to the first row of the scatterplot matrix, which shows scatter plots of
response against explanatory variables). Note that the strongest correlation is with T_Ops, with
a value of 0.96, very close to 1. This simply reflects the fact that the simple linear regression of
Jobtime on T_Ops is very highly significant1, t-value = 26.2. Units shows a correlation of 0.79,
also relatively high. The corresponding t-value is 4.94, also highly significant.
Exercise 1:
Calculate the simple linear regressions of Jobtime on each of T_Ops and Units. Confirm
the corresponding t-values.
Note, however, the negative correlations between Jobtime and each of Ops per Unit and
Rushed. These seem to suggest that Jobtime decreases with Ops per Unit and is lower for
Rushed jobs. While the latter makes sense, the former appears not to.
Exercise 2:
Calculate the simple linear regression of Jobtime on Ops per Unit. Comment of the
negative correlation of Jobtime with Ops per Unit in the light of the corresponding t-value
Correlations among explanatory variables
The individual correlations of the explanatory variables with the response do not shed much light
on the unexpected pattern of regression coefficients in the multiple regression. Some clues
begin to emerge when we inspect the correlations of the explanatory variables among
themselves, as shown in the remaining columns of the correlation matrix (corresponding to the
remaining rows of the scatterplot matrix).
Note that the highest correlation in this subset of the correlation matrix, with value of 0.86, is that
between Units and T_Ops, the two explanatory variables most highly correlated with the
response. The fact that these two explanatory variables are so highly correlated may be
interpreted by saying that they measure similar aspects of the jobs involved or that they convey
closely related information concerning the jobs. Thus, including both of these variables in the
multiple regression involves some degree of redundancy. In choosing values for the regression
coefficients, a balance must be struck about how much information regarding the process to
take from each variable. In some instances, this balancing act will entail balancing a
contribution from one variable with a negative contribution from another, hence the possibility of
a negative coefficient for a variable logically positively related to the response.
Multiple correlations among explanatory variables
With many explanatory variables, there is the possibility of more complex relationships involving
several variables. To allow for this possibility, it is necessary to calculate multiple correlations
between variables. Ultimately, this reduces to calculating a multiple correlation coefficient of
each explanatory variable with all of the others. In practice, this may be done by calculating a
1
This reflects the fact that simple regression coefficients and correlation coefficients are related; see SA,
p. 208. The t test of the hypothesis  = 0 is also a test of the hypothesis  = 0.
page 2
value of R2 for each of the explanatory variables regressed on all of the others. For example,
the value of R2 for Units, given the other explanatory variables, may be read from the results of
the regression of Units on the others,
Regression Analysis: Units versus Ops per Unit, T-Ops, Rushed?
Predictor
Constant
Ops per Unit
T-Ops
Rushed?
Coef
177.62
-22.147
0.13534
31.54
SE Coef
73.26
6.058
0.01382
36.47
S = 72.0706
R-Sq = 91.4%
T
2.42
-3.66
9.80
0.86
P
0.031
0.003
0.000
0.403
R-Sq(adj) = 89.5%
The value of R2 in this case is 91.4%.
Repeating this for the other variables leads to:
2
RUnits
= 91%
2
= 55%
ROpsperUnit
R2T _ Ops
= 89%
2
RRushed
= 13%
Exercise 3:
Confirm the calculation of the above R2 values.
This set of R2 values may be taken as measuring the full extent of correlation between the
explanatory variables. A popular convention indicates that such correlation is problematical if
2
the largest such R2 value exceeds 90%. Here, RUnits
does, and R2T _ Ops is very close.
Variance inflation factors
A somewhat more mathematical expression of this problem may be seen by expressing these
R2 values in a different way and relating them to the standard errors of the regression
coefficients in the prediction formula for the response. It turns out that the more correlation
there is between the explanatory variables, the greater are these standard errors and so the
less precisely the regression coefficients are estimated from the data. This phenomenon is
described as variance inflation, (really, standard error inflation). This parallels the difficulty of
interpreting the regression coefficients in the presence of correlations among the explanatory
discussed above; not only does interpretation of individual regression coefficients become
difficult but also estimating the values of those regression coefficients becomes difficult.
To see how variance inflation arises, consider the formula for the standard error of the
regression coefficients in the regression of response, Y, on explanatory variables X1, . . . , Xp-1,
SE(ˆ k ) 

sk n

1
1  Rk2
, 1  k  p-1,
Here,  is the error standard deviation, sk is the standard deviation of variable Xk and Rk2 is the
R2 value from the regression of Xk on the other explanatory variables, expressed as a proportion
(and not as a percentage, as in standard regression output).
page 3
Note that, if Rk2 = 0, that is, Xk is not correlated with the other X variables, then
SE(ˆ k ) 

sk n
, 1  k  p-1,
As Rk2 deviates from 0, that is, as Xk becomes more correlated with the other X variables, then
Rk2
gets closer to 1,
1 – Rk2
gets closer to 0,
1
1  Rk2
increases towards infinity,
and so
SE( ̂k )
1
1  Rk2
increases towards infinity.
may be referred to as a standard error inflation factor. Traditionally,
VIFk =
1
1  Rk2
is referred to as a variance inflation factor.
Exercise 4:
Calculate the variance inflation factors for the Jobtime case study.
Traditionally, correlated X variables are regarded as problematic if the largest VIF exceeds 10,
corresponding to the largest R2 value exceeding 90%.
Multicollinearity
Mathematicians refer to the problem of highly correlated variables as the multicollinearity
problem. This arises from the special case when one (or more) of the explanatory variables is
perfectly correlated with the other explanatory variables. This may be visualised easily in the
case of two explanatory variables, say X1 and X2. In that case, X1 and X2 are perfectly linearly
related, with no deviations from a straight line; recall SA, Figure 6.12, p. 211. X1 and X2 are said
to be collinear. It follows that the correlation of Y with each of them is the same, the simple
linear regressions of Y on each of them are equivalent and adding the other to the simple linear
regression of Y on one of them will make no difference.
When there are several explanatory, the linear relationships may be more complicated, for
example, a linear combination of one subset may be perfectly related to a linear combination of
another subset. In such circumstances, the term multicollinearity is used. Perfectly collinear
data may cause computational problems for regression software; it may refuse to fit a model.
Software may be programmed to detect multicollinearity in special cases. For example, Data
Desk will detect the presence of a superfluous indicator variable and fit a 0 coefficient.
In practice, perfect multicollinearity rarely arises in observational studies but near perfect
multicollinearity can cause computational problems and will cause interpretational problems.
page 4
page 5
Download