Note

advertisement
DATA PROBLEMS: MULTICOLLINEARITY AND INADEQUATE VARIATION
INTRODUCTION
Two problems that often times arise when using observational data that comes from uncontrolled
experiments, are:
1. Multicollinearity among explanatory variables
2. Inadequate variation in explanatory variables
MULTICOLLINEARITY
Multicollinearity exists when two or more explanatory variables are perfectly or highly
correlated. We can make the distinction between 2 types of multicollinearity. 1) Perfect
multicollinearity. 2) Near multicollinearity.
Perfect Multicollinearity
Perfect multicollinearity exists when two or more explanatory variables are perfectly correlated.
Perfect multicollinearity does not occur often, and usually results from the way which variables
are constructed. If we have perfect multicollinearity, then we cannot obtain estimates of the
parameters.
Near Multicollinearity
Near multicollinearity exists when two or more explanatory variables are highly correlated. This
is the more common multicollinearity problem.
CONSEQUENCES OF MULTICOLLINEARITY
What are the consequences of a high correlation between two or more independent variables?
The OLS Estimator is Still BLUE
The OLS estimator is still unbiased. Also, in the class of linear unbiased estimators the OLS
estimator still has minimum variance. Thus, we cannot find an alternative estimator that is better
than the OLS estimator. However, even though OLS estimator is the “best” estimator, best may
not be very good.
The Fit of the Sample Regression Equation is Unaffected
The “overall fit” of the sample regression equation, as measured by the R2 statistic, is not affected
by the presence of multicollinearity. Thus, if the sole objective of our empirical study is
prediction or forecasting, then multicollinearity does not matter.
The Variances and Standard Errors of the Parameter Estimates Will Increase
The worst consequence of multicollinearity is that it increases the variances and standard errors of
the OLS estimates. High variances mean that the estimates are imprecise, and therefore not very
reliable. High variances and standard errors imply low t-statistics. Thus, multicollinearity
increases the probability of making a type II error of accepting the null-hypothesis when it is
false, and therefore concluding that X does not effect Y when in reality it does. That is,
multicollinearity makes it difficult to detect an effect if one exists.
Variance Formula
For the MCLRM Yt = β1 + β2Xt2 + β2Xt3 + β4Xt4, the formula for the variance of the OLS
estimates of the slope coefficients is
σ2
Var(i^) = ———————————— for i = 2, 3, 4
Σ (Xti – Xibar) (1 – Ri2)
Where Ri2 is the R2 statistic from the regression of Xi on all other X’s in the regression equation.
This formula clearly shows that:
1. The variance of the estimate  i^ is higher (lower) the higher (lower) the correlation between Xi
and the other explanatory variables.
2. The variance of the estimate  i^ is lower (higher) the higher (lower) the variation in Xi.
3. The variance of the estimate  i^ is higher (lower) the higher (lower) the error variance.
4. The variance of the estimate  i^ increases at an increasing rate when Ri2 increases (correlation
between Xi and the other explanatory variables increases). When Ri2 = 1, the variance
becomes infinite.
This formula shows that multicollinearity has the same effect on the precision of the OLS
estimator as lack of variation in the explanatory variable. Both of these can be viewed as data
problems.
DETECTING MULTICOLLINEARITY
Observational data that comes from uncontrolled social experiments will almost always have
multicollinearity; that is, the explanatory variables will be correlated. How do we know when
multicollinearity is severe enough to be a problem?
Two Ways to Detect Multicollinearity
To detect severe multicollinearity, two approaches are often times used. 1) Symptoms. 2)
Diagnostic procedures.
Symptoms
One way to detect severe multicollinearity is to look for symptoms of severe multicollinearity.
Three common symptoms of multicollinearity are the following.
High R2 and Low t-Statistics
As we have seen, multicollinearity does not affect the R2 statistic; it only affects the estimated
standard errors and hence t-statistics. A possible symptom of severe multicollinearity is to
estimate an equation and get a relatively high R2 statistic, but find that most or all of the
individual coefficients are insignificant, i.e., t-statistics less than 2.
Wrong Signs for Estimated Coefficients
A second possible symptom of severe multicollinearity is incorrect signs for theoretically
important variables, or theoretically important variables that are statistically insignificant.
Estimated Coefficients Sensitive to Changes in Specification
A third possible symptom of severe multicollinearity is when you add or delete an independent
variable, or add or delete an observation or two, and the estimates of the coefficients change
dramatically.
Diagnostic Procedures
Two diagnostic procedures often used to detect severe multicollinearity are as follows.
Correlation Coefficients
The simplest diagnostic procedure is to calculate the sample correlation coefficients between all
pairs of independent variables in the sample. High correlation coefficients between pairs of
explanatory variables indicate that these variables are highly correlated, and therefore you may
have severe multicollinearity.
Variance Inflation Factors
You can calculate a variance inflation factor for each estimated slope coefficient. To calculate
the variance inflation factor for the estimate  i attached to explanatory variable Xi, do the
following.
Step #1: Run a regression of the explanatory variable Xi on all remaining explanatory variables
in the equation.
Step #2: Find the Ri2 statistic for the regression.
Step #3: Calculate the variance inflation factor as follows: VIF( i) = 1 / (1 - Ri2).
This is an estimate of how much multicollinearity has increased the estimated variance of the
estimate  i. Note that (1 - Ri2) is in the denominator of the formula for the variance of i. Many
researchers use the following rule-of-thumb: A variance inflation factor of greater than 5 to 10
indicates severe multicollinearity. However, this is arbitrary.
REMEDIES FOR MULTICOLLINEARITY
If severe multicollinearity exists, what should we do? No single best remedy exists for the
problem of multicollinearity. What we do is largely a matter of judgement. However, the
following remedies have been proposed.
Do Nothing
If we detect severe multicollinearity, we should try to correct it only if it is a problem. When is it
a problem? If we get insignificant estimates or estimates with wrong signs for important
theoretical variables. The following rules-of-thumb have been proposed for when to do-nothing:
1. When the t-statistics are all greater than 2.
2. When the t-statistics for all theoretically important variables are greater than 2.
3. When the R2 from the regression exceeds the Ri2 for the regression of each independent
variable on the
remaining independent variables.
Obtain More Data
Since multicollinearity is a data problem, we can try to correct it by obtaining more data.
However, in economics it is usually difficult to get more data.
Drop One or More Variables
A popular way to correct multicollinearity is to drop one or more of the highly correlated
explanatory variables. The problem is that if the true value of the coefficient is nonzero in the
population, then dropping the variable will result in omitted variable bias, and therefore estimates
of coefficients of remaining variables in the equation that are correlated with the dropped variable
will be biased.
Download