DATA PROBLEMS: MULTICOLLINEARITY AND INADEQUATE VARIATION INTRODUCTION Two problems that often times arise when using observational data that comes from uncontrolled experiments, are: 1. Multicollinearity among explanatory variables 2. Inadequate variation in explanatory variables MULTICOLLINEARITY Multicollinearity exists when two or more explanatory variables are perfectly or highly correlated. We can make the distinction between 2 types of multicollinearity. 1) Perfect multicollinearity. 2) Near multicollinearity. Perfect Multicollinearity Perfect multicollinearity exists when two or more explanatory variables are perfectly correlated. Perfect multicollinearity does not occur often, and usually results from the way which variables are constructed. If we have perfect multicollinearity, then we cannot obtain estimates of the parameters. Near Multicollinearity Near multicollinearity exists when two or more explanatory variables are highly correlated. This is the more common multicollinearity problem. CONSEQUENCES OF MULTICOLLINEARITY What are the consequences of a high correlation between two or more independent variables? The OLS Estimator is Still BLUE The OLS estimator is still unbiased. Also, in the class of linear unbiased estimators the OLS estimator still has minimum variance. Thus, we cannot find an alternative estimator that is better than the OLS estimator. However, even though OLS estimator is the “best” estimator, best may not be very good. The Fit of the Sample Regression Equation is Unaffected The “overall fit” of the sample regression equation, as measured by the R2 statistic, is not affected by the presence of multicollinearity. Thus, if the sole objective of our empirical study is prediction or forecasting, then multicollinearity does not matter. The Variances and Standard Errors of the Parameter Estimates Will Increase The worst consequence of multicollinearity is that it increases the variances and standard errors of the OLS estimates. High variances mean that the estimates are imprecise, and therefore not very reliable. High variances and standard errors imply low t-statistics. Thus, multicollinearity increases the probability of making a type II error of accepting the null-hypothesis when it is false, and therefore concluding that X does not effect Y when in reality it does. That is, multicollinearity makes it difficult to detect an effect if one exists. Variance Formula For the MCLRM Yt = β1 + β2Xt2 + β2Xt3 + β4Xt4, the formula for the variance of the OLS estimates of the slope coefficients is σ2 Var(i^) = ———————————— for i = 2, 3, 4 Σ (Xti – Xibar) (1 – Ri2) Where Ri2 is the R2 statistic from the regression of Xi on all other X’s in the regression equation. This formula clearly shows that: 1. The variance of the estimate i^ is higher (lower) the higher (lower) the correlation between Xi and the other explanatory variables. 2. The variance of the estimate i^ is lower (higher) the higher (lower) the variation in Xi. 3. The variance of the estimate i^ is higher (lower) the higher (lower) the error variance. 4. The variance of the estimate i^ increases at an increasing rate when Ri2 increases (correlation between Xi and the other explanatory variables increases). When Ri2 = 1, the variance becomes infinite. This formula shows that multicollinearity has the same effect on the precision of the OLS estimator as lack of variation in the explanatory variable. Both of these can be viewed as data problems. DETECTING MULTICOLLINEARITY Observational data that comes from uncontrolled social experiments will almost always have multicollinearity; that is, the explanatory variables will be correlated. How do we know when multicollinearity is severe enough to be a problem? Two Ways to Detect Multicollinearity To detect severe multicollinearity, two approaches are often times used. 1) Symptoms. 2) Diagnostic procedures. Symptoms One way to detect severe multicollinearity is to look for symptoms of severe multicollinearity. Three common symptoms of multicollinearity are the following. High R2 and Low t-Statistics As we have seen, multicollinearity does not affect the R2 statistic; it only affects the estimated standard errors and hence t-statistics. A possible symptom of severe multicollinearity is to estimate an equation and get a relatively high R2 statistic, but find that most or all of the individual coefficients are insignificant, i.e., t-statistics less than 2. Wrong Signs for Estimated Coefficients A second possible symptom of severe multicollinearity is incorrect signs for theoretically important variables, or theoretically important variables that are statistically insignificant. Estimated Coefficients Sensitive to Changes in Specification A third possible symptom of severe multicollinearity is when you add or delete an independent variable, or add or delete an observation or two, and the estimates of the coefficients change dramatically. Diagnostic Procedures Two diagnostic procedures often used to detect severe multicollinearity are as follows. Correlation Coefficients The simplest diagnostic procedure is to calculate the sample correlation coefficients between all pairs of independent variables in the sample. High correlation coefficients between pairs of explanatory variables indicate that these variables are highly correlated, and therefore you may have severe multicollinearity. Variance Inflation Factors You can calculate a variance inflation factor for each estimated slope coefficient. To calculate the variance inflation factor for the estimate i attached to explanatory variable Xi, do the following. Step #1: Run a regression of the explanatory variable Xi on all remaining explanatory variables in the equation. Step #2: Find the Ri2 statistic for the regression. Step #3: Calculate the variance inflation factor as follows: VIF( i) = 1 / (1 - Ri2). This is an estimate of how much multicollinearity has increased the estimated variance of the estimate i. Note that (1 - Ri2) is in the denominator of the formula for the variance of i. Many researchers use the following rule-of-thumb: A variance inflation factor of greater than 5 to 10 indicates severe multicollinearity. However, this is arbitrary. REMEDIES FOR MULTICOLLINEARITY If severe multicollinearity exists, what should we do? No single best remedy exists for the problem of multicollinearity. What we do is largely a matter of judgement. However, the following remedies have been proposed. Do Nothing If we detect severe multicollinearity, we should try to correct it only if it is a problem. When is it a problem? If we get insignificant estimates or estimates with wrong signs for important theoretical variables. The following rules-of-thumb have been proposed for when to do-nothing: 1. When the t-statistics are all greater than 2. 2. When the t-statistics for all theoretically important variables are greater than 2. 3. When the R2 from the regression exceeds the Ri2 for the regression of each independent variable on the remaining independent variables. Obtain More Data Since multicollinearity is a data problem, we can try to correct it by obtaining more data. However, in economics it is usually difficult to get more data. Drop One or More Variables A popular way to correct multicollinearity is to drop one or more of the highly correlated explanatory variables. The problem is that if the true value of the coefficient is nonzero in the population, then dropping the variable will result in omitted variable bias, and therefore estimates of coefficients of remaining variables in the equation that are correlated with the dropped variable will be biased.