INSTRUMENTAL VARIABLES REGRSSION MODEL INTRODUCTION We have seen that if the error term is correlated with an explanatory variable, then the OLS estimator is biased in small samples and inconsistent in large samples. This means that even in large samples the OLS estimator may not produce an estimate that is close to the true value of the population parameter being estimated. As a result, an empirical study that estimates a linear regression model using the OLS estimator when the error term is correlated with an explanatory variable is not internally valid. Three Major Sources of Correlation Between the Error Term and an Explanatory Variable The 3 most important sources that produce a correlation between the error term and an explanatory variable are the following. 1) Confounding variable. 2) Reverse causation. 3) Measurement error in an explanatory variable. The bias caused by a confounding variable can be corrected by including it as an explanatory variable in the model, if it is observable, or specifying and estimating a fixed effects regression model if it is unobservable and differs across units but is constant over time. However, these methods do not work if the confounding variable(s) is unobservable and/or differs across units and over time. Also, these methods do not work for reverse causation and measurement error in an explanatory variable. INSTRUMENTAL VARIABLE (IV) REGRESSION MODEL When the error term is correlated with an explanatory variable, it is not possible to find an estimator that is unbiased in small samples. However, it is possible to find an estimator that is consistent in large samples. To obtain consistent estimates in large samples, we can specify an instrumental variable (IV) regression model, and use an instrumental variable (IV) estimator. IV REGRESSION MODEL WITH ONE EXPLANTORY VARIABLE AND ONE INSTRUMENTAL VARIABLE The IV regression model with one explanatory variable is Yt = α + βXt + μt The IV regression model allows the error term to be correlated with the explanatory variable, and therefore the error term to have a non-constant, non-zero, mean. That is, Corr(μt, Xt) ≠ 0, and E(μt |Xt) ≠ 0. The remaining assumptions are the same as the MCLRM. Any variable correlated with the error term is called an endogenous variable. Any variable uncorrelated with the error term is called an exogenous variable. Estimation To use the sample of data to obtain estimates of the parameters of the IV regression model, we use an IV estimator. To use an IV estimator, you must have one or more valid instrumental variables. An instrumental variable is also called an instrument. We will designate an instrumental variable as I. Instrumental Variable A valid instrumental variable, I, has 2 properties. 1. Instrument Relevance - An instrumental variable, I, is relevant if it is correlated with the endogenous variable X. That is, Corr(It, Xt) ≠ 0. 2. Instrument Exogeneity – An instrumental variable, I, is exogenous if it is uncorrelated with the error term μ. That is, Corr(μt, Xt) ≠ 0. Two-Stage Least Squares (2SLS) Estimator The most often used IV estimator is the two-stage least squares (2SLS) estimator. It involves two stages. Stage #1: Regress X on I using the OLS estimator. Save the predicted values X^. Stage #2: Regress Y on the predicted values X^ using the OLS estimator. The 2SLS estimator for the slope parameter β is also given by the following formula. β^2SLS = Cov(I, Y) / Cov(I, X) where Cov(I, Y) is the sample covariance between I and Y, and Cov (I, X) is the sample covariance between I and X. Sampling Distribution of the 2SLS Estimator The large sample (asymptotic) sampling distribution of the 2SLS estimator β^2SLS is given by β^2SLS ~ N(β, Variance), where Variance = (1/n){Var[(I – E(I))μ] / [Cov(I, X)]2} This indicates that the 2SLS estimator has an approximate normal distribution in large samples. Also, as the sample size increases, the sampling distribution of β^2SLS collapses to the true value β. Therefore, in large samples the 2SLS estimator should produce an estimate that is close to the true value of the population parameter. 2SLS Estimator and Estimated Standard Errors To obtain a correct estimate of the standard error of the estimate, we must use the residuals, μt^ = Yt – α^ - β^Xt, not the residuals from the second-stage regression, εt = Yt – α^ - β^Xt^. Statistical programs with a 2SLS command will calculate the correct standard errors for you. IV REGRESSION MODEL WITH ONE EXPLANTORY VARIABLE AND TWO OR MORE INSTRUMENTAL VARIABLES Suppose that we have m instrumental variables, I1, I2, …, Im. If m = 0, then the regression coefficients α and β are said to be underidentified. If m = 1, then the regression coefficients α and β are said to be exactly identified. If m > 1, then the regression coefficients α and β are said to be overidentified. The IV estimator can be used to obtain estimates if α and β are exactly identified or overidentified. It cannot be used if α and β are underidentified. 2SLS Estimator The 2SLS estimator now involves the following two stages. Stage #1: Regress X on I1, I2, …, Im using the OLS estimator. Save the predicted values X^. Stage #2: Regress Y on the predicted value variable X^ using the OLS estimator. If all instrumental variables are relevant and exogenous, then the 2SLS estimator has a normal distribution and is consistent in large samples. IV REGRESSION MODEL WITH TWO OR MORE EXPLANTORY VARIABLES AND TWO OR MORE INSTRUMENTAL VARIABLES The IV regression model with two or more explanatory variables and m instrumental variables is Yt = β1 + β2Xt2 + β3Zt1 + …, + β2+rZrt + μt Where Xt2 is the endogenous explanatory variable, Zt1, …, Zrt are r exogenous explanatory variables, and I1, I2, …, Im are m instrumental variables. 2SLS Estimator The 2SLS estimator now involves the following two stages. Stage #1: Regress X on I1, I2, …, Im and Z1, Z2, …, Zr using the OLS estimator. Save the predicted values X^. Stage #2: Regress Y on the predicted value variable X^ and the exogenous explanatory variables Zt1, …, Zrt using the OLS estimator. If I1, I2, …, Im are relevant and exogenous, and Z1, Z2, …, Zr are exogenous, then the 2SLS estimator has a normal distribution and is consistent in large samples. CHECKING THE VALIDITY OF INSTRUMENTAL VARIABLES If all instrumental variables I1, I2, …, Im are uncorrelated with the endogenous explanatory variable X, then they are not relevant. If the instrumental variables have a relative low correlation with X, then they are said to be weak instruments. If the instruments are either not relevant or weak, then 2SLS will not have a normal distribution and be inconsistent in large samples. If any instrumental variable is correlated with the error term, then it is not exogenous. If any instrumental variable is not exogenous, then 2SLS will be inconsistent in large samples. If 2SLS is inconsistent, then it will not produce an estimate that is close to the true value of the population parameter, even if the sample size is large. Therefore, we should check the validity of our instrumental variable(s). Checking Instrument Relevance To check for instrument relevance, you calculate the F-statistic for the null hypothesis that the coefficients of the instrumental variables are all zero in the first-stage regression. An often used rule-of-thumb is that an F-statistic of less than 10 indicates possible weak instruments. Checking Instrument Exogeneity If you have one instrumental variable, and therefore the regression coefficients are exactly identified, then you cannot check for instrument exogeneity. If you have two or more instrumental variables, and therefore the regression coefficients are overidentified, then you can do a test of the overidentifying restrictions. This allows you to check if all instrumental variables are exogenous. The null hypothesis is the hypothesis that all instrumental variables are exogenous. The alternative hypothesis is that at least one of the instrumental variables is endogenous (i.e., correlated with the error term). The test is a Lagrange multiplier test and involves two steps. Step #1: Estimate the IV regression model using the 2SLS estimator. Save the 2SLS residuals μt^. Step #2. Regress the residuals μt^ on the instrumental variables I1, I2, …, Im and exogenous explanatory variables Z1, Z2, …, Zr using OLS. Use the R2 statistic from this regression to calculate the LM test statistic: LM = nR2, where n is sample size. The LM statistic has an approximate chi-square distribution, with m – 1 degrees of freedom. IV REGRESSION MODEL WITH TWO OR MORE ENDOGENOUS EXPLANTORY VARIABLES, TWO OR MORE EXOGENOUS EXPLANATORY VARIABLES, AND TWO OR MORE INSTRUMENTAL VARIABLES The IV regression model with k endogenous explanatory variables, r exogenous explanatory variables, and m instrumental variables is Yt = β1 + β2Xt2 + …, + βkXtk + βk+1Zt1 + …, + βk+rZrt + μt Where Xt2, …, Xtk are k – 1 endogenous explanatory variables, Zt1, …, Zrt are r exogenous explanatory variables, and I1, I2, …, Im are m instrumental variables. 2SLS Estimator The 2SLS estimator now involves the following two stages. Stage #1: Regress each Xi on I1, I2, …, Im and Z1, Z2, …, Zr using the OLS estimator. Save the k – 1 predicted value variables X2^, …, Xk^. Stage #2: Regress Y on the predicted value variables X2^, …, Xk^ and the exogenous explanatory variables Z1, Z2, …, Zr using the OLS estimator. If I1, I2, …, Im are relevant and exogenous, and Z1, Z2, …, Zr are exogenous, then the 2SLS estimator has a normal distribution and is consistent in large samples. Underidentified, Exactly Identified, and Overidentified Regression Coefficients The regression coefficients are underidentified if m < k – 1 ; exactly identified if m = k – 1 ; and overidentified if m > k – 1 . Checking the Validity of Instrumental Variables Instrument Relevance If k = 2 and you have one endogenous explanatory variable, then you can use the F-statistic to check for instrument relevance. If k > 2, then the F-test cannot be used to check for instrument relevance. Instrument Exogeneity The Lagrange multiplier test can be used to test for instrument exogeneity (test the overidentifying restrictions) if m > k – 1. The LM test statistic has an approximate chi-square distribution with (m – (k – 1)) degrees of freedom.