Research Method Lecture 11-2 (Ch15) Instrumental Variables Estimation and Two Stage Least Square © 1 What would happen when you use IV method when the suspected endogenous variable is in fact exogenous? Consider the following model Y=β0+β1x+u If x is exogenous, you do not need IV method. OLS estimators are consistent. Suppose that you have an instrument for x, called z, which satisfies the instrument conditions (instrument exogeneity and instrument relevance described in handout 111). Then, IV estimators are also consistent. 2 Then, which one is better, OLS or IV? Answer is, OLS. If x is exogenous, IV estimators have larger variances, so IV estimators are imprecise (you tend to get smaller t-stat in absolute value.) To see this, notice the following. Var(ˆ1, IV ) 2 SSTx Rx2, z Var( ˆ1,OLS ) 2 SSTx Since R2x,z is always between 0 and 1 (except the case x=z, where it is 1), the variance of IV estimator is always bigger asymptotically). 3 Thus, controlling for endogeneity(i.e., using IV method) when it is actually exogenous is costly in terms of precision. 4 Poor instruments: What would happen if the instrumental variable does not satisfy the instrument conditions. Consider the following model Y=β0+β1x+u This time, suppose that x is endogenous. But further suppose that your instrumental variable z does not satisfy the instrument conditions (i.e., you have a poor instrument). Then what would happen? 5 Answer to this question is the following 1.IV estimators are inconsistent. 2.The directions of the biases in IV estimators and OLS estimators can be the opposite. 3.The bias in IV can be worse than OLS. 6 To understand 1, notice that Corr( z, u ) u 1, IV ) 1 Corr( z, x) x p lim(ˆ p lim(ˆ1,OLS ) 1 corr( x, u ) u If instrument exogeneity is not satisfied, this term is not zero, so inconsistent. x If x is endogenous, this term is not zero, so inconsistent. (Proof: See the front board) So, both IV and OLS are inconsistent. 7 To understand 2, first consider that Corr(x,u) is a positive. Then OLS has positive bias. But it can happen that Corr(z,u)/Corr(z,x) is negative. In such a case, the IV estimator have a negative bias. This means that, when you have an invalid instrument, you may get very unexpected results. 8 To understand 3, consider the following scenario. (i) the instrument exogeneity is almost satisfied but not perfectly statisfied, that is; corr(z,u) is close to 0 but not exactly 0. (ii) The instrument is not very relevant; i.e., corr(z, x) is very close to 0. Then, even if instrument exogeneity is almost satisfied, the bias will be If this is magnified by the small corr(z,x). small, bias Corr( z, u ) u p lim(ˆ1, IV ) 1 Corr( z, x) x will be magnified. 9 It is possible that the bias is so magnified that the extent of bias in IV estimator is worse than OLS. 10 IV estimation of the multiple regression model I will extend the discussion to the multiple regression model. I will explain the following 3 cases, step by step. Case 1: One endogenous variable, one instrument. Case 2: One endogenous variable, more than one instruments. (Two stage least squares) Case 3: More than one endogenous variables, more than one instruments. (Two stage least squares) 11 Case 1: One endogenous variable, one instrument. Consider the following regression. log(wage) 0 1educ 2 exp u Suppose that educ is endogenous but exp is exogenous. 12 To explain IV regression for multiple regression, it is often useful to use different notations for endogenous end exogenous variable. Let us use y for endogenous variable (i.e., correlated with u) and z for exogenous variables (i.e., uncorreated with u). Then, we can write the model as: y1=β0+β1y2+β2z1+u …………………(1) y1 is log(wage), y2 is educ, and z1 is exp. 13 This model is called the structural equation to emphasize that this equation shows the causal relationship. Off course, OLS cannot be used to consistently estimate the parameters since y2 is endogenous. If you have an instrument for y2, you can consistently estimate the model. Let us call this instrument, z2. 14 As before, z2 should satisfy (i) instrument exogeneity, and (ii) instrument relevance. For a multiple regression model, these conditions are written as: 1. The instrument exogeneity Cov(z2, u)=0 …………………….(2) 2. The instrument relevance y2=π0+π1z1+π2z2+error …………….(3) All the exogenous variables included. This equation and π2≠0 is often called the reduced form equation. In addition, z2 should not be a part of the structural equation (1). This is called the 15 exclusion restriction. Now, we have the following three conditions that can be used to obtain the IV estimators. E(u)=0 Cov(z1,u)=0 Cov(z2,u)=0 (this is from the instrument exogeneity) The sample counterparts of these conditions are given in the next slide. 16 n (y i 1 i1 ˆ0 ˆ1 yi 2 ˆ2 zi1 ) 0 n zi1 ( yi1 ˆ0 ˆ1 yi 2 ˆ2 zi1 ) 0 i 1 n z i 1 i2 ( yi1 ˆ0 ˆ1 yi 2 ˆ2 zi1 ) 0 If you divide it by n, this is the sample average of uˆ. If you divide it by n-1, this is the sample covariance between z1 and uˆ. If you divide it by n-1, this is the sample covariance between z2 and uˆ. This is a set of three equations with three unknowns: ˆ0 ˆ1 ˆ2 The solutions to these equations are the IV estimators. There is a simple matrix expression for IV estimators. However, we will not cover this during the class. 17 Above method can be easily extended to the case where there are more explanatory variables (but only one endogenous variable). Consider the following model. y1=β0+β1y2+β2z1+β3z2+β4z3+..+ βkzk-1+ u Suppose that zk is the instrument for y2. Then the IV estimators are the solution to the following equations. 18 n ( y i1 i 1 n z i 1 ˆ0 ˆ1 yi 2 ˆ2 zi1 ... ˆk zik 1 ) 0 i1 ( yi1 ˆ0 ˆ1 yi 2 ˆ2 zi1 ... ˆk zik 1 ) 0 ik ( yi1 ˆ0 ˆ1 yi 2 ˆ2 zi1 ... ˆk zik 1 ) 0 n z i 1 Solution to the above equations are the IV estimators when there are many explanatory variables, but only one endogenous variable and one instrument. 19 Example Consider the following model. Log(wage)=β0+β1(educ)+β2Exper+β3Exper2 +β3(SMSA)+ β3(South)+u Using the college proximity (nearc4) as an IV for education, estimate the model. Use CARD.dta. (nearc4) is a dummy variable for someone who grew up near a four-year college. 20 . reg lwage educ exper expersq Source SS smsa south df MS Model Residual 155.959797 436.681848 5 3004 31.1919593 .145366794 Total 592.641645 3009 .196956346 lwage Coef. educ exper expersq smsa south _cons .0815797 .0838357 -.0022021 .1508006 -.1751761 4.611015 Std. Err. .003499 .0067735 .0003238 .015836 .0146486 .067895 . ivregress 2sls lwage exper expersq t 23.31 12.38 -6.80 9.52 -11.96 67.91 Number of obs F( 5, 3004) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.000 0.000 0.000 0.000 Coef. educ exper expersq smsa south _cons .13542 .1067727 -.0022553 .1249987 -.1409356 3.703427 Instrumented: Instruments: Std. Err. .0486085 .0218136 .0003394 .0284538 .0343705 .8201379 3010 214.57 0.0000 0.2632 0.2619 .38127 OLS [95% Conf. Interval] .0747189 .0705545 -.0028371 .1197501 -.2038985 4.477889 .0884405 .0971169 -.0015672 .1818511 -.1464537 4.74414 smsa south (educ=nearc4) Instrumental variables (2SLS) regression lwage = = = = = = Number of obs Wald chi2(5) Prob > chi2 R-squared Root MSE z 2.79 4.89 -6.64 4.39 -4.10 4.52 educ exper expersq smsa south nearc4 P>|z| 0.005 0.000 0.000 0.000 0.000 0.000 = = = = = 3010 499.36 0.0000 0.2051 .39562 IV [95% Conf. Interval] .0401491 .0640188 -.0029205 .0692302 -.2083005 2.095986 .230691 .1495266 -.00159 .1807671 -.0735707 5.310867 21 . reg educ exper expersq smsa south nearc4, robust Linear regression Number of obs F( 5, 3004) Prob > F R-squared Root MSE educ Coef. exper expersq smsa south nearc4 _cons -.4258437 .0009774 .3639914 -.582683 .3456458 16.68131 Robust Std. Err. .0320651 .0017044 .0863314 .0743531 .0824092 .1489113 t -13.28 0.57 4.22 -7.84 4.19 112.02 P>|t| 0.000 0.566 0.000 0.000 0.000 0.000 = = = = = 3010 675.83 0.0000 0.4524 1.9825 [95% Conf. Interval] -.4887155 -.0023646 .1947167 -.7284712 .1840616 16.38933 -.362972 .0043194 .5332661 -.4368948 .50723 16.97329 Check if nearc4 satisfies instrument relevance. Using t-test, we can reject the null hypothesis that nearc4 is not correlated with educ after controlling for all other exogenous variables. 22 Case 2: One endogenous variable, more than one instruments. Two stage least squares Consider the following model with one endogenous variable. y1=β0+β1y2+β2z1+u Now, suppose that you have two instruments for y2 that satisfy the instrument conditions. Call them z2 and z 3. 23 You can apply IV method using either z2 or z3. But this produces two different estimators. Moreover, they are not efficient. Now, I will show you a more efficient estimator. First, it is important to lay out the instrument conditions. 24 For z2 and z3 to be valid instruments, they have to satisfy the following two conditions. 1.Instrument exogeneity Cov(z2, u)=0 and Cov(z3, u)=0 2.Instrument relevance y2=π0+π1z1+ π2z2+ π3z3+error Include all the exogenous and π2≠0 or π3≠0 variables In addition, z2 and z3 should not be a part of the structural equation. These are called the 25 exclusion restrictions. Now, I will explain the estimation method. Instead of using only one instrument, we use a linear combination of z2 and z3 as the instrument. Since a linear combination of z2 and z3 also satisfies the instrument conditions, this is a valid method. The question is how to find the best linear combination of z2 and z3. 26 It turns out that OLS regression of the following model provides the best linear combination. y2=π0+π1z1+ π2z2+ π3z3+error After you estimate this model, you get the predicted value of y2. yˆ 2 ˆ0 ˆ1z1 ˆ2 z2 ˆ3 z3 Since yˆ2 is a combination of variables which are not correlated with u, yˆ2 is not correlated with u as well. At the same time, yˆ2 is correlate with y2. Thus this is a valid instrument. 27 Thus, we have the following three conditions that can be used to derive an IV estimator. E(u)=0 Cov(z1,u)=0 Cov( yˆ2 ,u)=0 The sample counter part of the above equations are given by: 28 n (y i1 i 1 n z i 1 i1 ( yi1 ˆ0 ˆ1 yi 2 ˆ2 zi1 ) 0 n yˆ i 1 ˆ0 ˆ1 yi 2 ˆ2 zi1 ) 0 i2 ( yi1 ˆ0 ˆ1 yi 2 ˆ2 zi1 ) 0 This is a set of three equations with three unknowns ˆ ˆ ˆ . Solution to these equations are special type of IV estimators called the two stage least square estimators. 0 1 2 29 You can estimate these parameters by following the above procedure. There is an alternative and equivalent procedure to estimate these parameters. This procedure will give you an idea why it is called the two stage least squares. 30 The estimation procedures of the two stage least square (2SLS). Stage 1. Estimate the following model using OLS and get the predicted value for y2: yˆ 2 . y2=π0+π1z1+ π2z2+ π3z3+error Make sure to put all the exogenous variables Stage 2. replace y2 with yˆ 2 , then estimate the following model using OLS. ˆ 2 2 z1 error y1 0 1 y OLS estimators of the coefficients are the two stage least square estimators (2SLS). 31 Estimating the standard errors for two stage least square. When you exactly follow the two stage procedures explained in the previous slide, you get correct 2SLS coefficients. But you don’t get correct standard errors. So, after applying the procedure, you have to do some extra work to estimate the standard errors. Under the homoskedasticity assumption, the valid standard errors are computed as follows/ 32 1. Estimate the 2SLS coefficients, then estimate the variance of u as n 1 2 ˆ ˆ u i n k 1 i 1 where Note you use y2, not yˆ 2 . Coefficients are 2SLS estimates. 2 uˆi yi1 ˆ0 ˆ1 yi 2 ˆ2 zi 2 2. Then the variance for βj is given by Varˆ(ˆ j ) ˆ 2 SSˆT2 (1 Rˆ 22 ) where SSˆT2 is the total variation of yˆ . Rˆ is the R-squared from regressing yˆ on all other exogenous variables appearing in the structural equation. 33 2 2 2 2 The square root of the variance in the previous slide is the standard error for βj. 34 Note STATA automatically estimate 2SLS model, as well as calculating the correct standard errors. Most of the cases, you should avoid estimating 2SLS “manually” (although it is a good exercise), since this does not provide you with the correct standard errors. 35 Exercise Consider the following model. Log(wage)=β0+β1(educ)+β2Exper+β3Exper2+u 1.Suppose educ is endogenous but exper and its square are exogenous. Using mother and father’s education as instruments, estimate the 2SLS model. Use Mroz.dta. 2.Manually estimate the model to check if you get the same coefficients. (Note that you will not get the correct standard errors.) 36 . reg lwage educ exper expersq, robust Linear regression Number of obs = F( 3, 424) = Prob > F = R-squared = Root MSE = 428 27.30 0.0000 0.1568 .66642 lwage Robust Coef. Std. Err. t P>|t| [95% Conf. Interval] educ exper expersq _cons .1074896 .013219 .0415665 .015273 -.0008112 .0004201 -.5220406 .2016505 8.13 2.72 -1.93 -2.59 0.000 0.007 0.054 0.010 .0815068 .1334725 .0115462 .0715868 -.0016369 .0000145 -.9183996 -.1256815 OLS 37 . ivregress 2sls lwage exper expersq (educ = motheduc fatheduc), first First-stage regressions Number of obs F( 4, 423) Prob > F R-squared Adj R-squared Root MSE educ Coef. exper expersq motheduc fatheduc _cons .0452254 -.0010091 .157597 .1895484 9.10264 Std. Err. .0402507 .0012033 .0358941 .0337565 .4265614 t 1.12 -0.84 4.39 5.62 21.34 P>|t| 0.262 0.402 0.000 0.000 0.000 Instrumental variables (2SLS) regression lwage Coef. educ exper expersq _cons .0613966 .0441704 -.000899 .0481003 Instrumented: Instruments: Std. Err. .0312895 .0133696 .0003998 .398453 = = = = = = [95% Conf. Interval] -.0338909 -.0033744 .087044 .1231971 8.264196 Number of obs Wald chi2(3) Prob > chi2 R-squared Root MSE z 1.96 3.30 -2.25 0.12 educ exper expersq motheduc fatheduc P>|z| 0.050 0.001 0.025 0.904 428 28.36 0.0000 0.2115 0.2040 2.0390 .1243417 .0013562 .2281501 .2558997 9.941084 = = = = = First stage regression 428 24.65 0.0000 0.1357 .67155 [95% Conf. Interval] .0000704 .0179665 -.0016826 -.7328532 “first” option show s first stage and second stage .1227228 .0703742 -.0001154 .8290538 2SLS results 38 Estimating 2SLS manually: When you regress the first stage manually on this data, more observations are used than the above 2SLS. To use exactly the same observations, first run the 2SLS and find the observations used in the regression. . ivregress 2sls lwage exper expersq (educ = motheduc fatheduc) Number of obs Wald chi2(3) Prob > chi2 R-squared Root MSE Instrumental variables (2SLS) regression lwage Coef. educ exper expersq _cons .0613966 .0441704 -.000899 .0481003 Std. Err. .0312895 .0133696 .0003998 .398453 z 1.96 3.30 -2.25 0.12 Instrumented: educ Instruments: exper expersq motheduc fatheduc P>|z| 0.050 0.001 0.025 0.904 428 = = 24.65 = 0.0000 = 0.1357 = .67155 [95% Conf. Interval] .0000704 .0179665 -.0016826 -.7328532 .1227228 .0703742 -.0001154 .8290538 e(sample) enable you to create dummy if the observatio n is used . gen fullsample= e(sample) 39 . reg educ exper expersq motheduc fatheduc if fullsample==1 Source SS df MS Model Residual 471.620998 1758.57526 4 117.90525 423 4.15738833 Total 2230.19626 427 5.22294206 educ Coef. exper expersq motheduc fatheduc _cons .0452254 -.0010091 .157597 .1895484 9.10264 Std. Err. .0402507 .0012033 .0358941 .0337565 .4265614 t 1.12 -0.84 4.39 5.62 21.34 Number of obs F( 4, 423) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.262 0.402 0.000 0.000 0.000 = 428 = 28.36 = 0.0000 = 0.2115 = 0.2040 = 2.039 [95% Conf. Interval] -.0338909 -.0033744 .087044 .1231971 8.264196 Then, estimate the first stage regression. Note “if fullsample==1” tells STATA to use observations only if fullsample is 1. .1243417 .0013562 .2281501 .2558997 9.941084 . predict educ_hat, xb After estimation, type this command. This will automatically create the predicted value of educ. 40 Finally estimate the second stage regression. You can see that the coefficient s are the same as before, but Std error and t-stats are different. . reg lwage educ_hat exper expersq if fullsample==1 Source SS df MS Model Residual 11.117828 212.209613 3 3.70594266 424 .50049437 Total 223.327441 427 .523015084 lwage Coef. educ_hat exper expersq _cons .0613966 .0441704 -.000899 .0481003 Std. Err. .0329624 .0140844 .0004212 .4197565 t 1.86 3.14 -2.13 0.11 Number of obs F( 3, 424) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.063 0.002 0.033 0.909 = = = = = = 428 7.40 0.0001 0.0498 0.0431 .70746 [95% Conf. Interval] -.0033933 .0164865 -.0017268 -.7769624 .1261866 .0718543 -.0000711 .873163 41 Case 3: More than one endogenous variables, more than one instruments Consider the following structural equation. y1=β0+β1y2+β2y3+β3z1+β4z2+β5z3+u1 There are two endogenous variables, y2 and y3. Thus, OLS will be biased. In order to estimate this model with IV method, you need at least 2 instruments. When you have multiple endogenous variables, you need at least the same number of instruments as the endogenous variables. 42 Suppose you have 3 instruments: z4 z5 z6. As usual, these instruments should satisfy 2 conditions. The first is that they should not be correlated with u1 (Instrument exogeneity). The second is that they should be correlated with endogenous variable (instrument relevance). When you have multiple endogenous variables, the second condition has a more complex expression, and it is called the rank condition. 43 The estimation procedure The 2SLS procedure when there are more than one endogenous variables is shown here. y1=β0+β1y2+β2y3+β3z1+β4z2+β5z3+u1 Suppose you have three Instruments : z4 z5 z6. 44 First stage: Estimate the following two reduced from regressions y2=п10+п11z1+п12z2+п13z3+п14z4+п15z5+п16z6+error y3=п20+п21z1+п22z2+п23z3+п24z4+п25z5+п26z6+error Then obtain yˆ 2 and yˆ3 . The second stage: Estimate the following ‘second stage regression’. y1=β0+β1 yˆ 2 +β2 yˆ3 +β3z1+β4z2+β5z3+u1 The estimated coefficients are the 2SLS coefficients. 45 Note that second stage regression does not produce correct standard errors. The derivation of the exact formula for the standard errors is not the focus of this course. Stata ivregress command automatically computes the correct standard errors. 46 Testing multiple hypotheses In the 2SLS method, the F statistic formula we used for OLS is no longer valid. STATA automatically computes a valid Ftype statistic for 2SLS. 47