Endogeneity & Simultaneous Equation Models Interdependence between variables also leads to endogeneity problems (ie violates assumption that Cov(X,u) = 0 and so OLS will give biased estimates) Eg C = a + bY + e Y= C +I+G + v (1) (2) This is a 2 equation simultaneous equation system C and Y appear on both sides of respective equations and are interdependent since Any shock, represented by ∆e but ∆C and ∆Y → ∆C → ∆Y →∆C in (1) from (2) from (1) so changes in C lead to changes in Y and changes in Y lead to changes in C but the fact that ∆e → ∆Y means Cov(X,u) = Cov(Y,e) ≠ 0 in (1) which given OLS implies ^ b= Cov ( X , y ) Cov ( X , u ) =b + Var( X ) Var ( X ) ^ means E (b ) ≠ b So OLS in the presence of interdependent variables gives biased estimates. Any right hand side variable which has the property Cov(X,u) ≠ 0 is said to be endogenous Any right hand side variable which has the property Cov(X,u) = 0 is said to be exogenous Solution: IV estimation (as with measurement error, since symptom, if not cause, is the same) ^ bIV = Cov ( Z , y ) Cov (Z , X ) (A) Again, problem is where to find instruments. In a simultaneous equation model, the answer is often in the system itself Eg.2. Price = b0 + b1 Wage + e Wage = d0 + d1Price + d2Unemployment + v (1) (2) This time wages and prices are interdependent so OLS on either (1) or (2) will give biased estimates….. but unemployment does not appear in (1) but is correlated with wages through (2) This means unemployment can be used as an instrument for wages in (1) as Cov(Unemployment, e) = 0 so uncorrelated with residual and Cov(Unemployment, Wage) ≠ 0 so correlated with endogenous RHS variable ^ So now bIV = Cov(Unemploment , Pr ice ) Cov (Unemployment,Wage) (compare with (A)) This process of using extra exogenous variables as instruments for endogenous RHS variables is known as identification If there are no additional exogenous variables outside the original equation than can be used as instruments for the endogenous RHS variables then the equation is said to be unidentified (In the example above (2) is unidentified because despite Price being endogenous , there are no other exogenous variables not already in (2) that can be used as instruments for Price). In general we can develop a rule that tells us whether any particular equation will be identified. “In a system of M simultaneous equations, then any one equation is identified if the number of exogenous variables excluded from that equation is greater than or equal to the total number of endogenous variables in that equation less one.” K − k ≥ m −1 (B) where K = Total no. of exogenous variables in the system k = No. of exogenous variables included in the equation m = No. of endogenous variables included in the equation In practice this rule tells us whether we can find an instrument for each and every endogenous RHS variable. Consider the previous example Price = b0 + b1 Wage + e Wage = d0 + d1Price + d2Unemployment + v (1) (2) We know Price and Wage are interdependent so can’t use OLS Consider each equation in turn In (1) ∃ 2 Endogenous variables (Price, Wage) ∴ m = 2 ∃ 1 Exogenous variables in the whole system (Unemployment) ∴K=1 ∃ 0 Exogenous variables in equation (1) ∴ k = 0 Using (B) K − k ≥ m −1 1-0 = 2-1 1= 1 so equation (1) is identified. There is an instrument for Wage (Unemployment) ∴ can use IV In (2) As before ∃ 2 Endogenous variables (Price, Wage) ∴ m = 2 ∃ 1 Exogenous variables in the whole system (Unemployment) ∴K=1 but now ∃ 1 Exogenous variables in equation (2) ∴ k = 1 Using (B) K − k ≥ m −1 1-1< 2-1 0<1 so equation (2) is not identified. There is no instrument (extra exogenous variable) for Price anywhere else in the system of equations ∴ can’t use IV Eg 2. Suppose now that Price = b0 + b1 Wage + e (1) Wage = d0 + d1Price + d2Unemployment + d3Productivity+ v (2) ie added another exogenous variable, Productivity, to (2). Since exogenous we assume Cov(Productivity, v) = 0 Now (2) is still not identified, since ∃ 2 Endogenous variables (Price, Wage) ∴ m = 2 and now ∃ 2 Exogenous variables in the whole system (Unemployment, productivity) ∴K=2 ∃ 2 Exogenous variables in equation (2) ∴ k = 2 (Unemployment, productivity) K − k ≥ m −1 2-2< 2-1 0<1 But now in (1) ∃ 2 Endogenous variables (Price, Wage) ∴ m = 2 and ∃ 2 Exogenous variables in the whole system (Unemployment, productivity) ∴K=2 ∃ 0 Exogenous variables in equation (1) ∴ k = 0 K − k ≥ m −1 2-0 > 2-1 2>1 Equation (1) said to be over-identified (more instruments (external exogenous variables) than strictly necessary for IV estimation) So which instrument to use for Wage in (1) ? If use unemployment then as before, using (A) ^ bIV = Cov (Unemployment, price) Cov (Unemployment,Wage) If instead use Productivity then ^ bIV = Cov ( productivity , price) Cov ( productivity ,Wage) Which is best? Both will give unbiased estimate of true value, but likely (especially in small samples) that estimates will be different. (see example) In practice use both at the same time -removes possibility of conflicting estimates - is more efficient (smaller variance) at least in large samples. With small samples more efficient to use the minimum number of instruments Given Price = b0 + b1 Wage + e (1) We know Wage = d0 + d1Price + d2Unemployment + d3Productivity+ v (2) and that Wage is related to the exogenous variables only by Wage = g0 + g1Unemployment + g2Productivity+ w (3) (the equation which expresses the endogenous right hand side variables solely as a function of exogenous variables is said to be a reduced form) ^ Idea is then to estimate (3) and save the predicted values Wage ^ ^ ^ ^ Wage = g 0 + g 1 Unemployment + g 2 Pr oductivity and use this as the instrument for Wage in (1) ^ Wage satisfies properties of an instrument since clearly correlated with variable of interest and because it is an average only of exogenous variables, is uncorrelated with the residual e in (1) (by assumption) This strategy is called Two Stage Least Squares and ^ ^ b2 SLS = Cov( X , y ) ^ Cov ( X , X ) ^ = Cov ( wage, price) ^ Cov ( wage, price) Note: In many cases you will not have a simultaneous system. More than likely will have just one equation but there may well be endogenous RHS variables. In this case the principle is exactly the same. Find additional exogenous variables that are correlated with the problem variable but uncorrelated with the error (ie “as if” there were another equation in the system). Example annual % change in nominal wages annual % change in price level % change 30 20 10 0 1963 Q1 2000 Q4 time The data set prod.dta contains quarterly time series data on wage, price, unemployment and productivity changes The graph suggests that wages and prices move together over time and suggests may want to run a regression of inflation on wage changes where by assumption (and nothing else) the direction of causality runs from changes in wages to changes in inflation reg inflat dwages if time>50 Source | SS df MS Number of obs = 101 -------------+-----------------------------F( 1, 99) = 381.84 Model | 2534.57478 1 2534.57478 Prob > F = 0.0000 Residual | 657.143032 99 6.63780841 R-squared = 0.7941 -------------+-----------------------------Adj R-squared = 0.7920 Total | 3191.71781 100 31.9171781 Root MSE = 2.5764 -----------------------------------------------------------------------------inflatio | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------dwages | 1.012592 .0518196 19.54 0.000 .9097704 1.115413 _cons | -1.609425 .5174281 -3.11 0.002 -2.636115 -.5827354 which suggests an almost one-for-one relation between inflation and wage changes over this period. However you might equally run a regression of wage changes on prices (where now the implied direction of causality is from changes in the inflation rate to changes in wages) . reg dwages inflat if time>50 Source | SS df MS Number of obs = 101 -------------+-----------------------------F( 1, 99) = 381.84 Model | 1962.98474 1 1962.98474 Prob > F = 0.0000 Residual | 508.94602 99 5.14086889 R-squared = 0.7941 -------------+-----------------------------Adj R-squared = 0.7920 Total | 2471.93076 100 24.7193076 Root MSE = 2.2673 -----------------------------------------------------------------------------dwages | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------inflatio | .784235 .0401334 19.54 0.000 .7046016 .8638684 _cons | 3.04795 .3657581 8.33 0.000 2.322207 3.773694 this suggests wages grow at constant rate of around 3% a year (the coefficient on the constant) and then each 1 percentage point increase in the inflation rate adds a .78 percentage point increase in wages. Which is right specification? In a sense both, since wages affect prices but prices also affect wages. The 2 variables are interdependent and said to be endogenous. This means that Cov(X,u)≠ 0 ie a correlation between right hand side variables and the residuals which makes OLS estimates biased and inconsistent. Need to instrument the endogenous right hand side variables. ie find a variable that is correlated with the suspect right hand side variable but uncorrelated with the error term. If, as in the example above, wages also depend on the unemployment rate and productivity growth as well as inflation. . reg dwages inflat umprate prod if time>50 Source | SS df MS Number of obs = 101 ---------+-----------------------------F( 3, 97) = 133.16 Model | 1988.96632 3 662.988773 Prob > F = 0.0000 Residual | 482.964442 97 4.97901486 R-squared = 0.8046 ---------+-----------------------------Adj R-squared = 0.7986 Total | 2471.93076 100 24.7193076 Root MSE = 2.2314 -----------------------------------------------------------------------------dwages | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------inflatio | .7823925 .0423672 18.467 0.000 .6983054 .8664796 umprate | .1898947 .0966741 1.964 0.052 -.0019766 .3817661 prod | -.2960195 .1554679 -1.904 0.060 -.6045802 .0125412 _cons | 2.117253 .8587953 2.465 0.015 .4127825 3.821724 (Strangely) over this period higher unemployment is associated with higher wages and lower productivity with lower wages, controlling for the effects of inflation. (Can you think of reasons why this might be?) Which instrument to use? Both should give same estimate if sample size is large enough but in finite (small) samples the two IV estimates will be different. Compare IV estimate using unemployment as instrument . ivreg inflat (dwages=ump ) if time>50 Instrumental variables (2SLS) regression -----------------------------------------------------------------------------inflatio | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------dwages | 1.601649 .3725813 4.299 0.000 .8623667 2.340931 _cons | -6.718597 3.254932 -2.064 0.042 -13.17709 -.2601064 Instrumented: Instruments: dwages umprate With that using productivity as an instrument . ivreg inflat (dwages=prod ) if time>50 Instrumental variables (2SLS) regression -----------------------------------------------------------------------------inflatio | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------dwages | 1.089042 .1544639 7.050 0.000 .782552 1.395532 _cons | -2.272514 1.364576 -1.665 0.099 -4.980128 .4351012 Instrumented: Instruments: dwages prod Both estimates are higher than original OLS estimate, but differ widely from each other. Best idea (which also gives more efficient estimates ie ones with lower standard errors) is to use all the instruments at the same time – at least in large samples Use unemp and productivity as instruments for wages in original equation 1. Regress endogenous variables (wages) on instruments (unemp and prod.) . reg dwages umprate prod if time>50 Source | SS df MS Number of obs = 101 ---------+-----------------------------F( 2, 98) = 6.54 Model | 290.979911 2 145.489955 Prob > F = 0.0022 Residual | 2180.95085 98 22.2546005 R-squared = 0.1177 ---------+-----------------------------Adj R-squared = 0.0997 Total | 2471.93076 100 24.7193076 Root MSE = 4.7175 -----------------------------------------------------------------------------dwages | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------umprate | -.1101753 .201477 -0.547 0.586 -.5099998 .2896493 prod | -.9147086 .3209619 -2.850 0.005 -1.551647 -.2777703 _cons | 11.28755 1.481331 7.620 0.000 8.347894 14.2272 Save predicted wage . predict dwagehat * stata command to save predicted value of dep. var. * 2. Include this instead of wages on the right hand side of the inflation regression . reg inflat dwagehat if time>50 Source | SS df MS Number of obs = 101 ---------+-----------------------------F( 1, 99) = 13.41 Model | 380.648731 1 380.648731 Prob > F = 0.0004 Residual | 2811.06908 99 28.3946372 R-squared = 0.1193 ---------+-----------------------------Adj R-squared = 0.1104 Total | 3191.71781 100 31.9171781 Root MSE = 5.3287 -----------------------------------------------------------------------------inflatio | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------dwagehat | 1.143749 .3123825 3.661 0.000 .5239143 1.763584 _cons | -2.747013 2.760836 -0.995 0.322 -8.22511 2.731083 This gives unbiased estimate of effect of wages on prices. Compare with original (biased) estimate below, can see wage effect is now stronger, (though standard error of IV estimate is larger than in OLS) . reg inflat dwages if time>50 inflatio | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------dwages | 1.012592 .0518196 19.541 0.000 .9097704 1.115413 _cons | -1.609425 .5174281 -3.110 0.002 -2.636115 -.5827354 Remember that Stata does all this automatically using the ivreg command . ivreg inflat (dwages=ump prod) if time>50, first First-stage regressions ----------------------Source | SS df MS Number of obs = 101 -------------+-----------------------------F( 2, 98) = 6.54 Model | 290.979911 2 145.489955 Prob > F = 0.0022 Residual | 2180.95085 98 22.2546005 R-squared = 0.1177 -------------+-----------------------------Adj R-squared = 0.0997 Total | 2471.93076 100 24.7193076 Root MSE = 4.7175 -----------------------------------------------------------------------------dwages | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------umprate | -.1101753 .201477 -0.55 0.586 -.5099998 .2896493 prod | -.9147086 .3209619 -2.85 0.005 -1.551647 -.2777703 _cons | 11.28755 1.481331 7.62 0.000 8.347894 14.2272 Instrumental variables (2SLS) regression Source | SS df MS Number of obs = 101 -------------+-----------------------------F( 1, 99) = 53.86 Model | 2492.05222 1 2492.05222 Prob > F = 0.0000 Residual | 699.665589 99 7.06732918 R-squared = 0.7808 -------------+-----------------------------Adj R-squared = 0.7786 Total | 3191.71781 100 31.9171781 Root MSE = 2.6584 -----------------------------------------------------------------------------inflatio | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------dwages | 1.143749 .1558462 7.34 0.000 .8345162 1.452981 _cons | -2.747012 1.377368 -1.99 0.049 -5.48001 -.0140152 -----------------------------------------------------------------------------Instrumented: dwages Instruments: umprate prod ------------------------------------------------------------------------------ If there is more than one potentially endogenous rhs variable in your equation the order condition (above) tells us that you will have to find at least one different instrument for each endogenous rhs variable (eg 2 endogenous rhs variables requires 2 different instruments). Again, each instrument should be correlated with the endogenous rhs variable it replaces net of the other existing exogenous rhs variables. Example Price = b0 + b1Wage + b1InterestRates + e (1) where now both wages and interest rates are probably endogenous (prices depend on both interest rates and wages but wages and interest rates also depend on prices). OLS estimation of (1) will therefore give biased estimates. To obtain unbiased 2SLS estimates you need 1) to find one instrument for wages and a different instrument for prices. 2) Obtain predicted values for both wages and prices using these instruments (and any other exogenous variables there may be in the equation) ^ ^ ^ Wage = g 0 + g 1 Instrument _ 1 ^ ^ ^ InterestRa tes = d 0 + d 1 Instrument _ 2 and then use these predicted values instead of the originals in (1) (in practice, the computer will do this for you, but you should always be aware of what exactly is going on). Example of Poor Instruments & IV Estimation Often a poor choice of instrument can make things much worse than the original OLS estimates. Consider the example of the effect of education on wages (taken from the data set video.dta). Policy makers are often interested in the costs and benefits of education. Some people argue that education is endogenous (because it also picks up the effects of omitted variables like ability or motivation and so is correlated with the error term). The OLS estimates from a regression of log hourly wages on years of education suggest that . reg lhw yearsed Source | SS df MS Number of obs = 6076 -------------+-----------------------------F( 1, 6074) = 589.23 Model | 246.491385 1 246.491385 Prob > F = 0.0000 Residual | 2540.90296 6074 .41832449 R-squared = 0.0884 -------------+-----------------------------Adj R-squared = 0.0883 Total | 2787.39434 6075 .458830344 Root MSE = .64678 -----------------------------------------------------------------------------lhw | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------yearsed | .0759492 .0031288 24.27 0.000 .0698156 .0820828 _cons | 5.468133 .0400159 136.65 0.000 5.389688 5.546579 1 extra year of education is associated with 7.6% increase in earnings. If endogeneity is a problem, then these estimates are biased, (upward if ability and education are positively correlated – see lecture notes on omitted variable bias). So try to instrument instead. For some reason you choose whether the individual owns a video recorder. To be a good instrument the variable should be a) uncorrelated with the residual and by extension the dependent variable (wages) but b) correlated with the endogenous right hand side variable (education). To test assumption a) you first examine the correlation between wages and videos in a regression. . reg lhw video Source | SS df MS Number of obs = 6706 -------------+-----------------------------F( 1, 6704) = 0.20 Model | .089653333 1 .089653333 Prob > F = 0.6576 Residual | 3059.15176 6704 .456317387 R-squared = 0.0000 -------------+-----------------------------Adj R-squared = -0.0001 Total | 3059.24142 6705 .456262702 Root MSE = .67551 -----------------------------------------------------------------------------lhw | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------video | -.0178585 .0402898 -0.44 0.658 -.0968393 .0611223 _cons | 6.444377 .0428575 150.37 0.000 6.360363 6.528391 The regression shows that it satisfies the first requirement since it is uncorrelated with wages (videos are relatively cheap now so that ownership is more a matter of taste than income). However when you instrument years of education using video ownership . ivreg lhw (yearsed=video) Instrumental variables (2SLS) regression Source | SS df MS Number of obs = 6076 -------------+-----------------------------F( 1, 6074) = 0.03 Model | -57.4179219 1 -57.4179219 Prob > F = 0.8657 Residual | 2844.81226 6074 .46835895 R-squared = . -------------+-----------------------------Adj R-squared = . Total | 2787.39434 6075 .458830344 Root MSE = .68437 -----------------------------------------------------------------------------lhw | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------yearsed | -.0083832 .0495839 -0.17 0.866 -.1055853 .0888189 _cons | 6.52326 .6204326 10.51 0.000 5.306992 7.739528 -----------------------------------------------------------------------------Instrumented: yearsed Instruments: video ------------------------------------------------------------------------------ The IV estimates are now negative (and insignificant). This does not appear very sensible. The reason is that video ownership is hardly correlated with education as the regression below shows. . reg yearsed video Source | SS df MS Number of obs = 12334 -------------+-----------------------------F( 1, 12332) = 0.28 Model | 1.91564494 1 1.91564494 Prob > F = 0.5943 Residual | 83278.6218 12332 6.75305075 R-squared = 0.0000 -------------+-----------------------------Adj R-squared = -0.0001 Total | 83280.5375 12333 6.75265851 Root MSE = 2.5987 -----------------------------------------------------------------------------yearsed | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------video | -.0500536 .0939784 -0.53 0.594 -.234266 .1341587 _cons | 12.15383 .1029141 118.10 0.000 11.9521 12.35556 ------------------------------------------------------------------------------ Moral: Always check the correlation of your instrument with the endogenous right hand variable. The “first” option on stata’s ivreg command will always print the 1st stage of the 2SLS regression – ie the regression above – in which you can tell if the proposed instrument is correlated with the endogenous variable by simply looking at the t value. Sometimes instruments may be statistically significant from zero in the 1st stage and still not good enough. So there is a rule of thumb (at least in the case of a single endogenous variable) that should only proceed with IV estimation if the F value on the 1st stage of 2SLS > 10. In the example above this is clearly not the case. Testing Instrumental Validity (Overidentifying Restrictions) If you have more instruments than endogenous right hand side variables (the equation is overidentified – hence the name for the test) then it is possible to test whether (some of the) instruments are valid – in the sense that they satisfy the assumption of being uncorrelated with the residual in the original model. One way to do this would be to compute two different 2SLS estimates, one using one instrument and another using the other instrument (rather like in the above example on prices, wages productivity and unemployment). If these estimates are radically different you might conclude that one (or both) of the instruments was invalid (not exogenous). An implicit test of this – that avoids having to compute all of the possible IV estimates is based on the following idea Given y = b0 + b1X + u and Cov(X,u)? 0 If an instrument Z is valid (exogenous) it is uncorrelated with u To test this simply regress u on all the possible instruments. u = d0 + d1Z1 + d2Z2 + …. dkZk + v If the instruments are exogenous they should be uncorrelated with u and so the coefficients d1 .. dk should all be zero (ie the Z variables have no explanatory power) Since u is never observed have to use a proxy for this which turns out to be the residual from the 2SLS estimation ^ 2sls ^ ^ 2 sls u = y − b0 − b12sls X (since this is a consistent estimate of the true unknown residuals) So to Test Overidentifying Restrictions 1. Estimate model by 2SLS and save the residuals 2. Regress these residuals on all the exogenous variables (including those X variables in the original equation that are not suspect) and save the R2 3. Compute N*R2 4. Under the null that all the instruments are uncorrelated then N*R2 ~ ?2 with L-k degrees of freedom (L is the number of instruments and k is the number of endogenous right hand side variables) Note that can only do this test if there are more instruments than endogenous right hand side variables Example: using the prod.dta file we can test whether the additional variables (unemployment or productivity) are valid instruments for wages 1st do the two stage ;least squares regression using all the possible instruments ivreg inflat (dwages=prod ump) if time>50 Instrumental variables (2SLS) regression Source | SS df MS Number of obs = 101 -------------+-----------------------------F( 1, 99) = 53.86 Model | 2492.05222 1 2492.05222 Prob > F = 0.0000 Residual | 699.665589 99 7.06732918 R-squared = 0.7808 -------------+-----------------------------Adj R-squared = 0.7786 Total | 3191.71781 100 31.9171781 Root MSE = 2.6584 -----------------------------------------------------------------------------inflatio | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------dwages | 1.143749 .1558462 7.34 0.000 .8345162 1.452981 _cons | -2.747012 1.377368 -1.99 0.049 -5.48001 -.0140152 Now save these 2sls residuals . predict ivres, resid and regress these on all the exogenous variables in the system (in this case productivity and unemployment, but there may be situations where the original equation contained other exogenous variables, in which case include them here) . reg ivres prod ump if time>50 Source | SS df MS Number of obs = 101 -------------+-----------------------------F( 2, 98) = 2.75 Model | 37.2070174 2 18.6035087 Prob > F = 0.0687 Residual | 662.458589 98 6.75978152 R-squared = 0.0532 -------------+-----------------------------Adj R-squared = 0.0339 Total | 699.665607 100 6.99665607 Root MSE = 2.60 -----------------------------------------------------------------------------ivres | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------prod | .2554312 .1768927 1.44 0.152 -.0956065 .606469 umprate | -.2575159 .1110406 -2.32 0.022 -.4778724 -.0371594 _cons | 1.557729 .8164104 1.91 0.059 -.0624104 3.177869 The test is N*R 2 = 101*0.053 = 5.35 which is ~ ?2 (L-k = 2-1) (L=2 instruments and k=1 endogenous right hand side variable) Since ?2 (1)critical = 3.84, estimated value exceeds critical value so reject null hypothesis that some of the instruments are valid (which is consistent with earlier results showing IV estimates varied widely depending on which instrument was used – one of them, probably unemployment since the IV estimate was far from the OLS one, was wrong).