ECONOMICS 762: 2SLS Stata Example L. Magee March, 2008 This example uses data in the file 2slseg.dta. It contains 2932 observations from a sample of young adult males in the U.S. in 1976. The variables are: 1. nearc2 2. nearc4 3. educ 4. age 5. smsa 6. south 7. wage 8. married =1 if lived near a 2 yr college in 1966 =1 if lived near a 4 yr college in 1966 years of schooling, 1976 age in years, 1976 =1 if lived in an SMSA, 1976 (SMSA = “Standard Metropolitan Statistical Area”, basically indicates live in an urban area) =1 if live in southern U.S., 1976 hourly wage in cents, 1976 =1 if married, 1976 This data set is used in the article “Using Geographic Variation in College Proximity to Estimate the Returns to Schooling,” by D. Card (1994) in L.N. Christophides et al.(ed.), Aspects of Labour Market Behaviour: Essays in Honour of John Vanderkamp and used in the textbook: Introductory Econometrics: A Modern Approach, second edition, by Jeffrey M. Wooldridge. The goal is to estimate the percentage effect on the wage of getting an extra year of education, by estimating the coefficient on EDUC variable in a regression equation with the log of WAGE as the dependent variable, controlling for other factors as follows: LHS variable: log of WAGE RHS variables: EDUC, AGE, MARRIED, SMSA This will be referred to as the wage equation. It is commonly thought that EDUC is correlated with the error term in the wage equation (“unobserved ability”). This would result in OLS over-estimating the effect of EDUC on the log wage. It is hard to find instruments though. They need to be uncorrelated with the error term, yet help to predict years of schooling. In this example, some information on how far these young men lived from two types of colleges 10 years earlier is used as instruments. Here is the do file without comments: ****************************************************************************** ** 2SLS.do : March 2007 ****************************************************************************** clear capture log using "C:\Documents and Settings\courses\761 and 762\w07\2SLS\2SLS.log", replace use "C:\Documents and Settings\courses\761 and 762\w07\2SLS\2SLSeg.dta" summarize gen lwage=log(wage) ** IV regression (2SLS) ** ivreg lwage age married smsa (educ = nearc2 nearc4) ** general version of Hausman test ** predict ivresid,residuals est store ivreg reg lwage educ age married smsa hausman ivreg .,constant sigmamore df(1) ** Wu version of Hausman test ** quietly reg educ age married smsa nearc2 nearc4 predict educhat,xb reg lwage educ age married smsa educhat ** overidentification test ** quietly reg ivresid age married smsa nearc2 nearc4 predict explresid,xb matrix accum rssmat = explresid,noconstant matrix accum tssmat = ivresid,noconstant scalar nobs=e(N) scalar x2=nobs*rssmat[1,1]/tssmat[1,1] scalar pval=1-chi2(1,x2) scalar list x2 pval log close 2 Here is the same do file with comments about some of the commands inserted below them in italics: ****************************************************************************** ** 2SLS.do : March 2007 ****************************************************************************** clear capture log using "C:\Documents and Settings\courses\761 and 762\w07\2SLS\2SLS.log", replace use "C:\Documents and Settings\courses\761 and 762\w07\2SLS\2SLSeg.dta" summarize gen lwage=log(wage) ** IV regression (2SLS) ** ivreg lwage age married smsa (educ = nearc2 nearc4) This ivreg command computes the 2SLS estimates. The dependent variable is lwage. The regressors that are assumed exogenous are left outside of the parentheses: age married smsa. The regressors that are assumed endogenous are in the parentheses to the left of the equals sign. There’s just one in this example: educ. In the parentheses to the right of the equals sign are the instrumental variables, that are assumed exogenous and do not appear as regressors in the equation. Here they are nearc2 and nearc4. The key assumption is that distances from 2yr and 4yr colleges in 1966 are not correlated with the error in the wage equation, but do help to explain years of schooling in 1976. ** general version of Hausman test ** predict ivresid,residuals This post-estimation command stores the 2SLS residuals in a variable that I called ivresid.. est store ivreg This post-estimation command stores some of the 2SLS results for later use in a Hausman test. reg lwage educ age married smsa This command estimates the same equation by OLS in order to compute the Hausman test statistic. hausman ivreg .,constant sigmamore df(1) This command computes the Hausman test statistic. The null hypothesis is that the OLS estimator is consistent. If accepted, we probably would prefer to use OLS instead of 2SLS. The option constant is necessary to tell Stata to include the constant term in the comparison of both estimates. The sigmamore option tells Stata to use the same estimate of the variance of the error term for both models. This is desirable here since the error term has the same interpretation in both models. The df(1) option tells Stata that the null distribution has one degree of freedom. Stata was able to figure this out when I left this option out, even though the Hausman test is comparing values of two 5element (not one-element) vectors. It probably knew this by finding only one non-zero eigenvalue of the 5-by-5 covariance matrix estimate that it calls (V_b-V_B) in the output. It’s safer to impose the d.f. in the hausman command as above. ** Wu version of Hausman test ** quietly reg educ age married smsa nearc2 nearc4 The above OLS regression is done only to get the predicted value of educ to perform the Wu version of the Hausman test as described on p.82 of the Greene text, 5th edition. To reduce the amount of output in the log file, its output is suppressed by preceding the command with quietly. predict educhat,xb 3 reg lwage educ age married smsa educhat This OLS regression takes the original wage equation and adds the OLS predicted values of all of the (suspected) endogenous variables. Here there is only one, educhat. It was predicted using the full set of exogenous variables. The Wu version of the Hausman test is the standard significance test for the coefficient(s) on these added variables. Since there’s just one here, use a two-sided t-test. ** overidentification test ** quietly reg ivresid age married smsa nearc2 nearc4 The uncentred R-square of the above regression will be computed below to produce the overidentification test statistic, also known as the Sargan statistic. The dependent variable ivresid is the 2SLS residual vector, saved earlier. predict explresid,xb The predicted values from the regression are saved in order to calculate the uncentred R-squared. matrix accum rssmat = explresid,noconstant matrix accum tssmat = ivresid,noconstant There’s probably a neater way to do this, but I used these matrix accum commands with a noconstant option in order to compute two scalars, rssmat (which is the sum of squares of explresid) and tssmat (which is the sum of squares of ivresid) scalar nobs=e(N) e(N) is the sample size, which was automatically stored earlier. This command stores that value in a scalar variable nobs. scalar x2=nobs*rssmat[1,1]/tssmat[1,1] This command computes the overidentification test statistic, called x2. scalar pval=1-chi2(1,x2) This command computes the P-value using the Stata function chi2(n,x), which computes the area to the left of x under a chi-square distribution with n d.f. scalar list x2 pval This prints out the values of x2 and pval. log close 4 Now the log file: . use "C:\Documents and Settings\courses\761 and 762\w07\2SLS\2SLSeg.dta" . . summarize Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------nearc2 | 2932 .430764 .4952676 0 1 nearc4 | 2932 .6828104 .4654613 0 1 educ | 2932 13.25887 2.682475 1 18 age | 2932 28.11937 3.134548 24 34 smsa | 2932 .7060027 .4556684 0 1 -------------+-------------------------------------------------------south | 2932 .3915416 .4881783 0 1 wage | 2932 577.1872 264.5756 100 2404 married | 2932 .7141883 .4518772 0 1 . . gen lwage=log(wage) . . ** IV regression (2SLS) ** . ivreg lwage age married smsa (educ = nearc2 nearc4) Instrumental variables (2SLS) regression Source | SS df MS -------------+-----------------------------Model | -19.3235809 4 -4.83089521 Residual | 601.657409 2927 .205554291 -------------+-----------------------------Total | 582.333829 2931 .198680938 Number of obs F( 4, 2927) Prob > F R-squared Adj R-squared Root MSE = = = = = = 2932 122.71 0.0000 . . .45338 -----------------------------------------------------------------------------lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | .1386543 .0342091 4.05 0.000 .0715779 .2057307 age | .0366522 .0027297 13.43 0.000 .0312999 .0420044 married | .1937981 .0201602 9.61 0.000 .1542685 .2333277 smsa | .0976942 .0417188 2.34 0.019 .0158931 .1794953 _cons | 3.184304 .4405519 7.23 0.000 2.320481 4.048127 -----------------------------------------------------------------------------Instrumented: educ Instruments: age married smsa nearc2 nearc4 -----------------------------------------------------------------------------. . ** general version of Hausman test ** . predict ivresid,residuals . est store ivreg . reg lwage educ age married smsa Source | SS df MS -------------+-----------------------------Model | 145.487691 4 36.3719228 Residual | 436.846137 2927 .149247057 -------------+-----------------------------Total | 582.333829 2931 .198680938 5 Number of obs F( 4, 2927) Prob > F R-squared Adj R-squared Root MSE = = = = = = 2932 243.70 0.0000 0.2498 0.2488 .38633 -----------------------------------------------------------------------------lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | .0485886 .0027103 17.93 0.000 .0432742 .0539029 age | .0364856 .0023253 15.69 0.000 .0319262 .041045 married | .1759239 .0161841 10.87 0.000 .1441906 .2076572 smsa | .1962286 .0159841 12.28 0.000 .1648874 .2275698 _cons | 4.326357 .074032 58.44 0.000 4.181197 4.471517 -----------------------------------------------------------------------------. hausman ivreg .,constant sigmamore df(1) Note: the rank of the differenced variance matrix (1) does not equal the number of coefficients being tested (5); be sure this is what you expect, or there may be problems computing the test. Examine the output of your estimators for anything unexpected and possibly consider scaling your variables so that the coefficients are on a similar scale. ---- Coefficients ---| (b) (B) (b-B) sqrt(diag(V_b-V_B)) | ivreg . Difference S.E. -------------+---------------------------------------------------------------educ | .1386543 .0485886 .0900657 .0290232 age | .0366522 .0364856 .0001666 .0000537 married | .1937981 .1759239 .0178742 .0057599 smsa | .0976942 .1962286 -.0985344 .0317522 _cons | 3.184304 4.326357 -1.142053 .3680211 -----------------------------------------------------------------------------b = consistent under Ho and Ha; obtained from ivreg B = inconsistent under Ha, efficient under Ho; obtained from regress Test: Ho: difference in coefficients not systematic chi2(1) = (b-B)'[(V_b-V_B)^(-1)](b-B) = 9.63 Prob>chi2 = 0.0019 (V_b-V_B is not positive definite) . . ** Wu version of Hausman test ** . quietly reg educ age married smsa nearc2 nearc4 . predict educhat,xb . reg lwage educ age married smsa educhat Source | SS df MS -------------+-----------------------------Model | 146.924944 5 29.3849888 Residual | 435.408884 2926 .148806864 -------------+-----------------------------Total | 582.333829 2931 .198680938 Number of obs F( 5, 2926) Prob > F R-squared Adj R-squared Root MSE = = = = = = 2932 197.47 0.0000 0.2523 0.2510 .38575 -----------------------------------------------------------------------------lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | .0478031 .0027181 17.59 0.000 .0424736 .0531327 age | .0366522 .0023225 15.78 0.000 .0320982 .0412061 married | .1937981 .0171531 11.30 0.000 .1601647 .2274315 smsa | .0976942 .035496 2.75 0.006 .0280945 .1672939 educhat | .0908512 .0292331 3.11 0.002 .0335316 .1481708 6 _cons | 3.184304 .3748395 8.50 0.000 2.449328 3.91928 -----------------------------------------------------------------------------. . ** overidentification test ** . quietly reg ivresid age married smsa nearc2 nearc4 . predict explresid,xb . matrix accum rssmat = explresid,noconstant (obs=2932) . matrix accum tssmat = ivresid,noconstant (obs=2932) . scalar nobs=e(N) . scalar x2=nobs*rssmat[1,1]/tssmat[1,1] . scalar pval=1-chi2(1,x2) . scalar list x2 pval x2 = 5.9600396 pval = .01463371 . . log close log: C:\Documents and Settings\courses\761 and 762\w07\2SLS\2SLS.log log type: text closed on: 13 Mar 2007, 16:28:25 --------------------------------------------------------------------------------------------------------------- The two Hausman tests give identical information. The general version is in chi-square form, and equals 9.63, while the Wu version is a t-statistic, t = 3.11, which is the square root of 9.63. The have the same Pvalue of .002, indicating rejection of the consistency of OLS, providing support for using 2SLS. The overidentification test has a P-value of .014, which is significant at 5% but not 1%. So at the 5% level we would reject the hypothesis that the instrumental variables nearc2 and nearc4 are exogenous. If no other instrumental variables are available, it is hard to know what to do about this. We could drop one of the two instruments, but we would not know if that solves the problem because we then have no overidentification restrictions left to test. 7