SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian Sample Selection Models Sample selection issues arise when our dependent variable is only observed for a specific subgroup – for example, timing of retirement is only observed for those retired; wages are only observed for those employed, etc. In such situations, implicitly, we can imagine the existence of a selection equation that determines who is in the sample. Selection bias problem may arise because the error term in the outcome equation is correlated with the error term in the selection equation. Selection bias is, in fact, similar to omitted variable bias – if omitted variables are uncorrelated with those variables already in the model, then residuals also won’t be correlated with variables in the model, and assumptions are not violated, but if omitted variables are correlated with predictors already in the model, then, since their effects are relegated to residuals, residuals are also correlated with predictors, which violates a regression assumption. Importantly, there is no selection problem if every variable influencing selection is controlled in the outcome equation. The problem is that most selection processes are complex and the complete list of variables influencing selection usually cannot be measured. Therefore, in many case when dealing with variables observed for subsamples, we encounter a selection bias problem. Some selection processes are such that selection depends on the value of outcome itself – i.e., outcome is only observed once a certain threshold is passed; e.g., the data on the amount of financial support are only available for those who gave $500 or more. In those cases, Tobit regression model can be used – and for longitudinal data, xttobit command can be used for such censored samples. In cases where selection is determined by another variable (e.g., retirement, marriage, etc.), Heckman sample selection model can be used. This model combines two models -- a first stage probit (selection equation) and a second stage OLS (outcome equation), and can be either estimated using ML or as a two-step model. It is typically necessary to identify at least one variable that will affect selection but not the outcome – otherwise you will likely run into difficulties with model identification; besides, it would not make substantive sense to estimate the selection equation separately otherwise (as mentioned above, if every variable influencing selection is already in the outcome equation, then your results are not biased due to selection). In longitudinal data, such variables can sometimes be variables from earlier waves than the analysis period. . use http://www.sarkisian.net/socy7706/hrs_hours.dta . reshape long r@workhours80 r@poorhealth r@married r@totalpar r@siblog h@childlg r@al > lparhelptw, i(hhid pn) j(wave) (note: j = 1 2 3 4 5 6 7 8 9) Data wide -> long ----------------------------------------------------------------------------Number of obs. 6591 -> 59319 Number of variables 75 -> 20 j variable (9 values) -> wave xij variables: r1workhours80 r2workhours80 ... r9workhours80->rworkhours80 1 r1poorhealth r2poorhealth ... r9poorhealth-> rpoorhealth r1married r2married ... r9married -> rmarried r1totalpar r2totalpar ... r9totalpar -> rtotalpar r1siblog r2siblog ... r9siblog -> rsiblog h1childlg h2childlg ... h9childlg -> hchildlg r1allparhelptw r2allparhelptw ... r9allparhelptw->rallparhelptw ----------------------------------------------------------------------------. gen rallparhelptw_0= rallparhelptw (21949 missing values generated) . replace rallparhelptw=. if rtotalpar==0 (3815 real changes made, 3815 to missing) . gen parents=( rtotalpar>0) if (7846 missing values generated) rtotalpar~=. . heckman rallparhelptw rmarried rsiblog hchildlg raedyrs female minority, select( > parents = rmarried rsiblog hchildlg raedyrs female age minority rpoorhealth) clust > er(hhid) Iteration 0: log pseudolikelihood = -103600.47 Iteration 1: log pseudolikelihood = -103600.47 Heckman selection model Number of obs = 43636 (regression model with sample selection) Censored obs = 16355 Uncensored obs = 27281 Log pseudolikelihood = -103600.5 Wald chi2(6) Prob > chi2 = = 246.11 0.0000 (Std. Err. adjusted for 4653 clusters in hhid) ------------------------------------------------------------------------------| Robust | Coef. Std. Err. z P>|z| [95% Conf. Interval] --------------+---------------------------------------------------------------rallparhelptw | rmarried | -.2657773 .1211056 -2.19 0.028 -.50314 -.0284147 rsiblog | -.4121636 .0706821 -5.83 0.000 -.5506979 -.2736293 hchildlg | -.0297258 .0711244 -0.42 0.676 -.1691271 .1096755 raedyrs | .0495642 .0112151 4.42 0.000 .027583 .0715455 female | .6881667 .0590432 11.66 0.000 .5724441 .8038894 minority | -.1268172 .0956978 -1.33 0.185 -.3143815 .0607471 _cons | 1.571512 .2769189 5.67 0.000 1.028761 2.114263 --------------+---------------------------------------------------------------parents | rmarried | .4770129 .0291541 16.36 0.000 .419872 .5341539 rsiblog | -.0795487 .0224883 -3.54 0.000 -.123625 -.0354725 hchildlg | .0022671 .0233168 0.10 0.923 -.0434329 .0479671 raedyrs | -.0028614 .0042837 -0.67 0.504 -.0112573 .0055346 female | -.1334079 .0164738 -8.10 0.000 -.1656959 -.1011199 age | -.0552654 .0038619 -14.31 0.000 -.0628345 -.0476962 minority | .0929978 .0320184 2.90 0.004 .0302429 .1557528 rpoorhealth | -.11186 .0232671 -4.81 0.000 -.1574628 -.0662572 _cons | 3.25997 .2336726 13.95 0.000 2.80198 3.71796 --------------+---------------------------------------------------------------/athrho | .0008414 .0493236 0.02 0.986 -.095831 .0975138 /lnsigma | 1.357466 .0159669 85.02 0.000 1.326172 1.388761 --------------+---------------------------------------------------------------rho | .0008414 .0493235 -.0955387 .0972059 sigma | 3.886333 .0620526 3.766596 4.009877 lambda | .00327 .1916872 -.37243 .37897 ------------------------------------------------------------------------------Wald test of indep. eqns. (rho = 0): chi2(1) = 0.00 Prob > chi2 = 0.9864 2 Rho indicates if the unobservables in the selection model are correlated with the unobservables in the stage 2 model. If they are, we have biased estimates without correction (or in an OLS model). If the unobservables in stage 1 are unrelated to the unobservables in stage 2, as they are here, then we are saying that stage 1 does not affect stage 2 results. Here, rho appears to be nonsignificant (based on chi square test). The adjusted standard error for the outcome equation regression is given by sigma. The estimated selection coefficient lambda = sigma×rho. Next, we try to interpret the estimated selection effect itself. For this, let’s compute the average selection (or truncation) effect. First, let’s calculate the average value for the selection term for the sample of those who have living parents. For that we need to predict the inverse Mills ratio and get summary stats for it: . predict mills, mills (15215 missing values generated) . sum mills Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------mills | 44104 .6100459 .1600787 .2992288 1.199892 The average truncation effect is computed as lambda×average mills value. . di .00327*.6100459 .00199485 This is very small – it indicates by how much the conditional hours of help to parents are shifted up (or down) due to the selection or truncation effect. So based on this, we can calculate how much higher a person’s hours of help to parents are (for a person with sample average characteristics) when that person is selected into the condition of having parents still living vs what is would be for a randomly drawn individual from the population (with the same sample average characteristics). . di (exp(.00199485)-1)*100 .1996841 So it’s just .2% higher – almost the same. In any case, this kind of calculation only makes sense if there is a statistically significant effect of selection, i.e., the chi-square value for rho is statistically significant. If not, we would find that there are no effects of selection. Unfortunately, there is no heckman command designed specifically for longitudinal data; we added cluster correction here and that’s a good first step; or, in order to estimate a FE model with heckman correction, we can include dichotomies for individuals in both equations. More complex multistage models have been recommended as well but the implementation is more complicated, e.g., the process as suggested on Statalist: http://www.stata.com/statalist/archive/2005-04/msg00109.html 3 1. Estimate the selection equation via xtprobit. 2. Get estimates of the Mills ratio. 3. Use the Mills Ratio as an explanatory variable in the response equation where only the truncated dependent variable is considered, i.e. estimate this equation for selection =1. This is estimated via xtreg, re (with the Hausman test to check for specification). 4. Bootstrap the standard errors to account for inter-equation correlation. Models with Endogenous Independent Variables When there are concerns of reverse causality or the kind of spurious relationship where a third variable affects both DV and IV but that third variable is not measured and cannot be explicitly included, instrumental variables approaches can be used to avoid endogeneity bias. Instrumental variables should be selected so that they have an effect on the exogenous IV, but are not supposed to have any direct effect on the DV. A Endogenous regressor Instrument D B C Outcome We will deal with cases with 3 waves or more; if you have 2 waves, you can run IV models using ivreg and ivreg2 commands. We will again use an example from HRS dataset. . xtivreg rworkhours80 rpoorhealth rmarried hchildlg raedyrs female age minor > ity (rallparhelptw= rtotalpar rsiblog), fe Fixed-effects (within) IV regression Group variable: hhidpn Number of obs Number of groups = = 30541 6243 R-sq: Obs per group: min = avg = max = 1 4.9 9 within = . between = 0.0063 overall = 0.0100 corr(u_i, Xb) = -0.5088 Wald chi2(4) Prob > chi2 = = 10110.96 0.0000 -----------------------------------------------------------------------------rworkhours80 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------rallparhel~w | -11.09954 .6178209 -17.97 0.000 -12.31044 -9.88863 rpoorhealth | -3.911294 .9392181 -4.16 0.000 -5.752128 -2.070461 rmarried | -8.144087 1.660778 -4.90 0.000 -11.39915 -4.889022 hchildlg | 1.608275 2.026632 0.79 0.427 -2.36385 5.5804 raedyrs | (omitted) female | (omitted) age | (omitted) minority | (omitted) _cons | 46.8239 2.685722 17.43 0.000 41.55998 52.08782 4 -------------+---------------------------------------------------------------sigma_u | 33.23378 sigma_e | 41.078617 rho | .3955978 (fraction of variance due to u_i) -----------------------------------------------------------------------------F test that all u_i=0: F(6242,24294) = 0.76 Prob > F = 1.0000 -----------------------------------------------------------------------------Instrumented: rallparhelptw Instruments: rpoorhealth rmarried hchildlg raedyrs female age minority rtotalpar rsiblog -----------------------------------------------------------------------------. xtivreg rworkhours80 rpoorhealth rmarried hchildlg raedyrs female age minor > ity (rallparhelptw= rtotalpar rsiblog), re G2SLS random-effects IV regression Group variable: hhidpn Number of obs Number of groups = = 30541 6243 R-sq: Obs per group: min = avg = max = 1 4.9 9 within = 0.0192 between = 0.0805 overall = 0.0477 corr(u_i, X) = 0 (assumed) Wald chi2(8) Prob > chi2 = = 2843.18 0.0000 -----------------------------------------------------------------------------rworkhours80 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------rallparhel~w | -5.440523 .6434792 -8.45 0.000 -6.701719 -4.179327 rpoorhealth | -11.96851 .4269982 -28.03 0.000 -12.80541 -11.1316 rmarried | -3.793056 .5580382 -6.80 0.000 -4.886791 -2.699321 hchildlg | -.9219651 .3165478 -2.91 0.004 -1.542387 -.3015429 raedyrs | .8746774 .0731375 11.96 0.000 .7313306 1.018024 female | -7.071149 .5808851 -12.17 0.000 -8.209663 -5.932635 age | -1.319998 .0553041 -23.87 0.000 -1.428392 -1.211604 minority | -.6250384 .4461891 -1.40 0.161 -1.499553 .2494762 _cons | 104.2443 3.321154 31.39 0.000 97.73494 110.7536 -------------+---------------------------------------------------------------sigma_u | 0 sigma_e | 41.082 rho | 0 (fraction of variance due to u_i) -----------------------------------------------------------------------------Instrumented: rallparhelptw Instruments: rpoorhealth rmarried hchildlg raedyrs female age minority rtotalpar rsiblog -----------------------------------------------------------------------------. net search ivreg2 Install st0030_3 from http://www.stata-journal.com/software/sj7-4 . net search xtivreg2 Install xtivreg2 from http://fmwww.bc.edu/RePEc/bocode/x . xtivreg2 rworkhours80 rpoorhealth rmarried hchildlg raedyrs female age mino > rity (rallparhelptw= rtotalpar rsiblog), fe Warning - singleton groups detected. 445 observation(s) not used. Warning - collinearities detected Vars dropped: raedyrs female age minority 5 FIXED EFFECTS ESTIMATION -----------------------Number of groups = 5798 Obs per group: min = avg = max = 2 5.2 9 IV (2SLS) estimation -------------------Estimates efficient for homoskedasticity only Statistics consistent for homoskedasticity only Total (centered) SS Total (uncentered) SS Residual SS = = = 6411725.312 6411725.312 40994978.63 Number of obs F( 4, 24294) Prob > F Centered R2 Uncentered R2 Root MSE = = = = = = 30096 99.44 0.0000 -5.3938 -5.3938 41.08 -----------------------------------------------------------------------------rworkhours80 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------rallparhel~w | -11.09954 .6177701 -17.97 0.000 -12.31034 -9.888729 rpoorhealth | -3.911294 .9391408 -4.16 0.000 -5.751976 -2.070612 rmarried | -8.144087 1.660641 -4.90 0.000 -11.39888 -4.88929 hchildlg | 1.608275 2.026465 0.79 0.427 -2.363523 5.580073 -----------------------------------------------------------------------------Underidentification test (Anderson canon. corr. LM statistic): 345.874 Chi-sq(2) P-val = 0.0000 -----------------------------------------------------------------------------Weak identification test (Cragg-Donald Wald F statistic): 175.398 Stock-Yogo weak ID test critical values: 10% maximal IV size 19.93 15% maximal IV size 11.59 20% maximal IV size 8.75 25% maximal IV size 7.25 Source: Stock-Yogo (2005). Reproduced by permission. -----------------------------------------------------------------------------Sargan statistic (overidentification test of all instruments): 3.351 Chi-sq(1) P-val = 0.0672 -----------------------------------------------------------------------------Instrumented: rallparhelptw Included instruments: rpoorhealth rmarried hchildlg Excluded instruments: rtotalpar rsiblog Dropped collinear: raedyrs female age minority ------------------------------------------------------------------------------ The coefficients are said to be exactly identified if there are just enough instruments to estimate them, overidentified if there are more than enough instruments to estimate them (and if so, we can test whether the instruments are valid which is known as a test of the “overidentifying restrictions”), or they can be underidentified if there are too few instruments to estimate them (there should be at least as many instruments as endogenous IVs) or if instruments do not predict endogenous IVs, in which case you need to get more instruments as well. Underidentification test: Evaluates whether excluded instruments predict the endogenous regressor. The null hypothesis implies that excluded instruments do not predict the endogenous IV (i.e., path A is not significant). Weak identification test: The weak-instruments problem arises when the correlations between the endogenous regressors and the excluded instruments are nonzero but small (i.e., path A is significant but relationship is 6 weak). Rejection of their null hypothesis represents the absence of a weak-instruments problem. The null hypothesis being tested is that the estimator is weakly identified in the sense that it is subject to bias that the investigator finds unacceptably large. Under weak identification, the Wald test for beta rejects too often. The test statistic is based on the rejection rate r (10%, 20%, etc.) that the researcher is willing to tolerate if the true rejection rate should be the standard 5%. Weak instruments are defined as instruments that will lead to a rejection rate of r when the true rejection rate is 5%. Stock and Yogo (2005) have tabulated critical values for their two weak identification tests, and some critical values are listed in Stata output. Usually we get Cragg-Donald Wald F statistic here, but if we specify the robust, cluster(), or bw() options in xtivreg2, the reported weak-instruments test statistic is a Wald F statistic based on the Kleibergen–Paap rk statistic. When using the rk statistic to test for weak identification, we should apply caution when using the critical values compiled by Stock and Yogo (2005) or refer to the older “rule of thumb” of Staiger and Stock (1997), which says that the F statistic should be at least 10 for weak identification not to be considered a problem. If a weak instruments problem does arise, the best solution is to look for different instruments as the statistical inference for IV regression in such a case can be severely misleading. Using more instruments is not a solution, because the biases of instrumental variables estimators increase with the number of instruments. When the instruments are only weakly correlated with the endogenous regressors, some Monte Carlo evidence suggests that the LIML estimator (liml option in xtivreg2) performs better than the 2SLS (default) and GMM (gmm option) estimators; however, LIML estimator may result in confidence intervals that are somewhat larger than those from the 2SLS estimator. Overidentification test (test of overidentifying restrictions): In addition to the requirement that instrumental variables should be correlated with the endogenous regressors, the instruments must also be uncorrelated with the error term for the outcome variable, which also means that the instruments might not be correlated with the DV above and beyond their indirect relationship to the DV via the endogenous regressors (after taking into account all the other controls). If that assumption is violated, that means one or more instruments should be added to the equation predicting outcome because they have direct relationships to it. Thus a significant test statistic could represent either an invalid instrument or an incorrectly specified equation for the main outcome. We can only test this assumption if the model is overidentified, meaning that the number of additional instruments exceeds the number of endogenous regressors. That is why this test is called a test of overidentifying restrictions. If the model is just identified, we cannot perform this test. In this test, if we fail to reject the null, that indicates that this set of instruments is appropriate, while a rejection of the null indicates the instruments may be not valid. It is also possible, however, for that test to turn out significant (i.e, reject the null) if the error terms are heteroskedastic so that alternative should be tested before we conclude that our instrument are not valid. 7 Endogeneity test: We can also conduct endogeneity test to check whether we can just use a given variable as exogenous. If an endogenous regressor is in fact exogenous, OLS estimator is in fact more efficient and preferable to instrumental variables approach. . xtivreg2 rworkhours80 rpoorhealth rmarried hchildlg raedyrs female age mino > rity (rallparhelptw= rtotalpar rsiblog), fe endogtest(rallparhelptw) Warning - singleton groups detected. 445 observation(s) not used. Warning - collinearities detected Vars dropped: raedyrs female age minority FIXED EFFECTS ESTIMATION -----------------------Number of groups = 5798 Obs per group: min = avg = max = 2 5.2 9 IV (2SLS) estimation -------------------Estimates efficient for homoskedasticity only Statistics consistent for homoskedasticity only Total (centered) SS Total (uncentered) SS Residual SS = = = 6411725.312 6411725.312 40994978.63 Number of obs F( 4, 24294) Prob > F Centered R2 Uncentered R2 Root MSE = = = = = = 30096 99.44 0.0000 -5.3938 -5.3938 41.08 -----------------------------------------------------------------------------rworkhours80 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------rallparhel~w | -11.09954 .6177701 -17.97 0.000 -12.31034 -9.888729 rpoorhealth | -3.911294 .9391408 -4.16 0.000 -5.751976 -2.070612 rmarried | -8.144087 1.660641 -4.90 0.000 -11.39888 -4.88929 hchildlg | 1.608275 2.026465 0.79 0.427 -2.363523 5.580073 -----------------------------------------------------------------------------Underidentification test (Anderson canon. corr. LM statistic): 345.874 Chi-sq(2) P-val = 0.0000 -----------------------------------------------------------------------------Weak identification test (Cragg-Donald Wald F statistic): 175.398 Stock-Yogo weak ID test critical values: 10% maximal IV size 19.93 15% maximal IV size 11.59 20% maximal IV size 8.75 25% maximal IV size 7.25 Source: Stock-Yogo (2005). Reproduced by permission. -----------------------------------------------------------------------------Sargan statistic (overidentification test of all instruments): 3.351 Chi-sq(1) P-val = 0.0672 -endog- option: Endogeneity test of endogenous regressors: 1968.086 Chi-sq(1) P-val = 0.0000 Regressors tested: rallparhelptw -----------------------------------------------------------------------------Instrumented: rallparhelptw Included instruments: rpoorhealth rmarried hchildlg Excluded instruments: rtotalpar rsiblog Dropped collinear: raedyrs female age minority ------------------------------------------------------------------------------ The null hypothesis of the endogenous regressor being exogenous is rejected. 8 Autocorrelation and heteroskedasticity: To estimate the model adjusted for autocorrelation and heteroskedasticity, we use GMM2S estimation (two-step efficient generalized method of moments (GMM) estimator) along with robust option to deal with heteroskedasticity and bandwidth option for autocorrelation correction: . xtivreg2 rworkhours80 rpoorhealth rmarried hchildlg raedyrs female age mino > rity (rallparhelptw= rtotalpar rsiblog), fe robust bw(3) gmm2s Warning - singleton groups detected. 445 observation(s) not used. Warning - collinearities detected Vars dropped: raedyrs female age minority FIXED EFFECTS ESTIMATION -----------------------Number of groups = 5798 Obs per group: min = avg = max = 2 5.2 9 2-Step GMM estimation --------------------Estimates efficient for arbitrary heteroskedasticity and autocorrelation Statistics robust to heteroskedasticity and autocorrelation kernel=Bartlett; bandwidth= 3 time variable (t): wave group variable (i): hhidpn Number of obs = 30096 F( 4, 24294) = 65.94 Prob > F = 0.0000 Total (centered) SS = 6411725.312 Centered R2 = -5.3711 Total (uncentered) SS = 6411725.312 Uncentered R2 = -5.3711 Residual SS = 40849749.57 Root MSE = 41 -----------------------------------------------------------------------------| Robust rworkhours80 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------rallparhel~w | -11.07742 .7723191 -14.34 0.000 -12.59114 -9.563703 rpoorhealth | -3.843958 1.031333 -3.73 0.000 -5.865333 -1.822583 rmarried | -8.165251 1.912089 -4.27 0.000 -11.91288 -4.417625 hchildlg | 1.701042 1.919486 0.89 0.376 -2.061082 5.463165 -----------------------------------------------------------------------------Underidentification test (Kleibergen-Paap rk LM statistic): 212.701 Chi-sq(2) P-val = 0.0000 -----------------------------------------------------------------------------Weak identification test (Kleibergen-Paap rk Wald F statistic): 111.969 Stock-Yogo weak ID test critical values: 10% maximal IV size 19.93 15% maximal IV size 11.59 20% maximal IV size 8.75 25% maximal IV size 7.25 Source: Stock-Yogo (2005). Reproduced by permission. NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors. -----------------------------------------------------------------------------Hansen J statistic (overidentification test of all instruments): 2.946 Chi-sq(1) P-val = 0.0861 -----------------------------------------------------------------------------Instrumented: rallparhelptw Included instruments: rpoorhealth rmarried hchildlg Excluded instruments: rtotalpar rsiblog Dropped collinear: raedyrs female age minority ------------------------------------------------------------------------------ 9