Word

advertisement
SOCY7706: Longitudinal Data Analysis
Instructor: Natasha Sarkisian
Sample Selection Models
Sample selection issues arise when our dependent variable is only observed for a specific
subgroup – for example, timing of retirement is only observed for those retired; wages are only
observed for those employed, etc. In such situations, implicitly, we can imagine the existence of
a selection equation that determines who is in the sample.
Selection bias problem may arise because the error term in the outcome equation is correlated
with the error term in the selection equation. Selection bias is, in fact, similar to omitted variable
bias – if omitted variables are uncorrelated with those variables already in the model, then
residuals also won’t be correlated with variables in the model, and assumptions are not violated,
but if omitted variables are correlated with predictors already in the model, then, since their
effects are relegated to residuals, residuals are also correlated with predictors, which violates a
regression assumption. Importantly, there is no selection problem if every variable influencing
selection is controlled in the outcome equation. The problem is that most selection processes are
complex and the complete list of variables influencing selection usually cannot be measured.
Therefore, in many case when dealing with variables observed for subsamples, we encounter a
selection bias problem.
Some selection processes are such that selection depends on the value of outcome itself – i.e.,
outcome is only observed once a certain threshold is passed; e.g., the data on the amount of
financial support are only available for those who gave $500 or more. In those cases, Tobit
regression model can be used – and for longitudinal data, xttobit command can be used for such
censored samples.
In cases where selection is determined by another variable (e.g., retirement, marriage, etc.),
Heckman sample selection model can be used. This model combines two models -- a first stage
probit (selection equation) and a second stage OLS (outcome equation), and can be either
estimated using ML or as a two-step model. It is typically necessary to identify at least one
variable that will affect selection but not the outcome – otherwise you will likely run into
difficulties with model identification; besides, it would not make substantive sense to estimate
the selection equation separately otherwise (as mentioned above, if every variable influencing
selection is already in the outcome equation, then your results are not biased due to selection). In
longitudinal data, such variables can sometimes be variables from earlier waves than the analysis
period.
. use http://www.sarkisian.net/socy7706/hrs_hours.dta
. reshape long r@workhours80 r@poorhealth r@married r@totalpar r@siblog h@childlg r@al
> lparhelptw, i(hhid pn) j(wave)
(note: j = 1 2 3 4 5 6 7 8 9)
Data
wide
->
long
----------------------------------------------------------------------------Number of obs.
6591
->
59319
Number of variables
75
->
20
j variable (9 values)
->
wave
xij variables:
r1workhours80 r2workhours80 ... r9workhours80->rworkhours80
1
r1poorhealth r2poorhealth ... r9poorhealth->
rpoorhealth
r1married r2married ... r9married
->
rmarried
r1totalpar r2totalpar ... r9totalpar
->
rtotalpar
r1siblog r2siblog ... r9siblog
->
rsiblog
h1childlg h2childlg ... h9childlg
->
hchildlg
r1allparhelptw r2allparhelptw ... r9allparhelptw->rallparhelptw
----------------------------------------------------------------------------. gen rallparhelptw_0= rallparhelptw
(21949 missing values generated)
. replace rallparhelptw=. if rtotalpar==0
(3815 real changes made, 3815 to missing)
. gen parents=( rtotalpar>0) if
(7846 missing values generated)
rtotalpar~=.
. heckman rallparhelptw rmarried rsiblog hchildlg raedyrs female minority, select(
> parents = rmarried rsiblog hchildlg raedyrs female age minority rpoorhealth) clust
> er(hhid)
Iteration 0:
log pseudolikelihood = -103600.47
Iteration 1:
log pseudolikelihood = -103600.47
Heckman selection model
Number of obs
=
43636
(regression model with sample selection)
Censored obs
=
16355
Uncensored obs
=
27281
Log pseudolikelihood = -103600.5
Wald chi2(6)
Prob > chi2
=
=
246.11
0.0000
(Std. Err. adjusted for 4653 clusters in hhid)
------------------------------------------------------------------------------|
Robust
|
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
--------------+---------------------------------------------------------------rallparhelptw |
rmarried | -.2657773
.1211056
-2.19
0.028
-.50314
-.0284147
rsiblog | -.4121636
.0706821
-5.83
0.000
-.5506979
-.2736293
hchildlg | -.0297258
.0711244
-0.42
0.676
-.1691271
.1096755
raedyrs |
.0495642
.0112151
4.42
0.000
.027583
.0715455
female |
.6881667
.0590432
11.66
0.000
.5724441
.8038894
minority | -.1268172
.0956978
-1.33
0.185
-.3143815
.0607471
_cons |
1.571512
.2769189
5.67
0.000
1.028761
2.114263
--------------+---------------------------------------------------------------parents
|
rmarried |
.4770129
.0291541
16.36
0.000
.419872
.5341539
rsiblog | -.0795487
.0224883
-3.54
0.000
-.123625
-.0354725
hchildlg |
.0022671
.0233168
0.10
0.923
-.0434329
.0479671
raedyrs | -.0028614
.0042837
-0.67
0.504
-.0112573
.0055346
female | -.1334079
.0164738
-8.10
0.000
-.1656959
-.1011199
age | -.0552654
.0038619
-14.31
0.000
-.0628345
-.0476962
minority |
.0929978
.0320184
2.90
0.004
.0302429
.1557528
rpoorhealth |
-.11186
.0232671
-4.81
0.000
-.1574628
-.0662572
_cons |
3.25997
.2336726
13.95
0.000
2.80198
3.71796
--------------+---------------------------------------------------------------/athrho |
.0008414
.0493236
0.02
0.986
-.095831
.0975138
/lnsigma |
1.357466
.0159669
85.02
0.000
1.326172
1.388761
--------------+---------------------------------------------------------------rho |
.0008414
.0493235
-.0955387
.0972059
sigma |
3.886333
.0620526
3.766596
4.009877
lambda |
.00327
.1916872
-.37243
.37897
------------------------------------------------------------------------------Wald test of indep. eqns. (rho = 0): chi2(1) =
0.00
Prob > chi2 = 0.9864
2
Rho indicates if the unobservables in the selection model are correlated with the unobservables
in the stage 2 model. If they are, we have biased estimates without correction (or in an OLS
model). If the unobservables in stage 1 are unrelated to the unobservables in stage 2, as they are
here, then we are saying that stage 1 does not affect stage 2 results. Here, rho appears to be nonsignificant (based on chi square test).
The adjusted standard error for the outcome equation regression is given by sigma. The
estimated selection coefficient lambda = sigma×rho.
Next, we try to interpret the estimated selection effect itself. For this, let’s compute the average
selection (or truncation) effect. First, let’s calculate the average value for the selection term for
the sample of those who have living parents. For that we need to predict the inverse Mills ratio
and get summary stats for it:
. predict mills, mills
(15215 missing values generated)
. sum mills
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------mills |
44104
.6100459
.1600787
.2992288
1.199892
The average truncation effect is computed as lambda×average mills value.
. di .00327*.6100459
.00199485
This is very small – it indicates by how much the conditional hours of help to parents are shifted
up (or down) due to the selection or truncation effect. So based on this, we can calculate how
much higher a person’s hours of help to parents are (for a person with sample average
characteristics) when that person is selected into the condition of having parents still living vs
what is would be for a randomly drawn individual from the population (with the same sample
average characteristics).
. di (exp(.00199485)-1)*100
.1996841
So it’s just .2% higher – almost the same. In any case, this kind of calculation only makes sense
if there is a statistically significant effect of selection, i.e., the chi-square value for rho is
statistically significant. If not, we would find that there are no effects of selection.
Unfortunately, there is no heckman command designed specifically for longitudinal data; we
added cluster correction here and that’s a good first step; or, in order to estimate a FE model with
heckman correction, we can include dichotomies for individuals in both equations. More
complex multistage models have been recommended as well but the implementation is more
complicated, e.g., the process as suggested on Statalist:
http://www.stata.com/statalist/archive/2005-04/msg00109.html
3
1. Estimate the selection equation via xtprobit.
2. Get estimates of the Mills ratio.
3. Use the Mills Ratio as an explanatory variable in the response equation where only the
truncated dependent variable is considered, i.e. estimate this equation for selection =1. This is
estimated via xtreg, re (with the Hausman test to check for specification).
4. Bootstrap the standard errors to account for inter-equation correlation.
Models with Endogenous Independent Variables
When there are concerns of reverse causality or the kind of spurious relationship where a third
variable affects both DV and IV but that third variable is not measured and cannot be explicitly
included, instrumental variables approaches can be used to avoid endogeneity bias. Instrumental
variables should be selected so that they have an effect on the exogenous IV, but are not
supposed to have any direct effect on the DV.
A
Endogenous
regressor
Instrument
D
B
C
Outcome
We will deal with cases with 3 waves or more; if you have 2 waves, you can run IV models
using ivreg and ivreg2 commands. We will again use an example from HRS dataset.
. xtivreg
rworkhours80 rpoorhealth rmarried hchildlg raedyrs female age minor
> ity (rallparhelptw= rtotalpar rsiblog), fe
Fixed-effects (within) IV regression
Group variable: hhidpn
Number of obs
Number of groups
=
=
30541
6243
R-sq:
Obs per group: min =
avg =
max =
1
4.9
9
within =
.
between = 0.0063
overall = 0.0100
corr(u_i, Xb)
= -0.5088
Wald chi2(4)
Prob > chi2
=
=
10110.96
0.0000
-----------------------------------------------------------------------------rworkhours80 |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------rallparhel~w | -11.09954
.6178209
-17.97
0.000
-12.31044
-9.88863
rpoorhealth | -3.911294
.9392181
-4.16
0.000
-5.752128
-2.070461
rmarried | -8.144087
1.660778
-4.90
0.000
-11.39915
-4.889022
hchildlg |
1.608275
2.026632
0.79
0.427
-2.36385
5.5804
raedyrs | (omitted)
female | (omitted)
age | (omitted)
minority | (omitted)
_cons |
46.8239
2.685722
17.43
0.000
41.55998
52.08782
4
-------------+---------------------------------------------------------------sigma_u |
33.23378
sigma_e | 41.078617
rho |
.3955978
(fraction of variance due to u_i)
-----------------------------------------------------------------------------F test that all u_i=0:
F(6242,24294) =
0.76
Prob > F
= 1.0000
-----------------------------------------------------------------------------Instrumented:
rallparhelptw
Instruments:
rpoorhealth rmarried hchildlg raedyrs female age minority
rtotalpar rsiblog
-----------------------------------------------------------------------------. xtivreg
rworkhours80 rpoorhealth rmarried hchildlg raedyrs female age minor
> ity (rallparhelptw= rtotalpar rsiblog), re
G2SLS random-effects IV regression
Group variable: hhidpn
Number of obs
Number of groups
=
=
30541
6243
R-sq:
Obs per group: min =
avg =
max =
1
4.9
9
within = 0.0192
between = 0.0805
overall = 0.0477
corr(u_i, X)
= 0 (assumed)
Wald chi2(8)
Prob > chi2
=
=
2843.18
0.0000
-----------------------------------------------------------------------------rworkhours80 |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------rallparhel~w | -5.440523
.6434792
-8.45
0.000
-6.701719
-4.179327
rpoorhealth | -11.96851
.4269982
-28.03
0.000
-12.80541
-11.1316
rmarried | -3.793056
.5580382
-6.80
0.000
-4.886791
-2.699321
hchildlg | -.9219651
.3165478
-2.91
0.004
-1.542387
-.3015429
raedyrs |
.8746774
.0731375
11.96
0.000
.7313306
1.018024
female | -7.071149
.5808851
-12.17
0.000
-8.209663
-5.932635
age | -1.319998
.0553041
-23.87
0.000
-1.428392
-1.211604
minority | -.6250384
.4461891
-1.40
0.161
-1.499553
.2494762
_cons |
104.2443
3.321154
31.39
0.000
97.73494
110.7536
-------------+---------------------------------------------------------------sigma_u |
0
sigma_e |
41.082
rho |
0
(fraction of variance due to u_i)
-----------------------------------------------------------------------------Instrumented:
rallparhelptw
Instruments:
rpoorhealth rmarried hchildlg raedyrs female age minority
rtotalpar rsiblog
-----------------------------------------------------------------------------. net search ivreg2
Install st0030_3 from http://www.stata-journal.com/software/sj7-4
. net search xtivreg2
Install xtivreg2 from http://fmwww.bc.edu/RePEc/bocode/x
. xtivreg2
rworkhours80 rpoorhealth rmarried hchildlg raedyrs female age mino
> rity (rallparhelptw= rtotalpar rsiblog), fe
Warning - singleton groups detected. 445 observation(s) not used.
Warning - collinearities detected
Vars dropped: raedyrs female age minority
5
FIXED EFFECTS ESTIMATION
-----------------------Number of groups =
5798
Obs per group: min =
avg =
max =
2
5.2
9
IV (2SLS) estimation
-------------------Estimates efficient for homoskedasticity only
Statistics consistent for homoskedasticity only
Total (centered) SS
Total (uncentered) SS
Residual SS
=
=
=
6411725.312
6411725.312
40994978.63
Number of obs
F( 4, 24294)
Prob > F
Centered R2
Uncentered R2
Root MSE
=
=
=
=
=
=
30096
99.44
0.0000
-5.3938
-5.3938
41.08
-----------------------------------------------------------------------------rworkhours80 |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------rallparhel~w | -11.09954
.6177701
-17.97
0.000
-12.31034
-9.888729
rpoorhealth | -3.911294
.9391408
-4.16
0.000
-5.751976
-2.070612
rmarried | -8.144087
1.660641
-4.90
0.000
-11.39888
-4.88929
hchildlg |
1.608275
2.026465
0.79
0.427
-2.363523
5.580073
-----------------------------------------------------------------------------Underidentification test (Anderson canon. corr. LM statistic):
345.874
Chi-sq(2) P-val =
0.0000
-----------------------------------------------------------------------------Weak identification test (Cragg-Donald Wald F statistic):
175.398
Stock-Yogo weak ID test critical values: 10% maximal IV size
19.93
15% maximal IV size
11.59
20% maximal IV size
8.75
25% maximal IV size
7.25
Source: Stock-Yogo (2005). Reproduced by permission.
-----------------------------------------------------------------------------Sargan statistic (overidentification test of all instruments):
3.351
Chi-sq(1) P-val =
0.0672
-----------------------------------------------------------------------------Instrumented:
rallparhelptw
Included instruments: rpoorhealth rmarried hchildlg
Excluded instruments: rtotalpar rsiblog
Dropped collinear:
raedyrs female age minority
------------------------------------------------------------------------------
The coefficients are said to be exactly identified if there are just enough instruments to estimate
them, overidentified if there are more than enough instruments to estimate them (and if so, we
can test whether the instruments are valid which is known as a test of the “overidentifying
restrictions”), or they can be underidentified if there are too few instruments to estimate them
(there should be at least as many instruments as endogenous IVs) or if instruments do not predict
endogenous IVs, in which case you need to get more instruments as well.
Underidentification test:
Evaluates whether excluded instruments predict the endogenous regressor. The null hypothesis
implies that excluded instruments do not predict the endogenous IV (i.e., path A is not
significant).
Weak identification test:
The weak-instruments problem arises when the correlations between the endogenous regressors
and the excluded instruments are nonzero but small (i.e., path A is significant but relationship is
6
weak). Rejection of their null hypothesis represents the absence of a weak-instruments problem.
The null hypothesis being tested is that the estimator is weakly identified in the sense that it is
subject to bias that the investigator finds unacceptably large. Under weak identification, the
Wald test for beta rejects too often. The test statistic is based on the rejection rate r (10%, 20%,
etc.) that the researcher is willing to tolerate if the true rejection rate should be the standard 5%.
Weak instruments are defined as instruments that will lead to a rejection rate of r when the true
rejection rate is 5%. Stock and Yogo (2005) have tabulated critical values for their two weak
identification tests, and some critical values are listed in Stata output.
Usually we get Cragg-Donald Wald F statistic here, but if we specify the robust, cluster(), or
bw() options in xtivreg2, the reported weak-instruments test statistic is a Wald F statistic based
on the Kleibergen–Paap rk statistic. When using the rk statistic to test for weak identification, we
should apply caution when using the critical values compiled by Stock and Yogo (2005) or refer
to the older “rule of thumb” of Staiger and Stock (1997), which says that the F statistic should be
at least 10 for weak identification not to be considered a problem.
If a weak instruments problem does arise, the best solution is to look for different instruments as
the statistical inference for IV regression in such a case can be severely misleading. Using more
instruments is not a solution, because the biases of instrumental variables estimators increase
with the number of instruments. When the instruments are only weakly correlated with the
endogenous regressors, some Monte Carlo evidence suggests that the LIML estimator (liml
option in xtivreg2) performs better than the 2SLS (default) and GMM (gmm option) estimators;
however, LIML estimator may result in confidence intervals that are somewhat larger than those
from the 2SLS estimator.
Overidentification test (test of overidentifying restrictions):
In addition to the requirement that instrumental variables should be correlated with the
endogenous regressors, the instruments must also be uncorrelated with the error term for the
outcome variable, which also means that the instruments might not be correlated with the DV
above and beyond their indirect relationship to the DV via the endogenous regressors (after
taking into account all the other controls). If that assumption is violated, that means one or more
instruments should be added to the equation predicting outcome because they have direct
relationships to it. Thus a significant test statistic could represent either an invalid instrument or
an incorrectly specified equation for the main outcome.
We can only test this assumption if the model is overidentified, meaning that the number of
additional instruments exceeds the number of endogenous regressors. That is why this test is
called a test of overidentifying restrictions. If the model is just identified, we cannot perform this
test. In this test, if we fail to reject the null, that indicates that this set of instruments is
appropriate, while a rejection of the null indicates the instruments may be not valid.
It is also possible, however, for that test to turn out significant (i.e, reject the null) if the error
terms are heteroskedastic so that alternative should be tested before we conclude that our
instrument are not valid.
7
Endogeneity test:
We can also conduct endogeneity test to check whether we can just use a given variable as
exogenous. If an endogenous regressor is in fact exogenous, OLS estimator is in fact more
efficient and preferable to instrumental variables approach.
. xtivreg2
rworkhours80 rpoorhealth rmarried hchildlg raedyrs female age mino
> rity (rallparhelptw= rtotalpar rsiblog), fe endogtest(rallparhelptw)
Warning - singleton groups detected. 445 observation(s) not used.
Warning - collinearities detected
Vars dropped: raedyrs female age minority
FIXED EFFECTS ESTIMATION
-----------------------Number of groups =
5798
Obs per group: min =
avg =
max =
2
5.2
9
IV (2SLS) estimation
-------------------Estimates efficient for homoskedasticity only
Statistics consistent for homoskedasticity only
Total (centered) SS
Total (uncentered) SS
Residual SS
=
=
=
6411725.312
6411725.312
40994978.63
Number of obs
F( 4, 24294)
Prob > F
Centered R2
Uncentered R2
Root MSE
=
=
=
=
=
=
30096
99.44
0.0000
-5.3938
-5.3938
41.08
-----------------------------------------------------------------------------rworkhours80 |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------rallparhel~w | -11.09954
.6177701
-17.97
0.000
-12.31034
-9.888729
rpoorhealth | -3.911294
.9391408
-4.16
0.000
-5.751976
-2.070612
rmarried | -8.144087
1.660641
-4.90
0.000
-11.39888
-4.88929
hchildlg |
1.608275
2.026465
0.79
0.427
-2.363523
5.580073
-----------------------------------------------------------------------------Underidentification test (Anderson canon. corr. LM statistic):
345.874
Chi-sq(2) P-val =
0.0000
-----------------------------------------------------------------------------Weak identification test (Cragg-Donald Wald F statistic):
175.398
Stock-Yogo weak ID test critical values: 10% maximal IV size
19.93
15% maximal IV size
11.59
20% maximal IV size
8.75
25% maximal IV size
7.25
Source: Stock-Yogo (2005). Reproduced by permission.
-----------------------------------------------------------------------------Sargan statistic (overidentification test of all instruments):
3.351
Chi-sq(1) P-val =
0.0672
-endog- option:
Endogeneity test of endogenous regressors:
1968.086
Chi-sq(1) P-val =
0.0000
Regressors tested:
rallparhelptw
-----------------------------------------------------------------------------Instrumented:
rallparhelptw
Included instruments: rpoorhealth rmarried hchildlg
Excluded instruments: rtotalpar rsiblog
Dropped collinear:
raedyrs female age minority
------------------------------------------------------------------------------
The null hypothesis of the endogenous regressor being exogenous is rejected.
8
Autocorrelation and heteroskedasticity:
To estimate the model adjusted for autocorrelation and heteroskedasticity, we use GMM2S
estimation (two-step efficient generalized method of moments (GMM) estimator) along with
robust option to deal with heteroskedasticity and bandwidth option for autocorrelation
correction:
. xtivreg2
rworkhours80 rpoorhealth rmarried hchildlg raedyrs female age mino
> rity (rallparhelptw= rtotalpar rsiblog), fe robust bw(3) gmm2s
Warning - singleton groups detected. 445 observation(s) not used.
Warning - collinearities detected
Vars dropped: raedyrs female age minority
FIXED EFFECTS ESTIMATION
-----------------------Number of groups =
5798
Obs per group: min =
avg =
max =
2
5.2
9
2-Step GMM estimation
--------------------Estimates efficient for arbitrary heteroskedasticity and autocorrelation
Statistics robust to heteroskedasticity and autocorrelation
kernel=Bartlett; bandwidth=
3
time variable (t): wave
group variable (i): hhidpn
Number of obs =
30096
F( 4, 24294) =
65.94
Prob > F
=
0.0000
Total (centered) SS
= 6411725.312
Centered R2
= -5.3711
Total (uncentered) SS
= 6411725.312
Uncentered R2 = -5.3711
Residual SS
= 40849749.57
Root MSE
=
41
-----------------------------------------------------------------------------|
Robust
rworkhours80 |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------rallparhel~w | -11.07742
.7723191
-14.34
0.000
-12.59114
-9.563703
rpoorhealth | -3.843958
1.031333
-3.73
0.000
-5.865333
-1.822583
rmarried | -8.165251
1.912089
-4.27
0.000
-11.91288
-4.417625
hchildlg |
1.701042
1.919486
0.89
0.376
-2.061082
5.463165
-----------------------------------------------------------------------------Underidentification test (Kleibergen-Paap rk LM statistic):
212.701
Chi-sq(2) P-val =
0.0000
-----------------------------------------------------------------------------Weak identification test (Kleibergen-Paap rk Wald F statistic):
111.969
Stock-Yogo weak ID test critical values: 10% maximal IV size
19.93
15% maximal IV size
11.59
20% maximal IV size
8.75
25% maximal IV size
7.25
Source: Stock-Yogo (2005). Reproduced by permission.
NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors.
-----------------------------------------------------------------------------Hansen J statistic (overidentification test of all instruments):
2.946
Chi-sq(1) P-val =
0.0861
-----------------------------------------------------------------------------Instrumented:
rallparhelptw
Included instruments: rpoorhealth rmarried hchildlg
Excluded instruments: rtotalpar rsiblog
Dropped collinear:
raedyrs female age minority
------------------------------------------------------------------------------
9
Download