The Truncated Regression

advertisement
Research Method
Lecture 15-3
©
Truncated regression
and
Heckman Sample
Selection Corrections
1
Truncated regression
Truncated regression is different from
censored regression in the following way:
Censored regressions: dependent variable
may be censored, but you can include the
censored observations in the regression
Truncated regressions: A subset of
observations are dropped, thus, only the
truncated data are available for the
regression.
2
Reasons data truncation
happens
Example 1 (Truncation by survey design):
The “Gary Negative income experiment
data”, which is used extensively in the
economic literature, samples only those
families whose income is less than 1.5
times the 1976 poverty line. In this case,
families whose incomes are greater than
that threshold are dropped from the
regression due to the survey design.
3
Example 2 (Incidental Truncation): In the
wage offer regression of married women,
only those who are working has wage
information. Thus, the regression cannot
include women who are not working. In
this case, it is the people’s decision, not
the surveyor’s decision, that determines
the sample selection.
4
When applying OLS to a truncated
data causes a bias
Before learning the techniques to deal
with truncated data, it is important to
know when applying OLS to a truncated
data would cause a bias.
5
Suppose that you consider the following
regression:
yi=β0+β1xi+ui
And suppose that you have a random
sample of size N. We also assume that all
the OLS assumptions are satisfied. (The
most important assumption is E(ui|xi)=0)
6
Now, suppose that, instead of using all
the N observations, you select a
subsample of the original sample, then
run OLS using this sub-sample (truncated
sample) only.
Then, under what conditions, would this
OLS be unbiased. And under what
conditions, would this OLS be biased?
7
A: Running OLS using only the selected
subsample (truncated data) would not cause
a bias if
(A-1) Sample selection is randomly done.
(A-2) Sample selection is determined solely by the
value of x-variable. For example, suppose that x is
age. Then if you select sample if age is greater than
20 years old, this OLS is unbiased.
8
B: Running OLS using only the selected
subsample (truncated data) would cause
bias if
(B-1) Sample selection is determined by the value of
y-variable. For example, suppose that y is the
family income, and further suppose that you
select the sample if y is greater than certain
threshold. Then this OLS is biased.
9
(B-2) Sample selection is correlated with ui. For
example, if you are running wage regression:
wage=β0+β1(educ)+u, where u contains
unobserved ability. If sample is selected based
on the unobserved ability, this OLS is biased.
In practice, this situation happens when the
selection is based on the survey participant’s
decision. For example, in wage regression, a
person’s decision whether to work or not
determines if the person is included in the data
or not. Since the decision is likely to be based on
unobserved factors which is contained in u, the
selection is likely to be correlated with u.
10
Understanding why these
conditions indicate running
OLS on the truncated data is
unbiasednes/biasedness
Now, we know the conditions under
which OLS using a truncated data would
be cause biased or not.
Now let me explain why these conditions
cause/does not cause biases.
(There are some repetition in the explanations, but they
are more elaborate containing very important
information. So please read them carefully.)
11
Consider the following regression
yi=β0+β1xi+ui
Suppose that this regression satisfies all
the OLS assumptions.
Now, let si be a selection indicator: If si=1,
then this person is included in the
regression. If si=0, then this person is
dropped from the data.
12
Then running OLS using the selected
subsample means you run OLS using only
the observation with si=1.
This is equivalent to running the
following regression.
siyi=β0si+β1sixi+siui
In this regression, sixi is the explanatory
variable, and siui is the error term.
The crucial condition under which this
OLS is unbiased is the zero conditional
mean assumption: E(siui|sixi)=0. Thus we
need check under what conditions this is
13
satisfied.
To check E(siui|sixi)=0, it is sufficient to
check if E(siui|xi, si)=0. (If the latter is zero,
the former is also zero.)
But, further notice that
E(siui|xi,si)=siE(ui|xi,si) since si is a function
of si which is in the conditional set. Thus, it
is sufficient to check the condition which
ensures E(ui|xi, si)=0.
To simplify the notation from now, I drop isubscript. So I will check the condition
under which E(u|x, s)=0
14
Condition under which running OLS on
the selected subsample (truncated data)
is unbiased.
(A-1) Sample selection is done randomly.
In this case, s is independent of u and x.
Then we have E(u|x,s)=E(u|x). But since
the original regression satisfy OLS
conditions, we have E(u|x)=0. Therefore,
in this case, this OLS is unbiased.
15
(A-2) Sample is selected based solely on
the value of x-variable. For example, if x
is age, and you select the person if the age
is greater than 20 years old. Then s=1 if
x≥20, and s=0 if x<20. In this case, s is a
deterministic function of x.
If s is a
deterministic
Thus we have
function of x,
you can drop
E(u|x, s)=E(u|x, s(x))
s(x) from the
=E(u|x).
conditioning set.
But E(u|x)=0 since the original regression
satisify all the OLS conditions. Therefore,
in this case, OLS is unbiased.
16
Condition under which running OLS on the
selected subsample (truncated data) is
biased.
(B-1) Sample selection is based on the value
of y-variable. For example, y is monthly
family income, and you select families
whose income is smaller than $500. Then,
s=1 if y<500.
Checking if E(u|x, s)=0 is equivalent to checking if
E(u|x, s=1)=0 and E(x|x,s=0)=0. So we check
this.
17
E(u|x, s=1)=E(u|x, y≤500)
=E(u|x, β0+β1x+u ≤500)
=E(u|x, u ≤500-β0-β1x)
≠E(u|x)
Since, the set {u
≤500-β0-β1x} directly
depends on u, you
cannot drop this from
the conditioning set.
Thus, this is not
equal to E(u|x) which
means that this is not
equal to zero.
Thus, E(u|x,s=1) ≠0.
Similarly, you can show that E(u|x,s=0) ≠0.
Thus E(u|x,s) ≠0. Thus, this OLS is biased.
18
(B-2) Sample selection is correlated with ui. This
happens when it is the people’s decision, not the
surveyor's decision, that determines the sample
selection. This type of truncation is called the
‘incidental truncation’. The bias that arises from
this type of sample selection is called the
Sample Selection Bias.
The leading example is the wage offer regression
of married women: wage= β0+β1edu+ui. When
the woman decides not to work, the wage
information is not available. Thus, this women
will be dropped from the data. Since it is the
woman’s decision, this sample selection is likely
to be based on some unobservable factors which
are contained in ui.
19
For example, the women decides to work
if the wage offer is greater than her
reservation wage. This reservation wage
is likely to be determined by some
unobserved factors in u, such as
unobserved ability, unobserved family
backgrounds etc. Thus the selection
criteria is likely to be correlated with u.
This in turn means that s is correlated
with u.
Now, mathematically, it can be shown as
follows.
20
If s is correlated with u, then you cannot
drop s from the conditioning set. Thus we
have
E(u|x,s)≠E(u|x)
This means that E(u|x,s) ≠0. Thus, this OLS
is biased.
Again, this type of bias is called the Sample
Selection Bias.
21
A slightly more complicated case
Suppose, x is IQ, and the survey participant
responds to your survey if IQ>v.
In this case, the sample selection is based on xvariable and a random error v. Then, if you run
OLS using only the truncated data, will it cause
a bias?
Answer
Case 1: If v is independent of u, then it does not
cause a bias.
Case 2: If v is correlated with u, then this is the
same case as (B-2). Thus, the OLS will be biased.
22
Estimation methods when
data is truncated.
When you have (B-1) type truncation,
then we use ‘truncated regression’
When you have (B-2) type truncation
(incidental truncation), then we use the
Heckman Sample Selection Correction
method. This is also called the Heckit
model.
I will explain these methods one by one.
23
The Truncated Regression
When the data truncation is (B-1) type,
you apply the Truncated Regression
model.
To explain again, (B-1) type truncation
happens because the surveyor samples
people based on the value of y-variable.
24
Suppose that the following regression
satisfies all the OLS assumptions.
yi=β0+β1xi+ui, ui~N(0,σ2)
But, you sample only if yi<ci. (This means
yu drop observations if yi≥ci by survey
design.)
In this case, you know the exact value of ci
for each person.
25
Family income
per month
Example of (B-1) type data truncation
These
observat
ions are
dropped
from the
data.
$500
True
regression
Biased regression
when applying OLS to
truncated data
Educ of
household
head
26
As can be seen, running OLS on the
truncated data will cause biases.
The model that produces unbiased
estimate is based on the Maximum
Likelihood Estimation.
27
The estimation method is as follows.
For each observation, we can write ui=yiβ0-β1xi. Thus, the likelihood contribution
is the height of the density function.
However, since we select sample only if
yi<ci, we have to use the density function
of u conditional on yi<ci.
The conditional density function is given
in the next slide.
28
f (ui | yi  c )  f (ui |  0  1 xi  ui  ci )  f (ui | ui  ci   0  1 xi )

f (ui )
f (ui )
f (ui )


u
c   0  1 xi
c   0  1 xi
P (ui  ci   0  1 xi )
P( i  i
) ( i
)

1

(

ci   0  1 xi

1
2 2
)
1
1


ui2

2
e 2
2

u

 1 i 
2 






1
e
ci   0  1 xi  2

(
) 

ui 

  
ci   0  1 xi
1

(
(
ui

)
 

)
29
Thus, the likelihood contribution for ith
observation is obtained by plugging in
ui=yi-β0-β1xi in the conditional density
function. This is given by
1  yi   0  1 xi 


 


Li 
c   0  1 xi
( i
)

The likelihood function is given by
n
L(  0 , 1 , )   Li
i 1
The values of β0,β1,σ that maximizes L is
the estimators of the Truncated
Regression.
30
The partial effects
The estimated β1 shows the effect of x on
y. Thus, you can interpret the parameters
as if they were OLS parameters.
31
Exercise
We do not have a suitable data for
truncated regression. Therefore, let us
truncate the data by ourselves to check
how the truncated regression works.
EX1. Use JPSC_familyinc.dta to estimate the
following model using all the observation.
(family income)=β0+β1(husband’ educ)+u
Family income is in 10,000 yen.
32
EX2. Then run the OLS using only the
observations whose familyinc<800. How
did the parameters change?
EX2. Run the truncated regression model
for the data truncated from above at 800
(data which drops all the obs with
familyinc≥800). How did the parameters
change? Did the truncated regression
recover the parameters of the original
regression?
33
. reg familyinc huseduc
Source
SS
df
MS
Model
Residual
38305900.9
1 38305900.9
318850122 7693 41446.7856
Total
357156023 7694 46420.0705
familyinc
Coef.
huseduc
_cons
32.93413
143.895
Std. Err.
t
1.083325
15.09181
30.40
9.53
Number of obs
F( 1, 7693)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
7695
924.22
0.0000
0.1073
0.1071
203.58
P>|t|
[95% Conf. Interval]
0.000
0.000
30.81052
114.3109
OLS using all the
observations
35.05775
173.479
. reg familyinc huseduc if familyinc<800
Source
Model
Residual
Total
SS
df
MS
Number of obs
F( 1, 6272)
Prob > F
R-squared
Adj R-squared
Root MSE
11593241.1
1 11593241.1
120645494 6272 19235.5699
132238735 6273
familyinc
Coef.
huseduc
_cons
20.27929
244.5233
21080.621
Std. Err.
.8260432
11.33218
t
24.55
21.58
=
=
=
=
=
=
6274
602.70
0.0000
0.0877
0.0875
138.69
P>|t|
[95% Conf. Interval]
0.000
0.000
18.65996
222.3084
21.89861
266.7383
Obs with
familyinc≥800 are
dropped. The
parameter on
huseduc is biased
towards zero.
34
Truncated regression model
with the upper truncation
limit equal to 800: Obs with
familyinc≥800 are
automatically dropped from
this regression.
. truncreg familyinc huseduc, ul(800)
(note: 1421 obs. truncated)
Fitting full model:
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
log
log
log
log
likelihood
likelihood
likelihood
likelihood
=
=
=
=
-39676.782
-39618.757
-39618.629
-39618.629
Truncated regression
Limit: lower =
-inf
upper =
800
Log likelihood = -39618.629
Number of obs = 6274
Wald chi2(1) = 569.90
Prob > chi2 = 0.0000
familyinc
Coef.
Std. Err.
huseduc
_cons
24.50276
203.6856
1.0264
13.75721
/sigma
153.1291
1.805717
z
P>|z|
[95% Conf. Interval]
23.87
14.81
0.000
0.000
22.49105
176.7219
26.51446
230.6492
84.80
0.000
149.59
156.6683
Bias seems to be corrected, but not
perfect in this example.
35
Heckman Sample Selection Bias
Correction (Heckit Model)
Most common reason for data truncation is (B-2)
type: the incidental truncation.
This data truncation usually occurs because
sample selection is determined by the people’s
decision, not the surveyor’s decision. Consider
the wage regression example. If the person has
chosen to work, the person has “self-selected
into the sample”. If the person has decided not
to work, the person has “self-selected out of the
sample”.
Bias caused by this type of truncation is called
36
the Sample Selection Bias.
Bias correction for this type of data
truncation is done by the Heckman Sample
Selection Correction Method. It is also
called the Heckit model.
Consider the wage regression model. In
Heckit, you have wage equation and
sample selection equation.
Wage eq: yi=xiβ+ui and ui~N(0,σu2)
Selection equ: si*=ziδ+ei,and ei~N(0,1)
Such that the person work if si*>0. That is si=1
if si*>0, and si=0 if si*≤0.
37
In the above equations, I am using the
following vector notations. β
=(β0,β1,β2,…,βk)T. xi=(1,xi1, xi2,…,xik) and
δ=(δ0, δ1,.., δm)T and zi=(1, zi1, zi2,..,zim).
We assume that xi and zi are exogenous in
a sense that E(ui|xi, zi)=0.
Further, assume that xi is a strict subset of
zi. That is, all the x-variables are also a part
of zi. For example, xi=(1, experi, agei), and
zi=(1, experi, agei, kidslt6i).
We require that zi contains at least one
variable that is not contained in xi.
38
The structural error, ui, and the sample
selection si are correlated only if ui and ei
are correlated.
In other words, the sample selection
causes a bias only if ui and ei are
correlated.
Let use denote the correlation between ui
and ei by ρ=corr(ui, ei).
39
The data requirement of the Heckit model is
as follows.
1. yi is available only for the observations
who are currently working.
2: However, xi and zi are available both for
those who are working, and for those who re
not working.
40
Now, I will describe the Heckit model.
First, the expected value of yi given the fact that
the person has participated in the labor force (i.e.,
si=1) is written as
E ( yi | si  1, zi )  E ( yi | si*  0, zi )
 E ( yi | zi  ei  0, zi )
 E ( yi | ei   zi , zi )
 E ( xi   ui | ei   zi , zi )
 xi   E (ui | ei   zi , zi )
Using a result of bivariate normal distribution, the
last term can be shown to be E(ui|ei>-ziδ,zi)=
 ( zi ) / ( zi ). But the term,  ( zi ) / ( zi ) , is the
41
inverse mills ratio, λ(ziδ).
Thus, we have
E ( yi | si  1, zi )
 xi   E (ui | ei   zi , zi )
 xi    ( zi )
Heckman showed that sample selection
bias can be viewed as an omitted variable
bias, where the omitted variable is λ(ziδ).
42
Important thing to note is that, λ(ziδ) can be
easily estimated. How? Note that the selection
equation is simply a probit model of a labor
force participation.
So, estimate the sample selection equation by
probit to estimate ˆ . Then compute  ( z ˆ) .
Then, you can correct the bias by including  ( z ˆ)
in the wage regression, then estimate the model
using OLS.
Heckman showed that this method corrects for
the sample selection bias. This method is the
Heckit model.
Next slide summarizes the Heckit model.
i
i
43
Heckman Two-step Sample Selection
Correction Method (Heckit model)
Wage eq: yi=xiβ+ui and ui~N(0,σu2)
Selection equ: si*=ziδ+ei,and ei~N(0,1)
Such that the person work if si*>0. The person
does not work if si*≤0.
Assumption 1: E(ui|xi, zi)=0
Assumption 2: xi is a strict subset of zi.
If ui and ei are correlated, OLS estimation of wage
equation (using only the observations who are
44
working) is biased.
First step: Estimate sample selection
equation parameters ˆ using Probit. Then,
compute ( z ˆ) .
Second step: Plug in  ( z ˆ) in the wage
equation, then estimate the equation using
OLS. That is: estimate the following.
i
i
yi  xi    ( ziˆ)  error
In this model, ρ is the coefficient for  ( z ˆ) . If
ρ≠0, then the sample selection bias is
present. If ρ=0, then it is evidence that
sample selection bias is not present.
i
45
Note, when you exactly follow this
procedure, you get the correct coefficients,
but you don’t get the correct standard
errors. For the exact formula of standard
error, consult Wooldridge (2002).
The Stata automatically computes the
correct standard errors.
46
Exercise
Using Mroz.dta estimate the wage offer
equation using Heckit model. The
explanatory variables for wage offer
equation are educ exper expersq. The
explanatory variables for the sample
selection equation is educ, exper, expersq,
nwifeinc, age, kidslt6, kidsge6.
47
. **********************************************
. * Estimating heckit model manually
*
. **********************************************
. ***************************
. * First create selection *
. * Variable
*
. ***************************
.
gen s=0 if wage==.
(428 missing values generated)
Estimating Heckit
Manually. (note: you will
not get the correct
standard errors.
.
replace s=1 if wage~=.
(428 real changes made)
.
.
.
.
.
*******************************
*Next, estimate the probit
*
*selection equation
*
*******************************
probit s educ exper expersq nwifeinc age kidslt6 kidsge6
Iteration
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
4:
log
log
log
log
log
likelihood
likelihood
likelihood
likelihood
likelihood
=
=
=
=
=
-514.8732
-405.78215
-401.32924
-401.30219
-401.30219
Probit regression
Number of obs
LR chi2(7)
Prob > chi2
Pseudo R2
Log likelihood = -401.30219
s
Coef.
educ
exper
expersq
nwifeinc
age
kidslt6
kidsge6
_cons
.1309047
.1233476
-.0018871
-.0120237
-.0528527
-.8683285
.036005
.2700768
The first step:
The probit
selectdion
equation
Std. Err.
.0252542
.0187164
.0006
.0048398
.0084772
.1185223
.0434768
.508593
z
5.18
6.59
-3.15
-2.48
-6.23
-7.33
0.83
0.53
P>|z|
0.000
0.000
0.002
0.013
0.000
0.000
0.408
0.595
=
=
=
=
753
227.14
0.0000
0.2206
[95% Conf. Interval]
.0814074
.0866641
-.003063
-.0215096
-.0694678
-1.100628
-.049208
-.7267473
.180402
.1600311
-.0007111
-.0025378
-.0362376
-.636029
.1212179
1.266901
48
.
.
.
.
*******************************
*Then create inverse lambda *
*******************************
predict xdelta, xb
The second
step:
. gen lambda =normalden(xdelta)/normal(xdelta)
.
.
.
.
*************************************
*Finally, estimate the Heckit model *
*************************************
reg lwage educ exper expersq lambda
Source
SS
df
MS
Model
Residual
35.0479487
188.279492
4 8.76198719
423 .445105182
Total
223.327441
427 .523015084
lwage
Coef.
educ
exper
expersq
lambda
_cons
.1090655
.0438873
-.0008591
.0322619
-.5781032
Std. Err.
.0156096
.0163534
.0004414
.1343877
.306723
t
6.99
2.68
-1.95
0.24
-1.88
Number of obs
F( 4, 423)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.008
0.052
0.810
0.060
=
=
=
=
=
=
428
19.69
0.0000
0.1569
0.1490
.66716
Note the
standard errors
are not correct.
[95% Conf. Interval]
.0783835
.0117434
-.0017267
-.2318889
-1.180994
.1397476
.0760313
8.49e-06
.2964126
.024788
49
. heckman lwage educ exper expersq, select(s=educ exper expersq nwifeinc age kidslt6 kidsge6) twostep
Heckman selection model -- two-step estimates
(regression model with sample selection)
Coef.
Std. Err.
z
Number of obs
Censored obs
Uncensored obs
=
=
=
753
325
428
Wald chi2(3)
Prob > chi2
=
=
51.53
0.0000
P>|z|
[95% Conf. Interval]
lwage
educ
exper
expersq
_cons
.1090655
.0438873
-.0008591
-.5781032
.015523
.0162611
.0004389
.3050062
7.03
2.70
-1.96
-1.90
0.000
0.007
0.050
0.058
.0786411
.0120163
-.0017194
-1.175904
.13949
.0757584
1.15e-06
.019698
educ
exper
expersq
nwifeinc
age
kidslt6
kidsge6
_cons
.1309047
.1233476
-.0018871
-.0120237
-.0528527
-.8683285
.036005
.2700768
.0252542
.0187164
.0006
.0048398
.0084772
.1185223
.0434768
.508593
5.18
6.59
-3.15
-2.48
-6.23
-7.33
0.83
0.53
0.000
0.000
0.002
0.013
0.000
0.000
0.408
0.595
.0814074
.0866641
-.003063
-.0215096
-.0694678
-1.100628
-.049208
-.7267473
.180402
.1600311
-.0007111
-.0025378
-.0362376
-.636029
.1212179
1.266901
lambda
.0322619
.1336246
0.24
0.809
-.2296376
.2941613
rho
sigma
lambda
0.04861
.66362875
.03226186
.1336246
Heckit
estimated
automati
cally.
s
mills
Note H0:ρ=0
cannot be
rejected. So there
is little evidence
that sample
selection bias is
present.
50
Download