Notes 6 - Wharton Statistics Department

advertisement
Stat 521: Notes 6.
Measurement Error, Instrumental Variables
Reading: Wooldridge, Chapter 5.1, 18.1-18.2.
I. Note on Homework 1
In question 6, I asked you to consider the effect on mean
earnings (Y) of a policy in which everyone who has less than a
high school education instead got a high school education. The
model for earnings with constant variance is
log Yi = 0  1educi   i ,  i ~N (0,  2 ) .

2 
E (Yi | educi )  exp  0  1educi 

2 

If there is nonconstant variance, then,

 i2 
E (Yi | educi )  exp  0  1educi 

2


and we need a model for the distribution of  i to determine the
effect of the policy on mean earnings. We will study this later in
the course.
2
II. Measurement Error (Chapter 4.4)
The Classical Measurement Error Model:
Suppose we would like to estimate the regression
Yi   0  1 X i*1   i ,
1
but we only measure X i1 which contains measurement error:
X i1  X i*1   i
The classical measurement error assumes that the measurement
*
error  i is independent of the true X i1 and of  i : the
measurement error arises completely at random.
Let the regression of Y on X i1 be
Yi  0  1 X i1  i
(1.1)
 X*
Cov( X i*1 , X i1 )





1
1
We showed that 1
Var ( X i*1 )
 X2 *   2
2
Summary: Classical measurement error leads to a bias towards
zero (this is called the attenuation bias).
 X2
The quantity  2   2 which attenuates the regression

X
coefficient is called the reliability ratio.
*
*
The following table shows the reliability ratios of a few key
variables in economics:
Variable
Reliability
Data Set
Reference
Ratio
Log Annual
0.63
PSID
Validation
2
Bound et al.
Hours
Study
(1994)
Log Annual
Earnings
0.76
PSID
Validation
Study
Duncan and
Hill (1986)
Years of
Education
0.90
Twinsburg
Twins Study
Ashenfelter
and Krueger
(1994)
When the interest is in 1 in the multiple regression
Yi   0  1 X i*1   2 X i 2    K X iK   i
*
and the classical measurement error model holds for X i1 , then
1 in the model
Yi  0  1 X i1  2 X i 2 
 K X iK   i
satisfies
1  1
 r2
*
X
 r2   2
*
X
where
rX*
*
is the error from the regression of X i1 on X i 2 ,
, X iK .
Although classical measurement error is a good starting point, in
some important situations, classical measurement error is not
possible. For example, for a binary variable, classical
measurement cannot hold because there are only two ways the
variable can be mismeasured: a true 0 can be measured as a 1
and a true 1 can be mismeasured as a 0, and consequently the
error depends on the true value. Aigner (1972) shows random
misclassification of a binary variable still biases the regression
3
coefficients towards zero. But in general, the bias could be
greater or less than one.
We will discuss approaches to correcting for measurement error
later.
III. Causal Effects and Potential Outcomes Framework (Section
18.2, Wooldridge).
In many economic studies, we are interested in estimating the
causal effect of a policy or decision on an individual.
Example: Does military service raise or lower earnings? We
will consider this question specifically for men in the World
War II era using data from the 1980 Census. For a specific man,
military service in WWII might interrupt education or career
and so cause that man to earn less than he would have if he had
not served. Alternatively, for a specific man, if the labor market
favored veterans, or if various veteran’s programs conferred
advantages, then military service in WWII might cause that man
to earn more than he would have had he not served. The familiar
distinction between association and causation is that these
questions concern the effects caused by military service for the
same man; they do not simply compare the different men who
happened to be WWII veterans and those who happened to not
be WWII veterans.Men may be rejected for military service for
reasons of ill health or criminal behavior, and othersmay legally
avoid or illegally evade military service, so that we would
expect veterans and nonveterans of WWII to differ in earnings
quite apart from any effect caused by military service.
4
Potential Outcomes Framework:
Yi (1)  earnings in 1980 man i would have if he had served in the
military during WWII
Yi (0)  earnings in 1980 man i would have if he did not serve in
the military during WWII
For man i, the causal effect of military service on 1980 earnings
(1)
(0)
(1)
is Yi  Yi . However, we can only observe one of Yi or
Yi (0) . Let Di equal 1 or 0 according to whether man i served in
the military or did not serve in the military. We observe
(D )
earnings Yi  Yi i for man i.
Consider a model in which military service has the same causal
effect for everybody:
Yi (1)  Yi (0)   .
The causal effect of military service on earnings is  .
The observed earnings is
Yi  Yi (0)  Di 
Let X i be a vector of covariates, e.g., race and state of birth and
suppose
E (Yi (0) | X i )   ' X i , i.e.,
Yi (0)   ' X i   i , E ( i | X i )  0
5
Then
Yi  Di    ' X i   i
 Di    ' X i  E ( i | Di , X i )   i ,
where  i   i  E ( i | Di , X i ) .
Consider running the regression of Yi on Di , X i ,
E(Yi | Di , Xi )   Di   ' Xi
Suppose
E ( i | Di , X i )   Di [ Recall : E ( i | X i )  0 ; if there is no
interaction between Di and X i in affecting  i , then it follows
that E ( i | Di , X i ) does not depend on X i .]
Using the omitted variable bias formula with E ( i | Di , X i ) as
the omitted variable, we have
    .
(0)
When does   0 ? If Di is independent of Yi conditional on
X i . This is called selection on observables. It means that all
the variables that determine the selection of Di are recorded in
X i . If Di is randomly assigned as in a randomized experiment,
then selection on observables is assured.
Summary: We can estimate the causal effect of D on Y by
regression if we record all the covariates X which affect the
selection of D .
6
The Census data does not contained detailed personal
characteristics beyond race, sex , place of birth and education, so
selection on observables is unlikely to be satisfied.
A strategy for estimating a causal effect when selection on
observables does not hold is to look for an instrumental
variable.
Instrumental variable (IV): A variable Zi is an instrumental
variable if it satisfies the following three conditions:
(1) Relevance: Zi is correlated with Di conditional on X i .
(2) Random assignment: Zi is independent of Di conditional
on X i .
(3) Exclusion restriction: Zi has no direct effect on Y other
than through its influence on Di . To explain this assumption,
we allow the potential outcomes to depend on both Z and D :
Yi ( z ,d ) = the potential outcome that would be observed for unit i if
the instrumental variable Z was set to z and the treatment
variable D was set to d .
The exclusion restriction says that the potential outcomes
depend only on d :
Yi ( z ,d )  Yi ( z ' d ) for all possible z, z ', d .
How to make use of instrumental variables?
Let’s first consider the case without any other covariates X i .
Consider the additive linear constant effect (ALICE) model:
Yi ( z , d )  Yi (0,0)   d   z ,
7
which leads to the observed data model
Yi  Yi (0,0)   Di   Z i .
The exclusion restriction assumption says that   0 . The
(0,0)
random assignment assumption says that Yi
is independent of
Zi .
Thus,
E (Yi | Zi )  E (Yi (0,0) | Zi )   E ( Di | Zi )   Zi
 E (Yi (0,0) )   E ( Di | Zi )
If we knew E ( Di | Zi ) , then we could estimate  by least
squares regression of Yi on E ( Di | Zi ) . Although we don’t
know, E ( Di | Zi ) , we can estimate it.
Two stage least squares estimate of 
1. First Stage: Regress Di on Zi using least squares to obtain
Eˆ ( D | Z ) .
i
i
2. Second Stage: Regress Yi on Eˆ ( Di | Zi ) . The slope on
Eˆ ( D | Z ) is the the two stage least squares instrumental
i
i
variable estimate of  .
If there are covariates X i , we have
E (Yi | Zi , X i )   E ( Di | Zi , X i )   ' X i
8
and we just include the covariates in each stage, i.e., in the first
stage we regress Di on Zi and X i to obtain Eˆ ( Di | Zi , X i ) and
in the second stage, we regress Y on Eˆ ( D | Z , X ) and X .
i
i
i
i
i
Example: Angrist and Krueger (1994, “Why Do World War II
Veterans Earn More than Non-Veterans,” Journal of Labor
Economics) propose year/quarter of birth dummy variables as
instrumental variables for estimating the causal effect of military
service during WWII on earnings.
As an example, let’s consider particularly Zi  1 if man born in
quarter 3 or 4 of 1926 and Z i  0 if man born in quarter 3 or 4
of 1928. The data are from the 5% microsample of the 1980
Census (the Census only asked about detailed data such as
earnings for 5% of respondents).
ak_veteran_data=read.table("ak_veteran_data.csv",sep=",",header=TRUE);
earnings=ak_veteran_data$earnings;
veteran=ak_veteran_data$veteran;
z=ak_veteran_data$z;
Discussion of IV Assumptions:
(1) Relevance: Men born in 1926 turned 18 (draft eligible aid)
in 1944, the middle of the war while men born in 1928 turned 18
in 1946, after the war was over.
mean(veteran[z==1]);
[1] 0.7550714
mean(veteran[z==0]);
[1] 0.2431429
9
Men born in 1926 were much more likely to be veterans. The
IV is relevant.
( z  0, d  0)
(2) Random assignment: Yi
is the earnings man i would
have in 1980 if he were born in 1928 and did not serve in the
military. The random assignment assumption says that the
actual year the man was born is independent of this earnings
potential. The random assignment assumption would be
violated if there are gradual long term trends in fertility, health,
education, apprenticeship and employment which might make
the earnings potential of the men born in 1928 different on
average than those born in 1926. However, the random
assignment assumption seems plausible.
( z  0, d  0)
 Yi ( z 1, d 0) and
(3) Exclusion restriction: This says that Yi
Yi ( z 0, d 1)  Yi ( z 1,d 1) , i.e., if a man served (did not serve) in the
military regardless of whether he was born in 1926 or 1928, then
his earnings in 1980 would be the same regardless of whether he
was born in 1926 or 1928. This would be violated if there are
difference in earnings by age. The relationship between age and
earnings is relatively flat for the age range 54-56 (those born in
1926 would be 54 in 1980 and those born in 1928 would be 56),
but there is likely to be a small effect of age on earnings.
Angrist and Krueger do a variety of things to control for this in
their paper.
Empirical Results:
Usual regression:
10
linreg=lm(earnings~veteran);
summary(linreg);
Call:
lm(formula = earnings ~ veteran)
Residuals:
Min 1Q Median 3Q Max
-22114 -8003 -2184 4657 54652
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20348.3 108.1 188.21 <2e-16 ***
veteran
1840.8 153.0 12.03 <2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12800 on 27998 degrees of freedom
Multiple R-squared: 0.005141, Adjusted R-squared: 0.005106
F-statistic: 144.7 on 1 and 27998 DF, p-value: < 2.2e-16
If selection on observables held, then we would have strong
evidence, that military service increases a man’s earnings.
IV analysis:
# First stage regression
fsreg=lm(veteran~z);
veteranhat=predict(fsreg);
# Second stage regression
ssreg=lm(earnings~veteranhat);
betahat=coef(ssreg)[2];
betahat
veteranhat
-1010.588
11
The IV analysis is estimating that military service decreases a
man’s earnings.
12
Download