Reading: Wooldridge, Chapter 4.3.2, 4.4
I. Proxy Variables (Chapter 4.3.2)
So far we have largely noted the omitted variable bias and developed methods for at least assessing the sign of the bias.
What else can we do? An additional approach is to use a proxy variable . For example, we might consider using IQ as a proxy for ability in estimating the effect of education on earnings holding fixed ability. The idea is to include another covariate into the regression and thus eliminate at least part of the bias that stems from omitting the earlier variable. So suppose we would like to run the long regression:
Y i
0
X
1 i 1
X
K iK
Z
Z i
i
We do not observe
Z i
but instead we observe a proxy
W i
. We run the regression
Y i
0
X
1 i 1
X
K iK
W
W i
i
.
Wooldridge discusses two assumptions that make
W i
a perfect proxy variable :
1.
W i
is uncorrelated with
i
. Hence, if we were to run the regression
Y i
0
X
1 i 1
the estimator for
W
X
K iK
Z
Z i
W
W i
would converge to zero. i
,
1
2. Partialling out the proxy variable W i
, the omitted variable Z i is uncorrelated with the included variables
X i 1
, , X iK
.
Formally, if we run the regression
Z i
0
X
1 i 1
X
K iK
W
W i the estimators for
1
, ,
K
i
, converge to zero.
Under these assumptions, we can directly use the results on omitted variable bias. Think of the regression function including all
X i 1
, , X iK
, Z W i i
as the long regression, the regression functions omitting
Z i
(but including
X i 1
, , X iK
, W i
) as the short regression and the regression of
Z i
on
X i 1
, , X iK
, W i
as the artificial regression. The coefficient on
X ik in the short regression is, by the omitted variable bias formula, equal to
k
k
Z k
k
because
k
is zero, and so there is no bias for this coefficient. The coefficient on
W i
in the short regression is
W
Z W
Z W
, which is biased, but typically we do not care about the coefficient on the proxy variable.
Now let us look at this without assuming that
W i
is a perfect proxy variable. There are now a large number of regression floating around. For expositional reasons, we look at the case with a single covariate
X i
. We are interested in
1
in the regression:
Y i
0
1
X i
Z
Z i
1 i
.
2
We wish to compare the bias resulting from omitting
Z i
from this regression, and estimating the regression
Y i
0 1
X i
2 i
(1.1) with the bias of the regression where we replace
Z i
with
W i
and estimate
Y i
0
1
X i
2
W i
3 i
.
We also need to consider the long regression
Y i
0 1
X i
Z
Z i
W
W i
4 i and the artificial regressions
W i
0
X
X i
Z
Z i
5 i
Z i
Z i
0
0
X
X i
X
X i
W
W i
7 i
6 i
First consider the relationship between the coefficients of interest and the coefficients from the long regression. Using the omitted variable bias, we have
Z
Z
W Z
1
1 W X
Next consider the bias from running (1.1) . Using the omitted variable bias formula, the bias is equal to
Z X
Z X
W Z X
.
Now consider the bias from replacing
Z i
with
W i
. Using the omitted variable bias formula for the third time we have
1
1 Z X
1 W X
Z X so that the bias is
3
W X
Z X
.
Now suppose the own effect of the proxy variable is fairly small
(
W
0)
. In that case, the bias for omitting
Z i
is about
Z X
.
The latter is smaller if |
X
Z
|
, that is, if controlling for
W i lowers the correlation between
Z i
and
X i
.
Let us see how this plays out with real data. Let’s look at the
NLS data again. nlsdata=read.table("nls_2008.txt"); logwage=nlsdata$V1; educ=nlsdata$V2; exper=nlsdata$V3; age=nlsdata$V4; expersq=exper^2; kww=nlsdata$V7 iq=nlsdata$V8
Suppose we are interested in the regression of log earnings on education controlling for iq. The estimated regression would be
lm(logwage~educ+iq)
Call: lm(formula = logwage ~ educ + iq)
Coefficients:
(Intercept) educ iq
4.709855 0.045199 0.006183
If we could not observe IQ, we could estimate the regression without it lm(logwage~educ)
4
Call: lm(formula = logwage ~ educ)
Coefficients:
(Intercept) educ
5.04034 0.06717
Alternatively, we could estimate a regression using a test score
KWW, knowledge of the working world, as a proxy variable.
This proxy variable regression is
lm(logwage~educ+kww)
Call: lm(formula = logwage ~ educ + kww)
Coefficients:
(Intercept) educ kww
4.79674 0.04853 0.01382
Here the proxy variable seems to work quite well. To understand that better, let us look at the various components of the bias. First, the long regression
lm(logwage~educ+iq+kww)
Call: lm(formula = logwage ~ educ + iq + kww)
Coefficients:
(Intercept) educ iq kww
4.595812 0.035560 0.004482 0.011630
The three artificial regressions are lm(kww~educ+iq)
Call: lm(formula = kww ~ educ + iq)
5
Coefficients:
(Intercept) educ iq
9.8062 0.8288 0.1462
lm(iq~educ+kww)
Call: lm(formula = iq ~ educ + kww)
Coefficients:
(Intercept) educ kww
44.8282 2.8934 0.4894
lm(iq~educ)
Call: lm(formula = iq ~ educ)
Coefficients:
(Intercept) educ
53.452 3.553
So first decomposing the bias from the omitted variable regression is
Z X
W Z X
0.0046*3.5388 0.0017*0.1475*3.5388
0.0224
and the bias from the proxy variable regression is
W X
Z X
0.0035
.
The bias from the proxy regression is one sixth of that of the omitted variable regression. This is despite
W
not being that small (0.0116) but due to the offsetting biases in the proxy variable regression.
II. Measurement Error (Chapter 4.4)
6
The Classical Measurement Error Model:
Suppose we would like to estimate the regression
Y i
0
X
*
1 i 1
i
, but we only measure
X i 1
which contains measurement error:
X i 1
X
* i 1
i
For example, education is not measured perfectly.
The classical measurement error assumes that the measurement error
i
is independent of the true
X i
*
1
and of
i
: the measurement error arises completely at random.
Let the regression of Y on
X i 1
be
Y i
0
X
1 i 1
i
(1.2)
We are interested in the bias of
1
as an estimator of
1
.
Consider
Y i
0
X
*
1 i 1
X
2 i 1
i
as a proxy variable for
X i
*
1
. Consider the regression function
Y i
0
X
*
1 i 1
X
2 i 1
i
(1.3)
In the last equation, the population value for the coefficient
2 on
X i 1
is zero because of the independence between
i
and
i
.
Consequently,
1
1
. Running
(1.2)
instead of
(1.3)
(and thus omitting
*
X i 1
) leads to
7
1
1
Cov X
* i 1
, X i 1
)
Var X
* i 1
)
1
2
2
*
X
X
*
2
Summary: Classical measurement error leads to a bias towards zero (this is called the attenuation bias).
Here is R code which illustrates this:
# Simulation of classical measurement error model xstar=rnorm(100); x=xstar+rnorm(100); beta=1; y=beta*xstar+rnorm(100); plot(xstar,y,col="red"); model1=lm(y~xstar); abline(a=coef(model1)[1],b=coef(model1)[2],col="red"); model2=lm(y~x); points(x,y,col="blue"); abline(a=coef(model2)[1],b=coef(model2)[2],col="blue");
The quantity
2
*
X
2
*
X
2
which attenuates the regression coefficient is called the reliability ratio.
The following table shows the reliability ratios of a few key variables in economics:
Variable Reliability
Ratio
Data Set Reference
Log Annual
Hours
0.63 PSID
Validation
Study
Bound et al.
(1994)
8
Log Annual
Earnings
Years of
Education
0.76
0.90
PSID
Validation
Study
Twinsburg
Twins Study
Duncan and
Hill (1986)
Ashenfelter and Krueger
(1994)
When the interest is in
1
in the multiple regression
Y i
0
*
1
X i 1
2
X i 2
K
X iK
i and the classical measurement error model holds for X
* i 1
, then
1
in the model
Y i
0
X
1 i 1
X
2 i 2
X
K iK
i satisfies
1
1
2 r
*
X
2 r
*
X
2 where r
X
* is the error from the regression of
X
* i 1
on
X i 2
, , X iK
.
Although classical measurement error is a good starting point, in some important situations, classical measurement error is not possible. For example, for a binary variable, classical measurement cannot hold because there are only two ways the variable can be mismeasured: a true 0 can be measured as a 1 and a true 1 can be mismeasured as a 0, and consequently the error depends on the true value. Aigner (1972) shows random misclassification of a binary variable still biases the regression
9
coefficients towards zero. But in general, the bias could be greater or less than one.
10