Document 11163050

advertisement
^^
\^
—
:%
orrtic*T
M.J.T.
LIBRARIES -
DEWEY
Digitized by the Internet Archive
in
2011 with funding from
Boston Library Consortium
Member
Libraries
http://www.archive.org/details/nonparametricestOOnewe
^0
working paper
department
of
economics
massachusetts
institute of
technology
50 memorial drive
Cambridge, mass. 02139
WORKING PAPER
DEPARTMENT
OF ECONOMICS
Nonparametric Estimation of Triangular
Simultaneous Equations Models
Whitney K. Newey
James L. Powell
Francis Vella
No.
98-06
July, 1998
MASSACHUSEHS
INSTITUTE OF
TECHNOLOGY
50
MEMORIAL DRIVE
CAMBRIDGE, MASS. 02142
-MASSACHUSffTS INSTITUlt
OF TECHNOLOGY
Nonparametric Estimation of Triangular
Simultaneous Equations Models
Whitney K. Newey
Department of Economics
MIT, E52-262D
Cambridge,
MA
02139
wnewey@mit.edu
James L. Powell
Department of Economics
University of California at Berkeley
Francis Vella
Department of Economics
Rutgers University
New Brunswick, NJ 08903
Vella@fas-econ.rutgers.edu
March 1995
Revised, Febuary 1998
Abstract
This paper presents a simple two-step nonparametric estimator for a triangular
Our approach employs series approximations that exploit the
simultaneous equation model.
The first step comprises the nonparametric estimation
additive structure of the model.
The second step is the estimation
of the reduced form and the corresponding residuals.
of the primary equation via nonparametric regression with the reduced form residuals
We derive consistency and asymptotic normality results for our
included as a regressor.
Finally we present an empirical example,
estimator, including optimal convergence rates.
based on the relationship between the hourly wage rate and annual hours worked, which
illustrates the utility of our approach.
Keywords:
Nonparametric Estimation, Simultaneous Equations, Series Estimation, Two-Step
Estimators.
The NSF provided financial support for the work of Newey and Powell on this paper. An
early version of this paper was presented at an NBER conference on applications of
semiparametric estimation, October 1993, at Northwestern. We are grateful to
participants at this conference and workshops at Harvard/MIT and Tilburg for helpful
comments. We are also grateful to the referees for their extensive comments that have
helped both the content and style of this paper.
Introduction
1.
Structural estimation
important
is
in
econometrics, because
needed to account
is
it
correctly for endogeneity that comes from individual choice or market equilibrium.
structural models do not have tight functional form specifications, so that
to consider nonparametric structural models and their estimation.
is
it
is
estimation
useful
This paper proposes
and analyzes one approach to nonparametric structural modeling and estimation.
approach
Often
This
different than standard nonparametric regression, because the object of
is
the structural model and not just a conditional expectation.
Nonparametric structural models have been previously considered
Newey and Powell
(1989) and Vella (1991).
Roehrig (1988),
Roehrig (1988) gives identification results
when the errors are independent
for a system of equations
in
of the instruments.
Newey and
Powell (1989) consider identification and estimation under the weaker condition that the
disturbance has conditional mean zero given the instruments.
The model we consider
are complementary to these previous results.
in
some ways than Newey and Powell
The model we consider
The results of
is
this paper
more
restrictive
(1989), but is easier to estimate.
a triangular nonparametric simultaneous equations model
is
where
y = g(~,(x,z
+ e
)
(1.1)
E[e|u,z] = E[e|u],
,
E[u|z] = 0,
X = n (z) + u
X
is
a
d
X
X
1
vector of endogenous variables,
variables that includes
a
is
x
d,
vector of instrumental
1
1
z,
as a
d,,
functions of the instruments
x
1
subvector,
n„(z)
is
a
d
11
1
(1.1)
z
and
z,
u
is
a
d
x
X
x
vector of
1
vector of disturbances.
1
Equation
generalizes the limited information simultaneous equations model to allow for a
nonparametric relationship
nonparametric reduced form
g (x,z
^
)
between the variables
y
and
(x,z
)
and a
(z).
The conditional mean restriction
in
equation
(1.1)
is
one conditional mean version
of the usual orthogonality condition for a linear model.
general assumption than requiring that
This added generality
For example,
can be purchases of a commodity and
y
a more
and that
z
E[u] = 0.
x
in
allows for
it
some separable
expenditures on a subgroup of
Endogeneity results from expenditure on a commodity subset being a choice
commodities.
from individual heterogeneity
Also, heteroskedasticity often results
variable.
is
(1.1)
for some econometric models, because
conditional heteroskedasticity in the disturbances.
demand models
Equation
be independent of
(g,u)
may be important
2
functions, as pointed out by
Brown and Walker
Assumption
(1989).
demand
in
allows for both
(1.1)
endogeneity and heteroskedasticity.
An alternative model, considered by Newey and Powell
E[e|z] = 0.
equation
(1989), requires only that
Strictly speaking, neither model is stronger than the other, because
does not imply that
(1.1)
into a reduced
E[e|z] =
form and a residual
One benefit of such a condition
is
u
that
3
0.
The additive separability of
satisfying equation
it
(1.1)
a strong restriction.
leads to a straightforward estimation
approach, as will be further discussed below.
In this sense,
the model of equation
a convenient one for applications where its restrictions are palatable.
is
estimation
satisfied,
very difficult
is
if
2
If
x
4
a two-step nonparametric estimator of this model.
n
The second step
on
z.
(z)
were linear
in
z
is
If
equation
(1.1)
is
satisfied,
the nonparametric regression of
u
is
The mapping from the reduced form
it
first step
y
on
x,
z
and
,
and the conditional expectations were replaced by
independent of
= E[E[e|u,z]|z] = E[E[e|u]|z] = E[E[e|u]] = E[e] =
discontinuous, making
The
from the nonparametric regression
u
population regressions, then the third condition implies that
and the second condition that z is orthogonal to e.
3
is
(1989).
consists of the construction of a residual vector
of
(1.1)
In contrast,
E[g|z] =
mean assumption
only the conditional
Newey and Powell
as discussed in
We propose
4
is
x
E[y|z]
z,
z
and
is
orthogonal to
E[e] = 0,
then
u,
E[g|z]
0.
to the structure
g (x.z
difficult to construct a consistent estimator.
)
turns out to be
u.
Two-step nonparametric kernel estimation has previously been considered
(1994).
Ahn
in
Our results concern series estimators, which are convenient for imposing the
additive structure implied by equation
We
(1.1).
derive optimal mean-square convergence
rates and asymptotic normality results that account for the first step estimation and
allow for Vn-consistency of mean-square continuous functionals of the estimator.
Section 2 of the paper considers identification, presenting straightforward
Section 3 describes the estimator and Section 4 derives
sufficient conditions.
Section 5 gives conditions for asymptotic normality and inference,
convergence rates.
including consistent standard error formulae.
Section 6 describes an extension of the
model and estimators to semiparametric models.
and Monte Carlo results illustrating the
Section 7 contains an empirical example
utility of
our approach.
Identification
An implication
(2.1)
of our model
is
that for A
E[y|x,z] = g (x,z,) + E[e1x,z] = g (x,z
w
gQ(x,z^) + Aq(u) = hQ(w),
Thus, the function of interest,
y
on
(x,z
)
and
u.
Equation
E[X^(u)+{y-E[y |x,z]}|u,z] = A
as specified in equation
g (x,z
(1.1).
equivalent to the conditional
(2.1)
(u)
is
),
= E[e|u],
(u)
)
+ E[c|u,z]
= (x',
z'^,
u')'.
one component of an additive regression of
also implies that
only depends on
u,
E[e|u,z] = E[y-g (x,z
which implies
Thus, the additive structure of equation
mean
restriction of equation
identified, so that the identification of
g
(1.1).
under equation
(1.1)
(2.1)
is
u
is
the same as the
identification of this additive component in equation (2.1).
To analyze identification of
g (x,z
),
note that a conditional expectation
unique with probability one, so that any other additive function
-
E[c|u,z] = E[c|u],
Furthermore,
is
|u,z]
)
g(x,z )+A(u)
is
Prob(g(x,2 )+\(u) = g (x,z )+X
satisfying equation (2.1) must have
identification
is
Equivalently, working with the difference of
two conditional expectations, identification
is
equivalent to the statement that a zero
additive function must have only constant components.
gJx,z)
2.1:
Prob(d(x,zJ
+
Tr(i^)
up
is identified,
= 0) =
To be precise we have
implies there is a constant
1
c
Theorem
2.1,
nonconstant.
),
if
(5(x,z
)).
the
with
Prob(8(x,z.,) = c
Suppose that identification
)
there are
5(x,z
and
)
b-Iu)
with
5(x,z
)
+ i^{u)
=
and
between
this implies a functional relationship
Intuitively,
Then by
fails.
u
6(x,z
^'(u)
were a one-to-one function of a subvector
More generally,
random vectors
(x,z
5(x,z
and
)
+ y{u) =
)
u
of
u,
)
and
a degeneracy in the joint distribution of these two random variables.
instance,
Z
and only if
a straightforward interpretation of this result that leads to a simple
is
sufficient condition for identification.
(x,z
gig
to an additive constant, if
1
There
Therefore,
1.
equivalent to equality of conditional expectations implying equality of
the additive components, up to a constant.
Theorem
=
(u))
For
then
u
=
implies an exact relationship between
Consequently, the absence of an exact relationship
u.
will imply identification.
To formalize these ideas
it
is
helpful to be precise about
what we mean by existence
of a functional relationship.
There
Definition:
there exists
and
is a
functional relationship between
h(x,z^,u)
Prob(h(x,z Jl) = 0)
and a set
<
1
for
U
all
such that
fixed
This condition says that the pair of vectors
equation.
Prob(U)
u e
(x,z
(x,z
>0,
)
and
u
if and only if
Prob(h(x,z ,u) = 0) =
11.
)
and
u
Thus, the functional relationship of this definition
solve a nontrivial implicit
is
an implicit one.
As
previously noted, nonidentification implies existence of such an implicit relationship.
Therefore, the contrapositive statement leads to the following sufficiency result for
1
identification.
Theorem
g (x,z
)
Although
(x,z
If there
is
no functional relationship between
is identified
up
to an additive constant.
2.2:
it
and
)
is
u
and
)
u
then
a sufficient condition, nonexistence of a functional relationship between
is
By Theorem
not a necessary condition for identification.
nonexistence of an additive functional relationship that
Thus, identification
condition.
(x,z
may
is
occur when there
still
2.1,
it
is
necessary and sufficient
is
an exact, nonadditive
functional relationship.
The additive structure
identified even
z
is
differentiable.
6(x) + y{u)
= 5(x) +
The restriction that
TT(z)
=
(z,e
x
and
u
is
of the
is
so strong that the model
is
not satisfied,
and
x
is
and
gp,(x)
i.e.
may be
even though
two dimensional,
z
form
+ u
not
two
implies a nonlinear, nonadditive
= exp(x -u
^^"U
is
are restricted to be
'^^^(u)
only one instrument although there are
x = n(z)
has
z
).
Consider an additive
satisfying
9'(u)
= 5(x^,exp(x^-u^)+U2) + y{u^,u^) =
g(-)(x)
and
'^--.(u)
5^(x) + 52(x)exp(x^-u^) = 0,
are differentiable means that
Then equation
6(x)
and
).
(2.2) implies
-52(x)exp(x^-u^) +
two equations gives
^^(u) = -y (u)exp(u -X
0.
y(u)
Let numerical subscripts denote partial derivatives with
respect to corresponding arguments.
last
),
The reduced form
must also be differentiable.
Combining the
(2.1)
For example, suppose
).
example there
In this
relationship between
(2.2)
(x,z
a scalar,
endogenous variables.
function
equation
when the usual order condition
smaller dimension than
present,
in
3-^(u)
=
0,
5^U)
+ r^^u) = 0.
5 (x) = 6 (x) = 0,
^
(u)
=
TtJ.^^
6(x)
identified, up to a constant, by
There
=
u,
it
follows from this equation that
Then the second and
0-
implying that both
equation (2.2) implies that
contains more than one point with
u
can vary for given
x
probability one, so that
with probability one
conditional on
x
Assuming that support of
and
5
first equations imply
are constant functions.
y
a constant function, and hence
is
Theorem
gQ(x)
a simple sufficient condition for identification that generalizes the rank
is
z = (z ,z
partitioned as
).
differentiation with respect to
Theorem
then
If
g(x,z
If
2.3:
support of
(z,u)
g (x,z
),
Let
x,
\(u),
x,
u,
z
u,
and
and
I,
and
,
\[(z)
z
2
Suppose that
z
subscripts denote
respectively.
are differentiable, the boundary of the
has zero probability, and with probability one
rankCH (z)) = d
,
is identified.
)
were linear
n(z)
is
2.1.
condition for identification in a linear simultaneous equations system.
is
Thus,
in
z
then
rankdT
(z))
= d
is
precisely the condition for
identification of one equation of a linear simultaneous system, in terms of the reduced
form coefficients.
Thus, this condition
is
a nonlinear, nonparametric generalization of
the rank condition.
It
is
interesting to note, as pointed out by a referee, that this condition leads to
an explicit formula for the structure.
Note that
h(x,z) = E[y|x,z] = g(x,z
)
+
Differentiating gives
A(x-n(z)).
h^(z,x) = g^(x,z^) + A^(u),
h^(x,z^)
= g^(x,z^) - n^(z)'A^(u),
h_(x,z) = -n_(z)'X (u),
2
2
u
Then,
if
rankdl
[Tl^izm^iz)'
]
= d
(z))
n^tz)
for almost
and solving gives
the other terms gives
all
z,
multiplying
A^(u) = -D(z)h (x.z).
h„(x,z)
by
D(z) =
Plugging this result into
gj(x,z^) = h^(x,z) - n^(z)'D(z)h2(x,z),
(2.3)
g (x,z
= h
)
(x,z) + D(z)h (x,z).
This formula gives an explicit equation for the derivatives of the structural function
in
terms of the identified conditional expectations
So far
we have
This
constant.
different values of
also desirable to
restriction
is
only considered identification of
sufficient for
is
(x,z
know the
level of
g (x,z
E[e] = 0,
t(u)
if
is
so that
g (x,z
h(x,z).
)
up to an additive
as comparing
g (x,z
at
)
such as forecasting demand quantity,
which can be identified
),
imposed on the distribution of
assumed that
follows that
many purposes, such
In other cases,
).
and
n(z)
E[y] = E[g (x,z
it
is
a location
Then from equation
)].
JxCujdu =
some function with
if
For example, suppose that
e.
g
it
is
(2.2)
it
1,
J'E[y|x,z^,u]T(u)du - E[jE[y |x,z^,u]T(u)du] + E[y]
(2.4)
gQ(x,z^) - E[gQ(x,z^)] + E[y] = g^{x,z^).
Thus, under the familiar restriction
g_(x,z
up to a constant, as
)
E[e] =
the conditions for identification of
in equation (2.2),
also serve to identify the level of
go(x,z^).
Another approach to identifying the constant term
For example, the restriction
^ (u).
^(-,(0)
=
is
is
to place restrictions on
a generalization of a condition that
holds in the linear simultaneous equations model because, in the linear model where
g^(x,z
)
and
TKz)
are linear and equation
(1.1)
holds with population regressions
replacing conditional expectations, the regression of
linear in
u.
some known
(2.5)
More generally, we
u
and
A.
E[y|x,z^,u] -
y
on
)
and
will consider imposing the restriction
In this case
X =
(x,z
gQ(x,z^) + X^{u) -
X =
gQ(x,z^).
u
will be
/^(-.(u)
= X
for
Thus, under the restriction
A (u) =
structural function in terms of
A,
there
The constant
h.
restrictions as well, but this will suffice for
Equation
(2.1)
can be estimated
in
under other types of
many purposes.
two steps by combining estimation of the residual
with estimation of the additive regression
procedure
(i
will be identified
Estimation
3.
u
a simple, explicit formula for the
is
=
(2.1),
1,
is
equation
(2.1).
formation of nonparametrically estimated residuals
The second step
n).
...,
in
is
using the estimated residuals
the structural function
g (x,z
)
The first step of this
u.
=
w.-IT(z.),
estimation of the additive regression in equation
in
u.
place of the true ones.
An estimator of
can then be recovered by pulling out the component
that depends on those variables.
Estimation of additive regression models has previously been considered
and we can apply those results for our second step estimation.
literature,
least
in the
There are at
two general approaches to estimation of an additive component of a nonparametric
One
regression.
is
to use an estimator that imposes additivity and then pull out the
component of interest.
Breiman and Friedman
This approach for kernel estimators has been considered by
(1985),
Hastie and Tibshirani (1991), and others.
Series estimators
are particularly convenient for imposing additivity, by simply excluding interaction
terms, as in Stone (1985) and Andrews and
Whang
(1990).
Another approach to estimating the additive component
expectation
is
to use the fact that, for a function
J'E[y|x,z^,u]T(u)du = g (x,z
Then
gQ(x,z)
)
+
t(u)
g (x,z
such that
)
of the conditional
J'T(u)du =
1,
AQ(u)T(u)du.
can be estimated by integrating, over
u,
an unrestricted estimator of
This partial means approach to estimating additive models
the conditional expectation.
has been proposed by Newey (1994), Tjostheim and Auestad (1994), and Linton and Nielsen
(1995).
computationally convenient for kernel estimators, although
is
It
it
is
less
asymptotically efficient than imposing additivity for estimation of -v^-consistent
when the disturbance
functionals of additive models
homoskedastic.
is
paper we focus on series estimators because of their computational
In this
To describe the two-step series
convenience and high efficiency in imposing additivity.
we
estimator
=
(z),...,r
(r
For each positive integer L
initially consider the first step.
be a vector of approximating functions.
(z))'
In this
let
r
(z)
paper we will
consider in detail polynomial and spline approximating functions, that will be further
Let an
discussed below.
number of observations.
r.
= r
Let
n
be the total
be the predicted value from a regression of
IT(z)
x.
on
(z.),
1
1
n(z) =
(3.1)
rhz)'r.
To form the second
w
functions of
u,
subscript index the observations, and let
i
in
= (R'R)~^R'(x,,...,x
3^
step,
p
let
= (x'.z'u')'
(w) = (p
R =
)',
p
r]'.
be a vector of approximating
(w))'
(w),...,p
such that each
In
[r,
depends either on
(w)
(x,z
)
or on
Exclusion of the interaction terms will mean that any linear
but not both.
combination of the approximating functions has the same additive structure as in equation
(2.1),
that additivity
i.e.
for the event
and
d,
imposed.
is
t(w)
Also,
denote the indicator function
lid)
let
denote a function of the form
d+d
t(w) = n. ,^
J=l
(3.2)
where
+ d
a
is
and
.
b
.
l(a. <
Let
(x,z
11
u.
=
x.
).
refers to the observation, and
y.
on
= p (w.)
J
This
- n(z.),
w
p.
b.)
J
w
are finite constants,
the dimension of
discussed below.
w. <
J
is
.
t(w)
is
w. =
the
jth
1
x.
= t(w.).
1
for each observation where
10
w,
and
d = d
a trimming function to be further
(x'. ,z' .,u'. )'
1
component of
li
,
where an
The second step
t.
i
subscript
^ for
1
=
1,
giving
is
obtained by regressing
p - (P'P) ^P'Y
h(w) = p'^(w)'/3,
(3.3)
P =
Y =
[Vl.-'Vnl''
t(w)
The trimming function
convenient for the asymptotic theory.
is
w
be used to exclude large values of
(y^....,y^)'.
that might lead to outliers in the case of
Although we assume that the trimming function
polynomial regression.
results would extend to the case
where the
a
limits
and
.
b
if
are estimated.
.
these limits are estimated from independent data
function
g (x.z
)
and of
A
respectively.
u,
^
K
i(x,z^) = Sg.
(x,z,)
-^
remaining terms depend only on
(3.4)
u.
Suppose that
p
IK
(w) =
K
for the next
terms, and that the
Then the estimators can be formed as
+1
Ej!2
^jPj(---i)'
To complete the definition of
^(-^ =
h
-
this estimator the constant
shifts, the constants are not important.
For example,
includes the log of price, a constant
elasticities
(i.e.
derivatives of
g).
is
^g " ^A = ^r
^j=Y2^jPj^"^-
For many objects of estimation, such as predicting how
X
g(x,z
if
and
c
c
is
changes as
)
g(x,z
)
x
For
important for predicting tax receipts.
However, the condition that
^g ~ ^1 ~ ^A
z
not needed for estimation of
^^.(u)
linear model, can easily be used to estimate the constants.
^"^
or
represents log-demand
Because of the trimming we cannot use the most familiar condition that
»
—
K
^j=K +Z^f]'^^
.
terms should be specified.
In other cases the constant is important.
example, knowing the level of demand
estimate the constants.
)
the
is
1
(x,z
g
1
These estimators are uniquely defined except for the constant terms
and
still hold.
by collecting those terms that depend only on
(u)
depends only on
,,(w)
^IcK
p,
is
can be used to construct an estimator of the structural
h(w)
and those that depend only on
constant, that
In
a subsample that
(e.g.
not otherwise used in estimation), then the results described here will
The estimator
nonrandom, some
is
J
J
particular,
can also
It
^^^*^^ ^°
=
X,
:=
which generalizes the
Choosing
^" estimator satisfying
11
E[e]
c^
= X -
X(u)
= X
A
and
to
In this case it will follow that the estimator of the individual
equation (3.4).
components are
g(x,z
(3.6)
We
)
= h(x,z
,u)
X{u) = h(x,z ,u) - h(x,z ,u) + X.
- A,
consider two specific types of series estimators corresponding to polynomial and
To describe these
spline approximating functions.
let
fi
=
'
vector of nonnegative integers,
^
|fi|
=
d^
and
Xi._,)^.,
let
z
M.
For a sequence
= n._ (z.) \
rhz) =
of distinct such vectors, a
i^JU))p_,
approximation for the first stage
given by
is
in the
second stage,
multi-indices with dimension the same as
(x,z
or
)
that the first
d
u,
power series
z^^^^V.
(z^^^^
To describe a power series
only on
norm
a multi-index, with
i.e.
denote a
)'
jn
((j
nonzero, and vice versa.
(^(k)),
_
denote a sequence of
and such that for each
k,
w
depends
This feature can be incorporated by making sure
but not both.
components of
w,
let
/j(k)
are zero
if
any of the last
d
components are
Then for the second stage a power series approximation can be
obtained as
p
(w) = (w*^
For the asymptotic theory
with degree
|(li|
w"^
it
)'
will be
increasing in
i
.
assumed that each multi-index sequence
or
k,
and that
all
is
ordered
terms of a particular order are
included before increasing the order.
The theory to follow uses orthonormal polynomials, which may also have computational
advantages.
For the first step, replacing each power
by the product of univariate
z
polynomials of the same order, where the univariate polynomials are orthonormal with
respect to some distribution,
may
The estimator will be
lead to reduced collinearity.
numerically invariant to such a replacement, because
12
I
nU)
I
is
monotonically
An analogous replacement could
increasing.
also be used in the second step.
Regression splines are smooth piecewise polynomials with fixed knots (join points).
in the statistics literature
They have been considered
They are different than smoothing
parameter
in that
is
by Agarwal and Studden (1980).
For regression splines the smoothing
splines.
They have attractive features relative to power series
the number of knots.
they are less sensitive to outliers and to bad approximations over small regions.
in the
The theory here requires that the knots be placed
must be known, and that the knots be evenly spaced.
It
support of
which therefore
z.,
should be possible to generalize
results to allow the boundary of the support to be estimated and the knots not evenly
spaced, but these generalizations lead to further technicalities that
we
leave to future
research.
To describe regression splines
and that the support of
c
let
[-1,1]
=
(c)
is
z
is
m
a. = -1,
convenient to assume that
degree spline with
[-1,1].
b. =
1,
For a scalar
evenly spaced knots on
J-1
a linear combination of
1
p.j(u) =
^
In this
is
a cartesian product of the interval
An
l(c > 0)*c.
it
'
paper we take
set of multi-indices
functions for
z
of knots for the
< ^ < m+1,
,
{[u +
m
1
m+2
- 2(^-m-l)/J]}™,
fixed but will allow
</j{£)},
with
/J.(£)
J
^ m+J
< ^ < m+J
to increase with sample size.
for each
will be products of univariate splines.
jth
component of
z
is
J
and
j
For a
the approximating
I,
In particular,
if
the number
then the approximating functions could
.,
be formed as
d
(3.7)
r^^(z)
Throughout the paper
ratio of
1
nj=i
it
%i).)j^^s^'
will be
(^
=
assumed that
''
J.
-K).
depends on
numbers of knots for each pair of elements of
13
z
is
K
in
such a way that the
bounded above and away from
Spline approximating functions in the second step could be formed in the analogous
zero.
way, from products of univariate splines, with additivity imposed by only including terms
that depend only on one of
(x,z
or
)
u.
The theory to follow uses B-splines, which are a linear transformations of the above
The low multicoUinearity of B-splines and
functions that have lower multicollinearity.
recursive formula for calculation also lead to computational advantages; e.g. see Powell
(1981).
Convergence Rates
we
In this Section
obtain results
it
= y - h (w),
and
important to impose some regularity conditions.
is
u = x - H
for a random matrix
constants
Assumption
C
1:
derive mean-square error and uniform convergence rates.
Y,
=
IIYII
V
such that
(i
=
{E[IIYII^]>^'^^,
matrix
v
<
oo,
D
let
and
IIDII
{x,z),
= [trace(D'D)]
tj
1/2
the infimum of
IIYII
00
ProbdlYII
{(y.,x.,z.)},
Also, for a
(z).
X =
Let
To
1,
C)
<
2,
=
...)
1.
is
i.i.d.
and
and
Var(x|z)
Var(y|X)
are bounded.
The bounded second conditional moment assumption
literature (e.g. Stone, 1985).
Relaxing
it
is
quite
common
in the series
estimation
would lead to complications that we wish to
avoid.
Next,
Assumption
on
its
we impose
z
2:
is
the following requirement.
Also,
from zero on
W
= {w
continuously distributed with density that
support, and the support of
intervals.
Let
w
W, and
is
z
is
:
is
t(w) = IK
bounded away from zero
a cartesian product of compact, connected
continuously distributed and the density of
W
is
contained
in the interior of
14
w
the support of
is
bounded away
w.
This assumption
useful for deriving the properties of series estimators like those
is
considered here
see Newey, 1997).
(e.g.
moment matrix
the second
of the approximating functions.
condition like those discussed in Section 2
w
of
allows us to bound below the eigenvalues of
It
is
embodied
in this
being bounded away from zero means that there
between
(x,z
and
)
so that by
u,
Also, an identification
is
assumption.
The density
no functional relationship
Theorem 2.2 identification
This assumption,
holds.
along with smoothness conditions discussed below, imply that the dimension of
at least as large as the dimension of
(x,z
z
must be
a familiar order condition for
),
identification.
To see why Assumption 2 implies the order condition, note that the density of
is
bounded away from zero on an open
x,
z
set.
and
,
But
(z ,n(z))
given below, so
is
unless
of
and
are the same,
Some discreteness
results.
a vector function of
In particular.
the distribution of
a one-to-one function of
is
must be bounded away from zero on an open
z
that will be smooth under assumptions
has at least as many components as
z
x
(z ,n(z))
w
Then, because
range must be confined to a manifold of smaller dimension than
its
(z ,n(z))
n{z)
the density of
IT(z),
set.
w
z.
in the
z
Since the dimension
(z ,n(z)).
must have as many components as
components of
z
(x,z
).
can be allowed for without affecting the
Assumption 2 can be weakened to hold only for some component of
Also, one could allow
some components of
distributed, as long as they have finite suppoj-t.
z
On the other hand,
to be discretely
it
is
not easy to
extend our results to the case where there are discrete instruments that take on an
infinite
number of
values.
Some smoothness conditions are useful for controlling the bias of the series
estimators.
Assumption
and
3:
gQ(x,z^)
TT
(z)
and
is
^q(u)
continuously differentiable of order
s
on the support of
are Lipschitz and continuously differentiable of order
If.
15
s
z
on
The derivative orders
and
s
the rate of approximation for
In particular,
approximate the function.
and hence also for
X(u),
control the rate at which polynomials or splines
s
0(K
will be
h (w),
-s/d
where
),
d
is
g (x,z
)
and
the dimension of
(x,z^).
and
The
last regularity condition restricts the rate of
L.
Recall that
Assumption
denotes the dimension of
d
Either a) for power series, (K
4:
+
for splines, (K^ + KL^^^)[(L/n)^''^+L"^/^i] -^
For example, for splines,
K
if
and
requires that they grow slower than
As a preliminary step
expectation estimator
Lemma
it
h(w).
is
z.
at the
and that
,
+L
L)[(L/n)
i
i)
U
q = 1/2
Also, for
sup
Let
F
same rate then
0;
or
b)
s
>
2d
this condition
.
denote the distribution function of
all
=
w.
,,,\h(w) -
(K/n + k~^^^^ + L/n +
hJw)\
L^^/^0.
p
(J
q =
for splines, and
^weW
=
p
for power series,
1
(K^KK/n)^^^
subsequent proofs are given
+
K^^'^
+
iUn)^^^
equation
(3.4).
We
term a
will treat the constant
square and uniform convergence rates, so
it
is
+ L~^/^i]).
Appendix.
in the
This result leads to convergence rates for the structural estimator
separately.
>
helpful to derive convergence rates for the conditional
Sr(w)[h(w)-hJw)]^dFJw)
in
—
If Assumptions 1-4 are satisfied then
4.1:
This and
n
K
0.
grow
L
K
growth of the number of terms
little
g(x,z
differently for the
helpful to state and discuss
)
given
mean
them
For mean square convergence we give a result for mean-square convergence of
the de-meaned difference between the estimator and the truth.
16
Let
x = E[t(w)].
Theorem
If Assumptions 1-4 are satisfied then for
4.2:
ST(w)[kx,z^)-ST(w)A(x,z^)F^(dw)/TfF^(dw) =
The convergence rate
approximating functions
in
is
the
sum
of
K(x,z
)
= g(x,z )-g^(x,z
(K/n + k'^^^"^
+
L/n
+ C^^/^^i).
two terms, depending on the number
K
the first step and the number
in the
)
L
of
Each of
second step.
these two terms has a form analogous that derived in Stone (1985), which would attain the
optimal mean-square convergence rate for estimating a conditional expectation
were chosen appropriately.
L
and
L
proportional to
n
conclusion of Theorem 4.2
<^
In particular,
i
i
i
if
K
(1982)
it
)
if
^
\
satisfied, then the
r^
/
(max{n
r
i.e.
-2s/(d+2s)
,
n
-2s /(d +2s
i
i
)-,
)].
i
would be optimal for estimation of
n
did not have to be estimated in the first step.
u
n
follows that the convergence rates in this expression are optimal
for their respective steps,
g (x,z
is
and
is
^
From Stone
chosen proportional to
and Assumption 4
,
,- 1'^ t- /j
r ,
^rJ/
r /
i?/
M- /_;
(dw)/T} F
(dw) =
)F
)-St(w)A(x,z
Sr(w)[A(x,z
,^
(4.1)
is
K
if
Thus, when
K
and
L
are chosen appropriately, the mean square error convergence rate of the second step
estimator
is
the slower of the optimal rate for the first step and the optimal rate for
the second step that would obtain
An important feature
n
if
the first step did not have to be estimated.
of this result
is
that the second step convergence rate
the optimal one for a function of a d-dimensional variable rather than
is
the slower rate that would be optimal for estimation of a general function of
additivity
was not imposed.
estimation.
It
is
h (w)
is
where
used in
essentially the exclusion of interaction terms that leads to this
result, as discussed in
Although a
This occurs because the additivity of
w,
Andrews and Whang
full derivation of
(1990).
bounds on the rate of convergence of estimators of
17
g_^(x,z
is
)
beyond the scope of this paper, something can be said.
n
Stone (1982) that
estimation of
when
)
u
known.
is
cannot be better than the best rate when
n
the rate
is
attained
we know
c.
vr
K
.^.^-
Setting
^
and
.
that
it
u
is
known,
w
+.
^
the rates
to K
be proportional to +K
rate will be attained for splines when
d
is
unknown
,
s
,
d,
where
s
this rate is
s/d s
n
,
and
d /(d +2s
nil
^,
)
s/d ^
/d
^
and
s/d
means that the second stage
We do
(relative to its dimension) as the first stage.
convergence rate would be when the first stage
although in that case
we know
>
For power series the
2.
narrow the range of
Also, the side conditions in
It
is
s/d s
less
at least as
and
step.
smooth
know what the optimal
smooth than the second stage,
if
K
L
and
are chosen to minimize the mean-square
Our estimator could conceivably be suboptimal
in that case.
Assumption 4 are quite strong and rule out any possibility
of optimality of our estimator
when
not
is
s/d
that the first stage convergence rate would dominate the
mean-square error of our estimator
error in Theorem 4.1 or 4.2.
is
^
that
i
follows that the optimal second stage
it
/d
s
d/(d+2s)
where the estimator could attain the best convergence rate for the second
The condition
when neither the
first or the second stage is very
smooth
1).
interesting to note that the convergence rates in equation (4.1) are only
relevant for the function itself and not for derivatives.
h(w),
u
must attain the best rate when
g
side conditions of Assumption 4 are even stronger and would
(e.g.
known from
the best one (and that the estimator has the best rate).
is
minimize each component of the mean-square error
s/d
when
Since the best rate
Therefore, for the ranges of
^
L
,
is
the best attainable mean-square error (MSE) rate for
is
g (x,z
It
Because
u
is
plugged into
one might think that convergence rates for derivatives are also important, as
Taylor expansion of
h(u)
around
h(u).
depends only on the conditioning variables
The reason that they are not
x
and
z
in
equation
We thank a referee for pointing this out and for giving a slightly
of the following discussion of optimality.
18
(2.1),
is
that
in a
u
a feature that
more general version
allows us to work with a Taylor expansion of
rather than
h (w)
Theorem 4.2 gives convergence rates that are net
be interesting to know
will
E[c] =
A
= A
(u)
rates.
is
does not identify the constant.
a pointwise one, so
by specifying a restriction
motivated
in the
is
straightforward to obtain uniform convergence rates for
h(w)
in
Lemma
uniform rates for
g
/^^(u)
= A
uniform convergence rates for
W
then for
4.3:
q = 1/2
is
is
g
)
= h(x,z ,u) -
for splines, and
^'^P(xz)eW \S(x.zp
- gQ(x,z^)\
from the
)
note that
u
at
4.1.
In particular,
^rf^^ =
1
(in
^>
^.nd
W
W,
when the
in
follow immediately from those for
q =
can
)
some value
imposed leading to an estimator as
A,
g(x,z
continuous in the supremum norm on
equal to the coordinate projection of
g(x,z
If
In particular,
g(x,z
Furthermore, as long as the restriction
from Lemma
will also follow
identification restriction
Specifically, for
4.1.
up to the constant term, by just fixing
h(w),
used to identify the constant term
-^(-.(u)
Theorem
Also, the restriction
but this condition does not seem to be well
0,
so that uniform rates follow immediately.
(3.6),
That
model and so will not be considered.
be recovered from
on
included.
does not lead to mean-square error convergence
E[T(w)e] =
uniform convergence rates for
W),
would also
would be possible to obtain mean-square rates that include the constant term
It
It
it
is
It
With the trimming the
depend on the identifying assumption for the constant term.
usual restriction
in the proofs.
of the constant term.
when the constant term
similar results hold
if
h(w)
equation
h.
on values of
(x,z
),
Assumptions 1-4 are satisfied
for power series,
= 0^(K'^[(K/n)^^^ + k"^^'^ + (L/n)^''^ + L^/^^i]).
This uniform convergence rate cannot be made equal to Stone's (1982) bounds for either
the first or second steps, due to the leading
K*^
term.
Nevertheless, these uniform
rates improve on some in the literature, e.g. on Cox (1988), as noted in
Apparently,
it
is
Newey
(1997).
not yet known whether series estimators can attain the optimal uniform
convergence rates.
The uniform convergence rate for polynomial regression estimators
19
is
slower and the
K
conditions on
more stringent for power series than
L
and
we
Nevertheless
splines.
retain the results for polynomials because they have some appealing practical features.
They are often used for increased
flexibility in functional form.
Also polynomials do
not require knot placement, and hence are simpler to use in practice.
We
could also derive convergence rates for
Theorems 4.2 and
identical to those in
5.
4.3.
and obtain convergence rates
Mu),
We omit
these results for brevity.
Inference
Large sample confidence intervals are useful for inference.
We
focus on pointwise
confidence intervals rather than uniform ones, as in much of the literature.
purpose
denote a possible true function
h
let
suppressed for convenience.
consider
all linear
function
g (x,z
That
functionals of
w
argument
is
We
represents the entire function.
h
which turn out to include the value of the
h,
We construct confidence
at a point.
)
the symbol
is,
where the
h (w),
For this
intervals using standard errors
that account for the nonparamiCtric first step estimation.
To describe the standard errors and associated confidence
consider a general framework.
h (w),
w
where the
For this purpose
argument
represents the entire function.
As a function of
real line,
9
i.e.
= a(h_)
h,
a(h)
9
.
useful to
That
is,
the symbol
be a particular number associated with
The object of interest
h
is
denote a possible true function
suppressed for convenience.
Let
of the functional at
of
h
it
represents a mapping from possible functions of
a(h)
a functional.
9 = a(h)
estimator
is
let
intervals
will be
In this Section
h
h
h.
to the
assumed to be the value
we develop standard errors
for the
and give asymptotic normality results that allow formation
of large sample confidence intervals.
This framework
value of
g
is
at a point.
general enough to include
many objects
For example, under the restriction
20
of interest, such as the
'^p.(u)
= X
discussed
earlier, the value of
g (x,z
Sq^^'^I^ ^
(5.1)
^^^^
^^"^0^'
"^
A general inference procedure for
equation,
i.e.
the value of
can be used
(x,z
at a point
)
can be represented as
)
h(x,z^,u) - X.
linear functionals will apply to the functional of this
the construction of large sample confidence intervals for
in
at a point.
g
Another example
a w^eighted average derivative of
is
when
g
through a linear combination
w
particularly interesting
g (x,z
This object
).
an index model that depends on
is
Sq^w
say
r,
differentiable function with derivative
t
)
= t(w
where
9-)
Then for any
(q).
w
t(q)
- (x,z
is
)
only
a
is
such that
v(w)
q
v(w)t (w'^r)
integrable,
is
Jv(w)[ah-(w)/aw,]dw =
(5.2)
(w,^')v(w)dw]3',
[Jt
q
1
1
so that the weighted average derivative
h
functional of
where
,
g (w
)
g^^w
)
This
y.
is
also a linear
This functional
may
also be useful
over some range, as discussed in Section
3g (w )/Sw
or
proportional to
a(h) = J'v(w)[5h(w)/9w ]dw.
for summarizing the slope of
the value of
is
7.
Unlike
at a point, the weighted average derivative
functional will be ^-consistent under regularity conditions discussed below.
We can
derive standard errors for linear functionals of
general enough to include
many important examples, such
a weighted average derivative.
"plug-in" estimator.
For example, the value of
applying the functional in equation
general, linearity of
9 = Ap,
(5.3)
Because
G
is
a(h)
A =
(5.1)
g
to obtain
which
is
as the value of
G = a(h)
The estimator
h,
of
6
= a(h
)
a class
g
is
at a point or
a natural,
at a point could be estimated by
g(x,z
)
= a(h) = h(x,z
,u) - A.
In
implies that
(a(p^j^)
a(pj^j^)).
a linear combination of the two-step least squares coefficients
p,
a
natural estimator of the variance can be obtained by applying the formula for parametric
21
two step estimators
Newey
(see
As for other series estimators, such as
1984).
because
(1997), this should lead to correct inference asymptotically,
Let
properly for the variability of the estimator.
Q = P'P/n,
(5.4)
Z =
J:.^
—
J.
T.p.p'.[y.-h(w.)]^/n,
X
1.
i.
1.
1
Newey
accounts
it
denote the Kronecker product and
®
=1
Q
X
1
in
®(R'R/n),
\A
i\.
1
Z,
=
11
^1=1
1
11
The variance estimate for
V = AQ ht
(5.5)
This estimator
regressions.
is
It
H =
,(u.u'. )®(r.r'. )/n,
y.
+
a(h)
is
'
then
HQ^^S^Qj^H'lQ ^A'
suggested by Newey (1984) for parametric two-step
like that
AQ ZQ
equal to a term
is
n
,T.[Sh(w.)/au]'®p.r'./n
^1=1 1
1
11
V.
A'
,
that
is
the standard heteroskedasticity
consistent variance estimator, plus an additional nonnegative definite term that accounts
for the presence of
estimator
is
u.
It
has this additive form, where the variance of the two-step
larger than the first, because the second step conditions on the first step
dependent variables, making the second and first steps orthogonal.
We
will
show that under certain regularity conditions
Thus doing inference as
will be a correct large
results of
Andrews
However, as
a(h).
sample approximation.
(1991)
(0-G
were distributed normally with mean
9
if
VnV
)
—
>
N(0,1).
and variance
6
V/n
This extends large sample inference
and Newey (1997) to account for an estimated first step.
work, such results do not specify the convergence rate of
in this previous
That rate will depend on how fast
V
goes to infinity.
To date there
is
still
not a complete characterization of the convergence rate of a series estimator available
in the literature,
although
it
is
known when ^^-consistency occurs
(see
Newey,
1997).
Those results can be extended to obtain Vn-consistency of functionals of the two-step
estimators developed here.
norm and
this
P
norm by
Let
II
•
II
= (E[t(w)(') ]/t)
denote a trimmed mean-square
denote the set of functions that can be approximated arbitrarily well
a linear combination of
p
(w)
for large enough
22
K.
It
is
in
well known, for
p
in
T
corresponding to power series or splines, that
(w)
(x,z
and
)
where the individual components have
u
a(h)
for any
II
T
•
The
norm.
II
such that
= ElxCwJi'Cwjhtw)],
y(w)
with a function
h(w)
means that
Essentially this
h(w) e T.
product of
finite
y(w) e
critical condition for Vji-consistency is that there is
(5.6)
includes all additive functions
representation theorem, this
is
its
2 —
E[t(w)(') ]/t.
|z].
in the
same space as
being continuous with
a(h)
Vn(Q
In this case
By the Riesz
h.
- 8
will be
)
asymptotic variance can be given an explicit expression.
To describe the asymptotic variance
E[T(w)v(w)3h (w)/5u'
is
the same as the functional
respect to the mean square error
asymptotically normal, and
that
can be represented as an expected
a(h)
in the
/n-consistent case, let
p(z) =
The asymptotic variance of the estimator can then be expressed
as:
V = E[T(w)i^(w)i^(w)'Var(y|X)]
(5.7)
+ E[p(z)Var(x|z)p(z)'
For example, for the weighted average derivative
zero where
t(w)
equation (5.6)
is
w
density of
is
zero, so
and
Proj(*|?')
t'(w)
lie
in the
v(w)
if
is
3v{w)/Sx|y),
where
is
ff^(w)
the
denotes the mean-square projection on the space of
w
and
space spanned by
functions that are additive in
equation (5.2),
then integration by parts shows that
y(w) = -Proj(f (w)
satisfied with
functions that are additive in
make
t(w)v(w) = v(w),
in
].
w
and
x.
p
x.
The presence of
(w)
for large
Proj(«|?')
K,
that
is
is
necessary to
the set of
Then
p(z) = -E[T(w){Proj(fQ(w)~^av(w)/5x|?')>ahQ(w)/au' |z],
and the asymptotic variance will be
V
from equation
To state precisely results on asymptotic normality
regularity conditions.
Recall that
t)
= y-h
(w).
23
(5.7).
it
is
helpful to introduce
some
Assumption
2
5:
4
IX]
E[llull
is
= Var(y|X)
(X)
a-
bounded.
h (w)
Also,
4
bounded away from zero,
is
E[7)
|X]
bounded, and
is
twice continuously differentiable in
is
with
u
bounded first and second derivatives.
moment
This condition strengthens the bounded conditional second
moment and
of the fourth conditional
condition to boundedness
away from
the conditional variance being bounded
zero.
The next assumption
is
requires that the functional
least for the truth
Assumption
E[T(w)v(w)h
and the approximating functions.
h
a(p
(w)],
for large
(w)
which
is
such that
|3
= E[t(w)v(w)p
)
K,
and
v[w)
This condition also requires that
p
E[t(w)IIv(w)II
E[T(w)lly(w)-p p
(w)],
5v(w)/Sx
(w)
w
in
complementary to Assumption
is
<
]
K
2
(w)ll
important for the precise form that
on functions that are additive
The next condition
2
a(h
oo,
as
-^«
]
)
=
K
—^
oo.
can be approximated well by a linear combination
v(w)
example, with the average derivative this condition leads
f
It
have a representation as an expected product form, at
a(h)
There exists-
5:
the key condition for ^/n-consistency of the estimator.
y(w)
and
must take.
For
to be the projection of
u.
For
6.
vi-w]
d
= d + d
and a
d
d
W
x
9
max.
1
u
vector
integers
of nonnegative
let
o
o
I-
Id
h(w)/5w ,• 'Sw
•
^
denote a nonnegative integer and
5
K
—
>
is
oo,
a scalar,
a(p
K
'p
)
|
a(h)
is
This assumption says that functional
square error.
|
^
|
h
|
_
o
for some
=
|h|„ =
o
a(h)
is
continuous
The lack of mean-square continuity
in
0,
imposed
is
a scalar, which
is
|h|„,
o
K
E[t(w){p (w)'/3
fij.
2
}
]
—
but not in mean
a(h)
Another restriction
general enough to cover
24
and there exists
will imply that the estimator
also a useful regularity condition.
a(h)
2:
bounded away from zero while
is
that
5
K.
not V^-consistent, and
is
5 h(w)
,li.,
r-j
w
a(h)
7:
such that as
Also let
.
= V.
i^j_
,,Ja'^h(w)|.
,^5,sup
Assumption
,
|u|
r-
many cases
of
is
>
0.
When
interest.
a(h)
a vector, asymptotic normality with an estimated covariance
is
matrix would follow from Assumption
Andrews
of
iii)
In contrast,
(1991) is difficult to verify.
is
J
Assumption 7
a primitive condition, that
is
This condition and Assumption 5 are mutually exclusive,
relatively easy to verify.
and general enough to include many cases of interest.
For example, as noted above,
For another example. Assumption 7
Assumption 6 includes the weighted average derivative.
obviously applies to the functional
a(h) = h(x,z ,u) - A
from equation
continuous with respect to the supremum norm and where
_
i^
there
is
IK.
p (x,z ,u)'/3
such that
/3^
K
That assumption of Andrews
(1991).
- A
is
it
K^L^)/n -^
—
TnK
8:
0,
>
VnL
0,
i
i
—
One can show that this condition requires that
pointed out by a referee.
also limits the
It
k\^
+ KL^^j/n
s/d
5/2
>
s
/d
and
>
L
+
L.
KL +KL
2,
so that if each
for splines and
n
was
as
n
for
These conditions are stronger than for single step series estimators
series.
discussed in
0.
K
and
(KL
-^
and
growth rate of
grows at the same rate they can grow no faster than
power
K
for power series,
0,
(K^L + k'^L^ +
and for splines
is
bounded away from zero but
is
>
which
straightforward to show that
The next condition restricts the allowed rates of growth for
Assumption
(5.1),
Newey
(1997),
because of complications due to the multistage nature of the
estimation problem.
The next condition
is
useful for the estimator
V
of the asymptotic variance to
have the right properties.
Assumption
9:
Either
quadratic one, and
h^iw)
of the
is
/—<,
a)
vnK
l~s
differentiable of
jth
x
is
univariate,
—
0;
all
orders, there
>
b)
x
derivative bounded above by
This assumption helps guarantee that
A
is
if
a spline
used
K
multivariate,
p
a constant
C
is
C(C)-^,
(u)
is
and
25
and
it
(w)
is
at least a
a power series,
with the absolute value
/nK~^ -^
its derivative
is
for some
e
>
0.
can be approximated by
+
p
which
(w),
These
important for consistency of the covariance matrix estimator.
is
conditions are not very general, but more general ones would require series approximation
rates for derivatives, which are not available in the literature.
least useful for
some
set of regularity conditions.
„sup,,.lia
.
Theorem
p
5.1:
=
o
1-3, 5, either 6
or
and 8 and 9 are satisfied,
7,
o
VnV-^^^(G - e^)
-^
N(o,i),
Furthermore, if Assumption 6
-^
-^
n(o,i).
is satisfied,
V
N(0,V),
v^i/^^^re-e^;
-^V.
asymptotic normality needed for large sample confidence intervals this
results gives convergence rate bounds and,
when Assumption
6 holds, Vn-consistency.
Bounds on the convergence rate can also be derived from the results shown
that for power series
CK
Cs-^K) ^
o
and for splines
Thus, for example, when
generic positive constant.
Cs-^K) ^
o
a{h)
CK
= h(x,z
l/\^
(z)
b)
is
where
(1997)
C
where
- A,
is
a
6 = 0,
and for splines
K/v'n
also satisfies the
by letting
K
the convergence rate for
all
orders,
same condition, the convergence rate can be made close to
grow slowly enough.
for some positive, small
has derivatives of
h (w)
satisfied, so that
continuously differentiable of order
Cn
Newey
/^/Vn.
When Assumption 9
n
in
,
,u)
an upper bound on the convergence rate for power series would be
would be
B =
then
and
(i:JK)/Vn)
In addition to the
C^-^K)
(w)ll.
p
If Assumptions
VR(B - e^;
of Assumption 7, let
5
K
u.
max,
and
at
is
showing that the asymptotic variance estimator will be consistent under
Next, for the nonnegative integer
0n +
This assumption
6,
s
all
In particular,
for every
s
>
0,
h (w)
and
so that
if
H
X
26
are
K = Cn
of the conditions will be satisfied.
a(h) = h(x,z ,u) -
(z)
L =
and
Then the
for power series will then be
n
"^
*
,
which will be close to
l/-/n
for
small enough.
e
The centering of the limiting distribution
"over fitting" that
is
K
of the estimator is
than
l/yH.
present, meaning that the bias of these estimators shrinks faster
This over fitting
than the variance.
-s/d
,
1/Vn
Since
at zero in this result is a reflection of
is
implied by Assumption
is
generally the fastest the standard deviation of
This feature of the asymptotic normality result that
condition that
faster than
K
Andrews
is
for
is
present.
shared by other theorems for
When combined with the
and Newey (1997).
(1991)
go to infinity slower than n
<
a
<
1,
the bias shrinking
means that the convergence rate for the estimator
l/Vn
This
the optimal rate.
is
condition, but that
is
is
bounded away from
different than asymptotic normality results for other types of
nonparametric estimators, such as kernel regression.
6.
The order of the bias
so that Assumption 8 requires that the bias shrink faster
estimators can shrink (such as the sample mean), overfitting
series estimators, as in
8.
an extension that
It
would be good to relax this
beyond the scope of this paper.
is
Additive Semiparametric Models
In
economic applications there
is
nonparametric estimation problematic.
often a large number of covariates thereby making
The well known curse of dimensionality can make
the estimation of large dimensional nonparametric models difficult.
problem
is
to restrict the model in
some way while retaining some nonparametric features.
The single index model discussed earlier
Another example
others.
is
a model that
is
One approach to this
is
one example of such a restricted model.
additive in
some components and a parametric
in
This type of model has received much attention in the literature as an approach
to dimension reduction.
Furthermore,
it
is
particularly easy to estimate using a series
estimator, by simply imposing restrictions on the approximating functions.
To describe
this type of model,
let
w
= (x.z
27
)
as before, and let
w
(j
=
denote
0,1,...,J),
J+1
gQ(w^) = w^^r + Ijiigjtw^
(6.1)
w
subvectors of
A semiparametric additive model
.
J.
x
This model can be estimated by specifying that the functions of
consist of the elements of
w
and power series or splines
could also impose similar restrictions on
IT^(z)
restriction could be imposed in estimation by specifying
and power series or splines
The coefficients
g„(w
)
+
(u).
/V
z
in
each
z
m
= z'tt^ + V
J]._
g.(w
+
)
r
(z)
Assume that
(6.2)
If
m (z m
of
w
]
= E[T(w)i^{w)h(w)],
is
This
.
h,
5.
h(w) =
so that a series
Let
q(w)
be the
on functions of the form
for the probability measure where
E[T(w)q(w)q(w)'
).
to include the elements
that are mean-square continuous functionals of
'^f,(u),
,11
of equation (6.1) are examples of functionals of
from the mean-square projection
One
J).
For example, one
^m=l
estimator will be v^-consistent under the conditions of Section
residual
(j=l
P(d) = E[T(w)l(wei^)]/E[T(w)].
nonsingular, an identification condition for
^.
follow from the results of Section
5.
y
will
Robinson (1988) also gave some instrumental
variable estimators for a semiparametric model that
is
linear in endogenous variables.
The regularity conditions of Section 5 can be weakened somewhat for
this model.
Basically, only the nonparametric part of the model need satisfy these conditions so
w
Then
y(w) = (E[T(w)q(w)q(w)' ])~^q(w).
Hence, Vn-consistency and asymptotic normality of the series estimator for
that, for example,
(w)
p
and the reduced form, by specifying that
X(u)
could specify a partially linear reduced form as
z„
included in
w
in
they depend on a linear combination plus some additive components.
of
is
need not be continuously distributed.
28
Empirical Example
7.
we
illustrate our approach
To
investigate the empirical relationship between the
hourly wage rate and the number of annual hours worked.
the hourly
wage rate as independent
of the
While early examinations treated
number of hours worked recent studies have
incorporated an endogenous wage rate, (see, for example, Moffitt 1984, Biddle and Zarkin
and Vella 1993), and have uncovered a non-linear relationship between the wage rate
1989,
and annual hours.
The theoretical underpinnings of
an increasing importance of labor when
it
is
hours of work due to related start up costs.
number of hours
involved in a greater
Alternatively Barzel (1973) argues that labor
1962).
this relationship can be assigned to
is
(see Oi
relatively unproductive at low
Moreover, at high hours of work Barzel
argues that fatigue decreases labor productivity.
These forces will generate hourly wage
rates that initially increase but subsequently decrease as daily hours work increase.
may
Finally, the taxation/labor supply literature argues that a non-linear relationship
exist due to the progressive nature of taxation rates.
between hours and wage rates
it
in the
(7.1)
y.
is
= z^.p + g2o(x.) +
variables that includes
E
and
x.
e.,
u.
z
x.
.,
many
influences and
we capture
and
individual
r
are parameters,
are zero mean error terms such that
semiparametric version of equation
parametric reduced form.
characteristics in
z'.^ + u..
annual hours worked,
is
|3
=
wage rate of
the log of the hourly
individual characteristics,
and
potentially the outcome of
is
following model:
y.
where
Thus the non-linear relationship
z.,
We use
so that
it
(1.1),
like that
z
i,
z.
is
is
.
a vector of
a vector of exogenous
g
is
E[g|u] *
an unknown function,
0.
discussed in Section
this specification because there are
This
6,
is
a
with a
many
individual
would be difficult to apply fully nonparametric
estimation due to the curse of dimensionality.
We estimate
this
model using data on
males from the 1989 wave of the Michigan Panel Survey Of Income Dynamics.
29
To preserve
we follow
comparability with previous empirical studies
Biddle .and .ZarJcin (1989). ..Accordingly,
in
we
similar data exclusions to those
only. include males aged between 22 and 55
who had worked between 1000 and 3500 hours
in the
previous year.
This produced a sample
Our measures of hours and wages are annual hours worked and the
of 1314 observations.
hourly wage rate respectively.
We
in
wage equation by
first estimate the
column
1
of Table
The hour effect
1.
is
OLS and report the relevant parameters
linear
small and not significantly different from
Adjusting for the endogeneity of hours via linear two stage least squares (2SLS),
zero.
however, indicates that the impact of hours
effect
is
small.
The results
in
column
2,
is
statistically significant although the
on the basis of the t-statistic for the
residuals, suggests hours are endogenous to wages.
To allow the
for
function to be non-linear
g
we employed
and various approximations for the manner
g
endogeneity of hours.
We
also allow for
in
alternative specifications
which we account for the
some non-linearities
in the
including non-linear terms for the experience and tenure variables.
number of non-linear terms
residuals
in the
from the chosen
in the first step
specification,
We
primary equation.
we determine
the number of approximating terms
The CV criterion
is
well
will lead to estimates with good properties here.
properties in a Monte Carlo study, and find that
Because the theory requires over
variance asymptotically,
those that
CV
it
first choose the
and then, by employing the associated
asymptotic mean-square error when the first estimation
it
We
discriminate between alternative specifications on the basis
of cross validation (CV) criterion.
that
reduced form by
is
known
not present so
we are hopeful
Below we consider
CV performs
where the bias
fitting,
to minimize
its
well.
is
smaller than the
seems prudent to consider specifications with more terms than
gives, particularly for inference.
the number of terms that minimizes the
CV
Accordingly, for both steps
we
identify
criterion and then add an additional term. For
example, while the approximation which minimized the CV criterion for the first step was
a fourth order polynomial in tenure and experience
30
we generate
the residuals
from a model
which includes a fifth order polynomial
in
Note that to enable
these variables.
comparisons, these were the residuals which were included
in
column 2 discussed above.
The CV values for a subset of the second step specifications examined are reported
in
Table
Although we also experimented with splines,
2.
in
each step, we found that the
The preferred
specifications involving polynomial approximations appeared to dominate.
specification
is
a fourth order polynomial in hours while accounting for the endogeneity
through a third order polynomial
specification
182.643.
is
in
the residuals.
The CV criterion for this
The fourth column of Table
hours profile employing these approximations.
coefficients while excluding the residuals.
1
reports the estimates of the
The third column presents the hours
Due to the over fitting requirement we now
take as a base case the specification with a fifth order polynomial
order polynomial
in
the residual.
We
in
will also check the sensitivity of
hours and a fourth
some results to
this choice.
While Table
it
is
1
suggests the non-linearity and endogeneity are statistically important
useful to examine their respective roles in determining the hours profile. For
comparison purposes Figure
plots the change in the log of the hourly
1
annual hours increase, where the nonparametric specification
h = 5
r = 4.
and
To
is
wage rate as
the over fitted case with
over the observations, and the graph shows the deviation of the function from
As annual hours affect the log of the hourly wage rate
plot the impact of hours on the log of the hourly
wage
in
its
In Figure
1
we
mean.
an additive manner we simply
rate.
In
Figure
1
we
plot the
quadratic 2SLS estimates like those of Biddle and Zarkin (1989) and the profile
similar to theirs.
mean
facilitate comparisons each function has been centered at its
is
very
also plot the predicted relationship between hours and
wages for these data with and without the adjustment for endogeneity using our base
specification.
It
indicates the relationship
is
highly non-linear.
Furthermore, failing
to adjust for the endogeneity leads to incorrect inferences regarding the overall impact
of hours on
wages and, more
specifically, the value of the turning points.
Although we found that the polynomial approximations appeared to best
31
fit the
data
we
We employ
also identified the best fitting spline approximations.
the parameters
we allow
this for the first
number of
to be chosen by the data are the
and second steps and
we over
in this instance
including an additional join point, in each step, above what
CV
of the
values.
is
points for the residuals.
In Figure 2
we
1
data by
1
join points in the
join point in the hours function and
and the wage rate from the over fitted spline approximation.
do
determined on the basis
plot the predicted relationship
looks remarkably similar to that in Figure
We
join points.
fit the
Note that the best fitting approximation involved
reduced form and a primary specification of
cubic splines and
1
join
between hours
This predicted relationship
Furthermore, the points noted above
1.
related to the failure to account for the endogeneity are similar.
Figures
wage rate
1
and 2 indicate that at high numbers of hours the implied decrease
sufficient to decrease total earnings.
is
may be due
This
in the
number
to the small
of observations in this area of the profile and the associated large standard errors.
Figures 3 and 4 explore this issue by plotting the 95 percent confidence intervals for
our adjusted nonparametric estimates of the wage hours profile.
It
confirms that the
estimated standard errors are large at the upper end of the hours range.
To quantify an average measure of the effect of hours on wages we consider the
weighted average derivative of the estimated
over the range where the function
1300
region,
to
2500
125
upper
tail.
shown
hours, represents
observations above
only
is
2600,
The average derivative for the
.000155.
to be
upward sloping
81 percent
1300-2500
h = 5
We estimated
function.
so that not very
the polynomial approximation, based on
error of
g
and
r
in
of the total sample.
and
1
2.
This
There were
many were excluded by ignoring the
part of the hours profile from
=
4,
is
.000513
is
.000505
with a standard error of
estimates are quite different than the linear 2SLS estimate reported
consistent with the presence of nonlinearity.
1
Figures
with a standard
The corresponding estimate from our spline approximation of the
weighted average derivative
Figure
the derivative
the average derivative
is
.000350
.000147.
in
Table
These
1
which
Furthermore, for the quadratic 2SLS
with a standard error of
32
.000142,
is
in
which
is
also quite different than the average derivative for the nonparametric estimators.
difference
and
is
This
present despite the relatively wide pointwise confidence bands in Figures 3
Thus, in our example the average w^age change for most of the individuals in the
4.
sample
is
found to be substantially larger for our nonparametric approach than by linear
or quadratic 2SLS.
Before proceeding we examined the robustness of these estimates of the average
derivative to the specification of the reduced
the reduced
form and alternative approximations of the wage hours
reduced the number of approximating terms
CV
form and the alternative specifications of
in the
reduced form.
profile.
While
we
First,
we found
that the
form
criteria for each step increased with the exclusion of these terms in the reduced
we found
that there was virtually no impact on the total profile and a very minor effect
on our estimate of the average derivative of the profile for the region discussed above.
When we included
For example, consider the estimates from the polynomial approximation.
only linear terms in the reduced form the estimated average derivative
a standard error of
quadratic terms
.0000175.
.000546
is
.000515
is
with
The corresponding values for the specification including
with a standard error of
cubic terms resulted in an estimate of
.000522
.000174.
Finally, the addition of
with a standard error of
.000153.
1
Changes
in the specification of the
model based on the spline approximations produced
similarly small differences.
We
terms
also explored the sensitivity of the average derivative estimate to the
approximation.
in the series
Both the average derivative and
were relatively unaffected by increasing the number of terms
example, with an 8th order polynomial
the average derivative estimate
Finally
our model.
We
is
in the
standard error
approximations.
For
hours and 7th order polynomial in the residual
.000519
we examine whether we are
with a standard error of
.000156.
able to reject the additive structure implied by
do this by including an interaction term capturing the product of hours
and the residuals
variable
was
in
its
number of
.231.
in
our preferred specification.
The CV value for
The t-statistic on this included
this specification is
33
182.924
while that for our
preferred specification plus this interaction term and the term capturing the product of
hours squared and the residual squared
conclude there
is
is
On the basis of these results we
183.188.
no evidence against the additive structure of the model.
In addition to this empirical evidence
The objective of
data examined above.
we
provide simulation evidence featuring the
this exercise is to explore the ability of our
procedure to accurately estimate the unknown function
examined above.
We
in the
empirical setting
we
simulate the endogenous variable through the exogenous variables and
parameter estimates from each of the polynomial and spline approximations reported above.
We generate
the hours variable by simulating the reduced form and incorporating a random
component drawn from the reduced form empirical residual vector.
we estimate
the residuals and from the simulated hours vector
order terms for hours.
We then
From
this reduced
we generate
form
the higher
simulate wages by employing the simulated values of hours
and residuals, along with the true exogenous variables, and the parameter estimates
discussed above.
wage equation
The random component
residuals.
in the empirical the
is
drawn from the distribution
of the empirical
As we employ the parameter vector from the over fitted models
polynomial model has a fifth order polynomial in the reduced form
while the wage equation has a 5th order polynomial in hours and a fourth order polynomial
in the residuals.
in the
The spline approximation employs a cubic spline with
reduced form while the wage equation has
points for the residuals.
We examine
2
2
join points for hours
join points
and
the performance of our procedure for
2
join
3000
replications of the model.
In order to relax the
parametric assumptions of the model we use the CV criterion
each step of the estimation.
That
is,
for the model which employs the polynomial
approximation we first choose the number of terms
of the residuals
from the over
approximating terms
in the
fit
in
in the
reduced form.
Then on the basis
reduced form we then choose the number of
wage equation.
For the model generated by the spline
approximation we do the same except we employ a cubic spline and choose the number of
join points.
For both approximations we trim the bottom and top two and a half percent
34
observations, on the basis of the hours residuals, from the data for which
the
wage equation.
An important aspect of our procedure
polynomial model for
all
25
CV method
the ability of the
is
To examine
identify the correct specification.
this
specifications of the
we computed
the
choices in the reduced form this generated
examined up to
3
residuals in the
wage equation.
derivative estimator like that
CV
we computed
we
we can
27
As there was also 5
For the spline model we
possible specifications.
applied to the actual data.
the estimator by choosing the
First consider the results
Once again we over
.000153,
8.9 percent.
fit the
number of terms that minimizes
computed standard errors for the same
The mean of the
from the polynomial model.
average derivative estimates, across the Monte Carlo replications, was
represents a bias of
to a fifth
evaluate the performance of an average
criterion, plus one additional term, and
specification.
possibilities.
This produced
results
criteria for the
form and then 3 each for hours and
join points in the reduced
From the Monte Carlo
data such that
125
CV
to correctly
wage equation combining up
order polynomial in hours and fifth order polynomial in residuals.
the
we estimate
.000456,
which
The standard error across the replications was
while the average of the estimated standard errors was
.000152.
Therefore the
estimated standard errors accurately reflect the variability of the estimator in this
experiment. For the spline approximation the average estimate was
represents a bias of
.000145,
We
11.1
percent.
.000442
which
The standard error across the replications was
while the average of the estimated standard errors was
.000144.
also computed rejection frequencies for tests that the average derivative
was
equal to its true value. For the polynomial model the rejection rates at the nominal
10
and
20
respectively.
percent.
percent significance levels were
6.5,
11.8
and
5,
22.5 percent
The corresponding figures for the spline model were
7.4,
13.1
and
24.0
Thus, asymptotic inference procedures turned out to be quite accurate in this
experiment, lending some credence to our inference in the empirical example.
35
Appendix
We
will first prove one of the results on identification.
Proof of Theorem
Consider differentiable functions
2.3:
5(x,z
and
)
and
^(u),
suppose that
= 6(x,z
and
z
identically in
+ ^(u) = 6(n(z)+u,z^) + ^(u),
)
Differentiating then gives
u.
= n (z)'5 (x.z
If
IT
(z)
rank
full
is
+ 5 (x.z
)
X
J\.
J.
it
= 5 (x,z
),
X
J.
it
suffices to consider
=
S{yi,z )+Tf{u)
= n (z)'5 (x,z ).
^
X
+ ^ (u),
LI
}\.
follows from the last equation that that
then follows from the first two equations that
show identification we
)
X
A.
will use
5(x,z
)
Theorem
and
)
=
and
Tf
By differentiability of
2.1.
•yiu)
5 (x,z
5 (x,z
that are differentiable.
=
)
(u)
=
g
and
0.
It
Now, to
0.
A
Also, if
with probability one, then by continuity of the functions and the
boundary having zero probability, this equality holds identically on the interior of the
support of
Then, as above,
(u,z).
all
the partial derivatives of
are zero with probability one, implying they are constant.
by Theorem
2.1.
To avoid
that
is
repetition,
needed.
may be
it
and
y(u)
Identification then follows
useful to prove consistency and asymptotic normality
Throughout the appendix
different in different uses.
among
components and
X
and
ir
represents a possible value of
z
)
QED.
for a general two-step weighted least squares estimator.
notation
5(x,z
its
H
Let
w(X,7r)
(z)
36
To state the Lemmas some
C
will denote a generic positive constant,
X
be a vector of variables that includes
a vector of functions of
= E[x|z].
assumed that
lemmas
Then for
w
= w(X,n
X
and
(z)),
it
ti,
where
will be
(A.l)
E[y|X] = h^lw),
Also, let
i//.
11
= MX.)
E[x|z] =
Tl^(z).
be a scalar, nonnegative function of
X.,
and
IT(z)
be as in
x.
1
1
the body of the paper, and
p = (P'Pf^P'Y,
h(w) = p^(w)'p.
(A.2)
W
Let
max,
= {w
t(w) =
:
i^^sup
...
and for
1},
Al:
component
w.(X,7r)
2
E[t(w)i//(X)
and
B
K
iii)
K
There exists
5,
max.
with,
a,
a
and
>
lin
(z)-^'
Lemma Ah
r'^Cz)!!
K
^r
1
Lemma
(w) = P
(w)
R
L
(z)
=
Br
h^\, =
U d
Al:
and
L
(z),
For each nonnegative integer
iv)
< C,AK)
and
sup-^IIR^^fz)!!
d =
-^
0,
0,
5
^{L); v)
:£
CK
2
Krf^) K/n
-ct
and
P
Note that the value of
r
K
p
(z)
(w)
=
and/or
r'^(z).
Let
r
+ (L/n)^^^ +
is
unchanged
(z)
is
used, so that
L
37
if
and
p.
d,
then
a nonsingular constant
it
111
= t(w.),
1
0,
l\]).
h
t.
>
(K/n + K~^"- + L/n + L^\).
a:jK)[(K/n)^^^ + K°'
d
—
then
and Al are satisfied for some nonnegative integer
linear transformation of
p
and
and Al are satisfied for
p
Proof of
there are nonsingular
such that |h-,-p 'f3j.\^ ^
O
K o
L
ST(w)\p(X)^[h(w)-hJw)]^dFjX) =
\h -
continuously
K
and
iK^^^i^^CK) + i:^(K)^^(L)][(L/n)^^^ + l'""!]
1
is
each
ClA.
5
If Assumptions
If Assumptions
L;
-sup,.,lia'^p'^(w)ll
(3
ii)
have smallest eigenvalues that are
]
and
K.
1
sup
in
(z)'
n;
w.(X,II (z))
and L
K
L
E[R (z)R
.
K
K
L
and
or
tt
For each
Lipschitz in
is
P (w) - Bp (w)
such that for
P (w)P (w)']
C,JK)
is
w(X,Tr)
either does not depend on
bounded away from zero, uniformly
there
= P^^^.).
|h|„ =
o
a nonnegative integer, let
bounded and
is
i//(X)
i)
distributed with bounded density;
B
P^
^n'^n^n^'*
|a^h(w)|.
Assumption
matrices
5
P = [t^^^p^
= p'^(w.),
can be assumed that
Q =
E[T.i/»^p.p'.
r i^r
1
].
By
•'
Assumption
Q
the smallest eigenvalue of
Al,
bounded away from zero, so that the
is
K
—1
Q
largest eigenvalue of
K
—1/2
Q
p
^ C^
(w)ll
p
By the Cauchy-Schwartz inequality there
(w).
Ll~K
IIS
Consider replacing
bounded.
is
p
p
C
a constant
is
~K
by
(w)
(w) =
such that
so that the hypotheses and the conclusion will be satisfied with
|(K),
I
this replacement.
2~K
Since
E[t(w)i/((X) p
Q =
prove the results with
for
Q^ = E[rhz)rhz)'] =
= L
A
Next, let
1/2
by construction,
I
By analogous reasoning,
I.
/—'
/vn + L
notational simplicity that that
~cx
^
suffices to prove the result
it
11
-^
TT.
i,
n(z)
CG-^
Iin.-n.ll^/n £
i.^^J.
is
1.
=
n(z.),
and
J.
/
s O(A^) +
L
Newey
(1997),
y."jin.-n.ii /n =
11
^1=1
LI.
(y-j-J'
Theorem
Theorem
max.
1
of
Newey
lin.-n.ll
i^n
J:.",
^1=1
(a
P
11
J
J.
from
(Q^-Dly-?-^)
I
+ 0(K~^"i)
Cll^-?-. Il^o
L
of
Newey
2
Wlf-'if,
II
=
2
(A
),
(1)
+ 0(K"^"i)
)
71
Lemma
and
(?(L)A
p
A3,
).
TT
^"^
^ " P'P/n.
38
Note that
(A. 2)
so that by eq.
)
(^(L)A
ii)
p
Also, by eq.
(1997).
Tt
p
|T.-T.|/n =
['^i'/'iPi'----'^n'^nPn^'
X
J
(1997),
=
11
1
follows that
it
o
Therefore, by Assumption Al
P =
-y
(1),
p
last inequality follows by
the Appendix in
Let
Also, suppose
for
^^
Note that for
L
Cll3^-3-, 11^0
n
(A.4)
1
+ C^." Iin.-y r^(z.)ll^/n
)
i-^
O
(A.3a)
TT„(z.).
1
< Cj'[IT(z)-n-(z)]^dF(z) + Cj[n(z)-r, r^(z)]^dF(z) +
Also, by
=
TT.
a scalar function.
)'Q (^-y
J
1.
< CS[fi{z)-y^r^{z)]hF{z) + C|
(A.3)
suffices to
Al,
j;."
where the
it
I.
71
Assumption
~K
(w)p (w)'] =
E[IIQ-III^]
of
(A. 7a),
implying
-^
IIQ-Ill
Lipschitz in
w(X,Tt)
Markov inequality,
W
Also, by
0.
convex and a mean value expansion
T.x.llp.-p.ll
n,
Xi-^i'^j'/'j
HPjH
£ Cj(K)llw.-w.ll < CC^(K)lllt.-n.ll.
/" =
Then by
(K).
=
x.
in
Also, by the
and
x. = x.,
x.,
and
w,
i/».
bounded,
5
IIQ-QII
^ ^
(A.5)
liy;.",x.x.0^(p.p'.
^1=1
11
1
11
- p.p'.)/nll + CllE.",(T.x.-x.)p.p'./nll
^1=1 1 1
11
1
11
+ Cll7.",(x.x.-x.)p.p'./nll 5 Cr.'^,x.x.(llp.-p.ll^ +
1=1 11
^1=1 1 1
11
1
2i//.llp.llllp.-p.ll)/n
11
1
1
1
+ CC,AK)^.^A(jj-tp.\/n
^"1=1
1
1
1
A/2
.2,
£ CC,(K)^.",IIIT.-n.ll^/n + C(r.",x.0^llp.ll^/n)^''^(l.'^,x.x.llp.-p.ll^/n)^
^1=1
1
1
O (C^(K)^C(L)A
^0
+
p
Let
=
T).
(w.)],
i//.[y.-h
2 ~
E[t).|X]
is
E[ti.E[t)
.1
(t)
=
11
=
for
=
|X]
1
1
*
.1X.,X.] = E[t}.E[tj
Then by
j.
+ Cn"^E.'^,x.x.llp.-p.ll^/n =
^1=1
where the second to
Y
result is that
(1/n).
n
Also,
p
2
CE[x.i//.p'.p.]/n
(A.7)
(n"^[r
1111
.
bounded, and by independence of
p
n
)
if
n
|
111
+ C,(K)^A^])
1
71
|X]
=
(A
P
n
it
),
= o
1
(n~^),
p
Then, since a standard
follows that
ll(P-P)'T]/nll^
1111
= Ctr(Q)/n = Ctr(I)/n = CK/n.
(A.5)
(llp.ll^+llp.ll^)/n
E[IIP'Vnll^] = E[E[IIP'r]ll^/n|X]] = EIY.'^ x^i/z'^p'.p. Var(y. |X.)/n^] £
^1=1
1
1
IIP'7)/nll^
eqs.
|
71
(A.5).
Therefore, by the triangle inequality,
^ C[ll(P-P)'Vnll^ + llP'Vnll^] = o
(1)
(K/n) =
+
P
Then by
X,
Cn"V."jx.-x.
^1=1
(K)^C(L)A
p
E[|Y
=
|ti.,X.,X.] |X.,X.]
last equality foUowrs similarly to eq.
= O (A
Furthermore,
1
depending only on
P
(1).
p
Then by independence
).
E[7).|X] = 0.
so
i//.h„(w.),
= o
)
TT
(X ,...,X
E[ll(P-P)'T)/nll^|X] £ Cn"^.",.//^llx.p.-x.p.ll^ £
^1=1 1
11 11
(A.6)
= o
X =
r{K]\{L)A
TT
Var{y. |X) = Var(y|X.)
EItj.tj
i
1
and
),
i//.E[y.
and
ip.
E[7).7).|X]
X.] |X.,X.]
,...,t)
+
+ K^'^^C,(K)A
IT
=
t)
1111
1=1
1
(C(K)^A^
pi
11
bounded by
the observations,
=
)
71
E[0.y.|X] =
of the observations
11
^1=1
1
P
and (A.7) and the smallest eigenvalue of
with probability approaching one, for
M
= P(P'P)~^P',
39
Q
7j'M7]/n
(K/n).
P
bounded away from zero
= (r}'P/n)Q~^(P'T)/n) £
=
(l)IIP'7)/nll^
(K/n).
h,
Let
h,
and
sup^lh (w)-p
1
llh-hll^/n £
K
(w)'|3|
IIMt)
It
n x
1
1
= 0(K
vectors
1
— cc
ri
,...,h
1
)' ).
p
Let
~
and
M-I
idempotent, (M-I)h
A^ = VK/Vn + K"°' +
n
for
(A^),
P
and
T.i//.h„(w.),
1
h = (h
M
O
=
h.
1
(i.e.
Then by
).
TT
h =
PfS
11
^"1=1
p
r
y = (x
and
O
(A?).
p
h
bounded away from zero with probability
Q
i/(
x
y
i//
(l)(y-h)'M(y-h)/n ^
y
)',
(l)[7)'M7)/n +
r
+ (h-h)'M(h-h)]/n ^
(h-h)'M(h-h)
tr
(K/n) +
p
p
(l)y.",x.i/»^[h^(w.)-h^(w.)]^/n
^1=1 1 1
1
1
(l)^.",x.iA^[h^(w.)-p'.p]^/n =
^1=1 1 1
1
1
Then by the triangle
1
(K/n) + CV.", Iin.-n.ll^/n + 0(k"^'^) =
^1
< O (l)(p-/3)'Q(p-p) =
p
1
+ M(h-h) + (M-I)hll^/n ^ C(7)'MT)/n + y.'^,T.ip^[h(w.)-hJw.)]^/n
^1=1 r 1
1
1
approaching one that, for
+
T.i/(.h^(w.),
1
then follows by the smallest eigenvalue of
llp-|3ll^
=
h.
1
11
+ y.^,T.ilj^[hAw.)-p'.f3]^/n £
^1=1 r
1
Y.", Illt.-n.ll^/n =
1=1
Lipschitz, and
h„(w)
u
(M-I)(h-Pp),
T.i//.h(w.).
1
be the corresponding
h
be such that
=
h.
p
p
P
(aJ).
h
inequality,
{J'x(w)i//(X)^[h(w)-hQ(w)]^dFQ(X)}^''^ £ {Jx(w).//(X)^[p^(w)' 0-fi)]^dF^{X)}^^^
+ {Jx(w)i/<(X)^[p^(w)'p-hQ(w)]^dF(w)}^''^ =
giving the first conclusion.
+ 0(k"'^) <
IIP-/3II
(A^),
Also, by the triangle inequality,
|h(w)-hQ(w)|^ ^ |p^(w)'p-hQ(w)|^ + |p^(w)'(p-p)|g
£ 0(K ") + C.(K)llp-pll =
o
(C;,(K)A,
p
o
To state Lemma A2, some additional notation
as given in the body of the paper.
Also,
).
n
is
let
p.
40
QED.
needed.
= p
K
Z
Let
(w.),
r.
,
= p
Z
,
L
(z.).
Q
,
and
Q
be
Z =
t = y.",T.^//^p.p'.[y.-h(w.)]^/n,
^1=1 1111
1
1111 Var(y. |X.)],
E[T.^//^p.p'.
Q
V = P'P/n,
=
-<
Q
E[T.(//^p.p'.
r
],
I'^i'^i
111
H =
y.",T.0.p.{[ah(w.)/aw]'aw(x.,fT.)/a7i®r'. }/n,
^1=1 111
1
H =
E[T.(/».p.{[ah_(w.)/aw]'aw(x.,n.)/a7i®r'.
Ill
V = AQ~^Z
Assumption A2:
+ HQ'^S^Q^^H' )Q"^A'
lla(h)ll
i)
differentiable in
w
11
1
K
For notational convenience,
bounded;
and
L
and
£ C|h|
V = AQ~^(Z
,
h (w)
and
+
HQ^^E^Q^^H' )Q~^A'
V
subscripts for
ii)
;
}].
1
are suppressed.
viw),
such that
(3
a(h
K.
a(pj^j^)
= E[t(w)0(X)^i'(w)Pj^j^(w)],
and
(p
)
)
is
= E[t(w)i/((X)
E[t(w)i//(X)^IIv(w)-Pj^p'^(w)II^]
such that for
K.
a(h
)
2
i/'(w)h_,(w)],
U
(J
~
a scalar and there exists
are twice continuously
w(X,7r)
respectively, and the first and second derivatives are
tt
either a) there are
ill)
1
1
1
h (w) = p
K
~
(w)'p
0,
or;
E[h (w)
,
K.
K.
-^
2
]
b)
—
>
a(h)
0,
K.
bounded away from zero.
K.
Under
iii)
let
b),
d(X) = [ah (w)/aw]'aw(X,n (z))/a7i,
the set of limit points of
elements of
r
T(w)i/;(X)y(w)d(X)
V =
(z)'^'
on
and
,
ifi,
let
p(z)
recall the definition of
as
be the matrix of projections of
and
E[T(w)^//(X)^i^(w)i'(w)'Var(y|X)] + E[p(z)Var(x|z)p(z)'
41
£
].
is
and
Lemma
^(L)
If Assumptions
A2:
and
^J^)
following go to zero as
(>:;,JK)/Vn)
v^^^^re
and A2 are satisfied,
hounded away from zero as
3.re
KKiJiKf^CD^/n.
Al
^
n
oo:
VRk~°^,
L^C,Q(K)^^(Lf'/n.
and
- e^)
-^
N(o,i),
VR(B - 9^)
-^
N(0,V),
Lemma
A2:
First,
V
71
Aq = [K^^^q(K)
Ajj
Note that by
0(L
1/2
//n)
K
grow,
and each of the
VRC^x, C,JK)^K^/n, (K^L+L^K(K)^/n,
C,j:K)^C,^(K)^(LK+L^)/n,
then
9 = 6
+
V^^^'^cb-Qq)
-^
n(o,i).
is satisfied,
-^V.
A^ = K^^^/V^ + k""' + A
h
,
TT
A^, = ?(L)L^'^^/v^,
Ql
+ Cq(K)^?(L)]A^ + (:^{K)K^^^//n,
= [L^'^^CilK) + Cq(K)C(L)^]A^ + K^^^C(L)/v/n.
v^nK""^
and
A
-^
and
= 0(K
rate conditions and
C^(K)
CL^C.AK)^^iLf/n -^
0,
(A.8)
and
let
= l}^^/V^ + L"^,
A
L
bounded away from zero,
is
\
and
Furthermore, if Assumption A2, Hi) a)
Proof of
Var(y X)
o
1/2
VdkA
/Vn + L
and
-^
0,
1/2
^(L)
/-/n).
A
=
(L^'^^/-/n)(l + h'^'^^VnC'^i)
Therefore,
it
follows from the convergence
2 2
^(L) L /n ^
bounded away from zero that
and
K^^^^Q = 0([KC^(K) + K^^\^{K)'^^{L)]L^^^/Vn + <q(K)K/v^)
L^^^Aj^ = 0([LC^(K) + L^^^Cq(K)?(L)^]L^^^//S + L^^^K^^^?(L)/v^)
=
0(a\^(K)^
+ L^Co(K)^C(L)^ + LK^(L)^}^'^^/v^)
VnCQ(K)Aj = 0«Q(K)L//n) -^
0.
42
=
-^
0.
Cq(K)L^^^Ci(K)Aj^ = 0({Cq(K)Vi(K) (K+L)}^^V/n) -> 0.
L
from these results that
also follows
It
1/2
-^
A
—>
C,^(K)A^
0,
and
H
For simplicity the remainder of the proof will be given for the scalar
Lemma
Also, as in the proof of
r
=
(z)
R
and
(z),
A[Z + HE H' ]A'
.
—
>
i^^(w) = Ap^(w).
a)
A2,- iii),
Q =
By
K
0.
S - CI
I,
is
V
is
p
-1
A =
Then
}^''^
K
(w),
V =
= Var(y|X)
(X)
cr
case.
(z)
(w) = P
2
By
.
Therefore,
positive semidefinite.
satisfied.
K
In this case,
= {trlFAA'F']}^''^ £ {tr[CFAZA'F' ]}^^^ ^ tr{CFVF'
llFAll
Suppose that Assumption
Let
show the result with
equal to identity matrices.
Q =
bounded away from zero and
(A.9)
suffices to
it
be a symmetric square root of
F
Let
Q
and
Q
Al
—>
C,(K)A
=
E[t(w)i/((X)
2
C.
K
ylwjp (w)'
]•
E[T(w)i/»(X)^lli^(w)-y^(w)ll^] ^ E[T(w)i//(X)^lly(w)-p^p^(w)ll^]
I,
K
IS.
and
E[T(w)i//(X)d(X)y„(w)r^(z)']r^(z),
mean square error (MSE)
2
{w)-v{w)\\
\\v
b, (z)
= E[T(w)i//(X)d(X)i^(w)r^(z)' ]r^(z).
of a least squares projection
random variable being projected,
CE[t(w)i//(X)
b„, (z) =
d(X) = [5h_(w)/aw]'aw(X,n^(z))/a7r,
Also let
2
]
(z)-b
E[llb
K.L
—
K
as
>
—
>
co.
(z)ll
no greater than the
2
2
£ E[t(w)i//(X) d(X) Up
2
L
is
]
Since the
MSE
of the
(w)-v(w)ll
2
]
^
K.
Furthermore, by Assumption A2,
ii),
b),
K.
2
E[llb
(z)-p(z)ll
Var(x|z),
it
]
—
as
>
L
—
>
oo.
2
Then by boundedness of
cr
(X)
= Var(y|X)
and
and by the fact that MSE convergence implies convergence of second moments,
follows that
(A.IO)
V =
E[T(w)i//(X)^i^^(w)i^^(w)'cr^(X)] + E[b^, (z)Var(x|z)b^, (z)']
IS.
This shows that
F
is
bounded.
K.L
In.
Suppose that Assumption A2,
by the Cauchy-Schwartz inequality,
|a(hj.)|
= |Ap^| £
K.
Is.
that
HAH -^
00.
Also,
V a AZA'
Next, by the proof of
> CIIAII
Lemma
2
,
so
Al,
43
F
iii)
~
IIAIIIIp^ll
K.
is
-^
V.
K.L
b) is satisfied.
= IIAII(E[h^(x)
2
Then
1/7
])
so
,
is.
also bounded under Assumption
5.
(A.ll)
(A^) = o
=
IIQ-III
Q
P
H =
Further, for
(1).
P
Al,
implies that the smallest eigenvalue of
Q
(A„) = o
—^
IIQ-III
o
Lemma
similarly to
111
H
p
Now,
Ql
P
1
,T.i//.p.d(X.)r'. /n,
^"1=1
1
1
Y.
=
IIH-HII
(A.12)
(A^J -
=
IIQ -III
(1),
P
(1),
p
bounded away from zero
is
--1
Q
with probability approaching one, implying that the largest eigenvalue of
implying
(1),
(A.13)
IIBQ~
IIFA(Q~-^-I)ll
Also,
IIFAQ~^''^II^
£
II
IIBIIO
for any matrix
(1)
^ I1FA(I-Q)Q~^II £
IIFAII IIQ-IIIO
= tr(FAQ"^A'F') £ CO
(1) IIFAII^
let.
ll^/nFa(p^'p - h^)ll £ C^nllFII |p^(
(A.14)
U
Also, for
=
h.
1
r
£ Cv^[tr(FAQ"^A'F')]^'^^0(K"") =
(1).
Then
(k"*^).
p
(^nK"") = o
p
Ap = a(p
Then by
h.
= T.>i.h_(w.)
1
1
1
a(h) = A^,
'p),
(^K""") = o
h =
and
In
h
(h,
1
(1).
and
(A.15),
and the triangle inequality,
)',
ynF[a(h)-a(hQ)] = FAQ"^P't)/V^ + FAQ"^P' (h-h)/v^ + o
(A.16)
n =
(n
,n )',
L
sup-,|n (2)-r (z)'?'|
of each
u.
n
1
h(w.)
= x.-n.,
11
1
= 0(L
around
-a
w.,
U =
(u,,...,u
1
and
d.
and by
eq.
i),
|
P
(A.14)
eqs.
(1).
p
)'
P
Let
0.
IIFAQ"-^P'(h-Pp)/v^ll £ cynllFAQ~^P'/v^llsup^|p^(w)'p - h^lw)
(A.15)
for
-^
)'P - h-(-)|^ =
U
o
In
1
1
•
h = (h,,...,h
and
T.i/».h-(w.)
Therefore,
P
lp^(-)'p - h„(-)L =
u
o
be such that
p
(1)
=
P
Next,
B.
is
)',
n
= d(X.).
(B.O),
44
a-
(1).
be such that
By a second order mean-value expansion
^P'(h-h)/'/n =
FAQ
(A.17)
FAQ
^^2i'^.i/».p.d.[IT.-n.]/-/n +
=
p
FAQ ^HQ^^R'U/v^
111
+ FAQ~-^HQ7^R'(n-R'y)//S + FAQ~V.",T.i/;.p.d.[r'.?'-n.]/\/n + p
^1=1 111
1
£ Cv/nllFAQ"^^^IICn(K)j;",llfr.-n.ll^/n =
Il3ll
Also, by
d.
nHQ
bounded and
{Vni:AK)^^) = o
1—
of
multivariate regression
^
on
T.ip.p.d.
111
sum
equal to the matrix
H'
r.,
HQ, H'
1
1
1
(1),
of squares
from the
n - 2- 2
V. ,T.i//.p.p'.d./n ^ CQ.
^1=1
1
r-.
£:
1111
Therefore,
IIFAQ~%Q^^R'(n-R'j)/'/nll ^ IIFAQ'-^HQ'^R'/v/Hlli/n'sup^ in^Czj-r^lz)' ^|
(A. 18)
< [traceCFAQ'^HQ^^Q^Q^^H'Q'^A'F'Jl^^^Ot/KL"^) < CIIFAQ"^'^^IIO(ynL"°'i)
(/nLA) =
=
o
P
(1).
P
Similarly,
111
IIFAQ~V.",T.i/;.p.d.[r'.^-n.]/v^
^1=1 ill
(A.19)
Next, note that
E[IIR' U/V^ll^]
II
< CIIFAQ~^^^IIO(V^L~'^i) = o
(1).
p
= tr(Z
)
< Ctr(I
)
£ L
by
E[u^|z]
bounded, so by
the Markov inequality,
(A.20)
IIR'U/VnII =
(L^''^).
P
Also, note that
--1— --1
HQ,
IIFAQ
--1/2
II
r^
(A.21)
(1
II
FAQ
= Q
(1).
11
By similar reasoning, and by
eq.
Therefore,
p
IIFAQ~-^H(q"^-I)R'U/V^II < IIFAQ"-^HQ7^IIIIQ
1
(A.22)
II
p
1
-III IIR'
U/Vnll =
p
(A. 8),
IIFAQ'^CH-HIR'U/'/SII < IIFAQ"^II IIH-HII
IIR' U/^/SlI
=
(A„L^''^)
p
Noting also that
HH'
is
(L^'^^A^J = o (1).
Ql
p
H
= o
(1).
p
the population matrix mean-square of the regression of
45
11
on
T.i//.p.d.
Ill
HH' <
so that
r.,
E[IIHR' U/V^ll^] = tr(HZ,H'
follows that
it
< CK.
)
1
1
IIHR'U/V^II =
Therefore,
CI,
/?
(K
and
),
P
IIFA(Q~-^-I)HR'U/Vnll < IIFAQ~^II
(A.23)
{A^K^'^^) = o
=
ll(I-Q)ll IIHR' U/-/nll
Q
P
(1).
P
Combining eqs. (A.16)-(A.23) and the triangle inequality gives
FAQ"^P' (h-h)/v^ = FAHR'U/'/n + o
(A.24)
Next,
it
"
-~-l/2
Al that
(P-P)'7)/v^ll
IIQ
=
implying
(1),
--1/2
--1 /-^
IIFAQ (P-P)'t)/v^II ^
(A.25)
Lemma
follows as in the proof of
(Cq(K)^C(L)A^ + C^(K)^A^) = o
(1).
HFAQ
"-1/2
^
-
(P-P)'7)/V^ll
IIIIQ
= o
(1).
P
Also, by
=
E[t)|X]
0,
E[IIFA(Q~-^-I)P't7/V^I1^|X] = tr(FA(I-Q)Q~-^ZQ~^(I-Q)A'F'
(A.26)
s IIFA(I-Q)Q~^II^ £
II
FA 11^ I-Q 11^0
(1),
II
P
so that
-^
IIFA(Q" -I)P'7j/Vnll
Then combining
0.
eqs.
(A.25)-(A.26) with eq. (A.16) and
the triangle inequality gives
VnF[a(h)-a(h„)] = FA(P'-r)/i/n + HR'U/i/n) + o
(A.27)
(1).
p
Next, for any vector
Note that
are
Z.
d>
with
across
i.i.d.
i
=
ll<AII
let
1
for a given
+ Hr.u.]/v/n
<A'FA[T.i/(.p.7i.
Ill
n,
E[Z.
in
]
=
11
1
and
0,
e
>
0,
IIFAII
^ C
and
IIFAHII
£
CIIFAII
C
:£
by
semidefinite, so that
nE[l(|Z.
in
£
I
>
e)Z^
]
in
= ne^E[l(|Z.
in
Cne"^ll(/.ll'*<ElllT.p.ll'^E[T}'^|X.]]
11
11
|
>
e)(Z. /e)^] s ne"^E[|Z.
in
in
+ E[llr.ll'^E[u'^|z.]]}/n^
11
1
46
Z.
.
in
Var(Z.
)
=
1/n.
in
in
Furthermore, for any
=
l"^]
CI-HH'
positive
s Cn ^{Cq(K)^E[IIt.0.p.II^] + ?(L)^E[llr.ll^]} = 0([Cq(K)^K+ ?(L)\]/n) =
VnF(e-e
Therefore,
—
follows by the Lindbergh-Feller theorem and equation
N(0,I)
>
)
o(l).
(A.27).
CI -
As previously noted,
Suppose
a(h)
a scalar.
is
so that
o
9
=
IIAll^
scalar
^
Up
A
V
9-9^ =
that
p
.
Lemma
Next, by
1
/9
1
(V
Newey
of
w
Lipschitz in
w(X,7r)
and
max.
=
11
E[D|X] s CFAQ"^A'F =
:£
Ctr(D)max.
Also, by
II
1
=
11
(Do
P
111
''1=1
1
X;.",T7.llw.-w.ll
1=1
/n =
IIZ-ZII
s
(A
PTi
).
(?(L)A
= o
)
1
111
= o
(1)
h„(w)
-^0
x.
"2
= t.,
(1).
Hence, by the
-2
2
1111
-
tAt).-t).) =
l+h
|
|
}PQ~^A'F'/n,
Then
1
(1).
p
P
E[y'.",T)^lT.-T. |/n|X] ^ cy.", |t.-t. |/n
^1=1
1
1
so that
1
IIIj"^(T.p.p'.-T.p.p'.)T,^/nll
II
:s
so that by
(1),
max.^|h.-h.| = o
^1=1
1
y.",7)^|T.-T. |/n =
^1=1
111
Then similarly to the proof
of
^
IIFAQ" (Z-Z)Q~U' F'
1
follows by
it
IIQ-III
+ llEj"^T.p.p'.T,^/n -
=
£ IIFAQ~^II^IIZ-ZII
47
-^0.
and
1
1
O (e(L)A
^
p
)
71
(A^),
pQ
Zll
= Op(AQ) + Op«E[T.llp.ll*E[T,^|X.]]/n}^^^) = 0^(A^ + (:^{K)\/n) = o
Therefore,
/3
then gives
IIAIl
p
71
D = FAQ~^P'diag{l+|T)
ElJ^.'^.n^llw.-w.ii^/nlX] £ cy.",llw.-w.ll^/n,
1
Also,
(1).
1
1
1=1
'
p
Note that by
(1).
^1=1
uniformly bounded,
E[Tf)^|X]
p
= IIFAQ~^P' diag{ i),i) }PQ~^A'F'/nll
|h.-h.|
i:^n
= o
p
Z = y.",T.p.p'.T)^/n.
Let
(1).
p
IIFAQ~^f:-E)Q~^A'F'
=
|
)
p
Also, for
{)..
h
respectively,
ti
|
In the other case it follows
b).
(C^(K)A,
iTT.-n.
= o
11
i:£n
T.[-27).(h.-h.)+(h.-h.)^]
,
|h.-h.
n
This result for the
{C.AK)/Vn).
o
"
-
triangle inequality,
iii),
p
max.
(1997) that
all
(C^(K)/V^).
5
1
i^sn
and
/Tn) =
=
|h.-h.|
1
o
p
p
max.
Al,
.
for
Dividing by
Cv(K)IIAII.
o
U/Vn) =
i:£n
Theorem
lAp'^k ^
covers the case of Assumption A2,
a(h)
V
from
£
e-e_ =
Therefore,
CJK)
o
A *
2
CIIAII
the Cauchy-Schwartz inequality implies
/3,
latAp"*^)!
9
<
IIAIl
V ^
positive semi-definite, so that
is
follows from the above proof that
It
Note that for any
large enough.
C^(K)lipil,
HSH'
Furthermore,
(1).
IIBZII
£
and
for any matrix
CIIBII
CI -
by
B,
S
positive semi-definite, so that
11FA(Q~^SQ"-^-Z)A'F'II s IIFA(Q"^-I)SQ~^A'F'II + IIFAS(Q~^-I)A'F'II £ IIFAQ"^1I^II(I-Q)SII +
IIFAZIIIII-QIIIIFAQ~^II
^
PPP
+ IIFAIIo (1)0
(l)III-QII
P
inequality,
To show
this last result, note first that
^
IIZ-SII
11 -^
that
IIZ-ZJl
Let
0.
d.
= {[dh(v/.)/dv/]' dw{X.,fl.)/dn
11
1
E.^^T.|d.-d.|2/n ^ C(sup^|h-hQ|^ +
-^
0.
11
and
d.
= d(X.).
bounded,
9w(X,7r)/97r
Al,
HSH')A'F'
follows similarly to the argument for
it
1
Then by the conclusion of Lemma
Then by the triangle
^
(I).
FA(Q"^HQ~^f: q"^H' Q~^ -
suffices to show that
it
= o
(1)
=
j:.^^||fT.-n.ll^/n)
(^(Kj^A^j
Therefore,
IIH-HII
^ C(y.",T.llp.ll^llr.ll^/n)^''^(^.",T.|d.-d. |^/n)^'^^=
^1=1
Hence, by eq.
1
1
^1=1
1
and the triangle inequality,
(A. 12)
IIFAQ~-^HQ~^(Z -S )Q~^H'Q"'^A'F'
s tr(FAQ~-^HQ7^H'Q~^A'F')o
(1)
p
1
P
IIH-HII
-^
(C^(K)L^^^C,(K)A^) = o (1).
1
h
p
The conclusion then
0.
For example, by logic
follows similarly to previous arguments.
It
111
II
like that above,
< IIFAQ'-^HQ^^II^IIZ -Z
= o
II
(1).
p
^
FVF' —
then follows by similar arguments and the triangle inequality that
the case of Assumption A2,
square root that that
(V
/V
similarly
)VnV
from
Lemma
A3:
max.
|v.-v.
Proof:
If
I
—
v.
=
v'^^'^/v"^^^
(9-e
F
—
V
.
is i.i.d.
(S
),
By the density of
1,
is
so that
a scalar,
it
follows by taking a
VnW~^^^{e-e^) =
In the other case the last conclusion follows
and continuously distributed with bounded density and
»
v.
-^
a(h)
In
QED.
—
5
where
N(0,I).
>
)
—1/2
>
b),
iii,
L
0,
then
V "j
bounded, for any
inequality,
48
2('a<v.<b)-2('a£v.£b; /n
|
A
>
0,
A
—
>
0,
=-
(5
).
by the Markov
inin.p
y" [l(|v.-a|£A
M=r
We
)+l(|v.-b|£A )]/n =
known
also use the well
positive sequences with
such that
—
>
£
I
II 1/5 n
v. -v.
=
(1)
—
^
0.
2
n
8
and only
if
-1/2
(e
i.e.
,
n
follows that
It
|
—
n
>
in
oo.
|
e
if
)
5
—
n
max.
>
Y
))
= O
(A
pn
-^
i
Then
0.
|v.-v.
i^sn
probability approaching one (w.p.a.l) as
in
|
i
).
for
Consider any positive sequence
slowly enough.
>
goes to zero slower than
n
V max.
i:£n
n
so
0.
Y
result that
—
e
in
(Probd v.-a| ^A )+Prob( v.-b <A
v
n
all
e
= (e
1/2
n
)
^ d /v with
n n
Then w.p.a.l,
E y.", |l(a<v.sb)-l(a£v.£b)|/5 n = £ y.", l(a£v.+[v.-v.]<b).-l(a<v.<b) |/5 n
n^i=l
i
i
i
i
n
1
n^i=l
|
nnn
< £ y.",[l(|v.-a|<6 /v )+l(lv.-b|:£5 /v )]/8 n = e 8~^0 {8 /v
i
n^i=l
1
nn
Proof of
Lemma
4.1:
nnpnn
)
= O {v
pn
)
= o
(1).
QED.
p
Because series estimators are invariant to nonsingular linear
transformations of the approximating functions, a location and scale shift allows us to
W
assume that
=
and
[0,1]
Z =
[0,1]
Also, in the polynomial case, the
i.
component
powers can be replaced by polynomials of the same order that are orthonormal with respect
to the uniform distribution on
[0,1],
and
in the spline
both cases the resulting vector of functions
is
case by a B-spline basis.
a nonsingular linear combination of the
original vector, because of the assumption that the order of the multi-index is
increasing.
Also,
assume that the same operation
approximating functions
C.(K) =
it
(z).
,.,lia^p^(w)ll,
i^.sup
^
welf
1mI-J
'
follows from
Newey
C.(K) 5 CK^"^^-^,
and
in the spline
carried out on the first step
^.(L) =
^j
max,
j,
,
?(L) £ CL,
C(L) < L-^.
49
.sup
ImI-J
(1997) that in the polynomial case,
case that
Cj(K) ^ K-^^\
is
For any nonnegative integer
max.
J
Then
r
In
-,lia'^phz)ll.
^
zeZ
It
follows from these inequalities and Assumption 4 that
^1
71
so the conclusion follows by
Proof of Theorem
distribution of
- Af,(u)dF
g„(w
)
(u).
t(w) =
given
w
For
= (x.z
= g„{w J-Jg (w )dF (w
away from
zero, so
Therefore, for
is
Al.
QED.
Without changing the notation
4.2:
w
Lemma
7t
),
).
let
1,
let
F (w)
denote the conditional
A(u) = A(u) - J"A(u)dF (u),
note also that
g(w
)
Also by the density of
the conditional density on
W,
and
^f^(u)
= g(w J-Jglw )dF (w
w
)
=
'^^.(u)
and
being bounded and bounded
and the corresponding marginals.
A = J"{h(w)-h (w)}dF (w)
J[h(w)-hQ(w)]^dFQ(w) £ CJ[i(w^)+A(u)-gQ(w^)-AQ(u)]^dFQ(w^)dFQ(u)
= Cj'[i(w^)-gQ(w^) + A(u)-Aq(u) + A]^dFQ(w^)dFQ(u)
= CX[i(w^)-iQ(w^)]^dFQ(w^)
+ CjlACul-AQCull^dF^lu) +
CA^
s Cmax{J'[i(w^)-iQ(w^)]^dFQ(w^), JlAlul-A^CuJl^dF^lu), A^>
where the product terms drop out by the construction of
g(u),
etc.,
e.g.
J"
J[g(w^)-gQ(w^)][A(u)-AQ(u)]dFQ{w^)dFQ(u) = J[g(w^)-iQ(w^)]dFQ(w^)- J[A(u)-AQ(u)]dFQ(u) =
0.
QED.
Proof of Theorem 4.3:
W.
Lemma
4.1
u
by fixing the value of
at any point in
QED.
Proof of Theorem
4.1,
Follows from
it
5.1:
For power series, using the inequalities
in the
proof of Theorem
follows that
L^Co(K)^?(L)'^ s
cA^,
Cq(K)^C^(K)^(LK+L^) - CK^(KL + L^),
50
and for splines that
L^CQCKl^CtL)"^ ^ CKl"^,
It
Cq(K)^<^(K)^(LK+L^) £ CK^(KL + L^).
then follows from the rate conditions of Theorem 4.2 that the rate conditions of
A2 are
satisfied.
Theorem
4.1.
Lemma
The conclusion then follows from Lemma A2, similarly to the proof of
QED.
51
References
Agarwal, G.G. and Studden, W.J. (1980): "Asymptotic Integrated Mean Square Error Using
Least Squares and Bias Minimizing Spline," Annals of Statistics 8, 1307-1325.
Ahn, H. (1994) "Nonparametric Estimation of Conditional Choice Probabilities in a Binary
Choice Model Under Uncertainty," working paper, Virginia Polytechnic Institute.
Andrews, D.W.K. (1991): "Asymptotic Normality of Series Estimators for Nonparametric and
Semiparametric Models," Econometrica 59, 307-345.
Andrews, D.W.K. and Whang, Y.J. (1990): "Additive Interactive Regression Models:
Circumvention of the Curse of Dimensionality," Econometric Theory 6, 466-479.
Barzel, Y. (1973) "The Determination of Daily Hours and Wages," Quarterly Journal
Economics
87,
of
220-238.
J. and Zarkin, G. (1989) "Choice among Wage-Hours Packages: An Empirical
Investigation of Male Labor Supply," Journal of Labor Economics 7, 415-37.
Biddle,
Breiman, L. and Friedman, J.H. (1985) "Estimating optimal transformations for multiple
regression and correlation," Journal of the American Statistical Association. 80
580-598.
Brown, B.W. and Walker, M.B. (1989): "The Random Utility Hypothesis and Inference
Demand Systems," Econometrica 47, 814-829.
in
.
Cox, D.D. 1988, "Approximation of Least Squares Regression on Nested Subspaces,"
Annals of Statistics 16, 713-732.
Hastie, T. and R. Tibshirani (1991): Generalized Additive Models,
Chapman and
Hall,
London.
(1995): "A Kernel Method of Estimating Structured
Nonparametric Regression Based on Marginal Integration," Biometrika 82, 93-100.
Linton, O. and Nielsen, J.B.
Linton, O., E. Mammen, and J. Nielsen (1997): "The Existence and Asymptotic Properties of
a Backfitting Projection Algorithm Under Weak Conditions," preprint.
Moffitt, R.
Wage-Hours Labor Supply Model"
(1984) "The Estimation of a Joint
Journal of Labor Economics
Z,
550-66.
Newey, W.K. (1984) "A Method of Moments Interpretation of Sequential Estimators,"
Economics Letters
14,
201-206.
Newey, W.K. (1994) "Kernel Estimation of Partial Means and a General Variance
Estimator," Econometric Theory, 10, 233-253.
Newey, W.K. (1997): "Convergence Rates and Asymptotic Normality for Series Estimators,"
Journal of Econometrics 79, 147-168.
Newey, W.K. and J.L. Powell (1989) "Nonparametric Instrumental Variables Estimation,"
working paper, MIT Department of Economics.
Oi,
W. (1962) "Labor as a Quasi Fixed Factor," Journal of Political
52
Economy
70, 538-555.
Powell, M.J.D. (1981). Approximation Theory and Methods. Cambridge, England: Cambridge
University Press.
Robinson, P. (1988): "Root-N-Consistent Semiparametric Regression," Econometrica, 56,
931-954.
Roehrig, C.S. (1988): "Conditions for Identification in Nonparametric and Parametric
Models," Econometrica, 56, 433-447.
(1982) "Optimal global rates of convergence for nonparametric regression,"
Annals of Statistics 10, 1040-1053.
Stone, C.J.
Stone, C.J. (1985) "Additive regression and other nonparametric models," Annals
Statistics. 13 689-705.
of
Tjostheim, D. and Auestad, B. (1994): "Nonparametric Identification of Nonlinear Time
Series: Projections," Journal of the American Statistical Association 89, 1398-1409.
Two-Step Estimator for Approximating Unknown Functional
Endogenous Explanatory Variables," Australian National
Models
with
Forms in
Department
of Economics, working paper.
University,
Vella, F.
(1991) "A Simple
(1993) "Nonwage Benefits in a Simultaneous Model of Wages and Hours: Labor
Supply Functions of Young Females," Journal of Labor Economics 11, 704-723
Vella, F.
53
Table 1:
Estimate
OLS
.000017
(.5597)
2sls
NP
.000387
-.0094
(2.7447)
NPIV
-.01097
(2.5513)
(2.9170)
7.01516-6
(2.6363)
(2.7986)
-2.1707e-9
-2.0250e-9
(2.6304)
(2.4640)
2.3692e-13
(2.5537)
-.000384
7.4577e-6
1.89296-13
(2.0311)
-.0005
(2.6732)
(2.7643)
-6.64646-8
r
(.5211)
3.49786-10
r^
(2.7516)
Notes:!) h denotes hours worked and r is the reduced form residual.
ii) The regressors in the wage equation included education dummies, union status, tenure,
full-time work experience, black dummy and regional variables.
iii) The variables included in Z which are excluded from X are marital status, health
status, presence of young children, rural dummy and non-labor income.
iv) Absolute value of t-statistics are reported in parentheses.
Table
Specification
h=0; r=0
h=l; r=l
h=2; r=2
h=3; r=3
h=4; r=4
h=5; r=5
2:
Cross Validation Value
185.567
183.158
183.738
182.899
182.882
183.128
54
Estimote of Wage/Hours Profile from Polynomial Approximation
.^^;^>^
d
0)
o
d
/
cc
m
Ol
o
q
d
\
y
^
1
5
>^
3
d
1
o
o
_)
\
CJi
d
>.
y
^
/
/
.
\\
-v
-^
^ ^ ^
-
\
\
-
'^
^
/
^ ^
/ /
\\
\
.
\
1
\\
d
^
/''
/
\
-^
/
\
/
/_
/
d
/
-
/
1
c
o
^
-
^
/
<u
"~
J
\
^
\
\\
1
c;
^ ''/^
\
\
O
X
^/
\
\
^
-/y- ~ -
\
/
tn
.
/
sz
7
CJ
— — —
— — —
to
I
UnodjustPd Nnnpnromelnc
Two Stoqe LROst Squares
1,1. I.I.I,
1
1000
1400
1800
2200
2600
Annual Hours Worked
Figiire
6<
1:
3000
3400
3800
Estimate of Wage/Hours Profile from Spline Approximation
6
CM
o
-[
^
/^^
d
o
V
\
\
^
/
d
/^
^
o
d
-
d
\
\
""
-
\
''
/
/
IN
"
\
1
\
^
d
\
\
1
en
/
/
\
\
s
/
/
\
^
O
/
/
1
\\
\
\
\
y/
v_^
a Ncnporometric
— — —
UnadJL stea Nonpnrametric
1
1
10Q0
\'
/
\V
I
-^_
/
1
vt-
'
1400
1800
.
2600
Annual Hours Worked
Figure
hCly
1
1
2200
2:
3000
_i
3400
L.
3800
95% Confidence
d
6
o
d
o
1
X
1
Interval
for
Wage/Hours
'
'
1
\
Profile
1
'
1
i
_ -
t
from Polynomiol
'
1
\
'
0~----'-/A^^~~-'A\
\
^^
./''''^
N,
''
\
'^
/
\
\
d
1
^
^
-V.
S
\
\
CO
o
\
\
1
o
\
CSJ
!
I
1000
I
1400
1
1
1800
1
1
1
2200
1
1
2600
Annual Hours Worked
Figure
"51
3:
1
3000
1
!
3400
[
3800
MIT LIBRARIES
3 9080 01444 1346
95% Confidence
o
Intervol for
i
1
1
1
1
1
Wage/Hours
!
1
from Spline
Profile
1
1
<
i
1
0)
o
^
cr
•^
QJ
6
\
en
^ ^
.
^
\
o
^ ^
"^
^ ^
'"
-
''
-^
\
5
>N
i-
3
o
o
o
\
c
o
'
N^^
"
^ ^ ^
d
j^ y
""
y^
y^
^ ^^//
\.
\
\
"^
\
^
'^
^
\
"^
/
\
\
\
/
\
1
c
._
cn
"
"
\
"!t
CD
^ —
\
o
-J
^^^
1
X
o
" "
"^
u_
en
^
\
-\,-'-''''^
v.
^ ~ "
s
-^
6
.
\
\
00
-
\
1
JZ
\
u
\
\
OJ
1000
1400
2200
1800
2600
Annual Hours Worked
Figure
;
V
4:
3000
3400
3800
kB73 n!5
Date Due
Lib-26-67
Download