Document 11162995

advertisement
^^^'^^%.;
MIT. LIBRARIES • DEWEY
Digitized by the Internet Archive
in
2011 with funding from
Boston Library Consortium IVIember Libraries
http://www.archive.org/details/nonparametricest00newe2
working paper
department
of economics
Nonparametric Estimation Of Triangular
Simultaneous Equations Models
Whitney Newey
No. 98-16
October 1998
massachusetts
institute of
technology
50 memorial drive
Cambridge, mass. 02139
WORKING PAPER
DEPARTMENT
OF ECONOMICS
Nonparametric Estimation Of Triangular
Simultaneous Equations Models
Whitney Newey
No. 98-16
October 1998
MASSACHUSEHS
INSTITUTE OF
TECHNOLOGY
50 MEMORIAL DRIVE
CAMBRIDGE, MASS. 02142
MASSACHUSETTS TNSTTfDrr
OFTECHMOLOGY
Nonparametric Estimation of Triangular
Simultaneous Equations Models
Whitney K. Newey
Department of Economics
MIT, E52-262D
Cambridge, MA 02139
wnewey@mit edu
.
James L. Powell
Department of Economics
University of California at Berkeley
Francis Vella
Department of Economics
Rutgers University
Brunswick, NJ 08903
Vella@f as-econ. rutgers. edu
New
March 1995
Revised, Febuary 1998
Abstract
This paper presents a simple two-step nonparametric estimator for a triangular
simultaneous equation model.
Our approach employs series approximations that exploit the
additive structure of the model.
The first step comprises the nonparametric estimation
The second step is the estimation
of the reduced form and the corresponding residuals.
of the primary equation via nonparametric regression with the reduced form residuals
included as a regressor.
We derive consistency and asymptotic normality results for our
estimator, including optimal convergence rates.
Finally we present an empirical example,
based on the relationship between the hourly wage rate and annual hours worked, which
illustrates the utility of our approach.
Keywords:
Nonparametric Estimation, Simultaneous Equations, Series Estimation, Two-Step
Estimators.
The NSF provided financial support for the work of Newey and Powell on this paper. An
early version of this paper was presented at an NBER conference on applications of
semiparametric estimation, October 1993, at Northwestern. We are grateful to
participants at this conference and workshops at Harvard/MIT and Tilburg for helpful
comments. We are also grateful to the referees for their extensive comments that have
helped both the content and style of this paper.
Introduction
1.
Structural estimation
important
is
in
econometrics, because
needed to account
is
it
correctly for endogeneity that comes from individual choice or market equilibrium.
structural models do not have tight functional form specifications, so that
to consider nonparametric structural models and their estimation.
it
is
is
estimation
useful
This paper proposes
and analyzes one approach to nonparametric structural modeling and estimation.
approach
Often
This
different than standard nonparametric regression, because the object of
is
the structural model and not just a conditional expectation.
Nonparametric structural models have been previously considered
Newey and Powell
(1989)
and Vella
for a system of equations
(1991).
Roehrig (1988),
in
Roehrig (1988) gives identification results
when the errors are independent of the instruments.
Newey and
Powell (1989) consider identification and estimation under the weaker condition that the
disturbance has conditional mean zero given the instruments.
The model we consider
are complementary to these previous results.
in
some ways than Newey and Powell
The model we consider
is
The results of
is
this paper
more restrictive
(1989), but is easier to estimate.
a triangular nonparametric simultaneous equations model
where
y =
(x Z
or
°
(1.1)
X = n
X
is
a
d
X
1
+ G
)
^
(z) +
E[e|u,z] = E[e|u],
,
vector of endogenous variables,
variables that includes
z,
as a
d,,
(1.1)
x
1
z
is
subvector,
a
x
d
n„(z)
is
vector of instrumental
1
a
d
11
1
functions of the instruments
E[u|z] = 0,
u
z,
and
u
is
a
d
x
1
X
x
vector of
1
vector of disturbances.
Equation
generalizes the limited information simultaneous equations model to allow for a
nonparametric relationship
nonparametric reduced form
g (x,z
)
between the variables
y
and
(x,z
)
and a
Ilf^(z).
The conditional mean restriction
in equation (1.1) is one conditional
mean version
of the usual orthogonality condition for a linear model.
general assumption than requiring that
This added generality
y
a more
For example,
can be purchases of a commodity and
x
in
E[u] = 0.
and that
z
allows for
it
some separable
expenditures on a subgroup of
Endogeneity results from expenditure on a commodity subset being a choice
commodities.
from
Also, heteroskedasticity often results
variable.
is
(1.1)
for some econometric models, because
conditional heteroskedasticity in the disturbances.
demand models
Equation
be independent of
(e,u)
may be important
2
Brown and Walker
functions, as pointed out by
demand
individual heterogeneity in
Assumption
(1989).
allows for both
(1.1)
endogeneity and heteroskedasticity.
An alternative model, considered by Newey and Powell
E[c|z] =
0.
equation
(1.1)
Strictly speaking, neither model
E[e|z] =
does not imply that
into a reduced
form and a residual
One benefit of such a condition
is
u
that
3
0.
is
very difficult
if
it
In this sense,
its restrictions
2
If
n
The second step
on
z.
(z)
were linear
in
z
is
In contrast,
E[e|z] =
is
u
The first step
of this model.
from the nonparametric regression
the nonparametric regression of
y
on
x,
z
and
,
and the conditional expectations were replaced by
If equation (1.1) is satisfied,
u is independent of z,
= E[E[e|u,z]|z] = E[E[e|u]|z] = E[E[e|u]] = E[e] = 0.
The mapping from the reduced form
it
(1.1)
4
3
discontinuous, making
the model of equation
are palatable.
population regressions, then the third condition implies that
and the second condition that z is orthogonal to e.
4
a strong restriction.
(1989).
consists of the construction of a residual vector
X
is
mean assumption
only the conditional
We propose a two-step nonparametric estimator
of
(1.1)
x
leads to a straightforward estimation
Newey and Powell
satisfied, as discussed in
The additive separability of
satisfying equation
a convenient one for applications where
estimation
stronger than the other, because
is
approach, as will be further discussed below.
is
(1989), requires only that
E[y|z]
z
and
to the structure
is
orthogonal to
E[e] = 0,
g-(x,z
difficult to construct a consistent estimator.
then
)
u,
E[e|z]
turns out to be
Two-step nonparametric kernel estimation has previously been considered
u.
(1994).
Ahn
in
Our results concern series estimators, which are convenient for imposing the
additive structure implied by equation
We
(1.1).
derive optimal mean-square convergence
rates and asymptotic normality results that account for the first step estimation and
allow for v'n-consistency of mean-square continuous functionals of the estimator.
Section 2 of the paper considers identification, presenting straightforward
Section 3 describes the estimator and Section 4 derives
sufficient conditions.
convergence rates.
Section 5 gives conditions for asymptotic normality and inference,
including consistent standard error formulae.
Section 6 describes an extension of the
model and estimators to semiparametric models.
Section 7 contains an empirical example
and Monte Carlo results illustrating the utility of our approach.
Identification
An implication
(2.1)
of our model is that for
E[y|x,z] =
+ E[g|x,2]
gQ^^-^i^
=
Eq(->^,z^)
w
= ggfx.z^) + Aq(u) = hQ(w).
Thus, the function of interest,
y
on
(x,z
)
and
u.
g (x,z
Equation
(2.1) also
E[X (u)+{y-E[y|x,z]>|u,z] = ^^(u)
as specified in equation
(1.1).
equivalent to the conditional
is
),
'^(-v(u)
=
= E[c|u],
+ E[e|u,z]
(x',
z'^.
u')',
one component of an additive regression of
implies that
only depends on
u,
E[e|u,z] = E[y-g-^(x,z
which implies
Thus, the additive structure of equation
mean
restriction of equation
identified, so that the identification of
g_^
(1.1).
under equation
(1.1)
)
(2.1) is
u
is
the same as the
identification of this additive component in equation (2.1).
To analyze
identification of
g (x,z
),
note that a conditional expectation
unique with probability one, so that any other additive function
=
E[e|u,z] = E[e|u],
Furthermore,
is
lu.z]
g(x,z )+X(u)
is
Prob(g(x,z )+A(u) = g (x,z )+A
satisfying equation (2.1) must have
identification
is
Equivalently, working with the difference of
two conditional expectations, identification
additive function
2.1:
Prob(8(x,zJ
equivalent to the statement that a zero
is
must have only constant components.
gJx.z)
up
is identified,
+ "xCu) - 0) =
To be precise we have
implies there is a constant
1
c
Theorem
2.1,
nonconstant.
),
(5(x,z
the
with
Prob(8(x,z,) = c
Suppose that identification
)
there are
5(x,z
and
)
^'(u)
with
5(x,z
)
+ g-Cu)
Intuitively, this implies a functional relationship
Then by
fails.
=
and
between
u
5(x,z
)).
2'(u)
were a one-to-one function of a subvector
More generally,
random vectors
(x,z
)
6(x,z
and
+ 3'(u) =
)
u
of
u,
)
and
a degeneracy in the joint distribution of these two random variables.
instance, if
Tf
and only if
a straightforward interpretation of this result that leads to a simple
is
sufficient condition for identification.
(x,z
gig
to an additive constant, if
1
There
Therefore,
1.
equivalent to equality of conditional expectations implying equality of
the additive components, up to a constant.
Theorem.
=
(u))
For
then
u
=
implies an exact relationship between
Consequently, the absence of an exact relationship
u.
will imply identification.
To formalize these ideas
it
is
helpful to be precise about
what we mean by existence
of a functional relationship.
There
Definition:
there exists
and
is a
h(x,z ,u)
functional relationship between
and a set
Prob(h(x,z ,u) = 0) <
1
for
11
all
such that
fixed
This condition says that the pair of vectors
equation.
ProbCU.) >0,
u e
(x,z
Cx,z J
and
u
if and only if
Prob(h(x.z ,u) = 0) =
11.
)
and
u
Thus, the functional relationship of this definition
solve a nontrivial implicit
is
an implicit one.
As
previously noted, nonidentification implies existence of such an implicit relationship.
Therefore, the contrapositive statement leads to the following sufficiency result for
1
identification.
Theorem
g (x,z
)
Although
(x,z
If there
is
is identified
up
2.2:
it
and
)
is
no functional relationship between
)
and
u
then
an additive constant.
to
a sufficient condition, nonexistence of a functional relationship between
u
is
By Theorem
not a necessary condition for identification.
nonexistence of an additive functional relationship that
may
Thus, identification
condition.
(x,z
is
occur when there
still
2.1,
it
is
necessary and sufficient
is
an exact, nonadditive
functional relationship.
The additive structure
identified even
z
is
differentiable.
(2.2)
(2.1)
is
zM
n(z) = (z,e
),
is
The reduced form
x
and
so strong that the model
is
not satisfied,
For example, suppose
).
example there
In this
relationship between
6{x) +
(x,z
a scalar,
endogenous variables.
function
equation
when the usual order condition
smaller dimension than
present,
in
u
of the
and
g(-)(x)
is
and
even though
two dimensional,
x =
form
n(z) + u
x -u
has
z
z
is
not
are restricted to be
'^^((u)
only one instrument although there are
two
implies a nonlinear, nonadditive
= exp(x -u
).
Consider an additive
satisfying
= 5(x) + ^(u) = 5(x^,exp(x^-u^)+U2) +
The restriction that
x
i.e.
may be
gp,(x)
and
must also be differentiable.
A.
(u)
=
0-
are differentiable means that
5(x)
and
2r(u)
Let numerical subscripts denote partial derivatives with
respect to corresponding arguments.
5^(x) + 52(x)exp(x^-u^) = 0,
Then equation
-52(x)exp(x -u
Combining the last two equations gives
Z Au) = -y (u)exp(u -X
^'("^.^2)
).
)
(2.2) implies
+ r Au) = 0,
5 (x) + y^^") = 0.
x
Assuming that support of
probability one, so that
with probability one
7^M
=
5(x)
identified, up to a constant, by
There
is
=
is
u,
it
follows from this equation that
Then the second and first equations imply
0.
implying that both
equation (2.2) implies that
contains more than one point with
u
can vary for given
x
r,(u)
5 (x) = 5 (x) = 0,
conditional on
and
5
are constant functions.
y
a constant function, and hence
Theorem
g(-,(x)
a simple sufficient condition for identification that generalizes the rank
z = (z ,z
partitioned as
).
differentiation with respect to
Theorem
support of
then
If
g(x,z
If
2.3:
(z,u)
g^(x,z
),
Let
x,
x,
-XCu),
u,
z
u,
and
and
1,
and
,
2
Suppose that
z
z
subscripts denote
respectively.
are differentiable, the boundary of the
Tl(z)
has zero probability, and with probability one
rankCTl (z)) - d
,
is identified.
)
were linear
n(z)
is
2.1.
condition for identification in a linear simultaneous equations system.
is
Thus,
in
z
then
rankdl
= d
(z))
is
precisely the condition for
identification of one equation of a linear simultaneous system, in terms of the reduced
form coefficients.
Thus, this condition
is
a nonlinear, nonparametric generalization of
the rank condition.
It
is
interesting to note, as pointed out by a referee, that this condition leads to
an explicit formula for the structure.
Note that
h(x,z) = E[y|x,z] = g(x,z.
)
+
Differentiating gives
A(x-n(z)).
h^(z,x) = gj^(x,z^) + \iu),
h^(x,z^)
= g^(x.z^) - n^(z)'A^(u),
h^Cx.z) = -n2(z)'A^(u),
Then,
[IT
if
rank(n
(z)n (z)'
]
n
= d
(z))
(z)
for almost
and solving gives
the other terms gives
all
A
z,
(u)
multiplying
= -D(z)h
(x,z).
h (x,z)
by
D(z) =
Plugging this result into
g^(x,z^) = h^(x.z) - U^{z)'D{z)h^{x,z),
(2.3)
g (x,z,) = h (x,z) + D(z)h„(x,z).
2
X
1
^x
This formula gives an explicit equation for the derivatives of the structural function
in
terms of the identified conditional expectations
So far
we have
This
constant.
is
different values of
also desirable to
restriction
is
(x,z
).
know the
many purposes, such
g (x,z
level of
E[c]
=
t(u)
if
0,
is
so that
g (x,z
h(x,z).
)
up to an additive
as comparing
g {x,z
at
)
such as forecasting demand quantity,
In other cases,
),
imposed on the distribution of
assumed that
follows that
only considered identification of
sufficient for
and
IT(z)
which can be identified
E[y] = E[g (x,z
it
is
a location
Then from equation
)].
J'T(u)du =
some function with
if
For example, suppose that
e.
g
it
is
(2.2)
it
1,
jE[yIx,z^,u]T(u)du - E[jE[y |x,z^,u]T(u)du] + E[y]
(2.4)
= gQ(x,z^) - E[gQ(x,z^)]
+ E[y]
Thus, under the familiar restriction
g^(x,z
up to a constant, as
)
in
= g^{x,z^).
E[e] =
the conditions for identification of
equation (2.2), also serve to identify the level of
go(x.z^).
Another approach to identifying the constant term
For example, the restriction
X^(u).
/V
(0)
=
is
is
to place restrictions on
a generalization of a condition that
holds in the linear simultaneous equations model because, in the linear model where
g^(x,z
)
and
IKz)
are linear and equation
(1.1)
holds with population regressions
replacing conditional expectations, the regression of
linear in
u.
some known
(2.5)
More generally, we
u
and
A.
y
on
(x,z
)
and
will consider imposing the restriction
In this case
E[y|x,z^,u] - A = gQ(x,z^) + Aq(u) - A =
g^i-x.,z^).
u
will be
'^(-.(u)
= A
for
Thus, under the restriction
=
-^f^tu)
structural function in terms of
A,
there
The constant
h.
restrictions as well, but this will suffice for
Equation
(2.1)
can be estimated
in
under other types of
many purposes.
two steps by combining estimation of the residual
with estimation of the additive regression
procedure
(i
will be identified
Estimation
3.
u
a simple, explicit formula for the
is
=
(2.1),
is
equation
(2.1).
formation of nonparametrically estimated residuals
The second step
n).
1
in
is
using the estimated residuals
the structural function
g (x,z
)
The first step of this
u.
=
w.-II(z.),
estimation of the additive regression in equation
u.
in place of the true ones.
An estimator of
can then be recovered by pulling out the component
that depends on those variables.
Estimation of additive regression models has previously been considered in the
literature, and
least
we can apply
those results for our second step estimation.
There are at
two general approaches to estimation of an additive component of a nonparametric
One
regression.
is
to use an estimator that imposes additivity and then pull out the
component of interest.
Breiman and Friedman
This approach for kernel estimators has been considered by
(1985),
Hastie and Tibshirani (1991), and others.
Series estimators
are particularly convenient for imposing additivity, by simply excluding interaction
terms, as in Stone (1985) and Andrews and
Whang
(1990).
Another approach to estimating the additive component
expectation
is
to use the fact that, for a function
jElylx.z ,u]T(u)du = g (x,z
Then
g (x,z)
)
t(u)
g (x,z
such that
)
of the conditional
jT(u)du =
1,
+ AQ(u)T(u)du.
can be estimated by integrating, over
u,
an unrestricted estimator of
.
This partial means approach to estimating additive models
the conditional expectation.
has been proposed by Newey (1994), Tjostheim and Auestad (1994), and Linton and Nielsen
(1995).
computationally convenient for kernel estimators, although
is
It
is
it
less
asymptotically efficient than imposing additivity for estimation of v^-consistent
when the disturbance
functionals of additive models
homoskedastic.
is
paper we focus on series estimators because of their computational
In this
To describe the two-step series
convenience and high efficiency in imposing additivity.
we
estimator
=
(r
.
(z),...,rj
For each positive integer L
initially consider the first step.
be a vector of approximating functions.
(z))'
.
In this
r
let
(z)
paper we will
consider in detail polynomial and spline approximating functions, that will be further
Let an
discussed below.
number of observations.
r.
=
r
subscript index the observations, and let
i
Let
be the total
be the predicted value from a regression of
IT(z)
x.
on
(z.),
1
1
In
mz) = rhz)'y,
(3.1)
To form the second
p (w) = (p
step, let
w
functions of
= (x',z' u')'
In
R =
x)'.
r = (R'R)~^R'(x,
[r,,...,r
KK.
such that each
depends either on
(w)
p
]'
be a vector of approximating
(w))'
(w),...,p
IK.
u,
n
(x,z
)
or on
Exclusion of the interaction terms will mean that any linear
but not both.
combination of the approximating functions has the same additive structure as in equation
(2.1),
that additivity
i.e.
for the event
and
d,
is
t(w)
imposed.
Also, let
denote the indicator function
1(d)
denote a function of the form
d+d
t(w) = n. ,'^ l(a. £ w. s
J=l
J
J
(3.2)
where
+ d
a
is
.
and
b
.
Let
(x,z
).
11
u.
=
refers to the observation, and
y.
on
= p (w.)
This
x. - n(z.),
w
p.
J
w
are finite constants,
the dimension of
discussed below.
b.)
is
.
t(w)
is
w. =
(x'.
1
1
t.
= t(w.).
1
for each observation where
10
the
jth
component of
w,
and
d = d
a trimming function to be further
,z' .,u'. )',
li
1
where an
The second step
t.
=
1,
giving
is
i
subscript
^ for
obtained by regressing
h(w) = p^(w)'p,
(3.3)
P =
fVi
p - (P'P) ^P'Y
Y =
-^nPn^'-
t(w)
The trimming function
convenient for the asymptotic theory.
is
be used to exclude large values of
(y,.....yJ'.
w
that might lead to outliers in the case of
Although we assume that the trimming function
polynomial regression.
where the limits
results would extend to the case
a
and
.
b
if
are estimated.
.
The estimator
g (x,z
)
h(w)
constant, that
p
and of
respectively.
u,
remaining terms depend only on
g(x,z^) = c^+
u.
(x,z
)
Suppose that
p
IK
(w) =
K
for the next
A(u) =
c^+
this estimator the constant
shifts, the constants are not important.
For example,
includes the log of price, a constant
elasticities (i.e. derivatives of
g).
is
g(x,z
if
and
c
c
is
changes as
)
g(x,z
)
x
J J
and
=
g
p,
1
- c^
A
For
important for predicting tax receipts.
However, the condition that
c
z
not needed for estimation of
'^p>(u)
linear model, can easily be used to estimate the constants.
J-*^
or
represents log-demand
Because of the trimming we cannot use the most familiar condition that
j^oP-P-(u)
Y,--v p+'i
.
terms should be specified.
In other cases the constant is important.
example, knowing the level of demand
estimate the constants.
the
is
^g ^ ^X = ^V
Ij^j^ +2^jP/"^'
For many objects of estimation, such as predicting how
x
)
Then the estimators can be formed as
K +1
l.^^ ^.p.{x,z^),
To complete the definition of
1
(x,z
terms, and that the
These estimators are uniquely defined except for the constant terms
and
hold.
by collecting those terms that depend only on
'^^.(u)
depends only on
(w)
still
is
can be used to construct an estimator of the structural
and those that depend only on
(3.4)
In
these limits are estimated from independent data (e.g. a subsample that
not otherwise used in estimation), then the results described here will
function
nonrandom, some
is
J
J
particular,
can also
It
=
A,
=
which generalizes the
Choosing
leads to an estimator satisfying
11
E[e]
c^
A
= X -
A(u) = X
and
to
In this case it will follow that the
equation (3.4).
estimator of the individual
components are
g(x,z
(3.6)
We
)
= h(x,2
,u)
A(u) = h(x,z ,u) - h(x,z ,u) + A.
- A,
consider two specific types of series estimators corresponding to polynomial and
To describe these
spline approximating functions.
let
/i
=
(ji
'
vector of nonnegative integers,
=
IT.
(z.)
For a sequence
".
To describe a power
(x,z
or
)
that the first
d
and
let
z
z^^^V.
series in the second stage, let
u,
d^
I]._,fx.,
given by
is
multi-indices with dimension the same as
only on
|fi|
=
°^ distinct such vectors, a power series
(i-i(^))o_i
approximation for the first stage
rhz)^U^^'^
a multi-index, with norm
i.e.
denote a
)'
,...,/i
but not both.
components of
nonzero, and vice versa.
|i(k)
w,
denote a sequence of
(m^^))
and such that for each
k,
w
depends
This feature can be incorporated by making sure
are zero
if
any of the last
d
components are
Then for the second stage a power series approximation can be
obtained as
K,
fid)
m(k)w
.-••.w'^
)
p (w) = (w'^
,
,
For the asymptotic theory
with degree
|;i|
it
will be
increasing in
i
.
assumed that each multi-index sequence
or
k,
and that
all
is
ordered
terms of a particular order are
included before increasing the order.
The theory to follow uses orthonormal polynomials, which may also have computational
advantages.
For the first step, replacing each power
by the product of univariate
z
polynomials of the same order, where the univariate polynomials are orthonormal with
respect to some distribution,
may
The estimator will be
lead to reduced collinearity.
numerically invariant to such a replacement, because
12
|
^lU)
I
is
monotonically
An analogous replacement could also be used
increasing.
second step.
in the
Regression splines are smooth piecewise polynomials with fixed knots (join points).
They have been considered
in the statistics literature
They are different than smoothing
parameter
is
For regression splines the smoothing
splines.
the number of knots.
by Agarwal and Studden (1980).
They have attractive features relative to power series
in that they are less sensitive to outliers
and to bad approximations over small regions.
The theory here requires that the knots be placed
in the
must be known, and that the knots be evenly spaced.
It
support of
z.,
which therefore
should be possible to generalize
results to allow the boundary of the support to be estimated and the knots not evenly
spaced, but these generalizations lead to further technicalities that
we
leave to future
research.
To describe regression
and that the support of
c
let
[-1,1]
=
(c)
is
splines
z
is
m
J-1
b. =
a. = -1,
convenient to assume that
degree spline with
[-1,1].
1,
For a scalar
evenly spaced knots on
a linear combination of
1
p.,(u) =
^
In this
is
a cartesian product of the interval
An
l(c > 0)«c.
it
.
M[u
set of multi-indices
z
of knots for the
+
m
paper we take
functions for
< ^ < m+1.
{fx(£)},
1
m+2
- 2(;^-m-l)/J]}™,
fixed but will allow
with
fi.(£)
J
^ m+J
^ ^ s
to increase with sample size.
for each
will be products of univariate splines.
jth
component of
z
is
J
m+J
and
j
£,
For a
the approximating
In particular, if the
number
then the approximating functions could
.,
be formed as
d
(3.7)
rj^(z) =
Throughout the paper
nj=;
it
'^M/io.j/^j)'
will be
^'^
=
assumed that
''
J.
••^^'
depends on
ratio of numbers of knots for each pair of elements of
13
z
is
K
in such a
way
that the
bounded above and away from
formed
Spline approximating functions in the second step could be
zero.
in the
analogous
way, from products of univariate splines, with additivity imposed by only including terms
that depend only on one of
(x,z
or
)
u.
The theory to follow uses B-splines, which are a linear transformations of the above
The low multicoUinearity of B-splines and
functions that have lower multicollinearity.
recursive formula for calculation also lead to computational advantages; e.g. see Powell
(1981).
4.
Convergence Rates
we
In this Section
obtain results
it
= y - h (w),
and
important to impose some regularity conditions.
is
u = x -
Assumption
C
1:
Also, for a
(z).
TT
V
for a random matrix
constants
derive mean-square error and uniform convergence rates.
Y,
IIYII
V
such that
(i
=
,
v
<
oo,
let
and
IIDII
(x,z),
= [trace(D'D)]
tj
1/2
the infimum of
IIYII
00
Prob( IIYII
{(y.,x.,z.)},
1/v
= {E[IIYir]r
D
matrix
X =
Let
To
1,
<
C) =
2,
...)
1.
is i.i.d.
and
and
Var(x|z)
Var(y|X)
are bounded.
The bounded second conditional moment assumption
literature (e.g. Stone, 1985).
Relaxing
it
is
quite
common
in the series
estimation
would lead to complications that we wish to
avoid.
Next,
Assumption
on
we impose
z
2:
Also,
from zero on
Let
W
= {w
continuously distributed with density that
and the support of
its support,
intervals.
is
the following requirement.
w
W, and
is
z
is
:
is
t(w) =
bounded away from zero
a cartesian product of compact, connected
continuously distributed and the density of
W
is
1>.
w
is
contained in the interior of the support of
14
bounded away
w.
useful for deriving the properties of series estimators like those
This assumption
is
considered here
(e.g.
see Newey, 1997).
moment matrix
the second
of the approximating functions.
condition like those discussed in Section 2
w
of
allows us to bound below the eigenvalues of
It
is
embodied
in this
being bounded away from zero means that there
between
(x,z
and
)
so that by
u,
Theorem 2.2
Also, an identification
is
assumption.
The density
no functional relationship
This assumption,
identification holds.
along with smoothness conditions discussed below, imply that the dimension of
at least as large as the dimension of
(x,z
z
must be
a familiar order condition for
),
identification.
To see why Assumption 2 implies the order condition, note that the density of
is
bounded away from zero on an open
x,
z
set.
and
,
But
the density of
IKz),
(z ,n(z))
is
set.
(z ,IT(z))
a vector function of
w
Then, because
w
a one-to-one function of
is
must be bounded away from zero on an open
z
that will be smooth under assumptions
given below, so its range must be confined to a manifold of smaller dimension than
(z ,n(z))
unless
of
and
n(z)
has at least as many components as
z
x
are the same,
Some discreteness
results.
In particular.
the distribution of
in the
z
Since the dimension
(z ,n(z)).
must have as many components as
components of
z
(x,z
).
can be allowed for without affecting the
Assumption 2 can be weakened to hold only for some component of
Also, one could allow
z.
some components of
distributed, as long as they have finite support.
z
On the other hand,
to be discretely
it
is
not easy to
extend our results to the case where there are discrete instruments that take on an
infinite
number of
values.
Some smoothness conditions are useful for controlling the bias of the
series
estimators.
Assumption
and
g (x,z
3:
)
^
is
(z)
and
/V
(u)
continuously differentiable of order
s
on the support of
are Lipschitz and continuously differentiable of order
W.
15
s
z
on
The derivative orders
and
s
control the rate at which polynomials or splines
the rate of approximation for
In particular,
approximate the function.
and hence also for
A.(u),
s
h (w),
0(K
will be
-s/d
where
),
d
is
g (x.z
)
and
the dimension of
(x.Zj^).
and
The
last regularity condition restricts the rate of
L.
Recall that
Assumption
4:
denotes the dimension of
d
Either a) for power series, (K
+
for splines. (K^ + KL^^^)[(L/n)^''^+L"^/^il -^
For example, for splines,
K
if
and
requires that they grow slower than
As a preliminary step
expectation estimator
Lemma
4.1:
it
h(w).
is
n
i
i)
q = 1/2
all
at the
same rate then
0;
or
b)
s
>
2d
this condition
.
Let
F
denote the distribution function of
w.
-
hJw)\
(K/n + K~^^^^ + L/n + L~^^/^i).
q =
for splines, and
,.Ah(w)
w&W
p
=
p
for power series,
1
(k'^KKM)^^^
+
K^^^
+ (L/n)^^^ + L~^/^i]).
subsequent proofs are given in the Appendix.
This result leads to convergence rates for the structural estimator
in equation (3.4).
We
will treat the constant
square and uniform convergence rates, so
separately.
—>
If Assumptions 1-4 are satisfied then
sup
This and
+L
L)[(L/n)
helpful to derive convergence rates for the conditional
ST(w)[h(w)-hJw)]^dFJw) =
Also, for
z.
and that
,
K
0.
grow
L
K
growth of the number of terms
it
term a
is
little
g(x,z
differently for the
helpful to state and discuss
)
given
mean
them
For mean square convergence we give a result for mean-square convergence of
the de-meaned difference between the estimator and the truth.
16
Let
t = E[t(w)].
Theorem
If Assumptions 1-4 are satisfied then for
4.2:
ST(w)[kx,z^)-ST(w)l(x,z^)Fj:dw)/xl^FQ(dw) =
The convergence rate
approximating functions
the
is
sum
of
in the first step
h(x,z
)
= g(x,z )-g (x,z
)
(K/n + k"^^''^ + L/n + L~^^/\).
two terms, depending on the number
K
and the number
in the
L
of
Each of
second step.
these two terms has a form analogous that derived in Stone (1985), which would attain the
optimal mean-square convergence rate for estimating a conditional expectation
L
were chosen appropriately.
and
L
proportional to
n
conclusion of Theorem 4.2
i
In particular,
i
i
,
if
K
chosen proportional to
is
and Assumption 4
10
1
From Stone
(1982)
it
g-^(x,z
)
if
u
n
satisfied, then the
{max{n~^^^^'^^^^\ n~^^/^V^^l^).
p
follows that the convergence rates
for their respective steps,
and
is
ST(w)Lkx,zJ-ST(w)kx.zJFJ:dw)/TfFJdw) =
(4.1)
is
K
if
i.e.
in this
expression are optimal
would be optimal for estimation of
n
did not have to be estimated in the first step.
Thus, when
K
and
L
are chosen appropriately, the mean square error convergence rate of the second step
estimator
is
the slower of the optimal rate for the first step and the optimal rate for
the second step that would obtain
An important feature
n
is
if
the first step did not have to be estimated.
of this result is that the second step convergence rate
the optimal one for a function of a d-dimensional variable rather than
the slower rate that would be optimal for estimation of a general function of
additivity
was not imposed.
estimation.
It
is
This occurs because the additivity of
h-^(w)
is
w,
where
used in
essentially the exclusion of interaction terms that leads to this
result, as discussed in
Andrews and Whang
(1990).
Although a full derivation of bounds on the rate of convergence of estimators of
17
g (x,z
is
)
beyond the scope of this paper, something can be said.
n
Stone (1982) that
estimation of
when
)
u
is
known.
cannot be better than the best rate when
the rate
is
attained
we know
o ...
Setting
,,
K
n
andJ
.
that
it
u
Since the best rate
is
known,
d
,
minimize each component of the mean-square error
rate will be attained for splines
,
s
when
s/d ^ ^i/d.
n
,
s
d,
it
d/(d+2s)
u
is
unknown
where
this rate is
,
and
d /(d +2s
nil
that
and
s/d
>
For power series the
2.
narrow the range of
where the estimator could attain the best convergence rate for the second
s/d ^
The condition
s/d
means that the second stage
We
(relative to its dimension) as the first stage.
convergence rate would be when the first stage
although in that case
we know
least as
is at
and
s/d
step.
smooth
do not know what the optimal
less
smooth than the second stage,
that the first stage convergence rate would dominate the
mean-square error of our estimator
error in Theorem 4.1 or 4.2.
is
^
,,
)
i
follows that the optimal second stage
side conditions of Assumption 4 are even stronger and would
s/d
when
the best one (and that the estimator has the best rate).
is
^*
the rates
be proportional ^to *u
to u
,
known from
must attain the best rate when
g
Therefore, for the ranges of
.
L
is
the best attainable mean-square error (MSE) rate for
is
g (x,z
It
if
K
L
and
are chosen to minimize the mean-square
Our estimator could conceivably be suboptimal
in that case.
Also, the side conditions in Assumption 4 are quite strong and rule out any possibility
of optimality of our estimator
(e.g.
when
It is
s/d ^
when neither the
first or the second stage is very
1).
interesting to note that the convergence rates in equation (4.1) are only
relevant for the function itself and not for derivatives.
h(w),
smooth
Because
u
is
plugged into
one might think that convergence rates for derivatives are also important, as
Taylor expansion of
h(u)
around
h(u).
depends only on the conditioning variables
The reason that they are not
x
and
z
in
equation
We thank a referee for pointing this out and for giving a slightly
of the following discussion of optimality.
IS
(2.1),
is
that
in
a
u
a feature that
more general version
allows us to work with a Taylor expansion of
rather than
h (w)
Theorem 4.2 gives convergence rates that are net
know
be interesting to
similar results hold
if
E[e] =
usual restriction
A (u) = A
rates.
is
motivated
It
it
in the
is
With the trimming the
Also, the restriction
straightforward to obtain uniform convergence rates for
h(w)
in
Lemma
uniform rates for
g
'^p)(u)
= X
uniform convergence rates for
W
4.3:
q = 1/2
is
continuous in the
is
g
4.1.
)
= h(x,z ,u) -
for splines, and
from the
)
note that
u
at
^f^(^)
1
in
=
^>
^i^d
W
W,
equation
h.
on values of
(x,z
),
Assumptions 1-4 are satisfied
for power series,
^S(x,z^) - gQ(x,zp\ = 0^(K'^[(K/n)^^^ + k"^^"* + (L/n)^^^ +
^^P(xz)&-W
(in
when the
follow immediately from those for
q =
can
)
some value
imposed leading to an estimator as
A,
g(x,z
supremum norm on
In particular,
equal to the coordinate projection of
g(x,z
If
In particular,
g(x,z
Furthermore, as long as the restriction
from Lemma
will also follow
identification restriction
Specifically, for
4.1.
up to the constant term, by just fixing
h(w),
used to identify the constant term
'Vp,(u)
then for
That
but this condition does not seem to be well
0,
so that uniform rates follow immediately.
Theorem
included.
is
model and so will not be considered.
be recovered from
(3.6),
would also
does not lead to mean-square error convergence
E[t(w)c] =
uniform convergence rates for
W),
It
would be possible to obtain mean-square rates that include the constant term
It
by specifying a restriction
on
when the constant term
does not identify the constant.
a pointwise one, so
in the proofs.
of the constant term.
depend on the identifying assumption for the constant term.
will
h(w)
L^/\]).
This uniform convergence rate cannot be made equal to Stone's (1982) bounds for either
the first or second steps, due to the leading
K
term.
Nevertheless, these uniform
rates improve on some in the literature, e.g. on Cox (1988), as noted in
Apparently,
it
is
Newey
(1997).
not yet known whether series estimators can attain the optimal uniform
convergence rates.
The uniform convergence rate for polynomial regression estimators
19
is
slower and the
K
conditions on
more stringent for power series than
L
and
Nevertheless we
splines.
retain the results for polynomials because they have some appealing practical features.
flexibility in functional form.
They are often used for increased
Also polynomials do
not require knot placement, and hence are simpler to use in practice.
We
could also derive convergence rates for
identical to those in
5.
Theorems 4.2 and
4.3.
and obtain convergence rates
A(u),
We omit
these results for brevity.
Inference
We focus on
Large sample confidence intervals are useful for inference.
confidence intervals rather than uniform ones, as in much of the literature.
purpose
denote a possible true function
h
let
suppressed for convenience.
That
consider all linear functionals of
function
g (x,z
w
For this
argument
is
We
represents the entire function.
h
which turn out to include the value of the
h,
We
at a point.
)
the symbol
is,
where the
h (w),
pointwise
construct confidence intervals using standard errors
that account for the nonparametric first step estimation.
To describe the standard errors and associated confidence intervals
consider a general framework.
h (w),
w
where the
For this purpose
argument
represents the entire function.
As a function of
real line,
G
i.e.
= aCh^)
h,
a(h)
G
.
useful to
That
is,
the symbol
be a particular number associated with
The object of interest
h
is
denote a possible true function
suppressed for convenience.
Let
of the functional at
of
h
represents a mapping from possible functions of
a(h)
a functional.
Q = a(h)
estimator
is
let
it
will be
In this Section
we
h
h
h.
to the
assumed to be the value
develop standard errors for the
and give asymptotic normality results that allow formation
of large sample confidence intervals.
This framework
value of
g
is
at a point.
general enough to include
many objects of
For example, under the restriction
20
'^f^(u)
interest, such as the
= X
discussed
earlier, the value of
g (x.z
(5.1)
g (x,z
= a(h
)
at a point
)
a(h) = h(x,z
),
A general inference procedure for
equation,
i.e.
the value of
can be used
g
can be represented as
u) - X.
linear functionals will apply to the functional of this
construction of large sample confidence intervals for
in the
a weighted average derivative of
is
when
g
through a linear combination
w
particularly interesting
say
2^,
g (x,z
This object
).
an index model that depends on
is
differentiable function with derivative
?-}
)
at a point.
Another example
v(w)t (w'
(x,z
t
= t(w
where
g (w
)
(q).
Then for any
3')
w
t(q)
= (x,z
is
)
only
a
is
such that
v(w)
integrable,
is
J"v(w)[ah„(w)/aw,]dw = [ft (w,^)v(w)dw]3',
(5.2)
q
1
1
so that the weighted average derivative
h
functional of
where
,
a(h) =
for summarizing the slope of
the value of
g (w
)
is
Xv(w)[5h(w)/9w ]dw.
g/^(w
)
This
j.
also a linear
is
This functional
may
also be useful
over some range, as discussed in Section
9g (w )/5w
or
proportional to
7.
Unlike
at a point, the weighted average derivative
functional will be ^^-consistent under regularity conditions discussed below.
We can
derive standard errors for linear functionals of
general enough to include
G = AP,
(5.3)
Because
G
is
Q = a(h)
The estimator
For example, the value of
applying the functional in equation
general, linearity of
a(h)
A =
which
many important examples, such as the value
a weighted average derivative.
"plug-in" estimator.
h,
(5.1)
g
to obtain
of
6
is
of
= a(h^)
a class
g
is
at a ppint or
a natural,
at a point could be estimated by
g(x,z
)
= a(h) = h(x,z
- X.
,u)
In
implies that
(a(p^j^)
a(pj^j^)).
a linear combination of the two-step least squares coefficients
p,
a
natural estimator of the variance can be obtained by applying the formula for parametric
21
,
two step estimators (see Newey
As for other series estimators, such as
1984).
(1997), this should lead to correct inference asymptotically,
properly for the variability of the estimator.
Q = P'P/n,
(5.4)
E,
^1=1
11
^1=1
V = AQ ht
This estimator
regressions.
H =
11
The variance estimate for
(5.5)
111
= 7.",(u.u'.)®(r.r'.)/n,
1
is
a(h)
is
accounts
=
Q
I
1
®(R'R/n),
d-k
1
,T.[Sh(w.)/5u]'®p.r'./n.
^1=1 1
1
11
y.
then
+ HQ^^Z^Q^^H' ]Q ^A'
like that
suggested by Newey (1984) for parametric two-step
AQ ZQ
equal to a term
It is
11
t = y.",T.p.p'.[y.-h(w.)]^/n,
it
Newey
denote the Kronecker product and
©
Let
because
in
A'
,
that
is
the standard heteroskedasticity
consistent variance estimator, plus an additional nonnegative definite term that accounts
for the presence of
estimator
is
u.
It
has this additive form, where the variance of the two-step
larger than the first, because the second step conditions on the first step
dependent variables, making the second and first steps orthogonal.
/-V-1/2
(0-9
vnV
will
show that under certain regularity conditions
Thus doing inference as
will be a correct large
results of
Andrews
if
were distributed normally with mean
9
sample approximation.
(1991)
9
—
d
''
We
)
cind
>
N(0,1).
variance
V/n
This extends large sample inference
and Newey (1997) to account for an estimated first step.
However, as in this previous work, such results do not specify the convergence rate of
a(h).
That rate will depend on how fast
V
goes to infinity.
To date there
is still
not a complete characterization of the convergence rate of a series estimator available
in the literature,
although
it
is
known when v^-consistency occurs
(see
Newey, 1997).
Those results can be extended to obtain \^-consistency of functionals of the two-step
estimators developed here.
norm and
this
P
norm by a
Let
II
•
II
= (E[t(w)(-)
J/t)
denote a trimmed mean-square
denote the set of functions that can be approximated arbitrarily well in
linear combination of
K
p (w)
for large enough
22
K.
It
is
well known, for
p
in
T
corresponding to power series or splines, that
(w)
(x,z
and
)
u
where the individual components have
critical condition for \^-consistency is that there is
includes all additive functions
finite
i'(w)
T
e
II
•
The
norm.
II
such that
a(h) = E[T(w)i;{w)h(w)],
(5.6)
for any
with a function
h(w)
product of
representation theorem, this
respect to the
means that
Essentially this
h(w) G T.
is
its
is
2 —
E[t(w)(*) ]/t.
in the
same space as
being continuous with
a(h)
VnCe - 6
In this case
By the Riesz
h.
will be
)
asymptotic variance can be given an explicit expression.
To describe the asymptotic variance
E[T(w)y(w)5h (w)/9u'
that
the same as the functional
mean square error
asymptotically normal, and
p(w)
can be represented as an expected
a(h)
in the
v^-consistent case, let
p(z) =
The asymptotic variance of the estimator can then be expressed
|z].
as:
V = E[T(w)i^(w)p(w)'Var(y|X)]
(5.7)
+ E[p(z)Var(x|z)p(z)'
For example, for the weighted average derivative
zero where
t(w)
equation (5.6)
is
w
density of
is
zero, so
satisfied with
and
Proj(
•
|
y)
functions that are additive in
make
t'(w)
lie in
i'(w)
in equation (5.2),
Sv(w)/5x|7'),
is
where
fp,(w)
is
the
denotes the mean-square projection on the space of
w
and
w
and
x.
The presence of
p (w)
x.
for large
V
and the asymptotic variance will be
t}
that
is
is
necessary to
the set of
Then
from equation
To state precisely results on asymptotic normality
Recall that
Proj('|?')
K,
-E[T{w){Proj(fQ(w)"Ww)/5x|?')}ahQ(w)/5u'
regularity conditions.
v(w)
if
then integration by parts shows that
= -Proj(f (w)
the space spanned by
functions that are additive in
p(z) =
t(w)v(w) = v(w),
].
= y-h (w).
23
|z],
(5.7).
it
is
helpful to introduce
some
.
Assumption
2
5:
4
|X]
E[llull
is
cr
(X) =
Var(y|X)
bounded.
h (w)
Also,
4
bounded away from zero,
is
E[t)
|X]
is
bounded, and
twice continuously differentiable in
is
with
u
bounded first and second derivatives.
This condition strengthens the bounded conditional second moment condition to boundedness
moment and the
of the fourth conditional
away from
conditional variance being bounded
zero.
The next assumption
is
requires that the functional
least for the truth
Assumption
the key condition for v^-consistency of the estimator.
and the approximating functions.
h
E[T(w)y(w)h„(w)],
and
which
K,
such that
K
is
2
ElxtwjIlt'Cw)!!
5v(w)/5x
is
a(h
oo,
^
O
as
)
=
^
K
co.
can be approximated well by a linear combination
uCw)
important for the precise form that
on functions that are additive
The next condition
<
K
example, with the average derivative this condition leads
f (w)
]
E[T(w)lly(w)-P-^p'^(w)ll^]
kK
kK
for large
(3
= E[t(w)i;(w)p, ^(w)],
a{p, ^)
This condition also requires that
p (w)
have a representation as an expected product form, at
a(h)
There exists- vCw)
6:
It
w
in
complementary to Assumption
v{w)
and
6.
must
v(w)
For
take.
to be the projection of
u.
For
d
and a
= d + d
d
d
x
vector
1
h(w)/Sw
8
of nonnegative integers let
fi
• • •
3w
Also let
.
\y.\
= E-_iM-
5Ti(w) =
denote a nonnegative integer and
5
w
max
,
I
.
/I
I
^jSup
a^h(w)
...
£5
w€M'
I
Assumption
a(h)
7:
such that as
K
—
is
oo,
a scalar,
|
a(h)
a(p 'P
)
is
This assumption says that functional
square error.
|
s
|
h
|
_
o
for some
a(h)
is
continuous in
The lack of mean-square continuity
also a useful regularity condition.
imposed
is
a scalar, which
a(h)
I
is
|h|_,
o
K
E[t(w){p (w)'p„}
but not in
2
]
P^
K.
—>
mean
a(h)
Another restriction
general enough to cover
24
and there exists
will imply that the estimator
is
that
5^0,
bounded away from zero while
not \^-consistent, and
is
h s =
|
K
>
I
many cases of
is
0.
s
When
interest.
a(h)
a vector, asymptotic normality with an estimated covariance
is
matrix would follow from Assumption
of
iii)
In contrast,
(1991) is difficult to verify.
is
J
Andrews
Assumption 7
is
and general enough to include many cases of interest.
For example, as noted above,
Assumption 6 includes the weighted average derivative.
For another example, Assumption 7
a(h) = h(x,z ,u) - A
obviously applies to the functional
from equation
continuous with respect to the supremum norm and where
IK
ji^
is
such that
(3
p (x,z ,u)'p
K.
E[T(w){p^(w)'pj^}^] -^
a primitive condition, that
This condition and Assumption 6 are mutually exclusive,
relatively easy to verify.
there
That assumption of Andrews
(1991).
bounded away from zero but
is
0.
K^L^)/n -^
/-^^~s/d
8:
0,
—
vnK
^
>
/-u
vnL
0,
and for splines
-s /d
i
i
—
>
^
0,
pointed out by a referee.
It
K^^L"^ +
s/d
>
5/2
(KL
K
>
0.
s
/d
and
>
L
+
L.
KL 2 +KL 3
2,
was
as
so that if each
for splines and
n
,,6,
,^8,
n
for
These conditions are stronger than for single step series estimators
series.
discussed in
and
growth rate of
grows at the same rate they can grow no faster than
—
KL'^)/n
and
/,,9,
for power series,
(K^L + k'^L^ +
also limits the
K
.
i-
One can show that this condition requires that
power
is
_
- X
The next condition restricts the allowed rates of growth for
Assumption
which
straightforward to show that
is
it
(5.1),
Newey
(1997),
because of complications due to the multistage nature of the
estimation problem.
The next condition
is
useful for the estimator
V
of the asymptotic variance to
have the right properties.
Assumption
9:
Either
quadratic one, and
h (w)
of the
is
a)
x
1—
—
v^TK
is univciriate,
>
0;
b)
x
is
differentiable of all orders, there
jth
derivative bounded above by
This assumption helps guarantee that
if
is
used
multivariate,
is
C(C)
-^^(u)
a spline
a constant
,
and
25
and
v^iK
it
is
at least a
is
a power series,
K
p (w)
C
with the absolute value
—
its derivative
>
for some
e
> 0.
can be approximated by
+
which
p (w),
important for consistency of the covariance matrix estimator.
is
These
conditions are not very general, but more general ones would require series approximation
rates for derivatives, which are not available in the literature.
This assumption
at
is
least useful for showing that the asymptotic -variance estimator will be consistent under
some
set of regularity conditions.
Next, for the nonnegative integer
of Assumption 7, let
5
C,AK) =
o
u K
max,
-sup„iJ5^p
.
Theorem
Qn +
p
5.1:
i.w)\\.
If Assumptions
(C,JK)/Vn)
6
Vnv^^'^<e - e^)
1-3, 5, either 6
-^
7,
and 8 and 9 are satisfied,
-^
Vnv'^''(e-e^)
N(o,i),
N(0,V),
-^
n(o,i).
is satisfied.
7-^1^.
needed for large sample confidence intervals this
In addition to the asymptotic normality
results gives convergence rate bounds and,
when Assumption 6
holds, •v^-consistency.
Bounds on the convergence rate can also be derived from the results shown
that for power series
^^.(K)
o
generic positive constant.
^
CK
C^-^K) ^
and for splines
when
Thus, for example,
o
a(h)
CK
= h(x,z
1/v^
n
(z)
b) is satisfied,
also satisfies the
where
is
a
5 = 0,
and for splines
by letting
K
so that
grow slowly enough.
for some positive, small
the convergence rate for
a(h)
h (w)
has derivatives of
all
orders,
same condition, the convergence rate can be made close to
continuously differentiable of order
Cn
(1997)
C
where
- A,
K/\^
Newey
VK/Vn.
When Assumption 9
and
in
,
,u)
an upper bound on the convergence rate for power series would be
would be
Q =
then
and
Furthermore, if Assumption 6
Vn(e - e^)
or
e,
s
all
= h(x,z
In particular,
for every
s > 0,
h (w)
and
so that
if
n^(z)
K = Cn
of the conditions will be satisfied.
,u)
-
X
26
are
L =
cind
Then the
for power series will then be
n
~
*
,
l/Vn
which will be close to
for
€
small enough.
The centering of the limiting distribution at zero
"over fitting" that
than the variance.
of the estimator is
than
1/v^.
Since
is
in this result is
present, meaning that the bias of these estimators shrinks faster.
This over fitting
K
-s/d
,
1/v^
is
implied by Assumption
is
faster than
K
Andrews
is
(X
for
is
present.
shared by other theorems for
When combined with the
and Newey (1997).
(1991)
go to infinity slower than n
<
a
<
1,
the bias shrinking
means that the convergence rate for the estimator
1/Vn!
the optimal rate.
This
is
is
bounded away from
different than asymptotic normality results for other types of
nonparametric estimators, such as kernel regression.
condition, but that is an extension that
6.
The order of the bias
generally the fastest the standard deviation of
This feature of the asymptotic normality result that
series estimators, as in
8.
so that Assumption 8 requires that the bias shrink faster
estimators can shrink (such as the sample mean), overfitting
condition that
a reflection of
It
would be good to relax this
beyond the scope of this paper.
is
Additive Semiparametric Models
In
economic applications there
is
nonparametric estimation problematic.
often a large number of covariates thereby making
The well known curse of dimensionality can make
the estimation of large dimensional nonparametric models difficult.
problem
is
to restrict the model in
some way while retaining some nonparametric features.
The single index model discussed earlier
Another example
others.
is
a model that
is
One approach to this
is
one example of such a restricted model.
additive in
some components and a parametric
in
This type of model has received much attention in the literature as an approach
to dimension reduction.
Furthermore,
it
is
particularly easy to estimate using a series
estimator, by simply imposing restrictions on the approximating functions.
To describe
this type of model, let
w
= (x,z
27
)
as before, and let
w
.,
(j
=
denote
0,1,..., J),
w
subvectors of
J+1
A semiparametric additive model
.
is
go(w^) = w^^^' + IjIigj^w^J.
(6.1)
I'
x
This model can be estimated by specifying that the functions of
consist of the elements of
w
and power series or splines
could also impose similar restrictions on
w
in
a partially linear reduced form as
could specify
^
TT„(z)
and power series or splines
The coefficients
gp^(w
)
+ A (u)
z
in
each
z
m
= z'lr^ + V
r (z)
(6.2)
m (z m
of equation (6.1) are examples of functionals of
from the mean-square projection of
Assume that
,11
w
]
r = E[T(w)y(w)h(w)],
is
h,
5.
so that a series
Let
q(w)
5.
Then
y
will
Robinson (1988) also gave some instrumental
is
linear in endogenous variables.
The regularity conditions of Section 5 can be weakened somewhat for
Basically, only the nonparametric part of the
w
-y.
= (E[T(w)q(w)q(w)' ])~^q(w).
variable estimators for a semiparametric model that
that, for example,
be the
P{d) = E[T(w)l(w6i4)]/E[T(w)].
Hence, \^-consistency and asymptotic normality of the series estimator for
follow from the results of Section
h(w) =
nonsingular, an identification condition for
i;(w)
This
on functions of the form
for the probability measure where
E[T(w)q(w)q(w)'
).
to include the elements
that are mean-square continuous functionals of
y. ,g.(w, .) + A„(u),
Ij
J=l J
For example, one
.
estimator will be VlT-consistent under the conditions of Section
residual
One
(j=l,...,J).
'^m=l
restriction could be imposed in estimation by specifying
z„
(w)
p
and the reduced form, by specifying that
A(u)
they depend on a linear combination plus some additive components.
of
included in
this model.
model need satisfy these conditions so
need not be continuously distributed.
?.8
Empirical Example
7.
To
illustrate our
approach we investigate the empirical relationship between the
hourly wage rate and the number of annual hours worked.
the hourly
wage rate as independent
of the
While early examinations treated
number of hours worked recent studies have
incorporated an endogenous wage rate, (see, for example, Moffitt 1984, Biddle and Zarkin
between the wage rate
1989, and Vella 1993), and have uncovered a non-linear relationship
The theoretical underpinnings of
and annual hours.
an increasing importance of labor when
it
is
hours of work due to related start up costs.
number of hours
involved in a greater
Alternatively Barzel (1973) argues that labor
1962).
can be assigned to
this relationship
is
relatively unproductive at low
Moreover, at high hours of work Barzel
These forces will generate hourly wage
argues that fatigue decreases labor productivity.
rates that initially increase but subsequently decrease as daily hours
work increase.
Finally, the taxation/labor supply literature argues that a non-linear relationship
exist due to the progressive nature of taxation rates.
between hours and wage rates
it
may
Thus the non-linear relationship
potentially the outcome of
is
(see Oi
many influences and we capture
in the following model:
(7.1)
y.
where
y.
is
=
z;./3
H-
g2o(x.) +
variables that includes
e.
and
u.
z
x.
and
individual
z
are parameters,
are zero mean error terms such that
semiparametric version of equation
parametric reduced form.
characteristics in
z'.r + u..
annual hours worked,
is
p
.,
=
wage rate of
the log of the hourly
individual characteristics,
and
x.
e.,
z.,
We
so that
(1.1),
z.
i,
is
g„_
z
is
.
a vector of exogenous
is
an unknown function,
E[e|u] * 0.
This
like that discussed in Section 6,
use this specification because there are
it
a vector of
is
a
with a
many
individual
would be difficult to apply fully nonparametric
We
estimation due to the curse of dimensionality.
estimate this model using data on
males from the 1989 wave of the Michigan Panel Survey Of Income Dynamics.
29
To preserve
comparability with previous empirical studies we follow similar data exclusions to those
Accordingly,
and Zarkin (1989).
in Biddle
we
only include males aged between 22 and 55
who had worked between 1000 and 3500 hours
in
the previous year.
This produced a sample
Our measures of hours and wages are annual hours worked and the
of 1314 observations.
hourly wage rate respectively.
We
in
wage equation by
first estimate the
column
1
of Table
The hour effect
1.
is
OLS and report the relevant parameters
linear
small and not significantly different from
Adjusting for the endogeneity of hours via linear two stage least squares (2SLS),
zero.
however, indicates that the impact of hours
effect
is small.
The results
in
column
2,
is
statistically significant although the
on the basis of the t-statistic for the
residuals, suggests hours are endogenous to wages.
To allow the
for
function to be non-linear
g
we employed
and various approximations for the manner
g
endogeneity of hours.
We
also allow for
in
alternative specifications
which we account for the
some non-linearities
in
the reduced form by
including non-linear terms for the experience and tenure variables.
number of non-linear terms
residuals
in the
from the chosen
in the first step
specification,
We
primary equation.
and then, by employing the associated
we determine
the number of approximating terms
The CV criterion
is
well
asymptotic mean-square error when the first estimation
it
will lead to estimates with good properties here.
properties in a Monte Carlo study, and find that
is
known
those that
CV
it
to minimize
not present so
we are
Below we consider
CV performs
Because the theory requires over fitting, where the bias
variance asymptotically,
first choose the
discriminate between alternative specifications on the basis
of cross validation (CV) criterion.
that
We
hopeful
its
well.
is
smaller than the
seems prudent to consider specifications with more terms than
gives, particularly for inference.
the number of terms that minimizes the
CV
Accordingly, for both steps
we
identify
criterion and then add an additional term. For
example, while the approximation which minimized the CV criterion for the first step was
a fourth order polynomial in tenure and experience
30
we generate the
residuals
from a model
which includes a fifth order polynomial
Note that to enable
in these variables.
comparisons, these were the residuals which were included
in
column 2 discussed above.
The CV values for a subset of the second step specifications examined are reported
in
Table
2.
Although
we
also experimented with splines, in each step,
we found that the
The preferred
specifications involving polynomial approximations appeared to dominate.
specification
a fourth order polynomial in hours while accounting for the endogeneity
is
through a third order polynomial in the residuals.
specification
182.643.
is
The CV criterion for this
The fourth column of Table
reports the estimates of the
The third column presents the hours
hours profile employing these approximations.
coefficients while excluding the residuals.
1
Due to the over fitting requirement we now
take as a base case the specification with a fifth order polynomial in hours and a fourth
order polynomial in the residual.
We
will also check the sensitivity of
some results to
this choice.
While Table
it
is
1
suggests the non-linearity and endogeneity are statistically important
useful to examine their respective roles in determining the hours profile. For
comparison purposes Figure
plots the change in the log of the hourly
1
annual hours increase, where the nonparametric specification
h = 5
and
r
=
4.
To
is
wage rate as
the over fitted case with
facilitate comparisons each function has been centered at its
over the observations, and the graph shows the deviation of the function from
As annual hours affect the log of the hourly wage rate
plot the impact of hours on the log of the hourly
wage
in
its
In Figure
1
we
mean.
an additive manner we simply
rate.
In Figure
1
we
plot the
quadratic 2SLS estimates like those of Biddle and Zarkin (1989) and the profile
similar to theirs.
mean
is
very
also plot the predicted relationship between hours and
wages for these data with and without the adjustment for endogeneity using our base
specification.
It
indicates the relationship is highly non-linear.
Furthermore, failing
to adjust for the endogeneity leads to incorrect inferences regarding the overall impact
of hours on wages and,
Although
we found
more
specifically, the value of the turning points.
that the polynomial approximations appeared to best fit the data
31
we
We employ
also identified the best fitting spline approximations.
the parameters
we allow
this for the first
to be chosen by the data are the
and second steps and
number of
we over
in this instance.
including an additional join point, in each step, above what
CV
of the
values.
is
fit
points for the residuals.
Figure 2
In
we
the data by
1
join points in the
join point in the hours function and
1
plot the predicted relationship
and the wage rate from the over fitted spline approximation.
looks remarkably similar to that in Figure
We do
join points.
determined on the basis
Note that the best fitting approximation involved
reduced form and a primary specification of
cubic splines and
1
join
between hours
This predicted relationship
Furthermore, the points noted above
1.
related to the failure to account for the endogeneity are similar.
Figures
wage rate
1
and 2 indicate that at high numbers of hours the implied decrease
sufficient to decrease total earnings.
is
may be due
This
to the small
in the
number
of observations in this area of the profile and the associated large standard errors.
Figures 3 and 4 explore this issue by plotting the 95 percent confidence intervals for
our adjusted nonparametric estimates of the wage hours profile.
It
confirms that the
estimated standard errors are large at the upper end of the hours range.
To quantify an average measure of the effect of hours on wages we consider the
weighted average derivative of the estimated
over the range where the function
1300
region,
to
2500
125
upper
tail.
shown
hours, represents
observations above
only
is
2500,
The average derivative for the
.000155.
to be
upward sloping
81 percent
1300-2500
h = 5
We estimated
function.
so that not very
the polynomial approximation, based on
error of
g
and
r
in
the derivative
Figures
of the total sample.
This
There were
many were excluded by ignoring the
part of the hours profile from
=
4,
is
.000513
with a standard
is
.000505
with a standard error of
.000147.
estimates are quite different than the linear 2SLS estimate reported in Table
consistent with the presence of nonlinearity.
1
2.
The corresponding estimate from our spline approximation of the
weighted average derivative
Figure
and
1
the average derivative
is
.000350
These
1
which
is
Furthermore, for the quadratic 2SLS in
with a standard error of
32
.000142,
which
is
also quite different than the average derivative for the nonparametric estimators.
difference
and
is
This
present despite the relatively wide pointwise confidence bands in Figures 3
Thus, in our example the average wage change for most of the individuals in the
4.
sample
found to be substantially larger for our nonparametric approach than by linear
is
or quadratic 2SLS.
Before proceeding
we examined
the robustness of these estimates of the average
derivative to the specification of the reduced
the reduced
form and alternative approximations of the wage hours
reduced the number of approximating terms
CV
form and the alternative specifications of
in the
reduced form.
profile.
While
we
First,
we found
that the
form
criteria for each step increased with the exclusion of these terms in the reduced
we found
that there
was
virtually no impact on the total profile and a very minor effect
on our estimate of the average derivative of the profile for the region discussed above.
When we included
For example, consider the estimates from the polynomial approximation.
only linear terms in the reduced
form the estimated average derivative
a standard error of
The corresponding values for the specification including
quadratic terms
.0000175.
.000546
is
with a standard error of
cubic terms resulted in an estimate of
Changes
in the specification of the
.000522
.000174.
.000515
is
with
Finally, the addition of
with a standard error of
.000153.
model based on the spline approximations produced
similarly small differences.
We
terms
were
also explored the sensitivity of the average derivative estimate to the
approximation.
in the series
Both the average derivative and
relatively unaffected by increasing the
number of terms
in the
its
number of
standard error
approximations.
For
example, with an 8th order polynomial in hours and 7th order polynomial in the residual
the average derivative estimate
Finally
our model.
is
.000519
we examine whether we are
We do
and the residuals
variable
was
.231.
with a standard error of
able to reject the additive structure implied by
this by including an interaction
in
.000156.
our preferred specification.
term capturing the product of hours
The t-statistic on
The CV value for this specification
33
is
182.924
this included
while that for our
preferred specification plus this interaction term and the term capturing the product of
hours squared and the residual squared
conclude there
is
is
On the basis of these results we
183.188.
no evidence against the additive structure of the model.
In addition to this empirical evidence
we provide
The objective of this exercise
data examined above.
simulation evidence featuring the
procedure to accurately estimate the unknown function
examined above.
We
explore the ability of our
is to
in the
empirical setting
we
simulate the endogenous variable through the exogenous variables and
parameter estimates from each of the polynomial and spline approximations reported above.
We generate
the hours variable by simulating the reduced form and incorporating a random
component drawn from the reduced form empirical residual vector.
we estimate
the residuals and from the simulated hours vector
order terms for hours.
We
From
this reduced
we generate
form
the higher
then simulate wages by employing the simulated values of hours
and residuals, along with the true exogenous variables, and the parameter estimates
discussed above.
wage equation
The random component
residuals.
is
drawn from the
distribution of the empirical
As we employ the parameter vector from the over fitted models
in the empirical the polynomial
model has a fifth order polynomial
in the
reduced form
while the wage equation has a 5th order polynomial in hours and a fourth order polynomial
in the residuals.
in the
The spline approximation employs a cubic spline with
reduced form while the wage equation has
points for the residuals.
We examine
2
2
join points for hours
join points
and
the performance of our procedure for
2
join
3000
replications of the model.
In order to relax the parametric assumptions of the
each step of the estimation.
That
is,
for the model which employs the polynomial
approximation we first choose the number of terms
of the residuals from the over fit reduced
approximating terms
in the
model we use the CV criterion in
wage equation.
in the
reduced form.
Then on the basis
form we then choose the number of
For the model generated by the spline
approximation we do the same except we employ a cubic spline and choose the number of
join points.
For both approximations we trim the bottom and top two and a half percent
34
observations, on the basis of the hours residuals, from the data for which
the
wage equation.
An important aspect of our procedure
polynomial model for
order polynomial
in
all
25
3
residuals in the
wage
equation.
the
results
we computed
we
we can
possibilities.
27
For the spline model we
possible specifications.
evaluate the performance of an average
the estimator by choosing the
.000153,
8.9 percent.
to a fifth
As there was also 5
Once again we over
fit the
number of terms that minimizes
and computed standard errors for the same
The mean of the
from the polynomial model.
average derivative estimates, across the Monte Carlo replications, was
represents a bias of
criteria for the
wage equation combining up
applied to the actual data.
First consider the results
CV
to correctly
form and then 3 each for hours and
This produced
criterion, plus one additional term,
specification.
125
join points in the reduced
derivative estimator like that
CV
we computed
hours and fifth order polynomial in residuals.
From the Monte Carlo
data such that
this
specifications of the
choices in the reduced form this generated
examined up to
CV method
the ability of the
is
To examine
identify the correct specification.
the
we estimate
.000456,
which
The standard error across the replications was
while the average of the estimated standard errors
was
.000152.
Therefore the
estimated standard errors accurately reflect the variability of the estimator in this
experiment. For the spline approximation the average estimate was
represents a bias of
.000145,
We
11.1
percent.
.000442
which
The standard error across the replications was
while the average of the estimated standard errors
was
.000144.
also computed rejection frequencies for tests that the average derivative
was
equal to its true value. For the polynomial model the rejection rates at the nominal
10
and
20
respectively.
percent.
percent significance levels were
6.5,
11.8
and
5,
22.5 percent
The corresponding figures for the spline model were
7.4,
13.1
and
24.0
Thus, asymptotic inference procedures turned out to be quite accurate in this
experiment, lending some credence to our inference in the empirical example.
35
Appendix
We
will first prove one of the results on identification.
Proof of Theorem
Consider differentiable functions
2.3:
6(x,z
and
)
and
rtu),
suppose that
= 5(x,z
and
z
identically in
+ ^(u) = 5(n(z)+u,z
)
J^
If
IT
is full
(z)
rank
+ 5 (x,z
)
J.
i.
it
+ t(u),
Differentiating then gives
u.
= n (z)'5 (x,z
J,
)
= 6 (x,z
),
J'L
i.
it
identification
we
suffices to consider
6(x,z J+ytu) =
= H (z)'5 (x,z
^
j^
+ ^ (u),
LI
follows from the last equation that that
11
then follows from the first two equations that
show
)
J.
will use
5(x,z
)
Theorem
and
)
=
and
y
By differentiability of
2.1.
y{u)
5,(x,z,
5 (x.z
that are differentiable.
u
=
)
(u)
=
g
and
Also,
).
1,
0.
It
Now, to
0.
A
if
with probability one, then by continuity of the functions and the
boundary having zero probability, this equality holds identically on the interior of the
support of
Then, as above,
(u,z).
all
the partial derivatives of
are zero with probability one, implying they are constant.
by Theorem
2.1.
To avoid
that
is
repetition,
needed.
may be
it
Throughout the appendix
different in different uses.
and
n
represents a possible value of
among
and
y{u)
Identification then follows
useful to prove consistency and asymptotic normality
X
z
)
QED.
for a general two-step weighted least squares estimator.
notation
6(x,z
its
components and
n
Let
w(X,7r)
(z)
36
To state the Lemmas some
C
will denote a generic positive constant,
X
be a vector of variables that includes
a vector of functions of
= E[x|z].
assumed that
lemmas
Then for
w
= w(X,n
X
and
(z)),
it
ti,
where
will be
(A.l)
E[y|X] = h^Cw),
Also, let
ti.
11
= 0(X.)
E[x|z] = Tl^U).
be a scalar, nonnegative function of
and
n(z)
X.,
be as in
t.
1
1
the body of the paper, and
Let
p = (P'Pr^P'Y,
h(w) = p^(w)'p,
(A.2)
W
max,
= {w
t(w) =
:
,^5,sup
and for
1>,
5
Al:
component
w.(X,7r)
0(X)
i)
|h|_ =
o
a nonnegative integer, let
bounded and
is
matrices
B
E[t(w)(/»(X)
K
2 K
P (w)P (w)']
and
B
iii)
There exists
5,
max,
with,
a, a,
,
and
>
K
in
w.(X,n
and L
K
is
continuously
there are nonsingular
and
L;
L
R
(z)
=
Br
L
and
sup^-IIR^z)!!
K
and
(z),
For each nonnegative integer
iv)
^sup,,,ll5^P^(w)ll < C^AK)
p^
(z))
each
ii)
tt;
have smallest eigenvalues that are
]
and
such that |h„-p 'P^l- ^
U
K o
z,
L
K.
1
K
K
L
L
E[R (z)R (z)'
and
or
ti
For each
Lipschitz in
is
P (w) = Bp (w)
such that for
bounded away from zero, uniformly
C,JK)
w(X,7r)
either does not depend on
distributed with bounded density;
is
= P^^^.).
P^
t^i'/'iPi-'-.^n'^nPn^''
,.^|5'^h(w)|.
Assumption
there
P =
^(L); v)
r<
CK
5
— cc
and
sup^\\Tl^{z)-y^rHz)\\ s CL~"i.
Lemma
Al:
If Assumptions
1
[K^^^C,^(K) + (:^(K)^^(L)][(L/n)^^^ + 1'°"!]
-^
0,
ST(w)iJj(X)^[h(w)-hJw)fdFjX) =
If Assumptions
1
Lemma
= P^(w)
C,^(K)
(K/n + k"^" + L/n +
= O a:jK)[(K/n)^^^ + k"" + (L/n)^^^ +
p a
l\]).
Al:
Note that the value of
if
,
and
K
p (w)
rhz) =
and/or
R^(z).
Let
r
K/n
then
d,
is
unchanged
(z)
is
used, so that it can be
L
= t(w.),
p.
a nonsingular constant
= p^(w.),
Q =
assumed that
Elx.^^^p.p'.
1
37
and
0,
>
L^\).
h
t.
—
then
u a
linear treinsformation of
p'^(w)
2
0,
and Al are satisfied for some nonnegative integer
|h - h-l
Proof of
d =
and Al are satisfied for
1
1
1
].
By
largest eigenvalue of
Q
— 1/Z
K
p (w).
u~K
113
Q
2~K
Since
E[t(w)i/'(X) p
Q =
prove the results with
Q^ =
E[rhz)rhz)'] =
= L
A
Next, let
L
/Vn' +
~K
(w)p (w)'] =
by construction,
I
suffices to prove the result
it
11
H.
i,
n(z)
is
= iKz.),
and
=
IT.
^ O(A^) + CWr-Tf,
L
Tt
+ c^;."
)
ll^o
CJ-lnCz)-?-,
L
'
Note that for
Newey
(1997),
=
y.^Am-HAl^/n
^1=1
1
Theorem
max.^
i^n
1
1
of
Newey
Ilft.-n.l!
11
=
Therefore, by Assumption Al
+ 0(K"^"i)
rhz)]^dF(z) + Clly-r,
L
ll^o
p
Theorem
1
follows that
it
of
Newey
2
llr-y.
II
=
(A
2
),
TT
(1997),
(e(L)A ).
TT
p ^
ii)
[TjI/'jPj
"^n^nPn^'
+ 0(K"^"i)
and
Lemma
^"^
A3,
^ ^ P'P/n.
38
Note that
(A. 2)
so that by eq.
(A^).
P
(1)
Also, by eq.
(1997).
11
P =
from
rhz.)ii^/n
iin.-?-
l.^.\i.-T.\/n = O (?(L)A ).
1=1
71
p
(A.4)
y
(1),
p
last inequality follows by
the Appendix in
Let
Also, suppose
for
ft-
1
a scalar function.
£ c(f-r, )'Q,(?-r,
^ Cj-[n(z)-n^(z)]^dF(z) +
(A.3a)
TT„(z.).
1
£ Cj'[n(z)-3-j^rhz)]^dF(z) + C|(y-?'j^)'(Q^-IKr-9'JI
Also, by
suffices to
it
Al,
X;.!l,iin.-n.ii^/n
(A.3)
such that
I.
notational simplicity that that
where the
(w) =
p
C
a constant
is
By analogous reasoning,
I.
71
Assumption
by
(w)
p
so that the hypotheses and the conclusion will be satisfied with
|(K),
this replacement.
for
Consider replacing
bounded.
is
bounded away from zero, so that the
is
By the Cauchy-Schwartz inequality there
£ CCi
(w)ll
p
Q
Al, the smallest eigenvalue of
Assumption
E[IIQ-I!I^]
^
of
(A.7a),
implying
—^
IIQ-Ill
Lipschitz in
w(X,7r)
Markov
inequality,
^
ri
C^(K)llw.-w.ll ^ C<^(K)lin.-n.ll.
:£
/n =
,T.0.llp.ll
^1=1
convex and a mean value expansion
T.x.llp.-p.ll
ti,
Y.
•"
W
Also, by
0.
^1
=
x.
x. = x.,
x.,
and
Also, by the
1111
Then by
(K).
p
w,
in
and
i/i.
^1
bounded,
(A.5)
£
IIQ-QII
lir.",x.x.^//^(p.p'.
^1=1
r
1
s
+ Clir.^Jx.x.-xJp.p'./nll
^1=1
1
11
1
1
^r
1
- p.p'.)/nll + CllX;.",(x.x.-x.)p.p'./nll
^r
1
^i=l
1
i
i
'^I'^i
i
cy;.",x.x.(llp.-p.ll^ + Z^.llp.mip.-p.lD/n +
^1=1
11
11
1
1
1
1
Cr (K)^.", |fi.-i/r.
|/n
^1=1^1 ^1
a/2
s C<,(K)\.",lllt.-n.ll^/n + C(r.",x.i/(^llp.ll^/n)^^^(y.",x.x.llp.-p.ll^/n)^
^1=1
1
1
O (C^(K)^^(L)A
^0
+
P
Let
r).
=
(w.)],
i//.[y.-h
of the observations
2 ~
E[t}.IX]
is
the observations,
=
11
tj
=
for
E[ll(P-P)'7)/nll^|X]
E[tj.t)
i
:£
|X]
1
^
.|X.,X.]
where the second to
Y
result is that
= o
(1/n).
n
Also,
=
=
CElx.i//.p'p.]/n
(A.7)
E[t].|X1
P
n
)
if
|
(n"^[<^(K)^?(L)A
P
+
r{K)hh)
1
TT
E[|Y ||X] = O
n
(A
P
n
it
),
EdlP'Vnll^] = E[E[IIP'7)ll^/n|X]] =
= Ctr(Q)/n = Ctr(I)/n = CK/n.
(A.5)
X,
111
llP'Vnll^ ^ C[ll(P-P)'Tj/nll^ +
eqs.
=
depending only on
E(y;.",x?i/»'*p'.p.
Var(y.
1*^1*^1
^1=1
1
r
= o
P(P'P)~ P'
39
ll(P-P)'T?/nll^
|X.)/n^] £
1
Therefore, by the triangle inequality,
llP'-rj/nll^]
M =
(n"^),
P
Then, since a standard
follows that
(1)
+
O (K/n) =
P
and (A.7) and the smallest eigenvalue of
with probability approaching one, for
= o
71
(A.5).
P
Then by
Furthermore,
0.
bounded, and by independence of
last equality follows similarly to eq.
(A
=
Cn~^.",0^1lx.p.-x.p.ll^ < Cn'V.'^, |x.-x. (llp.ll^+llp.ll^)/n
^1=1 1
^1=1
11 11
'^i
1111
p
(1).
p
1
E[t}.E[t).17).,X.,X.] |X.,X.]
p
2
so
= o
)
71
Then by independence
).
1
Then by
j.
+ Cn't",-^.T.IIp.-p.ll^/n =
^^1=1
(X ,...,X
= ^.h^(w.),
1
+ <^(K)^C(L)A
^
^0
71
Var(y. |X) = Var(y|X.)
and
E[-r].Tj.|X]
X =
and
),
= 0.E[y.
1
ip.
=0
TT
(t)
E[^.y. |X]
E[tj.E[t).|X.]|X.,X.]
(A.6)
p
1111
^1=1
1
= O (C,(K)^A^ + K^^^<,(K)A
^1
^1
)
71
t)
bounded by
11
^1=1
1
,
Q
T)'MT)/n
(K/n).
P
bounded away from zero
= (T)'P/n)Q~ (P'Vn) ^
= O (K/n).
p
(DIIP'Tj/nll^
p
h,
h,
and
be
h
jthe
Let
= x
h.
1
corresponding
K
sup,,,|h^(w)-p (w)'p|
be such that
Mw
111
,
n x
vectors
.)
.\jj
= 0(K
1
h.
= x.iJj.hAw.),
11
1
-oc
o
(M-I)(h-Pe),
h^<w)
Hh-h«^/n ^
mv
Lipschitz.
and
Jin--IT.«
1=1
It
11
Gil
p
approaching one that, for
s
h =
P/3
(l)(p-p)'Q(p-/3) =
+ (h-E3'M(h-h)]/n s
O
p
M-I
)').
Let
and
(B
idempotent, (M-I)h
o
/n = O
(A
P
for
),
A^,
h
TT
= VK/Vn + K~" +
11
and
i/»
y ,...,t
\p
(l)(y-h)'M(y-h)/n ^
(K/d) +
p
O
(A?).
p
h
bounded av^ay from zero with probability
Q
y = (x
p
+
and
(K/n) + C^.", Ilfi.-n.ll^/n + 0(K"^") ^1=1
then follows by the smallest eigenvalue of
llp-/3ll^
1
+ M(h-h) + <M-i)hll^/n £ C(T]'MT)/n + y;.",T.i/»^[h^(w.)-h„(w.)]^/n
^1=1 1 1
1
1
+ y.",T.^A^[h„(w.)-p'.p]^/n £
^1=1
11
J]-
,...,h
1
~
M
Then by
).
1
1
h = (h
(i.e.
= T.0.h_(w.),
h.
1
y
)',
(l)[T)'M7]/n +
(h-h)'M(h-h)
(DY." £.i//^[h^(w.)-h„(w.)]^/n
^^1=1 r 1
1
1
(l)j:.",T.0^[h^(w.)-p'.p]^/n =
^1=1 1 1
1
1
P
{Ah.
h
Then by the triangle inequality,
UT(w)!/»(X)^[h(w)-hQ(w)]^dFj^(X)}^''^ ^ {jT{w).//(X)^[p^(w)' (p-/3)]^dFQ(X)>^^^
+ {jT(w)0(X)V(w)'l3-hQ(w)]^dF(w)}^'^^ -
giving the first conclusion.
+ 0{K"") s
lip-pll
(Aj^),
Also, by the triangle inequality,
|h(w)-hQ(w)|g s |p^(w)'p-hQ(w)|g + |p^(w)'(p-i3)|^
^ 0(K ") + Cg(K)llp-pll =
«g(K)Aj^).
To state Lemma A2, some additional notation
as given in the body of the paper.
Also, let
is
p.
40
QED.
needed.
= p
K
Z
Let
(w.),
r.
,
= p
S
,
L
(z.),
Q
,
and
Q
be
2 =
± = y.",T.i/»^p.p'.[y.-h(w.)]^/n,
1111
^1=1
Q =
Q = P'P/n,
1
r r r
E[T.^^p.p'.
1111
-^i
1
y.",T.i//.p.{[Sh(w.)/5w]'aw(X.,fT.)/57r®r'. }/n,
^1=1 111
1
1
H =
E[T.^/(.p.<[5h„(w.)/aw]'5w(X.,n.)/aTr®r'.>].
11
ill
AQ~^(f: +
Assumption A2:
HQ~^f ^Q^^H' )Q"^A'
differentiable in
lla(h)ll
i)
w
11
1
K
For notational convenience,
iii)
and
L
and
^ C|h|„;
o
ir
1
V = AQ"^Z
,
V
subscripts for
h^(w)
U
ii)
and
+ HQ^^Z^Q^^H' )Q"^A'
v(w),
= E[t(w).//(X)^i'(w)Pj^j^(w)],
a(h
)
is
~
(^
are twice continuously
w(X,Tt)
respectively, and the first and second derivatives are
either a) there are
a scalar and there exists
.
are suppressed.
such that
p
a(h
K.
a(pj^j^)
i
],
H =
V =
bounded;
E[T.t//^p.p'.Var(y. IXJI,
1
)
and
U
)
=
2
E[T(v^r)^//(X)
E[T(w)t/((X)^lli^(w)-/3j^p^(w)ll^]
such that for
K
~
h (w) - p (w)'p
-^
0,
or; b)
E[h (w)
,
vCwjh
Z
]
—
>
U
(w)],
a(h)
0,
bounded away from zero.
K.
Under
iii)
b),
let
d(X) = [ah (w)/aw]'5w(X,n (z))/a7t,
the set of limit points of
elements of
r
T(w)^(X)y(w)d(X)
{z)'y
on
,
£,
and
p(z)
let
recall the definition of
+ E[p(z)Var(x|z)p{z)'
41
as
be the matrix of projections of
and
V = E[T(w)0(X)^i^(w)i^{w)'Var(y|X)]
£
].
is
and
Lemma
If Assumptions Al and A2 are satisfied,
A2:
and
^(L)
are bounded
C,JK)
following go
to zero
n —^
as
away from zero as
m.-
VnK
L^C,Q(K)^^(Lf/n,
KLi^Q(Kf^(L)^/n,
,
and
VnV^'^^Ce - e^)
-^
N(O.I),
VR(Q - Qq)
-U
Lemma
A2:
=
V
N(0,V),
L^'^^/v/n + l""i,
Aq = [K^^^q(K)
A„ =
1/2
/Vn)
grow,
C,JK) K /n, (K L+L
and each of the
KXK)
then
/n,
6 = 6^ +
-^V.
A^ = K^^^/V^ + K""' + A
71
n
U
VnK"" -^
and
A,
and
= 0(K
Cs-(K)
CL^CQ(K)^?(L)Vn ->
0,
K^''% =
N(0,I).
o
.
A^, = ?(L)L^^^/;/n.
Ql
+ Co(K)^C(L)]A^ + Co(K)K^^^/v^.
1
rate conditions and
(A.8)
K
bounded away from zero,
is satisfied,
+ K^^^^{L)/V^.
[L^''^C,(K) + Cn(K)C(L)^]A
li
0(L
and
is
First, let
TT
Note that by
L
C,^(K)^i^^(K)^(LK+L^ )/n,
^^^(Q-B^) -^
Furthermore, if Assumption A2, Hi) a)
A
i,
\
and
(CXK)/Vn)
p ^d
Proof of
VnL
Var(y X)
1/2
71
vQc""i
/Vn + L
and
1/2
^(L)
^
0,
= (L^''^/V^)(l + L"^^^vQ.~"i) =
A
7r
/Vn).
Therefore,
it
follows from the convergence
bounded away from zero that
^(L)
and
0([KCj(K) + K^''^Co(K)^^(L)]L^''^/^n + Cq^^JK/V^)
= 0({k\Ci(K)^ + KLCqCKj'^CCL)^ + K^Co(K)^}^''^/v^) ->
0.
L^^^Ajj = 0([LCj(K) + L^''^Co(K)C(L)^]L^^^/v^ + L^'^\^^^^(L)/Vn)
VnC^CKjA^ = 0(CQ(K)L/yn) -^
0,
42
2 2
L /n ^
Cq(k)l^''^Cj(k)a^ = o({Cq(k)^lc^(k)^(k+l)}^''^/v^) -^ o.
It
from these
also follows
L
results that
1/2
-^
A
—
Cq(K)A
0,
and
>
For simplicity the remainder of the proof will be given for the scalar
Also, as in the proof of
r
= R
(z)
Q
and
(z),
A[Z + HS H']A'.
Let
Lemma
Q
and
-^
v^{w) = Ap^(w).
0.
equal to identity matrices.
Q =
Z - CI
I,
is
V
Q =
By
-1
U
mean square error (MSE)
By
.
and
b, (z)
2
(wj-i/'fw)!!
]
—
K
as
>
—
>
(w),
V =
= Var(y|X)
(X)
Therefore,
=
C.
2
E[t(w)i/»(X) u'{w)p
K
(w)'
].
^ E[T(w)i/((X)^lli;(w)-p-,p^(w)ll^]
= E[T(w)i//(X)d(X)y(w)rhz)' ]rhz).
(z)-b
K.L
}^'^^
K
K.l_,
of a least squares projection
E[llb
(w) = P
2
cr
case.
(z)
IT
0.
b^, (z) =
(J
random variable being projected,
Up
A =
Then
K
.
In this case,
E[T(w)i/»(X)^lly(w)-i'-,(w)ll^]
I,
p
positive semidefinite.
a) is satisfied.
A2,- iii),
E[T(w)i//(X)d(X)yj.(w)rhz)']r^(z),
2
show the result with
d(X) = [ah-(w)/5w]'aw(X,n-(z))/5jr,
Also let
CE[t(w)i//(X)
suffices to
= {tr[FAA'F']}^'^^ s {tr[CFASA'F' ]>^^^ £ tr{CFVF'
IIFAll
Suppose that Assumption
Let
it
be a symmetric square root of
F
bounded away from zero and
{A.9)
Al
-^
C,(K)A
is
2
L
oo.
(z)ll
no greater than the MSE of the
2
]
E[t(w)i/((X) d(X)
:s
Since the
2
Up (w)-t'(w)ll
2
s
]
K.
Furthermore, by Assumption A2,
b),
ii),
K.
2
E[llb
(z)-p(z)ll
Var(x|z),
it
]
—
as
>
L
—
>
co.
2
Then by boundedness of
cr
(X)
= Var(y|X)
and
and by the fact that MSE convergence implies convergence of second moments,
follows that
V =
(A.IO)
E[T(w)(^(X)^yj^{w)yj^(w)V^(X)] + E[bj^^(z)Var(x|z)bj^^(z)']
This shows that
F
is
bounded.
Suppose that Assumption A2,
~
by the Cauchy-Schwartz inequality,
|a(h-,)|
= |Ap^| s
K.
that
IIAII
—
>
00.
Also,
V > ASA' ^
Next, by the proof of
Lemma
2
CIIAII
,
so
Al,
43
F
~
|IAIIIip„ll
is
V.
b) is satisfied.
iii)
K.
K.
-^
=
2
IIAII(E[h„(x)
Then
1/2
])
.
K.
also bounded under Assumption 5.
so
(A.ll)
=
IIQ-III
(A_) = o
Q
p
H =
Further, for
(A.12)
Lemma
Al,
implies that the smallest eigenvalue of
Q
(A„) = o
-^
IIQ-III
(A_,) = o (1).
Ql
p
similarly to
11
111
^1=1
H
p
Now,
p
1
,T.i//.p.d(X.)r'./n,
T.
=
IIH-HII
=
IIQ -III
(1),
P
(1),
p
is
bounded away from zero
with probability approaching one, implying that the largest eigenvalue of
O
impl3dng
(1),
(A.13)
^
IIFA{Q~^-I)II
IIFAQ'^'^^II^
Also,
£
IIBQ~'^II
IIBIIO
for any matrix
(1)
IIFA(I-Q)Q~-^II
^
= tr(FAQ"^A'F') ^ CO
=
(l)IIFAII^
be such that
p
IIV^a(p'^'p -
(A.14)
h^)ll
(J
Also, for
h.
=
T.i/».h-(w.)
1
1
1
0.
P
h^(-)U = O
Ip'^C-)'^ -
U
o
(V^"") =
£ Cv^llFII|p^(-)'3 - h^(-)k =
U
o
In
h = (h,,...,h
and
1
Then
(K"°').
p
p
^ CV^[tr(FAQ"^A'F')]^'^^0(K~°^) =
h.
Ap = a(p
= T.iA.h^(w.)
1
(A.16)
1
1
= A3,
a(h)
'p),
h =
and
eqs.
In
(h,,...,h
1
(1).
)',
= o
(t/iiK"")
P
Then by
o
p
IIFAQ"^' (h-Pp)/Vnll s Cv/KiiFAQ~^P'/V^llsup,,,|p^(w)'^ - h^(w)
(A.15)
for
-^
(1).
P
Next, let
is
Therefore,
B.
IIFAIIIIQ-IIIO (1)
Q
{A.14)
|
(1).
P
and
(A.15),
and the triangle inequality,
)',
v^[a(h)-a(h-)] = FAQ"^P'VV^ + FAQ"^P' (h-h)/\/n + o
(1).
p
Let
IT
= (n
sup_|n (z)-r
of each
n
)',
{z)'2r\
h(w.)
u.
= x.-n.,
= 0(L
around
w.,
U =
and
d.
and by
eq.
i),
(u ,...,u )',
=
d(X.).
(B.O),
44
y
be such that
By a second order mean-value expansion
—
(A.17)
FAQ ^P'(h-h)/v^ = FAQ
^Y.-^^i.ip.p.d.[-n.-Tl.]/^/R +
^1=1
111
111
= FAQ^^Hq/r' U/^n
p
1
111
+ FAQ~-^HQ7^R' (n-R' j-VV^ + FAQ~V.",T.i/».p.d.[r'.y-n.l/Vn +
^1=1 111
1
< C^IIFAQ~^''^IICn(K)I.'',lin.-n.ll^/n = O (/SCn(K)A^) = o (1),
^0
^1=1
71
p
p
11
f
ll^ll
Also, by
d.
p
'^
nHQ H'
bounded and
^
multivariate regression of
— -^"l
yv
on
T.i/>.p.d.
Ill
sum
equal to the matrix
r.,
:^
1
1
1
n
HQ, H
Y.
^
2,^ '^
2
.x.ip .p.p'.d./n
^1=1
from the
of squares
1111
'^
^ CQ.
1
Therefore,
llFAQ~^HQ~^R'(n-R'3')/v^ll ^ IIFAQ'^HQ'^R'/'/nllV^-sup^lnQCzj-rhz)'?'!
(A.18)
< [trace(FAQ"^HQ~^Q^Q^^H'Q"^A'F')]^''^0(/nL~"i) ^ CIIFAQ"^^^IIQ(^/riL~"i)
=
(v/nL""i)
= o
(1).
Similarly,
ill
IIFAQ~V.",T.i/».p.d.[r'.3'-n.]/v/n
(A.19)
111
^1=1
II
s CIIFAQ~^''^IIO(VnL~"i) = o
E[IIR'U/V^II^] = tr(Z^) £ Ctr(Ij^) ^
Next, note that
(1).
p
L
by
E[u^|z]
bounded, so by
the Markov inequality,
1
(A.20)
IIR'U/V?ill
= O (L
/7
).
P
Also, note that
IIFAQ'^HQT^H ^
(A.21)
(1)IIFAQ~-^''^II
p
1
IIFAQ~%(Q7^-I)R'U/\/nll
:s
(A.22)
eq.
(1).
Therefore,
p
11
IIFAQ"^Hq7^IIIQ -IIIIIR'U/V^II =
1
By similar reasoning, and by
=
=
is
Ql
(A„L^^^) = o
p
HH'
(L^'^^A^,)
p
= o
(A. 8),
IIFAQ ^(H-H)R'U/Vnll ^ IIFAQ ^IIIIH-HIIIIR'U/Vnll
Noting also that
O
H
(1).
p
the population matrix mean-square of the regression of
45
(1).
p
)
T.0.p.d.
r
i*^!
on
HH' s
so that
r.,
CI,
E[I1HR' U/Vnll^] = tr(HZ,H') < CK.
follows that
it
1
1
1
Therefore,
1/2
=
IIHR'U/\/nll
(K
and
),
IIFA(Q"^-I)HR'U/V^II ^ IIFAQ"^II
(A.23)
ll(I-Q)ll IIHR'
U/ZSlI =
(A^K^'^^) = o
Q
P
(1).
P
Combining eqs. (A.16)-(A.23) and the triangle inequality gives
FAQ~-^P' (h-h)/v^ =
(A.24)
Next,
it
FAUR'V/Vn
follows as in the proof of
(Cq(K)^C(L)A^ + <:^(K)^A^) = o
(A.25)
+ o
(1).
Lemma
Al that
(P-PJ'tj/VTTiI
IIQ
=
implying
(1),
IIFAQ"^(P-P)'t]/v^II ^ IIFAQ~^'^^IIIIQ~^'^^(P-P)'-n/V^II
= o
(1).
P
Also, by
(A.26)
0,
E[IIFA(Q~^-I)P't7/v^II^|X] = tr(FA(I-Q)Q~^ZQ~^(I-Q)A'F'
IIFA(I-Q)Q~^II^ s
:£
so that
=
E[7)|X]
IIFA(Q~
II
FA 11^ I-Q 11^0
II
-^
-DP'V/nll
(1),
Then combining
0.
eqs.
(A.25)-(A.26) with eq. (A.16) and
the triangle inequality gives
(A.27)
v^[a(h)-a(h^)] = FA{P' n/VR + HR'U/v^) + o
(1).
p
Next, for any vector
Note that
are
Z.
<p
with
across
i.i.d.
i
=
II0II
e
>
0,
IIFAII
let
</i'FA[T.i/».p.T}.
Ill
''in
for a given
in
Furthermore, for any
1
£ C
and
n,
IIFAHII
E[Z.
]
=
+ Hr.u.]/V^ = Z.
11
1
0,
and
£ CIIFAII £ C
in
I
>
e)Z^
in
]
= n6^E[l(|Z.
in
£ Cne"^li0ll^E[llT.p.ll\[7)||X.]] +
|
>
e){Z. /e)^] £ ne"^E[lZ.
in
in
E[llr.ll'^E[ut |z.]]}/n^
46
Var(Z.
)
=
1/n.
in
by
semidefinite, so that
nEUdZ.
.
in
l"^]
CI-HH'
positive
£ Cn ^{Cq(K)^E[IIt.i/».p.II^] + C(L)^E[llr.ll^]} = 0([Cq(K)^K+ ^{L)\]/n) = o(l).
/nF(9-0
Therefore,
—
follows by the Lindbergh-Feller theorem and equation
N(0,I)
>
)
(A.27).
CI -
As previously noted,
Suppose
a(h)
so that
CJK)II/3II,
o
9
IIAII^
:£
scalar
CtK)
5
^
V
9-6^ =
that
p
.
Next, by
and
1
^ C,AK)\\A\\.
o
/Vn) =
Lemma
P
max.
Al,
of
Newey
|h.-h.
w
Lipschitz in
w(X,7i)
triangle inequality,
max.
111111
and
11
x.[-27j.(h.-h.)+(h.-h.)^] = v..
|
|
=
= o
III.-IT.
11
|
=
11
D =
111
,7).llw.-w.ll
^1=1
/n =
iiz-zii
=
Therefore,
(A
P
£
).
111
(l)o (1)
O
it
max.^|h.-h.| = o
i^r
1
II
|
'
'
=
x.(7}.-t).)
I+Itj
'n
|
}PQ"^A'F'/n.
'
Then
(1).
P
so that
:s
IIFAQ"^II^IIZ-ZII
T.^.tj^Ix.-x. |/n
^1=1 1
11
IIQ-III
=
(^(L)A
p^
=
47
)
71
(A„),
Q
zii
9
(A^
C^(K)^K/n)
pQ^O
-^0.
and
CJ^." |x.-t. |/n
+ iiy;.",T.p.p'.T)^/n ^i=l ri^i i
11
s
Hence, by the
(1).
v }PQ"U'F'/nll
4
4
1/?
=
({E[x.llp.irE[7}^|X.]]/nr
)
IIFAQ~^(Z-S)Q"^A'F'
h„(w)
1
E[J]." tj^It.-x. |/n|X]
i
follows by
1111
= x.,
x.
Then similarly to the proof of
1
follows
it
so that by
(1),
p
r r
^
p
TT
pQpii
(A^) +
= o
P
iiy;.",(x.p.p'.-x.p.p'.)T}^/nii
^1=1
= o
)
71
FAQ~^P'diag{l+|T7,
b
1^
'-'1=1
P
uniformly bounded,
o
then gives
IIAII
other case
Also,
(1).
(f(L)A
y'.",x.p.p'.-n^/n.
E(y;.",T}^llw.-w.ll^/n|X] £ cy.",llw.-w.ll^/n,
^1=1 1
^1=1
1
1
1
1
y.
'p|
p
1
Z =
£ Ctr{D)max.^ |h.-h.| =
E[t)^|X]
= o
p
IIFAQ~^(Z-S)Q~^A'F'II = IIFAQ~^P'diag{i^
Also, by
)
h
Note that by
(1).
p
i^n
In the
p
Let
|p
This result for the
"
b).
(C„(K)A,
respectively,
tt
Also, for
(1).
iii),
p
1
E[D|X] £ CFAQ~^A'F =
n
{C.JK)/Vn).
5
1
i£n
|h.-h.
i:£n
1
max.
(1997) that
all
for
^^
Dividing by
(C^(K)/\/n).
P
(1/V^) =
i:sn
Theorem
^
o
covers the case of Assumption A2,
a(h)
V
Op
„
.
r/9
9-6^ = O (V
Therefore,
.
CIIAII
the Cauchy-Schwartz inequality implies
IS,
= latAp'^)! ^ \Ap^\
9
IIAII
A
2
V £
positive semi-definite, so that
is
follows from the above proof that
It
Note that for any
large enough.
from
a scalar.
is
HZH'
+
Furthermore,
= o
(1).
p
IIBSII
s
and
for any matrix
CIIBII
CI - E
by
B,
positive semi-definite, so that
IIFA(Q~^EQ~^-Z)A'F'I1 < IIFA(Q~^-I)SQ~^A' F'
£ O
IIFAZIIIII-QIIIIFAQ~^II
+ IIFAIIo (1)0
(l)III-QII
P
P
inequality,
To show
-^
JIE-ZII
show that
suffices to
it
IIFAZ(q"^-1)A' F'
+
II
11 -^
it
= {[ah{w.)/dw]' av/(X.,fi.)/dn
d.
11
1
C(sup^|h-hQ|2
0.
11
and
d.
=
d(X.).
bounded,
3w(X,7r)/57r
Al,
-^
follows similarly to the argument for
1
Then by the conclusion of Lemma
^ii^i'V^i'^^".^
Let
0.
IIZ -Z,ll
Then by the triangle
(1).
P
FA(Q~^HQ~^Z^Q^^H'Q"^ - HE^H')A'F'
this last result, note first that
that
= o
(1)
P
< IIFAQ"^II^II(I-Q)ZII +
II
=
+ l.^^\\fl.-Tl//n)
(q(K)2A^).
Therefore,
IIH-HII
^ C(X.",T.llp.ll^lir.ll^/n)^^^(I.",T.|d.-d.|^/n)^^^=
^1=1
1
1
^1=1
1
111
Hence, by eq. (A. 12) and the triangle inequality,
P
IIH-HII
—
(C^(K)L^''^C{K)A^) = o (1).
1
h
p
The conclusion then
0.
>
For example, by logic
follows similarly to previous arguments.
like that above,
IIFAQ~^HQ~^f: -Z )Q~^H'Q~^A'F'II ^ IIFAQ~^HQ~^II^I1Z -E
tr(FAQ~'^HQ7^H'Q~^A'F')o
:£
It
(1)
= o
(1).
p
p
1
II
^
FVF' —
then follows by similar arguments and the triangle inequality that
the case of Assumption A2,
square root that that
/V
(V
similarly
)\^nV
from
Lemma
A3:
max.^
Iv.-v.
i^n
Proof:
11
I
—
v.
=
—
V
—
^1/2
>
n
),
By the density of
6
where
-^
N(0,I).
^
1,
a(h)
is
a scalar,
it
follows by taking a
VnV'^^he-Q
so that
In
)
=
In the other case the last conclusion follows
QED.
.
and continuously distributed with bounded density and
is i.i.d.
(5
P
b),
v"^'^^/V~^'^^
(^"^n^
F
If
iii,
I.
n
v.
-^
0.
then
Y.'^
AKasv .^b)-l(ar^v .:^b)\/n
^i=2
bounded, for any
1
inequality.
48
i
i
A
n
> 0,
A
n
—
>
0,
=
(S
P
n
).
by the Markov
y.",[l(|v.-a|£A )+l(|v.-b|^A )]/n =
^1=1
1
n
n
1
We
also use the well
known
positive sequences with
such that
—
>
c
—
goes to zero slower than
n
=
n
p
—
^
0.
8
in
-1/2
(e
i.e.
,
—
n
>
in
e
if
Y
n n
))
=
(A
n
5
)
—
n
max.
i^n
>
-^
|v.-v.
i
Then
0.
n
p
).
for
Consider any positive sequence
follows that
It
probability approaching one (w.p.a.l) as
2
n
|
and only
if
(1)
slowly enough.
>
Iv.-v. |/5
V max.
1
n
i^n
1
n
so
0,
Y
result that
c
in
(Prob(| v.-a| <A )+Prob( v.-b|<A
p
v
n
all
e
= (e
1/2
n
)
^ 5 /v with
n n
|
i
Then w.p.a.l,
oo.
c 7.", |l{a<v.£b)-l(a:£v.<b)|/5 n = e 7.", l(a£v.+[v.-v.]<b)-l(a<v.£b) |/5 n
n^i=l
i
i
1
i
i
n
n^i=l
|
nnn
/v
nnpnn
s c y.^JKIv.-alrsS /v )+l(|v.-b|<5 /v )]/8 n = e d~^0
n^i=l
Proof of
nn
1
Lemma
4.1:
i
(5
)
= O (y
pn
)
= o
(1).
QED.
p
Because series estimators are invariant to nonsingular linear
transformations of the approximating functions, a location and scale shift allows us to
assume that
W
=
and
[0,1]
Z =
[0,1]
Also,
i.
in
the polynomial case, the component
powers can be replaced by polynomials of the same order that are orthonormal with respect
to the uniform distribution on
[0,1],
and
in the spline
both cases the resulting vector of functions
is
case by a B-spline basis.
a nonsingular linear combination of the
original vector, because of the assumption that the order of the multi-index is
increasing.
Also,
assume that the same operation
approximating functions
<.(K) =
Then
it
follows from
<IK) ^
and
max.
r
,
(z).
.sup
Newey
CK^'^^J,
is
carried out on the first step
For any nonnegative integer
~J\d p
(w)ll,
f
.(L)
= max,
j,
,
(1997) that in the polynomial case,
C(L) £ CL.
in the spline case that
49
In
.sup
.^115
p
(z)ll.
It
follows from these inequalities and Assumption 4 that
[Cq(K)C^(K) + Co(K)^?q(L)]A^
so the conclusion follows by
- JA (u)dF (u).
g„(w
)
w
t(w) =
given
For
w
= (x,z
= g„(w J-J'g (w JdF (w
away from
Al.
A^ = (L/n)^^^
),
).
let
1,
A(u)
L V^^i,
let
F"f,(w)
denote the conditional
A(u) - jA(u)dF (u),
-
note also that
g(w
)
Also by the density of
zero, so is the conditional density on
Therefore, for
+
QED.
Without changing the notation
Proof of Theorem 4.2:
distribution of
Lemma
^0.
W,
and
^^(u) =
= g(w )-J'g(w )dF (w
w
)
^Au)
and
being bounded and bounded
and the corresponding marginals.
A = J{h(w)-h (w)}dF (w)
J[h(w)-hQ(w)]^dFQ(w) > Cj'[i(w^)+A(u)-gQ(w^)-AQ(u)]^dFQ(w^)dFQ(u)
= CSlg(vf^)-g^iw^) + A(u)-Aq(u) + A]^dFQ(w^)dFQ(u)
= Cj-[i(w^)-iQ(w^)]^dFQ(w^) + Cj'[A(u)-AQ(u)]^dFQ(u) + CA^
i Cmax{J[i(w^)-iQ(w^)]^dFQ(w^), J-[A(u)-AQ{u)]^dFQ{u), A^>
where the product terms drop out by the construction of
g(u),
etc.,
e.g.
S
J[g(w^)-iQ(w^)][A(u)-AQ(u)]dFQ(w^)dFQ(u) = X[g(w^)-gQ(w^)]dFQ(w^)- J[A(u)-AQ(u)]dFQ(u) =
0.
QED.
Proof of Theorem 4.3:
W.
Lemma
4.1
by fixing the value of
u
at any point in
QED.
Proof of Theorem
4.1,
Follows from
it
5.1:
For power series, using the inequalities
in the
proof of Theorem
follows that
Cq(K)^K^ s
Ck"^,
L^CqCKJ^^CL)"^ £
(K^L+L^)C^(K)^ £ C(k\+L^)K^,
Ck\^,
KLCq(K)'^C{L)^ ^ CK^L^,
Co(K)^Ci(K)^(LK+L^) ^ CK^(KL + L^).
«^n
and for splines that
It
then follows from the rate conditions of Theorem 4.2 that the rate conditions of
A2 are
satisfied.
Theorem 4.L
Lemma
The conclusion then follows from Lemma A2, similarly to the proof of
QED.
51
References
Agarwal, G.G. and Studden, W.J. (1980): "Asymptotic Integrated Mean Square Error Using
Least Squares and Bias Minimizing Spline," Annals of Statistics 8, 1307-1325.
Ahn, H. (1994) "Nonparametric Estimation of Conditional Choice Probabilities in a Binary
Choice Model Under Uncertainty," working paper, Virginia Polytechnic Institute.
Andrews, D.W.K. (1991): "Asymptotic Normality of Series Estimators for Nonparametric and
Semiparametric Models," Econometrica 59, 307-345.
Andrews, D.W.K. and Whang, Y.J. (1990): "Additive Interactive Regression Models:
Circumvention of the Curse of Dimensionality," Econometric Theory 6, 466-479.
Barzel, Y. (1973) "The Determination of Daily Hours and Wages," Quarterly Journal
Economics
87,
of
220-238.
J. and Zarkin, G. (1989) "Choice among Wage-Hours Packages: An Empirical
Investigation of Male Labor Supply," Journal of Labor Economics 7, 415-37.
Biddle,
Breiman, L. and Friedman, J.H. (1985) "Estimating optimal transformations for multiple
regression and correlation," Journal of the American Statistical Association. 80
580-598.
Brown, B.W. and Walker, M.B. (1989): "The Random Utility Hypothesis and Inference
Demand Systems," Econometrica 47, 814-829.
in
Cox, D.D. 1988, "Approximation of Least Squares Regression on Nested Subspaces,"
Annals of Statistics 16, 713-732.
Hastie, T. and R. Tibshirani (1991): Generalized Additive Models,
Chapman and
Hall,
London.
Linton, 0. and Nielsen, J.B. (1995): "A Kernel Method of Estimating Structured
Nonparametric Regression Based on Marginal Integration," Biometrika 82, 93-100.
Linton, 0., E. Mammen, and J. Nielsen (1997): "The Existence and Asymptotic Properties of
a Backfitting Projection Algorithm Under Weak Conditions," preprint.
Wage-Hours Labor Supply Model"
Moffitt, R. (1984) "The Estimation of a Joint
Journal of Labor Economics
2,
550-66.
Newey, W.K. (1984) "A Method of Moments Interpretation of Sequential Estimators,"
Economics Letters
14,
201-206.
Newey, W.K. (1994) "Kernel Estimation of Partial Means and a General Variance
Estimator," Econometric Theory, 10, 233-253.
Newey, W.K. (1997): "Convergence Rates and Asymptotic Normality for Series Estimators,"
Journal of Econometrics 79, 147-168.
Newey, W.K. and J.L. Powell (1989) "Nonparametric Instrumental Variables Estimation,"
working paper, MIT Department of Economics.
Oi,
W. (1962) "Labor as a Quasi Fixed Factor," Journal of Political
52
Economy
70, 538-555.
Powell, M.J.D. (1981). Approximation Theory and Methods. Cambridge, England: Cambridge
University Press.
Robinson, P. (1988): "Root-N-Consistent Semiparametric Regression," Econometrica, 56,
931-954.
Roehrig, C.S. (1988): "Conditions for Identification in Nonparametric and Parametric
Models," Econometrica, 56, 433-447.
Stone, C.J.
(1982) "Optimal global rates of convergence for nonparametric regression,"
Annals of Statistics
10,
1040-1053.
Stone, C.J. (1985) "Additive regression and other nonparametric models," Annals
Statistics. 13 689-705.
of
Tjostheim, D. and Auestad, B. (1994): "Nonparametric Identification of Nonlinear Time
Series: Projections," Journal of the American Statistical Association 89, 1398-1409.
Two-Step Estimator for Approximating Unknovim Functional
Models with Endogenous Explanatory Variables," Australian National
University, Department of Economics, working paper.
Vella, F.
(1991) "A Simple
Forms
Vella,
in
F. (1993) "Nonwage Benefits in a Simultaneous Model of Wages and Hours: Labor
Supply Functions of Young Females," Journal of Labor Economics 11, 704-723
53
Table 1:
Estimate
OLS
.000017
(.5597)
2sls
NP
.000387
-.0094
(2.7447)
(2.5513)
7.01516-6
NPIV
-.01097
(2.9170)
7.4577e-6
(2.6363)
(2.7986)
-2.1707e-9
-2.0250e-9
(2.6304)
(2.4640)
2.3692e-13
(2.5537)
1.89296-13
(2.0311)
-.0005
-.000384
(2.7643)
(2.6732)
-6.64646-8
r
(.5211)
3.49786-10
r^
(2.7516)
Notes:i) h denotes hours worked and r is the reduced form residual.
ii) The regressors in the wage equation included education dummies, union status, tenure,
full-time work experience, black dummy and regional variables.
iii)
The variables included in Z which are excluded from X are marital status, health
young children, rural dummy and non-labor income.
status, presence of
iv)
Absolute value of t-statistics are reported in parentheses.
Table
Specification
h=0; r=0
h=l; r=l
h=2; r=2
h=3; r=3
h=4; r=4
h=5; r=5
2:
Cross Validation Value
185.567
183.158
183.738
182.899
182.882
183.128
54
1
Estimote of Woge/Hours Profile from Polynomial Approximation
—
o
1
1
1
1
1
\
y^
o
^^ //
//
\
^
^
\
d
-\
\
/
*
\
" ^
'^
\
\
^
^
''/
-^
,
^
''''/
1
^.
/
CNJ
d
"^""A
/
^^
/ /
1
/
/
\
to
d
^
\
" ^ ^
^ ^/^
\
'
\
o
\
\
1
I
f
N^
"
^ ^
/-"
d
o
d
__J.----^J,
^1
^^,^4—
\
"
^
Y
\
-
\
/
'
^"^^L^-^
1
/
/
<a-
d
/
/
1
/
iT)
d
/
-
7
— — —
— — —
1
Unadjusled Nonporometi-.c
Two Slaqe Leasl Squares
ID
1
I
1000
1
1400
1
1800
1
1
2200
I
1
1
2600
Annual Hours Worked
Figiire 1:
1.1
JOOO
1
3400
.J
3800
Estimate of Wage/Hours Profile from Spline Approximation
d
(D
CNJ
O
cc
d
"
0)
en
O
\
/
X
Oi
o
»-
d
1
\
_!
C
01
\
1
^
y
c
SI
o
\
V
""^
1
\
"^
^
\
\'
/
-
\
/
f^
.
\
\
/
d
\
/
\
/
1
/
^
— — —
unadjusled Nonparomelnc
,
,
1
1000
1400
'
\
\
/
/
1
CP
o
N
-
CN
d
\
'"x
/
y
-
\
/
/
/
\
\
\
^ ^ ^
/
/
/
\
\
\
^
/
''
\
\
o
\
\\
/
/~
>N
o
d
N.
y/^
^
o
5 d
l^
"
^.--'''^
y^
y^
1800
1
1
1
1
2200
1
2600
Annual Hours Worked
Figure
2:
:
3000
1
3400
3800
95Z Confidence
1
d
1
Interval for
1
1
1
Wage/Hours
1
1
T
from Polynomial
Profile
'
'
1
1
'
--'
o
,-''
tt:
d
-
\
o
d
o
1
I
\\
^^-„^^^
^^^-^""^
\
^^^---"'y<^
-
»»
\
\
*^
.y^^ ^
^
s
^^""^
N,
\
\
.
\
J'
\
o
-
\
1
\
\
\
0)
en
\
CD
d
\
\
1
\
SI
o
\
CN
1
1
I
1000
1400
1800
2200
Annual Hours Worked
Figure
3:
.
1
1
2600
3000
3400
3800
MIT LIBRARIES
3
95% Confidence
T
d
1
1
TOflD
DD77Tb33
Interval for
1
r-
M
Wage/Hours
1
1
T
from Spline
Profile
1
1
1
1
1
,----^
'^
d
^
\
^^^^--""^
\
o
d
\
__
\
N
1
^
o
^
d
^ "
-~
-^
^^
^^\
"
N,
^J^J^^^^
,'><^
/^^
\
""-.
\
^
/
\
\
-'
.
'
X
" en
''
\
m
o
\
\
\
1
\
\
\
CM
I
1000
1400
1800
2200
2600
Annual Hours Worked
Figiire 4:
3000
3400
3800
4879
OU
Date Due
Lib-26-67
Download