SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS

advertisement
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
JONG-MYUN MOON
Abstract. This paper studies transformation models T0 (Y ) = X 0 0 + " with an unknown
monotone transformation T0 . Our focus is on the identi…cation and estimation of 0 , leaving
the speci…cation of T0 and the distribution of " nonparametric. We identify 0 under a
new set of conditions; speci…cally, we demonstrate that identi…cation may be achieved even
when the regressor X has bounded support and contains discrete random variables. Our
identi…cation is constructive and leads to sieve extremum estimator. The empirical criterion
of our estimator has a U-process structure, and therefore does not conform to existing
results in the sieve estimation literature. We derive the convergence rate of the estimator
and demonstrate its asymptotic normality. For inference, the weighted bootstrap is proved to
be consistent. The estimator is simple to implement with standard optimization algorithms.
A simulation study provides insight on its …nite-sample performance.
Date: October 27, 2014
A¢ liation and Contact Information: UCL and CeMMAP, Email: jong-myun.moon@ucl.ac.uk.
1
2
JONG-MYUN MOON
1. Introduction
Data transformation is often used in econometric analysis. For example, dependent variables are routinely log-transformed in linear regressions in order to mitigate nonlinearity
and heteroskedasticity. This e¤ective but arbitrary technique can be justi…ed if transforming functions are included as model parameters and then estimated using data. The most
prominent example of this approach is the in‡uential Box-Cox transformation model (Box
and Cox, 1964). Those authors suggested a parametric family of power functions, including
a log-transformation, as candidate functions for data transformation. There are several variations of this approach, which involve di¤erent sets of transformation functions. However, if
complex patterns are possible, then a nonparametric approach provides a useful alternative.
This paper concerns identi…cation and estimation of regression models with a nonparametric
transformation.
Regression models with a transformed dependent variable are called transformation models.
Speci…cally, transformation models are represented by the equation
(1)
T0 (Y ) = X 0
0
+ ";
where Y 2 R and X 2 Rdx are observed random variables, and " 2 R is an unobserved error
term. In the model (1), there are three parameters: (i) the regressor coe¢ cient 0 , (ii) the
transformation T0 and (iii) the error distribution. We consider the case when both T0 and the
error distribution are nonparametric. Horowitz (1996) and Chen (2002) review the literature
regarding identi…cation and estimation of model (1). For related models in econometrics, see
Matzkin (2007). Following the literature, we assume (i) " is independent of X and (ii) T0 is
strictly monotone.
There are several applications of the transformation model (1). An important class of transformation models are duration models. In labor economics, the study of employment and
unemployment durations is an important area of research, and the duration model has been
the main vehicle of empirical studies (Kiefer, 1988, Farber, 1999). More recently, the unemployment duration is often studied through the labor-market search model (Mortensen and
Pissarides, 1999, Rogerson, Shimer, and Wright, 2005), which imposes testable implications
on the duration models (Eckstein and Van den Berg, 2007). See Meyer (1996), van den Berg
and Ridder (1998) and van den Berg (2001) for related works. Also, hedonic models with
additive marginal utility and additive marginal production technology, studied by Ekeland,
Heckman, and Nesheim (2002, 2004), are closely related to the transformation model (1).
Chiappori, Komunjer, and Kristensen (2013) provides an extensive list of applications in
di¤erent areas.
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
3
We contribute to the literature by producing new conditions for identi…cation and proposing
a new estimator for 0 . The identi…cation exploits two model features: the monotonicity of
T0 and the additive separability of X 0 0 and ". First, we notice that an ordering is preserved
by any monotone transformation. Therefore, if we use only the ordering induced by Y for the
identi…cation, then the speci…c form of the transformation T0 is entirely irrelevant. This is
not to say T0 is not identi…ed; indeed, if 0 is identi…ed, then the identi…cation of T0 can be
established following Chen (2002). However, in order to identify 0 , it is enough to consider
the ordering induced by Y , as will be demonstrated. Further, in general, the ordering is
completely characterized by a binary relation. Therefore, if we are to use the information on
ordering only, it is enough to see the binary comparison of any pair of data.
The second observation leading to the identi…cation is that the ordering of Y is determined by
a linear function X 0 0 + ". Suppose we have two observations (Y1 ; X10 ) and (Y2 ; X20 ). Then we
see that Y1 < Y2 if and only if X10 0 + "1 < X20 0 + "2 . This observation may be summarized
to the equality
(2)
IfY2
Y1 > 0g = If (X2
X1 )0
0
< "2
"1 g;
for an indicator function If g. The relation (2) is similar to the binary choice model. The
di¤erence of errors "2 "1 has the role of a random threshold, and the binary outcome of
whether the inequality Y2 Y1 > 0 holds or not is determined by whether the threshold is
crossed by the di¤erence of two “single indices” (X2 X1 )0 0 .
These two observations help us formulate a minimization problem that identi…es the model
parameter 0 as a unique solution. Our identi…cation result is similar in spirit to the identi…cation of the maximum rank correlation (MRC) estimator of Han (1987). A distinctive feature
of our approach is that the cumulative distribution function (cdf) of "2 "1 , denoted by F0 ,
will be identi…ed along with 0 . Our identi…cation result is new, and provides new identifying
conditions. Speci…cally, we allow the regressor vector X to contain discrete random variables.
Further, all continuous regressors may have bounded support. Our key identifying condition
is intuitive; we require that discrete regressors do not dominate the continuous regressors
in terms of the relative contribution to the single index X 0 0 . However, regardless of the
condition’s being met, the subvector of 0 for continuous regressors is identi…ed.
The identi…cation is constructive in the sense that it suggests a natural estimator. Our
estimator is de…ned as a minimizing solution to an empirical criterion, and the empirical
criterion is acquired as a sample analogue of the identifying criterion. We propose to use the
method of sieves. Sieves refer to a collection of subsets of parameter space which approximates
the original parameter space increasingly well. Conceptually, a denser sieve is employed as
more data are collected. See Chen (2007) for a survey of the literature on sieve estimation.
4
JONG-MYUN MOON
Our estimation procedure involves minimizing an empirical criterion, which is a function of
and F , over a sieve space.
As implied by the equation (2), the criterion function will involve pairwise combinations of
observations in its formulation. As such, our empirical criterion has a U-process structure; in
other words, it appears as a double-summation over every pair of observations. Extremum
estimation involving U-processes has been studied by Sherman (1993, 1994) for parametric
problems, and the theory is applied to the MRC estimation. The MRC criterion function has
a U-process structure, and it is a step function of a Euclidean parameter. On the other hand,
our empirical criterion function is a smooth function of parameters, which is one advantage
of our approach. However, we need to extend the existing literature to deal with a seminonparametric problem to account for the in…nite-dimensional parameter F . To do so, we
adopt and modify the existing results on sieve M-estimation by Shen and Wong (1994) and
Shen (1997). The main contribution here is to show that the estimator minimizing the Uprocess can be represented as an approximate M-estimator. We achieve this by approximating
the U-process using a more familiar empirical process. The theoretical device used for this
task is the U-process maximal inequality; in Appendix B, we present its working form.
We show that the estimator of F0 converges faster than the n1=4 -rate in terms of L2 -norm.
The estimator of 0 converges at the n1=2 -rate to the normal distribution. Regarding the
inference on 0 , because we provide an explicit form of the asymptotic variance, an inference
can be conducted relying on the asymptotic approximation. A downside of this approach
is that the asymptotic covariance matrix has quite a complex form, and that it requires
estimation of even more nonparametric objects, such as a conditional expectation. Therefore,
we prefer simulation-based methods and suggest a weighted-bootstrap scheme to approximate
the …nite-sample distribution of the estimator. The consistency of weighted bootstrap has
been recently shown by Ma and Kosorok (2005) and Chen and Pouzo (2009) for the sieve
M-estimation and the conditional moment model, respectively. We extend these earlier works
to the case when the empirical criterion has a U-process structure.
There are several literatures related to this paper. First, several papers have proposed to
p
estimate T0 nonparametrically, when n-consistent estimator of 0 is available. As such,
these papers and our work are complementary. See Horowitz (1996), Ye and Duan (1997),
Klein and Sherman (2002), and Chen (2002). If T0 is parametrized, then all the model
parameters can be estimated jointly, including 0 and T0 . Relevant works in this approach
include Linton, Sperlich, and Van Keilegom (2008) and Santos (2011) among others. Second,
there are rank-based estimators initiated by Han (1987). Other relevant works in this strand
include Cavanagh and Sherman (1998), Abrevaya (2003), Khan and Tamer (2007) and Khan,
Shin, and Tamer (2011) among others. A common aspect shared by these methods is that
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
5
is identi…ed and estimated without knowledge of T0 and the error distribution. Third, the
methods for the single-index model are applicable to the transformation model. Single-index
models have been extensively studied in econometrics and statistics since Ichimura (1993);
see Horowitz (1998) and Ichimura and Todd (2007) for surveys. In addition, although our
estimator is designed speci…cally for the transformation model, its technical aspect is akin
to that of the single-index regression model. This is because the Euclidean parameter 0
enters the in…nite-dimensional parameter F0 as its argument. Rather unexpectedly, however,
few works relate the sieve estimation to single-index models; see Ding and Nan (2011) and
references therein. These results are not applicable to our problem.1 Therefore, we develop
a suitable asymptotic theory that applies to the single-index problem in the context of sieve
estimation.
0
The remainder of this paper is organized as follows. Section 2 de…nes the model and establishes the identi…cation. Section 3 de…nes our estimator and shows its consistency. Section 4
derives the rate-of-convergence. Section 5 shows the asymptotic normality of the estimator.
It also includes the consistency of the weighted bootstrap procedure. Section 6 contains a
simulation study. Section 7 discusses possible extensions. Proofs are gathered in the Appendix. Most notations will be de…ned in Section 2 and in Appendix A.1, but inevitably more
notations will be added throughout the paper.
2. Identification
We de…ne the criterion function that identi…es 0 and F0 as its minimizing solution. To
this end, we need to introduce scale and location normalizations. As a scale normalization,
the …rst component of 0 is normalized to 1, and thus 0 is written as ( 0;1 ; 00 )0 for a
scalar 0;1 such that j 0;1 j = 1 and some (dx 1)-dimensional vector 0 . To see why it
is necessary, consider T = cT0 and "i = c"i for some positive constant c > 0. Because T
is strictly increasing and "i is not observed, an alternative model T (Yi ) = Xi0 (c 0 ) + "i is
observationally equivalent to the original model (1). Therefore, for the point identi…cation
of , we need to restrict the parameter space for so that no two possible points 1 and 2
can be related as a constant multiple of the other. There are other ways to achieve the scale
normalization. For instance, we could set j 0 j = 1, so that the parameter space for is a
unit sphere in Rdx .
The location normalization is achieved by not allowing a constant term in X. Suppose we
have a constant term c, and write the model as T0 (Yi ) = c+Xi0 0 +"i . This can be equivalently
1 The recent work by Ding and Nan (2011) assumes that the empirical criterion is twice Fréchet di¤erentiable
with respect to a certain pseudo-metric. See pp.3035-3036 of Ding and Nan (2011). Our empirical criterion is
not Fréchet di¤erentiable.
6
JONG-MYUN MOON
written as T0 (Yi ) = c + Xi0 0 + "i for "i = "i + c c for any constant c. As these two models
are observationally equivalent, the constant term c or c is not identi…ed. Notice that we do
not impose a location normalization to "i ; its mean or median is not restricted.
As mentioned in the introduction, our criterion function is motivated by the relation (2).
We develop further from (2) to induce the identifying criterion. By taking a conditional
expectation in both sides of (2), we have
P ( Y > 0jX1 ; X2 ) = P ( " >
X0
0 jX1 ; X2 )
=1
F0 (
X0
0 );
where F0 is the cdf of "2 "1 and the notation
denotes a di¤erence of two consecutive
observations; that is, ( ) = ( )2 ( )1 . Recall "1 and "2 are from i.i.d. samples, and hence the
distribution of "1 "2 is equal to the distribution "2 "1 . This implies that 1 F0 ( z) = F0 (z)
for any z 2 R. Then we have an equation
P ( Y > 0jX1 ; X2 ) = F0 ( X 0
(3)
0 ):
This relation leads us to de…ne a new criterion. To state it, let us relabel the parameters
and F . Because the …rst component of is normalized to 1, we denote it separately by
b 2 f 1; 1g. Then is a (dx 1)-by-1 vector such that = (b; 0 )0 . Combining the parameters
of interest and F , we write = ( ; F ). Then, we de…ne a nonlinear least squares criterion
implied by the relation (3) as follows; for Vi = (Yi ; Xi0 )0 ,
(4)
h(b; ; V1 ; V2 ) = fIf Y > 0g
F ( X 0 )g2 ;
Q(b; ) = E[h(b; ; V1 ; V2 )]:
We call Q(b; ) the population criterion. A corresponding empirical criterion will be de…ned
in Section 3. Theorem 2.1 below shows that 0 and F0 are uniquely identi…ed as a minimizer
of the population criterion function Q, and the in…nite-dimensional parameter F0 is uniquely
identi…ed on the support of X 0 0 .
The following notations are needed. Because we have di¤erent conditions for continuous and
discrete regressors (see Assumption 2.3), let us divide Xi to a continuous random vector
0 ; X 0 ). Divide
Xi;c 2 Rdc and a discrete random vector Xi;c 2 Rdx dc so that Xi0 = (Xi;c
i;d
to c and d accordingly. Similarly, we write X 0 = ( Xc0 ; Xd0 ). The support of a random
vector X is denoted by supp X.2 Lastly, we denote
Nj = supp " \ fx0
0;c
+
0
j 1 0;d
: x 2 supp Xi;c g \ fx0
x dc
for some constants f j gdj=0
and j 2 f1;
; dx
the identi…cation purpose (see Assumption 2.4).
0;c
+
0
j 0;d
: x 2 supp Xi;c g;
dc g. The last notation Nj is used only for
Assumption 2.1. fYi ; Xi ; "i gni=1 is independent and identically distributed (i.i.d.) and conforms to the equation (1). "i is continuous and independent of Xi .
2 For a random variable X, its support is de…ned as the smallest closed set B such that P [X 2 B c ] = 0.
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
Assumption 2.2. (i) 0 = ( 0;1 ; 00 )0 for 0;1 2 f 1; 1g and
collection of continuous monotone functions on R. F0 2 F.
0
2
Rdx
7
1.
(ii) F is a
0 ; X 0 )0 , X
dc is jointly continuous, and X
Assumption 2.3. (i) For Xi = (Xi;c
i;c 2 R
i;d 2
i;d
d
d
R x c is discrete. There is no constant in Xi . (ii) supp Xi = supp Xi;c supp Xi;d . (iii)
supp Xd is not contained in a proper linear subspace of Rdx dc .
Assumption 2.4. There exist a set of points f 0 = 0; 1 ;
; dx dc g such that j 2
supp Xd for j = 1;
; dx dc and f 1 ;
; dx dc g are linearly independent. In addition, the set Nj has a non-empty interior for every j = 1;
; d x dc .
Assumption 2.1 is standard in the literature although the independence of "i and Xi0 0 can
be weakened to the conditional median independence; see Khan, Shin and Tamer (2012).
We do not consider this possibility. Assumption 2.2 regards the parameter spaces for
and F . Assumption 2.2 (i) restates our scale normalization and restricts
to be compact.
Assumption 2.2 (ii) de…nes F0 , and restricts the parameter space F for F0 . Regarding the
identi…cation, F0 needs not to be smooth. However, it is essential that any F 2 F is continuous and monotone. The monotonicity requirement is not required if the support for X 0 0
is R. Heuristically speaking, this assumption regulates the possible value of the parameter
when the support of Xi is not connected due to the existence of discrete regressors.
Assumption 2.3 (i) allows that both continuous and discrete regressors in Xi . The requirement that Xi;c is jointly continuous implies that supp Xc has a non-empty interior, or,
equivalently, that supp Xc is not included in a proper subspace of Rdx;c . If this assumption
is violated, then supp Xc will exhibit the multi-collinearity. As explained above, a constant
term is not allowed. Assumption 2.3 (ii) means that the support of Xi;c does not depend on
the realization of Xi;d . This assumption can be weakened, and we may allow the support
of Xi;c depends on the value of the discrete regressors Xi;d , as long as the support of Xi;c
conditional on Xi;d has non-empty interior in Rdc . What is necessary to this generalization is
only to modify Assumption 2.4 accordingly. The proof of identi…cation will remain essentially
same. For simplicity, however, we do not attempt this generalization. Assumption 2.3 (iii) is
a requirement for the discrete regressor Xi;d , parallel to the requirement that Xi;c is jointly
continuous.
Assumption 2.4 requires that (i) the contribution of discrete variables to the single index
Xi0 0 is not too large relative to that of continuous variables, and (ii) the variation of the
error term is not too small. This assumption regards the identi…cation of 0;d , that is, the
regression parameter for the discrete regressors. If there is no discrete regressor, therefore,
Assumptions 2.4 is not needed. Even with discrete regressors, if the support of any regressor
is R, then we can omit it.
8
JONG-MYUN MOON
Theorem 2.1. Suppose Assumptions 2.1-2.4 hold and de…ne A =
Q(b1 ;
for some b1 2 f 1; 1g and
any z 2 supp X 0 0 .
1
1)
=
min
b2f 1;1g; 2A
F. let
Q(b; )
= ( 1 ; F1 ) 2 A. Then b1 =
0;1 ,
1
=
0
and F1 (z) = F0 (z) for
The proof of Theorem 2.1 is in the Appendix A. Theorem 2.1 establishes the identi…cation
of 0 and F0 . We stress that F0 is identi…ed only on the support of X 0 0 . This fact adds
some complication when we study the estimation of 0 and F0 .
3. Consistency
3.1. Extremum Estimation and Method of Sieves. The identi…cation result of Theorem
2.1 is constructive in the sense that it suggests an extremum estimator. This section de…nes
our estimator and proves its consistency. As there is an in…nite-dimensional parameter,
the consistency will be stated in terms of a particular norm that we de…ne soon. Before
proceeding, let us add one simpli…cation. Henceforth, we assume 0;1 is known and its value
is 1; this can be accepted without loss of generality, because our estimator of 0;1 exactly
equals to the true value, with probability approaching 1. Therefore we let 0 = (1; 00 )0 , and
further, simplify notations to Q( ) = Q(1; ) and h( ; ; ) = h(1; ; ; ).
The sample analogue to the population criterion Q de…ned in (4) is
X
1
(5)
Qn ( ) =
h( ; Vi ; Vj ):
n(n 1)
i6=j
Let us call Qn the empirical criterion. It is immediate from the de…nition that E[Qn ( )] =
Q( ). Also, Qn ( ) is a U-statistic for Q( ). If viewed as a stochastic process, then Qn ( )
induces a U-process, a generalization of U-statistic; it is a U-process after centering by Q( )
p
and scaling by n. Much of our asymptotic theory will rely on the U-process theory3.
We minimize Qn not over A but over a subset of A, called a sieve. Let us denote a collection
of sieves by fAk g. It is required that the sieve Ak approximates the entire parameter space
A increasingly accurately as the index k grows. For a …nite sample size, we get to pick one
sieve Ak to use. However, conceptually, a di¤erent sieve Akn is used as the sample size n
changes. The sieve index kn depends on n, and grows to the in…nity along with the sample
size n. Our discussion below will rely on abstract assumptions on the sieve spaces fAk g and
the speed of divergence of the sieve index kn . Because is …nite-dimensional, we may de…ne
the sieve Ak as a product of and Fk ; that is, only the in…nite-dimensional F is “sieved.”
3 U-Process theory is similar to the empirical process theory. For more about U-process theory, see Arcones
and Giné (1993), Sherman (1994) and de la Peña and Giné (1999) among others.
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
Using the sieve Akn =
9
Fkn , we de…ne the estimator ^ n as follows;
^ n 2 argmin Qn ( ):
(6)
2Akn
We write ^ n = (^n ; F^n ) for ^n 2
and F^n 2 Fkn . If there are multiple minimizers in (6),
any point among them can be chosen as the estimator.
3.2. Consistency. In semi-nonparametric problems, there are several candidates for a norm
attached to the parameter space. This is due to the in…nite-dimensional nature of the problem.
One of the main task in studying a semi-nonparametric problem is to …nd out a proper norm
to the context. In contrast, in parametric problems, the Euclidean norm is a natural choice
to measure a distance. We start by de…ning a suitable norm to state the consistency of the
estimator ^ n .4 When de…ning norms on F, an important fact is that F0 is only identi…ed on
the support of X 0 0 . Therefore, we …rst de…ne a norm on F as
(
)
kF kF ;c = max
sup
z2supp
X0
0
jF (z)j;
sup
supp
X0
0
jF 0 (z)j :
Then, de…ne the consistency norm k kc on A as k kc = j j + kF kF ;c . Also, denote the usual
sup-norm by k k1 .
We are ready to state assumptions for the consistency. We assume Xi has at least fourth
moment (EjXi j4 < 1). Also fVi = (Yi ; Xi0 )0 g is always a random sample. These two premises
are maintained throughout the paper. We list other more substantial assumptions.
Assumption 3.1. (i) The parameter 0 is uniquely identi…ed in the sense of Theorem 2.1.
(ii) is a compact subset of Rd with a non-empty interior. 0 is an interior point of .
i
d
Assumption 3.2. (i) For some integer
3, maxi2f0;1; ; g supz2R j dz
i F0 (z)j < 1. (ii) For
some constant ! > 0, collect every monotone function F on R such that
max
i2f0;1;
sup
; g z2R
d
fF (z)
dz i
F0 (z)g (1 + z 2 )!=2
B;
for some positive constant B > 0. The set F is the closure of this function class in the norm
kF k1;1 = kF k1 _ kF 0 k1 .
Assumption 3.3. There exists a sequence f k F0 g such that
maxi2f0;1g supz2R dzd i f k F0 (z) F0 (z)g ! 0 as k ! 1:
k F0
2
Fk and
Assumptions 3.1 is standard. The true parameter 0 needs not to be an interior point for
consistency, but it is included for later results. Assumption 3.2 (i) states that F0 is at least
4 Later we add more norms when needed. See the de…nition (7) and Appendix A.1. In fact, all those norms
are only semi-norms. We do not stress this fact.
10
JONG-MYUN MOON
-times di¤erentiable and its derivatives are uniformly bounded. Assumption 3.2 (ii) de…nes
the set F. There are several implications. First, by de…nition, F0 is an interior point5 of F.
Second, the weighting function (1 + ( )2 )!=2 is included to address the case when Xi have an
unbounded support. The particular form of the weighting function and its technical usage
come from Gallant and Nychka (1987). Third, F 2 F needs not to be a cdf. Recall that F 2 F
being continuous monotone is enough for the identi…cation (Assumption 2.2 (ii)). However,
it is possible to make F include only cdfs. Similarly, knowing that F0 is symmetric (that is,
F0 (z) = 1 F0 ( z) for any z 2 R), we may restrict that every F 2 F is symmetric. The
asymptotic distribution of ^n is not a¤ected by the choice of F. Assumption 3.3 speci…es the
approximation property of the sieves. For consistency, it is enough that the true parameter
F0 is well approximated. We de…ne k 0 = ( 0 ; k F0 ). Notice that k k 0
0 kc ! 0.
Theorem 3.1. Suppose Assumptions 3.1-3.3 hold. Then k^ n
0 kc
p
! 0.
The proof of Theorem 3.1 is in the appendix. Notice that the derivative F00 , as well as F0 , is
consistently estimated, uniformly on the support of X 0 0 . This result is used to establish
the convergence rate of ^ n in a weaker norm.
4. Rate of Convergence
This section derives the convergence rate of the estimator ^ n . The …rst step is to de…ne an
appropriate norm on A. To this end, we need to show that the population criterion Q induces
a norm on the parameter space local to 0 . We provide heuristic explanations. Given the
consistency result, we can focus on a subset of parameter space A near to 0 . Consider a
local neighborhood of 0 in the normed space (A; k kc ). By the equality (3), it is easy to
show that
Q( ) Q( 0 ) = E[F ( X 0 ) F0 ( X 0 0 )]2 :
Recall that we set 0;1 = 1 and as such X 0 = X1 + X 0 for Xj = X2;j X1;j and
X = [ X2 ;
; Xdx ]0 . Applying the Taylor expansion to F ( X 0 ) F0 ( X 0 0 ), we
obtain the following approximate equality:
Q( )
Q(
0)
' E[fF 0 ( X 0 ) X 0 (
0)
+ F ( X0
0)
F0 ( X 0
0 )g
2
];
0
0
0
0
If k
0 kc is small, then we may replace F ( X ) by F0 ( X 0 ) in the last expression.
This is the reason why the consistency norm k kc is chosen to involve the …rst-order derivative
of F .
5 Here, we regard F as a normed space attached with k k
1;1 ; that is, F is a whole set. Note that F0 is not
an interior point of a larger normed space fF : kF k1;1 < 1g with the same norm.
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
11
This heuristic observation motivates us to de…ne the following norm as a measure of rate of
convergence for ^ n ; de…ne the rate norm k kq as
(7)
k kq = fE[fF00 ( X 0
X0 + F ( X0
0)
0 )g
2
]g1=2 :
The subscript q is chosen to indicate that the norm is derived from the population criterion
Q. Lemma A.15 proves that Q( ) Q( 0 ) is locally similar to k k2q on the open neighborhood
of 0 in the normed space (A; k kc ). In standard parametric problems, a similar relation
holds with the Euclidean norm j j. The rate norm k kq is not necessarily an object of interest;
however, it turns out that the rate norm k kq is equivalent6 to the norm j j+kF ( X 0 0 )kL2 (P )
for the usual L2 -norm k kL2 (P ) with respect to the probability measure P . Then, for instance,
the upper bound for the rate of j^n
0 j is given by the k kq -norm rate. The following three
assumptions, in addition to the assumptions for the consistency, will be used to derive the
rate of convergence.
Assumption 4.1. 2 ! > ! + :
Assumption 4.2. There exists a sequence f
rn k kn 0
0 kq = o(1).
Assumption 4.3. Denote, for
= E[F00 ( X 0
The matrix
2
0) f
k 0
; Xdx ]0 ,
X = [ X2 ;
X
= ( 0 ; F0;k ) : k 2 N; Fk 2 Fk g such that
E[ Xj X 0
0 ]gf
X
E[ Xj X 0
0 ]g
0
]:
is non-singular.
Assumption 4.1 limits possible values for the constants
and !. Recall that these two
constants are used to de…ne the parameter space F in Assumptions 3.2-3.3. Note that the
convergence rate rn will be determined by these constants. Assumption 4.2 states that the
sieve approximation error k kn 0
0 kq vanishes faster than the convergence rate rn . This
requirement is intuitive because the rate of k kn 0
0 kq is an upper bound for the rate
of k^ n
0 kq . Assumption 4.3 is a key condition to the entire rate calculation. It has
a similar role to the nonsingularity of Hessian matrix in usual parametric problems. The
particular form of the matrix will be suggested in the proof of Lemma A.14, which proves
the norm equivalence of k kq and j j + kF ( X 0 0 )kL2 (P ) . These three assumptions with the
consistency of ^ n in the norm k kc are su¢ cient to have the following result. Recall that the
constants and ! are de…ned in Assumption 3.2.
Theorem 4.1 (Rate of Convergence). Suppose Assumptions 3.1-3.3 and 4.1-4.3 hold. Then
rn k^ n
0 kq
= Op (1);
6 Two norms are equivalent if their ratio remains within a …xed range [a; b] for 0 < a < b < 1, for any point.
This equivalence result is proved in Lemma A.14.
12
JONG-MYUN MOON
for the rate-of-convergence factor rn = n
!=(2 !+ +!) .
The convergence rate for sieve M-estimator is proved by Shen and Wong (1994). A similar
result can be found in van der Vaart and Wellner (1996), Theorem 3.4.1. We use the proof
method similar to van der Vaart and Wellner (1996). When doing so, it needs to be considered
that the empirical criterion Qn has a U-process structure. Sherman (1993, 1994) study
a similar problem in parametric problems. Our result extends Sherman (1993, 1994) to
in…nite-dimensional problems with sieve spaces.
To facilitate asymptotic analysis, we need to decompose the criterion function. De…ne for
v; v1 ; v2 2 R2+dx ,
m( ; v) = E[h( ; V1 ; V2 )jV1 = v] + E[h( ; V1 ; V2 )jV2 = v]
g( ; v1 ; v2 ) = h( ; v1 ; v2 )
Q( );
E[h( ; V1 ; V2 )jV1 = v1 ] + E[h( ; V1 ; V2 )jV2 = v2 ] + Q( ):
Note that E[m( ; V2 )] = Q( ) and E[g( ; V1 ; V2 )] = 0. Moreover, it can be checked that
n
(8)
Qn ( ) =
X
1
1X
g( ; Vi ; Vj ):
m( ; Vi ) +
n
n(n 1)
i=1
i6=j
The expression (8) is called the Höe¤ding decomposition; this is a fundamental result to the
U-statistic theory. Because E[g( ; V1 ; V2 )jV1 ] = E[g( ; V1 ; V2 )jV2 ] = 0 for any 2 A, the
second term in the right of (8) is called a degenerate U-process. From the last expression, it
is clear that the U-process criterion is the sum of a sample-mean process and the degenerate
U-process.
As such, our proof of Theorem 4.1 can be divided to two parts. First, we show that the
degenerate U-process in (8) is asymptotically negligible. This is proved in Lemma A.13 in
the appendix. Then, we can treat ^ n as a M-estimator minimizing the sample mean of
m( ; Vi ) with some error; the error comes from the degenerate U-process. Second, we prove
the rate-of-convergence using the empirical process theory, similar to van der Vaart and
Wellner (1996) Theorem 3.4.1.
5. Asymptotic Normality
This section focuses on the asymptotic distribution of ^n ; recall that = (1; 0 )0 . The in…nitedimensional parameter F is treated as a nuisance parameter. The …rst step is to express as
a function of . We are to express such a functional as an inner product of and a special
point v . The inner product is induced by the norm k kq . To de…ne it, let V be a product
space of Rd and fF : kF ( X 0 0 )kL2 (P ) < 1g. For arbitrary two points v; w in V, we de…ne
(9)
hv; wi = E[fF00 ( X 0
0)
X0
v
+ Fv ( X 0
0
0 )gfF0 (
X0
0)
X0
w
+ Fw ( X 0
0 )g];
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
13
for v = ( v ; Fv ) and w = ( w ; Fw ). It can be easily veri…ed that the bilinear map h ; i is
indeed an inner product. Then, the special point v is de…ned as follows. Let v = ( ; F )
for
0
1
=
and F (z) = F0 (z)E[ X 0 j X 0 0 = z] 1 :
Assume that v is in V, or, equivalent, assume that kF ( X 0
calculation7, one can show that
0
(10)
0 )kL1 (P )
is …nite. By easy
= h ; v i:
Therefore, we know the exact expression for the special point v . Even when its expression
is unknown, however, the existence of v is guaranteed by the Riesz representation theorem
if V is a Hilbert space and the map 7! 0 is bounded linear. For this reason, v is often
called the Riesz representer.
The representation of 0 as an inner product (10) is instrumental, since it is possible to
approximate the inner product by the population criterion. Note that the inner product (9)
is equivalently de…ned by the polarization identify:
4 hv; wi = kv + wk2q
(11)
kv
wk2q :
Therefore, if the two squared norms in (11) are well approximated, so is the inner product.
A relevant fact is that the rate-norm k kq is chosen to approximate the population criterion
Q locally to 0 ; see (7) in the previous section. Therefore it is foreseeable that 0 can be
expressed using Q. There are technical subtlety in doing so, and more details can be found
in the proof of Theorem 5.1.
To obtain the asymptotic normality, the following assumptions are used.
Assumption 5.1.
!>
Assumption 5.2. For
kF0;kn ( X 0
0)
+ !.
kn
F0 ( X 0
0
= ( 0 ; F0;kn ) de…ned in Assumption 4.2,
0 )kL2 (P )
Assumption 5.3. (i) Fk
= o(n
spanfp1 ;
2=3
0
), kF0;k
( X0
n
0)
F00 ( X 0
0 )kL4 (P )
= o(n
1=3
):
; pk g for all k; (ii) fkpj k1 g1
j=1 is uniformly bounded.
j
d
Assumption 5.4. Let j (k) = max1 i k k dz
j pi k1 . Then the followings hold: (i)
p
2
3
1
kn ^ rn , (ii) kn rn = o(n ) and (iii) kn2 rn 1 = o(1).
2 (kn ) .
1 (kn )
_
Assumption 5.5. Let pk (z) = (p1 (z);
; pk (z))0 .
The smallest eigenvalue of
k
0
k
0
0
E[p ( X 0 )p ( X 0 ) ] is bounded away from zero uniformly in k 2 N.
Assumption 5.6. For any
f
kn v
:
kn v
2 Rd , there exists a sequence
= ( ;F
;kn );
2
;F
7 A similar calculation appears in the proof of Lemma A.14.
;kn
2 spanfp1 ;
; pkn gg;
14
such that (i)
JONG-MYUN MOON
p
nrn k
kn v
v kq ! 0 as n ! 1 and that (ii) supn2R kF
;kn k1
is bounded.
Assumption 5.1 is stronger than Assumption 4.1. For the asymptotic normality of ^n , we
2
need that k
close to ^ n . Therefore, if ^ n
0 kq is well approximated by Q( ) Q( 0 ) for
converges faster, then the approximation shows less error. By imposing Assumption 5.1, we
achieve the faster convergence rate and hence control the approximation error. Assumption
5.2 demands the sieve approximation error vanishes not only for F0 but also for its derivative
F00 at a certain rate. Assumption 5.3 limits the sieve space that we consider. As mentioned
already, we choose Fk to be …nite-dimensional and linear. The functions fp1 ; p2 ; g are
called basis functions. Assumption 5.4 regards the smoothness property of the basis functions.
Note that j (k) can be regarded as a smoothness measure for a basis functions fp1 ;
; pk g.
The role of Assumption 5.4 is to control the convergence of derivatives of F^n . Recall that
the convergence rate is stated in terms of the rate norm k kq , and that the convergence of
^ 00
^0
k^ n
0 kq does not imply that the derivatives Fn and Fn converge in some norm. However, by
imposing Assumption 5.4, we can control the convergence rate of kF^n0 F0 k1 and kF^n00 F0 k1
with regard to kF^n F0 k1 . Assumption 5.5 is used to establish the norm equivalence between
kF ( X 0 0 )kL2 (P ) and kF ( X 0 0 )kL1 (P ) for F in Fk . This is possible because Fk is a …nitedimensional sieve; recall that in the Euclidian space, Lp -norms are equivalent for 1 p 1.
Assumption 5.6 states that the Riesz representer v can be approximated by a sequence in
the sieves to a certain precision. Before stating the main result, we add one more notation.
De…ne the linear directional derivative of h( ; ; ) to the direction v = ( v ; Fv ) 2 V as
h0 ( ; ; )[v] =
(12)
d
h( + tv; ; )
dt
:
t=0
Now we state the main result of this paper.
Theorem 5.1. Suppose Assumptions 3.1-3.3, 4.3, 5.1-5.5 hold. Then
p
d
n(^n
0 ) ! N (0; );
where the matrix
is such that, for any
0
= E h0 (
2 Rd ,
0 ; V1 ; V2 )[v
h0 (
0 ; V1 ; V3 )[v
]]:
The proof of Theorem 5.1 can be found in Appendix A, and the functional form of
h0 ( 0 ; V1 ; V2 )[ ] is derived by Lemma A.19. Because we have an explicit expression for v , it
is possible to estimate v and then the matrix . If is consistently estimated, the inference
on 0 can be conducted relying on the asymptotic normality result of the above theorem. A
downside to this approach is that it involves several nonparametric estimations. For instance,
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
15
to estimate v , the conditional expectation E[ X 0 j X 0 0 ] needs to be estimated. Therefore, a simulation-based method is preferred. Below, we prove the consistency of weighted
bootstrap.
5.1. Weighted Bootstrap. Consider a randomly generated sequence of weights fBi gni=1 .
We assume E[Bi ] = 1 and V ar(Bi ) = 1. If these conditions are met, it may have any
distribution. Possible distributions are the discrete uniform distribution on f0; 2g or the
normal distribution N (1; 1). De…ne the weighted empirical criterion
X
1
Bi Bj h( ; Vi ; Vj ):
Qn ( ) =
n(n 1)
i6=j
Next, de…ne ^ n to be a point such that ^ n 2 Akn and
^ n 2 argmin
2Akn
Qn ( ):
Also, write ^ n = (^n ; F^n ). The following theorem proves that the asymptotic distribution of
p ^ ^
n( n n ) conditional on the sample fV1 ;
; Vn g is same with the unconditional asymptotic
p ^
distribution of n( n
0 ).
Theorem 5.2. Suppose all the conditions of Theorem 5.1 hold. If fBi gni=1 is an i.i.d. sequence such that E[Bi ] = 1 and V ar(Bi ) = 1, then for any c 2 Rd and any n 2 N,
p
p
P [ n(^n ^n ) cjV1 ;
; Vn ] = P [ n(^n
c] + op (1):
0)
The bootstrap inference is easy to implement. Fix the distribution for Bi , and draw the
random weights fBi gni=1 . Then, estimate ^ n by minimizing the weighted empirical criterion
Qn . The sieve-size index kn remains same with the original problem. By repeating this
p
procedure, we obtain the empirical distribution of n(^n ^n ) conditional on fVi gni=1 . Then,
the quantiles of the empirical distribution can be used as critical values for the inference on
p ^
n( n
0 ).
6. Simulation Study
Many duration models are examples of the transformation model. Proportional hazard models and mixed proportional hazard models are all nested in transformation models.8 We use
those two models to conduct the following simulation study.
8 Proportional hazard models assume the error distribution is …xed to be a negative extreme-value distribution,
whereas the transformation function (or baseline hazard) remains nonparametric. Mixed proportional hazard
models are more general, but still restrictive; for instance, the normal distribution is not allowed for the error
distribution (Ridder, 1990).
16
JONG-MYUN MOON
1
Design 1
Design 2
Design 3
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−8
−6
−4
−2
0
2
4
6
Figure 1. CDF of
We consider three designs. The transformation function T (y) = log y is chosen for data
generation. However, note that all three estimators are numerically invariant even if Y is
transformed by any other monotone function. The data are generated from the following
equation
log Y = X1
", for ( 1 ; 2 ) = (1; 1):
1 X2
2 X3
This speci…cation is shared by all three designs. Further, we …x the distribution of
(X1 ; X2 ; X3 ); X1 and X2 are standard normal random variables and X3 is a binary random
variable with equal probabilities of being 0 or 1. (X1 ; X2 ; X3 ) are mutually independent.
Across three designs, we vary only the distribution of ". This is summarized below:
Design 1: "
EV (0; 1);
d
Design 2: " = log v + u; for v
d
Design 3: " = log v + u; for v
(1; 1) and u
EV (0; 1);
(3; 3) and u
EV (0; 1);
where EV (0; 1) means the standard extreme-value distribution with cdf F (z) =
exp( exp( z)), and ( ; ) denotes the gamma distribution with mean
and variance
2
. Design 1 conforms to the proportional hazard model. Design 2 and 3 belong to the
mixed proportional hazard model or frailty model. As the additional random error v follows
the gamma distribution, they are also called a gamma frailty model.
Finite-sample distributions of several estimators are compared. Let us call the sieve extremum
estimator developed by this paper, the sieve estimator. We compare the sieve estimator
with two others estimators: Cox estimator for the proportional hazard model and the MRC
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
17
estimator of Han (1987). Note that the Cox estimator is mis-speci…ed for Design 2 and Design
3. We still report the result because Cox model is widely used in empirical researches.
For each design, we generate samples of size 100 and 300. Then the parameter = ( 1 ; 2 ) is
estimated for (i) Cox estimator, (ii) sieve estimator, and (iii) MRC estimator. Regarding the
sieve estimator, we vary the dimension of sieve space to k = 3; 5; 7. The estimation procedure
is repeated 500 times, and we report the sample bias and the sample mean squared error
(MSE) from 500 estimates of …ve estimators. To implement the sieve estimator, sieve Fk
needs to be speci…ed. We choose I-spline as basis functions. Ramsay (1988) explains the
construction of I-spline. What is useful with I-spline is that each basis function is a cdf
of some continuous random variable. Therefore, it is easy to tune Fk to our purpose of
estimating a symmetric cdf. We construct Fk to contain only symmetric cdfs from I-spline
bases. The dimension of Fk equals to the index k.
The simulation results are summarized in Figure 2-7. In each …gure, the left panel shows the
bias and the right panel shows MSE. Bias1 indicates the bias of estimating 1 . Bias2 is for
2 . MSE1 and MSE2 also correspond to 1 and 2 respectively. Design 1 provides a good
benchmark to our estimator. It is because the Cox estimator is correctly speci…ed and has one
less in…nite-dimensional parameter. Not surprisingly, the Cox estimator shows the least MSE.
Our estimator behaves comparably well. The e¢ ciency loss of our estimator relative to the
Cox estimator seems bearable when considering Design 2 and Design 3. The Cox estimator
shows a large bias in these mis-speci…ed designs. On the contrary, the sieve estimator performs
well across all three designs. Compared to MRC estimator, the sieve estimator shows less
MSE, especially for a smaller sample size of n = 100. We also notice that the sieve estimator
is not sensitive to di¤erent sieve-size indexes k 2 f3; 5; 7g. In summary, we …nd that the sieve
estimator behaves well, even for a small sample size.
7. Conclusion
The intuition that a binary comparison speci…es the ordering is used to identify the transformation model. A new estimator is constructed from the identi…cation result. Its asymptotic
distribution is derived, and the bootstrap inference is justi…ed. As technical by-products,
we contribute to the literature on the sieve estimation by studying a U-process problem and
showing how to handle the single-index structure in the semi-nonparametric problem.
Several important extensions are possible. Regarding its application to duration models,
we may extend the current method to account for censoring and time-varying regressors.
Another direction is to consider competing risks models. We hope to study these extensions
in future researches.
18
JONG-MYUN MOON
0.2
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0.15
0.1
0.05
0
Cox
Sieve3
Sieve5
Bias1
Sieve7
MRC
Cox
Bias2
Sieve3
Sieve5
MSE1
Sieve7
MRC
MSE2
Figure 2. Simulation result for Design 1 when n = 100
0.06
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
0.05
0.04
0.03
0.02
0.01
0
Cox
Sieve3 Sieve5 Sieve7
Bias1
MRC
Cox
Bias2
Sieve3 Sieve5 Sieve7
MSE1
MRC
MSE2
Figure 3. Simulation result for Design 1 when n = 300
0.25
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0.2
0.15
0.1
0.05
0
Cox
Sieve3
Sieve5
Bias1
Sieve7
Bias2
MRC
Cox
Sieve3
Sieve5
MSE1
Sieve7
MSE2
Figure 4. Simulation result for Design 2 when n = 100
MRC
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
19
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
Cox
Sieve3
Sieve5
Bias1
Sieve7
MRC
Cox
Bias2
Sieve3 Sieve5 Sieve7
MSE1
MRC
MSE2
Figure 5. Simulation result for Design 2 when n = 300
1.4
4
3.5
3
2.5
2
1.5
1
0.5
0
1.2
1
0.8
0.6
0.4
0.2
0
Cox
Sieve3
Sieve5
Bias1
Sieve7
MRC
Cox
Bias2
Sieve3
Sieve5
MSE1
Sieve7
MRC
MSE2
Figure 6. Simulation result for Design 3 when n = 100
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Cox
Sieve3
Sieve5
Bias1
Sieve7
Bias2
MRC
Cox
Sieve3
Sieve5
MSE1
Sieve7
MSE2
Figure 7. Simulation result for Design 3 when n = 300
MRC
20
JONG-MYUN MOON
Appendix A. Proofs
A.1. Notations. We de…ne and use several norms throughout the appendix.
k
k
k
k
k
k
k
k
k
k
k
k ;1;!
k ;1
k1
kL1 (P )
kLp (P )
kF ;c
ke; ;1
ke;1
kc
kq
ke;Lp
The
The
The
The
The
The
The
The
The
The
The
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
i
d
2 !=2
kF k ;1;! = max0 i supz2R j dz
i F (z)j(1 + z )
i
d
kF k ;1 = max0 i supz2R j dz
i F (z)j
kF k1 = supz2R jF (z)j
kXkL1 (P ) is the essential supremum of the random variable X
kXkLp (P ) = fEjXjP g1=p for any integer p 1
kF kF ;c = kF (Z0 )kL1 (P ) + kF 0 (Z0 )kL1 (P ) for Z0 = X 0 0
k ke; ;1 = j j + kF k ;1
k ke;1 = j j + kF k1
k kc = j j + kF kF ;c
k kq is de…ned in (7)
k ke;Lp = j j + kF (Z0 )kLp (P )
Other notations used in the appendix are gathered in the table below.
Z0
a.b
a b
N ("; F; k k)
N[] ("; F; k k)
C1 ; C2 ;
!
j
n
d
A scalar random variable such that Z0 = X 0 0
a Kb for a universal constant K not depending on a or b
a . b and a & b
The covering number9of size " for a set F under the norm k k
The bracketing number of size " for a function class F under the norm k k
Generic positive constants which do no depend on the context of the proof
The degree of smoothness of F; see Assumption 3.2
The constant for the weighting function (1 + z 2 )!=2 ; see Assumption 3.2
See Assumption 5.4
See Remark A.17
A.2. Proof for Section 2.
Lemma A.1. Suppose Assumptions 2.1-2.4 hold. Suppose
2 f 1; 1g
and F 2 F.
0
0
If F (x ) = F0 (x 0 ) for any x 2 supp X, then = 0 and F (z) = F0 (z) for any z 2
supp X 0 0 :
Proof. Note 0 2 supp Xd . Hence, if x = (xc ; 0), by Assumption 2.3 (ii), we have
(13)
F (x0c
c)
= F0 (x0c
0;c )
for any xc 2 supp Xc :
As " is a di¤erence of two i.i.d. continuous RVs, 0 is an interior point of supp ". Regarding
supp Xc , the same holds by Assumption 2.3 (i). We know c 6= 0 and 0;c =
6 0 since
9 See p.83 of van der Vaart and Wellner (1996) for the precise de…nition.
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
j 0;1 j = 1. Observe that 0 2 R is an interior point to both supp Xc0 c and supp Xc0
Therefore we can …nd an open neighborhood of 0 denoted by Nc Rdc such that
(14)
Nc
supp " \ supp Xc0
c
\ supp Xc0
21
0;c .
0;c :
We …rst show F is strictly increasing on Nc . Suppose not. Then …nd two points x1 ; x2 2 Nc
with the following three properties; (i) x1 and x2 are di¤erent only in the …rst coordinates,
say x1;1 6= x2;1 ; (ii) x1;1 < x2;1 ; (iii) F (x01 c )
F (x02 c ). Because F0 is strictly increasing
on Nc , F0 (x01 0;c ) < F0 (x02 0;c ). Then either F (x01 c ) 6= F0 (x01 0;c ) or F (x02 c ) 6= F0 (x02 0;c ).
This contradicts the condition of the lemma. As such, F is strictly increasing on Nc . Next,
we prove c = 0;c . Suppose not. Find two points x1 ; x2 2 Nc such that x01 c > x02 c
and x01 0;c < x02 0;c . By strong monotonicity of F and F0 on Nc , F (x01 c ) > F (x02 c ) and
F0 (x01 0;c ) < F0 (x02 0;c ). We reach a contradiction and conclude c = 0;c . Then by (13),
we can infer that F (z) = F0 (z) for any z 2 supp Xc0 0;c . So far, 0;c is identi…ed and F0 is
identi…ed only on supp Xc0 0;c .
x dc
We move on to the identi…cation of 0;d . To this end, we …nd the values of f 0j 0;d gdj=1
;
for the de…nition of j , see Assumption 2.4. Start by j = 1. By Assumption 2.4, there are
two points x1 ; x2 2 supp Xc such that x01 0;c = x02 0;c + 01 0;d 2 N1 . Because F0 is strictly
increasing on N1 and F = F0 on N1 , then it follows that F (x01 0;c ) ? F (z) if x01 0;c ? z. In
other words,
x01
(15)
0;c
= z if F (x01
0;c )
= F (z):
By the condition of the lemma,
(16)
F (x01
0;c )
= F0 (x01
0;c )
From (15) and (16), we see that
= F0 (x02
0
1 d
= x01
F (z) = F0 (z) on z 2 supp Xc0
+
0
1 0;d )
x02
0;c
=
[ fx0
0;c
+
0;c
0;c
0;c
= F (x02
0
1 0;d
0
1 0;d
0;c
+
0
1 d ):
and
: x 2 supp Xc g:
Repeat the same argument for each j to identify other 0j 0;d ’s. Then we identify
dx dc
f 0j 0;d gj=1
. As the last step, we note that, since f 1 ;
; dx dc g is linearly independent, 0;d 2 Rdx dc is identi…ed. Conclude that
= 0 and that F (z) = F0 (z) for any
z 2 supp X 0 0 .
Proof of Theorem 2.1. We know P ( Y > 0jX1 ; X2 ) = F0 ( X 0
iterated expectation,
Q(b; ) = E[E[If Y
= E[ F0 ( X 0
0gf1
0 )(1
2F ( X 0 )gjX1; X2 ]f1
0 ).
By this fact and the
2F ( X 0 )g + F ( X 0 )2 ]
2F ( X 0 )) + F ( X 0 )2 ]:
22
JONG-MYUN MOON
The last expectation can be simpli…ed to the sum of E[fF0 ( X 0 0 ) F ( X 0 )g2 ] and
some constant not depending on parameters. From this observation, it is obvious that
Q( ) is minimized only if F0 ( X 0 0 ) = F ( X 0 ) almost surely. Lemma A.1 proves
that if F0 ( X 0 0 ) = F ( X 0 ), then it follows that
= 0 and F (z) = F0 (z) for any
0
z 2 supp X 0 . Hence we conclude.
A.3. Proof for Section 3.
Remark A.2 (The constant B). Note that kF k
By Hölder inequality and Assumptions 3.1,
kF k
kF
;1
F0 k
;1
for any F 2 F is uniformly bounded.
;1
+ kF0 k
B + kF0 k
;1
;1 :
The second inequality holds because the weighting function is strictly larger than 1. As
kF0 k ;1 is bounded by Assumption 3.1, kF k ;1 is bounded by a universal constant B +
kF0 k ;1 . We denote B = B + kF0 k ;1 .
Lemma A.3. Under Assumptions3.1(ii) and 3.2(ii), for any
(di ; yi ; x0i )0 2 supp Vi ,
(17)
jh(
1 ; V1 ; V2 )
Proof. We use notations
jh(
1 ; V1 ; V2 )
h(
h(
. (jx1 j + jx2 j + 1)k
2 A and vi =
2
2 ke;1 :
1
x; x below; they are de…ned similarly to
X;
X. Observe
2 ; V1 ; V2 )j
= d1 d2 j2If y
. jF1 ( x0
(18)
2 ; V1 ; V2 )j
1;
1)
0g
F1 ( x0
F2 ( x0
F2 ( X 0
1)
2 )j
jF1 ( x0
1)
F2 ( x0
2 )j
2 )j;
where the inequality holds by Remark A.2 and the fact that d1 is a binary variable. By
Taylor expansion after obvious expansion, jF1 ( x0 1 ) F2 ( x0 2 )j is equal to
jF10 (z ) X 0 (
1
2)
+ F1 ( x0
2)
F2 ( x0
2 )j:
for some z 2 [ X 0 1 ; X 0 2 ]. Since kF10 k1 < B by Remark A.2, using Hölder inequality,
we have
jh(
(19)
1 ; V1 ; V2 )
h(
2 ; V1 ; V2 )j
. Bj xj
j
1
2j
+ jF1 ( x0
. (jx1 j + jx2 j + 1) j
1
2j
2)
F2 ( x0
0
+ jF1 ( x
2)
2 )j
F2 ( x0
2 )j
;
where the second inequality holds by that Bj xj + 1 . jx1 j + jx2 j + 1. The result (17) follows
(19).
Lemma A.4. Under Assumptions3.1(ii) and 3.2(ii),
jQ(
1)
Q(
2 )j
.k
1
2 ke;1 ;
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
for any
1;
2
23
2 A.
Proof. By Jensen’s inequality, jQ(
follows by Lemma A.3.
1)
Q(
2 )j
E jh(
1 ; V1 ; V2 )
h(
2 ; V1 ; V2 )j.
The claim
Lemma A.5. Under Assumptions 3.1-3.2, F is compact in k k1;1 -norm and A is compact
in k ke;1;1 -norm.
Proof. We recall Lemma A.4 of Gallant and Nychka (1987); let us call it GN. Let
=
0; 0 = !; m = 1; m0 = 1; k = 1 in the cited lemma. Although one of the conditions is that
0 < < 0 , it can be learnt from the proof that can be zero (and indeed can be negative).
The set F de…ned in Assumption 3.2 is smaller than a corresponding set in the cited lemma;
note we de…ne F as k k ;1;! -ball of radius B=2 whereas GN sets up F as a ball in the L2 type norm similarly de…ned to k k ;1;! . All other conditions of GN are included verbatim
in Assumptions 3.1-3.2. Therefore, we know F is relatively compact in k k1;1 -norm. By
Assumption 3.2, F is compact in k k1;1 -norm. The second claim follows immediately.
Lemma A.6. Suppose Assumptions 3.1-3.2 hold. Let " > 0 be small enough. Then
+!
:
log N ("; A; k ke;1 ) . 1 log " + " ;
=
!
Proof. The inequality (20) is immediate from the de…nitions of the covering number and the
norm k ke;1 :
(20)
N ("; A; k ke;1 )
N ("=2; ; j j)
N ("=2; F; k k1 ):
Because
is compact, "=2-covering number of
is proportional to "d . As such, ignoring
constant terms,
log N ("=2; ; j j) . 1 log ":
;!
Denote CB=2
= fF : kF k ;1;!
;!
" < ", log N ("; CB=2
; k k1 ) . "
B=2g. By Lemma A.3 of Santos (2012), for some " > 0, if
. Since, by Assumption 3.2,
fF
it follows that N ("; F; k k1 )
F0 : F 2 Fg
;!
CB=2
;
;!
N ("; CB=2
; k k1 ). Hence the claim is shown.
Remark A.7. When we use Lemma A.6, we ignore that it holds for small ". This is harmless
simpli…cation.
Lemma A.8. Under Assumptions 3.1-3.2, sup
Proof. Let H = fh ( ; ; ) :
2A jQn (
)
p
Q( )j ! 0 as n ! 1.
2 Ag. By Lemma A.3,
Ejh( ; V1 ; V2 )
h( ; V1 ; V2 )j . Kk
1
2 ke;1 ;
24
JONG-MYUN MOON
for some positive number K > 0. Then we can apply Theorem 2.7.11 of van der Vaart and
Wellner (1996) to obtain, for any " > 0,
N[] ("; H; L1 (P ))
N ("=(2K); A; k ke;1 ):
Lemma A.6 implies N ("=(2K); A; k ke;1 ) is …nite. Then by Corollary 5.2.5 of de la Peña and
Giné (1999) (U-process ULLN), the claim is proven.
Lemma A.9. Suppose Assumptions 3.1-3.2 hold. Let N = f 2 A : k
let
T
N" =
f 2A:k
ke;1;1 < "g:
0 kc
= 0g. Also
2N
Then for any " > 0,
Q(
0)
<
inf
2AnN"
Q( ):
Proof. The set N" is an intersection of open "-balls in (A; k ke;1;1 ), and hence itself is open
in (A; k ke;1;1 ). As A is compact in k ke;2;1 -norm, it is also compact in the weaker norm
k ke;1;1 . See that AnN" is compact in k ke;1;1 -norm. By the extremum value theorem,
there is " 2 AnN" such that Q( " ) = inf 2AnN" Q( ) for any " > 0. By Theorem 2.1,
Q( 0 ) = Q( " ) only if " 2 N . Conclude Q( 0 ) < Q( " ).
Lemma A.10. Under Assumptions3.1(ii) and 3.2(ii),
jm(
for any
1;
2
1 ; v)
m(
2 ; v)j
. (jxj + 1) j
1
2j
+ sup jF1 (z)
F2 (z)j
z2Z0
2 A, v = (d; y; x0 )0 2 supp Vi .
Proof. By the triangular inequality and Jensen’s inequality,
jm(
1 ; v)
m(
2 ; v)j
Ejh(
1 ; V1 ; v)
h(
2 ; V1 ; v)j+Ejh( 1 ; v; V2 )
h(
2 ; v; V2 )j+jQ(
) Q( )j:
By Lemma A.3,
Ejh(
1 ; V1 ; v)
h(
2 ; V1 ; v)j
. (EjX1 j + jxj + 1) j
1
2j
+ sup jF1 (z)
F2 (z)j :
z2Z0
Note EjX1 j + jxj + 1 . jxj + 1. The same inequality holds for the second term. The third
term is bounded by Lemma A.4. Then the result follows.
Lemma A.11. Under Assumptions3.1(ii) and 3.2(ii),
jg(
for any
1 ; v1 ; v2 )
1;
2
g(
2 ; v1 ; v2 )j
. (jx1 j + jx2 j + 1) j
1
2j
+ sup jF1 (z)
2 A and any v1 ; v2 2 supp Vi .
Proof. This follows immediately by Lemma A.3, A.10 and A.4.
z2Z0
F2 (z)j
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
25
Proof of Theorem 3.1. De…ne N" as in Lemma A.9. Note for any " > 0,
k^ n
0 kc
" ) k^ n
0 ke;1;1
" ) ^ n 2 AnN" , ^ n 2 Akn nN" ;
where the arrow ) means “only if” relation. Therefore, it is enough to show P [^ n 2
Akn nN" ] ! 0. Observe
(21) P [^ n 2 Akn nN" ]
P[
inf
2Akn nN"
Qn ( )
Qn (
kn
0 )]
P [ inf
2AnN"
Qn ( )
Qn (
kn
0 )]:
By the continuous mapping theorem and Lemma A.8,
inf
p
2AnN"
Qn ( ) !
inf
2AnN"
Q( ):
Write
Qn (
kn
0)
= Qn (
kn
0)
p
Q(
0)
kn
+ Q(
kn
0)
Q(
0)
+ Q(
0 ):
Note that Qn ( kn 0 ) Q( kn 0 ) ! 0 by Lemma A.8. In addition, it is true that Q( kn 0 )
p
Q( 0 ) ! 0, by the continuity of Qin k ke;1 (Lemma A.4). Hence, we see that Qn ( kn 0 ) !
Q( 0 ). Therefore, it follows
(22)
P [ inf
2AnN"
Qn ( )
Qn (
kn
0 )]
= P [ inf
2AnN"
Q( ) + op (1)
Q(
0 )]:
By Lemma A.9, (22) converges to zero. This shows P [^ n 2 Akn nN" ] ! 0 and therefore
p
k^ n
0 kc ! 0:
A.4. Proof for Section 4.
Lemma A.12. Under Assumptions 3.1-3.2,
1X
(23)
E[sup j
g( ; Vi ; Vj )j] = O(1):
2A n
i6=j
Proof. Let us write g( ) = g( ; Vi ; Vj ) and G = fg( ) :
1 X
(24) dn (g( 1 ); g( 2 )) = f 2
jg( 1 ) g( 2 )j2 g1=2 ;
n
i6=j
2 Ag. Also de…ne
Dn =
sup dn (g(
1 ); g( 2 )):
1 ; 2 2A
By Theorem B.1, the left of (23) bounded above by
Z Dn
2 1=2
(25)
fE[g( 0 ) ]g + E[
log N ( ; G; dn )d ]:
0
Notice E[g( 0 )2 ] is …nite. We are left to show that the second term of (25) is bounded. By
Lemma A.11, for some c > 0,
1 X
(26)
dn (g( 1 ); g( 2 )) ck 1
(jXi j + jXj j + 1)2 g1=2 :
2 ke;1 f 2
n
i6=j
26
JONG-MYUN MOON
Because k
1
(27)
2 ke;1
is uniformly bounded for any 1 ; 2 2 A,
1 X
(jXi j + jXj j + 1)2 g1=2 :
Dn . Ln for Ln = f 2
n
i6=j
By Theorem 2.7.11 of van der Vaart and Wellner (1996), N ( ; G; dn )
Then by Lemma A.6,
(28)
log N ( ; G; dn )
log N ( c
1
N( c
1 L 1 ; A; k
n
ke;1 ).
Ln 1 ; A; k ke;1 ) . 1
log + log Ln +
Ln ;
RD
for
= +!
log "d"
1 and
. The …rst inequality of (29) follows by the fact that 0
!
RD
1
for any D
0. The second inequality holds by (27) and an elementary
0 " d" . D
2
inequality of x log x x :
Z Dn
log N ( ; G; dn )d . Dn + 1 + Dn log Ln + Dn1 Ln . 1 + Ln + L2n :
(29)
0
Note E[L2n ] E[X12 + X22 + 1] < 1 and E[Ln ] <
(25) is bounded, and we obtain (23).
p
E[L2n ] < 1. Therefore the last term of
Lemma A.13. Under Assumptions 3.1-3.2,
n
(30)
1X
m(^ n ; Vi )
n
i=1
n
1X
1
min
m( ; Vi ) + Op ( ):
2Akn n
n
Proof. By Markov inequality,
1X
g( ; Vi ; Vj )j > "]
(31)
P [sup j
2A n
i=1
"
1
E[sup j
2A
i6=j
1X
g( ; Vi ; Vj )j]
n
i6=j
By Lemma A.12, the right of (31) is uniformly bounded for n 2 N. From the de…nition of ^ n
in (6) and the decomposition (8), the claim (30) follows.
Lemma A.14 (Norm equivalence). Under Assumptions 3.1 and 4.3, k kq and k ke;L2 (P )
are equivalent norms.
Proof. For an upper bound of k kq ,
(32)
k kq = kF00 (Z0 ) X 0 + F (Z0 )kL2 (P )
fE[F00 (Z0 ) X 0 ]2 g1=2 + fE[F (Z0 )]2 g1=2
B
0
E[ X X 0 ]
1=2
+ kF (Z0 )kL2 (P ) . k ke;2 ;
where the second inequality is a result of Hölder inequality. For a lower bound, observe
k k2q & kF00 (Z0 ) X 0 + F (Z0 )kL2 (P )
(33)
= E[fF00 (Z0 )f X
E[ XjZ0 ]g0 g2 ] + E[fF00 (Z0 )E[ XjZ0 ]0 + F (Z0 )g2 ];
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
27
where the inequality holds by Assumption 3.2(ii) and the equality is valid because a crossproduct term not showing up in (33) is vanished by the law of iterated expectation. (33) is
bounded below by
0
(34)
+ E[F00 (Z0 )E[ XjZ0 ]0 + F (Z0 )]2 ;
where is the smallest eigenvalue of . By Assumption 4.2,
eigenvalue of E[F00 (Z0 )2 E[ XjZ0 ]E[ XjZ0 ]0 ]. Then
0
(35)
2
0
+
2
0
> 0. Let
E F00 (Z0 )2 E[ XjZ0 ]E[ XjZ0 ]0
> 0 be the largest
:
Using (35), bound (34) below by
(36)
2
0
+
2
0
E F00 (Z0 )2 E[ XjZ0 ]E[ XjZ0 ]0
By an elementary inequality of a2 + b2
below by
(37)
2
0
+
4
1
2 (a
+
2
E[F00 (Z0 )E[ XjZ0 ]0 + F (Z0 )]2 :
b)2 , the expression (36) is further bounded
E[F (Z0 )]2 & j j2 + kF (Z0 )k2L2 (P ) :
And by Jensen’s inequality,
fj j2 + kF k2L2 (P ) g1=2
1
p fj j + kF kL2 (P ) g:
2
Therefore we show k ke;L2 (P ) . k kq . This combined with the inequality (32) shows that
the two norms are equivalent.
Lemma A.15. Suppose Assumptions 3.1-3.2 and 4.3 hold. Then for any
2 (0; 1), if
2B =f 2A:k
0 kc < g, there exists a positive constant M such that
p
2
2
(38)
jQ( ) Q( 0 ) k
M k
0 kq j
0 kq :
The constant M > 0 does not depend on .
Proof. Applying the Taylor expansion to F ( X 0 )
F ( X0 )
F0 ( X 0
0)
F0 ( X 0
= F 0 (Z0 ) X 0 (
0 ),
we have
0)
1
0 00
0
+ (
0 ) F (Z ) X X (
0 ) + F (Z0 )
2
where Z is a random variable between Z0 and X 0 . Denote
D1 = F00 (Z0 ) X 0 (
0)
+ F (Z0 )
D2 = fF 0 (Z0 ) F00 (Z0 )g X 0 (
1
0 00
0
D3 =
(
0 ) F (Z ) X X (
2
F0 (Z0 );
0 );
0 ):
F0 (Z0 );
28
JONG-MYUN MOON
Recall that Q( ) Q( 0 ) = E[F ( X 0 ) F0 ( X 0 0 )]2 . Therefore, Q( ) Q( 0 ) = E[fD1 +
D2 +D3 g2 ]. We expand and examine this term by term. Note that (i) E[j Xj4 ] and E[j Xj2 ]
are bounded, (ii) kF 00 k1 < B for any F 2 F (see Remark A.2), and (iii) for any 2 B ;
kF 0 (Z0 )
(39)
F00 (Z0 )kL1 (P )
k
0 kc
< :
These three facts are used in (41)-(42). First, by de…nition,
E[D12 ] = k
(40)
2
0 kq :
Second, by Cauchy-Schwartz inequality,
E[D22 ]
(41)
F 0 (Z0 )
F00 (Z0 )
L1 (P )
2
0 j E[j
j
Xj2 ] . j
2
0j :
Third, again by Cauchy-Schwartz inequality,
1 2
4
4
4
B j
0 j Ej Xj . j
0j :
4
Fourth, applying Hölder inequality to cross-product terms, by (40)-(42),
p
(43)
k
EjD1 D2 j .
0 kq j
0 j;
E[D32 ]
(42)
EjD1 D3 j . k
p
EjD2 D3 j .
j
(44)
(45)
By the norm equivalence (Lemma A.14), j
collecting (40)-(45),
p
2
(46) jQ( ) Q( 0 ) k
)k
0 kq j . ( +
Recall that if
2
0j ;
0 kq j
3
0j :
0j
k
0 ke;L2 (P )
2
0 kq
+ (1 +
p
k
3
0 kq
)k
0 kq .
Hence,
4
0 kq ;
+k
2B ,
k
0 kq
k
0 ke;L2 (P )
Hence, the right of (46) is less than
p
p
f +
+ (1 +
) +
.k
0 kc
2
2
0 kq
gk
< :
As we con…ne to be less than 1, the above expression is bounded above by M
for some …xed constant M > 0.
p
2
0 kq
k
Lemma A.16. Suppose Assumptions 3.1-3.2 hold. Let M be the constant appearing in
Lemma A.15. Let
A kn ; = f 2 A k n : k
Then, for large n, we have
"
E
sup jGn fm( ; )
2Akn ;
m(
kn
0;
0 kc
#
<M
)gj . ( +
2
;k
n)
0 kq
2 !
!
2 !
g
1
+p ( +
n
n)
+!
!
;
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
for
n
=k
0 kq .
0
kn
29
2.
Proof. Let us write m( ) = m( ; Vi ) and h( ) = h( ; V1 ; V2 ). Suppose k kn 0
0 kc < M
This holds for large n, because k kn 0
0 kc ! 0 by Assumption 3.3. Let M be a constant
such that
(47)
sup
sup jm( ; v) m(
; v)j < M < 1:
kn
2A v2R2+dx
We may take M = 3B by Remark A.2 and the de…nition of m( ) in (5). By Hölder inequality,
(48)
km( )
m(
0 )kL2 (P )
kn
+kE[h( )
kE[h( )
h(
h(
0 )jV2 ]kL2 (P )
kn
0 )jV1 ]kL2 (P )
kn
+ jQ( )
Q(
0 )j:
kn
By Jensen’s inequality,
(49)
kE[h( )
Recall that
0
kn
jh( )
h(
kn
0 )jVi ]kL2 (P )
kh( )
h(
kn
0 )kL2 (P ) :
= ( 0 ; F0;kn ). Hence, using the …rst inequality of (19), we obtain
h(
0 )j
kn
. Bj Xj
j
0j
+ jF ( X 0
F0;kn ( X 0
0)
0 )j;
which yields
(50)
kh( )
h(
kn
. j
0 )kL2 (P )
0j
= k
In addition, by Lemma A.15, if k
jQ( )
Q(
kn
2k
(51)
2
<M
jQ( )
F0;kn ( X 0
0)
0 )kL2 (P )
0 ke;L2 (P ) :
kn
0 kc
0 )j
+ kF ( X 0
and k
Q(
0 )j
2
0 kq
+ 2k
0 kc
0
kn
+ jQ(
kn
Q(
0 )j
2
0 kq :
0
kn
0)
2,
<M
Also note that, by Lemma A.14,
(52)
k
kn
0 ke;L2 (P )
k
By (48)-(52) and Assumption 4.2, if
(53)
km( )
m(
kn
kn
0 kq
k
0 kq
+k
+k
kn
kn
0 kq :
0
2 A kn ; ,
0 )kL2 (P )
. k
0 kq
2
0 kq
+k
(54)
+
n
+
2
+k
+
2
n
0 kq
0
kn
0
. +
2
0 kq
n;
where the last inequality holds because possible values for are bounded and
n ! 1. The maximal inequality is invoked to conclude. Denote
Mn; = fm( )
m(
kn
0)
:
2 Akn ; g;
n
! 0 as
30
JONG-MYUN MOON
and …nd a sequence fMn; g1
m( kn 0 )kL2 (P )
n=1 such that km( )
+ n . By Lemma 3.4.2 of van der Vaart and Wellner (1996),
(55)
E[ sup jGn fm( )
m(
2Akn ;
. J~[] (Mn; ; Mn; ; k kL2 (P ) )
0 )gj]
kn
(1 +
where
Z
J~[] ( ; Mn; ; k kL2 (P ) ) =
0
By Lemma A.10, for some constant C > 0,
km( ; Vi )
Mn; . By (54), Mn;
J~[] (Mn; ; Mn; ; k kL2 (P ) )
p
M );
n(Mn; )2
q
1 + log N[] ("; Mn; ; k kL2 (P ) )d":
m( ; Vi )kL2 (P )
Ck
ke;1 :
The …rst inequality of (56) is acquired by Theorem 2.7.11 of van der Vaart and Wellner (1996)
and the second inequality is trivial from Akn ;
A.
(56)
N[] ("; Mn; ; k kL2 (P ) )
Lemma A.6 shows log N ("C
(57)
1
N ("C
1 ; A; k
J~[] ( ; Mn; ; k kL2 (P ) ) .
Z
; Akn ; ; k ke;1 )
ke;1 ) . 1
p
0
2
log " + "
m(
2Akn ;
kn
for =
Z p
d" .
"
log " + "
; A; k ke;1 )
+!
! .
Therefore we have
2
d"
2
;
0
where the second inequality is from the fact that j2
Plugging (57) into (55),
E[ sup jGn fm( )
1
N ("C
0 )gj]
log "j < "
. (Mn; )
2
2
when " is small enough.
1
(1 + p (Mn; )
n
2
2
M ):
The result follows after simpli…cation.
Proof of Theorem 4.1. We start by introducing some abbreviations. Let
De…ne a k kq -norm shell as
(58)
Sj;n = f 2 Akn : sj;n < k
0 kq
2sj;n g for sj;n = 2j
1
= ( + !)= !.
rn 1 :
As in Lemma A.15, denote
Bc = f 2 A : k
p
0 kc
< cg:
1
Let cn be a sequence such that cn 1 k^ n
0 kc ! 0. Such a sequence fcn gn=1 exists since ^ n
is consistent in k kc -norm. Let " > 0 and J 2 N be arbitrary positive constants.
The theorem is proved if P [rn k^ n
large enough. Observe
(59)
P [rn k^ n
0 kq
> 2J ]
0 kq
> 2J ] can be made arbitrarily small by letting J
P [rn k^ n
0 kq
> 2J ; ^ n 2 Bcn ] + P [^ n 2
= Bcn ];
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
31
where the last term P [^ n 2
= Bcn ] converges to zero by de…nition of Bcn . The …rst term on
the right of (59) is bounded above by (60):
X
(60)
P [^ n 2 Sj;n \ Bcn ] + P [k^ n
0 kq > "]:
j J
2sj;n "
The second term of (60) can be ignored again by the consistency of ^ n and the fact that
k kq . k kc . Therefore it is enough to show that the …rst term of (60) decreases to 0 as J
increases.
First, we obtain a bound for P [^ n 2 Sj;n \ Bcn ], showing up in (65) below. By Lemma A.13,
if ^ n 2 Sj;n \ Bcn , then
n
(61)
inf
2Sj;n \Bcn
1X
fm( ; Vi )
n
m(
1
Op ( ):
n
0 ; Vi )g
kn
i=1
Let us treat the random term Op ( n1 ) in (61) as if it were a deterministic sequence Cn for some
C > 0. This simpli…cation comes without loss of generality. Recall Lemma A.15. Find an
integer N such that if n N , then
Q( )
for
2 Bcn . Then for n
Q(
1
k
2
0)
2
0 kq
N,
(62)
inf
2Sj;n \Bcn
Q( )
Expand the left of (61) to
" n
1X
(63)
inf
fm( ; Vi )
2Sj;n \Bcn n
Q(
1
(sj;n )2 :
2
0)
E[m( ; Vi )]g
i=1
n
1X
fm(
n
kn
0 ; Vi )
E[m(
kn
0 ; Vi )]g
+ Q( )
Q(
kn
i=1
Combine and denote the two summation term in (63) by n 1=2 Gn fm( )
is a standard notation in the empirical process theory). Expand
Q( )
Q(
0)
kn
= Q( )
Q(
0)
+ Q(
0)
Q(
kn
m(
kn
#
0)
:
0 )g
(this
0 ):
Then, (63) is bounded below by (64):
(64)
inf
2Sj;n \Bcn
1
p Gn fm( )
n
m(
kn
0 )g
1
+ (sj;n )2
2
kF0;kn (Z0 )
where the second term is obtained by (62) and the third term is by
Q(
0)
Q(
kn
0)
fkF0;kn (Z0 )
F0 (Z0 )kL2 (P ) g2 :
F0 (Z0 )k2L2 (P ) ;
32
JONG-MYUN MOON
From (61)-(64), we obtain
P [^ n 2 Sj;n \ Bcn ]
(65)
for
j;n
inf
2Sj;n \Bcn
Gn fm( )
m(
p
1p
n(sj;n )2 + nkF0;kn (Z0 )
2
C
=p
n
Note
P[
kn
0 )g
j;n ];
F0 (Z0 )k2L2 (P ) :
p
F0 (Z0 )k2L2 (P ) = o( nrn 2 )
p
by Assumption 4.2 and A.14. And also note that n 1=2 = o( nrn 2 ). On the other hand,
p
p
n(sj;n )2
nrn 2 . Therefore, j;n < 0 for large enough n and j.
p
nkF0;kn (Z0 )
Second, we evaluate the probability (65). Suppose n and j are large enough so that
By Markov inequality, the right of (65) is less than
1
(66)
j
j;n j
E[
sup
2Sj;n \Bcn
jGn fm( )
m(
kn
j;n
< 0.
0 )gj]:
Note, for large n,
if
n
<M
2
Sj;n \ Bcn
Asj;n
for M appearing in Lemma A.15. Also note
sj;n + s2j;n + rn 1 . sj;n
for large n. Then by Lemma A.16, the expression (66) is less than, up to a …xed scale,
1
(67)
j
j;n j
(sj;n )
2 !
!
2 !
1
+ p (sj;n )
n
+!
!
:
Finally, the …rst term of (60) is bounded. For large enough n and J, (68) holds by (65)-(67).
p 2
The second inequality of (68) is from j j;n j
nsj;n . The last equality is from Assumption
4.1.
X
X 1
2 !
!
+!
1
(68)
P [^ n 2 Sj;n \ Bcn ] .
(sj;n ) 2 ! + p (sj;n ) !
j j;n j
n
j J
2sj;n "
j J
.
1 X 2
p
sj;n
n
!+ +!
2 !
j J
= f1 + n
This shows (60) is O(2
J)
f(
+
X
sj;n
1 2
f p rn
n
+! 2 !
!
j J
+! 2 !
) 2 !+! +!
!
g g2
J
= O(2
J
!+ +!
2 !
+ rn
+! 2 !
!
);
+ o(1). In turn, from (59), it is shown that
P [rn 1 k^ n
0 kq
> 2J ]
O(2
J
) + o(1):
Therefore we conclude rn is the rate-of-convergence factor for the estimator ^ n .
g2
J
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
33
A.5. Proof for Section 5. Higher-order directional derivatives of h( ; ; ) are similarly
de…ned to the …rst-order derivative de…ned in (12). Let v1 ; v2 ; v3 2 V be arbitrary directions
(vectors). We denote second and third order derivatives by h00 ( )[v1 ; v2 ] and h000 ( )[v1 ; v2 ; v3 ]
respectively. Also let h00 ( )[v]2 = h00 ( )[v; v] and h00 ( )[v]3 = h000 ( )[v; v; v]. Following de la
Peña and Giné (1999), we denote the U-process empirical measure by Un ; it assigns 1=n(n 1)
probability to each pair (Vi ; Vj )i6=j from the random sample fV1 ;
; Vn g. The expectation
of h( ) with respect to the random probability measure Un is denoted by Un h( ); that is,
X
1
h( ; Vi ; Vj ):
Un h( ) =
n(n 1)
i6=j
Further, we write (Un
E)h( ) = Un h( )
E[h( )]. Note this is a U-process indexed by .
Remark A.17 (De…nition of n ). We choose an arbitrary sequence f n g that converges to
zero slower than rn 1 . Also we require that all the conditions of Assumption 5.4 hold when rn 1
is substituted by n . For now, if these two conditions are met, n can converge at any rate; its
rate will be picked in the proof of Theorem 5.1. The role of this sequence is to de…ne a local
neighborhood around 0 . In order to guarantee that the local neighborhood encompasses the
estimator ^ n , we choose n slower than rn 1 . On the other hand, it will be required that, on
the local neighborhood, the empirical criterion Qn approximates the population criterion Q
well enough then; n needs to shrink fast enough to obtain a good approximation. We will
choose n so that it converges in…nitesimally slower than rn 1 .
Lemma A.18. Suppose Assumptions 5.3 and 5.5 hold. For any Fk 2 Fk ;
(69)
p
p
dj
kFk (Z0 )kL1 (P ) . kkFk (Z0 )kL2 (P ) and k Fk (Z0 )kL1 (P ) . k j (k)kFk (Z0 )kL2 (P ) ;
dz
for j de…ned in Assumption 5.4 (but Assumption 5.4 is not needed).
Proof. Let k denote the the smallest eigenvalue of E[pk (Z0 )pk (Z0 )0 ]. By Assumption 5.5,
> 0 for some and for every k 2 N. Because Fk is a …nite dimensional linear
k
sieve (Assumption 5.3), Fk (Z0 ) is a linear combination of basis functions fp1 ;
; pk g; write
c0 pk (Z0 ) for some c 2 Rk . Then,
p
p
0
(70)
kFk (Z0 )kL2 (P ) = fc0 E[pk (Z0 )pk (Z0 )0 ]cg1=2
c0 c;
k cc&
where the last inequality holds by Assumption 5.5. Next, observe
(71)
(72)
kFk (Z0 )kL1 (P )
k
dj
Fk (Z0 )kL1 (P )
dz
k
X
i=1
k
X
i=1
jci jkpi (Z0 )kL1 (P ) .
jci jk
dj
pi (Z0 )kL1 (P )
dz
k
X
i=1
jci j
j (k)
k
X
i=1
jci j.
34
JONG-MYUN MOON
By Cauchy-Schwarz inequality,
P
p p
k c0 c. By this and (70)-(72), (69) is shown.
jci j
Lemma A.19. Suppose F; Fv and Fw are functions on R. Suppose ; v ; w are in the set
. Denote = ( ; F ), v = ( v ; Fv ) and w = ( w ; Fw ). (i) If z 7! F (z) is di¤ erentiable,
(73)
h0 ( ; V1 ; V2 )[v] =
2fIf Y
F ( X 0 )gfF 0 ( X 0 ) X 0
0g
+ Fv ( X 0 )g:
v
(ii) If F is twice di¤ erentiable and Fv is di¤ erentiable,
(74)
h00 ( ; V1 ; V2 )[v; w] = 2fF 0 ( X 0 ) X 0
fF 0 ( X 0 ) X 0
fF 00 ( X 0 ) X 0
w
+ Fw ( X 0 )g
+ Fv ( X 0 )g
v
X0
w
2fIf Y
0g
+ Fw0 ( X 0 ) X 0
v
v
F ( X 0 )g
+ Fv0 ( X 0 ) X 0
wg
(iii) If F is three-times di¤ erentiable and Fv is twice di¤ erentiable,
h000 ( ; V1 ; V2 )[v; v; w] = 4fF 00 ( X 0 ) X 0
(75)
+Fw ( X 0 ) X 0
0
0
v
+ Fv0 ( X 0 ) X 0
0
fF ( X ) X
0
+2fF ( X ) X
0
w
0
v
wg
0
+ Fv ( X )g
2fIf Y
F ( X 0 )g
0g
X 0 v )2 + Fw00 ( X 0 )( X 0 v )2
w(
+2Fv00 ( X 0 ) X 0
:
v
+ Fw ( X 0 )gfF 00 ( X 0 )( X 0 v )2
+2Fv0 ( X 0 ) X 0 v g
fF 000 ( X 0 ) X 0
X0
w
X0 vg
w
Proof. From
h( + tv) = fIf Y
0g
F ( X 0 + t X 0 v ))
tFv ( X 0 + t X 0 v )g2 :
by elementary calculus, we obtain (73). Similarly, from
h0 ( + tw)[v] =
2fIf Y
tFw ( X 0 + t X 0
+tFw0 ( X 0 + t X 0
F ( X0 + t X0
w)
fF 0 ( X 0 + t X 0
w)
0g
w )g
X0
w)
v
X0
+ Fv ( X 0 + t X 0
v
w )g;
we obtain (74). Lastly, from
h00 ( + tw)[v; v] = 2fF 0 ( X 0 + t X 0
w)
X0
v
+tFw0 ( X 0 + t X 0
w)
X0
v
+ Fv ( X 0 + t X 0
w )g
fF 0 ( X 0 + t X 0
w)
X0
v
+ tFw0 ( X 0 + t X 0
w)
+Fv ( X 0 + t X 0
tFw ( X
+tFw00 (
X
0
0
w )g
+t X
0
+t X
0
w )g
w )(
2fIf Y
00
0g
fF ( X
X
0
2
v)
+
0
F ( X0 + t X0
+t X
2Fv0 (
X0
X
0
0
w )(
X
+t X
0
0
v
w)
2
v)
w)
X 0 v g;
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
35
we can derive (75).
Lemma A.20. Suppose Assumptions 3.1- 3.2, 4.3, 5.1-5.5 hold. Consider any non-random
sequence f n 2 Akn g such that k n
0 ke;L2 (P ) = o( n ). Then
k
n
kn
2
0 ke;L2 (P )
= Q(
Proof. Denote vn = n
kn 0 and vn = (
Assumption 5.2. By the triangle inequality,
(76)
kvn kq
kvn ke;L2 (P )
k
n)
vn ; Fvn ).
0 ke;L2 (P )
n
Q(
0)
kn
Note k
+k
kn
1
+ o( ):
n
kn
0 ke;L2 (P )
0
0 ke;L2 (P )
n
= o( n ) by
= O( n ):
Recall that any element of Fkn is at least three-times di¤erentiable by Assumptions 3.1-3.2
and Remark A.2. Then h( ) is three-times directionally di¤erentiable (Lemma A.19). By
Taylor expansion,
1
1
+ h00 ( kn 0 )[vn ]2 + h000 ( n )[vn ]3 ;
2
6
for some n 2 Akn between n and kn 0 . Denote n = ( n ; Fn ). We examine each term on
the right of (77) from below. First, from Lemma A.19,
(77)
h(
h0 (
n)
kn
0 )[vn ]
Since E[If Y
E h0 (
h(
kn
=
0)
= h0 (
kn
2fIf Y
0 )[vn ]
0
(Z0 ) X 0
F0;kn (Z0 )gfF0;k
n
0g
vn
+ Fvn (Z0 )g:
0gjX1 ; X2 ] = F0 (Z0 ), by the law of iterated expectation, it follows
kn
0 )[v]
=
0
(Z0 ) X 0
F0;kn (Z0 )gfF0;k
n
2E fF0 (Z0 )
vn
+ Fvn (Z0 )g :
Then by Hölder inequality and Assumption 5.2,
E h0 (
(78)
kn
0 )[v]
(79)
. kF0 (Z0 )
F0;kn (Z0 )kL2 (P )
0
kF0;k
(Z0 ) X 0
n
. o(n
3=4
)
= o(n
3=4
)
By (76), the last expression of (78) is o(n
j
vn j
vn
+ Fvn (Z0 )kL2 (P ) :
+ kFvn (Z0 )kL2 (P )
kvn ke;L2 (P ) :
3=4
n )-term.
As
n
= o(n
1=3 ),
we obtain
1
= o( ):
n
Next, the second-order term of the Taylor expansion (77) is examined. From Lemma A.19,
E h0 (
(80)
(81)
h00 (
kn
2
0 ; V1 ; V2 )[vn ]
kn
0 )[v]
0
= 2fF0;k
(Z0 ) X 0
n
00
4fF0;k
(Z0 )( X 0
n
2
vn )
vn
+ Fvn (Z0 )g2
+ 2Fv0 (Z0 ) X 0
vn gfIf
Y
By the law of iterated expectation and the triangle inequality,
(82)
E[h00 (
kn
2
0 ; V1 ; V2 )[vn ] ]
0
. E[fF0;k
(Z0 ) X 0
n
vn
+ Fvn (Z0 )g2 ]
0g
F0;kn (Z0 )g:
36
JONG-MYUN MOON
00
(Z0 )( X 0
+E[jfF0;k
n
2
vn )
+ 2Fv0 (Z0 ) X 0
vn gfF0 (Z0 )
F0;kn (Z0 )gj]:
After some algebra, the …rst term on the right of (82) is shown to be bounded by
F00 (Z0 )k2L4 (P ) k X 0
0
(Z0 )
kvn k2q + kF0;k
n
2
vn kL4 (P )
F00 (Z0 )kL4 (P ) k X 0
0
(Z0 )
+kF0;k
n
vn kL4 (P ) kvn kq :
By Assumption 5.2and (76), the last expression is further simpli…ed to
kvn k2q + o(n
2=3
)O( 2n ) + o(n
1=3
)O( 2n ) = kvn k2q + o(n
1
);
The second term on the right of (82) is bounded above by
00
(Z0 )( X 0
kF0;k
n
. fj
2
vn )
2
vn j
+ 2Fv0n (Z0 ) X 0
+j
= O( n )o(n
F0;kn (Z0 )kL2 (P )
vn jgkF0 (Z0 )
2=3
1
) = o(n
F0;kn (Z0 )kL2 (P ) :
vn kL2 (P ) kF0 (Z0 )
)
where the inequality holds by the triangle inequality and Remark A.2, and the last inequality
holds by Assumption 5.2 and (76). Therefore, we have
1
= kvk2q + o( ):
n
The third order term in (77) follows. As given by Lemma A.19,
E[h00 (
(83)
h000 (
3
n ; V1 ; V2 )[vn ]
0
fFn ( X
0
000
= f6Fn 00 ( X 0
n)
fFn ( X
0
2
0 ; V1 ; V2 )[v] ]
kn
0
vn
X
0
X
n )(
vn )
+
2
vn )
0
n )g
3Fv00n (
0
+ Fv n ( X
3
X0
n )(
X
+ 8Fv0n (Z0 )) X 0
2fIf Y
n )(
X
0
0g
2
vn )
+ 4Fvn (Z0 ) X 0
vn
Fn ( X
0
vn g
n )g
g:
Similar to above, using Hölder inequality and the triangle inequality multiple times,
(84) E h000 (
. fkFn 00 ( X 0 )( X 0
3
n ; V1 ; V2 )[v]
+kFvn (Z0 ) X
+kF0 ( X
0
0
vn kL2 (P ) g
0)
+kFv00 ( X 0
0
Fn ( X
X0
n )(
0
fkFn ( X
n )kL2 (P )
2
vn ) kL2 (P ) g:
0
2
vn ) kL2 (P )
n)
X
0
000
+ kFv0n (Z0 ) X 0
vn kL2 (P )
0
fkFn ( X )( X
0
vn kL2 (P )
+ kFvn ( X 0
3
vn )
n )kL2 (P ) g
kL2 (P )
We examine each term of the last expression. By Lemma A.18,
kFn 00 ( X 0
n )(
X0
2
vn ) kL2 (P )
kFvn (Z0 ) X 0
0
kFv0n (Z0 ) X 0
kFn ( X
000
0
0
n)
X
kFn ( X )( X
0
0
vn kL2 (P )
vn kL2 (P )
vn kL2 (P )
3
vn )
. j
2
vn j
= O( 2n );
p
kn O( 2n );
p
. kFv0n (Z0 )kL1 (P ) j vn j = kn 1 (kn )O( 2n );
. kFvn (Z0 )kL1 (P ) j
. j
kL2 (P ) . j
vn j
= O( n );
3
vn j
= O( 3n ):
vn j
=
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
There are two more terms; namely, kFvn ( X 0
By Taylor expansion and Hölder inequality,
kFvn ( X 0
n )kL2 (P )
= kFvn (Z0 ) + Fvn ( X 0
n )kL2 (P )
and kF0 ( X 0
for n is between n and
Hence from the above,
0.
Recall that
n
kFvn ( X 0
n )kL2 (P ) .
n
) X 0(
0 )kL2 (P )
n
0 j;
n
is between
n )kL2 (P )
Fn ( X 0
0)
Fvn (Z0 )kL2 (P )
n)
kFvn (Z0 )kL2 (P ) + kFv0n ( X 0
. kFvn (Z0 )kL2 (P ) + j
37
and
n
0.
As such, j
0j
n
= O( n ).
= O( n )
It is similar to show
kF0 ( X 0
Fn ( X 0
0)
n )kL2 (P )
= O( n ):
In sum, using the above bounds for each term of (84), we see that
p
p
p
E[h000 ( n ; V1 ; V2 )[vn ]3 ] = O(f 2n + kn 2n + kn 1 (kn ) 2n g n + n f 3n + 2n g) = O( kn 1 (kn ) 3n ):
p
By Remark A.17, kn 1 (kn ) 3n = o(n 1 ). This, along with (80) and (83) proves the lemma.
Lemma A.21. Suppose Assumptions 3.1-3.2, 4.3, 5.3-5.5 hold. Let M be the constant
appearing in Lemma A.15. Denote
Bkn ; = f
0
kn
2 A kn ; k
:
0 kc
<M
2
;k
kn
0 kq
g:
Then, for a sequence f n g1
n=1 speci…ed in Remark A.17,
(85)
sup (Un
2Bkn ;
E)h00 (
kn
n
Proof. By Markov inequality, the lemma is proved if
"
(86)
E
1
] = op ( ):
n
0 )[
sup (Un
2Bkn ;
E)h00 (
kn
0 )[
n
#
1
] = o( ):
n
The claim (86) can be shown by the maximal inequality.
h00 ( kn 0 )[ ]2 = h00 ( kn 0 ; V1 ; V2 )[ ]2 . De…ne
m00 (
g 00 (
kn
kn
0 ; v)[
]2 = E[h00 (
0 ; v1 ; v2 )[
kn
]2 = h00 (
0 )[
kn
]2 jV1 = v] + E[h00 (
0 )[
]2
E[h00 (
kn
E[h00 (
0 )[
kn
0 )[
kn
As before, abbreviate
]2 jV2 = v]
E[h00 (
kn
0 )[
]2 jV1 = v1 ]
0 )[
]2 jV2 = v2 ] + E[h00 (
kn
0 )[
]2 ];
]2 ]:
38
JONG-MYUN MOON
We shorten m00 ( kn 0 ; Vi )[ ]2 to m00 ( kn 0 )[ ]2 , and g 00 ( kn 0 ; Vi ; Vj )[ ]2 to g 00 (
The following equation is yet another application of Hoe¤ding decomposition;
(87)
E)h00 (
(Un
0 )[
kn
]2 = (Pn
E)m00 (
0 )[
kn
]2 + (Un
By the decomposition (87), we may bound the left of (86) by
#
"
"
E
sup (Pn
2Bkn ;
E)m00 (
kn
0 )[
]2 + E
sup (Un
2Bkn ;
n
E)g 00 (
E)g 00 (
kn
kn
0 )[
n
kn
0 )[
0 )[
]2 .
]2 ;
#
]2 :
Above two expectations are shown to converge to zero faster than 1=n-rate by Lemma A.23
and Lemma A.24 respectively. Therefore the claim (85) is proved.
Lemma A.22. Suppose the conditions of Lemma A.21. Notations are also same with Lemma
A.21. Then,
"
log N[] ("; M00kn ; n ; L2 (P )) . kn log 2
kn n
Proof. Let
n
and Fkn ;
n
be such that Bkn ;
n
=
kF k1;L1 (P ) = kF (Z0 )kL1 (P ) + kF 0 (Z0 )kL1 (P ) ;
n
Fkn ; n . Also, denote
k ke;1;L1 (P ) = j j + kF k1;L1 (P ) ,
By Lemma A.26, Theorem 2.7.11 of van der Vaart and Wellner (1996) can be applied:
"
N(
N[] ("; M00kn ; n ; L2 (P ))
; Bkn ; n ; k ke;1;L1 (P ) )
C1
"
"
; n ; j j) N (
; Fkn ; n ; k k1;L1 (P ) );
(88)
N(
2C1
4C1
for some constant C1 > 0. The covering number N ("=4C1 ; Fkn ; n ; k k1;L1 (P ) ) in (88) is not
easy to calculate. However, by Lemma A.18 and Assumption 5.4(i), for any F 2 Fkn ; n Fkn ,
(89)
kF k1;L1 (P ) . kn kF (Z0 )kL1 (P ) :
Therefore we have the following inequality; for some positive constant C2 ,
"
"
(90)
N(
; Fkn ; n ; k k1;1 ) N (
; Fkn ; n ; k kL1 (P ) );
4C1
C2 kn
where kF kL1 (P ) is de…ned to be kF (Z0 )kL1 (P ) for simplicity.
Next, in order to calculate the covering number (90), we exploit the fact that Fkn is a
linear combination of …nite basis functions. For F 2 Fkn , let us write F (z) = c0 pkn (z) for
c = (c1 ;
; ckn )0 . Notice that any coe¢ cient cj must be bounded. To see this, suppose not.
As F (z) is a bounded function, this implies some linear combination of basis functions is zero.
This contradicts with Assumption 5.5, and so any coe¢ cient cj is bounded. Without loss of
generality, assume that cj is within an interval [ 1; 1] for any j. Next, consider F 2 Fkn ; n
and suppose kF (Z0 )kL2 (P )
c n for some constant c. Because any norm is linear, this
implies that the coe¢ cient cj is in an interval [ c n ; c n ]. Since p = supj kpj k1 is bounded
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
39
by Assumption 5.3, it follows
kc0 pkn kL1 (P )
pfjc1 j +
+ jckn jg:
Hence, a "-radius ball in (Fkn ; k kL1 (P ) ) is smaller than a set
n
o
c0 pkn (z) : jc1 j +
+ jckn j "=p
All these considerations lead to the following calculation; the covering number on the right
of (90) is bounded above, up to a …xed scale, by
(91)
N(
"
; [ c n ; c n ]; j j)]
C2 kn2
kn
kn2 n
"
kn
:
Therefore, the claim of the lemma is proved. Now, back to (88), after the log transformation,
"
"
log N[] ("; M00kn ; n ; L2 (P )) . d log
kn log 2 :
k
n
n n
As the second term dominates on the right of the previous inequality, we obtain the claimed
inequality.
Lemma A.23. Suppose the conditions of Lemma A.21. Notations are also same. Then,
"
#
1
00
2
E
sup Gn m ( kn 0 )[v] = o( p ):
n
v2Bkn ; n
Proof. We prove this lemma by using the maximal inequality of Theorem 2.14.2 in van der
Vaart and Wellner (1996). To prepare for the application of the maximal inequality, de…ne
M00kn ;
n
= fv 7! m00 (
kn
2
0 )[v]
: v 2 Bkn ; n g:
From (81), it can be easily checked that the envelope for fh00 (
kn
0 )[v]
: v 2 Bkn ; n g is
j Xj2 j v j2 + jFv (Z0 )j2 + fjFv (Z0 )j + jFv0 (Z0 )jgj Xjj v j:
Hn (V1 ; V2 )
From this, an envelop function for M00kn ;
n
can be obtained; let
Mn (v) = E[Hn (V1 ; V2 )jV1 = v] + E[Hn (V1 ; V2 )jV2 = v] + E[Hn (V1 ; V2 )]:
Then for any v 2 Bkn ; n ;
km00 (
kn
0 ; Vi )[
]2 kL2 (P )
kMn (Vi )kL2 (P ) ;
by using Jensen’s inequality and Minkowski inequality. We need to calculate kMn (Vi )kL2 (P ) ;
kMn (Vi )kL2 (P )
(92)
3kHn (V1 ; V2 )kL2 (P )
. j v j2 + kFv (Z0 )k2L4 (P ) + fkFv (Z0 )kL4 (P ) + kFv0 (Z0 )kL4 (P ) gj v j
40
JONG-MYUN MOON
where the …rst inequality holds by Jensens’inequality, and the second inequality is a consequence of Hölder inequality. Note, by Lemma A.18,
p
p
kF (Z0 )kL4 (P ) . kn kF (Z0 )kL2 (P ) ,
kF 0 (Z0 )kL4 (P ) . kn 1 (kn )kF (Z0 )kL2 (P ) :
p
Recall that Assumption 5.4 (i) states kn 1 (kn ) . kn . Hence, from (92),
(93)
kMn (Vi )kL2 (P )
C1 kn
2
n:
Next, the following bracketing integral is calculated:
Z 1q
00
J[] (1; Mkn ; n ; L2 (P )) =
1 + log N[] ("kMn (Vi )kL2 (P ) ; M00kn ; n ; L2 (P ))d":
0
By (93), this is bounded above by
Z 1q
1 + log N[] ("C1 kn
0
2
00
n ; Mkn ;
n
; L2 (P ))d";
which, by the change of variable, can be written as
Z C1 kn 2n q
2
1
(C1 kn n )
1 + log N[] ("; M00kn ; n ; L2 (P ))d"
0
By Lemma A.22, the above integral is bounded above by
Z C1 kn 2n r
p
p kn Z
"
2
1
(94)
C2 kn (C1 kn n )
log 2 d"
kn
kn n
n 0
0
Notice that kn =
integral rule,
n
n =kn
p
log "0 d"0 :
! 1. To evaluate the last integral of (94), observe that, by the Leibniz
Z xp
p
d
log "0 d"0 =
log x:
dx 0
Then by l’Hopital’s rule, for any c > 1,
Z xp
1
1 p
lim c
log "0 d" = lim c 1
log x = 0:
x!0 x
x!0 cx
0
From these calculations, we learn that, for any small number > 0, the right of (94) is less
p
than kn (kn = n ) if the number n is large enough. Keeping this in mind, recall Theorem
2.14.2 of van der Vaart and Wellner (1996) to obtain
"
#
p
p
E
sup Gn m00 ( kn 0 )[ ]2 . kn (kn = n ) kMn (Vi )kL2 (P ) . kn (kn = n ) kn 2n :
2Akn ;
n
p
By Assumption 5.4, kn kn 2n = o(n 1=2 ). By choosing arbitrarily small, we conclude that
p
kn (kn = n ) kn 2n = o(n 1=2 ). This concludes the proof.
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
41
Lemma A.24. Suppose the conditions of Lemma A.21. Notations are also same. Then,
#
"
1
00
2
E
sup Un g ( kn 0 )[v] = o( ):
n
v2Bkn ; n
Proof. By Theorem B.1, for any
"
(95)
E
00
Un g (
sup
v2Bkn ;
kn
n
2 A kn ; n ,
#
2
0 )[v]
1
. E
n
where the random metric n is de…ned as
8
<1 X
(96)
g 00 ( kn 0 ; Vi ; Vj )[
n( 1; 2) =
: n2
Z
Dn
log N ("; Bkn ; n ;
0
2
1]
g 00 (
kn
n )d"
2 2
0 ; Vi ; Vj )[ 2 ]
i6=j
and Dn is the diameter of Akn ;
C,
n( 1;
2)
n
measured by
n.
;
91=2
=
;
;
By Lemma A.27, for some positive constant
91=2
8
=
<1 X
2
k
C
G(V
;
V
)
i
j
;
: n2
2 ke;1;L1 (P ) ;
1
i6=j
where the “Lipschitz constant” G( ; ) is de…ned in the Lemma A.27. Let us denote
8
91=2
<
=
X
1
(97)
Gn = C
:
G(Vi ; Vj )2
: n(n 1)
;
i6=j
Notice that
n( 1;
2)
. Gn k
2 ke;1;L1 (P ) :
1
Then, since kF k1;L1 (P ) . kn kF (Z0 )kL1 (P ) as derived in (89),
(98)
Dn =
sup
1 ; 2 2Akn ; n
n( 1;
2)
. kn
The following two inequalities are almost identical to (88):
"
"
N ("; Bkn ; n ; n ) N (
; Bkn ; n ; k ke;1;L1 (P ) ) N (
;
Gn
2Gn
n Gn :
n
; j j) N (
"
; Fkn ; n ; k k1;L1 (P ) );
2Gn
Further, similar to (90)-(91),
(99)
log N ("; Bkn ; n ;
From (99),
Z Dn
0
log N ("; Bkn ; n ;
n)
.
n )d"
log " + log
. Dn
n Gn
kn log " + kn log kn2
Dn log Dn + Dn log
n Gn :
n Gn
+kn Dn + kn Dn log Dn + kn Dn log kn2
n Gn :
42
JONG-MYUN MOON
Notice that x log x x2 for any positive x. And recall the bound (98). Using these two, we
can show that the right of the above inequality is less than
kn
n Gn
+ (kn
2
n Gn )
+ kn ( n Gn )2 + kn2
n Gn
+ kn (kn
2
n Gn )
+ (kn2
n Gn )
By the de…nition (97), E[G2n ] = E[G(Vi ; Vj )2 ] and this is bounded. By Jensen’s inequality,
E[Gn ] . fE[G(Vi ; Vj )2 ]g1=2 . Therefore, we obtain
Z Dn
log N ("; Bkn ; n ; n )d"] . kn2 n :
E[
0
By Assumption 5.4,
kn2 n
converges to zero. This result combined with (95) proves the lemma.
Lemma A.25. Suppose Assumptions 3.1-3.2 hold. For any
A A,
(100)
jh00 (
h00 (
0 ; V1 ; V2 )[w]
kn
. (j Xj + j Xj2 )fj
+kFw0 (Z0 )
= ( 0 ; F ) 2 A and w; w 2
0 ; V1 ; V2 )[w]j
kn
wj
w
+ kFw (Z0 )
Fw (Z0 )kL1 (P )
Fw0 (Z0 )kL1 (P ) g:
Proof. The directional derivative h00 ( ; v1 ; v2 )[w] exists if F is twice di¤erentiable and Fw is
di¤erentiable. This is guaranteed by 3.1-3.2. By easy algebra,
h00 ( ; V1 ; V2 )[w]2
h00 ( ; V1 ; V2 )[w]2
j2fF 0 (Z0 ) X 0 (
w1
fF (Z0 ) X (
w1
0
0
+j4fF 00 (Z0 ) X 0 (
+2Fw0 1 (Z0 ) X 0 (
fIf Y
+
w1
w2 )
+ Fw1 (Z0 )
w2 )
+ Fw1 (Z0 ) + Fw2 (Z0 )gj
+
w1
0g
X 0(
w2 )
w2 )
Fw2 (Z0 )g
w2 )
w1
+ 2fFw0 1 (Z0 )
Fw0 2 (Z0 )g X 0
w2 g
F (Z0 )gj:
Recall that for any = ( ; F ) 2 A, F , F 0 and F 00 are all uniformly bounded as explained
in Remark A.2. Hence the right of the above inequality is smaller than, for strictly positive
constants fC1 ;
; C4 g,
(101)
C1 fj X 0 (
w1
+C2 j X 0 (
+C4 j X 0
w2 )j
w1
w2 j
+
+ kFw1 (Z0 )
w2 )
X 0(
kFw0 1 (Z0 )
Fw2 (Z0 )kL1 (P ) g
w1
w2 )j
+ C3 j X 0 (
w1
w2 )j
Fw0 2 (Z0 )kL1 (P ) :
The product terms in (101) can be bounded using Cauchy-Schwartz inequality. Ignoring
constant terms, the result (100) can be acquired from the expression (101).
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
43
The following two lemmas can be easily proved, similarly to Lemma A.10 and Lemma A.11;
hence proofs are omitted.
Lemma A.26. Suppose Assumptions 3.1-3.2 hold. For any
A A,
jm00 (
kn
0 ; v)[w]
m00 (
0 ; v)[w]j
kn
. (1 + jxj + jxj2 )fj
= ( 0 ; F ) 2 A and w; w 2
wj
w
+ kFw (Z0 )
Fw (Z0 )kL1 (P ) + kFw0 (Z0 )
Lemma A.27. Suppose Assumptions 3.1-3.2 hold. For any
A A,
jg 00 (
kn
g 00 (
0 ; v1 ; v2 )[w]
. G(v1 ; v2 )fj
wj
w
Fw0 (Z0 )kL1 (P ) g:
= ( 0 ; F ) 2 A and w; w 2
0 ; v1 ; v2 )[w]j
kn
Fw (Z0 )kL1 (P ) + kFw0 (Z0 )
+ kFw (Z0 )
Fw0 (Z0 )kL1 (P ) g;
where G(v1 ; v2 ) = 1 + jx1 j + jx2 j + jx1 jjx2 j + jx1 j2 + jx2 j2 :
Lemma A.28. Suppose Assumptions 3.1-3.2, 5.3-5.5 hold. Let M be the constant de…ned
in Lemma A.15. Denote
Bkn ; = f
0
kn
2 A kn ; k
:
0 kc
2
<M
;k
kn
0 kq
g:
Then, for a sequence f n g1
n=1 speci…ed in Remark A.17,
sup
2[
where [
kn
0;
kn
0
Proof. Let us write
h000 (
(Un
v2Bkn ; n
kn 0 ; kn
1
)[v]3 = op ( );
n
E)h000 (
0 +v]
+ v] is a line connecting two points
0
and
kn
0
+ v.
= ( ; F ) and v = ( v ; Fv ). From Lemma A.19, we have
; V1 ; V2 )[v]3 = 6fF 00 ( X 0
fF 0 ( X 0
fF
000
)( X 0 v )2 + 2Fv ( X 0
) X0
( X0
v
+ Fv ( X 0
+j Xj n jFv ( X 0
We need bounds for jFv ( X 0
)g
)( X 0 v )3 + 3Fv00 ( X 0
Observe that, since kF k3;1 < B (see Remark A.2),
p
(102)
jh000 ( ; V1 ; V2 )[v]3 j . j Xj3 3n + j Xj2 kn
Fv ( X 0
kn
) X0 vg
2fIf Y
0g
F ( X0
)( X 0 v )2 g:
3
n
)j2 + j Xj2
2
00
n jFv (
X0
)j:
)j and jFv00 ( X 0 ) j. First, by Taylor expansion,
) = Fv (Z0 ) + Fv0 (Z0 ) X 0 (
0)
1
+ Fv00 ( X 0
2
)f X 0 (
0 )g
2
;
)g
44
JONG-MYUN MOON
for some point
between
and 0 . Because kFv (Z0 )kL2 (P ) . n , by Lemma A.18 and
Assumption 5.4(i),
p
p
p
kFv (Z0 )kL1 (P ) . kn n ; kFv0 (Z0 )kL1 (P ) . kn 1 (kn ) n ; kFv00 (Z0 )kL1 (P ) . kn 2 (kn ) n :
Using these inequalities and Hölder inequality, we obtain
p
p
p
jFv ( X 0 )j . kn n + kn 1 (kn ) 2n j Xj + kn 2 (kn ) 3n j Xj2 :
Second, after repeating similar calculations, it follows that
p
jFv00 ( X 0 )j . jFv00 (Z0 )j + j Xjj v j . kn 2 (kn )
n
+ j Xj n :
From Assumption 5.4, it is easy to show that
p
1
= o( ),
n
From this, after some algebra, we can show that
kn
3
n
_ kn 1 (kn )
5
n
_ kn 2 (kn )
7
n
kn 2 (kn )
3
n
1
= o( ):
n
1
; V1 ; V2 )[v]3 j . (j Xj4 + j Xj3 + j Xj2 + j Xj)o( ):
n
Applying Jensen’s inequality, from (103), it follows that
jh000 (
(103)
1
1
)[v]3 j . Op (1)o( );
Ejh000 ( )[v]3 j . O(1)o( );
n
n
where the Op (1)-term is a result of the law of large numbers for U-statistic. This proves the
lemma.
Un jh000 (
Lemma A.29. Suppose Assumptions 3.1-3.2, 4.1-4.3, 5.1-5.5 hold. Let ~ n = ^ n "n kn v
for any non-random sequence "n such that "n = O(n 1=2 ); v is the Riesz representer. Then,
for vn = ~ n
kn 0 ,
(104)
k~ n
kn
2
0 kq
=
E)h0 (
(Un
kn
0 )[vn ]
+ Qn (~ n )
Qn (
kn
0)
1
+ op ( ):
n
Proof. Note the lemma supposes all the assumptions for Theorem 3.1 and Theorem 4.1.
Because "n is decreasing faster than n to zero, we can see that k~ n
0 k = op ( n ). By
Lemma A.20,
(105)
k~ n
kn
2
0 kq
kn
0)
= Q(~ n )
We may expand Q(~ n )
Qn (
(106)
Q(~ n )g + fQ(
fQn (~ n )
1
+ op ( ):
n
Qn (
kn
0)
Qn (
kn
0 )g
to
kn
0)
+ Qn (~ n ) + Qn (
and note that the …rst two terms of the above expression can be written as
(107)
(Un
E)fh(~ n )
h(
kn
0 )g:
kn
0 );
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
45
As 7! h( ) = h( ; Vi ; Vj ) permits the third-order Taylor expansion thanks to Lemma A.19,
for vn = ~ n
kn 0 , (107) is equivalent to
1
1
+ h00 ( kn 0 )[vn ]2 + h000 ( n )[vn ]3 g;
2
6
where n 2 Akn is a random point between n and kn 0 . Recall the de…nition of the
set Bkn ; de…ned in Lemma A.21. Note k~ n
0 kc converges in probability to zero, and
therefore kvn kc converges in probability to zero. Then vn is in the set Bkn ; with probability
approaching one. By this fact, we can apply Lemma A.21 and Lemma A.28 to bound the
second and third order derivatives in (108) to obtain
(108)
E)fh0 (
(Un
(109)
0 )[vn ]
kn
E)fh0 (
(Un
kn
0 )[vn ]
1
+ op ( ):
n
From (105), (106) and (109), the claim (104) is shown.
Lemma A.30. Adopt the conditions of Theorem 5.1. Then
p
p
n(Un E)h0 ( kn 0 )[ kn v ] = n(Un E)h0 (
Proof. Denote
h0 (
0 )[ kn v
kn
= ( ;F
kn v
h0 (
]
;kn ).
] + op (1):
After short algebra, we obtain
0 )[ kn v
F00 (Z0 ))
0
(Z0 )
f(F0;k
n
0 )[v
]=
X
0
2fIf Y
0g
F0 (Z0 )g
0
(Z0 ) X 0
F0 (Z0 )gfF0;k
n
g + 2fF0;kn (Z0 )
+F
;kn (Z0 )g:
Therefore,
jh0 (
h0 (
kn
0 )[ kn v
]
.
0
(Z0 )
j(F0;k
n
0 )[ kn v
F00 (Z0 ))
X
]j
0
j + jF0;kn (Z0 )
0
(Z0 ) X 0
F0 (Z0 )jjF0;k
n
+F
;kn (Z0 )gj:
By Hölder inequality,
(110)
Ejh0 (
kn
0 )[ kn v
]
h0 (
+kF0;kn (Z0 )
0 )[ kn v
0
]j . kF0;k
(Z0 )
n
0
F0 (Z0 )kL2 (P ) kF0;k
(Z0 ) X 0
n
1=2 ).
By Assumption 5.2, the right of (110) is o(n
(111) j(Un
E)fh0 (
Un jh0 (
kn
kn
0 )[ kn v
0 )[ kn v
]
h0 (
]
h0 (
0 )[ kn v
0 )[ kn v
F00 (Z0 )kL2 (P ) k X 0
+F
kn v
]
h0 (
0 )[ kn v
]
;kn (Z0 )kL2 (P ) :
By Jensen’s inequality,
]gj
]j + Ejh0 (
0 )[ kn v
kn
We already showed that the second term on the right of (111) is o(n
by Markov inequality, for any " > 0;
P Un jh0 ( kn 0 )[
1
Ejh0 ( kn
"
kL2 (P )
0 )[ kn v
h0 (
h0 (
]
1=2 ).
1
]j = o(n
"
]j:
For the …rst term,
]j > "
0 )[ kn v
0 )[ kn v
1=2
):
46
JONG-MYUN MOON
This shows
Un jh0 (
0 )[ kn v
kn
]
h0 (
0 )[ kn v
1=2
]j = op (n
):
Next, we have
h0 (
0 )[ kn v
]
h0 (
0 )[v
]=
2fIf Y
0g
;kn (Z0 )
F0 (Z0 )gfF
F (Z0 )g:
Then it is immediate that
Ejh0 (
0 )[ kn v
h0 (
]
0 )[v
]j . kF
;kn (Z0 )
F (Z0 )kL2 (P ) . k
kn v
v kq ;
where the last inequality holds by Lemma A.14. By Assumption 5.6,
k
kn v
v kq = o(n
1=2
rn 1 ) = o(n
1=2
):
By the similar argument with above, we show
Un jh0 (
0 )[ kn v
]
h0 (
0 )[v
]j = op (n
1=2
):
This concludes the proof.
p
Proof of Theorem 5.1. Any linear transform of n(^n
product induced by k kq ; for any 2 Rd ,
p
p
(112)
n(^n
n h^ n
0) =
kn
0)
0; v
can be expressed by the inner
i:
To exploit the property of …nite-dimensional sieve, we want both entries of the above inner
product are in the sieve space Akn . Observe
p
p
p
(113)
n h^ n
n h^ n
kn v i ;
kn 0 ; v i =
kn 0 ; kn v i + n h^ n
kn 0 ; v
and then, by Cauchy-Schwartz inequality, the second term on the right of (113) is bounded
above by
p
(114)
nk^ n
kn 0 kq kv
kn v kq :
By Assumption 5.5 and Theorem 4.1 (rate of convergence), (114) converges in probability
to zero. As such, we are left with the …rst term of (113). A trick is to express the inner
product by a function of squared norms. Let "n be arbitrary non-random sequence such that
"n = o(n 1=2 ). This is a local “perturbation” factor. By the polarization identity and the
bi-linearity of inner product,
p
p
1
4 n h^ n
kn 0 ; kn v i = 4 n"n h^ n
k n 0 ; " n kn v i
p
p
2
2
(115)
= n"n 1 k^ n + "n kn v
n"n 1 k^ n + "n kn v
k n 0 kq
k n 0 kq :
The reason why we bring in the ad-hoc sequence "n is to exploit the local-quadraticity of
Q( ) near 0 . Note that ^ n
"n kn v is close to ^ n ,
kn v is far away from 0 , but ^ n
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
47
and hence ^ n "n kn v approaches 0 along with ^ n . An important fact is that, when the
estimator ^ n is perturbed slightly by "n -factor, it still converges faster than n ; that is, for
any v 2 V,
k^ n "n v
0 k = op ( n ):
p
This is true because the rate of convergence is slower than the parametric rate n, and n is
slower than n 1=2 .
By Lemma A.29,
k^ n + "n
(116)
=
kn v
(Un
kn
E)h0 (
2
0 kq
kn
k^ n + "n
0 )[2"n kn v
kn v
kn
2
0 kq
] + Qn (^ n + "n
kn v
)
Qn (^ n
"n
kn v
1
) + op ( ):
n
Using Taylor expansion around ^ n , it is easy to show that
(117)
Qn (^ n + "n
kn v
)
Qn (^ n
"n
kn v
) = Op ("2n ):
Collecting (115)-(117), we obtain
(118)
p
p
p
1p
1
n h^ n
n(Un E)h0 ( kn 0 )[ kn v ] + n"n 1 Op ("2n ) + n"n 1 op ( ):
kn 0 ; k n v i =
2
n
Since the sequence "n is arbitrary, we may choose it such that the last two term in (118) are
o(1)-terms. Then, thanks to Lemma A.30, from (118),
1p
n(Un E)h0 ( 0 )[v ] + op (1):
2
By Theorem 12.3 of van der Vaart (2000) (the central limit theorem for U-statistic), the claim
is proved.
(119)
p
n(^n
0)
=
p
n h^ n
kn
0;
kn v
i=
Proof of Theorem 5.2. Let V~i = (Vi0 ; Bi )0 , and denote
h ( ; V~i ; V~j ) = Bi Bj h( ; Vi ; Vj );
and we can write
Qn ( ) =
1
n(n
1)
X
h ( ; V~i ; V~j ):
i6=j
Note that the weighted empirical criterion Qn has the same structure with the unweighted
empiricial criterion Qn . Also, importantly, both criteria induce the same population criterion;
~ ; V~i ; V~j )]:
Q( ) = E[h( ; Vi ; Vj )] = E[h(
Therefore, it is straightforward to modify the proof of Theorem 3.1 and Theorem 4.1 for the
weighted empirical criterion Qn . Then we obtain
k^ n
0 kc
p
! 0;
rn k^ n
0 kq
= Op (1):
48
JONG-MYUN MOON
Next, we modify the proof of Theorem 5.1. To this end, note
d~
h( + tv; V~i ; V~j )[v]
dt
0(
= Bi Bj
t=0
d
h( + tv; Vi ; Vj )[v]
dt
= Bi Bj h0 ( )[v]:
t=0
h0 (
Denote h
)[v] = Bi Bj
)[v]. Higher order derivatives are de…ned in the same manner.
Hence, it is again a straightforward business to adopt the proof of Theorem 5.1 to the weighted
empirical criterion. Then we can obtain, similar to (119), for any 2 Rd ;
(120)
p
n 0 (^n
0)
=
p
n h^ n
kn
0;
kn v
i=
1p ~
n(Un
2
~ 0(
E)h
0 )[v
] + op (1);
~n is the U-process empirical measure for the augmented random variable V~i . Note
where U
p
that the unconditional limiting distribution of n 0 (^n
0 ) is immediately derived from the
expression (120).
Now, subtract (120) from (119) to get
(121)
p
n h^ n
^ n;
kn v
i =
=
1p ~
~ 0 ( 0 )[v ]g + op (1)
n(Un E)fh0 ( 0 )[v ] h
2
p
1
n X
f(Bi Bj 1)h0 ( 0 )[v ]g + op (1):
2 n(n 1)
i6=j
p
We will study the conditional limiting distribution of n h^ n ^ n ; kn v i. To state the
claim precise, following the construction of van der Vaart and Wellner (1996), the underlying
probability space is expanded as follows. Let the sample fVi g1
i=1 be the projection of the …rst
1 coordinates in the probability space (V 1 Z; A1 C; P 1 Q) and the random weights
fBi g1
i=1 depend on the last coordinate only. We denote by EQ [ ] the conditional expectation,
1
conditional on fVi g1
i=1 . When a probabilistic statement holds conditional on fVi gi=1 for any
1
1
realization of fVi g1
i=1 but P -measure zero set, we will simply say that it holds P -almost
surely (a.s.).
Let us write, suppressing its dependence on
0
Wij = h (
0 ; Vi ; Vj )[v
Then, by Hoe¤ding decomposition,
p
1
n X
(Bi Bj 1)Wij
(122)
2 n(n 1)
i6=j
=
and n,
];
Wi =
1
n
1
n
X
1 X
p
(Bi 1)W i :
n
i
p
X
n
+
(Bi Bj
2n(n 1)
i6=j
Wij
j6=i
Bi
Bj + 1)Wij :
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
49
First, the convergence of the …rst term is shown. Note that EQ [f(Bi 1)W i g2 ] = (W i )2 and
8
9
n
<
=
X
X
X
1
1
2
2
(W i ) =
Wij +
Wij Wik ! 0 + E[W12 W13 ] P 1 -a.s.,
;
n
n(n 1)2 :
i=1
i6=j
i6=j6=k
by the strong law of large numbers for U-statistic (Theorem 4.1.4 of de la Peña and Giné
(1999)). Then, the Lindberg condition is, for any " > 0,
X
1
1
(123)
E[f p (Bi 1)W i g2 Ifj p (Bi 1)W i j > "g] ! 0 P 1 -a.s..
n
n
i
To show the condition (123), it su¢ ces to show, for every i 2 N,
p
(124)
E[(Bi 1)2 Ifj(Bi 1)W i j > n"g] ! 0 ,
and in turn, the condition (124) holds if W i is P 1 -a.s. uniformly bounded in n (recall that
W i depends on n). Observe that, conditional on Vi ,
W i ! E[Wij jVi ] P 1 -a.s. for any j 6= i:
by the strong law of large numbers. Any covering sequence is uniformly bounded. Therefore,
we verify the condition (124) and hence the Lindberg condition (123). This implies
1 X
d
p
(125)
(Bi 1)W i ! N (0; E[W12 W13 ]) P 1 -a.s..
n
i
It is left to show that the second term on the right of (122) degenerates. Let us write
~ij = Bi Bj Bi Bj + 1. See that EQ [B
~ij B
~ik ] = 0 if j 6= k, and EQ [B
~ 2 ] = 5. Therefore, it
B
ij
is easy to show that
p
X
X
n
1
~ij Wij g2 ] =
EQ [f
B
5Wij2 ! 0 P 1 -a.s.,
2
2n(n 1)
4n(n 1)
i6=j
i6=j
by Theorem 4.1.4 of de la Peña and Giné (1999). Therefore the conditional limiting distribution of (121) is N (0; E[W12 W13 ]) which coincides with the unconditional limiting distribution
p
of n 0 (^n
0 ). The claim follows by the Cramer-Wold device.
50
JONG-MYUN MOON
Appendix B. Maximal Inequality for U-Process
We collects several results in de la Peña and Giné (1999) (Henceforth, DG) to formulate a
convenient form of maximal inequality. Notations are independent from the rest of the paper.
1
fXi g1
i=1 is a i.i.d. sequence of random variables. f"i gi=1 be an i.i.d. sequence of Redemacher
random variables; "i is either 1 or 1 with equal probabilities. f"i g and fXi g are mutually
independent. N ("; F; d) is a covering number. F is a class of functions f : R2 ! R such that
E[f (X1 ; X2 )jX2 ] = E[f (X1 ; X2 )jX1 ] = 0. However, f is not necessarily symmetric; that is,
f (x1 ; x2 ) may not be equal to f (x2 ; x1 ). De…ne a random metric for f; g 2 F;
)1=2
(
1 P
"i "j (f (Xi ; Xj ) g(Xi ; Xj ))]2
;
dn (f; g) = E" [
n i6=j
for E" [ ] = E[ jX1 ;
; Xn ]. A short calculation shows
)1=2
(
1 P
2
:
[f (Xi ; Xj ) g(Xi ; Xj )]
dn (f; g) =
n2 i6=j
(126)
Theorem B.1. There is a universal constant K > 0 such that for any f0 2 F,
Z Dn
1X
2 1=2
log N ( ; F; dn )d
(127)
E[sup j
f (Xi ; Xj )j] . fE[f0 (Xi ; Xj ) ]g + K E
f 2F n
0
;
i6=j
where Dn is the diameter of F measured by a random metric dn ( ; ).
P
Proof. For an arbitrary deterministic sequence fxi gni=1 , f 7! n1 i6=j "i "j f (xi ; xj ) is a homogeneous Rademacher chaos process of degree 2.10 Thus by DG Corollary 5.1.8 (maximal
inequality for Rademacher chaos process), we obtain (128); for any f0 2 F and any deterministic sequence fxi g, there is a universal constant K > 0 such that
(128)
k sup fj
f 2F
1 P
"i "j f (xi ; xj )jgk
n i6=j
1
1 P
"i "j f0 (xi ; xj )k 1
n i6=j
Z Dn;x
+K
log N ("; F; dn;x )d";
k
0
11
where k k 1 is an Orlicz norm ; dn;x ( ; ) and Dn;x are equal to dn ( ; ) and Dn for Xi = xi .
The equations (4.3.3) and (4.3.4) of DG show the lower and upper bound for the Orlicz norm
in terms of the Lp -norms. Employing these bounds, the following inequality is obtained from
(128):
E[sup j
f 2F
1 P
"i "j f (xi ; xj )j]
n i6=j
10 See p. 110 of DG for the de…nition.
11 For the de…nition of the Orlicz norm, see p.188 of DG.
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
(129)
(130)
51
Z Dn;x
1 P
2 1=2
log N ("; F; dn;x )d"
. fE[
"i "j f0 (xi ; xj )] g + K
n i6=j
0
)1=2
(
Z Dn;x
1 P
2
log N ("; F; dn;x )d":
f0 (xi ; xj )
=
+K
n2 i6=j
0
The deterministic sequence fxi g in (130) will be substituted with a random sequence fXi g.
To do this, the unconditional expectation in the left of (129) is replaced by the conditional
expectation E" [ ], and as a result, we obtain
)1=2
(
Z Dn
1 P
1 P
2
f0 (Xi ; Xj )
+K
log N ("; F; dn )d";
"i "j f (Xi ; Xj )j] .
(131) E" [sup j
n2 i6=j
f 2F n i6=j
0
for Dn and dn ( ; ) de…ned already. After taking an unconditional expectation on both sides
of (131), apply Jensen’s inequality. Then (132) follows:
Z Dn
1 P
1=2
"i "j f (Xi ; Xj )j] . E[f0 (Xi ; Xj )2 ]
+KE
log N ( ; F; dn )d
(132) E[sup j
f 2F n i6=j
0
Note that
(133)
1 P
1 P
f (Xi ; Xj ) =
2
n i6=j
n i6=j
1
ff (Xi ; Xj ) + f (Xj ; Xi )g;
and (x1 ; x2 ) 7! 2 1 ff (x1 ; x2 ) + f (x2 ; x1 )g is a symmetric kernel. As such, by the randomization theorem (DG Theorem 3.5.3) and (133),
1 P
1X
E[sup j
f (Xi ; Xj )j]
E[sup j
"i "j 2 1 ff (Xi ; Xj ) + f (Xj ; Xi )gj]
n
n
f 2F
f 2F
i6=j
i6=j
1X
E[sup j
"i "j f (Xi ; Xj )j]:
f 2F n
(134)
i6=j
By (132) and (134), we conclude.
Theorem B.2. Suppose F is as described above. Then for some K > 0,
Z Dn
1X
(135)
E[ sup
jf (Xi ; Xj ) g(Xi ; Xj )j] . K E[
log N ( ; F; dn )d ];
f;g2F n
0
i6=j
where Dn is a random diameter of F measured by dn (f; g).
Proof. The same argument with the above theorem repeats. Apply the second inequality of
DG Corollary 5.1.8, instead of the …rst inequality.
52
JONG-MYUN MOON
References
Abrevaya, J. (2003): “Pairwise-di¤erence rank estimation of the transformation model,”
Journal of Business & Economic Statistics, 21(3).
Arcones, M. A., and E. Giné (1993): “Limit theorems for U-processes,” Annals of Probability, pp. 1494–1542.
Box, G. E., and D. R. Cox (1964): “An analysis of transformations,”Journal of the Royal
Statistical Society. Series B (Methodological), pp. 211–252.
Cavanagh, C., and R. P. Sherman (1998): “Rank estimators for monotonic index models,”
Journal of Econometrics, 84(2), 351–381.
Chen, S. (2002): “Rank estimation of transformation models,” Econometrica, 70(4), 1683–
1697.
Chen, X. (2007): “Large sample sieve estimation of semi-nonparametric models,”Handbook
of Econometrics, 6, 5549–5632.
Chen, X., and D. Pouzo (2009): “E¢ cient estimation of semiparametric conditional moment models with possibly nonsmooth residuals,”Journal of Econometrics, 152(1), 46–60.
Chiappori, P.-A., I. Komunjer, and D. Kristensen (2013): “Nonparametric identi…cation and estimation of transformation models,” Discussion paper.
de la Peña, V., and E. Giné (1999): Decoupling: from dependence to independence.
Springer Verlag.
Ding, Y., and B. Nan (2011): “A sieve m-theorem for bundled parameters in semiparametric models, with application to the e¢ cient estimation in a linear model for censored
data,” Annals of Statistics, 39(6), 3032–3061.
Eckstein, Z., and G. J. Van den Berg (2007): “Empirical labor search: A survey,”
Journal of Econometrics, 136(2), 531–564.
Ekeland, I., J. J. Heckman, and L. Nesheim (2002): “Identifying hedonic models,”
American Economic Review, 92(2), 304–309.
(2004): “Identi…cation and estimation of hedonic models.,” Journal of Political
Economy, 112(1), 60.
Farber, H. S. (1999): “Mobility and stability: the dynamics of job change in labor markets,”
Handbook of labor economics, 3, 2439–2483.
Gallant, A. R., and D. W. Nychka (1987): “Semi-nonparametric maximum likelihood
estimation,” Econometrica, pp. 363–390.
Han, A. K. (1987): “Non-parametric Analysis Of A Generalized Regression Model: The
Maximum Rank Correlation Estimator,” Journal of Econometrics, 35(2), 303–316.
Horowitz, J. L. (1996): “Semiparametric estimation of a regression model with an unknown
transformation of the dependent variable,” Econometrica, pp. 103–137.
SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS
53
(1998): Semiparametric methods in econometric, Lecture Notes In Statistics.
Springer Verlag.
Ichimura, H. (1993): “Semiparametric least squares (SLS) and weighted SLS estimation of
single-index models,” Journal of Econometrics, 58(1), 71–120.
Ichimura, H., and P. E. Todd (2007): “Implementing nonparametric and semiparametric
estimators,” Handbook of Econometrics, 6, 5369–5468.
Khan, S., Y. Shin, and E. Tamer (2011): “Heteroscedastic transformation models with
covariate dependent censoring,” Journal of Business & Economic Statistics, 29(1).
Khan, S., and E. Tamer (2007): “Partial rank estimation of duration models with general
forms of censoring,” Journal of Econometrics, 136(1), 251–280.
Kiefer, N. M. (1988): “Economic duration data and hazard functions,” Journal of Economic Literature, 26(2), 646–679.
Klein, R. W., and R. P. Sherman (2002): “Shift restrictions and semiparametric estimation in ordered response models,” Econometrica, 70(2), 663–691.
Linton, O., S. Sperlich, and I. Van Keilegom (2008): “Estimation of a semiparametric
transformation model,” Annals of Statistics, 36(2), 686–718.
Ma, S., and M. R. Kosorok (2005): “Robust semiparametric M-estimation and the
weighted bootstrap,” Journal of Multivariate Analysis, 96(1), 190–217.
Matzkin, R. L. (2007): “Nonparametric identi…cation,”Handbook of Econometrics, 6, 5307–
5368.
Meyer, B. D. (1996): “What have we learned from the Illinois reemployment bonus experiment?,” Journal of Labor Economics, pp. 26–51.
Mortensen, D. T., and C. A. Pissarides (1999): “New developments in models of search
in the labor market,” Handbook of labor economics, 3, 2567–2627.
Ramsay, J. (1988): “Monotone regression splines in action,”Statistical Science, pp. 425–441.
Ridder, G. (1990): “The non-parametric identi…cation of generalized accelerated failuretime models,” Review of Economic Studies, 57(2), 167–181.
Rogerson, R., R. Shimer, and R. Wright (2005): “Search-theoretic models of the labor
market: a survey,” Journal of Economic Literature, pp. 959–988.
Santos, A. (2011): “Semiparametric estimation of invertible models,” Discussion paper.
(2012): “Inference in nonparametric instrumental variables with partial identi…cation,” Econometrica, 80(1), 213–275.
Shen, X. (1997): “On methods of sieves and penalization,” Annals of Statistics, 25(6),
2555–2591.
Shen, X., and W. H. Wong (1994): “Convergence rate of sieve estimates,” Annals of
Statistics, pp. 580–615.
Sherman, R. P. (1993): “The limiting distribution of the maximum rank correlation estimator,” Econometrica, pp. 123–137.
54
JONG-MYUN MOON
(1994): “Maximal inequalities for degenerate U-processes with applications to optimization estimators,” Annals of Statistics, pp. 439–459.
van den Berg, G. J. (2001): “Duration models: speci…cation, identi…cation and multiple
durations,” Handbook of econometrics, 5, 3381–3460.
van den Berg, G. J., and G. Ridder (1998): “An empirical equilibrium search model of
the labor market,” Econometrica, pp. 1183–1221.
van der Vaart, A. W. (2000): Asymptotic statistics, vol. 3. Cambridge university press.
van der Vaart, A. W., and J. A. Wellner (1996): Weak convergence and empirical
processes. Springer-Verlag.
Ye, J., and N. Duan (1997): “Nonparametric n 1=2 -consistent estimation for the general
transformation models,” Annals of Statistics, 25(6), 2682–2717.
Download