SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS JONG-MYUN MOON Abstract. This paper studies transformation models T0 (Y ) = X 0 0 + " with an unknown monotone transformation T0 . Our focus is on the identi…cation and estimation of 0 , leaving the speci…cation of T0 and the distribution of " nonparametric. We identify 0 under a new set of conditions; speci…cally, we demonstrate that identi…cation may be achieved even when the regressor X has bounded support and contains discrete random variables. Our identi…cation is constructive and leads to sieve extremum estimator. The empirical criterion of our estimator has a U-process structure, and therefore does not conform to existing results in the sieve estimation literature. We derive the convergence rate of the estimator and demonstrate its asymptotic normality. For inference, the weighted bootstrap is proved to be consistent. The estimator is simple to implement with standard optimization algorithms. A simulation study provides insight on its …nite-sample performance. Date: October 27, 2014 A¢ liation and Contact Information: UCL and CeMMAP, Email: jong-myun.moon@ucl.ac.uk. 1 2 JONG-MYUN MOON 1. Introduction Data transformation is often used in econometric analysis. For example, dependent variables are routinely log-transformed in linear regressions in order to mitigate nonlinearity and heteroskedasticity. This e¤ective but arbitrary technique can be justi…ed if transforming functions are included as model parameters and then estimated using data. The most prominent example of this approach is the in‡uential Box-Cox transformation model (Box and Cox, 1964). Those authors suggested a parametric family of power functions, including a log-transformation, as candidate functions for data transformation. There are several variations of this approach, which involve di¤erent sets of transformation functions. However, if complex patterns are possible, then a nonparametric approach provides a useful alternative. This paper concerns identi…cation and estimation of regression models with a nonparametric transformation. Regression models with a transformed dependent variable are called transformation models. Speci…cally, transformation models are represented by the equation (1) T0 (Y ) = X 0 0 + "; where Y 2 R and X 2 Rdx are observed random variables, and " 2 R is an unobserved error term. In the model (1), there are three parameters: (i) the regressor coe¢ cient 0 , (ii) the transformation T0 and (iii) the error distribution. We consider the case when both T0 and the error distribution are nonparametric. Horowitz (1996) and Chen (2002) review the literature regarding identi…cation and estimation of model (1). For related models in econometrics, see Matzkin (2007). Following the literature, we assume (i) " is independent of X and (ii) T0 is strictly monotone. There are several applications of the transformation model (1). An important class of transformation models are duration models. In labor economics, the study of employment and unemployment durations is an important area of research, and the duration model has been the main vehicle of empirical studies (Kiefer, 1988, Farber, 1999). More recently, the unemployment duration is often studied through the labor-market search model (Mortensen and Pissarides, 1999, Rogerson, Shimer, and Wright, 2005), which imposes testable implications on the duration models (Eckstein and Van den Berg, 2007). See Meyer (1996), van den Berg and Ridder (1998) and van den Berg (2001) for related works. Also, hedonic models with additive marginal utility and additive marginal production technology, studied by Ekeland, Heckman, and Nesheim (2002, 2004), are closely related to the transformation model (1). Chiappori, Komunjer, and Kristensen (2013) provides an extensive list of applications in di¤erent areas. SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 3 We contribute to the literature by producing new conditions for identi…cation and proposing a new estimator for 0 . The identi…cation exploits two model features: the monotonicity of T0 and the additive separability of X 0 0 and ". First, we notice that an ordering is preserved by any monotone transformation. Therefore, if we use only the ordering induced by Y for the identi…cation, then the speci…c form of the transformation T0 is entirely irrelevant. This is not to say T0 is not identi…ed; indeed, if 0 is identi…ed, then the identi…cation of T0 can be established following Chen (2002). However, in order to identify 0 , it is enough to consider the ordering induced by Y , as will be demonstrated. Further, in general, the ordering is completely characterized by a binary relation. Therefore, if we are to use the information on ordering only, it is enough to see the binary comparison of any pair of data. The second observation leading to the identi…cation is that the ordering of Y is determined by a linear function X 0 0 + ". Suppose we have two observations (Y1 ; X10 ) and (Y2 ; X20 ). Then we see that Y1 < Y2 if and only if X10 0 + "1 < X20 0 + "2 . This observation may be summarized to the equality (2) IfY2 Y1 > 0g = If (X2 X1 )0 0 < "2 "1 g; for an indicator function If g. The relation (2) is similar to the binary choice model. The di¤erence of errors "2 "1 has the role of a random threshold, and the binary outcome of whether the inequality Y2 Y1 > 0 holds or not is determined by whether the threshold is crossed by the di¤erence of two “single indices” (X2 X1 )0 0 . These two observations help us formulate a minimization problem that identi…es the model parameter 0 as a unique solution. Our identi…cation result is similar in spirit to the identi…cation of the maximum rank correlation (MRC) estimator of Han (1987). A distinctive feature of our approach is that the cumulative distribution function (cdf) of "2 "1 , denoted by F0 , will be identi…ed along with 0 . Our identi…cation result is new, and provides new identifying conditions. Speci…cally, we allow the regressor vector X to contain discrete random variables. Further, all continuous regressors may have bounded support. Our key identifying condition is intuitive; we require that discrete regressors do not dominate the continuous regressors in terms of the relative contribution to the single index X 0 0 . However, regardless of the condition’s being met, the subvector of 0 for continuous regressors is identi…ed. The identi…cation is constructive in the sense that it suggests a natural estimator. Our estimator is de…ned as a minimizing solution to an empirical criterion, and the empirical criterion is acquired as a sample analogue of the identifying criterion. We propose to use the method of sieves. Sieves refer to a collection of subsets of parameter space which approximates the original parameter space increasingly well. Conceptually, a denser sieve is employed as more data are collected. See Chen (2007) for a survey of the literature on sieve estimation. 4 JONG-MYUN MOON Our estimation procedure involves minimizing an empirical criterion, which is a function of and F , over a sieve space. As implied by the equation (2), the criterion function will involve pairwise combinations of observations in its formulation. As such, our empirical criterion has a U-process structure; in other words, it appears as a double-summation over every pair of observations. Extremum estimation involving U-processes has been studied by Sherman (1993, 1994) for parametric problems, and the theory is applied to the MRC estimation. The MRC criterion function has a U-process structure, and it is a step function of a Euclidean parameter. On the other hand, our empirical criterion function is a smooth function of parameters, which is one advantage of our approach. However, we need to extend the existing literature to deal with a seminonparametric problem to account for the in…nite-dimensional parameter F . To do so, we adopt and modify the existing results on sieve M-estimation by Shen and Wong (1994) and Shen (1997). The main contribution here is to show that the estimator minimizing the Uprocess can be represented as an approximate M-estimator. We achieve this by approximating the U-process using a more familiar empirical process. The theoretical device used for this task is the U-process maximal inequality; in Appendix B, we present its working form. We show that the estimator of F0 converges faster than the n1=4 -rate in terms of L2 -norm. The estimator of 0 converges at the n1=2 -rate to the normal distribution. Regarding the inference on 0 , because we provide an explicit form of the asymptotic variance, an inference can be conducted relying on the asymptotic approximation. A downside of this approach is that the asymptotic covariance matrix has quite a complex form, and that it requires estimation of even more nonparametric objects, such as a conditional expectation. Therefore, we prefer simulation-based methods and suggest a weighted-bootstrap scheme to approximate the …nite-sample distribution of the estimator. The consistency of weighted bootstrap has been recently shown by Ma and Kosorok (2005) and Chen and Pouzo (2009) for the sieve M-estimation and the conditional moment model, respectively. We extend these earlier works to the case when the empirical criterion has a U-process structure. There are several literatures related to this paper. First, several papers have proposed to p estimate T0 nonparametrically, when n-consistent estimator of 0 is available. As such, these papers and our work are complementary. See Horowitz (1996), Ye and Duan (1997), Klein and Sherman (2002), and Chen (2002). If T0 is parametrized, then all the model parameters can be estimated jointly, including 0 and T0 . Relevant works in this approach include Linton, Sperlich, and Van Keilegom (2008) and Santos (2011) among others. Second, there are rank-based estimators initiated by Han (1987). Other relevant works in this strand include Cavanagh and Sherman (1998), Abrevaya (2003), Khan and Tamer (2007) and Khan, Shin, and Tamer (2011) among others. A common aspect shared by these methods is that SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 5 is identi…ed and estimated without knowledge of T0 and the error distribution. Third, the methods for the single-index model are applicable to the transformation model. Single-index models have been extensively studied in econometrics and statistics since Ichimura (1993); see Horowitz (1998) and Ichimura and Todd (2007) for surveys. In addition, although our estimator is designed speci…cally for the transformation model, its technical aspect is akin to that of the single-index regression model. This is because the Euclidean parameter 0 enters the in…nite-dimensional parameter F0 as its argument. Rather unexpectedly, however, few works relate the sieve estimation to single-index models; see Ding and Nan (2011) and references therein. These results are not applicable to our problem.1 Therefore, we develop a suitable asymptotic theory that applies to the single-index problem in the context of sieve estimation. 0 The remainder of this paper is organized as follows. Section 2 de…nes the model and establishes the identi…cation. Section 3 de…nes our estimator and shows its consistency. Section 4 derives the rate-of-convergence. Section 5 shows the asymptotic normality of the estimator. It also includes the consistency of the weighted bootstrap procedure. Section 6 contains a simulation study. Section 7 discusses possible extensions. Proofs are gathered in the Appendix. Most notations will be de…ned in Section 2 and in Appendix A.1, but inevitably more notations will be added throughout the paper. 2. Identification We de…ne the criterion function that identi…es 0 and F0 as its minimizing solution. To this end, we need to introduce scale and location normalizations. As a scale normalization, the …rst component of 0 is normalized to 1, and thus 0 is written as ( 0;1 ; 00 )0 for a scalar 0;1 such that j 0;1 j = 1 and some (dx 1)-dimensional vector 0 . To see why it is necessary, consider T = cT0 and "i = c"i for some positive constant c > 0. Because T is strictly increasing and "i is not observed, an alternative model T (Yi ) = Xi0 (c 0 ) + "i is observationally equivalent to the original model (1). Therefore, for the point identi…cation of , we need to restrict the parameter space for so that no two possible points 1 and 2 can be related as a constant multiple of the other. There are other ways to achieve the scale normalization. For instance, we could set j 0 j = 1, so that the parameter space for is a unit sphere in Rdx . The location normalization is achieved by not allowing a constant term in X. Suppose we have a constant term c, and write the model as T0 (Yi ) = c+Xi0 0 +"i . This can be equivalently 1 The recent work by Ding and Nan (2011) assumes that the empirical criterion is twice Fréchet di¤erentiable with respect to a certain pseudo-metric. See pp.3035-3036 of Ding and Nan (2011). Our empirical criterion is not Fréchet di¤erentiable. 6 JONG-MYUN MOON written as T0 (Yi ) = c + Xi0 0 + "i for "i = "i + c c for any constant c. As these two models are observationally equivalent, the constant term c or c is not identi…ed. Notice that we do not impose a location normalization to "i ; its mean or median is not restricted. As mentioned in the introduction, our criterion function is motivated by the relation (2). We develop further from (2) to induce the identifying criterion. By taking a conditional expectation in both sides of (2), we have P ( Y > 0jX1 ; X2 ) = P ( " > X0 0 jX1 ; X2 ) =1 F0 ( X0 0 ); where F0 is the cdf of "2 "1 and the notation denotes a di¤erence of two consecutive observations; that is, ( ) = ( )2 ( )1 . Recall "1 and "2 are from i.i.d. samples, and hence the distribution of "1 "2 is equal to the distribution "2 "1 . This implies that 1 F0 ( z) = F0 (z) for any z 2 R. Then we have an equation P ( Y > 0jX1 ; X2 ) = F0 ( X 0 (3) 0 ): This relation leads us to de…ne a new criterion. To state it, let us relabel the parameters and F . Because the …rst component of is normalized to 1, we denote it separately by b 2 f 1; 1g. Then is a (dx 1)-by-1 vector such that = (b; 0 )0 . Combining the parameters of interest and F , we write = ( ; F ). Then, we de…ne a nonlinear least squares criterion implied by the relation (3) as follows; for Vi = (Yi ; Xi0 )0 , (4) h(b; ; V1 ; V2 ) = fIf Y > 0g F ( X 0 )g2 ; Q(b; ) = E[h(b; ; V1 ; V2 )]: We call Q(b; ) the population criterion. A corresponding empirical criterion will be de…ned in Section 3. Theorem 2.1 below shows that 0 and F0 are uniquely identi…ed as a minimizer of the population criterion function Q, and the in…nite-dimensional parameter F0 is uniquely identi…ed on the support of X 0 0 . The following notations are needed. Because we have di¤erent conditions for continuous and discrete regressors (see Assumption 2.3), let us divide Xi to a continuous random vector 0 ; X 0 ). Divide Xi;c 2 Rdc and a discrete random vector Xi;c 2 Rdx dc so that Xi0 = (Xi;c i;d to c and d accordingly. Similarly, we write X 0 = ( Xc0 ; Xd0 ). The support of a random vector X is denoted by supp X.2 Lastly, we denote Nj = supp " \ fx0 0;c + 0 j 1 0;d : x 2 supp Xi;c g \ fx0 x dc for some constants f j gdj=0 and j 2 f1; ; dx the identi…cation purpose (see Assumption 2.4). 0;c + 0 j 0;d : x 2 supp Xi;c g; dc g. The last notation Nj is used only for Assumption 2.1. fYi ; Xi ; "i gni=1 is independent and identically distributed (i.i.d.) and conforms to the equation (1). "i is continuous and independent of Xi . 2 For a random variable X, its support is de…ned as the smallest closed set B such that P [X 2 B c ] = 0. SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS Assumption 2.2. (i) 0 = ( 0;1 ; 00 )0 for 0;1 2 f 1; 1g and collection of continuous monotone functions on R. F0 2 F. 0 2 Rdx 7 1. (ii) F is a 0 ; X 0 )0 , X dc is jointly continuous, and X Assumption 2.3. (i) For Xi = (Xi;c i;c 2 R i;d 2 i;d d d R x c is discrete. There is no constant in Xi . (ii) supp Xi = supp Xi;c supp Xi;d . (iii) supp Xd is not contained in a proper linear subspace of Rdx dc . Assumption 2.4. There exist a set of points f 0 = 0; 1 ; ; dx dc g such that j 2 supp Xd for j = 1; ; dx dc and f 1 ; ; dx dc g are linearly independent. In addition, the set Nj has a non-empty interior for every j = 1; ; d x dc . Assumption 2.1 is standard in the literature although the independence of "i and Xi0 0 can be weakened to the conditional median independence; see Khan, Shin and Tamer (2012). We do not consider this possibility. Assumption 2.2 regards the parameter spaces for and F . Assumption 2.2 (i) restates our scale normalization and restricts to be compact. Assumption 2.2 (ii) de…nes F0 , and restricts the parameter space F for F0 . Regarding the identi…cation, F0 needs not to be smooth. However, it is essential that any F 2 F is continuous and monotone. The monotonicity requirement is not required if the support for X 0 0 is R. Heuristically speaking, this assumption regulates the possible value of the parameter when the support of Xi is not connected due to the existence of discrete regressors. Assumption 2.3 (i) allows that both continuous and discrete regressors in Xi . The requirement that Xi;c is jointly continuous implies that supp Xc has a non-empty interior, or, equivalently, that supp Xc is not included in a proper subspace of Rdx;c . If this assumption is violated, then supp Xc will exhibit the multi-collinearity. As explained above, a constant term is not allowed. Assumption 2.3 (ii) means that the support of Xi;c does not depend on the realization of Xi;d . This assumption can be weakened, and we may allow the support of Xi;c depends on the value of the discrete regressors Xi;d , as long as the support of Xi;c conditional on Xi;d has non-empty interior in Rdc . What is necessary to this generalization is only to modify Assumption 2.4 accordingly. The proof of identi…cation will remain essentially same. For simplicity, however, we do not attempt this generalization. Assumption 2.3 (iii) is a requirement for the discrete regressor Xi;d , parallel to the requirement that Xi;c is jointly continuous. Assumption 2.4 requires that (i) the contribution of discrete variables to the single index Xi0 0 is not too large relative to that of continuous variables, and (ii) the variation of the error term is not too small. This assumption regards the identi…cation of 0;d , that is, the regression parameter for the discrete regressors. If there is no discrete regressor, therefore, Assumptions 2.4 is not needed. Even with discrete regressors, if the support of any regressor is R, then we can omit it. 8 JONG-MYUN MOON Theorem 2.1. Suppose Assumptions 2.1-2.4 hold and de…ne A = Q(b1 ; for some b1 2 f 1; 1g and any z 2 supp X 0 0 . 1 1) = min b2f 1;1g; 2A F. let Q(b; ) = ( 1 ; F1 ) 2 A. Then b1 = 0;1 , 1 = 0 and F1 (z) = F0 (z) for The proof of Theorem 2.1 is in the Appendix A. Theorem 2.1 establishes the identi…cation of 0 and F0 . We stress that F0 is identi…ed only on the support of X 0 0 . This fact adds some complication when we study the estimation of 0 and F0 . 3. Consistency 3.1. Extremum Estimation and Method of Sieves. The identi…cation result of Theorem 2.1 is constructive in the sense that it suggests an extremum estimator. This section de…nes our estimator and proves its consistency. As there is an in…nite-dimensional parameter, the consistency will be stated in terms of a particular norm that we de…ne soon. Before proceeding, let us add one simpli…cation. Henceforth, we assume 0;1 is known and its value is 1; this can be accepted without loss of generality, because our estimator of 0;1 exactly equals to the true value, with probability approaching 1. Therefore we let 0 = (1; 00 )0 , and further, simplify notations to Q( ) = Q(1; ) and h( ; ; ) = h(1; ; ; ). The sample analogue to the population criterion Q de…ned in (4) is X 1 (5) Qn ( ) = h( ; Vi ; Vj ): n(n 1) i6=j Let us call Qn the empirical criterion. It is immediate from the de…nition that E[Qn ( )] = Q( ). Also, Qn ( ) is a U-statistic for Q( ). If viewed as a stochastic process, then Qn ( ) induces a U-process, a generalization of U-statistic; it is a U-process after centering by Q( ) p and scaling by n. Much of our asymptotic theory will rely on the U-process theory3. We minimize Qn not over A but over a subset of A, called a sieve. Let us denote a collection of sieves by fAk g. It is required that the sieve Ak approximates the entire parameter space A increasingly accurately as the index k grows. For a …nite sample size, we get to pick one sieve Ak to use. However, conceptually, a di¤erent sieve Akn is used as the sample size n changes. The sieve index kn depends on n, and grows to the in…nity along with the sample size n. Our discussion below will rely on abstract assumptions on the sieve spaces fAk g and the speed of divergence of the sieve index kn . Because is …nite-dimensional, we may de…ne the sieve Ak as a product of and Fk ; that is, only the in…nite-dimensional F is “sieved.” 3 U-Process theory is similar to the empirical process theory. For more about U-process theory, see Arcones and Giné (1993), Sherman (1994) and de la Peña and Giné (1999) among others. SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS Using the sieve Akn = 9 Fkn , we de…ne the estimator ^ n as follows; ^ n 2 argmin Qn ( ): (6) 2Akn We write ^ n = (^n ; F^n ) for ^n 2 and F^n 2 Fkn . If there are multiple minimizers in (6), any point among them can be chosen as the estimator. 3.2. Consistency. In semi-nonparametric problems, there are several candidates for a norm attached to the parameter space. This is due to the in…nite-dimensional nature of the problem. One of the main task in studying a semi-nonparametric problem is to …nd out a proper norm to the context. In contrast, in parametric problems, the Euclidean norm is a natural choice to measure a distance. We start by de…ning a suitable norm to state the consistency of the estimator ^ n .4 When de…ning norms on F, an important fact is that F0 is only identi…ed on the support of X 0 0 . Therefore, we …rst de…ne a norm on F as ( ) kF kF ;c = max sup z2supp X0 0 jF (z)j; sup supp X0 0 jF 0 (z)j : Then, de…ne the consistency norm k kc on A as k kc = j j + kF kF ;c . Also, denote the usual sup-norm by k k1 . We are ready to state assumptions for the consistency. We assume Xi has at least fourth moment (EjXi j4 < 1). Also fVi = (Yi ; Xi0 )0 g is always a random sample. These two premises are maintained throughout the paper. We list other more substantial assumptions. Assumption 3.1. (i) The parameter 0 is uniquely identi…ed in the sense of Theorem 2.1. (ii) is a compact subset of Rd with a non-empty interior. 0 is an interior point of . i d Assumption 3.2. (i) For some integer 3, maxi2f0;1; ; g supz2R j dz i F0 (z)j < 1. (ii) For some constant ! > 0, collect every monotone function F on R such that max i2f0;1; sup ; g z2R d fF (z) dz i F0 (z)g (1 + z 2 )!=2 B; for some positive constant B > 0. The set F is the closure of this function class in the norm kF k1;1 = kF k1 _ kF 0 k1 . Assumption 3.3. There exists a sequence f k F0 g such that maxi2f0;1g supz2R dzd i f k F0 (z) F0 (z)g ! 0 as k ! 1: k F0 2 Fk and Assumptions 3.1 is standard. The true parameter 0 needs not to be an interior point for consistency, but it is included for later results. Assumption 3.2 (i) states that F0 is at least 4 Later we add more norms when needed. See the de…nition (7) and Appendix A.1. In fact, all those norms are only semi-norms. We do not stress this fact. 10 JONG-MYUN MOON -times di¤erentiable and its derivatives are uniformly bounded. Assumption 3.2 (ii) de…nes the set F. There are several implications. First, by de…nition, F0 is an interior point5 of F. Second, the weighting function (1 + ( )2 )!=2 is included to address the case when Xi have an unbounded support. The particular form of the weighting function and its technical usage come from Gallant and Nychka (1987). Third, F 2 F needs not to be a cdf. Recall that F 2 F being continuous monotone is enough for the identi…cation (Assumption 2.2 (ii)). However, it is possible to make F include only cdfs. Similarly, knowing that F0 is symmetric (that is, F0 (z) = 1 F0 ( z) for any z 2 R), we may restrict that every F 2 F is symmetric. The asymptotic distribution of ^n is not a¤ected by the choice of F. Assumption 3.3 speci…es the approximation property of the sieves. For consistency, it is enough that the true parameter F0 is well approximated. We de…ne k 0 = ( 0 ; k F0 ). Notice that k k 0 0 kc ! 0. Theorem 3.1. Suppose Assumptions 3.1-3.3 hold. Then k^ n 0 kc p ! 0. The proof of Theorem 3.1 is in the appendix. Notice that the derivative F00 , as well as F0 , is consistently estimated, uniformly on the support of X 0 0 . This result is used to establish the convergence rate of ^ n in a weaker norm. 4. Rate of Convergence This section derives the convergence rate of the estimator ^ n . The …rst step is to de…ne an appropriate norm on A. To this end, we need to show that the population criterion Q induces a norm on the parameter space local to 0 . We provide heuristic explanations. Given the consistency result, we can focus on a subset of parameter space A near to 0 . Consider a local neighborhood of 0 in the normed space (A; k kc ). By the equality (3), it is easy to show that Q( ) Q( 0 ) = E[F ( X 0 ) F0 ( X 0 0 )]2 : Recall that we set 0;1 = 1 and as such X 0 = X1 + X 0 for Xj = X2;j X1;j and X = [ X2 ; ; Xdx ]0 . Applying the Taylor expansion to F ( X 0 ) F0 ( X 0 0 ), we obtain the following approximate equality: Q( ) Q( 0) ' E[fF 0 ( X 0 ) X 0 ( 0) + F ( X0 0) F0 ( X 0 0 )g 2 ]; 0 0 0 0 If k 0 kc is small, then we may replace F ( X ) by F0 ( X 0 ) in the last expression. This is the reason why the consistency norm k kc is chosen to involve the …rst-order derivative of F . 5 Here, we regard F as a normed space attached with k k 1;1 ; that is, F is a whole set. Note that F0 is not an interior point of a larger normed space fF : kF k1;1 < 1g with the same norm. SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 11 This heuristic observation motivates us to de…ne the following norm as a measure of rate of convergence for ^ n ; de…ne the rate norm k kq as (7) k kq = fE[fF00 ( X 0 X0 + F ( X0 0) 0 )g 2 ]g1=2 : The subscript q is chosen to indicate that the norm is derived from the population criterion Q. Lemma A.15 proves that Q( ) Q( 0 ) is locally similar to k k2q on the open neighborhood of 0 in the normed space (A; k kc ). In standard parametric problems, a similar relation holds with the Euclidean norm j j. The rate norm k kq is not necessarily an object of interest; however, it turns out that the rate norm k kq is equivalent6 to the norm j j+kF ( X 0 0 )kL2 (P ) for the usual L2 -norm k kL2 (P ) with respect to the probability measure P . Then, for instance, the upper bound for the rate of j^n 0 j is given by the k kq -norm rate. The following three assumptions, in addition to the assumptions for the consistency, will be used to derive the rate of convergence. Assumption 4.1. 2 ! > ! + : Assumption 4.2. There exists a sequence f rn k kn 0 0 kq = o(1). Assumption 4.3. Denote, for = E[F00 ( X 0 The matrix 2 0) f k 0 ; Xdx ]0 , X = [ X2 ; X = ( 0 ; F0;k ) : k 2 N; Fk 2 Fk g such that E[ Xj X 0 0 ]gf X E[ Xj X 0 0 ]g 0 ]: is non-singular. Assumption 4.1 limits possible values for the constants and !. Recall that these two constants are used to de…ne the parameter space F in Assumptions 3.2-3.3. Note that the convergence rate rn will be determined by these constants. Assumption 4.2 states that the sieve approximation error k kn 0 0 kq vanishes faster than the convergence rate rn . This requirement is intuitive because the rate of k kn 0 0 kq is an upper bound for the rate of k^ n 0 kq . Assumption 4.3 is a key condition to the entire rate calculation. It has a similar role to the nonsingularity of Hessian matrix in usual parametric problems. The particular form of the matrix will be suggested in the proof of Lemma A.14, which proves the norm equivalence of k kq and j j + kF ( X 0 0 )kL2 (P ) . These three assumptions with the consistency of ^ n in the norm k kc are su¢ cient to have the following result. Recall that the constants and ! are de…ned in Assumption 3.2. Theorem 4.1 (Rate of Convergence). Suppose Assumptions 3.1-3.3 and 4.1-4.3 hold. Then rn k^ n 0 kq = Op (1); 6 Two norms are equivalent if their ratio remains within a …xed range [a; b] for 0 < a < b < 1, for any point. This equivalence result is proved in Lemma A.14. 12 JONG-MYUN MOON for the rate-of-convergence factor rn = n !=(2 !+ +!) . The convergence rate for sieve M-estimator is proved by Shen and Wong (1994). A similar result can be found in van der Vaart and Wellner (1996), Theorem 3.4.1. We use the proof method similar to van der Vaart and Wellner (1996). When doing so, it needs to be considered that the empirical criterion Qn has a U-process structure. Sherman (1993, 1994) study a similar problem in parametric problems. Our result extends Sherman (1993, 1994) to in…nite-dimensional problems with sieve spaces. To facilitate asymptotic analysis, we need to decompose the criterion function. De…ne for v; v1 ; v2 2 R2+dx , m( ; v) = E[h( ; V1 ; V2 )jV1 = v] + E[h( ; V1 ; V2 )jV2 = v] g( ; v1 ; v2 ) = h( ; v1 ; v2 ) Q( ); E[h( ; V1 ; V2 )jV1 = v1 ] + E[h( ; V1 ; V2 )jV2 = v2 ] + Q( ): Note that E[m( ; V2 )] = Q( ) and E[g( ; V1 ; V2 )] = 0. Moreover, it can be checked that n (8) Qn ( ) = X 1 1X g( ; Vi ; Vj ): m( ; Vi ) + n n(n 1) i=1 i6=j The expression (8) is called the Höe¤ding decomposition; this is a fundamental result to the U-statistic theory. Because E[g( ; V1 ; V2 )jV1 ] = E[g( ; V1 ; V2 )jV2 ] = 0 for any 2 A, the second term in the right of (8) is called a degenerate U-process. From the last expression, it is clear that the U-process criterion is the sum of a sample-mean process and the degenerate U-process. As such, our proof of Theorem 4.1 can be divided to two parts. First, we show that the degenerate U-process in (8) is asymptotically negligible. This is proved in Lemma A.13 in the appendix. Then, we can treat ^ n as a M-estimator minimizing the sample mean of m( ; Vi ) with some error; the error comes from the degenerate U-process. Second, we prove the rate-of-convergence using the empirical process theory, similar to van der Vaart and Wellner (1996) Theorem 3.4.1. 5. Asymptotic Normality This section focuses on the asymptotic distribution of ^n ; recall that = (1; 0 )0 . The in…nitedimensional parameter F is treated as a nuisance parameter. The …rst step is to express as a function of . We are to express such a functional as an inner product of and a special point v . The inner product is induced by the norm k kq . To de…ne it, let V be a product space of Rd and fF : kF ( X 0 0 )kL2 (P ) < 1g. For arbitrary two points v; w in V, we de…ne (9) hv; wi = E[fF00 ( X 0 0) X0 v + Fv ( X 0 0 0 )gfF0 ( X0 0) X0 w + Fw ( X 0 0 )g]; SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 13 for v = ( v ; Fv ) and w = ( w ; Fw ). It can be easily veri…ed that the bilinear map h ; i is indeed an inner product. Then, the special point v is de…ned as follows. Let v = ( ; F ) for 0 1 = and F (z) = F0 (z)E[ X 0 j X 0 0 = z] 1 : Assume that v is in V, or, equivalent, assume that kF ( X 0 calculation7, one can show that 0 (10) 0 )kL1 (P ) is …nite. By easy = h ; v i: Therefore, we know the exact expression for the special point v . Even when its expression is unknown, however, the existence of v is guaranteed by the Riesz representation theorem if V is a Hilbert space and the map 7! 0 is bounded linear. For this reason, v is often called the Riesz representer. The representation of 0 as an inner product (10) is instrumental, since it is possible to approximate the inner product by the population criterion. Note that the inner product (9) is equivalently de…ned by the polarization identify: 4 hv; wi = kv + wk2q (11) kv wk2q : Therefore, if the two squared norms in (11) are well approximated, so is the inner product. A relevant fact is that the rate-norm k kq is chosen to approximate the population criterion Q locally to 0 ; see (7) in the previous section. Therefore it is foreseeable that 0 can be expressed using Q. There are technical subtlety in doing so, and more details can be found in the proof of Theorem 5.1. To obtain the asymptotic normality, the following assumptions are used. Assumption 5.1. !> Assumption 5.2. For kF0;kn ( X 0 0) + !. kn F0 ( X 0 0 = ( 0 ; F0;kn ) de…ned in Assumption 4.2, 0 )kL2 (P ) Assumption 5.3. (i) Fk = o(n spanfp1 ; 2=3 0 ), kF0;k ( X0 n 0) F00 ( X 0 0 )kL4 (P ) = o(n 1=3 ): ; pk g for all k; (ii) fkpj k1 g1 j=1 is uniformly bounded. j d Assumption 5.4. Let j (k) = max1 i k k dz j pi k1 . Then the followings hold: (i) p 2 3 1 kn ^ rn , (ii) kn rn = o(n ) and (iii) kn2 rn 1 = o(1). 2 (kn ) . 1 (kn ) _ Assumption 5.5. Let pk (z) = (p1 (z); ; pk (z))0 . The smallest eigenvalue of k 0 k 0 0 E[p ( X 0 )p ( X 0 ) ] is bounded away from zero uniformly in k 2 N. Assumption 5.6. For any f kn v : kn v 2 Rd , there exists a sequence = ( ;F ;kn ); 2 ;F 7 A similar calculation appears in the proof of Lemma A.14. ;kn 2 spanfp1 ; ; pkn gg; 14 such that (i) JONG-MYUN MOON p nrn k kn v v kq ! 0 as n ! 1 and that (ii) supn2R kF ;kn k1 is bounded. Assumption 5.1 is stronger than Assumption 4.1. For the asymptotic normality of ^n , we 2 need that k close to ^ n . Therefore, if ^ n 0 kq is well approximated by Q( ) Q( 0 ) for converges faster, then the approximation shows less error. By imposing Assumption 5.1, we achieve the faster convergence rate and hence control the approximation error. Assumption 5.2 demands the sieve approximation error vanishes not only for F0 but also for its derivative F00 at a certain rate. Assumption 5.3 limits the sieve space that we consider. As mentioned already, we choose Fk to be …nite-dimensional and linear. The functions fp1 ; p2 ; g are called basis functions. Assumption 5.4 regards the smoothness property of the basis functions. Note that j (k) can be regarded as a smoothness measure for a basis functions fp1 ; ; pk g. The role of Assumption 5.4 is to control the convergence of derivatives of F^n . Recall that the convergence rate is stated in terms of the rate norm k kq , and that the convergence of ^ 00 ^0 k^ n 0 kq does not imply that the derivatives Fn and Fn converge in some norm. However, by imposing Assumption 5.4, we can control the convergence rate of kF^n0 F0 k1 and kF^n00 F0 k1 with regard to kF^n F0 k1 . Assumption 5.5 is used to establish the norm equivalence between kF ( X 0 0 )kL2 (P ) and kF ( X 0 0 )kL1 (P ) for F in Fk . This is possible because Fk is a …nitedimensional sieve; recall that in the Euclidian space, Lp -norms are equivalent for 1 p 1. Assumption 5.6 states that the Riesz representer v can be approximated by a sequence in the sieves to a certain precision. Before stating the main result, we add one more notation. De…ne the linear directional derivative of h( ; ; ) to the direction v = ( v ; Fv ) 2 V as h0 ( ; ; )[v] = (12) d h( + tv; ; ) dt : t=0 Now we state the main result of this paper. Theorem 5.1. Suppose Assumptions 3.1-3.3, 4.3, 5.1-5.5 hold. Then p d n(^n 0 ) ! N (0; ); where the matrix is such that, for any 0 = E h0 ( 2 Rd , 0 ; V1 ; V2 )[v h0 ( 0 ; V1 ; V3 )[v ]]: The proof of Theorem 5.1 can be found in Appendix A, and the functional form of h0 ( 0 ; V1 ; V2 )[ ] is derived by Lemma A.19. Because we have an explicit expression for v , it is possible to estimate v and then the matrix . If is consistently estimated, the inference on 0 can be conducted relying on the asymptotic normality result of the above theorem. A downside to this approach is that it involves several nonparametric estimations. For instance, SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 15 to estimate v , the conditional expectation E[ X 0 j X 0 0 ] needs to be estimated. Therefore, a simulation-based method is preferred. Below, we prove the consistency of weighted bootstrap. 5.1. Weighted Bootstrap. Consider a randomly generated sequence of weights fBi gni=1 . We assume E[Bi ] = 1 and V ar(Bi ) = 1. If these conditions are met, it may have any distribution. Possible distributions are the discrete uniform distribution on f0; 2g or the normal distribution N (1; 1). De…ne the weighted empirical criterion X 1 Bi Bj h( ; Vi ; Vj ): Qn ( ) = n(n 1) i6=j Next, de…ne ^ n to be a point such that ^ n 2 Akn and ^ n 2 argmin 2Akn Qn ( ): Also, write ^ n = (^n ; F^n ). The following theorem proves that the asymptotic distribution of p ^ ^ n( n n ) conditional on the sample fV1 ; ; Vn g is same with the unconditional asymptotic p ^ distribution of n( n 0 ). Theorem 5.2. Suppose all the conditions of Theorem 5.1 hold. If fBi gni=1 is an i.i.d. sequence such that E[Bi ] = 1 and V ar(Bi ) = 1, then for any c 2 Rd and any n 2 N, p p P [ n(^n ^n ) cjV1 ; ; Vn ] = P [ n(^n c] + op (1): 0) The bootstrap inference is easy to implement. Fix the distribution for Bi , and draw the random weights fBi gni=1 . Then, estimate ^ n by minimizing the weighted empirical criterion Qn . The sieve-size index kn remains same with the original problem. By repeating this p procedure, we obtain the empirical distribution of n(^n ^n ) conditional on fVi gni=1 . Then, the quantiles of the empirical distribution can be used as critical values for the inference on p ^ n( n 0 ). 6. Simulation Study Many duration models are examples of the transformation model. Proportional hazard models and mixed proportional hazard models are all nested in transformation models.8 We use those two models to conduct the following simulation study. 8 Proportional hazard models assume the error distribution is …xed to be a negative extreme-value distribution, whereas the transformation function (or baseline hazard) remains nonparametric. Mixed proportional hazard models are more general, but still restrictive; for instance, the normal distribution is not allowed for the error distribution (Ridder, 1990). 16 JONG-MYUN MOON 1 Design 1 Design 2 Design 3 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −8 −6 −4 −2 0 2 4 6 Figure 1. CDF of We consider three designs. The transformation function T (y) = log y is chosen for data generation. However, note that all three estimators are numerically invariant even if Y is transformed by any other monotone function. The data are generated from the following equation log Y = X1 ", for ( 1 ; 2 ) = (1; 1): 1 X2 2 X3 This speci…cation is shared by all three designs. Further, we …x the distribution of (X1 ; X2 ; X3 ); X1 and X2 are standard normal random variables and X3 is a binary random variable with equal probabilities of being 0 or 1. (X1 ; X2 ; X3 ) are mutually independent. Across three designs, we vary only the distribution of ". This is summarized below: Design 1: " EV (0; 1); d Design 2: " = log v + u; for v d Design 3: " = log v + u; for v (1; 1) and u EV (0; 1); (3; 3) and u EV (0; 1); where EV (0; 1) means the standard extreme-value distribution with cdf F (z) = exp( exp( z)), and ( ; ) denotes the gamma distribution with mean and variance 2 . Design 1 conforms to the proportional hazard model. Design 2 and 3 belong to the mixed proportional hazard model or frailty model. As the additional random error v follows the gamma distribution, they are also called a gamma frailty model. Finite-sample distributions of several estimators are compared. Let us call the sieve extremum estimator developed by this paper, the sieve estimator. We compare the sieve estimator with two others estimators: Cox estimator for the proportional hazard model and the MRC SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 17 estimator of Han (1987). Note that the Cox estimator is mis-speci…ed for Design 2 and Design 3. We still report the result because Cox model is widely used in empirical researches. For each design, we generate samples of size 100 and 300. Then the parameter = ( 1 ; 2 ) is estimated for (i) Cox estimator, (ii) sieve estimator, and (iii) MRC estimator. Regarding the sieve estimator, we vary the dimension of sieve space to k = 3; 5; 7. The estimation procedure is repeated 500 times, and we report the sample bias and the sample mean squared error (MSE) from 500 estimates of …ve estimators. To implement the sieve estimator, sieve Fk needs to be speci…ed. We choose I-spline as basis functions. Ramsay (1988) explains the construction of I-spline. What is useful with I-spline is that each basis function is a cdf of some continuous random variable. Therefore, it is easy to tune Fk to our purpose of estimating a symmetric cdf. We construct Fk to contain only symmetric cdfs from I-spline bases. The dimension of Fk equals to the index k. The simulation results are summarized in Figure 2-7. In each …gure, the left panel shows the bias and the right panel shows MSE. Bias1 indicates the bias of estimating 1 . Bias2 is for 2 . MSE1 and MSE2 also correspond to 1 and 2 respectively. Design 1 provides a good benchmark to our estimator. It is because the Cox estimator is correctly speci…ed and has one less in…nite-dimensional parameter. Not surprisingly, the Cox estimator shows the least MSE. Our estimator behaves comparably well. The e¢ ciency loss of our estimator relative to the Cox estimator seems bearable when considering Design 2 and Design 3. The Cox estimator shows a large bias in these mis-speci…ed designs. On the contrary, the sieve estimator performs well across all three designs. Compared to MRC estimator, the sieve estimator shows less MSE, especially for a smaller sample size of n = 100. We also notice that the sieve estimator is not sensitive to di¤erent sieve-size indexes k 2 f3; 5; 7g. In summary, we …nd that the sieve estimator behaves well, even for a small sample size. 7. Conclusion The intuition that a binary comparison speci…es the ordering is used to identify the transformation model. A new estimator is constructed from the identi…cation result. Its asymptotic distribution is derived, and the bootstrap inference is justi…ed. As technical by-products, we contribute to the literature on the sieve estimation by studying a U-process problem and showing how to handle the single-index structure in the semi-nonparametric problem. Several important extensions are possible. Regarding its application to duration models, we may extend the current method to account for censoring and time-varying regressors. Another direction is to consider competing risks models. We hope to study these extensions in future researches. 18 JONG-MYUN MOON 0.2 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0.15 0.1 0.05 0 Cox Sieve3 Sieve5 Bias1 Sieve7 MRC Cox Bias2 Sieve3 Sieve5 MSE1 Sieve7 MRC MSE2 Figure 2. Simulation result for Design 1 when n = 100 0.06 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0.05 0.04 0.03 0.02 0.01 0 Cox Sieve3 Sieve5 Sieve7 Bias1 MRC Cox Bias2 Sieve3 Sieve5 Sieve7 MSE1 MRC MSE2 Figure 3. Simulation result for Design 1 when n = 300 0.25 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0.2 0.15 0.1 0.05 0 Cox Sieve3 Sieve5 Bias1 Sieve7 Bias2 MRC Cox Sieve3 Sieve5 MSE1 Sieve7 MSE2 Figure 4. Simulation result for Design 2 when n = 100 MRC SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 19 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 Cox Sieve3 Sieve5 Bias1 Sieve7 MRC Cox Bias2 Sieve3 Sieve5 Sieve7 MSE1 MRC MSE2 Figure 5. Simulation result for Design 2 when n = 300 1.4 4 3.5 3 2.5 2 1.5 1 0.5 0 1.2 1 0.8 0.6 0.4 0.2 0 Cox Sieve3 Sieve5 Bias1 Sieve7 MRC Cox Bias2 Sieve3 Sieve5 MSE1 Sieve7 MRC MSE2 Figure 6. Simulation result for Design 3 when n = 100 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Cox Sieve3 Sieve5 Bias1 Sieve7 Bias2 MRC Cox Sieve3 Sieve5 MSE1 Sieve7 MSE2 Figure 7. Simulation result for Design 3 when n = 300 MRC 20 JONG-MYUN MOON Appendix A. Proofs A.1. Notations. We de…ne and use several norms throughout the appendix. k k k k k k k k k k k k ;1;! k ;1 k1 kL1 (P ) kLp (P ) kF ;c ke; ;1 ke;1 kc kq ke;Lp The The The The The The The The The The The norm norm norm norm norm norm norm norm norm norm norm i d 2 !=2 kF k ;1;! = max0 i supz2R j dz i F (z)j(1 + z ) i d kF k ;1 = max0 i supz2R j dz i F (z)j kF k1 = supz2R jF (z)j kXkL1 (P ) is the essential supremum of the random variable X kXkLp (P ) = fEjXjP g1=p for any integer p 1 kF kF ;c = kF (Z0 )kL1 (P ) + kF 0 (Z0 )kL1 (P ) for Z0 = X 0 0 k ke; ;1 = j j + kF k ;1 k ke;1 = j j + kF k1 k kc = j j + kF kF ;c k kq is de…ned in (7) k ke;Lp = j j + kF (Z0 )kLp (P ) Other notations used in the appendix are gathered in the table below. Z0 a.b a b N ("; F; k k) N[] ("; F; k k) C1 ; C2 ; ! j n d A scalar random variable such that Z0 = X 0 0 a Kb for a universal constant K not depending on a or b a . b and a & b The covering number9of size " for a set F under the norm k k The bracketing number of size " for a function class F under the norm k k Generic positive constants which do no depend on the context of the proof The degree of smoothness of F; see Assumption 3.2 The constant for the weighting function (1 + z 2 )!=2 ; see Assumption 3.2 See Assumption 5.4 See Remark A.17 A.2. Proof for Section 2. Lemma A.1. Suppose Assumptions 2.1-2.4 hold. Suppose 2 f 1; 1g and F 2 F. 0 0 If F (x ) = F0 (x 0 ) for any x 2 supp X, then = 0 and F (z) = F0 (z) for any z 2 supp X 0 0 : Proof. Note 0 2 supp Xd . Hence, if x = (xc ; 0), by Assumption 2.3 (ii), we have (13) F (x0c c) = F0 (x0c 0;c ) for any xc 2 supp Xc : As " is a di¤erence of two i.i.d. continuous RVs, 0 is an interior point of supp ". Regarding supp Xc , the same holds by Assumption 2.3 (i). We know c 6= 0 and 0;c = 6 0 since 9 See p.83 of van der Vaart and Wellner (1996) for the precise de…nition. SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS j 0;1 j = 1. Observe that 0 2 R is an interior point to both supp Xc0 c and supp Xc0 Therefore we can …nd an open neighborhood of 0 denoted by Nc Rdc such that (14) Nc supp " \ supp Xc0 c \ supp Xc0 21 0;c . 0;c : We …rst show F is strictly increasing on Nc . Suppose not. Then …nd two points x1 ; x2 2 Nc with the following three properties; (i) x1 and x2 are di¤erent only in the …rst coordinates, say x1;1 6= x2;1 ; (ii) x1;1 < x2;1 ; (iii) F (x01 c ) F (x02 c ). Because F0 is strictly increasing on Nc , F0 (x01 0;c ) < F0 (x02 0;c ). Then either F (x01 c ) 6= F0 (x01 0;c ) or F (x02 c ) 6= F0 (x02 0;c ). This contradicts the condition of the lemma. As such, F is strictly increasing on Nc . Next, we prove c = 0;c . Suppose not. Find two points x1 ; x2 2 Nc such that x01 c > x02 c and x01 0;c < x02 0;c . By strong monotonicity of F and F0 on Nc , F (x01 c ) > F (x02 c ) and F0 (x01 0;c ) < F0 (x02 0;c ). We reach a contradiction and conclude c = 0;c . Then by (13), we can infer that F (z) = F0 (z) for any z 2 supp Xc0 0;c . So far, 0;c is identi…ed and F0 is identi…ed only on supp Xc0 0;c . x dc We move on to the identi…cation of 0;d . To this end, we …nd the values of f 0j 0;d gdj=1 ; for the de…nition of j , see Assumption 2.4. Start by j = 1. By Assumption 2.4, there are two points x1 ; x2 2 supp Xc such that x01 0;c = x02 0;c + 01 0;d 2 N1 . Because F0 is strictly increasing on N1 and F = F0 on N1 , then it follows that F (x01 0;c ) ? F (z) if x01 0;c ? z. In other words, x01 (15) 0;c = z if F (x01 0;c ) = F (z): By the condition of the lemma, (16) F (x01 0;c ) = F0 (x01 0;c ) From (15) and (16), we see that = F0 (x02 0 1 d = x01 F (z) = F0 (z) on z 2 supp Xc0 + 0 1 0;d ) x02 0;c = [ fx0 0;c + 0;c 0;c 0;c = F (x02 0 1 0;d 0 1 0;d 0;c + 0 1 d ): and : x 2 supp Xc g: Repeat the same argument for each j to identify other 0j 0;d ’s. Then we identify dx dc f 0j 0;d gj=1 . As the last step, we note that, since f 1 ; ; dx dc g is linearly independent, 0;d 2 Rdx dc is identi…ed. Conclude that = 0 and that F (z) = F0 (z) for any z 2 supp X 0 0 . Proof of Theorem 2.1. We know P ( Y > 0jX1 ; X2 ) = F0 ( X 0 iterated expectation, Q(b; ) = E[E[If Y = E[ F0 ( X 0 0gf1 0 )(1 2F ( X 0 )gjX1; X2 ]f1 0 ). By this fact and the 2F ( X 0 )g + F ( X 0 )2 ] 2F ( X 0 )) + F ( X 0 )2 ]: 22 JONG-MYUN MOON The last expectation can be simpli…ed to the sum of E[fF0 ( X 0 0 ) F ( X 0 )g2 ] and some constant not depending on parameters. From this observation, it is obvious that Q( ) is minimized only if F0 ( X 0 0 ) = F ( X 0 ) almost surely. Lemma A.1 proves that if F0 ( X 0 0 ) = F ( X 0 ), then it follows that = 0 and F (z) = F0 (z) for any 0 z 2 supp X 0 . Hence we conclude. A.3. Proof for Section 3. Remark A.2 (The constant B). Note that kF k By Hölder inequality and Assumptions 3.1, kF k kF ;1 F0 k ;1 for any F 2 F is uniformly bounded. ;1 + kF0 k B + kF0 k ;1 ;1 : The second inequality holds because the weighting function is strictly larger than 1. As kF0 k ;1 is bounded by Assumption 3.1, kF k ;1 is bounded by a universal constant B + kF0 k ;1 . We denote B = B + kF0 k ;1 . Lemma A.3. Under Assumptions3.1(ii) and 3.2(ii), for any (di ; yi ; x0i )0 2 supp Vi , (17) jh( 1 ; V1 ; V2 ) Proof. We use notations jh( 1 ; V1 ; V2 ) h( h( . (jx1 j + jx2 j + 1)k 2 A and vi = 2 2 ke;1 : 1 x; x below; they are de…ned similarly to X; X. Observe 2 ; V1 ; V2 )j = d1 d2 j2If y . jF1 ( x0 (18) 2 ; V1 ; V2 )j 1; 1) 0g F1 ( x0 F2 ( x0 F2 ( X 0 1) 2 )j jF1 ( x0 1) F2 ( x0 2 )j 2 )j; where the inequality holds by Remark A.2 and the fact that d1 is a binary variable. By Taylor expansion after obvious expansion, jF1 ( x0 1 ) F2 ( x0 2 )j is equal to jF10 (z ) X 0 ( 1 2) + F1 ( x0 2) F2 ( x0 2 )j: for some z 2 [ X 0 1 ; X 0 2 ]. Since kF10 k1 < B by Remark A.2, using Hölder inequality, we have jh( (19) 1 ; V1 ; V2 ) h( 2 ; V1 ; V2 )j . Bj xj j 1 2j + jF1 ( x0 . (jx1 j + jx2 j + 1) j 1 2j 2) F2 ( x0 0 + jF1 ( x 2) 2 )j F2 ( x0 2 )j ; where the second inequality holds by that Bj xj + 1 . jx1 j + jx2 j + 1. The result (17) follows (19). Lemma A.4. Under Assumptions3.1(ii) and 3.2(ii), jQ( 1) Q( 2 )j .k 1 2 ke;1 ; SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS for any 1; 2 23 2 A. Proof. By Jensen’s inequality, jQ( follows by Lemma A.3. 1) Q( 2 )j E jh( 1 ; V1 ; V2 ) h( 2 ; V1 ; V2 )j. The claim Lemma A.5. Under Assumptions 3.1-3.2, F is compact in k k1;1 -norm and A is compact in k ke;1;1 -norm. Proof. We recall Lemma A.4 of Gallant and Nychka (1987); let us call it GN. Let = 0; 0 = !; m = 1; m0 = 1; k = 1 in the cited lemma. Although one of the conditions is that 0 < < 0 , it can be learnt from the proof that can be zero (and indeed can be negative). The set F de…ned in Assumption 3.2 is smaller than a corresponding set in the cited lemma; note we de…ne F as k k ;1;! -ball of radius B=2 whereas GN sets up F as a ball in the L2 type norm similarly de…ned to k k ;1;! . All other conditions of GN are included verbatim in Assumptions 3.1-3.2. Therefore, we know F is relatively compact in k k1;1 -norm. By Assumption 3.2, F is compact in k k1;1 -norm. The second claim follows immediately. Lemma A.6. Suppose Assumptions 3.1-3.2 hold. Let " > 0 be small enough. Then +! : log N ("; A; k ke;1 ) . 1 log " + " ; = ! Proof. The inequality (20) is immediate from the de…nitions of the covering number and the norm k ke;1 : (20) N ("; A; k ke;1 ) N ("=2; ; j j) N ("=2; F; k k1 ): Because is compact, "=2-covering number of is proportional to "d . As such, ignoring constant terms, log N ("=2; ; j j) . 1 log ": ;! Denote CB=2 = fF : kF k ;1;! ;! " < ", log N ("; CB=2 ; k k1 ) . " B=2g. By Lemma A.3 of Santos (2012), for some " > 0, if . Since, by Assumption 3.2, fF it follows that N ("; F; k k1 ) F0 : F 2 Fg ;! CB=2 ; ;! N ("; CB=2 ; k k1 ). Hence the claim is shown. Remark A.7. When we use Lemma A.6, we ignore that it holds for small ". This is harmless simpli…cation. Lemma A.8. Under Assumptions 3.1-3.2, sup Proof. Let H = fh ( ; ; ) : 2A jQn ( ) p Q( )j ! 0 as n ! 1. 2 Ag. By Lemma A.3, Ejh( ; V1 ; V2 ) h( ; V1 ; V2 )j . Kk 1 2 ke;1 ; 24 JONG-MYUN MOON for some positive number K > 0. Then we can apply Theorem 2.7.11 of van der Vaart and Wellner (1996) to obtain, for any " > 0, N[] ("; H; L1 (P )) N ("=(2K); A; k ke;1 ): Lemma A.6 implies N ("=(2K); A; k ke;1 ) is …nite. Then by Corollary 5.2.5 of de la Peña and Giné (1999) (U-process ULLN), the claim is proven. Lemma A.9. Suppose Assumptions 3.1-3.2 hold. Let N = f 2 A : k let T N" = f 2A:k ke;1;1 < "g: 0 kc = 0g. Also 2N Then for any " > 0, Q( 0) < inf 2AnN" Q( ): Proof. The set N" is an intersection of open "-balls in (A; k ke;1;1 ), and hence itself is open in (A; k ke;1;1 ). As A is compact in k ke;2;1 -norm, it is also compact in the weaker norm k ke;1;1 . See that AnN" is compact in k ke;1;1 -norm. By the extremum value theorem, there is " 2 AnN" such that Q( " ) = inf 2AnN" Q( ) for any " > 0. By Theorem 2.1, Q( 0 ) = Q( " ) only if " 2 N . Conclude Q( 0 ) < Q( " ). Lemma A.10. Under Assumptions3.1(ii) and 3.2(ii), jm( for any 1; 2 1 ; v) m( 2 ; v)j . (jxj + 1) j 1 2j + sup jF1 (z) F2 (z)j z2Z0 2 A, v = (d; y; x0 )0 2 supp Vi . Proof. By the triangular inequality and Jensen’s inequality, jm( 1 ; v) m( 2 ; v)j Ejh( 1 ; V1 ; v) h( 2 ; V1 ; v)j+Ejh( 1 ; v; V2 ) h( 2 ; v; V2 )j+jQ( ) Q( )j: By Lemma A.3, Ejh( 1 ; V1 ; v) h( 2 ; V1 ; v)j . (EjX1 j + jxj + 1) j 1 2j + sup jF1 (z) F2 (z)j : z2Z0 Note EjX1 j + jxj + 1 . jxj + 1. The same inequality holds for the second term. The third term is bounded by Lemma A.4. Then the result follows. Lemma A.11. Under Assumptions3.1(ii) and 3.2(ii), jg( for any 1 ; v1 ; v2 ) 1; 2 g( 2 ; v1 ; v2 )j . (jx1 j + jx2 j + 1) j 1 2j + sup jF1 (z) 2 A and any v1 ; v2 2 supp Vi . Proof. This follows immediately by Lemma A.3, A.10 and A.4. z2Z0 F2 (z)j SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 25 Proof of Theorem 3.1. De…ne N" as in Lemma A.9. Note for any " > 0, k^ n 0 kc " ) k^ n 0 ke;1;1 " ) ^ n 2 AnN" , ^ n 2 Akn nN" ; where the arrow ) means “only if” relation. Therefore, it is enough to show P [^ n 2 Akn nN" ] ! 0. Observe (21) P [^ n 2 Akn nN" ] P[ inf 2Akn nN" Qn ( ) Qn ( kn 0 )] P [ inf 2AnN" Qn ( ) Qn ( kn 0 )]: By the continuous mapping theorem and Lemma A.8, inf p 2AnN" Qn ( ) ! inf 2AnN" Q( ): Write Qn ( kn 0) = Qn ( kn 0) p Q( 0) kn + Q( kn 0) Q( 0) + Q( 0 ): Note that Qn ( kn 0 ) Q( kn 0 ) ! 0 by Lemma A.8. In addition, it is true that Q( kn 0 ) p Q( 0 ) ! 0, by the continuity of Qin k ke;1 (Lemma A.4). Hence, we see that Qn ( kn 0 ) ! Q( 0 ). Therefore, it follows (22) P [ inf 2AnN" Qn ( ) Qn ( kn 0 )] = P [ inf 2AnN" Q( ) + op (1) Q( 0 )]: By Lemma A.9, (22) converges to zero. This shows P [^ n 2 Akn nN" ] ! 0 and therefore p k^ n 0 kc ! 0: A.4. Proof for Section 4. Lemma A.12. Under Assumptions 3.1-3.2, 1X (23) E[sup j g( ; Vi ; Vj )j] = O(1): 2A n i6=j Proof. Let us write g( ) = g( ; Vi ; Vj ) and G = fg( ) : 1 X (24) dn (g( 1 ); g( 2 )) = f 2 jg( 1 ) g( 2 )j2 g1=2 ; n i6=j 2 Ag. Also de…ne Dn = sup dn (g( 1 ); g( 2 )): 1 ; 2 2A By Theorem B.1, the left of (23) bounded above by Z Dn 2 1=2 (25) fE[g( 0 ) ]g + E[ log N ( ; G; dn )d ]: 0 Notice E[g( 0 )2 ] is …nite. We are left to show that the second term of (25) is bounded. By Lemma A.11, for some c > 0, 1 X (26) dn (g( 1 ); g( 2 )) ck 1 (jXi j + jXj j + 1)2 g1=2 : 2 ke;1 f 2 n i6=j 26 JONG-MYUN MOON Because k 1 (27) 2 ke;1 is uniformly bounded for any 1 ; 2 2 A, 1 X (jXi j + jXj j + 1)2 g1=2 : Dn . Ln for Ln = f 2 n i6=j By Theorem 2.7.11 of van der Vaart and Wellner (1996), N ( ; G; dn ) Then by Lemma A.6, (28) log N ( ; G; dn ) log N ( c 1 N( c 1 L 1 ; A; k n ke;1 ). Ln 1 ; A; k ke;1 ) . 1 log + log Ln + Ln ; RD for = +! log "d" 1 and . The …rst inequality of (29) follows by the fact that 0 ! RD 1 for any D 0. The second inequality holds by (27) and an elementary 0 " d" . D 2 inequality of x log x x : Z Dn log N ( ; G; dn )d . Dn + 1 + Dn log Ln + Dn1 Ln . 1 + Ln + L2n : (29) 0 Note E[L2n ] E[X12 + X22 + 1] < 1 and E[Ln ] < (25) is bounded, and we obtain (23). p E[L2n ] < 1. Therefore the last term of Lemma A.13. Under Assumptions 3.1-3.2, n (30) 1X m(^ n ; Vi ) n i=1 n 1X 1 min m( ; Vi ) + Op ( ): 2Akn n n Proof. By Markov inequality, 1X g( ; Vi ; Vj )j > "] (31) P [sup j 2A n i=1 " 1 E[sup j 2A i6=j 1X g( ; Vi ; Vj )j] n i6=j By Lemma A.12, the right of (31) is uniformly bounded for n 2 N. From the de…nition of ^ n in (6) and the decomposition (8), the claim (30) follows. Lemma A.14 (Norm equivalence). Under Assumptions 3.1 and 4.3, k kq and k ke;L2 (P ) are equivalent norms. Proof. For an upper bound of k kq , (32) k kq = kF00 (Z0 ) X 0 + F (Z0 )kL2 (P ) fE[F00 (Z0 ) X 0 ]2 g1=2 + fE[F (Z0 )]2 g1=2 B 0 E[ X X 0 ] 1=2 + kF (Z0 )kL2 (P ) . k ke;2 ; where the second inequality is a result of Hölder inequality. For a lower bound, observe k k2q & kF00 (Z0 ) X 0 + F (Z0 )kL2 (P ) (33) = E[fF00 (Z0 )f X E[ XjZ0 ]g0 g2 ] + E[fF00 (Z0 )E[ XjZ0 ]0 + F (Z0 )g2 ]; SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 27 where the inequality holds by Assumption 3.2(ii) and the equality is valid because a crossproduct term not showing up in (33) is vanished by the law of iterated expectation. (33) is bounded below by 0 (34) + E[F00 (Z0 )E[ XjZ0 ]0 + F (Z0 )]2 ; where is the smallest eigenvalue of . By Assumption 4.2, eigenvalue of E[F00 (Z0 )2 E[ XjZ0 ]E[ XjZ0 ]0 ]. Then 0 (35) 2 0 + 2 0 > 0. Let E F00 (Z0 )2 E[ XjZ0 ]E[ XjZ0 ]0 > 0 be the largest : Using (35), bound (34) below by (36) 2 0 + 2 0 E F00 (Z0 )2 E[ XjZ0 ]E[ XjZ0 ]0 By an elementary inequality of a2 + b2 below by (37) 2 0 + 4 1 2 (a + 2 E[F00 (Z0 )E[ XjZ0 ]0 + F (Z0 )]2 : b)2 , the expression (36) is further bounded E[F (Z0 )]2 & j j2 + kF (Z0 )k2L2 (P ) : And by Jensen’s inequality, fj j2 + kF k2L2 (P ) g1=2 1 p fj j + kF kL2 (P ) g: 2 Therefore we show k ke;L2 (P ) . k kq . This combined with the inequality (32) shows that the two norms are equivalent. Lemma A.15. Suppose Assumptions 3.1-3.2 and 4.3 hold. Then for any 2 (0; 1), if 2B =f 2A:k 0 kc < g, there exists a positive constant M such that p 2 2 (38) jQ( ) Q( 0 ) k M k 0 kq j 0 kq : The constant M > 0 does not depend on . Proof. Applying the Taylor expansion to F ( X 0 ) F ( X0 ) F0 ( X 0 0) F0 ( X 0 = F 0 (Z0 ) X 0 ( 0 ), we have 0) 1 0 00 0 + ( 0 ) F (Z ) X X ( 0 ) + F (Z0 ) 2 where Z is a random variable between Z0 and X 0 . Denote D1 = F00 (Z0 ) X 0 ( 0) + F (Z0 ) D2 = fF 0 (Z0 ) F00 (Z0 )g X 0 ( 1 0 00 0 D3 = ( 0 ) F (Z ) X X ( 2 F0 (Z0 ); 0 ); 0 ): F0 (Z0 ); 28 JONG-MYUN MOON Recall that Q( ) Q( 0 ) = E[F ( X 0 ) F0 ( X 0 0 )]2 . Therefore, Q( ) Q( 0 ) = E[fD1 + D2 +D3 g2 ]. We expand and examine this term by term. Note that (i) E[j Xj4 ] and E[j Xj2 ] are bounded, (ii) kF 00 k1 < B for any F 2 F (see Remark A.2), and (iii) for any 2 B ; kF 0 (Z0 ) (39) F00 (Z0 )kL1 (P ) k 0 kc < : These three facts are used in (41)-(42). First, by de…nition, E[D12 ] = k (40) 2 0 kq : Second, by Cauchy-Schwartz inequality, E[D22 ] (41) F 0 (Z0 ) F00 (Z0 ) L1 (P ) 2 0 j E[j j Xj2 ] . j 2 0j : Third, again by Cauchy-Schwartz inequality, 1 2 4 4 4 B j 0 j Ej Xj . j 0j : 4 Fourth, applying Hölder inequality to cross-product terms, by (40)-(42), p (43) k EjD1 D2 j . 0 kq j 0 j; E[D32 ] (42) EjD1 D3 j . k p EjD2 D3 j . j (44) (45) By the norm equivalence (Lemma A.14), j collecting (40)-(45), p 2 (46) jQ( ) Q( 0 ) k )k 0 kq j . ( + Recall that if 2 0j ; 0 kq j 3 0j : 0j k 0 ke;L2 (P ) 2 0 kq + (1 + p k 3 0 kq )k 0 kq . Hence, 4 0 kq ; +k 2B , k 0 kq k 0 ke;L2 (P ) Hence, the right of (46) is less than p p f + + (1 + ) + .k 0 kc 2 2 0 kq gk < : As we con…ne to be less than 1, the above expression is bounded above by M for some …xed constant M > 0. p 2 0 kq k Lemma A.16. Suppose Assumptions 3.1-3.2 hold. Let M be the constant appearing in Lemma A.15. Let A kn ; = f 2 A k n : k Then, for large n, we have " E sup jGn fm( ; ) 2Akn ; m( kn 0; 0 kc # <M )gj . ( + 2 ;k n) 0 kq 2 ! ! 2 ! g 1 +p ( + n n) +! ! ; SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS for n =k 0 kq . 0 kn 29 2. Proof. Let us write m( ) = m( ; Vi ) and h( ) = h( ; V1 ; V2 ). Suppose k kn 0 0 kc < M This holds for large n, because k kn 0 0 kc ! 0 by Assumption 3.3. Let M be a constant such that (47) sup sup jm( ; v) m( ; v)j < M < 1: kn 2A v2R2+dx We may take M = 3B by Remark A.2 and the de…nition of m( ) in (5). By Hölder inequality, (48) km( ) m( 0 )kL2 (P ) kn +kE[h( ) kE[h( ) h( h( 0 )jV2 ]kL2 (P ) kn 0 )jV1 ]kL2 (P ) kn + jQ( ) Q( 0 )j: kn By Jensen’s inequality, (49) kE[h( ) Recall that 0 kn jh( ) h( kn 0 )jVi ]kL2 (P ) kh( ) h( kn 0 )kL2 (P ) : = ( 0 ; F0;kn ). Hence, using the …rst inequality of (19), we obtain h( 0 )j kn . Bj Xj j 0j + jF ( X 0 F0;kn ( X 0 0) 0 )j; which yields (50) kh( ) h( kn . j 0 )kL2 (P ) 0j = k In addition, by Lemma A.15, if k jQ( ) Q( kn 2k (51) 2 <M jQ( ) F0;kn ( X 0 0) 0 )kL2 (P ) 0 ke;L2 (P ) : kn 0 kc 0 )j + kF ( X 0 and k Q( 0 )j 2 0 kq + 2k 0 kc 0 kn + jQ( kn Q( 0 )j 2 0 kq : 0 kn 0) 2, <M Also note that, by Lemma A.14, (52) k kn 0 ke;L2 (P ) k By (48)-(52) and Assumption 4.2, if (53) km( ) m( kn kn 0 kq k 0 kq +k +k kn kn 0 kq : 0 2 A kn ; , 0 )kL2 (P ) . k 0 kq 2 0 kq +k (54) + n + 2 +k + 2 n 0 kq 0 kn 0 . + 2 0 kq n; where the last inequality holds because possible values for are bounded and n ! 1. The maximal inequality is invoked to conclude. Denote Mn; = fm( ) m( kn 0) : 2 Akn ; g; n ! 0 as 30 JONG-MYUN MOON and …nd a sequence fMn; g1 m( kn 0 )kL2 (P ) n=1 such that km( ) + n . By Lemma 3.4.2 of van der Vaart and Wellner (1996), (55) E[ sup jGn fm( ) m( 2Akn ; . J~[] (Mn; ; Mn; ; k kL2 (P ) ) 0 )gj] kn (1 + where Z J~[] ( ; Mn; ; k kL2 (P ) ) = 0 By Lemma A.10, for some constant C > 0, km( ; Vi ) Mn; . By (54), Mn; J~[] (Mn; ; Mn; ; k kL2 (P ) ) p M ); n(Mn; )2 q 1 + log N[] ("; Mn; ; k kL2 (P ) )d": m( ; Vi )kL2 (P ) Ck ke;1 : The …rst inequality of (56) is acquired by Theorem 2.7.11 of van der Vaart and Wellner (1996) and the second inequality is trivial from Akn ; A. (56) N[] ("; Mn; ; k kL2 (P ) ) Lemma A.6 shows log N ("C (57) 1 N ("C 1 ; A; k J~[] ( ; Mn; ; k kL2 (P ) ) . Z ; Akn ; ; k ke;1 ) ke;1 ) . 1 p 0 2 log " + " m( 2Akn ; kn for = Z p d" . " log " + " ; A; k ke;1 ) +! ! . Therefore we have 2 d" 2 ; 0 where the second inequality is from the fact that j2 Plugging (57) into (55), E[ sup jGn fm( ) 1 N ("C 0 )gj] log "j < " . (Mn; ) 2 2 when " is small enough. 1 (1 + p (Mn; ) n 2 2 M ): The result follows after simpli…cation. Proof of Theorem 4.1. We start by introducing some abbreviations. Let De…ne a k kq -norm shell as (58) Sj;n = f 2 Akn : sj;n < k 0 kq 2sj;n g for sj;n = 2j 1 = ( + !)= !. rn 1 : As in Lemma A.15, denote Bc = f 2 A : k p 0 kc < cg: 1 Let cn be a sequence such that cn 1 k^ n 0 kc ! 0. Such a sequence fcn gn=1 exists since ^ n is consistent in k kc -norm. Let " > 0 and J 2 N be arbitrary positive constants. The theorem is proved if P [rn k^ n large enough. Observe (59) P [rn k^ n 0 kq > 2J ] 0 kq > 2J ] can be made arbitrarily small by letting J P [rn k^ n 0 kq > 2J ; ^ n 2 Bcn ] + P [^ n 2 = Bcn ]; SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 31 where the last term P [^ n 2 = Bcn ] converges to zero by de…nition of Bcn . The …rst term on the right of (59) is bounded above by (60): X (60) P [^ n 2 Sj;n \ Bcn ] + P [k^ n 0 kq > "]: j J 2sj;n " The second term of (60) can be ignored again by the consistency of ^ n and the fact that k kq . k kc . Therefore it is enough to show that the …rst term of (60) decreases to 0 as J increases. First, we obtain a bound for P [^ n 2 Sj;n \ Bcn ], showing up in (65) below. By Lemma A.13, if ^ n 2 Sj;n \ Bcn , then n (61) inf 2Sj;n \Bcn 1X fm( ; Vi ) n m( 1 Op ( ): n 0 ; Vi )g kn i=1 Let us treat the random term Op ( n1 ) in (61) as if it were a deterministic sequence Cn for some C > 0. This simpli…cation comes without loss of generality. Recall Lemma A.15. Find an integer N such that if n N , then Q( ) for 2 Bcn . Then for n Q( 1 k 2 0) 2 0 kq N, (62) inf 2Sj;n \Bcn Q( ) Expand the left of (61) to " n 1X (63) inf fm( ; Vi ) 2Sj;n \Bcn n Q( 1 (sj;n )2 : 2 0) E[m( ; Vi )]g i=1 n 1X fm( n kn 0 ; Vi ) E[m( kn 0 ; Vi )]g + Q( ) Q( kn i=1 Combine and denote the two summation term in (63) by n 1=2 Gn fm( ) is a standard notation in the empirical process theory). Expand Q( ) Q( 0) kn = Q( ) Q( 0) + Q( 0) Q( kn m( kn # 0) : 0 )g (this 0 ): Then, (63) is bounded below by (64): (64) inf 2Sj;n \Bcn 1 p Gn fm( ) n m( kn 0 )g 1 + (sj;n )2 2 kF0;kn (Z0 ) where the second term is obtained by (62) and the third term is by Q( 0) Q( kn 0) fkF0;kn (Z0 ) F0 (Z0 )kL2 (P ) g2 : F0 (Z0 )k2L2 (P ) ; 32 JONG-MYUN MOON From (61)-(64), we obtain P [^ n 2 Sj;n \ Bcn ] (65) for j;n inf 2Sj;n \Bcn Gn fm( ) m( p 1p n(sj;n )2 + nkF0;kn (Z0 ) 2 C =p n Note P[ kn 0 )g j;n ]; F0 (Z0 )k2L2 (P ) : p F0 (Z0 )k2L2 (P ) = o( nrn 2 ) p by Assumption 4.2 and A.14. And also note that n 1=2 = o( nrn 2 ). On the other hand, p p n(sj;n )2 nrn 2 . Therefore, j;n < 0 for large enough n and j. p nkF0;kn (Z0 ) Second, we evaluate the probability (65). Suppose n and j are large enough so that By Markov inequality, the right of (65) is less than 1 (66) j j;n j E[ sup 2Sj;n \Bcn jGn fm( ) m( kn j;n < 0. 0 )gj]: Note, for large n, if n <M 2 Sj;n \ Bcn Asj;n for M appearing in Lemma A.15. Also note sj;n + s2j;n + rn 1 . sj;n for large n. Then by Lemma A.16, the expression (66) is less than, up to a …xed scale, 1 (67) j j;n j (sj;n ) 2 ! ! 2 ! 1 + p (sj;n ) n +! ! : Finally, the …rst term of (60) is bounded. For large enough n and J, (68) holds by (65)-(67). p 2 The second inequality of (68) is from j j;n j nsj;n . The last equality is from Assumption 4.1. X X 1 2 ! ! +! 1 (68) P [^ n 2 Sj;n \ Bcn ] . (sj;n ) 2 ! + p (sj;n ) ! j j;n j n j J 2sj;n " j J . 1 X 2 p sj;n n !+ +! 2 ! j J = f1 + n This shows (60) is O(2 J) f( + X sj;n 1 2 f p rn n +! 2 ! ! j J +! 2 ! ) 2 !+! +! ! g g2 J = O(2 J !+ +! 2 ! + rn +! 2 ! ! ); + o(1). In turn, from (59), it is shown that P [rn 1 k^ n 0 kq > 2J ] O(2 J ) + o(1): Therefore we conclude rn is the rate-of-convergence factor for the estimator ^ n . g2 J SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 33 A.5. Proof for Section 5. Higher-order directional derivatives of h( ; ; ) are similarly de…ned to the …rst-order derivative de…ned in (12). Let v1 ; v2 ; v3 2 V be arbitrary directions (vectors). We denote second and third order derivatives by h00 ( )[v1 ; v2 ] and h000 ( )[v1 ; v2 ; v3 ] respectively. Also let h00 ( )[v]2 = h00 ( )[v; v] and h00 ( )[v]3 = h000 ( )[v; v; v]. Following de la Peña and Giné (1999), we denote the U-process empirical measure by Un ; it assigns 1=n(n 1) probability to each pair (Vi ; Vj )i6=j from the random sample fV1 ; ; Vn g. The expectation of h( ) with respect to the random probability measure Un is denoted by Un h( ); that is, X 1 h( ; Vi ; Vj ): Un h( ) = n(n 1) i6=j Further, we write (Un E)h( ) = Un h( ) E[h( )]. Note this is a U-process indexed by . Remark A.17 (De…nition of n ). We choose an arbitrary sequence f n g that converges to zero slower than rn 1 . Also we require that all the conditions of Assumption 5.4 hold when rn 1 is substituted by n . For now, if these two conditions are met, n can converge at any rate; its rate will be picked in the proof of Theorem 5.1. The role of this sequence is to de…ne a local neighborhood around 0 . In order to guarantee that the local neighborhood encompasses the estimator ^ n , we choose n slower than rn 1 . On the other hand, it will be required that, on the local neighborhood, the empirical criterion Qn approximates the population criterion Q well enough then; n needs to shrink fast enough to obtain a good approximation. We will choose n so that it converges in…nitesimally slower than rn 1 . Lemma A.18. Suppose Assumptions 5.3 and 5.5 hold. For any Fk 2 Fk ; (69) p p dj kFk (Z0 )kL1 (P ) . kkFk (Z0 )kL2 (P ) and k Fk (Z0 )kL1 (P ) . k j (k)kFk (Z0 )kL2 (P ) ; dz for j de…ned in Assumption 5.4 (but Assumption 5.4 is not needed). Proof. Let k denote the the smallest eigenvalue of E[pk (Z0 )pk (Z0 )0 ]. By Assumption 5.5, > 0 for some and for every k 2 N. Because Fk is a …nite dimensional linear k sieve (Assumption 5.3), Fk (Z0 ) is a linear combination of basis functions fp1 ; ; pk g; write c0 pk (Z0 ) for some c 2 Rk . Then, p p 0 (70) kFk (Z0 )kL2 (P ) = fc0 E[pk (Z0 )pk (Z0 )0 ]cg1=2 c0 c; k cc& where the last inequality holds by Assumption 5.5. Next, observe (71) (72) kFk (Z0 )kL1 (P ) k dj Fk (Z0 )kL1 (P ) dz k X i=1 k X i=1 jci jkpi (Z0 )kL1 (P ) . jci jk dj pi (Z0 )kL1 (P ) dz k X i=1 jci j j (k) k X i=1 jci j. 34 JONG-MYUN MOON By Cauchy-Schwarz inequality, P p p k c0 c. By this and (70)-(72), (69) is shown. jci j Lemma A.19. Suppose F; Fv and Fw are functions on R. Suppose ; v ; w are in the set . Denote = ( ; F ), v = ( v ; Fv ) and w = ( w ; Fw ). (i) If z 7! F (z) is di¤ erentiable, (73) h0 ( ; V1 ; V2 )[v] = 2fIf Y F ( X 0 )gfF 0 ( X 0 ) X 0 0g + Fv ( X 0 )g: v (ii) If F is twice di¤ erentiable and Fv is di¤ erentiable, (74) h00 ( ; V1 ; V2 )[v; w] = 2fF 0 ( X 0 ) X 0 fF 0 ( X 0 ) X 0 fF 00 ( X 0 ) X 0 w + Fw ( X 0 )g + Fv ( X 0 )g v X0 w 2fIf Y 0g + Fw0 ( X 0 ) X 0 v v F ( X 0 )g + Fv0 ( X 0 ) X 0 wg (iii) If F is three-times di¤ erentiable and Fv is twice di¤ erentiable, h000 ( ; V1 ; V2 )[v; v; w] = 4fF 00 ( X 0 ) X 0 (75) +Fw ( X 0 ) X 0 0 0 v + Fv0 ( X 0 ) X 0 0 fF ( X ) X 0 +2fF ( X ) X 0 w 0 v wg 0 + Fv ( X )g 2fIf Y F ( X 0 )g 0g X 0 v )2 + Fw00 ( X 0 )( X 0 v )2 w( +2Fv00 ( X 0 ) X 0 : v + Fw ( X 0 )gfF 00 ( X 0 )( X 0 v )2 +2Fv0 ( X 0 ) X 0 v g fF 000 ( X 0 ) X 0 X0 w X0 vg w Proof. From h( + tv) = fIf Y 0g F ( X 0 + t X 0 v )) tFv ( X 0 + t X 0 v )g2 : by elementary calculus, we obtain (73). Similarly, from h0 ( + tw)[v] = 2fIf Y tFw ( X 0 + t X 0 +tFw0 ( X 0 + t X 0 F ( X0 + t X0 w) fF 0 ( X 0 + t X 0 w) 0g w )g X0 w) v X0 + Fv ( X 0 + t X 0 v w )g; we obtain (74). Lastly, from h00 ( + tw)[v; v] = 2fF 0 ( X 0 + t X 0 w) X0 v +tFw0 ( X 0 + t X 0 w) X0 v + Fv ( X 0 + t X 0 w )g fF 0 ( X 0 + t X 0 w) X0 v + tFw0 ( X 0 + t X 0 w) +Fv ( X 0 + t X 0 tFw ( X +tFw00 ( X 0 0 w )g +t X 0 +t X 0 w )g w )( 2fIf Y 00 0g fF ( X X 0 2 v) + 0 F ( X0 + t X0 +t X 2Fv0 ( X0 X 0 0 w )( X +t X 0 0 v w) 2 v) w) X 0 v g; SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 35 we can derive (75). Lemma A.20. Suppose Assumptions 3.1- 3.2, 4.3, 5.1-5.5 hold. Consider any non-random sequence f n 2 Akn g such that k n 0 ke;L2 (P ) = o( n ). Then k n kn 2 0 ke;L2 (P ) = Q( Proof. Denote vn = n kn 0 and vn = ( Assumption 5.2. By the triangle inequality, (76) kvn kq kvn ke;L2 (P ) k n) vn ; Fvn ). 0 ke;L2 (P ) n Q( 0) kn Note k +k kn 1 + o( ): n kn 0 ke;L2 (P ) 0 0 ke;L2 (P ) n = o( n ) by = O( n ): Recall that any element of Fkn is at least three-times di¤erentiable by Assumptions 3.1-3.2 and Remark A.2. Then h( ) is three-times directionally di¤erentiable (Lemma A.19). By Taylor expansion, 1 1 + h00 ( kn 0 )[vn ]2 + h000 ( n )[vn ]3 ; 2 6 for some n 2 Akn between n and kn 0 . Denote n = ( n ; Fn ). We examine each term on the right of (77) from below. First, from Lemma A.19, (77) h( h0 ( n) kn 0 )[vn ] Since E[If Y E h0 ( h( kn = 0) = h0 ( kn 2fIf Y 0 )[vn ] 0 (Z0 ) X 0 F0;kn (Z0 )gfF0;k n 0g vn + Fvn (Z0 )g: 0gjX1 ; X2 ] = F0 (Z0 ), by the law of iterated expectation, it follows kn 0 )[v] = 0 (Z0 ) X 0 F0;kn (Z0 )gfF0;k n 2E fF0 (Z0 ) vn + Fvn (Z0 )g : Then by Hölder inequality and Assumption 5.2, E h0 ( (78) kn 0 )[v] (79) . kF0 (Z0 ) F0;kn (Z0 )kL2 (P ) 0 kF0;k (Z0 ) X 0 n . o(n 3=4 ) = o(n 3=4 ) By (76), the last expression of (78) is o(n j vn j vn + Fvn (Z0 )kL2 (P ) : + kFvn (Z0 )kL2 (P ) kvn ke;L2 (P ) : 3=4 n )-term. As n = o(n 1=3 ), we obtain 1 = o( ): n Next, the second-order term of the Taylor expansion (77) is examined. From Lemma A.19, E h0 ( (80) (81) h00 ( kn 2 0 ; V1 ; V2 )[vn ] kn 0 )[v] 0 = 2fF0;k (Z0 ) X 0 n 00 4fF0;k (Z0 )( X 0 n 2 vn ) vn + Fvn (Z0 )g2 + 2Fv0 (Z0 ) X 0 vn gfIf Y By the law of iterated expectation and the triangle inequality, (82) E[h00 ( kn 2 0 ; V1 ; V2 )[vn ] ] 0 . E[fF0;k (Z0 ) X 0 n vn + Fvn (Z0 )g2 ] 0g F0;kn (Z0 )g: 36 JONG-MYUN MOON 00 (Z0 )( X 0 +E[jfF0;k n 2 vn ) + 2Fv0 (Z0 ) X 0 vn gfF0 (Z0 ) F0;kn (Z0 )gj]: After some algebra, the …rst term on the right of (82) is shown to be bounded by F00 (Z0 )k2L4 (P ) k X 0 0 (Z0 ) kvn k2q + kF0;k n 2 vn kL4 (P ) F00 (Z0 )kL4 (P ) k X 0 0 (Z0 ) +kF0;k n vn kL4 (P ) kvn kq : By Assumption 5.2and (76), the last expression is further simpli…ed to kvn k2q + o(n 2=3 )O( 2n ) + o(n 1=3 )O( 2n ) = kvn k2q + o(n 1 ); The second term on the right of (82) is bounded above by 00 (Z0 )( X 0 kF0;k n . fj 2 vn ) 2 vn j + 2Fv0n (Z0 ) X 0 +j = O( n )o(n F0;kn (Z0 )kL2 (P ) vn jgkF0 (Z0 ) 2=3 1 ) = o(n F0;kn (Z0 )kL2 (P ) : vn kL2 (P ) kF0 (Z0 ) ) where the inequality holds by the triangle inequality and Remark A.2, and the last inequality holds by Assumption 5.2 and (76). Therefore, we have 1 = kvk2q + o( ): n The third order term in (77) follows. As given by Lemma A.19, E[h00 ( (83) h000 ( 3 n ; V1 ; V2 )[vn ] 0 fFn ( X 0 000 = f6Fn 00 ( X 0 n) fFn ( X 0 2 0 ; V1 ; V2 )[v] ] kn 0 vn X 0 X n )( vn ) + 2 vn ) 0 n )g 3Fv00n ( 0 + Fv n ( X 3 X0 n )( X + 8Fv0n (Z0 )) X 0 2fIf Y n )( X 0 0g 2 vn ) + 4Fvn (Z0 ) X 0 vn Fn ( X 0 vn g n )g g: Similar to above, using Hölder inequality and the triangle inequality multiple times, (84) E h000 ( . fkFn 00 ( X 0 )( X 0 3 n ; V1 ; V2 )[v] +kFvn (Z0 ) X +kF0 ( X 0 0 vn kL2 (P ) g 0) +kFv00 ( X 0 0 Fn ( X X0 n )( 0 fkFn ( X n )kL2 (P ) 2 vn ) kL2 (P ) g: 0 2 vn ) kL2 (P ) n) X 0 000 + kFv0n (Z0 ) X 0 vn kL2 (P ) 0 fkFn ( X )( X 0 vn kL2 (P ) + kFvn ( X 0 3 vn ) n )kL2 (P ) g kL2 (P ) We examine each term of the last expression. By Lemma A.18, kFn 00 ( X 0 n )( X0 2 vn ) kL2 (P ) kFvn (Z0 ) X 0 0 kFv0n (Z0 ) X 0 kFn ( X 000 0 0 n) X kFn ( X )( X 0 0 vn kL2 (P ) vn kL2 (P ) vn kL2 (P ) 3 vn ) . j 2 vn j = O( 2n ); p kn O( 2n ); p . kFv0n (Z0 )kL1 (P ) j vn j = kn 1 (kn )O( 2n ); . kFvn (Z0 )kL1 (P ) j . j kL2 (P ) . j vn j = O( n ); 3 vn j = O( 3n ): vn j = SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS There are two more terms; namely, kFvn ( X 0 By Taylor expansion and Hölder inequality, kFvn ( X 0 n )kL2 (P ) = kFvn (Z0 ) + Fvn ( X 0 n )kL2 (P ) and kF0 ( X 0 for n is between n and Hence from the above, 0. Recall that n kFvn ( X 0 n )kL2 (P ) . n ) X 0( 0 )kL2 (P ) n 0 j; n is between n )kL2 (P ) Fn ( X 0 0) Fvn (Z0 )kL2 (P ) n) kFvn (Z0 )kL2 (P ) + kFv0n ( X 0 . kFvn (Z0 )kL2 (P ) + j 37 and n 0. As such, j 0j n = O( n ). = O( n ) It is similar to show kF0 ( X 0 Fn ( X 0 0) n )kL2 (P ) = O( n ): In sum, using the above bounds for each term of (84), we see that p p p E[h000 ( n ; V1 ; V2 )[vn ]3 ] = O(f 2n + kn 2n + kn 1 (kn ) 2n g n + n f 3n + 2n g) = O( kn 1 (kn ) 3n ): p By Remark A.17, kn 1 (kn ) 3n = o(n 1 ). This, along with (80) and (83) proves the lemma. Lemma A.21. Suppose Assumptions 3.1-3.2, 4.3, 5.3-5.5 hold. Let M be the constant appearing in Lemma A.15. Denote Bkn ; = f 0 kn 2 A kn ; k : 0 kc <M 2 ;k kn 0 kq g: Then, for a sequence f n g1 n=1 speci…ed in Remark A.17, (85) sup (Un 2Bkn ; E)h00 ( kn n Proof. By Markov inequality, the lemma is proved if " (86) E 1 ] = op ( ): n 0 )[ sup (Un 2Bkn ; E)h00 ( kn 0 )[ n # 1 ] = o( ): n The claim (86) can be shown by the maximal inequality. h00 ( kn 0 )[ ]2 = h00 ( kn 0 ; V1 ; V2 )[ ]2 . De…ne m00 ( g 00 ( kn kn 0 ; v)[ ]2 = E[h00 ( 0 ; v1 ; v2 )[ kn ]2 = h00 ( 0 )[ kn ]2 jV1 = v] + E[h00 ( 0 )[ ]2 E[h00 ( kn E[h00 ( 0 )[ kn 0 )[ kn As before, abbreviate ]2 jV2 = v] E[h00 ( kn 0 )[ ]2 jV1 = v1 ] 0 )[ ]2 jV2 = v2 ] + E[h00 ( kn 0 )[ ]2 ]; ]2 ]: 38 JONG-MYUN MOON We shorten m00 ( kn 0 ; Vi )[ ]2 to m00 ( kn 0 )[ ]2 , and g 00 ( kn 0 ; Vi ; Vj )[ ]2 to g 00 ( The following equation is yet another application of Hoe¤ding decomposition; (87) E)h00 ( (Un 0 )[ kn ]2 = (Pn E)m00 ( 0 )[ kn ]2 + (Un By the decomposition (87), we may bound the left of (86) by # " " E sup (Pn 2Bkn ; E)m00 ( kn 0 )[ ]2 + E sup (Un 2Bkn ; n E)g 00 ( E)g 00 ( kn kn 0 )[ n kn 0 )[ 0 )[ ]2 . ]2 ; # ]2 : Above two expectations are shown to converge to zero faster than 1=n-rate by Lemma A.23 and Lemma A.24 respectively. Therefore the claim (85) is proved. Lemma A.22. Suppose the conditions of Lemma A.21. Notations are also same with Lemma A.21. Then, " log N[] ("; M00kn ; n ; L2 (P )) . kn log 2 kn n Proof. Let n and Fkn ; n be such that Bkn ; n = kF k1;L1 (P ) = kF (Z0 )kL1 (P ) + kF 0 (Z0 )kL1 (P ) ; n Fkn ; n . Also, denote k ke;1;L1 (P ) = j j + kF k1;L1 (P ) , By Lemma A.26, Theorem 2.7.11 of van der Vaart and Wellner (1996) can be applied: " N( N[] ("; M00kn ; n ; L2 (P )) ; Bkn ; n ; k ke;1;L1 (P ) ) C1 " " ; n ; j j) N ( ; Fkn ; n ; k k1;L1 (P ) ); (88) N( 2C1 4C1 for some constant C1 > 0. The covering number N ("=4C1 ; Fkn ; n ; k k1;L1 (P ) ) in (88) is not easy to calculate. However, by Lemma A.18 and Assumption 5.4(i), for any F 2 Fkn ; n Fkn , (89) kF k1;L1 (P ) . kn kF (Z0 )kL1 (P ) : Therefore we have the following inequality; for some positive constant C2 , " " (90) N( ; Fkn ; n ; k k1;1 ) N ( ; Fkn ; n ; k kL1 (P ) ); 4C1 C2 kn where kF kL1 (P ) is de…ned to be kF (Z0 )kL1 (P ) for simplicity. Next, in order to calculate the covering number (90), we exploit the fact that Fkn is a linear combination of …nite basis functions. For F 2 Fkn , let us write F (z) = c0 pkn (z) for c = (c1 ; ; ckn )0 . Notice that any coe¢ cient cj must be bounded. To see this, suppose not. As F (z) is a bounded function, this implies some linear combination of basis functions is zero. This contradicts with Assumption 5.5, and so any coe¢ cient cj is bounded. Without loss of generality, assume that cj is within an interval [ 1; 1] for any j. Next, consider F 2 Fkn ; n and suppose kF (Z0 )kL2 (P ) c n for some constant c. Because any norm is linear, this implies that the coe¢ cient cj is in an interval [ c n ; c n ]. Since p = supj kpj k1 is bounded SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 39 by Assumption 5.3, it follows kc0 pkn kL1 (P ) pfjc1 j + + jckn jg: Hence, a "-radius ball in (Fkn ; k kL1 (P ) ) is smaller than a set n o c0 pkn (z) : jc1 j + + jckn j "=p All these considerations lead to the following calculation; the covering number on the right of (90) is bounded above, up to a …xed scale, by (91) N( " ; [ c n ; c n ]; j j)] C2 kn2 kn kn2 n " kn : Therefore, the claim of the lemma is proved. Now, back to (88), after the log transformation, " " log N[] ("; M00kn ; n ; L2 (P )) . d log kn log 2 : k n n n As the second term dominates on the right of the previous inequality, we obtain the claimed inequality. Lemma A.23. Suppose the conditions of Lemma A.21. Notations are also same. Then, " # 1 00 2 E sup Gn m ( kn 0 )[v] = o( p ): n v2Bkn ; n Proof. We prove this lemma by using the maximal inequality of Theorem 2.14.2 in van der Vaart and Wellner (1996). To prepare for the application of the maximal inequality, de…ne M00kn ; n = fv 7! m00 ( kn 2 0 )[v] : v 2 Bkn ; n g: From (81), it can be easily checked that the envelope for fh00 ( kn 0 )[v] : v 2 Bkn ; n g is j Xj2 j v j2 + jFv (Z0 )j2 + fjFv (Z0 )j + jFv0 (Z0 )jgj Xjj v j: Hn (V1 ; V2 ) From this, an envelop function for M00kn ; n can be obtained; let Mn (v) = E[Hn (V1 ; V2 )jV1 = v] + E[Hn (V1 ; V2 )jV2 = v] + E[Hn (V1 ; V2 )]: Then for any v 2 Bkn ; n ; km00 ( kn 0 ; Vi )[ ]2 kL2 (P ) kMn (Vi )kL2 (P ) ; by using Jensen’s inequality and Minkowski inequality. We need to calculate kMn (Vi )kL2 (P ) ; kMn (Vi )kL2 (P ) (92) 3kHn (V1 ; V2 )kL2 (P ) . j v j2 + kFv (Z0 )k2L4 (P ) + fkFv (Z0 )kL4 (P ) + kFv0 (Z0 )kL4 (P ) gj v j 40 JONG-MYUN MOON where the …rst inequality holds by Jensens’inequality, and the second inequality is a consequence of Hölder inequality. Note, by Lemma A.18, p p kF (Z0 )kL4 (P ) . kn kF (Z0 )kL2 (P ) , kF 0 (Z0 )kL4 (P ) . kn 1 (kn )kF (Z0 )kL2 (P ) : p Recall that Assumption 5.4 (i) states kn 1 (kn ) . kn . Hence, from (92), (93) kMn (Vi )kL2 (P ) C1 kn 2 n: Next, the following bracketing integral is calculated: Z 1q 00 J[] (1; Mkn ; n ; L2 (P )) = 1 + log N[] ("kMn (Vi )kL2 (P ) ; M00kn ; n ; L2 (P ))d": 0 By (93), this is bounded above by Z 1q 1 + log N[] ("C1 kn 0 2 00 n ; Mkn ; n ; L2 (P ))d"; which, by the change of variable, can be written as Z C1 kn 2n q 2 1 (C1 kn n ) 1 + log N[] ("; M00kn ; n ; L2 (P ))d" 0 By Lemma A.22, the above integral is bounded above by Z C1 kn 2n r p p kn Z " 2 1 (94) C2 kn (C1 kn n ) log 2 d" kn kn n n 0 0 Notice that kn = integral rule, n n =kn p log "0 d"0 : ! 1. To evaluate the last integral of (94), observe that, by the Leibniz Z xp p d log "0 d"0 = log x: dx 0 Then by l’Hopital’s rule, for any c > 1, Z xp 1 1 p lim c log "0 d" = lim c 1 log x = 0: x!0 x x!0 cx 0 From these calculations, we learn that, for any small number > 0, the right of (94) is less p than kn (kn = n ) if the number n is large enough. Keeping this in mind, recall Theorem 2.14.2 of van der Vaart and Wellner (1996) to obtain " # p p E sup Gn m00 ( kn 0 )[ ]2 . kn (kn = n ) kMn (Vi )kL2 (P ) . kn (kn = n ) kn 2n : 2Akn ; n p By Assumption 5.4, kn kn 2n = o(n 1=2 ). By choosing arbitrarily small, we conclude that p kn (kn = n ) kn 2n = o(n 1=2 ). This concludes the proof. SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 41 Lemma A.24. Suppose the conditions of Lemma A.21. Notations are also same. Then, # " 1 00 2 E sup Un g ( kn 0 )[v] = o( ): n v2Bkn ; n Proof. By Theorem B.1, for any " (95) E 00 Un g ( sup v2Bkn ; kn n 2 A kn ; n , # 2 0 )[v] 1 . E n where the random metric n is de…ned as 8 <1 X (96) g 00 ( kn 0 ; Vi ; Vj )[ n( 1; 2) = : n2 Z Dn log N ("; Bkn ; n ; 0 2 1] g 00 ( kn n )d" 2 2 0 ; Vi ; Vj )[ 2 ] i6=j and Dn is the diameter of Akn ; C, n( 1; 2) n measured by n. ; 91=2 = ; ; By Lemma A.27, for some positive constant 91=2 8 = <1 X 2 k C G(V ; V ) i j ; : n2 2 ke;1;L1 (P ) ; 1 i6=j where the “Lipschitz constant” G( ; ) is de…ned in the Lemma A.27. Let us denote 8 91=2 < = X 1 (97) Gn = C : G(Vi ; Vj )2 : n(n 1) ; i6=j Notice that n( 1; 2) . Gn k 2 ke;1;L1 (P ) : 1 Then, since kF k1;L1 (P ) . kn kF (Z0 )kL1 (P ) as derived in (89), (98) Dn = sup 1 ; 2 2Akn ; n n( 1; 2) . kn The following two inequalities are almost identical to (88): " " N ("; Bkn ; n ; n ) N ( ; Bkn ; n ; k ke;1;L1 (P ) ) N ( ; Gn 2Gn n Gn : n ; j j) N ( " ; Fkn ; n ; k k1;L1 (P ) ); 2Gn Further, similar to (90)-(91), (99) log N ("; Bkn ; n ; From (99), Z Dn 0 log N ("; Bkn ; n ; n) . n )d" log " + log . Dn n Gn kn log " + kn log kn2 Dn log Dn + Dn log n Gn : n Gn +kn Dn + kn Dn log Dn + kn Dn log kn2 n Gn : 42 JONG-MYUN MOON Notice that x log x x2 for any positive x. And recall the bound (98). Using these two, we can show that the right of the above inequality is less than kn n Gn + (kn 2 n Gn ) + kn ( n Gn )2 + kn2 n Gn + kn (kn 2 n Gn ) + (kn2 n Gn ) By the de…nition (97), E[G2n ] = E[G(Vi ; Vj )2 ] and this is bounded. By Jensen’s inequality, E[Gn ] . fE[G(Vi ; Vj )2 ]g1=2 . Therefore, we obtain Z Dn log N ("; Bkn ; n ; n )d"] . kn2 n : E[ 0 By Assumption 5.4, kn2 n converges to zero. This result combined with (95) proves the lemma. Lemma A.25. Suppose Assumptions 3.1-3.2 hold. For any A A, (100) jh00 ( h00 ( 0 ; V1 ; V2 )[w] kn . (j Xj + j Xj2 )fj +kFw0 (Z0 ) = ( 0 ; F ) 2 A and w; w 2 0 ; V1 ; V2 )[w]j kn wj w + kFw (Z0 ) Fw (Z0 )kL1 (P ) Fw0 (Z0 )kL1 (P ) g: Proof. The directional derivative h00 ( ; v1 ; v2 )[w] exists if F is twice di¤erentiable and Fw is di¤erentiable. This is guaranteed by 3.1-3.2. By easy algebra, h00 ( ; V1 ; V2 )[w]2 h00 ( ; V1 ; V2 )[w]2 j2fF 0 (Z0 ) X 0 ( w1 fF (Z0 ) X ( w1 0 0 +j4fF 00 (Z0 ) X 0 ( +2Fw0 1 (Z0 ) X 0 ( fIf Y + w1 w2 ) + Fw1 (Z0 ) w2 ) + Fw1 (Z0 ) + Fw2 (Z0 )gj + w1 0g X 0( w2 ) w2 ) Fw2 (Z0 )g w2 ) w1 + 2fFw0 1 (Z0 ) Fw0 2 (Z0 )g X 0 w2 g F (Z0 )gj: Recall that for any = ( ; F ) 2 A, F , F 0 and F 00 are all uniformly bounded as explained in Remark A.2. Hence the right of the above inequality is smaller than, for strictly positive constants fC1 ; ; C4 g, (101) C1 fj X 0 ( w1 +C2 j X 0 ( +C4 j X 0 w2 )j w1 w2 j + + kFw1 (Z0 ) w2 ) X 0( kFw0 1 (Z0 ) Fw2 (Z0 )kL1 (P ) g w1 w2 )j + C3 j X 0 ( w1 w2 )j Fw0 2 (Z0 )kL1 (P ) : The product terms in (101) can be bounded using Cauchy-Schwartz inequality. Ignoring constant terms, the result (100) can be acquired from the expression (101). SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 43 The following two lemmas can be easily proved, similarly to Lemma A.10 and Lemma A.11; hence proofs are omitted. Lemma A.26. Suppose Assumptions 3.1-3.2 hold. For any A A, jm00 ( kn 0 ; v)[w] m00 ( 0 ; v)[w]j kn . (1 + jxj + jxj2 )fj = ( 0 ; F ) 2 A and w; w 2 wj w + kFw (Z0 ) Fw (Z0 )kL1 (P ) + kFw0 (Z0 ) Lemma A.27. Suppose Assumptions 3.1-3.2 hold. For any A A, jg 00 ( kn g 00 ( 0 ; v1 ; v2 )[w] . G(v1 ; v2 )fj wj w Fw0 (Z0 )kL1 (P ) g: = ( 0 ; F ) 2 A and w; w 2 0 ; v1 ; v2 )[w]j kn Fw (Z0 )kL1 (P ) + kFw0 (Z0 ) + kFw (Z0 ) Fw0 (Z0 )kL1 (P ) g; where G(v1 ; v2 ) = 1 + jx1 j + jx2 j + jx1 jjx2 j + jx1 j2 + jx2 j2 : Lemma A.28. Suppose Assumptions 3.1-3.2, 5.3-5.5 hold. Let M be the constant de…ned in Lemma A.15. Denote Bkn ; = f 0 kn 2 A kn ; k : 0 kc 2 <M ;k kn 0 kq g: Then, for a sequence f n g1 n=1 speci…ed in Remark A.17, sup 2[ where [ kn 0; kn 0 Proof. Let us write h000 ( (Un v2Bkn ; n kn 0 ; kn 1 )[v]3 = op ( ); n E)h000 ( 0 +v] + v] is a line connecting two points 0 and kn 0 + v. = ( ; F ) and v = ( v ; Fv ). From Lemma A.19, we have ; V1 ; V2 )[v]3 = 6fF 00 ( X 0 fF 0 ( X 0 fF 000 )( X 0 v )2 + 2Fv ( X 0 ) X0 ( X0 v + Fv ( X 0 +j Xj n jFv ( X 0 We need bounds for jFv ( X 0 )g )( X 0 v )3 + 3Fv00 ( X 0 Observe that, since kF k3;1 < B (see Remark A.2), p (102) jh000 ( ; V1 ; V2 )[v]3 j . j Xj3 3n + j Xj2 kn Fv ( X 0 kn ) X0 vg 2fIf Y 0g F ( X0 )( X 0 v )2 g: 3 n )j2 + j Xj2 2 00 n jFv ( X0 )j: )j and jFv00 ( X 0 ) j. First, by Taylor expansion, ) = Fv (Z0 ) + Fv0 (Z0 ) X 0 ( 0) 1 + Fv00 ( X 0 2 )f X 0 ( 0 )g 2 ; )g 44 JONG-MYUN MOON for some point between and 0 . Because kFv (Z0 )kL2 (P ) . n , by Lemma A.18 and Assumption 5.4(i), p p p kFv (Z0 )kL1 (P ) . kn n ; kFv0 (Z0 )kL1 (P ) . kn 1 (kn ) n ; kFv00 (Z0 )kL1 (P ) . kn 2 (kn ) n : Using these inequalities and Hölder inequality, we obtain p p p jFv ( X 0 )j . kn n + kn 1 (kn ) 2n j Xj + kn 2 (kn ) 3n j Xj2 : Second, after repeating similar calculations, it follows that p jFv00 ( X 0 )j . jFv00 (Z0 )j + j Xjj v j . kn 2 (kn ) n + j Xj n : From Assumption 5.4, it is easy to show that p 1 = o( ), n From this, after some algebra, we can show that kn 3 n _ kn 1 (kn ) 5 n _ kn 2 (kn ) 7 n kn 2 (kn ) 3 n 1 = o( ): n 1 ; V1 ; V2 )[v]3 j . (j Xj4 + j Xj3 + j Xj2 + j Xj)o( ): n Applying Jensen’s inequality, from (103), it follows that jh000 ( (103) 1 1 )[v]3 j . Op (1)o( ); Ejh000 ( )[v]3 j . O(1)o( ); n n where the Op (1)-term is a result of the law of large numbers for U-statistic. This proves the lemma. Un jh000 ( Lemma A.29. Suppose Assumptions 3.1-3.2, 4.1-4.3, 5.1-5.5 hold. Let ~ n = ^ n "n kn v for any non-random sequence "n such that "n = O(n 1=2 ); v is the Riesz representer. Then, for vn = ~ n kn 0 , (104) k~ n kn 2 0 kq = E)h0 ( (Un kn 0 )[vn ] + Qn (~ n ) Qn ( kn 0) 1 + op ( ): n Proof. Note the lemma supposes all the assumptions for Theorem 3.1 and Theorem 4.1. Because "n is decreasing faster than n to zero, we can see that k~ n 0 k = op ( n ). By Lemma A.20, (105) k~ n kn 2 0 kq kn 0) = Q(~ n ) We may expand Q(~ n ) Qn ( (106) Q(~ n )g + fQ( fQn (~ n ) 1 + op ( ): n Qn ( kn 0) Qn ( kn 0 )g to kn 0) + Qn (~ n ) + Qn ( and note that the …rst two terms of the above expression can be written as (107) (Un E)fh(~ n ) h( kn 0 )g: kn 0 ); SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 45 As 7! h( ) = h( ; Vi ; Vj ) permits the third-order Taylor expansion thanks to Lemma A.19, for vn = ~ n kn 0 , (107) is equivalent to 1 1 + h00 ( kn 0 )[vn ]2 + h000 ( n )[vn ]3 g; 2 6 where n 2 Akn is a random point between n and kn 0 . Recall the de…nition of the set Bkn ; de…ned in Lemma A.21. Note k~ n 0 kc converges in probability to zero, and therefore kvn kc converges in probability to zero. Then vn is in the set Bkn ; with probability approaching one. By this fact, we can apply Lemma A.21 and Lemma A.28 to bound the second and third order derivatives in (108) to obtain (108) E)fh0 ( (Un (109) 0 )[vn ] kn E)fh0 ( (Un kn 0 )[vn ] 1 + op ( ): n From (105), (106) and (109), the claim (104) is shown. Lemma A.30. Adopt the conditions of Theorem 5.1. Then p p n(Un E)h0 ( kn 0 )[ kn v ] = n(Un E)h0 ( Proof. Denote h0 ( 0 )[ kn v kn = ( ;F kn v h0 ( ] ;kn ). ] + op (1): After short algebra, we obtain 0 )[ kn v F00 (Z0 )) 0 (Z0 ) f(F0;k n 0 )[v ]= X 0 2fIf Y 0g F0 (Z0 )g 0 (Z0 ) X 0 F0 (Z0 )gfF0;k n g + 2fF0;kn (Z0 ) +F ;kn (Z0 )g: Therefore, jh0 ( h0 ( kn 0 )[ kn v ] . 0 (Z0 ) j(F0;k n 0 )[ kn v F00 (Z0 )) X ]j 0 j + jF0;kn (Z0 ) 0 (Z0 ) X 0 F0 (Z0 )jjF0;k n +F ;kn (Z0 )gj: By Hölder inequality, (110) Ejh0 ( kn 0 )[ kn v ] h0 ( +kF0;kn (Z0 ) 0 )[ kn v 0 ]j . kF0;k (Z0 ) n 0 F0 (Z0 )kL2 (P ) kF0;k (Z0 ) X 0 n 1=2 ). By Assumption 5.2, the right of (110) is o(n (111) j(Un E)fh0 ( Un jh0 ( kn kn 0 )[ kn v 0 )[ kn v ] h0 ( ] h0 ( 0 )[ kn v 0 )[ kn v F00 (Z0 )kL2 (P ) k X 0 +F kn v ] h0 ( 0 )[ kn v ] ;kn (Z0 )kL2 (P ) : By Jensen’s inequality, ]gj ]j + Ejh0 ( 0 )[ kn v kn We already showed that the second term on the right of (111) is o(n by Markov inequality, for any " > 0; P Un jh0 ( kn 0 )[ 1 Ejh0 ( kn " kL2 (P ) 0 )[ kn v h0 ( h0 ( ] 1=2 ). 1 ]j = o(n " ]j: For the …rst term, ]j > " 0 )[ kn v 0 )[ kn v 1=2 ): 46 JONG-MYUN MOON This shows Un jh0 ( 0 )[ kn v kn ] h0 ( 0 )[ kn v 1=2 ]j = op (n ): Next, we have h0 ( 0 )[ kn v ] h0 ( 0 )[v ]= 2fIf Y 0g ;kn (Z0 ) F0 (Z0 )gfF F (Z0 )g: Then it is immediate that Ejh0 ( 0 )[ kn v h0 ( ] 0 )[v ]j . kF ;kn (Z0 ) F (Z0 )kL2 (P ) . k kn v v kq ; where the last inequality holds by Lemma A.14. By Assumption 5.6, k kn v v kq = o(n 1=2 rn 1 ) = o(n 1=2 ): By the similar argument with above, we show Un jh0 ( 0 )[ kn v ] h0 ( 0 )[v ]j = op (n 1=2 ): This concludes the proof. p Proof of Theorem 5.1. Any linear transform of n(^n product induced by k kq ; for any 2 Rd , p p (112) n(^n n h^ n 0) = kn 0) 0; v can be expressed by the inner i: To exploit the property of …nite-dimensional sieve, we want both entries of the above inner product are in the sieve space Akn . Observe p p p (113) n h^ n n h^ n kn v i ; kn 0 ; v i = kn 0 ; kn v i + n h^ n kn 0 ; v and then, by Cauchy-Schwartz inequality, the second term on the right of (113) is bounded above by p (114) nk^ n kn 0 kq kv kn v kq : By Assumption 5.5 and Theorem 4.1 (rate of convergence), (114) converges in probability to zero. As such, we are left with the …rst term of (113). A trick is to express the inner product by a function of squared norms. Let "n be arbitrary non-random sequence such that "n = o(n 1=2 ). This is a local “perturbation” factor. By the polarization identity and the bi-linearity of inner product, p p 1 4 n h^ n kn 0 ; kn v i = 4 n"n h^ n k n 0 ; " n kn v i p p 2 2 (115) = n"n 1 k^ n + "n kn v n"n 1 k^ n + "n kn v k n 0 kq k n 0 kq : The reason why we bring in the ad-hoc sequence "n is to exploit the local-quadraticity of Q( ) near 0 . Note that ^ n "n kn v is close to ^ n , kn v is far away from 0 , but ^ n SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 47 and hence ^ n "n kn v approaches 0 along with ^ n . An important fact is that, when the estimator ^ n is perturbed slightly by "n -factor, it still converges faster than n ; that is, for any v 2 V, k^ n "n v 0 k = op ( n ): p This is true because the rate of convergence is slower than the parametric rate n, and n is slower than n 1=2 . By Lemma A.29, k^ n + "n (116) = kn v (Un kn E)h0 ( 2 0 kq kn k^ n + "n 0 )[2"n kn v kn v kn 2 0 kq ] + Qn (^ n + "n kn v ) Qn (^ n "n kn v 1 ) + op ( ): n Using Taylor expansion around ^ n , it is easy to show that (117) Qn (^ n + "n kn v ) Qn (^ n "n kn v ) = Op ("2n ): Collecting (115)-(117), we obtain (118) p p p 1p 1 n h^ n n(Un E)h0 ( kn 0 )[ kn v ] + n"n 1 Op ("2n ) + n"n 1 op ( ): kn 0 ; k n v i = 2 n Since the sequence "n is arbitrary, we may choose it such that the last two term in (118) are o(1)-terms. Then, thanks to Lemma A.30, from (118), 1p n(Un E)h0 ( 0 )[v ] + op (1): 2 By Theorem 12.3 of van der Vaart (2000) (the central limit theorem for U-statistic), the claim is proved. (119) p n(^n 0) = p n h^ n kn 0; kn v i= Proof of Theorem 5.2. Let V~i = (Vi0 ; Bi )0 , and denote h ( ; V~i ; V~j ) = Bi Bj h( ; Vi ; Vj ); and we can write Qn ( ) = 1 n(n 1) X h ( ; V~i ; V~j ): i6=j Note that the weighted empirical criterion Qn has the same structure with the unweighted empiricial criterion Qn . Also, importantly, both criteria induce the same population criterion; ~ ; V~i ; V~j )]: Q( ) = E[h( ; Vi ; Vj )] = E[h( Therefore, it is straightforward to modify the proof of Theorem 3.1 and Theorem 4.1 for the weighted empirical criterion Qn . Then we obtain k^ n 0 kc p ! 0; rn k^ n 0 kq = Op (1): 48 JONG-MYUN MOON Next, we modify the proof of Theorem 5.1. To this end, note d~ h( + tv; V~i ; V~j )[v] dt 0( = Bi Bj t=0 d h( + tv; Vi ; Vj )[v] dt = Bi Bj h0 ( )[v]: t=0 h0 ( Denote h )[v] = Bi Bj )[v]. Higher order derivatives are de…ned in the same manner. Hence, it is again a straightforward business to adopt the proof of Theorem 5.1 to the weighted empirical criterion. Then we can obtain, similar to (119), for any 2 Rd ; (120) p n 0 (^n 0) = p n h^ n kn 0; kn v i= 1p ~ n(Un 2 ~ 0( E)h 0 )[v ] + op (1); ~n is the U-process empirical measure for the augmented random variable V~i . Note where U p that the unconditional limiting distribution of n 0 (^n 0 ) is immediately derived from the expression (120). Now, subtract (120) from (119) to get (121) p n h^ n ^ n; kn v i = = 1p ~ ~ 0 ( 0 )[v ]g + op (1) n(Un E)fh0 ( 0 )[v ] h 2 p 1 n X f(Bi Bj 1)h0 ( 0 )[v ]g + op (1): 2 n(n 1) i6=j p We will study the conditional limiting distribution of n h^ n ^ n ; kn v i. To state the claim precise, following the construction of van der Vaart and Wellner (1996), the underlying probability space is expanded as follows. Let the sample fVi g1 i=1 be the projection of the …rst 1 coordinates in the probability space (V 1 Z; A1 C; P 1 Q) and the random weights fBi g1 i=1 depend on the last coordinate only. We denote by EQ [ ] the conditional expectation, 1 conditional on fVi g1 i=1 . When a probabilistic statement holds conditional on fVi gi=1 for any 1 1 realization of fVi g1 i=1 but P -measure zero set, we will simply say that it holds P -almost surely (a.s.). Let us write, suppressing its dependence on 0 Wij = h ( 0 ; Vi ; Vj )[v Then, by Hoe¤ding decomposition, p 1 n X (Bi Bj 1)Wij (122) 2 n(n 1) i6=j = and n, ]; Wi = 1 n 1 n X 1 X p (Bi 1)W i : n i p X n + (Bi Bj 2n(n 1) i6=j Wij j6=i Bi Bj + 1)Wij : SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 49 First, the convergence of the …rst term is shown. Note that EQ [f(Bi 1)W i g2 ] = (W i )2 and 8 9 n < = X X X 1 1 2 2 (W i ) = Wij + Wij Wik ! 0 + E[W12 W13 ] P 1 -a.s., ; n n(n 1)2 : i=1 i6=j i6=j6=k by the strong law of large numbers for U-statistic (Theorem 4.1.4 of de la Peña and Giné (1999)). Then, the Lindberg condition is, for any " > 0, X 1 1 (123) E[f p (Bi 1)W i g2 Ifj p (Bi 1)W i j > "g] ! 0 P 1 -a.s.. n n i To show the condition (123), it su¢ ces to show, for every i 2 N, p (124) E[(Bi 1)2 Ifj(Bi 1)W i j > n"g] ! 0 , and in turn, the condition (124) holds if W i is P 1 -a.s. uniformly bounded in n (recall that W i depends on n). Observe that, conditional on Vi , W i ! E[Wij jVi ] P 1 -a.s. for any j 6= i: by the strong law of large numbers. Any covering sequence is uniformly bounded. Therefore, we verify the condition (124) and hence the Lindberg condition (123). This implies 1 X d p (125) (Bi 1)W i ! N (0; E[W12 W13 ]) P 1 -a.s.. n i It is left to show that the second term on the right of (122) degenerates. Let us write ~ij = Bi Bj Bi Bj + 1. See that EQ [B ~ij B ~ik ] = 0 if j 6= k, and EQ [B ~ 2 ] = 5. Therefore, it B ij is easy to show that p X X n 1 ~ij Wij g2 ] = EQ [f B 5Wij2 ! 0 P 1 -a.s., 2 2n(n 1) 4n(n 1) i6=j i6=j by Theorem 4.1.4 of de la Peña and Giné (1999). Therefore the conditional limiting distribution of (121) is N (0; E[W12 W13 ]) which coincides with the unconditional limiting distribution p of n 0 (^n 0 ). The claim follows by the Cramer-Wold device. 50 JONG-MYUN MOON Appendix B. Maximal Inequality for U-Process We collects several results in de la Peña and Giné (1999) (Henceforth, DG) to formulate a convenient form of maximal inequality. Notations are independent from the rest of the paper. 1 fXi g1 i=1 is a i.i.d. sequence of random variables. f"i gi=1 be an i.i.d. sequence of Redemacher random variables; "i is either 1 or 1 with equal probabilities. f"i g and fXi g are mutually independent. N ("; F; d) is a covering number. F is a class of functions f : R2 ! R such that E[f (X1 ; X2 )jX2 ] = E[f (X1 ; X2 )jX1 ] = 0. However, f is not necessarily symmetric; that is, f (x1 ; x2 ) may not be equal to f (x2 ; x1 ). De…ne a random metric for f; g 2 F; )1=2 ( 1 P "i "j (f (Xi ; Xj ) g(Xi ; Xj ))]2 ; dn (f; g) = E" [ n i6=j for E" [ ] = E[ jX1 ; ; Xn ]. A short calculation shows )1=2 ( 1 P 2 : [f (Xi ; Xj ) g(Xi ; Xj )] dn (f; g) = n2 i6=j (126) Theorem B.1. There is a universal constant K > 0 such that for any f0 2 F, Z Dn 1X 2 1=2 log N ( ; F; dn )d (127) E[sup j f (Xi ; Xj )j] . fE[f0 (Xi ; Xj ) ]g + K E f 2F n 0 ; i6=j where Dn is the diameter of F measured by a random metric dn ( ; ). P Proof. For an arbitrary deterministic sequence fxi gni=1 , f 7! n1 i6=j "i "j f (xi ; xj ) is a homogeneous Rademacher chaos process of degree 2.10 Thus by DG Corollary 5.1.8 (maximal inequality for Rademacher chaos process), we obtain (128); for any f0 2 F and any deterministic sequence fxi g, there is a universal constant K > 0 such that (128) k sup fj f 2F 1 P "i "j f (xi ; xj )jgk n i6=j 1 1 P "i "j f0 (xi ; xj )k 1 n i6=j Z Dn;x +K log N ("; F; dn;x )d"; k 0 11 where k k 1 is an Orlicz norm ; dn;x ( ; ) and Dn;x are equal to dn ( ; ) and Dn for Xi = xi . The equations (4.3.3) and (4.3.4) of DG show the lower and upper bound for the Orlicz norm in terms of the Lp -norms. Employing these bounds, the following inequality is obtained from (128): E[sup j f 2F 1 P "i "j f (xi ; xj )j] n i6=j 10 See p. 110 of DG for the de…nition. 11 For the de…nition of the Orlicz norm, see p.188 of DG. SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS (129) (130) 51 Z Dn;x 1 P 2 1=2 log N ("; F; dn;x )d" . fE[ "i "j f0 (xi ; xj )] g + K n i6=j 0 )1=2 ( Z Dn;x 1 P 2 log N ("; F; dn;x )d": f0 (xi ; xj ) = +K n2 i6=j 0 The deterministic sequence fxi g in (130) will be substituted with a random sequence fXi g. To do this, the unconditional expectation in the left of (129) is replaced by the conditional expectation E" [ ], and as a result, we obtain )1=2 ( Z Dn 1 P 1 P 2 f0 (Xi ; Xj ) +K log N ("; F; dn )d"; "i "j f (Xi ; Xj )j] . (131) E" [sup j n2 i6=j f 2F n i6=j 0 for Dn and dn ( ; ) de…ned already. After taking an unconditional expectation on both sides of (131), apply Jensen’s inequality. Then (132) follows: Z Dn 1 P 1=2 "i "j f (Xi ; Xj )j] . E[f0 (Xi ; Xj )2 ] +KE log N ( ; F; dn )d (132) E[sup j f 2F n i6=j 0 Note that (133) 1 P 1 P f (Xi ; Xj ) = 2 n i6=j n i6=j 1 ff (Xi ; Xj ) + f (Xj ; Xi )g; and (x1 ; x2 ) 7! 2 1 ff (x1 ; x2 ) + f (x2 ; x1 )g is a symmetric kernel. As such, by the randomization theorem (DG Theorem 3.5.3) and (133), 1 P 1X E[sup j f (Xi ; Xj )j] E[sup j "i "j 2 1 ff (Xi ; Xj ) + f (Xj ; Xi )gj] n n f 2F f 2F i6=j i6=j 1X E[sup j "i "j f (Xi ; Xj )j]: f 2F n (134) i6=j By (132) and (134), we conclude. Theorem B.2. Suppose F is as described above. Then for some K > 0, Z Dn 1X (135) E[ sup jf (Xi ; Xj ) g(Xi ; Xj )j] . K E[ log N ( ; F; dn )d ]; f;g2F n 0 i6=j where Dn is a random diameter of F measured by dn (f; g). Proof. The same argument with the above theorem repeats. Apply the second inequality of DG Corollary 5.1.8, instead of the …rst inequality. 52 JONG-MYUN MOON References Abrevaya, J. (2003): “Pairwise-di¤erence rank estimation of the transformation model,” Journal of Business & Economic Statistics, 21(3). Arcones, M. A., and E. Giné (1993): “Limit theorems for U-processes,” Annals of Probability, pp. 1494–1542. Box, G. E., and D. R. Cox (1964): “An analysis of transformations,”Journal of the Royal Statistical Society. Series B (Methodological), pp. 211–252. Cavanagh, C., and R. P. Sherman (1998): “Rank estimators for monotonic index models,” Journal of Econometrics, 84(2), 351–381. Chen, S. (2002): “Rank estimation of transformation models,” Econometrica, 70(4), 1683– 1697. Chen, X. (2007): “Large sample sieve estimation of semi-nonparametric models,”Handbook of Econometrics, 6, 5549–5632. Chen, X., and D. Pouzo (2009): “E¢ cient estimation of semiparametric conditional moment models with possibly nonsmooth residuals,”Journal of Econometrics, 152(1), 46–60. Chiappori, P.-A., I. Komunjer, and D. Kristensen (2013): “Nonparametric identi…cation and estimation of transformation models,” Discussion paper. de la Peña, V., and E. Giné (1999): Decoupling: from dependence to independence. Springer Verlag. Ding, Y., and B. Nan (2011): “A sieve m-theorem for bundled parameters in semiparametric models, with application to the e¢ cient estimation in a linear model for censored data,” Annals of Statistics, 39(6), 3032–3061. Eckstein, Z., and G. J. Van den Berg (2007): “Empirical labor search: A survey,” Journal of Econometrics, 136(2), 531–564. Ekeland, I., J. J. Heckman, and L. Nesheim (2002): “Identifying hedonic models,” American Economic Review, 92(2), 304–309. (2004): “Identi…cation and estimation of hedonic models.,” Journal of Political Economy, 112(1), 60. Farber, H. S. (1999): “Mobility and stability: the dynamics of job change in labor markets,” Handbook of labor economics, 3, 2439–2483. Gallant, A. R., and D. W. Nychka (1987): “Semi-nonparametric maximum likelihood estimation,” Econometrica, pp. 363–390. Han, A. K. (1987): “Non-parametric Analysis Of A Generalized Regression Model: The Maximum Rank Correlation Estimator,” Journal of Econometrics, 35(2), 303–316. Horowitz, J. L. (1996): “Semiparametric estimation of a regression model with an unknown transformation of the dependent variable,” Econometrica, pp. 103–137. SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 53 (1998): Semiparametric methods in econometric, Lecture Notes In Statistics. Springer Verlag. Ichimura, H. (1993): “Semiparametric least squares (SLS) and weighted SLS estimation of single-index models,” Journal of Econometrics, 58(1), 71–120. Ichimura, H., and P. E. Todd (2007): “Implementing nonparametric and semiparametric estimators,” Handbook of Econometrics, 6, 5369–5468. Khan, S., Y. Shin, and E. Tamer (2011): “Heteroscedastic transformation models with covariate dependent censoring,” Journal of Business & Economic Statistics, 29(1). Khan, S., and E. Tamer (2007): “Partial rank estimation of duration models with general forms of censoring,” Journal of Econometrics, 136(1), 251–280. Kiefer, N. M. (1988): “Economic duration data and hazard functions,” Journal of Economic Literature, 26(2), 646–679. Klein, R. W., and R. P. Sherman (2002): “Shift restrictions and semiparametric estimation in ordered response models,” Econometrica, 70(2), 663–691. Linton, O., S. Sperlich, and I. Van Keilegom (2008): “Estimation of a semiparametric transformation model,” Annals of Statistics, 36(2), 686–718. Ma, S., and M. R. Kosorok (2005): “Robust semiparametric M-estimation and the weighted bootstrap,” Journal of Multivariate Analysis, 96(1), 190–217. Matzkin, R. L. (2007): “Nonparametric identi…cation,”Handbook of Econometrics, 6, 5307– 5368. Meyer, B. D. (1996): “What have we learned from the Illinois reemployment bonus experiment?,” Journal of Labor Economics, pp. 26–51. Mortensen, D. T., and C. A. Pissarides (1999): “New developments in models of search in the labor market,” Handbook of labor economics, 3, 2567–2627. Ramsay, J. (1988): “Monotone regression splines in action,”Statistical Science, pp. 425–441. Ridder, G. (1990): “The non-parametric identi…cation of generalized accelerated failuretime models,” Review of Economic Studies, 57(2), 167–181. Rogerson, R., R. Shimer, and R. Wright (2005): “Search-theoretic models of the labor market: a survey,” Journal of Economic Literature, pp. 959–988. Santos, A. (2011): “Semiparametric estimation of invertible models,” Discussion paper. (2012): “Inference in nonparametric instrumental variables with partial identi…cation,” Econometrica, 80(1), 213–275. Shen, X. (1997): “On methods of sieves and penalization,” Annals of Statistics, 25(6), 2555–2591. Shen, X., and W. H. Wong (1994): “Convergence rate of sieve estimates,” Annals of Statistics, pp. 580–615. Sherman, R. P. (1993): “The limiting distribution of the maximum rank correlation estimator,” Econometrica, pp. 123–137. 54 JONG-MYUN MOON (1994): “Maximal inequalities for degenerate U-processes with applications to optimization estimators,” Annals of Statistics, pp. 439–459. van den Berg, G. J. (2001): “Duration models: speci…cation, identi…cation and multiple durations,” Handbook of econometrics, 5, 3381–3460. van den Berg, G. J., and G. Ridder (1998): “An empirical equilibrium search model of the labor market,” Econometrica, pp. 1183–1221. van der Vaart, A. W. (2000): Asymptotic statistics, vol. 3. Cambridge university press. van der Vaart, A. W., and J. A. Wellner (1996): Weak convergence and empirical processes. Springer-Verlag. Ye, J., and N. Duan (1997): “Nonparametric n 1=2 -consistent estimation for the general transformation models,” Annals of Statistics, 25(6), 2682–2717.