Massachusetts Institute of Technology Department of Economics Working Paper Series CONDITIONAL QUANTILE PROCESSES BASED ON SERIES OR MANY REGRESSORS Alexandre Belloni Victor Chernozhukov Iván Fernándex-Val Working Paper 11-15 June 1, 2011 Room E52-251 50 Memorial Drive Cambridge, MA 02142 This paper can be downloaded without charge from the Social Science Research Network Paper Collection at http://ssrn.com/abstract=1908413 arXiv:1105.6154v1 [stat.ME] 31 May 2011 CONDITIONAL QUANTILE PROCESSES BASED ON SERIES OR MANY REGRESSORS ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV AND IVÁN FERNÁNDEZ-VAL Abstract. Quantile regression (QR) is a principal regression method for analyzing the impact of covariates on outcomes. The impact is described by the conditional quantile function and its functionals. In this paper we develop the nonparametric QR series framework, covering many regressors as a special case, for performing inference on the entire conditional quantile function and its linear functionals. In this framework, we approximate the entire conditional quantile function by a linear combination of series terms with quantile-specific coefficients and estimate the function-valued coefficients from the data. We develop large sample theory for the empirical QR coefficient process, namely we obtain uniform strong approximations to the empirical QR coefficient process by conditionally pivotal and Gaussian processes, as well as by gradient and weighted bootstrap processes. We apply these results to obtain estimation and inference methods for linear functionals of the conditional quantile function, such as the conditional quantile function itself, its partial derivatives, average partial derivatives, and conditional average partial derivatives. Specifically, we obtain uniform rates of convergence, large sample distributions, and inference methods based on strong pivotal and Gaussian approximations and on gradient and weighted bootstraps. All of the above results are for function-valued parameters, holding uniformly in both the quantile index and in the covariate value, and covering the pointwise case as a by-product. If the function of interest is monotone, we show how to use monotonization procedures to improve estimation and inference. We demonstrate the practical utility of these results with an empirical example, where we estimate the price elasticity function of the individual demand for gasoline, as indexed by the individual unobserved propensity for gasoline consumption. Keywords. Quantile regression series processes, uniform inference. JEL Subject Classification. C12, C13, C14. AMS Subject Classification. 62G05, 62G15, 62G32. Date: first version May, 2006, this version of June 1, 2011. The main results of this paper, particularly the pivotal method for inference based on the entire quantile regression process, were first presented at the NAWM Econometric Society, New Orleans, January, 2008 and also at the Stats in the Chateau in September 2009. We are grateful to Arun Chandraksekhar, Denis Chetverikov, Ye Luo, Denis Tkachenko, and Sami Stouli for careful readings of several versions of the paper. We thank Andrew Chesher, Roger Koenker, Oliver Linton, Tatiana Komarova, and seminar participants at the Econometric Society meeting, CEMMAP master-class, Stats in the Chateau, BU, Duke, and MIT for many useful suggestions. We gratefully acknowledge research support from the NSF. 1 2 1. Introduction Quantile regression (QR) is a principal regression method for analyzing the impact of covariates on outcomes, particularly when the impact might be heterogeneous. This impact is characterized by the conditional quantile function and its functionals [3, 7, 24]. For example, we can model the log of the individual demand for some good, Y , as a function of the price of the good, the income of the individual, and other observed individual characteristics X and an unobserved preference U for consuming the good, as Y = Q(X, U ), where the function Q is strictly increasing in the unobservable U . With the normalization that U ∼ Uniform(0, 1) and the assumption that U and X are independent, the function Q(X, u) is the u-th conditional quantile of Y given X, i.e. Q(X, u) = QY |X (u|X). This function can be used for policy analysis. For example, we can determine how changes in taxes for the good could impact demand heterogeneously across individuals. In this paper we develop the nonparametric QR series framework for performing inference on the entire conditional quantile function and its linear functionals. In this framework, we approximate the entire conditional quantile function QY |X (u|x) by a linear combination of series terms, Z(x)′ β(u). The vector Z(x) includes transformations of x that have good approximation properties such as powers, trigonometrics, local polynomials, or B-splines. The function u 7→ β(u) contains quantile-specific coefficients that can be estimated from the data using the QR estimator of Koenker and Bassett [25]. As the number of series terms grows, the approximation error QY |X (u|x) − Z(x)′ β(u) decreases, approaching zero in the limit. By controlling the growth of the number of terms, we can obtain consistent estimators and perform inference on the entire conditional quantile function and its linear functionals. The QR series framework also covers as a special case the so called many regressors model, which is motivated by many new types of data that emerge in the new information age, such as scanner and online shopping data. b denote the QR estimator of We describe now the main results in more detail. Let β(·) β(·). The first set of results provides large-sample theory for the empirical QR coefficient √ b process of increasing dimension n(β(·) − β(·)). We obtain uniform strong approximations to this process by a sequence of the following stochastic processes of increasing dimension: (i) a conditionally pivotal process, (ii) a gradient bootstrap process, 3 (iii) a Gaussian process, and (iv) a weighted bootstrap process. To the best of our knowledge, all of the above results are new. The existence of the pivotal approximation emerges from the special nature of QR, where a (sub) gradient of the sample objective function evaluated at the truth is pivotal conditional on the regressors. This allows us to perform high-quality inference without even resorting to Gaussian approximations. We also show that the gradient bootstrap, introduced by Parzen, Wei and Ying [30] in the parametric context, is effectively a means of carrying out the conditionally pivotal approximation without explicitly estimating Jacobian matrices. The conditions for validity of these two schemes require only a mild restriction on the growth of the number of series terms in relation to the sample size. We also obtain a Gaussian approximation to the entire distribution of QR process of increasing dimension by using chaining arguments and Yurinskii’s coupling. Moreover, we show that the weighted bootstrap works to approximate the distribution of QR process for the same reason as the Gaussian approximation. The conditions for validity of the Gaussian and weighted bootstrap approximations, however, appear to be substantively stronger than for the pivotal and gradient bootstrap approximations. The second set of results provides estimation and inference methods for linear functionals of the conditional quantile function, including (i) the conditional quantile function itself, (u, x) 7→ QY |X (u|x), (ii) the partial derivative function, (u, x) 7→ ∂xk QY |X (u|x), R (iii) the average partial derivative function, u 7→ ∂xk QY |X (u|x)dµ(x), and R (iv) the conditional average partial derivative, (u, xk ) 7→ ∂xk QY |X (u|x)dµ(x|xk ), where µ is a given measure and xk is the k-th component of x. Specifically, we derive uniform rates of convergence, large sample distributions and inference methods based on the strong pivotal and Gaussian approximations and on the gradient and weighted bootstraps. It is noteworthy that all of the above results apply to function-valued parameters, holding uniformly in both the quantile index and the covariate value, and covering pointwise normality and rate results as a special case. If the function of interest is monotone, we show how to use monotonization procedures to improve estimation and inference. The paper contributes and builds on the existing important literature on conditional quantile estimation. First and foremost, we build on the work of He and Shao [22] that studied the many regressors model and gave pointwise limit theorems for the QR estimator in the case of a single quantile index. We go beyond the many regressors model to the series 4 model and develop large sample estimation and inference results for the entire QR process. We also develop analogous estimation and inference results for the conditional quantile function and its linear functionals, such as derivatives, average derivatives, conditional average derivatives, and others. None of these results were available in the previous work. We also build on Lee [28] that studied QR estimation of partially linear models in the series framework for a single quantile index, and on Horowitz and Lee [23] that studied nonparametric QR estimation of additive quantile models for a single quantile index in a series framework. Our framework covers these partially linear models and additive models as important special cases, and allows us to perform inference on a considerably richer set of functionals, uniformly across covariate values and a continuum of quantile indices. Other very important work includes Chaudhuri [10], and Chaudhuri, Doksum and Samarov [11], Härdle, Ritov, and Song [20], Cattaneo, Crump, and Jansson [8], and Kong, Linton, and Xia [27], among others, but this work focused on local, non-series, methods. Our work also relies on the series literature, at least in a motivational and conceptual sense. In particular, we rely on the work of Stone [35], Andrews [2], Newey [29], Chen and Shen [13], Chen [12] and others that rigorously motivated the series framework as an approximation scheme and gave pointwise normality results for least squares estimators, and on Chen [12] and van de Geer [36] that gave (non-uniform) consistency and rate results for general series estimators, including quantile regression for the case of a single quantile index. White [38] established non-uniform consistency of nonparametric estimators of the conditional quantile function based on a nonlinear series approximation using artificial neural networks. In contrast to the previous results, our rate results are uniform in covariate values and quantile indices, and cover both the quantile function and its functionals. Moreover, we not only provide estimation rate results, but also derive a full set of results on feasible inference based on the entire quantile regression process. While relying on previous work for motivation, our results require to develop both new proof techniques and new approaches to inference. In particular, our proof techniques rely on new maximal inequalities for function classes with growing moments and uniform entropy. One of our inference approaches involves an approximation to the entire conditional quantile process by a conditionally pivotal process, which is not Donsker in general, but can be used for high-quality inference. The utility of this new technique is particularly apparent in our high-dimensional setting. Under stronger conditions, we also establish an asymptotically valid approximation to the quantile regression process by Gaussian processes using Yurinskii’s coupling. Previously, [15] used Yurinskii’s coupling to obtain a strong 5 approximation to the least squares series estimator. The use of this technique in our context is new and much more involved, because we approximate an entire empirical QR process of an increasing dimension, instead of a vector of increasing dimension, by a Gaussian process. Finally, it is noteworthy that our uniform inference results on functionals, where uniformity is over covariate values, do not even have analogs in the least squares series literature (the extension of our results to least squares is a subject of ongoing research, [5]). This paper does not deal with sparse models, where there are some key series terms and many “non-key” series terms which ideally should be omitted from estimation. In these settings, the goal is to find and indeed remove most of the “non-key” series terms before proceeding with estimation. [6] obtained rate results for quantile regression estimators in this case, but did not provide inference results. Even though our paper does not explicitly deal with inference in sparse models after model selection, the methods and bounds provided herein are useful for analyzing this problem. Investigating this matter rigorously is a challenging issue, since it needs to take into account the model selection mistakes in estimation, and is beyond the scope of the present paper; however, it is a subject of our ongoing research. Plan of the paper. The rest of the paper is organized as follows. In Section 2, we describe the nonparametric QR series model and estimators. In Section 3, we derive asymptotic theory for the series QR process. In Section 4, we give estimation and inference theory for linear functionals of the conditional quantile function and show how to improve estimation and inference by imposing monotonicity restrictions. In Section 5, we present an empirical application to the demand of gasoline and a computational experiment calibrated to the application. The computational algorithms to implement our inference methods and the proofs of the main results are collected in the Appendices. Notation. In what follows, S m−1 denotes the unit sphere in Rm . For x ∈ Rm , we define the Euclidian norm as kxk := supα∈S m−1 |α′ x|. For a set I, diam(I) = supv,v̄∈I kv − v̄k denotes the diameter of I, and int(I) denotes the interior of I. For any two real numbers a and b, a ∨ b = max{a, b} and a ∧ b = min{a, b}. Calligraphic letters are used to denote the support of interest of a random variable or vector. For example, U ⊂ (0, 1) is the support of U , X ⊂ Rd is the support of X, and Z = {Z(x) ∈ Rm : x ∈ X } is the support of Z = Z(X). The relation an . bn means that an ≤ Cbn for a constant C and for all n large enough. We denote by P ∗ the probability measure induced by conditioning on any realization of the data Dn := {(Yi , Xi ) : 1 ≤ i ≤ n}. We say that a random variable ∆n = oP ∗ (1) in P -probability if for any positive numbers ǫ > 0 and η > 0, P {P ∗ (|∆n | > ǫ) > η} = o(1), or 6 equivalently, P ∗ (|∆n | > ǫ) = oP (1). We typically shall omit the qualifier “in P -probability.” The operator E denotes the expectation with respect to the probability measure P , En √ denotes the expectation with respect to the empirical measure, and Gn denotes n(En −E). 2. Series Framework 2.1. The set-up. The set-up corresponds to the nonparametric QR series framework: Y = QY |X (U |X) = Z ′ β(U ) + R(X, U ), U |X ∼ Uniform(0, 1) and β(u) ∈ Rm , where X is a d-vector of elementary regressors, and Z = Z(X) is an m-vector of approximating functions formed by transformations of X with the first component of Z equal to one. The function Z ′ β(u) is the series approximation to QY |X (u|X) and is linear in the parameter vector β(u), which is defined below. The term R(X, u) is the approximation error, with the special case of R(X, u) = 0 corresponding to the many regressors model. We allow the series terms Z(X) to change with the sample size n, i.e. Z(X) = Zn (X), but we shall omit the explicit index indicating the dependence on n. We refer the reader to Newey [29] and Chen [12] for examples of series functions, including (i) regression splines, (ii) polynomials, (iii) trigonometric series, (iv) compactly supported wavelets, and others. Interestingly, in the latter scheme, the entire collection of series terms is dependent upon the sample size. to For each quantile index u ∈ (0, 1), the population coefficient β(u) is defined as a solution min E[ρu (Y − Z ′ β)] β∈Rm (2.1) where ρu (z) = (u − 1{z < 0})z is the check function ([24]). Thus, this coefficient is not a solution to an approximation problem but to a prediction problem. However, we show in Lemma 1 in Appendix B that the solution to (2.1) inherits important approximation properties from the best least squares approximation of QY |X (u|·) by a linear combination of Z(·). We consider estimation of the coefficient function u 7→ β(u) using the entire quantile regression process over U , a compact subset of (0, 1), b {u 7→ β(u), u ∈ U }, 7 b namely, for each u ∈ U , the estimator β(u) is a solution to the empirical analog of the population problem (2.1) min En [ρu (Yi − Zi′ β)]. β∈Rm (2.2) We are also interested in various functionals of QY |X (·|x). We estimate them by the correb sponding functionals of Z(x)′ β(·), such as the estimator of the entire conditional quantile b function, (x, u) 7→ Z(x)′ β(u), and derivatives of this function with respect to the elementary covariates x. 2.2. Regularity Conditions. In our framework the entire model can change with n, although we shall omit the explicit indexing by n. We will use the following primitive assumptions on the data generating process, as n → ∞ and m = m(n) → ∞. Condition S. S.1 The data Dn = {(Yi , Xi′ )′ , 1 ≤ i ≤ n} are an i.i.d. sequence of real (1 + d)-vectors, and Zi = Z(Xi ) is a real m-vector for i = 1, . . . , n. S.2 The conditional density of the response variable fY |X (y|x) is bounded above by f and its derivative in y is bounded above by f ′ , uniformly in the arguments y and x ∈ X and in n; moreover, fY |X (QY |X (u|x)|x) is bounded away from zero uniformly for all arguments u ∈ U , x ∈ X , and n. S.3 For every m, the eigenvalues of the Gram matrix Σm = E[ZZ ′ ] are bounded from above and away from zero, uniformly in n. S.4 The norm of the series terms obeys maxi≤n kZi k ≤ ζ(m, d, n) := ζm . S.5 The approximation error term R(X, U ) is such that supx∈X ,u∈U |R(x, u)| . m−κ . Comment 2.1. Condition S is a simple set of sufficient conditions. Our proofs and lemmas gathered in the appendices hold under more general, but less primitive, conditions. Condition S.2 impose mild smoothness assumptions on the conditional density function. Condition S.3 and S.4 imposes plausible conditions on the design, see Newey [29]. Condition S.2 and S.3 together imply that the eigenvalues of the Jacobian matrix Jm (u) = E[fY |X (QY |X (u|X)|X)ZZ ′ ] are bounded away from zero and from above; Lemma 12 further shows that this together with S.5 implies that the eigenvalues of Jem (u) = E[fY |X (Z ′ β(u)|X)ZZ ′ ] are bounded away from zero and from above, since the two matrices are uniformly close. Assumption S.4 imposes a uniform bound on the norm of the transformed vector which can grow with 8 the sample size. For example, we have ζm = √ m for splines and ζm = m for polynomials. Assumption S.5 introduces a bound on the approximation error and is the only nonprimitive assumption. Deriving sharp bounds on the sup norm for the approximation error is a delicate subject of approximation theory even for least squares, with the exception of local polynomials; see, e.g. [5] for a discussion in the least squares case. However, in Lemma 1 in Appendix B, we characterize the nature of the approximation by establishing that the vector β(u) solving (2.1) is equivalent to the least squares approximation to QY |X (u|x) in terms of the order of L2 error. We also deduce a simple upper bound on the sup norm for the approximation error. In applications, we recommend to analyze the size of the approximation error numerically. This is possible by considering examples of plausible functions QY |X (u|x) and comparing them to the best linear approximation schemes, as this directly leads to sharp results. We give an example of how to implement this approach in the empirical application. 3. Asymptotic Theory for QR Coefficient Processes based on Series 3.1. Uniform Convergence Rates for Series QR Coefficients. The first main result is a uniform rate of convergence for the series estimators of the QR coefficients. Theorem 1 (Uniform Convergence Rate for Series QR Coefficients). Under Condition S, 2 m log n = o(n), and provided that ζm r m log n b . sup β(u) − β(u) .P n u∈U Thus, up to a logarithmic factor, the uniform convergence over U is achieved at the same rate as for a single quantile index ([22]). The proof of this theorem relies on new concentration inequalities that control the behavior of the empirical eigenvalues of the design matrix. 3.2. Uniform Strong Approximations to the Series QR Process. The second main result is the approximation of the QR process by a linear, conditionally pivotal process. Theorem 2 (Strong Approximations to the QR Process by a Pivotal Coupling). Under 2 log 7 n = o(n), and m−κ+1 log 3 n = o(1), the QR process is uniformly Condition S, m3 ζm close to a conditionally pivotal process, namely √ −1 b n β(u) − β(u) = Jm (u)Un (u) + rn (u), 9 where n 1 X Zi (u − 1{Ui ≤ u}), Un (u) := √ n (3.3) i=1 where U1 , . . . , Un are i.i.d. Uniform(0, 1), independently distributed of Z1 , . . . , Zn , and 1/2 sup krn (u)k .P u∈U m3/4 ζm log3/4 n p 1−κ log n = o(1/ log n). + m n1/4 The theorem establishes that the QR process is approximately equal to the process −1 (·)U (·), Jm n which is pivotal conditional on Z1 , ..., Zn . This is quite useful since we can simulate Un (·) in a straightforward fashion, conditional on Z1 , ..., Zn , and we can estimate the matrices Jm (·) using Powell’s method [32]. Thus, we can perform inference based on the entire empirical QR process in series regression or many regressors contexts. The following theorem establishes the formal validity of this approach, which we call the pivotal method. Theorem 3 (Pivotal Method). Let Jbm (u) denote the estimator of Jm (u) defined in (3.7) √ with bandwidth hn obeying hn = o(1) and hn m log3/2 n = o(1). Under Condition S, 2 m2 log 4 n = o(nh ), the feasible pivotal process Jb−1 (·)U∗ (·) m−κ+1/2 log3/2 n = o(1), and ζm n m n −1 (·)U∗ (·) of the pivotal process defined in Theorem 2: correctly approximates a copy Jm n where −1 −1 Jbm (u)U∗n (u) = Jm (u)U∗n (u) + rn (u), n U∗n (u) 1 X := √ Zi (u − 1{Ui∗ ≤ u}), n (3.4) i=1 U1∗ , . . . , Un∗ are i.i.d. Uniform(0, 1), independently distributed of Z1 , . . . , Zn , and s 2 m2 log2 n p p ζm + m−κ+1/2 log n + hn m log n = o(1/ log n). sup krn (u)k .P nhn u∈U This method is closely related to another approach to inference, which we refer to here as the gradient bootstrap method. This approach was previously introduced by Parzen, Wei and Ying [30] for parametric models with fixed dimension. We extend it to a considerably more general nonparametric series framework. The main idea is to generate draws βb∗ (·) of the QR process as solutions to QR problems with gradients perturbed by a pivotal quantity √ U∗ (·)/ n. In particular, let us define a gradient bootstrap draw βb∗ (u) as the solution to n the problem √ minm En [ρu (Yi − Zi′ β)] − U∗n (u)′ β/ n, β∈R (3.5) 10 for each u ∈ U , where U∗n (·) is defined in (3.4). The problem is solved many times for √ b − β(·)) is approximated by the independent draws of U∗n (·), and the distribution of n(β(·) √ b∗ b empirical distribution of the bootstrap draws of n(β (·) − β(·)). 2 m3 log7 n = o(n), and Theorem 4 (Gradient Bootstrap Method). Under Condition S, ζm −1 (·)U∗ (·) m−κ+1/2 log3/2 n = o(1), the gradient bootstrap process correctly approximates a copy Jm n of the pivotal process defined in Theorem 2, √ b∗ −1 b n β (u) − β(u) = Jm (u)U∗n (u) + rn (u), where U∗n (u) is defined in (3.4) and 1/2 sup krn (u)k .P u∈U p m3/4 ζm log3/4 n −κ m log n = o(1/ log n). + m n1/4 The stated bound continues to hold in P -probability if we replace the unconditional probability P by the conditional probability P ∗ . The main advantage of this method relative to the pivotal method is that it does not require to estimate Jm (·) explicitly. The disadvantage is that in large problems the gradient bootstrap can be computationally much more expensive. Next we turn to a strong approximation based on a sequence of Gaussian processes. Theorem 5 (Strong Approximation to the QR Process by a Gaussian Coupling). Under 6 log22 n = o(n), there exists a sequence of zero-mean Gaussian conditions S.1-S.4 and m7 ζm processes Gn (·) with a.s. continuous paths, that has the same covariance functions as the pivotal process Un (·) in (3.4) conditional on Z1 , . . . , Zn , namely, E[Gn (u)Gn (u′ )′ ] = E[Un (u)Un (u′ )′ ] = En [Zi Zi′ ](u ∧ u′ − uu′ ), for all u and u′ ∈ U . Also, Gn (·) approximates the empirical process Un (·), namely, sup kUn (u) − Gn (u)k .P o(1/ log n). u∈U Consequently, if in addition S.5 holds with m−κ+1 log3 n = o(1), √ −1 b sup k n(β(u) − β(u)) − Jm (u)Gn (u)k .P o(1/ log n). u∈U −1 (·)G (·) Moreover, under the conditions of Theorem 3, the feasible Gaussian process Jbm n −1 (·)G (·): correctly approximates Jm n −1 −1 Jbm (u)Gn (u) = Jm (u)Gn (u) + rn (u), 11 where sup krn (u)k .P u∈U s 2 m2 log2 n p p ζm + m−κ+1/2 log n + hn m log n = o(1/ log n). nhn The requirement on the growth of the dimension m in Theorem 5 is a consequence of a step in the proof that relies upon Yurinskii’s coupling. Therefore, improving that step through the use of another coupling could lead to significant improvements in such requirements. Note, however, that the pivotal approximation has a much weaker growth restriction, and it is also clear that this approximation should be more accurate than any further approximation of the pivotal approximation by a Gaussian process. Another related inference method is the weighted bootstrap for the entire QR process. Consider a set of weights h1 , ..., hn that are i.i.d. draws from the standard exponential distribution. For each draw of such weights, define the weighted bootstrap draw of the QR process as a solution to the QR problem weighted by h1 , . . . , hn : βbb (u) ∈ arg minm En [hi ρu (Yi − Zi′ β)], for u ∈ U . β∈R The following theorem establishes that the weighted bootstrap distribution is valid for approximating the distribution of the QR process. 2 m3 log 9 n = o(n), Theorem 6 (Weighted Bootstrap Method). (1) Under Condition S, ζm and m−κ+1 log3 n = o(1), the weighted bootstrap process satisfies n J −1 (u) X √ b m b b (hi − 1)Zi (u − 1{Ui ≤ u}) + rn (u), n β (u) − β(u) = √ n i=1 where U1 , . . . , Un are i.i.d. Uniform(0, 1), independently distributed of Z1 , . . . , Zn , and 1/2 sup krn (u)k .P u∈U m3/4 ζm log5/4 n p 1−κ log n = o(1/ log n). + m n1/4 The bound continues to hold in P -probability if we replace the unconditional probability P by P ∗ . (2) Furthermore, under the conditions of Theorem 5, the weighted bootstrap process ap−1 (·)G (·) defined in Theorem 5, that is: proximates the Gaussian process Jm n √ −1 b − Jm (u)Gn (u)k .P o(1/ log n). sup k n(βbb (u) − β(u)) u∈U In comparison with the pivotal and gradient bootstrap methods, the Gaussian and weighted bootstrap methods require stronger assumptions. 12 3.3. Estimation of Matrices. In order to implement some of the inference methods, we need uniformly consistent estimators of the Gram and Jacobian matrices. The natural candidates are b m = En [Zi Zi′ ], Σ 1 b Jbm (u) = En [1{|Yi − Zi′ β(u)| ≤ hn } · Zi Zi′ ], 2hn (3.6) (3.7) where hn is a bandwidth parameter, such that hn → 0, and u ∈ U . The following result establishes uniform consistency of these estimators and provides an appropriate rate for the bandwidth hn which depends on the growth of the model. 2 log n = Theorem 7 (Estimation of Gram and Jacobian Matrices). If conditions S.1-S.4 and ζm b m − Σm = oP (1) in the eigenvalue norm . If conditions S.1-S.5, hn = o(1), o(n) hold, then Σ 2 log n = o(nh ) hold, then Jb (u)−J (u) = o (1) in the eigenvalue norm uniformly and mζm n m m P in u ∈ U . 4. Estimation and Inference Methods for Linear Functionals In this section we apply the previous results to develop estimation and inference methods for various linear functionals of the conditional quantile function. 4.1. Functionals of Interest. We are interested in various functions θ created as linear functionals of QY |X (u|·) for all u ∈ U . Particularly useful examples include, for x = (w, v) and xk denoting the k-th component of x, 1. the function itself: 2. the derivative: θ(u, x) = QY |X (u|x); θ(u, x) = ∂xk QY |X (u|x); R 3. the average derivative: θ(u) = ∂xk QY |X (u|x)dµ(x); R 4. the conditional average derivative: θ(u|w) = ∂xk QY |X (u|w, v)dµ(v|w). The measures µ entering the definitions above are taken as known; the result can be extended to include estimated measures. Note that in each example we could be interested in estimating θ(u, w) simultaneously for many values of (u, w). For w ∈ W ⊂ Rd , let I ⊂ U × W denote the set of indices (u, w) of interest. For example: • the function at a particular point (u, w), in which case I = {(u, w)}, • the function u 7→ θ(u, w), having fixed w, in which case I = U × {w}, • the function w 7→ θ(u, w), having fixed u, in which case I = {u} × W, 13 • the entire function (u, w) 7→ θ(u, w), in which case I = U × W. By the linearity of the series approximations, the above parameters can be seen as linear functions of the quantile regression coefficients β(u) up to an approximation error, that is θ(u, w) = ℓ(w)′ β(u) + rn (u, w), (u, w) ∈ I, (4.8) where ℓ(w)′ β(u) is the series approximation, with ℓ(w) denoting the m-vector of loadings on the coefficients, and rn (u, w) is the remainder term, which corresponds to the approximation error. Indeed, in each of the examples above, this decomposition arises from the application of different linear operators A to the decomposition QY |X (u|·) = Z(·)′ β(u) + R(u, ·) and evaluating the resulting functions at w: AQY |X (u|·) [w] = (AZ(·)) [w]′ β(u) + (AR(u, ·)) [w]. (4.9) In the four examples above the operator A is given by, respectively, 1. the identity operator: (Af )[x] = f [x], so that ℓ(x) = Z(x), rn (u, x) = R(u, x) ; 2. a differential operator: (Af )[x] = (∂xk f )[x], so that ℓ(x) = ∂xk Z(x), rn (u, x) = ∂xk R(u, x) ; R 3. an integro-differential operator: Af = ∂xk f (x)dµ(x), so that Z Z ℓ = ∂xk Z(x)dµ(x), rn (u) = ∂xk R(u, x)dµ(x) ; R 4. a partial integro-differential operator: (Af )[w] = ∂xk f (x)dµ(v|w), so that Z Z ℓ(w) = ∂xk Z(x)dµ(v|w), rn (u, w) = ∂xk R(u, x)dµ(v|w) . For notational convenience, we use the formulation (4.8) in the analysis, instead of the motivational formulation (4.9). We shall provide the inference tools that will be valid for inference on the series approximation ℓ(w)′ β(u), (u, w) ∈ I, and, provided that the approximation error rn (u, w), (u, w) ∈ I, is small enough as com- pared to the estimation noise, these tools will also be valid for inference on the functional of interest: θ(u, w), (u, w) ∈ I. 14 Therefore, the series approximation is an important penultimate target, whereas the functional θ is the ultimate target. The inference will be based on the plug-in estimator b w) := ℓ(w)′ β(u) b θ(u, of the the series approximation ℓ(w)′ β(u) and hence of the final target θ(u, w). 4.2. Pointwise Rates and Inference on the Linear Functionals. Let (u, w) be a pair of a quantile index value u and a covariate value w, at which we are interested in performing inference on θ(u, w). In principle, this deterministic value can be implicitly indexed by n, although we suppress the dependence. Condition P. P.1 The approximation error is small, namely √ n|rn (u, w)|/kℓ(w)k = o(1). P.2 The norm of the loading ℓ(w) satisfies: kℓ(w)k . ξθ (m, w). Theorem 8 (Pointwise Convergence Rate for Linear Functionals). Assume that the conditions of Theorem 2 and Condition P hold, then b w) − θ(u, w)| .P |θ(u, ξθ (m,w) √ . n (4.10) In order to perform inference, we construct an estimator of the variance as −1 b m Jb−1 (u)ℓ(w)/n. σ bn2 (u, w) = u(1 − u)ℓ(w)′ Jbm (u)Σ m (4.11) −1 −1 σn2 (u, w) = u(1 − u)ℓ(w)′ Jm (u)Σm Jm (u)ℓ(w)/n. (4.12) Under the stated conditions this quantity is consistent for Finally, consider the t-statistic: tn (u, w) = b w) − θ(u, w) θ(u, . σ bn (u, w) Condition P also ensures that the approximation error is small, so that tn (u, w) = b ℓ(w)′ (β(u) − β(u)) + oP (1). σ bn (u, w) We can carry out standard inference based on this statistic because tn (u, w) →d N (0, 1). 15 Moreover, we can base inference on the quantiles of the following statistics: pivotal coupling t∗n (u, w) = gradient bootstrap coupling t∗n (u, w) = weighted bootstrap coupling t∗n (u, w) = −1 (u)U∗ (u)/√n ℓ(w)′ Jbm n ; σ bn (u, w) b ℓ(w)′ (βb∗ (u) − β(u)) ; σ bn (u, w) (4.13) b ℓ(w)′ (βbb (u) − β(u)) . σ bn (u, w) Accordingly let kn (1 − α) be the 1 − α/2 quantile of the standard normal variable, or let kn (1−α) denote the 1−α quantile of the random variable |t∗n (u, w)| conditional on the data, i.e. kn (1 − α) = inf{t : P (|t∗n (u, w)| ≤ t|Dn ) ≥ 1 − α}, then a pointwise (1 − α)-confidence interval can be formed as b w) − kn (1 − α)b b w) + kn (1 − α)b [ι̇(u, w), ϊ(u, w)] = [θ(u, σn (u, w), θ(u, σn (u, w)]. Theorem 9 (Pointwise Inference for Linear Functionals). Suppose that the conditions of Theorems 2 and 7, and Condition P hold. Then, tn (u, w) →d N (0, 1). For the case of using the gradient bootstrap method, suppose that the conditions of Theorem 4 hold. For the case of using the weighted bootstrap method, suppose that the conditions of Theorem 6 hold. Then, t∗n (u, w) →d N (0, 1) conditional on the data. Moreover, the (1− α)-confidence band [ι̇(u, w), ϊ(u, w)] covers θ(u, w) with probability that is asymptotically no less than 1 − α, namely n o P θ(u, w) ∈ [ι̇(u, w), ϊ(u, w)] ≥ 1 − α + o(1). 4.3. Uniform Rates and Inference on the Linear Functionals. For f : I 7→ Rk , define the norm kf kI := sup kf (u, w)k. (u,w)∈I We shall invoke the following assumptions to establish rates and uniform inference results over the region I. Condition U. U.1 The approximation error is small, namely √ n log n sup krn (u, w)/kℓ(w)kk = o(1). (u,w)∈I 16 U.2 The loadings ℓ(w) are uniformly bounded and admit Lipschitz coefficients ξθL (m, I), that is, kℓkI . ξθ (m, I), kℓ(w) − ℓ(w′ )k ≤ ξθL (m, I)kw − w′ k, and log[diam(I) ∨ ξθ (m, I) ∨ ξθL (m, I) ∨ ζm ] . log m Theorem 10 (Uniform Convergence Rate for Linear Functionals). Assume that the condi2 log 2 n = o(n), then tions of Theorem 2 and Condition U hold, and dξθ2 (m, I)ζm b w) − θ(u, w)| .P sup(w,u)∈I |θ(u, ξθ (m,I)∨1 √ n log n. (4.14) As in the pointwise case, consider the estimator of the variance σ bn2 (u, w) in (4.11). Under our conditions this quantity is uniformly consistent for σn2 (u, w) defined in (4.12), namely σ bn2 (u, w)/σn2 (u, w) = 1 + oP (1/ log n) uniformly over (u, w) ∈ I. Then we consider the t-statistic process: ( b w) − θ(u, w) θ(u, tn (u, w) = , (u, w) ∈ I σ bn (u, w) ) . Under our assumptions the approximation error is small, so that tn (u, w) = b ℓ(w)′ (β(u) − β(u)) + oP (1/ log n) in ℓ∞ (I). σ bn (u, w) The main result on inference is that the t-statistic process can be strongly approximated by the following pivotal processes or couplings: o −1 (u)U∗ (u)/√n ℓ(w)′ Jbm n , (u, w) ∈ I ; = σ bn (u, w) pivotal coupling n t∗n (u, w) gradient bootstrap coupling n t∗n (u, w) = Gaussian coupling n t∗n (u, w) weighted bootstrap coupling n t∗n (u, w) = b ℓ(w)′ (βb∗ (u) − β(u)) , σ bn (u, w) o (u, w) ∈ I ; b ℓ(w)′ (βbb (u) − β(u)) , σ bn (u, w) o (u, w) ∈ I . o −1 (u)G (u)/√n ℓ(w)′ Jbm n = , (u, w) ∈ I ; σ bn (u, w) (4.15) The following theorem shows that these couplings approximate the distribution of the tstatistic process in large samples. 17 Theorem 11 (Strong Approximation of Inferential Processes by Couplings). Suppose that the conditions of Theorems 2 and 3, and Condition U hold. For the case of using the gradient bootstrap method, suppose that the conditions of Theorem 4 hold. For the case of using the Gaussian approximation, suppose that the conditions of Theorem 5 hold. For the case of using the weighted bootstrap method, suppose that the conditions of Theorem 6 hold. Then, tn (u, w) =d t∗n (u, w) + oP (1/ log n), in ℓ∞ (I), where P can be replaced by P ∗ . To construct uniform two-sided confidence bands for {θ(u, w) : (u, w) ∈ I}, we consider the maximal t-statistic ktn kI = sup |tn (u, w)|, (u,w)∈I as well as the couplings to this statistic in the form: kt∗n kI = sup |t∗n (u, w)|. (u,w)∈I Ideally, we would like to use quantiles of the first statistic as critical values, but we do not know them. We instead use quantiles of the second statistic as large sample approximations. Let kn (1 − α) denote the 1 − α quantile of random variable kt∗n kI conditional on the data, i.e. kn (1 − α) = inf{t : P (kt∗n kI ≤ t|Dn ) ≥ 1 − α}. This quantity can be computed numerically by Monte Carlo methods, as we illustrate in the empirical section. Let δn > 0 be a finite sample expansion factor such that δn log1/2 n → 0 but δn log n → ∞. For example, we recommend to set δn = 1/(4 log 3/4 n). Then for cn (1 − α) = kn (1 − α) + δn we define the confidence bands of asymptotic level 1 − α to be b w) − cn (1 − α)b b w) + cn (1 − α)b [ι̇(u, w), ϊ(u, w)] = [θ(u, σn (u, w), θ(u, σn (u, w)], (u, w) ∈ I. The following theorem establishes the asymptotic validity of these confidence bands. Theorem 12 (Uniform Inference for Linear Functionals). Suppose that the conditions of Theorem 11 hold for a given coupling method. (1) Then n o P ktn kI ≤ cn (1 − α) ≥ 1 − α + o(1). (4.16) 18 (2) As a consequence, the confidence bands constructed above cover θ(u, w) uniformly for all (u, w) ∈ I with probability that is asymptotically no less than 1 − α, namely n o P θ(u, w) ∈ [ι̇(u, w), ϊ(u, w)], for all (u, w) ∈ I ≥ 1 − α + o(1). (4.17) (3) The width of the confidence band 2cn (1 − α)b σn (u, w) obeys uniformly in (u, w) ∈ I: 2cn (1 − α)b σn (u, w) = 2kn (1 − α)(1 + oP (1))σn (u, w). (4) Furthermore, if kt∗n kI does not concentrate at kn (1 − α) at a rate faster than (4.18) √ log n, P (kt∗n kI ≤ kn (1 − α) + εn ) = 1 − α + o(1) that is, it obeys the anti-concentration property √ for any εn = o(1/ log n), then the inequalities in (4.16) and (4.17) hold as equalities, and the finite sample adjustment factor δn could be set to zero. The theorem shows that the confidence bands constructed above maintain the required level asymptotically and establishes that the uniform width of the bands is of the same order as the uniform rate of convergence. Moreover, under anti-concentration the confidence intervals are asymptotically similar. Comment 4.1. This inferential strategy builds on [15], who proposed a similar strategy for inference on the minimum of a function. The idea was not to find the limit distribution, which may not exist in some cases, but to use distributions provided by couplings. Since the limit distribution needs not exist, it is not immediately clear that the confidence intervals maintain the right asymptotic level. However, the additional adjustment factor δn assures the right asymptotic level. A small price to pay for using the adjustment δn is that the confidence intervals may not be similar, i.e. remain asymptotically conservative in coverage. However, the width of the confidence intervals is not asymptotically conservative, since δn is negligible compared to kn (1 − α). If an additional property, called anti-concentration, holds, then the confidence intervals automatically become asymptotically similar. The anticoncentration property holds if, after appropriate scaling by some deterministic sequences an and bn , the inferential statistic an (ktn kI − bn ) has a continuous limit distribution. More generally, it holds if for any subsequence of integers {nk } there is a further subsequence {nkr } along which ankr (ktnkr kI − bnkr ) has a continuous limit distribution, possibly dependent on the subsequence. For an example of the latter see [9], where certain inferential processes converge in distribution to tight Gaussian processes subsubsequentially. We expect anticoncentration to hold in our case, but our constructions and results do not critically hinge on it. 19 Comment 4.2 (Primitive Bounds on ξθ (m, I)). The results of this section rely on the quantity ξθ (m, I). The value of ξθ (m, I) depends on the choice of basis for the series estimator and on the type of the linear functional. Here we discuss the case of regression splines and refer to [29] and [12] for other choices of basis. After a possible renormalization of X, we √ assume its support is X = [−1, 1]d . For splines it has been established that ζm . m and maxk supx∈X k∂xk Z(x)k . m1/2+k , [29]. Then we have that for • the function itself: • the derivative: • the average derivative: θ(u, x) = QY |X (u|x), ℓ(x) = Z(x), ξθ (m, I) . √ m; θ(u, x) = ∂xk QY |X (u|x), ℓ(x) = ∂x Z(x), ξθ (m, I) . m3/2 ; R θ(u) = ∂xk QY |X (u|x)dµ(x), supp(µ) ⊂ intX , |∂xk µ(x)| . 1, R R ℓ = ∂xk Z(x)µ(x) dx = − Z(x)∂xk µ(x) dx, ξθ (m) . 1. 4.4. Imposing Monotonicity on Linear Functionals. The functionals of interest might be naturally monotone in some of their arguments. For example, the conditional quantile function is increasing in the quantile index and the conditional quantile demand function is decreasing in price and increasing in the quantile index. Therefore, it might be desirable to impose the same requirements on the estimators of these functions. Let θ(u, w), where (u, w) ∈ I, be a weakly increasing function in (u, w), i.e. θ(u′ , w′ ) ≤ θ(u, w) whenever (u′ , w′ ) ≤ (u, w) componentwise.1 Let θb and [ι̇, ϊ] be the point and band estimators of θ, constructed using one of the methods described in the previous sections. These estimators might not satisfy the monotonicity requirement due to either estimation error or imperfect approximation. However, we can monotonize these estimates and perform inference using the following method, suggested in [14]. Let q, f : I 7→ K, where K is a bounded subset of R, and consider any monotonization operator M that satisfies: (1) a monotone-neutrality condition Mq = q if q monotone; (4.19) (2) a distance-reducing condition kMq − Mf kI ≤ kq − f kI ; (4.20) and (3) an order-preserving condition q≤f implies Mq ≤ Mf. (4.21) Examples of operators that satisfy these conditions include: 1If θ(u, w) is decreasing in w, we take the transformation w̃ = −w and θ̃(u, w̃) = θ(u, −w̃), where θ̃(u, w̃) is increasing in w̃. 20 1. multivariate rearrangement [14], 2. isotonic projection [4], 3. convex combinations of rearrangement and isotonic regression [14], and 4. convex combinations of monotone minorants and monotone majorants. The following result establishes that monotonizing point estimators reduces estimation error, and that monotonizing confidence bands increases coverage while reducing the length of the bands. The following result follows from Theorem 12 using the same arguments as in Propositions 2 and 3 in [14]. Corollary 1 (Inference Monotone Linear Functionals). Let θ : I 7→ K be weakly increasing over I and θb be the QR series estimator of Theorem 10. If M satisfies the conditions (4.19) and (4.20), then the monotonized QR estimator is necessarily closer to the true value: kMθb − θkI ≤ kθb − θkI . Let [ι̇, ϊ] be a confidence band for θ of Theorem 12. If M satisfies the conditions (4.19) and (4.21), the monotonized confidence bands maintain the asymptotic level of the original intervals: P {θ(u, w) ∈ [Mι̇(u, w), Mϊ(u, w)] : (u, w) ∈ I} ≥ 1 − α + o(1). If M satisfies the condition (4.20), the monotonized confidence bands are shorter in length than the original monotone intervals: kMϊ − Mι̇kI ≤ kϊ − ι̇kI . Comment 4.3. Another strategy, as mentioned e.g. in [14], is to simply intersect the initial confidence interval with the set of monotone functions. This is done simply by taking the smallest majorant of the lower band and the largest minorant of the upper band. This, in principle, produces shortest intervals with desired asymptotic level. However, this may not have good properties under misspecification, i.e. when approximation error is not relatively small (we do not analyze such cases in this paper, but they can occur in practice), whereas the strategy explored in the corollary retains a number of good properties even in this case. See [14] for a detailed discussion of this point and for practical guidance on choosing particular monotonization schemes. 21 5. Examples This section illustrates the finite sample performance of the estimation and inference methods with two examples. All the calculations were carried out with the software R ([33]), using the package quantreg for quantile regression ([26]). 5.1. Empirical Example. To illustrate our methods with real data, we consider an empirical application on nonparametric estimation of the demand for gasoline. [21], [34], and [39] estimated nonparametrically the average demand function. We estimate nonparametrically the quantile demand and elasticity functions and apply our inference methods to construct confidence bands for the average quantile elasticity function. We use the same data set as in [39], which comes from the National Private Vehicle Use Survey, conducted by Statistics Canada between October 1994 and September 1996. The main advantage of this data set, relative to similar data sets for the U.S., is that it is based on fuel purchase diaries and contains detailed household level information on prices, fuel consumption patterns, vehicles and demographic characteristics. (See [39] for a more detailed description of the data.) Our sample selection and variable construction also follow [39]. We select into the sample households with non-zero licensed drivers, vehicles, and distance driven. We focus on regular grade gasoline consumption. This selection leaves us with a sample of 5,001 households. Fuel consumption and expenditure are recorded by the households at the purchase level. We consider the following empirical specification: Y = QY |X (U |X), QY |X (U |X) = g(W, U ) + V ′ β(U ), X = (W, V ), where Y is the log of total gasoline consumption in liters per month; W is the log of price per liter; U is the unobservable preference of the household to consume gasoline; and V is a vector of 28 covariates. Following [39], the covariate vector includes the log of age, a dummy for the top coded value of age, the log of income, a set of dummies for household size, a dummy for urban dwellers, a dummy for young-single (age less than 36 and household size of one), the number of drivers, a dummy for more than 4 drivers, 5 province dummies, and 12 monthly dummies. To estimate the function g(W, U ), we consider three series approximations in W : linear, a power orthogonal polynomial of degree 6, and a cubic B-spline with 5 knots at the {0, 1/4, 1/2, 3/4, 1} quantiles of the observed values of W . The number of series terms is selected by undersmoothing over the specifications chosen by applying least squares cross validation to the corresponding conditional average demand 22 functions. In the next section, we analyze the size of the specification error of these series approximations in a numerical experiment calibrated to mimic this example. The empirical results are reported in Figures 1–3. Fig. 1 plots the initial and rearranged estimates of the quantile demand surface for gasoline as a function of price and the quantile index, that is (u, exp(w)) 7→ θ(u, w) = exp(g(w, u) + v ′ β(u)), where the value of v is fixed at the sample median values of the ordinal variables and one for the dummies corresponding to the sample modal values of the rest of the variables.2 The rows of the figure correspond to the three series approximations. The monotonized estimates in the right panels are obtained using the average rearrangement over both the price and quantile dimensions proposed in [14]. The power and B-spline series approximations show most noticeably non-monotone areas with respect to price at high quantiles, which are removed by the rearrangement. Fig. 2 shows series estimates of the quantile elasticity surface as a function of price and the quantile index, that is: (u, exp(w)) 7→ θ(u, w) = ∂w g(w, u). The estimates from the linear approximation show that the elasticity decreases with the quantile index in the middle of the distribution, but this pattern is reversed at the tails. The power and B-spline estimates show substantial heterogeneity of the elasticity across prices, with individuals at the high quantiles being more sensitive to high prices.3 Fig. 3 shows 90% uniform confidence bands for the average quantile elasticity function Z u 7→ θ(u) = ∂w g(w, u)dµ(w). The rows of the figure correspond to the three series approximations and the columns correspond to the inference methods. We construct the bands using the pivotal, Gaussian and weighted bootstrap methods. For the pivotal and Gaussian methods the distribution of the maximal t-statistic is obtained by 1,000 simulations. The weighted bootstrap uses standard exponential weights and 199 repetitions. The confidence bands show that the 2The median values of the ordinal covariates are $40K for income, 46 for age, and 2 for the number of drivers. The modal values for the rest of the covariates are 0 for the top-coding of age, 2 for household size, 1 for urban dwellers, 0 for young-single, 0 for the dummy of more than 4 drivers, 4 (Prairie) for province, and 11 (November) for month. 3These estimates are smoothed by local weighted polynomial regression across the price dimension ([16]), because the unsmoothed elasticity estimates display very erratic behavior. 23 Linear Rearranged 1000 1000 800 800 600 600 Liters Liters 400 400 200 200 0.8 0.8 0.7 0.6 Quantile 0.2 0.5 0.7 0.6 0.6 0.4 0.6 0.4 Quantile Price ($) Power 0.2 0.5 Price ($) Rearranged 1000 1000 800 800 600 600 Liters Liters 400 400 200 200 0.8 0.8 0.7 0.6 Quantile 0.2 0.5 0.7 0.6 0.6 0.4 0.6 0.4 Quantile Price ($) B−splines 0.2 0.5 Price ($) Rearranged 1000 1000 800 800 600 600 Liters Liters 400 400 200 200 0.8 0.7 0.6 0.6 0.4 Quantile 0.2 0.5 Price ($) 0.8 0.7 0.6 0.6 0.4 Quantile 0.2 0.5 Price ($) Figure 1. Quantile demand surfaces for gasoline as a function of price and the quantile index. The left panels display linear, power and B-spline series estimates and the right panels shows the corresponding estimates monotonized by rearrangement over both dimensions. 24 0.2 Price ($) Quantile Price ($) Quantile Price ($) 0.2 0.60 0.6 weighted polynomial regression with bandwidth 0.5. tile index. The power and B-spline series estimates are smoothed by local Figure 2. Quantile elasticity surfaces as a function of price and the quan- 0.4 0.2 0.55 0.4 0.55 0.4 0.55 Quantile 0.60 0.6 0.60 0.6 0.8 0.8 0.8 −0.5 −0.5 −0.5 −1.0 −1.0 −1.0 Elasticity Elasticity Elasticity 0.0 0.0 0.0 B−splines Power Linear 25 evidence of heterogeneity in the elasticities across quantiles is not statistically significant, because we can trace a horizontal line within the bands. They show, however, that there is significant evidence of negative price sensitivity at most quantiles as the bands are bounded away from zero for most quantiles. 5.2. Numerical Example. To evaluate the performance of our estimation and inference methods in finite samples, we conduct a Monte Carlo experiment designed to mimic the previous empirical example. We consider the following design for the data generating process: Y = g(W ) + V ′ β + σΦ−1 (U ), (5.22) where g(w) = α0 + α1 w + α2 sin(2πw) + α3 cos(2πw) + α4 sin(4πw) + α5 cos(4πw), V is the same covariate vector as in the empirical example, U ∼ U (0, 1), and Φ−1 denotes the inverse of the CDF of the standard normal distribution. The parameters of g(w) and β are calibrated by applying least squares to the data set in the empirical example and σ is calibrated to the least squares residual standard deviation. We consider linear, power and B-spline series methods to approximate g(w), with the same number of series terms and other tuning parameters as in the empirical example Figures 4 and 5 examine the quality of the series approximations in population. They compare the true quantile function (u, exp(w)) 7→ θ(u, w) = g(w) + v ′ β + σΦ−1 (u), and the quantile elasticity function (u, exp(w)) 7→ θ(u, w) = ∂w g(w), to the estimands of the series approximations. In the quantile demand function the value of v is fixed at the sample median values of the ordinal variables and at one for the dummies corresponding to the sample modal values of the rest of the variables. The estimands are obtained numerically from a mega-sample (a proxy for infinite population) of 100 × 5, 001 observations with the values of (W, V ) as in the data set (repeated 100 times) and with Y generated from the DGP (5.22). Although the derivative function does not depend on u in our design, we do not impose this restriction on the estimands. Both figures show that the power and B-spline estimands are close to the true target functions, whereas the more parsimonious linear approximation misses important curvature features of the target functions, especially in the elasticity function. 26 0.6 0.8 0.0 0.4 0.6 0.8 Linear 0.2 0.4 0.6 0.8 Pivotal Gaussian Weighted Bootstrap 0.4 0.6 0.8 Power −0.5 −2.0 −1.5 −1.0 Elasticity −0.5 −2.0 −2.0 −1.5 −1.0 Elasticity −0.5 −1.0 0.0 quantile 0.0 quantile 0.0 quantile 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 quantile Pivotal Gaussian Weighted Bootstrap 0.4 0.6 0.8 quantile B−splines −0.5 −2.0 −1.5 −1.0 Elasticity −0.5 −2.0 −1.5 −1.0 Elasticity −0.5 −1.0 −1.5 0.0 quantile 0.0 quantile −2.0 0.2 −0.5 −1.5 −2.0 0.2 0.0 0.2 −1.0 Elasticity −0.5 −1.5 −2.0 0.4 −1.5 Elasticity −1.0 Elasticity −0.5 −1.0 −2.0 −1.5 Elasticity 0.2 Elasticity Weighted Bootstrap 0.0 Gaussian 0.0 Pivotal 0.2 0.4 0.6 quantile 0.8 0.2 0.4 0.6 0.8 quantile Figure 3. 90% Confidence bands for the average quantile elasticity function. Pivotal and Gaussian bands are obtained by 1,000 simulations. Weighted bootstrap bands are based on 199 bootstrap repetitions with standard exponential weights. 27 True Linear 1000 1000 800 800 600 600 Liters Liters 400 400 200 200 0.8 0.8 0.7 0.6 0.7 0.6 0.6 0.4 Quantile 0.5 0.2 0.6 0.4 Quantile Price ($) Power 0.2 0.5 Price ($) B−spline 1000 1000 800 800 600 600 Liters Liters 400 400 200 200 0.8 0.7 0.6 0.8 0.7 0.6 0.6 0.4 Quantile 0.2 0.5 Price ($) 0.6 0.4 Quantile 0.2 0.5 Price ($) Figure 4. Estimands of the quantile demand surface. Estimands for the linear, power and B-spline series estimators are obtained numerically using 500,100 simulations. 28 True Linear 0.0 0.0 −0.5 −0.5 Elasticity Elasticity −1.0 −1.0 0.8 0.8 0.65 0.6 0.4 Quantile 0.50 0.60 0.4 0.55 0.2 0.65 0.6 0.60 Quantile Price ($) 0.55 Price ($) 0.2 0.50 Power B−spline 0.0 0.0 −0.5 −0.5 Elasticity Elasticity −1.0 −1.0 0.8 0.65 0.6 0.60 0.4 Quantile 0.55 0.2 0.50 Price ($) 0.8 0.65 0.6 0.60 0.4 Quantile 0.55 0.2 0.50 Price ($) Figure 5. Estimands of the quantile elasticity surface. Estimands for the linear, power and B-spline series estimators are obtained numerically using 500,100 simulations. 29 To analyze the properties of the inference methods in finite samples, we draw 500 samples from the DGP in equation (5.22) with 3 sample sizes, n: 5, 001, 1, 000, and 500 observations. For n = 5, 001 we fix W to the values in the data set, whereas for the smaller sample sizes we draw W with replacement from the values in the data set and keep it fixed across samples. To speed up computation, we drop the vector V by fixing it at the sample median values of the ordinal components and at one for the dummies corresponding to the sample modal values for all the individuals. We focus on the average quantile elasticity function u 7→ θ(u) = Z ∂w g(w)dµ(w), over the region I = [0.1, 0.9]. We estimate this function using linear, power and B-spline quantile regression with the same number of terms and other tuning parameters as in the empirical example. Although θ(u) does not change with u in our design, again we do not impose this restriction on the estimators. For inference, we compare the performance of 90% confidence bands for the entire elasticity function. These bands are constructed using the pivotal, Gaussian and weighted bootstrap methods, all implemented in the same fashion as in the empirical example. The interval I is approximated by a finite grid of 91 quantiles I˜ = {0.10, 0.11, ..., 0.90}. Table 1 reports estimation and inference results averaged across 200 simulations. The ˜ Bias and RMSE are true value of the elasticity function is θ(u) = −0.74 for all u ∈ I. ˜ SE/SD reports the ratios the absolute bias and root mean squared error integrated over I. of empirical average standard errors to empirical standard deviations. SE/SD uses the analytical standard errors from expression (4.11). The bandwidth for Jbm (u) is chosen using the Hall-Sheather option of the quantreg R package ([19]). Length gives the empirical average of the length of the confidence band. SE/SD and length are integrated over the ˜ Cover reports empirical coverage of the confidence bands with nominal grid of quantiles I. level of 90%. Stat is the empirical average of the 90% quantile of the maximal t-statistic used to construct the bands. Table 1 shows that the linear estimator has higher absolute bias than the more flexible power and B-spline estimators, but displays lower rmse, especially for small sample sizes. The analytical standard errors provide good approximations to the standard deviations of the estimators. The confidence bands have empirical coverage close to the nominal level of 90% for all the estimators and sample sizes considered; and weighted bootstrap bands tend to have larger average length than the pivotal and Gaussian bands. 30 All in all, these results strongly confirm the practical value of the theoretical results and methods developed in the paper. They also support the empirical example by verifying that our estimation and inference methods work quite nicely in a very similar setting. Table 1. Finite Sample Properties of Estimation and Inference Methods for Average Quantile Elasticity Function Pivotal Linear Power B-spline Bias 0.05 0.00 0.01 RMSE 0.14 0.15 0.15 SE/SD 1.04 1.03 1.02 Cover 90 91 90 Length 0.77 0.85 0.86 Linear Power B-spline 0.03 0.03 0.02 0.29 0.33 0.35 1.09 1.07 1.05 92 92 90 1.78 2.01 2.08 Linear Power B-spline 0.04 0.02 0.02 0.45 0.52 0.52 1.01 1.04 1.04 88 90 90 2.60 3.12 3.25 Gaussian n = 5, 001 Stat Cover 2.64 90 2.65 91 2.64 88 n = 1, 000 2.64 93 2.66 91 2.65 90 n = 500 2.64 88 2.65 90 2.65 90 Weighted Bootstrap Length 0.76 0.85 0.86 Stat 2.64 2.65 2.64 Cover 87 88 90 Length 0.82 0.91 0.93 Stat 2.87 2.83 2.84 1.78 2.00 2.07 2.64 2.65 2.65 90 95 96 1.96 2.17 2.21 2.99 2.95 2.95 2.60 3.13 3.25 2.64 2.66 2.65 90 95 96 2.84 3.29 3.35 3.05 3.01 3.00 Notes: 200 repetitions. Simulation standard error for coverage probability is 2%. 31 Appendix A. Implementation Algorithms Throughout this section we assume that we have a random sample {(Yi , Zi ) : 1 ≤ i ≤ n}. √ b We are interested in approximating the distribution of the process n(β(·) − β(·)) or of the statistics associated with functionals of it. Recall that for each quantile u ∈ U ⊂ (0, 1), b we estimate β(u) by quantile regression β(u) = arg minβ∈Rm En [ρu (Yi − Zi′ β)], the Gram b m = En [Zi Z ′ ], and the Jacobian matrix Jm (u) by Powell [32] estimator matrix Σm by Σ i b Jbm (u) = En [1{|Yi −Zi′ β(u)| ≤ hn }·Zi Zi′ ]/2hn , where we recommend choosing the bandwidth hn as in the quantreg R package with the Hall-Sheather option ([19]). We begin describing the algorithms to implement the methods to approximate the dis√ b − β(·)) indexed by U . tribution of the process n(β(·) Algorithm 1 (Pivotal method). (1) For b = 1, . . . , B, draw U1b , . . . , Unb i.i.d. from U ∼ P Uniform(0, 1) and compute Ubn (u) = n−1/2 ni=1 Zi (u − 1{Uib ≤ u}), u ∈ U . (2) Ap√ b proximate the distribution of { n(β(u) − β(u)) : u ∈ U } by the empirical distribution of −1 (u)Ub (u) : u ∈ U , 1 ≤ b ≤ B}. {Jbm n Algorithm 2 (Gaussian method). (1) For b = 1, . . . , B, generate a m-dimensional standard b (·). Define Gb (u) = Σ b (u) for u ∈ U . (2) Approximate the b −1/2 Brownian bridge on U , Bm Bm m n √ b −1 (u)Gb (u) : − β(u)) : u ∈ U } by the empirical distribution of {Jbm distribution of { n(β(u) n u ∈ U , 1 ≤ b ≤ B}. Algorithm 3 (Weighted bootstrap method). (1) For b = 1, . . . , B, draw hb1 , . . . , hbn i.i.d. from the standard exponential distribution and compute the weighted quantile regression P process βbb (u) = arg minβ∈Rm ni=1 hbi ·ρu (Yi −Zi′ β), u ∈ U . (2) Approximate the distribution √ b √ b of { n(β(u)−β(u)) : u ∈ U } by the empirical distribution of { n(βbb (u)− β(u)) : u ∈ U, 1 ≤ b ≤ B}. Algorithm 4 (Gradient bootstrap method). (1) For b = 1, . . . , B, draw U1b , . . . , Unb i.i.d. P from U ∼ Uniform(0, 1) and compute Ubn (u) = n−1/2 ni=1 Zi (u − 1{Uib ≤ u}), u ∈ U . (2) P For b = 1, . . . , B, estimate the quantile regression process βbb (u) = arg minβ∈Rm ni=1 ρu (Yi − √ b b (u) = − n Ubn (u)/u, and Yn+1 = (u)′ β), u ∈ U , where Xn+1 Zi′ β) + ρu (Yn+1 − Xn+1 b (u)′ βbb (u), for all u ∈ U . (3) Approximate the n max1≤i≤n |Yi | to ensure Yn+1 > Xn+1 √ b √ b distribution of { n(β(u)−β(u)) : u ∈ U } by the empirical distribution of { n(βbb (u)−β(u)) : u ∈ U , 1 ≤ b ≤ B}. The previous algorithms provide approximations to the distribution of √ b n(β(u) − β(u)) that are uniformly valid in u ∈ U . We can use these approximations directly to make 32 inference on linear functionals of QY |X (·|X) including the conditional quantile functions, provided the approximation error is small as stated in Theorems 9 and 11. Each linear functional is represented by {θ(u, w) = ℓ(w)′ β(u) + rn (u, w) : (u, w) ∈ I}, where ℓ(w)′ β(u) is the series approximation, ℓ(w) ∈ Rm is a loading vector, rn (u, w) is the remainder term, and I is the set of pairs of quantile indices and covariates values of interest, see Section 4 for details and examples. Next we provide algorithms to conduct pointwise or uniform inference over linear functionals. Let B be a pre-specified number of bootstrap or simulation repetitions. Algorithm 5 (Pointwise Inference for Linear Functionals). (1) Compute the variance esti−1 (u)Σ −1 (u)ℓ(w)/n. (2) Using any of the Algorithms 1-4, b m Jbm mate σ bn2 (u, w) = u(1−u)ℓ(w)′ Jbm compute vectors V1 (u), . . . , VB (u) whose empirical distribution approximates the distribution √ b ℓ(w)′ Vb (u) − β(u)). (3) For b = 1, . . . , B, compute the t-statistic t∗b (u, w) = √ of n(β(u) . (4) Form a (1 − α)-confidence interval for θ(u, w) as n nb σn (u,w) ′ b ℓ(w) β(u) ± kn (1 − α)b σn (u, w), where kn (1 − α) is the 1 − α sample quantile of {t∗b n (u, w) : 1 ≤ b ≤ B}. Algorithm 6 (Uniform Inference for Linear Functionals). (1) Compute the variance esb m Jb−1 (u)ℓ(w)/n for (u, w) ∈ I. (2) Using any timates σ b2 (u, w) = u(1 − u)ℓ(w)′ Jb−1 (u)Σ m n m of the Algorithms 1-4, compute the processes V1 (·), . . . , VB (·) whose empirical distribution √ b approximates the distribution of { n(β(u) : u ∈ U }. (3) For b = 1, . . . , B, compute − β(u)) √ ℓ(w)′ Vb (u) ∗b the maximal t-statistic ktn kI = sup(u,w)∈I nbσn (u,w) . (4) Form a (1 − α)-confidence band b for {θ(u, w) : (u, w) ∈ I} as {ℓ(w)′ β(u) ± kn (1 − α)b σn (u, w) : (u, w) ∈ I}, where kn (1 − α) is the 1 − α sample quantile of {kt∗b n kI : 1 ≤ b ≤ B}. Appendix B. A result on identification of QR series approximation in population and its relation to the best L2 -approximation In this section we provide results on identification and approximation properties for the QR series approximation. In what follows, we denote z = Z(x) ∈ Z for some x ∈ X , and for a function h : X → R we define Qu (h) = E[ρu (Y − h(X))], so that β(u) ∈ arg minβ∈Rm Qu (Z ′ β). Also let f := inf u∈U ,X∈X fY |X (QY |X (u|X)|X), where f > 0 by condition S.2. 33 Consider the best L2 -approximation to the conditional quantile function gu (·) = QY |X (u|·) by a linear combination of the chosen basis, namely ∗ β̃ (u) ∈ arg minm E |Z ′ β − gu (X)|2 . (B.23) β∈R ∗ We consider the following approximation rates associated with β̃ (u): i h ∗ ∗ sup |z ′ β̃ (u) − gu (x)|. c2u,2 = E |Z ′ β̃ (u) − gu (X)|2 and cu,∞ = x∈X ,z=Z(x) Lemma 1. Assume that conditions S.2-S.4, ζm cu,2 = o(1) and cu,∞ = o(1) hold. Then, as n grows, we have the following approximation properties for β(u): i h ∗ E |Z ′ β(u) − gu (X)|2 ≤ (16 ∨ 3f¯/f )c2u,2 , E |Z ′ β(u) − Z ′ β̃ (u)|2 ≤ (9 ∨ 8f¯/f )c2u,2 and sup x∈X ,z=Z(x) |z ′ β(u) − gu (x)| . cu,∞ + ζm q p (9 ∨ 8f¯/f )cu,2 / mineig(Σm ). h i1/2 h i1/2 ∗ ∗ Proof of Lemma 1. We assume that E |Z ′ β(u) − Z ′ β̃ (u)|2 ≥ 3E |Z ′ β̃ (u) − gu (X)|2 , otherwise the statements follow directly. The proof proceeds in 4 steps. Step 1 (Main argument). For notational convenience let 3/2 /E |Z ′ β(u) − gu (X)|3 . q̄ = (f 3/2 /f¯′ )E |Z ′ β(u) − gu (X)|2 By Steps 2 and 3 below we have respectively Qu (Z ′ β(u)) − Qu (gu ) ≤ f¯c2u,2 and q f E |Z ′ β(u) − gu (X)|2 q̄ ∧ Qu (Z β(u)) − Qu (gu ) ≥ f E [|Z ′ β(u) − gu (X)|2 ] . 3 3 ′ Thus f E [|Z ′ β(u)−gu (X)|2 ] 3 (B.24) (B.25) o n q ∧ (q̄/3) f E [|Z ′ β(u) − gu (X)|2 ] ≤ f¯c2u,2 . q √ As n grows, since f¯c2u,2 < q̄/ 3 by Step 4 below, it follows that E |Z ′ β(u) − gu (X)|2 ≤ 3c2 f¯/f which proves the first statement regarding β(u). u,2 The second statement regarding β(u) follows since r h r h i p i q ∗ E |Z ′ β(u) − Z ′ β̃ (u)|2 ≤ E [|Z ′ β(u) − gu (X)|2 ]+ E |Z ′ β̃ ∗ (u) − gu (X)|2 ≤ 3c2u,2 f¯/f +cu,2 . 34 Finally, the third statement follows by the triangle inequality, S.4, and the second statement ∗ ∗ supx∈X ,z=Z(x) |z ′ β(u) − gu (x)| ≤ supx∈X ,z=Z(x) |z ′ β̃ (u) − gu (x)| + ζm kβ(u) − β̃ (u)k r h i ≤ cu,∞ + ζm E |Z ′ β̃ ∗ (u) − Z ′ β(u)|2 /mineig(Σm ) q p ≤ cu,∞ + ζm (9 ∨ 8f¯/f )cu,2 / mineig(Σm ). Step 2 (Upper Bound). For any two scalars w and v we have that Z v ρu (w − v) − ρu (w) = −v(u − 1{w ≤ 0}) + (1{w ≤ t} − 1{w ≤ 0})dt. (B.26) 0 By (B.26) and the law of iterated expectations, for any measurable function h hR i h−g Qu (h) − Qu (gu ) = E 0 u FY |X (gu + t|X) − FY |X (gu |X)dt hR i h−g = E 0 u tfY |X (gu + t̃X,t |X)dt ≤ (f¯/2)E[|h − gu |2 ] where t̃X,t lies between 0 and t for each t ∈ [0, h(x) − gu (x)]. ∗ (B.27) ∗ Thus, (B.27) with h(X) = Z ′ β̃ (u) and Qu (Z ′ β(u)) ≤ Qu (Z ′ β̃ (u)) imply that ∗ Qu (Z ′ β(u)) − Qu (gu ) ≤ Qu (Z ′ β̃ (u)) − Qu (gu ) ≤ f¯c2u,2 . Step 3 (Lower Bound). To establish a lower bound note that for a measurable function h hR i h−g Qu (h) − Qu (gu ) = E 0 u FY |X (gu + t|X) − FY |X (gu |X)dt hR i 2 h−g (B.28) = E 0 u tfY |X (gu |X) + t2 fY′ |X (gu + t̃X,t |X)dt ≥ (f /2)E[|h − gu |2 ] − 1 f¯′ E[|h − gu |3 ]. 6 If q f E [|Z ′ β(u) − gu (X)|2 ] ≤ q̄, then f¯′ E |Z ′ β(u) − gu (X)|3 ≤ f E |Z ′ β(u) − gu (X)|2 and (B.28) with the function h(Z) = Z ′ β(u) yields f E |Z ′ β(u) − gu (X)|2 . Qu (Z β(u)) − Qu (gu ) ≥ 3 ′ q f E [|Z ′ β(u) − gu (X)|2 ] > q̄, let hu (X) = (1 − α)Z ′ β(u) + αgu (X) q where α ∈ (0, 1) is picked so that f E [|hu − gu |2 ] = q̄. Then by (B.28) and convexity of Qu we have On the other hand, if Qu (Z ′ β(u)) − Qu (gu ) ≥ q f E [|Z ′ β(u) − gu (X)|2 ] q̄ · ( Qu (hu ) − Qu (gu ) ) . 35 Next note that hu (X) − gu (X) = (1 − α)(Z ′ β(u) − gu (X)), thus q f E [|hu − gu |2 ] = q̄ = f 3/2 E[|Z ′ β(u) − gu (X)|2 ]3/2 f 3/2 E[|hu − gu |2 ]3/2 = . f¯′ E[|Z ′ β(u) − gu (X)|3 ] f¯′ E[|hu − gu |3 ] Using that and applying (B.28) with hu we obtain 1 Qu (hu ) − Qu (gu ) ≥ (f /2)E[|hu − gu |2 ] − f¯′ E[|hu − gu |3 ] = q̄ 2 /3. 6 Therefore q f E |Z ′ β(u) − gu (X)|2 q̄ f E [|Z ′ β(u) − gu (X)|2 ] . ∧ Qu (Z β(u)) − Qu (gu ) ≥ 3 3 ′ Step 4. (f¯c2u,2 < q̄ 2 /3 as n grows) Recall that by S.4 kZk ≤ ζm and that we can assume i1/2 i1/2 h h ∗ ∗ E |Z ′ β(u) − Z ′ β̃ (u)|2 ≥ 3E |Z ′ β̃ (u) − gu (X)|2 . Then, using the relation above (in the second inequality) 3/2 E [|Z ′ β(u)−gu (X)|2 ] E[|Z ′ β(u)−gu (X)|3 ] 1/2 ≥ ≥ ≥ where κ̄ = Finally, p p ≥ mineig(Σm ). E [|Z ′ β(u)−gu (X)|2 ] supx∈X ,z=Z(x) |z ′ β(u)−gu (x)| h i1/2 h i1/2 ∗ ∗ E |Z ′ β(u)−Z ′ β̃ (u)|2 +E |Z ′ β̃ (u)−gu (X)|2 1 2 supx∈X ,z=Z(z) |z ′ β(u)−z ′ β̃ ∗ (u)|+supx∈X ,z=Z(x) |z ′ β̃ ∗ (u)−gu (x)| ! h i1/2 h i1/2 ∗ ∗ E |Z ′ β̃ (u)−gu (X)|2 E |Z ′ β(u)−Z ′ β̃ (u)|2 1 ∧ sup ′ ∗ 2 supx∈X ,z=Z(x) |z ′ β(u)−z ′ β̃ ∗ (u)| x∈X ,z=Z(x) |z β̃ (u)−gu (x)| 1 2 ∗ κ̄kβ(u)−β̃ (u)k ∗ ζm kβ(u)−β̃ (u)k f¯cu,2 < (f 3/2 /f ′ )(1/2)( ζκ̄m ∧ ∧ cu,2 cu,∞ √ cu,2 cu,∞ )/ 3 √ ≤ q̄/ 3 as n grows under the condition ζm cu,2 = o(1), cu,∞ = o(1), and conditions S.2 and S.3. Appendix C. Proof of Theorems 1-7 In this section we gather the proofs of the theorems 1-7 stated in the main text. We adopt the standard notation of the empirical process literature [37]. We begin by assuming that the sequences m and n satisfy m/n → 0 as m, n → ∞. For notational convenience we write ψi (β, u) = Zi (1{Yi ≤ Zi′ β} − u), where Zi = Z(Xi ). Also for any sequence r = rn = o(1) and any fixed 0 < B < ∞, we define the set Rn,m := {(u, β) ∈ U × Rm : kβ − β(u)k ≤ Br} 36 and the following error terms: ǫ0 (m, n) := sup u∈U ǫ1 (m, n) := sup (u,β)∈Rn,m ǫ2 (m, n) := sup (u,β)∈Rn,m kGn (ψi (β(u), u))k, kGn (ψi (β, u)) − Gn (ψi (β(u), u))k, n1/2 kE[ψi (β, u)] − E[ψi (β(u), u)] − Jm (u)(β − β(u))k. In what follows, we say that the data are in general position if for any γ ∈ Rm , P (Yi = Zi′ γ, for at least one i) = 0. Under the bounded density condition S.2, the data are in the general position. We also assume i.i.d. sampling throughout the appendix, although this condition is not needed for most of the results. C.1. Proof of Theorem 1. We start by establishing uniform rates of convergence. The following technical lemma will be used in the proof of Theorem 1. b Lemma 2 (Rates in Euclidian Norm for Perturbed QR Process). Suppose that β(u) is a minimizer of En [ρu (Yi − Zi′ β)] + An (u)′ β for each u ∈ U , and that the perturbation term obeys supu∈U kAn (u)k .P r = o(1). The unperturbed case corresponds to An (·) = 0. If inf u∈U mineig [Jm (u)] > J > 0 and the conditions R1. ǫ0 (m, n) .P R2. ǫ1 (m, n) .P R3. ǫ2 (m, n) .P √ √ √ nr, nr, nr, hold, where the constants in the bounds above can be taken to be independent of the constant B in the definition of Rm,n , then for any ε > 0, there is a sufficiently large B such that b with probability at least 1 − ε, (u, β(u)) ∈ Rm,n uniformly over u ∈ U , that is, b (C.29) − β(u) .P r. sup β(u) u∈U Proof of Lemma 2. Due to the convexity of the objective function, it suffices to show that for any ε > 0, there exists B < ∞ such that ′ P inf inf η [En [ψi (β, u)] + An (u)] |β=β(u)+Brη > 0 ≥ 1 − ε. u∈U kηk=1 (C.30) 37 Indeed, the quantity En [ψi (β, u)] + An (u) is a subgradient of the objective function at β. Observe that uniformly in u ∈ U , √ √ ′ nη En [ψi (β(u) + Brη, u)] ≥ Gn (η ′ ψi (β(u), u)) + η ′ Jm (u)ηB nr − ǫ1 (m, n) − ǫ2 (m, n), since E [ψi (β(u), u)] = 0 by definition of β(u) (see argument in the proof of Lemma 3). Invoking R2 and R3, ǫ1 (m, n) + ǫ2 (m, n) .P √ nr, and by R1, uniformly in η ∈ S m−1 we have |Gn (η ′ ψi (β(u), u))| ≤ sup kGn (ψi (β(u), u))k = ǫ0 (m, n) .P √ nr. u∈U Then the event of interest in (C.30) is implied by the event √ √ ′ η Jm (u)ηB nr − ǫ0 (m, n) − ǫ1 (m, n) − ǫ2 (m, n) − n sup kAn (u)k > 0 , u∈U whose probability can be made arbitrarily close to 1, for large n, by setting B sufficiently large since supu∈U kAn (u)k .P r, and η ′ Jm (u)η ≥ J > 0 by the condition on the eigenvalues of Jm (u). 2 log n = o(n) Proof of Theorem 1. Let φn = supα∈S m−1 E[(Zi′ α)2 ] ∨ En [(Zi′ α)2 ]. Under ζm and S.3, φn .P 1 by Corollary 2 in Appendix G. Next recall that Rm,n = {(u, β) ∈ U × Rm : kβ − β(u)k ≤ Br} for some fixed B large enough. Under S.1-S.5, ǫ0 (m, n) .P √ mφn log n by Lemma 23, and ǫ1 (m, n) .P √ mφn log n by Lemma 22, where none of these bounds depend on B. Under S.1-S.5, ǫ2 (m, n) . √ √ √ nζm B 2 r 2 + nm−κ Br . nr by Lemma 24 provided ζm B 2 r = o(1) and m−κ B = o(1). 2 m log n = o(n) by the growth condition of the theorem, we can take Finally, since ζm p r = (m log n)/n in Lemma 2 with An (u) = 0 and the result follows. C.2. Proof of Theorems 2, 3, 4, 5 and 6. In order to establish our uniform linear approximations for the perturbed QR process, it is convenient to define the following approximation error: b ǫ3 (m, n) := sup n1/2 k En [ψi (β(u), u)] + An (u)k. u∈U b Lemma 3 (Uniform Linear Approximation). Suppose that β(u) is a minimizer of En [ρu (Yi − Zi β)] + An (u)′ β 38 for each u ∈ U , the data are in general position, and the conditions of Lemma 2 hold. The unperturbed case corresponds to An (·) = 0. Then √ n −1 X b ψi (β(u), u) − An (u) + rn (u), nJm (u) β(u) − β(u) = √ n i=1 (C.31) supu∈U krn (u)k .P ǫ1 (m, n) + ǫ2 (m, n) + ǫ3 (m, n). Proof of Lemma 3. First note that E [ψi (β(u), u)] = 0 by definition of β(u). Indeed, despite the possible approximation error, β(u) minimizes E[ρu (Y −Z ′ β)] which yields E [ψi (β(u), u)] = 0 by the first order conditions. Therefore equation (C.31) can be recast as b rn (u) = n1/2 Jm (u)(β(u) − β(u)) + Gn (ψi (β(u), u)) + An (u). b Next note that if (u, β(u)) ∈ Rn,m uniformly in u ∈ U by the triangle inequality, uniformly in u ∈ U b u)) + krn (u)k ≤ Gn (ψi (β(u), u)) − Gn (ψi (β(u), h i b b + n1/2 E ψi (β(u), u) − E [ψi (β(u), u)] − Jm (u) β(u) − β(u) + h i 1/2 b + n En ψi (β(u), u) + An (u) ≤ ǫ1 (m, n) + ǫ2 (m, n) + ǫ3 (m, m). b The result follows by Lemma 2 which bounds from below the probability of (u, β(u)) ∈ Rn,m uniformly in u ∈ U . Proof of Theorem 2. We first control the approximation errors ǫ0 , ǫ1 , ǫ2 , and ǫ3 . Lemma √ 23 implies that ǫ0 (m, n) .P mφn log n. Lemma 24 implies that ǫ1 (m, n) .P p mζm r log n + √ √ mζm log n √ and ǫ2 (m, n) .P nζm r 2 + nm−κ r. n Next note that under the bounded density condition in S.2, the data are in general position. Therefore, by Lemma 26 with probability 1 we have mζm ǫ3 (m, n) ≤ √ . n 2 log3 n = o(n), the uniThus, the assumptions required by Lemma 3 follow under m3 ζm p form rate r = (m log n)/n, and the condition on κ. 39 Thus by Lemma 3 with An (·) = 0 √ n −1 X b ψi (β(u), u) + rn (u), where nJm (u) β(u) − β(u) = √ n i=1 sup krn (u)k .P u∈U p mζm r log n + p mζm log n mζm √ + m−κ m log n + √ . n n Next, let n 1 X reu := √ Zi (1{Yi ≤ Zi′ β(u) + R(Xi , u)} − 1{Yi ≤ Zi′ β(u)}). n i=1 p By Lemma 25 supu∈U ke ru k .P m1−κ log n + Pn ′ √1 i=1 Zi (u − 1{Yi ≤ Zi β(u) + R(Xi , u)}). n mζm √log n . n The result follows since Un (u) =d Proof of Theorem 3. Note that sup krn (u)k ≤ u∈U sup α∈S m−1 ,u∈U −1 −1 |α′ (Jm (u) − Jbm (u))α| sup kU∗n (u)k. u∈U To bound the second term, by Lemma 23 sup kU∗n (u)k .P u∈U p m log n (C.32) 2 log n = o(n). since φn .P 1 by Corollary 2 in Appendix G under ζm −1 (u) also Next since Jm (u) has eigenvalues bounded away from zero by S.2 and S.3, Jm has eigenvalues bounded above by a constant. Moreover, by Lemma 4 sup α∈S m−1 ,u∈U |α′ (Jbm (u) − Jm (u))α| .P ǫ5 (m, n) + ǫ6 (m, n) q q 2 log n 2 m log n ζm mζm m log n −κ + + + m .P nhn nhn n ζm + hn , where the second inequality follows by Lemma 28 with r = conditions stated in the Theorem, sup α∈S m−1 ,u∈U |α′ (Jbm (u) − Jm (u))α| . oP p (m log n)/n. Under the growth q 1/ m log3 n . −1 (u) has eigenvalues bounded above by a conThus, with probability approaching one, Jbm stant uniformly in u ∈ U . 40 Moreover, by the matrix identity A−1 − B −1 = B −1 (B − A)A−1 −1 (u) − Jb−1 (u))α| = |α′ Jb−1 (u)(Jb (u) − J (u))J −1 (u)α| |α′ (Jm m m m m m ′ b−1 ′ b −1 ≤ kα Jm (u)k sup |α̃ (Jm (u) − Jm (u))α̃| kJm (u)αk (C.33) p α̃∈S m−1 . oP 1/ m log3 n , which holds uniformly over α ∈ S m−1 and u ∈ U . Finally, the stated bound on supu∈U krn (u)k follows by combining the relations (C.32) and (C.33). Proof of Theorem 4. The first part of the proof is similar to the proof of Theorem 2 but applying Lemma 3 twice, one to the unperturbed problem and one to the perturbed problem √ with An (u) = −U∗n (u)/ n, for every u ∈ U . Consider the set E of realizations of the data Dn such that supkαk=1 En [(α′ Zi )2 ] . 1. By Corollary 2 in Appendix G and assumptions S.3 and S.4, P (E) = 1 − o(1) under our growth conditions. Thus, by Lemma 23 r m log n sup kAn (u)k .P r = n u∈U 2 log n = o(n) and S.3. where we used that φn .P 1 by Corollary 2 in Appendix G under ζm Then, √ √ √ b b nJm (u) βb∗ (u) − β(u) = nJm (u) βb∗ (u) − β(u) − nJm (u) β(u) − β(u) = U∗n (u) + rnpert(u) − rnunpert(u), where sup krnpert (u) − rnunpert(u)k .P u∈U p mζm r log n + p mζm log n mζm √ + m−κ m log n + √ . n n Note also that the results continue to hold in P -probability if we replace P by P ∗ , since if a random variable Bn = OP (1) then Bn = OP ∗ (1). Indeed, the first relation means that P (|Bn | > ℓn ) = o(1) for any ℓn → ∞, while the second means that P ∗ (|Bn | > ℓn ) = oP (1) for any ℓn → ∞. But the second clearly follows from the first from the Markov inequality, observing that E[P ∗ (|Bn | > ℓn )] = P (|Bn | > ℓn ) = o(1). Proof of Theorem 5. The existence of the Gaussian process with the specified covariance structure is trivial. To establish the additional coupling, note that under S.1-S.4 and the growth conditions, supkαk=1 En [(α′ Zi )2 ] .P 1 by Corollary 2 in Appendix G. Conditioning on any realization of Z1 , ..., Zn such that supkαk=1 En [(α′ Zi )2 ] . 1, the existence of the Gaussian process with the specified covariance structure that also satisfies the coupling 41 condition follows from Lemma 8. Under the conditions of the theorem supu∈U kUn (u) − Gn (u)k = oP (1/ log n). Therefore, the second statement follows since √ b − β(u)) − J −1 (u)Gn (u)k sup k n(β(u) m u∈U −1 −1 ≤ sup kJm (u)Un (u) − Jm (u)Gn (u)k + oP (1/ log n) u∈U −1 (u)k sup kUn (u) − Gn (u)k + oP (1/ log n) = oP (1/ log n), ≤ sup kJm u∈U u∈U where we invoke Theorem 2 under the growth conditions, and that the eigenvalues of Jm (u) are bounded away from zero uniformly in n by S.2 and S.3. The last statement proceeds similarly to the proof of Theorem 3. Proof of Theorem 6. Note that βbb (u) solves the quantile regression problem for the rescaled data {(hi Yi , hi Zi ) : 1 ≤ i ≤ n}. The weight hi is independent of (Yi , Zi ), E[hi ] = 1, b E[h2 ] = 1, and max1≤i≤n hi .P log n. That allows us to extend all results from β(u) to βbb (u) i b = ζ log n to account for the larger envelope, and ψ b (β, u) = h ψ (β, u). replacing ζm by ζm m i i i The first part of the proof is similar to the proof of Theorem 2 but applying Lemma 3 twice, one to the original problem and one to the problem weighted by {hi }. Then √ √ √ bb b b n β (u) − β(u) = n βbb (u) − β(u) + n β(u) − β(u) n J −1 (u) X = m√ hi Zi (u − 1{Yi ≤ Zi′ β(u)}) + rnb (u)− n i=1 n −1 (u) X Jm Zi (u − 1{Yi ≤ Zi′ β(u)}) − rn (u) − √ n i=1 where supu∈U krn (u) − rnb (u)k .P Next, let p mζm r log2 n + mζm√log2 n n √ + m−κ m log n + mζm √log n . n n 1 X reu := √ (hi − 1)Zi (1{Yi ≤ Zi′ β(u) + R(Xi , u)} − 1{Yi ≤ Zi′ β(u)}). n i=1 By Lemma 25, supu∈U ke ru k .P u} = 1{Yi ≤ Zi′ β(u) p + R(Xi , u)}. m1−κ log n + ζm m√log2 n . n The result follows since 1{Ui ≤ Note also that the results continue to hold in P -probability if we replace P by P ∗ , since if a random variable Bn = OP (1) then Bn = OP ∗ (1). Indeed, the first relation means that P (|Bn | > ℓn ) = o(1) for any ℓn → ∞, while the second means that P ∗ (|Bn | > ℓn ) = oP (1) 42 for any ℓn → ∞. But the second clearly follows from the first from the Markov inequality, observing that E[P ∗ (|Bn | > ℓn )] = P (|Bn | > ℓn ) = o(1). The second part of the theorem follows by applying Lemma 8 with vi = hi − 1, where hi ∼ exp(1) so that E[vi2 ] = 1, E[|vi |3 ] . 1 and max1≤i≤n |vi | .P log n. Lemma 8 implies that there is a Gaussian process Gn (·) with covariance structure En [Zi Zi′ ](u ∧ u′ − uu′ ) such that n 1 X sup √ (hi − 1)Zi (u − 1{Ui ≤ u}) − Gn (u) .P o(1/ log n). n u∈U i=1 Combining the result above with the first part of the theorem, the second part follows by the triangle inequality. C.3. Proof of Theorem 7 on Matrices Estimation. Consider the following quantities ǫ4 (m, n) = ǫ5 (m, n) = ǫ6 (m, n) = sup n−1/2 Gn (α′ Zi )2 , α∈S m−1 n−1/2 2hn sup α∈S m−1 , (u,β)∈Rm,n sup α∈S m−1 , (u,β)∈Rm,n Gn 1{|Yi − Zi′ β| ≤ hn }(α′ Zi )2 , 1 ′ ′ 2 ′ 2hn E 1{|Yi − Zi β| ≤ hn }(α Zi ) − α Jm (u)α , where hn is a bandwidth parameter such that hn → 0. Lemma 4 (Estimation of Variance Matrices and Jacobians). Under the definitions of ǫ4 , ǫ5 , and ǫ6 we have b m − Σm )α ≤ ǫ4 (m, n) sup α′ (Σ α∈S m−1 sup α∈S m−1 , u∈U ′ b α (Jm (u) − Jm (u))α ≤ ǫ5 (m, n)+ǫ6 (m, n). Proof. The first relation follows by the definition of ǫ4 (m, n) and the second by the triangle inequality and definition of ǫ5 (m, n) and ǫ6 (m, n). Proof of Theorem 7. This follows immediately from Lemma 4, Lemma 27, and Lemma 28. 43 Appendix D. Proofs of Theorems 8 and 9 Proof of Theorem 8. By Theorem 2 and P1 b w) − θ(u, w)| ≤ |ℓ(u)′ (β(u) b |θ(u, − β(u))| + |rn (u, w)| −1 ′ √ |ℓ(w) Jm (u)Un (u)| kℓ(w)k √ √ ≤ + o + o(kℓ(w)k/ n) P n n log n ′ −1 √ ξθ (m,w) n (u)| √ + o + o(ξθ (m, w)/ n) ≤ |ℓ(w) Jm√(u)U P n n log n where the last inequality follows by kℓ(w)k . ξθ (m, w) assumed in P2. Finally, since ′ 2 2 −1 (u)U (u)|2 ] . kℓ(w)k2 kJ −1 (u)k2 sup E[|ℓ(w)′ Jm n α∈S m−1 E[(α Z) ] . ξθ (m, w), the result m −1 (u)U (u)| . follows by applying the Chebyshev inequality to establish that |ℓ(w)Jm n P ξθ (m, w). Proof of Theorem 9. By Assumption P1 and Theorem 2 tn (u, w) = b ℓ(w)′ (β(u)−β(u)) σ bn (u,w) + rn (u,w) σ bn (u,w) = −1 ℓ(w)′ Jm (u)Un (u) √ nb σn (u,w) + oP kℓ(w)k √ nb σn (u,w) log n + oP √ kℓ(w)k nb σn (u,w) To show that the last two terms are oP (1), note that under conditions S.2 and S.3, σ bn (u, w) &P √ bn (u, w) = (1 + oP (1))σn (u, w) by Theorem 7. kℓ(w)k/ n since σ −1 (u)Z )2 /σ 2 (u, w)1{|ℓ(w)′ J −1 (u)Z | ≥ ǫ√nσ (u, w)}] → Finally, for any ǫ > 0, E[(ℓ(w)′ Jm i i n n m 0 under our growth conditions. Thus, by Lindeberg-Feller central limit theorem and σ bn (u, w) = (1 + oP (1))σn (u, w), it follows that −1 (u)U (u) ℓ(w)′ Jm n √ →d N (0, 1) nb σn (u, w) and the first result follows. The remaining results for t∗n (u, w) follow similarly under the corresponding conditions. Appendix E. Proofs of Theorems 10-12 Lemma 5 (Entropy Bound). Let W ⊂ Rd be a bounded set, and ℓ : W → Rm be a mapping in W, such that for ξθ (m) and ξθL (m) ≥ 1, kℓ(w)k ≤ ξθ (m) and kℓ(w) − ℓ(w̃)k ≤ ξθL (m)kw − w̃k, for all w, w̃ ∈ W. Moreover, assume that kJm (u) − Jm (u′ )k ≤ ξ(m)|u − u′ | for u, u′ ∈ U in the operator norm, ξ(m) ≥ 1, and let µ = supu∈U kJm (u)−1 k > 0. Let Lm be the class of functions −1 Lm = {fu,w (g) = ℓ(w)′ Jm (u)g : u ∈ U , w ∈ W} ∪ {f (g) = 1} . 44 where g ∈ B(0, ζm ), and let F denote the envelope of Lm , i.e. F = supf ∈Lm |f |. Then, for any ε < ε0 , the uniform entropy of Lm is bounded by ε0 + diam(U × W)K d+1 sup N (εkF kQ,2 , Lm , L2 (Q)) ≤ ε Q where K := µξθL ζm + µ2 ξθ (m)ξ(m)ζm . Proof. The uniform entropy bound is based on the proof of Theorem 2 of Andrews [1]. We first exclude the function constant equal to 1 and compute the Lipschitz constant associated with the other functions in Lm as follows −1 (u)g − ℓ(w̃)′ J −1 (ũ)g| = |(ℓ(w) − ℓ(w̃))′ J −1 (u)g + ℓ(w̃)′ (J −1 (u) − J −1 (ũ))g| |ℓ(w)′ Jm m m m m −1 (u)g|+ = |(ℓ(w) − ℓ(w̃))′ Jm −1 (ũ)(J (ũ) − J (u))J −1 (u)g| +|ℓ(w̃)′ Jm m m m −1 (u)gk+ ≤ ξθL (m)kw − w̃kkJm −1 (ũ)k ξ(m)|ũ − u| kJ −1 (u)gk +kℓ(w̃)′ Jm m ≤ K(kw − w̃k + |ũ − u|) where K := µξθL ζm + µ2 ξθ (m)ξ(m)ζm . Consider functions fũj ,w̃j , where (ũj , w̃j )’s are points at the centers of disjoint cubes of diameter εkF kQ,2 /K whose union covers U × W. Thus, for any (u, w) ∈ U × W min kfu,w − fũj ,w̃j kQ,2 ≤ K min (kw − w̃j k + |ũj − u|) ≤ εkF kQ,2 . ũj ,w̃j ũj ,w̃j Adding (if necessary) the constant function in to the cover, for any measure Q we obtain N (εkF kQ,2 , Lm , L2 (Q)) ≤ 1 + (diam(U × W)K/εkF kQ,2 )d+1 . The result follows by noting that ε < ε0 and that kF kQ,2 ≥ 1 since Lm contains the function constant equal to 1. Lemma 6. Let F and G be two classes of functions with envelopes F and G respectively. Then the entropy of FG = {f g : f ∈ F, g ∈ G} satisfies sup N (εkF GkQ,2 , FG, L2 (Q)) ≤ sup N ((ε/2)kF kQ,2 , F, L2 (Q)) sup N ((ε/2)kGkQ,2 , G, L2 (Q)). Q Q Q Proof. The proof is similar to Theorem 3 of Andrews [1]. For any measure Q we denote R cF = kF k2Q,2 and cG = kGk2Q,2 . Note for any measurable set A, QF (A) = A F 2 (x)dQ(x)/cF R and QG (A) = A G2 (x)dQ(x)/cG are also measures. Let K = sup N (εkGkQF ,2 , G, L2 (QF )) = sup N (εkGkQ,2 , G, L2 (Q)) and QF Q 45 L = sup N (εkF kQG ,2 , F, L2 (QG )) = sup N (εkGkQ,2 , G, L2 (Q)). Q QG Let g1 , . . . , gK and f1 , . . . , fL denote functions in G and F used to build a cover of G and F of cubes with diameter εkF GkQ,2 . Since F ≥ |f | and G ≥ |g| we have min ℓ≤L,k≤K kf g − fℓ gk kQ,2 ≤ min ℓ≤L,k≤K kf (g − gk )kQ,2 + kgk (f − fℓ )kQ,2 ≤ min kF (g − gk )kQ,2 + min kG(f − fℓ )kQ,2 k≤K ℓ≤L = min k(g − gk )kQF ,2 + min k(f − fℓ )kQG ,2 k≤K ℓ≤L ≤ εkGkQF ,2 + εkF kQG ,2 = 2εkF GkQ,2 . Therefore, by taking pairwise products of g1 , . . . , gK and f1 , . . . , fL to create a net we have N (εkF GkQ,2 , FG, L2 (Q)) ≤ N ((ε/2)kF kQ,2 , F, L2 (Q)) N ((ε/2)kGkQ,2 , G, L2 (Q)). The result follows by taking the supremum over Q on both sides. Proof of Theorem 10. By the triangle inequality b w) − θ(u, w)| ≤ sup |ℓ(w)′ (β(u) b sup |θ(u, − β(u))| + sup |rn (u, w)| (u,w)∈I (u,w)∈I (u,w)∈I where the second term satisfies sup(u,w)∈I | rn (u, w)/kℓ(w)k | = o(n−1/2 log−1 n) by condition U.1. By Theorem 2, the first term is bounded uniformly over I by b |ℓ(w)′ (β(u) − β(u))| .P −1 (u)U (u)| √ |ℓ(w)′ Jm n √ + oP (ξθ (m, I)/[ n log n]) n (E.34) since kℓ(w)k ≤ ξθ (m, I) by U.2 and the remainder term of the linear representation in Theorem 2 satisfies supu∈U krn (u)k = oP (1/ log n). Given Zi = Z(Xi ) consider the classes of functions −1 Fm = {ℓ(w)′ Jm (u)Zi : (u, w) ∈ I}, G = {(1{Ui ≤ u}−u) : u ∈ U } and Lm = Fm ∪{f (Zi ) = 1}. By Lemma 5 and Lemma 6 the entropy of the class of function Lm G = {f g : f ∈ Lm , g ∈ G} satisfies, for K . (ξθ (m, I) ∨ 1)(ξθL (m, I) ∨ 1)(ζm ∨ 1), sup log N (εkF k2,Q , Lm G, L2 (Q)) . (d + 1) log[(1 + diam(I)K)/ε] . log(n/ε) Q by our assumptions. Noting that under S.3 q −1 −1 u(1 − u)ℓ(w)′ Jm (u)Σm Jm (u)ℓ(w) . ξθ (m, I), sup (u,w)∈I 46 Lemma 16 applied to Lm G with J(m) = sup |ℓ(w)′ Jm (u)−1 Un (u)| .P (u,w)∈I √ d log n, M (m, n) . 1 ∨ ξθ (m, I)ζm yields 1/2 2 p d log(n) (1 ∨ ξθ2 (m, I) ζm ) log n log1/2 n d log n 1 ∨ ξθ2 (m, I) + n where the second term in the parenthesis is negligible compared to the first under our growth conditions. The result follows using this bound into (E.34). −1 (u) − Proof of Theorem 11. Under the conditions of Theorems 2 and 3, supu∈U |maxeig(Jbm √ −1 (u))| . o(1/√m log3/2 n), and σ Jm bn (u, w) = (1 + oP (1/ m log3/2 n))σn (u, w) uniformly P √ in (u, w) ∈ I. Note that σ bn (u, w) &P kℓ(w)k/ n. Next define t̄∗n (u, w) as follows: −1 (u)U∗ (u)/√n ℓ(w)′ Jm n ∗ For pivotal and gradient bootstrap coupling: t̄n (u, w) = ; σn (u, w) For Gaussian and weighted bootstrap coupling: t̄∗n (u, w) −1 (u)G (u)/√n ℓ(w)′ Jm n . = σn (u, w) Note that t̄∗n is independent of the data conditional on the regressor sequence Zn = (Z1 , ..., Zn ), unlike t∗n which has some dependence on the data through various estimated quantities. For the case of pivotal and gradient bootstrap couplings by Theorem 2 and condition U tn (u, w) =d t̄∗n (u, w) + oP (1/ log n) in ℓ∞ (I). Moreover, for the case of Gaussian and weighted bootstrap couplings under the conditions of Theorem 5 tn (u, n) =d t̄∗n (u, w) + oP (1/ log n) in ℓ∞ (I). Finally, under the growth conditions t̄∗n (u, w) = t∗n (u, w) + oP (1/ log n) in ℓ∞ (I). Thus, it follows that uniformly in (u, w) ∈ I tn (u, w) =d t∗n (u, w) + oP (1/ log n) = tn (u, w) + oP (1/ log n) and the result follows. Proof of Theorem 12. Let εn = 1/ log n, and δn such that δn log1/2 n → 0, and δn /εn → ∞. 47 Step 1. By the proof of Theorem 11 there is an approximation kt̄∗n kI = sup(u,w)∈I |t̄∗n (u, w)| to kt∗n kI = sup(u,w)∈I |t∗n (u, w)|, which does not depend on the data conditional on the re- gressor sequence Zn = (Z1 , ..., Zn ), such that P (| kt̄∗n kI − kt∗n kI | ≤ εn ) = 1 − o(1). Now let kn (1 − α) := (1 − α) − quantile of kt∗n kI , conditional on Dn , and let κn (1 − α) := (1 − α) − quantile of kt̄∗n kI , conditional on Dn . Note that since kt̄∗n kI is conditionally independent of Y1 , . . . , Yn , κn (1 − α) = (1 − α) − quantile of kt̄∗n kI , conditional on Zn . Then applying Lemma 7 to kt̄∗n kI and kt∗n kI , we get that for some νn ց 0 P [κn (p) ≥ kn (p − νn ) − εn and kn (p) ≥ κn (p − νn ) − εn ] = 1 − o(1). Step 2. Claim (1) now follows by noting that P {ktn kI > kn (1 − α) + δn } ≤ P {ktn kI > κn (1 − α − νn ) − εn + δn } + o(1) ≤ P {kt∗n kI > κn (1 − α − νn ) − 2εn + δn } + o(1) ≤ P {kt∗n kI > kn (1 − α − 2νn ) − 3εn + δn } + o(1) ≤ P {kt∗n kI > kn (1 − α − 2νn )} + o(1) = EP [P {kt∗n kI > kn (1 − α − 2νn )|Dn }] + o(1) ≤ EP [α + 2νn ] + o(1) = α + o(1). Claim (2) follows from the equivalence of the event {θ(u, w) ∈ [ι̇(u, w), ϊ(u, w)], for all (u, w) ∈ I} and the event {ktn kI ≤ cn (1 − α)}. To prove Claim (3) note that σ bn (u, w) = (1 + oP (1))σn (u, w) uniformly in (u, w) ∈ I under the conditions of Theorems 2 and 7. Moreover, cn (1 − α) = kn (1 − α)(1 + oP (1)) because 1/kn (1 − α) .P 1 and δn → 0. Combining these relations the result follows. Claim (4) follows from Claim (1) and from the following lower bound. By Lemma 7, we get that for some νn ց 0 P [κn (p + νn ) + εn ≥ kn (p) and kn (p + νn ) + εn ≥ κn (p)] = 1 − o(1). 48 Then P {ktn kI ≥ kn (1 − α) + δn } ≥ P {ktn kI ≥ κn (1 − α + νn ) + εn + δn } − o(1) ≥ P {kt∗n kI ≥ κn (1 − α + νn ) + 2εn + δn } − o(1) ≥ P {kt∗n kI ≥ kn (1 − α + 2νn ) + 3εn + δn } − o(1) ≥ P {kt∗n kI ≥ kn (1 − α + 2νn ) + 2δn } − o(1) ≥ E[P {kt∗n kI ≥ kn (1 − α + 2νn ) + 2δn |Dn }] − o(1) = α − 2νn − o(1) = α − o(1), where we used the anti-concentration property in the last step. We use this lemma in the proof of Theorem 12. Lemma 7 (Closeness in Probability Implies Closeness of Conditional Quantiles). Let Xn and Yn be random variables and Dn be a random vector. Let FXn (x|Dn ) and FYn (x|Dn ) (p|Dn ) denote the denote the conditional distribution functions, and FX−1n (p|Dn ) and FY−1 n corresponding conditional quantile functions. If |Xn − Yn | = oP (ε), then for some νn ց 0 with probability converging to one (p|Dn ) ≤ FX−1n (p + νn |Dn ) + ε, ∀p ∈ (νn , 1 − νn ). (p + νn |Dn ) + ε and FY−1 FX−1n (p|Dn ) ≤ FY−1 n n Proof. We have that for some νn ց 0, P {|Xn − Yn | > ε} = o(νn ). This implies that P [P {|Xn − Yn | > ε|Dn } ≤ νn ] → 1, i.e. there is a set Ωn such that P (Ωn ) → 1 and P {|Xn − Yn | > ε|Dn } ≤ νn for all Dn ∈ Ωn . So, for all Dn ∈ Ωn FXn (x|Dn ) ≥ FYn +ε (x|Dn ) − νn and FYn (x|Dn ) ≥ FXn +ε (x|Dn ) − νn , ∀x ∈ R, which implies the inequality stated in the lemma, by definition of the conditional quantile function and equivariance of quantiles to location shifts. Appendix F. A Lemma on Strong Approximation of an Empirical Process of an Increasing Dimension by a Gaussian Process Lemma 8. (Approximation of a Sequence of Empirical Processes of Increasing Dimension by a Sequence of Gaussian Processes) Consider the empirical process Un in [ℓ∞ (U )]m , U ⊆ (0, 1), conditional on Zi ∈ Rm , i = 1, . . . , n, defined by Un (u) = Gn (vi Zi ψi (u)) , ψi (u) = u − 1{Ui ≤ u}, 49 where Ui , i = 1, . . . , n, is an i.i.d. sequence of standard uniform random variables, vi , i = 1, . . . , n, is an i.i.d. sequence of real valued random variables such that E[vi2 ] = 1, E[|vi |3 ] . 1, and max1≤i≤n |vi | .P log n. Suppose that Zi , i = 1, . . . , n, are such that sup En (α′ Zi )2 . 1, kαk≤1 6 max kZi k . ζm , and m7 ζm log22 n = o(n). 1≤i≤n There exists a sequence of zero-mean Gaussian processes Gn with a.s. continuous paths, that has the same covariance functions as Un , conditional on Z1 , . . . , Zn , namely, E[Gn (u)Gn (u′ )′ ] = E[Un (u)Un (u′ )′ ] = En [Zi Zi′ ](u ∧ u′ − uu′ ), for all u and u′ ∈ U , and that approximates the empirical process Un , namely, 1 sup kUn (u) − Gn (u)k .P o . log n u∈U Proof. The proof is based on the use of maximal inequalities and Yurinskii’s coupling. Throughout the proof all the probability statements are conditional on Z1 , . . . , Zn . We define the sequence of projections πj : U → U , j = 0, 1, 2, . . . , ∞ by πj (u) = 2k−1 /2j if u ∈ ((2k − 2)/2j , 2k /2j ), k = 1, . . . , j, and πj (u) = u if u = 0 or 1. In what follows, given a process G in [ℓ∞ (U )]m and its projection G ◦ πj , whose paths are step functions with jm 2j steps, we shall identify the process G ◦ πj with a random vector G ◦ πj in R2 convenient. Analogously, given a random vector W in j R2 m , when we identify it with a process W in [ℓ∞ (U )]m , whose paths are step functions with 2j steps. The following relations proven below: (1) (Finite-Dimensional Approximation) r1 = sup kUn (u) − Un ◦ πj (u)k .P o u∈U 1 log n ; (2) (Coupling with a Normal Vector) there exists Nnj =d N (0, var[Un ◦ πj ]) such that, 1 ; r2 = kNnj − Un ◦ πj k2 .P o log n (3) (Embedding a Normal Vector into a Gaussian Process) there exists a Gaussian process Gn with properties stated in the lemma such that Nnj = Gn ◦ πj a.s.; (4) (Infinite-Dimensional Approximation) r3 = sup kGn (u) − Gn ◦ πj (u)k .P o u∈U 1 log n . 50 The result then follows from the triangle inequality sup kUn (u) − Gn (u)k ≤ r1 + r2 + r3 . u∈U Relation (1) follows from r1 = sup kUn (u) − Un ◦ πj (u)k ≤ u∈U sup |u−u′ |≤2−j .P p kUn (u) − Un (u′ )k 2−j m log n + s 2 log4 n m2 ζ m .P o(1/ log n), n where the last inequality holds by Lemma 9, and the final rate follows by choosing here 2j = (m log3 n)ℓn for some ℓn → ∞ slowly enough. Relation (2) follows from the use of Yurinskii’s coupling (Pollard [31], Chapter 10, Theorem 10): Let ξ1 , . . . , ξn be independent p-vectors with E[ξi ] = 0 for each i, and with P κ := i E kξi k3 finite. Let S = ξ1 + · · · + ξn . For each δ > 0 there exists a random vector T with a N (0, var(S)) distribution such that | log(1/B)| P {kS − T k > 3δ} ≤ C0 B 1 + where B := κpδ−3 , p for some universal constant C0 . In order to apply the coupling, we collapse vi Zi ψi ◦ πj to a p-vector, and let so that Un ◦ πj = Pn √ ξi = vi Zi ψi ◦ πj ∈ Rp , p = 2j m i=1 ξi / n. Then 3/2 2j X m X 2 3j/2 3 En E[kξi k3 ] = En E ψi (ukj )2 vi2 Ziw E[|vi |3 ]En [kZi k3 ] . 23j/2 ζm . ≤2 k=1 w=1 Therefore, by Yurinskii’s coupling, since log n . 2j m, by the choice 2j = m log3 n, Pn 3 3 2j m i=1 ξi 25j/2 mζm n23j/2 ζm √ 3 − Nn,j ≥ 3δ = . →0 P √ n (δ n) δ3 n1/2 1/6 5j 2 6 . This verifies relation (2) with by setting δ = 2 mn ζm log n r2 .P provided j is chosen as above. 6 25j m2 ζm log n n 1/6 = o(1/ log n), 51 Relation (3) follows from the a.s. embedding of a finite-dimensional random normal vector into a path of a continuous Gaussian process, which is possible by Lemma 11. Relation (4) follows from r3 = sup kGn (u) − Gn ◦ πj (u)k ≤ u∈U sup |u−u′ |≤2−j p .P kGn (u) − Gn (u′ )k 2−j m log n .P o(1/ log n), where the last inequality holds by Lemma 10 since by assumption of this lemma, sup EEn vi2 (α′ Zi )2 = sup En (α′ Zi )2 . 1 kαk≤1 kαk≤1 and the rate follows from setting j as above. Note that putting bounds together we also get an explicit bound on the approximation error: sup kUn (u) − Gn (u)k .P u∈U p 2−j m log n + s 2 log4 n m2 ζ m + n 6 log n 25j m2 ζm n 1/6 , Next we establish the auxiliary relations (1) and (4) appearing in the preceding proof. Lemma 9 (Finite-Dimensional Approximation). Let Z1 , . . . , Zn ∈ Rm be such that maxi≤n kZi k . ζm , and ϕ = supkαk≤1 En [(α′ Zi )2 ], let vi be i.i.d. random variables such that E[vi2 ] = 1 and max1≤i≤n |vi | .P log n, and let ψi (u) = u − 1{Ui ≤ u}, i = 1, . . . , n, where U1 , . . . , Un are i.i.d. Uniform(0, 1) random variables. Then, for γ > 0, and Un (u) = Gn (vi Zi ψi (u)), s 4 2 2 p ′ α Un (u) − Un (u′ ) .P γϕm log n + m ζm log n . sup n kαk≤1,|u−u′ |≤γ Proof. For notational convenience let An := q 2 log4 n m2 ζm . n Using the second maximal in- equality of Lemma 16 with M (m, n) = ζm log n ′ α Un (u) − Un (u′ ) ε(m, n, γ) = sup kαk≤1,|u−u′ |≤γ q p m log n sup EEn [vi2 (α′ Zi )2 (ψi (u) − ψi (u′ ))2 ] + An . .P kαk≤1,|u−u′ |≤γ By the independence between Zi , vi and Ui , and E[vi2 ] = 1, p p ϕm log n sup E[(ψi (u) − ψi (u′ ))2 ] + An . ε(m, n, γ) .P |u−u′ |≤γ 52 Since Ui ∼ Uniform(0, 1) we have (ψi (u) − ψi (u′ ))2 =d (|u − u′ | − 1{Ui ≤ |u − u′ |})2 . Thus, since |u − u′ | ≤ γ ε(m, n, γ) .P p p ϕm log n γ(1 − γ) + An . Lemma 10 (Infinite-Dimensional Approximation). Let Gn : U → Rm be a zero-mean Gaussian process whose covariance structure conditional on Z1 , . . . , Zn is given by E Gn (u)Gn (u′ )′ = En [Zi Zi′ ](u ∨ u′ − uu′ ) for any u, u′ ∈ U ⊂ (0, 1), where Zi ∈ Rm , i = 1, . . . , n. Then, for any γ > 0 we have p sup kGn (u) − Gn (u′ )k .P ϕγm log m |u−u′ |≤γ where ϕ = supkαk≤1 En [(α′ Zi )2 ]. Proof. We will use the following maximal inequality for Gaussian processes (Proposition A.2.7 [37]) Let X be a separable zero-mean Gaussian process indexed by a set T . Suppose that for some K > σ(X) = supt∈T σ(Xt ), 0 < ǫ0 ≤ σ(X), we have V K , for 0 < ε < ǫ0 , N (ε, T, ρ) ≤ ε where N (ε, T, ρ) is the covering number of T by ε-balls with respect to the standard deviation metric ρ(t, t′ ) = σ(Xt − Xt′ ). Then there exists a universal constant D such that for √ every λ ≥ σ 2 (X)(1 + V )/ǫ0 V DKλ √ Φ̄(λ/σ(X)), P sup Xt > λ ≤ V σ 2 (X) t∈T where Φ̄ = 1 − Φ, and Φ is the cumulative distribution function of a standard Gaussian random variable. as We apply this result to the zero-mean Gaussian process Xn : S m−1 × U × U → R defined Xn,t = α′ (Gn (u) − Gn (u′ )), t = (α, u, u′ ), α ∈ S m−1 , |u − u′ | ≤ γ. It follows that supt∈T Xn,t = sup|u−u′ |≤γ kGn (u) − Gn (u′ )k. For the process Xn we have: r r sup En [(α′ Zi )2 ], and V . m. σ(Xn ) ≤ γ sup En [(α′ Zi )2 ], K . kαk≤1 Therefore the result follows by setting λ ≃ q kαk≤1 γm log m supkαk≤1 En [(α′ Zi )2 ]. 53 In what follows, as before, given a process G in [ℓ∞ (U )]m and its projection G ◦ πj , whose paths are step functions with 2j steps, we shall identify the process G ◦ πj with a random jm vector G ◦ πj in R2 jm , when convenient. Analogously, given a random vector W in R2 identify it with a process W in [ℓ∞ (U )]m , whose paths are step functions with 2j we steps. Lemma 11. (Construction of a Gaussian Process with a Pre-scribed Projection) Let Nj be a given random vector such that Nj =d G̃ ◦ πj =: N (0, Σj ), where Σj := Var[Nj ] and G̃ is a zero-mean Gaussian process in [ℓ∞ (U )]m whose paths are a.s. uniformly continuous with respect to the Euclidian metric | · | on U . There exists a zero-mean Gaussian process in [ℓ∞ (U )]m , whose paths are a.s. uniformly continuous with respect to the Euclidian metric | · | on U , such that Nj = G ◦ πj and G =d G̃ in [ℓ∞ (U )]m . Proof. Consider a vector G̃ ◦ πℓ for ℓ > j. Then Ñj = G̃ ◦ πj is a subvector of G̃ ◦ πℓ = Ñℓ . Thus, denote the remaining components of Ñℓ as Ñℓ\j . We can construct an identically distributed copy Nℓ of Ñℓ such that Nj is a subvector of Nℓ . Indeed, we set Nℓ as a vector with components Nj and Nℓ\j , arranged in appropriate order, namely that Nℓ ◦ πj = Nj , where Nℓ\j = Σℓ\j,j Σ−1 j,j Nj + ηj , where ηj ⊥Nj and ηj =d N (0, Σℓ\j,ℓ\j − Σℓ\j,j Σ−1 j,j Σj,ℓ\j ), where ! ! Ñj Σj,j Σℓ\j,j := var . Σj,ℓ\j Σℓ\j,ℓ\j Ñℓ\j We then identify the vector Nℓ with a process Nℓ in ℓ∞ (U ), and define the pointwise limit G of this process as G(u) := lim Nℓ (u) for each u ∈ U0 , ℓ→∞ j 2 where U0 = ∪∞ j=1 ∪k=1 ukj is a countable dense subset of U . The pointwise limit exists, since by construction of {πℓ } and U0 , for each u ∈ U0 , we have that πℓ (u) = u for all ℓ ≥ ℓ(u), where ℓ(u) is a sufficiently large constant. 54 By construction Gℓ = Gℓ ◦ πℓ =d G̃ ◦ πℓ . Therefore, for each ǫ > 0, there exists η(ǫ) > 0 small enough such that P sup u,u′ ∈U0 :|u−u′ |≤η(ǫ) kG(u) − G(u′ )k ≥ ǫ ! ≤P ≤P ≤P sup sup kG ◦ πk (u) − G ◦ πk (u′ )k ≥ ǫ |u−u|≤η(ǫ) k sup ′ sup kG̃ ◦ πk (u) − G̃ ◦ πk (u )k ≥ ǫ |u−u|≤η(ǫ) k sup |u−u|≤η(ǫ) kG̃(u) − G̃(u′ )k ≥ ǫ ! ! ! ≤ ǫ, where the last display is true because sup|u−u|≤η kG̃(u)− G̃(u′ )k → 0 as η → 0 almost surely and thus also in probability, by a.s. continuity of sample paths of G̃. Setting ǫ = 2−m for each m ∈ N in the above display, and summing the resulting inequalities over m, we get a finite number on the right side. Conclude that by the Borel-Cantelli lemma, for almost all ω ∈ Ω, |G(u) − G(u′ )| ≤ 2−m for all |u − u′ | ≤ η(2−m ) for all sufficiently large m. This implies that almost all sample paths are uniformly continuous on U0 , and we can extend the process by continuity to a process {G(u), u ∈ U } with almost all paths that are uniformly continuous. In order to show that the law of G is equal to the law of G̃ in ℓ∞ (U ), it suffices to demonstrate that E[g(G)] = E[g(G̃)] for all g : [ℓ∞ (U )]m → R : |g(z) − g(z̃)| ≤ sup kz(u) − z̃(u)k ∧ 1. u∈U We have that |E[g(G)] − E[g(G̃)]| ≤ |E[g(G ◦ πℓ )] − E[g(G̃ ◦ πℓ )]|+ + E sup kG ◦ πℓ (u) − G(u)k ∧ 1 + u∈U + E sup kG̃ ◦ πℓ (u) − G̃(u)k ∧ 1 u∈U → 0 as ℓ → ∞. The first term converges to zero by construction, and the second and third terms converge to zero by the dominated convergence theorem and by G ◦ πℓ → G and G̃ ◦ πℓ → G̃ in [ℓ∞ (U )]m as ℓ → ∞ a.s., holding due to a.s. uniform continuity of sample paths of G and G̃. 55 Appendix G. Technical Lemmas on Bounding Empirical Errors In Appendix G.2 we establish technical results needed for our main results - uniform rates of convergence, uniform linear approximations, and uniform central limit theorem under high-level conditions. In Appendix G.3 we verify that these conditions are implied by the primitive Condition S stated in Section 2. G.1. Some Preliminary Lemmas. Lemma 12. Under the conditions S.2 and S.5, for any u ∈ U and α ∈ S m−1 |α′ (Jm (u) − Jem (u))α| . m−κ = o(1), where Jem (u) = E[fY |X (Z ′ β(u)|X)ZZ ′ ]. Proof of Lemma 12. For any α ∈ S m−1 |α′ (Jm (u) − Jem (u))α| = E[|fY |X (Z ′ β(u) + R(X, u)|X) − fY |X (Z ′ β(u)|X)|(Z ′ α)2 ] . α′ Σm α · f ′ m−κ . The result follows since Σm has bounded eigenvalues, κ > 0 and m → ∞ as n → ∞. Lemma 13 (Auxiliary Matrix). Under conditions S.1-S.5, for u′ , u ∈ U we have that uniformly over z ∈ Z ′ −1 ′ −1 (u))U (u′ ) p z (Jm (u ) − Jm n q .P |u − u′ | m log n. −1 −1 u(1 − u)z ′ Jm (u)Σm Jm (u)z Proof. Recall that Jm (u) = E fY |X (QY |X (u|X)|X)ZZ ′ for any u ∈ U . Moreover, under √ −1 (u′ )U (u′ )k . m log n uniformly in u′ ∈ U by Lemma 23 and S.1-S.5, we have kJm n P Corollary 2 of Appendix G. Using the matrix identity A−1 − B −1 = B −1 (B − A)A−1 −1 (u′ ) − J −1 (u) = J −1 (u)(J (u) − J (u′ ))J −1 (u′ ). Jm m m m m m Moreover, since |fY |X (QY |X (u|x)|x) − fY |X (QY |X (u′ |x)|x)| ≤ (f¯′ /f )|u − u′ | by Lemma 14, Jm (u) − Jm (u′ ) 4 (f¯′ /f )|u − u′ |Σm , and Jm (u′ ) − Jm (u) 4 (f¯′ /f )|u − u′ |Σm 56 where the inequalities are in the semi-definite positive sense. Using these relations and the definition of sn (u, x) we obtain ′ −1 ′ −1 ′ z (Jm (u ) − Jm (u))Un (u ) q −1 −1 ′ u(1 − u)z Jm (u)Σm Jm (u)z = .P ′ −1 z Jm (u) ′ −1 ′ ′ sn (u, x) (Jm (u) − Jm (u ))Jm (u )Un (u ) √ (f¯′ /f )|u − u′ |maxeig(Σm ) m log n. The result follows since f¯′ is bounded above, f is bounded from below, and the eigenvalues of Σm are bounded above and below by constants uniformly in n by S.2 and S.3. Lemma 14 (Primitive Condition S.2). Under the condition S.2, there are positive constants c, C, C1′ , C2′ , C ′′ such that the conditional quantile functions satisfy the following properties uniformly over u, u′ ∈ U , x ∈ X , (i) c |u − u′ | ≤ |QY |X (u|x) − QY |X (u′ |x)| ≤ C|u − u′ |; (ii) |fY |X (QY |X (u|x)|x) − fY |X (QY |X (u′ |x)|x)| ≤ C1′ |u − u′ |; (iii) fY |X (y|x) ≤ 1/c and |fY′ |X (y|x)| ≤ C2′ ; d2 Q (u|x) (iv) du ≤ C ′′ . 2 Y |X Proof. Under S.2, fY |X (·|x) is a differentiable function so that QY |X (·|x) is twice differentiable. To show the first statement note that d du QY |X (u|x) = 1 fY |X (QY |X (u|x)|x) , by an application of the inverse function theorem. Recall that f = inf x∈X ,u∈U fY |X (QY |X (u|x)|x) > 0, and supx∈X ,y∈R fY |X (y|x) ≤ f¯. This proves the first statement for c = 1/f¯ and C = 1/f . d fY |X (y|x), and To show the second statement let f¯′ = supx∈X dy C1′ d = sup fY |X (QY |X (u|x)|x) du x∈X ,u∈U f¯′ d d fY |X (y|x) |y=QY |X (u|x) QY |X (u|x) ≤ . = sup du f x∈X ,u∈U dy By a Taylor expansion we have |fY |X (QY |X (u|x)|x) − fY |X (QY |X (u′ |x)|x)| ≤ C1′ |u − u′ |. The second part of the third statement follows with C2′ = f¯′ . The first statement was already shown in the proof of part (i). For the fourth statement, using the implicit function theorem for second order derivatives 3 fY′ |X (QY |X (u|x)|x) d2 d d2 Q (u|x) = − F (y|x) . Q (u|x) = − 2 2 Y |X Y |X du Y |X du dy f 3 (Q (u|x)|x) y=QY |X (u|X) Y |X Y |X 57 Thus, the statement holds with C ′′ = f¯′ /f 3 . Under S.2, we can take c, C, C1′ , C2′ , and C ′′ to be fixed positive constants uniformly over n. G.2. Maximal Inequalities. In this section we derive maximal inequalities that are needed for verifying the preliminary high-level conditions. These inequalities rely mainly on uniform entropy bounds and VC classes of functions. In what follows F denotes a class of functions whose envelope is F . Recall that for a probability measure Q with kF kQ,p > 0, N (εkF kQ,p , F, Lp (Q)) denotes the covering number under the specified metric (i.e., the minimum number of Lp (Q)-balls of radius εkF kQ,p needed to cover F). We refer to Dudley [17] for the details of the definitions. Suppose that we have the following upper bound on the L2 (P ) covering numbers for F: N (ǫkF kP,2 , F, L2 (P )) ≤ n(ǫ, F, P ) for each ǫ > 0, p where n(ǫ, F, P ) is increasing in 1/ǫ, and ǫ log n(ǫ, F, P ) → 0 as 1/ǫ → ∞ and is decreasing in 1/ǫ. Let ρ(F, P ) := supf ∈F kf kP,2 /kF kP,2 . Let us call a threshold function x : Rn 7→ R ksub-exchangeable if, for any v, w ∈ Rn and any vectors ṽ, w̃ created by the pairwise exchange of the components in v with components in w, we have that x(ṽ) ∨ x(w̃) ≥ [x(v) ∨ x(w)]/k. √ Several functions satisfy this property, in particular x(v) = kvk with k = 2 and constant functions with k = 1. Lemma 15 (Exponential inequality for separable empirical process). Consider a separable P empirical process Gn (f ) = n−1/2 ni=1 {f (Zi ) − E[f (Zi )]} and the empirical measure Pn for Z1 , . . . , Zn , an underlying independent data sequence. Let K > 1 and τ ∈ (0, 1) be constants, and en (F, Pn ) = en (F, Z1 , . . . , Zn ) be a k-sub-exchangeable random variable, such that Z ρ(F ,Pn )/4 p τ kF kPn ,2 log n(ǫ, F, Pn )dǫ ≤ en (F, Pn ) and sup varP f ≤ (4kcKen (F, Pn ))2 2 f ∈F 0 for some universal constant c > 1, then ( ) 4 P sup |Gn (f )| ≥ 4kcKen (F, Pn ) ≤ EP τ f ∈F "Z ρ(F ,Pn )/2 ǫ −1 0 −{K 2 −1} n(ǫ, F, Pn ) # ! dǫ ∧ 1 +τ. Proof. See [6], Lemma 18 and note that the proof does not use that Zi ’s are i.i.d., only independent which was the requirement of Lemma 17 of [6]. The next lemma establishes a new maximal inequality which will be used in the following sections. 58 Lemma 16. Suppose for all large m and all 0 < ε ≤ ε0 2 2 2 n(ǫ, Fm , P ) ≤ (ω/ε)J(m) and n(ǫ, Fm , P ) ≤ (ω/ε)J(m) , (G.35) for some ω such that log ω . log n, and let Fm = supf ∈Fm |f | denote the envelope function associated with Fm . 1. (A Maximal Inequality Based on Entropy and Moments) Then, as n grows we have !1/2 1/2 log1/2 n. sup En [f 4 ] ∨ E[f 4 ] sup |Gn (f )| .P J(m) sup E[f 2 ] + n−1/2 J(m) log1/2 n f ∈Fm f ∈Fm f ∈Fm 2. (A Maximal Inequality Based on Entropy, Moments, and Extremum) Suppose that Fm ≤ M (m, n) with probability going to 1 as n grows. Then as n grows we have sup |Gn (f )| .P J(m) f ∈Fm 2 sup E[f ] + n −1 2 2 J(m) M (m, n) log n f ∈Fm !1/2 log1/2 n. Proof. We divide the proof into steps. Step 1 is the main argument, Step 2 is an application of Lemma 15, and Step 3 contains some auxiliary calculations. Step 1. (Main Argument) Proof of Part 1. By Step 2 below, which invokes Lemma 15, p sup |Gn (f )| .P J(m) log n sup (En [f 2 ]1/2 ∨ E[f 2 ]1/2 ). (G.36) f ∈Fm f ∈Fm We can assume that supf ∈Fm En [f 2 ]1/2 ≥ supf ∈Fm E[f 2 ]1/2 throughout the proof otherwise we are done with both bounds. 2 , we also have Again by Step 2 and (G.35) applied to Fm sup |En f 2 − E f 2 | = n−1/2 sup |Gn (f 2 )| f ∈Fm f ∈Fm p 1/2 −1/2 . .P n J(m) log n sup En [f 4 ] ∨ E[f 4 ] (G.37) f ∈Fm Thus we have 1/2 sup En [f 2 ] .P sup E f 2 + n−1/2 J(m) log1/2 n sup En f 4 ∨ E[f 4 ] . f ∈Fm f ∈Fm (G.38) f ∈Fm Therefore, inserting the bounds (G.38) in equation (G.36) yields the result. Proof of Part 2. One more time we can assume that supf ∈Fm En [f 2 ] ≥ supf ∈Fm E[f 2 ] otherwise we are done. By (G.37) we have 1/2 supf ∈Fm En f 2 − E f 2 .P n−1/2 J(m) log1/2 n supf ∈Fm En f 4 ∨ E[f 4 ] 1/2 .P n−1/2 J(m) log1/2 nM (m, n) supf ∈Fm En f 2 59 where we used that f 4 ≤ f 2 M 2 (m, n) with probability going to 1. Since for positive numbers a, c, and x, x ≤ a + c|x|1/2 implies that x ≤ 4a + 4c2 we conclude sup En f 2 .P sup E f 2 + n−1 J(m)2 M 2 (m, n) log n. f ∈Fm f ∈Fm Inserting the bound in equation (G.36) gives the result. Step 2. (Applying Lemma 15) We apply Lemma 15 to Fm with τm = 1/(4J(m)2 [K 2 − 1]) for some large constant K to be set later, and p en (Fm , Pn ) = J(m) log n 2 1/2 sup En [f ] f ∈Fm 2 1/2 ∨ E[f ] ! assuming that n is sufficiently large (i.e., n ≥ ω). We observe that by (G.35), the bound ǫ 7→ n(ǫ, Fm , Pn ) satisfies the monotonicity hypotheses of Lemma 15. Next note that √ √ en (Fm , Pn ) is 2-sub-exchangeable, because supf ∈Fm kf kPn ,2 is 2-sub-exchangeable, and √ ρ(Fm , Pn ) := supf ∈Fm kf kPn ,2 /kFm kPn ,2 ≥ 1/ n by Step 3 below. Thus, Z ρ(Fm ,Pn )/4 Z ρ(Fm ,Pn )/4 p p log n(ǫ, Fm , P )dǫ ≤ kFm kPn ,2 J(m) log(ω/ǫ)dǫ kFm kPn ,2 0 0 p ≤ J(m) log(n ∨ ω) sup kf kPn ,2 /2 f ∈Fm ≤ en (Fm , Pn ), p 1/2 R ρ 1/2 Rρp Rρ which follows by 0 log(ω/ǫ)dǫ ≤ 0 1dǫ 2 log(n ∨ ω), for log(ω/ǫ)dǫ ≤ ρ 0 √ 1/ n ≤ ρ ≤ 1. √ Let K > 1 be sufficiently large (to be set below). Recall that 4 2c > 4 where c > 1 is universal. Note that for any f ∈ Fm , by Chebyshev inequality √ P (|Gn (f )| > 4 2cKen (Fm , Pn ) ) ≤ supf ∈Fm kf k2P,2 1 √ ≤ √ ≤ τm /2. 2 2 (4 2cKen (Fm , Pn )) (4 2cK) J(m)2 log n By Lemma 15 with our choice of τm , ω > 1, and ρ(Fm , Pn ) ≤ 1, Z 1/2 o n √ 4 2 2 ≤ (ω/ǫ)1−J(m) [K −1] dǫ + τm P sup |Gn (f )| > 4 2cKen (Fm , Pn ) τm 0 f ∈Fm 2 2 4 (1/[2ω])J(m) [K −1] + τm , ≤ τm J(m)2 [K 2 − 1] which can be made arbitrary small by choosing K sufficiently large (and recalling that τm → 0 as K grows). 60 Step 3. (Auxiliary calculations.) To establish that supf ∈Fm kf kPn ,2 is √ 2-sub-exchangeable, define Z̃ and Ỹ by exchanging any components in Z with corresponding components in Y . Then √ 2( sup kf kPn (Z̃),2 ∨ sup kf kPn (Ỹ ),2 ) ≥ ( sup kf k2P f ∈Fm f ∈Fm f ∈Fm 2 2 1/2 ≥ ( sup En [f (Z̃i ) ] + En [f (Ỹi ) ]) f ∈Fm ≥ ( sup f ∈Fm kf k2Pn (Z),2 ∨ sup f ∈Fm n (Z̃),2 + sup kf k2P f ∈Fm n (Ỹ ),2 )1/2 2 = ( sup En [f (Zi ) ] + En [f (Yi )2 ])1/2 f ∈Fm kf k2Pn (Y ),2 )1/2 = sup kf kPn (Z),2 ∨ sup kf kPn (Y ),2 . f ∈Fm f ∈Fm √ Next we show that ρ(Fm , Pn ) := supf ∈Fm kf kPn ,2 /kFm kPn ,2 ≥ 1/ n. The latter bound 2 = En [supf ∈Fm |f (Zi )|2 ] ≤ supi≤n supf ∈Fm |f (Zi )|2 , and from the infollows from En Fm equality supf ∈Fm En [|f (Zi )|2 ] ≥ supf ∈Fm supi≤n |f (Zi )|2 /n. The last technical lemma in this section bounds the uniform entropy for VC classes of functions (we refer Dudley [17] for formal definitions). Lemma 17 (Uniform Entropy of VC classes). Suppose F has VC index V , as ε > 0 goes to zero we have for J = O(V ) sup N (εkF kQ,2 , F, L2 (Q)) . (1/ε)J Q where Q ranges over all discrete probabilities measures. Proof. Being a VC class of index V , by Theorem 2.6.7 in [37] we have that the bound supQ log N (εkF kQ,2 , F, L2 (Q)) . V log(1/ε) holds for ε sufficiently small (also making the expression bigger than 1). Comment G.1. Although the product of two VC classes of functions may not be a VC class, if F has VC index V , the square of F is still a VC class whose VC index is at most 2V . G.3. Bounds on Various Empirical Errors. In this section we provide probabilistic bounds for the error terms under the primitive Condition S. Our results rely on empirical processes techniques. In particular, they rely on the maximal inequalities derived in Section G.2. We start with a sequence of technical lemmas which are used in the proofs of the lemmas that bound the error terms ǫ0 − ǫ6 . 61 Lemma 18. Let r = o(1). The class of functions Fm,n = {α′ (ψi (β, u) − ψi (β(u), u)) : u ∈ U , kαk ≤ 1, kβ − β(u)k ≤ r} has VC index of O(m). Proof. Consider the classes W := {Zi′ α : α ∈ Rm } and V := {1{Yi ≤ Zi′ β} : β ∈ Rm } (for convenience let Ai = (Zi , Yi )). Their VC index is bounded by m + 2. Next consider f ∈ Fm,n which can be written in the form f (Ai ) := g(Ai )(1{h(Ai ) ≤ 0} − 1{p(Ai ) ≤ 0}) where g ∈ W, 1{h ≤ 0} and 1{p ≤ 0} ∈ V. {(Ai , t) : f (Ai ) ≤ t} = {(Ai , t) : g(Ai )(1{h(Ai ) ≤ 0} − 1{p(Ai ) ≤ 0}) ≤ t} = {(Ai , t) : h(Ai ) > 0, p(Ai ) > 0, t ≥ 0}∪ ∪ {(Ai , t) : h(Ai ) ≤ 0, p(Ai ) ≤ 0, t ≥ 0}∪ ∪ {(Ai , t) : h(Ai ) ≤ 0, p(Ai ) > 0, g(Ai ) ≤ t}∪ ∪ {(Ai , t) : h(Ai ) > 0, p(Ai ) ≤ 0, −g(Ai ) ≤ t}. Since each one of the sets can be written as three intersections of basic sets, it follows that Fm,n has VC index at most O(m). Lemma 19. The class of functions Hm,n = {1{|Yi − Zi′ β| ≤ h}(α′ Zi )2 : kβ − β(u)k ≤ r, h ∈ (0, H], α ∈ S m−1 } has VC index of O(m). Proof. The proof is similar to that of Lemma 18. Lemma 20. The family of functions Gm,n = {α′ ψi (β(u), u) : u ∈ U , α ∈ S m−1 } has VC index of O(m). Proof. The proof is similar to the proof of Lemma 18. Lemma 21. The family of functions An,m = {α′ Z 1{Y ≤ Z ′ β(u) + R(X, u)} − 1{Y ≤ Z ′ β(u)} : α ∈ S m−1 , u ∈ U } has VC index of O(m). Proof. The key observation is that the function Z ′ β(u) + R(X, u) is monotone in u so that {1{Y ≤ Z ′ β(u) + R(X, u)} : u ∈ U } has VC index of 1 and that {1{Y ≤ Z ′ β(u)} : u ∈ U } ⊂ {1{Y ≤ Z ′ β : β ∈ Rm }. The proof then follows similarly to Lemma 18. 62 Consider the maximum between the maximum eigenvalue associated with the empirical Gram matrix and the population Gram matrix φn = max En (α′ Zi )2 ∨ E (α′ Zi )2 . α∈S m−1 (G.39) The factor φn will be used to bound the quantities ǫ0 and ǫ1 in the analysis for the rate of convergence. Next we state a result due to Guédon and Rudelson [18] specialized to our framework. Theorem 13 (Guédon and Rudelson [18]). Let Zi ∈ Rm , i = 1, . . . , n, be random vectors such that log n E maxi≤n kZi k2 <1 δ := · n max E (Zi′ α)2 2 α∈S m−1 we have # n 1 X E max (Zi′ α)2 − E (Zi′ α)2 ≤ 2δ · max E (Zi′ α)2 . m−1 m−1 n α∈S α∈S " i=1 ′ 2 2 log n = o(n), for λ Corollary 2. Under Condition S and ζm max = maxα∈S m−1 E (Zi α) , we have that for n large enough φn as defined in (G.39) satisfies s s 2 log n 2 ζ m λmax and P (φn > 2λmax ) ≤ 2 ζm log n . E [φn ] ≤ 1 + 2 nλmax nλmax 2 under Proof. Let δ be defined as in Theorem 13. Next note that E maxi=1,...,n kZi k2 . ζm 2 log n)/n in Theorem 13. S.4, and λmax . 1 and λmax & 1 under S.3. Therefore, δ2 . (ζm 2 log n = o(n) yields δ < 1 as n grows. The growth condition ζm The first result follows by applying Theorem 13 and the triangle inequality. To show the second relation note that the event {φn > 2λmax } cannot occur if φn = maxα∈S m−1 E[(Zi′ α)2 ] = λmax . Thus P (φn > 2λmax ) = P (maxα∈S m−1 En [(Zi′ α)2 ] > 2λmax ) ≤ P ( max En [(Zi′ α)2 ] − E (Zi′ α)2 > λmax ) m−1 α∈S ≤ E max En [(Zi′ α)2 ] − E (Zi′ α)2 /λmax α∈S m−1 ≤ 2δ, by the triangle inequality, the Markov inequality and Theorem 13. Next we proceed to bound the various approximation errors terms. 63 Lemma 22 (Controlling error ǫ1 ). Under conditions S.1-S.4 we have ǫ1 (m, n) .P p m log n φn . Proof. Consider the class of functions Fm,n defined in Lemma 18 so that ǫ1 (m, n) = sup |Gn (f )|. √ f ∈Fm,n From Lemma 17 we have that J(m) . m. By Step 2 of Lemma 16, see equation (G.36), p 1/2 sup |Gn (f )| .P m log n sup En [f 2 ] ∨ E[f 2 ] (G.40) f ∈Fm,n f ∈Fm,n The score function ψi (·, ·) satisfies the following inequality for any α ∈ S m−1 (ψi (β, u) − ψi (β(u), u))′ α = |α′ Zi | 1{yi ≤ Zi′ β} − 1{yi ≤ Zi′ β(u)} ≤ |α′ Zi |. Therefore # n X 1 |α′ Zi |2 ≤ φn and E[f 2 ] ≤ E n " En [f 2 ] ≤ En [|α′ Zi |2 ] ≤ φn (G.41) i=1 by definition (G.39). Combining (G.41) with (G.40) we obtain the result. Lemma 23 (Controlling error ǫ0 and Pivotal process norm). Under the conditions S.1-S.4 ǫ0 (m, n) .P p p m log n φn and sup kUn (u)k .P m log n φn . u∈U Proof. For ǫ0 , the proof is similar to the proof of Lemma 22 and relying on Lemma 20 and noting that for g ∈ Gm,n En [g2 ] = En [(α′ ψi (β(u), u))2 ] = En [(α′ Zi )2 (1{yi ≤ Zi′ β(u)} − u)2 ] ≤ En [(α′ Zi )2 ] ≤ φn . Similarly, E[g2 ] ≤ φn . The second relation follows similarly. Lemma 24 (Bounds on ǫ1 (m, n) and ǫ2 (m, n)). Under conditions S.1-S.4 we have p √ √ mζm ǫ1 (m, n) .P mζm r log n + √ log n and ǫ2 (m, n) .P nζm r 2 + nm−κ r. n Proof. Part 1 (ǫ1 (m, n)) The first bound will follow from the application of the maximal inequality derived in Lemma 16. For any B ≥ 0, define the class of functions Fn,m as in Lemma 18 which characterize ǫ1 . By the (second) maximal inequality in Lemma 16 ǫ1 (m, n) = sup |Gn (f )| .P J(m) f ∈Fn,m sup E f 2 + n−1 J(m)2 M 2 (m, n) log n f ∈Fm !1/2 log1/2 n, 64 where J(m) . √ m by the VC dimensions of Fn,m being of O(m), Lemma 18, and M (m, n) = max1≤i≤n kZi k ≤ ζm . The bound stated in the lemma holds provided we can show that sup E f 2 . rζm . (G.42) f ∈Fm The score function ψi (·, ·) satisfies the following inequality for any α ∈ S m−1 (ψi (β, u) − ψi (β(u), u))′ α = |Z ′ α| |1{Yi ≤ Z ′ β} − 1{Yi ≤ Z ′ β(u)}| i i i ≤ |Zi′ α| · 1{|Yi − Zi′ β(u)| ≤ |Zi′ (β − β(u))|}. Thus, i h 2 ≤ E (α′ (ψi (β, u) − ψi (β(u), u))) ≤ ≤ = E R |Z ′ α|2 1{|Y − Z ′ β(u)| ≤ |Z ′ (β − β(u))|}fY |Z (y|Z)dy E (Z ′ α)2 · min(2f¯|Z ′ (β − β(u))|, 1) 2f kβ − β(u)k supα∈S m−1 ,γ∈S m−1 E |Zi′ α|2 |Zi′ γ| 2f kβ − β(u)k supα∈S m−1 E |Zi′ α|3 , where supα∈S m−1 E[|Zi′ α|3 ] . ζm by S.3 and S.4. Therefore, we have the upper bound (G.42). Part 2 (ǫ2 (m, n)) To show the bound for ǫ2 , note that by Lemma 12, for any α ∈ S m−1 √ n|α′ (Jm (u) − Jem (u))(β − β(u))| . √ nm−κ r. For α ∈ S m−1 and Jem (u) = E[fY |X (Z ′ β(u)|X)ZZ ′ ], define ǫ2 (m, n, α) = n1/2 |α′ (E[ψ(Y, Z, β, u)] − E[ψ(Y, Z, β(u), u)]) − α′ Jem (u)(β − β(u))|. Thus ǫ2 (m, n) . supα∈S m−1 ,(u,β)∈Rn,m ǫ2 (m, n, α) + √ nm−κ r. Note that since E[α′ ψi (β(u), u)] = 0, for some β̃ in the line segment between β(u) and β, E[α′ ψi (β, u)] = E[fY |X (Zi′ β̃|X)(Zi′ α)Zi′ ](β − β(u)). Thus, using that |fY |X (Z ′ β̃|X) − fY |X (Z ′ β(u)|X)| ≤ f ′ |Z ′ (β̃ − β(u))| ≤ f ′ |Z ′ (β − β(u))|, 65 h i ǫ2 (m, n, α) = n1/2 E (α′ Z)(fY |X (Z ′ β̃|X) − fY |X (Z ′ β(u)|X))Z ′ (β − β(u)) ≤ n1/2 E |α′ Z| |Z ′ (β − β(u))|2 f ′ ≤ n1/2 f ′ kβ − β(u)k2 sup E[|α′ Z|3 ], α∈S m−1 where supα∈S m−1 E[|Zi′ α|3 ] . ζm by S.3 and S.4. Lemma 25 (Bounds on Approximation Error for Uniform Linear Approximation). Under Condition S, n 1 X r̃u := √ Zi (1{Yi ≤ Zi′ β(u) + R(Xi , u)} − 1{Yi ≤ Zi′ β(u)}), u ∈ U , n i=1 satisfies sup α∈S m−1 ,u∈U |α′ r̃n (u)| .P min p m1−κ log n + ζm m log n √ , n q nφn m−κ f¯ . Proof. The second bound follows by Cauchy-Schwarz, |R(Xi , u)| . m−κ , and bounded conditional probability density function of Y given X. The proof of the first bound is similar to the bound on ǫ1 in Lemma 24, it will also follow from the application of the maximal inequality derived in Lemma 16. Define the class of functions An,m = {α′ Z 1{Y ≤ Z ′ β(u) + R(X, u)} − 1{Y ≤ Z ′ β(u)} : α ∈ S m−1 , u ∈ U }. By the (second) maximal inequality in Lemma 16, sup α∈S m−1 ,u∈U |α′ r̃n (u)| = .P where J(m) . √ sup |Gn (f )| f ∈An,m J(m) sup E f f ∈An,m 2 + n−1 J(m)2 M 2 (m, n) log n !1/2 log1/2 n, m by the VC dimensions of An,m being of O(m) by Lemma 21, and M (m, n) = ζm by S.4. The bound stated in the lemma holds provided we can show that sup E f 2 . m−κ . f ∈An,m (G.43) 66 For any f ∈ An,m |f (Yi , Zi , Xi )| = |Zi′ α| |1{Yi ≤ Zi′ β(u) + R(Xi , u)} − 1{Yi ≤ Zi′ β(u)}| ≤ |Zi′ α| · 1{|Yi − Zi′ β(u)| ≤ |R(Xi , u)|}. Thus, since |fY |X (y|X)| ≤ f¯, R E f 2 (Y, Z, X) ≤ E |Z ′ α|2 1{|y − Z ′ β(u)| ≤ |R(X, u)|}fY |X (y|X)dy ≤ E (Z ′ α)2 · min(2f¯R(X, u), 1) ≤ 2f m−κ supα∈S m−1 E |Zi′ α|2 , where supα∈S m−1 E[|Zi′ α|2 ] . 1 by invoking S.3. Therefore, we have the upper bound (G.43). b Lemma 26 (Bound on ǫ3 (m, n)). Let β(u) be a solution to the perturbed QR problem b β(u) ∈ arg minm En [ρu (Yi − Zi′ β)] + An (u)′ β. β∈R If the data are in general position, ǫ3 (m, n) = sup n u∈U holds with probability 1. 1/2 b kEn [ψi (β(u), u)] + An (u)k ≤ min √ m √ ζm , φn m n Proof. Note that the dual problem associated with the perturbed QR problem is max (u−1)≤ai ≤u En [Yi ai ] : En [Zi ai ] = −An (u). b Letting b a(u) denote the solution for the dual problem above, and letting ai (β(u)) := (u − ′ b 1{Yi ≤ Zi β(u)}), by the triangle inequality i √ h ′ b −b ai (u)) + n En (Zi α)(ai (β(u)) ǫ3 (m, n) ≤ sup kαk≤1,u∈U + supu∈U √ nkEn [Zi b ai (u)] + An (u)k. By dual feasibility En [Zi b ai (u)] = −An (u), and the second term is identically equal to zero. b We note that ai (β(u)) 6= b ai (u) only if the ith point is interpolated. Since the data are in general position, with probability one the quantile regression interpolates m points b (Z ′ β(u) = Yi for m points for every u ∈ U ). i 67 b Therefore, noting that |ai (β(u)) −b ai (u)| ≤ 1 r h i √ q √ ′ 2 b ǫ3 (m, n) ≤ sup −b ai (u)}2 ≤ φn m n En [(Zi α) ] En {ai (β(u)) kαk≤1,u∈U and, with probability 1, ǫ3 (m, n) ≤ √ sup nEn kαk≤1,u∈U m b 1{ai (β(u)) 6= b ai (u)} max kZi k ≤ √ max kZi k. 1≤i≤n n 1≤i≤n 2 log n = o(n), Lemma 27 (Bound on ǫ4 (m, n)). Under conditions S.1 − S.4, and ζm r 2 log n ζm ǫ4 (m, n) .P = o(1). n Proof. The result follows from Theorem 13 of Guédon and Rudelson [18] under our assumptions. Lemma 28 (Bounds on ǫ5 (m, n) and ǫ6 (m, n)). Under S.2, hn = o(1), and r = o(1), s 2 2 m log n mζm ζm + log n and ǫ6 (m, n) . m−κ + rζm + hn . ǫ5 (m, n) .P nhn nhn Proof. To bound ǫ5 first let Hm,n = 1{|Yi − Zi′ β| ≤ h}(α′ Zi )2 : kβ − β(u)k ≤ r, h ∈ (0, H], α ∈ S m−1 . Then, by Lemma 16 ǫ5 (m, n) = .P where J(m) . √ n−1/2 hn sup |Gn (f )| f ∈Hn,m n−1/2 J(m) hn sup E f f ∈Hm 2 + n−1 J(m)2 M 2 (m, n) log n !1/2 log1/2 n, m by the VC dimensions of Hn,m being of O(m) by Lemma 19, and M (m, n) = maxi≤n Hm,n,i , where Hm,n,i = kZi k2 , is the envelope of Hm,n in the sample of size n. Therefore the envelope is bounded by max1≤i≤n kZi k2 . We also get that sup E f 2 ≤ f ∈Hm,n sup β∈Rm ,α∈S m−1 E 1{|Yi − Zi′ β| ≤ hn }(α′ Zi )4 ≤ 2f¯hn sup E (α′ Zi )4 . α∈S m−1 68 By S.2 f¯ is bounded, and by S.4 2 2 sup E f 2 ≤ 2f¯ζm hn sup E (α′ Zi )2 . ζm hn . α∈S m−1 f ∈Hm,n Collecting terms, ǫ5 (m, n) .P .P .P 1/2 n−1/2 √ 2 4 log1/2 n m ζm hn + (m max kZi k /n) log n 1≤i≤n hn s s 2 m log n ζm m2 max1≤i≤n kZi k4 + log n nhn nh2n n s 2m m max1≤i≤n kZi k2 ζm log1/2 n + log n. nhn nhn To show the bound on ǫ6 note that |fY′ |X (y|x)| ≤ f¯′ by S.2. Therefore E 1{|Y − Z ′ β| ≤ hn }(α′ Z)2 h i Rh = E (α′ Z)2 −hnn fY |X (Z ′ β + t|X)dt h i Rh = E (α′ Z)2 −hnn fY |X (Z ′ β|X) + tfY′ |X (Z ′ β + t̃|X)dy = 2hn E fY |X (Z ′ β|X)(α′ Z)2 + O(2h2n f¯′ E (Z ′ α)2 ) by the mean-value theorem. Moreover, we have for any (u, β) ∈ Rm,n E fY |X (Z ′ β|X)(α′ Z)2 = E fY |X (Z ′ β(u)|X)(α′ Z)2 + +E (fY |X (Z ′ β|X) − fY |X (Z ′ β(u)|X))(α′ Z)2 = E fY |X (Z ′ β(u)|X)(α′ Z)2 + O(E f¯′ Z ′ (β − β(u))(α′ Z)2 ) = α′ Jem (u)α + O(f¯′ r supα∈S m−1 E |α′ Z|3 ) = α′ Jm (u)α + O(m−κ ) + O(f¯′ r supα∈S m−1 E |α′ Z|3 ) where the last line follows from Lemma 13, S.2 and S.5. Since f¯′ is bounded by assumption S.2, we obtain ǫ6 (m, n) . hn + m−κ + r supα∈S m−1 E |α′ Z|3 . Finally, conditions S.3 and S.4 yields supα∈S m−1 E |α′ Z|3 ≤ ζm supα∈S m−1 E |α′ Z|2 . ζm and the results follows. References [1] D. W. K. Andrews. Empirical process methods in econometrics. Handbook of Econometrics, Volume IV, Chapter 37, Edited by R. F. Engle and D. L. McFadden, Elsvier Science B.V., pages 2247–2294, 1994. [2] Donald W. K. Andrews. Asymptotic normality of series estimators for nonparametric and semiparametric regression models. Econometrica, 59(2):307–345, 1991. [3] O. Arias, K. F. Hallock, and Walter Sosa-Escudero. Individual heteroneity in the returns to schooling: instrumental variables quantile regression using twins data. Empirical Economics, 26(1):7–40, 2001. 69 [4] R. Barlow, D. Bartholemew, J. Bremner, and H. Brunk. Statistical Inference Under Order Restrictions. John Wiley, New York, 1972. [5] A. Belloni, X. Chen, and V. Chernozhukov. New asymptotic theory for series estimators. 2009. [6] A. Belloni and V. Chernozhukov. ℓ1 -penalized quantile regression for high dimensional sparse models. Ann. Statist., 39(1):82–130, 2011. [7] M. Buchinsky. Changes in the u.s. wage structure 1963-1987: Application of quantile regression. Econometrica, 62(2):405–458, Mar. 1994. [8] Matias D. Cattaneo, Richard K. Crump, and Michael Jansson. Robust data-driven inference for densityweighted average derivatives. J. Amer. Statist. Assoc., 105(491):1070–1083, 2010. With supplementary material available online. [9] Arun G. Chandrasekhar, Victor Chernozhukov, Francesca Molinari, and Paul Schrimpf. Inference on sets of best linear approximations to functions. MIT Working Paper, 2010. [10] Probal Chaudhuri. Nonparametric estimates of regression quantiles and their local Bahadur representation. Ann. Statist., 19(2):760–777, 1991. [11] Probal Chaudhuri, Kjell Doksum, and Alexander Samarov. On average derivative quantile regression. Ann. Statist., 25(2):715–744, 1997. [12] Xiaohong Chen. Large sample sieve estimation of semi-nonparametric models. In: Heckman, J.J., Leamer, E. (Eds.), Handbook of Econometrics, 6B(Chapter 76), 2006. [13] Xiaohong Chen and Xiaotong Shen. Sieve extremum estimates for weakly dependent data. Econometrica, 66(2):289–314, 1998. [14] V. Chernozhukov, I. Fernández-Val, and A. Galichon. Improving point and interval estimators of monotone functions by rearrangement. Biometrika, 96(3):559–575, 2009. [15] V. Chernozhukov, Sokbae Lee, and Adam M. Rosen. Intersection bounds: estimation and inference. arXiv, (0907.3503v1), 2009. [16] William S. Cleveland. Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist. Assoc., 74(368):829–836, 1979. [17] R. Dudley. Uniform Cental Limit Theorems. Cambridge Studies in advanced mathematics, 2000. [18] O. Guédon and M. Rudelson. lp -moments of random vectors via majorizing measures. Advances in Mathematics, 208:798–823, 2007. [19] Peter Hall and Simon J. Sheather. On the distribution of a studentized quantile. Journal of the Royal Statistical Society. Series B (Methodological), 50(3):381–391, 1988. [20] Wolfgang K. Härdle, Yaacov Ritov, and Song Song. Partial linear quantile regression and bootstrap confidence bands. SFB 649 Discussion Paper 2010-002, 2009. [21] Jerry A. Hausman and Whitney K. Newey. Nonparametric estimation of exact consumers surplus and deadweight loss. Econometrica, 63(6):pp. 1445–1476, 1995. [22] X. He and Q.-M. Shao. On parameters of increasing dimenions. Journal of Multivariate Analysis, 73:120– 135, 2000. [23] J. L. Horowitz and S. Lee. Nonparametric estimation of an additive quantile regression model. Journal of the American Statistical Association, 100(472):1238–1249, 2005. [24] R. Koenker. Quantile regression. Cambridge University Press, New York, 2005. [25] R. Koenker and G. Basset. Regression quantiles. Econometrica, 46(1):33–50, 1978. 70 [26] Roger Koenker. quantreg: Quantile Regression, 2008. R package version 4.24. [27] Efang Kong, Oliver Linton, and Yingcun Xia. Uniform Bahadur representation for local polynomial estimates of M -regression and its application to the additive model. Econometric Theory, 26(5):1529– 1564, 2010. [28] S. Lee. Efficient semiparametric estimation of a partially linear quantile regression model. Econometric Theory, 19:1–31, 2003. [29] Whitney K. Newey. Convergence rates and asymptotic normality for series estimators. Journal of Econometrics, 79:147–168, 1997. [30] M. I. Parzen, L. J. Wei, and Z. Ying. A resampling method based on pivotal estimating functions. Biometrika, 81(2):341–350, 1994. [31] David Pollard. A User’s Guide to Measure Theoretic Probability. Cambridge Series in Statistical and Probabilistic Mathematics, 2001. [32] James L. Powell. Least absolute deviations estimation for the censored regression model. Journal of Econometrics, 25(3):303–325, 1984. [33] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2008. ISBN 3-900051-07-0. [34] Richard Schmalensee and Thomas M. Stoker. Household gasoline demand in the united states. Econometrica, 67(3):pp. 645–662, 1999. [35] Charles J. Stone. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, 10(4):1040–1053, 1982. [36] Sara van de Geer. M-estimation using penalties or sieves. Journal of Statistical Planning and Inference, 108(1-2):55–69, 2002. [37] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics, 1996. [38] Halbert L. White. Nonparametric estimation of conditional quantiles using neural networks. In Proceedings of the Symposium on the Interface, pages 190–199, 1992. [39] Adonis Yatchew and Joungyeo Angela No. Household gasoline demand in canada. Econometrica, 69(6):pp. 1697–1709, 2001.