Massachusetts Institute of Technology Department of Economics Working Paper Series

advertisement
Massachusetts Institute of Technology
Department of Economics
Working Paper Series
CONDITIONAL QUANTILE PROCESSES BASED ON SERIES
OR MANY REGRESSORS
Alexandre Belloni
Victor Chernozhukov
Iván Fernándex-Val
Working Paper 11-15
June 1, 2011
Room E52-251
50 Memorial Drive
Cambridge, MA 02142
This paper can be downloaded without charge from the
Social Science Research Network Paper Collection at
http://ssrn.com/abstract=1908413
arXiv:1105.6154v1 [stat.ME] 31 May 2011
CONDITIONAL QUANTILE PROCESSES BASED ON SERIES OR MANY
REGRESSORS
ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV AND IVÁN FERNÁNDEZ-VAL
Abstract. Quantile regression (QR) is a principal regression method for analyzing the
impact of covariates on outcomes. The impact is described by the conditional quantile
function and its functionals. In this paper we develop the nonparametric QR series framework, covering many regressors as a special case, for performing inference on the entire
conditional quantile function and its linear functionals. In this framework, we approximate the entire conditional quantile function by a linear combination of series terms with
quantile-specific coefficients and estimate the function-valued coefficients from the data.
We develop large sample theory for the empirical QR coefficient process, namely we obtain uniform strong approximations to the empirical QR coefficient process by conditionally
pivotal and Gaussian processes, as well as by gradient and weighted bootstrap processes.
We apply these results to obtain estimation and inference methods for linear functionals of the conditional quantile function, such as the conditional quantile function itself, its
partial derivatives, average partial derivatives, and conditional average partial derivatives.
Specifically, we obtain uniform rates of convergence, large sample distributions, and inference methods based on strong pivotal and Gaussian approximations and on gradient and
weighted bootstraps. All of the above results are for function-valued parameters, holding
uniformly in both the quantile index and in the covariate value, and covering the pointwise
case as a by-product. If the function of interest is monotone, we show how to use monotonization procedures to improve estimation and inference. We demonstrate the practical
utility of these results with an empirical example, where we estimate the price elasticity
function of the individual demand for gasoline, as indexed by the individual unobserved
propensity for gasoline consumption.
Keywords. Quantile regression series processes, uniform inference.
JEL Subject Classification. C12, C13, C14.
AMS Subject Classification. 62G05, 62G15, 62G32.
Date: first version May, 2006, this version of June 1, 2011. The main results of this paper, particularly the pivotal
method for inference based on the entire quantile regression process, were first presented at the NAWM Econometric Society,
New Orleans, January, 2008 and also at the Stats in the Chateau in September 2009. We are grateful to Arun Chandraksekhar,
Denis Chetverikov, Ye Luo, Denis Tkachenko, and Sami Stouli for careful readings of several versions of the paper. We thank
Andrew Chesher, Roger Koenker, Oliver Linton, Tatiana Komarova, and seminar participants at the Econometric Society
meeting, CEMMAP master-class, Stats in the Chateau, BU, Duke, and MIT for many useful suggestions. We gratefully
acknowledge research support from the NSF.
1
2
1. Introduction
Quantile regression (QR) is a principal regression method for analyzing the impact of
covariates on outcomes, particularly when the impact might be heterogeneous. This impact
is characterized by the conditional quantile function and its functionals [3, 7, 24]. For example, we can model the log of the individual demand for some good, Y , as a function of the
price of the good, the income of the individual, and other observed individual characteristics
X and an unobserved preference U for consuming the good, as
Y = Q(X, U ),
where the function Q is strictly increasing in the unobservable U . With the normalization
that U ∼ Uniform(0, 1) and the assumption that U and X are independent, the function
Q(X, u) is the u-th conditional quantile of Y given X, i.e. Q(X, u) = QY |X (u|X). This
function can be used for policy analysis. For example, we can determine how changes in
taxes for the good could impact demand heterogeneously across individuals.
In this paper we develop the nonparametric QR series framework for performing inference
on the entire conditional quantile function and its linear functionals. In this framework,
we approximate the entire conditional quantile function QY |X (u|x) by a linear combination
of series terms, Z(x)′ β(u). The vector Z(x) includes transformations of x that have good
approximation properties such as powers, trigonometrics, local polynomials, or B-splines.
The function u 7→ β(u) contains quantile-specific coefficients that can be estimated from
the data using the QR estimator of Koenker and Bassett [25]. As the number of series
terms grows, the approximation error QY |X (u|x) − Z(x)′ β(u) decreases, approaching zero
in the limit. By controlling the growth of the number of terms, we can obtain consistent
estimators and perform inference on the entire conditional quantile function and its linear
functionals. The QR series framework also covers as a special case the so called many
regressors model, which is motivated by many new types of data that emerge in the new
information age, such as scanner and online shopping data.
b denote the QR estimator of
We describe now the main results in more detail. Let β(·)
β(·). The first set of results provides large-sample theory for the empirical QR coefficient
√ b
process of increasing dimension n(β(·)
− β(·)). We obtain uniform strong approximations
to this process by a sequence of the following stochastic processes of increasing dimension:
(i) a conditionally pivotal process,
(ii) a gradient bootstrap process,
3
(iii) a Gaussian process, and
(iv) a weighted bootstrap process.
To the best of our knowledge, all of the above results are new. The existence of the pivotal
approximation emerges from the special nature of QR, where a (sub) gradient of the sample objective function evaluated at the truth is pivotal conditional on the regressors. This
allows us to perform high-quality inference without even resorting to Gaussian approximations. We also show that the gradient bootstrap, introduced by Parzen, Wei and Ying [30]
in the parametric context, is effectively a means of carrying out the conditionally pivotal
approximation without explicitly estimating Jacobian matrices. The conditions for validity
of these two schemes require only a mild restriction on the growth of the number of series
terms in relation to the sample size. We also obtain a Gaussian approximation to the entire
distribution of QR process of increasing dimension by using chaining arguments and Yurinskii’s coupling. Moreover, we show that the weighted bootstrap works to approximate the
distribution of QR process for the same reason as the Gaussian approximation. The conditions for validity of the Gaussian and weighted bootstrap approximations, however, appear
to be substantively stronger than for the pivotal and gradient bootstrap approximations.
The second set of results provides estimation and inference methods for linear functionals
of the conditional quantile function, including
(i) the conditional quantile function itself, (u, x) 7→ QY |X (u|x),
(ii) the partial derivative function, (u, x) 7→ ∂xk QY |X (u|x),
R
(iii) the average partial derivative function, u 7→ ∂xk QY |X (u|x)dµ(x), and
R
(iv) the conditional average partial derivative, (u, xk ) 7→ ∂xk QY |X (u|x)dµ(x|xk ),
where µ is a given measure and xk is the k-th component of x. Specifically, we derive uniform rates of convergence, large sample distributions and inference methods based on the
strong pivotal and Gaussian approximations and on the gradient and weighted bootstraps.
It is noteworthy that all of the above results apply to function-valued parameters, holding
uniformly in both the quantile index and the covariate value, and covering pointwise normality and rate results as a special case. If the function of interest is monotone, we show
how to use monotonization procedures to improve estimation and inference.
The paper contributes and builds on the existing important literature on conditional
quantile estimation. First and foremost, we build on the work of He and Shao [22] that
studied the many regressors model and gave pointwise limit theorems for the QR estimator
in the case of a single quantile index. We go beyond the many regressors model to the series
4
model and develop large sample estimation and inference results for the entire QR process.
We also develop analogous estimation and inference results for the conditional quantile
function and its linear functionals, such as derivatives, average derivatives, conditional
average derivatives, and others. None of these results were available in the previous work.
We also build on Lee [28] that studied QR estimation of partially linear models in the
series framework for a single quantile index, and on Horowitz and Lee [23] that studied
nonparametric QR estimation of additive quantile models for a single quantile index in a
series framework. Our framework covers these partially linear models and additive models
as important special cases, and allows us to perform inference on a considerably richer set
of functionals, uniformly across covariate values and a continuum of quantile indices. Other
very important work includes Chaudhuri [10], and Chaudhuri, Doksum and Samarov [11],
Härdle, Ritov, and Song [20], Cattaneo, Crump, and Jansson [8], and Kong, Linton, and
Xia [27], among others, but this work focused on local, non-series, methods.
Our work also relies on the series literature, at least in a motivational and conceptual
sense. In particular, we rely on the work of Stone [35], Andrews [2], Newey [29], Chen
and Shen [13], Chen [12] and others that rigorously motivated the series framework as an
approximation scheme and gave pointwise normality results for least squares estimators,
and on Chen [12] and van de Geer [36] that gave (non-uniform) consistency and rate results
for general series estimators, including quantile regression for the case of a single quantile
index. White [38] established non-uniform consistency of nonparametric estimators of the
conditional quantile function based on a nonlinear series approximation using artificial neural networks. In contrast to the previous results, our rate results are uniform in covariate
values and quantile indices, and cover both the quantile function and its functionals. Moreover, we not only provide estimation rate results, but also derive a full set of results on
feasible inference based on the entire quantile regression process.
While relying on previous work for motivation, our results require to develop both new
proof techniques and new approaches to inference. In particular, our proof techniques
rely on new maximal inequalities for function classes with growing moments and uniform
entropy. One of our inference approaches involves an approximation to the entire conditional
quantile process by a conditionally pivotal process, which is not Donsker in general, but
can be used for high-quality inference. The utility of this new technique is particularly
apparent in our high-dimensional setting. Under stronger conditions, we also establish an
asymptotically valid approximation to the quantile regression process by Gaussian processes
using Yurinskii’s coupling. Previously, [15] used Yurinskii’s coupling to obtain a strong
5
approximation to the least squares series estimator. The use of this technique in our context
is new and much more involved, because we approximate an entire empirical QR process of
an increasing dimension, instead of a vector of increasing dimension, by a Gaussian process.
Finally, it is noteworthy that our uniform inference results on functionals, where uniformity
is over covariate values, do not even have analogs in the least squares series literature (the
extension of our results to least squares is a subject of ongoing research, [5]).
This paper does not deal with sparse models, where there are some key series terms and
many “non-key” series terms which ideally should be omitted from estimation. In these
settings, the goal is to find and indeed remove most of the “non-key” series terms before
proceeding with estimation. [6] obtained rate results for quantile regression estimators in
this case, but did not provide inference results. Even though our paper does not explicitly deal with inference in sparse models after model selection, the methods and bounds
provided herein are useful for analyzing this problem. Investigating this matter rigorously
is a challenging issue, since it needs to take into account the model selection mistakes in
estimation, and is beyond the scope of the present paper; however, it is a subject of our
ongoing research.
Plan of the paper. The rest of the paper is organized as follows. In Section 2, we describe
the nonparametric QR series model and estimators. In Section 3, we derive asymptotic
theory for the series QR process. In Section 4, we give estimation and inference theory for
linear functionals of the conditional quantile function and show how to improve estimation
and inference by imposing monotonicity restrictions. In Section 5, we present an empirical
application to the demand of gasoline and a computational experiment calibrated to the
application. The computational algorithms to implement our inference methods and the
proofs of the main results are collected in the Appendices.
Notation. In what follows, S m−1 denotes the unit sphere in Rm . For x ∈ Rm , we define
the Euclidian norm as kxk := supα∈S m−1 |α′ x|. For a set I, diam(I) = supv,v̄∈I kv − v̄k
denotes the diameter of I, and int(I) denotes the interior of I. For any two real numbers a
and b, a ∨ b = max{a, b} and a ∧ b = min{a, b}. Calligraphic letters are used to denote the
support of interest of a random variable or vector. For example, U ⊂ (0, 1) is the support of
U , X ⊂ Rd is the support of X, and Z = {Z(x) ∈ Rm : x ∈ X } is the support of Z = Z(X).
The relation an . bn means that an ≤ Cbn for a constant C and for all n large enough.
We denote by P ∗ the probability measure induced by conditioning on any realization of
the data Dn := {(Yi , Xi ) : 1 ≤ i ≤ n}. We say that a random variable ∆n = oP ∗ (1) in
P -probability if for any positive numbers ǫ > 0 and η > 0, P {P ∗ (|∆n | > ǫ) > η} = o(1), or
6
equivalently, P ∗ (|∆n | > ǫ) = oP (1). We typically shall omit the qualifier “in P -probability.”
The operator E denotes the expectation with respect to the probability measure P , En
√
denotes the expectation with respect to the empirical measure, and Gn denotes n(En −E).
2. Series Framework
2.1. The set-up. The set-up corresponds to the nonparametric QR series framework:
Y = QY |X (U |X) = Z ′ β(U ) + R(X, U ), U |X ∼ Uniform(0, 1) and β(u) ∈ Rm ,
where X is a d-vector of elementary regressors, and Z = Z(X) is an m-vector of approximating functions formed by transformations of X with the first component of Z equal to
one. The function Z ′ β(u) is the series approximation to QY |X (u|X) and is linear in the
parameter vector β(u), which is defined below. The term R(X, u) is the approximation
error, with the special case of R(X, u) = 0 corresponding to the many regressors model.
We allow the series terms Z(X) to change with the sample size n, i.e. Z(X) = Zn (X), but
we shall omit the explicit index indicating the dependence on n. We refer the reader to
Newey [29] and Chen [12] for examples of series functions, including (i) regression splines,
(ii) polynomials, (iii) trigonometric series, (iv) compactly supported wavelets, and others.
Interestingly, in the latter scheme, the entire collection of series terms is dependent upon
the sample size.
to
For each quantile index u ∈ (0, 1), the population coefficient β(u) is defined as a solution
min E[ρu (Y − Z ′ β)]
β∈Rm
(2.1)
where ρu (z) = (u − 1{z < 0})z is the check function ([24]). Thus, this coefficient is not
a solution to an approximation problem but to a prediction problem. However, we show
in Lemma 1 in Appendix B that the solution to (2.1) inherits important approximation
properties from the best least squares approximation of QY |X (u|·) by a linear combination
of Z(·).
We consider estimation of the coefficient function u 7→ β(u) using the entire quantile
regression process over U , a compact subset of (0, 1),
b
{u 7→ β(u),
u ∈ U },
7
b
namely, for each u ∈ U , the estimator β(u)
is a solution to the empirical analog of the
population problem (2.1)
min En [ρu (Yi − Zi′ β)].
β∈Rm
(2.2)
We are also interested in various functionals of QY |X (·|x). We estimate them by the correb
sponding functionals of Z(x)′ β(·),
such as the estimator of the entire conditional quantile
b
function, (x, u) 7→ Z(x)′ β(u),
and derivatives of this function with respect to the elementary
covariates x.
2.2. Regularity Conditions. In our framework the entire model can change with n, although we shall omit the explicit indexing by n. We will use the following primitive assumptions on the data generating process, as n → ∞ and m = m(n) → ∞.
Condition S.
S.1 The data Dn = {(Yi , Xi′ )′ , 1 ≤ i ≤ n} are an i.i.d. sequence of real (1 + d)-vectors,
and Zi = Z(Xi ) is a real m-vector for i = 1, . . . , n.
S.2 The conditional density of the response variable fY |X (y|x) is bounded above by f
and its derivative in y is bounded above by f ′ , uniformly in the arguments y and
x ∈ X and in n; moreover, fY |X (QY |X (u|x)|x) is bounded away from zero uniformly
for all arguments u ∈ U , x ∈ X , and n.
S.3 For every m, the eigenvalues of the Gram matrix Σm = E[ZZ ′ ] are bounded from
above and away from zero, uniformly in n.
S.4 The norm of the series terms obeys maxi≤n kZi k ≤ ζ(m, d, n) := ζm .
S.5 The approximation error term R(X, U ) is such that supx∈X ,u∈U |R(x, u)| . m−κ .
Comment 2.1. Condition S is a simple set of sufficient conditions. Our proofs and lemmas gathered in the appendices hold under more general, but less primitive, conditions.
Condition S.2 impose mild smoothness assumptions on the conditional density function.
Condition S.3 and S.4 imposes plausible conditions on the design, see Newey [29]. Condition S.2 and S.3 together imply that the eigenvalues of the Jacobian matrix
Jm (u) = E[fY |X (QY |X (u|X)|X)ZZ ′ ]
are bounded away from zero and from above; Lemma 12 further shows that this together
with S.5 implies that the eigenvalues of Jem (u) = E[fY |X (Z ′ β(u)|X)ZZ ′ ] are bounded away
from zero and from above, since the two matrices are uniformly close. Assumption S.4
imposes a uniform bound on the norm of the transformed vector which can grow with
8
the sample size. For example, we have ζm =
√
m for splines and ζm = m for polynomials.
Assumption S.5 introduces a bound on the approximation error and is the only nonprimitive
assumption. Deriving sharp bounds on the sup norm for the approximation error is a
delicate subject of approximation theory even for least squares, with the exception of local
polynomials; see, e.g. [5] for a discussion in the least squares case. However, in Lemma 1
in Appendix B, we characterize the nature of the approximation by establishing that the
vector β(u) solving (2.1) is equivalent to the least squares approximation to QY |X (u|x) in
terms of the order of L2 error. We also deduce a simple upper bound on the sup norm
for the approximation error. In applications, we recommend to analyze the size of the
approximation error numerically. This is possible by considering examples of plausible
functions QY |X (u|x) and comparing them to the best linear approximation schemes, as this
directly leads to sharp results. We give an example of how to implement this approach in
the empirical application.
3. Asymptotic Theory for QR Coefficient Processes based on Series
3.1. Uniform Convergence Rates for Series QR Coefficients. The first main result
is a uniform rate of convergence for the series estimators of the QR coefficients.
Theorem 1 (Uniform Convergence Rate for Series QR Coefficients). Under Condition S,
2 m log n = o(n),
and provided that ζm
r
m log n
b
.
sup β(u) − β(u) .P
n
u∈U
Thus, up to a logarithmic factor, the uniform convergence over U is achieved at the
same rate as for a single quantile index ([22]). The proof of this theorem relies on new
concentration inequalities that control the behavior of the empirical eigenvalues of the design
matrix.
3.2. Uniform Strong Approximations to the Series QR Process. The second main
result is the approximation of the QR process by a linear, conditionally pivotal process.
Theorem 2 (Strong Approximations to the QR Process by a Pivotal Coupling). Under
2 log 7 n = o(n), and m−κ+1 log 3 n = o(1), the QR process is uniformly
Condition S, m3 ζm
close to a conditionally pivotal process, namely
√ −1
b
n β(u)
− β(u) = Jm
(u)Un (u) + rn (u),
9
where
n
1 X
Zi (u − 1{Ui ≤ u}),
Un (u) := √
n
(3.3)
i=1
where U1 , . . . , Un are i.i.d. Uniform(0, 1), independently distributed of Z1 , . . . , Zn , and
1/2
sup krn (u)k .P
u∈U
m3/4 ζm log3/4 n p 1−κ
log n = o(1/ log n).
+ m
n1/4
The theorem establishes that the QR process is approximately equal to the process
−1 (·)U (·),
Jm
n
which is pivotal conditional on Z1 , ..., Zn . This is quite useful since we can
simulate Un (·) in a straightforward fashion, conditional on Z1 , ..., Zn , and we can estimate
the matrices Jm (·) using Powell’s method [32]. Thus, we can perform inference based on the
entire empirical QR process in series regression or many regressors contexts. The following
theorem establishes the formal validity of this approach, which we call the pivotal method.
Theorem 3 (Pivotal Method). Let Jbm (u) denote the estimator of Jm (u) defined in (3.7)
√
with bandwidth hn obeying hn = o(1) and hn m log3/2 n = o(1). Under Condition S,
2 m2 log 4 n = o(nh ), the feasible pivotal process Jb−1 (·)U∗ (·)
m−κ+1/2 log3/2 n = o(1), and ζm
n
m
n
−1 (·)U∗ (·) of the pivotal process defined in Theorem 2:
correctly approximates a copy Jm
n
where
−1
−1
Jbm
(u)U∗n (u) = Jm
(u)U∗n (u) + rn (u),
n
U∗n (u)
1 X
:= √
Zi (u − 1{Ui∗ ≤ u}),
n
(3.4)
i=1
U1∗ , . . . , Un∗ are i.i.d. Uniform(0, 1), independently distributed of Z1 , . . . , Zn , and
s
2 m2 log2 n
p
p
ζm
+ m−κ+1/2 log n + hn m log n = o(1/ log n).
sup krn (u)k .P
nhn
u∈U
This method is closely related to another approach to inference, which we refer to here as
the gradient bootstrap method. This approach was previously introduced by Parzen, Wei
and Ying [30] for parametric models with fixed dimension. We extend it to a considerably
more general nonparametric series framework. The main idea is to generate draws βb∗ (·) of
the QR process as solutions to QR problems with gradients perturbed by a pivotal quantity
√
U∗ (·)/ n. In particular, let us define a gradient bootstrap draw βb∗ (u) as the solution to
n
the problem
√
minm En [ρu (Yi − Zi′ β)] − U∗n (u)′ β/ n,
β∈R
(3.5)
10
for each u ∈ U , where U∗n (·) is defined in (3.4). The problem is solved many times for
√ b
− β(·)) is approximated by the
independent draws of U∗n (·), and the distribution of n(β(·)
√ b∗
b
empirical distribution of the bootstrap draws of n(β (·) − β(·)).
2 m3 log7 n = o(n), and
Theorem 4 (Gradient Bootstrap Method). Under Condition S, ζm
−1 (·)U∗ (·)
m−κ+1/2 log3/2 n = o(1), the gradient bootstrap process correctly approximates a copy Jm
n
of the pivotal process defined in Theorem 2,
√ b∗
−1
b
n β (u) − β(u)
= Jm
(u)U∗n (u) + rn (u),
where U∗n (u) is defined in (3.4) and
1/2
sup krn (u)k .P
u∈U
p
m3/4 ζm log3/4 n
−κ
m log n = o(1/ log n).
+
m
n1/4
The stated bound continues to hold in P -probability if we replace the unconditional probability
P by the conditional probability P ∗ .
The main advantage of this method relative to the pivotal method is that it does not
require to estimate Jm (·) explicitly. The disadvantage is that in large problems the gradient
bootstrap can be computationally much more expensive.
Next we turn to a strong approximation based on a sequence of Gaussian processes.
Theorem 5 (Strong Approximation to the QR Process by a Gaussian Coupling). Under
6 log22 n = o(n), there exists a sequence of zero-mean Gaussian
conditions S.1-S.4 and m7 ζm
processes Gn (·) with a.s. continuous paths, that has the same covariance functions as the
pivotal process Un (·) in (3.4) conditional on Z1 , . . . , Zn , namely,
E[Gn (u)Gn (u′ )′ ] = E[Un (u)Un (u′ )′ ] = En [Zi Zi′ ](u ∧ u′ − uu′ ), for all u and u′ ∈ U .
Also, Gn (·) approximates the empirical process Un (·), namely,
sup kUn (u) − Gn (u)k .P o(1/ log n).
u∈U
Consequently, if in addition S.5 holds with m−κ+1 log3 n = o(1),
√
−1
b
sup k n(β(u)
− β(u)) − Jm
(u)Gn (u)k .P o(1/ log n).
u∈U
−1 (·)G (·)
Moreover, under the conditions of Theorem 3, the feasible Gaussian process Jbm
n
−1 (·)G (·):
correctly approximates Jm
n
−1
−1
Jbm
(u)Gn (u) = Jm
(u)Gn (u) + rn (u),
11
where
sup krn (u)k .P
u∈U
s
2 m2 log2 n
p
p
ζm
+ m−κ+1/2 log n + hn m log n = o(1/ log n).
nhn
The requirement on the growth of the dimension m in Theorem 5 is a consequence
of a step in the proof that relies upon Yurinskii’s coupling. Therefore, improving that
step through the use of another coupling could lead to significant improvements in such
requirements. Note, however, that the pivotal approximation has a much weaker growth
restriction, and it is also clear that this approximation should be more accurate than any
further approximation of the pivotal approximation by a Gaussian process.
Another related inference method is the weighted bootstrap for the entire QR process.
Consider a set of weights h1 , ..., hn that are i.i.d. draws from the standard exponential
distribution. For each draw of such weights, define the weighted bootstrap draw of the QR
process as a solution to the QR problem weighted by h1 , . . . , hn :
βbb (u) ∈ arg minm En [hi ρu (Yi − Zi′ β)], for u ∈ U .
β∈R
The following theorem establishes that the weighted bootstrap distribution is valid for
approximating the distribution of the QR process.
2 m3 log 9 n = o(n),
Theorem 6 (Weighted Bootstrap Method). (1) Under Condition S, ζm
and m−κ+1 log3 n = o(1), the weighted bootstrap process satisfies
n
J −1 (u) X
√ b
m
b
b
(hi − 1)Zi (u − 1{Ui ≤ u}) + rn (u),
n β (u) − β(u) = √
n
i=1
where U1 , . . . , Un are i.i.d. Uniform(0, 1), independently distributed of Z1 , . . . , Zn , and
1/2
sup krn (u)k .P
u∈U
m3/4 ζm log5/4 n p 1−κ
log n = o(1/ log n).
+ m
n1/4
The bound continues to hold in P -probability if we replace the unconditional probability P
by P ∗ .
(2) Furthermore, under the conditions of Theorem 5, the weighted bootstrap process ap−1 (·)G (·) defined in Theorem 5, that is:
proximates the Gaussian process Jm
n
√
−1
b
− Jm
(u)Gn (u)k .P o(1/ log n).
sup k n(βbb (u) − β(u))
u∈U
In comparison with the pivotal and gradient bootstrap methods, the Gaussian and
weighted bootstrap methods require stronger assumptions.
12
3.3. Estimation of Matrices. In order to implement some of the inference methods, we
need uniformly consistent estimators of the Gram and Jacobian matrices. The natural
candidates are
b m = En [Zi Zi′ ],
Σ
1
b
Jbm (u) =
En [1{|Yi − Zi′ β(u)|
≤ hn } · Zi Zi′ ],
2hn
(3.6)
(3.7)
where hn is a bandwidth parameter, such that hn → 0, and u ∈ U . The following result
establishes uniform consistency of these estimators and provides an appropriate rate for the
bandwidth hn which depends on the growth of the model.
2 log n =
Theorem 7 (Estimation of Gram and Jacobian Matrices). If conditions S.1-S.4 and ζm
b m − Σm = oP (1) in the eigenvalue norm . If conditions S.1-S.5, hn = o(1),
o(n) hold, then Σ
2 log n = o(nh ) hold, then Jb (u)−J (u) = o (1) in the eigenvalue norm uniformly
and mζm
n
m
m
P
in u ∈ U .
4. Estimation and Inference Methods for Linear Functionals
In this section we apply the previous results to develop estimation and inference methods
for various linear functionals of the conditional quantile function.
4.1. Functionals of Interest. We are interested in various functions θ created as linear
functionals of QY |X (u|·) for all u ∈ U . Particularly useful examples include, for x = (w, v)
and xk denoting the k-th component of x,
1. the function itself:
2. the derivative:
θ(u, x) = QY |X (u|x);
θ(u, x) = ∂xk QY |X (u|x);
R
3. the average derivative: θ(u) = ∂xk QY |X (u|x)dµ(x);
R
4. the conditional average derivative: θ(u|w) = ∂xk QY |X (u|w, v)dµ(v|w).
The measures µ entering the definitions above are taken as known; the result can be extended
to include estimated measures.
Note that in each example we could be interested in estimating θ(u, w) simultaneously
for many values of (u, w). For w ∈ W ⊂ Rd , let I ⊂ U × W denote the set of indices (u, w)
of interest. For example:
• the function at a particular point (u, w), in which case I = {(u, w)},
• the function u 7→ θ(u, w), having fixed w, in which case I = U × {w},
• the function w 7→ θ(u, w), having fixed u, in which case I = {u} × W,
13
• the entire function (u, w) 7→ θ(u, w), in which case I = U × W.
By the linearity of the series approximations, the above parameters can be seen as linear
functions of the quantile regression coefficients β(u) up to an approximation error, that is
θ(u, w) = ℓ(w)′ β(u) + rn (u, w), (u, w) ∈ I,
(4.8)
where ℓ(w)′ β(u) is the series approximation, with ℓ(w) denoting the m-vector of loadings on
the coefficients, and rn (u, w) is the remainder term, which corresponds to the approximation
error. Indeed, in each of the examples above, this decomposition arises from the application
of different linear operators A to the decomposition QY |X (u|·) = Z(·)′ β(u) + R(u, ·) and
evaluating the resulting functions at w:
AQY |X (u|·) [w] = (AZ(·)) [w]′ β(u) + (AR(u, ·)) [w].
(4.9)
In the four examples above the operator A is given by, respectively,
1. the identity operator: (Af )[x] = f [x], so that
ℓ(x) = Z(x),
rn (u, x) = R(u, x) ;
2. a differential operator: (Af )[x] = (∂xk f )[x], so that
ℓ(x) = ∂xk Z(x),
rn (u, x) = ∂xk R(u, x) ;
R
3. an integro-differential operator: Af = ∂xk f (x)dµ(x), so that
Z
Z
ℓ = ∂xk Z(x)dµ(x), rn (u) = ∂xk R(u, x)dµ(x) ;
R
4. a partial integro-differential operator: (Af )[w] = ∂xk f (x)dµ(v|w), so that
Z
Z
ℓ(w) = ∂xk Z(x)dµ(v|w), rn (u, w) = ∂xk R(u, x)dµ(v|w) .
For notational convenience, we use the formulation (4.8) in the analysis, instead of the
motivational formulation (4.9).
We shall provide the inference tools that will be valid for inference on the series approximation
ℓ(w)′ β(u), (u, w) ∈ I,
and, provided that the approximation error rn (u, w), (u, w) ∈ I, is small enough as com-
pared to the estimation noise, these tools will also be valid for inference on the functional
of interest:
θ(u, w), (u, w) ∈ I.
14
Therefore, the series approximation is an important penultimate target, whereas the functional θ is the ultimate target. The inference will be based on the plug-in estimator
b w) := ℓ(w)′ β(u)
b
θ(u,
of the the series approximation ℓ(w)′ β(u) and hence of the final target
θ(u, w).
4.2. Pointwise Rates and Inference on the Linear Functionals. Let (u, w) be a pair
of a quantile index value u and a covariate value w, at which we are interested in performing
inference on θ(u, w). In principle, this deterministic value can be implicitly indexed by n,
although we suppress the dependence.
Condition P.
P.1 The approximation error is small, namely
√
n|rn (u, w)|/kℓ(w)k = o(1).
P.2 The norm of the loading ℓ(w) satisfies: kℓ(w)k . ξθ (m, w).
Theorem 8 (Pointwise Convergence Rate for Linear Functionals). Assume that the conditions of Theorem 2 and Condition P hold, then
b w) − θ(u, w)| .P
|θ(u,
ξθ (m,w)
√
.
n
(4.10)
In order to perform inference, we construct an estimator of the variance as
−1
b m Jb−1 (u)ℓ(w)/n.
σ
bn2 (u, w) = u(1 − u)ℓ(w)′ Jbm
(u)Σ
m
(4.11)
−1
−1
σn2 (u, w) = u(1 − u)ℓ(w)′ Jm
(u)Σm Jm
(u)ℓ(w)/n.
(4.12)
Under the stated conditions this quantity is consistent for
Finally, consider the t-statistic:
tn (u, w) =
b w) − θ(u, w)
θ(u,
.
σ
bn (u, w)
Condition P also ensures that the approximation error is small, so that
tn (u, w) =
b
ℓ(w)′ (β(u)
− β(u))
+ oP (1).
σ
bn (u, w)
We can carry out standard inference based on this statistic because tn (u, w) →d N (0, 1).
15
Moreover, we can base inference on the quantiles of the following statistics:
pivotal coupling
t∗n (u, w) =
gradient bootstrap coupling
t∗n (u, w) =
weighted bootstrap coupling t∗n (u, w) =
−1 (u)U∗ (u)/√n
ℓ(w)′ Jbm
n
;
σ
bn (u, w)
b
ℓ(w)′ (βb∗ (u) − β(u))
;
σ
bn (u, w)
(4.13)
b
ℓ(w)′ (βbb (u) − β(u))
.
σ
bn (u, w)
Accordingly let kn (1 − α) be the 1 − α/2 quantile of the standard normal variable, or let
kn (1−α) denote the 1−α quantile of the random variable |t∗n (u, w)| conditional on the data,
i.e. kn (1 − α) = inf{t : P (|t∗n (u, w)| ≤ t|Dn ) ≥ 1 − α}, then a pointwise (1 − α)-confidence
interval can be formed as
b w) − kn (1 − α)b
b w) + kn (1 − α)b
[ι̇(u, w), ϊ(u, w)] = [θ(u,
σn (u, w), θ(u,
σn (u, w)].
Theorem 9 (Pointwise Inference for Linear Functionals). Suppose that the conditions of
Theorems 2 and 7, and Condition P hold. Then,
tn (u, w) →d N (0, 1).
For the case of using the gradient bootstrap method, suppose that the conditions of Theorem
4 hold. For the case of using the weighted bootstrap method, suppose that the conditions of
Theorem 6 hold. Then, t∗n (u, w) →d N (0, 1) conditional on the data.
Moreover, the (1− α)-confidence band [ι̇(u, w), ϊ(u, w)] covers θ(u, w) with probability that
is asymptotically no less than 1 − α, namely
n
o
P θ(u, w) ∈ [ι̇(u, w), ϊ(u, w)] ≥ 1 − α + o(1).
4.3. Uniform Rates and Inference on the Linear Functionals. For f : I 7→ Rk ,
define the norm
kf kI := sup kf (u, w)k.
(u,w)∈I
We shall invoke the following assumptions to establish rates and uniform inference results
over the region I.
Condition U.
U.1 The approximation error is small, namely
√
n log n sup krn (u, w)/kℓ(w)kk = o(1).
(u,w)∈I
16
U.2 The loadings ℓ(w) are uniformly bounded and admit Lipschitz coefficients ξθL (m, I),
that is,
kℓkI . ξθ (m, I), kℓ(w) − ℓ(w′ )k ≤ ξθL (m, I)kw − w′ k, and
log[diam(I) ∨ ξθ (m, I) ∨ ξθL (m, I) ∨ ζm ] . log m
Theorem 10 (Uniform Convergence Rate for Linear Functionals). Assume that the condi2 log 2 n = o(n), then
tions of Theorem 2 and Condition U hold, and dξθ2 (m, I)ζm
b w) − θ(u, w)| .P
sup(w,u)∈I |θ(u,
ξθ (m,I)∨1
√
n
log n.
(4.14)
As in the pointwise case, consider the estimator of the variance σ
bn2 (u, w) in (4.11). Under
our conditions this quantity is uniformly consistent for σn2 (u, w) defined in (4.12), namely
σ
bn2 (u, w)/σn2 (u, w) = 1 + oP (1/ log n) uniformly over (u, w) ∈ I. Then we consider the
t-statistic process:
(
b w) − θ(u, w)
θ(u,
tn (u, w) =
, (u, w) ∈ I
σ
bn (u, w)
)
.
Under our assumptions the approximation error is small, so that
tn (u, w) =
b
ℓ(w)′ (β(u)
− β(u))
+ oP (1/ log n) in ℓ∞ (I).
σ
bn (u, w)
The main result on inference is that the t-statistic process can be strongly approximated
by the following pivotal processes or couplings:
o
−1 (u)U∗ (u)/√n
ℓ(w)′ Jbm
n
, (u, w) ∈ I ;
=
σ
bn (u, w)
pivotal coupling
n
t∗n (u, w)
gradient bootstrap coupling
n
t∗n (u, w) =
Gaussian coupling
n
t∗n (u, w)
weighted bootstrap coupling
n
t∗n (u, w) =
b
ℓ(w)′ (βb∗ (u) − β(u))
,
σ
bn (u, w)
o
(u, w) ∈ I ;
b
ℓ(w)′ (βbb (u) − β(u))
,
σ
bn (u, w)
o
(u, w) ∈ I .
o
−1 (u)G (u)/√n
ℓ(w)′ Jbm
n
=
, (u, w) ∈ I ;
σ
bn (u, w)
(4.15)
The following theorem shows that these couplings approximate the distribution of the tstatistic process in large samples.
17
Theorem 11 (Strong Approximation of Inferential Processes by Couplings). Suppose that
the conditions of Theorems 2 and 3, and Condition U hold. For the case of using the
gradient bootstrap method, suppose that the conditions of Theorem 4 hold. For the case of
using the Gaussian approximation, suppose that the conditions of Theorem 5 hold. For the
case of using the weighted bootstrap method, suppose that the conditions of Theorem 6 hold.
Then,
tn (u, w) =d t∗n (u, w) + oP (1/ log n), in ℓ∞ (I),
where P can be replaced by P ∗ .
To construct uniform two-sided confidence bands for {θ(u, w) : (u, w) ∈ I}, we consider
the maximal t-statistic
ktn kI = sup |tn (u, w)|,
(u,w)∈I
as well as the couplings to this statistic in the form:
kt∗n kI = sup |t∗n (u, w)|.
(u,w)∈I
Ideally, we would like to use quantiles of the first statistic as critical values, but we do not
know them. We instead use quantiles of the second statistic as large sample approximations.
Let kn (1 − α) denote the 1 − α quantile of random variable kt∗n kI conditional on the data,
i.e.
kn (1 − α) = inf{t : P (kt∗n kI ≤ t|Dn ) ≥ 1 − α}.
This quantity can be computed numerically by Monte Carlo methods, as we illustrate in
the empirical section.
Let δn > 0 be a finite sample expansion factor such that δn log1/2 n → 0 but δn log n → ∞.
For example, we recommend to set δn = 1/(4 log 3/4 n). Then for cn (1 − α) = kn (1 − α) + δn
we define the confidence bands of asymptotic level 1 − α to be
b w) − cn (1 − α)b
b w) + cn (1 − α)b
[ι̇(u, w), ϊ(u, w)] = [θ(u,
σn (u, w), θ(u,
σn (u, w)], (u, w) ∈ I.
The following theorem establishes the asymptotic validity of these confidence bands.
Theorem 12 (Uniform Inference for Linear Functionals). Suppose that the conditions of
Theorem 11 hold for a given coupling method.
(1) Then
n
o
P ktn kI ≤ cn (1 − α) ≥ 1 − α + o(1).
(4.16)
18
(2) As a consequence, the confidence bands constructed above cover θ(u, w) uniformly for
all (u, w) ∈ I with probability that is asymptotically no less than 1 − α, namely
n
o
P θ(u, w) ∈ [ι̇(u, w), ϊ(u, w)], for all (u, w) ∈ I ≥ 1 − α + o(1).
(4.17)
(3) The width of the confidence band 2cn (1 − α)b
σn (u, w) obeys uniformly in (u, w) ∈ I:
2cn (1 − α)b
σn (u, w) = 2kn (1 − α)(1 + oP (1))σn (u, w).
(4) Furthermore, if kt∗n kI does not concentrate at kn (1 − α) at a rate faster than
(4.18)
√
log n,
P (kt∗n kI
≤ kn (1 − α) + εn ) = 1 − α + o(1)
that is, it obeys the anti-concentration property
√
for any εn = o(1/ log n), then the inequalities in (4.16) and (4.17) hold as equalities, and
the finite sample adjustment factor δn could be set to zero.
The theorem shows that the confidence bands constructed above maintain the required
level asymptotically and establishes that the uniform width of the bands is of the same
order as the uniform rate of convergence. Moreover, under anti-concentration the confidence
intervals are asymptotically similar.
Comment 4.1. This inferential strategy builds on [15], who proposed a similar strategy
for inference on the minimum of a function. The idea was not to find the limit distribution,
which may not exist in some cases, but to use distributions provided by couplings. Since the
limit distribution needs not exist, it is not immediately clear that the confidence intervals
maintain the right asymptotic level. However, the additional adjustment factor δn assures
the right asymptotic level. A small price to pay for using the adjustment δn is that the
confidence intervals may not be similar, i.e. remain asymptotically conservative in coverage.
However, the width of the confidence intervals is not asymptotically conservative, since δn
is negligible compared to kn (1 − α). If an additional property, called anti-concentration,
holds, then the confidence intervals automatically become asymptotically similar. The anticoncentration property holds if, after appropriate scaling by some deterministic sequences
an and bn , the inferential statistic an (ktn kI − bn ) has a continuous limit distribution. More
generally, it holds if for any subsequence of integers {nk } there is a further subsequence {nkr }
along which ankr (ktnkr kI − bnkr ) has a continuous limit distribution, possibly dependent on
the subsequence. For an example of the latter see [9], where certain inferential processes
converge in distribution to tight Gaussian processes subsubsequentially. We expect anticoncentration to hold in our case, but our constructions and results do not critically hinge
on it.
19
Comment 4.2 (Primitive Bounds on ξθ (m, I)). The results of this section rely on the quantity ξθ (m, I). The value of ξθ (m, I) depends on the choice of basis for the series estimator
and on the type of the linear functional. Here we discuss the case of regression splines and
refer to [29] and [12] for other choices of basis. After a possible renormalization of X, we
√
assume its support is X = [−1, 1]d . For splines it has been established that ζm . m and
maxk supx∈X k∂xk Z(x)k . m1/2+k , [29]. Then we have that for
• the function itself:
• the derivative:
• the average derivative:
θ(u, x) = QY |X (u|x), ℓ(x) = Z(x), ξθ (m, I) .
√
m;
θ(u, x) = ∂xk QY |X (u|x), ℓ(x) = ∂x Z(x), ξθ (m, I) . m3/2 ;
R
θ(u) = ∂xk QY |X (u|x)dµ(x), supp(µ) ⊂ intX , |∂xk µ(x)| . 1,
R
R
ℓ = ∂xk Z(x)µ(x) dx = − Z(x)∂xk µ(x) dx, ξθ (m) . 1.
4.4. Imposing Monotonicity on Linear Functionals. The functionals of interest might
be naturally monotone in some of their arguments. For example, the conditional quantile
function is increasing in the quantile index and the conditional quantile demand function
is decreasing in price and increasing in the quantile index. Therefore, it might be desirable
to impose the same requirements on the estimators of these functions.
Let θ(u, w), where (u, w) ∈ I, be a weakly increasing function in (u, w), i.e. θ(u′ , w′ ) ≤
θ(u, w) whenever (u′ , w′ ) ≤ (u, w) componentwise.1 Let θb and [ι̇, ϊ] be the point and band
estimators of θ, constructed using one of the methods described in the previous sections.
These estimators might not satisfy the monotonicity requirement due to either estimation
error or imperfect approximation. However, we can monotonize these estimates and perform
inference using the following method, suggested in [14].
Let q, f : I 7→ K, where K is a bounded subset of R, and consider any monotonization
operator M that satisfies: (1) a monotone-neutrality condition
Mq = q
if q monotone;
(4.19)
(2) a distance-reducing condition
kMq − Mf kI ≤ kq − f kI ;
(4.20)
and (3) an order-preserving condition
q≤f
implies
Mq ≤ Mf.
(4.21)
Examples of operators that satisfy these conditions include:
1If θ(u, w) is decreasing in w, we take the transformation w̃ = −w and θ̃(u, w̃) = θ(u, −w̃), where θ̃(u, w̃)
is increasing in w̃.
20
1. multivariate rearrangement [14],
2. isotonic projection [4],
3. convex combinations of rearrangement and isotonic regression [14], and
4. convex combinations of monotone minorants and monotone majorants.
The following result establishes that monotonizing point estimators reduces estimation error, and that monotonizing confidence bands increases coverage while reducing the length
of the bands. The following result follows from Theorem 12 using the same arguments as
in Propositions 2 and 3 in [14].
Corollary 1 (Inference Monotone Linear Functionals). Let θ : I 7→ K be weakly increasing
over I and θb be the QR series estimator of Theorem 10. If M satisfies the conditions (4.19)
and (4.20), then the monotonized QR estimator is necessarily closer to the true value:
kMθb − θkI ≤ kθb − θkI .
Let [ι̇, ϊ] be a confidence band for θ of Theorem 12. If M satisfies the conditions (4.19)
and (4.21), the monotonized confidence bands maintain the asymptotic level of the original
intervals:
P {θ(u, w) ∈ [Mι̇(u, w), Mϊ(u, w)] : (u, w) ∈ I} ≥ 1 − α + o(1).
If M satisfies the condition (4.20), the monotonized confidence bands are shorter in length
than the original monotone intervals:
kMϊ − Mι̇kI ≤ kϊ − ι̇kI .
Comment 4.3. Another strategy, as mentioned e.g. in [14], is to simply intersect the initial
confidence interval with the set of monotone functions. This is done simply by taking the
smallest majorant of the lower band and the largest minorant of the upper band. This, in
principle, produces shortest intervals with desired asymptotic level. However, this may not
have good properties under misspecification, i.e. when approximation error is not relatively
small (we do not analyze such cases in this paper, but they can occur in practice), whereas
the strategy explored in the corollary retains a number of good properties even in this
case. See [14] for a detailed discussion of this point and for practical guidance on choosing
particular monotonization schemes.
21
5. Examples
This section illustrates the finite sample performance of the estimation and inference
methods with two examples. All the calculations were carried out with the software R
([33]), using the package quantreg for quantile regression ([26]).
5.1. Empirical Example. To illustrate our methods with real data, we consider an empirical application on nonparametric estimation of the demand for gasoline. [21], [34], and [39]
estimated nonparametrically the average demand function. We estimate nonparametrically
the quantile demand and elasticity functions and apply our inference methods to construct
confidence bands for the average quantile elasticity function. We use the same data set as
in [39], which comes from the National Private Vehicle Use Survey, conducted by Statistics
Canada between October 1994 and September 1996. The main advantage of this data set,
relative to similar data sets for the U.S., is that it is based on fuel purchase diaries and
contains detailed household level information on prices, fuel consumption patterns, vehicles
and demographic characteristics. (See [39] for a more detailed description of the data.)
Our sample selection and variable construction also follow [39]. We select into the sample
households with non-zero licensed drivers, vehicles, and distance driven. We focus on regular grade gasoline consumption. This selection leaves us with a sample of 5,001 households.
Fuel consumption and expenditure are recorded by the households at the purchase level.
We consider the following empirical specification:
Y = QY |X (U |X),
QY |X (U |X) = g(W, U ) + V ′ β(U ),
X = (W, V ),
where Y is the log of total gasoline consumption in liters per month; W is the log of
price per liter; U is the unobservable preference of the household to consume gasoline;
and V is a vector of 28 covariates. Following [39], the covariate vector includes the log
of age, a dummy for the top coded value of age, the log of income, a set of dummies for
household size, a dummy for urban dwellers, a dummy for young-single (age less than 36 and
household size of one), the number of drivers, a dummy for more than 4 drivers, 5 province
dummies, and 12 monthly dummies. To estimate the function g(W, U ), we consider three
series approximations in W : linear, a power orthogonal polynomial of degree 6, and a cubic
B-spline with 5 knots at the {0, 1/4, 1/2, 3/4, 1} quantiles of the observed values of W .
The number of series terms is selected by undersmoothing over the specifications chosen
by applying least squares cross validation to the corresponding conditional average demand
22
functions. In the next section, we analyze the size of the specification error of these series
approximations in a numerical experiment calibrated to mimic this example.
The empirical results are reported in Figures 1–3. Fig. 1 plots the initial and rearranged
estimates of the quantile demand surface for gasoline as a function of price and the quantile
index, that is
(u, exp(w)) 7→ θ(u, w) = exp(g(w, u) + v ′ β(u)),
where the value of v is fixed at the sample median values of the ordinal variables and one
for the dummies corresponding to the sample modal values of the rest of the variables.2 The
rows of the figure correspond to the three series approximations. The monotonized estimates
in the right panels are obtained using the average rearrangement over both the price and
quantile dimensions proposed in [14]. The power and B-spline series approximations show
most noticeably non-monotone areas with respect to price at high quantiles, which are
removed by the rearrangement.
Fig. 2 shows series estimates of the quantile elasticity surface as a function of price and
the quantile index, that is:
(u, exp(w)) 7→ θ(u, w) = ∂w g(w, u).
The estimates from the linear approximation show that the elasticity decreases with the
quantile index in the middle of the distribution, but this pattern is reversed at the tails.
The power and B-spline estimates show substantial heterogeneity of the elasticity across
prices, with individuals at the high quantiles being more sensitive to high prices.3
Fig. 3 shows 90% uniform confidence bands for the average quantile elasticity function
Z
u 7→ θ(u) = ∂w g(w, u)dµ(w).
The rows of the figure correspond to the three series approximations and the columns
correspond to the inference methods. We construct the bands using the pivotal, Gaussian
and weighted bootstrap methods. For the pivotal and Gaussian methods the distribution
of the maximal t-statistic is obtained by 1,000 simulations. The weighted bootstrap uses
standard exponential weights and 199 repetitions. The confidence bands show that the
2The median values of the ordinal covariates are $40K for income, 46 for age, and 2 for the number of
drivers. The modal values for the rest of the covariates are 0 for the top-coding of age, 2 for household size,
1 for urban dwellers, 0 for young-single, 0 for the dummy of more than 4 drivers, 4 (Prairie) for province,
and 11 (November) for month.
3These estimates are smoothed by local weighted polynomial regression across the price dimension ([16]),
because the unsmoothed elasticity estimates display very erratic behavior.
23
Linear
Rearranged
1000
1000
800
800
600
600
Liters
Liters
400
400
200
200
0.8
0.8
0.7
0.6
Quantile
0.2
0.5
0.7
0.6
0.6
0.4
0.6
0.4
Quantile
Price ($)
Power
0.2
0.5
Price ($)
Rearranged
1000
1000
800
800
600
600
Liters
Liters
400
400
200
200
0.8
0.8
0.7
0.6
Quantile
0.2
0.5
0.7
0.6
0.6
0.4
0.6
0.4
Quantile
Price ($)
B−splines
0.2
0.5
Price ($)
Rearranged
1000
1000
800
800
600
600
Liters
Liters
400
400
200
200
0.8
0.7
0.6
0.6
0.4
Quantile
0.2
0.5
Price ($)
0.8
0.7
0.6
0.6
0.4
Quantile
0.2
0.5
Price ($)
Figure 1. Quantile demand surfaces for gasoline as a function of price and
the quantile index. The left panels display linear, power and B-spline series
estimates and the right panels shows the corresponding estimates monotonized by rearrangement over both dimensions.
24
0.2
Price ($)
Quantile
Price ($)
Quantile
Price ($)
0.2
0.60
0.6
weighted polynomial regression with bandwidth 0.5.
tile index. The power and B-spline series estimates are smoothed by local
Figure 2. Quantile elasticity surfaces as a function of price and the quan-
0.4
0.2
0.55
0.4
0.55
0.4
0.55
Quantile
0.60
0.6
0.60
0.6
0.8
0.8
0.8
−0.5
−0.5
−0.5
−1.0
−1.0
−1.0
Elasticity
Elasticity
Elasticity
0.0
0.0
0.0
B−splines
Power
Linear
25
evidence of heterogeneity in the elasticities across quantiles is not statistically significant,
because we can trace a horizontal line within the bands. They show, however, that there is
significant evidence of negative price sensitivity at most quantiles as the bands are bounded
away from zero for most quantiles.
5.2. Numerical Example. To evaluate the performance of our estimation and inference
methods in finite samples, we conduct a Monte Carlo experiment designed to mimic the
previous empirical example. We consider the following design for the data generating process:
Y = g(W ) + V ′ β + σΦ−1 (U ),
(5.22)
where g(w) = α0 + α1 w + α2 sin(2πw) + α3 cos(2πw) + α4 sin(4πw) + α5 cos(4πw), V is
the same covariate vector as in the empirical example, U ∼ U (0, 1), and Φ−1 denotes the
inverse of the CDF of the standard normal distribution. The parameters of g(w) and β
are calibrated by applying least squares to the data set in the empirical example and σ is
calibrated to the least squares residual standard deviation. We consider linear, power and
B-spline series methods to approximate g(w), with the same number of series terms and
other tuning parameters as in the empirical example
Figures 4 and 5 examine the quality of the series approximations in population. They
compare the true quantile function
(u, exp(w)) 7→ θ(u, w) = g(w) + v ′ β + σΦ−1 (u),
and the quantile elasticity function
(u, exp(w)) 7→ θ(u, w) = ∂w g(w),
to the estimands of the series approximations. In the quantile demand function the value of
v is fixed at the sample median values of the ordinal variables and at one for the dummies
corresponding to the sample modal values of the rest of the variables. The estimands are
obtained numerically from a mega-sample (a proxy for infinite population) of 100 × 5, 001
observations with the values of (W, V ) as in the data set (repeated 100 times) and with
Y generated from the DGP (5.22). Although the derivative function does not depend on
u in our design, we do not impose this restriction on the estimands. Both figures show
that the power and B-spline estimands are close to the true target functions, whereas the
more parsimonious linear approximation misses important curvature features of the target
functions, especially in the elasticity function.
26
0.6
0.8
0.0
0.4
0.6
0.8
Linear
0.2
0.4
0.6
0.8
Pivotal
Gaussian
Weighted Bootstrap
0.4
0.6
0.8
Power
−0.5
−2.0
−1.5
−1.0
Elasticity
−0.5
−2.0
−2.0
−1.5
−1.0
Elasticity
−0.5
−1.0
0.0
quantile
0.0
quantile
0.0
quantile
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
quantile
Pivotal
Gaussian
Weighted Bootstrap
0.4
0.6
0.8
quantile
B−splines
−0.5
−2.0
−1.5
−1.0
Elasticity
−0.5
−2.0
−1.5
−1.0
Elasticity
−0.5
−1.0
−1.5
0.0
quantile
0.0
quantile
−2.0
0.2
−0.5
−1.5
−2.0
0.2
0.0
0.2
−1.0
Elasticity
−0.5
−1.5
−2.0
0.4
−1.5
Elasticity
−1.0
Elasticity
−0.5
−1.0
−2.0
−1.5
Elasticity
0.2
Elasticity
Weighted Bootstrap
0.0
Gaussian
0.0
Pivotal
0.2
0.4
0.6
quantile
0.8
0.2
0.4
0.6
0.8
quantile
Figure 3. 90% Confidence bands for the average quantile elasticity function.
Pivotal and Gaussian bands are obtained by 1,000 simulations.
Weighted bootstrap bands are based on 199 bootstrap repetitions with standard exponential weights.
27
True
Linear
1000
1000
800
800
600
600
Liters
Liters
400
400
200
200
0.8
0.8
0.7
0.6
0.7
0.6
0.6
0.4
Quantile
0.5
0.2
0.6
0.4
Quantile
Price ($)
Power
0.2
0.5
Price ($)
B−spline
1000
1000
800
800
600
600
Liters
Liters
400
400
200
200
0.8
0.7
0.6
0.8
0.7
0.6
0.6
0.4
Quantile
0.2
0.5
Price ($)
0.6
0.4
Quantile
0.2
0.5
Price ($)
Figure 4. Estimands of the quantile demand surface. Estimands for the
linear, power and B-spline series estimators are obtained numerically using
500,100 simulations.
28
True
Linear
0.0
0.0
−0.5
−0.5
Elasticity
Elasticity
−1.0
−1.0
0.8
0.8
0.65
0.6
0.4
Quantile
0.50
0.60
0.4
0.55
0.2
0.65
0.6
0.60
Quantile
Price ($)
0.55
Price ($)
0.2
0.50
Power
B−spline
0.0
0.0
−0.5
−0.5
Elasticity
Elasticity
−1.0
−1.0
0.8
0.65
0.6
0.60
0.4
Quantile
0.55
0.2
0.50
Price ($)
0.8
0.65
0.6
0.60
0.4
Quantile
0.55
0.2
0.50
Price ($)
Figure 5. Estimands of the quantile elasticity surface. Estimands for the
linear, power and B-spline series estimators are obtained numerically using
500,100 simulations.
29
To analyze the properties of the inference methods in finite samples, we draw 500 samples
from the DGP in equation (5.22) with 3 sample sizes, n: 5, 001, 1, 000, and 500 observations.
For n = 5, 001 we fix W to the values in the data set, whereas for the smaller sample sizes we
draw W with replacement from the values in the data set and keep it fixed across samples.
To speed up computation, we drop the vector V by fixing it at the sample median values
of the ordinal components and at one for the dummies corresponding to the sample modal
values for all the individuals. We focus on the average quantile elasticity function
u 7→ θ(u) =
Z
∂w g(w)dµ(w),
over the region I = [0.1, 0.9]. We estimate this function using linear, power and B-spline
quantile regression with the same number of terms and other tuning parameters as in the
empirical example. Although θ(u) does not change with u in our design, again we do not
impose this restriction on the estimators. For inference, we compare the performance of
90% confidence bands for the entire elasticity function. These bands are constructed using
the pivotal, Gaussian and weighted bootstrap methods, all implemented in the same fashion
as in the empirical example. The interval I is approximated by a finite grid of 91 quantiles
I˜ = {0.10, 0.11, ..., 0.90}.
Table 1 reports estimation and inference results averaged across 200 simulations. The
˜ Bias and RMSE are
true value of the elasticity function is θ(u) = −0.74 for all u ∈ I.
˜ SE/SD reports the ratios
the absolute bias and root mean squared error integrated over I.
of empirical average standard errors to empirical standard deviations. SE/SD uses the
analytical standard errors from expression (4.11). The bandwidth for Jbm (u) is chosen using
the Hall-Sheather option of the quantreg R package ([19]). Length gives the empirical
average of the length of the confidence band. SE/SD and length are integrated over the
˜ Cover reports empirical coverage of the confidence bands with nominal
grid of quantiles I.
level of 90%. Stat is the empirical average of the 90% quantile of the maximal t-statistic used
to construct the bands. Table 1 shows that the linear estimator has higher absolute bias
than the more flexible power and B-spline estimators, but displays lower rmse, especially
for small sample sizes. The analytical standard errors provide good approximations to the
standard deviations of the estimators. The confidence bands have empirical coverage close
to the nominal level of 90% for all the estimators and sample sizes considered; and weighted
bootstrap bands tend to have larger average length than the pivotal and Gaussian bands.
30
All in all, these results strongly confirm the practical value of the theoretical results and
methods developed in the paper. They also support the empirical example by verifying that
our estimation and inference methods work quite nicely in a very similar setting.
Table 1. Finite Sample Properties of Estimation and Inference Methods
for Average Quantile Elasticity Function
Pivotal
Linear
Power
B-spline
Bias
0.05
0.00
0.01
RMSE
0.14
0.15
0.15
SE/SD
1.04
1.03
1.02
Cover
90
91
90
Length
0.77
0.85
0.86
Linear
Power
B-spline
0.03
0.03
0.02
0.29
0.33
0.35
1.09
1.07
1.05
92
92
90
1.78
2.01
2.08
Linear
Power
B-spline
0.04
0.02
0.02
0.45
0.52
0.52
1.01
1.04
1.04
88
90
90
2.60
3.12
3.25
Gaussian
n = 5, 001
Stat
Cover
2.64
90
2.65
91
2.64
88
n = 1, 000
2.64
93
2.66
91
2.65
90
n = 500
2.64
88
2.65
90
2.65
90
Weighted Bootstrap
Length
0.76
0.85
0.86
Stat
2.64
2.65
2.64
Cover
87
88
90
Length
0.82
0.91
0.93
Stat
2.87
2.83
2.84
1.78
2.00
2.07
2.64
2.65
2.65
90
95
96
1.96
2.17
2.21
2.99
2.95
2.95
2.60
3.13
3.25
2.64
2.66
2.65
90
95
96
2.84
3.29
3.35
3.05
3.01
3.00
Notes: 200 repetitions. Simulation standard error for coverage probability is 2%.
31
Appendix A. Implementation Algorithms
Throughout this section we assume that we have a random sample {(Yi , Zi ) : 1 ≤ i ≤ n}.
√ b
We are interested in approximating the distribution of the process n(β(·)
− β(·)) or of
the statistics associated with functionals of it. Recall that for each quantile u ∈ U ⊂ (0, 1),
b
we estimate β(u) by quantile regression β(u)
= arg minβ∈Rm En [ρu (Yi − Zi′ β)], the Gram
b m = En [Zi Z ′ ], and the Jacobian matrix Jm (u) by Powell [32] estimator
matrix Σm by Σ
i
b
Jbm (u) = En [1{|Yi −Zi′ β(u)|
≤ hn }·Zi Zi′ ]/2hn , where we recommend choosing the bandwidth
hn as in the quantreg R package with the Hall-Sheather option ([19]).
We begin describing the algorithms to implement the methods to approximate the dis√ b
− β(·)) indexed by U .
tribution of the process n(β(·)
Algorithm 1 (Pivotal method). (1) For b = 1, . . . , B, draw U1b , . . . , Unb i.i.d. from U ∼
P
Uniform(0, 1) and compute Ubn (u) = n−1/2 ni=1 Zi (u − 1{Uib ≤ u}), u ∈ U . (2) Ap√ b
proximate the distribution of { n(β(u)
− β(u)) : u ∈ U } by the empirical distribution of
−1 (u)Ub (u) : u ∈ U , 1 ≤ b ≤ B}.
{Jbm
n
Algorithm 2 (Gaussian method). (1) For b = 1, . . . , B, generate a m-dimensional standard
b (·). Define Gb (u) = Σ
b (u) for u ∈ U . (2) Approximate the
b −1/2
Brownian bridge on U , Bm
Bm
m
n
√ b
−1 (u)Gb (u) :
− β(u)) : u ∈ U } by the empirical distribution of {Jbm
distribution of { n(β(u)
n
u ∈ U , 1 ≤ b ≤ B}.
Algorithm 3 (Weighted bootstrap method). (1) For b = 1, . . . , B, draw hb1 , . . . , hbn i.i.d.
from the standard exponential distribution and compute the weighted quantile regression
P
process βbb (u) = arg minβ∈Rm ni=1 hbi ·ρu (Yi −Zi′ β), u ∈ U . (2) Approximate the distribution
√ b
√
b
of { n(β(u)−β(u))
: u ∈ U } by the empirical distribution of { n(βbb (u)− β(u))
: u ∈ U, 1 ≤
b ≤ B}.
Algorithm 4 (Gradient bootstrap method). (1) For b = 1, . . . , B, draw U1b , . . . , Unb i.i.d.
P
from U ∼ Uniform(0, 1) and compute Ubn (u) = n−1/2 ni=1 Zi (u − 1{Uib ≤ u}), u ∈ U . (2)
P
For b = 1, . . . , B, estimate the quantile regression process βbb (u) = arg minβ∈Rm ni=1 ρu (Yi −
√
b
b
(u) = − n Ubn (u)/u, and Yn+1 =
(u)′ β), u ∈ U , where Xn+1
Zi′ β) + ρu (Yn+1 − Xn+1
b
(u)′ βbb (u), for all u ∈ U . (3) Approximate the
n max1≤i≤n |Yi | to ensure Yn+1 > Xn+1
√ b
√
b
distribution of { n(β(u)−β(u))
: u ∈ U } by the empirical distribution of { n(βbb (u)−β(u))
:
u ∈ U , 1 ≤ b ≤ B}.
The previous algorithms provide approximations to the distribution of
√
b
n(β(u)
− β(u))
that are uniformly valid in u ∈ U . We can use these approximations directly to make
32
inference on linear functionals of QY |X (·|X) including the conditional quantile functions,
provided the approximation error is small as stated in Theorems 9 and 11. Each linear
functional is represented by {θ(u, w) = ℓ(w)′ β(u) + rn (u, w) : (u, w) ∈ I}, where ℓ(w)′ β(u)
is the series approximation, ℓ(w) ∈ Rm is a loading vector, rn (u, w) is the remainder term,
and I is the set of pairs of quantile indices and covariates values of interest, see Section
4 for details and examples. Next we provide algorithms to conduct pointwise or uniform
inference over linear functionals.
Let B be a pre-specified number of bootstrap or simulation repetitions.
Algorithm 5 (Pointwise Inference for Linear Functionals). (1) Compute the variance esti−1 (u)Σ
−1 (u)ℓ(w)/n. (2) Using any of the Algorithms 1-4,
b m Jbm
mate σ
bn2 (u, w) = u(1−u)ℓ(w)′ Jbm
compute vectors V1 (u), . . . , VB (u) whose empirical distribution approximates the distribution
√ b
ℓ(w)′ Vb (u) − β(u)). (3) For b = 1, . . . , B, compute the t-statistic t∗b (u, w) = √
of n(β(u)
.
(4) Form a (1 − α)-confidence interval for θ(u, w) as
n
nb
σn (u,w)
′
b
ℓ(w) β(u) ± kn (1 − α)b
σn (u, w), where
kn (1 − α) is the 1 − α sample quantile of {t∗b
n (u, w) : 1 ≤ b ≤ B}.
Algorithm 6 (Uniform Inference for Linear Functionals). (1) Compute the variance esb m Jb−1 (u)ℓ(w)/n for (u, w) ∈ I. (2) Using any
timates σ
b2 (u, w) = u(1 − u)ℓ(w)′ Jb−1 (u)Σ
m
n
m
of the Algorithms 1-4, compute the processes V1 (·), . . . , VB (·) whose empirical distribution
√ b
approximates the distribution of { n(β(u)
: u ∈ U }. (3) For b = 1, . . . , B, compute
− β(u))
√
ℓ(w)′ Vb (u) ∗b
the maximal t-statistic ktn kI = sup(u,w)∈I nbσn (u,w) . (4) Form a (1 − α)-confidence band
b
for {θ(u, w) : (u, w) ∈ I} as {ℓ(w)′ β(u)
± kn (1 − α)b
σn (u, w) : (u, w) ∈ I}, where kn (1 − α)
is the 1 − α sample quantile of {kt∗b
n kI : 1 ≤ b ≤ B}.
Appendix B. A result on identification of QR series approximation in
population and its relation to the best L2 -approximation
In this section we provide results on identification and approximation properties for the
QR series approximation. In what follows, we denote z = Z(x) ∈ Z for some x ∈ X , and
for a function h : X → R we define
Qu (h) = E[ρu (Y − h(X))],
so that β(u) ∈ arg minβ∈Rm Qu (Z ′ β). Also let f := inf u∈U ,X∈X fY |X (QY |X (u|X)|X), where
f > 0 by condition S.2.
33
Consider the best L2 -approximation to the conditional quantile function gu (·) = QY |X (u|·)
by a linear combination of the chosen basis, namely
∗
β̃ (u) ∈ arg minm E |Z ′ β − gu (X)|2 .
(B.23)
β∈R
∗
We consider the following approximation rates associated with β̃ (u):
i
h
∗
∗
sup
|z ′ β̃ (u) − gu (x)|.
c2u,2 = E |Z ′ β̃ (u) − gu (X)|2 and cu,∞ =
x∈X ,z=Z(x)
Lemma 1. Assume that conditions S.2-S.4, ζm cu,2 = o(1) and cu,∞ = o(1) hold. Then, as
n grows, we have the following approximation properties for β(u):
i
h
∗
E |Z ′ β(u) − gu (X)|2 ≤ (16 ∨ 3f¯/f )c2u,2 , E |Z ′ β(u) − Z ′ β̃ (u)|2 ≤ (9 ∨ 8f¯/f )c2u,2 and
sup
x∈X ,z=Z(x)
|z ′ β(u) − gu (x)| . cu,∞ + ζm
q
p
(9 ∨ 8f¯/f )cu,2 / mineig(Σm ).
h
i1/2
h
i1/2
∗
∗
Proof of Lemma 1. We assume that E |Z ′ β(u) − Z ′ β̃ (u)|2
≥ 3E |Z ′ β̃ (u) − gu (X)|2
,
otherwise the statements follow directly. The proof proceeds in 4 steps.
Step 1 (Main argument). For notational convenience let
3/2
/E |Z ′ β(u) − gu (X)|3 .
q̄ = (f 3/2 /f¯′ )E |Z ′ β(u) − gu (X)|2
By Steps 2 and 3 below we have respectively
Qu (Z ′ β(u)) − Qu (gu ) ≤ f¯c2u,2 and
q
f E |Z ′ β(u) − gu (X)|2
q̄
∧
Qu (Z β(u)) − Qu (gu ) ≥
f E [|Z ′ β(u) − gu (X)|2 ] .
3
3
′
Thus
f E [|Z ′ β(u)−gu (X)|2 ]
3
(B.24)
(B.25)
o
n
q
∧ (q̄/3) f E [|Z ′ β(u) − gu (X)|2 ] ≤ f¯c2u,2 .
q
√
As n grows, since f¯c2u,2 < q̄/ 3 by Step 4 below, it follows that E |Z ′ β(u) − gu (X)|2 ≤
3c2 f¯/f which proves the first statement regarding β(u).
u,2
The second statement regarding β(u) follows since
r h
r h
i p
i q
∗
E |Z ′ β(u) − Z ′ β̃ (u)|2 ≤ E [|Z ′ β(u) − gu (X)|2 ]+ E |Z ′ β̃ ∗ (u) − gu (X)|2 ≤ 3c2u,2 f¯/f +cu,2 .
34
Finally, the third statement follows by the triangle inequality, S.4, and the second statement
∗
∗
supx∈X ,z=Z(x) |z ′ β(u) − gu (x)| ≤ supx∈X ,z=Z(x) |z ′ β̃ (u) − gu (x)| + ζm kβ(u) − β̃ (u)k
r h
i
≤ cu,∞ + ζm E |Z ′ β̃ ∗ (u) − Z ′ β(u)|2 /mineig(Σm )
q
p
≤ cu,∞ + ζm (9 ∨ 8f¯/f )cu,2 / mineig(Σm ).
Step 2 (Upper Bound). For any two scalars w and v we have that
Z v
ρu (w − v) − ρu (w) = −v(u − 1{w ≤ 0}) +
(1{w ≤ t} − 1{w ≤ 0})dt.
(B.26)
0
By (B.26) and the law of iterated expectations, for any measurable function h
hR
i
h−g
Qu (h) − Qu (gu ) = E 0 u FY |X (gu + t|X) − FY |X (gu |X)dt
hR
i
h−g
= E 0 u tfY |X (gu + t̃X,t |X)dt ≤ (f¯/2)E[|h − gu |2 ]
where t̃X,t lies between 0 and t for each t ∈ [0, h(x) − gu (x)].
∗
(B.27)
∗
Thus, (B.27) with h(X) = Z ′ β̃ (u) and Qu (Z ′ β(u)) ≤ Qu (Z ′ β̃ (u)) imply that
∗
Qu (Z ′ β(u)) − Qu (gu ) ≤ Qu (Z ′ β̃ (u)) − Qu (gu ) ≤ f¯c2u,2 .
Step 3 (Lower Bound). To establish a lower bound note that for a measurable function h
hR
i
h−g
Qu (h) − Qu (gu ) = E 0 u FY |X (gu + t|X) − FY |X (gu |X)dt
hR
i
2
h−g
(B.28)
= E 0 u tfY |X (gu |X) + t2 fY′ |X (gu + t̃X,t |X)dt
≥ (f /2)E[|h − gu |2 ] − 1 f¯′ E[|h − gu |3 ].
6
If
q
f E [|Z ′ β(u) − gu (X)|2 ] ≤ q̄, then f¯′ E |Z ′ β(u) − gu (X)|3 ≤ f E |Z ′ β(u) − gu (X)|2
and (B.28) with the function h(Z) = Z ′ β(u) yields
f E |Z ′ β(u) − gu (X)|2
.
Qu (Z β(u)) − Qu (gu ) ≥
3
′
q
f E [|Z ′ β(u) − gu (X)|2 ] > q̄, let hu (X) = (1 − α)Z ′ β(u) + αgu (X)
q
where α ∈ (0, 1) is picked so that f E [|hu − gu |2 ] = q̄. Then by (B.28) and convexity of
Qu we have
On the other hand, if
Qu (Z ′ β(u)) − Qu (gu ) ≥
q
f E [|Z ′ β(u) − gu (X)|2 ]
q̄
· ( Qu (hu ) − Qu (gu ) ) .
35
Next note that hu (X) − gu (X) = (1 − α)(Z ′ β(u) − gu (X)), thus
q
f E [|hu − gu |2 ] = q̄ =
f 3/2 E[|Z ′ β(u) − gu (X)|2 ]3/2
f 3/2 E[|hu − gu |2 ]3/2
=
.
f¯′ E[|Z ′ β(u) − gu (X)|3 ]
f¯′ E[|hu − gu |3 ]
Using that and applying (B.28) with hu we obtain
1
Qu (hu ) − Qu (gu ) ≥ (f /2)E[|hu − gu |2 ] − f¯′ E[|hu − gu |3 ] = q̄ 2 /3.
6
Therefore
q
f E |Z ′ β(u) − gu (X)|2
q̄
f E [|Z ′ β(u) − gu (X)|2 ] .
∧
Qu (Z β(u)) − Qu (gu ) ≥
3
3
′
Step 4. (f¯c2u,2 < q̄ 2 /3 as n grows) Recall that by S.4 kZk ≤ ζm and that we can assume
i1/2
i1/2
h
h
∗
∗
E |Z ′ β(u) − Z ′ β̃ (u)|2
≥ 3E |Z ′ β̃ (u) − gu (X)|2
.
Then, using the relation above (in the second inequality)
3/2
E [|Z ′ β(u)−gu (X)|2 ]
E[|Z ′ β(u)−gu (X)|3 ]
1/2
≥
≥
≥
where κ̄ =
Finally,
p
p
≥
mineig(Σm ).
E [|Z ′ β(u)−gu (X)|2 ]
supx∈X ,z=Z(x) |z ′ β(u)−gu (x)|
h
i1/2
h
i1/2
∗
∗
E |Z ′ β(u)−Z ′ β̃ (u)|2
+E |Z ′ β̃ (u)−gu (X)|2
1
2 supx∈X ,z=Z(z) |z ′ β(u)−z ′ β̃ ∗ (u)|+supx∈X ,z=Z(x) |z ′ β̃ ∗ (u)−gu (x)|
!
h
i1/2
h
i1/2
∗
∗
E |Z ′ β̃ (u)−gu (X)|2
E |Z ′ β(u)−Z ′ β̃ (u)|2
1
∧ sup
′ ∗
2
supx∈X ,z=Z(x) |z ′ β(u)−z ′ β̃ ∗ (u)|
x∈X ,z=Z(x) |z β̃ (u)−gu (x)|
1
2
∗
κ̄kβ(u)−β̃ (u)k
∗
ζm kβ(u)−β̃ (u)k
f¯cu,2 < (f 3/2 /f ′ )(1/2)( ζκ̄m ∧
∧
cu,2
cu,∞
√
cu,2
cu,∞ )/ 3
√
≤ q̄/ 3 as n grows under the condition
ζm cu,2 = o(1), cu,∞ = o(1), and conditions S.2 and S.3.
Appendix C. Proof of Theorems 1-7
In this section we gather the proofs of the theorems 1-7 stated in the main text. We adopt
the standard notation of the empirical process literature [37]. We begin by assuming that
the sequences m and n satisfy m/n → 0 as m, n → ∞. For notational convenience we write
ψi (β, u) = Zi (1{Yi ≤ Zi′ β} − u), where Zi = Z(Xi ). Also for any sequence r = rn = o(1)
and any fixed 0 < B < ∞, we define the set
Rn,m := {(u, β) ∈ U × Rm : kβ − β(u)k ≤ Br}
36
and the following error terms:
ǫ0 (m, n) :=
sup
u∈U
ǫ1 (m, n) :=
sup
(u,β)∈Rn,m
ǫ2 (m, n) :=
sup
(u,β)∈Rn,m
kGn (ψi (β(u), u))k,
kGn (ψi (β, u)) − Gn (ψi (β(u), u))k,
n1/2 kE[ψi (β, u)] − E[ψi (β(u), u)] − Jm (u)(β − β(u))k.
In what follows, we say that the data are in general position if for any γ ∈ Rm ,
P (Yi = Zi′ γ, for at least one i) = 0. Under the bounded density condition S.2, the data
are in the general position. We also assume i.i.d. sampling throughout the appendix,
although this condition is not needed for most of the results.
C.1. Proof of Theorem 1. We start by establishing uniform rates of convergence. The
following technical lemma will be used in the proof of Theorem 1.
b
Lemma 2 (Rates in Euclidian Norm for Perturbed QR Process). Suppose that β(u)
is a
minimizer of
En [ρu (Yi − Zi′ β)] + An (u)′ β
for each u ∈ U , and that the perturbation term obeys supu∈U kAn (u)k .P r = o(1). The
unperturbed case corresponds to An (·) = 0. If inf u∈U mineig [Jm (u)] > J > 0 and the
conditions
R1. ǫ0 (m, n) .P
R2. ǫ1 (m, n) .P
R3. ǫ2 (m, n) .P
√
√
√
nr,
nr,
nr,
hold, where the constants in the bounds above can be taken to be independent of the constant
B in the definition of Rm,n , then for any ε > 0, there is a sufficiently large B such that
b
with probability at least 1 − ε, (u, β(u))
∈ Rm,n uniformly over u ∈ U , that is,
b
(C.29)
− β(u) .P r.
sup β(u)
u∈U
Proof of Lemma 2. Due to the convexity of the objective function, it suffices to show that
for any ε > 0, there exists B < ∞ such that
′
P inf inf η [En [ψi (β, u)] + An (u)] |β=β(u)+Brη > 0 ≥ 1 − ε.
u∈U kηk=1
(C.30)
37
Indeed, the quantity En [ψi (β, u)] + An (u) is a subgradient of the objective function at β.
Observe that uniformly in u ∈ U ,
√
√ ′
nη En [ψi (β(u) + Brη, u)] ≥ Gn (η ′ ψi (β(u), u)) + η ′ Jm (u)ηB nr − ǫ1 (m, n) − ǫ2 (m, n),
since E [ψi (β(u), u)] = 0 by definition of β(u) (see argument in the proof of Lemma 3).
Invoking R2 and R3,
ǫ1 (m, n) + ǫ2 (m, n) .P
√
nr,
and by R1, uniformly in η ∈ S m−1 we have
|Gn (η ′ ψi (β(u), u))| ≤ sup kGn (ψi (β(u), u))k = ǫ0 (m, n) .P
√
nr.
u∈U
Then the event of interest in (C.30) is implied by the event
√
√
′
η Jm (u)ηB nr − ǫ0 (m, n) − ǫ1 (m, n) − ǫ2 (m, n) − n sup kAn (u)k > 0 ,
u∈U
whose probability can be made arbitrarily close to 1, for large n, by setting B sufficiently
large since supu∈U kAn (u)k .P r, and η ′ Jm (u)η ≥ J > 0 by the condition on the eigenvalues
of Jm (u).
2 log n = o(n)
Proof of Theorem 1. Let φn = supα∈S m−1 E[(Zi′ α)2 ] ∨ En [(Zi′ α)2 ]. Under ζm
and S.3, φn .P 1 by Corollary 2 in Appendix G.
Next recall that Rm,n = {(u, β) ∈ U × Rm : kβ − β(u)k ≤ Br} for some fixed B large
enough.
Under S.1-S.5, ǫ0 (m, n) .P
√
mφn log n by Lemma 23, and ǫ1 (m, n) .P
√
mφn log n
by Lemma 22, where none of these bounds depend on B. Under S.1-S.5, ǫ2 (m, n) .
√
√
√
nζm B 2 r 2 + nm−κ Br . nr by Lemma 24 provided ζm B 2 r = o(1) and m−κ B = o(1).
2 m log n = o(n) by the growth condition of the theorem, we can take
Finally, since ζm
p
r = (m log n)/n in Lemma 2 with An (u) = 0 and the result follows.
C.2. Proof of Theorems 2, 3, 4, 5 and 6. In order to establish our uniform linear
approximations for the perturbed QR process, it is convenient to define the following approximation error:
b
ǫ3 (m, n) := sup n1/2 k En [ψi (β(u),
u)] + An (u)k.
u∈U
b
Lemma 3 (Uniform Linear Approximation). Suppose that β(u)
is a minimizer of
En [ρu (Yi − Zi β)] + An (u)′ β
38
for each u ∈ U , the data are in general position, and the conditions of Lemma 2 hold. The
unperturbed case corresponds to An (·) = 0. Then
√
n
−1 X
b
ψi (β(u), u) − An (u) + rn (u),
nJm (u) β(u)
− β(u) = √
n
i=1
(C.31)
supu∈U krn (u)k .P ǫ1 (m, n) + ǫ2 (m, n) + ǫ3 (m, n).
Proof of Lemma 3. First note that E [ψi (β(u), u)] = 0 by definition of β(u). Indeed, despite
the possible approximation error, β(u) minimizes E[ρu (Y −Z ′ β)] which yields E [ψi (β(u), u)] =
0 by the first order conditions.
Therefore equation (C.31) can be recast as
b
rn (u) = n1/2 Jm (u)(β(u)
− β(u)) + Gn (ψi (β(u), u)) + An (u).
b
Next note that if (u, β(u))
∈ Rn,m uniformly in u ∈ U by the triangle inequality, uniformly
in u ∈ U
b
u)) +
krn (u)k ≤ Gn (ψi (β(u), u)) − Gn (ψi (β(u),
h
i
b
b
+ n1/2 E ψi (β(u),
u) − E [ψi (β(u), u)] − Jm (u) β(u)
− β(u) +
h
i
1/2 b
+ n En ψi (β(u), u) + An (u)
≤ ǫ1 (m, n) + ǫ2 (m, n) + ǫ3 (m, m).
b
The result follows by Lemma 2 which bounds from below the probability of (u, β(u))
∈ Rn,m
uniformly in u ∈ U .
Proof of Theorem 2. We first control the approximation errors ǫ0 , ǫ1 , ǫ2 , and ǫ3 . Lemma
√
23 implies that ǫ0 (m, n) .P mφn log n. Lemma 24 implies that
ǫ1 (m, n) .P
p
mζm r log n +
√
√
mζm log n
√
and ǫ2 (m, n) .P nζm r 2 + nm−κ r.
n
Next note that under the bounded density condition in S.2, the data are in general
position. Therefore, by Lemma 26 with probability 1 we have
mζm
ǫ3 (m, n) ≤ √ .
n
2 log3 n = o(n), the uniThus, the assumptions required by Lemma 3 follow under m3 ζm
p
form rate r = (m log n)/n, and the condition on κ.
39
Thus by Lemma 3 with An (·) = 0
√
n
−1 X
b
ψi (β(u), u) + rn (u), where
nJm (u) β(u)
− β(u) = √
n
i=1
sup krn (u)k .P
u∈U
p
mζm r log n +
p
mζm log n
mζm
√
+ m−κ m log n + √ .
n
n
Next, let
n
1 X
reu := √
Zi (1{Yi ≤ Zi′ β(u) + R(Xi , u)} − 1{Yi ≤ Zi′ β(u)}).
n
i=1
p
By Lemma 25 supu∈U ke
ru k .P m1−κ log n +
Pn
′
√1
i=1 Zi (u − 1{Yi ≤ Zi β(u) + R(Xi , u)}).
n
mζm
√log n .
n
The result follows since Un (u) =d
Proof of Theorem 3. Note that
sup krn (u)k ≤
u∈U
sup
α∈S m−1 ,u∈U
−1
−1
|α′ (Jm
(u) − Jbm
(u))α| sup kU∗n (u)k.
u∈U
To bound the second term, by Lemma 23
sup kU∗n (u)k .P
u∈U
p
m log n
(C.32)
2 log n = o(n).
since φn .P 1 by Corollary 2 in Appendix G under ζm
−1 (u) also
Next since Jm (u) has eigenvalues bounded away from zero by S.2 and S.3, Jm
has eigenvalues bounded above by a constant. Moreover, by Lemma 4
sup
α∈S m−1 ,u∈U
|α′ (Jbm (u) − Jm (u))α| .P ǫ5 (m, n) + ǫ6 (m, n)
q
q
2 log n
2 m log n
ζm
mζm
m log n
−κ +
+
+
m
.P
nhn
nhn
n ζm + hn ,
where the second inequality follows by Lemma 28 with r =
conditions stated in the Theorem,
sup
α∈S m−1 ,u∈U
|α′ (Jbm (u) − Jm (u))α| . oP
p
(m log n)/n. Under the growth
q
1/ m log3 n .
−1 (u) has eigenvalues bounded above by a conThus, with probability approaching one, Jbm
stant uniformly in u ∈ U .
40
Moreover, by the matrix identity A−1 − B −1 = B −1 (B − A)A−1
−1 (u) − Jb−1 (u))α| = |α′ Jb−1 (u)(Jb (u) − J (u))J −1 (u)α|
|α′ (Jm
m
m
m
m
m
′ b−1
′ b
−1
≤ kα Jm (u)k sup |α̃ (Jm (u) − Jm (u))α̃| kJm
(u)αk (C.33)
p α̃∈S m−1
. oP 1/ m log3 n ,
which holds uniformly over α ∈ S m−1 and u ∈ U .
Finally, the stated bound on supu∈U krn (u)k follows by combining the relations (C.32)
and (C.33).
Proof of Theorem 4. The first part of the proof is similar to the proof of Theorem 2 but
applying Lemma 3 twice, one to the unperturbed problem and one to the perturbed problem
√
with An (u) = −U∗n (u)/ n, for every u ∈ U . Consider the set E of realizations of the data
Dn such that supkαk=1 En [(α′ Zi )2 ] . 1. By Corollary 2 in Appendix G and assumptions S.3
and S.4, P (E) = 1 − o(1) under our growth conditions. Thus, by Lemma 23
r
m log n
sup kAn (u)k .P r =
n
u∈U
2 log n = o(n) and S.3.
where we used that φn .P 1 by Corollary 2 in Appendix G under ζm
Then,
√
√
√
b
b
nJm (u) βb∗ (u) − β(u)
= nJm (u) βb∗ (u) − β(u) − nJm (u) β(u)
− β(u)
= U∗n (u) + rnpert(u) − rnunpert(u), where
sup krnpert (u) − rnunpert(u)k .P
u∈U
p
mζm r log n +
p
mζm log n
mζm
√
+ m−κ m log n + √ .
n
n
Note also that the results continue to hold in P -probability if we replace P by P ∗ , since
if a random variable Bn = OP (1) then Bn = OP ∗ (1). Indeed, the first relation means that
P (|Bn | > ℓn ) = o(1) for any ℓn → ∞, while the second means that P ∗ (|Bn | > ℓn ) = oP (1)
for any ℓn → ∞. But the second clearly follows from the first from the Markov inequality,
observing that E[P ∗ (|Bn | > ℓn )] = P (|Bn | > ℓn ) = o(1).
Proof of Theorem 5. The existence of the Gaussian process with the specified covariance
structure is trivial. To establish the additional coupling, note that under S.1-S.4 and the
growth conditions, supkαk=1 En [(α′ Zi )2 ] .P 1 by Corollary 2 in Appendix G. Conditioning
on any realization of Z1 , ..., Zn such that supkαk=1 En [(α′ Zi )2 ] . 1, the existence of the
Gaussian process with the specified covariance structure that also satisfies the coupling
41
condition follows from Lemma 8. Under the conditions of the theorem supu∈U kUn (u) −
Gn (u)k = oP (1/ log n). Therefore, the second statement follows since
√
b
− β(u)) − J −1 (u)Gn (u)k
sup k n(β(u)
m
u∈U
−1
−1
≤ sup kJm
(u)Un (u) − Jm
(u)Gn (u)k + oP (1/ log n)
u∈U
−1
(u)k sup kUn (u) − Gn (u)k + oP (1/ log n) = oP (1/ log n),
≤ sup kJm
u∈U
u∈U
where we invoke Theorem 2 under the growth conditions, and that the eigenvalues of Jm (u)
are bounded away from zero uniformly in n by S.2 and S.3.
The last statement proceeds similarly to the proof of Theorem 3.
Proof of Theorem 6. Note that βbb (u) solves the quantile regression problem for the rescaled
data {(hi Yi , hi Zi ) : 1 ≤ i ≤ n}. The weight hi is independent of (Yi , Zi ), E[hi ] = 1,
b
E[h2 ] = 1, and max1≤i≤n hi .P log n. That allows us to extend all results from β(u)
to βbb (u)
i
b = ζ log n to account for the larger envelope, and ψ b (β, u) = h ψ (β, u).
replacing ζm by ζm
m
i i
i
The first part of the proof is similar to the proof of Theorem 2 but applying Lemma 3
twice, one to the original problem and one to the problem weighted by {hi }. Then
√ √ √ bb
b
b
n β (u) − β(u)
= n βbb (u) − β(u) + n β(u) − β(u)
n
J −1 (u) X
= m√
hi Zi (u − 1{Yi ≤ Zi′ β(u)}) + rnb (u)−
n
i=1
n
−1 (u) X
Jm
Zi (u − 1{Yi ≤ Zi′ β(u)}) − rn (u)
− √
n
i=1
where supu∈U krn (u) − rnb (u)k .P
Next, let
p
mζm r log2 n +
mζm√log2 n
n
√
+ m−κ m log n +
mζm
√log n .
n
n
1 X
reu := √
(hi − 1)Zi (1{Yi ≤ Zi′ β(u) + R(Xi , u)} − 1{Yi ≤ Zi′ β(u)}).
n
i=1
By Lemma 25, supu∈U ke
ru k .P
u} = 1{Yi ≤
Zi′ β(u)
p
+ R(Xi , u)}.
m1−κ log n +
ζm m√log2 n
.
n
The result follows since 1{Ui ≤
Note also that the results continue to hold in P -probability if we replace P by P ∗ , since
if a random variable Bn = OP (1) then Bn = OP ∗ (1). Indeed, the first relation means that
P (|Bn | > ℓn ) = o(1) for any ℓn → ∞, while the second means that P ∗ (|Bn | > ℓn ) = oP (1)
42
for any ℓn → ∞. But the second clearly follows from the first from the Markov inequality,
observing that E[P ∗ (|Bn | > ℓn )] = P (|Bn | > ℓn ) = o(1).
The second part of the theorem follows by applying Lemma 8 with vi = hi − 1, where
hi ∼ exp(1) so that E[vi2 ] = 1, E[|vi |3 ] . 1 and max1≤i≤n |vi | .P log n. Lemma 8 implies
that there is a Gaussian process Gn (·) with covariance structure En [Zi Zi′ ](u ∧ u′ − uu′ ) such
that
n
1 X
sup √
(hi − 1)Zi (u − 1{Ui ≤ u}) − Gn (u) .P o(1/ log n).
n
u∈U
i=1
Combining the result above with the first part of the theorem, the second part follows by
the triangle inequality.
C.3. Proof of Theorem 7 on Matrices Estimation. Consider the following quantities
ǫ4 (m, n) =
ǫ5 (m, n) =
ǫ6 (m, n) =
sup n−1/2 Gn (α′ Zi )2 ,
α∈S m−1
n−1/2
2hn
sup
α∈S m−1 ,
(u,β)∈Rm,n
sup
α∈S m−1 ,
(u,β)∈Rm,n
Gn 1{|Yi − Zi′ β| ≤ hn }(α′ Zi )2 ,
1
′
′
2
′
2hn E 1{|Yi − Zi β| ≤ hn }(α Zi ) − α Jm (u)α
,
where hn is a bandwidth parameter such that hn → 0.
Lemma 4 (Estimation of Variance Matrices and Jacobians). Under the definitions of ǫ4 , ǫ5 ,
and ǫ6 we have
b m − Σm )α ≤ ǫ4 (m, n)
sup α′ (Σ
α∈S m−1
sup
α∈S m−1 ,
u∈U
′ b
α (Jm (u) − Jm (u))α ≤ ǫ5 (m, n)+ǫ6 (m, n).
Proof. The first relation follows by the definition of ǫ4 (m, n) and the second by the triangle
inequality and definition of ǫ5 (m, n) and ǫ6 (m, n).
Proof of Theorem 7. This follows immediately from Lemma 4, Lemma 27, and Lemma 28.
43
Appendix D. Proofs of Theorems 8 and 9
Proof of Theorem 8. By Theorem 2 and P1
b w) − θ(u, w)| ≤ |ℓ(u)′ (β(u)
b
|θ(u,
− β(u))| + |rn (u, w)|
−1
′
√
|ℓ(w) Jm (u)Un (u)|
kℓ(w)k
√
√
≤
+
o
+ o(kℓ(w)k/ n)
P
n
n
log
n
′ −1
√
ξθ (m,w)
n (u)|
√
+
o
+ o(ξθ (m, w)/ n)
≤ |ℓ(w) Jm√(u)U
P
n
n log n
where the last inequality follows by kℓ(w)k . ξθ (m, w) assumed in P2. Finally, since
′
2
2
−1 (u)U (u)|2 ] . kℓ(w)k2 kJ −1 (u)k2 sup
E[|ℓ(w)′ Jm
n
α∈S m−1 E[(α Z) ] . ξθ (m, w), the result
m
−1 (u)U (u)| .
follows by applying the Chebyshev inequality to establish that |ℓ(w)Jm
n
P
ξθ (m, w).
Proof of Theorem 9. By Assumption P1 and Theorem 2
tn (u, w) =
b
ℓ(w)′ (β(u)−β(u))
σ
bn (u,w)
+
rn (u,w)
σ
bn (u,w)
=
−1
ℓ(w)′ Jm
(u)Un (u)
√
nb
σn (u,w)
+ oP
kℓ(w)k
√
nb
σn (u,w) log n
+ oP
√
kℓ(w)k
nb
σn (u,w)
To show that the last two terms are oP (1), note that under conditions S.2 and S.3, σ
bn (u, w) &P
√
bn (u, w) = (1 + oP (1))σn (u, w) by Theorem 7.
kℓ(w)k/ n since σ
−1 (u)Z )2 /σ 2 (u, w)1{|ℓ(w)′ J −1 (u)Z | ≥ ǫ√nσ (u, w)}] →
Finally, for any ǫ > 0, E[(ℓ(w)′ Jm
i
i
n
n
m
0 under our growth conditions.
Thus, by Lindeberg-Feller central limit theorem and
σ
bn (u, w) = (1 + oP (1))σn (u, w), it follows that
−1 (u)U (u)
ℓ(w)′ Jm
n
√
→d N (0, 1)
nb
σn (u, w)
and the first result follows.
The remaining results for t∗n (u, w) follow similarly under the corresponding conditions.
Appendix E. Proofs of Theorems 10-12
Lemma 5 (Entropy Bound). Let W ⊂ Rd be a bounded set, and ℓ : W → Rm be a mapping
in W, such that for ξθ (m) and ξθL (m) ≥ 1,
kℓ(w)k ≤ ξθ (m) and kℓ(w) − ℓ(w̃)k ≤ ξθL (m)kw − w̃k, for all w, w̃ ∈ W.
Moreover, assume that kJm (u) − Jm (u′ )k ≤ ξ(m)|u − u′ | for u, u′ ∈ U in the operator norm,
ξ(m) ≥ 1, and let µ = supu∈U kJm (u)−1 k > 0. Let Lm be the class of functions
−1
Lm = {fu,w (g) = ℓ(w)′ Jm
(u)g : u ∈ U , w ∈ W} ∪ {f (g) = 1}
.
44
where g ∈ B(0, ζm ), and let F denote the envelope of Lm , i.e. F = supf ∈Lm |f |. Then, for
any ε < ε0 , the uniform entropy of Lm is bounded by
ε0 + diam(U × W)K d+1
sup N (εkF kQ,2 , Lm , L2 (Q)) ≤
ε
Q
where K := µξθL ζm + µ2 ξθ (m)ξ(m)ζm .
Proof. The uniform entropy bound is based on the proof of Theorem 2 of Andrews [1]. We
first exclude the function constant equal to 1 and compute the Lipschitz constant associated
with the other functions in Lm as follows
−1 (u)g − ℓ(w̃)′ J −1 (ũ)g| = |(ℓ(w) − ℓ(w̃))′ J −1 (u)g + ℓ(w̃)′ (J −1 (u) − J −1 (ũ))g|
|ℓ(w)′ Jm
m
m
m
m
−1 (u)g|+
= |(ℓ(w) − ℓ(w̃))′ Jm
−1 (ũ)(J (ũ) − J (u))J −1 (u)g|
+|ℓ(w̃)′ Jm
m
m
m
−1 (u)gk+
≤ ξθL (m)kw − w̃kkJm
−1 (ũ)k ξ(m)|ũ − u| kJ −1 (u)gk
+kℓ(w̃)′ Jm
m
≤ K(kw − w̃k + |ũ − u|)
where K := µξθL ζm + µ2 ξθ (m)ξ(m)ζm .
Consider functions fũj ,w̃j , where (ũj , w̃j )’s are points at the centers of disjoint cubes of
diameter εkF kQ,2 /K whose union covers U × W. Thus, for any (u, w) ∈ U × W
min kfu,w − fũj ,w̃j kQ,2 ≤ K min (kw − w̃j k + |ũj − u|) ≤ εkF kQ,2 .
ũj ,w̃j
ũj ,w̃j
Adding (if necessary) the constant function in to the cover, for any measure Q we obtain
N (εkF kQ,2 , Lm , L2 (Q)) ≤ 1 + (diam(U × W)K/εkF kQ,2 )d+1 .
The result follows by noting that ε < ε0 and that kF kQ,2 ≥ 1 since Lm contains the function
constant equal to 1.
Lemma 6. Let F and G be two classes of functions with envelopes F and G respectively.
Then the entropy of FG = {f g : f ∈ F, g ∈ G} satisfies
sup N (εkF GkQ,2 , FG, L2 (Q)) ≤ sup N ((ε/2)kF kQ,2 , F, L2 (Q)) sup N ((ε/2)kGkQ,2 , G, L2 (Q)).
Q
Q
Q
Proof. The proof is similar to Theorem 3 of Andrews [1]. For any measure Q we denote
R
cF = kF k2Q,2 and cG = kGk2Q,2 . Note for any measurable set A, QF (A) = A F 2 (x)dQ(x)/cF
R
and QG (A) = A G2 (x)dQ(x)/cG are also measures. Let
K = sup N (εkGkQF ,2 , G, L2 (QF )) = sup N (εkGkQ,2 , G, L2 (Q)) and
QF
Q
45
L = sup N (εkF kQG ,2 , F, L2 (QG )) = sup N (εkGkQ,2 , G, L2 (Q)).
Q
QG
Let g1 , . . . , gK and f1 , . . . , fL denote functions in G and F used to build a cover of G and
F of cubes with diameter εkF GkQ,2 . Since F ≥ |f | and G ≥ |g| we have
min
ℓ≤L,k≤K
kf g − fℓ gk kQ,2 ≤
min
ℓ≤L,k≤K
kf (g − gk )kQ,2 + kgk (f − fℓ )kQ,2
≤ min kF (g − gk )kQ,2 + min kG(f − fℓ )kQ,2
k≤K
ℓ≤L
= min k(g − gk )kQF ,2 + min k(f − fℓ )kQG ,2
k≤K
ℓ≤L
≤ εkGkQF ,2 + εkF kQG ,2 = 2εkF GkQ,2 .
Therefore, by taking pairwise products of g1 , . . . , gK and f1 , . . . , fL to create a net we
have
N (εkF GkQ,2 , FG, L2 (Q)) ≤ N ((ε/2)kF kQ,2 , F, L2 (Q)) N ((ε/2)kGkQ,2 , G, L2 (Q)).
The result follows by taking the supremum over Q on both sides.
Proof of Theorem 10. By the triangle inequality
b w) − θ(u, w)| ≤ sup |ℓ(w)′ (β(u)
b
sup |θ(u,
− β(u))| + sup |rn (u, w)|
(u,w)∈I
(u,w)∈I
(u,w)∈I
where the second term satisfies sup(u,w)∈I | rn (u, w)/kℓ(w)k | = o(n−1/2 log−1 n) by condition U.1.
By Theorem 2, the first term is bounded uniformly over I by
b
|ℓ(w)′ (β(u)
− β(u))| .P
−1 (u)U (u)|
√
|ℓ(w)′ Jm
n
√
+ oP (ξθ (m, I)/[ n log n])
n
(E.34)
since kℓ(w)k ≤ ξθ (m, I) by U.2 and the remainder term of the linear representation in
Theorem 2 satisfies supu∈U krn (u)k = oP (1/ log n).
Given Zi = Z(Xi ) consider the classes of functions
−1
Fm = {ℓ(w)′ Jm
(u)Zi : (u, w) ∈ I}, G = {(1{Ui ≤ u}−u) : u ∈ U } and Lm = Fm ∪{f (Zi ) = 1}.
By Lemma 5 and Lemma 6 the entropy of the class of function Lm G = {f g : f ∈ Lm , g ∈
G} satisfies, for K . (ξθ (m, I) ∨ 1)(ξθL (m, I) ∨ 1)(ζm ∨ 1),
sup log N (εkF k2,Q , Lm G, L2 (Q)) . (d + 1) log[(1 + diam(I)K)/ε] . log(n/ε)
Q
by our assumptions. Noting that under S.3
q
−1
−1
u(1 − u)ℓ(w)′ Jm
(u)Σm Jm
(u)ℓ(w) . ξθ (m, I),
sup
(u,w)∈I
46
Lemma 16 applied to Lm G with J(m) =
sup |ℓ(w)′ Jm (u)−1 Un (u)| .P
(u,w)∈I
√
d log n, M (m, n) . 1 ∨ ξθ (m, I)ζm yields
1/2
2
p
d log(n) (1 ∨ ξθ2 (m, I) ζm
) log n
log1/2 n
d log n 1 ∨ ξθ2 (m, I) +
n
where the second term in the parenthesis is negligible compared to the first under our
growth conditions. The result follows using this bound into (E.34).
−1 (u) −
Proof of Theorem 11. Under the conditions of Theorems 2 and 3, supu∈U |maxeig(Jbm
√
−1 (u))| . o(1/√m log3/2 n), and σ
Jm
bn (u, w) = (1 + oP (1/ m log3/2 n))σn (u, w) uniformly
P
√
in (u, w) ∈ I. Note that σ
bn (u, w) &P kℓ(w)k/ n. Next define t̄∗n (u, w) as follows:
−1 (u)U∗ (u)/√n
ℓ(w)′ Jm
n
∗
For pivotal and gradient bootstrap coupling:
t̄n (u, w) =
;
σn (u, w)
For Gaussian and weighted bootstrap coupling:
t̄∗n (u, w)
−1 (u)G (u)/√n
ℓ(w)′ Jm
n
.
=
σn (u, w)
Note that t̄∗n is independent of the data conditional on the regressor sequence Zn = (Z1 , ..., Zn ),
unlike t∗n which has some dependence on the data through various estimated quantities.
For the case of pivotal and gradient bootstrap couplings by Theorem 2 and condition U
tn (u, w) =d t̄∗n (u, w) + oP (1/ log n) in ℓ∞ (I).
Moreover, for the case of Gaussian and weighted bootstrap couplings under the conditions
of Theorem 5
tn (u, n) =d t̄∗n (u, w) + oP (1/ log n) in ℓ∞ (I).
Finally, under the growth conditions
t̄∗n (u, w) = t∗n (u, w) + oP (1/ log n) in ℓ∞ (I).
Thus, it follows that uniformly in (u, w) ∈ I
tn (u, w) =d t∗n (u, w) + oP (1/ log n) = tn (u, w) + oP (1/ log n)
and the result follows.
Proof of Theorem 12. Let εn = 1/ log n, and δn such that δn log1/2 n → 0, and δn /εn → ∞.
47
Step 1. By the proof of Theorem 11 there is an approximation kt̄∗n kI = sup(u,w)∈I |t̄∗n (u, w)|
to kt∗n kI = sup(u,w)∈I |t∗n (u, w)|, which does not depend on the data conditional on the re-
gressor sequence Zn = (Z1 , ..., Zn ), such that
P (| kt̄∗n kI − kt∗n kI | ≤ εn ) = 1 − o(1).
Now let
kn (1 − α) := (1 − α) − quantile of kt∗n kI , conditional on Dn ,
and let
κn (1 − α) := (1 − α) − quantile of kt̄∗n kI , conditional on Dn .
Note that since kt̄∗n kI is conditionally independent of Y1 , . . . , Yn ,
κn (1 − α) = (1 − α) − quantile of kt̄∗n kI , conditional on Zn .
Then applying Lemma 7 to kt̄∗n kI and kt∗n kI , we get that for some νn ց 0
P [κn (p) ≥ kn (p − νn ) − εn and kn (p) ≥ κn (p − νn ) − εn ] = 1 − o(1).
Step 2. Claim (1) now follows by noting that
P {ktn kI > kn (1 − α) + δn } ≤ P {ktn kI > κn (1 − α − νn ) − εn + δn } + o(1)
≤ P {kt∗n kI > κn (1 − α − νn ) − 2εn + δn } + o(1)
≤ P {kt∗n kI > kn (1 − α − 2νn ) − 3εn + δn } + o(1)
≤ P {kt∗n kI > kn (1 − α − 2νn )} + o(1)
= EP [P {kt∗n kI > kn (1 − α − 2νn )|Dn }] + o(1)
≤ EP [α + 2νn ] + o(1) = α + o(1).
Claim (2) follows from the equivalence of the event {θ(u, w) ∈ [ι̇(u, w), ϊ(u, w)], for all
(u, w) ∈ I} and the event {ktn kI ≤ cn (1 − α)}.
To prove Claim (3) note that σ
bn (u, w) = (1 + oP (1))σn (u, w) uniformly in (u, w) ∈ I
under the conditions of Theorems 2 and 7. Moreover, cn (1 − α) = kn (1 − α)(1 + oP (1))
because 1/kn (1 − α) .P 1 and δn → 0. Combining these relations the result follows.
Claim (4) follows from Claim (1) and from the following lower bound.
By Lemma 7, we get that for some νn ց 0
P [κn (p + νn ) + εn ≥ kn (p) and kn (p + νn ) + εn ≥ κn (p)] = 1 − o(1).
48
Then
P {ktn kI ≥ kn (1 − α) + δn } ≥ P {ktn kI ≥ κn (1 − α + νn ) + εn + δn } − o(1)
≥ P {kt∗n kI ≥ κn (1 − α + νn ) + 2εn + δn } − o(1)
≥ P {kt∗n kI ≥ kn (1 − α + 2νn ) + 3εn + δn } − o(1)
≥ P {kt∗n kI ≥ kn (1 − α + 2νn ) + 2δn } − o(1)
≥ E[P {kt∗n kI ≥ kn (1 − α + 2νn ) + 2δn |Dn }] − o(1)
= α − 2νn − o(1) = α − o(1),
where we used the anti-concentration property in the last step.
We use this lemma in the proof of Theorem 12.
Lemma 7 (Closeness in Probability Implies Closeness of Conditional Quantiles). Let Xn
and Yn be random variables and Dn be a random vector. Let FXn (x|Dn ) and FYn (x|Dn )
(p|Dn ) denote the
denote the conditional distribution functions, and FX−1n (p|Dn ) and FY−1
n
corresponding conditional quantile functions. If |Xn − Yn | = oP (ε), then for some νn ց 0
with probability converging to one
(p|Dn ) ≤ FX−1n (p + νn |Dn ) + ε, ∀p ∈ (νn , 1 − νn ).
(p + νn |Dn ) + ε and FY−1
FX−1n (p|Dn ) ≤ FY−1
n
n
Proof. We have that for some νn ց 0, P {|Xn − Yn | > ε} = o(νn ). This implies that
P [P {|Xn − Yn | > ε|Dn } ≤ νn ] → 1, i.e. there is a set Ωn such that P (Ωn ) → 1 and
P {|Xn − Yn | > ε|Dn } ≤ νn for all Dn ∈ Ωn . So, for all Dn ∈ Ωn
FXn (x|Dn ) ≥ FYn +ε (x|Dn ) − νn and FYn (x|Dn ) ≥ FXn +ε (x|Dn ) − νn , ∀x ∈ R,
which implies the inequality stated in the lemma, by definition of the conditional quantile
function and equivariance of quantiles to location shifts.
Appendix F. A Lemma on Strong Approximation of an Empirical Process of
an Increasing Dimension by a Gaussian Process
Lemma 8. (Approximation of a Sequence of Empirical Processes of Increasing Dimension
by a Sequence of Gaussian Processes) Consider the empirical process Un in [ℓ∞ (U )]m , U ⊆
(0, 1), conditional on Zi ∈ Rm , i = 1, . . . , n, defined by
Un (u) = Gn (vi Zi ψi (u)) , ψi (u) = u − 1{Ui ≤ u},
49
where Ui , i = 1, . . . , n, is an i.i.d. sequence of standard uniform random variables, vi , i =
1, . . . , n, is an i.i.d. sequence of real valued random variables such that E[vi2 ] = 1, E[|vi |3 ] .
1, and max1≤i≤n |vi | .P log n. Suppose that Zi , i = 1, . . . , n, are such that
sup En (α′ Zi )2 . 1,
kαk≤1
6
max kZi k . ζm , and m7 ζm
log22 n = o(n).
1≤i≤n
There exists a sequence of zero-mean Gaussian processes Gn with a.s. continuous paths,
that has the same covariance functions as Un , conditional on Z1 , . . . , Zn , namely,
E[Gn (u)Gn (u′ )′ ] = E[Un (u)Un (u′ )′ ] = En [Zi Zi′ ](u ∧ u′ − uu′ ), for all u and u′ ∈ U ,
and that approximates the empirical process Un , namely,
1
sup kUn (u) − Gn (u)k .P o
.
log n
u∈U
Proof. The proof is based on the use of maximal inequalities and Yurinskii’s coupling.
Throughout the proof all the probability statements are conditional on Z1 , . . . , Zn .
We define the sequence of projections πj : U → U , j = 0, 1, 2, . . . , ∞ by πj (u) = 2k−1 /2j
if u ∈ ((2k − 2)/2j , 2k /2j ), k = 1, . . . , j, and πj (u) = u if u = 0 or 1. In what follows, given
a process G in [ℓ∞ (U )]m and its projection G ◦ πj , whose paths are step functions with
jm
2j steps, we shall identify the process G ◦ πj with a random vector G ◦ πj in R2
convenient. Analogously, given a random vector W in
j
R2 m
, when
we identify it with a process
W in [ℓ∞ (U )]m , whose paths are step functions with 2j steps.
The following relations proven below:
(1) (Finite-Dimensional Approximation)
r1 = sup kUn (u) − Un ◦ πj (u)k .P o
u∈U
1
log n
;
(2) (Coupling with a Normal Vector) there exists Nnj =d N (0, var[Un ◦ πj ]) such that,
1
;
r2 = kNnj − Un ◦ πj k2 .P o
log n
(3) (Embedding a Normal Vector into a Gaussian Process) there exists a Gaussian
process Gn with properties stated in the lemma such that Nnj = Gn ◦ πj a.s.;
(4) (Infinite-Dimensional Approximation)
r3 = sup kGn (u) − Gn ◦ πj (u)k .P o
u∈U
1
log n
.
50
The result then follows from the triangle inequality
sup kUn (u) − Gn (u)k ≤ r1 + r2 + r3 .
u∈U
Relation (1) follows from
r1 = sup kUn (u) − Un ◦ πj (u)k ≤
u∈U
sup
|u−u′ |≤2−j
.P
p
kUn (u) − Un (u′ )k
2−j m log n +
s
2 log4 n
m2 ζ m
.P o(1/ log n),
n
where the last inequality holds by Lemma 9, and the final rate follows by choosing here
2j = (m log3 n)ℓn for some ℓn → ∞ slowly enough.
Relation (2) follows from the use of Yurinskii’s coupling (Pollard [31], Chapter 10, Theorem 10): Let ξ1 , . . . , ξn be independent p-vectors with E[ξi ] = 0 for each i, and with
P κ := i E kξi k3 finite. Let S = ξ1 + · · · + ξn . For each δ > 0 there exists a random vector
T with a N (0, var(S)) distribution such that
| log(1/B)|
P {kS − T k > 3δ} ≤ C0 B 1 +
where B := κpδ−3 ,
p
for some universal constant C0 .
In order to apply the coupling, we collapse vi Zi ψi ◦ πj to a p-vector, and let
so that Un ◦ πj =
Pn
√
ξi = vi Zi ψi ◦ πj ∈ Rp , p = 2j m
i=1 ξi /
n. Then

3/2 
2j X
m
X


2 
3j/2
3
En E[kξi k3 ] = En E 
ψi (ukj )2 vi2 Ziw
E[|vi |3 ]En [kZi k3 ] . 23j/2 ζm
.
≤2
k=1 w=1
Therefore, by Yurinskii’s coupling, since log n . 2j m, by the choice 2j = m log3 n,
Pn
3
3 2j m
i=1 ξi
25j/2 mζm
n23j/2 ζm
√ 3
− Nn,j ≥ 3δ
=
.
→0
P √
n
(δ n)
δ3 n1/2
1/6
5j 2 6
. This verifies relation (2) with
by setting δ = 2 mn ζm log n
r2 .P
provided j is chosen as above.
6
25j m2 ζm
log n
n
1/6
= o(1/ log n),
51
Relation (3) follows from the a.s. embedding of a finite-dimensional random normal
vector into a path of a continuous Gaussian process, which is possible by Lemma 11.
Relation (4) follows from
r3 = sup kGn (u) − Gn ◦ πj (u)k ≤
u∈U
sup
|u−u′ |≤2−j
p
.P
kGn (u) − Gn (u′ )k
2−j m log n .P o(1/ log n),
where the last inequality holds by Lemma 10 since by assumption of this lemma,
sup EEn vi2 (α′ Zi )2 = sup En (α′ Zi )2 . 1
kαk≤1
kαk≤1
and the rate follows from setting j as above.
Note that putting bounds together we also get an explicit bound on the approximation
error:
sup kUn (u) − Gn (u)k .P
u∈U
p
2−j m log n +
s
2 log4 n
m2 ζ m
+
n
6 log n
25j m2 ζm
n
1/6
,
Next we establish the auxiliary relations (1) and (4) appearing in the preceding proof.
Lemma 9 (Finite-Dimensional Approximation). Let Z1 , . . . , Zn ∈ Rm be such that maxi≤n kZi k .
ζm , and ϕ = supkαk≤1 En [(α′ Zi )2 ], let vi be i.i.d. random variables such that E[vi2 ] = 1 and
max1≤i≤n |vi | .P log n, and let ψi (u) = u − 1{Ui ≤ u}, i = 1, . . . , n, where U1 , . . . , Un are
i.i.d. Uniform(0, 1) random variables. Then, for γ > 0, and Un (u) = Gn (vi Zi ψi (u)),
s
4
2 2
p
′
α Un (u) − Un (u′ ) .P γϕm log n + m ζm log n .
sup
n
kαk≤1,|u−u′ |≤γ
Proof. For notational convenience let An :=
q
2 log4 n
m2 ζm
.
n
Using the second maximal in-
equality of Lemma 16 with M (m, n) = ζm log n
′
α Un (u) − Un (u′ ) ε(m, n, γ) =
sup
kαk≤1,|u−u′ |≤γ
q
p
m log n
sup
EEn [vi2 (α′ Zi )2 (ψi (u) − ψi (u′ ))2 ] + An .
.P
kαk≤1,|u−u′ |≤γ
By the independence between Zi , vi and Ui , and E[vi2 ] = 1,
p
p
ϕm log n sup
E[(ψi (u) − ψi (u′ ))2 ] + An .
ε(m, n, γ) .P
|u−u′ |≤γ
52
Since Ui ∼ Uniform(0, 1) we have (ψi (u) − ψi (u′ ))2 =d (|u − u′ | − 1{Ui ≤ |u − u′ |})2 . Thus,
since |u − u′ | ≤ γ
ε(m, n, γ) .P
p
p
ϕm log n γ(1 − γ) + An .
Lemma 10 (Infinite-Dimensional Approximation). Let Gn : U → Rm be a zero-mean
Gaussian process whose covariance structure conditional on Z1 , . . . , Zn is given by
E Gn (u)Gn (u′ )′ = En [Zi Zi′ ](u ∨ u′ − uu′ )
for any u, u′ ∈ U ⊂ (0, 1), where Zi ∈ Rm , i = 1, . . . , n. Then, for any γ > 0 we have
p
sup kGn (u) − Gn (u′ )k .P ϕγm log m
|u−u′ |≤γ
where ϕ = supkαk≤1 En [(α′ Zi )2 ].
Proof. We will use the following maximal inequality for Gaussian processes (Proposition
A.2.7 [37]) Let X be a separable zero-mean Gaussian process indexed by a set T . Suppose
that for some K > σ(X) = supt∈T σ(Xt ), 0 < ǫ0 ≤ σ(X), we have
V
K
, for 0 < ε < ǫ0 ,
N (ε, T, ρ) ≤
ε
where N (ε, T, ρ) is the covering number of T by ε-balls with respect to the standard deviation metric ρ(t, t′ ) = σ(Xt − Xt′ ). Then there exists a universal constant D such that for
√
every λ ≥ σ 2 (X)(1 + V )/ǫ0
V
DKλ
√
Φ̄(λ/σ(X)),
P sup Xt > λ ≤
V σ 2 (X)
t∈T
where Φ̄ = 1 − Φ, and Φ is the cumulative distribution function of a standard Gaussian
random variable.
as
We apply this result to the zero-mean Gaussian process Xn : S m−1 × U × U → R defined
Xn,t = α′ (Gn (u) − Gn (u′ )), t = (α, u, u′ ), α ∈ S m−1 , |u − u′ | ≤ γ.
It follows that supt∈T Xn,t = sup|u−u′ |≤γ kGn (u) − Gn (u′ )k.
For the process Xn we have:
r
r
sup En [(α′ Zi )2 ], and V . m.
σ(Xn ) ≤ γ sup En [(α′ Zi )2 ], K .
kαk≤1
Therefore the result follows by setting λ ≃
q
kαk≤1
γm log m supkαk≤1 En [(α′ Zi )2 ].
53
In what follows, as before, given a process G in [ℓ∞ (U )]m and its projection G ◦ πj , whose
paths are step functions with 2j steps, we shall identify the process G ◦ πj with a random
jm
vector G ◦ πj in R2
jm
, when convenient. Analogously, given a random vector W in R2
identify it with a process W in
[ℓ∞ (U )]m ,
whose paths are step functions with
2j
we
steps.
Lemma 11. (Construction of a Gaussian Process with a Pre-scribed Projection) Let Nj be
a given random vector such that
Nj =d G̃ ◦ πj =: N (0, Σj ),
where Σj := Var[Nj ] and G̃ is a zero-mean Gaussian process in [ℓ∞ (U )]m whose paths are
a.s. uniformly continuous with respect to the Euclidian metric | · | on U . There exists a
zero-mean Gaussian process in [ℓ∞ (U )]m , whose paths are a.s. uniformly continuous with
respect to the Euclidian metric | · | on U , such that
Nj = G ◦ πj and G =d G̃ in [ℓ∞ (U )]m .
Proof. Consider a vector G̃ ◦ πℓ for ℓ > j. Then Ñj = G̃ ◦ πj is a subvector of G̃ ◦ πℓ = Ñℓ .
Thus, denote the remaining components of Ñℓ as Ñℓ\j . We can construct an identically
distributed copy Nℓ of Ñℓ such that Nj is a subvector of Nℓ . Indeed, we set Nℓ as a vector
with components
Nj and Nℓ\j ,
arranged in appropriate order, namely that Nℓ ◦ πj = Nj , where
Nℓ\j = Σℓ\j,j Σ−1
j,j Nj + ηj ,
where ηj ⊥Nj and ηj =d N (0, Σℓ\j,ℓ\j − Σℓ\j,j Σ−1
j,j Σj,ℓ\j ), where
!
!
Ñj
Σj,j
Σℓ\j,j
:= var
.
Σj,ℓ\j Σℓ\j,ℓ\j
Ñℓ\j
We then identify the vector Nℓ with a process Nℓ in ℓ∞ (U ), and define the pointwise
limit G of this process as
G(u) := lim Nℓ (u) for each u ∈ U0 ,
ℓ→∞
j
2
where U0 = ∪∞
j=1 ∪k=1 ukj is a countable dense subset of U . The pointwise limit exists, since
by construction of {πℓ } and U0 , for each u ∈ U0 , we have that πℓ (u) = u for all ℓ ≥ ℓ(u),
where ℓ(u) is a sufficiently large constant.
54
By construction Gℓ = Gℓ ◦ πℓ =d G̃ ◦ πℓ . Therefore, for each ǫ > 0, there exists η(ǫ) > 0
small enough such that
P
sup
u,u′ ∈U0 :|u−u′ |≤η(ǫ)
kG(u) − G(u′ )k ≥ ǫ
!
≤P
≤P
≤P
sup
sup kG ◦ πk (u) − G ◦ πk (u′ )k ≥ ǫ
|u−u|≤η(ǫ) k
sup
′
sup kG̃ ◦ πk (u) − G̃ ◦ πk (u )k ≥ ǫ
|u−u|≤η(ǫ) k
sup
|u−u|≤η(ǫ)
kG̃(u) − G̃(u′ )k ≥ ǫ
!
!
!
≤ ǫ,
where the last display is true because sup|u−u|≤η kG̃(u)− G̃(u′ )k → 0 as η → 0 almost surely
and thus also in probability, by a.s. continuity of sample paths of G̃. Setting ǫ = 2−m for
each m ∈ N in the above display, and summing the resulting inequalities over m, we get
a finite number on the right side. Conclude that by the Borel-Cantelli lemma, for almost
all ω ∈ Ω, |G(u) − G(u′ )| ≤ 2−m for all |u − u′ | ≤ η(2−m ) for all sufficiently large m. This
implies that almost all sample paths are uniformly continuous on U0 , and we can extend the
process by continuity to a process {G(u), u ∈ U } with almost all paths that are uniformly
continuous.
In order to show that the law of G is equal to the law of G̃ in ℓ∞ (U ), it suffices to
demonstrate that
E[g(G)] = E[g(G̃)] for all g : [ℓ∞ (U )]m → R : |g(z) − g(z̃)| ≤ sup kz(u) − z̃(u)k ∧ 1.
u∈U
We have that
|E[g(G)] − E[g(G̃)]| ≤ |E[g(G ◦ πℓ )] − E[g(G̃ ◦ πℓ )]|+
+ E sup kG ◦ πℓ (u) − G(u)k ∧ 1 +
u∈U
+ E sup kG̃ ◦ πℓ (u) − G̃(u)k ∧ 1
u∈U
→ 0 as ℓ → ∞.
The first term converges to zero by construction, and the second and third terms converge
to zero by the dominated convergence theorem and by
G ◦ πℓ → G and G̃ ◦ πℓ → G̃ in [ℓ∞ (U )]m as ℓ → ∞ a.s.,
holding due to a.s. uniform continuity of sample paths of G and G̃.
55
Appendix G. Technical Lemmas on Bounding Empirical Errors
In Appendix G.2 we establish technical results needed for our main results - uniform
rates of convergence, uniform linear approximations, and uniform central limit theorem under high-level conditions. In Appendix G.3 we verify that these conditions are implied
by the primitive Condition S stated in Section 2.
G.1. Some Preliminary Lemmas.
Lemma 12. Under the conditions S.2 and S.5, for any u ∈ U and α ∈ S m−1
|α′ (Jm (u) − Jem (u))α| . m−κ = o(1),
where Jem (u) = E[fY |X (Z ′ β(u)|X)ZZ ′ ].
Proof of Lemma 12. For any α ∈ S m−1
|α′ (Jm (u) − Jem (u))α| = E[|fY |X (Z ′ β(u) + R(X, u)|X) − fY |X (Z ′ β(u)|X)|(Z ′ α)2 ]
. α′ Σm α · f ′ m−κ .
The result follows since Σm has bounded eigenvalues, κ > 0 and m → ∞ as n → ∞.
Lemma 13 (Auxiliary Matrix). Under conditions S.1-S.5, for u′ , u ∈ U we have that
uniformly over z ∈ Z
′ −1 ′
−1 (u))U (u′ ) p
z (Jm (u ) − Jm
n
q
.P |u − u′ | m log n.
−1
−1
u(1 − u)z ′ Jm
(u)Σm Jm
(u)z Proof. Recall that Jm (u) = E fY |X (QY |X (u|X)|X)ZZ ′ for any u ∈ U . Moreover, under
√
−1 (u′ )U (u′ )k .
m log n uniformly in u′ ∈ U by Lemma 23 and
S.1-S.5, we have kJm
n
P
Corollary 2 of Appendix G.
Using the matrix identity A−1 − B −1 = B −1 (B − A)A−1
−1 (u′ ) − J −1 (u) = J −1 (u)(J (u) − J (u′ ))J −1 (u′ ).
Jm
m
m
m
m
m
Moreover, since |fY |X (QY |X (u|x)|x) − fY |X (QY |X (u′ |x)|x)| ≤ (f¯′ /f )|u − u′ | by Lemma 14,
Jm (u) − Jm (u′ ) 4 (f¯′ /f )|u − u′ |Σm , and Jm (u′ ) − Jm (u) 4 (f¯′ /f )|u − u′ |Σm
56
where the inequalities are in the semi-definite positive sense. Using these relations and the
definition of sn (u, x) we obtain
′ −1 ′
−1
′
z (Jm (u ) − Jm (u))Un (u ) q
−1
−1
′
u(1 − u)z Jm (u)Σm Jm (u)z =
.P
′ −1
z Jm (u)
′
−1 ′
′ sn (u, x) (Jm (u) − Jm (u ))Jm (u )Un (u )
√
(f¯′ /f )|u − u′ |maxeig(Σm ) m log n.
The result follows since f¯′ is bounded above, f is bounded from below, and the eigenvalues
of Σm are bounded above and below by constants uniformly in n by S.2 and S.3.
Lemma 14 (Primitive Condition S.2). Under the condition S.2, there are positive constants
c, C, C1′ , C2′ , C ′′ such that the conditional quantile functions satisfy the following properties
uniformly over u, u′ ∈ U , x ∈ X ,
(i) c |u − u′ | ≤ |QY |X (u|x) − QY |X (u′ |x)| ≤ C|u − u′ |;
(ii) |fY |X (QY |X (u|x)|x) − fY |X (QY |X (u′ |x)|x)| ≤ C1′ |u − u′ |;
(iii) fY |X (y|x) ≤ 1/c and |fY′ |X (y|x)| ≤ C2′ ;
d2
Q
(u|x)
(iv) du
≤ C ′′ .
2
Y |X
Proof. Under S.2, fY |X (·|x) is a differentiable function so that QY |X (·|x) is twice differentiable.
To show the first statement note that
d
du QY |X (u|x)
=
1
fY |X (QY |X (u|x)|x) ,
by an application
of the inverse function theorem. Recall that f = inf x∈X ,u∈U fY |X (QY |X (u|x)|x) > 0, and
supx∈X ,y∈R fY |X (y|x) ≤ f¯. This proves the first statement for c = 1/f¯ and C = 1/f .
d
fY |X (y|x), and
To show the second statement let f¯′ = supx∈X dy
C1′
d
= sup fY |X (QY |X (u|x)|x)
du
x∈X ,u∈U f¯′
d
d
fY |X (y|x) |y=QY |X (u|x)
QY |X (u|x) ≤ .
= sup du
f
x∈X ,u∈U dy
By a Taylor expansion we have
|fY |X (QY |X (u|x)|x) − fY |X (QY |X (u′ |x)|x)| ≤ C1′ |u − u′ |.
The second part of the third statement follows with C2′ = f¯′ . The first statement was
already shown in the proof of part (i).
For the fourth statement, using the implicit function theorem for second order derivatives
3
fY′ |X (QY |X (u|x)|x)
d2
d
d2
Q
(u|x)
=
−
F
(y|x)
.
Q
(u|x)
=
−
2
2
Y |X
Y |X
du Y |X
du
dy
f 3 (Q
(u|x)|x)
y=QY |X (u|X)
Y |X
Y |X
57
Thus, the statement holds with C ′′ = f¯′ /f 3 .
Under S.2, we can take c, C, C1′ , C2′ , and C ′′ to be fixed positive constants uniformly
over n.
G.2. Maximal Inequalities. In this section we derive maximal inequalities that are needed
for verifying the preliminary high-level conditions. These inequalities rely mainly on uniform entropy bounds and VC classes of functions. In what follows F denotes a class of
functions whose envelope is F . Recall that for a probability measure Q with kF kQ,p > 0,
N (εkF kQ,p , F, Lp (Q)) denotes the covering number under the specified metric (i.e., the
minimum number of Lp (Q)-balls of radius εkF kQ,p needed to cover F). We refer to Dudley
[17] for the details of the definitions.
Suppose that we have the following upper bound on the L2 (P ) covering numbers for F:
N (ǫkF kP,2 , F, L2 (P )) ≤ n(ǫ, F, P ) for each ǫ > 0,
p
where n(ǫ, F, P ) is increasing in 1/ǫ, and ǫ log n(ǫ, F, P ) → 0 as 1/ǫ → ∞ and is decreasing
in 1/ǫ. Let ρ(F, P ) := supf ∈F kf kP,2 /kF kP,2 . Let us call a threshold function x : Rn 7→ R ksub-exchangeable if, for any v, w ∈ Rn and any vectors ṽ, w̃ created by the pairwise exchange
of the components in v with components in w, we have that x(ṽ) ∨ x(w̃) ≥ [x(v) ∨ x(w)]/k.
√
Several functions satisfy this property, in particular x(v) = kvk with k = 2 and constant
functions with k = 1.
Lemma 15 (Exponential inequality for separable empirical process). Consider a separable
P
empirical process Gn (f ) = n−1/2 ni=1 {f (Zi ) − E[f (Zi )]} and the empirical measure Pn for
Z1 , . . . , Zn , an underlying independent data sequence. Let K > 1 and τ ∈ (0, 1) be constants,
and en (F, Pn ) = en (F, Z1 , . . . , Zn ) be a k-sub-exchangeable random variable, such that
Z ρ(F ,Pn )/4 p
τ
kF kPn ,2
log n(ǫ, F, Pn )dǫ ≤ en (F, Pn ) and sup varP f ≤ (4kcKen (F, Pn ))2
2
f ∈F
0
for some universal constant c > 1, then
(
)
4
P sup |Gn (f )| ≥ 4kcKen (F, Pn ) ≤ EP
τ
f ∈F
"Z
ρ(F ,Pn )/2
ǫ
−1
0
−{K 2 −1}
n(ǫ, F, Pn )
#
!
dǫ ∧ 1 +τ.
Proof. See [6], Lemma 18 and note that the proof does not use that Zi ’s are i.i.d., only
independent which was the requirement of Lemma 17 of [6].
The next lemma establishes a new maximal inequality which will be used in the following
sections.
58
Lemma 16. Suppose for all large m and all 0 < ε ≤ ε0
2
2
2
n(ǫ, Fm , P ) ≤ (ω/ε)J(m) and n(ǫ, Fm
, P ) ≤ (ω/ε)J(m) ,
(G.35)
for some ω such that log ω . log n, and let Fm = supf ∈Fm |f | denote the envelope function
associated with Fm .
1. (A Maximal Inequality Based on Entropy and Moments) Then, as n grows we have

!1/2 1/2
 log1/2 n.
sup En [f 4 ] ∨ E[f 4 ]
sup |Gn (f )| .P J(m)  sup E[f 2 ] + n−1/2 J(m) log1/2 n
f ∈Fm
f ∈Fm
f ∈Fm
2. (A Maximal Inequality Based on Entropy, Moments, and Extremum) Suppose that
Fm ≤ M (m, n) with probability going to 1 as n grows. Then as n grows we have
sup |Gn (f )| .P J(m)
f ∈Fm
2
sup E[f ] + n
−1
2
2
J(m) M (m, n) log n
f ∈Fm
!1/2
log1/2 n.
Proof. We divide the proof into steps. Step 1 is the main argument, Step 2 is an application
of Lemma 15, and Step 3 contains some auxiliary calculations.
Step 1. (Main Argument) Proof of Part 1. By Step 2 below, which invokes Lemma 15,
p
sup |Gn (f )| .P J(m) log n sup (En [f 2 ]1/2 ∨ E[f 2 ]1/2 ).
(G.36)
f ∈Fm
f ∈Fm
We can assume that supf ∈Fm En [f 2 ]1/2 ≥ supf ∈Fm E[f 2 ]1/2 throughout the proof otherwise
we are done with both bounds.
2 , we also have
Again by Step 2 and (G.35) applied to Fm
sup |En f 2 − E f 2 | = n−1/2 sup |Gn (f 2 )|
f ∈Fm
f ∈Fm
p
1/2
−1/2
.
.P n
J(m) log n sup En [f 4 ] ∨ E[f 4 ]
(G.37)
f ∈Fm
Thus we have
1/2
sup En [f 2 ] .P sup E f 2 + n−1/2 J(m) log1/2 n sup En f 4 ∨ E[f 4 ]
.
f ∈Fm
f ∈Fm
(G.38)
f ∈Fm
Therefore, inserting the bounds (G.38) in equation (G.36) yields the result.
Proof of Part 2. One more time we can assume that supf ∈Fm En [f 2 ] ≥ supf ∈Fm E[f 2 ]
otherwise we are done. By (G.37) we have
1/2
supf ∈Fm En f 2 − E f 2 .P n−1/2 J(m) log1/2 n supf ∈Fm En f 4 ∨ E[f 4 ]
1/2
.P n−1/2 J(m) log1/2 nM (m, n) supf ∈Fm En f 2
59
where we used that f 4 ≤ f 2 M 2 (m, n) with probability going to 1. Since for positive numbers
a, c, and x, x ≤ a + c|x|1/2 implies that x ≤ 4a + 4c2 we conclude
sup En f 2 .P sup E f 2 + n−1 J(m)2 M 2 (m, n) log n.
f ∈Fm
f ∈Fm
Inserting the bound in equation (G.36) gives the result.
Step 2. (Applying Lemma 15) We apply Lemma 15 to Fm with τm = 1/(4J(m)2 [K 2 − 1])
for some large constant K to be set later, and
p
en (Fm , Pn ) = J(m) log n
2 1/2
sup En [f ]
f ∈Fm
2 1/2
∨ E[f ]
!
assuming that n is sufficiently large (i.e., n ≥ ω). We observe that by (G.35), the bound
ǫ 7→ n(ǫ, Fm , Pn ) satisfies the monotonicity hypotheses of Lemma 15. Next note that
√
√
en (Fm , Pn ) is 2-sub-exchangeable, because supf ∈Fm kf kPn ,2 is 2-sub-exchangeable, and
√
ρ(Fm , Pn ) := supf ∈Fm kf kPn ,2 /kFm kPn ,2 ≥ 1/ n by Step 3 below. Thus,
Z ρ(Fm ,Pn )/4
Z ρ(Fm ,Pn )/4 p
p
log n(ǫ, Fm , P )dǫ ≤ kFm kPn ,2
J(m) log(ω/ǫ)dǫ
kFm kPn ,2
0
0
p
≤ J(m) log(n ∨ ω) sup kf kPn ,2 /2
f ∈Fm
≤ en (Fm , Pn ),
p
1/2 R ρ
1/2
Rρp
Rρ
which follows by 0 log(ω/ǫ)dǫ ≤ 0 1dǫ
2 log(n ∨ ω), for
log(ω/ǫ)dǫ
≤
ρ
0
√
1/ n ≤ ρ ≤ 1.
√
Let K > 1 be sufficiently large (to be set below). Recall that 4 2c > 4 where c > 1 is
universal. Note that for any f ∈ Fm , by Chebyshev inequality
√
P (|Gn (f )| > 4 2cKen (Fm , Pn ) ) ≤
supf ∈Fm kf k2P,2
1
√
≤ √
≤ τm /2.
2
2
(4 2cKen (Fm , Pn ))
(4 2cK) J(m)2 log n
By Lemma 15 with our choice of τm , ω > 1, and ρ(Fm , Pn ) ≤ 1,
Z 1/2
o
n
√
4
2
2
≤
(ω/ǫ)1−J(m) [K −1] dǫ + τm
P sup |Gn (f )| > 4 2cKen (Fm , Pn )
τm 0
f ∈Fm
2
2
4 (1/[2ω])J(m) [K −1]
+ τm ,
≤
τm J(m)2 [K 2 − 1]
which can be made arbitrary small by choosing K sufficiently large (and recalling that
τm → 0 as K grows).
60
Step 3. (Auxiliary calculations.) To establish that supf ∈Fm kf kPn ,2 is
√
2-sub-exchangeable,
define Z̃ and Ỹ by exchanging any components in Z with corresponding components in Y .
Then
√
2( sup kf kPn (Z̃),2 ∨ sup kf kPn (Ỹ ),2 ) ≥ ( sup kf k2P
f ∈Fm
f ∈Fm
f ∈Fm
2
2 1/2
≥ ( sup En [f (Z̃i ) ] + En [f (Ỹi ) ])
f ∈Fm
≥ ( sup
f ∈Fm
kf k2Pn (Z),2
∨ sup
f ∈Fm
n (Z̃),2
+ sup kf k2P
f ∈Fm
n (Ỹ
),2
)1/2
2
= ( sup En [f (Zi ) ] + En [f (Yi )2 ])1/2
f ∈Fm
kf k2Pn (Y ),2 )1/2 =
sup kf kPn (Z),2 ∨ sup kf kPn (Y ),2 .
f ∈Fm
f ∈Fm
√
Next we show that ρ(Fm , Pn ) := supf ∈Fm kf kPn ,2 /kFm kPn ,2 ≥ 1/ n. The latter bound
2
= En [supf ∈Fm |f (Zi )|2 ] ≤ supi≤n supf ∈Fm |f (Zi )|2 , and from the infollows from En Fm
equality supf ∈Fm En [|f (Zi )|2 ] ≥ supf ∈Fm supi≤n |f (Zi )|2 /n.
The last technical lemma in this section bounds the uniform entropy for VC classes of
functions (we refer Dudley [17] for formal definitions).
Lemma 17 (Uniform Entropy of VC classes). Suppose F has VC index V , as ε > 0 goes
to zero we have for J = O(V )
sup N (εkF kQ,2 , F, L2 (Q)) . (1/ε)J
Q
where Q ranges over all discrete probabilities measures.
Proof. Being a VC class of index V , by Theorem 2.6.7 in [37] we have that the bound
supQ log N (εkF kQ,2 , F, L2 (Q)) . V log(1/ε) holds for ε sufficiently small (also making the
expression bigger than 1).
Comment G.1. Although the product of two VC classes of functions may not be a VC
class, if F has VC index V , the square of F is still a VC class whose VC index is at most
2V .
G.3. Bounds on Various Empirical Errors. In this section we provide probabilistic
bounds for the error terms under the primitive Condition S. Our results rely on empirical
processes techniques. In particular, they rely on the maximal inequalities derived in Section
G.2.
We start with a sequence of technical lemmas which are used in the proofs of the lemmas
that bound the error terms ǫ0 − ǫ6 .
61
Lemma 18. Let r = o(1). The class of functions
Fm,n = {α′ (ψi (β, u) − ψi (β(u), u)) : u ∈ U , kαk ≤ 1, kβ − β(u)k ≤ r}
has VC index of O(m).
Proof. Consider the classes W := {Zi′ α : α ∈ Rm } and V := {1{Yi ≤ Zi′ β} : β ∈ Rm }
(for convenience let Ai = (Zi , Yi )). Their VC index is bounded by m + 2. Next consider
f ∈ Fm,n which can be written in the form f (Ai ) := g(Ai )(1{h(Ai ) ≤ 0} − 1{p(Ai ) ≤ 0})
where g ∈ W, 1{h ≤ 0} and 1{p ≤ 0} ∈ V.
{(Ai , t) : f (Ai ) ≤ t} = {(Ai , t) : g(Ai )(1{h(Ai ) ≤ 0} − 1{p(Ai ) ≤ 0}) ≤ t}
= {(Ai , t) : h(Ai ) > 0, p(Ai ) > 0, t ≥ 0}∪
∪ {(Ai , t) : h(Ai ) ≤ 0, p(Ai ) ≤ 0, t ≥ 0}∪
∪ {(Ai , t) : h(Ai ) ≤ 0, p(Ai ) > 0, g(Ai ) ≤ t}∪
∪ {(Ai , t) : h(Ai ) > 0, p(Ai ) ≤ 0, −g(Ai ) ≤ t}.
Since each one of the sets can be written as three intersections of basic sets, it follows that
Fm,n has VC index at most O(m).
Lemma 19. The class of functions
Hm,n = {1{|Yi − Zi′ β| ≤ h}(α′ Zi )2 : kβ − β(u)k ≤ r, h ∈ (0, H], α ∈ S m−1 }
has VC index of O(m).
Proof. The proof is similar to that of Lemma 18.
Lemma 20. The family of functions Gm,n = {α′ ψi (β(u), u) : u ∈ U , α ∈ S m−1 } has VC
index of O(m).
Proof. The proof is similar to the proof of Lemma 18.
Lemma 21. The family of functions
An,m = {α′ Z 1{Y ≤ Z ′ β(u) + R(X, u)} − 1{Y ≤ Z ′ β(u)} : α ∈ S m−1 , u ∈ U }
has VC index of O(m).
Proof. The key observation is that the function Z ′ β(u) + R(X, u) is monotone in u so that
{1{Y ≤ Z ′ β(u) + R(X, u)} : u ∈ U }
has VC index of 1 and that {1{Y ≤ Z ′ β(u)} : u ∈ U } ⊂ {1{Y ≤ Z ′ β : β ∈ Rm }. The proof
then follows similarly to Lemma 18.
62
Consider the maximum between the maximum eigenvalue associated with the empirical
Gram matrix and the population Gram matrix
φn = max En (α′ Zi )2 ∨ E (α′ Zi )2 .
α∈S m−1
(G.39)
The factor φn will be used to bound the quantities ǫ0 and ǫ1 in the analysis for the rate
of convergence. Next we state a result due to Guédon and Rudelson [18] specialized to our
framework.
Theorem 13 (Guédon and Rudelson [18]). Let Zi ∈ Rm , i = 1, . . . , n, be random vectors
such that
log n E maxi≤n kZi k2
<1
δ :=
·
n
max E (Zi′ α)2
2
α∈S m−1
we have
#
n
1 X
E max (Zi′ α)2 − E (Zi′ α)2 ≤ 2δ · max E (Zi′ α)2 .
m−1
m−1
n
α∈S
α∈S
"
i=1
′ 2
2 log n = o(n), for λ
Corollary 2. Under Condition S and ζm
max = maxα∈S m−1 E (Zi α) ,
we have that for n large enough φn as defined in (G.39) satisfies


s
s
2 log n
2
ζ
m
 λmax and P (φn > 2λmax ) ≤ 2 ζm log n .
E [φn ] ≤ 1 + 2
nλmax
nλmax
2 under
Proof. Let δ be defined as in Theorem 13. Next note that E maxi=1,...,n kZi k2 . ζm
2 log n)/n in Theorem 13.
S.4, and λmax . 1 and λmax & 1 under S.3. Therefore, δ2 . (ζm
2 log n = o(n) yields δ < 1 as n grows.
The growth condition ζm
The first result follows by applying Theorem 13 and the triangle inequality.
To show the second relation note that the event {φn > 2λmax } cannot occur if φn =
maxα∈S m−1 E[(Zi′ α)2 ] = λmax . Thus
P (φn > 2λmax ) = P (maxα∈S m−1 En [(Zi′ α)2 ] > 2λmax )
≤ P ( max En [(Zi′ α)2 ] − E (Zi′ α)2 > λmax )
m−1
α∈S
≤ E max En [(Zi′ α)2 ] − E (Zi′ α)2 /λmax
α∈S m−1
≤ 2δ,
by the triangle inequality, the Markov inequality and Theorem 13.
Next we proceed to bound the various approximation errors terms.
63
Lemma 22 (Controlling error ǫ1 ). Under conditions S.1-S.4 we have
ǫ1 (m, n)
.P
p
m log n φn .
Proof. Consider the class of functions Fm,n defined in Lemma 18 so that
ǫ1 (m, n) = sup |Gn (f )|.
√
f ∈Fm,n
From Lemma 17 we have that J(m) . m. By Step 2 of Lemma 16, see equation (G.36),
p
1/2
sup |Gn (f )| .P m log n sup En [f 2 ] ∨ E[f 2 ]
(G.40)
f ∈Fm,n
f ∈Fm,n
The score function ψi (·, ·) satisfies the following inequality for any α ∈ S m−1
(ψi (β, u) − ψi (β(u), u))′ α = |α′ Zi | 1{yi ≤ Zi′ β} − 1{yi ≤ Zi′ β(u)} ≤ |α′ Zi |.
Therefore
#
n
X
1
|α′ Zi |2 ≤ φn
and E[f 2 ] ≤ E
n
"
En [f 2 ] ≤ En [|α′ Zi |2 ] ≤ φn
(G.41)
i=1
by definition (G.39). Combining (G.41) with (G.40) we obtain the result.
Lemma 23 (Controlling error ǫ0 and Pivotal process norm). Under the conditions S.1-S.4
ǫ0 (m, n) .P
p
p
m log n φn and sup kUn (u)k .P m log n φn .
u∈U
Proof. For ǫ0 , the proof is similar to the proof of Lemma 22 and relying on Lemma 20 and
noting that for g ∈ Gm,n
En [g2 ] = En [(α′ ψi (β(u), u))2 ] = En [(α′ Zi )2 (1{yi ≤ Zi′ β(u)} − u)2 ] ≤ En [(α′ Zi )2 ] ≤ φn .
Similarly, E[g2 ] ≤ φn .
The second relation follows similarly.
Lemma 24 (Bounds on ǫ1 (m, n) and ǫ2 (m, n)). Under conditions S.1-S.4 we have
p
√
√
mζm
ǫ1 (m, n) .P mζm r log n + √ log n and ǫ2 (m, n) .P nζm r 2 + nm−κ r.
n
Proof. Part 1 (ǫ1 (m, n)) The first bound will follow from the application of the maximal
inequality derived in Lemma 16.
For any B ≥ 0, define the class of functions Fn,m as in Lemma 18 which characterize ǫ1 .
By the (second) maximal inequality in Lemma 16
ǫ1 (m, n) = sup |Gn (f )| .P J(m)
f ∈Fn,m
sup E f 2 + n−1 J(m)2 M 2 (m, n) log n
f ∈Fm
!1/2
log1/2 n,
64
where J(m) .
√
m by the VC dimensions of Fn,m being of O(m), Lemma 18, and
M (m, n) = max1≤i≤n kZi k ≤ ζm . The bound stated in the lemma holds provided we
can show that
sup E f 2 . rζm .
(G.42)
f ∈Fm
The score function ψi (·, ·) satisfies the following inequality for any α ∈ S m−1
(ψi (β, u) − ψi (β(u), u))′ α = |Z ′ α| |1{Yi ≤ Z ′ β} − 1{Yi ≤ Z ′ β(u)}|
i
i
i
≤ |Zi′ α| · 1{|Yi − Zi′ β(u)| ≤ |Zi′ (β − β(u))|}.
Thus,
i
h
2
≤
E (α′ (ψi (β, u) − ψi (β(u), u)))
≤
≤
=
E
R
|Z ′ α|2 1{|Y − Z ′ β(u)| ≤ |Z ′ (β − β(u))|}fY |Z (y|Z)dy
E (Z ′ α)2 · min(2f¯|Z ′ (β − β(u))|, 1)
2f kβ − β(u)k supα∈S m−1 ,γ∈S m−1 E |Zi′ α|2 |Zi′ γ|
2f kβ − β(u)k supα∈S m−1 E |Zi′ α|3 ,
where supα∈S m−1 E[|Zi′ α|3 ] . ζm by S.3 and S.4.
Therefore, we have the upper bound (G.42).
Part 2 (ǫ2 (m, n)) To show the bound for ǫ2 , note that by Lemma 12, for any α ∈ S m−1
√
n|α′ (Jm (u) − Jem (u))(β − β(u))| .
√
nm−κ r.
For α ∈ S m−1 and Jem (u) = E[fY |X (Z ′ β(u)|X)ZZ ′ ], define
ǫ2 (m, n, α) = n1/2 |α′ (E[ψ(Y, Z, β, u)] − E[ψ(Y, Z, β(u), u)]) − α′ Jem (u)(β − β(u))|.
Thus ǫ2 (m, n) . supα∈S m−1 ,(u,β)∈Rn,m ǫ2 (m, n, α) +
√
nm−κ r.
Note that since E[α′ ψi (β(u), u)] = 0, for some β̃ in the line segment between β(u) and
β, E[α′ ψi (β, u)] = E[fY |X (Zi′ β̃|X)(Zi′ α)Zi′ ](β − β(u)). Thus, using that |fY |X (Z ′ β̃|X) −
fY |X (Z ′ β(u)|X)| ≤ f ′ |Z ′ (β̃ − β(u))| ≤ f ′ |Z ′ (β − β(u))|,
65
h
i
ǫ2 (m, n, α) = n1/2 E (α′ Z)(fY |X (Z ′ β̃|X) − fY |X (Z ′ β(u)|X))Z ′ (β − β(u))
≤ n1/2 E |α′ Z| |Z ′ (β − β(u))|2 f ′
≤ n1/2 f ′ kβ − β(u)k2
sup E[|α′ Z|3 ],
α∈S m−1
where supα∈S m−1 E[|Zi′ α|3 ] . ζm by S.3 and S.4.
Lemma 25 (Bounds on Approximation Error for Uniform Linear Approximation). Under
Condition S,
n
1 X
r̃u := √
Zi (1{Yi ≤ Zi′ β(u) + R(Xi , u)} − 1{Yi ≤ Zi′ β(u)}), u ∈ U ,
n
i=1
satisfies
sup
α∈S m−1 ,u∈U
|α′ r̃n (u)| .P min
p
m1−κ log n +
ζm m log n
√
,
n
q
nφn m−κ f¯ .
Proof. The second bound follows by Cauchy-Schwarz, |R(Xi , u)| . m−κ , and bounded
conditional probability density function of Y given X.
The proof of the first bound is similar to the bound on ǫ1 in Lemma 24, it will also follow
from the application of the maximal inequality derived in Lemma 16.
Define the class of functions
An,m = {α′ Z 1{Y ≤ Z ′ β(u) + R(X, u)} − 1{Y ≤ Z ′ β(u)} : α ∈ S m−1 , u ∈ U }.
By the (second) maximal inequality in Lemma 16,
sup
α∈S m−1 ,u∈U
|α′ r̃n (u)|
=
.P
where J(m) .
√
sup |Gn (f )|
f ∈An,m
J(m)
sup E f
f ∈An,m
2
+ n−1 J(m)2 M 2 (m, n) log n
!1/2
log1/2 n,
m by the VC dimensions of An,m being of O(m) by Lemma 21, and
M (m, n) = ζm by S.4. The bound stated in the lemma holds provided we can show that
sup E f 2 . m−κ .
f ∈An,m
(G.43)
66
For any f ∈ An,m
|f (Yi , Zi , Xi )| = |Zi′ α| |1{Yi ≤ Zi′ β(u) + R(Xi , u)} − 1{Yi ≤ Zi′ β(u)}|
≤ |Zi′ α| · 1{|Yi − Zi′ β(u)| ≤ |R(Xi , u)|}.
Thus, since |fY |X (y|X)| ≤ f¯,
R
E f 2 (Y, Z, X) ≤ E |Z ′ α|2 1{|y − Z ′ β(u)| ≤ |R(X, u)|}fY |X (y|X)dy
≤ E (Z ′ α)2 · min(2f¯R(X, u), 1)
≤ 2f m−κ supα∈S m−1 E |Zi′ α|2 ,
where supα∈S m−1 E[|Zi′ α|2 ] . 1 by invoking S.3. Therefore, we have the upper bound (G.43).
b
Lemma 26 (Bound on ǫ3 (m, n)). Let β(u)
be a solution to the perturbed QR problem
b
β(u)
∈ arg minm En [ρu (Yi − Zi′ β)] + An (u)′ β.
β∈R
If the data are in general position,
ǫ3 (m, n) = sup n
u∈U
holds with probability 1.
1/2
b
kEn [ψi (β(u),
u)] + An (u)k ≤ min
√
m
√ ζm , φn m
n
Proof. Note that the dual problem associated with the perturbed QR problem is
max
(u−1)≤ai ≤u
En [Yi ai ] : En [Zi ai ] = −An (u).
b
Letting b
a(u) denote the solution for the dual problem above, and letting ai (β(u))
:= (u −
′
b
1{Yi ≤ Zi β(u)}), by the triangle inequality
i
√ h ′
b
−b
ai (u)) +
n En (Zi α)(ai (β(u))
ǫ3 (m, n) ≤
sup
kαk≤1,u∈U
+ supu∈U
√
nkEn [Zi b
ai (u)] + An (u)k.
By dual feasibility En [Zi b
ai (u)] = −An (u), and the second term is identically equal to zero.
b
We note that ai (β(u))
6= b
ai (u) only if the ith point is interpolated. Since the data
are in general position, with probability one the quantile regression interpolates m points
b
(Z ′ β(u)
= Yi for m points for every u ∈ U ).
i
67
b
Therefore, noting that |ai (β(u))
−b
ai (u)| ≤ 1
r h
i
√ q
√
′
2
b
ǫ3 (m, n) ≤
sup
−b
ai (u)}2 ≤ φn m
n En [(Zi α) ] En {ai (β(u))
kαk≤1,u∈U
and, with probability 1,
ǫ3 (m, n) ≤
√
sup
nEn
kαk≤1,u∈U
m
b
1{ai (β(u)) 6= b
ai (u)} max kZi k ≤ √ max kZi k.
1≤i≤n
n 1≤i≤n
2 log n = o(n),
Lemma 27 (Bound on ǫ4 (m, n)). Under conditions S.1 − S.4, and ζm
r
2 log n
ζm
ǫ4 (m, n) .P
= o(1).
n
Proof. The result follows from Theorem 13 of Guédon and Rudelson [18] under our assumptions.
Lemma 28 (Bounds on ǫ5 (m, n) and ǫ6 (m, n)). Under S.2, hn = o(1), and r = o(1),
s
2
2 m log n
mζm
ζm
+
log n and ǫ6 (m, n) . m−κ + rζm + hn .
ǫ5 (m, n) .P
nhn
nhn
Proof. To bound ǫ5 first let
Hm,n = 1{|Yi − Zi′ β| ≤ h}(α′ Zi )2 : kβ − β(u)k ≤ r, h ∈ (0, H], α ∈ S m−1 .
Then, by Lemma 16
ǫ5 (m, n)
=
.P
where J(m) .
√
n−1/2
hn
sup |Gn (f )|
f ∈Hn,m
n−1/2
J(m)
hn
sup E f
f ∈Hm
2
+ n−1 J(m)2 M 2 (m, n) log n
!1/2
log1/2 n,
m by the VC dimensions of Hn,m being of O(m) by Lemma 19, and
M (m, n) = maxi≤n Hm,n,i , where Hm,n,i = kZi k2 , is the envelope of Hm,n in the sample of
size n. Therefore the envelope is bounded by max1≤i≤n kZi k2 . We also get that
sup E f 2 ≤
f ∈Hm,n
sup
β∈Rm ,α∈S m−1
E 1{|Yi − Zi′ β| ≤ hn }(α′ Zi )4
≤ 2f¯hn sup E (α′ Zi )4 .
α∈S m−1
68
By S.2 f¯ is bounded, and by S.4
2
2
sup E f 2 ≤ 2f¯ζm
hn sup E (α′ Zi )2 . ζm
hn .
α∈S m−1
f ∈Hm,n
Collecting terms,
ǫ5 (m, n) .P
.P
.P
1/2
n−1/2 √
2
4
log1/2 n
m ζm hn + (m max kZi k /n) log n
1≤i≤n
hn
s
s
2 m log n
ζm
m2 max1≤i≤n kZi k4
+
log n
nhn
nh2n
n
s
2m
m max1≤i≤n kZi k2
ζm
log1/2 n +
log n.
nhn
nhn
To show the bound on ǫ6 note that |fY′ |X (y|x)| ≤ f¯′ by S.2. Therefore
E 1{|Y − Z ′ β| ≤ hn }(α′ Z)2
h
i
Rh
= E (α′ Z)2 −hnn fY |X (Z ′ β + t|X)dt
h
i
Rh
= E (α′ Z)2 −hnn fY |X (Z ′ β|X) + tfY′ |X (Z ′ β + t̃|X)dy
= 2hn E fY |X (Z ′ β|X)(α′ Z)2 + O(2h2n f¯′ E (Z ′ α)2 )
by the mean-value theorem. Moreover, we have for any (u, β) ∈ Rm,n
E fY |X (Z ′ β|X)(α′ Z)2
= E fY |X (Z ′ β(u)|X)(α′ Z)2 +
+E (fY |X (Z ′ β|X) − fY |X (Z ′ β(u)|X))(α′ Z)2
= E fY |X (Z ′ β(u)|X)(α′ Z)2 + O(E f¯′ Z ′ (β − β(u))(α′ Z)2 )
= α′ Jem (u)α + O(f¯′ r supα∈S m−1 E |α′ Z|3 )
= α′ Jm (u)α + O(m−κ ) + O(f¯′ r supα∈S m−1 E |α′ Z|3 )
where the last line follows from Lemma 13, S.2 and S.5. Since f¯′ is bounded by assumption
S.2, we obtain ǫ6 (m, n) . hn + m−κ + r supα∈S m−1 E |α′ Z|3 . Finally, conditions S.3 and
S.4 yields supα∈S m−1 E |α′ Z|3 ≤ ζm supα∈S m−1 E |α′ Z|2 . ζm and the results follows.
References
[1] D. W. K. Andrews. Empirical process methods in econometrics. Handbook of Econometrics, Volume IV,
Chapter 37, Edited by R. F. Engle and D. L. McFadden, Elsvier Science B.V., pages 2247–2294, 1994.
[2] Donald W. K. Andrews. Asymptotic normality of series estimators for nonparametric and semiparametric regression models. Econometrica, 59(2):307–345, 1991.
[3] O. Arias, K. F. Hallock, and Walter Sosa-Escudero. Individual heteroneity in the returns to schooling:
instrumental variables quantile regression using twins data. Empirical Economics, 26(1):7–40, 2001.
69
[4] R. Barlow, D. Bartholemew, J. Bremner, and H. Brunk. Statistical Inference Under Order Restrictions.
John Wiley, New York, 1972.
[5] A. Belloni, X. Chen, and V. Chernozhukov. New asymptotic theory for series estimators. 2009.
[6] A. Belloni and V. Chernozhukov. ℓ1 -penalized quantile regression for high dimensional sparse models.
Ann. Statist., 39(1):82–130, 2011.
[7] M. Buchinsky. Changes in the u.s. wage structure 1963-1987: Application of quantile regression. Econometrica, 62(2):405–458, Mar. 1994.
[8] Matias D. Cattaneo, Richard K. Crump, and Michael Jansson. Robust data-driven inference for densityweighted average derivatives. J. Amer. Statist. Assoc., 105(491):1070–1083, 2010. With supplementary
material available online.
[9] Arun G. Chandrasekhar, Victor Chernozhukov, Francesca Molinari, and Paul Schrimpf. Inference on
sets of best linear approximations to functions. MIT Working Paper, 2010.
[10] Probal Chaudhuri. Nonparametric estimates of regression quantiles and their local Bahadur representation. Ann. Statist., 19(2):760–777, 1991.
[11] Probal Chaudhuri, Kjell Doksum, and Alexander Samarov. On average derivative quantile regression.
Ann. Statist., 25(2):715–744, 1997.
[12] Xiaohong Chen. Large sample sieve estimation of semi-nonparametric models. In: Heckman, J.J.,
Leamer, E. (Eds.), Handbook of Econometrics, 6B(Chapter 76), 2006.
[13] Xiaohong Chen and Xiaotong Shen. Sieve extremum estimates for weakly dependent data. Econometrica, 66(2):289–314, 1998.
[14] V. Chernozhukov, I. Fernández-Val, and A. Galichon. Improving point and interval estimators of monotone functions by rearrangement. Biometrika, 96(3):559–575, 2009.
[15] V. Chernozhukov, Sokbae Lee, and Adam M. Rosen. Intersection bounds: estimation and inference.
arXiv, (0907.3503v1), 2009.
[16] William S. Cleveland. Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist.
Assoc., 74(368):829–836, 1979.
[17] R. Dudley. Uniform Cental Limit Theorems. Cambridge Studies in advanced mathematics, 2000.
[18] O. Guédon and M. Rudelson. lp -moments of random vectors via majorizing measures. Advances in
Mathematics, 208:798–823, 2007.
[19] Peter Hall and Simon J. Sheather. On the distribution of a studentized quantile. Journal of the Royal
Statistical Society. Series B (Methodological), 50(3):381–391, 1988.
[20] Wolfgang K. Härdle, Yaacov Ritov, and Song Song. Partial linear quantile regression and bootstrap
confidence bands. SFB 649 Discussion Paper 2010-002, 2009.
[21] Jerry A. Hausman and Whitney K. Newey. Nonparametric estimation of exact consumers surplus and
deadweight loss. Econometrica, 63(6):pp. 1445–1476, 1995.
[22] X. He and Q.-M. Shao. On parameters of increasing dimenions. Journal of Multivariate Analysis, 73:120–
135, 2000.
[23] J. L. Horowitz and S. Lee. Nonparametric estimation of an additive quantile regression model. Journal
of the American Statistical Association, 100(472):1238–1249, 2005.
[24] R. Koenker. Quantile regression. Cambridge University Press, New York, 2005.
[25] R. Koenker and G. Basset. Regression quantiles. Econometrica, 46(1):33–50, 1978.
70
[26] Roger Koenker. quantreg: Quantile Regression, 2008. R package version 4.24.
[27] Efang Kong, Oliver Linton, and Yingcun Xia. Uniform Bahadur representation for local polynomial
estimates of M -regression and its application to the additive model. Econometric Theory, 26(5):1529–
1564, 2010.
[28] S. Lee. Efficient semiparametric estimation of a partially linear quantile regression model. Econometric
Theory, 19:1–31, 2003.
[29] Whitney K. Newey. Convergence rates and asymptotic normality for series estimators. Journal of Econometrics, 79:147–168, 1997.
[30] M. I. Parzen, L. J. Wei, and Z. Ying. A resampling method based on pivotal estimating functions.
Biometrika, 81(2):341–350, 1994.
[31] David Pollard. A User’s Guide to Measure Theoretic Probability. Cambridge Series in Statistical and
Probabilistic Mathematics, 2001.
[32] James L. Powell. Least absolute deviations estimation for the censored regression model. Journal of
Econometrics, 25(3):303–325, 1984.
[33] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria, 2008. ISBN 3-900051-07-0.
[34] Richard Schmalensee and Thomas M. Stoker. Household gasoline demand in the united states. Econometrica, 67(3):pp. 645–662, 1999.
[35] Charles J. Stone. Optimal global rates of convergence for nonparametric regression. The Annals of
Statistics, 10(4):1040–1053, 1982.
[36] Sara van de Geer. M-estimation using penalties or sieves. Journal of Statistical Planning and Inference,
108(1-2):55–69, 2002.
[37] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer Series in
Statistics, 1996.
[38] Halbert L. White. Nonparametric estimation of conditional quantiles using neural networks. In Proceedings of the Symposium on the Interface, pages 190–199, 1992.
[39] Adonis Yatchew and Joungyeo Angela No. Household gasoline demand in canada. Econometrica,
69(6):pp. 1697–1709, 2001.
Download