MA MODELS QUANTILE REGRESSION DIMENSIONAL

advertisement
MIT LIBRARIES
3 9080 02874 5203
,AAMi^
y\o
.Ol'ii
Massachusetts Institute of Technology
Department of Economics
Working Paper Series
^-PENALIZED QUANTILE REGRESSION IN HIGHDIMENSIONAL SPARSE MODELS
Alexandre Belloni
"
Victor
Chemozhukov
Working Paper 09-1
April 24, 2009
RoomE52-251
50 Memorial Drive
Cambridge,
MA 02142
This paper can be downloaded without charge from the
Social Science Research Network Paper Collection at
http://ssrn.com/abstract=1
394734
£i-PENALIZED QUANTILE REGRESSION IN HIGH-DIMENSIONAL
SPARSE MODELS
By Alexandre Belloni and Victor
Duke
Chernozhukov*'"''
University and Massachusetts Institute of Technology
We
.
consider median regression and, more generally, quantile re-
gression in high-dimensional spEtrse models. In these models the overall
number
p
of regressors
is
very large, possibly larger than the sam-
ple size n, but only s of these regressors have non-zero impact on the
conditional quantile of the response variable, where
than
T7.
sistent,
s
grows slower
Since in this case the ordinary quantile regression
we
is
not con-
consider quantile regression penalized by the ^i-norm of
coefficients (fi-QR). First,
we show that £i-QR
rate ys/n-\/Iogp, which
close to the oracle rate \/s/n, achievable
when
is
the minimal true model
is
is
consistent at the
known. The overall number of
re-
gressors p affects the rate only through the logp factor, thus allowing
nearly exponential growth
The
the
in
number
rate result holds under relatively
s/n converges to zero
weak
of zero-impact regressors.
conditions, requiring that
at a super-logarithmic
speed and that regular-
ization parameter satisfies certain theoretical constraints. Second,
we
propose a pivotal, data-driven choice of the regularization parameter
and show that
show that £i-QR
it
satisfies
these theoretical constraints. Third,
correctly selects the true minimal
model
we
as a valid
submodel, when the non-zero coefficients of the true model are well
We
separated from zero.
efficients in
fi-QR
non-zero coefficients
:
of
is
in
also
show that the number
same stochastic order
as
s,
of non-zero co-
the
the minimal true model. Fourth,
number
of
we analyze
the rate of convergence of a two-step estimator that applies ordi-
nary quantile regression to the selected model. Fifth, we evaluate the
performance of ^i-QR
in a
Monte-Carlo experiment, and provide an
application to the analysis of the international economic growth.
'First version;
December, 2007, This
version: April 19, 2009.
'The authors gratefully acknowledge the research support from the National Science Foundation.
AMS 2000 subject classifications: Primary 62H12, 62J99; secondary 62,107
Keywords and phrases: median
regression, quantile regression, sparse
models
•
Digitized by the Internet Archive
in
2011 with funding from
Boston Library Consortium IVIember Libraries
http://www.archive.org/details/l1penalizedquant00cher
BELLONI AND CHERNOZJIUKOV
'
2
'
.
Introduction.
1.
Quantile regression
an important
is
,.
,.
statistical
method
the impact of regressors on the conditional distribution of a response variable
[22],
Koenker and Bassett
on the
of regressors
[19],
for quantile regression
regressors
is
Chernozhukov
[17],
of regressors
is
given in
covering the case where the
[p
=
and
exhibits robustness to outliers
The asymptotic theory under
[20],
Portnoy
[28],
fixed
p
increasing
and Belloni and Chernozhukov
[1-5]
of regressors
number of
sample
negligible relative to the
is
we
consider quantile regression in high-dimensional sparse models
In such models, the overall
number
the sample size n. However, the
HDSMs
([8],
[26])
signal processing,
of regressors p
number
impact on the response variable -
number
penalized
is
s
is
very large, possibly
size
much
larger
smaller than the sample
have emerged to deal with
many new
size,
that
is,
s
—
o[n).
analysis,
is
regressors are known.
.
began to investigate estimation of HDSMs, primarily focusing on
Thus the estimator can be
[39]
number
of regressors p.
van der Geer
[35].
[37]
when the
Meinshausen and Yu
[26]
Our paper's contribution
selection
regression
is
is
not consistent
£i-norm of parameter
in
coefficients.
We
HDSM
p.
Lv
[11]
used
There were
framework, a set of results on
for quantile regression. Since
HDSMs, we
squares
shall not review here.
to develop, within the
and rates of convergence
and Zhang
derived valuable finite sample bounds on
screening and derived asymptotic results under even weaker conditions on
we
significant
for the ^i -penalized least
empirical risk for ^i-penalized estimators in generalized linear models. Fan and
other interesting developments, which
[8]
consistent even under very rapid, nearly
demonstrated similar striking results
proposed by Tibshirani
Candes and Tao
selector, achieves the rate
very close to the oracle rate \/s/n obtainable
exponential growth in the total
and Huang
where
'
regression, with ^i-norm acting as a penalty function.
Vs/nvlogp, which
The
applications, arising in biometrics,
machine learning, econometrics, and other areas of data
of papers
mean
than
of significant regressors - those having a non-zero
demonstrated that, remarkably, an estimator, called the Dantzig
model
.5],
(HDSMs).
high-dimensional data sets have become widely available.
many
[4,
o{n)).
In this paper,
A
The
[19].
Gutenbrunner and Jureckova
The asymptotic theory under
others.
He and Shao
number
Laplace
well-developed under both fixed number of
of regressors.
[9]
[7],
and has a wide applicabiUty
[29],
is
given by Koenker and Bassett
Knight
number
number
(cf.
captures the heterogeneity of the impact
it
different parts of the distribution
regressors and increasing
n
In particular,
has excellent computational properties
asymptotic theory
[14],
[20]).
analyzing
for
ordinary quantile
consider quantile regression penalized by the
show that the ^j-penahzed quantile regression
is
PENALIZED QUANTILE REGRESSION
consistent at the rate -y/s/n-v/logp, which
the true minimal model
we propose a
is
is
known. In order to make the penahzed estimator practical,
pivotal, data-driven choice of the regularization parameter,
this choice leads to the
when
close to the oracle rate ^J sjn achievable
same sharp convergence
rate. Further,
quantile regression correctly selects the true minimal
model
and show that
we show that the penalized
when
as a valid submodel,
We
the non-zero coefficients of the true model are well separated from zero.
also analyze
a two-step estimator that applies standard quantile regression to the selected model and
aims at reducing the bias of the penalized quantile estimator.
Monte
penalized and post-penalized estimators with a
economic growth example. Thus, our
examining a new
non-linearities
class of problems.
We
illustrate the use of
and an international
carlo experiment
results contribute to the literature
on
HDSMs by
Moreover, our proof strategy, developed to cope with
and non-smoothness of quantile regression, may be of
estimation problems.
the
(We provide more
M-
interest in other
detailed comparisons to the literature in Section
2.)
Finally, let us
The
comment on
the role of computational considerations in our analysis.
choice of the £i -penalty function arises from considering a tradeoff between statistical
efficiency
and computational
efficiency,
with the latter being of particular importance in
high-dimensional applications. Indeed, in model selection problems, the statistical efficiency
criterion favors the use of the ^o-penalty functions (Akaike
^0-penalty counts the the
number
of non-zero
[1]
and Schwarz
components of a parameter
[32]),
where the
vector. However,
the computational efficiency criterion favors the use of convex penalty functions. Indeed,
convex penalty functions lead to
efficient
convex programming problems
([27j); in
contrast,
the £o-penalty functions lead to inefficient combinatorial search problems, plagued by the
computational curse of dimensionality. Precisely because
closest to the £o-penalty (e.g. [30]), the ^i-penalty has
HDSMs,
in general (e.g. [25]),
and
it
emerged to play a central
most
effective
respecting the computational efficiency constraint.
comparisons with the
also describe our key results
literature. In Section 3,
tions E.1-E.5, which are implied by D.1-D.4,
model
2,
,
_
.
we introduce the problem and
pivotal choices for the regular-
under D.1-D.4, and provide detailed
we develop the main
and
selection, while
,
some simple primitive assumptions D.1-D.4, and propose
We
is
role in
'
organize the rest of the paper as follows. In Section
ization parameter.
a convex function that
in our analysis, in particular. In other words, the use
of the ^i-penalty takes us close to performing the
We
is
also hold
results
much more
analysis the pivotal choice of the penalization parameter. In Section
5,
under condi-
generally. Section 4
we carry out a com-
4
^
BELLONI AND CHERNOZHUKOV
.
,
,,
•
putational experiment and provide an application to an international growtli example. In
Section
6,
we provide conclusions and
discuss possible extensions. In
Appendix A, we
verify
that conditions E.1-E.5 are implied by conditions D.1-D.4 and also hold more generally.
Notation.
1.1.
size n,
what
In
a
<
—
n
oo.
>
We
cb for all sufficiently large n, for
we use a <p
b to
denote a <p
b
<p
=
denote that a
We
a.
denote ^2-iiorm by
2.
implicitly index
parameter values by the sample
•
||,
||
use the notation a
some constant
also use the notation
||
>
c
6 to
carry out
denote that a
||i,
~ 6 to denote a < b < a
a V 6 = max{a, b} and a /\b —
£oo-norm by
=
||oo,
||
The
Basic Setting.
of
0(6), that
and a ~p
min{a,
b}.
and the £o-"norm" by
and
6
to
We
||o.
|]
In this section,
the setting, the estimator, and state primitive regularity conditions.
provide an overview of the main results.
all
that does not depend on n,
we use a
Op{b);
£i-norm by
<
We
Basic Settings, the Estimator, and Overview of Results.
we formulate
2.1.
all
but we omit the index whenever this does not cause confusion.
the asymptotic analysis as
is
we
follows,
We
also
-
^
set-up of interest corresponds to a parametric quantile regression
model, where the dimension p of the underlying model increases with the sample
size n.
Namely, we consider a response variable y and p-dimensional covariates x such that the
u-th conditional quantile function of y given x
'•
Qy^,iu)
(2.1)
We
=
is
x'f3{u)^
given by
piu)eW.
consider the case where the dimension p of the model
the available sample size n, but the true model /3(u)
is
is
large, possibly
sparse having only
non-zero components. Throughout the paper the quantile index u G
The population
coefficient P{u)
•
(2.2)
.
where Pu{t)
a
=
(u
random sample
known
(0, 1)
is
s
larger than
=
s{u)
< p
fixed.
to be a minimizer of the criterion function
Q^iP)^E[p^{y-x'f3)],
.
—
is
much
l{t
<
{yi,x\),
Q})t
.
.
.
is
the asymmetric absolute deviation function
[20].
Given
,{yn,x„), the quantile regression estimator /3{u) of l3{u)
is
defined as a minimizer of
Qu{P)=En[pu{y,-x[0)]
(2.3)
where E„
\fiyi,Xi)] :=
in the given sample.
n~^ IZ"=i
f[yi^X'i)
denotes the empirical expectation of a function /
PENALIZED QUANTILE REGRESSION
In the high-dimensional settings, particularly
when p >
5
n, quantile regression
not consistent, which motivates the use of penalization in order to remove
nearly
all
The
sistency.
is
whose population
regressors
generally
or at least
coefficients are zero, thereby possibly restoring con-
penalization that has been proven to be quite useful in least squares settings
the £i-penalty leading to the lasso estimator
[35].
The Choice of Estimator, Linear Programming Formulation, and
2.2.
all
is
^i-penalized quantile regression estimator /?(u)
Its
The
Dual.
a solution to the following optimization
is
problem;
min
(2.4)
When
the solution
number
is
we
not unique,
Q„(/?)
+^
^
|^_,.|.
define (3{u) as a basic solution having the
components. The criterion function
of non-zero
in (2.4)
is
the
sum
minimal
of the criterion
function (2.3) and a penalty function given by a scaled £i-norm of the parameter vector.
This £i-penalized quantile regression or quantile regression lasso has been considered by
Knight and Fu
[18]
under the small
For computational purposes,
problem
it is
(fixed)
p asymptotics.
important to note that the penalized quantile regression
(2.4) is equivalent to the following linear
programming problem
p
\
min
i
L
The problem minimizes a sum
^f and
^~.
The
/3j
= P^ —
linear
i
=
is,
'
of f i-norm of the absolute positive /3^
(3~
"
and negative P~ parts
and of an average of asymmetrically weighted residuals
programming formulation
polynomial time, algorithms
algorithms, one can
.
l,...,n.
(2.5)
useful for
is
estimator, particularly in high-dimensional applications. There are a
that
_
nfr['
^+-^- ^y,-x',{P+-p-),
of the parameter
.
'-'
€+.r,/3+,/3-€Kf+^''
(2.5)
-
E„[<+ + (l-u)erl
+ -E('^^
+ /?7")
'^^
^
^'
for the linear
compute the estimator
computation of the
number
programming problem
(2.5).
of efficient,
Using these
avoiding the computational
(2.4) efficiently,
curse of dimensionality.
Furthermore,
for
both computational and theoretical purposes,
it
is
important to note
that the primal problem (2.5) has the following dual problem:
max
(2-6)
.,•
E„
Iviai]
.
|E„[a:i,a,]|<^,
(tt
—
1)
<
a.j
<
w,
i
i
= l,...,p,
= 1,
.
.
.
,
77,.
^
'
•
-
'
BELLONI AND CHERNOZHUKOV
6
The dual problem maximizes
,
the correlation between the response variable and the rank
scores subject to the condition requiring the rank scores to be approximately uncorrelated
with the regressors. This condition
—
a*{u)
by
{u
—
l{yi
<
the event {yi
(2.1)
is
reasonable, since the true rank scores, defined as
x^P{u)}), should be independent of regressors
<
x'^(3{u)} is
distributed variable Uj which
is
equivalent to the event {ui
independent oi
<
Xi.
This follows because
u}, for a standard uniformly
Xi.
Since both primal and dual problems are feasible, by strong duality for linear program-
ming the optimal values
of (2.6) equals the optimal value of (2.4) (see, for example, Bert-
simas and Tsitsiklis
The optimal
in
[6]).
solution to the dual problem plays an important role
our analysis, helping us control the sparseness of the penalized estimator P{u) as well
as choose the penalization parameter A.
lem
(2.6) also plays
an important
Of
course, the optimal solution to the dual prob-
role in the non-penalized case,
=
with A
0,
yielding the
regression generalization of Hajek-Sidak rank scores (Gutenbrunner and Jureckova
Another potential approach worth considering
and Tao
[8],
proposed
the context of
in
mean
is
[14]).
the Dantzig selector approach of Candes
We
regression.
can extend this approach to
quantile regression by defining the estimator as solution to the following problem;
(2.7)
;
,flt\f3^\--
;
\\Sumo.<l
i=i
where A
is
'
•
.
a penalization parameter, and S^
is
-
-
a subgradient of the quantile regression
objective function Qu{P)-
(2.8)
Su{/3)
•
The estimator
(2.7)
= En[{l{yr<x\p}-u)x,].
minimizes the £i-norm of the coefficients subject to a goodness-of-fit
constraint.
On
computational grounds, we prefer the ^j-penalized estimator
The reason
selector estimator (2.7).
is
(2.4) over to the
that the subgradient Su in (2.8)
is
Dantzig
a piece-wise con-
stant function in parameters, leading to a serious difficulty in computing the estimator
(2.7).
In particular, the problem (2.7) can be recast as a mixed integer
lem with n binary variables,
for
rithm. (In sharp contrast, in the
S{P)
=
E„[(yi
—
Xif3)xi],
Accordingly, in the
mean
programming problem,
which (generally) there
mean
is
programming prob-
no known polynomial time algo-
regression case the subgradient
corresponding to the objective function Q{P)
is
=
a linear function,
'^nliVi
—
i'i/?)'^]/2.
regression case, the optimization problem can be recast as a linear
for
which there are polynomial time algorithms.)
PENALIZED QUANTILE REGRESSION
Another idea
for
to minimize the £i
7
formulating a Dantzig type estimator for quantile regression would be
norm
of the coefficients subject to a convex goodness-of-fit constraint,
namely
min f^
(2.9)
Since the constraint set
is
with in the
of A that
QuiP) < 7}
:
:
QuiP) <
course, this
and convex,
is
problem
this
first place.
Indeed, for every feasible choice of 7 in (2.9) there
makes the solutions
to (2.9)
and
to (2.4) identical.
(2.9) is equivalent to
problem
to the original
(2.4)
min^gup Yl^=\
with A
=
\Pj\
To
n/n. Therefore
it
is
we started
a feasible choice
see this, fix a 7
for the constraint
+ f^iQuiP) —
is
hardly a surprise, since this
equivalent to an £i-penalized quantile regression problem (2.4) that
denote the optimal value of the Lagrange multiplier
problem
T
piece- wise linear
is
programming problem. Of
equivalent to a Hnear
problem
{f3
|/3,|
QuiP) <
which
7,
and
let
k
then the
is
then equivalent
suffices to focus
our analysis on
7)1
the original problem.
2.3.
Here we propose a pivotal, data-driven
The Choice of the Regularization Parameter.
choice for the regularization parameter value A.
We
shall verify in Section 4 that
such
choice will agree with our theoretical choice of A maximizing the speed of convergence of
the penalized estimate to the true parameter value.
Because the objective function
in the
"
-
primal problem (2,4)
is
not pivotal in either small or
large samples, finding a pivotal A appears to be difficult a priori. However, instead of looking
at the primal problem, let us look at its linear
(2.10)
-
|E„[a:jjaj]|
-
<
-,for
all
j
=
programming dual
l,...,p<^
(2.6),
<
||E„[.x,aj]||^
which requires that
-.
-.•
This restriction requires that potential rank scores must be approximately uncorrelated
with regressors.
It
then makes sense to select A so that the true rank scores
o.*{u)
"
:-
satisfy this constraint.
Of
coinrse, since
The key
is,
we can
A„ =
'
(2.11)
That
= {u-l{yi <
we do not observe the
observation
is
XiP{u)})
fori
potentially set A
=
=
l,...,n
A„, where
n||E„[x,a*(t.)]||^.
'
true rank scores, this choice
that the finite sample distribution of
A„
is
is
not available to us.
pivotal conditional
on
8
'-
..
'
BELLONI AND CHERNOZHUKOV
,
the regressors xi,
.
,
.
a;„.
.
We know
a*{u)
where
uj,
.
.
.
,u„ are
the regressors, xi,
...
=
{u
uniform
i.i.d.
^
that rank scores can be represented almost surely as
—
l{ui
(0, 1)
<
u}),
random
we have
jXn. Thus,
-
for
i
=
1.
.
,n,
.
.
.
,
variables, independently distributed
.
^
..'-:„,-,
from
'.
.:
.:,
'"
\
(2.12)
A^=n\\En\x,{u-l{u^<u})]\\^,
;
which has a known distribution conditional on xi,
quantiles
A„
quantile of
as our choice for A. In particular,
.
.
,Xn. Therefore
set A
=
we can use the
A(xi,...,x„) as the
1
tail
— a„
A„
A
(2.13)
where q„
we
.
'
\
some
at
=
inf{c:P(A„<c|a;i,...,x„) > 1-Q„},
rate to be determined below.
Finally, let us note that
we can
also derive the pivotal quantity A„,
and thus
also
our
choice of the regularization parameter A, from the subgradient characterization of optimality
for the
2.4.
primal problem
(2.4).
Primitive Conditions.
ters [16],
We
n.
framework of high-dimensional parame-
which formally consists of a sequence of models with parameter dimension p
tending to infinity as the sample
els,
follow Huber's
size
n grows
to infinity. Thus, the parameters of the
the parameter space, and the parameter dimension are
However, following Huber's convention, we
will
—
Pn
mod-
indexed by the sample size
all
omit the index n whenever this does not
cause confusion. Let us consider the following set of conditions:
D.l. Sampling.
Data
(yi,Xj)',i
—
l,...,n are an
with the conditional u-quantile function given by
i.i.d.
(2.1),
sequence of real
and with the
first
(1 4- p)-vectors,
component
of x^
equal to one.
D.2. Sparseness of the
bounded by
D.3.
1
<
s
=
s^
True Model. The number of non-zero components of
< n/\og{n V
are
and uniformly
is
p).
Smooth Conditional Density. The conditional density
^ fy^\x,{y\x)
l3{u)
bounded above uniformly
in
fy-\x,{y\x)
and
its
y and x ranging over supports of
derivative
yi
and
Xi,
in n.
0.4. Identifiability in Population and Well-Behaved Regressors. Eigenvalues of the population design matrix E[a:tX^] are
bounded above and away from
zero,
and suphqii^i E[|Xjap]
PENALIZED QUANTILE REGRESSION
is
bounded above, uniformly
tile,
fy.\x^{x'P{u)\x)
and uniformly
Xj,
The
is
in n.
The
9
conditional density evaluated at the conditional quan-
bounded away from
zero,
uniformly in x ranging over the support of
in n.
conditions D.1-D.4 stated above are a set of simple conditions that ensure that the
These conditions allow us to demonstrate
high-level conditions developed in Section 3 hold.
the general applicability of our results and straightforwardly compare to other results in
the literature. In particular, condition D.l imposes
a conventional assumption
in
asymptotic statistics
the effective dimension of the true model
random sampling on the
(e.g [38]).
is
Condition D.2 requires that
smaller than the sample
is
data, which
Condition D.3
size.
imposes some smoothness on the conditional distribution of the response variable. Condition
D.4 requires the population design matrix to be uniformly non-singular and the regressors'
moments
to be well-behaved.
Further,
En
[xiX^],
let
that
be the maximal /c-sparse eigenvalue of the empirical design matrix
(f){k)
is,
(2.14)
sup
(t>ik)=
..;,;
En\{a'x^f].
L
J
Following Meinshausen and
Yu
[26]
,
we
'^:
,
;.._.
.
-
lla||<l,l|Q||o<fc
our general results on convergence rates of
will state
the penalized estimator in terms of the maximal sparse eigenvalue 4>{mo)- Meinshausen and
Yu
worked with mo
[26]
estimator. In this paper
=
n Ap
an
initial
upper bound on the zero norm of the penalized
we can work with a
can work with
smaller mo, in particular, under D.1-D.4,
we
'
-j
'
.
m,o
-
.
as
as this provides a valid initial
=pA
(n/log(n Vp)),
bound on the
zero
norm
'
'
•
of our penalized estimator under a
suitable choice of the penalization parameter.
By
[8]
Yu
using an assumption on the growth rate of
(i){k.),
we avoid imposing Candes and Tao's
uniform uncertainty principle on the empirical design matrix E„
[26]
argue that the assumption in terms of
uncertainty principle, since
Meinshausen and Yu
[26]
it
that
if
the intercept
Meinshausen and
provide a thorough discussion of the behavior of (p{n) in
show that the condition
4>{n)
example, when the empirical design matrix
is
[x,x'^\.
are less stringent than the uniform
allows for non-vanishing correlation between the regressors.
cases of interest. In particular, they
in several cases (for
0(/c)
included as a covariate
we have
cf){l)
>
1.
<p
is
1
many
appears reasonable
block diagonal). Note
For the purposes of a basic
10
".
overview of results
in
the next subsection,
.
we employ the assumption
^
•'
<^(n/log(nVp))<pl.
(2.15)
which
'
BELLONI AND CHERNOZHUKOV
'
'
:
standard Gaussian regressors and some other regressors considered in
will cover
Meinshausen and Yu
<
(because (p{n/log{n V p))
[26]
analysis presented in Section
(l>{n/\og{n'V p)) to diverge,
3,
we do not impose
which should permit
(2.15)
Furthermore,
(p{n)).
and allow
in
our general
for the sparse eigenvalue
with regressors having
for situations
tails
thicker than Gaussian.
In order to illustrate our conditions
we employ the
following canonical examples through-
out the paper.
Example
Normal Design).
(Isotropic
1
y
.
=
1 is
=
x'/3o
positive at zero
and D.2
D.4,
for k
< n
+ e,
'
.
the intercept and the covariates X-i
and has bounded derivatives. This example
if |j/3o|io
<
=
s
~
A'^(0, /),
and the errors
smooth probability density function which
are independent identically distributed with a
is
=
model
1/2) of the following regression
where the covariate xi
Let us consider estimating the median (u
satisfies conditions D.l, D.3,
o(n/log(n Vp)). Moreover, the maximal A'-sparse eigenvalues
satisfy
-
(/)(/c)
:—
E„
sup
(q't,)'
~j, 1
k\ogp
+
=l,i|a||o<A-
by
Lemma
as
shown
in
2 (Correlated
our conditions with
(t){n/
log(nVp)) ~p
Candes and Tao's uniform uncertainty
Normal Design).
We
consider the
=
D.2
if ||,/3o||o
p''~-'l
and — 1 < p <
<
s
=
mentioned
p)).
This example
The maximal
Thus, this design
satisfies
in [2G] this design violates
which requires
\p\
^
at
logp
rate.
satisfies
~
i±M
1
-
IpI
1
\
Moreover,
as in
Example
A^(0, E),
where
conditions D.l, D.3. D.4, and
fc-sparse eigenvalues for k
sup
E„ [(a'.,f ] -,
^
= l,||al|o<fc
J
l|a!|
14.
fixed.
is
o(n/log(n V
m:=
by Lemmas
1
1.
principle.
same setup
but instead we suppose that the covariates are correlated, namely x^i
E,j
as
satisfies
this design satisfies
[8],
Example
1,
Thus, this design
14.
< n
satisfy
+ /^Li^
\
n
j
our conditions with 4>{n/ log(nVp))
~p
1.
Candes and Tao's uniform uncertainty
However,
principle,
PENALIZED QUANTILE REGRESSION
Finally,
it is
worth noting that our analysis
in Sections 3
and
4,
11
and
in
Appendix
A
allows
the key parameters of the model, such as the bounds on the eigenvalues of the design matrix
and on the density function, to change with the sample
to trace out the impact of these parameters
we
estimator. In particular,
size.
This
will explicitly allow
us
on the large sample behavior of the penalized
be able to immediately see how some basic changes in
will
the primitive conditions stated above affect the large sample behavior of the penalized
estimator.
2.5.
Overview of Mam Results.
Here we discuss our results under simplest assumptions,
and condition
consisting of conditions D.1-D.4
(2,15)
on the maximal (n/log(n Vp))-sparse
eigenvalue. These simplest assumptions allow us to straightforwardly
compare our
those obtained in the literature, without getting into nuisance details.
under more general conditions
results
subsequent sections: in Section
in the
on convergence rates and model
selection; in Section 4,
We
3,
results to
state our results
we present various
we analyze our choice
of the
penalization parameter.
In order to achieve the most rapid rate of convergence,
with
t
Our
growing as slowly as possible with
first
main
result
is
to choose
= i^nlog(nVp)
A
(2.16)
we need
n; for concreteness, let
t oc
log log n.
that the ^j-penalized quantile regression estimator converges at
the rate:
(2.17)
-
;,
mu)-(3iu)\\
,
S
^
=
^.t
provided that the number of non-zero components
(2.18)
=•
-:
,.;.
'-
s satisfies
n
*
-
,
'"
-
\
'
'
'
,
,.
^
'
note that the total number of regressors p affects the rate of convergence (2.17) only
through a logarithm
t
0og(nVp),
J--t'J\og{nVp)^0.
-
\
We
.
in p.
\/log(n Vp), which
is
Hence
if
p
is
polynomial
in n, the rate of
very close to the oracle rate \/s/n, obtainable
minimal true model. Further, we note that our resulting
s of the
minimal true model
is
very weak;
can be of almost the same order as
Our second main
penalized estimator
result
is
convergence
is
of the
n,
when p
namely
s
=
is
is
sjs/n
when we know the
restriction (2.18)
on the dimension
polynomial in n and
t
a
log log n, s
o(n/(i^logn))).
that the dimension ||5(u)||o of the model selected by the ii-
same
stochastic order as the dimension s of the minimal true
"•
12.,
.^
.
Further,
zero,
if
7
;
.
'"
V.
;;:
.''',-
the parameter values of the minimal true model are well separated
away from
'
'
'
-
-,,_„,.
'
~
min
'.
some diverging sequence
for
-,
.
;
mu)\\o<ps.
\
,
namely
(2.20)
-
'.
model, namely
(2.19)
BELLONI AND CHERNOZHUKOV
,'
,
|/3,(u)|
>
^ J-
i of positive constants,
•
t
0og(n
•
Vp),
then with probability converging to one,
the model selected by the £i-penahzed estimator correctly nests the true minimal model:
support
(2.21)
C support
(/3(u))
{0iu)).
Moreover, we provide conditions under which a hard-thresholding selects the correct support.
Our
third
main
result
is
that a two-step estimator, which applies standard quantile re-
gression to the selected model, achieves a similar rate of convergence;
^-^log(nvp)^0,
y-\/log("Vp)
(2.22)
provided the true non-zero coefficients are well-separated from zero in the sense of equation
'
(2.20).
~
Finally, our fourth
larization
result
is
to propose (2.13), a data-driven choice of the regu-
parameter A which has a pivotal
regressors,
its
main
and
finite
sample distribution conditional on the
to verify that (2.13) satisfies the theoretical restriction (2.16), supporting
use in practical estimation.
Our
results for quantile regression parallel the results for least squares
and Yu
[26]
and by Candes and Tao
[8].
Our
results
on the pivotal choice of the regularization
parameter partly parallel the results by Candes and Tao
whereas Candes and Tao's choice
the regression disturbances.
The
relies
is
[8],
except that our choice
upon the knowledge
existence of close parallels
contrast to the least squares problem, our problem
Nevertheless, there
by Meinshausen
is
is
pivotal
of the standard deviation of
may seem
surprising, since, in
highly non-linear and non-smooth.
an intuition presented below, suggesting that we can overcome these
difficulties.
While our
egy
is
results for quantile regression parallel results for least squares, our proof strat-
substantially different, as
it
has to address non-linearities and non-smoothness. In
PENALIZED QUANTILE REGRESSION
order to explain the difference,
Yu
[26].
They
first
let
us recall,
13
and
the proof strategy of Meinshausen
e.g.,
analyze the problem with no disturbances, recognize sparseness of the
solution for this zero noise problem, and then analyze a sequence of problems along the
path interpolating the zero-noise problem and the
full-noise
problem. Along this sequence,
they bound the increments in the number of non-zero components and in the rates of
convergence. This approach does not seem to work for our problem, where the zero-noise
problem does not seem to have either the required sparseness or the required smoothness. In
sharp contrast, our approach directly focuses on the full-noise problem, and simultaneously
bounds the number of non-zero components and convergence
be of independent interest
rates.
M-estimation problems and even
for other
may
Thus, our approach
for the least
squares
problem.
Our
analysis
is
perhaps closer
work of van der Geer
[37]
but
in spirit to,
which derived
finite
still
quite different from, the important
sample bounds on the empirical
risk of ii-
penalized estimators in generalized linear models (but did not investigate quantile regression
models).
is
that
The major
and van der Geer
difference between our proof
we analyze the sparseness
of the solution to the penalized
[37] 's
proof strategies
problem and then further
exploit sparseness to control empirical errors in the sample criterion function.
derive not only the results on
prime
model selection and on sparseness of
As a
result,
solutions, which are of a
but also the results on the consistency and rates of convergence under weak
interest,
conditions on the
number
components
of non-zero
allows s to be of almost the
same order
as the
s.
As mentioned above, our approach
sample
size n,
and
delivers convergence
rates that are close to a/s/tz. In contrast, van der Geer's [37] approach requires s to
much
smaller than n, namely s'^/n
convergence when s'^/n
In our proofs
m
=
\\(3{u)\\o
we
—
>
0,
all
and thus does not
be
deliver consistency or rates of
'
,
:
.
.
.
,
:-
on two key quantities: the number of non-zero components
of the solution /3{u)
m
>
.
critically rely
we make use
—
oo.
QuiP) (with P ranging over
In particular,
we
and the empirical error
in the
sample
criterion,
Qu{P)
—
m-dimensional submodels of the large p-dimensional model).
of the following relations:
-
.
implies smaller empirical error, and
(1)
lower
(2)
smaller empirical error can imply lower m.
Starting with a structural initial upper
bound on
the two relations to solve for sharp bounds on'
m
m
(see condition E.3 below)
and the empirical
we can use
error, given
which we
then can solve for convergence rates.
Let us
comment on the
intuition behind relations (1)
and
(2).
Relation
(1) follows
from
BELLONI AND CHERNOZHUKOV
14
an application of the usual entropy-based maximal
tropy of
all
m dimensional
models grows
the closer the sample criterion function
m-dimensional submodels. Relation
all
to favor lower-dimensional solutions
inequalities,
at the rate
Q^
mlogp.
upon
realizing that the en-
In particular, the lower the m.,
to a locally quadratic function, uniformly across
(2) follows
when Qu
is
from the use of ^i-penalty, which tends
close to being quadratic. Figure 1 provides
a visual illustration of this, using a two-dimensional example with a one-dimensional true
minimal submodel;
ure
in
the example, the true parameter value {Pi{u), (32{u))
is
plots a diamond, centered at the origin, representing a contour set of the
1
(1,0). Fig^i -penalty
By
the dual
interpretation (2.9) of our estimation problem, the penalized estimator looks for a
minimal
function and
a pearl, representing a contour set of the criterion function Q^.
diamond, subject to the diamond having a non-empty intersection with a fixed
set of optimal solutions
is
pearl. Smaller empirical errors
shape the pearl into an
parameter value of (1,0)
panel of Figure
(left
a non-ellipse and can center
like
Figure
1).
pearl.
The
then given by the intersection of the minimal diamond with the
it
far
1).
ellipse
and center
it
closer to the true
Larger empirical errors shape the pearl
away from the true parameter value
(right panel of
Therefore, smaller empirical errors tend to cause sparse optimal solutions, cor-
=
rectly setting P2{u)
0; larger
incorrectly setting /?2(w)
empirical errors tend to cause non-sparse optimal solutions,
7^ 0.
\
/52'
132'
J
/
\
\
Fig
1
.
/J{")
//3{«)
'
01
"^->*,«i!«.1-'""''
/
\/
(iiu)
These figures provide a geometric illustration for the discussion given in the
Ci-penalized estimation
may
be (left panel) or
may
text
Pi
concerning why
not be (right panel) successful at selecting the minimal
true model.
3.
Analysis and
we prove
the
main
Main Results Under High-Level
results
Conditions.
In this section
under general conditions that encompass the simple conditions
PENALIZED QUANTILE REGRESSION
15
D.1-D.4 as a special case.
3.1.
We
The Five Basic Conditions.
work with the following
will
five basic
conditions
E.1-E.5 which are the essential ingredients needed for our asymptotic approximations. In
Appendix A, we
D.4 stated
in Section 2,
Appendix
particular, in
under simple
verify that conditions E.1-E.5 hold
and we
A we
also
show that E.1-E.5
arise
sufficient conditions
much more
D.l-
generally. In
characterize key constants appearing in E.1-E.5 in terms of
the parameters of the model.
E.l. True
Model Sparseness. The true parameter value
has at most s
l3{u)
< n/ log(n V
p)
non-zero components, namely
(3.1)
||/3(u)||o
= s<n/log(nVp).
E.2. Identification in Population. In the population, the true parameter value /3(u)
the
is
unique solution to the quantile objective function. Moreover, the following minorization
condition holds,
QuiP) - QuWiu)) >
(3.2)
uniformly
g
is
in
/3
G
]R^,
where g
:
q
(||/3
R+ — R+
»
- Piu)f A
g{\\P
-
/3(u)||))
,
a fixed convex function with ^'(0)
is
>
0,
and
a sequence of positive numbers that characterizes the strength of identification in the
population.
E.3. Empirical Pre-Sparseness.
The number
m=
of the solution to the penalized quantile regression
m<77,ApA
(3.3)
problem
),
A"
where
(/>(m) is
the
maximal m-sparse
E.4- Em.pirical Sparseness. For r
tic
,„
components of P{u)
(2.4) obej''s
the inequality
-
\
-
.,
"'
.
eigenvalue.
||/?('u)— /3(u)||,
m=
||^(u)||o
obeys the following stochas-
inequality
,,
where
—
^
n
,
,
— v'nlog(n
V p)(j)(m)
^'^^
^^
Vm <p H- {r A I) + x^y
,
(3.4)
by
=
of non-zero
||/3(u)||o
/i
>
g
is
.
a sequence of positive constants.
The sequence
'
of constants
the population analog of the empirical sparse eigenvalue (f){mo)
E.5. Sparse Control of Empirical Error.
The
(cf.
^
is
determined
Appendix A).
empirical error that describes the deviation
of the empirical criterion from the population criterion satisfies the following stochastic
—
16
•
-.
'
BELLONI AND CHERNOZHUKOV
.
inequality
<^/3'".^^
^ ""' - '^
/^ /^/..^^
lOy^'
<p
{QuWiu)) - Qumu)))\
UP) - QuiP)
\
(35)
uniformly over
G M''
{/?
Let us briefly
:
||/3||o
<
m A n A p,
comment on each
||/?
—
i
/3(u)||
is
^
<
log(n V p)(^(m
/
r}, uniformly over
As stated
of the conditions.
basic modeling assumption, and condition E.2
s)
r^^—
^.
earlier,
+
s)
m < n, r >
condition E.l
0.
is
a
an identification assumption, required to
hold in population. Conditions E.3 and E.4 arise from two characterizations of sparseness of
the solution to the optimization problem (2.4) defining the estimator. Condition E.3 arises
from simple bounds applied to the
mal
inequalities applied to the second characterization. Condition E.5 arises
appUed
inequalities
we
characterization. Condition E.4 arises from maxi-
first
to the empirical criterion function.
is
of order
derive conditions E.4 and E.5,
m-dimensional submodels of the p-
crucially exploit the fact that the entropy of all
dimensional model
To
from maximal
mlogp, which depends on p only
logarithmically. Finally,
we
note that Conditions E.1-E.5 easily hold under primitive assumptions D.1-D.4, in particular
/i
~
g
~
1,
but we also permit them to hold more generally.
5 for verification
Theorem
and further analysis
refer the reader to Section
of these conditions.
combines conditions E.1-E.5 to establish bounds on the rate of convergence
1
and sparseness of the estimator
Theorem
We
Assume
1.
(2.4).
that conditions E.1-E.5 hold. Lett -^p oo he a sequence of positive
numbers, possibly data-dependent, define
TTiQ
(3.6)
=
p A
I
'
Then we have
(3.7)
;'
r~^
:^
log(n W p)
fi-
J
and
set
X
= tJnlog(nV p)4'(mo +
*
.
J
s)
—
.
q
that
||,5(u)-/3(.)||S^ =
^/'^°'^"''''^^^"°^'^''
qn
provided that \^/s/{qn) -^p
0,
and
''
-
^V
II^NIIo<p(t)
(3.8)
9^
'^
V
^•
<?/
This
is
the main result of the paper that derives the rate of convergence of the {'i-penalized
quantile regression estimator and a stochastic
Our
results parallel the results of
bound on the dimension
Meinshausen and Yu
[26]
of the selected model.
obtained
for the ^i-penalized
PENALIZED QUANTILE REGRESSION
We
mean
regression.
main
results of this section
17
a detailed discussion of this and other
refer the reader to Section 2 for
under simplified conditions. Here we only note that the rate of
convergence generally depends on the number of significant regressors
the number of regressors p, the strength of identification
0(mo), and the constant
q,
helpful to state the
D.1-D.4, where g
Corollary
g
~
/u
~
1.
1
/i
~
main
s, q,
and
/x.
under the simple set of conditions
result separately
1.
Conditions D.I-D.4 imply conditions E.1-E.5 with
(A Leading Case).
Therefore, under D.I-D.4.
X
we have
~
the empirical sparse eigenvalue
determined by the population sparse eigenvalue. The bound on
/U
the dimension also depends on the sequence of constants
It is also
the logarithm of
s,
Wo = p A
Vp)), so setting
(r7./log(n
slog(n Vp)(/)(mo)
— tJn\og(ny p)(p(m-oj
and
>0
t\
ij
that
/slog(nVp)(p(mo)
^
\
/?r
Ml <r
<pt\'
{u)-P{u)\\
I
n
and
-
If in addition (/'(mo)
<p
-
mu)h<ps.
•
then we obtain the rate result listed in equation (2.17).
1,
This corollary follows from lemmas stated
Appendix A, where we
in
verify that conditions
D.I-D.4 imply conditions E.1-E.5. Moreover, we use the fact that (^(mo
s log(n
It is
V
< n
p)
for
mo = p A
(n/ log(n
V
and that 0(2mo) <
p)),
2(/>(mo)
+
s)
by
<
(f){2mo)
if
Lemma iL
useful to revisit our concrete examples.
Example
3 (Isotropic
sidered earlier, recall that
by Theorem
Lemma
11
1
we have
Normal Design, continued).
we have that
m-o
we have (p{mo
0(fc)
<p
< n/log(nVp),
+ s)
<p
1.
Also,
we
1
+
In the isotropic normal design con-
\/{k/n) logp.
and, since
verify in
If
we assume
Appendix
A
X/ ^nlog(n Vp)
s
—
»
00,
< n/log(nVp), by
that g
~
/x
~
1.
Thus,
the rate result listed in equation (2.17) applies to this example.
Example
4 (Correlated
Normal Design, continued).
considered earher, we have that
by Theorem
1
(f){k)
<p Yr||(l
we have mo < n/ log(nVp) and,
+
In the correlated normal design
\/(fc/n) logp). If
since
we assume
s
X/^/nlog{n Vp) -^
00,
< n/ log(nVp), by Lemma
BELLONI AND CHERNOZHUKOV
18
11
we have 4>{mo
+ s)
<p j^y <p
Also,
1.
we
verify in
Appendix A, that g
~
~
/x
1.
Thus,
the rate result listed in equation (2.17) applies to this example too.
Proof. (Theorem
'-.;
:
^
The proof
m
..;/
-.-^
/'
We
divide the proof in four
bound on m, the second
step obtains prehminary
bounds on
step provides an initial
first
'
•
'
-
r:=||^(^i)-^(u)||andm:=||^(u)||o.
".
^
'
'
successively refines upper
The
steps.
Let
1)
and
r.
and the fourth step establishes the rate
inequalities, the third step verifies consistency,
result.
Step
We
1.
by proving that
start
By
with probability converging to one.
m
If
mo = p we have
Suppose that
m<
nAp
m
>
.
m-o
By
< mo
if i
>
when
m
>
t
\f2.
t
m<
-^p oo,
rriQ
occur
will
we have
condition E.3
< m^. Next
Since
\/2.
m = max im: m<nApA
directly tliat
is finite)
-
(3.9)
<
m
n'^(p{m)
A2
mo =
consider the case
m
Therefore we have
—
n
i
Q
;
\log{n
m.oi for
some
^
\J
p)
ij,^
i
>
I
(since
definition fh satisfies the inequality
m<n''
'
,
0(m)
-
"
.
.
A2
<
Since (/>(mo)
A,
4>{mo
+
we have A > ty^n log(n V p)<p{mo){fi./q).
s)
the value of mo, and fh
— mol
in (3.9),
f''nlog(nVp)
is
Step
By
(p{m.o)
/U
t-
log(n V p)
11
and
„„<?"
2
fi-
t"
>
t
\/2
bound on
we obtain
^
a contradiction.
2.
In this step
Condition E.l,
we obtain some preliminary
tire
support of
has exactly
s
elements, that
is,
\Tu\
inequalities.
/?(u)
T^ := support(/3(u)) := {j S {1,
agi'ee
Lemma
n
<l>(moi) q^
n^
which
and then using
Inserting this
—
s.
.
.
.
,p}
:
|/?,(u)|
>
0}
Let Pt^{u) denote a vector whose r„ components
with T^ components of P{u), and whose remaining components are equal to zero.
.
PENALIZED QUANTILE REGRESSION
and since
B}' definition of ^(u)
Quik^)) - QuiPiu)) <
^
||^ru(")lli
^(||/?(u)||i
-
ll/§('")lli
19
we have that
mu)h) <
-
^(||/3(u)||i
||^T„(^^)||l).
Using that
Ill/9(ti)lli
-
mu)
<
Pt.(u)||i|
- PTAu)h < yfil^mAu) -
< Vsr
P{u)\\
we obtain that
Qu0{u)) - QuWiu)) < -v^r.
n
Applying condition E.5 to control the difference between the sample and population criterion
functions,
we
further get that
^
^
ss
/-PT/
/^/
^
^
XN
(m +
r-
n
n
V
Invoking the identification condition E.2 and the definition of
/Q ^n^
Step
3.
/(m +
^ r ^
^^ <
+
^g[r))<p-^/sr
n
2
(
.
q{r~
(3.10)
In this step
(
we show
s)
log(n
V
r,
we obtain
p)(?!>(m
r\'
+
s)
n
V
consistency,
+ s)
s)\og,(n\/ p)d>(m
namely r
=
Op(l).
By
Step
we have
1
with probability converging to one.
The construction
(3.6) of A,
t
m.
< mo
;,
—+p
oo,
—>p
and the condition Xy/s/{qn)
assumed
in the
theorem imply
A^s
ls\og{nVp)(l>{Tno
n
Condition {in),
fJ.
+
n
\
>
q,
s)
^n\og(nV p)<i){m.o +
n
s)
A
q
and empirical sparseness condition
E.4, stated in equation (3.4),
imply that
(3.11)
-
'-,•
v^<p/i(7-Al)n/A + v^Op(l),
•
which implies the following second bound on m:
V^<p^n/X.
(3.12)
Using (3.12) and
^o ION
(3.13)
m>
l{m>
•,
r
s in
equation (3.10) gives
/ 2
/
^^
s]q[r^
hg[r))
1
.
^
^
/-
v/n log(n V
<pr-^s + r^
^
"^
p)(j){rrn)
+
^
s)/x
=
,
,
rOp{q)
-
20
BELLONI AND CIIERNOZHUKOV
'
".
where the
and
m<
,'
s in
On
{in).
the other hand, using (3.12)
/sIog(nVp)(?i(mo
S log n V piffllTTln
A
^
by conditions
(z)
and
^-^^^^
and
iii)
/J.
>
+
i-
5)
s)
=
roM
,
,
Conclude from (3.13)
q.
(3.14) that
q[r'/\9{r)) =rOj,{q).
/
.
.
Next we show that
we have l{r >
r
is,
Step
=
4.
=
(3.15) implies r
0}[r
A
>
0,
function with g'(0)
that
and
^
r^
<^r-^s^T^-^^
-s\
,
last equality follows
(3.15)
r
o
/
^
where the
(i)
equation (3.10) gives
\{m.<s)q{f- Kg(r))
(3.14)
and
by conditions
last equality follows
<p l{r >
{g{r)/r)]
>
so that g(r)
0}op(l).
g'(0)r.
condition E.2, g
>
Thus, l{r
=
Op(l)
we improve the bound
=
0}[r A5'(0)]
q
and by
a fixed convex
l{r
>
0}op(l),
gr^
<„
third bound:
.
Plugging (3.1G) into (3.10) and using the relation r^
(3.17)
m to the following
on
(3.11)
^</J^.
.,
=
Op{g{r)) under r
X /s
—
Op(l), gives us
X /s
r
(-
Op(q)r'^
or equivalently
r
n
<p
.
qn
Finally, inserting (3.17) into (3.16),
(3.8)
is
by
This step derives the rate of convergence.
(3.16)
3.2.
By
(3.15)
Op(l).
Using that r
bound
both sides of
Op(l). Dividing
we obtain -Jm <p
^/s{fi/q),
which
verifies the final
on m.
Model Selection Properties.
Next we turn
to the
model
selectioir properties of
the
estimator.
Theorem
are separated
(3.18)
2.
// conditions of
away from
zero,
Theorem
mm
i
and
if the
|/3j(u)|
>
«W
^,
n
\
of positive constants, i
—
>
cx),
q-
then with probability approaching
one
(3.19)
non-zero components of j3{u)
namely
jesuppoTt{0(,u})
for some diverging sequence
hold,
1
support {P(u))
C
support {0{u)).
PENALIZED QUANTILE REGRESSION
21
Moreover, the hard-thresholded estimator P{u), defined by
n ( \
0j[u)
where
—
£'
oo and
»
(.'
n
\^)
=/?j(u)W
I
jt
—
>
\R
(
W
|/3j(u)|
^
>
oU
it\i
sion.
2 derives
These
some model
=
support
mean
The
regression.
s) /i
-2
I
>
{(3{u)).
selection properties of the
results parallel analogous results obtained
£i-penalized
+
with probability converging to one,
0, satisfies
support (/3(u))
Theorem
s\og{n.yp)4>{mQ
first result
£]
—penalized quantile regres-
by Meinshausen and Yu
[26] for
the
says that in order for the support of the estima-
tor to include the support of the true model, non-zero coefficients need to be well-separated
from
zero,
which
is
a stronger condition than what we required for consistency.
sion of the true support
is
in general one-sided; the
some unnecessary components having the true
The
inclu-
support of the estimator can include
The second
coefficients equal zero.
result de-
scribes the performance of the fj -penalized estimator with an additional hard thresholding,
which does eliminate inclusions of such unnecessary components. However, the value of the
right threshold explicitly
depends on the parameter values characterizing the separation of
non-zero coefficients from zero.
Proof. (Theorem
2)
The
result
on inclusion of the support stated
follows from the separation assumption (3.18)
P{u)\\. Indeed,
by Theorem
Mu)
(3.20)
-
1
we have with
/3(u)|U
<
and the inequahty
in
equation (3.19)
\\(3{u) — P{u)\\oo < 0iu) —
probability going to one,
mu) - /3{u)\\
<
min
|,5,(u)|.
;£support(/3(ii))
The
last inequality follows
assumption
>
/3(w)||oo
from the rate result of Theorem
1
and from the separation
(3.18). Next, the converse of the inclusion event (3.19) implies that ||/3(w)
—
Since the latter can occur only with probability ap-
niiiij£support(/3(u)) |/?j(^i)|-
proaching zero, we conclude that the event (3.19) occurs with probability converging to
one.
,"
,
^
,
'
;
,
Consider the hard-thresholded estimator next. Let r„
To
establish the inclusion note that by
.
min
jesupport {p{u))
\P,iu)\
>
min
jSsupport
Theorem
1
=
t^(s/n) log(n V
p)(f){ma
and the separation assumption
J\3j{u)\ - \P,{u) - P,iu)\} >p £r„
(/3(u))
+
-
s)iJ,/q'^.
(3.18)
r„
22
BELLONT AND CHERNOZHUKOV
,
SO that rnirijggupport
—
£'/£
>
0.
|^j(w)|
(/3(u))
>
C
.
with probability going to one by
^'^n
Therefore, support (/?(u))
,
•
support
^
£'
oo and
with probabihty going to one.
(/3(u))
To
establish the opposite inclusion, consider the quantity
e„=
max
Theorem
B_y
1
e„
<p
all
with probability going to one by
components smaller than
we have that support {0{u)) C support
Two-step estimator.
3.3.
< ^V„
r„ so that £„
by the hard-threshold rule
of ^(u),
0j{u)\.
jgsupport (P(u))
.
(/3(u))
.
.
.
,p}, selected
In this problem
^(u)earg
We
further estimation. Moreover,
3.
is
be a model, that
the true
D
a subset of
'
,,
Q„(/3).
of the parameter vector
we remove the
is,
define the two-step estimator /3'^{u)
/?
that were not
regressors that were not selected from
we no longer use ^i-penaUzation.
Suppose that conditions E.l, E.2, and E.5
model that contains
sion \T\
T"
.
min
we constrain the components
selected to be zero; or, equivalently,
Theorem
Since
Next we consider the following two-step estimator that apphes
by a data-dependent procedure.
.
cx3.
with probability going to one.
as a solution of the following optimization problem:
(3.21)
^
excluded from the support
i'r„ are
the ordinary quantile regression to the selected model. Let
{1,
£'
hold.
model T^ with probability converging
Let
to one,
T
be
any selected
and whose dimen-
of stochastic orders, then
P^iu) - piu)
IV
provided the right side converges to zero
Under conditions
of the
Is login \Jp)4>{s)l
^
9
m probability.
theorem see that the rate of convergence of the two-step estimator
is
generally faster than the rate of the one-step penalized estimator, unless 0(n)
in
which case the rate
is
the same.
It is
also helpful to note that
\\f iu)-P{u)\\<p
Proof. (Theorem
3).
Let r
=
||/3^(u)
-
when
~
~p
(p{s),
and 0(s) <p
1,
^^(u) and by r„ C
f
g
1
^^ login Vp).
/9(u)||.
By
definition of
with probability approaching one, we have that with probability approaching one
Qu{fiu))-QuiP{u))<0.
PENALIZED QUANTILE REGRESSION
<p
First note that since \T\
by
s,
Lemma
we have
11
that
23
(t){\T\
+
s)
condition E.5 to control the empirical error in the objective function,
QuiP
[u))
s\og{nV
- Qu[p{u))
<p
+
p)(f>{\T\
r]l
s)
I
<p
n
<p
we
Applying
(p{s).
get that
slog{n\/ p)(p{s)
Invoking the identification condition E.2 we obtain that
.(.^A,(r))<,^./^^°^("^^)^^^)
(3.22)
Since
As
we assumed that
in the
we can
^/s log(n
proof of Theorem
refine the
bound
1,
V
p)4>{s)/n
=
Op{q),
this implies that r
=
and that
Op(l),
<p
that q{r~Ag{r))
r^
=
rOp{q).
Op(g{r)). Therefore
(3.22) to
slog(n Vp)0(s)
qr
we conclude
'^pT\j
or
r
s\og{nV p)(j){s)
j
^
<
I
g'
n
D
proving the result.
4.
Analysis of the Pivotal Choice of the Penalization Pcirameter.
we show that under some conditions
proposed
the pivotal choice for the penalization parameter A
in Section 2.3 satisfies the theoretical
convergence stated
Theorem
in
1.
requirements needed to achieve the rates of
,
_^^
.
Recall that the true rank scores can be represented almost surely as
al{u)
where
ui,.
..
,Un are
the regressors, xi,
(4.1)
.
i.i.d.
.
,
.
=
[u
uniform
x„. Thus,
—
\{ui
(0, 1)
<u]),
random
(4.2)
4.
i
=
1,
.
.
.
A'
=
inf{A:
.,
.
,
variables, independently distributed
=
(xj,
.
.
.
,x„).
Let the regularization parameter A(A') be defined as
A(A')
.
P(A„ < A|A) > 1-Q>,},
Q„
=
•„
,n,
we have
which has a known distribution conditional on
Theorem
for
A„ = n||En[x,(u-l{u, <u})]||^,
-
In this section
^
^
n\J p
from
;
24
BELLONI AND CHERNOZHUKOV
'
for some sequence
—
f
oo.
>
Assume
that there exists a sequence Cn^p such that uniformly in
Nl/2'
'
n
/
jn^^
(4.3)
i^ul
^
Moreover, assume g
= X{X)
r/ien A
0.
Theorem
1,
=
=
{A',,?'
>
;r
4)
We
and
for
the inequahty ec
let
each
<
1
I
<
V
ond
p)),
l°g("
i/iai
^
p)
ii/slog(n
0.
V p)(t>{l)/n/q
—>p
—p
t
oo such that
+ s)—
and
»p
,
and
i
~p
'^
i.
-
:
12"=!
<
V
Stout
[34],
.
,
Theorem
5.2.2:
independent random variables with zero mean and
^i ^^^ *n
n and
.
•
.
n >
1.
=
I3r=i
Suppose
e
^
[-^f]
and 7 >
>
n >
^°^ all
1.
Then
0.
Let \Xi\
for
<
each n
cs„
>
1,
implies that
P(5„/s„ >£)<exp(-(£72j(l-ec/2)j,
(4.5)
and there
and
exist constants £(7)
7r(7)
such that
if
£
>
£(7) and ec
P(5„/s„>£)>exp(-(£V2)(l +
(4.6)
We
i
V
will use the following inequalities of
=
5„
*
assumptions on the regularization parameter assumed in
1} denote a sequence of
finite variances,
almost surely
(?)(n/log(n
ri
:;
'^".P
)
tJnlog(n\/ p)4>{mo
q^\
—
—
yiog(nVp)/i-y
f
PA
Proof. (Theorem
Let
—p
(i'(l)
^
""^
X] ^u
1
there exists a sequence
X
where tuq
^,
c„,p
satisfies the
namely
(4.4)
<
,'
<
7r(7),
then
7)'
need to establish upper and lower bounds on the value of
A.
We
first
establish an
= X!"=i ^^f, and note that 0(1) = supj<pD?/n. Next observe that
Var (J^"^i Xija*(u)|X) = u(l - u)v'j. Note that by (4.3) we have supj<,<„ |x,ja*(u)| <
Cn.pVj/\/u{l — u), j = l,...,p. Moreover, for n large enough, condition (4.3) also implies
upper bound. Let v^
that
2c„,p(i
(4.7)
Under
(4.7),
we can apply
obtain that for every j
=
I,
+ l)0og(n Vp)/yu(l -
(4.5)
.
.
.
with e
,p
=
2{t
+
u)
<
1/2.
l)v^log(n Vp), and c
=
Cn,p/\/u{l
—
u) to
PENALIZED QUANTILE REGRESSION
Therefore, since y/n$(l)
>
Vj
25
we have
p(^^^l^M$^>e\x]<exp(-{e + l)\og{nVp))=
(4.9)
\/u{l
—
Next note that using
^
(n
u)n(l){l)
Vp) \{ny p)
we have
(4.9)
(4.10)
P
> V"(l -
(a„
<
u)n4>{l)6\x)
> Ju(l -
^a;yO*(u
JZP
u)n(?!>(l)£|X
i=\
n
< p max P
^x,j<(w
> Ju{l-u)n(l){l)e\X
1=1
P(A„ > A|X)
Since
A
(4.11)
decreasing in A,
is
<
yju{l
-
we conclude that
u)n4>il)s
<
2{t
+
l)yJnlog{nVp)(l)il).
Next we turn to establishing the lower bound. Let jn G
=
that Vj^
i/ni^(l).
1
{nV
>
Fix 7
definition of
>
P max
\j<p
>
—
£(7) and ec
<
7r(7).
and, by (4.3) we have ec
\.
Since
-u)n(f){l)
A
A|A')
>
is
...
decreasing in
eyju{l -u)n(j){l)
Thus, taking in account that
In order to verify (4.4) define
i
~p
i
—
>
00.
MX
=
=
ti/21og(n V p)/(l
o(l), for
n
large
+ 7),
c
=
enough we
^
exp(-(£V2)(l+7))
>
exp(-t2log(nVp))
J
,,
.
(4.12)
have that
\
,
P (A„ >
,p} denote an index such
Therefore we can apply (4.6) to obtain
pf \E7=iX,,„a'du)\
\Vu{l
.
1=1
£(7) and 7r(7)), and set e
u). Since £ diverges,
.
>P[\J2xrjX{u] >
1=1
fix
.
A„ we have
^a;,ja*(u) >A|A')
I
p)
(which implicitly
Cn,p/\/u{l
have e
By
{1,
/i
t
Thus, the
~
=
A, it follows
=
(^
that
=t^2u{l - u)nlog(n Vp)0(l)/(1 +
g,
we have
established A
A/[\/nlog(n V p)(p{mo
first result
follows from the assumptions that t^/s log(n
+
7)
~p t^Jn\og{n V p)(p{mQ
s){^/q)].
By
+ s)^.
construction
we
of (4.4) follows, and the second result of (4,4)
V
p)4>{l)/n/q -^p
0.
!
'
-.
D.
BELLONI AND CHERNOZHUKOV
26
For concreteness, we now verify the conditions of Tlieorem 4 in our examples.
Example
Normal Design, continued).
5 (Isotropic
ated with the jth covariate, where
x.i is
we use standard Gaussian concentration bounds,
we have
,
see
,
,
•.
(4.13)-
P(|a;,j|
>
denote the n-vector associ-
x.j
Section
[2-3]
K
For any value
3.
•
,
.
\xij\
>
\
.
'"
<exp(-/\V2).
/r)
In turn this implies that max] <,;<„, i<j<p
Let
a column of ones representing the intercept. Next
<p \/\og(n Vp). Moreover, the vectors
x.j
are
such that
P{\
(4.14)
|>A')<2exp(-2A'V7r2)
||x.,||-E[||x.,||]
Combining these bounds we obtain minj=i_.
ditions (4.3) hold with Cn^p
have
(?l)(l)
>
1
+
and ^(mo
—p y
the definition of mo. Thus,
g
~
/u
~
1 in
Example
<p
s)
"s^^-^p^
\
+
,^p
\jYL^=i
Theorem 4
+
6
(Correlated
.
^p.
\/logp. Therefore, con-
=
o{n).
On
p)
requires i^slog(n
V
<
=
p)
the other hand,
by
1
We
o{n).
Lemma
we
and
14
also verify that
We
analyze the correlated de-
random
variables. Corollary 3.12 of
Normal Design, continued).
The upper bound
[23].
P
z^j
for the case
p
>
follows from the result
~
max
A^(0, 1) are
i.i.d.
|x,,|
as in
>
<
A'
Example
P
5.
max
\
|3„|
K
>
(The case with p <
follows by changing
the signs of x^ for each even j and redefining the parameter j3{u) for these
so that after the transformation
we obtain the design with p >
0.)
only on the independence within the components of each vector
are independent for
obtain Cn,p —p
i'
j^
i,
w-^Si^^ ^nd
+ |p|)/(l-|p|)}
(l
+
we can invoke the same
t"^
log'^(n
4 also requires t^slog(rt
Vp) =
v/(mo/n) logp
and the definition of mo. Since p
section.
.
K >l
(4.15)
{(l
.
the next section.
Ledoux and Talagrand
where
1,
V^"
yj[s/n) logp
sign similarly using comparison theorems for Gaussian
that for
=
V^.,j
^u ~p
and i^log^(n V
y(m.o/n)logp
^
a.nd E{\\x.,\\]
V
p)
=
is
it
logp)
<
{(l
follows that 0(1)
o{n) in this case.
We
regressors;
The
lower
bound
x.j.
Since
x^ij
Example
o(n). In addition, 0(1)
+y(s/n)
fixed
results of
new
>
-I-
also verify that g
and
Xij
Therefore we
+ s) <p
Lemma 14
and 0(mo
1
+ |p|)/(l -
~p 0(mo
5.
relies
|p|)}
s).
~
p
by
Thus, Theorem
~
1
in the
next
PENALIZED QUANTILE REGRESSION
5.
Empirical Performance.
mance
27
In order to access the, finite sample practical perfor-
we conducted a Monte Carlo study and an application
of the proposed estimators,
to international economic growth.
Monte Carlo Simulation.
5.1.
In this section
canonical quantile regression estimator, the
we
will
£i -penalized
compare the performance
of the
quantile regression, the two-step
estimator, and the ideal oracle estimator. Recall that the two-step estimator applies the
canonical quantile regression to the model selected by the penalized estimator.
The
oracle
estimator applies the canonical quantile regression on the minimal true model. (Of course,
such an estimator
is
not available outside Monte Carlo experiments.)
We
focus our attention
on model selection properties of the penalized estimator and biases and standard deviations
of these estimators.
We
begin by considering the following regression model (see Example
y
-
+ e,
x'Pil/2)
/?(l/2)
where an intercept and the covariates x_i
identically distributed e
and the dimension
~
T,ij
worthy because
it
satisfies
We
"
2.
=
p''"-'',
In the
results
size
n
to 200.
the right panels
From
the
panels
minimal true model
is
We
;'.-"'
-
.
="
Example
2 with p
=
regressors,
0.5.
namely x_i
This design
is
it
~
noteeasily
on model selection performance of the penalized estimator
panels of Figures 2-3,
left
Prom
we
we
we
we
plot the frequencies of the dimensions of the
model
selection performance
see that the frequency of selecting
very small.
in
plot the firequencies of selecting the correct components.
see that the
the performance of the estimator
results.
and the sample
5,
;
selected model; in the right panel
left
':
as specified in
our conditions.
2-.3.
dimension p of covariates x equal to 1000,
violates the condition of the uniform uncertainty principle, but
summarize the
Figures
N{Q,I), and the errors e are independent
model above with correlated
also consider a variant of the
N{0,T,), where
set the
0)',
parameter A equal to 0.9-quantile of the pivotal random variable A„,
following our proposal in Section
We
~
(1,1,1,1,1,0,...,
minimal model to
s of the true
set the regularization
We
A''(0, 1).
=
where
1)
We
is
much
is
particularly good.
larger
model than the
also see that in the design with correlated regressors,
quite good, as
we would expect from our
These results confirm the theoretical results of Theorem
2,
theoretical
namely, that
when
the
non-zero coefficients are well-separated from zero, with probability tending to one, the
penalized estimator should select the model that includes the minimal true model as a
BELLONI AND CHERNOZHUKOV
28
subset. Moreover, these results also confirm the theoretical result of
Theorem
namely that
1,
the dimension of the selected model should be of the same stochastic order as the dimension
of the true
minimal model. In summary, we that find the model selection performance of
the penalized estimator very well agree with our theoretical results.
We summarize results on the estimation performance in Table
1
.
We see that the penalized
quantile regression estimator significantly outperforms the canonical quantile regression, as
we would expect from Theorem
regressors
bias, as
is
and from inconsistency
1
larger than the sample
we would expect from the
The penalized
size.
of the latter
definition of the estimator
of coefficients from zero. Furthermore,
we
when the number of
quantile regression has a substantial
which penalizes large deviations
see that the two-step estimator improves
upon
the penalized quantile regression, particularly in terms of drastically reducing the bias.
The
two-step estimator in fact does almost as well as the ideal oracle estimator, as we would
expect from Theorem
harm
4.
also see that the (unarbitrary) correlation of regressors does not
the performance of the penalized and the two-step estimators, which
from our theoretical
results. In fact, since data-driven value of
the correlated case, as
8,
We
we would expect
A tends to be slightly lower for
we would expect by the comparison theorem mentioned
in
Example
the penalized estimator selects smaller models and also makes smaller estimation errors
than
in the canonical
of the penalized
uncorrelated case. In summary,
and two-step estimators to be
in
we
find the estimation
performance
agreement with our theoretical
results.
MONTE CARLO RESULTS
Example
Mean
QR
QR
Two-step QR
Oracle QR
Canonical
Penalized
norm
Bias
Std Deviation
1.6929
0.99
5.14
2.43
1.1519
0.37
5.14
4.97
0.0276
0.29
5.00
5.00
0.0012
0.20
table displays the average Co
norrn
Co
Mean
QR
QR
Two-.step QR
Oracle QR
We
Mean
25.27
Canonical
Penalized
deviation.
Isotropic Gaussian Design
992.29
Example
The
1:
2:
Co
/"i
Correlated Gaussian Design
norm
Mean
(i
norm
Bias
Std Deviation
988.41
29.40
1.2526
1.11
5.19
4.09
0.4316
0.29
5.19
5.02
0.0075
0.27
5,00
5.00
0.0013
0.25
and
fi
obtained the results
Taule 1
norm of the estimators as well as mean bias and standard
using 5000 Monte Carlo repetitions for each design.
PENALIZED QUANTILE REGRESSION
Hjstugram of the numbflr of imrv-zcro components
in
3(1/2)
Hisloi^rnin of Ihp immlx-r of correct r om(HinpnU stl<*u^
29
{/,
=
a.
p
-
f))
Fig 2. The figure siimmanzes the covariate selection results for the isotropic normal design example, based
on 5000 Monte Carlo repetitions. The left panel plots the histogram for the number of covariates selected out
of the possible 1000 covariates. The right panel plots the histogram for the number of significant covariates
selected; there are in total 5 significant covariates a,mongst 1000 covariates.
Histogram of the number of noivzero components
Fig
The
in
J(1/2)
HrstoRrtim of the number of coiTvrt eom|ionents selected (s
=
f>, /j
=
5)
summarizes the covariate selection results for the correlated normal design example with
= .5, based on 5000 Monte Carlo repetitions. The left panel plots the histogram for
the number of covariates selected out of the possible 1000 covari.ates. The ri,ght panel plots the histogram,
for the number of significant covariates selected; there are in total 5 significant covariates amongst 1000
covariates. We obtained the results using 5000 Monte Carlo repetitions.
3.
figure
correlation coefficient p
BELLONI AND CHERNOZHUKOV
30
5.2.
International Economic Growth Example.
we apply £i-penalized
In this section
quantile regression to an international economic growth example, using
method
for
model
We
selection.
primarily as a
it
use the Barro and Lee data consisting of a panel of 138
countries for the period of 1960 to 1985.
We
consider the national growth rates in gross
domestic product (GDP) per capita as a dependent variable y for periods 1965-75 and
we
1975-85.^ In our analysis,
allows for a total of
n
=
will consider
=
model with nearly p
90 complete observations. Our goal here
is
60 covariates, which
to select a subset of
these covariates and briefly compare the resulting models to the standard models used in
the empirical growth literature (Barro and Sala-i-Martin
One
[2],
of the central issues in the empirical growth literature
of an initial (lagged) level of
GDP
Koenker and Machado
is
the estimation of the effect
per capita on the growth rates of
GDP
per capita. In
Solow-Swan-Ramsey growth model
particular, a key prediction from the classical
[21]).
the
is
hypothesis of convergence, which states that poorer countries should typically grow faster
and therefore should tend
to catch
up with the
states that the effect of initial level of
pointed out in Barro and Sala-i-Martin
GDP
[3],
richer countries. Thus, such a hypothesis
on the growth rate should be negative. As
this hypothesis
regression of growth rates on the initial level of
GDP.
is
rejected using a simple bivariate
(In our case,
median regression yields
a positive coefhcient of 0.00045.) In order to reconcile the data and the theory, the literature
has focused on estimating the effect conditional on the pertinent characteristics of countries.
Covariates that describe such characteristics can include variables measuring education and
science policies, strength of market institutions, trade openness, savings rates
[3].
The theory then
of the initial level of
predicts that for countries with similar other characteristics the effect
GDP
on the growth rate should be negative
Given that the number of covariates we can condition on
size,
particular, past previous findings
came under
procedures for covariate selection. In
resolve the
([24]).
model
Since the
selection
comparable to the sample
fact, in
number
is
some
The growth
rate in
GDP
median
[31]).
of covariates
is
high, there
classical tools.
very large, though see
[24]
have
no simple way to
Indeed the number of
and
we use the
is
[31]
for
an attempt
lasso selection device,
regressions, to resolve this important issue.
over period from
t\
to t2
is
In
upon ad hoc
cases, all of the previous findings
to search over several millions of these models. Here
specifically the £i-penalized
([24],
severe criticisms for relying
problem using only the
possible lower-dimensional model
^
is
([3])
the covariate selection becomes an important issue in this analysis
been questioned
and others
coTnmonly defined as \og{GDP,^/GDPi^) —
1.
PENALIZED QUANTILE REGRESSION
Let us
now turn
parameter
A.
This
We
performed povariate selection using the £i-
initially
used our data-driven choice of penahzation
to our empirical results.
penalized median regressions, where
initial
we
31
choice led us to select no covariates, which
is
consistent with the
situations in which the true coefficients are not well-separated from zero.
to slowly decrease the penalization parameter in order to allow for
selected.
We
present the model selection results in Table
we
the choice of A,
select the black
openness) and a measure of pohtical
we
select
an additional
With a
in the table.
in the table.
We
With
some
the
covariates to
first
be
relaxation of
market exchange rate premium (characterizing trade
instability.
set of educational
third relaxation of A
refer the reader to
2.
We then proceeded
[2]
With a second
relaxation of the choice of A
attainment variables, and several others reported
we include
and
[3]
yet another set of variables also reported
for a
complete definition and discussion of
each of these variables.
We then
proceeded to apply the standard median and quantile regressions to the selected
models and we also report the standard confidence
and
5
we show
intervals for these estimates. In Figures 4
these results graphically, plotting estimates of quantile regression coefficients
P{u) and pointwise confidence intervals on the vertical axis against the quantile index u on
the horizontal axis.
that
we have
We
should note that the confidence intervals do not take into account
selected the models using the data. (In an ongoing
working on devising procedures that
we have
selected, the
will
median regression
account
for this.)
coefficients
We
on the
companion work, we are
find that, in all
initial level of
models that
GDP
is
always
negative and the standard confidence intervals do not include zero. Similar conclusions also
hold for quantile regressions with quantile indices in the middle range. In summary,
we
believe that our empirical findings support the hypothesis of convergence from the classical
Solow-Swan-Ramsey growth model. Of
methods
course,
it
would be good to
find formal inferential
to fully support this hypothesis. Finally, our findings also agree
the previous findings reported in Barro and Sala-i-Martin
[2]
and thus support
and Koenker and Machado
[21].
6.
model
Conclusion and Extensions.
In this
work we characterize the estimation and
selection properties of the £i-penalized quantile regression for high-dimensional sparse
models. Despite the non-linear nature of the problem, we provide results on estimation
and model
selection that parallel those recently obtained for the penalized least squares
estimator. It
is
likely that
M-estimation problems.
our proof techniques can be useful for deriving results for other
32
BELLONI AND CHERNOZHUKOV
,
MODEL SELECTION RESULTS FOR THE INTERNATIONAL
GROWTH REGRESSIONS
Penalization
Parameter
A
=
Real
GDP
1.077968
per capita (log)
is
included in
all
models
Additional Selected Variables
A
Black Market
A/2
Premium
(log)
',
Political Instability
Black Market
A/3
/
'
'
'
'"
J
(log)
Political Instability
Measure of
.
Premium
:.
tariff restriction
Infant mortality rate
Ratio of real government ''consumption" net of defense and education
Exchange rate
~
% of "higher school complete" in
% of "secondary school complete"
-
female population
male population
in
Black Market Premium
A/4
(log)
Political Instability
Measiu'e of tariff restriction
Infant mortality rate
Ratio of real government "consumption" net of defense and education
Exchange rate
% of "higher school complete" in
% of "secondaj-y school complete"
Female gross enrollment ratio
%
of "no education" in the
female population
in
male population
for higher
education
male population
Population proportion over 65
Average years of secondary schooling in the male population
Black Market Premium (log)
A/5
Political Instability
•,
Measure of
tariff restriction
Infant mortality rate
Ratio of real government "consumption" net of defense and education
Exchange rate
% of "higher school complete" in
% of "secondary school complete"
Female gross enrollment ratio
-
%
•
.
of "no education" in the
female population
in
male population
for higher
education
male population
Population proportion over 65
in the male population
Average years of secondarj' schooling
Growth
%
rate of population
male population
Ratio of nominal government expenditure on defense to nominal
Ratio of import to GDP
of "higher school attained" in
Table
For
GDP
2
sequence of penalization parameters we obtained nested models. All the
columns of the design matrix were normalized to have unit length.
this particular decreafung
PENALIZED QUANTILE REGRESSION
Real
Intercept
0,02
/
0.5
GDP
33
per capita (log)
t
\
\
/
1
/
0.4
'
y
0.3
0.2
0.1
-0.1
//
^/
0.2
--
- - - -
/
/
"^'^-0.02
\
\
^
\
-0.04
^
\
-
0.4
0.6
0.8
0.2
1
04
u
0.6
0.8
1
u
Black Market Premium
Measure of Political
(log)
/
instability
2
/
0.1
/
1.5
;
~ ~ - - - - 0.1
1
"J^
_ - -
0.2
/
;
0.5
/
/
0.3
______
^__
/
/
/
/
-0.5
0.2
0.4
0.6
0.(
0,2
0,4
0,6
0!
Fig 4. This figure plots the coefficient estimates and standard pointwise 90 % confidence intervals for
model associated with A/2 which selected two covariates in addition to the initial level of GDP.
the
BELLONI AND CHERNOZIIUKOV
34
Real
Intercept
GDP
pci
i.a])ita {loj;)
1
U.Ub
— ~ _ _ _ _
•'
0.5
"~
,
-0
02
OA
0.8
0.6
Black Market Premium
s.
N
2
1
6
0,4
Measure of
(ii}(i)
0.2
~
1
OB
1
Political instabililv
04
^
f
0,2
^.
-..-
.
/...
.
0-0 2
/
0.2
-0 4
0.2
04
06
08
0.2
1
6
4
u
u
M<^[isiUT of tariff rrstricrion
Infant mon.ality rate
0.2
0.4
0.6
8
1
8
1
u
Excliange rate
2
4
"f '^WRhcr
.-icfmol
6
0.6
2
1
"''
0.6
0.4
u
X 10"°
OS
1
u
complelc"
in
,q-i
female population
^
%
r,f
spiondary srhool rnmplele"
in
male population
/
-'~--''
0.2
0.4
6
0,8
6
1
2
0,4
5
0.8
1
Fig 5. This figure plots the coefficient estimates and standard point-wise 90% confidence intervals for the
model associated with A/3 which selected eight covariates in addition to the initial level of GDP.
PENALIZED QUANTILE REGRESSION
There are several possible extensions that we would
First,
we would hke
indices.
We expect
35
pursue
like to
to extend out results to hold uniformly across a
in the future
work.
continuum of quantile
that most of our results will generahze to this case with a few appropriate
modifications. Second, following van der Geer
we would
[37],
like to allow for regressor-
we would
specific choice of the penalization parameter. Specifically,
like to
consider the
following estimator:
(6.1)
.
where
a-j
= J^
p{u) e arg rnin E„ [p^iy,
Z]"=i ^fj-
The dual problem
max
E„
x^/3)]
To map
this to our previous
^
a,|^,|
associated with (6.1) has the form:
\En[x^Ja^]\<^aJ,
{u
+-
lj/,a,]
a€R"
(6-2)
-
~
1)
<
ai
<
u,
j
i
=
=
l,...,p,
I,
.
.
.
,n.
framework, we can redefine the regressors and the parameter
spaces via transformations Xij
=
Xij/a-j
=
and Pj{u)
analogous proof strategy. Third, we would
like to
aj(5j{u).
We
can then proceed with an
extend our analysis to cover non- sparse
models that are well-approximated by sparse models. In such a framework, the components
of ^(u) reordered by magnitude,
namely
|/3(i)(w)|
>
\P[2){u)\
exhibit a sufficiently rapid decay behavior, for example,
stants
R
and
threshold can
t.
Therefore, truncation to zero of
still
we
D.1-D.4 discussed
>
|/3(fc)(ti)|
\P(p-i){u)\
< Rk"^'^
for
>
|/?(p)(u)|,
some con-
components below a particular moving
lead to consistent estimation.
APPENDIX
In this section
all
>
A:
VERIFICATION OF CONDITIONS
verify that conditions E.1-E.5 hold
in Section 2
and
also hold
E.1-E.5
under the simple
much more
set of conditions
generally. For convenience,
we
denote by
S^-
=
{QeIRP:||Q||
the /c-sparse unit sphere in IR^. In what follows,
and
p.
=
l,||a||o<fc}
we show how
the key constants, such as q
appearing in E.1-E.5, are functions of the following population constants (which can
—
36
BELLONI AND CHERNOZriUKOV
.
r
.
"
.
possibly depend on the sample size n):
•
/•=
"2^
fy,\x,i^'P{u)\x),
f :=
sup
fy,\:c,iy\x),
:=
inf
E
/' :=
sup
_4|,_(y|x),
g{k)
[(Q'a;)^]
,
where values of y and x range over the support of
y,
and
x,
The
.
depend on the
results also
sparse eigenvalue of sample design matrix
4i{k)
:- sup En
already mentioned earlier. As an illustration
common
we compute the constants
7 (Isotropic
We
Normal Design, continued).
For concreteness assume that the errors are e
~
N{{), 1).
can compute the values of the several constants involved
/
= 1/V^ >
Example
ample
2.
>
=
0.39, /'
l/y27re
8 (Correlated
<
/i^v^ >
1/3,
Lemma
model
1
(2.1)
~
/
1/6,
=
Under
>
0.6, Q{k)
this simple design
=
1,
and
—
Lp{k)
1/V2tt
>
and ^{k) <
0.39, /'
j^
=
=
=
0.4,
1.
in
Ex-
relevant con-
l/x/^Tre
<
0.25,
3.
Conditions E.l (model sparseness)
we impose throughout, including
The
1/2.
<
l/\/27r
we
is
the key underly-
in condition D.2.
Lemmas
5 below establish the remaining conditions E.2-E.5.
(Verifying Condition E.2
-
We have that in the linear quantile
— l3{u)\\ = r,
— /?(u)||o < rn,
Identification).
under random sampling, for each
/?
£ R^
QuiP) - QuiPiu)) > qim)
(A.2)
:
\\/3
||/3
[r^
A
gir))
,
where
(A.3)
=
/
Consider next the design
and that p
7V(0, 1)
0.4,
Q{k)>V^ =
ing model assumption, which
and
<
\/V2tt
A.l. Verification of E.1-E.5.
1, 2, 3, 4,
^JtT/S
Example
revisit the design of
in the analysis:
Normal Design, continued).
=
bounded by /
=
0.25, 7(/c)
For concreteness assume that e
stants are
7(fc)
two
in (A.l) for
designs used in the literature.
Example
1.
[(q''x,)'
g{r)
=
r
and
q(m)
^
—^minO,
\-j^^[m)\
\.
—
PENALIZED QUANTILE REGRESSION
Thus, condition E.2 holds with q
E.2 holds with
=
q
q{n)
Proof. (Lemma
Knight
1)
under Conditions D.I-D.4, condition
q{n). In particular,
ci 1.
Let Fy^^ denote the conditional distribution of y given
w
any two scalars
[17], for
=
37
Prom
x.
and v we have that
rv
pu{w -
(A. 4)
- pu{w) = -v{u - l{w <
v)
0})
+
<z]-
{\{w
/
l{w < 0})dz.
Jo
Applying (A. 4) with
w=
=
y—x'(5{u) and v
we have that E [-v{u - l{w <
x'{f3-l3{u))
Using the law of iterated expectations and mean value expansion, we obtain
0.
Q„(/3)
-
Q„(/?(u))
= E
+
Fy\,,{x' p{u)
z)
-
E
>
|e
-
i(x'(/3
[{x'iP
£
=
[0, z]
Fy^,[x' (i(u))dz
(A.5)
>
for z^^^
0})]
- f E[|x;(/3 - /3(u))|3].
P{u)))'fyiAx'/3{u))]
- I3{um - f E[|a:'(/3
/3{u))\']
Next define
=
r„
By
(A.5)
sup |f
we have
Quif3{u)
:
+ fd) -
^f^~E [[x'df]
"
that
construction of
r,„
E[\a'x\^]
.
and the convexity
of Qu, for
'
3 /
any
/3
,
d G S^
'
"
•
.
for all
,
'
^.
^ 37
By
>
Q„(/3(u))
.
,
,
such that
||
P{u)\\
<
TTi,,
we
have that
Qu{P)-QuW{u)) > =E[{x'{f3 Letting
r
{
\\(3
—
f3{u)\\
=
r
first
QuiPiu))
P{u)\\
fMQuWiu) + r,nd)-QuW{u))
)
>
rfQ{m)rl
4
and
=E\ix'{P-(3{u))f\ >
' '
-^^
'
4
inequality holds by construction of r^; hence
QuiP) - Qu[p{u)) >
for
-
we have
M^ Quif3{u) + Tmd) -
where the
(3{u)))^]aI\\P
q{m) defined
in (A. 3).
—^
A^
>q(m)(r^Ar)
D
''
38
':
The
Lemma
•
lemmas
following two
'
BELLONI AND CHERNOZHUKOV
'
^
verify the Empirical Pre-sparseness condition.
2 (Verifying Condition E.3
ber of non-zero components of (3{u)
-
is
We
Empirical Pre-Sparseness).
num-
have that the
bounded by n/\p, namely
-
mu)\\o<nAp.
Suppose that
yi
,
.
y„ are absolutely continuous conditional onxi,
.
.
,
of interpolated points, h
Proof.
—
e
=
I,
\{i
we have
Trivially
with rows x[,i
=
.
.
.
,n, c
:
=
yi
<
{ue', (1
Let
p-
-
Y =
(yi,
u)e', Ae', Ae')',
.
.
.
and
Note that the penalized quantile
min
A =
ue'i+
+
(1
-
A
Matrix
has rank n, since
Tsitsiklis
components.
[6]
We
there
u)e'E,~
+
+
Ae'/3+
||/3(w)||o
of interpolated points.
Therefore,
-
[I
we have
||/3(u)||o
on
A' consider the
of vertices. Since c
>
set.
=
We
+ {n—
||/3"''(u)||o
-l-
By Theorem
||/5-(u)||o since
have that n
h)
<n
componentwise
if
=
0.
such that Y'{a^
conditional on
—
X
a^)
—
is
>
A
<
||/9(u)||o
this
unique solution with probability one.
n, see
[G].
Therefore,
\\P(u)\\o^h.
is
:
A'a
polyhedron
<
is
c]
Bertsimas
2.4 of
number
vertices
h
<
If
is
:
A' a
which has a
(i.e.,
is,
finite.
the
:
<
c}.
finite
Condi-
number
zero
is
always
A'a
<
c}
is
a
not unique there exist vertices a^,a~
Y
is
absolutely continuous
Therefore the dual problem has a
the dual basic solution
non-degenerate, that
n.
form of A' implies that {a
is
of non-zero
Let h denote the
0).
non-empty
a zero probability event since
and the number of
primal basic solution
nxn
h components of ^ and ^ are non-zero.
which leads to
the solution of the dual
This
where
Aw = Y
"^
polyhedron defined by {a
Therefore,
A'],
one optimal basic solution w* with at most n non-zero
feasible for the dual problem). Moreover, the
bounded
—
c'w
niin
To prove the second statement, consider the dual problem maxa{ya
tional
X
/
and / denotes the
Ae'/?-
has linearly independent rows.
at least
is
be the n x p matrix
defined P{u) as a basic solution with the minimal
components (note that
number
it
number
can be written as
regi'ession
C -^- + XP+ - Xp- = Y
and
X
,y„)',
(1,1,...,!)' denotes vectors of ones of conformable dimensions,
identity matrix.
,Xn, then the
.
x^f3{u)}\, is equal to \\/3{u)\\o with probability one.
||/9(u)||o
=
.
.
number
we have with probability one that
is
unique,
we have that the
of non-zero variables equals
||/3(w)||o
+
(n
—
h)
=
n, or that
-
D
PENALIZED QUANTILE REGRESSION
Prom
[6]
,
the complementary slackness condition of linear programming, see
we have
that for any component j G
Pj{u)
(A.6)
^
Pj{u)
>
only
<
,
.
,
E„
E„
if
Theorem
4.5 of
p}
.
.
if
only
[2;,,ai(ti)]
[xijai{u)]
= — and
"a
=
,
(2.6).
Lemm.'\ 3 (Verifying Condition E.3
For any A >
{1
.
where a{u) solves the dual problem
||;9(u)||o.
39
Empirical Pre-Sparseness, continued).
-
m =
Let
we have
m<
'n?(j)[m)
A2
Proof. Let
m=
k
=
||^(u)||o
^T we have
mX =
<
For any
\T\.
sign(^fc(u))
We
shall
=
fc
0.
6
from (A.6) we have
T",
Therefore, by Cauchy-Schwarz inequality
sign0{u)ysign0{u))X
||Xsign(^(u))i|||a(u)||
where we used that
||sign(;5(u))||
T — support(^(u)), and
{X'a{u))k = sign(^fc(u))A and, for
a{u) be the solution of the dual problem (2.6),
—
||sign(^(u))||o
= i/m we
have
mX <
^
<
sign0{u)y{X'a{u))
= (x sign0 {u)))' a{u)
V^^W^)\\signi(3{u))\\\\aiu)\\,
m. Since
||2(u)||
n^/m^im), which
need some additional notation
denote the score function
we have
in
what
<
i/max{u,
1
— u}n <
-
follows. Let
D
/:
,;
m-sparse vectors near to
for the ith observation. Define the set of
.
R{r,m):={PG-RP .m\o<m, \\0-p{u)\\<r},
and define the sparse sphere associated with a given vector
§(/3)
=
{a e IRP
:
||q.||
<
1,
support(Q)
C
/?
'
'
eo(m,n,p)
-^^
as
support(/3)}.
.-
Also, define the following empirical and linearization errors
(A. 7)
and
yields the result.
the true value P{u)
-
y/n,
suPaes-
|G„(qV,(/3(u),u))|,
-
ei(r,
m, n,p)
suP;geR(r,m),a£S(/3).
|G„(aVj(/3, u))
e2(r,
m,n,p)
sup/3efi(r,m),aeS(/3)
V^\E[a'^i{P,u)]
G„(a't/'i(/?(w),u))|,
-
E[aVj(/3(w),u)]|.
,:
.
40
.
where Gn
is
BELLONI AND CHERNOZHUKOV
.
the empirical process operator, that
Next we
verify condition E.4.
Lemma
4 (Verifying Condition E.4
\\/3{u)—P{u)\\.
We
have
.
ip{m)
<p
m
< n and
yjn
iJi{m)
4'{m), condition E.4 holds with
^—
I
n,
^
<P
-^[r
IJ^
4) It will
(2. 1)
under random sampling,
log(r7
V p)0(m) V ^Jn log(n V
jjl
= ^J ip{m){%J ip(rn) f
= n{n), namely
— Jnlog(n
^
,
<
\/
,
1
and
,Xn-
-;
p)ip{m.)
p)(j){m)'
>
1,
so that condition E.4 holds with
be convenient to define three vectors of rank scores (dual
'
'
.
,
— u — \{yi < x[l3{u)] for = 1,
=
ai{u) = u — l{y, < x\P{u)} for
the true rank scores, a*{u)
2.
the estimated rank scores,
3.
the dual optimal rank scores, 2(u) that solve the dual program
i
=
nE„
.
i
denote the support of p{u). Let x.^
sign(/3^(u))
.
=
{xij,j
G
T)',
'-,
i.e.
^M =
This
is
1,
.
.
,
.
n;
(2.6).
=
[Pj{u),j e T)'
we have that
=
||sign(/?^(u))||
"Eji [2;jf2i(-u)j
A
Therefore we can bound the number of non-zero components of
in (A.8).
n;
and Pf{u)
a:.-a,(u)
^-^
bound the empirical expectation
,
.
Fi'om the complementary slackness characterizations (A.G)
(A.8)
.
.
VI). Therefore, provided
1.
T
.
=
r
||/3(u)||o,
^
(I>{1)
variables):
Let
=
absolutely continuous conditional onxi,
M) + M-
In particular, under D.I-D.4, <p{m)
Proof. (Lemma
m
Let
^/m.
where
r,
o,i^e
model
^
+
^
Empirical Sparseness).
-
,yn
that, in the linear quantile
uniformly in
(G„(/) := n~^''^ Y17=\if{^i) ~E[/(X,)]).
.
and suppose thatyj,
Vrn <p jj{m) -{r Al)
is
/9(u)
provided we can
achieved in the next step by combining
the maximal inequalities and assumptions on the design matrix.
Using the triangle inequality
A\/m < nE„ i
II
-(ai(u)
L'-'
-
in (A.8), write
a,(u))|
+ nE„
Jllll
a;
-(af(zt)
L'-'
-
a*(u))|
+ nE„
Jllll
x..*a*(u)l
L^-*
Jll
.
.
PENALIZED QUANTILE REGRESSION
Then we bound each
we
Lemma
use
<
^/7^€o
To bound the
last
component,
(m,n,p) <p ^JnmAog(nyp) (^fp{m)V \/4'{m)
component, we observe that ai{u)
first
2 the penalized quantile regression
one. This implies that
\\nEn \x c;{ai{u)
To bound
in this display.
9 to get
||nE„ [a;.^a*(u)]||
To bound the
components
of the three
41
E„
-
[|2i(u)
-
fit
ai{u) only
< m/n.
we
Therefore,
<
nsup^ggm E„
<
n sup„g§^ ^/En
<
^n(f>{m)m.
if y,
m
can interpolate at most
ai(u)p]
ai(u))|
^
=
x'^P{u).
By Lemma
points with probability
get
-
[\a'xi\ \ai{u)
ai{u)\]
[\a'xi\'^]^En \\ai{u)
-
ai{u)[^]
the second component, note that
-
||nE„ \x^!f{a.,{u)
=
a*(u))j
||\/n
G„
[x.f{ai{u)
-
+
a*(ti)))
||
-
||nE [x.^(ai(u)
a*(u))]
||
||
< v^ei {r,m,n,p) +
<p sjnm. \o%(n V
where we use
Lemma 8
Setting fi{m)
=
and
Lemma
10 to
\f^p{m){-~/ipijri) f
V
1)
p)
bound
y/ne2 {r,m,n,p)
\/^[m) V 4>{m)
respectively ei(r,
> ^ip{m) and
follows.
n^J Lp{m){sJ ip{m)]r A
m,n,p) and
using that
m
< n
e2(r,
the
1)
m, n,p).
first
result
_
Under D.1-D.4 we
Lemma
5
<
(^(n)
1,
and 0(1) >
(Verifying Condition E.5
quantile model (2.1) under
-
1
D
and condition E.4 holds.
Empirical Error).
We
random sampling, and uniformly over
have
m
<
that,
n, r
in the linear
>
0,
and the
...
region R{r, m);
\Qu{f3)
+
- Qu{P) -
{Qu{f3{u))
^og("
- Qu{0[u))) <p V("^ +
^
[^^[m + s)v4'{m + s)
)
.
I
In particular, under D.I-D.4 we have that condition E.5 holds, namely
\Qu{I3)
- QuW) - [QuWiu]) - QuiPH)) <p -^Y^(m + s)log(nVp)0(m +
5)
I
uniformly over
m
<
n, r
>
0,
and
the region
R[r,m).
Proof. For convenience let e„ := \QuiP) - Qu{P) - P{u)\\, and ||/3 - P{u)\\o <m. + s we have that
11/3
-
-^lo
^i('^^'rn
+
s,n,p)dz=^ -^ei{r,m
.-
(<3u(/3(u))
+
s,n,p).
-
Qu(/3('")))
I-
Since r
•_
>
BELLONI AND CHERNOZHUKOV
42
The
Lemma
follows from
first result
Under D.1-D.4 we have
<
(/'(n)
8.
and 0(1) >
1
/
A. 2. Controlling Empirical and Linecirization Errors.
nical results of
Appendix A. 4
maximal
results provide the
submodels' dimensions m.
results
and
number
We
<
for a class of
which may be of some independent
n,
functions
J^T
VC
VC
Proof.
We
Tc
index bounded by
their
VC index
2.6.15.
0}
is
(for
cm
convenience
bounded by
Next consider / G
- l{p{Z) <
0})
— m.
The classes of functions
cr},
:Qe§(/3),support(/3)
VC
W :=
Z —
2; see, for
{x'a
:
c.
<
is
fixed
similar.
0} and l{p
0} G V.
index of the class of sets {{Z,
t)
x' (3}
:
:
and has cardinality m,
[38]
Lemma
form f{Z) := g{Z){l{h{Z)
in the
<
<
:= {l{y
example, van der Vaart and Wellner
which can be written
l{h
T
it is
C T} and V
support(a)
[y,x)). Since
and
C T}
support(a)
:
for some universal constant
let
eW,
where g
definition equal to the
m+
J-t
These technical
classes.
prove the result for !Ft, and we omit the proof for Qt as
C T}
by
dimension and the uniform covering
{1,2,... ,p}, \T\
= {a'{MP,u)~ij^iP{u),u))
Consider the classes of functions
support(/3)
VC
interest.
dimension of relevant functions
St = {QV^(/3(u),")
have their
These technical
ei.
(see, e.g., [38]).
Consider a fixed subset
6.
and
inequalities for a collection of empirical processes indexed
begin with a bound on the
Lemma
Here we exploit the tech-
to control the empirical errors eo
on the concepts of the
their usage rely
D
and condition E.5 holds.
1
VC
The
f{Z) <
t},
index of J^t
f e Tt^
t
is
<
by
E^. We
have that
{{Z,t):f{Z)<t}
.
-
"
Thus each
set {{Z,t)
:
t)
:
g{Z)
<
t}.
{iZ,t):g(Z){l{h(Z)<0}-l{piZ)<0})<t}
U
U
{iZ,t):h{Z)<0,p(Z)<0,t>0}u
{{Z,t):h(Z)<0,piZ)>0,g{Z)<t}U
U
{{Z,t):hiZ)>0,piZ)<0,giZ)>t}.
f{Z) <
complements of the basic
and {[Z,
=
=
sets
{Z
{{Z,t):h{Z)>0,p(Z)>0,t>0}U
t}
:
These basic
is
created by taking
h{Z) >
sets
0},
form
{Z p(Z) <
finite
VC
:
classes,
unions, intersections, and
0}, {t
>
0}, {(Z,
each having
VC
t)
:
g{Z) >
t],
index of order m.
,
PENy\LIZED QUANTILE REGRESSION
VC
Therefore, the
order as the
sum
index of a class of sets {(Z,
of the
der Vaart and Wellner
VC
[38]
t)
:
f{Z) <
indices of the set classes formed
Lemma
Lemma
For any
7.
^m,n,p
m<
e
i
by the basic
R is of the same
VC sets; see van
D
for function classes
m-dimensional subsets of a p-dimensional
all
f G Tf,
2.6.17.
Next we control the uniform L^ covering numbers
the union of
t},
43
generated by taking
set.
n, consider the classes of functions
= {a'{MP,u) - ^p^{(5{u),u))
Qm,n,v
=
:
G RP,
/?
< m, a G
||/3||o
and
S(/3)}
iaeS;^},
{a'i>i{P[u),u)
with envelope functions Fm,n,p o-nd Gm,n,p- For each
>
e
/16e\^('^'"~^) [
SUp7V(e||Fm,„,p||Q,2,-^m,n,p,i'2((5))
^
SUpiV(e||G^,„,p||Q,2,em,n,p,i^2(Q))
<C[
<^
j
(
ep\^
—
\
/leeN^^'^™"^^ /en
for
some universal constants
Proof. Let
components.
C
and
-^
c.
It follows
that
implies that the covering
its
VC
number
dimension
of J-t
is
at
most
bounded by
is
cm by Lemma
,,
<
In turn this
2(cm-l)
,.
[38]
Theorem
2.6.7. Since
{ep/m)"^ different restrictions T, the total covering number
is
Lemma
to control the empirical errors eo
8 (Controlling error ei).
We
<
n.
ej
as defined in (A. 7).
have that in the linear quantile model (2.1)
ei{r,m,n,p) <p v'mlog(n V p)
uniformly in r and m.
and
.
max
| V'(/?(77i),
y^(p{m)j
.
we
bounded
according the statement of the lemma. The proof for Gm,n,p follows similarly.
Next we proceed
.-
< C(cm)(16e) cm
C is an universal constant, see van der Vaart and Wellner
have at most (^)
6.
non-zero
.
'-|^\
N{e\\FT\\Q,2.J'T.L2{Q))
where
m
J-t denote a restriction of !Fm,n,p for a particular choice of
D
BELLONI AND CHERNOZHUKOV
44
Proof. By
7 the uniform covering
Lemma
18
Since
we have
of J-n,m,p
that uniformly in
-
|q' (t/.^(/3,^i.)
t/<i(/3(u), u))\
E„[/2]
(A. 10)
number
m,n,p) < supy^^^^
ei(r,
< En
using the definition of
=
[\a'x,f]
<?i(m,)
and
''^"'~
^
Prom Lemma
(ep/m)'". Using
m<n
\a'x^\ \l{y,
<
|(G„(/)|.
bounded by C(16e/e)
is
|G„(/)|<p\/mlog(r!,Vp)max|
sup
(A.9)
we have
definition of ej
<
x'^p}
-
Combining
<
l{y,
and E[/2] < E
(i>{m)
ip{m).
E[/2]V2,
sup
x;/?(u)}|
[|Q-'a;,|2]
E„[/2]i/2l
sup
<
(A. 10) with (A.9)
<
,,
\a'x,\,
>^[rn),
we obtain the
result.
a
Next we bound
Lemma
eo
using the
same
tools
we used
9 (Controlling empirical error eo)-
<p
eo[m,n,p)
uniformly in
bound
to
ei.
In the linear quantile model (2.1) we have
^m log(n V p) max | ^/ip{m)
,
\/(p(m)|
m < n.
Proof. The proof
is
similar to the proof of
Lemma S
with Qm,n,p instead of
thatfor^ e g„,,n.p^Qh&ve¥,rAg'^]=En\{a'^,{P{u),u)f]=Kn [{a'xif{\{y^ < x'^Piu)}
< E„
[[a'x^f]
<
Alternatively
on
4){1)
4){m) for
a e §™.
we could bound
instead of
proceed to bound
Lemma
all
eq
using
Note
J-m,n,p-
-
D
,
Theorem
5.2.2 of Stout [34] to achieve a
0(m) by making additional assumptions on the covariates
dependence
x^^-j.
Now we
£2.
10 (Controlling linearization error
e2(r,
(o)-
m, n,p) < v/nyi^(m)
We
f
have that in the linear quantile model
y'v?(m,)/r
A
1
^
uniformly in r
Proof. By
>
and
m <
definition
eoir.m.n.p)
u)^]
=
=
n.
we have
sup^gR(^„)
,,,gg(^)
vn|E[aVt(/3,
sup^6fl(r,m).a6S(/3) \/H|E[(a'2;,
)
«)]
(l{y,
-
E[aVj(/3(u),u)]|
<
x'^p}
-
l{y,
<
x[P{u)})]\.
PENALIZED QUANTILE REGRESSION
By
the Cauchy-Schwarz inequality the expression above
v^
By
sup JE[(q'x02]
aeS^
definition v'(m)
-
l{\y,
x'^P{u)\
E[(l{y,
<
<
=
-
-
I3{u))\},
1{?/.
<
bounded by
^E[[\{y^<x'^l3]-\{y,<x'^(3[^)]Y].
/36R(r,m)
sup^^gm E[(Q'xi)^]. Next, since
[x'iiP
x[P}
sup
is
45
|
<
\{y^
-
x'^P}
<
l{yi
x[P{u)]
\
<
we have
x',0iu)})^]
= E [|l{y, < </?} - l{y, < x'^P{u)}\]
< E l{\y,-x',p{u)\<\x',iP-0{um]
< E [/^i^yj(l')')| fyiAt + x'Mu)\x,)dt A
< (2/11^ - P[u)\\ sup^gsjr E [\a'x^\]) A 1
<
(2rf/^{^) A
l]
1.
n
A. 3.
Lemmas on
maximum
Sparse Eigenvalues.
fc-sparse eigenvalues that are
Recall the notation for the unit sphere
unit sphere Sp
maximum
We
=
{q G
M^
:
||a||
/c-sparse eigenvalue of
begin with a
lemma
=
In this section
£
>
1
11.
Let
M
=
S"""-^
1, ||a||o
M, namely
<
{a G IR"
k}. For a
(pM{k)
=
=
||a||
:
lemmas on the
matrix
sup{ a'
M,
Ma
:
and the
1}
let
earlier.
fc-sparse
4>M{k) denote the
a £ §p
that establishes a type of subadditivity of the
}.
maximum
sparse
.
be a semi-definite "positive matrix.
For any integers k and ik with
we have
(i>Mm<
Proof. Let a
We
collect
used in some of the derivations presented
eigenvalues as a function of the cardinality.
Lemma
we
achieve 4>M[£k). Moreover
can choose ai's such that
Since
M
is
||Qi||o
<
'
\(i\<t>M{k).
let
YllJ\ ^i
k since \C\k
positive semi-definite, for any i,j
>
ik.
=^
such that
•
•' -"
.^"
',
'..
SUi
ll^i'tllo
—
l|o'||o-
'
'
we have a[Mai + a'.Maj >
2
|Q'Mqj|
.
BELLONI AND CHERNOZ?IUKOV
46
Therefore,
,
.
we have
= a'Ma = J2^^^^i + Y.J2'^^^^J
(l)M{ik)
•
,
'-
:
•'
...;'•:
i=l
m
;
' <
:
.
-
^
...
mEi!«'ii'<^M(iia,iio).
.
.
1=1
Note that E,=i
The
11^,1^
=
1
lemmas
following
and thus
<
<PA./(£fc)
max,=i,...jf] 0a/(||q,||o)
\i]
characterize the behavior of the
We
case of correlated Gaussian regressors.
Let
(T^(/c).
12.
Consider
0(fc)
be the
Xi
=
T}l~Zi, where
,fi{k)
n X p matrix
Gk
{a, a)
it
<
a(k)
^ a'Za/^,
where
11^^11=
sup
z-,
i
(a, q)
=
Lemma
[38]
results
A. 2.1)
(^1
<
>
N{0,lp), p
n,
for k
[sio;^],
we have
that P{||5i.||
<
Then with
n.
+ /kJ^V^^
S"-^ Note
\aZ a/ y/n] =
for the
and sup^ggt tt'Ea <
n,
l,...,n as rows.
G S^ x
Using Borell's concentration inequality
ner
~
Zi
<
suffices to establish the result for k
collecting vectors
(p{k)
,
maximal k-sparse eigenvalue o/ E„
11
sup
n/2. Let
Z
be the
Consider the Gaussian process
that
J a'E„[ziz[]a= J (f){k).
Gaussian process
— median||5/;|| >
r}
(see
<
van der Vaart and Well-
e^"'" ". Also,
by
classical
on the behavior of the maximal eigenvalues of the Gaussian covariance matrices
German
y/k/n)
[13]), as
—
limit in
0.
n
-^ oo, for any k/n
Since k/n
[0, 1].
\/kn/n)
<
fore, for
any ro
0.
lies
Therefore,
within
[0, 1],
—
>
7 G
[0, 1],
we have that limt „(median]|5^-||
(see
—
1
—
any sequence k„/n has convergent subsequence with
we can conclude
This further implies
>
the
for
by establishing an upper bound on
start
probability converging to one, uniformly in k
Proof. By Lemma
\i](pM{k)-
maximal sparse eigenvalue
that holds with high probability.
Lemma
<
that, as
hmsup„
n
—
>
00, limsup;.^_„(median||^fc^
supj,<„(median]|5^,|]
there exists no large enough such that for
—
all
1
—
\/k/n)
<
n > no and
0.
all
—
||
1
—
Therek
<
n,
PENALIZED QUANTILE REGRESSION
P1
11^^.
>
II
1
+
sjk/n
p\
over k
< n we
by 2^_j
=
%Jck/n\ogp
one, uniformly for
sup^ggjc
a'Ea <
no,
>
1
+
>
1
+ Jk/^ + r,+ro\ < f^
+ r^ + ro
^/k/^
—
>
for c
as
>
all
containing
Zi
< (^6-""^/^
I
(fle""''^'-
V^V
fc=l
n -^
sup^ggit
k:
(J~(k),
and using that
1
We
cx).
(^)
<
p^
we bound the
,
J a'¥.n\ziz[]a
a{k){l
+
right side
conclude that with probability converging to
<
1
+
\/k/n^\ogp. Furthermore, since
we conclude that with probability converging
J a'Er,[x^x'^\a <
sup^gg^
for all k:
subvectors of
(^)
n >
sup Ja'Enlz^z'^a
"€S^
e'^~'^^'^'°SP
There are at most
obtain
fc=l
r/c
that for
sup ^a'¥.n\ziz'^a
TpI
Setting
e""'' /^.
[
we conclude
k elements, so
Summing
+ r + To <
47
to one, uniformly
D
y'k/n^/logp).
Next, relying on Sudakov's minoration, we show a lower bound on the expectation of
the
maximum
shows that the upper bound
result
Lemma
4>{k)
fc-sparse eigenvalue.
be the
we have
(1)
Consider
13.
—
is
do not use the lower bound
sharp
where
Y}l'^Zi,
in the analysis,
terms of the rate dependence on
in
~
Zi
maximal k-sparse eigenvalue o/E„
N{0,lp), and
[x^a;^],
inf^^gic
< n <
for k
fc,p,
a'Ea >
but the
and
n.
a^{k). Let
Then for any even k
p.
'
'
that:
•
E ,/m]>^^/{k/m^^EiP^)andi2)
X
Proof. Let
supremum
(A. 11)
:
h->
a'Xa/y/n^ where
of this Gaussian process
sup
'.'
Hence we proceed
the lower
Of)
bound
(1)
\a'
X a/ ^\ =
in three steps: In
Step
{a, a)
Xj,
>p
i
=
G S^ x
^^(fc/2) log(p 1,
.
,
.
.
S""-*.
n
as rows.
Note that
/c).
Consider
\/4>{k) is
sup
1,
J a'En[xix'^\a ^ J 4>{k).
we consider the uncorrelated
;'-,.•,.;
,.
case and prove
on the expectation of the supremum using Sudakov's minoration, using
packing number. In Step
(2)
"
>
a lower bound on a relevant packing number. In Step
the lower bound
''
/^
be the n x p matrix collecting vectors
the Gaussian process (a,
the
Xi
We
3,
we
generalize Step
on the supremum
itself
1
2,
we
derive the lower
bound on the
to the correlated case. In Step 4,
we prove
using Borell's concentration inequality.
-
1
'
BELLONI AND CHERNOZHUKOV
,48
Step
we consider the
In this step
1.
minoration.
By
sup^ggfc Za,
where a
£'[sup|^ggfc
We
a =
fixing
(1,
.
.
,
.
Zq ;=
i—*
= ^a^Zt -
Zs)
=
V^E[(Zt
-
d),
Z,)2]
= ^E1E„
the largest
number from below
of elements
in the
i
=
(ii,
remaining p
T
for e
.
.
.
—
,
=
ip)
-4=. In
order to do this
number
€ Sp such that
=
ij
we
-
[{x[{t
with respect to the metric d that can be packed into Sp, see
€
sup^^g^ E„[x^q]
We
will
consider the standard deviation metric on Sp induced by Z: for any t,s € Sp,
Consider the packing number D{e, Sp,
s,t
>
y/(p{k)
a Gaussian process on Sp.
is
E„[2;Jq']
/ and show the result using Sudakov's
S"~\ we have
l)'/\/H €
,.
[10].
=
IT"!
\\t
-
s\\/V^.
We will bound the packing
k components and
(^)
^
\\s-~tf-t\t,-s,\'>
of such elements. Consider
t
in at
Z
1+
j€svipport(t)
ji€
\support(s)
support(s)|
> k/2
for every s,t
eV. By the inequality
Using Sudakov's minoration
support (s)
I
> 2^1 =
I.
(A.12)
Theorem
([12],
4.1.4),
\V\
>
proving the claim of the
In this step
lemma
we show that
\V\
>
[p
—
:
tj
-
l/Vk}, which has
|support(<) \ support(s)|
<
k/2}.
shown below maxfgr
By
|-A/'(t)|
—
[p
>
k)^/'^.
we conclude that
k)/{2n),
/.
k)^^'^.
t
£
i
construction
<
d)
T with the set support(^), where support(^) =
cardinahty k. For any e T let Af{t) ^ {s e T
convenient to identify every element
{j e {I,.. .,p}
S=
for the case
|support(i)\
we have that D{l/y/n, Sp,
EfsupZJ >sup^yiogi?(e,S^,d) > J\ogD{l/^,%,d) > \jk\og[p -
Since as
any
\support(()
Furthermore, by Step 2 given below we have that
It is
=
most k/2 elements. In
V be the set of the maximal cardinality, consisting of elements in T such that
2.
1^
T
•
7=1
Step
,
restrict attention to the collection
such that the support of s agrees with the support of
(A.12)
\V\.
bound
of disjoint closed balls of radius
l/\//c for exactly
k components. There are
=
s))^]]
this case
Let
=
Za] from below using Sudakov's minoration.
d{s,t)
e
S=
case where
..
K
:=
:
we have that maxter
(fc/2)(''fc/2
)
^°^'
every
t,
\J^{t)\\V\
>
\T\.
we conclude that
\V\>\T\/K={l)/K>ip-k)'^/^.
It
remains only to show that
|A/'(i)|
<
{k/2l^~kl2
) Consider an arbitrary
any k/2 components of support(i), and generate elements
s
G
A/'(i)
t
E T. Fix
by switching any of
the remaining k/2 components in support(i) to any of the possible p
—
k/2 values. This
PENALIZED QUANTILE REGRESSION
gives us at
all
most
{^~^/2
other combinations of
combinations
is
the construction
Step
The
3.
elements s S N'{t). Next
^'^^^
)
(j|./2)-
we conclude that
S
case where
I'^
this
\-N'{t)\
way we generate every element
<
{k/2)i^'k/2
Using Borell's concentration inequality
A. 2.1) for the
E[\/(l){k)]\
>
supremum
r}
<
Lemma
1
and
14.
new
metric, d{s,t)
(see
||s
-
i||o
<
=
2fc.
van der Vaart and Wellner
we have
[38]
Lemma
P{\\/<f>[k)
—
D
which proves the second claim of the lemma.
Next we combine the previous lemmas
Examples
since
of the Gaussian process defined in (A. 11),
2e~"'" Z^,
G M{t). From
satisfies
dis,t)>a{2k)\\s-t\\/^
4.
s
)•
/ follows similarly noting that the
7^
^a'^{Zt-Z,) = ^E[{Zt-Zs)''l
Step
us repeat this procedure for
let
k/2 components of support (t), where the number of such
initial
bounded by
49
to control the empirical sparse eigenvalues of
2.
For k <
n,
under the design of Example
we have
1
^(fc)
~p
+J
1
°^^
.
For k <n, under the design of Example 2 we have
Proof. Consider Example
Let Xi-i denote the ith observation without the
1.
ponent. Write
first
com-
,;
E„
1
E„
[x,
We first bound 4>N{k)Lemma 12 implies that
[x',^_,
E„
= E„
Letting N^i^^i
<p
(?!>Ar(A;)
1
+
because (pN-i-Ak) >p yJ{k/2n)\og{p -
\\b\\
=
a'^
<
l
[a;t,_ix-
^k/ny^logp.
Lemma
13
M + N.
_i
we have
Xi,_iXj_j
(f)N{k)
=
4>N-i _i(^)-
bounds (p^ik) from below
k).
We then bound 4>M{k)- Since Mu = 1, we have
let w = {a,b'y achieve 0m(^) where a G IR, 6 e
= ^l < 1,
\\w\\o < k. Note that \a\ < 1,
4>M{.k)^w'Mw
=
E„
_i]
\a\'^
(?I>m(1)
>
IR''""'.
By
||5||i
<
1-
To produce an upper bound
definition
we have
||u;||
^\\b\\. Therefore
+ 2ab'En[xi^li]<l+2b'En[x,^^i]
+ 2||6||i||E„[a;,,_i]||oo<l + 2v^||6||||E„[xi,_i]||oo.
^
''r
1
=
1,
BELLONI AND CHERNOZHUKOV
50
Next we bound
+
(f)j^{k).
The
intercept.
The proof
fixed, the
On
—
(p.
_i]
|E„
E„
Since
[xij]|.
':
.
+
(pik)
—
cr^(/c)
=
sup^^^h
>
1
Example
V
(p^^-^
2 is similar
sup^ggt a'T,a
<
(1
as for the uncorrelated design to hold.
a'
Ma + a'Na <
an
-lik) since the covariates contain
with the same steps. Since — 1
S
+ |/3|)/(1 -
|p|)
=
inf^ggt
allows for the
(4.1.5)
Lemma
p
<
1 is
Lemmas
to apply
and a~{k)
<
12
a'Sa >
same bound
D
.
A. 4. Maximal Inequalities for a Collection of Empirical Processes.
is
=
,
a'{M + N)a = sup^ggt
To bound 4)m{^) comparison theorem
result of this section
for j
by using the bounds derived above.
the design of
\p\)-
N{0,l/n)
'
bounds on the eigenvalues of the population design matrix
|/3|)/(1
^
[xij]
<p y/jljnjlogp. Therefore we have 0a/ (/c) <p
||oo
'
Note that
result follows
for
[a;^
the other hand, (p{k)
and 13 are given by
^(1
||E„
n'iaxj=2,...,p
'
we bound
Finally,
=
•
+ 2y/k/^^/E^.
4>M{k)
||oo
by (4.13) we have
2,... ,p,
1
||E„ [xi^-i]
'
The main
maximal inequality that controls the empirical
18, stating a
process uniformly over a collection of classes of functions using class-dependent bounds.
We
need
lemma, because the standard maximal inequalities applied
this
function classes yield a single class-independent
We
prove the main result by
stating
first
bound that
Lemma
is
15, giving
to the union of
too large for our purposes.
a
bound on
tail probabilities
of a separable sub-Gaussian process, stated in terms of uniform covering numbers. Here
we want
to explicitly trace the impact of covering
numbers on the
tail probability,
since
these covering numbers grow rapidly under increasing parameter dimension. Using the sym-
metrization approach, we then obtain
Lemma
17, giving
a bound on
tail probabilities
of
a general separable empirical process, also stated in terms of uniform covering numbers.
Finally given a growth rate on the covering numbers,
we obtain our
final
Lemma
IS,
which
we repeatedly employ throughout the paper.
Lemma
15
(Exponential Inequality for Sub-Gaussian Process).
zero-mean separable process {G(/)
:
/ S
with a L2{P) norm, and has envelope
namely for each g G
J-
—
F
J"),
.
whose index
includes zero,
is
is
equipped
sub- Gaussian,
J-:
i2
D
T
Suppose further that the process
P{!'G(5)|>r?}<2exp(-^77Vi?Hg||p,2)
with
set
Consider any linear
a positive constant;
and suppose
that
we have
for any
r?
>
0,
the following upper
bound for the
:
PENALIZED QUANTILE REGRESSION
51
T
uniform L2 covering numbers for
<
sup7V(e||F||Q,2,^,i2(Q))
n(e, J^,L2) for each
>
e
0,
Q
where n(e,^, L2)
is
decreasing in 1/e.
—
increasing in 1/e, and €\/logn(e,^, L2)
K
Then for
> D,
for
—
as 1/e
>
some universal constant
<
c
cx)
>
anrf is
30, p{!F,P)
:
=
SUP/e^||/||p,2/||F||p,2,
sup^g^|G(/)|
I^I|P,2 /o"'^'''^^'
The
Lemma
result of
page 302, on
common
e-in(e,^,L2)-{(-^/^)'-i>de.
similar in spirit to the result of
is
Ledoux and Talagrand
[23],
of a process stated in terms of Orlicz-norm covering numbers.
15 gives a tail probability stated in terms of the uniform L2 covering
numbers. The reason
for
15
tail probabilities
Lemma
However,
>cK'^ <
V^ogn{x,J^,L2)dx
is
that in our context estimates of the uniform L2 covering numbers
function classes are more readily available than the estimates the Orlicz-norm
covering numbers.
In order to prove a
we need
to
bound on
tail probabilities
go through a symmetrization argument. Since we use data-dependent threshold,
we need an appropriate extension
Let us
call
of a general separable empirical process,
a threshold function x
of the classical symmetrization
:
IR"
IR /c-sub-exchangeable
1-^
lemma
if
in w,
we have that
=
particular x{v)
x{v)
\\v\\
\/
x{w)
with k
=
>
Lemma
and constant functions with k
\/2
random threshold x that
lemma
=
is
for probabilities
x{Zi,
.
.
,
.
with components
in v
.
,
.
.
—
The
following result
(Lemma
2.3.7 of [38]) to
1.
sub-exchangeable.
16 (Symmetrization with Data-dependent Threshold).
dependent stochastic processes Zi,
x{Z)
Z„ and
arbitrary functions
Consider arbitrary
/Lii,
.
.
.
,
/i.„
denote the r quantile of x{Z), pr := P{x{Z)
<
g,-)
>
and pr
r,
= P{x{Z)
have
i=l
where xq
is
:
JT
Zn) be a k-sub-exchangeable random variable and for any t G
>
T
xo
Vx(Z)
and
[x{v) \/x{w)]/k. Several functions satisfy this property, in
generalizes the standard symmetrization
the case of a
K"
any v,w E
for
any vectors v,w created by the pairwise exchange of the components
to allow for this.
<
—P
Y^ej{Z, -
PT
1=1
a constant such that infjgjrP
(E"=i
>
Hi)
4/c
r
Zi{f)\
< ^) >
xoVxiZ]
1
- ^.
<
»-*
]R.
in-
Let
(0, 1) let q-r
Qr)
<
r.
We
'
BELLONI AND CHERNOZHUKOV
52
Note that we can recover the
setting
fc
=
1,
pr
=
1)
=
and p^
symmetrization lemma where threshold
classical
underlying
e„{!F, Zi,
.
i.i.d.
,
.
.
Zn)
by
from combining the previous
follows
Consider a sep-
17 (Exponential inequality for separable empirical process).
G„(/) =
arable empirical process
fixed
/,'„
The next lemma
0.
two lemmas.
Lemma
is
^^"'^'^
K
data sequence. Let
~
I3"=i{/(^i)
>
and r £
\
random
be a k- sub -exchangeable
E[/(Zi)]}, where Zi,...,Zn
(0,1) be constants,
and
is
e„(J^, P^)
an
—
variable, such that
-
'
rp(.?-,Pn)/4
I
J\ogn[t,J^,L2)de
|lF||p„,2 /
<
T
Jo
>
Lemma
as in
Finally, our
main
15, then
<-Ep(
P<^sup|G„(/)|>4fccA'e„(^,P„)
result in this section
is
1,
.
.
.
is
an underlying
,n with envelopes
Fm =
as follows.
er,(.Fm,Pr2)
1
=
and u >
all
=
m
'n~^^'^
is
1,
.
.
Y17=i{fiZi)
"
Consider a
E[/(Z,)]}, where
(nVp)'"(K/e)""^,
1.
For a constant
<
.
,n,
and with upper bounds on the
C
sup
<
<
e
:= (1
+
||/i|p,2,
1,
\/2t')/4 set
sup
K
>
.for all
m.
a large enough constant
||/||p„,2
\/2/S, for
n
sufficiently
large, the inequality
sup |G„(/)|
holds with probability at least
Now we
prove
Lemmas
1
—
< 4V2cKe^{J^m,
5,
15, 16, 17,
Pn),
where the constant
and
18.
=
by
=
CJm log(r) Vp) max
Then, for any 5 £ (0,1), there
=
Gn{f)
supjg_;r^ \f{x)\,m
n(e,.f„,,L2)
some constants k >
+T.
data sequence, defined over function classes !Frn,Tn
i.i.d.
uniform covering numbers of J-m given for
with
e-'n{e,T,L2r^^ ''Ue Al
j
18 (Maximal Inequality for a Collection of Empirical Processes).
collection of separable empirical processes
Zi,...,Zn
^
f^T
for the same constant c
Lemma
< -(4fcc/^e„(^,P„))
e„(.F,P„) and snpvar^f
c is the
<
n,
same
as in
Lemma
15.
PENALIZED QUANTILE REGRESSION
Proof of
[36],
Lema^ia 15.
The proof follows by
specializing
53
arguments given van der Vaart
page 286, to the sub-Gaussian processes and also tracing out the bounds on
tail
prob-
abilities in full detail.
Step
1,
is
.
.
.}
1.
There
exists a sequence of nested partitions of j^, {{^qi,
where the q-th partition consists of
sets of I/2(P) radius at
the largest positive integer such that 2~'°
<
balls of
To construct the
here:
L2{P) radius
e.g.
=
1,
most
P)/4 so that
p{J-,
such partition follows from a standard argument,
we repeat
i
qo
van der Vaart
q-th partition, cover J" with at
.
.
=
>
The
2.
qo,qo
+
where go
||F||p,22~'^,
existence of
page 286, which
[36],
=
most Uq
and replace these by the same number
||F||p,22~''
,Nq),q
.
n(2~'',^, L2)
of disjoint sets. If
the sequence of partitions does not yet consist of successive refinements, then replace the
partition at stage q by the set of
most
into at
Let
=
A''^
Uqg
-Uq
all
intersections of the form
sets, so
be an arbitrary point of
fqi
we can
of the process,
~
J-qi.
N^ —
Set iTq{f)
—
E
and |G(/)| < E^<,o+i "^ax/
00
Step
2.
By
E^4<
Setting
21ogA^g
rjq
>
|G(7r,(/)
E
E
7r^o(/)
=
GK(/)-7r,_i(/)),
-
^,_i(/))|
+ max/
|G(^,J/))|. Thus
/-
P
-.
max|GK(/)-7r,_i(/))|>7?J
P|max|G(^,„(/))|>r?,„
-*
9=90 + 1
•-
chosen below.
'
-
7r,_i(/)||p,2
<
2||F||p.22-(^-i)
<
NqNq-i > logUq, using that
q
1-^
'.
4||F||p,22-^ for g
8A'||F||p^22"'v^log7Vg, using sub-Gaussianity, setting
log
of the process
(?=go + l
'
—
separability
we can decompose / —
set only. In this case,
construction of the partition sets
||7r,(/)
By
J-q^.
supremum norm
GK(/))-GU,_:(/))=
00
9=90
77g
/ G
if
fqi
9=90+1
for constants
This gives partition
^9-i(/))- Hence by linearity
GU)-G{^q,U))=
'{sup|G(/)|>
^^^
.Fjj.
Yl^^qa lognj.
replace !F by Uq^ifqi, since the
can be computed by taking this
E^go+i('^'?(/)
that log
rf^
logn^
is
>
K
.
go
+
1-
> D,
increasing in
q,
'
using that
and 2~* <
54
:
.
BELLONI AND CriERNOZHUKOV
.
p(^, P)/4, we obtain
OO
•
•
<-
DO
-v
.
^
.
P
J2
=90 + 1
<
max|G(7r,(/)-^,_i(/))|>77J
,
oo
„
.
'
:
7V,iV,_i2exp (-77,V(4D||F||p,22-^)=
9=90 + 1
-'
-^
'^
<
-
:
Yl N,N,^r2exp{-{K/Df2\ogN,)
9=90 + 1
E
^
'
..
^
,,
;
2exp(-{(K/Z?)2-l}log(A^,A^,_i))
OO
'<
^..-^
,
2exp(-{(A7D)'-l}logn,)
/oo
•
•
-
.
2exp(-{(A-/Z?)2-l}Iogn,)dg
/
'90
.
rp{T.P)/4
Jo
By
<
Jensen's inequality y^oglV^
a, := J2'j=qo
oo
oo
E+
?=qo
Letting
bq
=
2
V^ogUq, so that
^
^9^8
2~', noting a^+i
A'||F||p,22-%.
9=90 + 1
l
—
—
aq
sJlogUq+i and 6^+1
—
—
bq
—2"'',
we
get using
summation by parts
oo
CO
^
2"''a,
=
g=go + l
(6,+i
-
6,)a,
= -
00
.
^log 7T.(jo+i Using that 2"''\/logn^
.
2
2-'yi^<
f;
Using a change of variables and that
oo
Step
3.
Letting
r]q^
A'||F||p,2p(J^,
*
as g
—
+
2-^0^^,
9=90 + 1
oo, so that
— cig^gl^+i
decreasing in g by assumption,
•^'JO
2~'>°
16
=
is
—
^
r2-'^^logn[2-i,J^,L2{P))dq.
2
9=90 + 1
00
/
''y^logn^
2
o,;
=2
2.2-('+i)ybi^^
9=90 + 1
where we use the assumption that
-
\
^
2.2-"°+iVlogn,„+i+
V
2
6,+i(a5+i
9=90 + 1
\
/
2-(9o+i)
^
-
a,&,|^+i
(
9=90+1
=
oo
f
^
-
<
p{J-, -P)/4,
we
finally
conclude that
/•p(^,-P)/4
P) ^2
log Nq^ recalhng that Nq^
,
=
n^^, using ||7rqo(/)||p,2
<
PENALIZED QUANTILE REGRESSION
55
and sub-Gaussianity, we conclude
|F||p,2
>
p{ma^|G(7r,„(/))i
<r
2exp
(
^
Jqo-l
=
Also, since Uq^
<
Vgo}
Vexp {-{K/Dflogrig)
-{{K/Df -l}\ogna)dq=
'
n(2 '°,^, F), 2
''o
<
r
< 2exp[-{iK/Df - l}logn,
{x\n2)-^2n{x,J',L2{P)r^^'^^^^^-^Ux.
'
•/p(^,P)/4
p{J^,P)/A, and n{x,!F,P)
is
increasing in 1/x,
we
obtain;
=4i^||F||p,2[p(^,P)/4]^21ogn(2-9o,^,P) <4V2A'||F||p,2
775,
^\ogn{x,J^,P)dx.
J/U
Step
tail
4. Finally,
bound stated
using c
=
adding the bounds on
in the
16/log2
main
+ 4\/2 <
^
adding bounds on
text. Further,
30,
<
77,
probabilities from Steps 2
tail
rjg
and
3
we obtain the
from Steps 2 and
3,
and
we obtain
^\ogn{x,:F,L2{P))dx.
cK||F||p,2y^
9=90
D
Proof of Lemma
(page 112) in
[38]
16.
The proof proceeds analogously
with the necessary adjustments. Letting
to the proof of
q^
Lemma
2.3.7
be the r quantile of x{Z) we
have
F
> XoVxiZ)} < P{x{Z) >
!
=1
first
Z =
pendent copy of
Z
such that x{Z)
|X]r=i ^i{fz)\
>
term of the expression above. Let
(Zi,
>
q-r
.
.
.
and
pr,
>
||Z!r=i ^iWj^
= Pt/2-
.
,
.
.
F„) be an inde-
E(^' 1=1
{Z
:
and using the triangular inequality we
f
left
>
T
xq
}
>
hand
x{Z) >
^'^
such that
.
I
Therefore, over the set
J-
.
we have inf/g;rF{|Er=i^i(/) <
T^Py
(V'l,
V x{Z). Therefore 3fz £
Z
;
.
by Bonferroni inequality we have that the
pT —Pt/'2'
xq
xq\/ x(Z). Conditional on such
of xo
Y —
,Z„), suitably defined on a product space. Fix a realization
have that
By definition
<qr}.
.r
Next we bound the
of
>Xoyx{Z)) +P{x{Z)
Qr,
T
l-Pr/2. Since Fy- {x(y)
side
is
<
qr)
=
bounded from below by
qr, ||E"=i ^illjr
V x{Z) V x{Y)
>
^o Va;(Z)}
we have
BELLONI AND CHERNOZHUKOV
56
Z we
Integrating over
\p{x{Z)>qr,
obtain
>xoVa;(Z)^ < PzP^
E^.
E(^' -
Vx(Z)
xo
>
^^;
\l
x{Y)
T
Let
£i, ...,£„
set {Yi
Y
£i,
and
.
,£n,
.
.
be an independent sequence of Rademacher random variables. Given
— Yi,Zi = Zi) if ^i = 1 and {Yi — Z^, Z^ = Yi)
Z by pairwise exchanging their components;
{y, Z) has the same distribution as
Y.^Y, -
PzPy
>
Z,,
-^^ -^'^]^
-^^'-^
^
if
=
e^
—1. That
is,
we
ei,
.
.
.
,
£„,
create vectors
by construction, conditional on each
(F, Z). Therefore,
= E^P^Pr
E(^' -
xo
\l
x{Z)
\l
x{Y)
^')
T
By
x(-)
S.PzPy
being /c-sub-exchangeable, and since Si^Yi
vx(Z) vx(y)
xn
>
Y.^y^-Z^)
-^
^-:^
-^
—
Zj)
=
„ „ „
< E.PzPy
,
\
(Fi
—
we have that
Zi),
Xo
Vx(Z) Vx(y)
Y.^Ay^-2c>
2k
r
By
the triangular inequality and removing x{Y) or x{Z), the latter
P
^£,(y, -
>
/!,)
•TO
V xjY)
^£,(Z, -/ij
4k
xq V x(Z)
>
4k
1=1
.;^
bounded by
is
J^
a
Proof of Lemma
17.
We
would
is
^n
-1/2
separable empirical process G„(/)
Gaussian; here Zi,...,Z„
apply exponential inequalities to the general
like to
an underlying
E7=l{fiZ^) - E[/(Z,)]}, which
i.i.d.
the standard symmetrization approach. Indeed,
G°{f)
i.e.,
=
P{£i
we
data sequence. To achieve this we use
first
n~^^'^Yl?=i{^if{^i)}^ where £!,...,£:„ are
=
1)
=
P{£i
= — 1) =
1/2,
not sub-
is
introduce the symmetrized process
i.i.d.
Rademacher random
which are independent of Zi,
probabihties of the general empirical process are bounded by the
.
,
.
.
Z„.
variables,
Then
the tail
the
tail probabilities of
symmetrized process using the symmetrization lemma recalled below. Further, we know that
by Hoeffding inequality the symmetrized process
with respect to the L2(Pn) norm, where ¥„
result.
By
.
is
is
sub-Gaussian conditional on Zj,
.
,
.
Z^
the empirical measure, and this delivers the
.
the Chebyshev's inequahty and the assumption on e„(.F, P„)
T fixed in the statement of the
P(|G„(/)|
.
we have
for the
lemma
> 4kcKeni:F,Vn)
'
)
'
< ^"P/ ^"'^P'^"(-^) ^
- (4kcKe„{^,Fn)f
sup^g^varp/
(4fcc/<'e„(^,P„))2
^
-
constant
>
,
PENALIZED QUANTILE REGRESSION
Therefore,
bj^
the symmetrization
>
sup |G„(/)|
Lemma
4fcci^e„(^,P„)
We then condition on the values of Zi,
as Pg. Conditional on Zi,
is
sub-Gaussian
.
,
.
.
.
< -P
.
sup
> ci^e„(^,P„)
|G°'(/)|
+ r.
I
G°
Z„, by the Hoeffding inequality the symmetrized process
forgeJ^ — J^
L2(Pn) norm, namely
for the
I
Z„, denoting the conditional probability measure
,
.
we obtain
16
I
57
P,{G°(5)>x}<2exp(-^^^
\
Lemma
Hence by
D=
15 with
1,
-^
Il5llp„,2
we can bound
<
sup|G°(/)|>c/re„(F,P„)
e-in(e,^,L2)-^^-^>de Al.
j^
The
result follows
from taking the expectation over Zi,
Proof of Lemma
The proof proceeds
18.
in
two
.
,
.
.
Zn-
with the
steps,
step containing
first
the main argument and the second step containing some auxiliary calculations.
Step
1.
we prove the main
In this step
result. First,
n{e,J-m, L2) satisfies the monotonicity hypotheses of
C=
sup/gj-^
i|i^m||p„,2
(1
+
V2u)/4. Note that supjgjr^
ll/liiPn,2/ll-P'm||p„,2
>
ll/l|p„,2 is
<
m
17 uniformly in
||/||p,2,
<
£
supyg^r^
||F„||ip„,2
m
<
ll/l|p„,2}
:
=
n:
^/m log(n V p) +
/
i—
n.
\/2-sub-exchangeable and p(J^m,P„)
l/\/" by Step 2 below. Thus, uniformly in
\/logn(e,J?^,L2)de
/
Lemma
C^/m log(n V p) max{supjgjr„
Second, recall that e„(J^m,P„) :=
for
we observe that the bound
vm \og{K/e)de
Jo
Jo
/'P(^m,P„)/4
<
(l/4)^mlog(n Vp) sup
||/||p„,2
+
||-?V.||p„,2
;
'
<
^mloginWp)
!S
^n
l,.'
771
:
^n
1
and K < n
Third, set
K
p)). Recall that
m<n
and / G
P(|G„(/)|
for
n
fl
+
v^)
/4
\
'',
'i-
:
'
'
'
(/(f
pVTb^,
<
Ide)^/^ {fP\og{K/e)dc)^^^
for
1/^
<
sufficiently large.
:= ./2/5
>
4\/2cC >
J-m,
||/||j.„,2
j
which follows by J^ ^/log{K/e)de <
p <
sup
,
Jo
feJ'n,
':
^vm]og{K/€)d€
/
1
1
B{K)
so that
where
1
<
c
<
we have by Chebyshev
> 4^/2cA'e„(^^,P„)
)
<
:= (i^^
30
is
._ i^
^
2/5,
defined in
and
let
Lemma
Tm
15.
=
(5/(2mlog(n V
Note that
for
any
'
inequality
'^^g^-f";^
(4v2c/i^e„(.fm,Pn))'^
"
~'
'
:
<
,
'/^
„,/^^"
'\
,
^
(4v2cC)^mlog(nVp)
< r„/2.
BELLONI AND CHERNOZHUKOV
58
Using
Lemma
17 with our choice of Tm,
m
<
k
n,
>
v >
I,
and p{J-m,^n) <
I,
1,
we
obtain
p| sup |Gn(/)|
>
4%/2c/^e„(J^^,P„),
3
m
^
^E
< n| <
V pi
sup |G„(/)|
V
,
M
Step
To
2.
2 log(n
•
.
by our choice
of
B{K) and n
In this step
ll/llp„(^),2
{sup^,^^E„
||/||Fn,2 is
SUP/e^„
+ ^"P/e^^
+E„
[/(Z,)2]
li/l|p„{Z),2
\/2-sub-exchangeable,
Next we show that
latter follows
>
Y
ll/llp„(z).2
^ {^up^e^.En
> {sup^e^^
Z,
let
be created by pairwise
^
[/(Z,)2]
il/ll^„(z),2
supjgjr^ ll/llp„(y),2)
+E„
Vsup^,^^
[/(>;)']
}'^'
\\f\\KiY).2f'~
=
=
||/||p„(y),2-
p(J"rn,IPn)
from E„ [F^]
suP/ejr^ E„[|/(Zi)|2]
11/111(^,2}'^'
[/(l')2]}'^'
Vsupjg;r„
Vp)
calculations.
exchanging any componentsof Z and Y. Then ^/2 (sup^gjr^
{sup/e^„
i
sufficiently large.
we perform some auxiliary
establish that supygjr„
4x/2cA'e„(J-^, P„)
^0
Tn=l
,
>
=
:= supygjr^
||/||p„,2/||.Fm||p„,2
En[supj£_;r^ |/(Z,)p]
<
>
l/\/n for
m
<
n.
The
supj<„ sup^gj^^ |/(Zj)p, and from
supjg^r^ sup,<„ \f{Zi)\'^/n.
ACKNOWLEDGEMENTS
We
would
Maugis
for
like to
thank Arun Chandrasekhar, Moshe Cohen, Ye Luo, and Pierre-Andre
thorough proof-reading of several versions of
this
paper and their detailed com-
ments that helped to considerably improve the paper.
REFERENCES
[1]
H.
Ak.mke
(1974).
A new
look at the statistical model identification.
IEEE
Transactions on Automatic
Control, AC-19, 716-723.
[2]
R.
J.
Barro and J.-W. Lee
(1994).
Data
set for
a panel of 139 countries, Discussion paper,
NBER,
http://ww~s\'. nber.org/pub/barro. lee.
[3]
R.
[4]
A.
Barro and X. Sala-i-Martin (1995). Economic Growth. McGraw-Hill, New
Belloni and V. Chernozhukov (2008). On the Computational Complexity
J.
Estimators in Large Samples, forthcoming
in
the Annals of Statistics.
York.
of
MCMC-based
^
PENALIZED QUANTILE REGRESSION
[5]
Chernozhukov
A. Belloni and V.
59
Conditional Quantile Processes under Increasing Dimen-
(2008).
Duke Technical Report.
sion,
[6]
D. Bertsimas and
[7]
M. BuCHlNSKY
TsiTSlKLls (1997). Introduction to Linear Optimization, Athena Scientific.
J.
Changes
(1994).
in the U.S.
Wage
Structure 1963-1987: Application of Quantile Re-
gression Econometrica, Vol. 62, No. 2 (Mar.), pp. 405-458.
[8]
Candes and T. Tao
E.
than
[9]
V.
Ann.
n.
Chernozhukov
[lo;
R.
Dudley
111
J.
Fan and
(2005).
Lv
35,
Number
6,
selector: statistical estimation
when p
is
much
larger
2313-2351.
Extremal quantile regression. Ann.
Statist. 33, no. 2,
Uniform Cental Limit Theorems, Cambridge Studies
(2000).
J.
The Dantzig
(2007).
Volume
Statist.
in
806-839.
advanced mathematics.
Sure Independence Screening for Ultra-High Dimensional Feature Space,
(2008).
Journal of the Royal Statistical Society Series B, 70, 849-911.
[12:
X. Fernique (1997).
Fonctions aleatoires gaussiennes, ecteurs aleatoires gaussiens. Publications
Centre de Recherches Mathematiques, Montreal., Ann. Probab. Volume
[is;
S.
German
Number
[14
C.
2,
Limit Theorem for the
Jureckova
J.
No.
Statistics, Vol. 20,
He and Q.-M. Shao
X.
Norm
of
Random
8,
2,
du
252-261.
Matrices, Ann. Probab.
Volume
8,
252-261.
Gutenbrunner and
Annals of
[15-
A
(1980).
Number
1
(1992). Regression
Rank
The
Scores and Regression Quantiles
(Mar.), pp. 305-330.
On
(2000).
Parameters of Increasing Dimenions, Journal of Multivariate
Analysis 73, 120-135.
[16
Huber
P. J.
Robust regression; asymptotics, conjectures and Monte Carlo. Ann.
(1993).
Statist. 1,
799-821.
[17:
K. Knight (1998). Limiting distributions
for
Li regression estimators under general conditions. Annals
of Statistics, 26, no. 2, 755-770.
[is:
[19
K. Knight and Fu,
Koenker
R.
W.
Asymptotics
J. (2000).
for lasso-type estimators.
Ann.
Statist. 28 1356-1378.
(2005). Quantile regression. Econometric Society Monographs, Cambridge University
Press.
[20
R.
Koenker and G. Basset
33-50.
[21
R.
Koenker and
regression Journal of the
[22
[23
P.-S.
Laplace
;,,
Machado
J.
Regression Quantiles, Econometrica, Vol. 46, No.
(1978).
"
-
',
American
(1999).
,
,,.,
.
..,„
;„
Goodness of
,.,.,
fit
.':
January,
'
and related inference process
for quantile
Statistical Association, 94, 1296-1310.
(1818). Theorie analytique des probabilites. Editions Jacques
M. Ledoux and M. Talagrand
1,
'
(1991). Probability in
Gabay
(1995), Paris.
Banach Spaces (Isoperimetry and
processes).
Ergebnisse der Mathematik und ihrer Grenzgebiete, Springer- Verlag.
[24
R. Levine
and
Renelt
D.
American Economic Review,
[25
N. Meinshausen and p.
(1992).
A
Sensitivity Analysis of Cross- Country
Growth Regressions, The
'
Vol. 82, No. 4, pp. 942-963.
Buhlmann
(2006).
High dimensional graphs and variable selection with
the Lasso. Ann. Statist. 34 1436-1462.
[26:
N.
Meinshausen and B. Yu
dimensional data.
[27:
(2009).
Lasso-type recovery of sparse representations for high-
of Statistics, Vol. 37, No.
1,
246270.
Y. Nesterov and a. Nemirovskii (1993). Interior-Point Polynomial Algorithms
ming, vol
[28
The Annals
S.
13,
SIAM
Portnoy
in
Convex Program-
in nonstationary,
dependent cases.
Studies in Applied Mathematics, Philadelphia.
(1991).
Asymptotic behavior of regression quantiles
BELLONI AND CHERNOZHUKOV
60
J.
[29j
Multivariate Anal. 38, no.
S.
PoRTNOY AND
R.
100-113.
1,
KOENKER
(1997).
The Gaussian hare and
of squared-error versus absolute-error estimators. Statist. Sci.
[30]
[31]
Volume
12,
Number
4,
279-300.
Recht, M. Fazel, and
P. A.
Parrilo
Matrix Equations via Nuclear
Norm
Minimization, Arxiv preprint arXiv:0706.4138, submitted.
B.
X. X. Sala-I-Martin(1997).
Just
I
Schwarz
G.
[33]
A. N. Shiryaev (1995).
F.
(1978).
Stout
Ran Two
Million Regressions,
The American Economic Review,
.
'
(1974).
,.
'
.
,
Estimating the dimension of a model. Annals of Statistics,
[32]
W.
Guaranteed Minimum-Rank Solutions of Linear
(2007).
'
Vol. 87, No. 2, pp. 178-183.
[34]
the Laplacian tortoise: computability
6,
461-464.
Probability, Springer.
Almost sure convergence. Probability and Mathematical
Statistics,
Academic
Press.
[35]
R. TiBSHiRANi (1996).
Regression shrinkage and selection via the Lasso.
J.
Roy. Statist. Soc. Ser. B,
58, 267-288.
[36]
A.
W. VAN DER Vaart
Asymptotic
(1998).
Statistics,
Cambridge
Series in Statistical
and Probabilistic
Mathematics.
[37]
S.
A. van DER
Statistics, Vol. 36,
[38]
A.
Geer
(2008).
No.
61-645.
2,
W. VAN DER Vaart and
High-dimensional generalized linear models and the
J.
A.
Wellner
(1996).
Weak Convergence and
lasso.
Annals of
Empirical Processes,
Springer Series in Statistics.
[39]
C.-H.
Zhang and
linear regression.
J.
Ann.
Huang
Statist.
(2008).
Volume
The
36,
sparsity
Number
and bias
4,
of the lasso selection in high-dimensional
1567-1594.
Alexandre Belloni
Victor Chernozhukov
Duke University
FuQUA School of Business
1 TowERviEw Drive
Massachusetts Institute of Technology
Department of Economics and Operations research Center
Durham, NC 27708-0120
PO Box 90120
Room E52-262F
E-MAIL: abn5@diike.edu
E-MAIL: vchern@mit.edu
.
50 Memorial Drive
Cambridge,
MA
02142
Download