^dZS^ MA MODELS [0-05

advertisement
DEWEY il
OUPL
MIT LIBRARIES
3 9080 033 6 3251
nx>
•
[0-05
ZOfo
Massachusetts Institute of Technology
Department of Economics
Working Paper Series
POST-A-PENALIZED ESTIMATORS IN HIGH-DIMENSIONAL
LINEAR REGRESSION MODELS
Alexandre Belloni
Victor Chernozhulcov
Working Paper 10-5
January 4, 2009
Room
E52-251
50 Memorial Drive
Cambridge,
MA 02142
This paper can be downloaded without charge from the
Social Science
Research Network Paper Collection
http://ssrn.com/abstract=1
58^4
at
^dZS^
Massachusetts Institute of Technology
Department of Economics
Working Paper Series
POST-A-PENALIZED ESTIMATORS IN HIGH-DIMENSIONAL
LINEAR REGRESSION MODELS
Alexandre Belloni
Victor Chernozhukov
Working Paper 10-5
January 4, 2009
Room
E52-251
50 Memorial Drive
Cambridge,
MA 021 42
This paper can be downloaded without charge from the
Social Science Research Network Paper Collection at
http://ssrn,com/abstract=1
582954
Digitized by the Internet Archive
in
2011 with funding from
Boston Library Consortium IVIember Libraries
http://www.archive.org/details/postscriptl2081p00bell
POST-^i-PENALIZED ESTIMATORS IN HIGH-DIMENSIONAL LINEAR
>3
REGRESSION MODELS
O
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
f-s^l
Abstract.
In this paper
we study post-penalized estimators which apply
ized linear regression to the
\0
It is
is
in
well
known
that
LASSO
We
show that post-LASSO performs
LASSO-based model
the
By
bias.
mean
(--
s-dimensional approximation to the regression function chosen by the oracle.
post-LASSO can perform
of convergence,
'
.
^vj
"true"
?*
model
LASSO
if
the
strictly better
than LASSO,
LASSO-based model
as a subset
and
^~~<
model selected by
"*"%
as the dimension of the "true" model.
An
important ingredient
LASSO
in
our analysis
is
which guarantees that
Our
a
new
this
sparsity
components
extreme
at
is
is
when
oracle estiof the
most of the same order
and hold
rate results are non-asymptotic
but also applies to other estimators,
of the
case,
bound on the dimension
dimension
parametric and nonparametric models. Moreover, our analysis
^..^^
all
In the
post-LASSO estimator becomes the
perfectly selects the "true" model, the
mator.
here the best
Furthermore,
the sense of a strictly faster rate
in
selection correctly includes
also achieves a sufficient sparsity.
OC
,-ys
and
LASSO
selection "fails" in the sense of missing
the "true" model we
some components
of the "true" regression model.
rate,
Remarkably, this
fl
C
LASSO.
at least as well as
terms of the rate of convergence, and has the advantage of a smaller
if
ordinary, unpenal-
first-step penalized estimators, typically
can estimate the regression function at nearly the oracle
thus hard to improve upon.
performance occurs even
(y^
model selected by
not limited to the
in
both
LASSO
es-
LASSO,
example, the trimmed
(^^
timator in the
^""
Dantzig selector, or any other estimator with good rates and good sparsity. Our analysis covers
p-
both traditional trimming and a new practical, completely data-driven trimming scheme that
K^
^
first step,
for
The
induces maximal sparsity subject to maintaining a certain goodness-of-fit.
.
has theoretical guarantees similar to those of
LASSO
procedures as well as traditional trimming
a wide variety of experiments.
in
or
post-LASSO, but
it
latter
scheme
dominates these
First arXiv version: December 2009.
Key words. LASSO, post-LASSO, post-model-selection
AMS
estimators.
Codes. Primary 62H12, 62J99; secondary 62J07.
Date: First Version; January
4,
2009. Current Revision:
March
29, 2010.
We
thank Don Andrews, Whitney
Newey, and Alexandre Tsybakov as well as participants of the Cowles Foundation Lecture at the 2009
Econometric Society meeting and the joint Harvard-MIT seminar
thank Denis Chetverikov, Brigham Fradsen, and Joonhwan Lee
suggestions on several versions of this paper.
1
for useful
for
comments.
We
thorough proof-reading
also
Summer
would
and many
like
to
useful
1.
.
work we study post-model selected estimators
In this
nal sparse models
much
possibly
Introduction
(HDSMs). In such models, the
larger than the
sample
size n.
for linear regi'ession in high-dimensio-
number
overall
However, the number
those having a non-zero impact on the response variable ~
is,
s
=
o{n).
HDSMs
([6],
[13])
of regressors
is
have emerged to deal with
p
is
very large,
s of significant regressors
smaller than the sample
many new
size,
-
that
applications arising in
biometrics, signal processing, machine learning, econometrics, and other areas of data analysis
where high-dimensional data
sets
have become widely available.
Several papers have begun to investigate estimation of
ized
[2,
mean
6, 10,
[2,
6,
10, 13, 17, 20, 19].
demonstrated the fundamental result that fi-penalized
13, 20, 19]
true model
is
known.
is
least squares es-
very close to the oracle rate \J sjn achievable
demonstrated a similar fundamental result on the excess
[17]
forecasting error loss under both quadratic
tor can
primarily focusing on penal-
regression, with the £i-norm acting as a penalty function
timators achieve the rate ys/n^/Xogp, which
when the
HDSMs,
and non-quadratic
loss functions.
Thus the estima-
be consistent and can have excellent forecasting performance even under very rapid,
number
nearly exponential growth of the total
of regressors p.
quantile regression process, obtaining similar results.
See
[9,
[1]
investigated the £i-penalized
2, 3, 4, 5,
11, 12, 15] for
many
other interesting developments and a detailed review of the existing literature.
In this paper
we derive
theoretical properties of post-penalized estimators which apply ordi-
nary, unpenalized linear least squares regi'ession to the
estimators, typically
LASSO.
It is
can perform at least as well as
is
LASSO
hard to improve upon.
in
We
regi'ession
show that post-
terms of the rate of convergence, and has
the advantage of a smaller bias. This nice performance occurs even
selection "fails" in the sense of missing
first-step penalized
known that LASSO can estimate the mean
well
function at nearly the oracle rate, and hence
LASSO
model selected by
some components
if
the
LASSO-based model
of the "true" regi'ession model. Here
by the "true" model we mean the best s-dimensional approximation to the regression function
chosen by the oracle. The intuition
for this result
is
that
only those components with relatively small coefficients.
form
than LASSO,
strictly better
LASSO-based model
sufficiently sparse.
in
selection omits
Furthermore, post-LASSO can per-
the sense of a strictly faster rate of convergence,
correctly includes
Of
LASSO-based model
all
if
the
components of the "true" model as a subset and
course, in the extreme case,
when LASSO
model, the post-LASSO estimator becomes the oracle estimator.
is
perfectly selects the "true"
POST-i'i-PENALIZED ESTIMATORS
Importantly, our rate analysis
is
LASSO
not limited to the
3
estimator in the
first step,
but
trimmed LASSO,
applies to a wide variety of other first-step estimators, including, for example,
We give generic rate results that cover any
bound are available. We also give a generic
the Dantzig selector, and their various modifications.
first-step estimator for
result
on trimmed
which a rate and a sparsity
where trimming can be performed by a traditional hard-
first-step estimators,
thresholding scheme or by a
new trimming scheme we introduce
in the paper.
Our new trimming
scheme induces maximal sparsity subject to maintaining a certain goodness-of-fit (goof)
sample, and
is
completely data-driven.
We show that
our post-goof-trimmed estimator performs
at least as well as the first-step estimator; for example, the post-goof-trimmed
at least as well as
It
LASSO, but can be
in the
LASSO
performs
under good model selection properties.
strictly better
should also be noted that traditional trimming schemes do not in general have such nice
theoretical guarantees, even in simple diagonal models.
we conduct a
Finally,
series of
computational experiments and find that the results confirm
our theoretical findings. In particular, we find that the post-goof-trimmed
LASSO
emerge
clearly as the best
LASSO
and the post-traditional-trimmed
To the best
and second
Our
problem.
upon the
penalized procedures for the related, but
also builds
is
a
new
sparsity
the
first
to establish the aforementioned rate
result builds
is
LASSO-type
in
also rely
bounds
[2]
on some inequalities
the
mean
regression
problem of median regression. Our analysis
estimators.
of the
in
established the properties of post-
and the other works
An
model
for sparse eigenvalues
Our
sparsity
and are comparable to the bounds
on maximal
LASSO
cited above that established
important ingredient in our analysis
selected
by LASSO, which guarantees
most of the same order as the dimension of the "true" model. This
at
the context of median regression.
bounds
[2]
who
[1],
diflferent,
bound on the dimension
that this dimension
We
is
ideas in
on the fundamental results of
the properties of the first-step
LASSO
estimators.
of our knowledge, our paper
analysis builds
and post-
both substantially outperforming
best,
on post-LASSO and the proposed post-goof-trimmed
results
LASSO
bounds
in [20]
and reasoning previously given
for
LASSO
in
[1]
in
improve upon the analogous
obtained under a larger penalty
level.
We
inequalities in [20] to provide primitive conditions for the sharp sparsity
to hold.
organize the remainder of the paper as follows. In Section
results of
[2]
for
LASSO,
albeit with a slightly
results of [11, 13, 21]. In Section
In Section
4,
we present a
3,
we present a
2,
we review some benchmark
improved choice of penalty, and model selection
generic rate result on post-penalized estimators.
generic rate result for post-trimmed-estimators, where trimming can
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
4
be traditional or based on
In Section
goodness-of-fit.
post-LASSO and the post-trimmed LASSO
5,
we apply our
generic results to the
we present the
estimators. In Section 6
results of
our computational experiments.
Notation. In what
follows, all
parameter values are indexed by the sample
omit the index whenever this does not cause confusion.
aV
•
llo
II
—
b
niax{a,b} and a A
^ T.
Er=i
a
<
min{a,
The ^2-norm
6).
use the notation (a)+
denoted by
is
•
||
|j
=
but we
max{a,0},
and the £o-norm
denotes the number of non-zero components of a vector. Given a vector 5 G IR^, and a
set of indices
j
=
b
We
size n,
We
TC
{1,
.
•
the vector in which 6tj
—
Sj if j
G T, 5tj
also use standard notation in the empirical process literature, E„[/]
= Er=i (/(->) "
f{zi)/n, and G„(/)
cb for
we denote by 6t
,p},
some constant
c
>
For an event E, we say that
E[f{z.,)])/V^.
We
use the notation a
that does not depend on n; and a
E wp —
^
1
E
when
<p
b to
=
if
=
—
E„[/(2j)]
<
6 to
denote
=
Op{b).
denote a
occurs with probability approaching one as n
grows.
2.
LASSO
The purpose
revisit
AS A
Benchmark
of this section
some known
is
Parametric and Nonparametric Models
in
to define the
results for the
LASSO
models
for
which we state our results and also to
estimator, which
we
will use as a
benchmark and
inputs to subsequent proofs. In particular, we revisit the fundamental rate results of
with a slightly improved, data-driven penalty
2.1.
Model
1:
Parametric Model. Let
[2],
as
but
level.
us consider the following parametric linear regi-ession
model:
y,
=
T=
x% +
e»,
c,
~
A^(0, a^),
PqEW,
support(/3o) has s elements where s
T is unknown, and regi'essors
E„[4) = lforallj = l,...,7A
where
A'
=
[a;i,
Given the large number of regressors p >
avoid overfitting the data.
.
.
n,
The LASSO estimator
.
z
<
-
n,
,Xn]' are fixed
1,
.
.
but p
n
>
n,
and normalized so that aj
some regularization
[16] is
,
.
is
=
required in order to
one way to achieve this regularization.
Specifically, define
^e
Our
goal
is
arg min Q{(3)
+
-||/3||i,
to revisit convergence results for
\\S\\2,n
where Q(/3)
=
E„[2/,
-
in the prediction (pseudo)
r'A12
= \/E„K5]
(2.1)
x'^pf.
norm,
POST-^i-PENALIZED ESTIMATORS
The key quantity
in the analysis
the gi-adient at the true value:
is
5=
This gradient
is
5
2E„[x',e,].
the effective "noise" in the problem. Indeed, for S
=
p —
po,
we have by the
Holder inequality
W) -
Q(/3o)
mln = 2Enhx'S
-
>
Thus, Qip) — QiPo) provides noisy information about
controlled by
(2-2)
-||5i|oolld1|i.
||(5||2„,
and the amount of noise
is
This noise should be dominated by the penalty, so that the rate of
||5'||oo||'5||i-
convergence can be deduced from a relationship between the penalty term and the quadratic
term
\\5\\l^.
This reasoning suggests choosing A so that
A
However
this choice
is
>
cn||S'||oo,
is
some
not feasible, since we do not
X
where A(l — a\X)
for
the
(1
A
—
>
S.
>
1.
We
propose setting
'n||5||cx)5
(2.3)
so that for this choice
with probability at least
c?i||5||oo
is
know
c
= c-A{l-a\X)
a)-ciuantile of
Note that the quantity A(l ~ a|.Y)
fixed
easily
—
I
a.
computed by simulation.
(2.4)
We
refer to this choice of
A as the data-driven choice, reflecting the dependence of the choice on the design matrix X.
Comment
2.1 (Data-driven choice vs standard choice.).
X
where
A>
1 is
The standard
= c(TA^/2nlogp,
choice of A employs
(2.5)
.
a constant that does not depend on X, chosen so that (2.4) holds no matter what
X Note that v^ll'^'lloo a maximum of A^(0, a"^) variables, which are correlated columns of
X axe correlated, as they typicaUy are in the sample. In order to compute A, the standard choice
is
is.
if
uses the conservative assumption that these variables are uncorrelated.
When
highly correlated, the standard choice (2.5) becomes quite conservative and
The X-dependent
X
may be
too large.
choice of penalty (2.3) takes advantage of the in-sample correlations induced
by the design matrix and yields smaller penalties. To
designs
the variables are
by drawing
x, as i.i.d.
and varying correlations
from
Ejfc for j
^
A'^(0,
k
E),
among
illustrate this point,
and defining
Xij
we simulated many
= Xij/wE„[if
three design options:
0,
],
with Ej,
p^^~^\ or
p.
We
=
1,
then
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
6
plots the sorted realized values of the
X-
impact of in-sample correlation on these values. As expected,
for
computed X-dependent penalty
dependent A and
illustrates the
a fixed confidence
penalty (2.3)
is
level 1
— a,
Figure
levels (2.3).
1
the more correlated the regressors are, the smaller the data-driven
relative to the standard conservative choice (2.5).
Ji
76
>
.Standard
- -
Bound
Data-dependent A, Design
1
Data-dependent A, Design
2
Data-dependent A, Design 3
Quantilc
Figure
x^ as
1.
i.i.d.
ReaUzod values
A'^(0,
design 3 has
of A(0.95|A') sorted in increasing order.
S), where (or j
T,jk
=
1/2.
We
^
k design
used n
=
has Sjk
1
p
100,
=
=
0,
X
is
drawn by generating
design 2 has Sj^
500 and a^
=
1.
=
(1/2)'-'"''',
and
For each design 100 design
matrices were drawn.
Under
least 1
—
(2.3), J
=
/3
—
/3o
obey the following "restricted condition" with probability
will
at
a:
\Stc\\i
<
c\\6t\\i,
where
c
:
—
c+l
(2.6)
1
Therefore, in order to get convergence rates in the prediction
norm
||^||2,n
—
yl^nfrpp, we
consider the following modulus of continuity between the penalty and the prediction norm:
RE.l(c)
mm
Ki{T)
V~s¥\kn
STc\\i<c\\ST\\i,SjiO
where ki(T) can depend on
is
n.
\\5t\\i
In turn, the convergence rate in the usual Euclidian
norm
\\5\\
determined by the following modulus of continuity between the prediction norm and the
Euclidian norm:
RE.2(c) k.2(T):=
min
\\5tc\\i<c\\6t\\i,5^0
&^,
\\d\\
POST-^i-PENALIZED ESTIMATORS
where
n-iiT)
can depend on
Conditions RE.l and RE. 2 are simply variants of the original
n.
restricted eigenvalue conditions
suppress dependence on
Lemma
1
T
7
imposed
and Tsybakov
in Bickel, Ritov
[2].
In
what
we
follows,
whenever convenient.
below states the rate of convergence in the prediction norm under a data-driven
choice of penalty.
Lemma
1 (Essentially in Bickel, Ritov,
and Tsybakov
/3oh,n
Under
||^-/?0i|2.n<(l
is
bounded away from
nearly the rate \/s/n (achievable
1
a. Since 5
=P—
in the Euclidian
cn\\S\\oo, then
probability at least
1
—a
+ c)^A(l-Q|X),
nK\
< a^y2n\og(p/a).
Thus, provided ki
—
>
cj UKi
C/
riKi
,
we have with
the data-driven choice (2.3),
where A(l — a\X)
<
If X
[2]).
when the
zero,
LASSO
true model
estimates the regression function at
T
known) with probability
is
Po obeys the restricted condition with probability at least
norm immediately
follows
1
—
at least
a, the rate
from the relation
\W-/3o\\2<\\p-m2,n/^2,
which also holds with probability at
LASSO
least
1
—
a.
Thus,
K2
if
(2.7)
is
also
bounded away from
estimates the regression coefficients at a near \J s/n rate with probability at least
Note that the \fsjn rate
is
zero,
1
—
a.
not the oracle rate in general, but under some further conditions
stated in Section 2.3, namely
when
the parametric model
the oracle model, this rate
is
is
an
oracle rate.
2.2.
Model
2:
Nonparametric modeL Next we
yj
where
Xi
=
and
y,;
are the outcomes,
p[zi),
/,
=
where
/(2j)i
pi^Zi) is
^^ can
Vi
=
/(zj)
Zi
+ e^,
€i
~
consider the nonparametric model given by
A^(0,(J^),
=
i
l,...n,
are vectors of fixed regressors,
a p-vector of transformations of
z-i
and
e,
are disturbances.
and any conformable vector
rewrite
=
3;-/?o
+ Uj, u^=
r,
+ e,,
where
r,
:=
/,
-
x-/3o.
For
/?o,
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
8
Next we choose our target
s
=
||/3ol|o
=
1^1 as
or "true" Pq, with the corresponding cardinahty of
any solution
its
support
to the following "oracle" risk minimization problem:
mm
min
EJix[P
'^
0<fc<pAn||/3||o<fc
'
0] +
'
'
k
(T^
'
n
(2.8)
Letting
= E,,[{x%-f,f]
cl:=En[rf]
denote the error from approximating
In order to simplify exposition,
/,;
by
then
x'^Po,
we focus some
c^
results
-
+ a's/n
the optimal value of
is
(2.8).
and discussions on the case where the
following holds:
-cl< Ka^s/n
with
K=
K which
1
which covers most cases of
interest.
(2.9)
Alternatively,
we could consider an
arbitrary
does not affect the results' modulo constants.
Note that
c^
+ a~s/n
is
the the expected estimation error £'[E„[/i
oracle estimator p° that minimizes the expected estimation error
estimators, by searching for the best fc-dimensional
— 0:^/3°]^]
among
of the (infeasible)
all fc-sparse least
model and then choosing k
square
to balance
approximation error with the sampling error, which the oracle knows how to compute. The
rate of convergence of the oracle estimator
of convergence,
(see
and
in general
Donoho and Jonstone
circumstances, such as
note that
model
[7]
when
it
+
a'^sjn
becomes an
ideal goal for the rate
can be achieved only up to logarithmic terms
and Rigollet and Tsybakov
[14]),
error, c^,
where we had
Next we state a rate of convergence
in
is
r^
in
selection.
0.
the prediction
2 (Essentially in Bickel, Ritov, and Tsybakov
||^-/3o||2,„< (1
\
Under
the data-driven choice (2.3),
\\P
where A(l — a\X)
we have with
- Pohn <
< a^2nlog{p/a).
(1
+
c)
Finally,
zero the oracle model becomes the parametric
=
norm under the data-driven
penalty.
Lemma
most cases
except under very special
becomes possible to perform perfect model
when the approximation
of the previous section
\/c'^
+
[2]).
If X
>
cn\\S\\oo, then
^)^
+ 2^.
cj
riKi
probability at least
^A(l
- a\X) +
2c,,
I
—a
choice of
POST-^i-PENALIZED ESTIMATORS
Thus, provided ki
bounded away from
is
zero,
a near-oracle rate with probability at least
follows
—
1
Model
interested in the
LASSO
to perfectly select the true model.
selection
LASSO. However,
is
in
some
model
selection
is
this section
is
arc specifically
select the true
where perfect
special cases,
and thus can
to describe these very special cases
possible.
In the discussion of our results on post-penalized estimators
we
will refer to the following
selection results for the parametric case.
Lemma
3 (Essentially in Meinshausen and
model,
the coefficients are well separated
if
min
jeT
then the true model
Moreover
T
is
a.
|/3o,
I
>
+ i,
^
Yu
from
for
t
[13]
and Lounici
>
max
:=
^
j
subset of the selected model,
>
3) In particular,
<
if
X
l/(u(l
>
+
cn]|5[|oo,
2c)s) for all
Thus, we see from parts
is
1)
and
1
+
and there
1
<
2),
j
<
^ (
.
selection
1) In the
parametric
T
\pj
-
:
Poj
,
\
:— support(/3o)
t
to
C T
:= support(/3).
j3:
\Pj\>t].
cn||5||,^, then
^<
\En[xijXik\\
= i,...,p
can be perfectly selected by applying trimming of level
2) In particular, if X
[11]).
zero, that is
T = f(i):={je{l,...,p}
We
wc
possible, these estimators can achieve the exact oracle rates,
be even better than LASSO. The purpose of
perfect
In fact,
first-
prove that post-model selection estimators such as post-LASSO
will
achieve near-oracle rates like those of
model
(2.10)
most common cases where these estimators do not perfectly
model. For these cases, we
where
+ c,.
j|^-/?o||2,„
Selection Properties. The primary results we develop do not require the
step estimators like
model
estimates the regression function at
Furthermore, the bound on empirical risk
q.
from the triangle inequality:
Y^E„[x',^-/,]2<
2.3.
LASSO
l\
cj nn\K2
is
k
X^s
<
a u
>
I
such that the design matrix
p, then
2
which follow from
[13]
and
Lemma
2,
that perfect model
possible under strong assumptions on the coefficients' separation
also see fi-om part 3),
which
is
due
satisfies
away from
zero.
to [11], that the strong separation of coefficients can
be
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
10
considerably weakened in exchange for a strong assumption on the design matrix. Finally, the
following extreme result also requires strong assumptions on separation of coefficients and the
design matrix.
Lemma
4
(Essentially in
Zhao and Yu
In the parametric model, under more restrictive
[21]).
conditions on the design, separation of non-zero coefficients, and penalty paro.meter, specified
in [21], with a high probability
T=
Comment
for this:
2.2.
first,
We
support(/3o)
=T=
support(/3).
only review model selection in the parametric case. There are two reasons
the results stated above have been developed for the parametric case only, and
extending them to nonparametric cases
is
outside the main focus of this paper.
from the stated conditions that
in
the nonparametric context, in order to select the
is
clear
oracle
model
T
perfectly, the oracle
very close to parametric (with Cg
much
similar to those stated above. Since
as
good as
LASSO
and can be
Moreover,
latter for case (a).
models have to be either
parametric
(i.e.
c^
=
it
0) or (b)
smaller than a'^s/n) and satisfy other strong conditions
we only argue that post-LASSO and
strictly better only in
if
(a)
Second,
oracle performance
some
is
cases,
it
related estimators are
suffices to
achieved for case
(a),
demonstrate the
then by continuity
of empirical risk with respect to the underlying model, the oracle performance should extend
to a neighborhood of case (a), which
3.
Let
/3
be any
A
case (b).
is
Generic Result on Post-Model Selection Estimators
first-step estimator acting as a
T
the model selected by this estimator;
model
selection device
and denote by
:= support(/3)
we assume
\T\
< n
throughout. Define the post-model
selection estimator as
^=arg min
If
model
selection
estimator
interest
is
is
works perfectly
(as
it
will
under some rather stringent conditions), then this
simply the oracle estimator and
the case
when model
post-model selection estimator.
its
properties are well known. However, of
selection does not
of interest in applications. In this section
(3.11)
Q(/3).
work
perfectly, as occurs for
we derive a generic
result
many
more
designs
on the performance of any
POST-^i-PENALIZED ESTIMATORS
In order to derive rates,
we need the
following minimal restricted sparse eigenvalue
RSE.l(m) K(m)^ :=
as well as the following
maximal
'"
min
^,
,,
restricted sparse eigenvalue
RSE.2(m)
(f>(m)
l2,n
max
:=
-
W^r
\\STc\\o<inMO
where
will
m
the restriction on the
is
11
number
of non-zero
components outside the support T.
It
be convenient to define the following condition number associated with the sample design
matrix:
x/0Cm)
l-^m
The
following theorem establishes
(3.12)
K(m)
bounds on the prediction error of a generic second-step
estimator.
Theorem
1
(Performance of a generic second-step estimator). In either the parametric model
or the nonparametric model,
Bn
and
let (3
let
the second-step estimator.
least I
=
any
first-step estimator with support
-
and C„ := Q{P^f) -
For any e >
e,
we have
mlogp+ [m +
W-M2.n<K,a
c^
be
= QiP) - Q{/3o)
such that with probability at
where
6
in the parametric model.
Q,
Q(/3o),
there is a constant
that for
s)\ogiiff^
Furthermore.
T, define
K^ independent
m :=
\T \ T\
+
+ V(5„)+ A
2c,
B^ and Cn
of
n
(C„)h
obey bounds (3.13) stated
below.
The
following
lemma bounds B^ and C„, although
means, as we shall do
Lemma
constant
LASSO
5 (Generic control of
and k
selected
in the
=
K^ independent
-,
of
+
m =\T\T\
be the
we can bound D^ by other
mlogp +
<
11/3
C„
<
l{T^f}[\\p^f^\\l,
/vo"
number
(in 4- s)
:iJ-n
1
—
of wrong regressors
For any e >
correct regressors missed.
that with probability at least
Bn
Il2,»j
Cn). Let
number of
n such
cases
case.
B^ and
\T \ T\ be the
many
in
s.
+ 2cs
Mfn±l^g^^2c,
there
is
a
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
12
Three implications of Theorem
the bounds on the prediction
model
=
A {Cn)+
second-step estimator
of
is,
wrong
and Bn does not
0,
Lemma 5
T ^
rate of convergence
regressors selected
affect the rate at all,
Otherwise,
if
parameter
||/3
— /3o||2,n
of the
first-step estimator.
T C
T, then
we have
and the performance of the
induced by the
C„ measures the in-sample
Intuitivelj',
to contain the
fails
T, the performance of the second-step estimator
(or loss-of-fit)
pai'ameter value po, and
by the
the selected model
both the sparsity ih and the minimum between Bn and C^.
sample goodness-of-fit
apply to any generic post-
dictated by the sparsity fh of the first-step estimator, which controls
is
the magnitude of the empirical errors.
true model, that
and
and most importantly, note that
the selected model contains the true model,
if
=
Cr,
1
we can bound both the
and m, the number
Secondly, note that
Theorem
stated in
selection estimator, provided
first-step estimator
{B,i)+
norm
are worth noting. Firstly
1
determined by
is
B„ measures
the in-
first-step estimator relative to the "true"
loss-of-fit
induced by truncating the "true"
outside the selected model T.
/3o
Finally, note that rates in other
norms
of interest immediately follow from the following
relations:
^E„[2:;/i-/,]2<
where
m=
of
Theorem
and
1
error provided by the following
Lemma
+
d)
-
Q(/3o)
uniformly for
=
-
=
all 5
Q(/?o)
uniformly for
Cs
(3.14)
on the sparsity-based control of the empirical
lemma.
- mini <
&W
-
all
l|/3ofcllLl
TcT
I
—
0, there is
<
tt^,
(^''T-d
uniformly over m.
2) Furthermore, with at least the
< i^e^y
K^ indepen-
a constant
mb^n +
^^e(^\j
same
°^^'^^^^°^^'°
such that \T \T\
in the parametric model.
>
e,
such that Pt^IIo
in the parametric model.
|Q(,V)
where
\\p-/3oh<\\/3-po\hn/^m),
6 (Sparsity-based control of noise). 1) For any e
\Q{Bo
c^
c„
5 relies
dent of n such that with probability at least
where
+
\T\T\.
The proof
Lemma
||/3-/Jo||2,„
ll/3ofJl2,n
=
k,
+
2Cs||(5||2,n,
<
n,
probability,
2c,||/3o^.||2,„,
and uniformly over k <
s,
POST-^i-PENALIZED ESTIMATORS
The proof of the lemma
a separate theorem since
on the following maximal inequality, which we state as
in turn rehes
may be
it
13
The proof
of independent interest.
of the
theorem involves
the use of Samorodnitsky-Talagrand's inequaUty.
Theorem
2 (Maximal inequality for a collection of empirical processes). Let
independent for
i
—
Cnim,,]) :=
for any
rj
G
(0, 1)
I,
.
.
.
and for
,n,
a2V2
m
=
1,
.
,
.
.
n
+ v/(m +
./log
£j
~
N{0,cj'^) be
define
s) log
+ ^{m +
(D/x^)
s)log{l/T])
and some universal constant D. Then
e.x'J
<
Gn
sup
eri{m..
rj),
for
m<
all
n,
\\6Tc\\o<mM5\\2,n>0
with probability at least
homogenous of degree
1.
-
*/(!
r]e
Note that we can
Proof. Step 0.
Step
1
—
1/e).
supremum
restrict the
over
=
\\6\\
since the function
1
is
zero.
For each non-negative integer
m < n,
and each
set
TC
{1,
•
•
.
|r\T| < m,
,p}, with
define the class of functions
g^;
=
{£,a-;;d7||(5||2,„
:
C
support((5)
f
,
||<5||
-
(3.15)
1}.
Also define
J^m
It
= {Qf-fc{l,...,p}:\f\T\<m].
follows that
P
I
sup |G,(/)|
>
e„(m,;?)
V/G-^-n
We
1
< f ^)
I
sup |G„(/)|
>
e„(m,7?)
\f&Qj.
.
|
(3,16)
J
apply Samorodnitsky-Talagi'and's inequality (Proposition A. 2. 7 in van der Vaart and
Wellner
[18j) to
bound the
right
hand
side of (3.16). Let
- G„(5)]2 = v/S,E„[(/-5)-l
P(f,g) := v/^e[G„(/)
for
max P
V'"/ \f\T\<m
/
f,g€
2 below, the covering
Qf\ by Step
N[e, Gf,
and cr^iGf)
p)
<
f
sup |G„(/)|
UeGf
>
of
Qf with
(6a/u™/f )"'+', for each
— max/gg_ ii[G„(/)]^ = a~.
P
number
e„(m,,/)^
J
respect to p obeys
<
e
<
a,
(3.17)
Then, by Samorodnitsky-Talagrand's inequality
<
(
\
^^Ip^^hplY^^ ^e„Xm,v)M
\/m
+ s(j~
J
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
14
for
D
some universal constant
>
For en[m,Tii) defined in the statement of the theorem,
I.
it
follows that
< r/e-^-V V
'
P\
sup |G„(/)!>en(m,7?)
Then,
P
sup |G„(/)|
> e„(m,7?),3m < n\
<
VP
<
^ r?e— <
m=0
sup |G„(/)| >
e„(777..7])
n
7?e-7(l
-
1/e),
which proves the claim of the theorem.
Step
2.
This step establishes
(3.17). For
and e^^^
aj^
m\2,n
in
t
gW and
gW,
t
fC
Gf, for a given
consider any two functions
.\f\T\<m.
{1, ...,p}
'\\t\\2,n
We
have that
i^'^t)
E^En
\t\
(^'J)
2,n
definition of
\f\T\<
777,,
Qf
and
{<it-'t)f
,2\-^i
E,En
\t
ll^lb.n
\
By
<
in (3.15), support(i)
||i||
=
I
by
(3.15).
C
T
Hence by
<
EcEn
a
E^En
2,n
i^'it)
«i)
M2,n
\\t\\2,n
and support(i) C T, so that support(i —
definitions RSE.l(7r7,)
4'{Tn)\\t
—
i||~/K(77i)",
t)
C
T,
and RSE.2(r7z),
and
\2,n
(x'^t)
{x'^t)
\\t\\2,n
\\th,n
2{<tr /ll^lb.n-IKIkn
= E^En
EtEn
,2
'\m
\\A\2,n
-
l^lb.ri
\\A\2,n \
^2ur
=
"
<
a^(l>{jn)\\t-t\\~/K{m)~.
(
[
Whn
)
,,,2
/n.u2
^"Il^-^"l2-/ll^ll2,n
Thus
[x[t)
[xrt]
<
EcEn
\
Then
the
bound
11^12,™
(3.17) follows
2a\\t
- t\\^^H^)/K.im)
=
2afi,r,]\t
-
^H
\\t\\2,n^
from the bound
in [18]
page 94 with
R=
'lafim for
any e <
a.
POST-^i-PENALIZED ESTIMATORS
;.
...
.^
,
Proof of Theorem
that Q(^)
—
Let 6 :=
Q0) -
QiPo)
6 part
1
(
)
,
p— 6q. By
definition of the second-step estimator,
any e >
it
follows
Thus,
Q(/3of )
< {Q{P) - QiPo)) A [QiP^f) -
for
D
...
.
< Q(p) and Q0) <
By Lemma
least \
1.
15
.
Q(/?o))
there exists a constant
< Sn A
K^ such
Cn-
that with probability at
e:
,:.
W) -
Q(/3o)
-
<
I|?||2,„I
^e,n||?||2,n
+
2c,||?||2,n
where
Ae'n.
Combining these
relations
= A'eCTv'lrnlogp-t- {m +
we obtain the
ll^llln
solving which
-
\S\\2,n
inequality
^.,nll^l|2.n
we obtain the stated
s) log ij.ff,)/n.
-
2c,||^||2,n
< B. A C„,
result:
<
+ 2Cs + V(^n)+ A
A,^n
(C„) +
.
D
4.
In this section
we
A Generic Result on Post-Trimmed Estimators
investigate post-trimmed estimators that arise from applying unpenalized
least squares in the second-step to the
models selected by trimmed estimators
Fonnally, given a first-step estimator P,
we
define its
trimmed support
in the first step.
at level
i
>
as
f(t):={je{l,...,p}:\Pj\>t}.
We
then define the post-trimmed estimator as
^'
The
trimming scheme
traditional
so that to trim
all
=
arg
sets the
min_ Q(/3).
(4.18)
trimming threshold
>
t
— maxi <j<p
|/3j
—
small coefficient estimates smaller than the uniform estimation error
discussed in Section 2.3, this
method
is
/3ojh
L As
particularly appealing in parametric models in which
the non-zero components are well separated from zero, where
selection device.
t
Unfortunately, this scheme
may perform
it
acts as a very effective
model
poorly in parametric models with
true coefficients not well separated from zero and in nonparametric models.
Indeed, even in
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
16
parametric models with
too aggressively
may
many
small but non-zero true coefficients, trimming the estimates
and consequently
result in large goodness-of-fit losses,
slow rates of
in
convergence and even inconsistency for the second-step estimators. This issue directly motivates
our
new
trimming method, which trims small
goodness-of-fit based
coefficient estimates as
much
as possible subject to maintaining a certain goodness-of-fit level. Unlike traditional trimming,
our
new method
our method
is
is
completely data-driven, which makes
at least as
good
than both of these methods
section
in a
as
LASSO
appealing for practice. Moreover,
it
post-LASSO
or
wide range of experiments,
we present generic performance bounds
for
but performs better
theoretically,
practically. In the
remainder of the
both the new method and the traditional
trimming method.
4.1.
Goodness-of-Fit Trimming. Here we propose a trimming method that
ming
level
selects the trim-
based on the goodness-of-fit of the post-trimmed estimator. Let 7 <
t
maximal allowed
denoted the
loss (gain) in goodness-of-fit (goof) relative to the first-step estimator.
We
define the goof-trimming threshold tj as the solution to
^
Then we
:= nmx{t
:
f:^f{t-,)
(4.19)
maintaining a certain
Q0) <
(4.19)
7}.
model and the post-goof-trimmed estimators
define the selected
Our construction
Q{p') -
and
and
(4.20) selects the
level of goodness-of-fit as
P
most
as:
p'^
-.^
(4.20)
trimming threshold subject
aggi'essive
measured by the
to
least squares criterion function.
Note that we can compute the data-driven trimming threshold
(4.19) very effectively using a
binary search procedure described below.
Theorem
3 (Performance of a generic post-goof- trimmed estimator). In either the param.etric
or the nonparametric model,
let /3 be
any
first-step estimator,
QiPo) and C„ := Qift^f) - QiPo). For any s
that with probability at least
iifl
\\P
where
Cg
earlier,
«
Polb.n
-
II
=
with
^
<
t'
-f^eCry
I
—
0,
there is a constant
m.\ogp+{m +
s)\ og^i.in
,„
^
2Cs
Lemma
^
-f-
]
Bn
:— Q{P)
K^ independent
of
-
n such
—
n
^
/
+ Bn)+
^(7
(
\
21)
l^-^-^J
('4
A ir^
f\
(C„)\ +
,
Furthermore, Bn and Cn obey bounds (3.13) stated
T.
Note that the bounds on the prediction norm stated
in
|T \ T\, and
e
in the parametric model.
T—
>
m :=
in
Theorem
3
and eciuation of
(3.13)
5 apply to any generic post-goof-trimmed estimator, provided we can bound both
POST-i'i-PENALIZED ESTIMATORS
the rate of convergence
often use the
bound
of the first-step estimator
/3o||2,?x
by the trimmed
regressors selected
we can
—
||/3
first-step estimator.
< m, where
fh
m
m
course, fh
is
much
potentially
the conditions of
threshold
t
=
Lemma
3
{Bn)+ A (C„)+
on perfect model selection
if
= C„ =
4.1
i?„
T
is,
A
choice of 7).
The
7 <
0,
t
—
loss of
bias.
Lemma
Otherwise,
if
T, then we have
3 provides sufficient
the selected model
fit
nice feature of the theorem above
is
=
0,
relative to the first-step estimator.
We
can also use
=
experiments. Note that
if
0,
is
^
=
0.
The
set
we would eliminate the second term
'y+Bn
=
in the rate
,
bound
<
provides a simple rationale for choosing 7
4.2 (Efficient computation of
t.
theoretical guarantees of this choice are
which
has a substantial regularization bias, then we have 7
binary search over
(4.22)
but this proposal led to the best performance
we could
ty).
Since there are at most
exactly by running at most
fit
This makes sense, since the first-step estimator can suffer
the post-trimmed estimator for
comparable to that of 7
is
simplest choice with good theoretical guarantees
is
7=^^^\^^<0,
it
fails
it
Consequently, our recommended data-driven choice
-
t-y
TC
since a negative 7 actually requires the second-step estimator to gain
from a large regularization
Comment
tj.
rate.
is
that
7
.
relative to the first-step estimator.
in general,
0,
parametric model hold with the
A C„.
(Recommended
which requires there to be no
p'^ is
—
fh
T, the performance of the second-step estimator
<^
allows for a wide range of choices of 7.
where
in the
and these terms drop out of the
0,
determined by both fh and
(feasible)
LASSO, we can even have
the selected model contains the true model, that
to contain the true model, that
any
regressors selected by
smaller than fh, resulting in a smaller variance for
conditions for this to hold for the given threshold
.
wrong
t^.
Also, note that
Comment
of
are tight, as, for example, in the case of
the post-goof-trimmed estimator. For instance, in the case of
if
and m, the number of wrong
For the purpose of obtaining rates,
number
the
is
the first-step estimator, provided the bounds on
LASSO. Of
17
log2 \T\
is
<
our computational
not practical and not always feasible,
(4.21). Since i?„
0.
in
Although
as
«
Q{P) -
this choice
we did
is
o"~
>
if /?
not available
in (4.22).
For any 7, we can compute the value
\T\ possible relevant
,
values of
t,
unpenalized least squai^es problems.
t-y
by a
we can compute
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
18
Proof of Theorem.
3.
Let S :=
-
J3
By
Po-
Q0) - QiM <
On
the other hand, since
Qm = 1 +
+ Q0) -
7
a minimizer of
/3 is
Q0) < Q0) + 7,
definition
Q
1
any
6 part (1), for
>
£
0,
there
< Q{PQf)
so that
Cn-
K^ such
a constant
is
=
QiM
.
Br,.
over the support T, Q{P)
QiP) - QiPo) < QiPof) -
By Lemma
so that
that with probabihty at least
-£
mln-A,,nm2,n-2Csm2,n<Q{(3)-Qm,
where
Ae,n
Combining the
+ [m +
J)
s) log^(.^)/n.
inequalities gives
ll^llln
Solving this inequality for
4.2.
= K^(TyJ[m log
-
^.,n||?||2.n
-
<
2c,||j'||2,n
+
(7
B„) A C„.
D
gives the stated result.
||(5||2,n
Traditional Trimming. Next we consider the traditional trimming scheme, which
based on the magnitude of the estimated
Given the
coefficients.
=
the trimmed first-step estimator Pt by setting Ptj
Pj\{\Pj\
>
is
first-step estimator P, define
t] for j
=
1, ...,p.
Finally define
the selected model and the post-trimmed estimator as
f=
Let m;
number
norm
=
of
trimmed components
=
p'.
of the first-step estimator,
distance from the first-step estimator
(4.23)
let
and Cn := QiPnf) — QiPo)- for any
with probability at least
\\P
=
,5
and
7^
:=
\\pt
trimmed estimator
to the
—
P\\2,n
the prediction
Pt-
(Performance of a generic post-traditional-trimmed estimator). In either the para-
metric or the nonparametnc model,
where Gt
p
|T \ T| denote the components selected outside the support T, m-t := \T \ T\ the
Theorem 4
more,
and
f(t)
-
Po\\2,n
1
—
>
be
any first- step estimator, and
0, there is
let
Bn := Q{P) — Q{Pq)
K^ independent
a constant
of n such that
e
<
K^a^/{Tnlogp-\- {m
H-
J-,t{K,aGt
^'nrit log(p/i,„, )/?t., 7(
Bn and Cn
£
p
<
t
+
2cs
+
s) log fiif,,)/n-h
+ 7f + 2\\p -
yj (t){mt)mt
,
cmd
Cg
obey bounds (3.13) stated earlier, with
2c s
Poh^n)
—
m
T=
+
+
[Bn)+ A ^/(C„)+,
the parametric model. Further-
T.
POST-^i-PENALIZED ESTIMATORS
Note that the bounds on the prediction norm stated
Lemma
easily controlled, just as in the case of
performance
is 7(
which measures
loss-of-fit
due
large. Indeed, in the
We
estimators.
Lemma
in
major determinant
of the
trimming threshold
3 (2), then
Comment
further discuss this issue in the next section in the context of
good model
selection
is
a given
jt
>
trimming by selecting the threshold
0,
where T C T wp —>
we can
set
t
=
max{t
:
—
\\Pt
t
to
/?||2,n
<
subject to maintaining a certain goodness-of-fit
theorem above formally covers
7(.
estimator,
this choice.
Our main proposal described
Proof of Theorem
4-
Let 5 :=
Q{P) < Q{Pt) A
By Lemma
/3
—
/3o,
in
level, as
However,
the
wp
jt completely.
We
can
fix
imply at most a specific
7(}-
is
C„ =
some drawbacks
loss of
fit
jf For
This choice uses maximal trimming
measured by the prediction norm. Our
it is
not easy to specify practical, data-
the previous subsection resolves just such difficulties.
6*
:— Pt
—
Po,
and
5 :=
P —
Pq.
By
definition of the
Q(Pf^f), so that
+ Bn) A Cn
Q(/3o).
6 (1), for
any e >
there
is
a constant
A'^-^i
such that with probability at least
\-e/2
mln - Ae,nmkn - 2c,||J||2,n
< QiP) "
Q{Po),
where
A^^n
On
7(,
LASSO. There
so that
1
Q{P) - QiPo) < (Q(A) - QiPo)) A (QiPof) - QiPo)) < {QiPt) - Q(P)
B„ = Q{P) -
too
can be very
One example
possible.
4.3 (Traditional trimming based on goodness-of-fit).
of traditional
since
"/(
is
very slow rates of convergence and even inconsistency for the second-step
which eliminates dependence of performance bounds on
driven
of the
to trimming. If the
parametric model with well-separated coefficients,
1,
components
All
parametric models with true coefficients not well separated from zero and
are of course exceptional cases where
>
A
3.
(3.13) in
nonpaxametric models, aggressive trimming can result in large goodness-of-fit losses
and consequently
—
Theorem
example, as suggested in the model selection
aggi^essive, for
in the
Theorem 4 and equation
in
any generic post-traditional-trimmed estimator.
5 apply to
bounds are
19
the other hand,
= I<^^ia\J[m\ogp + {m +
s)
log^im)/n.
we have
Q{Pt)-Q{p)
=
=
+ QiPo) - QiP)
- /?)] + 2E„[r,x',iPt -
QiPt) - QiPo)
2En[e^x[iPt
P)]
+
||?|li,„
-
Plli,„
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
20
To bound the terms above, note
with probability at
least 1
first
Theorem
that by
2,
there
is
a constant K^^2 such that
— e/2
|2E„M^(A -
<
^)]|
-
aA',,2GH|^t
and, second, by Cauchy-Schwartz |2E„[rjxJ(/?( -
I3)]\
<
^||2,„,
2cs\\(it
-
Moreover,
P\\2,n-
W^'Wln-mln = {\\S'hn-\\Shn)m\2,n + m2,n)
< ||^-^||2,n(||A-/3||2.n + 2||?||2,„).
Combining these
least
—
1
inequalities
and using that
=
7;
-
^,nil^l|2,n
"
2c,||J||2,„
<
2,n
<
Ae,„
+ 2c, + W aK.flGat +
which gives the stated result by taking
In this section
we
for
2c,7t
=
+ 7^(7^ + 2||?||2.„) +
2c,7,
||(5||2,n;
+
lt{lt
LASSO, which may be
The
2||d"||2,n)
{Bn)+
Also, note that 7;
A (C„) +
<
t\J(j>[mt)mt
D
(t){,mt).
LASSO
LASSO
to derive the rate of convergence of post-penalized estimators
We
also derive
new sharp
sparsity
bounds
for
of independent interest.
We
begin with a preliminary sparsity
LASSO.
7 (Empirical pre-sparsity
metric model,
let
fh=\T\ T\
Vm
C5
+
previous generic results allow us to use sparsity bounds and
new, oracle sparsity bound for LASSO.
for
A C„.
we obtain
+
V K^2-
Ks,i
/?„)
on post-penalized estimators to the case of
and nonparametric models.
in the parametric
where
K,,
specialize our results
rate of convergence of
Lemma
probability at
Post Model Selection Results for LASSO
being the first-step estimator.
bound
we obtain with
/3||2,n,
by the Caucliy-Schwartz inequality and the definition of
5.
A
+
(aA^e,2G,7«
Thus, solving the resulting quadratic inequality
5.1.
—
e
ll^llln
follows
\\Pt
—
m
<
for
and A >
LASSO). In
c
n||5||oo-
^s^J(t,l(m) 2c/ Ki
the parametric model.
+
either the parametric
^Ve have
i{c
+
l)yJ<P[m) nCs/X.
model or the nonpara-
POST-^i-PENALIZED ESTIMATORS
The lemma above
LASSO
states that
The lemma above immediately
achieves the oracle sparsity
example
When
bounded.
in
[2]
is
to a factor of
m<ps(P{n),
and
example when
>p
</)(n)
not sharp. However, for this case
by combining the preceding pre-sparsity
(t){rn).
(5.24)
Unfortunately, this
[13].
(p[n) diverges, for
p > 2n, the bound
up
upper bound on the sparsity of the form
yields the simple
_
as obtained for
21
bound
is
sharp only when
(f){n) is
\/logp in the Gaussian design with
we can construct a sharp
bound
sparsity
result with the following sub-linearity property of the
restricted sparse eigenvalues.
Lemma
i>l we
A
8 (Sub-linearity of restricted sparse eigenvalues). For any integer k
have
0(("^/c])
version of this
<
>0
and constant
\{](p{k).
lemma
has been previously proven in
for unrestricted eigenvalues
[1].
The
combination of the preceding two lemmas gives the following sparsity theorem. Recall that we
assume
c^
Theorem
< osj sjn and
a <
for
we have A(l -
1/4
5 (Sparsity bound for
LASSO
1/4,
S(f>{m
A
> ay^.
under data-driven penalty). In either the parametric
LASSO
model or the nonparametric model, consider the
a <
Oi\X)
estimator with A
< a^/s/n. and let m :- \f \T\. Consider the
2(2c/ki + 3(c — 1))'}. With probability at least I — a
Cs
n)
set
M
> cA(l - n\X),
- {m
G
N
:
m,
>
"
m<
The main imphcation
to be valid in
Lemmas
of
s
Theorem
and 10
9
min (b(m A n)
for
5
is
that
if
(2c
_
h
3(c - 1)
imnmeM
4>{'iti
A n) <p
1,
which we show below
important designs, then with probability at
least 1
m<ps.
Consequently, for these designs,
namely
s
:=
\T\
< s+rh <p
<p{n) for these designs,
in
[2]
and
[13].
sharpness, but
designs
all
of the
same order
with high probability. The reason
new bound
for this
is
as the oracle sparsity,
that iiimmeM
'/'("^)
^
is
comparable to the bounds in
[20] in
terms of order of
depend on the unknown
[20].
Next we show that miximeM 0("i
bound
is
requires a smaller penalty level A which also does not
sparse eigenvalues as in
the
sparsity
(5.25)
which allows us to sharpen the previous sparsity bound (5.24) considered
Also, our
it
s
LASSO's
—a
/^
^)
^P
(5.25) holds as a consequence.
1
f™"
As a
two very
common
side contribution,
designs of interest, so that
we
also
show that
for these
the restricted sparse eigenvalues and restricted eigenvalues defined earlier behave
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
22
We
nicely.
convert
The
form
state these results in asymptotic
them
for the sake of exposition,
following
lemma
deals with a Gaussian design;
stricted) sparse eigenvalues (see, e.g.
Lemma
and
to finite sample form using the results in [20]
7.
uses the standard concept of (unre-
it
to state a primitive condition
[2])
although we can
on the population design
matrix.
Lemma
9 (Gaussian design). Suppose
Xi,
=
i
sparse eigenvalues are hounded from above by
normalized form of
namely
Xi,
(n/[161ogp]), with probability at least
0(Tn)
<
ip
=
x,j
.
,n, are
<
oo and bounded from below by k^
Then for any
x^j/ ,/En[x'~A.
and
K /72,
Theorem 5 and n/(s
Therefore, under the conditions of
zero-mean Gaussian random
i.i.d.
m
<
>
its s
0.
logn-
Define
(slog(n/e))
A
- 2exp(— n/16),
1
K{mY >
8(^,
.
.
matrix E[xjiJ] has ones on the diagonal, and
vectors, such that the population design
Xj as a
1,
fim.
log p)
< 24y^/K.
—
oo,
>
we have
that as
n
-^ oo
"
/ 2c
m<swith probability approaching at least
The
following
Lemma
lemma
—
1
— + 3(c-
(Syj)
a, where
10 (Bounded design). Suppose
Kb
<
vectors, with n'iaxi<i<:„,i<j<p\xij\
x^
=
i
for
namely
m<
<
oo and bounded from, below by k"
x,j
=
i,j/, /En[i'?
(slog(n/e))
A
(
[e
Vlog p )
there
K'b \/n/ log p)
.
,n, are
n and
i.i.d.
bounded zero-mean random
Assurne that the population design
p.
slogn-sparse eigenvalues are bounded from
>
0.
a constant
is
we have
,
.
regressors.
that as
Define x^ as a normalized form of
£
?^
>
such that
if
sJTi/Kb
—
^
C)o
x^,
and
—^ oo
]
with probability approaching
Kb
Then
].
.
its
ip
(f>{m.)
i/n/
1,
all
matrix E[xix[] has ones on the diagonal, and
above by
we can take ki > k/24.
bounded
deals with arbitrary
i;
^ oo,
<
Aip,
K(m)" >
we have
that as
m<swith probability approaching at least
1
—
and
/Um
n -^
Ay/lp/K,
oo,
^'^
(2c
{4f)
0-,
<
under the conditions of Theorem 5 and provided
Therefore,
1.
ti" /A,
— +3(c-
1
where we can take k\
>
k,/8.
POST-f:-PENALIZED ESTIMATORS
Proof of Theorem
A
>
c
111
n||5||oo-
The
5.
23
choice of A imphes that with probabihty at least
that event, by
Va <
Lemma
v/0(m)
•
1
— q we have
7
2cyf^/Ki
+ 3(c+ l)\/0(m) ncjA,
•
which can be rewritten as
m
Note that
Therefore by
< n by optimahty
Lemma
.
< 2k
[/c]
.,
(5.26)
.
G M., and suppose
m
> M.
on sublinearity of sparse eigenvalues
8
fh
Thus, since
2
2c
nc, \
„,_
—
+3(c+l)-^
conditions. Consider any M
rn<s-(/)(m)
for
<
any
+!)"''''"
777.
s
M
>
A:
1
<^(M)f- +
3(c
we have
2c
nc^\
M < s-2(f)(M) —
+3(c+l)
M and
which violates the condition on
we must have
Therefore,
777.
s since c^
< a\J sjn,
more with
7Ti
< [M A
77)
Further, using again that Cg
nc.
— +3(c+l) —
^
< a\J sjn and
m<s- (t){M A
5.2.
l)/c
=
c
—
The
1.
cay/n, and (c
+
=
l)/c
c
—
1.
we obtain
/2c
fn<s-(/)(7WA7i)
+
>
< M.
In turn, applying (5.26) once
since (c
A
result follows
\^
.
> casjn we have
A
— + 3(c 2c-
77)
1
by minimizing the bound over
Performance of the post-LASSO Estimator. Here we show
M e M-
that the
D
post-LASSO
estimator enjoys good theoretical performance despite possibly "poor" selection of the model
by LASSO.
Theorem
6 (Performance of post-LASSO).
ric model, if
X
>
C7i||5||oo>
probability at least
1
—
for any e
>
/r7
there
either the parametric
is
model or
a constant A'^ independent of
the
nonparamet-
n such
e
Th
y
^^1
\
CTlK-i
that with
24
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
^
where
m
:= \T \ T\ and
Cg
—
In particular, under the data-driven
Q in the parametric model.
choice of X specified in (2.3) with log(l/Q:)
<
>
logp, jor any e
there
is
a constant K'^ ^ such
that
-/9o!|2,n
with probability at least
Proof of Theorem
mlogjpfj-m)
<K,a^
I
—a—
e.
||5r'-||i
>
a log
Mm
Note that by the optimaHty of p
6.
fi
/-g^ogP
^
,
(5.27)
n
Ki
LASSO
in the
problem, and letting
(5-28)
<^(||/3o||i-||^||i)<^(||?T||i-||?rHli)-
we have Q(^)-Q(/3o) <
c||?r||i,
ifrg
J
'
Bn:^QW)-Qm
IfZ?„ :=
/
J
c>
since
1.
Otherwise,
if
||5rHli
<
c||?t||i,
by RE. 1(c) we have
^
The
if
result follows
T C f we
:=
=
have C„
m
Lemma
by applying
The second claim
(2.9), in
B.
so that i3„
-
bound
2 to
first,
^^^^&^.
n
and Theorem
||(5||2,n
(5.29)
Ki
1,
and
also noting that
using the condition that
< a^fsjn,
Cg
relation
the case of the nonparametric model.
This theorem provides a performance bound
sparsity characterized
ability.
<
A C„ < l{r g f}S„.
immediate from the
is
-fhh
n
<
Q(/3o)
For
LASSO, but
common
it
can be
by m,
2)
LASSO's
designs this
for
post-LASSO
rate of convergence,
as a function of 1)
and
3)
LASSO's model
bound imphes that post-LASSO performs
strictly better in
some
cases,
LASSO's
selection
at least as well as
and has smaller regularization
bias. 'We pro-
vide further theoretical comparisons in what follows, and computational examples supporting
these comparisons appear in Section
in other
norms
It is
also
of interest immediately follow
v/e^S^"^ <
Comment
6.
11/3
- M2,n +
cs
worth repeating here that performance bounds
by the triangle inequality and by
and
||/J
5.1 (Comparison of the performance of
out complete and formal comparisons between
<P{m)<pl,
Ki
>p
1,
Mm ^P
1-
-
^2
-
(5.30)
/3o||2,„/«(m).
LASSO
to carry
and post-LASSO, we assume that
<
log(l/a)
established fairly general sufficient conditions for the
10.
The
is
WH
post-LASSO vs LASSO). In order
We
fourth relation
<
definition of k:
logp and a
first
a mild condition on the choice of
a
in
=
o{\).
three relations in
Lemmas
(5.31)
9
and
the definition of the data-driven
POST-^i-PENALIZED ESTIMATORS
25
choice (2.3) of penalty level A, which simplifies the probability statements in
what
follows.
We
note that vmder (5.31) post-LASSO with the data-driven penalty level A specified in (2.3)
first
obeys:
'"
mlogp
\\P-Po\kn <,
n
follows that
post-LASSO
V
Theorem
In addition, conditions (5.31) and
It
generally achieves the
T
is
same near-oracle
m = op{s)
and
is,
perfect
post-LASSO
model
T
CT wp
strictly
5.3.
when
selection,
post-LASSO naturally
in general fail to correctly select the
—
1,
>
as
strictly
under conditions of
o(s)logp
£p
improves upon LASSO's
m
=
T C T wp —
and
achieves the oracle performance:
1,
||/3
LASSO
the case where the first-step estimator
/3 is
LASSO.
Lemmas
4,
>
cn||5||oo,
that with probability at least
||^-/3ol|2,„<AVy
where m, :— \T
\ T\
I
^^°g^+^'^ +
—
as
C5
m
=
<
specified in (2.3) with log(l/a)
-
1
— a —
e.
extreme case of
^p
/3o||2,7i
cr^/s/n.
Lemma
4,
D
estimator. In what follows we pro/j
defined in equation (4.20) for
LASSO.
for any e
>
there
is
the parametric
a constant
model or
K^ independent
of
e
^)^°g^"~'
+2c. +
l{Tgr},/^fii±^^l^ +
the parametric case.
logp, for any e
Tn\og[piifh]
||/9-/3o||2,n</<„a
then
under conditions of
+
>
Q there
Under
is
2c.,
cnK\
UKl
and
with probability at least
and
Finally, in the
7 (Performance of post-goof-trimmed LASSO). In either
the nonparametric model, if X
3
Specif-
(5.33)
rate.
>
improves upon
+
CT
vide performance bounds for the post-goof-trimmed estimator
n such
LASSO:
rate as
(5.32)
Performance of the post-goof-trimmed
Theorem
s.
a class of well-behaved models - a neighborhood of parametric models
b.n
That
<p
slogp
<,
with well-separated coefficients - in which post-LASSO
ically, if
n
isT <^T.
as a subset, tliat
Furthermore, there
logp
V
LASSO may
Notably, this occurs despite the fact that
model
n
s
+ 1{T<ZT}
imply the oracle sparsity in
5
ll/3-/3o||2,n
oracle
'.s
the data-driven choice of A
a constant K'^^ such that
slog/x^
\{T(IT]
slogp
1
n
Ki
(5.34)
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
26
Proof.
The proof
Theorem
3 in the last step.
use the condition
7 <
of the first claim follows the
imposed
Cg
The second claim
< g^JsJu from
same
steps as the proof of
follows immediately from the
(2.9) in the
ability of the
LASSO,
first,
LASSO's
characterized by m, 2)
trimming scheme. Generally,
is
this
LASSO
since the post-goof-trimmed
where we
also
D
LASSO
rate of convergence,
bound
is
at least as
much
trims as
as a function of
model
selection
good as the bound
for post-
and
3) the
as possible subject to maintaining
also appealing that this estimator determines the
trimming
completely data-driven fashion. Moreover, by construction the estimated model
post-LASSO's model, which leads
post-LASSO
in
some
computational examples
Comment
invoking
in the construction of the estimator.
certain goodness-of-fit. It
over
6,
nonparametric model, in addition the condition
This theorem provides a performance bound for post-goof-trimmed
1) its spai'sity
Theorem
cases.
to the superior
We
is
level in
a
sparser than
performance of post-goof-trimmed
LASSO
further provide further theoretical comparisons below and
in Section 6.
5.2 (Comparison of the performance of post-goof-trimmed
LASSO
vs
LASSO
and
post-LASSO). In order to carry out complete and formal comparisons, we assume condition
(5.31) as before.
Under these
LASSO
conditions, post-goof-trimmed
obeys the following per-
formance bound:
/^olb.n
which
is
Theorem
analogous to the bound
5.
It
mlogp
^P
\s
,
^iJ^^f^^.\^^''^V
n
V
,
cr
n
for
post-LASSO, since
m
follows that in general post-goof-trimmed
of convergence of
LASSO
<p s by conditions (5.31)
LASSO matches the near oracle
<
fn.
is
(5.35)
a class of models - a neighborhood of parametric models with well-
separated coefficients - for which improvements upon the rate of convergence of
possible. Specifically,
(5.33), that
if
TO
is,
= op{m)
if rii
—
op(s) and
post-goof-trimmed
and
rate
and post-LASSO:
\\P~Pohn<p<r]j~^Nonetheless, there
and
T C T wp —
-
>
1,
T C T wp
LASSO
strictly
-^
1
LASSO
also
o(m) log
/Jolb.n
^P
<7
n
is
then we obtain the performance bound
improves upon LASSO's
post-goof-trimmed
LASSO
s_
V
n
rate.
Furthermore,
outperforms post-LASSO:
POST-^i-PENALIZED ESTIMATORS
under conditions of
Lastly,
Lemma
the oracle performance,
5.4.
threshold
t.
This
t^,
the
LASSO
ft^-
=
^jl{|^j|
>
t],
U
ay/sjn.
m :=
\f\T\,
:=
nit
of post-traditional-trimmed
if
X
>
LASSO
estimator. Next we con-
all
components below a
\f\f\ and
I
-
£
LASSO). In
>
any e
cn||5||oo> for
of n such that with probabtUty at least
-l{T^T}\ht{KeCTGt +
7, :== ||3f
\/mt logipum,) / y/n and
specified in (2.3) for log(l/Q')
probability at least
\
- a —
||/3-/3o||2.n
<
The proof
<
7^
where
<
is
a constant K^- independent
>
C
\
CIlKi
Under
there
is
4 in the last step.
condition
Cs
the data-driven choice of A
a constant K'^ ^ such that with
S logoff,
mlogippLf,
7*
n
V
same steps
The second claim
follows
n
Ki
as the proof of
from the
first,
Theorem
where we
6;
invoking
also use the
D
relation (2.9), for the nonparametric model.
This theorem provides a performance bound for post-traditional-trimmed
m
sparsity characterized by
acterized by m;, 2)
J
/C,a
of the first claim follows the
< o^/s/n,
there
either the parametric model
e
Theorem
LASSO's
and improvements
rate of convergence, 3) the
goodnesR-of-fit loss 7; relative to
of the
Phn
2C.,
t\J(j)(mi)mt.
logp, for any e
+ l{T^f}
its
-
6c,
'IlKi
tion of 1)
set
we have
Pohn < K,J^^^2iP±S^l±l}}2^ +
Proof.
achieves
estimator.
or the nonparametric model,
—
LASSO
arguably the most used trimming scheme in the literature. To state the
Tiieorem 8 (Performance
where Gt
post-goof-trimmed
trimming scheme which truncates to zero
is
result, recall that
is
<p
=
t
Performance of the post-traditional-trimmed
sider the traditional
P
/3n||2,n
3 holding for
27
LASSO
trimming scheme. Generally,
this
in sparsity
LASSO as a
over LASSO
trimming threshold
t
funcchai--
and resulting
induced by trimming, and 4) model selection ability
bound may be worse than the bound
for
LASSO, and
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
28
this arises
ming
LASSO may
We provide
because the post-traditional-trimmed
resulting in large goodness-of-fit losses
7(.
below and computational examples in Section
Comment
potentially use too
this discussion
we
also
6.
assume conditions
made
(5.31)
LASSO vs LASSO
in the previous for-
LASSO
mal comparisons. Under these conditions, post-traditional-trimmed
obeys the bound:
\\P-Pohn<a^l^ + a^^ + l{T^f}Lva^[^Y
In this case
we have mWrrit <
s
+ rh <p
by Theorem
s
improve upon LASSO's rate of convergence given
As expected, the
choice of
t,
which controls
trim-
further theoretical comparisons
5.3 (Comparison of the performance of post-traditional-trimmed
and post-LASSO). In
much
7(
in
5,
(5.36)
and, in general, the rate above cannot
Lemma
via the the
2.
bound
<
jt
ty^(f){mt)'mt,
can have
a large impact on the performance bounds:
t<J'^
=^ \W-Pohn<.J'-^
~
Slogp
Lemma 3
sound, since
of
it
parts (2) and
suggests inconsistency
in practice
(3).
the literature on model selection via
in
The
first
choice (5.37), suggested by
guarantees that post-traditional-trimmed
LASSO. The second
we can
set
Furthermore, there
t
is
—
large relative to n.
Note that
is,
more,
= op{m)
[11], is
as
we
theoretically
achieves the near-oracle rate
poor performance bound, and even
to
implement the
first
choice (5.37)
a special class of models - a neighborhood of parametric models with
m = op{s)
and T C T wp —>
post-traditional-trimmed
(5.33), that
if rfi
if
LASSO,
X/n.
well-separated coefficients - for which improvements
possible. Specifically,
LASSO
choice, however, results in a very
if s'^ is
n
V
Both options are standard suggestions
reviewed in
/s2logp
^
n
V
(5.37)
and T C T wp
LASSO
->
1,
upon the
1
rate of convergence of
LASSO
is
then we recover the performance bound
strictly
improves upon LASSO's
post-traditional-trimnied
LASSO
rate. Further-
also outperforms
post-LASSO:
-
o(?n.) log
/Solb.n
^P
Lemma
Lastly,
under the conditions of
LASSO
achieves the oracle performance,
n
p
/
s
V n
3 holding for the given
||/3
—
/iolb.n
^p a%J sjn.
i,
post-traditional-trimmed
D
.29
POST-fi-PENALIZED ESTIMATORS
."
,
Empirical Performance Relative to
6.
we
In this section
which
assess the finite
our benchmark,
is
LASSO
trimmed
2)
LASSO
sample performance of the following estimators;
post-LASSO,
3)
post-goof-trimmed
with the trimming threshold
— X/n
t
LASSO, and
LASSO,
4) post-traditional-
Lemma
suggested by
1)
3 part (3).
We
consider a "parametric" and a "nonparametric" model of the form:
J/i
where
=
/i
+
fi
f-i,
in
= C-
The parameter
signal",
=
n
We
C
= C[1,
Ejj
(6.39)
1/2,1/3,. ..,1/p]'.
and we vary
C
between
and
100, the variance of the noise
Ejx-
—
1.
(6.40)
determines the size of the coefficients, representing the "strength of the
2.
is
The number
ct'^
=
1,
(l/2)l'^~''"l,
Thus our
and
c)
of regi'essors
~
A''(0,
Sj^.
E),
model
is
=
p
500, the
—
sample
size
each design.
for
and consider three designs of the
^
= 1/2
for j
the equi-correlated design with Ejfc
pai'ametric
is
and we used 1000 simulations
covariance matrix S: a) the isotropic design with
=
l,...,n,
[1,1, 1,1,1, 0,0,.. .,0]',
generate regressors from the normal law Xi
with
=
i
the nonparametric model
So
is
ei^N{Q,(T^),
3:%,
parametric model
in the
^0
and
=
b) the Toeplitz design
A',
for j 7^ k; in all designs
very sparse and offers a rather favorable setting for
applying LASSO-type methods, while our nonparametric model
is
non-sparse and
much
less
favorable.
We present
left
the results of computational experiments for each design a)-c) in Figures 2-4.
column of each
figure reports the results for the parametric model,
of each figure reports the results for the nonparametric model. For each
the following as a function of the signal strength for each estimator
•
in the
top panel, the number of regi'essors selected,
•
in the
middle panel, the norm of the
•
in
We
bias,
the bottom panel, the average empirical
will focus
namely
risk,
and the
right
model the
The
column
figin'cs plot
/3:
|T|,
\\E[f3
-
Oo\\\,
namely E[En{fi —
x'^P]'^].
the discussion on the isotropic design, and only highlight differences for other
designs.
Figure
2, left
see from the
panel, shows the results for the parametric
bottom panel
that, for a
model with the
isotropic design.
We
wide range of signal strength C, both post-LASSO and
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
30
post-goof-trimmed
LASSO
LASSO
significantly
terms of empirical
in
risk.
outperform both
LASSO and
The middle panel shows
superior performance stems from their
much
that the
We
smaller bias.
post-traditional-trimmed
first
two estimators'
see from the top panel that
LASSO achieves good sparsity, ensuring that post-LASSO performs well, but post-goof-trimmed
LASSO achieves even better sparsity. Under very high signal strength, post-goof-trimmed
LASSO achieves the performance of the oracle estimator; post-traditional-trimmed LASSO
also achieves this performance; post-LASSO nearly matches it; while LASSO does not match
this performance. Interestingly, the post-traditional-trimmed LASSO performs very poorly for
intermediate ranges of signal.
Figure
design.
We
first
We
in
for
LASSO
significantly
terms of empirical
risk.
As
outperform both
in the
from the top panel that, as
while post-goof-trimmed
LASSO
LASSO and
in the
last
two because they have a much smaller
parametric model,
LASSO
for
almost
all signals,
bias.
achieves good sparsity,
In contrast to the parametric
achieves excellent sparsity.
LASSO
model, in the nonparametric setting the post-traditional-trimmed
terms of empirical risk
post-traditional-trimmed
parametric model, the middle panel shows that the
two estimators arc able to outperform the
also see
the nonparametric model with the isotropic
see from the bottom panel that, as in the parametric model, both post-LASSO and
post-goof-trimmed
LASSO
shows the results
right panel,
2,
performs poorly
in
except for very weak signals. Also in contrast to
the parametric model, no estimator achieves the exact oracle performance, although
and especially post-LASSO and post-goof-trimmed
LASSO
perform nearly as
well, as
LASSO,
we would
expect from the theoretical results.
Figure 3 shows the results for the parametric and nonparametric model with the Toeplitz
and we see that
design. This design deviates only moderately from the isotropic design,
the previous findings continue to hold.
findings continue to hold with only a few differences.
LASSO
no longer
Specifically,
continues to perform well and selects very sparse models.
LASSO
estimation.
terms of empirical
risk,
we
see that the previous
see from the top panels
post-goof-trimmed
LASSO
Consequently, in the case of the
In contrast,
panel that in the nonparametric model, post-goof-trimmed
in
still
substantially outperforms
of empirical risk, as the bottom-left panel shows.
post-LASSO
we
selects very sparse models, while
parametric model, post-goof-trimmed
of
Figure 4 shows the results under the equi-correlated
design. This design strongly deviates from the isotropic design, but
that in this case
all
we
LASSO
despite the fact that
it
see
post-LASSO
in
terms
from the bottom right
performs equally as well as
uses a
much
sparser model for
POST-^i-PENALIZED ESTIMATORS
The
._
findings above confirm our theoretical results on post-penalized estimators in parametric
LASSO
and nonparametric models. Indeed, we see that post-goof-trimmed
at least as
bias.
good
as
LASSO,
Post-goof-trimmed
(or very close to
LASSO
LASSO
outperforms post-LASSO whenever
when the
signal
is
strong and the model
does not produce
parametric and sparse
is
being such), the LASSO-based model selection permits the selection of oracle
That allows
or near-oracle model.
in empirical risk over
for
LASSO. Of
post-model selection estimators to achieve improvements
particular note
the excellent performance of post-goof-
is
trimmed LASSO, which uses data-driven trimming to
fully consistent
for
and post-LASSO are
and often perform considerably better since they remove penalization
excellent sparsity. Moreover,
is
31
with our theoretical
intermediate ranges of signal.
results.
select
a sparse model. This performance
trimming performs poorly
Finally, traditional
In particular,
exhibits very large biases leading to large
it
goodness-of-fit losses.
Appendix
Lemma
Proof of
make the use
-
>
>
and
1
=
ll^lli
IIAtIIi
/3
—
/3
and
>
A
for
+
\\s\\l^>--^{\\STh
-
WMi
-\\s\U\s\U
<
^||/3o||i
-
-^(Pr||i +
||^T.||i)
+
definition of 5, Q(/3) -Q(/9o)
|2
||cy||^_„
=
of the following relations for J
ll/3ol|i
Since
Proofs of Lemmas
Following Bickel, Ritov and Tsybakov
1.
W)-Qm
By
A.
-
^ll^lli,
llAVlli
<
2
to establish the result
[2],
cn||5||oo:
+
\\ST4i)
\\St\\i
-
+ \min
\\St4i-
to (A.43),
we
(A.43)
ll^llln<^(ll'5T||i-||^rHli)-
(A.44)
where we used that
c
>
1
<
(l
il-^/^"*"
+ il-llM,<fl
;)>Hi,<(i + i)
n
K.\
cjn
[8].
\
cJ
and invoked RE. 1(c) since (A.44) holds. Solve
bound on A(l - a\X)
follows
Gaussian random variables, P(|^|
2.2.1(a) in
1
get that:
H„
for
(A.42)
0,
c-
Finally, the
(a.41)
which, by (A.41) and (A.42), implies that
St4i <^-|!'5r||i=c||(5T||i.
Going back
we
for
||(5||2,n-
from the union bound and a probability inequality
> M) < exp{-AP/'2)
if
^
~
A^(0, 1), see Proposition
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
32
Nonparametric
Parametric
A. Sparsity
0.5
1.5
1
C
Signal Strength
Signal Strength
C
B. Bias
1.6
A.,,
0.5
0.5
1.5
1
Signal Strength
C
1.5
1
Signal Strength
C
C. Empirical Risk
2.5
1.6
,o
,©'"
1,4
r^
^0
2
,.«
.23
^
^
P'
1.5
d
O
W
1
a
S
W
0.5
"A"
0.5
(5
-A-i-
LASSO
Figure
2.
Tliis figure plots the
text.
for
The number
each value of
tlie
right
of regressors
C
/-r^^
0.8
6
0.4
0.5
1
Signal Strength
A-
©'
post-goof-trimmed
performance of the estimators
=
if
j 7^ k.
The
left
1.5
C
post-traditional-trimmed
listed in the text
under the
column corresponds
column corresponds to the nonparametric case described
is
p
=
A
P^/^,,A-y/A'
2
isotropic design for the covariates, Sjj;
parametric case and
^^....---^
C
post-LASSO
-o
0.2
-4-.!^
1.5
1
Signal Strength
.©.
,
13
©'"
.b
©'
1.2
500 and the sample
size
is
n
=
to the
in
the
100 with 1000 simulations
POST-^i-PENALIZED ESTIMATORS
33
Nonparametric
Parametric
A. Sparsity
M '^
#
'#
1.5
1
0.5
C
Signal Strength
1.5
1
Signal Strength
C
B. Bias
^
^-=
0.5
T-'
1
Signal Strength
1^
m s-^
i
n ij^
1.5
0.5
C
1
Signal Strength
1.5
C
C. Empirical Risk
0.5
1
Signal Strength
LASSO
Figure
3.
0.5
1.5
C
1
Signal Strength
A
•post-LASSO
post-goof-trimmed
©
This figure plots the performance of the estimators
Toeplitz design for the covariates,
Eja:
=
p'-'"*''
if
j 7^ k.
The
1.5
C
post-traditional-trimmed
listed in the text
left
under the
column corresponds to the
parametric case and the right column corresponds to the nonparametric case described
text.
for
The number
of regressors
each value of C.
is
p
=
500 and the sample size
is
n
=
in
the
100 with 1000 simulations
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
Nonparametric
Parametric
A. Sparsity
^^
^
'
"
'
^EB^'
'
A
^^
'^
O
0.5
1.5
1
C
Signal Strength
Signal Strength
C
B. Bias
3
A
—
'A
-@
0.5
@-
1.5
1
0.5
C
Signal Strength
1.5
1
Signal Strength
C
C. Empirical Risk
2.5
,x
?
1/1
Pi
,
1
a
o
1
5
(-1
a
S
w
0.5
A
05
A
1
Signal Strength
LASSO
Figure
4.
0.5
1.5
C
1.5
1
Signal Strength
A
post-LASSO
O
post-goof-trimmed
'
post-traditional-trimmed
This figure plots the performance of the estimators listed
equi-correlatcd design for the covariates, Hjk
=
p
\i
j
^
k.
The
left
C
in
the text under the
column corresponds
parametric case and the right column corresponds to the nonparametric case described
text.
The number
for eacli value of
of regressors
C
is
p
=
500 and the sample
size
is
n
=
to the
in
the
100 with 1000 simulations
POST-«i-PENALIZED ESTIMATORS
Proof of
Lemma
relation: for S
=
(3
—
Pq,
QW) - QWo) -
By
definition of P,
If ll'^llin
On
~
the other hand,
0,
Lemma
2
2Er,\e,x',5]
>
-^{\\ST\\i
<
^||/3o||i
||<5r<-||i)
satisfies the
-
-
+ \\ST4i)-^cMkn
^||/3||i,
which implies that
<
2c.||(5||2.„
Appendix
~
B.
(A.47)
>
we get
domination condition
From
(2.6).
- 2c.P|b,„ < (l + 1)
cjnh^Th
Lemma
6.
Finally, the
n
hi
bound on A(l —
random
Oi\X) follows from
variables, P(\£,\
> M) <
D
[8].
Result
(1) follows
+ S)-
follows from
i?„
Lemma
6 result (1).
The bound on C„
from the relation
Q(/?o)
-
mlJ
=
\2E^[e^x',5]
then applying Theorem 2 on sparse control of noise to
Result
get
^^^
cJ
\
we
Proofs of Lemmas for Post-Model Selection Estimators
\Qi(3o
2cs||5||2,n
(A.46) and using RE. 1(c)
< (l + -)
A'(0, 1), see Proposition 2.2.1(a) in
Lemma 5. The bound on
from Lemma 6 result (2).
Proof of
(A.46)
\\5t4i)
||<5Tc||l<^-||dT||l=c||5T||l,
2cs||A'||2,n
Proof of
follows
-
^(||<5t||i
the statement of the theorem.
the union bound and a probability inequality for Gaussian
,J
(A.45)
in
which gives the result on the prediction norm.
if
of the fohowing
+ 2En[r,x',6]>-\\S\U\5\U-2cMhn
¥\\ln -
+
V
exp{-M~/2)
we make the use
then we have estabhshed the bound
if ||(5|||„
||^|||„
to prove
=
Q(/3o)
+
[2],
cn\\S\\ao
¥\\in
^(II'^tIIi
<
>
X
ii
Q0) -
2c5||d'||2,,j
and therefore 5
Similar to
2.
35
+
2En[r,x[6]\,
|2E„[ejx'(5]|,
bounding
|2E„[ri.Tj(5]|
by
using the Cauchy-Schwartz inequality, and bounding (^) by p™.
(2) also follows
components
in
T
from Theorem 2 but applying
are modified),
m — k,
it
with
s
=
and noting that we can take
0,
/v.^
p
—
with
s (since
m
=
0.
only the
D
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
36
Appendix
Proof of
Lemma
Proofs of Lemmas for Sparsity of the LASSO estimator
C.
T =
Let
7.
support(/i),
m
and
= \T\T\.
We
have from the optimaUty
conditions that
2E„[xy(2/,
Therefore we have
VAX =
R=
for
- x'M =
{ri,
.
,
.
sign(^,)A/7i
jef\T.
each
for
?-„)'
.
2\\{X'{Y-Xp))^^^\\
<
2\\{X'{Y^R~XPo))f\T\\+niX'R)f^j,\\ + 2\\{X'X{Po~^))fyr\\
<
Vm
•
n||5||oo
+
+
2nV4>irn)cs
-
2nV'P{fh)\\P
/3o||2,«,
where we used that
-
||(A"X(/3o-^))^\^||
and similarly
Since A/c
>
||(A'i?)j^,^jj
7i||S'||oo,
result follows
Proof of
Lemma
and by
W .= E„
SUP||ai|u<m,||a||<l
<
ny^||/3o -
where we can choose
V\a'
X' Xq\\\X{^q -
^)||
;5||2,n,
2,
-
||/Jo
^||2,n
<
(l
+
^)
^+
2cs
we have
a,'s such that
<
1/c)
—
2/(c
+
by
1)
and a be such that
=
||a,Tc||o
j
||aiT'^||o
<
(;6(
definition of
\['k]
)
HarHIo and a^r
k for each
a[Wai + a'jWaj >
J2J2=
1=1
^
—
=
a' Wa.
= ^r/
D
c.
We can
M
decompose
,
(-1
positive semi-definite,
^
(1
{x,x'j]
i=l
is
sup||„,|^<^_||„||<i||a'X^|||jA(^o-^)||
Lemma
a = ^aj, with
ly
<
=
< n^J <^{fn)Cs.
by noting that
Let
8.
sup||„,|„<^,,|„i|<i|a'A'A(/3o-^)|
- l/c)\/^ < 2V4>[m)[\ + 1/c)/s/ki + 6\/0(m) nc,/A.
(1
The
=
V^
'-
i
=
2 |q'J'1^Qj| for
1,
...,
\f\, since [£]A;
any pair
(i,
j).
1=1
i
m 5^||a,||20(||a,TMIo) < m
_max
I
!,->
</.(||a,:T-|lo)
*-
I
I
[^fc].
Therefore
= \(]Y.a^Wa,
"
>
< m<^(^),
Since
POST-£i-PENALIZED ESTIMATORS
37
where we used that
m
m
r^i
||2
II-
r^i
,^1
D
Lemma
Proof of
P(maxi<j<p
First note that
9.
-
|ctj
condition on n. Let Ci,{m) and c*[m) denote the
and Ac(m)^ > mini<j<pCTJc,(m +
proof of Proposition 2
+
deviation of c,,{m
with
(i)
ei
and c*{m
s)
s).
=
+
M/s —
large since ki
probability at least
Lemma
1
—
(ii)
in
is
and
s)
10. First
c*(777
+
note that P(maxi<j<p
s,
(ii)
M
&
+
r,.
=
Jv\ for
n
3
(ii)
in
so that as
5
if
A
>
s,
\aj
—
Ij
The bound on
for
5
if
A
>
the
sufficiently
cnHSHoo which occurs with
<
1/4) H-
cr7||5||oo
1
as
n grows under the
minimum and maximum
It
1/2 and r*
=
[2]
and the
to one.
00,
on
from
s.
the restricted
Next
let
M
—
under the side condition on
bounded from below and <p[M)
Thus, the bound on
which occurs with probability at
^(777)" follows
The bound on
side condition
is
777-sparse eigenvalues
which bounds the deviation of
2.
n grows M/s —>
side
0(m) < maxi<j<pajc*{m.+
follows that
sufficiently large since k\
bounded from above with probability going
Theorem
bounds the
M — (slog(n/e))A(n/[161ogp]) so
and we have M e A4
n
Thus, the bound on (pim) and
.s).
with
Lemma
A ([e/A'sJ^/n/logp)
and we have
1/16, which
Let
from their population counterparts.
s)
eigenvalue ki follows from
(slog(n/e))
=
£4
follows from [20]'s
D
proof of Proposition 2
+
[2].
eigenvalues
a.
and K{m)^ > mini<j<payCt(r??.
c,(7Ti
=
£3
side
bounded from above with probability going
associated with E„[iiS^] (unnormahzed covariates).
[20] 's
n grows under the
and «(m)^
(t>{m)
Theorem
in then follows from
condition on n. Let Ct{m) and c*(7n) denote the
s)
as
0(m) < maxi<j<pCT|c*(m +
follows that
and
1/2,
00 under the side condition on
bound on
1
>
their population counterparts.
bounded from below and 0(M)
is
to one. Thus, the
Proof of
>
=
Lemma 3
restricted eigenvalue ki follows from
that as n gxows
from
s)
It
Thus, the bound on
1/3, £2
—
1/4)
minimum and maximum m-sparse
associated with E„[a:,xJ] (unnormahzed covariates).
s)
<
1|
least \
—
m
is
then follows from
a.
References
[1]
A. BellONI
and V. Chernozhukov
(2009). Ci-Penalized Quantile Regression for High Dimensional Sparse
Models, arXiv:0904.2931v3 [math. ST].
[2]
P.
J.
Ann.
BiCKEL, Y. RiTOV AND A. B. TsYBAKOV (2009). Simultaneous analysis of Lasso and Dantzig
Statist.
Volume
37,
Number
4 (2009), 1705-1732.
selector,
ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV
38
[3]
BuNEA, A. B. TSYBAKOV, AND M. H. Wegkamp(2006). Aggregation and Sparsity
F.
(COLT
Least Squares, in Proceedings of 19th Annual Conference on Learning Theoi'y
via
h
Penalized
2006) (G. Lugosi and
H. U. Simon, eds.). Lecture Notes in Artificial Intelligence 4005 379-391. Springer, Berlin.
[4]
[5]
No.
Statistics, Vol. 35,
n.
[7]
Canoes and
E.
Ann.
D. L.
Statist.
Tao
T.
Wegkamp
Volume
DONOHO AND
J.
35,
(2007). Sparsity oracle inequalities for the Lasso, Elec-
169-194.
1,
The Dantzig
(2007).
Number
selector:
statistical estimation
[9]
R.
J.
Dudley
Fan and
is
much
larger than
M. Johnstone
(1994). Ideal spatial adaptation
by wavelet shrinkage, Biometrika
'_
Uniform Cental Limit Theorems, Cambridge Studies
(2000).
J.
when p
2313-2351.
6,
1994 81(3):425-455.
[8]
The
(2007). Aggregation for Gaussian regression,
1674-1697.
4,
BuNEA, A. TsYBAKOV, AND M. H.
F.
tronic Journal of Statistics, Vol.
[6]
Wegkamp
BuNEA, A. B. TsYBAKOV, AND M. H.
F.
Annals of
advanced mathematics.
in
Lv. (2008). Sure Independence Screening for Ultra-High Dimensional Feature Space, Journal
of the Royal Statistical Society. Series B, vol. 70 (5), pp. 849-911.
[10]
V. KOLTCHINSKII (2009). Sparsity
Statist.
[11]
Volume
Number
45,
7-57.
K. LOUNICI (2008). Sup-norm convergence rate and sign concentration property of Lasso and Dantzig
estimators. Electron.
[12]
1,
penalized empirical risk minimization, Ann. Inst. H. Poincar Probab.
in
Volume
J. Statist.
2,
90-102.
K. LouNici, M. PoNTiL, A, B. Tsybakov, and
S.
van de Geer
(2009). Taking
Advantage
of Sparsity in
Multi-Task Learning, arXiv:0903.1468vl [stat.ML].
[13]
N.
Meinshausen and B. Yu
Lasso-type recovery of sparse representations
(2009).
for
high-dimensional
data. Annals of Statistics, vol. 37(1), 2246-2270.
[14]
P.
RiGOLLET and A. B. TSYBAKOV
(2010). Exponential Screening
and optimal rates of sparse estimation,
ArXiv arXiv:1003.2654.
[15]
M. ROSENBAUM and A. B. TSYBAKOV
(2008). Sparse recovery under matrix uncertainty, arXiv:0812.2818vl
[math. ST].
[16]
R. Tibshirani (1996).
Regression shrinkage and selection via the Lasso.
Roy. Statist. See. Ser.
J.
B
58
267-288.
[17]
S.
A. VAN de
Vol. 36, No.
[18]
A.
2,
Geer
High-dimensional generalized linear models and the
(2008).
lasso,
Annals of
Statistics,
614-645.
W. VAN DER Vaart AND
J.
A.
Wellner
(1996).
Weak Convergence and
Empirical Processes, Springer
Series in Statistics.
[19]
M. Wainwright
(2009).
Sharp thresholds
for noisy
constrained quadratic programming (Lasso)
,
IEEE
and high-dimensional recovery of sparsity using
(.\-
Transactions on Information Theory, 55:2183-2202,
May.
[20]
C.-H.
Zhang and
regression.
[21]
P.
Ann.
Zhao and
J.
Huang
Statist.
B.
(nov), 2541-2567.
Yu
(2008).
Volume
(2006).
36,
The
sparsity
Number
On Model
4,
and bias of the
lasso selection in high-dimensional linear
1567-1594.
Selection Consistency of Lasso.
J.
Machine Learning Research, 7
Download