DEWEY il OUPL MIT LIBRARIES 3 9080 033 6 3251 nx> • [0-05 ZOfo Massachusetts Institute of Technology Department of Economics Working Paper Series POST-A-PENALIZED ESTIMATORS IN HIGH-DIMENSIONAL LINEAR REGRESSION MODELS Alexandre Belloni Victor Chernozhulcov Working Paper 10-5 January 4, 2009 Room E52-251 50 Memorial Drive Cambridge, MA 02142 This paper can be downloaded without charge from the Social Science Research Network Paper Collection http://ssrn.com/abstract=1 58^4 at ^dZS^ Massachusetts Institute of Technology Department of Economics Working Paper Series POST-A-PENALIZED ESTIMATORS IN HIGH-DIMENSIONAL LINEAR REGRESSION MODELS Alexandre Belloni Victor Chernozhukov Working Paper 10-5 January 4, 2009 Room E52-251 50 Memorial Drive Cambridge, MA 021 42 This paper can be downloaded without charge from the Social Science Research Network Paper Collection at http://ssrn,com/abstract=1 582954 Digitized by the Internet Archive in 2011 with funding from Boston Library Consortium IVIember Libraries http://www.archive.org/details/postscriptl2081p00bell POST-^i-PENALIZED ESTIMATORS IN HIGH-DIMENSIONAL LINEAR >3 REGRESSION MODELS O ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV f-s^l Abstract. In this paper we study post-penalized estimators which apply ized linear regression to the \0 It is is in well known that LASSO We show that post-LASSO performs LASSO-based model the By bias. mean (-- s-dimensional approximation to the regression function chosen by the oracle. post-LASSO can perform of convergence, ' . ^vj "true" ?* model LASSO if the strictly better than LASSO, LASSO-based model as a subset and ^~~< model selected by "*"% as the dimension of the "true" model. An important ingredient LASSO in our analysis is which guarantees that Our a new this sparsity components extreme at is is when oracle estiof the most of the same order and hold rate results are non-asymptotic but also applies to other estimators, of the case, bound on the dimension dimension parametric and nonparametric models. Moreover, our analysis ^..^^ all In the post-LASSO estimator becomes the perfectly selects the "true" model, the mator. here the best Furthermore, the sense of a strictly faster rate in selection correctly includes also achieves a sufficient sparsity. OC ,-ys and LASSO selection "fails" in the sense of missing the "true" model we some components of the "true" regression model. rate, Remarkably, this fl C LASSO. at least as well as terms of the rate of convergence, and has the advantage of a smaller if ordinary, unpenal- first-step penalized estimators, typically can estimate the regression function at nearly the oracle thus hard to improve upon. performance occurs even (y^ model selected by not limited to the in both LASSO es- LASSO, example, the trimmed (^^ timator in the ^"" Dantzig selector, or any other estimator with good rates and good sparsity. Our analysis covers p- both traditional trimming and a new practical, completely data-driven trimming scheme that K^ ^ first step, for The induces maximal sparsity subject to maintaining a certain goodness-of-fit. . has theoretical guarantees similar to those of LASSO procedures as well as traditional trimming a wide variety of experiments. in or post-LASSO, but it latter scheme dominates these First arXiv version: December 2009. Key words. LASSO, post-LASSO, post-model-selection AMS estimators. Codes. Primary 62H12, 62J99; secondary 62J07. Date: First Version; January 4, 2009. Current Revision: March 29, 2010. We thank Don Andrews, Whitney Newey, and Alexandre Tsybakov as well as participants of the Cowles Foundation Lecture at the 2009 Econometric Society meeting and the joint Harvard-MIT seminar thank Denis Chetverikov, Brigham Fradsen, and Joonhwan Lee suggestions on several versions of this paper. 1 for useful for comments. We thorough proof-reading also Summer would and many like to useful 1. . work we study post-model selected estimators In this nal sparse models much possibly Introduction (HDSMs). In such models, the larger than the sample size n. for linear regi'ession in high-dimensio- number overall However, the number those having a non-zero impact on the response variable ~ is, s = o{n). HDSMs ([6], [13]) of regressors is have emerged to deal with p is very large, s of significant regressors smaller than the sample many new size, - that applications arising in biometrics, signal processing, machine learning, econometrics, and other areas of data analysis where high-dimensional data sets have become widely available. Several papers have begun to investigate estimation of ized [2, mean 6, 10, [2, 6, 10, 13, 17, 20, 19]. demonstrated the fundamental result that fi-penalized 13, 20, 19] true model is known. is least squares es- very close to the oracle rate \J sjn achievable demonstrated a similar fundamental result on the excess [17] forecasting error loss under both quadratic tor can primarily focusing on penal- regression, with the £i-norm acting as a penalty function timators achieve the rate ys/n^/Xogp, which when the HDSMs, and non-quadratic loss functions. Thus the estima- be consistent and can have excellent forecasting performance even under very rapid, number nearly exponential growth of the total of regressors p. quantile regression process, obtaining similar results. See [9, [1] investigated the £i-penalized 2, 3, 4, 5, 11, 12, 15] for many other interesting developments and a detailed review of the existing literature. In this paper we derive theoretical properties of post-penalized estimators which apply ordi- nary, unpenalized linear least squares regi'ession to the estimators, typically LASSO. It is can perform at least as well as is LASSO hard to improve upon. in We regi'ession show that post- terms of the rate of convergence, and has the advantage of a smaller bias. This nice performance occurs even selection "fails" in the sense of missing first-step penalized known that LASSO can estimate the mean well function at nearly the oracle rate, and hence LASSO model selected by some components if the LASSO-based model of the "true" regi'ession model. Here by the "true" model we mean the best s-dimensional approximation to the regression function chosen by the oracle. The intuition for this result is that only those components with relatively small coefficients. form than LASSO, strictly better LASSO-based model sufficiently sparse. in selection omits Furthermore, post-LASSO can per- the sense of a strictly faster rate of convergence, correctly includes Of LASSO-based model all if the components of the "true" model as a subset and course, in the extreme case, when LASSO model, the post-LASSO estimator becomes the oracle estimator. is perfectly selects the "true" POST-i'i-PENALIZED ESTIMATORS Importantly, our rate analysis is LASSO not limited to the 3 estimator in the first step, but trimmed LASSO, applies to a wide variety of other first-step estimators, including, for example, We give generic rate results that cover any bound are available. We also give a generic the Dantzig selector, and their various modifications. first-step estimator for result on trimmed which a rate and a sparsity where trimming can be performed by a traditional hard- first-step estimators, thresholding scheme or by a new trimming scheme we introduce in the paper. Our new trimming scheme induces maximal sparsity subject to maintaining a certain goodness-of-fit (goof) sample, and is completely data-driven. We show that our post-goof-trimmed estimator performs at least as well as the first-step estimator; for example, the post-goof-trimmed at least as well as It LASSO, but can be in the LASSO performs under good model selection properties. strictly better should also be noted that traditional trimming schemes do not in general have such nice theoretical guarantees, even in simple diagonal models. we conduct a Finally, series of computational experiments and find that the results confirm our theoretical findings. In particular, we find that the post-goof-trimmed LASSO emerge clearly as the best LASSO and the post-traditional-trimmed To the best and second Our problem. upon the penalized procedures for the related, but also builds is a new sparsity the first to establish the aforementioned rate result builds is LASSO-type in also rely bounds [2] on some inequalities the mean regression problem of median regression. Our analysis estimators. of the in established the properties of post- and the other works An model for sparse eigenvalues Our sparsity and are comparable to the bounds on maximal LASSO cited above that established important ingredient in our analysis selected by LASSO, which guarantees most of the same order as the dimension of the "true" model. This at the context of median regression. bounds [2] who [1], diflferent, bound on the dimension that this dimension We is ideas in on the fundamental results of the properties of the first-step LASSO estimators. of our knowledge, our paper analysis builds and post- both substantially outperforming best, on post-LASSO and the proposed post-goof-trimmed results LASSO bounds in [20] and reasoning previously given for LASSO in [1] in improve upon the analogous obtained under a larger penalty level. We inequalities in [20] to provide primitive conditions for the sharp sparsity to hold. organize the remainder of the paper as follows. In Section results of [2] for LASSO, albeit with a slightly results of [11, 13, 21]. In Section In Section 4, we present a 3, we present a 2, we review some benchmark improved choice of penalty, and model selection generic rate result on post-penalized estimators. generic rate result for post-trimmed-estimators, where trimming can ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV 4 be traditional or based on In Section goodness-of-fit. post-LASSO and the post-trimmed LASSO 5, we apply our generic results to the we present the estimators. In Section 6 results of our computational experiments. Notation. In what follows, all parameter values are indexed by the sample omit the index whenever this does not cause confusion. aV • llo II — b niax{a,b} and a A ^ T. Er=i a < min{a, The ^2-norm 6). use the notation (a)+ denoted by is • || |j = but we max{a,0}, and the £o-norm denotes the number of non-zero components of a vector. Given a vector 5 G IR^, and a set of indices j = b We size n, We TC {1, . • the vector in which 6tj — Sj if j G T, 5tj also use standard notation in the empirical process literature, E„[/] = Er=i (/(->) " f{zi)/n, and G„(/) cb for we denote by 6t ,p}, some constant c > For an event E, we say that E[f{z.,)])/V^. We use the notation a that does not depend on n; and a E wp — ^ 1 E when <p b to = if = — E„[/(2j)] < 6 to denote = Op{b). denote a occurs with probability approaching one as n grows. 2. LASSO The purpose revisit AS A Benchmark of this section some known is Parametric and Nonparametric Models in to define the results for the LASSO models for which we state our results and also to estimator, which we will use as a benchmark and inputs to subsequent proofs. In particular, we revisit the fundamental rate results of with a slightly improved, data-driven penalty 2.1. Model 1: Parametric Model. Let [2], as but level. us consider the following parametric linear regi-ession model: y, = T= x% + e», c, ~ A^(0, a^), PqEW, support(/3o) has s elements where s T is unknown, and regi'essors E„[4) = lforallj = l,...,7A where A' = [a;i, Given the large number of regressors p > avoid overfitting the data. . . n, The LASSO estimator . z < - n, ,Xn]' are fixed 1, . . but p n > n, and normalized so that aj some regularization [16] is , . is = required in order to one way to achieve this regularization. Specifically, define ^e Our goal is arg min Q{(3) + -||/3||i, to revisit convergence results for \\S\\2,n where Q(/3) = E„[2/, - in the prediction (pseudo) r'A12 = \/E„K5] (2.1) x'^pf. norm, POST-^i-PENALIZED ESTIMATORS The key quantity in the analysis the gi-adient at the true value: is 5= This gradient is 5 2E„[x',e,]. the effective "noise" in the problem. Indeed, for S = p — po, we have by the Holder inequality W) - Q(/3o) mln = 2Enhx'S - > Thus, Qip) — QiPo) provides noisy information about controlled by (2-2) -||5i|oolld1|i. ||(5||2„, and the amount of noise is This noise should be dominated by the penalty, so that the rate of ||5'||oo||'5||i- convergence can be deduced from a relationship between the penalty term and the quadratic term \\5\\l^. This reasoning suggests choosing A so that A However this choice is > cn||S'||oo, is some not feasible, since we do not X where A(l — a\X) for the (1 A — > S. > 1. We propose setting 'n||5||cx)5 (2.3) so that for this choice with probability at least c?i||5||oo is know c = c-A{l-a\X) a)-ciuantile of Note that the quantity A(l ~ a|.Y) fixed easily — I a. computed by simulation. (2.4) We refer to this choice of A as the data-driven choice, reflecting the dependence of the choice on the design matrix X. Comment 2.1 (Data-driven choice vs standard choice.). X where A> 1 is The standard = c(TA^/2nlogp, choice of A employs (2.5) . a constant that does not depend on X, chosen so that (2.4) holds no matter what X Note that v^ll'^'lloo a maximum of A^(0, a"^) variables, which are correlated columns of X axe correlated, as they typicaUy are in the sample. In order to compute A, the standard choice is is. if uses the conservative assumption that these variables are uncorrelated. When highly correlated, the standard choice (2.5) becomes quite conservative and The X-dependent X may be too large. choice of penalty (2.3) takes advantage of the in-sample correlations induced by the design matrix and yields smaller penalties. To designs the variables are by drawing x, as i.i.d. and varying correlations from Ejfc for j ^ A'^(0, k E), among illustrate this point, and defining Xij we simulated many = Xij/wE„[if three design options: 0, ], with Ej, p^^~^\ or p. We = 1, then ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV 6 plots the sorted realized values of the X- impact of in-sample correlation on these values. As expected, for computed X-dependent penalty dependent A and illustrates the a fixed confidence penalty (2.3) is level 1 — a, Figure levels (2.3). 1 the more correlated the regressors are, the smaller the data-driven relative to the standard conservative choice (2.5). Ji 76 > .Standard - - Bound Data-dependent A, Design 1 Data-dependent A, Design 2 Data-dependent A, Design 3 Quantilc Figure x^ as 1. i.i.d. ReaUzod values A'^(0, design 3 has of A(0.95|A') sorted in increasing order. S), where (or j T,jk = 1/2. We ^ k design used n = has Sjk 1 p 100, = = 0, X is drawn by generating design 2 has Sj^ 500 and a^ = 1. = (1/2)'-'"''', and For each design 100 design matrices were drawn. Under least 1 — (2.3), J = /3 — /3o obey the following "restricted condition" with probability will at a: \Stc\\i < c\\6t\\i, where c : — c+l (2.6) 1 Therefore, in order to get convergence rates in the prediction norm ||^||2,n — yl^nfrpp, we consider the following modulus of continuity between the penalty and the prediction norm: RE.l(c) mm Ki{T) V~s¥\kn STc\\i<c\\ST\\i,SjiO where ki(T) can depend on is n. \\5t\\i In turn, the convergence rate in the usual Euclidian norm \\5\\ determined by the following modulus of continuity between the prediction norm and the Euclidian norm: RE.2(c) k.2(T):= min \\5tc\\i<c\\6t\\i,5^0 &^, \\d\\ POST-^i-PENALIZED ESTIMATORS where n-iiT) can depend on Conditions RE.l and RE. 2 are simply variants of the original n. restricted eigenvalue conditions suppress dependence on Lemma 1 T 7 imposed and Tsybakov in Bickel, Ritov [2]. In what we follows, whenever convenient. below states the rate of convergence in the prediction norm under a data-driven choice of penalty. Lemma 1 (Essentially in Bickel, Ritov, and Tsybakov /3oh,n Under ||^-/?0i|2.n<(l is bounded away from nearly the rate \/s/n (achievable 1 a. Since 5 =P— in the Euclidian cn\\S\\oo, then probability at least 1 —a + c)^A(l-Q|X), nK\ < a^y2n\og(p/a). Thus, provided ki — > cj UKi C/ riKi , we have with the data-driven choice (2.3), where A(l — a\X) < If X [2]). when the zero, LASSO true model estimates the regression function at T known) with probability is Po obeys the restricted condition with probability at least norm immediately follows 1 — at least a, the rate from the relation \W-/3o\\2<\\p-m2,n/^2, which also holds with probability at LASSO least 1 — a. Thus, K2 if (2.7) is also bounded away from estimates the regression coefficients at a near \J s/n rate with probability at least Note that the \fsjn rate is zero, 1 — a. not the oracle rate in general, but under some further conditions stated in Section 2.3, namely when the parametric model the oracle model, this rate is is an oracle rate. 2.2. Model 2: Nonparametric modeL Next we yj where Xi = and y,; are the outcomes, p[zi), /, = where /(2j)i pi^Zi) is ^^ can Vi = /(zj) Zi + e^, €i ~ consider the nonparametric model given by A^(0,(J^), = i l,...n, are vectors of fixed regressors, a p-vector of transformations of z-i and e, are disturbances. and any conformable vector rewrite = 3;-/?o + Uj, u^= r, + e,, where r, := /, - x-/3o. For /?o, ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV 8 Next we choose our target s = ||/3ol|o = 1^1 as or "true" Pq, with the corresponding cardinahty of any solution its support to the following "oracle" risk minimization problem: mm min EJix[P '^ 0<fc<pAn||/3||o<fc ' 0] + ' ' k (T^ ' n (2.8) Letting = E,,[{x%-f,f] cl:=En[rf] denote the error from approximating In order to simplify exposition, /,; by then x'^Po, we focus some c^ results - + a's/n the optimal value of is (2.8). and discussions on the case where the following holds: -cl< Ka^s/n with K= K which 1 which covers most cases of interest. (2.9) Alternatively, we could consider an arbitrary does not affect the results' modulo constants. Note that c^ + a~s/n is the the expected estimation error £'[E„[/i oracle estimator p° that minimizes the expected estimation error estimators, by searching for the best fc-dimensional — 0:^/3°]^] among of the (infeasible) all fc-sparse least model and then choosing k square to balance approximation error with the sampling error, which the oracle knows how to compute. The rate of convergence of the oracle estimator of convergence, (see and in general Donoho and Jonstone circumstances, such as note that model [7] when it + a'^sjn becomes an ideal goal for the rate can be achieved only up to logarithmic terms and Rigollet and Tsybakov [14]), error, c^, where we had Next we state a rate of convergence in is r^ in selection. 0. the prediction 2 (Essentially in Bickel, Ritov, and Tsybakov ||^-/3o||2,„< (1 \ Under the data-driven choice (2.3), \\P where A(l — a\X) we have with - Pohn < < a^2nlog{p/a). (1 + c) Finally, zero the oracle model becomes the parametric = norm under the data-driven penalty. Lemma most cases except under very special becomes possible to perform perfect model when the approximation of the previous section \/c'^ + [2]). If X > cn\\S\\oo, then ^)^ + 2^. cj riKi probability at least ^A(l - a\X) + 2c,, I —a choice of POST-^i-PENALIZED ESTIMATORS Thus, provided ki bounded away from is zero, a near-oracle rate with probability at least follows — 1 Model interested in the LASSO to perfectly select the true model. selection LASSO. However, is in some model selection is this section is arc specifically select the true where perfect special cases, and thus can to describe these very special cases possible. In the discussion of our results on post-penalized estimators we will refer to the following selection results for the parametric case. Lemma 3 (Essentially in Meinshausen and model, the coefficients are well separated if min jeT then the true model Moreover T is a. |/3o, I > + i, ^ Yu from for t [13] and Lounici > max := ^ j subset of the selected model, > 3) In particular, < if X l/(u(l > + cn]|5[|oo, 2c)s) for all Thus, we see from parts is 1) and 1 + and there 1 < 2), j < ^ ( . selection 1) In the parametric T \pj - : Poj , \ :— support(/3o) t to C T := support(/3). j3: \Pj\>t]. cn||5||,^, then ^< \En[xijXik\\ = i,...,p can be perfectly selected by applying trimming of level 2) In particular, if X [11]). zero, that is T = f(i):={je{l,...,p} We wc possible, these estimators can achieve the exact oracle rates, be even better than LASSO. The purpose of perfect In fact, first- prove that post-model selection estimators such as post-LASSO will achieve near-oracle rates like those of model (2.10) most common cases where these estimators do not perfectly model. For these cases, we where + c,. j|^-/?o||2,„ Selection Properties. The primary results we develop do not require the step estimators like model estimates the regression function at Furthermore, the bound on empirical risk q. from the triangle inequality: Y^E„[x',^-/,]2< 2.3. LASSO l\ cj nn\K2 is k X^s < a u > I such that the design matrix p, then 2 which follow from [13] and Lemma 2, that perfect model possible under strong assumptions on the coefficients' separation also see fi-om part 3), which is due satisfies away from zero. to [11], that the strong separation of coefficients can be ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV 10 considerably weakened in exchange for a strong assumption on the design matrix. Finally, the following extreme result also requires strong assumptions on separation of coefficients and the design matrix. Lemma 4 (Essentially in Zhao and Yu In the parametric model, under more restrictive [21]). conditions on the design, separation of non-zero coefficients, and penalty paro.meter, specified in [21], with a high probability T= Comment for this: 2.2. first, We support(/3o) =T= support(/3). only review model selection in the parametric case. There are two reasons the results stated above have been developed for the parametric case only, and extending them to nonparametric cases is outside the main focus of this paper. from the stated conditions that in the nonparametric context, in order to select the is clear oracle model T perfectly, the oracle very close to parametric (with Cg much similar to those stated above. Since as good as LASSO and can be Moreover, latter for case (a). models have to be either parametric (i.e. c^ = it 0) or (b) smaller than a'^s/n) and satisfy other strong conditions we only argue that post-LASSO and strictly better only in if (a) Second, oracle performance some is cases, it related estimators are suffices to achieved for case (a), demonstrate the then by continuity of empirical risk with respect to the underlying model, the oracle performance should extend to a neighborhood of case (a), which 3. Let /3 be any A case (b). is Generic Result on Post-Model Selection Estimators first-step estimator acting as a T the model selected by this estimator; model selection device and denote by := support(/3) we assume \T\ < n throughout. Define the post-model selection estimator as ^=arg min If model selection estimator interest is is works perfectly (as it will under some rather stringent conditions), then this simply the oracle estimator and the case when model post-model selection estimator. its properties are well known. However, of selection does not of interest in applications. In this section (3.11) Q(/3). work perfectly, as occurs for we derive a generic result many more designs on the performance of any POST-^i-PENALIZED ESTIMATORS In order to derive rates, we need the following minimal restricted sparse eigenvalue RSE.l(m) K(m)^ := as well as the following maximal '" min ^, ,, restricted sparse eigenvalue RSE.2(m) (f>(m) l2,n max := - W^r \\STc\\o<inMO where will m the restriction on the is 11 number of non-zero components outside the support T. It be convenient to define the following condition number associated with the sample design matrix: x/0Cm) l-^m The following theorem establishes (3.12) K(m) bounds on the prediction error of a generic second-step estimator. Theorem 1 (Performance of a generic second-step estimator). In either the parametric model or the nonparametric model, Bn and let (3 let the second-step estimator. least I = any first-step estimator with support - and C„ := Q{P^f) - For any e > e, we have mlogp+ [m + W-M2.n<K,a c^ be = QiP) - Q{/3o) such that with probability at where 6 in the parametric model. Q, Q(/3o), there is a constant that for s)\ogiiff^ Furthermore. T, define K^ independent m := \T \ T\ + + V(5„)+ A 2c, B^ and Cn of n (C„)h obey bounds (3.13) stated below. The following lemma bounds B^ and C„, although means, as we shall do Lemma constant LASSO 5 (Generic control of and k selected in the = K^ independent -, of + m =\T\T\ be the we can bound D^ by other mlogp + < 11/3 C„ < l{T^f}[\\p^f^\\l, /vo" number (in 4- s) :iJ-n 1 — of wrong regressors For any e > correct regressors missed. that with probability at least Bn Il2,»j Cn). Let number of n such cases case. B^ and \T \ T\ be the many in s. + 2cs Mfn±l^g^^2c, there is a ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV 12 Three implications of Theorem the bounds on the prediction model = A {Cn)+ second-step estimator of is, wrong and Bn does not 0, Lemma 5 T ^ rate of convergence regressors selected affect the rate at all, Otherwise, if parameter ||/3 — /3o||2,n of the first-step estimator. T C T, then we have and the performance of the induced by the C„ measures the in-sample Intuitivelj', to contain the fails T, the performance of the second-step estimator (or loss-of-fit) pai'ameter value po, and by the the selected model both the sparsity ih and the minimum between Bn and C^. sample goodness-of-fit apply to any generic post- dictated by the sparsity fh of the first-step estimator, which controls is the magnitude of the empirical errors. true model, that and and most importantly, note that the selected model contains the true model, if = Cr, 1 we can bound both the and m, the number Secondly, note that Theorem stated in selection estimator, provided first-step estimator {B,i)+ norm are worth noting. Firstly 1 determined by is B„ measures the in- first-step estimator relative to the "true" loss-of-fit induced by truncating the "true" outside the selected model T. /3o Finally, note that rates in other norms of interest immediately follow from the following relations: ^E„[2:;/i-/,]2< where m= of Theorem and 1 error provided by the following Lemma + d) - Q(/3o) uniformly for = - = all 5 Q(/?o) uniformly for Cs (3.14) on the sparsity-based control of the empirical lemma. - mini < &W - all l|/3ofcllLl TcT I — 0, there is < tt^, (^''T-d uniformly over m. 2) Furthermore, with at least the < i^e^y K^ indepen- a constant mb^n + ^^e(^\j same °^^'^^^^°^^'° such that \T \T\ in the parametric model. > e, such that Pt^IIo in the parametric model. |Q(,V) where \\p-/3oh<\\/3-po\hn/^m), 6 (Sparsity-based control of noise). 1) For any e \Q{Bo c^ c„ 5 relies dent of n such that with probability at least where + \T\T\. The proof Lemma ||/3-/Jo||2,„ ll/3ofJl2,n = k, + 2Cs||(5||2,n, < n, probability, 2c,||/3o^.||2,„, and uniformly over k < s, POST-^i-PENALIZED ESTIMATORS The proof of the lemma a separate theorem since on the following maximal inequality, which we state as in turn rehes may be it 13 The proof of independent interest. of the theorem involves the use of Samorodnitsky-Talagrand's inequaUty. Theorem 2 (Maximal inequality for a collection of empirical processes). Let independent for i — Cnim,,]) := for any rj G (0, 1) I, . . . and for ,n, a2V2 m = 1, . , . . n + v/(m + ./log £j ~ N{0,cj'^) be define s) log + ^{m + (D/x^) s)log{l/T]) and some universal constant D. Then e.x'J < Gn sup eri{m.. rj), for m< all n, \\6Tc\\o<mM5\\2,n>0 with probability at least homogenous of degree 1. - */(! r]e Note that we can Proof. Step 0. Step 1 — 1/e). supremum restrict the over = \\6\\ since the function 1 is zero. For each non-negative integer m < n, and each set TC {1, • • . |r\T| < m, ,p}, with define the class of functions g^; = {£,a-;;d7||(5||2,„ : C support((5) f , ||<5|| - (3.15) 1}. Also define J^m It = {Qf-fc{l,...,p}:\f\T\<m]. follows that P I sup |G,(/)| > e„(m,;?) V/G-^-n We 1 < f ^) I sup |G„(/)| > e„(m,7?) \f&Qj. . | (3,16) J apply Samorodnitsky-Talagi'and's inequality (Proposition A. 2. 7 in van der Vaart and Wellner [18j) to bound the right hand side of (3.16). Let - G„(5)]2 = v/S,E„[(/-5)-l P(f,g) := v/^e[G„(/) for max P V'"/ \f\T\<m / f,g€ 2 below, the covering Qf\ by Step N[e, Gf, and cr^iGf) p) < f sup |G„(/)| UeGf > of Qf with (6a/u™/f )"'+', for each — max/gg_ ii[G„(/)]^ = a~. P number e„(m,,/)^ J respect to p obeys < e < a, (3.17) Then, by Samorodnitsky-Talagrand's inequality < ( \ ^^Ip^^hplY^^ ^e„Xm,v)M \/m + s(j~ J ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV 14 for D some universal constant > For en[m,Tii) defined in the statement of the theorem, I. it follows that < r/e-^-V V ' P\ sup |G„(/)!>en(m,7?) Then, P sup |G„(/)| > e„(m,7?),3m < n\ < VP < ^ r?e— < m=0 sup |G„(/)| > e„(777..7]) n 7?e-7(l - 1/e), which proves the claim of the theorem. Step 2. This step establishes (3.17). For and e^^^ aj^ m\2,n in t gW and gW, t fC Gf, for a given consider any two functions .\f\T\<m. {1, ...,p} '\\t\\2,n We have that i^'^t) E^En \t\ (^'J) 2,n definition of \f\T\< 777,, Qf and {<it-'t)f ,2\-^i E,En \t ll^lb.n \ By < in (3.15), support(i) ||i|| = I by (3.15). C T Hence by < EcEn a E^En 2,n i^'it) «i) M2,n \\t\\2,n and support(i) C T, so that support(i — definitions RSE.l(7r7,) 4'{Tn)\\t — i||~/K(77i)", t) C T, and RSE.2(r7z), and \2,n (x'^t) {x'^t) \\t\\2,n \\th,n 2{<tr /ll^lb.n-IKIkn = E^En EtEn ,2 '\m \\A\2,n - l^lb.ri \\A\2,n \ ^2ur = " < a^(l>{jn)\\t-t\\~/K{m)~. ( [ Whn ) ,,,2 /n.u2 ^"Il^-^"l2-/ll^ll2,n Thus [x[t) [xrt] < EcEn \ Then the bound 11^12,™ (3.17) follows 2a\\t - t\\^^H^)/K.im) = 2afi,r,]\t - ^H \\t\\2,n^ from the bound in [18] page 94 with R= 'lafim for any e < a. POST-^i-PENALIZED ESTIMATORS ;. ... .^ , Proof of Theorem that Q(^) — Let 6 := Q0) - QiPo) 6 part 1 ( ) , p— 6q. By definition of the second-step estimator, any e > it follows Thus, Q(/3of ) < {Q{P) - QiPo)) A [QiP^f) - for D ... . < Q(p) and Q0) < By Lemma least \ 1. 15 . Q(/?o)) there exists a constant < Sn A K^ such Cn- that with probability at e: ,:. W) - Q(/3o) - < I|?||2,„I ^e,n||?||2,n + 2c,||?||2,n where Ae'n. Combining these relations = A'eCTv'lrnlogp-t- {m + we obtain the ll^llln solving which - \S\\2,n inequality ^.,nll^l|2.n we obtain the stated s) log ij.ff,)/n. - 2c,||^||2,n < B. A C„, result: < + 2Cs + V(^n)+ A A,^n (C„) + . D 4. In this section we A Generic Result on Post-Trimmed Estimators investigate post-trimmed estimators that arise from applying unpenalized least squares in the second-step to the models selected by trimmed estimators Fonnally, given a first-step estimator P, we define its trimmed support in the first step. at level i > as f(t):={je{l,...,p}:\Pj\>t}. We then define the post-trimmed estimator as ^' The trimming scheme traditional so that to trim all = arg sets the min_ Q(/3). (4.18) trimming threshold > t — maxi <j<p |/3j — small coefficient estimates smaller than the uniform estimation error discussed in Section 2.3, this method is /3ojh L As particularly appealing in parametric models in which the non-zero components are well separated from zero, where selection device. t Unfortunately, this scheme may perform it acts as a very effective model poorly in parametric models with true coefficients not well separated from zero and in nonparametric models. Indeed, even in ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV 16 parametric models with too aggressively may many small but non-zero true coefficients, trimming the estimates and consequently result in large goodness-of-fit losses, slow rates of in convergence and even inconsistency for the second-step estimators. This issue directly motivates our new trimming method, which trims small goodness-of-fit based coefficient estimates as much as possible subject to maintaining a certain goodness-of-fit level. Unlike traditional trimming, our new method our method is is completely data-driven, which makes at least as good than both of these methods section in a as LASSO appealing for practice. Moreover, it post-LASSO or wide range of experiments, we present generic performance bounds for but performs better theoretically, practically. In the remainder of the both the new method and the traditional trimming method. 4.1. Goodness-of-Fit Trimming. Here we propose a trimming method that ming level selects the trim- based on the goodness-of-fit of the post-trimmed estimator. Let 7 < t maximal allowed denoted the loss (gain) in goodness-of-fit (goof) relative to the first-step estimator. We define the goof-trimming threshold tj as the solution to ^ Then we := nmx{t : f:^f{t-,) (4.19) maintaining a certain Q0) < (4.19) 7}. model and the post-goof-trimmed estimators define the selected Our construction Q{p') - and and (4.20) selects the level of goodness-of-fit as P most as: p'^ -.^ (4.20) trimming threshold subject aggi'essive measured by the to least squares criterion function. Note that we can compute the data-driven trimming threshold (4.19) very effectively using a binary search procedure described below. Theorem 3 (Performance of a generic post-goof- trimmed estimator). In either the param.etric or the nonparametric model, let /3 be any first-step estimator, QiPo) and C„ := Qift^f) - QiPo). For any s that with probability at least iifl \\P where Cg earlier, « Polb.n - II = with ^ < t' -f^eCry I — 0, there is a constant m.\ogp+{m + s)\ og^i.in ,„ ^ 2Cs Lemma ^ -f- ] Bn :— Q{P) K^ independent of - n such — n ^ / + Bn)+ ^(7 ( \ 21) l^-^-^J ('4 A ir^ f\ (C„)\ + , Furthermore, Bn and Cn obey bounds (3.13) stated T. Note that the bounds on the prediction norm stated in |T \ T\, and e in the parametric model. T— > m := in Theorem 3 and eciuation of (3.13) 5 apply to any generic post-goof-trimmed estimator, provided we can bound both POST-i'i-PENALIZED ESTIMATORS the rate of convergence often use the bound of the first-step estimator /3o||2,?x by the trimmed regressors selected we can — ||/3 first-step estimator. < m, where fh m m course, fh is much potentially the conditions of threshold t = Lemma 3 {Bn)+ A (C„)+ on perfect model selection if = C„ = 4.1 i?„ T is, A choice of 7). The 7 < 0, t — loss of bias. Lemma Otherwise, if T, then we have 3 provides sufficient the selected model fit nice feature of the theorem above is = 0, relative to the first-step estimator. We can also use = experiments. Note that if 0, is ^ = 0. The set we would eliminate the second term 'y+Bn = in the rate , bound < provides a simple rationale for choosing 7 4.2 (Efficient computation of t. theoretical guarantees of this choice are which has a substantial regularization bias, then we have 7 binary search over (4.22) but this proposal led to the best performance we could ty). Since there are at most exactly by running at most fit This makes sense, since the first-step estimator can suffer the post-trimmed estimator for comparable to that of 7 is simplest choice with good theoretical guarantees is 7=^^^\^^<0, it fails it Consequently, our recommended data-driven choice - t-y TC since a negative 7 actually requires the second-step estimator to gain from a large regularization Comment tj. rate. is that 7 . relative to the first-step estimator. in general, 0, parametric model hold with the A C„. (Recommended which requires there to be no p'^ is — fh T, the performance of the second-step estimator <^ allows for a wide range of choices of 7. where in the and these terms drop out of the 0, determined by both fh and (feasible) LASSO, we can even have the selected model contains the true model, that to contain the true model, that any regressors selected by smaller than fh, resulting in a smaller variance for conditions for this to hold for the given threshold . wrong t^. Also, note that Comment of are tight, as, for example, in the case of the post-goof-trimmed estimator. For instance, in the case of if and m, the number of wrong For the purpose of obtaining rates, number the is the first-step estimator, provided the bounds on LASSO. Of 17 log2 \T\ is < our computational not practical and not always feasible, (4.21). Since i?„ 0. in Although as « Q{P) - this choice we did is o"~ > if /? not available in (4.22). For any 7, we can compute the value \T\ possible relevant , values of t, unpenalized least squai^es problems. t-y by a we can compute ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV 18 Proof of Theorem. 3. Let S := - J3 By Po- Q0) - QiM < On the other hand, since Qm = 1 + + Q0) - 7 a minimizer of /3 is Q0) < Q0) + 7, definition Q 1 any 6 part (1), for > £ 0, there < Q{PQf) so that Cn- K^ such a constant is = QiM . Br,. over the support T, Q{P) QiP) - QiPo) < QiPof) - By Lemma so that that with probabihty at least -£ mln-A,,nm2,n-2Csm2,n<Q{(3)-Qm, where Ae,n Combining the + [m + J) s) log^(.^)/n. inequalities gives ll^llln Solving this inequality for 4.2. = K^(TyJ[m log - ^.,n||?||2.n - < 2c,||j'||2,n + (7 B„) A C„. D gives the stated result. ||(5||2,n Traditional Trimming. Next we consider the traditional trimming scheme, which based on the magnitude of the estimated Given the coefficients. = the trimmed first-step estimator Pt by setting Ptj Pj\{\Pj\ > is first-step estimator P, define t] for j = 1, ...,p. Finally define the selected model and the post-trimmed estimator as f= Let m; number norm = of trimmed components = p'. of the first-step estimator, distance from the first-step estimator (4.23) let and Cn := QiPnf) — QiPo)- for any with probability at least \\P = ,5 and 7^ := \\pt trimmed estimator to the — P\\2,n the prediction Pt- (Performance of a generic post-traditional-trimmed estimator). In either the para- metric or the nonparametnc model, where Gt p |T \ T| denote the components selected outside the support T, m-t := \T \ T\ the Theorem 4 more, and f(t) - Po\\2,n 1 — > be any first- step estimator, and 0, there is let Bn := Q{P) — Q{Pq) K^ independent a constant of n such that e < K^a^/{Tnlogp-\- {m H- J-,t{K,aGt ^'nrit log(p/i,„, )/?t., 7( Bn and Cn £ p < t + 2cs + s) log fiif,,)/n-h + 7f + 2\\p - yj (t){mt)mt , cmd Cg obey bounds (3.13) stated earlier, with 2c s Poh^n) — m T= + + [Bn)+ A ^/(C„)+, the parametric model. Further- T. POST-^i-PENALIZED ESTIMATORS Note that the bounds on the prediction norm stated Lemma easily controlled, just as in the case of performance is 7( which measures loss-of-fit due large. Indeed, in the We estimators. Lemma in major determinant of the trimming threshold 3 (2), then Comment further discuss this issue in the next section in the context of good model selection is a given jt > trimming by selecting the threshold 0, where T C T wp —> we can set t = max{t : — \\Pt t to /?||2,n < subject to maintaining a certain goodness-of-fit theorem above formally covers 7(. estimator, this choice. Our main proposal described Proof of Theorem 4- Let 5 := Q{P) < Q{Pt) A By Lemma /3 — /3o, in level, as However, the wp jt completely. We can fix imply at most a specific 7(}- is C„ = some drawbacks loss of fit jf For This choice uses maximal trimming measured by the prediction norm. Our it is not easy to specify practical, data- the previous subsection resolves just such difficulties. 6* :— Pt — Po, and 5 := P — Pq. By definition of the Q(Pf^f), so that + Bn) A Cn Q(/3o). 6 (1), for any e > there is a constant A'^-^i such that with probability at least \-e/2 mln - Ae,nmkn - 2c,||J||2,n < QiP) " Q{Po), where A^^n On 7(, LASSO. There so that 1 Q{P) - QiPo) < (Q(A) - QiPo)) A (QiPof) - QiPo)) < {QiPt) - Q(P) B„ = Q{P) - too can be very One example possible. 4.3 (Traditional trimming based on goodness-of-fit). of traditional since "/( is very slow rates of convergence and even inconsistency for the second-step which eliminates dependence of performance bounds on driven of the to trimming. If the parametric model with well-separated coefficients, 1, components All parametric models with true coefficients not well separated from zero and are of course exceptional cases where > A 3. (3.13) in nonpaxametric models, aggressive trimming can result in large goodness-of-fit losses and consequently — Theorem example, as suggested in the model selection aggi^essive, for in the Theorem 4 and equation in any generic post-traditional-trimmed estimator. 5 apply to bounds are 19 the other hand, = I<^^ia\J[m\ogp + {m + s) log^im)/n. we have Q{Pt)-Q{p) = = + QiPo) - QiP) - /?)] + 2E„[r,x',iPt - QiPt) - QiPo) 2En[e^x[iPt P)] + ||?|li,„ - Plli,„ ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV 20 To bound the terms above, note with probability at least 1 first Theorem that by 2, there is a constant K^^2 such that — e/2 |2E„M^(A - < ^)]| - aA',,2GH|^t and, second, by Cauchy-Schwartz |2E„[rjxJ(/?( - I3)]\ < ^||2,„, 2cs\\(it - Moreover, P\\2,n- W^'Wln-mln = {\\S'hn-\\Shn)m\2,n + m2,n) < ||^-^||2,n(||A-/3||2.n + 2||?||2,„). Combining these least — 1 inequalities and using that = 7; - ^,nil^l|2,n " 2c,||J||2,„ < 2,n < Ae,„ + 2c, + W aK.flGat + which gives the stated result by taking In this section we for 2c,7t = + 7^(7^ + 2||?||2.„) + 2c,7, ||(5||2,n; + lt{lt LASSO, which may be The 2||d"||2,n) {Bn)+ Also, note that 7; A (C„) + < t\J(j>[mt)mt D (t){,mt). LASSO LASSO to derive the rate of convergence of post-penalized estimators We also derive new sharp sparsity bounds for of independent interest. We begin with a preliminary sparsity LASSO. 7 (Empirical pre-sparsity metric model, let fh=\T\ T\ Vm C5 + previous generic results allow us to use sparsity bounds and new, oracle sparsity bound for LASSO. for A C„. we obtain + V K^2- Ks,i /?„) on post-penalized estimators to the case of and nonparametric models. in the parametric where K,, specialize our results rate of convergence of Lemma probability at Post Model Selection Results for LASSO being the first-step estimator. bound we obtain with /3||2,n, by the Caucliy-Schwartz inequality and the definition of 5. A + (aA^e,2G,7« Thus, solving the resulting quadratic inequality 5.1. — e ll^llln follows \\Pt — m < for and A > LASSO). In c n||5||oo- ^s^J(t,l(m) 2c/ Ki the parametric model. + either the parametric ^Ve have i{c + l)yJ<P[m) nCs/X. model or the nonpara- POST-^i-PENALIZED ESTIMATORS The lemma above LASSO states that The lemma above immediately achieves the oracle sparsity example When bounded. in [2] is to a factor of m<ps(P{n), and example when >p </)(n) not sharp. However, for this case by combining the preceding pre-sparsity (t){rn). (5.24) Unfortunately, this [13]. (p[n) diverges, for p > 2n, the bound up upper bound on the sparsity of the form yields the simple _ as obtained for 21 bound is sharp only when (f){n) is \/logp in the Gaussian design with we can construct a sharp bound sparsity result with the following sub-linearity property of the restricted sparse eigenvalues. Lemma i>l we A 8 (Sub-linearity of restricted sparse eigenvalues). For any integer k have 0(("^/c]) version of this < >0 and constant \{](p{k). lemma has been previously proven in for unrestricted eigenvalues [1]. The combination of the preceding two lemmas gives the following sparsity theorem. Recall that we assume c^ Theorem < osj sjn and a < for we have A(l - 1/4 5 (Sparsity bound for LASSO 1/4, S(f>{m A > ay^. under data-driven penalty). In either the parametric LASSO model or the nonparametric model, consider the a < Oi\X) estimator with A < a^/s/n. and let m :- \f \T\. Consider the 2(2c/ki + 3(c — 1))'}. With probability at least I — a Cs n) set M > cA(l - n\X), - {m G N : m, > " m< The main imphcation to be valid in Lemmas of s Theorem and 10 9 min (b(m A n) for 5 is that if (2c _ h 3(c - 1) imnmeM 4>{'iti A n) <p 1, which we show below important designs, then with probability at least 1 m<ps. Consequently, for these designs, namely s := \T\ < s+rh <p <p{n) for these designs, in [2] and [13]. sharpness, but designs all of the same order with high probability. The reason new bound for this is as the oracle sparsity, that iiimmeM '/'("^) ^ is comparable to the bounds in [20] in terms of order of depend on the unknown [20]. Next we show that miximeM 0("i bound is requires a smaller penalty level A which also does not sparse eigenvalues as in the sparsity (5.25) which allows us to sharpen the previous sparsity bound (5.24) considered Also, our it s LASSO's —a /^ ^) ^P (5.25) holds as a consequence. 1 f™" As a two very common side contribution, designs of interest, so that we also show that for these the restricted sparse eigenvalues and restricted eigenvalues defined earlier behave ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV 22 We nicely. convert The form state these results in asymptotic them for the sake of exposition, following lemma deals with a Gaussian design; stricted) sparse eigenvalues (see, e.g. Lemma and to finite sample form using the results in [20] 7. uses the standard concept of (unre- it to state a primitive condition [2]) although we can on the population design matrix. Lemma 9 (Gaussian design). Suppose Xi, = i sparse eigenvalues are hounded from above by normalized form of namely Xi, (n/[161ogp]), with probability at least 0(Tn) < ip = x,j . ,n, are < oo and bounded from below by k^ Then for any x^j/ ,/En[x'~A. and K /72, Theorem 5 and n/(s Therefore, under the conditions of zero-mean Gaussian random i.i.d. m < > its s 0. logn- Define (slog(n/e)) A - 2exp(— n/16), 1 K{mY > 8(^, . . matrix E[xjiJ] has ones on the diagonal, and vectors, such that the population design Xj as a 1, fim. log p) < 24y^/K. — oo, > we have that as n -^ oo " / 2c m<swith probability approaching at least The following Lemma lemma — 1 — + 3(c- (Syj) a, where 10 (Bounded design). Suppose Kb < vectors, with n'iaxi<i<:„,i<j<p\xij\ x^ = i for namely m< < oo and bounded from, below by k" x,j = i,j/, /En[i'? (slog(n/e)) A ( [e Vlog p ) there K'b \/n/ log p) . ,n, are n and i.i.d. bounded zero-mean random Assurne that the population design p. slogn-sparse eigenvalues are bounded from > 0. a constant is we have , . regressors. that as Define x^ as a normalized form of £ ?^ > such that if sJTi/Kb — ^ C)o x^, and —^ oo ] with probability approaching Kb Then ]. . its ip (f>{m.) i/n/ 1, all matrix E[xix[] has ones on the diagonal, and above by we can take ki > k/24. bounded deals with arbitrary i; ^ oo, < Aip, K(m)" > we have that as m<swith probability approaching at least 1 — and /Um n -^ Ay/lp/K, oo, ^'^ (2c {4f) 0-, < under the conditions of Theorem 5 and provided Therefore, 1. ti" /A, — +3(c- 1 where we can take k\ > k,/8. POST-f:-PENALIZED ESTIMATORS Proof of Theorem A > c 111 n||5||oo- The 5. 23 choice of A imphes that with probabihty at least that event, by Va < Lemma v/0(m) • 1 — q we have 7 2cyf^/Ki + 3(c+ l)\/0(m) ncjA, • which can be rewritten as m Note that Therefore by < n by optimahty Lemma . < 2k [/c] ., (5.26) . G M., and suppose m > M. on sublinearity of sparse eigenvalues 8 fh Thus, since 2 2c nc, \ „,_ — +3(c+l)-^ conditions. Consider any M rn<s-(/)(m) for < any +!)"''''" 777. s M > A: 1 <^(M)f- + 3(c we have 2c nc^\ M < s-2(f)(M) — +3(c+l) M and which violates the condition on we must have Therefore, 777. s since c^ < a\J sjn, more with 7Ti < [M A 77) Further, using again that Cg nc. — +3(c+l) — ^ < a\J sjn and m<s- (t){M A 5.2. l)/c = c — The 1. cay/n, and (c + = l)/c c — 1. we obtain /2c fn<s-(/)(7WA7i) + > < M. In turn, applying (5.26) once since (c A result follows \^ . > casjn we have A — + 3(c 2c- 77) 1 by minimizing the bound over Performance of the post-LASSO Estimator. Here we show M e M- that the D post-LASSO estimator enjoys good theoretical performance despite possibly "poor" selection of the model by LASSO. Theorem 6 (Performance of post-LASSO). ric model, if X > C7i||5||oo> probability at least 1 — for any e > /r7 there either the parametric is model or a constant A'^ independent of the nonparamet- n such e Th y ^^1 \ CTlK-i that with 24 ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV ^ where m := \T \ T\ and Cg — In particular, under the data-driven Q in the parametric model. choice of X specified in (2.3) with log(l/Q:) < > logp, jor any e there is a constant K'^ ^ such that -/9o!|2,n with probability at least Proof of Theorem mlogjpfj-m) <K,a^ I —a— e. ||5r'-||i > a log Mm Note that by the optimaHty of p 6. fi /-g^ogP ^ , (5.27) n Ki LASSO in the problem, and letting (5-28) <^(||/3o||i-||^||i)<^(||?T||i-||?rHli)- we have Q(^)-Q(/3o) < c||?r||i, ifrg J ' Bn:^QW)-Qm IfZ?„ := / J c> since 1. Otherwise, if ||5rHli < c||?t||i, by RE. 1(c) we have ^ The if result follows T C f we := = have C„ m Lemma by applying The second claim (2.9), in B. so that i3„ - bound 2 to first, ^^^^&^. n and Theorem ||(5||2,n (5.29) Ki 1, and also noting that using the condition that < a^fsjn, Cg relation the case of the nonparametric model. This theorem provides a performance bound sparsity characterized ability. < A C„ < l{r g f}S„. immediate from the is -fhh n < Q(/3o) For LASSO, but common it can be by m, 2) LASSO's designs this for post-LASSO rate of convergence, as a function of 1) and 3) LASSO's model bound imphes that post-LASSO performs strictly better in some cases, LASSO's selection at least as well as and has smaller regularization bias. 'We pro- vide further theoretical comparisons in what follows, and computational examples supporting these comparisons appear in Section in other norms It is also of interest immediately follow v/e^S^"^ < Comment 6. 11/3 - M2,n + cs worth repeating here that performance bounds by the triangle inequality and by and ||/J 5.1 (Comparison of the performance of out complete and formal comparisons between <P{m)<pl, Ki >p 1, Mm ^P 1- - ^2 - (5.30) /3o||2,„/«(m). LASSO to carry and post-LASSO, we assume that < log(l/a) established fairly general sufficient conditions for the 10. The is WH post-LASSO vs LASSO). In order We fourth relation < definition of k: logp and a first a mild condition on the choice of a in = o{\). three relations in Lemmas (5.31) 9 and the definition of the data-driven POST-^i-PENALIZED ESTIMATORS 25 choice (2.3) of penalty level A, which simplifies the probability statements in what follows. We note that vmder (5.31) post-LASSO with the data-driven penalty level A specified in (2.3) first obeys: '" mlogp \\P-Po\kn <, n follows that post-LASSO V Theorem In addition, conditions (5.31) and It generally achieves the T is same near-oracle m = op{s) and is, perfect post-LASSO model T CT wp strictly 5.3. when selection, post-LASSO naturally in general fail to correctly select the — 1, > as strictly under conditions of o(s)logp £p improves upon LASSO's m = T C T wp — and achieves the oracle performance: 1, ||/3 LASSO the case where the first-step estimator /3 is LASSO. Lemmas 4, > cn||5||oo, that with probability at least ||^-/3ol|2,„<AVy where m, :— \T \ T\ I ^^°g^+^'^ + — as C5 m = < specified in (2.3) with log(l/a) - 1 — a — e. extreme case of ^p /3o||2,7i cr^/s/n. Lemma 4, D estimator. In what follows we pro/j defined in equation (4.20) for LASSO. for any e > there is the parametric a constant model or K^ independent of e ^)^°g^"~' +2c. + l{Tgr},/^fii±^^l^ + the parametric case. logp, for any e Tn\og[piifh] ||/9-/3o||2,n</<„a then under conditions of + > Q there Under is 2c., cnK\ UKl and with probability at least and Finally, in the 7 (Performance of post-goof-trimmed LASSO). In either the nonparametric model, if X 3 Specif- (5.33) rate. > improves upon + CT vide performance bounds for the post-goof-trimmed estimator n such LASSO: rate as (5.32) Performance of the post-goof-trimmed Theorem s. a class of well-behaved models - a neighborhood of parametric models b.n That <p slogp <, with well-separated coefficients - in which post-LASSO ically, if n isT <^T. as a subset, tliat Furthermore, there logp V LASSO may Notably, this occurs despite the fact that model n s + 1{T<ZT} imply the oracle sparsity in 5 ll/3-/3o||2,n oracle '.s the data-driven choice of A a constant K'^^ such that slog/x^ \{T(IT] slogp 1 n Ki (5.34) ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV 26 Proof. The proof Theorem 3 in the last step. use the condition 7 < of the first claim follows the imposed Cg The second claim < g^JsJu from same steps as the proof of follows immediately from the (2.9) in the ability of the LASSO, first, LASSO's characterized by m, 2) trimming scheme. Generally, is this LASSO since the post-goof-trimmed where we also D LASSO rate of convergence, bound is at least as much trims as as a function of model selection good as the bound for post- and 3) the as possible subject to maintaining also appealing that this estimator determines the trimming completely data-driven fashion. Moreover, by construction the estimated model post-LASSO's model, which leads post-LASSO in some computational examples Comment invoking in the construction of the estimator. certain goodness-of-fit. It over 6, nonparametric model, in addition the condition This theorem provides a performance bound for post-goof-trimmed 1) its spai'sity Theorem cases. to the superior We is level in a sparser than performance of post-goof-trimmed LASSO further provide further theoretical comparisons below and in Section 6. 5.2 (Comparison of the performance of post-goof-trimmed LASSO vs LASSO and post-LASSO). In order to carry out complete and formal comparisons, we assume condition (5.31) as before. Under these LASSO conditions, post-goof-trimmed obeys the following per- formance bound: /^olb.n which is Theorem analogous to the bound 5. It mlogp ^P \s , ^iJ^^f^^.\^^''^V n V , cr n for post-LASSO, since m follows that in general post-goof-trimmed of convergence of LASSO <p s by conditions (5.31) LASSO matches the near oracle < fn. is (5.35) a class of models - a neighborhood of parametric models with well- separated coefficients - for which improvements upon the rate of convergence of possible. Specifically, (5.33), that if TO is, = op{m) if rii — op(s) and post-goof-trimmed and rate and post-LASSO: \\P~Pohn<p<r]j~^Nonetheless, there and T C T wp — - > 1, T C T wp LASSO strictly -^ 1 LASSO also o(m) log /Jolb.n ^P <7 n is then we obtain the performance bound improves upon LASSO's post-goof-trimmed LASSO s_ V n rate. Furthermore, outperforms post-LASSO: POST-^i-PENALIZED ESTIMATORS under conditions of Lastly, Lemma the oracle performance, 5.4. threshold t. This t^, the LASSO ft^- = ^jl{|^j| > t], U ay/sjn. m := \f\T\, := nit of post-traditional-trimmed if X > LASSO estimator. Next we con- all components below a \f\f\ and I - £ LASSO). In > any e cn||5||oo> for of n such that with probabtUty at least -l{T^T}\ht{KeCTGt + 7, :== ||3f \/mt logipum,) / y/n and specified in (2.3) for log(l/Q') probability at least \ - a — ||/3-/3o||2.n < The proof < 7^ where < is a constant K^- independent > C \ CIlKi Under there is 4 in the last step. condition Cs the data-driven choice of A a constant K'^ ^ such that with S logoff, mlogippLf, 7* n V same steps The second claim follows n Ki as the proof of from the first, Theorem where we 6; invoking also use the D relation (2.9), for the nonparametric model. This theorem provides a performance bound for post-traditional-trimmed m sparsity characterized by acterized by m;, 2) J /C,a of the first claim follows the < o^/s/n, there either the parametric model e Theorem LASSO's and improvements rate of convergence, 3) the goodnesR-of-fit loss 7; relative to of the Phn 2C., t\J(j)(mi)mt. logp, for any e + l{T^f} its - 6c, 'IlKi tion of 1) set we have Pohn < K,J^^^2iP±S^l±l}}2^ + Proof. achieves estimator. or the nonparametric model, — LASSO arguably the most used trimming scheme in the literature. To state the Tiieorem 8 (Performance where Gt post-goof-trimmed trimming scheme which truncates to zero is result, recall that is <p = t Performance of the post-traditional-trimmed sider the traditional P /3n||2,n 3 holding for 27 LASSO trimming scheme. Generally, this in sparsity LASSO as a over LASSO trimming threshold t funcchai-- and resulting induced by trimming, and 4) model selection ability bound may be worse than the bound for LASSO, and ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV 28 this arises ming LASSO may We provide because the post-traditional-trimmed resulting in large goodness-of-fit losses 7(. below and computational examples in Section Comment potentially use too this discussion we also 6. assume conditions made (5.31) LASSO vs LASSO in the previous for- LASSO mal comparisons. Under these conditions, post-traditional-trimmed obeys the bound: \\P-Pohn<a^l^ + a^^ + l{T^f}Lva^[^Y In this case we have mWrrit < s + rh <p by Theorem s improve upon LASSO's rate of convergence given As expected, the choice of t, which controls trim- further theoretical comparisons 5.3 (Comparison of the performance of post-traditional-trimmed and post-LASSO). In much 7( in 5, (5.36) and, in general, the rate above cannot Lemma via the the 2. bound < jt ty^(f){mt)'mt, can have a large impact on the performance bounds: t<J'^ =^ \W-Pohn<.J'-^ ~ Slogp Lemma 3 sound, since of it parts (2) and suggests inconsistency in practice (3). the literature on model selection via in The first choice (5.37), suggested by guarantees that post-traditional-trimmed LASSO. The second we can set Furthermore, there t is — large relative to n. Note that is, more, = op{m) [11], is as we theoretically achieves the near-oracle rate poor performance bound, and even to implement the first choice (5.37) a special class of models - a neighborhood of parametric models with m = op{s) and T C T wp —> post-traditional-trimmed (5.33), that if rfi if LASSO, X/n. well-separated coefficients - for which improvements possible. Specifically, LASSO choice, however, results in a very if s'^ is n V Both options are standard suggestions reviewed in /s2logp ^ n V (5.37) and T C T wp LASSO -> 1, upon the 1 rate of convergence of LASSO is then we recover the performance bound strictly improves upon LASSO's post-traditional-trimnied LASSO rate. Further- also outperforms post-LASSO: - o(?n.) log /Solb.n ^P Lemma Lastly, under the conditions of LASSO achieves the oracle performance, n p / s V n 3 holding for the given ||/3 — /iolb.n ^p a%J sjn. i, post-traditional-trimmed D .29 POST-fi-PENALIZED ESTIMATORS ." , Empirical Performance Relative to 6. we In this section which assess the finite our benchmark, is LASSO trimmed 2) LASSO sample performance of the following estimators; post-LASSO, 3) post-goof-trimmed with the trimming threshold — X/n t LASSO, and LASSO, 4) post-traditional- Lemma suggested by 1) 3 part (3). We consider a "parametric" and a "nonparametric" model of the form: J/i where = /i + fi f-i, in = C- The parameter signal", = n We C = C[1, Ejj (6.39) 1/2,1/3,. ..,1/p]'. and we vary C between and 100, the variance of the noise Ejx- — 1. (6.40) determines the size of the coefficients, representing the "strength of the 2. is The number ct'^ = 1, (l/2)l'^~''"l, Thus our and c) of regi'essors ~ A''(0, Sj^. E), model is = p 500, the — sample size each design. for and consider three designs of the ^ = 1/2 for j the equi-correlated design with Ejfc pai'ametric is and we used 1000 simulations covariance matrix S: a) the isotropic design with = l,...,n, [1,1, 1,1,1, 0,0,.. .,0]', generate regressors from the normal law Xi with = i the nonparametric model So is ei^N{Q,(T^), 3:%, parametric model in the ^0 and = b) the Toeplitz design A', for j 7^ k; in all designs very sparse and offers a rather favorable setting for applying LASSO-type methods, while our nonparametric model is non-sparse and much less favorable. We present left the results of computational experiments for each design a)-c) in Figures 2-4. column of each figure reports the results for the parametric model, of each figure reports the results for the nonparametric model. For each the following as a function of the signal strength for each estimator • in the top panel, the number of regi'essors selected, • in the middle panel, the norm of the • in We bias, the bottom panel, the average empirical will focus namely risk, and the right model the The column figin'cs plot /3: |T|, \\E[f3 - Oo\\\, namely E[En{fi — x'^P]'^]. the discussion on the isotropic design, and only highlight differences for other designs. Figure 2, left see from the panel, shows the results for the parametric bottom panel that, for a model with the isotropic design. We wide range of signal strength C, both post-LASSO and ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV 30 post-goof-trimmed LASSO LASSO significantly terms of empirical in risk. outperform both LASSO and The middle panel shows superior performance stems from their much that the We smaller bias. post-traditional-trimmed first two estimators' see from the top panel that LASSO achieves good sparsity, ensuring that post-LASSO performs well, but post-goof-trimmed LASSO achieves even better sparsity. Under very high signal strength, post-goof-trimmed LASSO achieves the performance of the oracle estimator; post-traditional-trimmed LASSO also achieves this performance; post-LASSO nearly matches it; while LASSO does not match this performance. Interestingly, the post-traditional-trimmed LASSO performs very poorly for intermediate ranges of signal. Figure design. We first We in for LASSO significantly terms of empirical risk. As outperform both in the from the top panel that, as while post-goof-trimmed LASSO LASSO and in the last two because they have a much smaller parametric model, LASSO for almost all signals, bias. achieves good sparsity, In contrast to the parametric achieves excellent sparsity. LASSO model, in the nonparametric setting the post-traditional-trimmed terms of empirical risk post-traditional-trimmed parametric model, the middle panel shows that the two estimators arc able to outperform the also see the nonparametric model with the isotropic see from the bottom panel that, as in the parametric model, both post-LASSO and post-goof-trimmed LASSO shows the results right panel, 2, performs poorly in except for very weak signals. Also in contrast to the parametric model, no estimator achieves the exact oracle performance, although and especially post-LASSO and post-goof-trimmed LASSO perform nearly as well, as LASSO, we would expect from the theoretical results. Figure 3 shows the results for the parametric and nonparametric model with the Toeplitz and we see that design. This design deviates only moderately from the isotropic design, the previous findings continue to hold. findings continue to hold with only a few differences. LASSO no longer Specifically, continues to perform well and selects very sparse models. LASSO estimation. terms of empirical risk, we see that the previous see from the top panels post-goof-trimmed LASSO Consequently, in the case of the In contrast, panel that in the nonparametric model, post-goof-trimmed in still substantially outperforms of empirical risk, as the bottom-left panel shows. post-LASSO we selects very sparse models, while parametric model, post-goof-trimmed of Figure 4 shows the results under the equi-correlated design. This design strongly deviates from the isotropic design, but that in this case all we LASSO despite the fact that it see post-LASSO in terms from the bottom right performs equally as well as uses a much sparser model for POST-^i-PENALIZED ESTIMATORS The ._ findings above confirm our theoretical results on post-penalized estimators in parametric LASSO and nonparametric models. Indeed, we see that post-goof-trimmed at least as bias. good as LASSO, Post-goof-trimmed (or very close to LASSO LASSO outperforms post-LASSO whenever when the signal is strong and the model does not produce parametric and sparse is being such), the LASSO-based model selection permits the selection of oracle That allows or near-oracle model. in empirical risk over for LASSO. Of post-model selection estimators to achieve improvements particular note the excellent performance of post-goof- is trimmed LASSO, which uses data-driven trimming to fully consistent for and post-LASSO are and often perform considerably better since they remove penalization excellent sparsity. Moreover, is 31 with our theoretical intermediate ranges of signal. results. select a sparse model. This performance trimming performs poorly Finally, traditional In particular, exhibits very large biases leading to large it goodness-of-fit losses. Appendix Lemma Proof of make the use - > > and 1 = ll^lli IIAtIIi /3 — /3 and > A for + \\s\\l^>--^{\\STh - WMi -\\s\U\s\U < ^||/3o||i - -^(Pr||i + ||^T.||i) + definition of 5, Q(/3) -Q(/9o) |2 ||cy||^_„ = of the following relations for J ll/3ol|i Since Proofs of Lemmas Following Bickel, Ritov and Tsybakov 1. W)-Qm By A. - ^ll^lli, llAVlli < 2 to establish the result [2], cn||5||oo: + \\ST4i) \\St\\i - + \min \\St4i- to (A.43), we (A.43) ll^llln<^(ll'5T||i-||^rHli)- (A.44) where we used that c > 1 < (l il-^/^"*" + il-llM,<fl ;)>Hi,<(i + i) n K.\ cjn [8]. \ cJ and invoked RE. 1(c) since (A.44) holds. Solve bound on A(l - a\X) follows Gaussian random variables, P(|^| 2.2.1(a) in 1 get that: H„ for (A.42) 0, c- Finally, the (a.41) which, by (A.41) and (A.42), implies that St4i <^-|!'5r||i=c||(5T||i. Going back we for ||(5||2,n- from the union bound and a probability inequality > M) < exp{-AP/'2) if ^ ~ A^(0, 1), see Proposition ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV 32 Nonparametric Parametric A. Sparsity 0.5 1.5 1 C Signal Strength Signal Strength C B. Bias 1.6 A.,, 0.5 0.5 1.5 1 Signal Strength C 1.5 1 Signal Strength C C. Empirical Risk 2.5 1.6 ,o ,©'" 1,4 r^ ^0 2 ,.« .23 ^ ^ P' 1.5 d O W 1 a S W 0.5 "A" 0.5 (5 -A-i- LASSO Figure 2. Tliis figure plots the text. for The number each value of tlie right of regressors C /-r^^ 0.8 6 0.4 0.5 1 Signal Strength A- ©' post-goof-trimmed performance of the estimators = if j 7^ k. The left 1.5 C post-traditional-trimmed listed in the text under the column corresponds column corresponds to the nonparametric case described is p = A P^/^,,A-y/A' 2 isotropic design for the covariates, Sjj; parametric case and ^^....---^ C post-LASSO -o 0.2 -4-.!^ 1.5 1 Signal Strength .©. , 13 ©'" .b ©' 1.2 500 and the sample size is n = to the in the 100 with 1000 simulations POST-^i-PENALIZED ESTIMATORS 33 Nonparametric Parametric A. Sparsity M '^ # '# 1.5 1 0.5 C Signal Strength 1.5 1 Signal Strength C B. Bias ^ ^-= 0.5 T-' 1 Signal Strength 1^ m s-^ i n ij^ 1.5 0.5 C 1 Signal Strength 1.5 C C. Empirical Risk 0.5 1 Signal Strength LASSO Figure 3. 0.5 1.5 C 1 Signal Strength A •post-LASSO post-goof-trimmed © This figure plots the performance of the estimators Toeplitz design for the covariates, Eja: = p'-'"*'' if j 7^ k. The 1.5 C post-traditional-trimmed listed in the text left under the column corresponds to the parametric case and the right column corresponds to the nonparametric case described text. for The number of regressors each value of C. is p = 500 and the sample size is n = in the 100 with 1000 simulations ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV Nonparametric Parametric A. Sparsity ^^ ^ ' " ' ^EB^' ' A ^^ '^ O 0.5 1.5 1 C Signal Strength Signal Strength C B. Bias 3 A — 'A -@ 0.5 @- 1.5 1 0.5 C Signal Strength 1.5 1 Signal Strength C C. Empirical Risk 2.5 ,x ? 1/1 Pi , 1 a o 1 5 (-1 a S w 0.5 A 05 A 1 Signal Strength LASSO Figure 4. 0.5 1.5 C 1.5 1 Signal Strength A post-LASSO O post-goof-trimmed ' post-traditional-trimmed This figure plots the performance of the estimators listed equi-correlatcd design for the covariates, Hjk = p \i j ^ k. The left C in the text under the column corresponds parametric case and the right column corresponds to the nonparametric case described text. The number for eacli value of of regressors C is p = 500 and the sample size is n = to the in the 100 with 1000 simulations POST-«i-PENALIZED ESTIMATORS Proof of Lemma relation: for S = (3 — Pq, QW) - QWo) - By definition of P, If ll'^llin On ~ the other hand, 0, Lemma 2 2Er,\e,x',5] > -^{\\ST\\i < ^||/3o||i ||<5r<-||i) satisfies the - - + \\ST4i)-^cMkn ^||/3||i, which implies that < 2c.||(5||2.„ Appendix ~ B. (A.47) > we get domination condition From (2.6). - 2c.P|b,„ < (l + 1) cjnh^Th Lemma 6. Finally, the n hi bound on A(l — random Oi\X) follows from variables, P(\£,\ > M) < D [8]. Result (1) follows + S)- follows from i?„ Lemma 6 result (1). The bound on C„ from the relation Q(/?o) - mlJ = \2E^[e^x',5] then applying Theorem 2 on sparse control of noise to Result get ^^^ cJ \ we Proofs of Lemmas for Post-Model Selection Estimators \Qi(3o 2cs||5||2,n (A.46) and using RE. 1(c) < (l + -) A'(0, 1), see Proposition 2.2.1(a) in Lemma 5. The bound on from Lemma 6 result (2). Proof of (A.46) \\5t4i) ||<5Tc||l<^-||dT||l=c||5T||l, 2cs||A'||2,n Proof of follows - ^(||<5t||i the statement of the theorem. the union bound and a probability inequality for Gaussian ,J (A.45) in which gives the result on the prediction norm. if of the fohowing + 2En[r,x',6]>-\\S\U\5\U-2cMhn ¥\\ln - + V exp{-M~/2) we make the use then we have estabhshed the bound if ||(5|||„ ||^|||„ to prove = Q(/3o) + [2], cn\\S\\ao ¥\\in ^(II'^tIIi < > X ii Q0) - 2c5||d'||2,,j and therefore 5 Similar to 2. 35 + 2En[r,x[6]\, |2E„[ejx'(5]|, bounding |2E„[ri.Tj(5]| by using the Cauchy-Schwartz inequality, and bounding (^) by p™. (2) also follows components in T from Theorem 2 but applying are modified), m — k, it with s = and noting that we can take 0, /v.^ p — with s (since m = 0. only the D ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV 36 Appendix Proof of Lemma Proofs of Lemmas for Sparsity of the LASSO estimator C. T = Let 7. support(/i), m and = \T\T\. We have from the optimaUty conditions that 2E„[xy(2/, Therefore we have VAX = R= for - x'M = {ri, . , . sign(^,)A/7i jef\T. each for ?-„)' . 2\\{X'{Y-Xp))^^^\\ < 2\\{X'{Y^R~XPo))f\T\\+niX'R)f^j,\\ + 2\\{X'X{Po~^))fyr\\ < Vm • n||5||oo + + 2nV4>irn)cs - 2nV'P{fh)\\P /3o||2,«, where we used that - ||(A"X(/3o-^))^\^|| and similarly Since A/c > ||(A'i?)j^,^jj 7i||S'||oo, result follows Proof of Lemma and by W .= E„ SUP||ai|u<m,||a||<l < ny^||/3o - where we can choose V\a' X' Xq\\\X{^q - ^)|| ;5||2,n, 2, - ||/Jo ^||2,n < (l + ^) ^+ 2cs we have a,'s such that < 1/c) — 2/(c + by 1) and a be such that = ||a,Tc||o j ||aiT'^||o < (;6( definition of \['k] ) HarHIo and a^r k for each a[Wai + a'jWaj > J2J2= 1=1 ^ — = a' Wa. = ^r/ D c. We can M decompose , (-1 positive semi-definite, ^ (1 {x,x'j] i=l is sup||„,|^<^_||„||<i||a'X^|||jA(^o-^)|| Lemma a = ^aj, with ly < = < n^J <^{fn)Cs. by noting that Let 8. sup||„,|„<^,,|„i|<i|a'A'A(/3o-^)| - l/c)\/^ < 2V4>[m)[\ + 1/c)/s/ki + 6\/0(m) nc,/A. (1 The = V^ '- i = 2 |q'J'1^Qj| for 1, ..., \f\, since [£]A; any pair (i, j). 1=1 i m 5^||a,||20(||a,TMIo) < m _max I !,-> </.(||a,:T-|lo) *- I I [^fc]. Therefore = \(]Y.a^Wa, " > < m<^(^), Since POST-£i-PENALIZED ESTIMATORS 37 where we used that m m r^i ||2 II- r^i ,^1 D Lemma Proof of P(maxi<j<p First note that 9. - |ctj condition on n. Let Ci,{m) and c*[m) denote the and Ac(m)^ > mini<j<pCTJc,(m + proof of Proposition 2 + deviation of c,,{m with (i) ei and c*{m s) s). = + M/s — large since ki probability at least Lemma 1 — (ii) in is and s) 10. First c*(777 + note that P(maxi<j<p s, (ii) M & + r,. = Jv\ for n 3 (ii) in so that as 5 if A > s, \aj — Ij The bound on for 5 if A > the sufficiently cnHSHoo which occurs with < 1/4) H- cr7||5||oo 1 as n grows under the minimum and maximum It 1/2 and r* = [2] and the to one. 00, on from s. the restricted Next let M — under the side condition on bounded from below and <p[M) Thus, the bound on which occurs with probability at ^(777)" follows The bound on side condition is 777-sparse eigenvalues which bounds the deviation of 2. n grows M/s —> side 0(m) < maxi<j<pajc*{m.+ follows that sufficiently large since k\ bounded from above with probability going Theorem bounds the M — (slog(n/e))A(n/[161ogp]) so and we have M e A4 n Thus, the bound on (pim) and .s). with Lemma A ([e/A'sJ^/n/logp) and we have 1/16, which Let from their population counterparts. s) eigenvalue ki follows from (slog(n/e)) = £4 follows from [20]'s D proof of Proposition 2 + [2]. eigenvalues a. and K{m)^ > mini<j<payCt(r??. c,(7Ti = £3 side bounded from above with probability going associated with E„[iiS^] (unnormahzed covariates). [20] 's n grows under the and «(m)^ (t>{m) Theorem in then follows from condition on n. Let Ct{m) and c*(7n) denote the s) as 0(m) < maxi<j<pCT|c*(m + follows that and 1/2, 00 under the side condition on bound on 1 > their population counterparts. bounded from below and 0(M) is to one. Thus, the Proof of > = Lemma 3 restricted eigenvalue ki follows from that as n gxows from s) It Thus, the bound on 1/3, £2 — 1/4) minimum and maximum m-sparse associated with E„[a:,xJ] (unnormahzed covariates). s) < 1| least \ — m is then follows from a. References [1] A. BellONI and V. Chernozhukov (2009). Ci-Penalized Quantile Regression for High Dimensional Sparse Models, arXiv:0904.2931v3 [math. ST]. [2] P. J. Ann. BiCKEL, Y. RiTOV AND A. B. TsYBAKOV (2009). Simultaneous analysis of Lasso and Dantzig Statist. Volume 37, Number 4 (2009), 1705-1732. selector, ALEXANDRE BELLONI AND VICTOR CHERNOZHUKOV 38 [3] BuNEA, A. B. TSYBAKOV, AND M. H. Wegkamp(2006). Aggregation and Sparsity F. (COLT Least Squares, in Proceedings of 19th Annual Conference on Learning Theoi'y via h Penalized 2006) (G. Lugosi and H. U. Simon, eds.). Lecture Notes in Artificial Intelligence 4005 379-391. Springer, Berlin. [4] [5] No. Statistics, Vol. 35, n. [7] Canoes and E. Ann. D. L. Statist. Tao T. Wegkamp Volume DONOHO AND J. 35, (2007). Sparsity oracle inequalities for the Lasso, Elec- 169-194. 1, The Dantzig (2007). Number selector: statistical estimation [9] R. J. Dudley Fan and is much larger than M. Johnstone (1994). Ideal spatial adaptation by wavelet shrinkage, Biometrika '_ Uniform Cental Limit Theorems, Cambridge Studies (2000). J. when p 2313-2351. 6, 1994 81(3):425-455. [8] The (2007). Aggregation for Gaussian regression, 1674-1697. 4, BuNEA, A. TsYBAKOV, AND M. H. F. tronic Journal of Statistics, Vol. [6] Wegkamp BuNEA, A. B. TsYBAKOV, AND M. H. F. Annals of advanced mathematics. in Lv. (2008). Sure Independence Screening for Ultra-High Dimensional Feature Space, Journal of the Royal Statistical Society. Series B, vol. 70 (5), pp. 849-911. [10] V. KOLTCHINSKII (2009). Sparsity Statist. [11] Volume Number 45, 7-57. K. LOUNICI (2008). Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electron. [12] 1, penalized empirical risk minimization, Ann. Inst. H. Poincar Probab. in Volume J. Statist. 2, 90-102. K. LouNici, M. PoNTiL, A, B. Tsybakov, and S. van de Geer (2009). Taking Advantage of Sparsity in Multi-Task Learning, arXiv:0903.1468vl [stat.ML]. [13] N. Meinshausen and B. Yu Lasso-type recovery of sparse representations (2009). for high-dimensional data. Annals of Statistics, vol. 37(1), 2246-2270. [14] P. RiGOLLET and A. B. TSYBAKOV (2010). Exponential Screening and optimal rates of sparse estimation, ArXiv arXiv:1003.2654. [15] M. ROSENBAUM and A. B. TSYBAKOV (2008). Sparse recovery under matrix uncertainty, arXiv:0812.2818vl [math. ST]. [16] R. Tibshirani (1996). Regression shrinkage and selection via the Lasso. Roy. Statist. See. Ser. J. B 58 267-288. [17] S. A. VAN de Vol. 36, No. [18] A. 2, Geer High-dimensional generalized linear models and the (2008). lasso, Annals of Statistics, 614-645. W. VAN DER Vaart AND J. A. Wellner (1996). Weak Convergence and Empirical Processes, Springer Series in Statistics. [19] M. Wainwright (2009). Sharp thresholds for noisy constrained quadratic programming (Lasso) , IEEE and high-dimensional recovery of sparsity using (.\- Transactions on Information Theory, 55:2183-2202, May. [20] C.-H. Zhang and regression. [21] P. Ann. Zhao and J. Huang Statist. B. (nov), 2541-2567. Yu (2008). Volume (2006). 36, The sparsity Number On Model 4, and bias of the lasso selection in high-dimensional linear 1567-1594. Selection Consistency of Lasso. J. Machine Learning Research, 7