MIT LIBRARIES 3 9080 02874 5203 ,AAMi^ y\o .Ol'ii Massachusetts Institute of Technology Department of Economics Working Paper Series ^-PENALIZED QUANTILE REGRESSION IN HIGHDIMENSIONAL SPARSE MODELS Alexandre Belloni " Victor Chemozhukov Working Paper 09-1 April 24, 2009 RoomE52-251 50 Memorial Drive Cambridge, MA 02142 This paper can be downloaded without charge from the Social Science Research Network Paper Collection at http://ssrn.com/abstract=1 394734 £i-PENALIZED QUANTILE REGRESSION IN HIGH-DIMENSIONAL SPARSE MODELS By Alexandre Belloni and Victor Duke Chernozhukov*'"'' University and Massachusetts Institute of Technology We . consider median regression and, more generally, quantile re- gression in high-dimensional spEtrse models. In these models the overall number p of regressors is very large, possibly larger than the sam- ple size n, but only s of these regressors have non-zero impact on the conditional quantile of the response variable, where than T7. sistent, s grows slower Since in this case the ordinary quantile regression we is not con- consider quantile regression penalized by the ^i-norm of coefficients (fi-QR). First, we show that £i-QR rate ys/n-\/Iogp, which close to the oracle rate \/s/n, achievable when is the minimal true model is is consistent at the known. The overall number of re- gressors p affects the rate only through the logp factor, thus allowing nearly exponential growth The the in number rate result holds under relatively s/n converges to zero weak of zero-impact regressors. conditions, requiring that at a super-logarithmic speed and that regular- ization parameter satisfies certain theoretical constraints. Second, we propose a pivotal, data-driven choice of the regularization parameter and show that show that £i-QR it satisfies these theoretical constraints. Third, correctly selects the true minimal model we as a valid submodel, when the non-zero coefficients of the true model are well We separated from zero. efficients in fi-QR non-zero coefficients : of is in also show that the number same stochastic order as s, of non-zero co- the the minimal true model. Fourth, number of we analyze the rate of convergence of a two-step estimator that applies ordi- nary quantile regression to the selected model. Fifth, we evaluate the performance of ^i-QR in a Monte-Carlo experiment, and provide an application to the analysis of the international economic growth. 'First version; December, 2007, This version: April 19, 2009. 'The authors gratefully acknowledge the research support from the National Science Foundation. AMS 2000 subject classifications: Primary 62H12, 62J99; secondary 62,107 Keywords and phrases: median regression, quantile regression, sparse models • Digitized by the Internet Archive in 2011 with funding from Boston Library Consortium IVIember Libraries http://www.archive.org/details/l1penalizedquant00cher BELLONI AND CHERNOZJIUKOV ' 2 ' . Introduction. 1. Quantile regression an important is ,. ,. statistical method the impact of regressors on the conditional distribution of a response variable [22], Koenker and Bassett on the of regressors [19], for quantile regression regressors is Chernozhukov [17], of regressors is given in covering the case where the [p = and exhibits robustness to outliers The asymptotic theory under [20], Portnoy [28], fixed p increasing and Belloni and Chernozhukov [1-5] of regressors number of sample negligible relative to the is we consider quantile regression in high-dimensional sparse models In such models, the overall number the sample size n. However, the HDSMs ([8], [26]) signal processing, of regressors p number impact on the response variable - number penalized is s is very large, possibly size much larger smaller than the sample have emerged to deal with many new size, that is, s — o[n). analysis, is regressors are known. . began to investigate estimation of HDSMs, primarily focusing on Thus the estimator can be [39] number of regressors p. van der Geer [35]. [37] when the Meinshausen and Yu [26] Our paper's contribution selection regression is is not consistent £i-norm of parameter in coefficients. We HDSM p. Lv [11] used There were framework, a set of results on for quantile regression. Since HDSMs, we squares shall not review here. to develop, within the and rates of convergence and Zhang derived valuable finite sample bounds on screening and derived asymptotic results under even weaker conditions on we significant for the ^i -penalized least empirical risk for ^i-penalized estimators in generalized linear models. Fan and other interesting developments, which [8] consistent even under very rapid, nearly demonstrated similar striking results proposed by Tibshirani Candes and Tao selector, achieves the rate very close to the oracle rate \/s/n obtainable exponential growth in the total and Huang where ' regression, with ^i-norm acting as a penalty function. Vs/nvlogp, which The applications, arising in biometrics, machine learning, econometrics, and other areas of data of papers mean than of significant regressors - those having a non-zero demonstrated that, remarkably, an estimator, called the Dantzig model .5], (HDSMs). high-dimensional data sets have become widely available. many [4, o{n)). In this paper, A The [19]. Gutenbrunner and Jureckova The asymptotic theory under others. He and Shao number Laplace well-developed under both fixed number of of regressors. [9] [7], and has a wide applicabiUty [29], is given by Koenker and Bassett Knight number number (cf. captures the heterogeneity of the impact it different parts of the distribution regressors and increasing n In particular, has excellent computational properties asymptotic theory [14], [20]). analyzing for ordinary quantile consider quantile regression penalized by the show that the ^j-penahzed quantile regression is PENALIZED QUANTILE REGRESSION consistent at the rate -y/s/n-v/logp, which the true minimal model we propose a is is known. In order to make the penahzed estimator practical, pivotal, data-driven choice of the regularization parameter, this choice leads to the when close to the oracle rate ^J sjn achievable same sharp convergence rate. Further, quantile regression correctly selects the true minimal model and show that we show that the penalized when as a valid submodel, We the non-zero coefficients of the true model are well separated from zero. also analyze a two-step estimator that applies standard quantile regression to the selected model and aims at reducing the bias of the penalized quantile estimator. Monte penalized and post-penalized estimators with a economic growth example. Thus, our examining a new non-linearities class of problems. We illustrate the use of and an international carlo experiment results contribute to the literature on HDSMs by Moreover, our proof strategy, developed to cope with and non-smoothness of quantile regression, may be of estimation problems. the (We provide more M- interest in other detailed comparisons to the literature in Section 2.) Finally, let us The comment on the role of computational considerations in our analysis. choice of the £i -penalty function arises from considering a tradeoff between statistical efficiency and computational efficiency, with the latter being of particular importance in high-dimensional applications. Indeed, in model selection problems, the statistical efficiency criterion favors the use of the ^o-penalty functions (Akaike ^0-penalty counts the the number of non-zero [1] and Schwarz components of a parameter [32]), where the vector. However, the computational efficiency criterion favors the use of convex penalty functions. Indeed, convex penalty functions lead to efficient convex programming problems ([27j); in contrast, the £o-penalty functions lead to inefficient combinatorial search problems, plagued by the computational curse of dimensionality. Precisely because closest to the £o-penalty (e.g. [30]), the ^i-penalty has HDSMs, in general (e.g. [25]), and it emerged to play a central most effective respecting the computational efficiency constraint. comparisons with the also describe our key results literature. In Section 3, tions E.1-E.5, which are implied by D.1-D.4, model 2, , _ . we introduce the problem and pivotal choices for the regular- under D.1-D.4, and provide detailed we develop the main and selection, while , some simple primitive assumptions D.1-D.4, and propose We is role in ' organize the rest of the paper as follows. In Section ization parameter. a convex function that in our analysis, in particular. In other words, the use of the ^i-penalty takes us close to performing the We is also hold results much more analysis the pivotal choice of the penalization parameter. In Section 5, under condi- generally. Section 4 we carry out a com- 4 ^ BELLONI AND CHERNOZHUKOV . , ,, • putational experiment and provide an application to an international growtli example. In Section 6, we provide conclusions and discuss possible extensions. In Appendix A, we verify that conditions E.1-E.5 are implied by conditions D.1-D.4 and also hold more generally. Notation. 1.1. size n, what In a < — n oo. > We cb for all sufficiently large n, for we use a <p b to denote a <p b <p = denote that a We a. denote ^2-iiorm by 2. implicitly index parameter values by the sample • ||, || use the notation a some constant also use the notation || > c 6 to carry out denote that a ||i, ~ 6 to denote a < b < a a V 6 = max{a, b} and a /\b — £oo-norm by = ||oo, || The Basic Setting. of 0(6), that and a ~p min{a, b}. and the £o-"norm" by and 6 to We ||o. |] In this section, the setting, the estimator, and state primitive regularity conditions. provide an overview of the main results. all that does not depend on n, we use a Op{b); £i-norm by < We Basic Settings, the Estimator, and Overview of Results. we formulate 2.1. all but we omit the index whenever this does not cause confusion. the asymptotic analysis as is we follows, We also - ^ set-up of interest corresponds to a parametric quantile regression model, where the dimension p of the underlying model increases with the sample size n. Namely, we consider a response variable y and p-dimensional covariates x such that the u-th conditional quantile function of y given x '• Qy^,iu) (2.1) We = is x'f3{u)^ given by piu)eW. consider the case where the dimension p of the model the available sample size n, but the true model /3(u) is is large, possibly sparse having only non-zero components. Throughout the paper the quantile index u G The population coefficient P{u) • (2.2) . where Pu{t) a = (u random sample known (0, 1) is s larger than = s{u) < p fixed. to be a minimizer of the criterion function Q^iP)^E[p^{y-x'f3)], . — is much l{t < {yi,x\), Q})t . . . is the asymmetric absolute deviation function [20]. Given ,{yn,x„), the quantile regression estimator /3{u) of l3{u) is defined as a minimizer of Qu{P)=En[pu{y,-x[0)] (2.3) where E„ \fiyi,Xi)] := in the given sample. n~^ IZ"=i f[yi^X'i) denotes the empirical expectation of a function / PENALIZED QUANTILE REGRESSION In the high-dimensional settings, particularly when p > 5 n, quantile regression not consistent, which motivates the use of penalization in order to remove nearly all The sistency. is whose population regressors generally or at least coefficients are zero, thereby possibly restoring con- penalization that has been proven to be quite useful in least squares settings the £i-penalty leading to the lasso estimator [35]. The Choice of Estimator, Linear Programming Formulation, and 2.2. all is ^i-penalized quantile regression estimator /?(u) Its The Dual. a solution to the following optimization is problem; min (2.4) When the solution number is we not unique, Q„(/?) +^ ^ |^_,.|. define (3{u) as a basic solution having the components. The criterion function of non-zero in (2.4) is the sum minimal of the criterion function (2.3) and a penalty function given by a scaled £i-norm of the parameter vector. This £i-penalized quantile regression or quantile regression lasso has been considered by Knight and Fu [18] under the small For computational purposes, problem it is (fixed) p asymptotics. important to note that the penalized quantile regression (2.4) is equivalent to the following linear programming problem p \ min i L The problem minimizes a sum ^f and ^~. The /3j = P^ — linear i = is, ' of f i-norm of the absolute positive /3^ (3~ " and negative P~ parts and of an average of asymmetrically weighted residuals programming formulation polynomial time, algorithms algorithms, one can . l,...,n. (2.5) useful for is estimator, particularly in high-dimensional applications. There are a that _ nfr[' ^+-^- ^y,-x',{P+-p-), of the parameter . '-' €+.r,/3+,/3-€Kf+^'' (2.5) - E„[<+ + (l-u)erl + -E('^^ + /?7") '^^ ^ ^' for the linear compute the estimator computation of the number programming problem (2.5). of efficient, Using these avoiding the computational (2.4) efficiently, curse of dimensionality. Furthermore, for both computational and theoretical purposes, it is important to note that the primal problem (2.5) has the following dual problem: max (2-6) .,• E„ Iviai] . |E„[a:i,a,]|<^, (tt — 1) < a.j < w, i i = l,...,p, = 1, . . . , 77,. ^ ' • - ' BELLONI AND CHERNOZHUKOV 6 The dual problem maximizes , the correlation between the response variable and the rank scores subject to the condition requiring the rank scores to be approximately uncorrelated with the regressors. This condition — a*{u) by {u — l{yi < the event {yi (2.1) is reasonable, since the true rank scores, defined as x^P{u)}), should be independent of regressors < x'^(3{u)} is distributed variable Uj which is equivalent to the event {ui independent oi < Xi. This follows because u}, for a standard uniformly Xi. Since both primal and dual problems are feasible, by strong duality for linear program- ming the optimal values of (2.6) equals the optimal value of (2.4) (see, for example, Bert- simas and Tsitsiklis The optimal in [6]). solution to the dual problem plays an important role our analysis, helping us control the sparseness of the penalized estimator P{u) as well as choose the penalization parameter A. lem (2.6) also plays an important Of course, the optimal solution to the dual prob- role in the non-penalized case, = with A 0, yielding the regression generalization of Hajek-Sidak rank scores (Gutenbrunner and Jureckova Another potential approach worth considering and Tao [8], proposed the context of in mean is [14]). the Dantzig selector approach of Candes We regression. can extend this approach to quantile regression by defining the estimator as solution to the following problem; (2.7) ; ,flt\f3^\-- ; \\Sumo.<l i=i where A is ' • . a penalization parameter, and S^ is - - a subgradient of the quantile regression objective function Qu{P)- (2.8) Su{/3) • The estimator (2.7) = En[{l{yr<x\p}-u)x,]. minimizes the £i-norm of the coefficients subject to a goodness-of-fit constraint. On computational grounds, we prefer the ^j-penalized estimator The reason selector estimator (2.7). is (2.4) over to the that the subgradient Su in (2.8) is Dantzig a piece-wise con- stant function in parameters, leading to a serious difficulty in computing the estimator (2.7). In particular, the problem (2.7) can be recast as a mixed integer lem with n binary variables, for rithm. (In sharp contrast, in the S{P) = E„[(yi — Xif3)xi], Accordingly, in the mean programming problem, which (generally) there mean is programming prob- no known polynomial time algo- regression case the subgradient corresponding to the objective function Q{P) is = a linear function, '^nliVi — i'i/?)'^]/2. regression case, the optimization problem can be recast as a linear for which there are polynomial time algorithms.) PENALIZED QUANTILE REGRESSION Another idea for to minimize the £i 7 formulating a Dantzig type estimator for quantile regression would be norm of the coefficients subject to a convex goodness-of-fit constraint, namely min f^ (2.9) Since the constraint set is with in the of A that QuiP) < 7} : : QuiP) < course, this and convex, is problem this first place. Indeed, for every feasible choice of 7 in (2.9) there makes the solutions to (2.9) and to (2.4) identical. (2.9) is equivalent to problem to the original (2.4) min^gup Yl^=\ with A = \Pj\ To n/n. Therefore it is we started a feasible choice see this, fix a 7 for the constraint + f^iQuiP) — is hardly a surprise, since this equivalent to an £i-penalized quantile regression problem (2.4) that denote the optimal value of the Lagrange multiplier problem T piece- wise linear is programming problem. Of equivalent to a Hnear problem {f3 |/3,| QuiP) < which 7, and let k then the is then equivalent suffices to focus our analysis on 7)1 the original problem. 2.3. Here we propose a pivotal, data-driven The Choice of the Regularization Parameter. choice for the regularization parameter value A. We shall verify in Section 4 that such choice will agree with our theoretical choice of A maximizing the speed of convergence of the penalized estimate to the true parameter value. Because the objective function in the " - primal problem (2,4) is not pivotal in either small or large samples, finding a pivotal A appears to be difficult a priori. However, instead of looking at the primal problem, let us look at its linear (2.10) - |E„[a:jjaj]| - < -,for all j = programming dual l,...,p<^ (2.6), < ||E„[.x,aj]||^ which requires that -. -.• This restriction requires that potential rank scores must be approximately uncorrelated with regressors. It then makes sense to select A so that the true rank scores o.*{u) " :- satisfy this constraint. Of coinrse, since The key is, we can A„ = ' (2.11) That = {u-l{yi < we do not observe the observation is XiP{u)}) fori potentially set A = = l,...,n A„, where n||E„[x,a*(t.)]||^. ' true rank scores, this choice that the finite sample distribution of A„ is is not available to us. pivotal conditional on 8 '- .. ' BELLONI AND CHERNOZHUKOV , the regressors xi, . , . a;„. . We know a*{u) where uj, . . . ,u„ are the regressors, xi, ... = {u uniform i.i.d. ^ that rank scores can be represented almost surely as — l{ui (0, 1) < u}), random we have jXn. Thus, - for i = 1. . ,n, . . . , variables, independently distributed . ^ ..'-:„,-, from '. .: .:, '" \ (2.12) A^=n\\En\x,{u-l{u^<u})]\\^, ; which has a known distribution conditional on xi, quantiles A„ quantile of as our choice for A. In particular, . . ,Xn. Therefore set A = we can use the A(xi,...,x„) as the 1 tail — a„ A„ A (2.13) where q„ we . ' \ some at = inf{c:P(A„<c|a;i,...,x„) > 1-Q„}, rate to be determined below. Finally, let us note that we can also derive the pivotal quantity A„, and thus also our choice of the regularization parameter A, from the subgradient characterization of optimality for the 2.4. primal problem (2.4). Primitive Conditions. ters [16], We n. framework of high-dimensional parame- which formally consists of a sequence of models with parameter dimension p tending to infinity as the sample els, follow Huber's size n grows to infinity. Thus, the parameters of the the parameter space, and the parameter dimension are However, following Huber's convention, we will — Pn mod- indexed by the sample size all omit the index n whenever this does not cause confusion. Let us consider the following set of conditions: D.l. Sampling. Data (yi,Xj)',i — l,...,n are an with the conditional u-quantile function given by i.i.d. (2.1), sequence of real and with the first (1 4- p)-vectors, component of x^ equal to one. D.2. Sparseness of the bounded by D.3. 1 < s = s^ True Model. The number of non-zero components of < n/\og{n V are and uniformly is p). Smooth Conditional Density. The conditional density ^ fy^\x,{y\x) l3{u) bounded above uniformly in fy-\x,{y\x) and its y and x ranging over supports of derivative yi and Xi, in n. 0.4. Identifiability in Population and Well-Behaved Regressors. Eigenvalues of the population design matrix E[a:tX^] are bounded above and away from zero, and suphqii^i E[|Xjap] PENALIZED QUANTILE REGRESSION is bounded above, uniformly tile, fy.\x^{x'P{u)\x) and uniformly Xj, The is in n. The 9 conditional density evaluated at the conditional quan- bounded away from zero, uniformly in x ranging over the support of in n. conditions D.1-D.4 stated above are a set of simple conditions that ensure that the These conditions allow us to demonstrate high-level conditions developed in Section 3 hold. the general applicability of our results and straightforwardly compare to other results in the literature. In particular, condition D.l imposes a conventional assumption in asymptotic statistics the effective dimension of the true model random sampling on the (e.g [38]). is Condition D.2 requires that smaller than the sample is data, which Condition D.3 size. imposes some smoothness on the conditional distribution of the response variable. Condition D.4 requires the population design matrix to be uniformly non-singular and the regressors' moments to be well-behaved. Further, En [xiX^], let that be the maximal /c-sparse eigenvalue of the empirical design matrix (f){k) is, (2.14) sup (t>ik)= ..;,; En\{a'x^f]. L J Following Meinshausen and Yu [26] , we '^: , ;.._. . - lla||<l,l|Q||o<fc our general results on convergence rates of will state the penalized estimator in terms of the maximal sparse eigenvalue 4>{mo)- Meinshausen and Yu worked with mo [26] estimator. In this paper = n Ap an initial upper bound on the zero norm of the penalized we can work with a can work with smaller mo, in particular, under D.1-D.4, we ' -j ' . m,o - . as as this provides a valid initial =pA (n/log(n Vp)), bound on the zero norm ' ' • of our penalized estimator under a suitable choice of the penalization parameter. By [8] Yu using an assumption on the growth rate of (i){k.), we avoid imposing Candes and Tao's uniform uncertainty principle on the empirical design matrix E„ [26] argue that the assumption in terms of uncertainty principle, since Meinshausen and Yu [26] it that if the intercept Meinshausen and provide a thorough discussion of the behavior of (p{n) in show that the condition 4>{n) example, when the empirical design matrix is [x,x'^\. are less stringent than the uniform allows for non-vanishing correlation between the regressors. cases of interest. In particular, they in several cases (for 0(/c) included as a covariate we have cf){l) > 1. <p is 1 many appears reasonable block diagonal). Note For the purposes of a basic 10 ". overview of results in the next subsection, . we employ the assumption ^ •' <^(n/log(nVp))<pl. (2.15) which ' BELLONI AND CHERNOZHUKOV ' ' : standard Gaussian regressors and some other regressors considered in will cover Meinshausen and Yu < (because (p{n/log{n V p)) [26] analysis presented in Section (l>{n/\og{n'V p)) to diverge, 3, we do not impose which should permit (2.15) Furthermore, (p{n)). and allow in our general for the sparse eigenvalue with regressors having for situations tails thicker than Gaussian. In order to illustrate our conditions we employ the following canonical examples through- out the paper. Example Normal Design). (Isotropic 1 y . = 1 is = x'/3o positive at zero and D.2 D.4, for k < n + e, ' . the intercept and the covariates X-i and has bounded derivatives. This example if |j/3o|io < = s ~ A'^(0, /), and the errors smooth probability density function which are independent identically distributed with a is = model 1/2) of the following regression where the covariate xi Let us consider estimating the median (u satisfies conditions D.l, D.3, o(n/log(n Vp)). Moreover, the maximal A'-sparse eigenvalues satisfy - (/)(/c) :— E„ sup (q't,)' ~j, 1 k\ogp + =l,i|a||o<A- by Lemma as shown in 2 (Correlated our conditions with (t){n/ log(nVp)) ~p Candes and Tao's uniform uncertainty Normal Design). We consider the = D.2 if ||,/3o||o p''~-'l and — 1 < p < < s = mentioned p)). This example The maximal Thus, this design satisfies in [2G] this design violates which requires \p\ ^ at logp rate. satisfies ~ i±M 1 - IpI 1 \ Moreover, as in Example A^(0, E), where conditions D.l, D.3. D.4, and fc-sparse eigenvalues for k sup E„ [(a'.,f ] -, ^ = l,||al|o<fc J l|a!| 14. fixed. is o(n/log(n V m:= by Lemmas 1 1. principle. same setup but instead we suppose that the covariates are correlated, namely x^i E,j as satisfies this design satisfies [8], Example 1, Thus, this design 14. < n satisfy + /^Li^ \ n j our conditions with 4>{n/ log(nVp)) ~p 1. Candes and Tao's uniform uncertainty However, principle, PENALIZED QUANTILE REGRESSION Finally, it is worth noting that our analysis in Sections 3 and 4, 11 and in Appendix A allows the key parameters of the model, such as the bounds on the eigenvalues of the design matrix and on the density function, to change with the sample to trace out the impact of these parameters we estimator. In particular, size. This will explicitly allow us on the large sample behavior of the penalized be able to immediately see how some basic changes in will the primitive conditions stated above affect the large sample behavior of the penalized estimator. 2.5. Overview of Mam Results. Here we discuss our results under simplest assumptions, and condition consisting of conditions D.1-D.4 (2,15) on the maximal (n/log(n Vp))-sparse eigenvalue. These simplest assumptions allow us to straightforwardly compare our those obtained in the literature, without getting into nuisance details. under more general conditions results subsequent sections: in Section in the on convergence rates and model selection; in Section 4, We 3, results to state our results we present various we analyze our choice of the penalization parameter. In order to achieve the most rapid rate of convergence, with t Our growing as slowly as possible with first main result is to choose = i^nlog(nVp) A (2.16) we need n; for concreteness, let t oc log log n. that the ^j-penalized quantile regression estimator converges at the rate: (2.17) - ;, mu)-(3iu)\\ , S ^ = ^.t provided that the number of non-zero components (2.18) =• -: ,.;. '- s satisfies n * - , '" - \ ' ' ' , ,. ^ ' note that the total number of regressors p affects the rate of convergence (2.17) only through a logarithm t 0og(nVp), J--t'J\og{nVp)^0. - \ We . in p. \/log(n Vp), which is Hence if p is polynomial in n, the rate of very close to the oracle rate \/s/n, obtainable minimal true model. Further, we note that our resulting s of the minimal true model is very weak; can be of almost the same order as Our second main penalized estimator result is convergence is of the n, when p namely s = is is sjs/n when we know the restriction (2.18) on the dimension polynomial in n and t a log log n, s o(n/(i^logn))). that the dimension ||5(u)||o of the model selected by the ii- same stochastic order as the dimension s of the minimal true "• 12., .^ . Further, zero, if 7 ; . '" V. ;;: .''',- the parameter values of the minimal true model are well separated away from ' ' ' - -,,_„,. ' ~ min '. some diverging sequence for -, . ; mu)\\o<ps. \ , namely (2.20) - '. model, namely (2.19) BELLONI AND CHERNOZHUKOV ,' , |/3,(u)| > ^ J- i of positive constants, • t 0og(n • Vp), then with probability converging to one, the model selected by the £i-penahzed estimator correctly nests the true minimal model: support (2.21) C support (/3(u)) {0iu)). Moreover, we provide conditions under which a hard-thresholding selects the correct support. Our third main result is that a two-step estimator, which applies standard quantile re- gression to the selected model, achieves a similar rate of convergence; ^-^log(nvp)^0, y-\/log("Vp) (2.22) provided the true non-zero coefficients are well-separated from zero in the sense of equation ' (2.20). ~ Finally, our fourth larization result is to propose (2.13), a data-driven choice of the regu- parameter A which has a pivotal regressors, its main and finite sample distribution conditional on the to verify that (2.13) satisfies the theoretical restriction (2.16), supporting use in practical estimation. Our results for quantile regression parallel the results for least squares and Yu [26] and by Candes and Tao [8]. Our results on the pivotal choice of the regularization parameter partly parallel the results by Candes and Tao whereas Candes and Tao's choice the regression disturbances. The relies is [8], except that our choice upon the knowledge existence of close parallels contrast to the least squares problem, our problem Nevertheless, there by Meinshausen is is pivotal of the standard deviation of may seem surprising, since, in highly non-linear and non-smooth. an intuition presented below, suggesting that we can overcome these difficulties. While our egy is results for quantile regression parallel results for least squares, our proof strat- substantially different, as it has to address non-linearities and non-smoothness. In PENALIZED QUANTILE REGRESSION order to explain the difference, Yu [26]. They first let us recall, 13 and the proof strategy of Meinshausen e.g., analyze the problem with no disturbances, recognize sparseness of the solution for this zero noise problem, and then analyze a sequence of problems along the path interpolating the zero-noise problem and the full-noise problem. Along this sequence, they bound the increments in the number of non-zero components and in the rates of convergence. This approach does not seem to work for our problem, where the zero-noise problem does not seem to have either the required sparseness or the required smoothness. In sharp contrast, our approach directly focuses on the full-noise problem, and simultaneously bounds the number of non-zero components and convergence be of independent interest rates. M-estimation problems and even for other may Thus, our approach for the least squares problem. Our analysis is perhaps closer work of van der Geer [37] but in spirit to, which derived finite still quite different from, the important sample bounds on the empirical risk of ii- penalized estimators in generalized linear models (but did not investigate quantile regression models). is that The major and van der Geer difference between our proof we analyze the sparseness of the solution to the penalized [37] 's proof strategies problem and then further exploit sparseness to control empirical errors in the sample criterion function. derive not only the results on prime model selection and on sparseness of As a result, solutions, which are of a but also the results on the consistency and rates of convergence under weak interest, conditions on the number components of non-zero allows s to be of almost the same order as the s. As mentioned above, our approach sample size n, and delivers convergence rates that are close to a/s/tz. In contrast, van der Geer's [37] approach requires s to much smaller than n, namely s'^/n convergence when s'^/n In our proofs m = \\(3{u)\\o we — > 0, all and thus does not be deliver consistency or rates of ' , : . . . , :- on two key quantities: the number of non-zero components of the solution /3{u) m > . critically rely we make use — oo. QuiP) (with P ranging over In particular, we and the empirical error in the sample criterion, Qu{P) — m-dimensional submodels of the large p-dimensional model). of the following relations: - . implies smaller empirical error, and (1) lower (2) smaller empirical error can imply lower m. Starting with a structural initial upper bound on the two relations to solve for sharp bounds on' m m (see condition E.3 below) and the empirical we can use error, given which we then can solve for convergence rates. Let us comment on the intuition behind relations (1) and (2). Relation (1) follows from BELLONI AND CHERNOZHUKOV 14 an application of the usual entropy-based maximal tropy of all m dimensional models grows the closer the sample criterion function m-dimensional submodels. Relation all to favor lower-dimensional solutions inequalities, at the rate Q^ mlogp. upon realizing that the en- In particular, the lower the m., to a locally quadratic function, uniformly across (2) follows when Qu is from the use of ^i-penalty, which tends close to being quadratic. Figure 1 provides a visual illustration of this, using a two-dimensional example with a one-dimensional true minimal submodel; ure in the example, the true parameter value {Pi{u), (32{u)) is plots a diamond, centered at the origin, representing a contour set of the 1 (1,0). Fig^i -penalty By the dual interpretation (2.9) of our estimation problem, the penalized estimator looks for a minimal function and a pearl, representing a contour set of the criterion function Q^. diamond, subject to the diamond having a non-empty intersection with a fixed set of optimal solutions is pearl. Smaller empirical errors shape the pearl into an parameter value of (1,0) panel of Figure (left a non-ellipse and can center like Figure 1). pearl. The then given by the intersection of the minimal diamond with the it far 1). ellipse and center it closer to the true Larger empirical errors shape the pearl away from the true parameter value (right panel of Therefore, smaller empirical errors tend to cause sparse optimal solutions, cor- = rectly setting P2{u) 0; larger incorrectly setting /?2(w) empirical errors tend to cause non-sparse optimal solutions, 7^ 0. \ /52' 132' J / \ \ Fig 1 . /J{") //3{«) ' 01 "^->*,«i!«.1-'""'' / \/ (iiu) These figures provide a geometric illustration for the discussion given in the Ci-penalized estimation may be (left panel) or may text Pi concerning why not be (right panel) successful at selecting the minimal true model. 3. Analysis and we prove the main Main Results Under High-Level results Conditions. In this section under general conditions that encompass the simple conditions PENALIZED QUANTILE REGRESSION 15 D.1-D.4 as a special case. 3.1. We The Five Basic Conditions. work with the following will five basic conditions E.1-E.5 which are the essential ingredients needed for our asymptotic approximations. In Appendix A, we D.4 stated in Section 2, Appendix particular, in under simple verify that conditions E.1-E.5 hold and we A we also show that E.1-E.5 arise sufficient conditions much more D.l- generally. In characterize key constants appearing in E.1-E.5 in terms of the parameters of the model. E.l. True Model Sparseness. The true parameter value has at most s l3{u) < n/ log(n V p) non-zero components, namely (3.1) ||/3(u)||o = s<n/log(nVp). E.2. Identification in Population. In the population, the true parameter value /3(u) the is unique solution to the quantile objective function. Moreover, the following minorization condition holds, QuiP) - QuWiu)) > (3.2) uniformly g is in /3 G ]R^, where g : q (||/3 R+ — R+ » - Piu)f A g{\\P - /3(u)||)) , a fixed convex function with ^'(0) is > 0, and a sequence of positive numbers that characterizes the strength of identification in the population. E.3. Empirical Pre-Sparseness. The number m= of the solution to the penalized quantile regression m<77,ApA (3.3) problem ), A" where (/>(m) is the maximal m-sparse E.4- Em.pirical Sparseness. For r tic ,„ components of P{u) (2.4) obej''s the inequality - \ - ., "' . eigenvalue. ||/?('u)— /3(u)||, m= ||^(u)||o obeys the following stochas- inequality ,, where — ^ n , , — v'nlog(n V p)(j)(m) ^'^^ ^^ Vm <p H- {r A I) + x^y , (3.4) by = of non-zero ||/3(u)||o /i > g is . a sequence of positive constants. The sequence ' of constants the population analog of the empirical sparse eigenvalue (f){mo) E.5. Sparse Control of Empirical Error. The (cf. ^ is determined Appendix A). empirical error that describes the deviation of the empirical criterion from the population criterion satisfies the following stochastic — 16 • -. ' BELLONI AND CHERNOZHUKOV . inequality <^/3'".^^ ^ ""' - '^ /^ /^/..^^ lOy^' <p {QuWiu)) - Qumu)))\ UP) - QuiP) \ (35) uniformly over G M'' {/? Let us briefly : ||/3||o < m A n A p, comment on each ||/? — i /3(u)|| is ^ < log(n V p)(^(m / r}, uniformly over As stated of the conditions. basic modeling assumption, and condition E.2 s) r^^— ^. earlier, + s) m < n, r > condition E.l 0. is a an identification assumption, required to hold in population. Conditions E.3 and E.4 arise from two characterizations of sparseness of the solution to the optimization problem (2.4) defining the estimator. Condition E.3 arises from simple bounds applied to the mal inequalities applied to the second characterization. Condition E.5 arises appUed inequalities we characterization. Condition E.4 arises from maxi- first to the empirical criterion function. is of order derive conditions E.4 and E.5, m-dimensional submodels of the p- crucially exploit the fact that the entropy of all dimensional model To from maximal mlogp, which depends on p only logarithmically. Finally, we note that Conditions E.1-E.5 easily hold under primitive assumptions D.1-D.4, in particular /i ~ g ~ 1, but we also permit them to hold more generally. 5 for verification Theorem and further analysis refer the reader to Section of these conditions. combines conditions E.1-E.5 to establish bounds on the rate of convergence 1 and sparseness of the estimator Theorem We Assume 1. (2.4). that conditions E.1-E.5 hold. Lett -^p oo he a sequence of positive numbers, possibly data-dependent, define TTiQ (3.6) = p A I ' Then we have (3.7) ;' r~^ :^ log(n W p) fi- J and set X = tJnlog(nV p)4'(mo + * . J s) — . q that ||,5(u)-/3(.)||S^ = ^/'^°'^"''''^^^"°^'^'' qn provided that \^/s/{qn) -^p 0, and '' - ^V II^NIIo<p(t) (3.8) 9^ '^ V ^• <?/ This is the main result of the paper that derives the rate of convergence of the {'i-penalized quantile regression estimator and a stochastic Our results parallel the results of bound on the dimension Meinshausen and Yu [26] of the selected model. obtained for the ^i-penalized PENALIZED QUANTILE REGRESSION We mean regression. main results of this section 17 a detailed discussion of this and other refer the reader to Section 2 for under simplified conditions. Here we only note that the rate of convergence generally depends on the number of significant regressors the number of regressors p, the strength of identification 0(mo), and the constant q, helpful to state the D.1-D.4, where g Corollary g ~ /u ~ 1. 1 /i ~ main s, q, and /x. under the simple set of conditions result separately 1. Conditions D.I-D.4 imply conditions E.1-E.5 with (A Leading Case). Therefore, under D.I-D.4. X we have ~ the empirical sparse eigenvalue determined by the population sparse eigenvalue. The bound on /U the dimension also depends on the sequence of constants It is also the logarithm of s, Wo = p A Vp)), so setting (r7./log(n slog(n Vp)(/)(mo) — tJn\og(ny p)(p(m-oj and >0 t\ ij that /slog(nVp)(p(mo) ^ \ /?r Ml <r <pt\' {u)-P{u)\\ I n and - If in addition (/'(mo) <p - mu)h<ps. • then we obtain the rate result listed in equation (2.17). 1, This corollary follows from lemmas stated Appendix A, where we in verify that conditions D.I-D.4 imply conditions E.1-E.5. Moreover, we use the fact that (^(mo s log(n It is V < n p) for mo = p A (n/ log(n V and that 0(2mo) < p)), 2(/>(mo) + s) by < (f){2mo) if Lemma iL useful to revisit our concrete examples. Example 3 (Isotropic sidered earlier, recall that by Theorem Lemma 11 1 we have Normal Design, continued). we have that m-o we have (p{mo 0(fc) <p < n/log(nVp), + s) <p 1. Also, we 1 + In the isotropic normal design con- \/{k/n) logp. and, since verify in If we assume Appendix A X/ ^nlog(n Vp) s — » 00, < n/log(nVp), by that g ~ /x ~ 1. Thus, the rate result listed in equation (2.17) applies to this example. Example 4 (Correlated Normal Design, continued). considered earher, we have that by Theorem 1 (f){k) <p Yr||(l we have mo < n/ log(nVp) and, + In the correlated normal design \/(fc/n) logp). If since we assume s X/^/nlog{n Vp) -^ 00, < n/ log(nVp), by Lemma BELLONI AND CHERNOZHUKOV 18 11 we have 4>{mo + s) <p j^y <p Also, 1. we verify in Appendix A, that g ~ ~ /x 1. Thus, the rate result listed in equation (2.17) applies to this example too. Proof. (Theorem '-.; : ^ The proof m ..;/ -.-^ /' We divide the proof in four bound on m, the second step obtains prehminary bounds on step provides an initial first ' • ' - r:=||^(^i)-^(u)||andm:=||^(u)||o. ". ^ ' ' successively refines upper The steps. Let 1) and r. and the fourth step establishes the rate inequalities, the third step verifies consistency, result. Step We 1. by proving that start By with probability converging to one. m If mo = p we have Suppose that m< nAp m > . m-o By < mo if i > when m > t \f2. t m< -^p oo, rriQ occur will we have condition E.3 < m^. Next Since \/2. m = max im: m<nApA directly tliat is finite) - (3.9) < m n'^(p{m) A2 mo = consider the case m Therefore we have — n i Q ; \log{n m.oi for some ^ \J p) ij,^ i > I (since definition fh satisfies the inequality m<n'' ' , 0(m) - " . . A2 < Since (/>(mo) A, 4>{mo + we have A > ty^n log(n V p)<p{mo){fi./q). s) the value of mo, and fh — mol in (3.9), f''nlog(nVp) is Step By (p{m.o) /U t- log(n V p) 11 and „„<?" 2 fi- t" > t \/2 bound on we obtain ^ a contradiction. 2. In this step Condition E.l, we obtain some preliminary tire support of has exactly s elements, that is, \Tu\ inequalities. /?(u) T^ := support(/3(u)) := {j S {1, agi'ee Lemma n <l>(moi) q^ n^ which and then using Inserting this — s. . . . ,p} : |/?,(u)| > 0} Let Pt^{u) denote a vector whose r„ components with T^ components of P{u), and whose remaining components are equal to zero. . PENALIZED QUANTILE REGRESSION and since B}' definition of ^(u) Quik^)) - QuiPiu)) < ^ ||^ru(")lli ^(||/?(u)||i - ll/§('")lli 19 we have that mu)h) < - ^(||/3(u)||i ||^T„(^^)||l). Using that Ill/9(ti)lli - mu) < Pt.(u)||i| - PTAu)h < yfil^mAu) - < Vsr P{u)\\ we obtain that Qu0{u)) - QuWiu)) < -v^r. n Applying condition E.5 to control the difference between the sample and population criterion functions, we further get that ^ ^ ss /-PT/ /^/ ^ ^ XN (m + r- n n V Invoking the identification condition E.2 and the definition of /Q ^n^ Step 3. /(m + ^ r ^ ^^ < + ^g[r))<p-^/sr n 2 ( . q{r~ (3.10) In this step ( we show s) log(n V r, we obtain p)(?!>(m r\' + s) n V consistency, + s) s)\og,(n\/ p)d>(m namely r = Op(l). By Step we have 1 with probability converging to one. The construction (3.6) of A, t m. < mo ;, —+p oo, —>p and the condition Xy/s/{qn) assumed in the theorem imply A^s ls\og{nVp)(l>{Tno n Condition {in), fJ. + n \ > q, s) ^n\og(nV p)<i){m.o + n s) A q and empirical sparseness condition E.4, stated in equation (3.4), imply that (3.11) - '-,• v^<p/i(7-Al)n/A + v^Op(l), • which implies the following second bound on m: V^<p^n/X. (3.12) Using (3.12) and ^o ION (3.13) m> l{m> •, r s in equation (3.10) gives / 2 / ^^ s]q[r^ hg[r)) 1 . ^ ^ /- v/n log(n V <pr-^s + r^ ^ "^ p)(j){rrn) + ^ s)/x = , , rOp{q) - 20 BELLONI AND CIIERNOZHUKOV ' ". where the and m< ,' s in On {in). the other hand, using (3.12) /sIog(nVp)(?i(mo S log n V piffllTTln A ^ by conditions (z) and ^-^^^^ and iii) /J. > + i- 5) s) = roM , , Conclude from (3.13) q. (3.14) that q[r'/\9{r)) =rOj,{q). / . . Next we show that we have l{r > r is, Step = 4. = (3.15) implies r 0}[r A > 0, function with g'(0) that and ^ r^ <^r-^s^T^-^^ -s\ , last equality follows (3.15) r o / ^ where the (i) equation (3.10) gives \{m.<s)q{f- Kg(r)) (3.14) and by conditions last equality follows <p l{r > {g{r)/r)] > so that g(r) 0}op(l). g'(0)r. condition E.2, g > Thus, l{r = Op(l) we improve the bound = 0}[r A5'(0)] q and by a fixed convex l{r > 0}op(l), gr^ <„ third bound: . Plugging (3.1G) into (3.10) and using the relation r^ (3.17) m to the following on (3.11) ^</J^. ., = Op{g{r)) under r X /s — Op(l), gives us X /s r (- Op(q)r'^ or equivalently r n <p . qn Finally, inserting (3.17) into (3.16), (3.8) is by This step derives the rate of convergence. (3.16) 3.2. By (3.15) Op(l). Using that r bound both sides of Op(l). Dividing we obtain -Jm <p ^/s{fi/q), which verifies the final on m. Model Selection Properties. Next we turn to the model selectioir properties of the estimator. Theorem are separated (3.18) 2. // conditions of away from zero, Theorem mm i and if the |/3j(u)| > «W ^, n \ of positive constants, i — > cx), q- then with probability approaching one (3.19) non-zero components of j3{u) namely jesuppoTt{0(,u}) for some diverging sequence hold, 1 support {P(u)) C support {0{u)). PENALIZED QUANTILE REGRESSION 21 Moreover, the hard-thresholded estimator P{u), defined by n ( \ 0j[u) where — £' oo and » (.' n \^) =/?j(u)W I jt — > \R ( W |/3j(u)| ^ > oU it\i sion. 2 derives These some model = support mean The regression. s) /i -2 I > {(3{u)). selection properties of the results parallel analogous results obtained £i-penalized + with probability converging to one, 0, satisfies support (/3(u)) Theorem s\og{n.yp)4>{mQ first result £] —penalized quantile regres- by Meinshausen and Yu [26] for the says that in order for the support of the estima- tor to include the support of the true model, non-zero coefficients need to be well-separated from zero, which is a stronger condition than what we required for consistency. sion of the true support is in general one-sided; the some unnecessary components having the true The inclu- support of the estimator can include The second coefficients equal zero. result de- scribes the performance of the fj -penalized estimator with an additional hard thresholding, which does eliminate inclusions of such unnecessary components. However, the value of the right threshold explicitly depends on the parameter values characterizing the separation of non-zero coefficients from zero. Proof. (Theorem 2) The result on inclusion of the support stated follows from the separation assumption (3.18) P{u)\\. Indeed, by Theorem Mu) (3.20) - 1 we have with /3(u)|U < and the inequahty in equation (3.19) \\(3{u) — P{u)\\oo < 0iu) — probability going to one, mu) - /3{u)\\ < min |,5,(u)|. ;£support(/3(ii)) The last inequality follows assumption > /3(w)||oo from the rate result of Theorem 1 and from the separation (3.18). Next, the converse of the inclusion event (3.19) implies that ||/3(w) — Since the latter can occur only with probability ap- niiiij£support(/3(u)) |/?j(^i)|- proaching zero, we conclude that the event (3.19) occurs with probability converging to one. ," , ^ , ' ; , Consider the hard-thresholded estimator next. Let r„ To establish the inclusion note that by . min jesupport {p{u)) \P,iu)\ > min jSsupport Theorem 1 = t^(s/n) log(n V p)(f){ma and the separation assumption J\3j{u)\ - \P,{u) - P,iu)\} >p £r„ (/3(u)) + - s)iJ,/q'^. (3.18) r„ 22 BELLONT AND CHERNOZHUKOV , SO that rnirijggupport — £'/£ > 0. |^j(w)| (/3(u)) > C . with probability going to one by ^'^n Therefore, support (/?(u)) , • support ^ £' oo and with probabihty going to one. (/3(u)) To establish the opposite inclusion, consider the quantity e„= max Theorem B_y 1 e„ <p all with probability going to one by components smaller than we have that support {0{u)) C support Two-step estimator. 3.3. < ^V„ r„ so that £„ by the hard-threshold rule of ^(u), 0j{u)\. jgsupport (P(u)) . (/3(u)) . . . ,p}, selected In this problem ^(u)earg We further estimation. Moreover, 3. is be a model, that the true D a subset of ' ,, Q„(/3). of the parameter vector we remove the is, define the two-step estimator /3'^{u) /? that were not regressors that were not selected from we no longer use ^i-penaUzation. Suppose that conditions E.l, E.2, and E.5 model that contains sion \T\ T" . min we constrain the components selected to be zero; or, equivalently, Theorem Since Next we consider the following two-step estimator that apphes by a data-dependent procedure. . cx3. with probability going to one. as a solution of the following optimization problem: (3.21) ^ excluded from the support i'r„ are the ordinary quantile regression to the selected model. Let {1, £' hold. model T^ with probability converging Let to one, T be any selected and whose dimen- of stochastic orders, then P^iu) - piu) IV provided the right side converges to zero Under conditions of the Is login \Jp)4>{s)l ^ 9 m probability. theorem see that the rate of convergence of the two-step estimator is generally faster than the rate of the one-step penalized estimator, unless 0(n) in which case the rate is the same. It is also helpful to note that \\f iu)-P{u)\\<p Proof. (Theorem 3). Let r = ||/3^(u) - when ~ ~p (p{s), and 0(s) <p 1, ^^(u) and by r„ C f g 1 ^^ login Vp). /9(u)||. By definition of with probability approaching one, we have that with probability approaching one Qu{fiu))-QuiP{u))<0. PENALIZED QUANTILE REGRESSION <p First note that since \T\ by s, Lemma we have 11 that 23 (t){\T\ + s) condition E.5 to control the empirical error in the objective function, QuiP [u)) s\og{nV - Qu[p{u)) <p + p)(f>{\T\ r]l s) I <p n <p we Applying (p{s). get that slog{n\/ p)(p{s) Invoking the identification condition E.2 we obtain that .(.^A,(r))<,^./^^°^("^^)^^^) (3.22) Since As we assumed that in the we can ^/s log(n proof of Theorem refine the bound 1, V p)4>{s)/n = Op{q), this implies that r = and that Op(l), <p that q{r~Ag{r)) r^ = rOp{q). Op(g{r)). Therefore (3.22) to slog(n Vp)0(s) qr we conclude '^pT\j or r s\og{nV p)(j){s) j ^ < I g' n D proving the result. 4. Analysis of the Pivotal Choice of the Penalization Pcirameter. we show that under some conditions proposed the pivotal choice for the penalization parameter A in Section 2.3 satisfies the theoretical convergence stated Theorem in 1. requirements needed to achieve the rates of , _^^ . Recall that the true rank scores can be represented almost surely as al{u) where ui,. .. ,Un are the regressors, xi, (4.1) . i.i.d. . , . = [u uniform x„. Thus, — \{ui (0, 1) <u]), random (4.2) 4. i = 1, . . . A' = inf{A: ., . , variables, independently distributed = (xj, . . . ,x„). Let the regularization parameter A(A') be defined as A(A') . P(A„ < A|A) > 1-Q>,}, Q„ = •„ ,n, we have which has a known distribution conditional on Theorem for A„ = n||En[x,(u-l{u, <u})]||^, - In this section ^ ^ n\J p from ; 24 BELLONI AND CHERNOZHUKOV ' for some sequence — f oo. > Assume that there exists a sequence Cn^p such that uniformly in Nl/2' ' n / jn^^ (4.3) i^ul ^ Moreover, assume g = X{X) r/ien A 0. Theorem 1, = = {A',,?' > ;r 4) We and for the inequahty ec let each < 1 I < V ond p)), l°g(" i/iai ^ p) ii/slog(n 0. V p)(t>{l)/n/q —>p —p t oo such that + s)— and »p , and i ~p '^ i. - : 12"=! < V Stout [34], . , Theorem 5.2.2: independent random variables with zero mean and ^i ^^^ *n n and . • . n > 1. = I3r=i Suppose e ^ [-^f] and 7 > > n > ^°^ all 1. Then 0. Let \Xi\ for < each n cs„ > 1, implies that P(5„/s„ >£)<exp(-(£72j(l-ec/2)j, (4.5) and there and exist constants £(7) 7r(7) such that if £ > £(7) and ec P(5„/s„>£)>exp(-(£V2)(l + (4.6) We i V will use the following inequalities of = 5„ * assumptions on the regularization parameter assumed in 1} denote a sequence of finite variances, almost surely (?)(n/log(n ri :; '^".P ) tJnlog(n\/ p)4>{mo q^\ — — yiog(nVp)/i-y f PA Proof. (Theorem Let —p (i'(l) ^ ""^ X] ^u 1 there exists a sequence X where tuq ^, c„,p satisfies the namely (4.4) < ,' < 7r(7), then 7)' need to establish upper and lower bounds on the value of A. We first establish an = X!"=i ^^f, and note that 0(1) = supj<pD?/n. Next observe that Var (J^"^i Xija*(u)|X) = u(l - u)v'j. Note that by (4.3) we have supj<,<„ |x,ja*(u)| < Cn.pVj/\/u{l — u), j = l,...,p. Moreover, for n large enough, condition (4.3) also implies upper bound. Let v^ that 2c„,p(i (4.7) Under (4.7), we can apply obtain that for every j = I, + l)0og(n Vp)/yu(l - (4.5) . . . with e ,p = 2{t + u) < 1/2. l)v^log(n Vp), and c = Cn,p/\/u{l — u) to PENALIZED QUANTILE REGRESSION Therefore, since y/n$(l) > Vj 25 we have p(^^^l^M$^>e\x]<exp(-{e + l)\og{nVp))= (4.9) \/u{l — Next note that using ^ (n u)n(l){l) Vp) \{ny p) we have (4.9) (4.10) P > V"(l - (a„ < u)n4>{l)6\x) > Ju(l - ^a;yO*(u JZP u)n(?!>(l)£|X i=\ n < p max P ^x,j<(w > Ju{l-u)n(l){l)e\X 1=1 P(A„ > A|X) Since A (4.11) decreasing in A, is < yju{l - we conclude that u)n4>il)s < 2{t + l)yJnlog{nVp)(l)il). Next we turn to establishing the lower bound. Let jn G = that Vj^ i/ni^(l). 1 {nV > Fix 7 definition of > P max \j<p > — £(7) and ec < 7r(7). and, by (4.3) we have ec \. Since -u)n(f){l) A A|A') > is ... decreasing in eyju{l -u)n(j){l) Thus, taking in account that In order to verify (4.4) define i ~p i — > 00. MX = = ti/21og(n V p)/(l o(l), for n large + 7), c = enough we ^ exp(-(£V2)(l+7)) > exp(-t2log(nVp)) J ,, . (4.12) have that \ , P (A„ > ,p} denote an index such Therefore we can apply (4.6) to obtain pf \E7=iX,,„a'du)\ \Vu{l . 1=1 £(7) and 7r(7)), and set e u). Since £ diverges, . >P[\J2xrjX{u] > 1=1 fix . A„ we have ^a;,ja*(u) >A|A') I p) (which implicitly Cn,p/\/u{l have e By {1, /i t Thus, the ~ = A, it follows = (^ that =t^2u{l - u)nlog(n Vp)0(l)/(1 + g, we have established A A/[\/nlog(n V p)(p{mo first result follows from the assumptions that t^/s log(n + 7) ~p t^Jn\og{n V p)(p{mQ s){^/q)]. By + s)^. construction we of (4.4) follows, and the second result of (4,4) V p)4>{l)/n/q -^p 0. ! ' -. D. BELLONI AND CHERNOZHUKOV 26 For concreteness, we now verify the conditions of Tlieorem 4 in our examples. Example Normal Design, continued). 5 (Isotropic ated with the jth covariate, where x.i is we use standard Gaussian concentration bounds, we have , see , , •. (4.13)- P(|a;,j| > denote the n-vector associ- x.j Section [2-3] K For any value 3. • , . \xij\ > \ . '" <exp(-/\V2). /r) In turn this implies that max] <,;<„, i<j<p Let a column of ones representing the intercept. Next <p \/\og(n Vp). Moreover, the vectors x.j are such that P{\ (4.14) |>A')<2exp(-2A'V7r2) ||x.,||-E[||x.,||] Combining these bounds we obtain minj=i_. ditions (4.3) hold with Cn^p have (?l)(l) > 1 + and ^(mo —p y the definition of mo. Thus, g ~ /u ~ 1 in Example <p s) "s^^-^p^ \ + ,^p \jYL^=i Theorem 4 + 6 (Correlated . ^p. \/logp. Therefore, con- = o{n). On p) requires i^slog(n V < = p) the other hand, by 1 We o{n). Lemma we and 14 also verify that We analyze the correlated de- random variables. Corollary 3.12 of Normal Design, continued). The upper bound [23]. P z^j for the case p > follows from the result ~ max A^(0, 1) are i.i.d. |x,,| as in > < A' Example P 5. max \ |3„| K > (The case with p < follows by changing the signs of x^ for each even j and redefining the parameter j3{u) for these so that after the transformation we obtain the design with p > 0.) only on the independence within the components of each vector are independent for obtain Cn,p —p i' j^ i, w-^Si^^ ^nd + |p|)/(l-|p|)} (l + we can invoke the same t"^ log'^(n 4 also requires t^slog(rt Vp) = v/(mo/n) logp and the definition of mo. Since p section. . K >l (4.15) {(l . the next section. Ledoux and Talagrand where 1, V^" yj[s/n) logp sign similarly using comparison theorems for Gaussian that for = V^.,j ^u ~p and i^log^(n V y(m.o/n)logp ^ a.nd E{\\x.,\\] V p) = is it logp) < {(l follows that 0(1) o{n) in this case. We regressors; The lower bound x.j. Since x^ij Example o(n). In addition, 0(1) +y(s/n) fixed results of new > -I- also verify that g and Xij Therefore we + s) <p Lemma 14 and 0(mo 1 + |p|)/(l - ~p 0(mo 5. relies |p|)} s). ~ p by Thus, Theorem ~ 1 in the next PENALIZED QUANTILE REGRESSION 5. Empirical Performance. mance 27 In order to access the, finite sample practical perfor- we conducted a Monte Carlo study and an application of the proposed estimators, to international economic growth. Monte Carlo Simulation. 5.1. In this section canonical quantile regression estimator, the we will £i -penalized compare the performance of the quantile regression, the two-step estimator, and the ideal oracle estimator. Recall that the two-step estimator applies the canonical quantile regression to the model selected by the penalized estimator. The oracle estimator applies the canonical quantile regression on the minimal true model. (Of course, such an estimator is not available outside Monte Carlo experiments.) We focus our attention on model selection properties of the penalized estimator and biases and standard deviations of these estimators. We begin by considering the following regression model (see Example y - + e, x'Pil/2) /?(l/2) where an intercept and the covariates x_i identically distributed e and the dimension ~ T,ij worthy because it satisfies We " 2. = p''"-'', In the results size n to 200. the right panels From the panels minimal true model is We ;'.-"' - . =" Example 2 with p = regressors, 0.5. namely x_i This design is it ~ noteeasily on model selection performance of the penalized estimator panels of Figures 2-3, left Prom we we we we plot the frequencies of the dimensions of the model selection performance see that the frequency of selecting very small. in plot the firequencies of selecting the correct components. see that the the performance of the estimator results. and the sample 5, ; selected model; in the right panel left ': as specified in our conditions. 2-.3. dimension p of covariates x equal to 1000, violates the condition of the uniform uncertainty principle, but summarize the Figures N{Q,I), and the errors e are independent model above with correlated also consider a variant of the N{0,T,), where set the 0)', parameter A equal to 0.9-quantile of the pivotal random variable A„, following our proposal in Section We ~ (1,1,1,1,1,0,..., minimal model to s of the true set the regularization We A''(0, 1). = where 1) We is much is particularly good. larger model than the also see that in the design with correlated regressors, quite good, as we would expect from our These results confirm the theoretical results of Theorem 2, theoretical namely, that when the non-zero coefficients are well-separated from zero, with probability tending to one, the penalized estimator should select the model that includes the minimal true model as a BELLONI AND CHERNOZHUKOV 28 subset. Moreover, these results also confirm the theoretical result of Theorem namely that 1, the dimension of the selected model should be of the same stochastic order as the dimension of the true minimal model. In summary, we that find the model selection performance of the penalized estimator very well agree with our theoretical results. We summarize results on the estimation performance in Table 1 . We see that the penalized quantile regression estimator significantly outperforms the canonical quantile regression, as we would expect from Theorem regressors bias, as is and from inconsistency 1 larger than the sample we would expect from the The penalized size. of the latter definition of the estimator of coefficients from zero. Furthermore, we when the number of quantile regression has a substantial which penalizes large deviations see that the two-step estimator improves upon the penalized quantile regression, particularly in terms of drastically reducing the bias. The two-step estimator in fact does almost as well as the ideal oracle estimator, as we would expect from Theorem harm 4. also see that the (unarbitrary) correlation of regressors does not the performance of the penalized and the two-step estimators, which from our theoretical results. In fact, since data-driven value of the correlated case, as 8, We we would expect A tends to be slightly lower for we would expect by the comparison theorem mentioned in Example the penalized estimator selects smaller models and also makes smaller estimation errors than in the canonical of the penalized uncorrelated case. In summary, and two-step estimators to be in we find the estimation performance agreement with our theoretical results. MONTE CARLO RESULTS Example Mean QR QR Two-step QR Oracle QR Canonical Penalized norm Bias Std Deviation 1.6929 0.99 5.14 2.43 1.1519 0.37 5.14 4.97 0.0276 0.29 5.00 5.00 0.0012 0.20 table displays the average Co norrn Co Mean QR QR Two-.step QR Oracle QR We Mean 25.27 Canonical Penalized deviation. Isotropic Gaussian Design 992.29 Example The 1: 2: Co /"i Correlated Gaussian Design norm Mean (i norm Bias Std Deviation 988.41 29.40 1.2526 1.11 5.19 4.09 0.4316 0.29 5.19 5.02 0.0075 0.27 5,00 5.00 0.0013 0.25 and fi obtained the results Taule 1 norm of the estimators as well as mean bias and standard using 5000 Monte Carlo repetitions for each design. PENALIZED QUANTILE REGRESSION Hjstugram of the numbflr of imrv-zcro components in 3(1/2) Hisloi^rnin of Ihp immlx-r of correct r om(HinpnU stl<*u^ 29 {/, = a. p - f)) Fig 2. The figure siimmanzes the covariate selection results for the isotropic normal design example, based on 5000 Monte Carlo repetitions. The left panel plots the histogram for the number of covariates selected out of the possible 1000 covariates. The right panel plots the histogram for the number of significant covariates selected; there are in total 5 significant covariates a,mongst 1000 covariates. Histogram of the number of noivzero components Fig The in J(1/2) HrstoRrtim of the number of coiTvrt eom|ionents selected (s = f>, /j = 5) summarizes the covariate selection results for the correlated normal design example with = .5, based on 5000 Monte Carlo repetitions. The left panel plots the histogram for the number of covariates selected out of the possible 1000 covari.ates. The ri,ght panel plots the histogram, for the number of significant covariates selected; there are in total 5 significant covariates amongst 1000 covariates. We obtained the results using 5000 Monte Carlo repetitions. 3. figure correlation coefficient p BELLONI AND CHERNOZHUKOV 30 5.2. International Economic Growth Example. we apply £i-penalized In this section quantile regression to an international economic growth example, using method for model We selection. primarily as a it use the Barro and Lee data consisting of a panel of 138 countries for the period of 1960 to 1985. We consider the national growth rates in gross domestic product (GDP) per capita as a dependent variable y for periods 1965-75 and we 1975-85.^ In our analysis, allows for a total of n = will consider = model with nearly p 90 complete observations. Our goal here is 60 covariates, which to select a subset of these covariates and briefly compare the resulting models to the standard models used in the empirical growth literature (Barro and Sala-i-Martin One [2], of the central issues in the empirical growth literature of an initial (lagged) level of GDP Koenker and Machado is the estimation of the effect per capita on the growth rates of GDP per capita. In Solow-Swan-Ramsey growth model particular, a key prediction from the classical [21]). the is hypothesis of convergence, which states that poorer countries should typically grow faster and therefore should tend to catch up with the states that the effect of initial level of pointed out in Barro and Sala-i-Martin GDP [3], richer countries. Thus, such a hypothesis on the growth rate should be negative. As this hypothesis regression of growth rates on the initial level of GDP. is rejected using a simple bivariate (In our case, median regression yields a positive coefhcient of 0.00045.) In order to reconcile the data and the theory, the literature has focused on estimating the effect conditional on the pertinent characteristics of countries. Covariates that describe such characteristics can include variables measuring education and science policies, strength of market institutions, trade openness, savings rates [3]. The theory then of the initial level of predicts that for countries with similar other characteristics the effect GDP on the growth rate should be negative Given that the number of covariates we can condition on size, particular, past previous findings came under procedures for covariate selection. In resolve the ([24]). model Since the selection comparable to the sample fact, in number is some The growth rate in GDP median [31]). of covariates is high, there classical tools. very large, though see [24] have no simple way to Indeed the number of and we use the is [31] for an attempt lasso selection device, regressions, to resolve this important issue. over period from t\ to t2 is In upon ad hoc cases, all of the previous findings to search over several millions of these models. Here specifically the £i-penalized ([24], severe criticisms for relying problem using only the possible lower-dimensional model ^ is ([3]) the covariate selection becomes an important issue in this analysis been questioned and others coTnmonly defined as \og{GDP,^/GDPi^) — 1. PENALIZED QUANTILE REGRESSION Let us now turn parameter A. This We performed povariate selection using the £i- initially used our data-driven choice of penahzation to our empirical results. penalized median regressions, where initial we 31 choice led us to select no covariates, which is consistent with the situations in which the true coefficients are not well-separated from zero. to slowly decrease the penalization parameter in order to allow for selected. We present the model selection results in Table we the choice of A, select the black openness) and a measure of pohtical we select an additional With a in the table. in the table. We With some the covariates to first be relaxation of market exchange rate premium (characterizing trade instability. set of educational third relaxation of A refer the reader to 2. We then proceeded [2] With a second relaxation of the choice of A attainment variables, and several others reported we include and [3] yet another set of variables also reported for a complete definition and discussion of each of these variables. We then proceeded to apply the standard median and quantile regressions to the selected models and we also report the standard confidence and 5 we show intervals for these estimates. In Figures 4 these results graphically, plotting estimates of quantile regression coefficients P{u) and pointwise confidence intervals on the vertical axis against the quantile index u on the horizontal axis. that we have We should note that the confidence intervals do not take into account selected the models using the data. (In an ongoing working on devising procedures that we have selected, the will median regression account for this.) coefficients We on the companion work, we are find that, in all initial level of models that GDP is always negative and the standard confidence intervals do not include zero. Similar conclusions also hold for quantile regressions with quantile indices in the middle range. In summary, we believe that our empirical findings support the hypothesis of convergence from the classical Solow-Swan-Ramsey growth model. Of methods course, it would be good to find formal inferential to fully support this hypothesis. Finally, our findings also agree the previous findings reported in Barro and Sala-i-Martin [2] and thus support and Koenker and Machado [21]. 6. model Conclusion and Extensions. In this work we characterize the estimation and selection properties of the £i-penalized quantile regression for high-dimensional sparse models. Despite the non-linear nature of the problem, we provide results on estimation and model selection that parallel those recently obtained for the penalized least squares estimator. It is likely that M-estimation problems. our proof techniques can be useful for deriving results for other 32 BELLONI AND CHERNOZHUKOV , MODEL SELECTION RESULTS FOR THE INTERNATIONAL GROWTH REGRESSIONS Penalization Parameter A = Real GDP 1.077968 per capita (log) is included in all models Additional Selected Variables A Black Market A/2 Premium (log) ', Political Instability Black Market A/3 / ' ' ' '" J (log) Political Instability Measure of . Premium :. tariff restriction Infant mortality rate Ratio of real government ''consumption" net of defense and education Exchange rate ~ % of "higher school complete" in % of "secondary school complete" - female population male population in Black Market Premium A/4 (log) Political Instability Measiu'e of tariff restriction Infant mortality rate Ratio of real government "consumption" net of defense and education Exchange rate % of "higher school complete" in % of "secondaj-y school complete" Female gross enrollment ratio % of "no education" in the female population in male population for higher education male population Population proportion over 65 Average years of secondary schooling in the male population Black Market Premium (log) A/5 Political Instability •, Measure of tariff restriction Infant mortality rate Ratio of real government "consumption" net of defense and education Exchange rate % of "higher school complete" in % of "secondary school complete" Female gross enrollment ratio - % • . of "no education" in the female population in male population for higher education male population Population proportion over 65 in the male population Average years of secondarj' schooling Growth % rate of population male population Ratio of nominal government expenditure on defense to nominal Ratio of import to GDP of "higher school attained" in Table For GDP 2 sequence of penalization parameters we obtained nested models. All the columns of the design matrix were normalized to have unit length. this particular decreafung PENALIZED QUANTILE REGRESSION Real Intercept 0,02 / 0.5 GDP 33 per capita (log) t \ \ / 1 / 0.4 ' y 0.3 0.2 0.1 -0.1 // ^/ 0.2 -- - - - - / / "^'^-0.02 \ \ ^ \ -0.04 ^ \ - 0.4 0.6 0.8 0.2 1 04 u 0.6 0.8 1 u Black Market Premium Measure of Political (log) / instability 2 / 0.1 / 1.5 ; ~ ~ - - - - 0.1 1 "J^ _ - - 0.2 / ; 0.5 / / 0.3 ______ ^__ / / / / -0.5 0.2 0.4 0.6 0.( 0,2 0,4 0,6 0! Fig 4. This figure plots the coefficient estimates and standard pointwise 90 % confidence intervals for model associated with A/2 which selected two covariates in addition to the initial level of GDP. the BELLONI AND CHERNOZIIUKOV 34 Real Intercept GDP pci i.a])ita {loj;) 1 U.Ub — ~ _ _ _ _ •' 0.5 "~ , -0 02 OA 0.8 0.6 Black Market Premium s. N 2 1 6 0,4 Measure of (ii}(i) 0.2 ~ 1 OB 1 Political instabililv 04 ^ f 0,2 ^. -..- . /... . 0-0 2 / 0.2 -0 4 0.2 04 06 08 0.2 1 6 4 u u M<^[isiUT of tariff rrstricrion Infant mon.ality rate 0.2 0.4 0.6 8 1 8 1 u Excliange rate 2 4 "f '^WRhcr .-icfmol 6 0.6 2 1 "'' 0.6 0.4 u X 10"° OS 1 u complelc" in ,q-i female population ^ % r,f spiondary srhool rnmplele" in male population / -'~--'' 0.2 0.4 6 0,8 6 1 2 0,4 5 0.8 1 Fig 5. This figure plots the coefficient estimates and standard point-wise 90% confidence intervals for the model associated with A/3 which selected eight covariates in addition to the initial level of GDP. PENALIZED QUANTILE REGRESSION There are several possible extensions that we would First, we would hke indices. We expect 35 pursue like to to extend out results to hold uniformly across a in the future work. continuum of quantile that most of our results will generahze to this case with a few appropriate modifications. Second, following van der Geer we would [37], like to allow for regressor- we would specific choice of the penalization parameter. Specifically, like to consider the following estimator: (6.1) . where a-j = J^ p{u) e arg rnin E„ [p^iy, Z]"=i ^fj- The dual problem max E„ x^/3)] To map this to our previous ^ a,|^,| associated with (6.1) has the form: \En[x^Ja^]\<^aJ, {u +- lj/,a,] a€R" (6-2) - ~ 1) < ai < u, j i = = l,...,p, I, . . . ,n. framework, we can redefine the regressors and the parameter spaces via transformations Xij = Xij/a-j = and Pj{u) analogous proof strategy. Third, we would like to aj(5j{u). We can then proceed with an extend our analysis to cover non- sparse models that are well-approximated by sparse models. In such a framework, the components of ^(u) reordered by magnitude, namely |/3(i)(w)| > \P[2){u)\ exhibit a sufficiently rapid decay behavior, for example, stants R and threshold can t. Therefore, truncation to zero of still we D.1-D.4 discussed > |/3(fc)(ti)| \P(p-i){u)\ < Rk"^'^ for > |/?(p)(u)|, some con- components below a particular moving lead to consistent estimation. APPENDIX In this section all > A: VERIFICATION OF CONDITIONS verify that conditions E.1-E.5 hold in Section 2 and also hold E.1-E.5 under the simple much more set of conditions generally. For convenience, we denote by S^- = {QeIRP:||Q|| the /c-sparse unit sphere in IR^. In what follows, and p. = l,||a||o<fc} we show how the key constants, such as q appearing in E.1-E.5, are functions of the following population constants (which can — 36 BELLONI AND CHERNOZriUKOV . r . " . possibly depend on the sample size n): • /•= "2^ fy,\x,i^'P{u)\x), f := sup fy,\:c,iy\x), := inf E /' := sup _4|,_(y|x), g{k) [(Q'a;)^] , where values of y and x range over the support of y, and x, The . depend on the results also sparse eigenvalue of sample design matrix 4i{k) :- sup En already mentioned earlier. As an illustration common we compute the constants 7 (Isotropic We Normal Design, continued). For concreteness assume that the errors are e ~ N{{), 1). can compute the values of the several constants involved / = 1/V^ > Example ample 2. > = 0.39, /' l/y27re 8 (Correlated < /i^v^ > 1/3, Lemma model 1 (2.1) ~ / 1/6, = Under > 0.6, Q{k) this simple design = 1, and — Lp{k) 1/V2tt > and ^{k) < 0.39, /' j^ = = = 0.4, 1. in Ex- relevant con- l/x/^Tre < 0.25, 3. Conditions E.l (model sparseness) we impose throughout, including The 1/2. < l/\/27r we is the key underly- in condition D.2. Lemmas 5 below establish the remaining conditions E.2-E.5. (Verifying Condition E.2 - We have that in the linear quantile — l3{u)\\ = r, — /?(u)||o < rn, Identification). under random sampling, for each /? £ R^ QuiP) - QuiPiu)) > qim) (A.2) : \\/3 ||/3 [r^ A gir)) , where (A.3) = / Consider next the design and that p 7V(0, 1) 0.4, Q{k)>V^ = ing model assumption, which and < \/V2tt A.l. Verification of E.1-E.5. 1, 2, 3, 4, ^JtT/S Example revisit the design of in the analysis: Normal Design, continued). = bounded by / = 0.25, 7(/c) For concreteness assume that e stants are 7(fc) two in (A.l) for designs used in the literature. Example 1. [(q''x,)' g{r) = r and q(m) ^ —^minO, \-j^^[m)\ \. — PENALIZED QUANTILE REGRESSION Thus, condition E.2 holds with q E.2 holds with = q q{n) Proof. (Lemma Knight 1) under Conditions D.I-D.4, condition q{n). In particular, ci 1. Let Fy^^ denote the conditional distribution of y given w any two scalars [17], for = 37 Prom x. and v we have that rv pu{w - (A. 4) - pu{w) = -v{u - l{w < v) 0}) + <z]- {\{w / l{w < 0})dz. Jo Applying (A. 4) with w= = y—x'(5{u) and v we have that E [-v{u - l{w < x'{f3-l3{u)) Using the law of iterated expectations and mean value expansion, we obtain 0. Q„(/3) - Q„(/?(u)) = E + Fy\,,{x' p{u) z) - E > |e - i(x'(/3 [{x'iP £ = [0, z] Fy^,[x' (i(u))dz (A.5) > for z^^^ 0})] - f E[|x;(/3 - /3(u))|3]. P{u)))'fyiAx'/3{u))] - I3{um - f E[|a:'(/3 /3{u))\'] Next define = r„ By (A.5) sup |f we have Quif3{u) : + fd) - ^f^~E [[x'df] " that construction of r,„ E[\a'x\^] . and the convexity of Qu, for ' 3 / any /3 , d G S^ ' " • . for all , ' ^. ^ 37 By > Q„(/3(u)) . , , such that || P{u)\\ < TTi,, we have that Qu{P)-QuW{u)) > =E[{x'{f3 Letting r { \\(3 — f3{u)\\ = r first QuiPiu)) P{u)\\ fMQuWiu) + r,nd)-QuW{u)) ) > rfQ{m)rl 4 and =E\ix'{P-(3{u))f\ > ' ' -^^ ' 4 inequality holds by construction of r^; hence QuiP) - Qu[p{u)) > for - we have M^ Quif3{u) + Tmd) - where the (3{u)))^]aI\\P q{m) defined in (A. 3). —^ A^ >q(m)(r^Ar) D '' 38 ': The Lemma • lemmas following two ' BELLONI AND CHERNOZHUKOV ' ^ verify the Empirical Pre-sparseness condition. 2 (Verifying Condition E.3 ber of non-zero components of (3{u) - is We Empirical Pre-Sparseness). num- have that the bounded by n/\p, namely - mu)\\o<nAp. Suppose that yi , . y„ are absolutely continuous conditional onxi, . . , of interpolated points, h Proof. — e = I, \{i we have Trivially with rows x[,i = . . . ,n, c : = yi < {ue', (1 Let p- - Y = (yi, u)e', Ae', Ae')', . . . and Note that the penalized quantile min A = ue'i+ + (1 - A Matrix has rank n, since Tsitsiklis components. [6] We there u)e'E,~ + + Ae'/3+ ||/3(w)||o of interpolated points. Therefore, - [I we have ||/3(u)||o on A' consider the of vertices. Since c > set. = We + {n— ||/3"''(u)||o -l- By Theorem ||/5-(u)||o since have that n h) <n componentwise if = 0. such that Y'{a^ conditional on — X a^) — is > A < ||/9(u)||o this unique solution with probability one. n, see [G]. Therefore, \\P(u)\\o^h. is : A'a polyhedron < is c] Bertsimas 2.4 of number vertices h < If is : A' a which has a (i.e., is, finite. the : < c}. finite Condi- number zero is always A'a < c} is a not unique there exist vertices a^,a~ Y is absolutely continuous Therefore the dual problem has a the dual basic solution non-degenerate, that n. form of A' implies that {a is of non-zero Let h denote the 0). non-empty a zero probability event since and the number of primal basic solution nxn h components of ^ and ^ are non-zero. which leads to the solution of the dual This where Aw = Y "^ polyhedron defined by {a Therefore, A'], one optimal basic solution w* with at most n non-zero feasible for the dual problem). Moreover, the bounded — c'w niin To prove the second statement, consider the dual problem maxa{ya tional X / and / denotes the Ae'/?- has linearly independent rows. at least is be the n x p matrix defined P{u) as a basic solution with the minimal components (note that number it number can be written as regi'ession C -^- + XP+ - Xp- = Y and X ,y„)', (1,1,...,!)' denotes vectors of ones of conformable dimensions, identity matrix. ,Xn, then the . x^f3{u)}\, is equal to \\/3{u)\\o with probability one. ||/9(u)||o = . . number we have with probability one that is unique, we have that the of non-zero variables equals ||/3(w)||o + (n — h) = n, or that - D PENALIZED QUANTILE REGRESSION Prom [6] , the complementary slackness condition of linear programming, see we have that for any component j G Pj{u) (A.6) ^ Pj{u) > only < , . , E„ E„ if Theorem 4.5 of p} . . if only [2;,,ai(ti)] [xijai{u)] = — and "a = , (2.6). Lemm.'\ 3 (Verifying Condition E.3 For any A > {1 . where a{u) solves the dual problem ||;9(u)||o. 39 Empirical Pre-Sparseness, continued). - m = Let we have m< 'n?(j)[m) A2 Proof. Let m= k = ||^(u)||o ^T we have mX = < For any \T\. sign(^fc(u)) We shall = fc 0. 6 from (A.6) we have T", Therefore, by Cauchy-Schwarz inequality sign0{u)ysign0{u))X ||Xsign(^(u))i|||a(u)|| where we used that ||sign(;5(u))|| T — support(^(u)), and {X'a{u))k = sign(^fc(u))A and, for a{u) be the solution of the dual problem (2.6), — ||sign(^(u))||o = i/m we have mX < ^ < sign0{u)y{X'a{u)) = (x sign0 {u)))' a{u) V^^W^)\\signi(3{u))\\\\aiu)\\, m. Since ||2(u)|| n^/m^im), which need some additional notation denote the score function we have in what < i/max{u, 1 — u}n < - follows. Let D /: ,; m-sparse vectors near to for the ith observation. Define the set of . R{r,m):={PG-RP .m\o<m, \\0-p{u)\\<r}, and define the sparse sphere associated with a given vector §(/3) = {a e IRP : ||q.|| < 1, support(Q) C /? ' ' eo(m,n,p) -^^ as support(/3)}. .- Also, define the following empirical and linearization errors (A. 7) and yields the result. the true value P{u) - y/n, suPaes- |G„(qV,(/3(u),u))|, - ei(r, m, n,p) suP;geR(r,m),a£S(/3). |G„(aVj(/3, u)) e2(r, m,n,p) sup/3efi(r,m),aeS(/3) V^\E[a'^i{P,u)] G„(a't/'i(/?(w),u))|, - E[aVj(/3(w),u)]|. ,: . 40 . where Gn is BELLONI AND CHERNOZHUKOV . the empirical process operator, that Next we verify condition E.4. Lemma 4 (Verifying Condition E.4 \\/3{u)—P{u)\\. We have . ip{m) <p m < n and yjn iJi{m) 4'{m), condition E.4 holds with ^— I n, ^ <P -^[r IJ^ 4) It will (2. 1) under random sampling, log(r7 V p)0(m) V ^Jn log(n V jjl = ^J ip{m){%J ip(rn) f = n{n), namely — Jnlog(n ^ , < \/ , 1 and ,Xn- -; p)ip{m.) p)(j){m)' > 1, so that condition E.4 holds with be convenient to define three vectors of rank scores (dual ' ' . , — u — \{yi < x[l3{u)] for = 1, = ai{u) = u — l{y, < x\P{u)} for the true rank scores, a*{u) 2. the estimated rank scores, 3. the dual optimal rank scores, 2(u) that solve the dual program i = nE„ . i denote the support of p{u). Let x.^ sign(/3^(u)) . = {xij,j G T)', '-, i.e. ^M = This is 1, . . , . n; (2.6). = [Pj{u),j e T)' we have that = ||sign(/?^(u))|| "Eji [2;jf2i(-u)j A Therefore we can bound the number of non-zero components of in (A.8). n; and Pf{u) a:.-a,(u) ^-^ bound the empirical expectation , . Fi'om the complementary slackness characterizations (A.G) (A.8) . . VI). Therefore, provided 1. T . = r ||/3(u)||o, ^ (I>{1) variables): Let = absolutely continuous conditional onxi, M) + M- In particular, under D.I-D.4, <p{m) Proof. (Lemma m Let ^/m. where r, o,i^e model ^ + ^ Empirical Sparseness). - ,yn that, in the linear quantile uniformly in (G„(/) := n~^''^ Y17=\if{^i) ~E[/(X,)]). . and suppose thatyj, Vrn <p jj{m) -{r Al) is /9(u) provided we can achieved in the next step by combining the maximal inequalities and assumptions on the design matrix. Using the triangle inequality A\/m < nE„ i II -(ai(u) L'-' - in (A.8), write a,(u))| + nE„ Jllll a; -(af(zt) L'-' - a*(u))| + nE„ Jllll x..*a*(u)l L^-* Jll . . PENALIZED QUANTILE REGRESSION Then we bound each we Lemma use < ^/7^€o To bound the last component, (m,n,p) <p ^JnmAog(nyp) (^fp{m)V \/4'{m) component, we observe that ai{u) first 2 the penalized quantile regression one. This implies that \\nEn \x c;{ai{u) To bound in this display. 9 to get ||nE„ [a;.^a*(u)]|| To bound the components of the three 41 E„ - [|2i(u) - fit ai{u) only < m/n. we Therefore, < nsup^ggm E„ < n sup„g§^ ^/En < ^n(f>{m)m. if y, m can interpolate at most ai(u)p] ai(u))| ^ = x'^P{u). By Lemma points with probability get - [\a'xi\ \ai{u) ai{u)\] [\a'xi\'^]^En \\ai{u) - ai{u)[^] the second component, note that - ||nE„ \x^!f{a.,{u) = a*(u))j ||\/n G„ [x.f{ai{u) - + a*(ti))) || - ||nE [x.^(ai(u) a*(u))] || || < v^ei {r,m,n,p) + <p sjnm. \o%(n V where we use Lemma 8 Setting fi{m) = and Lemma 10 to \f^p{m){-~/ipijri) f V 1) p) bound y/ne2 {r,m,n,p) \/^[m) V 4>{m) respectively ei(r, > ^ip{m) and follows. n^J Lp{m){sJ ip{m)]r A m,n,p) and using that m < n e2(r, the 1) m, n,p). first result _ Under D.1-D.4 we Lemma 5 < (^(n) 1, and 0(1) > (Verifying Condition E.5 quantile model (2.1) under - 1 D and condition E.4 holds. Empirical Error). We random sampling, and uniformly over have m < that, n, r in the linear > 0, and the ... region R{r, m); \Qu{f3) + - Qu{P) - {Qu{f3{u)) ^og(" - Qu{0[u))) <p V("^ + ^ [^^[m + s)v4'{m + s) ) . I In particular, under D.I-D.4 we have that condition E.5 holds, namely \Qu{I3) - QuW) - [QuWiu]) - QuiPH)) <p -^Y^(m + s)log(nVp)0(m + 5) I uniformly over m < n, r > 0, and the region R[r,m). Proof. For convenience let e„ := \QuiP) - Qu{P) - P{u)\\, and ||/3 - P{u)\\o <m. + s we have that 11/3 - -^lo ^i('^^'rn + s,n,p)dz=^ -^ei{r,m .- (<3u(/3(u)) + s,n,p). - Qu(/3('"))) I- Since r •_ > BELLONI AND CHERNOZHUKOV 42 The Lemma follows from first result Under D.1-D.4 we have < (/'(n) 8. and 0(1) > 1 / A. 2. Controlling Empirical and Linecirization Errors. nical results of Appendix A. 4 maximal results provide the submodels' dimensions m. results and number We < for a class of which may be of some independent n, functions J^T VC VC Proof. We Tc index bounded by their VC index 2.6.15. 0} is (for cm convenience bounded by Next consider / G - l{p{Z) < 0}) — m. The classes of functions cr}, :Qe§(/3),support(/3) VC W := Z — 2; see, for {x'a : c. < is fixed similar. 0} and l{p 0} G V. index of the class of sets {{Z, t) x' (3} : : and has cardinality m, [38] Lemma form f{Z) := g{Z){l{h{Z) in the < < := {l{y example, van der Vaart and Wellner which can be written l{h T it is C T} and V support(a) [y,x)). Since and C T} support(a) : for some universal constant let eW, where g definition equal to the m+ J-t These technical classes. prove the result for !Ft, and we omit the proof for Qt as C T} by dimension and the uniform covering {1,2,... ,p}, \T\ = {a'{MP,u)~ij^iP{u),u)) Consider the classes of functions support(/3) VC interest. dimension of relevant functions St = {QV^(/3(u),") have their These technical ei. (see, e.g., [38]). Consider a fixed subset 6. and inequalities for a collection of empirical processes indexed begin with a bound on the Lemma Here we exploit the tech- to control the empirical errors eo on the concepts of the their usage rely D and condition E.5 holds. 1 VC The f{Z) < t}, index of J^t f e Tt^ t is < by E^. We have that {{Z,t):f{Z)<t} . - " Thus each set {{Z,t) : t) : g{Z) < t}. {iZ,t):g(Z){l{h(Z)<0}-l{piZ)<0})<t} U U {iZ,t):h{Z)<0,p(Z)<0,t>0}u {{Z,t):h(Z)<0,piZ)>0,g{Z)<t}U U {{Z,t):hiZ)>0,piZ)<0,giZ)>t}. f{Z) < complements of the basic and {[Z, = = sets {Z {{Z,t):h{Z)>0,p(Z)>0,t>0}U t} : These basic is created by taking h{Z) > sets 0}, form {Z p(Z) < finite VC : classes, unions, intersections, and 0}, {t > 0}, {(Z, each having VC t) : g{Z) > t], index of order m. , PENy\LIZED QUANTILE REGRESSION VC Therefore, the order as the sum index of a class of sets {(Z, of the der Vaart and Wellner VC [38] t) : f{Z) < indices of the set classes formed Lemma Lemma For any 7. ^m,n,p m< e i by the basic R is of the same VC sets; see van D for function classes m-dimensional subsets of a p-dimensional all f G Tf, 2.6.17. Next we control the uniform L^ covering numbers the union of t}, 43 generated by taking set. n, consider the classes of functions = {a'{MP,u) - ^p^{(5{u),u)) Qm,n,v = : G RP, /? < m, a G ||/3||o and S(/3)} iaeS;^}, {a'i>i{P[u),u) with envelope functions Fm,n,p o-nd Gm,n,p- For each > e /16e\^('^'"~^) [ SUp7V(e||Fm,„,p||Q,2,-^m,n,p,i'2((5)) ^ SUpiV(e||G^,„,p||Q,2,em,n,p,i^2(Q)) <C[ <^ j ( ep\^ — \ /leeN^^'^™"^^ /en for some universal constants Proof. Let components. C and -^ c. It follows that implies that the covering its VC number dimension of J-t is at most bounded by is cm by Lemma ,, < In turn this 2(cm-l) ,. [38] Theorem 2.6.7. Since {ep/m)"^ different restrictions T, the total covering number is Lemma to control the empirical errors eo 8 (Controlling error ei). We < n. ej as defined in (A. 7). have that in the linear quantile model (2.1) ei{r,m,n,p) <p v'mlog(n V p) uniformly in r and m. and . max | V'(/?(77i), y^(p{m)j . we bounded according the statement of the lemma. The proof for Gm,n,p follows similarly. Next we proceed .- < C(cm)(16e) cm C is an universal constant, see van der Vaart and Wellner have at most (^) 6. non-zero . '-|^\ N{e\\FT\\Q,2.J'T.L2{Q)) where m J-t denote a restriction of !Fm,n,p for a particular choice of D BELLONI AND CHERNOZHUKOV 44 Proof. By 7 the uniform covering Lemma 18 Since we have of J-n,m,p that uniformly in - |q' (t/.^(/3,^i.) t/<i(/3(u), u))\ E„[/2] (A. 10) number m,n,p) < supy^^^^ ei(r, < En using the definition of = [\a'x,f] <?i(m,) and ''^"'~ ^ Prom Lemma (ep/m)'". Using m<n \a'x^\ \l{y, < |(G„(/)|. bounded by C(16e/e) is |G„(/)|<p\/mlog(r!,Vp)max| sup (A.9) we have definition of ej < x'^p} - Combining < l{y, and E[/2] < E (i>{m) ip{m). E[/2]V2, sup x;/?(u)}| [|Q-'a;,|2] E„[/2]i/2l sup < (A. 10) with (A.9) < ,, \a'x,\, >^[rn), we obtain the result. a Next we bound Lemma eo using the same tools we used 9 (Controlling empirical error eo)- <p eo[m,n,p) uniformly in bound to ei. In the linear quantile model (2.1) we have ^m log(n V p) max | ^/ip{m) , \/(p(m)| m < n. Proof. The proof is similar to the proof of Lemma S with Qm,n,p instead of thatfor^ e g„,,n.p^Qh&ve¥,rAg'^]=En\{a'^,{P{u),u)f]=Kn [{a'xif{\{y^ < x'^Piu)} < E„ [[a'x^f] < Alternatively on 4){1) 4){m) for a e §™. we could bound instead of proceed to bound Lemma all eq using Note J-m,n,p- - D , Theorem 5.2.2 of Stout [34] to achieve a 0(m) by making additional assumptions on the covariates dependence x^^-j. Now we £2. 10 (Controlling linearization error e2(r, (o)- m, n,p) < v/nyi^(m) We f have that in the linear quantile model y'v?(m,)/r A 1 ^ uniformly in r Proof. By > and m < definition eoir.m.n.p) u)^] = = n. we have sup^gR(^„) ,,,gg(^) vn|E[aVt(/3, sup^6fl(r,m).a6S(/3) \/H|E[(a'2;, ) «)] (l{y, - E[aVj(/3(u),u)]| < x'^p} - l{y, < x[P{u)})]\. PENALIZED QUANTILE REGRESSION By the Cauchy-Schwarz inequality the expression above v^ By sup JE[(q'x02] aeS^ definition v'(m) - l{\y, x'^P{u)\ E[(l{y, < < = - - I3{u))\}, 1{?/. < bounded by ^E[[\{y^<x'^l3]-\{y,<x'^(3[^)]Y]. /36R(r,m) sup^^gm E[(Q'xi)^]. Next, since [x'iiP x[P} sup is 45 | < \{y^ - x'^P} < l{yi x[P{u)] \ < we have x',0iu)})^] = E [|l{y, < </?} - l{y, < x'^P{u)}\] < E l{\y,-x',p{u)\<\x',iP-0{um] < E [/^i^yj(l')')| fyiAt + x'Mu)\x,)dt A < (2/11^ - P[u)\\ sup^gsjr E [\a'x^\]) A 1 < (2rf/^{^) A l] 1. n A. 3. Lemmas on maximum Sparse Eigenvalues. fc-sparse eigenvalues that are Recall the notation for the unit sphere unit sphere Sp maximum We = {q G M^ : ||a|| /c-sparse eigenvalue of begin with a lemma = In this section £ > 1 11. Let M = S"""-^ 1, ||a||o M, namely < {a G IR" k}. For a (pM{k) = = ||a|| : lemmas on the matrix sup{ a' M, Ma : and the 1} let earlier. fc-sparse 4>M{k) denote the a £ §p that establishes a type of subadditivity of the }. maximum sparse . be a semi-definite "positive matrix. For any integers k and ik with we have (i>Mm< Proof. Let a We collect used in some of the derivations presented eigenvalues as a function of the cardinality. Lemma we achieve 4>M[£k). Moreover can choose ai's such that Since M is ||Qi||o < ' \(i\<t>M{k). let YllJ\ ^i k since \C\k positive semi-definite, for any i,j > ik. =^ such that • •' -" .^" ', '.. SUi ll^i'tllo — l|o'||o- ' ' we have a[Mai + a'.Maj > 2 |Q'Mqj| . BELLONI AND CHERNOZ?IUKOV 46 Therefore, , . we have = a'Ma = J2^^^^i + Y.J2'^^^^J (l)M{ik) • , '- : •' ...;'•: i=l m ; ' < : . - ^ ... mEi!«'ii'<^M(iia,iio). . . 1=1 Note that E,=i The 11^,1^ = 1 lemmas following and thus < <PA./(£fc) max,=i,...jf] 0a/(||q,||o) \i] characterize the behavior of the We case of correlated Gaussian regressors. Let (T^(/c). 12. Consider 0(fc) be the Xi = T}l~Zi, where ,fi{k) n X p matrix Gk {a, a) it < a(k) ^ a'Za/^, where 11^^11= sup z-, i (a, q) = Lemma [38] results A. 2.1) (^1 < > N{0,lp), p n, for k [sio;^], we have that P{||5i.|| < Then with n. + /kJ^V^^ S"-^ Note \aZ a/ y/n] = for the and sup^ggt tt'Ea < n, l,...,n as rows. G S^ x Using Borell's concentration inequality ner ~ Zi < suffices to establish the result for k collecting vectors (p{k) , maximal k-sparse eigenvalue o/ E„ 11 sup n/2. Let Z be the Consider the Gaussian process that J a'E„[ziz[]a= J (f){k). Gaussian process — median||5/;|| > r} (see < van der Vaart and Well- e^"'" ". Also, by classical on the behavior of the maximal eigenvalues of the Gaussian covariance matrices German y/k/n) [13]), as — limit in 0. n -^ oo, for any k/n Since k/n [0, 1]. \/kn/n) < fore, for any ro 0. lies Therefore, within [0, 1], — > 7 G [0, 1], we have that limt „(median]|5^-|| (see — 1 — any sequence k„/n has convergent subsequence with we can conclude This further implies > the for by establishing an upper bound on start probability converging to one, uniformly in k Proof. By Lemma \i](pM{k)- maximal sparse eigenvalue that holds with high probability. Lemma < that, as hmsup„ n — > 00, limsup;.^_„(median||^fc^ supj,<„(median]|5^,|] there exists no large enough such that for — all 1 — \/k/n) < n > no and 0. all — || 1 — Therek < n, PENALIZED QUANTILE REGRESSION P1 11^^. > II 1 + sjk/n p\ over k < n we by 2^_j = %Jck/n\ogp one, uniformly for sup^ggjc a'Ea < no, > 1 + > 1 + Jk/^ + r,+ro\ < f^ + r^ + ro ^/k/^ — > for c as > all containing Zi < (^6-""^/^ I (fle""''^'- V^V fc=l n -^ sup^ggit k: (J~(k), and using that 1 We cx). (^) < p^ we bound the , J a'¥.n\ziz[]a a{k){l + right side conclude that with probability converging to < 1 + \/k/n^\ogp. Furthermore, since we conclude that with probability converging J a'Er,[x^x'^\a < sup^gg^ for all k: subvectors of (^) n > sup Ja'Enlz^z'^a "€S^ e'^~'^^'^'°SP There are at most obtain fc=l r/c that for sup ^a'¥.n\ziz'^a TpI Setting e""'' /^. [ we conclude k elements, so Summing + r + To < 47 to one, uniformly D y'k/n^/logp). Next, relying on Sudakov's minoration, we show a lower bound on the expectation of the maximum shows that the upper bound result Lemma 4>{k) fc-sparse eigenvalue. be the we have (1) Consider 13. — is do not use the lower bound sharp where Y}l'^Zi, in the analysis, terms of the rate dependence on in ~ Zi maximal k-sparse eigenvalue o/E„ N{0,lp), and [x^a;^], inf^^gic < n < for k fc,p, a'Ea > but the and n. a^{k). Let Then for any even k p. ' ' that: • E ,/m]>^^/{k/m^^EiP^)andi2) X Proof. Let supremum (A. 11) : h-> a'Xa/y/n^ where of this Gaussian process sup '.' Hence we proceed the lower Of) bound (1) \a' X a/ ^\ = in three steps: In Step {a, a) Xj, >p i = G S^ x ^^(fc/2) log(p 1, . , . . S""-*. n as rows. Note that /c). Consider \/4>{k) is sup 1, J a'En[xix'^\a ^ J 4>{k). we consider the uncorrelated ;'-,.•,.; ,. case and prove on the expectation of the supremum using Sudakov's minoration, using packing number. In Step (2) " > a lower bound on a relevant packing number. In Step the lower bound '' /^ be the n x p matrix collecting vectors the Gaussian process (a, the Xi We 3, we generalize Step on the supremum itself 1 2, we derive the lower bound on the to the correlated case. In Step 4, we prove using Borell's concentration inequality. - 1 ' BELLONI AND CHERNOZHUKOV ,48 Step we consider the In this step 1. minoration. By sup^ggfc Za, where a £'[sup|^ggfc We a = fixing (1, . . , . Zq ;= i—* = ^a^Zt - Zs) = V^E[(Zt - d), Z,)2] = ^E1E„ the largest number from below of elements in the i = (ii, remaining p T for e . . . — , = ip) -4=. In order to do this number € Sp such that = ij we - [{x[{t with respect to the metric d that can be packed into Sp, see € sup^^g^ E„[x^q] We will consider the standard deviation metric on Sp induced by Z: for any t,s € Sp, Consider the packing number D{e, Sp, s,t > y/(p{k) a Gaussian process on Sp. is E„[2;Jq'] / and show the result using Sudakov's S"~\ we have l)'/\/H € ,. [10]. = IT"! \\t - s\\/V^. We will bound the packing k components and (^) ^ \\s-~tf-t\t,-s,\'> of such elements. Consider t in at Z 1+ j€svipport(t) ji€ \support(s) support(s)| > k/2 for every s,t eV. By the inequality Using Sudakov's minoration support (s) I > 2^1 = I. (A.12) Theorem ([12], 4.1.4), \V\ > proving the claim of the In this step lemma we show that \V\ > [p — : tj - l/Vk}, which has |support(<) \ support(s)| < k/2}. shown below maxfgr By |-A/'(t)| — [p > k)^/'^. we conclude that k)/{2n), /. k)^^'^. t £ i construction < d) T with the set support(^), where support(^) = cardinahty k. For any e T let Af{t) ^ {s e T convenient to identify every element {j e {I,.. .,p} S= for the case |support(i)\ we have that D{l/y/n, Sp, EfsupZJ >sup^yiogi?(e,S^,d) > J\ogD{l/^,%,d) > \jk\og[p - Since as any \support(() Furthermore, by Step 2 given below we have that It is = most k/2 elements. In V be the set of the maximal cardinality, consisting of elements in T such that 2. 1^ T • 7=1 Step , restrict attention to the collection such that the support of s agrees with the support of (A.12) \V\. bound of disjoint closed balls of radius l/\//c for exactly k components. There are = s))^]] this case Let = Za] from below using Sudakov's minoration. d{s,t) e S= case where .. K := : we have that maxter (fc/2)(''fc/2 ) ^°^' every t, \J^{t)\\V\ > \T\. we conclude that \V\>\T\/K={l)/K>ip-k)'^/^. It remains only to show that |A/'(i)| < {k/2l^~kl2 ) Consider an arbitrary any k/2 components of support(i), and generate elements s G A/'(i) t E T. Fix by switching any of the remaining k/2 components in support(i) to any of the possible p — k/2 values. This PENALIZED QUANTILE REGRESSION gives us at all most {^~^/2 other combinations of combinations is the construction Step The 3. elements s S N'{t). Next ^'^^^ ) (j|./2)- we conclude that S case where I'^ this \-N'{t)\ way we generate every element < {k/2)i^'k/2 Using Borell's concentration inequality A. 2.1) for the E[\/(l){k)]\ > supremum r} < Lemma 1 and 14. new metric, d{s,t) (see ||s - i||o < = 2fc. van der Vaart and Wellner we have [38] Lemma P{\\/<f>[k) — D which proves the second claim of the lemma. Next we combine the previous lemmas Examples since of the Gaussian process defined in (A. 11), 2e~"'" Z^, G M{t). From satisfies dis,t)>a{2k)\\s-t\\/^ 4. s )• / follows similarly noting that the 7^ ^a'^{Zt-Z,) = ^E[{Zt-Zs)''l Step us repeat this procedure for let k/2 components of support (t), where the number of such initial bounded by 49 to control the empirical sparse eigenvalues of 2. For k < n, under the design of Example we have 1 ^(fc) ~p +J 1 °^^ . For k <n, under the design of Example 2 we have Proof. Consider Example Let Xi-i denote the ith observation without the 1. ponent. Write first com- ,; E„ 1 E„ [x, We first bound 4>N{k)Lemma 12 implies that [x',^_, E„ = E„ Letting N^i^^i <p (?!>Ar(A;) 1 + because (pN-i-Ak) >p yJ{k/2n)\og{p - \\b\\ = a'^ < l [a;t,_ix- ^k/ny^logp. Lemma 13 M + N. _i we have Xi,_iXj_j (f)N{k) = 4>N-i _i(^)- bounds (p^ik) from below k). We then bound 4>M{k)- Since Mu = 1, we have let w = {a,b'y achieve 0m(^) where a G IR, 6 e = ^l < 1, \\w\\o < k. Note that \a\ < 1, 4>M{.k)^w'Mw = E„ _i] \a\'^ (?I>m(1) > IR''""'. By ||5||i < 1- To produce an upper bound definition we have ||u;|| ^\\b\\. Therefore + 2ab'En[xi^li]<l+2b'En[x,^^i] + 2||6||i||E„[a;,,_i]||oo<l + 2v^||6||||E„[xi,_i]||oo. ^ ''r 1 = 1, BELLONI AND CHERNOZHUKOV 50 Next we bound + (f)j^{k). The intercept. The proof fixed, the On — (p. _i] |E„ E„ Since [xij]|. ': . + (pik) — cr^(/c) = sup^^^h > 1 Example V (p^^-^ 2 is similar sup^ggt a'T,a < (1 as for the uncorrelated design to hold. a' Ma + a'Na < an -lik) since the covariates contain with the same steps. Since — 1 S + |/3|)/(1 - |p|) = inf^ggt allows for the (4.1.5) Lemma p < 1 is Lemmas to apply and a~{k) < 12 a'Sa > same bound D . A. 4. Maximal Inequalities for a Collection of Empirical Processes. is = , a'{M + N)a = sup^ggt To bound 4)m{^) comparison theorem result of this section for j by using the bounds derived above. the design of \p\)- N{0,l/n) ' bounds on the eigenvalues of the population design matrix |/3|)/(1 ^ [xij] <p y/jljnjlogp. Therefore we have 0a/ (/c) <p ||oo ' Note that result follows for [a;^ the other hand, (p{k) and 13 are given by ^(1 ||E„ n'iaxj=2,...,p ' we bound Finally, = • + 2y/k/^^/E^. 4>M{k) ||oo by (4.13) we have 2,... ,p, 1 ||E„ [xi^-i] ' The main maximal inequality that controls the empirical 18, stating a process uniformly over a collection of classes of functions using class-dependent bounds. We need lemma, because the standard maximal inequalities applied this function classes yield a single class-independent We prove the main result by stating first bound that Lemma is 15, giving to the union of too large for our purposes. a bound on tail probabilities of a separable sub-Gaussian process, stated in terms of uniform covering numbers. Here we want to explicitly trace the impact of covering numbers on the tail probability, since these covering numbers grow rapidly under increasing parameter dimension. Using the sym- metrization approach, we then obtain Lemma 17, giving a bound on tail probabilities of a general separable empirical process, also stated in terms of uniform covering numbers. Finally given a growth rate on the covering numbers, we obtain our final Lemma IS, which we repeatedly employ throughout the paper. Lemma 15 (Exponential Inequality for Sub-Gaussian Process). zero-mean separable process {G(/) : / S with a L2{P) norm, and has envelope namely for each g G J- — F J"), . whose index includes zero, is is equipped sub- Gaussian, J-: i2 D T Suppose further that the process P{!'G(5)|>r?}<2exp(-^77Vi?Hg||p,2) with set Consider any linear a positive constant; and suppose that we have for any r? > 0, the following upper bound for the : PENALIZED QUANTILE REGRESSION 51 T uniform L2 covering numbers for < sup7V(e||F||Q,2,^,i2(Q)) n(e, J^,L2) for each > e 0, Q where n(e,^, L2) is decreasing in 1/e. — increasing in 1/e, and €\/logn(e,^, L2) K Then for > D, for — as 1/e > some universal constant < c cx) > anrf is 30, p{!F,P) : = SUP/e^||/||p,2/||F||p,2, sup^g^|G(/)| I^I|P,2 /o"'^'''^^' The Lemma result of page 302, on common e-in(e,^,L2)-{(-^/^)'-i>de. similar in spirit to the result of is Ledoux and Talagrand [23], of a process stated in terms of Orlicz-norm covering numbers. 15 gives a tail probability stated in terms of the uniform L2 covering numbers. The reason for 15 tail probabilities Lemma However, >cK'^ < V^ogn{x,J^,L2)dx is that in our context estimates of the uniform L2 covering numbers function classes are more readily available than the estimates the Orlicz-norm covering numbers. In order to prove a we need to bound on tail probabilities go through a symmetrization argument. Since we use data-dependent threshold, we need an appropriate extension Let us call of a general separable empirical process, a threshold function x of the classical symmetrization : IR" IR /c-sub-exchangeable 1-^ lemma if in w, we have that = particular x{v) x{v) \\v\\ \/ x{w) with k = > Lemma and constant functions with k \/2 random threshold x that lemma = is for probabilities x{Zi, . . , . with components in v . , . . — The following result (Lemma 2.3.7 of [38]) to 1. sub-exchangeable. 16 (Symmetrization with Data-dependent Threshold). dependent stochastic processes Zi, x{Z) Z„ and arbitrary functions Consider arbitrary /Lii, . . . , /i.„ denote the r quantile of x{Z), pr := P{x{Z) < g,-) > and pr r, = P{x{Z) have i=l where xq is : JT Zn) be a k-sub-exchangeable random variable and for any t G > T xo Vx(Z) and [x{v) \/x{w)]/k. Several functions satisfy this property, in generalizes the standard symmetrization the case of a K" any v,w E for any vectors v,w created by the pairwise exchange of the components to allow for this. < —P Y^ej{Z, - PT 1=1 a constant such that infjgjrP (E"=i > Hi) 4/c r Zi{f)\ < ^) > xoVxiZ] 1 - ^. < »-* ]R. in- Let (0, 1) let q-r Qr) < r. We ' BELLONI AND CHERNOZHUKOV 52 Note that we can recover the setting fc = 1, pr = 1) = and p^ symmetrization lemma where threshold classical underlying e„{!F, Zi, . i.i.d. , . . Zn) by from combining the previous follows Consider a sep- 17 (Exponential inequality for separable empirical process). G„(/) = arable empirical process fixed /,'„ The next lemma 0. two lemmas. Lemma is ^^"'^'^ K data sequence. Let ~ I3"=i{/(^i) > and r £ \ random be a k- sub -exchangeable E[/(Zi)]}, where Zi,...,Zn (0,1) be constants, and is e„(J^, P^) an — variable, such that - ' rp(.?-,Pn)/4 I J\ogn[t,J^,L2)de |lF||p„,2 / < T Jo > Lemma as in Finally, our main 15, then <-Ep( P<^sup|G„(/)|>4fccA'e„(^,P„) result in this section is 1, . . . is an underlying ,n with envelopes Fm = as follows. er,(.Fm,Pr2) 1 = and u > all = m 'n~^^'^ is 1, . . Y17=i{fiZi) " Consider a E[/(Z,)]}, where (nVp)'"(K/e)""^, 1. For a constant < . ,n, and with upper bounds on the C sup < < e := (1 + ||/i|p,2, 1, \/2t')/4 set sup K > .for all m. a large enough constant ||/||p„,2 \/2/S, for n sufficiently large, the inequality sup |G„(/)| holds with probability at least Now we prove Lemmas 1 — < 4V2cKe^{J^m, 5, 15, 16, 17, Pn), where the constant and 18. = by = CJm log(r) Vp) max Then, for any 5 £ (0,1), there = Gn{f) supjg_;r^ \f{x)\,m n(e,.f„,,L2) some constants k > +T. data sequence, defined over function classes !Frn,Tn i.i.d. uniform covering numbers of J-m given for with e-'n{e,T,L2r^^ ''Ue Al j 18 (Maximal Inequality for a Collection of Empirical Processes). collection of separable empirical processes Zi,...,Zn ^ f^T for the same constant c Lemma < -(4fcc/^e„(^,P„)) e„(.F,P„) and snpvar^f c is the < n, same as in Lemma 15. PENALIZED QUANTILE REGRESSION Proof of [36], Lema^ia 15. The proof follows by specializing 53 arguments given van der Vaart page 286, to the sub-Gaussian processes and also tracing out the bounds on tail prob- abilities in full detail. Step 1, is . . .} 1. There exists a sequence of nested partitions of j^, {{^qi, where the q-th partition consists of sets of I/2(P) radius at the largest positive integer such that 2~'° < balls of To construct the here: L2{P) radius e.g. = 1, most P)/4 so that p{J-, such partition follows from a standard argument, we repeat i qo van der Vaart q-th partition, cover J" with at . . = > The 2. qo,qo + where go ||F||p,22~'^, existence of page 286, which [36], = most Uq and replace these by the same number ||F||p,22~'' ,Nq),q . n(2~'',^, L2) of disjoint sets. If the sequence of partitions does not yet consist of successive refinements, then replace the partition at stage q by the set of most into at Let = A''^ Uqg -Uq all intersections of the form sets, so be an arbitrary point of fqi we can of the process, ~ J-qi. N^ — Set iTq{f) — E and |G(/)| < E^<,o+i "^ax/ 00 Step 2. By E^4< Setting 21ogA^g rjq > |G(7r,(/) E E 7r^o(/) = GK(/)-7r,_i(/)), - ^,_i(/))| + max/ |G(^,J/))|. Thus /- P -. max|GK(/)-7r,_i(/))|>7?J P|max|G(^,„(/))|>r?,„ -* 9=90 + 1 •- chosen below. ' - 7r,_i(/)||p,2 < 2||F||p.22-(^-i) < NqNq-i > logUq, using that q 1-^ '. 4||F||p,22-^ for g 8A'||F||p^22"'v^log7Vg, using sub-Gaussianity, setting log of the process (?=go + l ' — separability we can decompose / — set only. In this case, construction of the partition sets ||7r,(/) By J-q^. supremum norm GK(/))-GU,_:(/))= 00 9=90 77g / G if fqi 9=90+1 for constants This gives partition ^9-i(/))- Hence by linearity GU)-G{^q,U))= '{sup|G(/)|> ^^^ .Fjj. Yl^^qa lognj. replace !F by Uq^ifqi, since the can be computed by taking this E^go+i('^'?(/) that log rf^ logn^ is > K . go + 1- > D, increasing in q, ' using that and 2~* < 54 : . BELLONI AND CriERNOZHUKOV . p(^, P)/4, we obtain OO • • <- DO -v . ^ . P J2 =90 + 1 < max|G(7r,(/)-^,_i(/))|>77J , oo „ . ' : 7V,iV,_i2exp (-77,V(4D||F||p,22-^)= 9=90 + 1 -' -^ '^ < - : Yl N,N,^r2exp{-{K/Df2\ogN,) 9=90 + 1 E ^ ' .. ^ ,, ; 2exp(-{(K/Z?)2-l}log(A^,A^,_i)) OO '< ^..-^ , 2exp(-{(A7D)'-l}logn,) /oo • • - . 2exp(-{(A-/Z?)2-l}Iogn,)dg / '90 . rp{T.P)/4 Jo By < Jensen's inequality y^oglV^ a, := J2'j=qo oo oo E+ ?=qo Letting bq = 2 V^ogUq, so that ^ ^9^8 2~', noting a^+i A'||F||p,22-%. 9=90 + 1 l — — aq sJlogUq+i and 6^+1 — — bq —2"'', we get using summation by parts oo CO ^ 2"''a, = g=go + l (6,+i - 6,)a, = - 00 . ^log 7T.(jo+i Using that 2"''\/logn^ . 2 2-'yi^< f; Using a change of variables and that oo Step 3. Letting r]q^ A'||F||p,2p(J^, * as g — + 2-^0^^, 9=90 + 1 oo, so that — cig^gl^+i decreasing in g by assumption, •^'JO 2~'>° 16 = is — ^ r2-'^^logn[2-i,J^,L2{P))dq. 2 9=90 + 1 00 / ''y^logn^ 2 o,; =2 2.2-('+i)ybi^^ 9=90 + 1 where we use the assumption that - \ ^ 2.2-"°+iVlogn,„+i+ V 2 6,+i(a5+i 9=90 + 1 \ / 2-(9o+i) ^ - a,&,|^+i ( 9=90+1 = oo f ^ - < p{J-, -P)/4, we finally conclude that /•p(^,-P)/4 P) ^2 log Nq^ recalhng that Nq^ , = n^^, using ||7rqo(/)||p,2 < PENALIZED QUANTILE REGRESSION 55 and sub-Gaussianity, we conclude |F||p,2 > p{ma^|G(7r,„(/))i <r 2exp ( ^ Jqo-l = Also, since Uq^ < Vgo} Vexp {-{K/Dflogrig) -{{K/Df -l}\ogna)dq= ' n(2 '°,^, F), 2 ''o < r < 2exp[-{iK/Df - l}logn, {x\n2)-^2n{x,J',L2{P)r^^'^^^^^-^Ux. ' •/p(^,P)/4 p{J^,P)/A, and n{x,!F,P) is increasing in 1/x, we obtain; =4i^||F||p,2[p(^,P)/4]^21ogn(2-9o,^,P) <4V2A'||F||p,2 775, ^\ogn{x,J^,P)dx. J/U Step tail 4. Finally, bound stated using c = adding the bounds on in the 16/log2 main + 4\/2 < ^ adding bounds on text. Further, 30, < 77, probabilities from Steps 2 tail rjg and 3 we obtain the from Steps 2 and 3, and we obtain ^\ogn{x,:F,L2{P))dx. cK||F||p,2y^ 9=90 D Proof of Lemma (page 112) in [38] 16. The proof proceeds analogously with the necessary adjustments. Letting to the proof of q^ Lemma 2.3.7 be the r quantile of x{Z) we have F > XoVxiZ)} < P{x{Z) > ! =1 first Z = pendent copy of Z such that x{Z) |X]r=i ^i{fz)\ > term of the expression above. Let (Zi, > q-r . . . and pr, > ||Z!r=i ^iWj^ = Pt/2- . , . . F„) be an inde- E(^' 1=1 {Z : and using the triangular inequality we f left > T xq } > hand x{Z) > ^'^ such that . I Therefore, over the set J- . we have inf/g;rF{|Er=i^i(/) < T^Py (V'l, V x{Z). Therefore 3fz £ Z ; . by Bonferroni inequality we have that the pT —Pt/'2' xq xq\/ x(Z). Conditional on such of xo Y — ,Z„), suitably defined on a product space. Fix a realization have that By definition <qr}. .r Next we bound the of >Xoyx{Z)) +P{x{Z) Qr, T l-Pr/2. Since Fy- {x(y) side is < qr) = bounded from below by qr, ||E"=i ^illjr V x{Z) V x{Y) > ^o Va;(Z)} we have BELLONI AND CHERNOZHUKOV 56 Z we Integrating over \p{x{Z)>qr, obtain >xoVa;(Z)^ < PzP^ E^. E(^' - Vx(Z) xo > ^^; \l x{Y) T Let £i, ...,£„ set {Yi Y £i, and . ,£n, . . be an independent sequence of Rademacher random variables. Given — Yi,Zi = Zi) if ^i = 1 and {Yi — Z^, Z^ = Yi) Z by pairwise exchanging their components; {y, Z) has the same distribution as Y.^Y, - PzPy > Z,, -^^ -^'^]^ -^^'-^ ^ if = e^ —1. That is, we ei, . . . , £„, create vectors by construction, conditional on each (F, Z). Therefore, = E^P^Pr E(^' - xo \l x{Z) \l x{Y) ^') T By x(-) S.PzPy being /c-sub-exchangeable, and since Si^Yi vx(Z) vx(y) xn > Y.^y^-Z^) -^ ^-:^ -^ — Zj) = „ „ „ < E.PzPy , \ (Fi — we have that Zi), Xo Vx(Z) Vx(y) Y.^Ay^-2c> 2k r By the triangular inequality and removing x{Y) or x{Z), the latter P ^£,(y, - > /!,) •TO V xjY) ^£,(Z, -/ij 4k xq V x(Z) > 4k 1=1 .;^ bounded by is J^ a Proof of Lemma 17. We would is ^n -1/2 separable empirical process G„(/) Gaussian; here Zi,...,Z„ apply exponential inequalities to the general like to an underlying E7=l{fiZ^) - E[/(Z,)]}, which i.i.d. the standard symmetrization approach. Indeed, G°{f) i.e., = P{£i we data sequence. To achieve this we use first n~^^'^Yl?=i{^if{^i)}^ where £!,...,£:„ are = 1) = P{£i = — 1) = 1/2, not sub- is introduce the symmetrized process i.i.d. Rademacher random which are independent of Zi, probabihties of the general empirical process are bounded by the . , . . Z„. variables, Then the tail the tail probabilities of symmetrized process using the symmetrization lemma recalled below. Further, we know that by Hoeffding inequality the symmetrized process with respect to the L2(Pn) norm, where ¥„ result. By . is is sub-Gaussian conditional on Zj, . , . Z^ the empirical measure, and this delivers the . the Chebyshev's inequahty and the assumption on e„(.F, P„) T fixed in the statement of the P(|G„(/)| . we have for the lemma > 4kcKeni:F,Vn) ' ) ' < ^"P/ ^"'^P'^"(-^) ^ - (4kcKe„{^,Fn)f sup^g^varp/ (4fcc/<'e„(^,P„))2 ^ - constant > , PENALIZED QUANTILE REGRESSION Therefore, bj^ the symmetrization > sup |G„(/)| Lemma 4fcci^e„(^,P„) We then condition on the values of Zi, as Pg. Conditional on Zi, is sub-Gaussian . , . . . < -P . sup > ci^e„(^,P„) |G°'(/)| + r. I G° Z„, by the Hoeffding inequality the symmetrized process forgeJ^ — J^ L2(Pn) norm, namely for the I Z„, denoting the conditional probability measure , . we obtain 16 I 57 P,{G°(5)>x}<2exp(-^^^ \ Lemma Hence by D= 15 with 1, -^ Il5llp„,2 we can bound < sup|G°(/)|>c/re„(F,P„) e-in(e,^,L2)-^^-^>de Al. j^ The result follows from taking the expectation over Zi, Proof of Lemma The proof proceeds 18. in two . , . . Zn- with the steps, step containing first the main argument and the second step containing some auxiliary calculations. Step 1. we prove the main In this step result. First, n{e,J-m, L2) satisfies the monotonicity hypotheses of C= sup/gj-^ i|i^m||p„,2 (1 + V2u)/4. Note that supjgjr^ ll/liiPn,2/ll-P'm||p„,2 > ll/l|p„,2 is < m 17 uniformly in ||/||p,2, < £ supyg^r^ ||F„||ip„,2 m < ll/l|p„,2} : = n: ^/m log(n V p) + / i— n. \/2-sub-exchangeable and p(J^m,P„) l/\/" by Step 2 below. Thus, uniformly in \/logn(e,J?^,L2)de / Lemma C^/m log(n V p) max{supjgjr„ Second, recall that e„(J^m,P„) := for we observe that the bound vm \og{K/e)de Jo Jo /'P(^m,P„)/4 < (l/4)^mlog(n Vp) sup ||/||p„,2 + ||-?V.||p„,2 ; ' < ^mloginWp) !S ^n l,.' 771 : ^n 1 and K < n Third, set K p)). Recall that m<n and / G P(|G„(/)| for n fl + v^) /4 \ '', 'i- : ' ' ' (/(f pVTb^, < Ide)^/^ {fP\og{K/e)dc)^^^ for 1/^ < sufficiently large. := ./2/5 > 4\/2cC > J-m, ||/||j.„,2 j which follows by J^ ^/log{K/e)de < p < sup , Jo feJ'n, ': ^vm]og{K/€)d€ / 1 1 B{K) so that where 1 < c < we have by Chebyshev > 4^/2cA'e„(^^,P„) ) < := (i^^ 30 is ._ i^ ^ 2/5, defined in and let Lemma Tm 15. = (5/(2mlog(n V Note that for any ' inequality '^^g^-f";^ (4v2c/i^e„(.fm,Pn))'^ " ~' ' : < , '/^ „,/^^" '\ , ^ (4v2cC)^mlog(nVp) < r„/2. BELLONI AND CHERNOZHUKOV 58 Using Lemma 17 with our choice of Tm, m < k n, > v > I, and p{J-m,^n) < I, 1, we obtain p| sup |Gn(/)| > 4%/2c/^e„(J^^,P„), 3 m ^ ^E < n| < V pi sup |G„(/)| V , M Step To 2. 2 log(n • . by our choice of B{K) and n In this step ll/llp„(^),2 {sup^,^^E„ ||/||Fn,2 is SUP/e^„ + ^"P/e^^ +E„ [/(Z,)2] li/l|p„{Z),2 \/2-sub-exchangeable, Next we show that latter follows > Y ll/llp„(z).2 ^ {^up^e^.En > {sup^e^^ Z, let be created by pairwise ^ [/(Z,)2] il/ll^„(z),2 supjgjr^ ll/llp„(y),2) +E„ Vsup^,^^ [/(>;)'] }'^' \\f\\KiY).2f'~ = = ||/||p„(y),2- p(J"rn,IPn) from E„ [F^] suP/ejr^ E„[|/(Zi)|2] 11/111(^,2}'^' [/(l')2]}'^' Vsupjg;r„ Vp) calculations. exchanging any componentsof Z and Y. Then ^/2 (sup^gjr^ {sup/e^„ i sufficiently large. we perform some auxiliary establish that supygjr„ 4x/2cA'e„(J-^, P„) ^0 Tn=l , > = := supygjr^ ||/||p„,2/||.Fm||p„,2 En[supj£_;r^ |/(Z,)p] < > l/\/n for m < n. The supj<„ sup^gj^^ |/(Zj)p, and from supjg^r^ sup,<„ \f{Zi)\'^/n. ACKNOWLEDGEMENTS We would Maugis for like to thank Arun Chandrasekhar, Moshe Cohen, Ye Luo, and Pierre-Andre thorough proof-reading of several versions of this paper and their detailed com- ments that helped to considerably improve the paper. REFERENCES [1] H. Ak.mke (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, AC-19, 716-723. [2] R. J. Barro and J.-W. Lee (1994). Data set for a panel of 139 countries, Discussion paper, NBER, http://ww~s\'. nber.org/pub/barro. lee. [3] R. [4] A. Barro and X. Sala-i-Martin (1995). Economic Growth. McGraw-Hill, New Belloni and V. Chernozhukov (2008). On the Computational Complexity J. Estimators in Large Samples, forthcoming in the Annals of Statistics. York. of MCMC-based ^ PENALIZED QUANTILE REGRESSION [5] Chernozhukov A. Belloni and V. 59 Conditional Quantile Processes under Increasing Dimen- (2008). Duke Technical Report. sion, [6] D. Bertsimas and [7] M. BuCHlNSKY TsiTSlKLls (1997). Introduction to Linear Optimization, Athena Scientific. J. Changes (1994). in the U.S. Wage Structure 1963-1987: Application of Quantile Re- gression Econometrica, Vol. 62, No. 2 (Mar.), pp. 405-458. [8] Candes and T. Tao E. than [9] V. Ann. n. Chernozhukov [lo; R. Dudley 111 J. Fan and (2005). Lv 35, Number 6, selector: statistical estimation when p is much larger 2313-2351. Extremal quantile regression. Ann. Statist. 33, no. 2, Uniform Cental Limit Theorems, Cambridge Studies (2000). J. The Dantzig (2007). Volume Statist. in 806-839. advanced mathematics. Sure Independence Screening for Ultra-High Dimensional Feature Space, (2008). Journal of the Royal Statistical Society Series B, 70, 849-911. [12: X. Fernique (1997). Fonctions aleatoires gaussiennes, ecteurs aleatoires gaussiens. Publications Centre de Recherches Mathematiques, Montreal., Ann. Probab. Volume [is; S. German Number [14 C. 2, Limit Theorem for the Jureckova J. No. Statistics, Vol. 20, He and Q.-M. Shao X. Norm of Random 8, 2, du 252-261. Matrices, Ann. Probab. Volume 8, 252-261. Gutenbrunner and Annals of [15- A (1980). Number 1 (1992). Regression Rank The Scores and Regression Quantiles (Mar.), pp. 305-330. On (2000). Parameters of Increasing Dimenions, Journal of Multivariate Analysis 73, 120-135. [16 Huber P. J. Robust regression; asymptotics, conjectures and Monte Carlo. Ann. (1993). Statist. 1, 799-821. [17: K. Knight (1998). Limiting distributions for Li regression estimators under general conditions. Annals of Statistics, 26, no. 2, 755-770. [is: [19 K. Knight and Fu, Koenker R. W. Asymptotics J. (2000). for lasso-type estimators. Ann. Statist. 28 1356-1378. (2005). Quantile regression. Econometric Society Monographs, Cambridge University Press. [20 R. Koenker and G. Basset 33-50. [21 R. Koenker and regression Journal of the [22 [23 P.-S. Laplace ;,, Machado J. Regression Quantiles, Econometrica, Vol. 46, No. (1978). " - ', American (1999). , ,,., . ..,„ ;„ Goodness of ,.,., fit .': January, ' and related inference process for quantile Statistical Association, 94, 1296-1310. (1818). Theorie analytique des probabilites. Editions Jacques M. Ledoux and M. Talagrand 1, ' (1991). Probability in Gabay (1995), Paris. Banach Spaces (Isoperimetry and processes). Ergebnisse der Mathematik und ihrer Grenzgebiete, Springer- Verlag. [24 R. Levine and Renelt D. American Economic Review, [25 N. Meinshausen and p. (1992). A Sensitivity Analysis of Cross- Country Growth Regressions, The ' Vol. 82, No. 4, pp. 942-963. Buhlmann (2006). High dimensional graphs and variable selection with the Lasso. Ann. Statist. 34 1436-1462. [26: N. Meinshausen and B. Yu dimensional data. [27: (2009). Lasso-type recovery of sparse representations for high- of Statistics, Vol. 37, No. 1, 246270. Y. Nesterov and a. Nemirovskii (1993). Interior-Point Polynomial Algorithms ming, vol [28 The Annals S. 13, SIAM Portnoy in Convex Program- in nonstationary, dependent cases. Studies in Applied Mathematics, Philadelphia. (1991). Asymptotic behavior of regression quantiles BELLONI AND CHERNOZHUKOV 60 J. [29j Multivariate Anal. 38, no. S. PoRTNOY AND R. 100-113. 1, KOENKER (1997). The Gaussian hare and of squared-error versus absolute-error estimators. Statist. Sci. [30] [31] Volume 12, Number 4, 279-300. Recht, M. Fazel, and P. A. Parrilo Matrix Equations via Nuclear Norm Minimization, Arxiv preprint arXiv:0706.4138, submitted. B. X. X. Sala-I-Martin(1997). Just I Schwarz G. [33] A. N. Shiryaev (1995). F. (1978). Stout Ran Two Million Regressions, The American Economic Review, . ' (1974). ,. ' . , Estimating the dimension of a model. Annals of Statistics, [32] W. Guaranteed Minimum-Rank Solutions of Linear (2007). ' Vol. 87, No. 2, pp. 178-183. [34] the Laplacian tortoise: computability 6, 461-464. Probability, Springer. Almost sure convergence. Probability and Mathematical Statistics, Academic Press. [35] R. TiBSHiRANi (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B, 58, 267-288. [36] A. W. VAN DER Vaart Asymptotic (1998). Statistics, Cambridge Series in Statistical and Probabilistic Mathematics. [37] S. A. van DER Statistics, Vol. 36, [38] A. Geer (2008). No. 61-645. 2, W. VAN DER Vaart and High-dimensional generalized linear models and the J. A. Wellner (1996). Weak Convergence and lasso. Annals of Empirical Processes, Springer Series in Statistics. [39] C.-H. Zhang and linear regression. J. Ann. Huang Statist. (2008). Volume The 36, sparsity Number and bias 4, of the lasso selection in high-dimensional 1567-1594. Alexandre Belloni Victor Chernozhukov Duke University FuQUA School of Business 1 TowERviEw Drive Massachusetts Institute of Technology Department of Economics and Operations research Center Durham, NC 27708-0120 PO Box 90120 Room E52-262F E-MAIL: abn5@diike.edu E-MAIL: vchern@mit.edu . 50 Memorial Drive Cambridge, MA 02142