arXiv:1602.08927v1 [stat.ML] 29 Feb 2016 L2 BOOSTING IN HIGH-DIMENSIONS: RATE OF CONVERGENCE YE LUO AND MARTIN SPINDLER Abstract. Boosting is one of the most significant developments in machine learning. This paper studies the rate of convergence of L2 Boosting, which is tailored for regression, in a high-dimensional setting. Moreover, we introduce so-called “post-Boosting”. This is a post-selection estimator which applies ordinary least squares to the variables selected in the first stage by L2 Boosting. Another variant is orthogonal boosting where after each step an orthogonal projection is conducted. We show that both post-L2 Boosting and the orthogonal boosting achieve the same rate of convergence as Lasso in a sparse, high-dimensional setting. The “classical”L2 Boosting achieves a slower convergence rate for prediction, but no assumptions on the design matrix are imposed for this result in contrast to rates e.g. established with LASSO. We also introduce rules for early stopping which can easily be implemented and will be used in applied work. Moreover, our results also allow a direct comparison between LASSO and boosting that has been missing in the literature. Finally, we present simulation studies to illustrate the relevance of our theoretical results and to provide insights into the practical aspects of boosting. In the simulation studies post-L2 Boosting clearly outperforms LASSO. Key words: L2 Boosting, post-L2 Boosting, orthogonal L2 Boosting, High-dimensional models, LASSO, rate of convergence 1. Introduction In this paper we consider L2 Boosting algorithms for regression. Boosting algorithms represent one of the major advances in machine learning and statistics in recent years. Freund and Schapire’s AdaBoost algorithm for classification (Freund and Schapire (1997)) has attracted much attention in the machine learning community as well as in statistics. Many variants of the AdaBoost algorithm have been introduced and proven to be very competitive in terms of prediction accuracy in a variety of applications with a strong resistance to overfitting. Boosting methods were originally proposed as ensemble methods, which rely on the principle of generating multiple predictions and majority voting (averaging) among the individual classifiers (cf Bühlmann and Hothorn (2007)). An important step in the analysis of Boosting algorithms was Breiman’s interpretation of Boosting as a gradient descent algorithm in function space, inspired by numerical optimization and statistical estimation (Breiman (1996), Breiman (1998)). Building on this insight, Friedman et al. (2000) and Friedman (2001) embedded Boosting algorithms into Date: March 1, 2016. Luo: University of Florida, USA, lkurt@mit.edu. Spindler: Max Planck Society, Munich Center for the Economics of Aging, Amalienstr. 33, 80799 Munich, Germany, spindler@mea.mpisoc.mpg.de. We thank the German Research Foundation (DFG) for financial support through a grant for “Initiation of International Collaborations”. 1 2 YE LUO AND MARTIN SPINDLER the framework of statistical estimation and additive basis expansion. This also enabled the application of boosting for regression analysis. Boosting for regression was proposed by Friedman (2001) and Bühlmann and Yu (2003) defined and introduced L2 Boosting. An extensive overview of the development of Boosting and its manifold applications is given in the survey Bühlmann and Hothorn (2007). In this paper we analyze the rate of convergence of L2 Boosting in a high-dimensional setting. We show that L2 Boosting reaches a slower rate of convergence than the typical rate of LASSO without any assumptions on the design matrix. But we can show, that the so–called “post-Boosting”, which we introduce in this paper, and the orthogonal boosting variant both achieve the same rate of convergence as LASSO in a sparse, high-dimensional setting. The idea of post-Boosting is to apply ordinary least squares (OLS) to the model selected by first-step L2 Boosting. After each step orthogonal Boosting conducts an orthogonal project on the space of already selected variables. Boosting uses – compared to LASSO – a somewhat unusual penalization scheme. Penalization is done by “early stopping”to avoid overfitting in the high-dimensional case. In the low-dimensional case boosting without stopping converges to the ordinary least squares (OLS) solution. In a high-dimensional setting early stopping is key for avoiding overfitting and for the predictive performance of boosting. We give a new stopping rule which is simple to implement and also works very well in practical settings as demonstrated in the simulation studies. In a deterministic setting (e.g. in approximation theory) the boosting methods are also known as greedy algorithms (pure greedy algorithm (PGA) and orthogonal greedy algorithm (OGA)). In signal processing L2 Boosting is essentially the same as the matching pursuit algorithm by Mallat and Zhang (1993). We will employ the abbreviations post-BA (post-L2 Boosting algorithm) and oBA (orthogonal L2 Boosting algorithm) for the stochastic versions we analyze. As mentioned above, Boosting for regression was introduced by Friedman (2001). L2 Boosting was defined in Bühlmann and Yu (2003). Numerical convergence, consistency and statistical rates of convergence of boosting with early stopping in a low-dimensional setting were provided in Zhang and Yu (2005). Consistency in prediction norm of L2 Boosting in a high-dimensional setting was first proved in Bühlmann (2006). The orthogonal Boosting algorithm in a statistical setting under different assumptions was analyzed in Ing and Lai (2011). Rates for the PGA and OGA case are provided in Barron et al. (2008). The structure of the paper is as follows: In Section 2 the L2 Boosting algorithm (BA) is defined together with its modifications, the post-L2 Boosting algorithm (post-BA) and the orthogonalized version (oBA). In Section 3 we present the main results of our analysis. The proofs are provided in the Appendix. Section 4 contains a simulation study which offers some insights into the methods and also provides some guidance for stopping rules in applications. Section 5 shows an application and finally, we conclude. L2 BOOSTING IN HIGH-DIMENSIONS 3 Notation:PLet x and y be n-dimensional vectors. We define the empirical L2 -norm as p En [x] = 1/n ni=1 xi . ||x||2,n := EnP [x2 ] denotes the usual L2 -norm and < ·, · >n the inner product defined as < x, y >n = 1/n ni=1 xi yi and < ·, · >n,2 = (< ·, · >n )2 . For a random variable X, E[X] denotes the expectation of this random variable. corr(X, Y ) gives the correlation between two random variables X and Y . We use the notation a ∨ b = max{a, b} and a ∧ b = min{a, b}. We also use the notation a - b to denote a ≤ cb for some constant c > 0 that does not depend on n; and a -P b to denote a = OP (b). For a set U supp(U ) denotes all elements which are not equal to zero. Given a vector β ∈ Rp and a set of indices T ⊂ {1, . . . , p}, we denote by βT the vector in which βTj = βj if j ∈ T , βTj = 0 if j ∈ / T. 2. L2 Boosting with componentwise least squares To define the boosting algorithm for linear models, we consider the following regression model: (1) yi = x′i β + εi , i = 1, . . . , n, with vector xi = (xi,1 , . . . , xi,pn ) consisting of pn predictor variables, β a pn -dimensional coefficient vector, and a random, mean-zero error term εi . Further assumptions will be employed in the next sections. We allow that the dimension of the predictors pn grows with sample size n and is even larger than the sample size, i.e. dim(β) = pn ≫ n. But we will require some sparsity condition. This means that there is a large set of potential variables, but the number of variables which have non-zero coefficients, denoted by s, is small compared to the sample size, i.e. s ≪ n. This can be weakened to approximate sparsity as defined and to be explained later. More precise assumptions will also be made later. In the following we will drop the dependence of pn on the sample size and denote it with p if there is no confusion. X denotes the n × p design matrix where the single observations xi form the rows. Xj denotes the jth column of design matrix, xi,j the jth component of the vector xi . We consider a fixed design for the regressors. We assume that the regressors are standardized with mean zero and variance one, i.e. En [xi,j ] = 0 and En [x2i,j ] = 1 for j = 1, . . . , p, The basic principle of Boosting can be described as follows: We follow the interpretation of Breiman (1998) and Friedman (2001) of Boosting as an functional gradient descent optimization (minimization) method. The goal is to minimize a loss function, e.g. a L2 loss or the negative log-likelihood function of a model by an interative optimization scheme. In each step the (negative) gradient which is used in every step for updating the curent solution is modelled and estimated by a parametric or nonparametric statistical model, the so-called base learner. The fitted gradient is used for updating the solution of the optimization problem. A strength of boosting, besides that it can be used for different loss 4 YE LUO AND MARTIN SPINDLER functions, is its flexibility with regard to the base learners. We then repeat this procedure until some stopping criterion is met. In the literature there have been developed many different forms of boosting algorithms. In this paper we consider L2 Boosting with componentwise linear least squares and two variants which are all designed for regression analysis. P “L2 ”refers to the loss function which is the typical sum–of–squared residuals Qn (β) = ni=1 (yi − x′i β)2 in regression analysis. In this case the gradient equals the residuals. “componentwise linear least squares”refers to the base learners. We fit the gradient (i.e. residuals) against each regressor (p univariate regressions) and select the predictor/variable which correlates most with gradient/residual, i.e. decreases the loss function most, and then update the estimator in this direction. We next update the residuals and repeat the procedure until some stopping criterion is met. We consider L2 Boosting and two modifications: the “classical”one which was introduced in Friedman (2001) and refined in Bühlmann and Yu (2003) for regression analysis, an orthogonal variant and post-L2 Boosting. As far as we know post-L2 Boosting has not been defined and analyzed in the literature before. In signal processing and approximation theory the first two methods are known under the name the pure greedy algorithm (PGA) and the orthogonal greedy algorithm (OGA) in the deterministic setting, i.e. in a setting without stochastic error terms. 2.1. L2 Boosting. For L2 Boosting with componentwise least squares the algorithm is given below. Algorithm 1 (L2 Boosting). (1) Start / Initialization: β 0 = 0 (p-vector), f 0 = 0, set maximum number of iterations mstop and set iteration index m to 0. (2) Calculate the residuals Uim = yi − x′i β m . (3) For each predictor variable j = 1, . . . , p calculate the correlation with the residuals: Pn m < U m , xj > n m i=1 Ui xi,j . = γj := P n 2 En [x2i,j ] i=1 xi,j Select this variable j m which is correlated most with residuals, i.e. max1≤j≤p |corr(U m , xj )|1. (4) Update the estimator: β m+1 := β m + γjmm ej m where ej m is the j m th index vector and f m+1 := f m + γjmm xj m (5) Increase m by one. If m < mstop continue with (2), otherwise stop. The act of stopping is crucial for boosting algorithms, as stopping too late or never stopping leads to overfitting and therefore some kind of penalization is required. A suitable solution is to stop early, i.e. before overfitting takes place. “Early stopping” can be interpreted as a form of penalization. Similar to LASSO, early stopping might induce bias through shrinkage. A potential way to decrease the bias is by “post-Boosting”. This is a post-model selection estimator that applies ordinary least squares (OLS) to the model selected by first-step L2 Boosting. To define this estimator properly, we make the following 1Equivalently, which fits the gradient best in a L -sense. 2 L2 BOOSTING IN HIGH-DIMENSIONS 5 ∗ definitions T := supp(β) and T̂ := supp(β m ), the support of the true model and the support of the model estimated by L2 Boosting as described above with stopping at m∗ . A superscript C denotes the complement of the set with regard to {1, . . . , p}. In the context of LASSO, OLS after model selection was analyzed in Belloni and Chernozhukov (2013). Given the above definitions, the post-model selection estimator or OLS post-L2 Boosting estimator will take the from (2) β̃ = argminβ∈Rp Qn (β) : βj = 0 for each j ∈ T̂ C . Comment 2.1. For boosting algorithms it has been recommended – supported by simulation studies – not to update the full step size xj m but only a small step ν. The parameter ν can be interpreted as a shrinkage parameter, or alternatively, as describing the step size when updating the function estimate along the gradient. Small step sizes (or shrinkage) make the boosting algorithm slower to converge and require a larger number of iterations. But often the additional computation cost in turn results in better out-of-sample prediction performance. By default, ν is usually set to 0.1. Our analysis in the later sections also extends to a restricted step size 0 < ν < 1. 2.2. Orthogonal L2 Boosting. A variant of the Boosting Algorithm is orthogonal Boosting (oBA) or Orthogonal Greedy Algorithm in its deterministic version. Only the updating step is changed: a orthogonal projection of the response variable is conducted on all variables which have been selected up to this point. The advantage of this method is that any variable is selected at most once in this procedure, while in the previous version the same variable might be selected in different steps which makes the analysis far more complicated. More formally the method can be described as follows by modifying Step (4): Algorithm 2 (Orthogonal L2 Boosting). (4′ ) ŷ m+1 ≡ f m+1 = Pm y and Uim+1 = Yi − Ŷim+1 , where Pm denotes the projection of the variable y on the space spanned by first m selected variables (The corresponding regression coefficient is denoted βom+1 .) Define Xom as the matrix which consists only of the columns which correspond to the variables selected in the first m steps, i.e. all Xjk , k = 0, 1, . . . , k. Then we have: (3) (4) βom = ((Xom )′ Xom )−1 (Xom )′ y ŷ m+1 = fom+1 = Xom βom Comment 2.2. Orthogonal L2 Boosting might be interpreted as post-L2 Boosting where the refit takes place after each step. Comment 2.3. Both post-Boosting and orthogonal Boosting require that the number of selected variables be smaller than the sample size to be well-defined. This is enforced by our stopping rule as we will see later. 6 YE LUO AND MARTIN SPINDLER 3. Main Results In this section we discuss the main results regarding the L2 Boosting procedure (BA), post-L2 Boosting (post-BA) and the orthogonal procedure (oBA) in a high dimensional setting. L2 Boosting (post-BA) is discussed first. Then we analyze the remaining procedures under bounded, restricted eigenvalue conditions which are also used in the traditional LASSO literature. 3.1. L2 Boosting with Componentwise Least Squares. We analyze the linear regression model introduced in the previous section in a high-dimensional setting. The following definitions will be helpful for the analysis: U m denotes the residuals at the mth iteration given by U m = Y − Xβ m . β m is the estimator at the mth iteration. We define the difference between the true and the estimated vector as αm := β − β m . The prediction error is given by V m = Xαm . For the Boosting algorithm it is essential to determine when to stop, i.e. the stopping criterion. In the low-dimensional case, stopping time is not important: the value of the objective function decreases and converges to the traditional OLS solution exponentially fast, as described in Bühlmann and Yu (2006). In the high-dimensional case, such fast convergence rates are usually not available: the residual ε can be explained by n-linearly independent variables xj . Thus selecting more terms only leads to overfitting. Early stopping is comparable to penalization in LASSO, which prevents one from choosing too many variables and hence overfitting. Comparable to LASSO, a sparse structure will be needed for analysis. A.1 (Exact Sparsity). T = supp(β) and s = |T | ≪ n. This means that the set of potential variables might be large, but the number of variables which are non-zero is restricted to be smaller than the sample size. Comment 3.1. The exact sparsity assumption can be weakened to an approximate sparsity condition. This means that the set of relevant regressors is small, and the other variables do not have to be exactly zero but must be negligible compared to the estimation error. At each step, we minimize ||U m ||22,n along the “most greedy”variable Xj m . Below is a traditional assumption on the residual ε: q := A.2. With probability greater or equal 1−α we have, sup1≤j≤p| < Xj , ε >n | ≤ 2σ log(2p/α) n λn . Comment 3.2. The previous assumption is e.g. implied if the error terms are i.i.d. N (0, σ 2 ) random variables. This in turn can be generalized / weakened to non-normality cases by self-normalized random vector theory (de la Peña et al. (2009)) or the approach introduced in Chernozhukov et al. (2014). L2 BOOSTING IN HIGH-DIMENSIONS 7 By definition, U m := ε+Xαm = ε+V m , where |supp(αm )| ≤ (s+m), since αm := β−β m , 1 1 and maxj∈supp(αm) ρ2 (U m , Xj ) ≥ |supp(α m )| ≥ m+s . Lemma 1. ||U m+1 ||22,n = ||U m ||22,n − < U m , Xjm >2n = ||U m ||22,n (1 − ρ2 (U m , Xjm )), and ||V m+1 ||22,n = ||V m ||22,n − 2 < V m , γjmm Xjm >n +(γjmm )2 , where γjmm =< U m , Xjm >n . Moreover, since V m = U m − ε, ||V m+1 ||22,n = ||V m ||22,n − 2 < U m , Xjm >n < ε, Xjm >n − < U m , Xjm >2n . Denote σn2 := En [ε2 ]. From the above Lemma 1 we expect||V m ||22,n and ||U m ||22,n to have very similar behavior, since sup1≤j≤p | < ǫ, Xj >n | ≤ λn . Below are the key lemmas that describe the difference between ||U m ||22,n and ||V m ||22,n : Lemma 2. Assuming that assumptions A.1-A.2 hold. Let Zm = ||U m ||22,n − ||V m ||22,n . √ Then, |Zm − σn2 | ≤ 2 m + sλn ||V m ||2,n . Lemma 2 bounds the difference between ||U m ||22,n and ||V m ||22,n . This difference is σ 2 (1− Op (s/n)) if β m = β. Recall that ||U m+1 ||22,n = ||U m ||22,n − (γjm )2 , where |γjm | = max1≤j≤p | < Xj , U m >n | = max1≤j≤p | < Xj , V m >n + < Xj , ε >n |. It is obvious that supj | < Xj , V m >n | ≥ √ 1 ||V m ||2,n , since we can pick the j ∈ supp(αm ) that is most correlated with V m . The m+s next key Lemma establishes a lower bound on the decaying speed of ||U m ||22,n . Lemma 3 (Lower bound on γjm ). |γjmm | ≥ √ 1 ||V m ||2,n m+s − λn . Lemma 4 (Bounds). Suppose assumptions A.1 and A.2 hold and s log(p) → 0. Let m0 := n q √ sn c n m C λns = 2σ log p = o( log(p) ) for some constant C. Then, the prediction error ||V ||2,n -p 1 ( s log(p) )4. n Comment 3.3. The above bound is obtained without any of the additional assumptions on the design matrix X which are often required in LASSO, e.g., the bounded restricted eigenvalue assumption. The bound also works well under certain approximate sparsity conditions. Comment 3.4. We proved that there exists a stopping rule that leads to a certain convergence rate. From these theoretical insights we derive and recommend the following stopping rule for practical applications: stop at iteration m if ||U m ||22,n ||U m−1 ||22,n > 1 − c log(p)/n. c is a constant with c > 1. For the simulation studies we employ c = 1.1. which seems to work quite well. Comment 3.5. It is also important to have an estimator for the variance of the error term 2 . A consistent estimation of the variance is given by ||U m ||2 σ 2 , denoted by σ̂n,m 2,n at the stopping time. 8 YE LUO AND MARTIN SPINDLER 3.2. Orthogonal L2 Boosting in a high-dimensional setting with bounded restricted eigenvalue assumptions. In this section we analyze orthogonal L2 Boosting. For the orthogonal case, we obtain a faster rate of convergence than with the variant analyzed in the section before. We make use of similar notation as in the previous subsection: Uom denotes the residual and Vom the prediction error, formally defined below. Again, define βom as the parameter estimate after the mth iteration. The orthogonal Boosting Algorithm was introduced in the previous section. For completeness we give here the full version with some additional notation which will be required in a later analysis. Algorithm 3 (Orthogonal L2 Boosting). (1) initialization: Set βo0 = 0, fo0 = 0, Uo0 = Y and the iteration index m = 0. (2) Define Xom as the matrix of all Xjk , k = 0, 1, . . . , m and Pom as the projection matrix given Xom . (3) Let jm be the maximizer of the following: max1≤j≤p ρ2 (Xj , Uom ). Then, fom+1 = Pom Y with corresponding regression projection coefficient βom+1 . (4) Calculate the residual Uom = Y −Xβom = (I −Pom )Y := MPom Y and Vom = MPom Xβ. (5) Increase m by one. If some stopping criterion is reached, then stop; else continue with Step 2. It is easy to see that: ||Uom+1 ||2,n ≤ ||Uom ||2,n (5) The benefit of the oBA method, compared to L2 Boosting, is that once a variable Xj is selected, the procedure will never select this variable again. This means at every variable is selected at most once. For any square matrix A, we denote φs (A) and φl (A) as the smallest and largest eigenvalues of A. We make a restricted eigenvalue assumption which is also commonly used in the analysis of LASSO. Definition 3.1. The smallest and largest restricted eigenvalues are defined as φs (s, M ) := min φs (A), max φl (A). dim(A)≤s×s,A is any diagonal submatrices of M and φl (s, M ) := dim(A)≤s×s,A is any diagonal submatrices of M A. 3. We assume that there exist constants c, C such that 0 < c ≤ φs (s′ , En [x′i xi ]) ≤ φl (s′ , En [x′i xip ]) ≤ C < ∞ for any s′ ≤ sηn , where ηn is a sequence such that ηn → ∞ slowly, e.g., log(n). Denote T m as the set of variables selected at mth iteration. Denote S m := T − T m . We know that |T m | = m by construction. Denote Scm = T m − S m . L2 BOOSTING IN HIGH-DIMENSIONS 9 Lemma 5 (lower bound of the remainder). Suppose assumptions A.1-A.3 hold. For any m such that |T m | < sηn , ||Vom ||2,n ≥ c1 ||XβS m ||2,n , for some constant c1 > 0. If S m = ∅, then ||Vom ||2,n = 0. The above Lemma essentially says: if S m is non-empty, then there is still room for significant improvement in the value ||V m ||22,n . The next Lemma is key and shows how rapidly the estimates decay. It is obvious that ||Uom ||2,n and ||Vom ||2,n are both decaying sequences. A.4. minj∈T |βj | ≥ J and maxj∈T |βj | ≤ J ′ for some constants J > 0 and J ′ < ∞. Comment 3.6. The first part of the assumption 4 is known as a “beta-min”assumption as it restricts the size of non-zero coefficients. Lemma 6 (upper bound of the remainder). Suppose assumptions A.1-A.4 hold. Assuming √ ∗ that sλn → 0. Let m∗ be the first time that ||U m ||22,n < σ 2 + 2Kσsλ2n . Then, m∗ < Ks and ||Vom ||22,n - s log(p)/n with probability going to 1. Although in general L2 Boosting may have a slower convergence rate than LASSO, oBA reaches the rate of LASSO (under some additional conditions). The same technique used in Lemma 6 also holds for the post-L2 Boosting case. Basically, we can prove, using similar arguments, that T Ks ⊃ T when K is a large enough constant. Thus, post-L2 Boosting enjoys the LASSO convergence rate given assumptions A.1-A.4. We state this in the next section formally. 3.3. Post-L2 Boosting with Componentwise Least Squares. In many cases, penalization estimators like LASSO introduce some bias by shrinkage. The idea of “postestimators” is to estimate the final model by ordinary least squares including all the variables which have been selected in the first step. We introduced post-L2 Boosting in Section 2. Now we give the convergence rate for this procedure. Surprisingly, it improves upon the rate of convergence of L2 Boosting and reaches the rate of LASSO (under stronger assumptions). The proof of the result follows the idea of the proof for Lemma 6. √ Lemma 7 (Post-L2 Boosting). Suppose assumptions A.1-A.4 hold. Assuming that sλn → 0. Let m∗ = Ks be the stopping time with K a large enough constant. Let T m be the set ∗ of variables selected at iteration m of the L2 Boosting procedure. Then, T ⊂ T m with probability going to 1 and ||PXT m∗ Y − Xβ||22,n ≤ CKsλ2n - s log(p)/n. The proof of this Lemma is similar to that of Lemma 6. Comment 3.7. Our procedure is particularly sensitive to the starting value of the algorithm, at least in the non-asymptotic case. This might be exploited for screening and model selection: the procedure commences from different starting values until stopping. Then the intersection of all selected variables for each run is taken. This procedure might establish a sure screening property. 10 YE LUO AND MARTIN SPINDLER 3.4. Approximate sparsity. In previous sections, we utilize Assumption A.1 for constructing the results. Such results can be relaxed in the approximate sparsity case. A.5 (Approximate Sparsity). There exists a set T ⊂ {1, 2, . . . , p} such that |T | = s << n, √ and f (x) = x′ βT + rn with ||rn ||2,n ≤ sλn with probability going to 1. The main results in Lemma 4, Lemma 6 and Lemma 7 hold under A.2-A.5 rather than under A.1-A.4. For the βmin condition stated in Assumption A.4, we could relax it into the following condition stated in Candes and Tao (2007): A.6. There exists a constant C large enough such that minj∈T |βj | ≥ Cλn . Lemma 8 (Approximate sparsity). Suppose conditions A.2, A.3, A.5, and A.6 hold. Then, for the orthogonal (oBA) and post-L2 Boosting procedures, define m∗ := Ks log(n) for some ∗ and constant K. Then, T m ⊃ T with probability going to 1. ||Vom ||22,n -p s log(n∨p) n ||PXT m∗ Y − Xβ||22,n -p s log(n∨p) . n 4. Simulation Study In this section we present the results of our simulation study. The goal of this exercise is to illustrate the relevance of our theoretical results for providing insights into the functionality of boosting and the practical aspects of boosting. In particular, we demonstrate that the stopping rules for early stopping we propose work reasonably well in the simulations and give guidance for practical applications. Moreover, the comparison with LASSO might also be of interest. We consider the following linear model p X βj xj + ε, (6) y= j=1 with ε standard normal distributed and iid. For the coefficient vector β we consider two designs. First, we consider a sparse design, i.e. the first s Elements of β are set equal to one, all other components to zero (β = (1, . . . , 1, 0, . . . , 0)), and then a polynomial design in which the jth coefficient given by 1/j, i.e. β = (1, 1/2, 1/3, . . . , 1/p). For the design matrix X we consider two different settings: an “orthogonal” setting and a “correlated” setting. In the former setting the entries of X are drawn as iid draws from a standard normal distribution. In the correlated design the xi (rows of X) are distributed according to a multivariate normal distribution where the correlations are given by a Toeplitz matrix with factor 0.5 and alternating sign. We have the following settings: • X: “orthogonal” or “correlated” • coefficient vector β: sparse design or polynomial decaying design • n = 100, 200, 400, 800 L2 BOOSTING IN HIGH-DIMENSIONS • • • • • 11 p = 100, 200 s = 10 K=2 out-of-sample prediction size n1 = 50 number of repetitions R = 500 We consider the following estimators: L2 Boosting with componentwise least squares, orthogonal L2 Boosting and LASSO. For Boosting and LASSO we also consider the postselection estimators (“post”). For LASSO we consider a data-driven regressor-dependent choice for the penalization parameter (Belloni et al. (2012)) and cross validation. Although cross validation is very popular, it does not rely on established theoretical results and therefore we prefer a comparison with the formal penalty choice developed in Belloni et al. (2012). For Boosting we consider three stopping rules: “oracle”, “Ks”, and a “datadependent”stopping criterion which stops if ||U m ||22,n ||U m−1 ||22,n = 2 σ̂m,n 2 σ̂m−1,n > (1 − c log(p)/n)). This means stopping, if the ratio of the estimated variances does not improve upon a certain amount any more. The Ks-rule stops after K × s variables have been selected. As s is unknown the rule is not directly applicable. The oracle rule stops when the meansquared-error (MSE), defined below, is minimized, which is also not feasible in practical applications. The simulations are performed in R (R Core Team (2014)). For LASSO estimation the packages Chernozhukov et al. (2015) and Jerome Friedman (2010) (for cross-validation) are used. The Boosting procedures were implemented by the authors. Our code is available upon request. To evaluate the performance of the estimators we use the MSE criterion. We estimate the models on the same data sets and use the estimators to predict 50 observations out-ofsample. The (out-of-sample) MSE is defined as (7) M SE = E[(f (X) − f m (X))2 ] = E[(X ′ (β − β m ))2 ], where m denotes the iteration at which we stop, depending on the employed stopping rule. The MSE is estimated by n1 n1 X X [(x′i (β − β m ))2 ] [(f (xi ) − f m (xi ))2 ] = (8) M ˆSE = i i for the out-of-sample predictions. The results of the simulation study are shown in the tables below. As expected, the oracle-based estimator dominates in all cases except in the correlated, sparse setting with more parameters than observations. Our stopping criterion gives very good results, on par with the infeasible Ks-rule. Not surprisingly, given our results, both post-Boosting and orthogonal Boosting outperform the standard L2 Boosting in most cases. A comparison of post- and orthogonal Boosting does not provide a clear answer with advantages on both sides. It is interesting to see that the post-LASSO increases upon 12 YE LUO AND MARTIN SPINDLER Table 1. iid, sparse iid, sparse iid, sparse iid, sparse iid, sparse cor., sparse cor., sparse cor., sparse cor., sparse cor., sparse iid, poly. iid, poly. iid, poly. iid, poly. iid, poly. cor., poly. cor., poly. cor., poly. cor., poly. cor., poly. n 100 100 200 400 800 100 100 200 400 800 100 100 200 400 800 100 100 200 400 800 p 100 200 200 200 200 100 200 200 200 200 100 200 200 200 200 100 200 200 200 200 s 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 BA-oracle 0.475 0.581 0.184 0.084 0.036 1.382 3.112 0.631 0.199 0.079 0.398 0.457 0.274 0.199 0.125 0.231 0.261 0.195 0.146 0.105 BA-Ks 0.713 0.953 0.367 0.173 0.079 1.651 3.034 0.735 0.259 0.207 0.772 1.021 0.513 0.276 0.149 0.648 0.896 0.482 0.258 0.139 Table 2. iid, sparse iid, sparse iid, sparse iid, sparse iid, sparse cor., sparse cor., sparse cor., sparse cor., sparse cor., sparse iid, poly. iid, poly. iid, poly. iid, poly. iid, poly. cor., poly. cor., poly. cor., poly. cor., poly. cor., poly. n 100 100 200 400 800 100 100 200 400 800 100 100 200 400 800 100 100 200 400 800 p 100 200 200 200 200 100 200 200 200 200 100 200 200 200 200 100 200 200 200 200 s 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Simulation Results BA-our 0.624 0.804 0.295 0.144 0.065 2.134 3.121 0.811 0.247 0.102 0.575 0.673 0.364 0.240 0.147 0.419 0.506 0.289 0.204 0.126 pBA-oracle 0.121 0.127 0.054 0.026 0.013 0.447 2.198 0.060 0.025 0.014 0.392 0.445 0.269 0.194 0.125 0.217 0.255 0.169 0.119 0.083 p-BA-Ks 0.566 0.733 0.313 0.145 0.067 0.846 2.498 0.174 0.038 0.014 0.959 1.319 0.602 0.294 0.152 0.864 1.207 0.566 0.270 0.132 p-BA-our 0.335 0.443 0.195 0.103 0.048 1.704 2.953 0.167 0.087 0.042 0.616 0.728 0.382 0.244 0.149 0.434 0.534 0.271 0.180 0.110 oBA-oracle 0.118 0.124 0.054 0.026 0.013 0.279 3.026 0.057 0.025 0.014 0.413 0.440 0.269 0.191 0.126 0.215 0.255 0.172 0.114 0.077 oBA-Ks 0.802 1.002 0.424 0.197 0.090 0.743 2.580 0.390 0.189 0.086 1.148 1.574 0.655 0.308 0.155 1.052 1.538 0.619 0.294 0.142 Simulation Results continued LASSO 0.897 1.209 0.348 0.141 0.067 2.896 3.115 1.529 0.415 0.176 0.436 0.504 0.325 0.212 0.145 0.316 0.354 0.263 0.184 0.134 post-LASSO 0.606 1.335 0.416 0.188 0.088 1.271 2.260 0.392 0.170 0.079 0.653 1.170 0.521 0.294 0.166 0.569 0.877 0.426 0.245 0.141 Lasso-CV 0.540 0.758 0.276 0.119 0.056 0.830 1.874 0.466 0.192 0.088 0.428 0.551 0.311 0.190 0.118 0.339 0.412 0.269 0.170 0.107 post-Lasso-CV 0.877 1.135 0.510 0.270 0.135 1.201 2.422 0.758 0.356 0.171 0.709 0.877 0.518 0.343 0.213 0.585 0.658 0.430 0.304 0.191 LASSO, but there are some exceptions, probably driven by overfitting. Cross-validation works very well in many constellations. An important point of the simulation study is to compare Boosting and LASSO. It seems that in the polynomial decaying setting, Boosting (orthogonal Boosting with our stopping rule) dominates post-LASSO. This also seems true in the iid, sparse setting. In the correlated, sparse setting they are on par. Summing up, it seems that Boosting is a serious contender for LASSO. oBA-our 0.485 0.715 0.260 0.128 0.057 1.000 2.679 0.228 0.115 0.052 0.719 0.862 0.435 0.257 0.152 0.524 0.657 0.308 0.185 0.109 L2 BOOSTING IN HIGH-DIMENSIONS 13 Table 3. Average number of variables and iterations run in the simulation settings iid, sparse iid, sparse iid, sparse iid, sparse iid, sparse cor., sparse cor., sparse cor., sparse cor., sparse cor., sparse iid, poly. iid, poly. iid, poly. iid, poly. iid, poly. cor., poly. cor., poly. cor., poly. cor., poly. cor., poly. n 100 100 200 400 800 100 100 200 400 800 100 100 200 400 800 100 100 200 400 800 p 100 200 200 200 200 100 200 200 200 200 100 200 200 200 200 100 200 200 200 200 s 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 num var BA 13.080 13.740 13.670 13.994 13.858 9.830 8.230 13.280 13.430 13.290 7.630 7.450 9.580 13.150 17.540 5.420 5.360 5.810 7.420 9.220 num iteration BA 14.692 15.020 15.396 16.206 16.266 10.630 8.450 17.760 25.580 31 7.630 7.450 9.590 13.150 17.550 5.470 5.380 5.900 7.880 11.220 num var oBA 14.358 16.399 14.508 14.594 14.148 13.720 11.480 13.960 14.040 13.680 8.450 8.510 10.480 13.600 18.130 5.740 6.080 6.280 8.260 10.030 num var Lasso 16.354 20.169 20.238 20.616 20.590 11.340 14.270 18.520 19.170 19.400 12.510 15.660 18.190 22.020 27.440 7.690 11.160 11.760 13.470 15.830 Table 2 shows the average number of variables selected for LASSO, BA and oBA (both with our stopping criterion) and the number of iterations run in the BA case in the simulations above. An interesting pattern is that in most cases LASSO selects more variables or, coined in other terms, boosting algorithms give sparser solutions. Comment 4.1. It seems that in very high-dimensional settings, i.e. when the number of signals s is bigger than the sample size n, (e.g. n = 20, p = 50, s = 30) boosting performs quite well and outperforms LASSO which seems to break down. This case is not covered by our setting, but it is an interesting topic for future research and shows one of the advantages of boosting. 5. Application: Riboflavin production In this section we present an application to demonstrate how the methods work when applied to real data sets and, then compare these methods to related methods, i.e. LASSO. The application involves genetic data and analyzes the production of riboflavin. First, we describe the data set, then we present the results. 5.1. Data set. The data set has been provided by DSM (Kaiserburg, Switzerland) and was made publicly available for academic research in Bühlmann et al. (2014) (Supplemental Material). The real-valued response / dependent variable is the logarithm of the riboflavin production rate. The (co-)variables measure the logarithm of the expression level of 4, 088 genes (p = 4, 088) which are normalized. This means that the covariables are standardized to have variance 1, and the dependent variable and the resources are “de-meaned”which is equivalent to including an unpenalized intercept. The data set consists of n = 71 observations which were hybridized repeatedly during a fed-batch fermentation process in which 14 YE LUO AND MARTIN SPINDLER different engineered strains and strains grown under different fermentation conditions were analyzed. For further details we refer to Bühlmann et al. (2014), their Supplemental Material and the references therein. 5.2. Results. We analyzed a data set about the production of riboflavin (vitamin B2 ) with B. subtilis. We split the data set randomly into two samples: a training set and a testing set. We estimated models with different methods on the training set and then used the testing set to calculate out-of-sample mean squared errors (MSE) in order to evaluate the predictive accuracy. The size of the training set was 60 and the remaining 11 observations were used for forecasting. The table below shows the MSE errors for different methods discussed in the previous sections. Table 4. Results Riboflavin Production – out-of-sample MSE BA-Ks 0.4669 BA-our 0.3641 p-BA-Ks 0.4385 p-BA-our 0.1237 oBA-Ks 0.4246 oBA-our 0.1080 LASSO 0.1687 p-LASSO 0.1539 All calculations are peformed in R (R Core Team (2014)) with the package Chernozhukov et al. (2015) and our own code. Replication files are available upon request. The results show that, again, post- and orthogonal L2 Boosting give comparable results. They both outperform LASSO and post-LASSO in this application. Appendix A. Proofs of Section 3 A.1. Proofs of L2 boosting. Proof of Lemma 2. From Lemma 1, Zm+1 = Zm − 2γjmm < ε, Xjm >n , and Zm = Z0 − P k m >. Z = ||y||2 − ||Xβ||2 = ||ε||2 + 2 < 2 m−1 0 2,n 2,n 2,n k=0 γjk < ε, Xjk >= Z0 − 2 < ε, Xβ − V ε, Xβ >. m Then, Zm = ||ε||22,n + 2 < ε, V m >= ||ε||22,n + 2 < ε, ||V Vm ||2,n > ||V m ||2,n . Thus, √ |Zm − σ̂ 2 | ≤ 2 m + sλn ||V m ||2,n , since |supp(V m )| ≤ m + s. √ Proof of Lemma 4. Denote m∗ + 1 as the first time ||V m ||2,n ≤ ηn m + sλn , where ηn is some sequence of positive real numbers. We know that in high-dimensional settings ||U m ||2,n → 0 so ||V m ||2,n → σ 2 . Thus, by fixing p and n, such m∗ must exist. First, we prove that for any m ≤ m∗ , we have: ||V m+1 ||22,n ≤ ||V m ||22,n . By Lemma 1, ||V m+1 ||22,n = ||V m ||22,n − γjm (γjm − 2 < ε, xj >n ). We only need to prove γjm and (γjm − 2 < ε, xj >n ) have the same sign, i.e., |γjm | > 2| < ||V m || 2,n − λn ≥ λn (ηn − 1). ε, xj >n |. It suffices to prove |γjm | > 2λn . We know that |γjm | ≥ √m+s m 2 m+1 2 Thus, for any ηn > 3, ||V ||2,n ≤ ||V ||2,n . Let ηn = 3 + µ for some constant µ. L2 BOOSTING IN HIGH-DIMENSIONS 15 1 sn ) 2 . If there exists some constant C ′ such that m∗ +1 < C ′ ξn , then by defLet ξn = ( log(p) √ √ ∗ )1/4 - ( s log(p) )1/4 . inition ||V m +1 ||2,n ≤ 3(1 + µ) s + m∗ + 1λn ≤ 3(1 + 2µ) C ′ 2σ( s log(p) n n We will consider two cases, depending on whether m0 is smaller or larger than m∗ ∗ ∗ If m0 ≥ m∗ , then ||V m0 ||22,n = ||V m ||22,n + (||U m0 ||22,n − ||U m ||22,n ) − Zm0 + Zm∗ . So √ ∗ ||V m0 ||22,n ≤ ||V m ||22,n + 2 m0 ∨ m∗ λn ≤ C ′′ ( s log(p) )1/4 for some generic constant C ′′ . n If m0 < m∗ , we know that for any m = 0, 1, . . . , m0 , ||V m ||22,n is decreasing. For any m < m0 , we have: 1 ||V m ||2,n − λn )2 m+s 1 ||V m ||2,n = ||U m ||22,n − + λ2n . ||V m ||22,n + 2λn √ m+s m+s ||V m ||2,n 1 1 + λ2n . )+ Zm + 2λn √ = ||U m ||22,n (1 − m+s m+s m+s ||U m+1 ||22,n = ||U m ||22,n − (γjm )2 ≤ ||U m ||22,n − ( √ (9) ≤ ||U m ||22,n (1 − σ̂ 2 ||V m ||2,n 1 . )+ + 4(1 + µ)λn √ m+s m+s m+s Based on Lemma 2 and inequality (9), √ ||U m+1 ||22,n − σ̂ 2 − 8(1 + µ)λn m + s + 1||V m+1 ||2,n √ 1 ≤ (1 − )(||U m ||22,n − σ̂ 2 − 8(1 + µ)λn m + s||V m ||2,n ), m+s since √ √ √ √ 1 m + s + 1||V m+1 ||2,n − m + s||V m ||2,n ≤ ( m + s + 1− m + s)||V m ||2,n ≤ √ ||V m ||2,n . 2 m+s √ Denote m∗1 +1 as the first m such that ||U m+1 ||22,n ≤ σ 2 +8(1+µ)λn m + s + 1||V m+1 ||2,n . p ∗ ∗ If m∗1 < m0 , then ||U m1 +1 ||22,n ≤ σ̂ 2 + 8(1 + µ)λn m∗1 + s + 1||V m1 ||2,n . Moreover, we p ∗ ∗ ∗ know that ||V m1 +1 ||22,n = ||U m1 +1 ||22,n − Zm∗1 +1 ≤ (10 + 8µ)λn m∗1 + s + 1||V m1 +1 ||2,n , p √ ∗ and therefore ||V m1 +1 ||2,n ≤ (10 + 8µ)λn m∗1 + s + 1 ≤ 10(1 + µ) m0 λn = 10(1 + µ)( s log(p) )1/4 - ( s log(p) )1/4 . n n √ If m∗1 ≥ m0 , then define a new sequence q 0 := ||U 0 ||22,n > σ̂ 2 + 8(1 + µ) sλn ||V 0 ||2,n . √ 1 ). So we have: We require the LASSO growth rate sλn → 0. Let q m+1 = q m ∗ (1 − m+s s m 0 q = q m+s . p ∗ So for any m ≤ m0 ≤ m∗1 ∧m∗ , we have ||U m ||22,n −σ̂ 2 −8(1+µ)λn m∗1 + s + 1||V m1 ||2,n ≤ s q m ≤ C m+s . (Here we require ||q 0 ||2,n < ∞, which holds under ||β||2 < ∞. Bühlmann 16 YE LUO AND MARTIN SPINDLER (2006) requires a stronger condition: ||β||1 < ∞). If we believe that sup |β| < C, then the s2 result should be ||U m ||22,n ≤ C m+s .) √ Thus, we know that for any m ≤ m∗1 , ||U m+1 ||2,n ≤ σ̂ 2 +8(1+µ)λn m + 1 + s||V m+1 ||2,n + s C m+s+1 . √ s − And: ||V m+1 ||22,n = ||U m+1 ||22,n −Zm ≤ σ̂ 2 +8(1+µ)λn m + 1 + s||V m+1 ||2,n +C m+s+1 √ 2 m+1 σ̂ + 2 m + s + 1λn ||V ||2,n √ Cs . = (10 + 8µ) m + s + 1λn ||V m+1 ||2,n + m+s+1 )1/4 for some constant C1 > 0. Let m = m0 − 1, we get ||V m0 ||2,n ≤ C1 ( s log(p) n √ 1/4 convergence rate. Thus, we stop at m0 = C λns leads to ( log(p)s n ) A.2. Proofs of OGA. Proof of Lemma 5. By Sparse eigenvalue condition: c ||Vom ||22,n ≥ c||βS m ||22 ≥ ||XβS m ||22,n . C Similarly, for Uom , ||Uom ||22,n = ||MPom ε + MPom (Xβ)||22,n = ||MP m ε||22,n + ||Vom ||22,n + 2 < √ 2 m 2 m ε, Vom >n ≥ n−m n σ + ||Vo ||2,n − 2λn m + s||Vo ||2,n . Proof of Lemma 6. At step Ks, if T Ks ⊃ T , then ||Vom ||22,n = ||MPom Vom ||22,n = 0. The q with probability going estimated predictor satisfies: ||xβ − xβ m ||2,n ≤ 2(1 + η)σ s log(p) n to 1, where η > 0 is a constant. Let A1 = {T Ks ⊃ T }. Consider the event Ac1 . So there exists a j ∈ T which is never picked up in the process at k = 0, 1, . . . , Ks. Every step we pick a j to maximize | < Xj , Uom >2,n | = | < Xj , Vom >2,n + < Xj , ε >2 |. f m = Xαmm . So ||Vom ||2 ≥ c(||βS m ||2 +||αS m ||2 ) ≥ c P m |βS m |2 . Let W m = XβS m and W 2 2 2,n j∈S Sc c Also Vom = MPom XβS m , so < Vom , XβS m >2,n = ||MPom XβS m ||22,n = ||XβS m −XScm ζ||22,n ≥ c||βS m ||22,n , where XScm ζ = PXScm (XβS m ). P P Thus, it is easy to see that < Vom , XβS m >2,n = j∈S m βj < Vom , Xj >2,n ≥ c j∈S m βj2 . Thus, there exists some j ∗ such that | < Vom , Xj ∗ > | ≥ c|βj | ≥ cJ. We know that the optimal jm must satisfy: | < Uom , Xjm >2,n | ≥ | < Vom , Xj ∗ >2,n − < ε, Xj ∗ >2,n | ≥ cJ − λn . Thus, | < Vom , Xjm >2,n | > cJ − 2λn . L2 BOOSTING IN HIGH-DIMENSIONS 17 Hence, ||Vom+1 ||22,n = ||Vom − γjm Xjm ||22,n = ||Vom ||22,n − 2γjm < Uom , Xjm > +2γj < ǫ, Xjm > +γj2m ≤ ||Vom ||22,n − γj2m + 2λn |γjm | ≤ ||Vom ||22,n − (cJ − λn )2 + 2(cJ − λn )λn ≤ ||Vom ||22,n − (cJ)2 + 4cJλn . Consequently, ||Vom ||22,n ≤ ||V0m ||22,n − K(cJ)2 s + 4KcJsλn . Since ||Vo0 ||22,n ≤ CJ ′2 s, c2 J 2 m 2 so let K > CJ ′2 and assuming λn → 0 would lead to a ||Vo ||2,n < 0 asymptotically. That said, our assumption that “there exists a j which is never picked up in the process at k = 0, 1, 2, . . . , KS” is incorrect with probability going to 1. Thus, at time Ks, A1 must happen with probability going to 1. And therefore, we know that ||Uom ||22,n = ||MPoKs ε||22,n ≤ σ̂ 2 = σ 2 + Op ( √1n ). By definition of m∗ , m∗ ≤ Ks. Therefore, ||Vom ||22,n ≤ ||Uom ||22,n − σ 2 + m∗ λn = Op (sλ2n ). 18 YE LUO AND MARTIN SPINDLER References Barron, Andrew R., Cohen, Albert, Dahmen, Wolfgang and DeVore, Ronald A. (2008). ‘Approximation and learning by greedy algorithms’, Ann. Statist. 36(1), 64–94. URL: http://dx.doi.org/10.1214/009053607000000631 Belloni, A., Chen, D., Chernozhukov, V. and Hansen, C. (2012). ‘Sparse Models and Methods for Optimal Instruments With an Application to Eminent Domain’, Econometrica 80(6), 2369–2429. URL: http://dx.doi.org/10.3982/ECTA9626 Belloni, Alexandre and Chernozhukov, Victor. (2013). ‘Least squares after model selection in highdimensional sparse models’, Bernoulli 19(2), 521–547. URL: http://dx.doi.org/10.3150/11-BEJ410 Breiman, Leo. (1996). ‘Bagging Predictors’, Machine Learning 24, 123–140. URL: http://dx.doi.org/10.1007/BF00058655 Breiman, Leo. (1998). ‘Arcing Classifiers’, The Annals of Statistics 26(3), 801–824. Bühlmann, Peter. (2006). ‘Boosting for High-Dimensional Linear Models’, The Annals of Statistics 34(2), 559–583. Bühlmann, Peter and Hothorn, Torsten. (2007). ‘Boosting Algorithms: Regularization, Prediction and Model Fitting’, Statistical Science 22(4), 477–505. with discussion. URL: http://dx.doi.org/10.1214/07-STS242 Bühlmann, Peter, Kalisch, Markus and Meier, Lukas. (2014). ‘High-Dimensional Statistics with a View Toward Applications in Biology’, Annual Review of Statistics and Its Application 1(1), 255–278. URL: http://dx.doi.org/10.1146/annurev-statistics-022513-115545 Bühlmann, Peter and Yu, Bin. (2003). ‘Boosting with the L2 Loss: Regression and Classification’, Journal of the American Statistical Association 98(462), 324–339. URL: http://www.jstor.org/stable/30045243 Candes, Emmanuel and Tao, Terence. (2007). ‘The Dantzig selector: statistical estimation when p is much larger than n’, The Annals of Statistics pp. 2313–2351. Chernozhukov, Victor, Chetverikov, Denis, Kato, Kengo et al. (2014). ‘Gaussian approximation of suprema of empirical processes’, The Annals of Statistics 42(4), 1564–1597. Chernozhukov, Victor, Hansen, Christian and Spindler, Martin. (2015), hdm: High-Dimensional Metrics. R package version 0.1. de la Peña, Victor H., Lai, Tze Leung and Shao, Qi-Man. (2009), Self-normalized processes, Probability and its Applications (New York), Springer-Verlag, Berlin. Limit theory and statistical applications. Freund, Yoav and Schapire, Robert E. (1997). ‘A decision-theoretic generalization of on-line learning and an application to boosting’, Journal of computer and system sciences 55(1), 119–139. Friedman, Jerome H. (2001). ‘Greedy Function Approximation: A Gradient Boosting Machine’, The Annals of Statistics 29(5), 1189–1232. URL: http://www.jstor.org/stable/2699986 Friedman, J. H., Hastie, T. and Tibshirani, R. (2000). ‘Additive Logistic Regression: A Statistical View of Boosting’, The Annals of Statistics 28, 337–407. with discussion. URL: http://projecteuclid.org/euclid.aos/1016218223 Ing, Ching-Kang and Lai, Tze Leung. (2011). ‘A stepwise regression method and consistent model selection for high-dimensional sparse linear models’, Statistica Sinica 21(4), 1473–1513. Jerome Friedman, Trevor Hastie, Robert Tibshirani. (2010). ‘Regularization Paths for Generalized Linear Models via Coordinate Descent’, Journal of Statistical Software 33(1), 1–22. URL: http://www.jstatsoft.org/v33/i01 Mallat, S.G. and Zhang, Zhifeng. (1993). ‘Matching Pursuits with Time-frequency Dictionaries’, Trans. Sig. Proc. 41(12), 3397–3415. URL: http://dx.doi.org/10.1109/78.258082 R Core Team. (2014), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. L2 BOOSTING IN HIGH-DIMENSIONS 19 URL: http://www.R-project.org/ Zhang, T. and Yu, B. (2005). ‘Boosting with Early Stopping: Convergence and Consistency’, The Annals of Statistics 33, 1538–1579. URL: http://projecteuclid.org/euclid.aos/1123250222