$ L_2 $ Boosting in High-Dimensions: Rate of Convergence

advertisement
arXiv:1602.08927v1 [stat.ML] 29 Feb 2016
L2 BOOSTING IN HIGH-DIMENSIONS: RATE OF CONVERGENCE
YE LUO AND MARTIN SPINDLER
Abstract. Boosting is one of the most significant developments in machine learning. This paper
studies the rate of convergence of L2 Boosting, which is tailored for regression, in a high-dimensional setting.
Moreover, we introduce so-called “post-Boosting”. This is a post-selection estimator which applies ordinary
least squares to the variables selected in the first stage by L2 Boosting. Another variant is orthogonal
boosting where after each step an orthogonal projection is conducted. We show that both post-L2 Boosting
and the orthogonal boosting achieve the same rate of convergence as Lasso in a sparse, high-dimensional
setting. The “classical”L2 Boosting achieves a slower convergence rate for prediction, but no assumptions
on the design matrix are imposed for this result in contrast to rates e.g. established with LASSO. We
also introduce rules for early stopping which can easily be implemented and will be used in applied work.
Moreover, our results also allow a direct comparison between LASSO and boosting that has been missing
in the literature. Finally, we present simulation studies to illustrate the relevance of our theoretical results
and to provide insights into the practical aspects of boosting. In the simulation studies post-L2 Boosting
clearly outperforms LASSO.
Key words: L2 Boosting, post-L2 Boosting, orthogonal L2 Boosting, High-dimensional models, LASSO,
rate of convergence
1. Introduction
In this paper we consider L2 Boosting algorithms for regression. Boosting algorithms
represent one of the major advances in machine learning and statistics in recent years.
Freund and Schapire’s AdaBoost algorithm for classification (Freund and Schapire (1997))
has attracted much attention in the machine learning community as well as in statistics. Many variants of the AdaBoost algorithm have been introduced and proven to
be very competitive in terms of prediction accuracy in a variety of applications with a
strong resistance to overfitting. Boosting methods were originally proposed as ensemble
methods, which rely on the principle of generating multiple predictions and majority voting (averaging) among the individual classifiers (cf Bühlmann and Hothorn (2007)). An
important step in the analysis of Boosting algorithms was Breiman’s interpretation of
Boosting as a gradient descent algorithm in function space, inspired by numerical optimization and statistical estimation (Breiman (1996), Breiman (1998)). Building on this
insight, Friedman et al. (2000) and Friedman (2001) embedded Boosting algorithms into
Date: March 1, 2016.
Luo: University of Florida, USA, lkurt@mit.edu. Spindler: Max Planck Society, Munich Center for the
Economics of Aging, Amalienstr. 33, 80799 Munich, Germany, spindler@mea.mpisoc.mpg.de. We thank the
German Research Foundation (DFG) for financial support through a grant for “Initiation of International
Collaborations”.
1
2
YE LUO AND MARTIN SPINDLER
the framework of statistical estimation and additive basis expansion. This also enabled
the application of boosting for regression analysis. Boosting for regression was proposed
by Friedman (2001) and Bühlmann and Yu (2003) defined and introduced L2 Boosting. An
extensive overview of the development of Boosting and its manifold applications is given
in the survey Bühlmann and Hothorn (2007).
In this paper we analyze the rate of convergence of L2 Boosting in a high-dimensional
setting. We show that L2 Boosting reaches a slower rate of convergence than the typical
rate of LASSO without any assumptions on the design matrix. But we can show, that the
so–called “post-Boosting”, which we introduce in this paper, and the orthogonal boosting
variant both achieve the same rate of convergence as LASSO in a sparse, high-dimensional
setting. The idea of post-Boosting is to apply ordinary least squares (OLS) to the model
selected by first-step L2 Boosting. After each step orthogonal Boosting conducts an orthogonal project on the space of already selected variables.
Boosting uses – compared to LASSO – a somewhat unusual penalization scheme. Penalization is done by “early stopping”to avoid overfitting in the high-dimensional case. In
the low-dimensional case boosting without stopping converges to the ordinary least squares
(OLS) solution. In a high-dimensional setting early stopping is key for avoiding overfitting
and for the predictive performance of boosting. We give a new stopping rule which is
simple to implement and also works very well in practical settings as demonstrated in the
simulation studies.
In a deterministic setting (e.g. in approximation theory) the boosting methods are
also known as greedy algorithms (pure greedy algorithm (PGA) and orthogonal greedy
algorithm (OGA)). In signal processing L2 Boosting is essentially the same as the matching
pursuit algorithm by Mallat and Zhang (1993). We will employ the abbreviations post-BA
(post-L2 Boosting algorithm) and oBA (orthogonal L2 Boosting algorithm) for the stochastic
versions we analyze.
As mentioned above, Boosting for regression was introduced by Friedman (2001). L2 Boosting
was defined in Bühlmann and Yu (2003). Numerical convergence, consistency and statistical rates of convergence of boosting with early stopping in a low-dimensional setting were
provided in Zhang and Yu (2005). Consistency in prediction norm of L2 Boosting in a
high-dimensional setting was first proved in Bühlmann (2006). The orthogonal Boosting
algorithm in a statistical setting under different assumptions was analyzed in Ing and Lai
(2011). Rates for the PGA and OGA case are provided in Barron et al. (2008).
The structure of the paper is as follows: In Section 2 the L2 Boosting algorithm (BA)
is defined together with its modifications, the post-L2 Boosting algorithm (post-BA) and
the orthogonalized version (oBA). In Section 3 we present the main results of our analysis.
The proofs are provided in the Appendix. Section 4 contains a simulation study which
offers some insights into the methods and also provides some guidance for stopping rules
in applications. Section 5 shows an application and finally, we conclude.
L2 BOOSTING IN HIGH-DIMENSIONS
3
Notation:PLet x and y be n-dimensional
vectors. We define the empirical L2 -norm as
p
En [x] = 1/n ni=1 xi . ||x||2,n := EnP
[x2 ] denotes the usual L2 -norm and < ·, · >n the inner
product defined as < x, y >n = 1/n ni=1 xi yi and < ·, · >n,2 = (< ·, · >n )2 . For a random
variable X, E[X] denotes the expectation of this random variable. corr(X, Y ) gives the
correlation between two random variables X and Y . We use the notation a ∨ b = max{a, b}
and a ∧ b = min{a, b}. We also use the notation a - b to denote a ≤ cb for some constant
c > 0 that does not depend on n; and a -P b to denote a = OP (b). For a set U supp(U )
denotes all elements which are not equal to zero. Given a vector β ∈ Rp and a set of indices
T ⊂ {1, . . . , p}, we denote by βT the vector in which βTj = βj if j ∈ T , βTj = 0 if j ∈
/ T.
2. L2 Boosting with componentwise least squares
To define the boosting algorithm for linear models, we consider the following regression
model:
(1)
yi = x′i β + εi ,
i = 1, . . . , n,
with vector xi = (xi,1 , . . . , xi,pn ) consisting of pn predictor variables, β a pn -dimensional
coefficient vector, and a random, mean-zero error term εi . Further assumptions will be
employed in the next sections.
We allow that the dimension of the predictors pn grows with sample size n and is even
larger than the sample size, i.e. dim(β) = pn ≫ n. But we will require some sparsity
condition. This means that there is a large set of potential variables, but the number of
variables which have non-zero coefficients, denoted by s, is small compared to the sample
size, i.e. s ≪ n. This can be weakened to approximate sparsity as defined and to be
explained later. More precise assumptions will also be made later. In the following we will
drop the dependence of pn on the sample size and denote it with p if there is no confusion.
X denotes the n × p design matrix where the single observations xi form the rows. Xj
denotes the jth column of design matrix, xi,j the jth component of the vector xi . We
consider a fixed design for the regressors. We assume that the regressors are standardized
with mean zero and variance one, i.e. En [xi,j ] = 0 and En [x2i,j ] = 1 for j = 1, . . . , p,
The basic principle of Boosting can be described as follows: We follow the interpretation
of Breiman (1998) and Friedman (2001) of Boosting as an functional gradient descent
optimization (minimization) method. The goal is to minimize a loss function, e.g. a L2 loss or the negative log-likelihood function of a model by an interative optimization scheme.
In each step the (negative) gradient which is used in every step for updating the curent
solution is modelled and estimated by a parametric or nonparametric statistical model,
the so-called base learner. The fitted gradient is used for updating the solution of the
optimization problem. A strength of boosting, besides that it can be used for different loss
4
YE LUO AND MARTIN SPINDLER
functions, is its flexibility with regard to the base learners. We then repeat this procedure
until some stopping criterion is met.
In the literature there have been developed many different forms of boosting algorithms.
In this paper we consider L2 Boosting with componentwise linear least squares and two
variants which are all designed for regression analysis.
P “L2 ”refers to the loss function which
is the typical sum–of–squared residuals Qn (β) = ni=1 (yi − x′i β)2 in regression analysis. In
this case the gradient equals the residuals. “componentwise linear least squares”refers to
the base learners. We fit the gradient (i.e. residuals) against each regressor (p univariate
regressions) and select the predictor/variable which correlates most with gradient/residual,
i.e. decreases the loss function most, and then update the estimator in this direction. We
next update the residuals and repeat the procedure until some stopping criterion is met.
We consider L2 Boosting and two modifications: the “classical”one which was introduced
in Friedman (2001) and refined in Bühlmann and Yu (2003) for regression analysis, an
orthogonal variant and post-L2 Boosting. As far as we know post-L2 Boosting has not been
defined and analyzed in the literature before. In signal processing and approximation
theory the first two methods are known under the name the pure greedy algorithm (PGA)
and the orthogonal greedy algorithm (OGA) in the deterministic setting, i.e. in a setting
without stochastic error terms.
2.1. L2 Boosting. For L2 Boosting with componentwise least squares the algorithm is given
below.
Algorithm 1 (L2 Boosting).
(1) Start / Initialization: β 0 = 0 (p-vector), f 0 = 0, set
maximum number of iterations mstop and set iteration index m to 0.
(2) Calculate the residuals Uim = yi − x′i β m .
(3) For each predictor variable j = 1, . . . , p calculate the correlation with the residuals:
Pn
m
< U m , xj > n
m
i=1 Ui xi,j
.
=
γj := P
n
2
En [x2i,j ]
i=1 xi,j
Select this variable j m which is correlated most with residuals, i.e. max1≤j≤p |corr(U m , xj )|1.
(4) Update the estimator: β m+1 := β m + γjmm ej m where ej m is the j m th index vector
and f m+1 := f m + γjmm xj m
(5) Increase m by one. If m < mstop continue with (2), otherwise stop.
The act of stopping is crucial for boosting algorithms, as stopping too late or never
stopping leads to overfitting and therefore some kind of penalization is required. A suitable
solution is to stop early, i.e. before overfitting takes place. “Early stopping” can be
interpreted as a form of penalization. Similar to LASSO, early stopping might induce bias
through shrinkage. A potential way to decrease the bias is by “post-Boosting”. This is
a post-model selection estimator that applies ordinary least squares (OLS) to the model
selected by first-step L2 Boosting. To define this estimator properly, we make the following
1Equivalently, which fits the gradient best in a L -sense.
2
L2 BOOSTING IN HIGH-DIMENSIONS
5
∗
definitions T := supp(β) and T̂ := supp(β m ), the support of the true model and the
support of the model estimated by L2 Boosting as described above with stopping at m∗ . A
superscript C denotes the complement of the set with regard to {1, . . . , p}. In the context
of LASSO, OLS after model selection was analyzed in Belloni and Chernozhukov (2013).
Given the above definitions, the post-model selection estimator or OLS post-L2 Boosting
estimator will take the from
(2)
β̃ = argminβ∈Rp Qn (β) : βj = 0
for each
j ∈ T̂ C .
Comment 2.1. For boosting algorithms it has been recommended – supported by simulation studies – not to update the full step size xj m but only a small step ν. The parameter
ν can be interpreted as a shrinkage parameter, or alternatively, as describing the step size
when updating the function estimate along the gradient. Small step sizes (or shrinkage)
make the boosting algorithm slower to converge and require a larger number of iterations.
But often the additional computation cost in turn results in better out-of-sample prediction
performance. By default, ν is usually set to 0.1. Our analysis in the later sections also
extends to a restricted step size 0 < ν < 1.
2.2. Orthogonal L2 Boosting. A variant of the Boosting Algorithm is orthogonal Boosting (oBA) or Orthogonal Greedy Algorithm in its deterministic version. Only the updating
step is changed: a orthogonal projection of the response variable is conducted on all variables which have been selected up to this point. The advantage of this method is that any
variable is selected at most once in this procedure, while in the previous version the same
variable might be selected in different steps which makes the analysis far more complicated.
More formally the method can be described as follows by modifying Step (4):
Algorithm 2 (Orthogonal L2 Boosting).
(4′ )
ŷ m+1 ≡ f m+1 = Pm y
and
Uim+1 = Yi − Ŷim+1 ,
where Pm denotes the projection of the variable y on the space spanned by first m selected
variables (The corresponding regression coefficient is denoted βom+1 .)
Define Xom as the matrix which consists only of the columns which correspond to the
variables selected in the first m steps, i.e. all Xjk , k = 0, 1, . . . , k. Then we have:
(3)
(4)
βom = ((Xom )′ Xom )−1 (Xom )′ y
ŷ m+1 = fom+1 = Xom βom
Comment 2.2. Orthogonal L2 Boosting might be interpreted as post-L2 Boosting where
the refit takes place after each step.
Comment 2.3. Both post-Boosting and orthogonal Boosting require that the number of
selected variables be smaller than the sample size to be well-defined. This is enforced by
our stopping rule as we will see later.
6
YE LUO AND MARTIN SPINDLER
3. Main Results
In this section we discuss the main results regarding the L2 Boosting procedure (BA),
post-L2 Boosting (post-BA) and the orthogonal procedure (oBA) in a high dimensional
setting. L2 Boosting (post-BA) is discussed first. Then we analyze the remaining procedures
under bounded, restricted eigenvalue conditions which are also used in the traditional
LASSO literature.
3.1. L2 Boosting with Componentwise Least Squares. We analyze the linear regression model introduced in the previous section in a high-dimensional setting.
The following definitions will be helpful for the analysis: U m denotes the residuals at
the mth iteration given by U m = Y − Xβ m . β m is the estimator at the mth iteration. We
define the difference between the true and the estimated vector as αm := β − β m . The
prediction error is given by V m = Xαm .
For the Boosting algorithm it is essential to determine when to stop, i.e. the stopping
criterion. In the low-dimensional case, stopping time is not important: the value of the
objective function decreases and converges to the traditional OLS solution exponentially
fast, as described in Bühlmann and Yu (2006). In the high-dimensional case, such fast
convergence rates are usually not available: the residual ε can be explained by n-linearly
independent variables xj . Thus selecting more terms only leads to overfitting. Early
stopping is comparable to penalization in LASSO, which prevents one from choosing too
many variables and hence overfitting. Comparable to LASSO, a sparse structure will be
needed for analysis.
A.1 (Exact Sparsity). T = supp(β) and s = |T | ≪ n.
This means that the set of potential variables might be large, but the number of variables
which are non-zero is restricted to be smaller than the sample size.
Comment 3.1. The exact sparsity assumption can be weakened to an approximate sparsity condition. This means that the set of relevant regressors is small, and the other
variables do not have to be exactly zero but must be negligible compared to the estimation
error.
At each step, we minimize ||U m ||22,n along the “most greedy”variable Xj m . Below is a
traditional assumption on the residual ε:
q
:=
A.2. With probability greater or equal 1−α we have, sup1≤j≤p| < Xj , ε >n | ≤ 2σ log(2p/α)
n
λn .
Comment 3.2. The previous assumption is e.g. implied if the error terms are i.i.d.
N (0, σ 2 ) random variables. This in turn can be generalized / weakened to non-normality
cases by self-normalized random vector theory (de la Peña et al. (2009)) or the approach
introduced in Chernozhukov et al. (2014).
L2 BOOSTING IN HIGH-DIMENSIONS
7
By definition, U m := ε+Xαm = ε+V m , where |supp(αm )| ≤ (s+m), since αm := β−β m ,
1
1
and maxj∈supp(αm) ρ2 (U m , Xj ) ≥ |supp(α
m )| ≥ m+s .
Lemma 1. ||U m+1 ||22,n = ||U m ||22,n − < U m , Xjm >2n = ||U m ||22,n (1 − ρ2 (U m , Xjm )), and
||V m+1 ||22,n = ||V m ||22,n − 2 < V m , γjmm Xjm >n +(γjmm )2 , where γjmm =< U m , Xjm >n .
Moreover, since V m = U m − ε, ||V m+1 ||22,n = ||V m ||22,n − 2 < U m , Xjm >n < ε, Xjm >n
− < U m , Xjm >2n .
Denote σn2 := En [ε2 ]. From the above Lemma 1 we expect||V m ||22,n and ||U m ||22,n to have
very similar behavior, since sup1≤j≤p | < ǫ, Xj >n | ≤ λn . Below are the key lemmas that
describe the difference between ||U m ||22,n and ||V m ||22,n :
Lemma 2. Assuming that assumptions A.1-A.2 hold. Let Zm = ||U m ||22,n − ||V m ||22,n .
√
Then, |Zm − σn2 | ≤ 2 m + sλn ||V m ||2,n .
Lemma 2 bounds the difference between ||U m ||22,n and ||V m ||22,n . This difference is σ 2 (1−
Op (s/n)) if β m = β.
Recall that ||U m+1 ||22,n = ||U m ||22,n − (γjm )2 , where |γjm | = max1≤j≤p | < Xj , U m >n
| = max1≤j≤p | < Xj , V m >n + < Xj , ε >n |. It is obvious that supj | < Xj , V m >n | ≥
√ 1 ||V m ||2,n , since we can pick the j ∈ supp(αm ) that is most correlated with V m . The
m+s
next key Lemma establishes a lower bound on the decaying speed of ||U m ||22,n .
Lemma 3 (Lower bound on γjm ). |γjmm | ≥
√ 1 ||V m ||2,n
m+s
− λn .
Lemma 4 (Bounds). Suppose assumptions A.1 and A.2 hold and s log(p)
→ 0. Let m0 :=
n
q
√
sn
c
n
m
C λns = 2σ
log p = o( log(p) ) for some constant C. Then, the prediction error ||V ||2,n -p
1
( s log(p)
)4.
n
Comment 3.3. The above bound is obtained without any of the additional assumptions
on the design matrix X which are often required in LASSO, e.g., the bounded restricted
eigenvalue assumption. The bound also works well under certain approximate sparsity
conditions.
Comment 3.4. We proved that there exists a stopping rule that leads to a certain convergence rate. From these theoretical insights we derive and recommend the following
stopping rule for practical applications: stop at iteration m if
||U m ||22,n
||U m−1 ||22,n
> 1 − c log(p)/n.
c is a constant with c > 1. For the simulation studies we employ c = 1.1. which seems to
work quite well.
Comment 3.5. It is also important to have an estimator for the variance of the error term
2 . A consistent estimation of the variance is given by ||U m ||2
σ 2 , denoted by σ̂n,m
2,n at the
stopping time.
8
YE LUO AND MARTIN SPINDLER
3.2. Orthogonal L2 Boosting in a high-dimensional setting with bounded restricted eigenvalue assumptions. In this section we analyze orthogonal L2 Boosting.
For the orthogonal case, we obtain a faster rate of convergence than with the variant analyzed in the section before. We make use of similar notation as in the previous subsection:
Uom denotes the residual and Vom the prediction error, formally defined below. Again, define
βom as the parameter estimate after the mth iteration.
The orthogonal Boosting Algorithm was introduced in the previous section. For completeness we give here the full version with some additional notation which will be required
in a later analysis.
Algorithm 3 (Orthogonal L2 Boosting).
(1) initialization: Set βo0 = 0, fo0 = 0, Uo0 =
Y and the iteration index m = 0.
(2) Define Xom as the matrix of all Xjk , k = 0, 1, . . . , m and Pom as the projection
matrix given Xom .
(3) Let jm be the maximizer of the following: max1≤j≤p ρ2 (Xj , Uom ). Then, fom+1 =
Pom Y with corresponding regression projection coefficient βom+1 .
(4) Calculate the residual Uom = Y −Xβom = (I −Pom )Y := MPom Y and Vom = MPom Xβ.
(5) Increase m by one. If some stopping criterion is reached, then stop; else continue
with Step 2.
It is easy to see that:
||Uom+1 ||2,n ≤ ||Uom ||2,n
(5)
The benefit of the oBA method, compared to L2 Boosting, is that once a variable Xj is
selected, the procedure will never select this variable again. This means at every variable
is selected at most once.
For any square matrix A, we denote φs (A) and φl (A) as the smallest and largest eigenvalues of A.
We make a restricted eigenvalue assumption which is also commonly used in the analysis
of LASSO.
Definition 3.1. The smallest and largest restricted eigenvalues are defined as
φs (s, M ) :=
min
φs (A),
max
φl (A).
dim(A)≤s×s,A is any diagonal submatrices of M
and
φl (s, M ) :=
dim(A)≤s×s,A is any diagonal submatrices of M
A. 3. We assume that there exist constants c, C such that 0 < c ≤ φs (s′ , En [x′i xi ]) ≤
φl (s′ , En [x′i xip
]) ≤ C < ∞ for any s′ ≤ sηn , where ηn is a sequence such that ηn → ∞
slowly, e.g., log(n).
Denote T m as the set of variables selected at mth iteration. Denote S m := T − T m . We
know that |T m | = m by construction. Denote Scm = T m − S m .
L2 BOOSTING IN HIGH-DIMENSIONS
9
Lemma 5 (lower bound of the remainder). Suppose assumptions A.1-A.3 hold. For any
m such that |T m | < sηn , ||Vom ||2,n ≥ c1 ||XβS m ||2,n , for some constant c1 > 0. If S m = ∅,
then ||Vom ||2,n = 0.
The above Lemma essentially says: if S m is non-empty, then there is still room for
significant improvement in the value ||V m ||22,n . The next Lemma is key and shows how
rapidly the estimates decay. It is obvious that ||Uom ||2,n and ||Vom ||2,n are both decaying
sequences.
A.4. minj∈T |βj | ≥ J and maxj∈T |βj | ≤ J ′ for some constants J > 0 and J ′ < ∞.
Comment 3.6. The first part of the assumption 4 is known as a “beta-min”assumption
as it restricts the size of non-zero coefficients.
Lemma 6 (upper bound of the remainder). Suppose assumptions A.1-A.4 hold. Assuming
√
∗
that sλn → 0. Let m∗ be the first time that ||U m ||22,n < σ 2 + 2Kσsλ2n . Then, m∗ < Ks
and ||Vom ||22,n - s log(p)/n with probability going to 1.
Although in general L2 Boosting may have a slower convergence rate than LASSO, oBA
reaches the rate of LASSO (under some additional conditions). The same technique used
in Lemma 6 also holds for the post-L2 Boosting case. Basically, we can prove, using similar
arguments, that T Ks ⊃ T when K is a large enough constant. Thus, post-L2 Boosting
enjoys the LASSO convergence rate given assumptions A.1-A.4. We state this in the next
section formally.
3.3. Post-L2 Boosting with Componentwise Least Squares. In many cases, penalization estimators like LASSO introduce some bias by shrinkage. The idea of “postestimators” is to estimate the final model by ordinary least squares including all the
variables which have been selected in the first step. We introduced post-L2 Boosting in
Section 2. Now we give the convergence rate for this procedure. Surprisingly, it improves
upon the rate of convergence of L2 Boosting and reaches the rate of LASSO (under stronger
assumptions). The proof of the result follows the idea of the proof for Lemma 6.
√
Lemma 7 (Post-L2 Boosting). Suppose assumptions A.1-A.4 hold. Assuming that sλn →
0. Let m∗ = Ks be the stopping time with K a large enough constant. Let T m be the set
∗
of variables selected at iteration m of the L2 Boosting procedure. Then, T ⊂ T m with
probability going to 1 and ||PXT m∗ Y − Xβ||22,n ≤ CKsλ2n - s log(p)/n.
The proof of this Lemma is similar to that of Lemma 6.
Comment 3.7. Our procedure is particularly sensitive to the starting value of the algorithm, at least in the non-asymptotic case. This might be exploited for screening and
model selection: the procedure commences from different starting values until stopping.
Then the intersection of all selected variables for each run is taken. This procedure might
establish a sure screening property.
10
YE LUO AND MARTIN SPINDLER
3.4. Approximate sparsity. In previous sections, we utilize Assumption A.1 for constructing the results. Such results can be relaxed in the approximate sparsity case.
A.5 (Approximate Sparsity). There exists
a set T ⊂ {1, 2, . . . , p} such that |T | = s << n,
√
and f (x) = x′ βT + rn with ||rn ||2,n ≤ sλn with probability going to 1.
The main results in Lemma 4, Lemma 6 and Lemma 7 hold under A.2-A.5 rather than
under A.1-A.4. For the βmin condition stated in Assumption A.4, we could relax it into
the following condition stated in Candes and Tao (2007):
A.6. There exists a constant C large enough such that minj∈T |βj | ≥ Cλn .
Lemma 8 (Approximate sparsity). Suppose conditions A.2, A.3, A.5, and A.6 hold. Then,
for the orthogonal (oBA) and post-L2 Boosting procedures, define m∗ := Ks log(n) for some
∗
and
constant K. Then, T m ⊃ T with probability going to 1. ||Vom ||22,n -p s log(n∨p)
n
||PXT m∗ Y − Xβ||22,n -p
s log(n∨p)
.
n
4. Simulation Study
In this section we present the results of our simulation study. The goal of this exercise is
to illustrate the relevance of our theoretical results for providing insights into the functionality of boosting and the practical aspects of boosting. In particular, we demonstrate that
the stopping rules for early stopping we propose work reasonably well in the simulations
and give guidance for practical applications. Moreover, the comparison with LASSO might
also be of interest.
We consider the following linear model
p
X
βj xj + ε,
(6)
y=
j=1
with ε standard normal distributed and iid. For the coefficient vector β we consider two
designs. First, we consider a sparse design, i.e. the first s Elements of β are set equal to
one, all other components to zero (β = (1, . . . , 1, 0, . . . , 0)), and then a polynomial design
in which the jth coefficient given by 1/j, i.e. β = (1, 1/2, 1/3, . . . , 1/p).
For the design matrix X we consider two different settings: an “orthogonal” setting
and a “correlated” setting. In the former setting the entries of X are drawn as iid draws
from a standard normal distribution. In the correlated design the xi (rows of X) are
distributed according to a multivariate normal distribution where the correlations are given
by a Toeplitz matrix with factor 0.5 and alternating sign.
We have the following settings:
• X: “orthogonal” or “correlated”
• coefficient vector β: sparse design or polynomial decaying design
• n = 100, 200, 400, 800
L2 BOOSTING IN HIGH-DIMENSIONS
•
•
•
•
•
11
p = 100, 200
s = 10
K=2
out-of-sample prediction size n1 = 50
number of repetitions R = 500
We consider the following estimators: L2 Boosting with componentwise least squares,
orthogonal L2 Boosting and LASSO. For Boosting and LASSO we also consider the postselection estimators (“post”). For LASSO we consider a data-driven regressor-dependent
choice for the penalization parameter (Belloni et al. (2012)) and cross validation. Although
cross validation is very popular, it does not rely on established theoretical results and
therefore we prefer a comparison with the formal penalty choice developed in Belloni et al.
(2012). For Boosting we consider three stopping rules: “oracle”, “Ks”, and a “datadependent”stopping criterion which stops if
||U m ||22,n
||U m−1 ||22,n
=
2
σ̂m,n
2
σ̂m−1,n
> (1 − c log(p)/n)). This
means stopping, if the ratio of the estimated variances does not improve upon a certain
amount any more. The Ks-rule stops after K × s variables have been selected. As s
is unknown the rule is not directly applicable. The oracle rule stops when the meansquared-error (MSE), defined below, is minimized, which is also not feasible in practical
applications.
The simulations are performed in R (R Core Team (2014)). For LASSO estimation the
packages Chernozhukov et al. (2015) and Jerome Friedman (2010) (for cross-validation) are
used. The Boosting procedures were implemented by the authors. Our code is available
upon request.
To evaluate the performance of the estimators we use the MSE criterion. We estimate
the models on the same data sets and use the estimators to predict 50 observations out-ofsample. The (out-of-sample) MSE is defined as
(7)
M SE = E[(f (X) − f m (X))2 ] = E[(X ′ (β − β m ))2 ],
where m denotes the iteration at which we stop, depending on the employed stopping rule.
The MSE is estimated by
n1
n1
X
X
[(x′i (β − β m ))2 ]
[(f (xi ) − f m (xi ))2 ] =
(8)
M ˆSE =
i
i
for the out-of-sample predictions.
The results of the simulation study are shown in the tables below.
As expected, the oracle-based estimator dominates in all cases except in the correlated,
sparse setting with more parameters than observations. Our stopping criterion gives very
good results, on par with the infeasible Ks-rule. Not surprisingly, given our results, both
post-Boosting and orthogonal Boosting outperform the standard L2 Boosting in most cases.
A comparison of post- and orthogonal Boosting does not provide a clear answer with
advantages on both sides. It is interesting to see that the post-LASSO increases upon
12
YE LUO AND MARTIN SPINDLER
Table 1.
iid, sparse
iid, sparse
iid, sparse
iid, sparse
iid, sparse
cor., sparse
cor., sparse
cor., sparse
cor., sparse
cor., sparse
iid, poly.
iid, poly.
iid, poly.
iid, poly.
iid, poly.
cor., poly.
cor., poly.
cor., poly.
cor., poly.
cor., poly.
n
100
100
200
400
800
100
100
200
400
800
100
100
200
400
800
100
100
200
400
800
p
100
200
200
200
200
100
200
200
200
200
100
200
200
200
200
100
200
200
200
200
s
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
BA-oracle
0.475
0.581
0.184
0.084
0.036
1.382
3.112
0.631
0.199
0.079
0.398
0.457
0.274
0.199
0.125
0.231
0.261
0.195
0.146
0.105
BA-Ks
0.713
0.953
0.367
0.173
0.079
1.651
3.034
0.735
0.259
0.207
0.772
1.021
0.513
0.276
0.149
0.648
0.896
0.482
0.258
0.139
Table 2.
iid, sparse
iid, sparse
iid, sparse
iid, sparse
iid, sparse
cor., sparse
cor., sparse
cor., sparse
cor., sparse
cor., sparse
iid, poly.
iid, poly.
iid, poly.
iid, poly.
iid, poly.
cor., poly.
cor., poly.
cor., poly.
cor., poly.
cor., poly.
n
100
100
200
400
800
100
100
200
400
800
100
100
200
400
800
100
100
200
400
800
p
100
200
200
200
200
100
200
200
200
200
100
200
200
200
200
100
200
200
200
200
s
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
Simulation Results
BA-our
0.624
0.804
0.295
0.144
0.065
2.134
3.121
0.811
0.247
0.102
0.575
0.673
0.364
0.240
0.147
0.419
0.506
0.289
0.204
0.126
pBA-oracle
0.121
0.127
0.054
0.026
0.013
0.447
2.198
0.060
0.025
0.014
0.392
0.445
0.269
0.194
0.125
0.217
0.255
0.169
0.119
0.083
p-BA-Ks
0.566
0.733
0.313
0.145
0.067
0.846
2.498
0.174
0.038
0.014
0.959
1.319
0.602
0.294
0.152
0.864
1.207
0.566
0.270
0.132
p-BA-our
0.335
0.443
0.195
0.103
0.048
1.704
2.953
0.167
0.087
0.042
0.616
0.728
0.382
0.244
0.149
0.434
0.534
0.271
0.180
0.110
oBA-oracle
0.118
0.124
0.054
0.026
0.013
0.279
3.026
0.057
0.025
0.014
0.413
0.440
0.269
0.191
0.126
0.215
0.255
0.172
0.114
0.077
oBA-Ks
0.802
1.002
0.424
0.197
0.090
0.743
2.580
0.390
0.189
0.086
1.148
1.574
0.655
0.308
0.155
1.052
1.538
0.619
0.294
0.142
Simulation Results continued
LASSO
0.897
1.209
0.348
0.141
0.067
2.896
3.115
1.529
0.415
0.176
0.436
0.504
0.325
0.212
0.145
0.316
0.354
0.263
0.184
0.134
post-LASSO
0.606
1.335
0.416
0.188
0.088
1.271
2.260
0.392
0.170
0.079
0.653
1.170
0.521
0.294
0.166
0.569
0.877
0.426
0.245
0.141
Lasso-CV
0.540
0.758
0.276
0.119
0.056
0.830
1.874
0.466
0.192
0.088
0.428
0.551
0.311
0.190
0.118
0.339
0.412
0.269
0.170
0.107
post-Lasso-CV
0.877
1.135
0.510
0.270
0.135
1.201
2.422
0.758
0.356
0.171
0.709
0.877
0.518
0.343
0.213
0.585
0.658
0.430
0.304
0.191
LASSO, but there are some exceptions, probably driven by overfitting. Cross-validation
works very well in many constellations. An important point of the simulation study is to
compare Boosting and LASSO. It seems that in the polynomial decaying setting, Boosting
(orthogonal Boosting with our stopping rule) dominates post-LASSO. This also seems true
in the iid, sparse setting. In the correlated, sparse setting they are on par. Summing up,
it seems that Boosting is a serious contender for LASSO.
oBA-our
0.485
0.715
0.260
0.128
0.057
1.000
2.679
0.228
0.115
0.052
0.719
0.862
0.435
0.257
0.152
0.524
0.657
0.308
0.185
0.109
L2 BOOSTING IN HIGH-DIMENSIONS
13
Table 3. Average number of variables and iterations run in the simulation settings
iid, sparse
iid, sparse
iid, sparse
iid, sparse
iid, sparse
cor., sparse
cor., sparse
cor., sparse
cor., sparse
cor., sparse
iid, poly.
iid, poly.
iid, poly.
iid, poly.
iid, poly.
cor., poly.
cor., poly.
cor., poly.
cor., poly.
cor., poly.
n
100
100
200
400
800
100
100
200
400
800
100
100
200
400
800
100
100
200
400
800
p
100
200
200
200
200
100
200
200
200
200
100
200
200
200
200
100
200
200
200
200
s
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
num var BA
13.080
13.740
13.670
13.994
13.858
9.830
8.230
13.280
13.430
13.290
7.630
7.450
9.580
13.150
17.540
5.420
5.360
5.810
7.420
9.220
num iteration BA
14.692
15.020
15.396
16.206
16.266
10.630
8.450
17.760
25.580
31
7.630
7.450
9.590
13.150
17.550
5.470
5.380
5.900
7.880
11.220
num var oBA
14.358
16.399
14.508
14.594
14.148
13.720
11.480
13.960
14.040
13.680
8.450
8.510
10.480
13.600
18.130
5.740
6.080
6.280
8.260
10.030
num var Lasso
16.354
20.169
20.238
20.616
20.590
11.340
14.270
18.520
19.170
19.400
12.510
15.660
18.190
22.020
27.440
7.690
11.160
11.760
13.470
15.830
Table 2 shows the average number of variables selected for LASSO, BA and oBA (both
with our stopping criterion) and the number of iterations run in the BA case in the simulations above. An interesting pattern is that in most cases LASSO selects more variables
or, coined in other terms, boosting algorithms give sparser solutions.
Comment 4.1. It seems that in very high-dimensional settings, i.e. when the number
of signals s is bigger than the sample size n, (e.g. n = 20, p = 50, s = 30) boosting
performs quite well and outperforms LASSO which seems to break down. This case is not
covered by our setting, but it is an interesting topic for future research and shows one of
the advantages of boosting.
5. Application: Riboflavin production
In this section we present an application to demonstrate how the methods work when
applied to real data sets and, then compare these methods to related methods, i.e. LASSO.
The application involves genetic data and analyzes the production of riboflavin. First, we
describe the data set, then we present the results.
5.1. Data set. The data set has been provided by DSM (Kaiserburg, Switzerland) and
was made publicly available for academic research in Bühlmann et al. (2014) (Supplemental
Material).
The real-valued response / dependent variable is the logarithm of the riboflavin production rate. The (co-)variables measure the logarithm of the expression level of 4, 088 genes
(p = 4, 088) which are normalized. This means that the covariables are standardized to
have variance 1, and the dependent variable and the resources are “de-meaned”which is
equivalent to including an unpenalized intercept. The data set consists of n = 71 observations which were hybridized repeatedly during a fed-batch fermentation process in which
14
YE LUO AND MARTIN SPINDLER
different engineered strains and strains grown under different fermentation conditions were
analyzed. For further details we refer to Bühlmann et al. (2014), their Supplemental Material and the references therein.
5.2. Results. We analyzed a data set about the production of riboflavin (vitamin B2 ) with
B. subtilis. We split the data set randomly into two samples: a training set and a testing
set. We estimated models with different methods on the training set and then used the
testing set to calculate out-of-sample mean squared errors (MSE) in order to evaluate the
predictive accuracy. The size of the training set was 60 and the remaining 11 observations
were used for forecasting. The table below shows the MSE errors for different methods
discussed in the previous sections.
Table 4. Results Riboflavin Production – out-of-sample MSE
BA-Ks
0.4669
BA-our
0.3641
p-BA-Ks
0.4385
p-BA-our
0.1237
oBA-Ks
0.4246
oBA-our
0.1080
LASSO
0.1687
p-LASSO
0.1539
All calculations are peformed in R (R Core Team (2014)) with the package Chernozhukov et al.
(2015) and our own code. Replication files are available upon request.
The results show that, again, post- and orthogonal L2 Boosting give comparable results.
They both outperform LASSO and post-LASSO in this application.
Appendix A. Proofs of Section 3
A.1. Proofs of L2 boosting.
Proof of Lemma 2. From Lemma 1, Zm+1 = Zm − 2γjmm < ε, Xjm >n , and Zm = Z0 −
P
k
m >. Z = ||y||2 − ||Xβ||2 = ||ε||2 + 2 <
2 m−1
0
2,n
2,n
2,n
k=0 γjk < ε, Xjk >= Z0 − 2 < ε, Xβ − V
ε, Xβ >.
m
Then, Zm = ||ε||22,n + 2 < ε, V m >= ||ε||22,n + 2 < ε, ||V Vm ||2,n > ||V m ||2,n . Thus,
√
|Zm − σ̂ 2 | ≤ 2 m + sλn ||V m ||2,n , since |supp(V m )| ≤ m + s.
√
Proof of Lemma 4. Denote m∗ + 1 as the first time ||V m ||2,n ≤ ηn m + sλn , where ηn
is some sequence of positive real numbers. We know that in high-dimensional settings
||U m ||2,n → 0 so ||V m ||2,n → σ 2 . Thus, by fixing p and n, such m∗ must exist.
First, we prove that for any m ≤ m∗ , we have: ||V m+1 ||22,n ≤ ||V m ||22,n .
By Lemma 1, ||V m+1 ||22,n = ||V m ||22,n − γjm (γjm − 2 < ε, xj >n ).
We only need to prove γjm and (γjm − 2 < ε, xj >n ) have the same sign, i.e., |γjm | > 2| <
||V m ||
2,n
− λn ≥ λn (ηn − 1).
ε, xj >n |. It suffices to prove |γjm | > 2λn . We know that |γjm | ≥ √m+s
m
2
m+1
2
Thus, for any ηn > 3, ||V
||2,n ≤ ||V ||2,n . Let ηn = 3 + µ for some constant µ.
L2 BOOSTING IN HIGH-DIMENSIONS
15
1
sn
) 2 . If there exists some constant C ′ such that m∗ +1 < C ′ ξn , then by defLet ξn = ( log(p)
√
√
∗
)1/4 - ( s log(p)
)1/4 .
inition ||V m +1 ||2,n ≤ 3(1 + µ) s + m∗ + 1λn ≤ 3(1 + 2µ) C ′ 2σ( s log(p)
n
n
We will consider two cases, depending on whether m0 is smaller or larger than m∗
∗
∗
If m0 ≥ m∗ , then ||V m0 ||22,n = ||V m ||22,n + (||U m0 ||22,n − ||U m ||22,n ) − Zm0 + Zm∗ . So
√
∗
||V m0 ||22,n ≤ ||V m ||22,n + 2 m0 ∨ m∗ λn ≤ C ′′ ( s log(p)
)1/4 for some generic constant C ′′ .
n
If m0 < m∗ , we know that for any m = 0, 1, . . . , m0 , ||V m ||22,n is decreasing.
For any m < m0 , we have:
1
||V m ||2,n − λn )2
m+s
1
||V m ||2,n
= ||U m ||22,n −
+ λ2n .
||V m ||22,n + 2λn √
m+s
m+s
||V m ||2,n
1
1
+ λ2n .
)+
Zm + 2λn √
= ||U m ||22,n (1 −
m+s
m+s
m+s
||U m+1 ||22,n = ||U m ||22,n − (γjm )2 ≤ ||U m ||22,n − ( √
(9)
≤ ||U m ||22,n (1 −
σ̂ 2
||V m ||2,n
1
.
)+
+ 4(1 + µ)λn √
m+s
m+s
m+s
Based on Lemma 2 and inequality (9),
√
||U m+1 ||22,n − σ̂ 2 − 8(1 + µ)λn m + s + 1||V m+1 ||2,n
√
1
≤ (1 −
)(||U m ||22,n − σ̂ 2 − 8(1 + µ)λn m + s||V m ||2,n ),
m+s
since
√
√
√
√
1
m + s + 1||V m+1 ||2,n − m + s||V m ||2,n ≤ ( m + s + 1− m + s)||V m ||2,n ≤ √
||V m ||2,n .
2 m+s
√
Denote m∗1 +1 as the first m such that ||U m+1 ||22,n ≤ σ 2 +8(1+µ)λn m + s + 1||V m+1 ||2,n .
p
∗
∗
If m∗1 < m0 , then ||U m1 +1 ||22,n ≤ σ̂ 2 + 8(1 + µ)λn m∗1 + s + 1||V m1 ||2,n . Moreover, we
p
∗
∗
∗
know that ||V m1 +1 ||22,n = ||U m1 +1 ||22,n − Zm∗1 +1 ≤ (10 + 8µ)λn m∗1 + s + 1||V m1 +1 ||2,n ,
p
√
∗
and therefore ||V m1 +1 ||2,n ≤ (10 + 8µ)λn m∗1 + s + 1 ≤ 10(1 + µ) m0 λn = 10(1 +
µ)( s log(p)
)1/4 - ( s log(p)
)1/4 .
n
n
√
If m∗1 ≥ m0 , then define a new sequence q 0 := ||U 0 ||22,n > σ̂ 2 + 8(1 + µ) sλn ||V 0 ||2,n .
√
1
). So we have:
We require the LASSO growth rate sλn → 0. Let q m+1 = q m ∗ (1 − m+s
s
m
0
q = q m+s .
p
∗
So for any m ≤ m0 ≤ m∗1 ∧m∗ , we have ||U m ||22,n −σ̂ 2 −8(1+µ)λn m∗1 + s + 1||V m1 ||2,n ≤
s
q m ≤ C m+s
. (Here we require ||q 0 ||2,n < ∞, which holds under ||β||2 < ∞. Bühlmann
16
YE LUO AND MARTIN SPINDLER
(2006) requires a stronger condition: ||β||1 < ∞). If we believe that sup |β| < C, then the
s2
result should be ||U m ||22,n ≤ C m+s
.)
√
Thus, we know that for any m ≤ m∗1 , ||U m+1 ||2,n ≤ σ̂ 2 +8(1+µ)λn m + 1 + s||V m+1 ||2,n +
s
C m+s+1
.
√
s
−
And: ||V m+1 ||22,n = ||U m+1 ||22,n −Zm ≤ σ̂ 2 +8(1+µ)λn m + 1 + s||V m+1 ||2,n +C m+s+1
√
2
m+1
σ̂ + 2 m + s + 1λn ||V
||2,n
√
Cs
.
= (10 + 8µ) m + s + 1λn ||V m+1 ||2,n + m+s+1
)1/4 for some constant C1 > 0.
Let m = m0 − 1, we get ||V m0 ||2,n ≤ C1 ( s log(p)
n
√
1/4 convergence rate.
Thus, we stop at m0 = C λns leads to ( log(p)s
n )
A.2. Proofs of OGA.
Proof of Lemma 5. By Sparse eigenvalue condition:
c
||Vom ||22,n ≥ c||βS m ||22 ≥ ||XβS m ||22,n .
C
Similarly, for Uom , ||Uom ||22,n = ||MPom ε + MPom (Xβ)||22,n = ||MP m ε||22,n + ||Vom ||22,n + 2 <
√
2
m 2
m
ε, Vom >n ≥ n−m
n σ + ||Vo ||2,n − 2λn m + s||Vo ||2,n .
Proof of Lemma 6. At step Ks, if T Ks ⊃ T , then ||Vom ||22,n = ||MPom Vom ||22,n = 0. The
q
with probability going
estimated predictor satisfies: ||xβ − xβ m ||2,n ≤ 2(1 + η)σ s log(p)
n
to 1, where η > 0 is a constant.
Let A1 = {T Ks ⊃ T }.
Consider the event Ac1 . So there exists a j ∈ T which is never picked up in the process
at k = 0, 1, . . . , Ks.
Every step we pick a j to maximize | < Xj , Uom >2,n | = | < Xj , Vom >2,n + < Xj , ε >2 |.
f m = Xαmm . So ||Vom ||2 ≥ c(||βS m ||2 +||αS m ||2 ) ≥ c P m |βS m |2 .
Let W m = XβS m and W
2
2
2,n
j∈S
Sc
c
Also Vom = MPom XβS m , so < Vom , XβS m >2,n = ||MPom XβS m ||22,n = ||XβS m −XScm ζ||22,n ≥
c||βS m ||22,n , where XScm ζ = PXScm (XβS m ).
P
P
Thus, it is easy to see that < Vom , XβS m >2,n = j∈S m βj < Vom , Xj >2,n ≥ c j∈S m βj2 .
Thus, there exists some j ∗ such that | < Vom , Xj ∗ > | ≥ c|βj | ≥ cJ.
We know that the optimal jm must satisfy: | < Uom , Xjm >2,n | ≥ | < Vom , Xj ∗ >2,n − <
ε, Xj ∗ >2,n | ≥ cJ − λn . Thus, | < Vom , Xjm >2,n | > cJ − 2λn .
L2 BOOSTING IN HIGH-DIMENSIONS
17
Hence, ||Vom+1 ||22,n = ||Vom − γjm Xjm ||22,n = ||Vom ||22,n − 2γjm < Uom , Xjm > +2γj <
ǫ, Xjm > +γj2m
≤ ||Vom ||22,n − γj2m + 2λn |γjm | ≤ ||Vom ||22,n − (cJ − λn )2 + 2(cJ − λn )λn ≤ ||Vom ||22,n − (cJ)2 +
4cJλn . Consequently, ||Vom ||22,n ≤ ||V0m ||22,n − K(cJ)2 s + 4KcJsλn . Since ||Vo0 ||22,n ≤ CJ ′2 s,
c2 J 2
m 2
so let K > CJ
′2 and assuming λn → 0 would lead to a ||Vo ||2,n < 0 asymptotically. That
said, our assumption that “there exists a j which is never picked up in the process at
k = 0, 1, 2, . . . , KS” is incorrect with probability going to 1.
Thus, at time Ks, A1 must happen with probability going to 1. And therefore, we
know that ||Uom ||22,n = ||MPoKs ε||22,n ≤ σ̂ 2 = σ 2 + Op ( √1n ). By definition of m∗ , m∗ ≤ Ks.
Therefore, ||Vom ||22,n ≤ ||Uom ||22,n − σ 2 + m∗ λn = Op (sλ2n ).
18
YE LUO AND MARTIN SPINDLER
References
Barron, Andrew R., Cohen, Albert, Dahmen, Wolfgang and DeVore, Ronald A. (2008). ‘Approximation and learning by greedy algorithms’, Ann. Statist. 36(1), 64–94.
URL: http://dx.doi.org/10.1214/009053607000000631
Belloni, A., Chen, D., Chernozhukov, V. and Hansen, C. (2012). ‘Sparse Models and Methods for
Optimal Instruments With an Application to Eminent Domain’, Econometrica 80(6), 2369–2429.
URL: http://dx.doi.org/10.3982/ECTA9626
Belloni, Alexandre and Chernozhukov, Victor. (2013). ‘Least squares after model selection in highdimensional sparse models’, Bernoulli 19(2), 521–547.
URL: http://dx.doi.org/10.3150/11-BEJ410
Breiman, Leo. (1996). ‘Bagging Predictors’, Machine Learning 24, 123–140.
URL: http://dx.doi.org/10.1007/BF00058655
Breiman, Leo. (1998). ‘Arcing Classifiers’, The Annals of Statistics 26(3), 801–824.
Bühlmann, Peter. (2006). ‘Boosting for High-Dimensional Linear Models’, The Annals of Statistics
34(2), 559–583.
Bühlmann, Peter and Hothorn, Torsten. (2007). ‘Boosting Algorithms: Regularization, Prediction
and Model Fitting’, Statistical Science 22(4), 477–505. with discussion.
URL: http://dx.doi.org/10.1214/07-STS242
Bühlmann, Peter, Kalisch, Markus and Meier, Lukas. (2014). ‘High-Dimensional Statistics with a
View Toward Applications in Biology’, Annual Review of Statistics and Its Application 1(1), 255–278.
URL: http://dx.doi.org/10.1146/annurev-statistics-022513-115545
Bühlmann, Peter and Yu, Bin. (2003). ‘Boosting with the L2 Loss: Regression and Classification’,
Journal of the American Statistical Association 98(462), 324–339.
URL: http://www.jstor.org/stable/30045243
Candes, Emmanuel and Tao, Terence. (2007). ‘The Dantzig selector: statistical estimation when p is
much larger than n’, The Annals of Statistics pp. 2313–2351.
Chernozhukov, Victor, Chetverikov, Denis, Kato, Kengo et al. (2014). ‘Gaussian approximation
of suprema of empirical processes’, The Annals of Statistics 42(4), 1564–1597.
Chernozhukov, Victor, Hansen, Christian and Spindler, Martin. (2015), hdm: High-Dimensional
Metrics. R package version 0.1.
de la Peña, Victor H., Lai, Tze Leung and Shao, Qi-Man. (2009), Self-normalized processes, Probability and its Applications (New York), Springer-Verlag, Berlin. Limit theory and statistical applications.
Freund, Yoav and Schapire, Robert E. (1997). ‘A decision-theoretic generalization of on-line learning
and an application to boosting’, Journal of computer and system sciences 55(1), 119–139.
Friedman, Jerome H. (2001). ‘Greedy Function Approximation: A Gradient Boosting Machine’, The
Annals of Statistics 29(5), 1189–1232.
URL: http://www.jstor.org/stable/2699986
Friedman, J. H., Hastie, T. and Tibshirani, R. (2000). ‘Additive Logistic Regression: A Statistical
View of Boosting’, The Annals of Statistics 28, 337–407. with discussion.
URL: http://projecteuclid.org/euclid.aos/1016218223
Ing, Ching-Kang and Lai, Tze Leung. (2011). ‘A stepwise regression method and consistent model
selection for high-dimensional sparse linear models’, Statistica Sinica 21(4), 1473–1513.
Jerome Friedman, Trevor Hastie, Robert Tibshirani. (2010). ‘Regularization Paths for Generalized
Linear Models via Coordinate Descent’, Journal of Statistical Software 33(1), 1–22.
URL: http://www.jstatsoft.org/v33/i01
Mallat, S.G. and Zhang, Zhifeng. (1993). ‘Matching Pursuits with Time-frequency Dictionaries’, Trans.
Sig. Proc. 41(12), 3397–3415.
URL: http://dx.doi.org/10.1109/78.258082
R Core Team. (2014), R: A Language and Environment for Statistical Computing, R Foundation for
Statistical Computing, Vienna, Austria.
L2 BOOSTING IN HIGH-DIMENSIONS
19
URL: http://www.R-project.org/
Zhang, T. and Yu, B. (2005). ‘Boosting with Early Stopping: Convergence and Consistency’, The Annals
of Statistics 33, 1538–1579.
URL: http://projecteuclid.org/euclid.aos/1123250222
Download