Uploaded by Kevin Lim

STA315 final paper-regularization via elastic net by Kyuson & Hasaan

advertisement
STA315 Final Project:
Regularization and Variable Selection via the
Elastic Net
Hasaan Soomro, Kyuson Lim
April 25, 2021
Abstract
Zou and Hastie (2005) introduced a regularization method called the Elastic Net that
combines the L1 and L2 penalties, and performs variable selection and regularization simultaneously. The Elastic Net can perform grouped variable selection when predictors are
strongly correlated, and can potentially select all p predictors in the p > n case. Simulation
studies show that the Elastic Net has a better prediction accuracy than the Lasso, while
enjoying a similar sparsity. A new algorithm called LARS-EN can be used for computing
Elastic Net regularization paths efficiently, which is similar to the LARS algorithm for the
Lasso. Finally, the Elastic Net can be extended to Classification and Sparse PCA problems.
1
Introduction and Motivation
1.1
Importance of Variable Selection
Variable Selection is an important problem in the field of Statistics and Machine Learning. With
the rise of Big Data, modern data analysis often involves a large number of predictors (p > n).
In such scenarios, we would like to model the relationship between the response and the best
subset of predictors by eliminating the redundant/irrelevant/useless predictors. Both parsimony
and prediction performance are important aspects. Variable Selection helps overcome challenges
posed by high dimensional scenarios, by achieving faster computation time, reducing overfitting,
overcoming the curse of dimensionality, and avoiding uninterpretable complex models. Multiple
variable selection models such as the Lasso and Elastic Net have been introduced over the years
that are well-suited for specific scenarios. For example, when the number of predictors is greater
than the number of observations (p > n) and there are groups of strongly correlated predictors,
then the Elastic Net model is often used.
1.2
Evaluating the Quality of a Model
Typically, there are two major aspects that we consider when evaluating the quality of a model:
• Prediction Accuracy on future unseen data: Models with good prediction accuracy on
future unseen data are preferred.
• Model Interpretation: Simpler models are preferred because they help understand the
relationship between the response variable and predictors better. Also, when the number
of predictors is large, parsimony is particularly important.
A good model will have good prediction accuracy on future unseen data, and be simple to help
better understand the relationship between the response variable and predictors. A common
model is OLS (Ordinary Least Squares), where estimates are obtained by minimizing the residual
sum of squares:
n
X
RSS = arg min
(yi − β̂ T xi )2
(1.2.1)
β̂
i=1
Ordinary Least Squares often performs poorly in both prediction and interpretation. Penalization
techniques such as Lasso (L1 penalty) and Ridge Regression (L2 penalty) have been introduced to
1
improve the performance of Ordinary Least Squares. However, Lasso and Ridge Regression also
perform poorly in some scenarios such as when predictors are strongly correlated and grouped
variable selection is required. This aspect will be discussed further in the upcoming sections.
1.3
Grouped Variable Selection and p n Problem
In various problems and scenarios, there are more predictors than the number of observations (p >
n). Furthermore, there are groups of variables/predictors among which the pairwise correlations
are very high. For example, a typical microarray data set has fewer than 100 observations and
thousands of predictors which are genes. Moreover, genes sharing the same biological pathway
can have high correlations between them and such genes could be considered as forming a group.
Therefore, genes sharing the same biological pathway would form a group where the correlations
between the genes would be high. In such a scenario, an ideal model would do the following:
• Automatic Variable Selection: The model automatically eliminates the trivial predictors and selects the best subset of predictors.
• Grouped Selection: If one predictor is selected within a highly correlated group, then
the model selects the whole group automatically.
1.4
Lasso and Ridge Regression
Recall that the Lasso (Tibshirani 1996) is a penalized least squares method that does both
continuous shrinkage and automatic variable selection simultaneously by imposing an L1 -penalty
on the regression coefficients:
β̂ = arg min||y − Xβ||2 + λ1 ||β||1
β
(1.4.1)
Also, recall that Ridge Regression is a continuous shrinkage method that achieves good prediction
performance via the bias-variance trade-off, however, it does not perform variable selection. Ridge
Regression imposes an L2 -penalty on the regression coefficients:
β̂ = arg min||y − Xβ||2 + λ2 ||β||2
β
1.5
(1.4.2)
Limitations of Lasso and Ridge Regression
The Lasso and Ridge Regression have been successful in various scenarios, however, certain
limitations make them unsuitable for the p > n and grouped variables scenario, as well as for
some n > p cases.
Firstly, due to the type of the convex optimization problem, the Lasso can select at most n
variables prior to saturation in the p > n case. Furthermore, unless the bound of the L1 -penalty
on the regression coefficients is below a certain threshold, the Lasso is not well defined. Secondly,
the Lasso cannot perform grouped variable selection. When there is a group of highly correlated
variables, the Lasso selects only one variable from the group and does care which one is selected,
instead of selecting or eliminating the entire group. Lastly, when n > p and the predictors are
highly correlated, studies have shown that Ridge Regression tends to have a better prediction
performance than the Lasso.
Therefore, the main objective is to create a model or method that enjoys a similar prediction
performance and sparsity as the Lasso, and overcomes the limitations mentioned above. The
model should potentially be able to select all p predictors in the p > n case, perform grouped
variable selection when there is a group of highly correlated variables, and have a better prediction
performance than the Lasso when n > p and the predictors are highly correlated. To solve
this problem, the upcoming sections will talk about the Elastic Net model introduced by Zou
and Hastie (2005), which can perform automatic variable selection and continuous shrinkage
simultaneously, and can select group(s) of correlated predictors.
2
2
2.1
Naive Elastic Net
Definition
Zou and Hastie (2005) introduced the Naive Elastic Net which combines the L1 and L2 penalties
of the Lasso and Ridge Regression models. Given a data set with n observations with p predictors,
let y = (y1 , . . . , yn )T be the response and X = (x1 |. . . |xp ) be the design matrix, with y centered
and X standardized:
n
X
yi = 0,
i=1
n
X
xij = 0,
and
i=1
n
X
x2ij = 1,
for j = 1, 2, ..., p
(2.1.1)
i=1
Then the Naive Elastic Net optimization problem for non-negative λ1 , λ2 is:
(2.1.2)
β̂ = arg min{|y − Xβ|2 + λ2 |β|2 + λ1 |β|1 }
β
In equation (2.1.2), the L1 penalty enables the Naive Elastic Net to produce a sparse model and
the L2 penalty helps the Naive Elastic Net to perform grouped variable selection and does not
prevent/limit the possibility of selecting all p predictors. Section 2.2 and 2.3 will discuss the
grouping effect and the possibility of selecting all p predictors in more detail. Also, note that as
λ1 → 0, the Naive Elastic Net behaves more like Ridge Regression, and as λ2 → 0, it behaves
more like the Lasso.
An alternative representation of the Naive Elastic Net criterion can also be formulated. Let
2
, then the optimization problem is equivalent to:
α = λ1λ+λ
2
β̂ = arg min|y − Xβ|2 , subject to (1 − α)|β|1 + α|β|2 ≤ t, for some t
β
(2.1.3)
In equation (2.1.3), when α = 1 then the Naive Elastic Net becomes Ridge Regression and when
α = 0 it becomes the Lasso. Also, the Naive Elastic Net penalty is strictly convex when α > 0
and singular when α = 0. Therefore, when α ∈ [0, 1), the Naive Elastic Net enjoys the properties
of both Ridge Regression and the Lasso. Figure 1 shows that singularities at the vertices and
the edges are strictly convex, and the strength of convexity varies with α.
β1
β0
© Hasaan, Lim
Figure 1:
2.2
shape of the ridge penalty.
contour of the elastic net penalty with α = 0.5.
contour of the
Lasso penalty.
Solution
The Naive Elastic Net optimization problem can be efficiently solved in the same computational
cost as the Lasso. This is because the Naive Elastic Net optimization problem presented in
equation (2.1.2) can be transformed into an equivalent Lasso-type optimization problem on augmented data. Consider the following lemma:
3
Lemma 1: Given dataset (y, X) and (λ1 , λ2 ), construct artificial dataset (y∗ , X∗ ):
X
yn×1
∗
X∗(n+p)×p = (1 + λ2 )−1/2 √
=
, y(n+p)
0p×1
λ2 I
(2.2.1)
p
√
Let γ = λ1 / 1 + λ2 and β ∗ = (1 + λ2 )β. Then, rewrite the Naive Elastic Net problem:
L(γ, β) = L(γ, β ∗ ) = |y∗ − X∗ β ∗ |2 + γ|β ∗ |1
Define
∗
β̂ = arg min
L{(γ, β ∗ )}
∗
β
Then
β̂ = p
1
(1 + λ2 )
β̂
∗
(2.2.2)
(2.2.3)
(2.2.4)
Lemma 1 shows that the naive elastic net method has the potential to select all p predictors in
all situations. Note that in the transformed artificial dataset, X∗ has rank p and the sample size
is n + p. This implies that the Naive Elastic Net method can potentially select all p predictors
in all cases. Recall that in the p > n case, the Lasso can select at most n predictors. The
Naive Elastic Net method overcomes this major limitation of the Lasso because X∗ having rank
p and the sample size being n + p allows the Naive Elastic Net method to potentially select all
p predictors. Moreover, Lemma 1 also indicates that the Naive Elastic Net method can perform
automatic variable selection.
2.3
Solution (Orthogonal Design)
For orthogonal design matrix (XT X = I), the Naive Elastic Net solution becomes:
β̂i (Naive elastic net) =
(|β̂i (OLS)| − λ1 /2)+
· sign{β̂i (OLS)}
1 + λ2
(2.3.1)
The Ridge Regression solution under the orthogonal design is:
β̂i (Ridge) = β̂i (OLS)/(1 + λ2 )
(2.3.2)
Finally, the Lasso solution for the orthogonal design matrix is:
β̂i = (|β̂i (OLS)| − λ1 /2)+ · sign{β̂i (OLS)}
(2.3.3)
Note that β̂i = xTi y for OLS solutions and sign is either -1,0 or 1.
2.4
Grouping Effect
One of the major advantages of the Naive Elastic Net method over the Lasso is the ability to
perform grouped variable selection, where whole groups of correlated predictors are selected or
completely eliminated. This section will provide a more in-depth justification of the grouping
effect in the Naive Elastic Net method. Generally, if the regression coefficients of highly correlated
variables are nearly equal then a regression models tends to exhibit the grouping effect. To
understand this better, consider the general penalization method:
β̂ = arg min|y − Xβ|2 + λJ(β), J(β) > 0, β 6= 0
β̂
Lemma 2: Assume that predictors i and j are equal (xi = xj ) for i, j ∈ {1, ..., p}
• (i) If J(·) is strictly convex, then β̂i = β̂j , for all λ > 0.
∗
• (ii) If J(β) = |β|1 , then β̂ i , β̂ j ≥ 0 and β̂ is another minimizer of (2.4.1)
4
(2.4.1)


 β̂ k
∗
β̂ k =
(β̂ i + β̂ j ) · (s)


(β̂ i + β̂ j ) · (1 − s)
if k 6= i and k 6= j
if k = i, s ∈ [0, 1]
if k = j, s ∈ [0, 1]
(2.4.2)
From (i) and (ii) in Lemma 2, strict convexity ensures that a model exhibits the grouping effect
in the extreme situation when predictors are equal. The Naive Elastic Net penalty is strictly
convex due to the quadratic penalty, therefore it can exhibit the grouping effect. However, the
Lasso is not strictly convex and does not have an unique solution, therefore it cannot perform
grouped selection.
Moreover, a quantitative description can be provided to justify the grouping effect of the
Naive Elastic Net. We can define Dλ1 ,λ2 (i, j) as the difference between the coefficients path of
xi and xj . Then, the following theorem can be used:
Theorem 1: Given data (y, X) and parameters (λ1 , λ2 ), the response y is centered and the
predictors X are standardized. Let β̂ be the Naive Elastic Net estimate. Suppose β̂i (λ1 , λ2 ), β̂j (λ1 ,
λ2 ) > 0. Then,
Dλ1 ,λ2 (i, j) =
1
|β̂i (λ1 , λ2 ) − β̂j (λ1 , λ2 )| then
|y|1
(2.4.3)
Dλ1 ,λ2 (i, j) ≤
1p
2(1 − ρ), ρ = xTi xj (the sample correlation)
λ2
(2.4.4)
Notice that as ρ → 1, the difference between β̂i and β̂j will converge to 0. That is, the
coefficients of predictors i and j will be identical. Therefore, Theorem 1 justifies that the coefficients of strongly correlated predictors will be identical in the Naive Elastic Net method, and
ultimately provides a quantitative justification for grouped variable selection in the Naive Elastic
Net method.
2.5
Limitations of the Naive Elastic Net
Even though the Naive Elastic Net can perform grouped variable selection and has the potential
to select all p predictors in all situations, it incurs a double amount of shrinkage as it goes
through a two-step procedure involving Lasso and Ridge type shrinkage. The double shrinkage
introduces unnecessary bias and hinders the prediction performance of the Naive Elastic Net.
Therefore, Zou and Hastie (2005) introduce the Elastic Net, which is obtained by multiplying
the Naive Elastic Net coefficients by 1 + λ2 . The Elastic Net is a rescaled Naive Elastic Net that
overcomes this double amount of shrinkage.
One of the main motivations behind using 1 + λ2 as the rescaling factor is that the correlation
matrix efficiently reduces the estimated variance. More specifically,
R = (XT X + λ2 I)−1 XT ,
where (1 + λ2 )−1 R∗ ,
(2.5.1)
as R∗ is the least square correlation matrix, for reduced new correlation matrix. Thus, we could
find for the theorem of Lasso shrinkage as a stabilization.
Theorem 2: Suppose there is data (y, X) with (λ1 , λ2 ), such that there is a elastic net estimate of β̂ as
T
X X + λ2 I
β̂ = arg min β T
β − 2yT Xβ + λ1 |β|1 , β̂(Lasso) = (1 + λ2 )β̂
(2.5.2)
β
1 + λ2
Now, we could derive for the sampling covariance matrix Σ̂, which is XT X for orthogonal design,
to reduced for
1
λ2
Σ̂λ2 =
Σ̂ +
I
(2.5.3)
1 + λ2
1 + λ2
to be the estimate for the predictors. As one can see, the replaced Σ̂ is same as applying the
Lasso type shrinkage for the Elastic Net penalization.
5
3
Elastic Net
3.1
Definition/Estimate
Zou and Hastie (2005) introduced the Elastic Net which rescales the coefficents of the Naive
Elastic Net. Previously, we have identified that the Elastic Net problem is equivalent to the
Lasso type of problem in the case of p > n.
For data (y, X), penalty parameters (λ1 , λ2 ) and augmented data (y∗ , X∗ ), the Naive Elastic
Net solves a Lasso-type problem:
β̂ = arg min
|y∗ − X∗ β ∗ |2 + p
∗
λ2
∗
|β̂ |1 .
(3.1.1)
β̂ = arg min||y − Xβ||2 + λ2 ||β||2 + λ1 ||β||1
(3.1.2)
β̂
(1 + λ2 )
β
Hence, the Elastic Net (corrected) estimates of β̂ are defined by
p
∗
β̂(elastic net) = (1 + λ2 ) β̂ .
Note that β̂(naive elastic net) = {1/
(3.1.3)
p
∗
(1 + λ2 )}β̂ and thus allowing
β̂(elastic net) = (1 + λ2 )β̂(naive elastic net)
(3.1.4)
Thus, we can see that the Elastic Net coefficient is a rescaled Naive Elastic Net coefficient.
Now, we introduce the LARS-EN algorithm, which is similar to the LARS algorithm for the
Lasso problem, to solve the Elastic Net problem efficiently. For the tuning parameters of λ1
and λ2 in LARS-EN, the efficient LARS algorithm enhances the computation of the Elastic Net
problem for early stage of completion.
3.2
LARS-EN Computation Algorithm
For computational efficiency, the LARS algorithm simply finds the entire optimal lasso solutions
as a passage, using the same computational form of least square fit, for individual λ2 . This
effective way of creating the solution passage is beneficial for sparse dataset such as augmented
X∗ of a p n case from the Lemma 1. In this way, the lasso solutions are computationally
∗
attained as a linear manner. By inverted matrix GAk = X∗T
Ak XAk for active variable set Ak , the
LARS sequence is iterated computationally throughout downdating or updating the Cholesky
factorization of GAk−1 found from the former iteration as for updating the Rk−1 . Hence,
∗
GAk = X∗T
Ak XAk =
1
(XT XAk−1 + λ2 I),
1 + λ2 Ak−1
(3.2.1)
where X∗k has at most p − 1 zero elements. So, the non-zero coefficients are recorded at Ak steps.
Computational efficiency is achieved for the p n case due to early phase of algorithm.
3.3
Choice of Tuning Parameters
For Elastic Net tuning parameters with (λ1 , λ2 ), 3 types of tuning parameters are picked from
∗
using the Lasso regression method. Under the proportional relationship for β̂ ∝ β̂ (3.1.3),
the coefficients (t) or the fraction of Lasso type shrinkage (s) is chosen as pair of (λ2 , s) where
s ∈ [0, 1] or (λ2 , t) for parameters of the Elastic Net. The LARS-EN algorithm in the Elastic
Net identifies the whole solution passage at the kth step based on the forward stepwise fitting
process of Lasso.
In the case where only training set is available, the generic 10-fold cross validation method
is used, fitted for 10 sets. Then, the LARS-EN algorithm computed for each (λ1 , s, or k) based
on the fixed λ2 ∈ {(0, 0.01, 0.1, 1, 10, 100)}. After fitting for other parameters, we choose the
optimal λ2 which is giving the lowest CV error. While the algorithm manage to compute for
early stop in p n examples, the computational expensive CV example is with n > p frames in
least square fits.
6
4
Extending Elastic Net to Classification and Sparse PCA
First of all, newly extended method of sparse principal component analysis (SPCA) applies the
elastic net regularization to a principal component analysis with sparse loadings to formulate
an optimization in the regression solutions. As the PCA formulate an ordinary multivariate
regression type analysis, the SPCA constrains on the coefficients of multivariate data to retain
the sparse loadings as for the multivariate regression coefficients (Hui Zou et al., 2004).
For notation, the Xn×p and xi is the ith row vector of X. The SPCA is used as a two-step
exploratory analysis by first performing PCA, then use β̂ = arg minβ ||y − Xβ||2 + λ2 ||β||2 +
λ1 ||β||1 optimization to applicable sparse approximations (Hui Zou et al., 2004).
4.1
Method of SPCA and Direct Sparse Approximation
Due to elastic net constraint on the coefficients, the modified principal components with sparse
loadings are derived, which is referred to as SPCA. Upon the multivariate regression framework
of PCA, the elastic net penalization achieve its direct modification for derivation of the loadings
that are sparse.
The idea is to apply the constraint to α, which is exactly the first k loading vectors of typical
PCA. As α and β are the p-vectors, the SPCA of the leading sparse PC is defined as
(α̂, β̂) = min
α,β
n
X
||xi − αβ T xi ||2 + λ2 ||β||2 + λ1 ||β||1 ,
(4.1.1)
i=1
β̂
, as the loadings.
subject to αT α = ||α||2 = 1, and v̂ = ||β||
In the SPCA, a large λ1 generates sparse loadings as different λ1,i correspond to penalizing
the loadings of different principal components. When λ1 = 0 with SPCA, if λ2 > 0 then we get
SPCA = PCA. However, when p > n, SPCA = PCA if and only if λ2 > 0.
Also, α̂ is maximized due to constraint subjective to αT α = 1 as
(4.1.2)
α̂ = arg max T r(αT β).
α
Expending for Ap×k = [α1 , .., αk ] and Bp = [β1 , ..., βk ], the SPCA establish for first k sparse
PCs as,
k
n
k
X
X
X
λ1j ||βj ||1 ,
(4.1.3)
||xi − ABT xi ||2 + λ2
||βj ||2 +
min
A,B
i=1
j=1
j=1
β̂
which target the AT A = Ik×k . Note that v̂j = ||βj || , j = 1, ..., k. Then, there are k independent
j
elastic net problems for B|A, and one exact solution by SVD (spectral decomposition) for A|B.
The analogous LARS-EN algorithm from elastic net regularization problem efficiently solves
for the whole sequence of sparse approximations for each PC with its corresponding values of
λ1,i . Similar to grouping effect, the practical λ > 0 overcomes a potential collinearity that exists
for the multivariate data of X (Hui Zou et al., 2004).
4.2
Elicitation for Sparse Principal Components
Previously, we have defined v̂ = β̂ j /||β̂ j || and approximate the Vi = v̂, where the XV̂i is the ith
principal component. As the β̂ is proportional to V1 (β̂ ∝ V1 ), larger the λ1 is the more sparse
β̂, resulting in a sparse V̂i . Note that the λ1 is efficiently solved by LARS-EN algorithm, which
is an extension of regularization method.
For the loadings of typical k principal components, α start at V[,1:k]. Given fixed α, the
second step is to solve for the naive elastic net penalization as
β j = arg min(αj − β)T XT X(αj − β) + λ2 ||β||2 + λ1,j ||β||1 ,
β
j = 1, ..., k
(4.2.1)
For fixed B = {β 1 , .., β k }, we compute the SVD (single value decomposition) of XT XB =
UDVT to update for the A = UVT . The repetition of the steps ends when β converges. Finally,
we take the normalize v̂j = β j /||β j ||, j = 1, ..., k.
7
Therefore, the SPCA that is based on the penalization of elastic net and LARS-EN algorithm
as an extension for the regularization problem for the multivariate analysis enjoy being computationally efficient in the case that the data is small for p or large p predictors, and the tendency
for less identification for missing the important latent variables in using SPCA (Hui Zou et al.,
2004).
5
Simulations
The goal of the Monte-Carlo simulation study is to understand the superiority of the Elastic
Net over the Lasso for both prediction and variable selection performance especially when p > n
and there are groups of correlated predictors. A pseudo-random generated samples are used to
formulate the ordinary least square fit of the model
y = Xβ + σ,
∼ N (0, 1).
For methodological comparison of Ridge Regression, Lasso Regression along with the Elastic
Net, 4 different simulation least square fits, varying the correlation matrix (R), size (X) and
coefficients (β) are used. For each example, the whole data set is generated simultaneously and
split into 3 independent training, validation and test data sets, denoted ·/ · /· respectively. The
training data set is only used to fit the models, the validation data set is used for tuning parameters and selecting the optimal parameters, and the test data set is used for comparing test
errors. The settings for the four simulation examples are as follows:
(i) We generated 100 data sets consisting of 20/20/200 observations with 8 predictors,
β = (3, 1.5, 0, 0, 2, 0, 0, 0), and σ = 3. The pairwise correlations between the ith and jth predictors was corr(i, j) = 0.5|i−j| .
(ii) The second example is the same as example (i) except βi = 0.85 for all i ∈ {1, ..., 8}.
(iii) In the third example, we simulated 100 data sets with 100/100/400 observations and 40
predictors. Moreover,
β = (0, ..., 0, 2, ..., 2, 0, ..., 0, 2, ..., 2),
| {z } | {z } | {z } | {z }
10
10
10
10
with corr(i, j) = 0.5 for all i, j and σ = 15.
(iv) In the fourth example, we simulated 100 data sets with 50/50/400 observations and 40
predictors, where the predictors (X) are as follows:
xi = Z1 + xi ,
Z1 ∼ N (0, 1),
i = 1, ..., 5,
xi = Z2 + xi ,
Z2 ∼ N (0, 1),
i = 6, ..., 10,
xi ,
Z3 ∼ N (0, 1),
i = 11, ..., 15,
x i = Z3 +
i.i.d.
xi ∼ N (0, 1),
i.i.d.
xi ∼
i = 16, ..., 40,
N (0, 0.01),
i = 1, ..., 15,
and coefficients are defined as
β = (3, ..., 3, 0, ..., 0),
| {z } | {z }
15
and σ = 15.
25
Table 1 and Figure 2 show the median MSE comparison between the Elastic Net, Lasso, and
Ridge Regression based on 100 runs. Please note that the numbers in parentheses in Table 1
are the corresponding standard errors of the medians. Table 2 and Figure 3 show the median
number of non-zero coefficients selected by the Elastic Net and Lasso.
8
Method
Elastic-net
Lasso
Ridge
Ex.1
Ex.2
Ex.3
Ex.4
3.42 (0.33)
3.47 (0.37)
3.59 (0.36)
3.43 (0.30)
3.61 (0.33)
3.37 (0.28)
16.03 (0.62)
16.38 (0.65)
15.77 (0.57)
17.63 (1.07)
18.89 (1.55)
20.40 (1.33)
Table 1: Median MSE for the simulated examples and 3 methods based on 100 simulations
Method
Lasso
Elastic-net
Ex.1
5
6
Ex.2
6
7
Ex.3
20
28
Ex.4
20
20.5
Table 2: Median number of non-zero coefficients
Figure 2: Box-plot of MSE for the simulated examples and 3 methods based on 100 simulations
Figure 3: Plot of the median number of non-zero coefficients for the Lasso and Elastic Net
Observing from Table 1 and Figure 2, we can see that the Elastic Net is more accurate
than the Lasso in all four examples in terms of prediction performance, even when the Lasso
is significantly more accurate than Ridge Regression. From these results, we can conclude that
the Elastic Net dominates the Lasso, especially under collinearity. Furthermore, looking at the
median number of non-zero coefficients in Table 2 and Figure 3, the Elastic Net selects more
predictors than the Lasso due to the grouping effect and can generate sparse solutions. Also,
notice that in example 4 there are three equally important groups each with 5 predictors in
9
addition to 25 noise predictors. Example 4 requires grouped variable selection and the Elastic
Net behaves like the ideal model in this scenario.
6
Conclusion
The Lasso can select at most n predictors in the p > n case and cannot perform grouped selection.
Furthermore, Ridge regression usually has a better prediction performance than the Lasso when
there are high correlations between predictors in the n > p case. The Elastic Net can produce
a sparse model with good prediction accuracy, while selecting group(s) of strongly correlated
predictors. It can also potentially select all p predictors in all situations. A new algorithm
called LARS-EN can be used for computing elastic net regularization paths efficiently, similar
to the LARS algorithm for the Lasso. The Elastic Net has two tuning parameters as opposed
to one tuning parameter like the Lasso, which can be selected using a training and validation
set. Simulation results indicate that the Elastic Net dominates the Lasso, especially under the
collinearity. Finally, the Elastic Net can be extended to Classification and Sparse PCA problems.
References
[1] Hui Zou and Trevor Hastie Regularization and Variable Selection via the Elastic Net
[2] Hui Zou, Trevor Hastie, and Robert Tibshirani Sparse Principal Component Analysis
https://web.stanford.edu/~hastie/Papers/spc_jcgs.pdf
10
Download