STA315 Final Project: Regularization and Variable Selection via the Elastic Net Hasaan Soomro, Kyuson Lim April 25, 2021 Abstract Zou and Hastie (2005) introduced a regularization method called the Elastic Net that combines the L1 and L2 penalties, and performs variable selection and regularization simultaneously. The Elastic Net can perform grouped variable selection when predictors are strongly correlated, and can potentially select all p predictors in the p > n case. Simulation studies show that the Elastic Net has a better prediction accuracy than the Lasso, while enjoying a similar sparsity. A new algorithm called LARS-EN can be used for computing Elastic Net regularization paths efficiently, which is similar to the LARS algorithm for the Lasso. Finally, the Elastic Net can be extended to Classification and Sparse PCA problems. 1 Introduction and Motivation 1.1 Importance of Variable Selection Variable Selection is an important problem in the field of Statistics and Machine Learning. With the rise of Big Data, modern data analysis often involves a large number of predictors (p > n). In such scenarios, we would like to model the relationship between the response and the best subset of predictors by eliminating the redundant/irrelevant/useless predictors. Both parsimony and prediction performance are important aspects. Variable Selection helps overcome challenges posed by high dimensional scenarios, by achieving faster computation time, reducing overfitting, overcoming the curse of dimensionality, and avoiding uninterpretable complex models. Multiple variable selection models such as the Lasso and Elastic Net have been introduced over the years that are well-suited for specific scenarios. For example, when the number of predictors is greater than the number of observations (p > n) and there are groups of strongly correlated predictors, then the Elastic Net model is often used. 1.2 Evaluating the Quality of a Model Typically, there are two major aspects that we consider when evaluating the quality of a model: • Prediction Accuracy on future unseen data: Models with good prediction accuracy on future unseen data are preferred. • Model Interpretation: Simpler models are preferred because they help understand the relationship between the response variable and predictors better. Also, when the number of predictors is large, parsimony is particularly important. A good model will have good prediction accuracy on future unseen data, and be simple to help better understand the relationship between the response variable and predictors. A common model is OLS (Ordinary Least Squares), where estimates are obtained by minimizing the residual sum of squares: n X RSS = arg min (yi − β̂ T xi )2 (1.2.1) β̂ i=1 Ordinary Least Squares often performs poorly in both prediction and interpretation. Penalization techniques such as Lasso (L1 penalty) and Ridge Regression (L2 penalty) have been introduced to 1 improve the performance of Ordinary Least Squares. However, Lasso and Ridge Regression also perform poorly in some scenarios such as when predictors are strongly correlated and grouped variable selection is required. This aspect will be discussed further in the upcoming sections. 1.3 Grouped Variable Selection and p n Problem In various problems and scenarios, there are more predictors than the number of observations (p > n). Furthermore, there are groups of variables/predictors among which the pairwise correlations are very high. For example, a typical microarray data set has fewer than 100 observations and thousands of predictors which are genes. Moreover, genes sharing the same biological pathway can have high correlations between them and such genes could be considered as forming a group. Therefore, genes sharing the same biological pathway would form a group where the correlations between the genes would be high. In such a scenario, an ideal model would do the following: • Automatic Variable Selection: The model automatically eliminates the trivial predictors and selects the best subset of predictors. • Grouped Selection: If one predictor is selected within a highly correlated group, then the model selects the whole group automatically. 1.4 Lasso and Ridge Regression Recall that the Lasso (Tibshirani 1996) is a penalized least squares method that does both continuous shrinkage and automatic variable selection simultaneously by imposing an L1 -penalty on the regression coefficients: β̂ = arg min||y − Xβ||2 + λ1 ||β||1 β (1.4.1) Also, recall that Ridge Regression is a continuous shrinkage method that achieves good prediction performance via the bias-variance trade-off, however, it does not perform variable selection. Ridge Regression imposes an L2 -penalty on the regression coefficients: β̂ = arg min||y − Xβ||2 + λ2 ||β||2 β 1.5 (1.4.2) Limitations of Lasso and Ridge Regression The Lasso and Ridge Regression have been successful in various scenarios, however, certain limitations make them unsuitable for the p > n and grouped variables scenario, as well as for some n > p cases. Firstly, due to the type of the convex optimization problem, the Lasso can select at most n variables prior to saturation in the p > n case. Furthermore, unless the bound of the L1 -penalty on the regression coefficients is below a certain threshold, the Lasso is not well defined. Secondly, the Lasso cannot perform grouped variable selection. When there is a group of highly correlated variables, the Lasso selects only one variable from the group and does care which one is selected, instead of selecting or eliminating the entire group. Lastly, when n > p and the predictors are highly correlated, studies have shown that Ridge Regression tends to have a better prediction performance than the Lasso. Therefore, the main objective is to create a model or method that enjoys a similar prediction performance and sparsity as the Lasso, and overcomes the limitations mentioned above. The model should potentially be able to select all p predictors in the p > n case, perform grouped variable selection when there is a group of highly correlated variables, and have a better prediction performance than the Lasso when n > p and the predictors are highly correlated. To solve this problem, the upcoming sections will talk about the Elastic Net model introduced by Zou and Hastie (2005), which can perform automatic variable selection and continuous shrinkage simultaneously, and can select group(s) of correlated predictors. 2 2 2.1 Naive Elastic Net Definition Zou and Hastie (2005) introduced the Naive Elastic Net which combines the L1 and L2 penalties of the Lasso and Ridge Regression models. Given a data set with n observations with p predictors, let y = (y1 , . . . , yn )T be the response and X = (x1 |. . . |xp ) be the design matrix, with y centered and X standardized: n X yi = 0, i=1 n X xij = 0, and i=1 n X x2ij = 1, for j = 1, 2, ..., p (2.1.1) i=1 Then the Naive Elastic Net optimization problem for non-negative λ1 , λ2 is: (2.1.2) β̂ = arg min{|y − Xβ|2 + λ2 |β|2 + λ1 |β|1 } β In equation (2.1.2), the L1 penalty enables the Naive Elastic Net to produce a sparse model and the L2 penalty helps the Naive Elastic Net to perform grouped variable selection and does not prevent/limit the possibility of selecting all p predictors. Section 2.2 and 2.3 will discuss the grouping effect and the possibility of selecting all p predictors in more detail. Also, note that as λ1 → 0, the Naive Elastic Net behaves more like Ridge Regression, and as λ2 → 0, it behaves more like the Lasso. An alternative representation of the Naive Elastic Net criterion can also be formulated. Let 2 , then the optimization problem is equivalent to: α = λ1λ+λ 2 β̂ = arg min|y − Xβ|2 , subject to (1 − α)|β|1 + α|β|2 ≤ t, for some t β (2.1.3) In equation (2.1.3), when α = 1 then the Naive Elastic Net becomes Ridge Regression and when α = 0 it becomes the Lasso. Also, the Naive Elastic Net penalty is strictly convex when α > 0 and singular when α = 0. Therefore, when α ∈ [0, 1), the Naive Elastic Net enjoys the properties of both Ridge Regression and the Lasso. Figure 1 shows that singularities at the vertices and the edges are strictly convex, and the strength of convexity varies with α. β1 β0 © Hasaan, Lim Figure 1: 2.2 shape of the ridge penalty. contour of the elastic net penalty with α = 0.5. contour of the Lasso penalty. Solution The Naive Elastic Net optimization problem can be efficiently solved in the same computational cost as the Lasso. This is because the Naive Elastic Net optimization problem presented in equation (2.1.2) can be transformed into an equivalent Lasso-type optimization problem on augmented data. Consider the following lemma: 3 Lemma 1: Given dataset (y, X) and (λ1 , λ2 ), construct artificial dataset (y∗ , X∗ ): X yn×1 ∗ X∗(n+p)×p = (1 + λ2 )−1/2 √ = , y(n+p) 0p×1 λ2 I (2.2.1) p √ Let γ = λ1 / 1 + λ2 and β ∗ = (1 + λ2 )β. Then, rewrite the Naive Elastic Net problem: L(γ, β) = L(γ, β ∗ ) = |y∗ − X∗ β ∗ |2 + γ|β ∗ |1 Define ∗ β̂ = arg min L{(γ, β ∗ )} ∗ β Then β̂ = p 1 (1 + λ2 ) β̂ ∗ (2.2.2) (2.2.3) (2.2.4) Lemma 1 shows that the naive elastic net method has the potential to select all p predictors in all situations. Note that in the transformed artificial dataset, X∗ has rank p and the sample size is n + p. This implies that the Naive Elastic Net method can potentially select all p predictors in all cases. Recall that in the p > n case, the Lasso can select at most n predictors. The Naive Elastic Net method overcomes this major limitation of the Lasso because X∗ having rank p and the sample size being n + p allows the Naive Elastic Net method to potentially select all p predictors. Moreover, Lemma 1 also indicates that the Naive Elastic Net method can perform automatic variable selection. 2.3 Solution (Orthogonal Design) For orthogonal design matrix (XT X = I), the Naive Elastic Net solution becomes: β̂i (Naive elastic net) = (|β̂i (OLS)| − λ1 /2)+ · sign{β̂i (OLS)} 1 + λ2 (2.3.1) The Ridge Regression solution under the orthogonal design is: β̂i (Ridge) = β̂i (OLS)/(1 + λ2 ) (2.3.2) Finally, the Lasso solution for the orthogonal design matrix is: β̂i = (|β̂i (OLS)| − λ1 /2)+ · sign{β̂i (OLS)} (2.3.3) Note that β̂i = xTi y for OLS solutions and sign is either -1,0 or 1. 2.4 Grouping Effect One of the major advantages of the Naive Elastic Net method over the Lasso is the ability to perform grouped variable selection, where whole groups of correlated predictors are selected or completely eliminated. This section will provide a more in-depth justification of the grouping effect in the Naive Elastic Net method. Generally, if the regression coefficients of highly correlated variables are nearly equal then a regression models tends to exhibit the grouping effect. To understand this better, consider the general penalization method: β̂ = arg min|y − Xβ|2 + λJ(β), J(β) > 0, β 6= 0 β̂ Lemma 2: Assume that predictors i and j are equal (xi = xj ) for i, j ∈ {1, ..., p} • (i) If J(·) is strictly convex, then β̂i = β̂j , for all λ > 0. ∗ • (ii) If J(β) = |β|1 , then β̂ i , β̂ j ≥ 0 and β̂ is another minimizer of (2.4.1) 4 (2.4.1) β̂ k ∗ β̂ k = (β̂ i + β̂ j ) · (s) (β̂ i + β̂ j ) · (1 − s) if k 6= i and k 6= j if k = i, s ∈ [0, 1] if k = j, s ∈ [0, 1] (2.4.2) From (i) and (ii) in Lemma 2, strict convexity ensures that a model exhibits the grouping effect in the extreme situation when predictors are equal. The Naive Elastic Net penalty is strictly convex due to the quadratic penalty, therefore it can exhibit the grouping effect. However, the Lasso is not strictly convex and does not have an unique solution, therefore it cannot perform grouped selection. Moreover, a quantitative description can be provided to justify the grouping effect of the Naive Elastic Net. We can define Dλ1 ,λ2 (i, j) as the difference between the coefficients path of xi and xj . Then, the following theorem can be used: Theorem 1: Given data (y, X) and parameters (λ1 , λ2 ), the response y is centered and the predictors X are standardized. Let β̂ be the Naive Elastic Net estimate. Suppose β̂i (λ1 , λ2 ), β̂j (λ1 , λ2 ) > 0. Then, Dλ1 ,λ2 (i, j) = 1 |β̂i (λ1 , λ2 ) − β̂j (λ1 , λ2 )| then |y|1 (2.4.3) Dλ1 ,λ2 (i, j) ≤ 1p 2(1 − ρ), ρ = xTi xj (the sample correlation) λ2 (2.4.4) Notice that as ρ → 1, the difference between β̂i and β̂j will converge to 0. That is, the coefficients of predictors i and j will be identical. Therefore, Theorem 1 justifies that the coefficients of strongly correlated predictors will be identical in the Naive Elastic Net method, and ultimately provides a quantitative justification for grouped variable selection in the Naive Elastic Net method. 2.5 Limitations of the Naive Elastic Net Even though the Naive Elastic Net can perform grouped variable selection and has the potential to select all p predictors in all situations, it incurs a double amount of shrinkage as it goes through a two-step procedure involving Lasso and Ridge type shrinkage. The double shrinkage introduces unnecessary bias and hinders the prediction performance of the Naive Elastic Net. Therefore, Zou and Hastie (2005) introduce the Elastic Net, which is obtained by multiplying the Naive Elastic Net coefficients by 1 + λ2 . The Elastic Net is a rescaled Naive Elastic Net that overcomes this double amount of shrinkage. One of the main motivations behind using 1 + λ2 as the rescaling factor is that the correlation matrix efficiently reduces the estimated variance. More specifically, R = (XT X + λ2 I)−1 XT , where (1 + λ2 )−1 R∗ , (2.5.1) as R∗ is the least square correlation matrix, for reduced new correlation matrix. Thus, we could find for the theorem of Lasso shrinkage as a stabilization. Theorem 2: Suppose there is data (y, X) with (λ1 , λ2 ), such that there is a elastic net estimate of β̂ as T X X + λ2 I β̂ = arg min β T β − 2yT Xβ + λ1 |β|1 , β̂(Lasso) = (1 + λ2 )β̂ (2.5.2) β 1 + λ2 Now, we could derive for the sampling covariance matrix Σ̂, which is XT X for orthogonal design, to reduced for 1 λ2 Σ̂λ2 = Σ̂ + I (2.5.3) 1 + λ2 1 + λ2 to be the estimate for the predictors. As one can see, the replaced Σ̂ is same as applying the Lasso type shrinkage for the Elastic Net penalization. 5 3 Elastic Net 3.1 Definition/Estimate Zou and Hastie (2005) introduced the Elastic Net which rescales the coefficents of the Naive Elastic Net. Previously, we have identified that the Elastic Net problem is equivalent to the Lasso type of problem in the case of p > n. For data (y, X), penalty parameters (λ1 , λ2 ) and augmented data (y∗ , X∗ ), the Naive Elastic Net solves a Lasso-type problem: β̂ = arg min |y∗ − X∗ β ∗ |2 + p ∗ λ2 ∗ |β̂ |1 . (3.1.1) β̂ = arg min||y − Xβ||2 + λ2 ||β||2 + λ1 ||β||1 (3.1.2) β̂ (1 + λ2 ) β Hence, the Elastic Net (corrected) estimates of β̂ are defined by p ∗ β̂(elastic net) = (1 + λ2 ) β̂ . Note that β̂(naive elastic net) = {1/ (3.1.3) p ∗ (1 + λ2 )}β̂ and thus allowing β̂(elastic net) = (1 + λ2 )β̂(naive elastic net) (3.1.4) Thus, we can see that the Elastic Net coefficient is a rescaled Naive Elastic Net coefficient. Now, we introduce the LARS-EN algorithm, which is similar to the LARS algorithm for the Lasso problem, to solve the Elastic Net problem efficiently. For the tuning parameters of λ1 and λ2 in LARS-EN, the efficient LARS algorithm enhances the computation of the Elastic Net problem for early stage of completion. 3.2 LARS-EN Computation Algorithm For computational efficiency, the LARS algorithm simply finds the entire optimal lasso solutions as a passage, using the same computational form of least square fit, for individual λ2 . This effective way of creating the solution passage is beneficial for sparse dataset such as augmented X∗ of a p n case from the Lemma 1. In this way, the lasso solutions are computationally ∗ attained as a linear manner. By inverted matrix GAk = X∗T Ak XAk for active variable set Ak , the LARS sequence is iterated computationally throughout downdating or updating the Cholesky factorization of GAk−1 found from the former iteration as for updating the Rk−1 . Hence, ∗ GAk = X∗T Ak XAk = 1 (XT XAk−1 + λ2 I), 1 + λ2 Ak−1 (3.2.1) where X∗k has at most p − 1 zero elements. So, the non-zero coefficients are recorded at Ak steps. Computational efficiency is achieved for the p n case due to early phase of algorithm. 3.3 Choice of Tuning Parameters For Elastic Net tuning parameters with (λ1 , λ2 ), 3 types of tuning parameters are picked from ∗ using the Lasso regression method. Under the proportional relationship for β̂ ∝ β̂ (3.1.3), the coefficients (t) or the fraction of Lasso type shrinkage (s) is chosen as pair of (λ2 , s) where s ∈ [0, 1] or (λ2 , t) for parameters of the Elastic Net. The LARS-EN algorithm in the Elastic Net identifies the whole solution passage at the kth step based on the forward stepwise fitting process of Lasso. In the case where only training set is available, the generic 10-fold cross validation method is used, fitted for 10 sets. Then, the LARS-EN algorithm computed for each (λ1 , s, or k) based on the fixed λ2 ∈ {(0, 0.01, 0.1, 1, 10, 100)}. After fitting for other parameters, we choose the optimal λ2 which is giving the lowest CV error. While the algorithm manage to compute for early stop in p n examples, the computational expensive CV example is with n > p frames in least square fits. 6 4 Extending Elastic Net to Classification and Sparse PCA First of all, newly extended method of sparse principal component analysis (SPCA) applies the elastic net regularization to a principal component analysis with sparse loadings to formulate an optimization in the regression solutions. As the PCA formulate an ordinary multivariate regression type analysis, the SPCA constrains on the coefficients of multivariate data to retain the sparse loadings as for the multivariate regression coefficients (Hui Zou et al., 2004). For notation, the Xn×p and xi is the ith row vector of X. The SPCA is used as a two-step exploratory analysis by first performing PCA, then use β̂ = arg minβ ||y − Xβ||2 + λ2 ||β||2 + λ1 ||β||1 optimization to applicable sparse approximations (Hui Zou et al., 2004). 4.1 Method of SPCA and Direct Sparse Approximation Due to elastic net constraint on the coefficients, the modified principal components with sparse loadings are derived, which is referred to as SPCA. Upon the multivariate regression framework of PCA, the elastic net penalization achieve its direct modification for derivation of the loadings that are sparse. The idea is to apply the constraint to α, which is exactly the first k loading vectors of typical PCA. As α and β are the p-vectors, the SPCA of the leading sparse PC is defined as (α̂, β̂) = min α,β n X ||xi − αβ T xi ||2 + λ2 ||β||2 + λ1 ||β||1 , (4.1.1) i=1 β̂ , as the loadings. subject to αT α = ||α||2 = 1, and v̂ = ||β|| In the SPCA, a large λ1 generates sparse loadings as different λ1,i correspond to penalizing the loadings of different principal components. When λ1 = 0 with SPCA, if λ2 > 0 then we get SPCA = PCA. However, when p > n, SPCA = PCA if and only if λ2 > 0. Also, α̂ is maximized due to constraint subjective to αT α = 1 as (4.1.2) α̂ = arg max T r(αT β). α Expending for Ap×k = [α1 , .., αk ] and Bp = [β1 , ..., βk ], the SPCA establish for first k sparse PCs as, k n k X X X λ1j ||βj ||1 , (4.1.3) ||xi − ABT xi ||2 + λ2 ||βj ||2 + min A,B i=1 j=1 j=1 β̂ which target the AT A = Ik×k . Note that v̂j = ||βj || , j = 1, ..., k. Then, there are k independent j elastic net problems for B|A, and one exact solution by SVD (spectral decomposition) for A|B. The analogous LARS-EN algorithm from elastic net regularization problem efficiently solves for the whole sequence of sparse approximations for each PC with its corresponding values of λ1,i . Similar to grouping effect, the practical λ > 0 overcomes a potential collinearity that exists for the multivariate data of X (Hui Zou et al., 2004). 4.2 Elicitation for Sparse Principal Components Previously, we have defined v̂ = β̂ j /||β̂ j || and approximate the Vi = v̂, where the XV̂i is the ith principal component. As the β̂ is proportional to V1 (β̂ ∝ V1 ), larger the λ1 is the more sparse β̂, resulting in a sparse V̂i . Note that the λ1 is efficiently solved by LARS-EN algorithm, which is an extension of regularization method. For the loadings of typical k principal components, α start at V[,1:k]. Given fixed α, the second step is to solve for the naive elastic net penalization as β j = arg min(αj − β)T XT X(αj − β) + λ2 ||β||2 + λ1,j ||β||1 , β j = 1, ..., k (4.2.1) For fixed B = {β 1 , .., β k }, we compute the SVD (single value decomposition) of XT XB = UDVT to update for the A = UVT . The repetition of the steps ends when β converges. Finally, we take the normalize v̂j = β j /||β j ||, j = 1, ..., k. 7 Therefore, the SPCA that is based on the penalization of elastic net and LARS-EN algorithm as an extension for the regularization problem for the multivariate analysis enjoy being computationally efficient in the case that the data is small for p or large p predictors, and the tendency for less identification for missing the important latent variables in using SPCA (Hui Zou et al., 2004). 5 Simulations The goal of the Monte-Carlo simulation study is to understand the superiority of the Elastic Net over the Lasso for both prediction and variable selection performance especially when p > n and there are groups of correlated predictors. A pseudo-random generated samples are used to formulate the ordinary least square fit of the model y = Xβ + σ, ∼ N (0, 1). For methodological comparison of Ridge Regression, Lasso Regression along with the Elastic Net, 4 different simulation least square fits, varying the correlation matrix (R), size (X) and coefficients (β) are used. For each example, the whole data set is generated simultaneously and split into 3 independent training, validation and test data sets, denoted ·/ · /· respectively. The training data set is only used to fit the models, the validation data set is used for tuning parameters and selecting the optimal parameters, and the test data set is used for comparing test errors. The settings for the four simulation examples are as follows: (i) We generated 100 data sets consisting of 20/20/200 observations with 8 predictors, β = (3, 1.5, 0, 0, 2, 0, 0, 0), and σ = 3. The pairwise correlations between the ith and jth predictors was corr(i, j) = 0.5|i−j| . (ii) The second example is the same as example (i) except βi = 0.85 for all i ∈ {1, ..., 8}. (iii) In the third example, we simulated 100 data sets with 100/100/400 observations and 40 predictors. Moreover, β = (0, ..., 0, 2, ..., 2, 0, ..., 0, 2, ..., 2), | {z } | {z } | {z } | {z } 10 10 10 10 with corr(i, j) = 0.5 for all i, j and σ = 15. (iv) In the fourth example, we simulated 100 data sets with 50/50/400 observations and 40 predictors, where the predictors (X) are as follows: xi = Z1 + xi , Z1 ∼ N (0, 1), i = 1, ..., 5, xi = Z2 + xi , Z2 ∼ N (0, 1), i = 6, ..., 10, xi , Z3 ∼ N (0, 1), i = 11, ..., 15, x i = Z3 + i.i.d. xi ∼ N (0, 1), i.i.d. xi ∼ i = 16, ..., 40, N (0, 0.01), i = 1, ..., 15, and coefficients are defined as β = (3, ..., 3, 0, ..., 0), | {z } | {z } 15 and σ = 15. 25 Table 1 and Figure 2 show the median MSE comparison between the Elastic Net, Lasso, and Ridge Regression based on 100 runs. Please note that the numbers in parentheses in Table 1 are the corresponding standard errors of the medians. Table 2 and Figure 3 show the median number of non-zero coefficients selected by the Elastic Net and Lasso. 8 Method Elastic-net Lasso Ridge Ex.1 Ex.2 Ex.3 Ex.4 3.42 (0.33) 3.47 (0.37) 3.59 (0.36) 3.43 (0.30) 3.61 (0.33) 3.37 (0.28) 16.03 (0.62) 16.38 (0.65) 15.77 (0.57) 17.63 (1.07) 18.89 (1.55) 20.40 (1.33) Table 1: Median MSE for the simulated examples and 3 methods based on 100 simulations Method Lasso Elastic-net Ex.1 5 6 Ex.2 6 7 Ex.3 20 28 Ex.4 20 20.5 Table 2: Median number of non-zero coefficients Figure 2: Box-plot of MSE for the simulated examples and 3 methods based on 100 simulations Figure 3: Plot of the median number of non-zero coefficients for the Lasso and Elastic Net Observing from Table 1 and Figure 2, we can see that the Elastic Net is more accurate than the Lasso in all four examples in terms of prediction performance, even when the Lasso is significantly more accurate than Ridge Regression. From these results, we can conclude that the Elastic Net dominates the Lasso, especially under collinearity. Furthermore, looking at the median number of non-zero coefficients in Table 2 and Figure 3, the Elastic Net selects more predictors than the Lasso due to the grouping effect and can generate sparse solutions. Also, notice that in example 4 there are three equally important groups each with 5 predictors in 9 addition to 25 noise predictors. Example 4 requires grouped variable selection and the Elastic Net behaves like the ideal model in this scenario. 6 Conclusion The Lasso can select at most n predictors in the p > n case and cannot perform grouped selection. Furthermore, Ridge regression usually has a better prediction performance than the Lasso when there are high correlations between predictors in the n > p case. The Elastic Net can produce a sparse model with good prediction accuracy, while selecting group(s) of strongly correlated predictors. It can also potentially select all p predictors in all situations. A new algorithm called LARS-EN can be used for computing elastic net regularization paths efficiently, similar to the LARS algorithm for the Lasso. The Elastic Net has two tuning parameters as opposed to one tuning parameter like the Lasso, which can be selected using a training and validation set. Simulation results indicate that the Elastic Net dominates the Lasso, especially under the collinearity. Finally, the Elastic Net can be extended to Classification and Sparse PCA problems. References [1] Hui Zou and Trevor Hastie Regularization and Variable Selection via the Elastic Net [2] Hui Zou, Trevor Hastie, and Robert Tibshirani Sparse Principal Component Analysis https://web.stanford.edu/~hastie/Papers/spc_jcgs.pdf 10