LASSO estimators in linear regression for microarray data Angela Recchia1 , Ernst Wit2 and Alessio Pollice1 1 2 Dipartimento di Scienze Statistiche - Università degli Studi di Bari ang.r@dss.uniba.it, apollice@dss.uniba.it Department of Mathematics and Statistics - University of Lancaster ernst@stats.gla.ac.uk Summary. The Least Absolute Shrinkage and Selection Operator or LASSO [Tib96] is a technique for model selection and estimation in linear regression models. The LASSO minimizes the residual sum of squares subject to the sum of the absolute value of coefficients being less than a constant. Tibshirani applies a quadratic optimization problem with 2p linear equality constraints to obtain the LASSO solution for a fixed value of the tuning parameter t. In particular, the LASSO can be very useful in the analysis of microarray data in which the number of genes (predictors) is much larger than the number of sample observations. In this paper, we apply the LASSO methodology to a microarray study in breast cancer. Key words: microarray, LASSO, constrained regression 1 Introduction The LASSO, an acronym for Least Absolute Shrinkage and Selection Operator, is a method developed by Robert Tibshirani in 1996, for model selection and estimation in linear regression models. It achieves a better prediction accuracy and gives a sparse solution by exactly shrinking most coefficients to zero. Tibshirani demonstrates that the LASSO is more stable and accurate than traditional variable selection methods such as partial least squares, ridge regression and subset variable selection. The LASSO combines the prediction stability and parsimony of the two main penalization procedures, namely ridge regression and subset regression, which in turn were developed to overcome deficiencies with ordinary least squares regression (OLS) estimates. There are two main shortcomings ascribed to OLS: prediction accuracy is affected in the case of sparse data and can be improved by shrinking or zeroing selected coefficients. Secondly, interpretation is complicated when a large number of covariates are retained [HTF01]. Ridge regression shrinks the coefficients continuously towards zero, but keeps all the predictors in the model, so it does not produce a sparse or parsimonious model. Subset variable selection produces sparse models but is extremely variable because of its inherent discreteness. The LASSO shrinks the coefficients 1602 Angela Recchia, Ernst Wit and Alessio Pollice of many predictors to zero and thus performs variable selection automatically. It minimizes the residual sum of squares subject to the sum of the absolute values of the coefficients being less than a constant. The geometric shape of the constraint forces some coefficients to be exactly zero, so it embodies the advantageous features of both ridge regression and subset selection. 2 Definition In the standard linear multivariate regression model, the effects of the predictor variables on the response are summarized by yi = β0 + p X βj xij + εi , (1) j=1 where ε1 , ε2 , . . . , εn are independent identically distributed random variables with zero mean and constant variance σ 2 , yi (i = 1, . . . , n) are n response observations, xij (j = 1, . . . , p) are p explanatory variables, and β0 , β1 , . . . , βp are the regression coefficients. By omitting the intercept β0 , the LASSO estimates are given by the solution of the following optimization problem βb LASSO = arg min β n X yi − i=1 subject to p X !2 (2) βj xij j=1 p X |βj | ≤ t j=1 where t ≥ 0 is a tuning parameter controlling the amount of shrinkage towards zero applied to the estimates. The constrained minimization problem (2) may be transformed to an equivalent unconstrained minimization problem βbLASSO = arg min β " n X i=1 yi − p X j=1 !2 βj xij +λ p X # |βj | (3) j=1 where λ is a positive Lagrangian parameter that affects P the absolute size of the regression coefficients and is chosen in such way that pj=1 |βj | ≤ t [HTF01, OMT01]. Problems (2) and (3) are equivalent, in the sense that for any given LASSO regularization parameter λ there exists a tuning parameter t such that the two problems have the same solution. For a sufficiently large value of t the constraint has no effect and the least squares solutions is obtained. For smaller values of t, the solutions tend to become sparser, that is, they are shrunken versions of the least squares estimates and some of the coefficients are exactly zero. The constraint t can be chosen to minimize the estimated expected prediction error [HTF01]. LASSO estimators in linear regression for microarray data 1603 Table 1. Simple data example of linear model without intercept, one response y and two explanatory variables, x1 and x2 (n = 3). y x1 x2 −1.5 −1.0 −0.5 0.0 beta(2) 0.5 1.0 0.10 1 3 2.18 2 -1 4.59 3 -1 −0.5 0.0 0.5 1.0 1.5 2.0 beta(1) Fig. 1. The LASSO constraint |β1 |+|β2 | ≤ t (t = 1), the ridge constraint β12 +β22 ≤ t2 (t = 1) and the elliptical contours of the residual sum of the squares. 3 Geometry of LASSO P In this section we consider the l1 LASSO penalty pj=1 |βj | ≤ t as an alternative to P the l2 ridge penalty pj=1 βj 2 ≤ t [HTF01]. Table 1 shows a simple data example related to a linear model without intercept and with two explanatory variables, where β1 = 1 and β2 = 0. Table 2 shows the OLS, ridge (λ = 3.5) and LASSO (λ = 3.5) regression estimates. The ridge regression shrinks the coefficient of x2 towards zero, whereas the LASSO shrinks the coefficient of x2 exactly to zero. Figure 1 shows the elleptical contour plots of the least squares function, the cicular constraint region for the ridge regression β12 + β22 ≤ t2 and the constraint region for LASSO as rotated square centered at the origin |β1 |+|β2 | ≤ t. The solution of the optimization problem is represented by the point where the contour first touches the constraint. When t = 1 the LASSO constraint leads to a solution at a corner, where the coefficient β2 is equal to zero, while this is not the case for the ridge penalty. In general, it is quite likely that the contour and the LASSO constraint intersect in one of the corners, i.e. where one of parameters is zero; for ridge regression this is not particularly likely [OMT01, HTF01, Tib96, WM04]. 4 Algorithms for finding the LASSO solution Besides the original solution by Tibshirani using quadratic program (QP) for least squares regressions and iteratively reweighed least squares with QP for generalized linear models, a number of algorithms were proposed to obtain the LASSO estimates. 1604 Angela Recchia, Ernst Wit and Alessio Pollice Table 2. OLS, ridge and LASSO regression estimates. Estimate Std.Error t-value Pr(>|t|) OLS x1 x2 1.375 −0.254 0.341 4.031 0.385 −0.659 0.155 0.629 ridge x1 x2 1.066 −0.107 0.292 3.655 0.312 −0.344 0.105 0.774 1.052 0.000 0.251 4.187 0.000 −0.409 0.045 0.719 LASSO x1 x2 Tibshirani’s solution of the LASSO optimization problem is unique only under the assumption of an orthonormal X, and works only if X T X has full rank [Tib96]. Gay suggests a transformation of Tibshirani’s algorithm [Tib96]. Osborne et al [OPBT00] propose a faster QP algorithm for linear regression, which was implemented by Lokhorst as lasso2 package in the R system. The advantage of Osborne’s algorithm over Tibshirani’s is related to the inefficiency of LASSO when p is larger than n (p > n) and for small to medium values of t when p is large [OPBT00]. Osborne demonstrates that the βb solution of (2) is a continuous piecewise linear function of the constraint parameter t [EHJT04]. Osborne’s algorithm starts with the zero vector and adds variables iteratively, that is it builds up the optimal solution from a small base; whereas Tibshirani’s algorithm starts with the full ordinary least squares (OLS) estimate and removes variables, that is, it starts at a solution of the unconstrained problem [OPBT00]. Efron et al [EHJT04] develop an algorithm called LARS for linear regression. Efron proposes LARS as an alternative variable selection model and provides a geometrical interpretation of piecewise linear homotopy. Efron shows that the number of linear pieces in LASSO is approximately equal to the number of variables in the design matrix X. The quadratic programming used to solve the optimization problem (2) by Efron is close to the homotopy method of Osborne [EHJT04]. Osborne and Efron show that the extension of the continuity and piecewise linearity hold for general X [EHJT04, OPBT00]. 5 Standard errors of LASSO estimates Tibshirani proposes the LASSO estimate with an approximate form of the l1 penalty as p X j=1 |βj | = p X βj2 j=1 |βj | ≤t where βj is the jth parameter estimate [Tib96]. This LASSO constraint can be transformed to a ridge regression constraint by adding a Lagrangian penalty λ Pp βj2 j=1 |βj | LASSO estimators in linear regression for microarray data 1605 to the residual sum squares. Therefore a LASSO solution can be defined by the ridge regression estimator βe = (X T X + λW )−1 X T y where W is a diagonal matrix with diagonal elements |βe1 | if |βej | > 0. The covariance j matrix of the LASSO estimate is estimated as T e = var((X var( β) X + λW )−1 X T y) =σ b2 (X T X + λW )−1X T X(X T X + λW )−1 X where σ b2 is the standard estimate of the error variance. In particular, the effective number of parameters p(t) is approximated by the trace of the matrix H = X(X T X + λW )−1 X T . 6 Estimate of t There are a number of criteria to estimate the value of the hyper-parameter t in the LASSO model. Tibshirani [HT90] proposed the two most popular, the k-fold cross-validation(CV) and the generalized cross-validation (GCV). In the k-fold crossvalidation the full data set L is randomly divided into k subsets of equal size. Denoting Ls the test sets, and L − Ls the cross-validation training set (s = 1, . . . , k), for each t > 0 and s, the training set is used to estimate the parameters and the test set to validate the estimates [OMT01, WM04, Tib96]. The cross-validation is CV (t) = k X X (yi − ybi )2 = s=1 (yi ,xi )∈Ls k X X yi − xTi βbs 2 (4) s=1 (yi ,xi )∈Ls where βbs is the estimator of β obtained by the training set. The estimate of the tuning parameter t is obtained by the minimization of (4) [FL01, OMT01], that is bt = arg min [CV (t)] t The generalized cross-validation error is defined as GCV (t) = k X X s=1 (yi ,xi )∈Ls 1 n yi − ybi ) 1− p(t) n !2 (5) [FL01, Fu81, Tib96, WM04], where p(t) is the constrained approximation of the number of effective parameters. The estimate of the tuning parameter t is obtained by the minimization of (5), that is bt = arg min [GCV (t)] t Tibshirani proposes to minimize the generalized cross-validation for a linear approximation of the LASSO estimate with k = n (leave-one-out CV) [Tib96]. 1606 Angela Recchia, Ernst Wit and Alessio Pollice 7 Breast cancer example 50 60 70 cv We apply the LASSO penalty to a linear regression model for data referred to breast cancer, the most common neoplasm in woman. The data of this example come from a study performed on 62 biopsies of breast cancer patients over 59 genes, to determine which genes influence the severity of breast cancer [WGB02]. For each patient a variety of clinical information is available, such as age at diagnosis, follow-up time, whether or not the patient died of breast cancer, whether or not the patient died, the size of the tumor and the Nottingham Prognostic Index (NPI), a numerical index based on the translation of information provided in pathology reports for breast cancer with assigned grades 1 to 9. Using the penalized method of LASSO, we try to find a genetic component to the Nottingham Prognostic Index, that is the informative genes of the NPI. The linear multivariate regression model is applied to this data. The response variable is the severity score NPI, the predictors are the gene expression profiles of breast cancer tumor for each patient. Results for the full least squares approach and the two penalized procedures, namely ridge regression and LASSO, were obtained. All the ordinary least squares estimates of the model parameter are nonzero. The statistically significant genes for NPI are 8 and 33 (p-value < 0.005); 10, 21, 29, 50, 52 and 59 (p-value < 0.007); 6 and 19 (p-value < 0.008); 1 (p-value < 0.009). The other genes are not statistically significant. The ridge regression (λ = 2.8) shrinks the coefficients towards zero and not exactly to zero like LASSO. Few of the β coefficients obtained by ridge regression, namely those corresponding to independent variables 1, 22, 50 and 51, are nearly zero. The only two statistically significant genes are 23 and 21. For a fixed value of the penalty term (λ = 2) chosen via cross validation and generalized cross validation (Fig. 2 and 3), the non zero LASSO coefficients correspond to genes 2, 5, 9, 12, 19, 20, 21, 23, 25, 29, 32, 36, 37, 40, 45, 46, 49, and 57. The set of nonzero coefficients depends strongly on the λ parameter. In particular genes 23 (p-value = 0.006), 21 (p-value = 0.008), and 40 (p-value = 0.028) are statistically significant for the NPI. 0 5 10 15 20 25 15 20 25 −6 0 4 Coefficient values lambda 0 5 10 lambda Fig. 2. Linear regression. The top figure is the cross validation curve for breast cancer data. The bottom figure shows the LASSO estimates obtained by the CV. 1607 48 54 60 gcv LASSO estimators in linear regression for microarray data 0 5 10 15 20 25 15 20 25 −6 0 4 Coefficient values lambda 0 5 10 lambda Fig. 3. Linear regression. The top figure is the generalized cross validation curve for breast cancer data. The bottom figure shows the LASSO estimates obtained by the GCV. Figure 4 displays the profile of the LASSO coefficients for the breast cancer study as the tuning parameter t varys ranging from t = 0 to the completely unconstrained solution for which the values of β correspond to the OLS estimates. The coefficient values were normalized and plotted versus the standardized tuning parameter t = λ. In figure 4 estimates of LASSO coefficients obtained by LARS [EHJT04] kβk1 are represented by curves corresponding to individual genes. The LARS function contains all variants of LASSO and computes the complete solution for all possible values of the shrinkage parameter. In terms of coefficient estimates, standard deviations and p-values, the LASSO achieves better prediction accuracy by shrinkage and gives a sparser solution than the competing OLS and ridge estimators. * * * 47 ** * ** * * ** * * * * ** * * * * * * * * * ** * ** * * * * * ** ** * ** ** * * * * * *** * ** * * * * * ** * * * * * * ** * * * * * * * * * * * * ** ** * * * * * * * * * *** ** * * * * * * * * ** * ** * * * * * * * * * * ** * * * * * * * ** * * ** ** * * * * * ** ** * * * * * * * * ** * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * ** * * * ** * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * ** * * * * * * * * * * ** * ** * * * * * * * * * * * ** * * ** ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * ** ** * * *** * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * ** * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * ** * * * * ** * * * * * * * ** * ** *** * * ** * * ** * * * * * ** * * * * * * * ** * * * * * * * * * * ** * * ** ** * * ** * ** * * * * * * * * * * * ** * * * * * * * * * * * ** * * * * * * * ** * *** * * * ** *** * * * * * ** * * * * ** 0.0 0.2 * * * * * * * * * * * * * * * * * ** * ** * * * * * * ** * * ** * * * * ** * * ** * * ** * * * * * ** ** * ** * * * ** * * * ** * * * * * * * * * ** * * * * * * * ** * * * * * * * * * * * * * * * * * * * ** * * * * * * * ** * * * ** * * ** * * * * * * * ** ** * * * * * * * ** * ** * * ** * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * ** ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * ** ** * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * ** * * * * * * * ** * * * * * * * * * * * * * * ** * * ** * * ** * * * * * * * * * ** ** * ** * * * * * * * * * * * * * * * ** ** * ** * * * 0.4 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 0.6 ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** * ** * ** ** ** ** * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * ** * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * ** * * * * * * * * ** * * * * * * * * * * * ** * * * ** * * * * * ** * * * * * ** * * * ** * * * ** * * * ** * 0.8 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * ** * * * ** * * ** * ** * ** * * * * * * * ** * * * * * * ** * * ** * ** ** * * ** * * ** ** ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 43 25 7 6 3 11 15 21 10 20 30 −20 −10 0 Standardized Coefficients LASSO 1.0 |beta|/max|beta| Fig. 4. LASSO estimates (given by lars) for the 59 gene amplification profiles and NPI in linear regression model. 8 Conclusions In this paper the LASSO method of shrinkage and variable selection for regression is surveyed. The LASSO yields small variance of the estimators and achieves good 1608 Angela Recchia, Ernst Wit and Alessio Pollice estimation and prediction by shrinking smaller coefficients to zero. A dual goal of accurate estimation and consistent variable selection can be achieved simultaneously. The LASSO approach is an alternative to standard regression and variable subset selection techniques. If a large number of explanatory variables have a small effect, then the ridge regression estimator is the best choice, while if a small number of variables have a large effect, then subset selection does the best. If a moderate number of variables have a moderate effect, then LASSO is to be preferred [Tib96]. Shrinking the size of regression coefficients and variable selection are important goals in the analysis of microarray data, where the number of samples collected is much smaller than the number of gene per chip, that is, the number of predictor variables is much larger than the number of independent samples. The estimation in high-dimensional and low-sample size settings can be done by using the LASSO. We propose to use the l1 penalized estimation for linear regression to select genes that are relevant to patients’ cancer. Our goal is to determine which genes influence the severity of breast cancer, in the microarray experiment. The LASSO regression gives much better predictive performance than the OLS and the l2 penalized regression. In summary, the set of nonzero LASSO coefficients represents the genes that simultaneously best explain the NPI. Genes 23, 21, and 40 are the most statistically significant among the informative genes for NPI at p-value level, for the complete data set of 59. References [Mil90] Miller, A.: Subset Selection in Regression. Chapman and Hall, London (1990) [HTF01] Hastie, T. Tibshirani R., Friedman J.H.: Elements of Statistical Learning. Springer-Verlag, New York (2001) [WM04] Wit, E., McClure, J.D.: Statistics for microarray. John Wiley and Sons, New York (2004) [HT90] Hastie, T. Tibshirani R.: Genaralized additive models. Chapman and Hall, London (1990) [OPBT00] Osborne, M.R., Presnell, B., Brevin, A., Turlach, B.: On the LASSO and its dual. J. Comp. and Grap. Stat., 9, 2, 319–338 (2000) [Tib96] Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B, 58, 1, 267–288 (1996) [EHJT04] Efron, B., Hastie, T., Johnstone, I. Tibshirani, R.: Least Angle Regression (with discussion). Annals of Statistics, 32, 407–499 (2004) [OMT01] Ojelund, H., Madsen, H., Thyregod, P.: Calibration with absolute shrinkage. J. Chemiometrics, 15, 497–509 (2001) [FL01] Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association, 96, 456, 1348–1360 (2001) [Fu81] Fu, W.J.: Penalized Regression: the Bridge versus Lasso. Journal of Computational and Graphical Statistics, 7, 3, 397–416 (1881) [WGB02] Witton, C.J., Going, J.J., Burkhardt, H., Vass, K., Wit, E., Cooke, T.G., Ruffalo, T., Seeling, S., King, W., Bartlett, J.M.S.: The sub-classification of breast cancer using molecular cytogenetic gene chips. Proceedings of the American Association of Cancer Researchrs, 43, 289, 122–128 (2002)