LASSO estimators in linear regression for microarray data

advertisement
LASSO estimators in linear regression for
microarray data
Angela Recchia1 , Ernst Wit2 and Alessio Pollice1
1
2
Dipartimento di Scienze Statistiche - Università degli Studi di Bari
ang.r@dss.uniba.it, apollice@dss.uniba.it
Department of Mathematics and Statistics - University of Lancaster
ernst@stats.gla.ac.uk
Summary. The Least Absolute Shrinkage and Selection Operator or LASSO
[Tib96] is a technique for model selection and estimation in linear regression models.
The LASSO minimizes the residual sum of squares subject to the sum of the absolute
value of coefficients being less than a constant. Tibshirani applies a quadratic optimization problem with 2p linear equality constraints to obtain the LASSO solution
for a fixed value of the tuning parameter t. In particular, the LASSO can be very
useful in the analysis of microarray data in which the number of genes (predictors)
is much larger than the number of sample observations. In this paper, we apply the
LASSO methodology to a microarray study in breast cancer.
Key words: microarray, LASSO, constrained regression
1 Introduction
The LASSO, an acronym for Least Absolute Shrinkage and Selection Operator, is
a method developed by Robert Tibshirani in 1996, for model selection and estimation in linear regression models. It achieves a better prediction accuracy and
gives a sparse solution by exactly shrinking most coefficients to zero. Tibshirani
demonstrates that the LASSO is more stable and accurate than traditional variable
selection methods such as partial least squares, ridge regression and subset variable selection. The LASSO combines the prediction stability and parsimony of the
two main penalization procedures, namely ridge regression and subset regression,
which in turn were developed to overcome deficiencies with ordinary least squares
regression (OLS) estimates.
There are two main shortcomings ascribed to OLS: prediction accuracy is affected
in the case of sparse data and can be improved by shrinking or zeroing selected coefficients. Secondly, interpretation is complicated when a large number of covariates
are retained [HTF01]. Ridge regression shrinks the coefficients continuously towards
zero, but keeps all the predictors in the model, so it does not produce a sparse or parsimonious model. Subset variable selection produces sparse models but is extremely
variable because of its inherent discreteness. The LASSO shrinks the coefficients
1602
Angela Recchia, Ernst Wit and Alessio Pollice
of many predictors to zero and thus performs variable selection automatically. It
minimizes the residual sum of squares subject to the sum of the absolute values of
the coefficients being less than a constant. The geometric shape of the constraint
forces some coefficients to be exactly zero, so it embodies the advantageous features
of both ridge regression and subset selection.
2 Definition
In the standard linear multivariate regression model, the effects of the predictor
variables on the response are summarized by
yi = β0 +
p
X
βj xij + εi ,
(1)
j=1
where ε1 , ε2 , . . . , εn are independent identically distributed random variables with
zero mean and constant variance σ 2 , yi (i = 1, . . . , n) are n response observations,
xij (j = 1, . . . , p) are p explanatory variables, and β0 , β1 , . . . , βp are the regression
coefficients.
By omitting the intercept β0 , the LASSO estimates are given by the solution of
the following optimization problem
βb
LASSO
= arg min
β
n
X
yi −
i=1
subject to
p
X
!2
(2)
βj xij
j=1
p
X
|βj | ≤ t
j=1
where t ≥ 0 is a tuning parameter controlling the amount of shrinkage towards
zero applied to the estimates. The constrained minimization problem (2) may be
transformed to an equivalent unconstrained minimization problem
βbLASSO = arg min
β
"
n
X
i=1
yi −
p
X
j=1
!2
βj xij
+λ
p
X
#
|βj |
(3)
j=1
where λ is a positive Lagrangian parameter that affects
P the absolute size of the regression coefficients and is chosen in such way that pj=1 |βj | ≤ t [HTF01, OMT01].
Problems (2) and (3) are equivalent, in the sense that for any given LASSO regularization parameter λ there exists a tuning parameter t such that the two problems
have the same solution.
For a sufficiently large value of t the constraint has no effect and the least squares
solutions is obtained. For smaller values of t, the solutions tend to become sparser,
that is, they are shrunken versions of the least squares estimates and some of the coefficients are exactly zero. The constraint t can be chosen to minimize the estimated
expected prediction error [HTF01].
LASSO estimators in linear regression for microarray data
1603
Table 1. Simple data example of linear model without intercept, one response y
and two explanatory variables, x1 and x2 (n = 3).
y
x1 x2
−1.5 −1.0 −0.5 0.0
beta(2)
0.5
1.0
0.10 1 3
2.18 2 -1
4.59 3 -1
−0.5
0.0
0.5
1.0
1.5
2.0
beta(1)
Fig. 1. The LASSO constraint |β1 |+|β2 | ≤ t (t = 1), the ridge constraint β12 +β22 ≤ t2
(t = 1) and the elliptical contours of the residual sum of the squares.
3 Geometry of LASSO
P
In this section we consider the l1 LASSO penalty pj=1 |βj | ≤ t as an alternative to
P
the l2 ridge penalty pj=1 βj 2 ≤ t [HTF01]. Table 1 shows a simple data example
related to a linear model without intercept and with two explanatory variables, where
β1 = 1 and β2 = 0. Table 2 shows the OLS, ridge (λ = 3.5) and LASSO (λ = 3.5)
regression estimates. The ridge regression shrinks the coefficient of x2 towards zero,
whereas the LASSO shrinks the coefficient of x2 exactly to zero. Figure 1 shows the
elleptical contour plots of the least squares function, the cicular constraint region
for the ridge regression β12 + β22 ≤ t2 and the constraint region for LASSO as rotated
square centered at the origin |β1 |+|β2 | ≤ t. The solution of the optimization problem
is represented by the point where the contour first touches the constraint. When
t = 1 the LASSO constraint leads to a solution at a corner, where the coefficient β2
is equal to zero, while this is not the case for the ridge penalty. In general, it is quite
likely that the contour and the LASSO constraint intersect in one of the corners,
i.e. where one of parameters is zero; for ridge regression this is not particularly
likely [OMT01, HTF01, Tib96, WM04].
4 Algorithms for finding the LASSO solution
Besides the original solution by Tibshirani using quadratic program (QP) for least
squares regressions and iteratively reweighed least squares with QP for generalized
linear models, a number of algorithms were proposed to obtain the LASSO estimates.
1604
Angela Recchia, Ernst Wit and Alessio Pollice
Table 2. OLS, ridge and LASSO regression estimates.
Estimate Std.Error t-value Pr(>|t|)
OLS
x1
x2
1.375
−0.254
0.341 4.031
0.385 −0.659
0.155
0.629
ridge
x1
x2
1.066
−0.107
0.292 3.655
0.312 −0.344
0.105
0.774
1.052
0.000
0.251 4.187
0.000 −0.409
0.045
0.719
LASSO
x1
x2
Tibshirani’s solution of the LASSO optimization problem is unique only under the
assumption of an orthonormal X, and works only if X T X has full rank [Tib96].
Gay suggests a transformation of Tibshirani’s algorithm [Tib96].
Osborne et al [OPBT00] propose a faster QP algorithm for linear regression,
which was implemented by Lokhorst as lasso2 package in the R system. The advantage of Osborne’s algorithm over Tibshirani’s is related to the inefficiency of LASSO
when p is larger than n (p > n) and for small to medium values of t when p is
large [OPBT00]. Osborne demonstrates that the βb solution of (2) is a continuous
piecewise linear function of the constraint parameter t [EHJT04]. Osborne’s algorithm starts with the zero vector and adds variables iteratively, that is it builds
up the optimal solution from a small base; whereas Tibshirani’s algorithm starts
with the full ordinary least squares (OLS) estimate and removes variables, that is,
it starts at a solution of the unconstrained problem [OPBT00].
Efron et al [EHJT04] develop an algorithm called LARS for linear regression.
Efron proposes LARS as an alternative variable selection model and provides a geometrical interpretation of piecewise linear homotopy. Efron shows that the number
of linear pieces in LASSO is approximately equal to the number of variables in the
design matrix X. The quadratic programming used to solve the optimization problem (2) by Efron is close to the homotopy method of Osborne [EHJT04]. Osborne
and Efron show that the extension of the continuity and piecewise linearity hold for
general X [EHJT04, OPBT00].
5 Standard errors of LASSO estimates
Tibshirani proposes the LASSO estimate with an approximate form of the l1 penalty
as
p
X
j=1
|βj | =
p
X
βj2
j=1
|βj |
≤t
where βj is the jth parameter estimate [Tib96]. This LASSO constraint can be transformed to a ridge regression constraint by adding a Lagrangian penalty λ
Pp
βj2
j=1 |βj |
LASSO estimators in linear regression for microarray data
1605
to the residual sum squares. Therefore a LASSO solution can be defined by the ridge
regression estimator
βe = (X T X + λW )−1 X T y
where W is a diagonal matrix with diagonal elements |βe1 | if |βej | > 0. The covariance
j
matrix of the LASSO estimate is estimated as
T
e = var((X
var(
β)
X + λW )−1 X T y)
=σ
b2 (X T X + λW )−1X T X(X T X + λW )−1 X
where σ
b2 is the standard estimate of the error variance. In particular, the effective number of parameters p(t) is approximated by the trace of the matrix
H = X(X T X + λW )−1 X T .
6 Estimate of t
There are a number of criteria to estimate the value of the hyper-parameter t in
the LASSO model. Tibshirani [HT90] proposed the two most popular, the k-fold
cross-validation(CV) and the generalized cross-validation (GCV). In the k-fold crossvalidation the full data set L is randomly divided into k subsets of equal size. Denoting Ls the test sets, and L − Ls the cross-validation training set (s = 1, . . . , k),
for each t > 0 and s, the training set is used to estimate the parameters and the
test set to validate the estimates [OMT01, WM04, Tib96]. The cross-validation is
CV (t) =
k
X
X
(yi − ybi )2 =
s=1 (yi ,xi )∈Ls
k
X
X
yi − xTi βbs
2
(4)
s=1 (yi ,xi )∈Ls
where βbs is the estimator of β obtained by the training set. The estimate of the
tuning parameter t is obtained by the minimization of (4) [FL01, OMT01], that is
bt = arg min [CV (t)]
t
The generalized cross-validation error is defined as
GCV (t) =
k
X
X
s=1 (yi ,xi )∈Ls
1
n
yi − ybi )
1−
p(t)
n
!2
(5)
[FL01, Fu81, Tib96, WM04], where p(t) is the constrained approximation of the
number of effective parameters. The estimate of the tuning parameter t is obtained
by the minimization of (5), that is
bt = arg min [GCV (t)]
t
Tibshirani proposes to minimize the generalized cross-validation for a linear approximation of the LASSO estimate with k = n (leave-one-out CV) [Tib96].
1606
Angela Recchia, Ernst Wit and Alessio Pollice
7 Breast cancer example
50 60 70
cv
We apply the LASSO penalty to a linear regression model for data referred to breast
cancer, the most common neoplasm in woman. The data of this example come from a
study performed on 62 biopsies of breast cancer patients over 59 genes, to determine
which genes influence the severity of breast cancer [WGB02]. For each patient a
variety of clinical information is available, such as age at diagnosis, follow-up time,
whether or not the patient died of breast cancer, whether or not the patient died, the
size of the tumor and the Nottingham Prognostic Index (NPI), a numerical index
based on the translation of information provided in pathology reports for breast
cancer with assigned grades 1 to 9.
Using the penalized method of LASSO, we try to find a genetic component to
the Nottingham Prognostic Index, that is the informative genes of the NPI. The
linear multivariate regression model is applied to this data. The response variable is
the severity score NPI, the predictors are the gene expression profiles of breast cancer tumor for each patient. Results for the full least squares approach and the two
penalized procedures, namely ridge regression and LASSO, were obtained. All the
ordinary least squares estimates of the model parameter are nonzero. The statistically significant genes for NPI are 8 and 33 (p-value < 0.005); 10, 21, 29, 50, 52 and 59
(p-value < 0.007); 6 and 19 (p-value < 0.008); 1 (p-value < 0.009). The other genes
are not statistically significant. The ridge regression (λ = 2.8) shrinks the coefficients towards zero and not exactly to zero like LASSO. Few of the β coefficients
obtained by ridge regression, namely those corresponding to independent variables
1, 22, 50 and 51, are nearly zero. The only two statistically significant genes are
23 and 21. For a fixed value of the penalty term (λ = 2) chosen via cross validation
and generalized cross validation (Fig. 2 and 3), the non zero LASSO coefficients correspond to genes 2, 5, 9, 12, 19, 20, 21, 23, 25, 29, 32, 36, 37, 40, 45, 46, 49, and 57. The
set of nonzero coefficients depends strongly on the λ parameter. In particular genes
23 (p-value = 0.006), 21 (p-value = 0.008), and 40 (p-value = 0.028) are statistically
significant for the NPI.
0
5
10
15
20
25
15
20
25
−6 0 4
Coefficient values
lambda
0
5
10
lambda
Fig. 2. Linear regression. The top figure is the cross validation curve for breast
cancer data. The bottom figure shows the LASSO estimates obtained by the CV.
1607
48 54 60
gcv
LASSO estimators in linear regression for microarray data
0
5
10
15
20
25
15
20
25
−6 0 4
Coefficient values
lambda
0
5
10
lambda
Fig. 3. Linear regression. The top figure is the generalized cross validation curve
for breast cancer data. The bottom figure shows the LASSO estimates obtained by
the GCV.
Figure 4 displays the profile of the LASSO coefficients for the breast cancer study
as the tuning parameter t varys ranging from t = 0 to the completely unconstrained
solution for which the values of β correspond to the OLS estimates. The coefficient values were normalized and plotted versus the standardized tuning parameter
t
= λ. In figure 4 estimates of LASSO coefficients obtained by LARS [EHJT04]
kβk1
are represented by curves corresponding to individual genes. The LARS function
contains all variants of LASSO and computes the complete solution for all possible
values of the shrinkage parameter. In terms of coefficient estimates, standard deviations and p-values, the LASSO achieves better prediction accuracy by shrinkage
and gives a sparser solution than the competing OLS and ridge estimators.
*
*
*
47
**
*
**
*
*
**
*
*
*
*
**
*
*
*
*
*
*
*
*
*
**
*
**
*
*
*
*
*
**
**
*
**
**
*
*
*
*
*
***
*
**
*
*
*
*
*
**
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
**
**
*
*
*
*
*
*
*
*
*
***
**
*
*
*
*
*
*
*
*
**
*
**
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
**
*
*
**
**
*
*
*
*
*
**
**
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
**
*
*
*
**
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
**
*
**
*
*
*
*
*
*
*
*
*
*
*
**
*
*
**
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
**
**
*
*
***
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
**
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
**
*
*
*
*
**
*
*
*
*
*
*
*
**
*
**
***
*
*
**
*
*
**
*
*
*
*
*
**
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
**
*
*
**
**
*
*
**
*
**
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
**
*
***
*
*
*
**
***
*
*
*
*
*
**
*
*
*
*
**
0.0
0.2
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
**
*
*
*
*
*
*
**
*
*
**
*
*
*
*
**
*
*
**
*
*
**
*
*
*
*
*
**
**
*
**
*
* *
**
*
*
*
**
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
**
*
*
*
**
*
*
**
*
*
*
*
*
*
* **
**
*
*
*
*
*
*
*
**
*
**
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
**
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
**
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
**
*
*
**
*
*
*
*
*
*
*
*
*
**
**
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
**
*
**
*
*
*
0.4
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
0.6
**
**
**
**
**
**
**
**
**
**
**
**
**
**
**
**
**
*
**
*
**
**
**
**
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
**
*
*
*
*
*
**
*
*
*
*
*
**
*
*
*
**
*
*
*
**
*
*
*
**
*
0.8
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
**
*
*
*
**
*
*
**
*
**
*
**
*
*
*
*
*
*
*
**
*
*
*
*
*
*
**
*
*
**
*
**
**
*
*
**
*
*
**
**
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
43 25 7 6 3 11 15 21
10 20 30
−20 −10 0
Standardized Coefficients
LASSO
1.0
|beta|/max|beta|
Fig. 4. LASSO estimates (given by lars) for the 59 gene amplification profiles and
NPI in linear regression model.
8 Conclusions
In this paper the LASSO method of shrinkage and variable selection for regression
is surveyed. The LASSO yields small variance of the estimators and achieves good
1608
Angela Recchia, Ernst Wit and Alessio Pollice
estimation and prediction by shrinking smaller coefficients to zero. A dual goal of
accurate estimation and consistent variable selection can be achieved simultaneously.
The LASSO approach is an alternative to standard regression and variable subset
selection techniques. If a large number of explanatory variables have a small effect,
then the ridge regression estimator is the best choice, while if a small number of
variables have a large effect, then subset selection does the best. If a moderate
number of variables have a moderate effect, then LASSO is to be preferred [Tib96].
Shrinking the size of regression coefficients and variable selection are important
goals in the analysis of microarray data, where the number of samples collected is
much smaller than the number of gene per chip, that is, the number of predictor
variables is much larger than the number of independent samples. The estimation
in high-dimensional and low-sample size settings can be done by using the LASSO.
We propose to use the l1 penalized estimation for linear regression to select genes
that are relevant to patients’ cancer. Our goal is to determine which genes influence
the severity of breast cancer, in the microarray experiment. The LASSO regression
gives much better predictive performance than the OLS and the l2 penalized regression. In summary, the set of nonzero LASSO coefficients represents the genes that
simultaneously best explain the NPI. Genes 23, 21, and 40 are the most statistically
significant among the informative genes for NPI at p-value level, for the complete
data set of 59.
References
[Mil90]
Miller, A.: Subset Selection in Regression. Chapman and Hall, London
(1990)
[HTF01] Hastie, T. Tibshirani R., Friedman J.H.: Elements of Statistical Learning.
Springer-Verlag, New York (2001)
[WM04] Wit, E., McClure, J.D.: Statistics for microarray. John Wiley and Sons,
New York (2004)
[HT90]
Hastie, T. Tibshirani R.: Genaralized additive models. Chapman and Hall,
London (1990)
[OPBT00] Osborne, M.R., Presnell, B., Brevin, A., Turlach, B.: On the LASSO and
its dual. J. Comp. and Grap. Stat., 9, 2, 319–338 (2000)
[Tib96] Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy.
Stat. Soc. B, 58, 1, 267–288 (1996)
[EHJT04] Efron, B., Hastie, T., Johnstone, I. Tibshirani, R.: Least Angle Regression
(with discussion). Annals of Statistics, 32, 407–499 (2004)
[OMT01] Ojelund, H., Madsen, H., Thyregod, P.: Calibration with absolute shrinkage. J. Chemiometrics, 15, 497–509 (2001)
[FL01]
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and
its oracle properties. Journal of American Statistical Association, 96, 456,
1348–1360 (2001)
[Fu81]
Fu, W.J.: Penalized Regression: the Bridge versus Lasso. Journal of Computational and Graphical Statistics, 7, 3, 397–416 (1881)
[WGB02] Witton, C.J., Going, J.J., Burkhardt, H., Vass, K., Wit, E., Cooke, T.G.,
Ruffalo, T., Seeling, S., King, W., Bartlett, J.M.S.: The sub-classification
of breast cancer using molecular cytogenetic gene chips. Proceedings of
the American Association of Cancer Researchrs, 43, 289, 122–128 (2002)
Download