Chapter 2: The Lasso for Linear Models Alkeos Tsokos

advertisement
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Chapter 2: The Lasso for Linear Models
Alkeos Tsokos
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and
Generalizations’
alkeos.tsokos.10@ucl.ac.uk
February 19, 2016
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Overview
1. The Lasso Estimator
2. Why Does the Lasso Give Sparse Solutions?
3. Computing the Solution
4. Degrees of Freedom
5. Choosing The Tuning Parameter
6. Uniqueness of Lasso Estimates
7. Nonnegative Garrote
8. Bayes Estimates
9. Perspective
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Notation and Recurring Assumptions
I
I
I
I
I
P
|| · ||q denotes the lq norm (i.e. ||x||q = ( pj=1 |xj |q )1/q ).
For q = 0, we define 00 = 0 and so || · ||0 counts the non-zero
elements in a vector, even though this does not satisfy the
definition of a norm
hv, xi denotes
Pp the inner product between v and x (i.e.
hv, xi = j=1 vj xj )
All predictor variables are assumed to be standardised (i.e.
zero mean and unit variance) and response variables centred
(i.e. zero mean).
∂+ f (·) and ∂− f (·) denote the derivative of f (·) from the right
and left respectively.
∇+ f (·) and ∇− f (·) denote the gradient of f (·) from the right
and left respectively.
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Introduction
Consider a standard linear regression model
y = X β + ,
(1)
where y is a vector of responses of length n, X is an n × p design
matrix containing replicates of covariates, ∼ N (0n , σ 2 In ) is a
vector of errors, and β is a vector of coefficients (or parameters).
Goal: Estimate parameter vector β.
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
The standard approach is ordinary least squares, i.e. compute
1
β̂ = arg min ||y − X β||22 ,
2
β
(2)
β̂ = (X T X )−1 X T y .
(3)
which has solution
Problem 1: If p ≈ n, β̂ has large variance; the model has too much
freedom and may model noise rather than signal, i.e. it overfits.
Problem 2: It may be the case ||β||0 << p, but ||β̂||0 = p always;
the fitting mechanism doesn’t exclude predictors that don’t
influence the response.
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Solutions to Problems
Problem 1: Ridge regression solves
1
β̂ ridge = arg min ||y − X β||22 + λ||β||22 ,
2
β
(4)
where λ is a tuning parameter. This is a shrinkage estimator;
forcing ||β̂||22 to be small reduces the freedom of the model, hence
β̂ ridge has lower variance (at the cost of being biased).
Problem 2: Variable selection techniques such as best subset
selection, stepwise regression, etc.
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Two Birds One Stone: The Lasso Estimator
Is there a way to solve both problems at once? Yes, with the
‘Least Absolute Shrinkage and Selection Operator’ (Lasso).
1
β̂ lasso = arg min ||y − X β||22 + λ||β||1 .
2
β
I
Shrinkage: Like ridge regression, forcing ||β̂||1 to be small
reduces the variance of β̂ lasso .
I
Selection: For large enough λ, ||β̂ lasso ||0 < p, i.e. some
coefficients are estimated to be exactly zero, and the
corresponding predictors are excluded from the model.
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
(5)
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Why Do We Get Sparsity? One Predictor
In the case of one predictor, the model is
yi = βxi + i ,
i = 1, . . . , n .
(6)
The Lasso objective is
n
lasso
L
1X
(β) =
(yi − βxi )2 + λ|β| .
2
(7)
i=1
Llasso (β) is convex, hence β̂ is a minimiser of Llasso (β) if
∂− Llasso (β)|β=β̂ ≤ 0 ≤ ∂+ Llasso (β)|β=β̂ .
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
(8)
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Then when will 0 be a minimiser of Llasso (β)? We have:
∂+ Llasso (β)|β=0 = −hy, xi + λ ,
(9)
∂− Llasso (β)|β=0 = −hy, xi − λ ,
(10)
and
Hence, 0 is a minimiser of Llasso (β) if
−hy, xi − λ ≤ 0 ≤ −hy, xi + λ .
(11)
This will be the case precisely when
λ > |hy, xi| .
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
(12)
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Why Do We Get Sparsity? Multiple Predictors
In the case of multiple predictors the Lasso objective is
1
Llasso (β) = ||y − X β||22 + λ||β||1 ,
2
(13)
and as before, the optimality conditions are
∇− Llasso (β)|β=β̂ ≤ 0 ≤ ∇+ Llasso (β)|β=β̂ .
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
(14)
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
We now have:
∇+ Llasso (β)|β=0 = −X T y + λ1 ,
(15)
∇− Llasso (β)|β=0 = −X T y − λ1 ,
(16)
and
and hence the zero vector will be the solution when
−X T y − λ1 ≤ 0 ≤ −X T y + λ1 ,
(17)
where the inequalities are element-wise and all must hold. This will
be the case when
λ > max |hxj , yi|
(18)
j
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
When Are Only Some Predictors Removed?
The optimality condition for the individual coefficient βj is, once
again,
∂− lasso
∂+ lasso
L
(β) ≤ 0 ≤
L
(β) .
(19)
∂βj
∂βj
We have
X
∂+ lasso
L
(β)|βj =0 = −xT
βk xk ) + λ
j (y −
∂βj
(20)
X
∂+ lasso
βk xk ) − λ
L
(β)|βj =0 = −xT
j (y −
∂βj
(21)
k6=j
and
k6=j
Hence, depending on
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
P
k6=j
βk xk , β̂j may or may not be zero.
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Which Penalties Give Sparsity in General?
Consider now the more general penalised least squares problem
Lq (β) = arg min ||y − X β||22 + λ
β
p
X
|βj |q .
(22)
j=1
For q = 1, this is the Lasso, while for q = 2 this is ridge regression.
Assuming the one dimensional case again, we now have that for all
q > 1,
∇+ Lq (β)|β=0 = ∇− Lq (β)|β=0 = −X T y ,
(23)
which is identical to the gradient of the ordinary least squares loss
at 0. Hence, a coefficient will be estimated to be 0 only if the least
squares estimate is also 0. For continuous data, this will never be
the case.
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
What about q < 1?
I
For q < 1, Lq (·) is a sparsity inducing loss function, however
it is no longer convex
I
This means that it may not have a unique minimum, and in
fact 0 will always be a local minimum
I
L0 (·) amounts to best subset selection, which can only be
solved by evaluating the loss function at every combination of
active predictors
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Computing The Lasso Solution: One Predictor
When λ > |hy, xi| we know the solution is 0. For λ < |hy, xi| we
have:
∂Llasso (β) = −hy, xi + β + λsgn(β) ,
(24)
giving estimating equation
β̂ = hy, xi − λsgn(β̂) .
(25)
Trick: notice that sgn(β̂) = sgn(hy, xi), from which we obtain
β̂ = hy, xi − λsgn(hy, xi)
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
(26)
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Combining The Results
Noticing that hy, xi is precisely the
β̂ ols , we finally obtain

ols

β̂ + λ
β̂ Lasso = 0

 ols
β̂ − λ
ordinary least squares solution
if β̂ ols < −λ
if − λ < β̂ ols < λ
if β̂ ols > λ
(27)
The right hand side of equation (27) can be denoted by Sλ (β̂ ols ),
and is referred to as the soft-thresholding operator.
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Computing The Lasso Solution: Multiple Predictors
The Lasso criterion for multiple predictors can be written as
LLasso (β) =
p
p
N
X
X
X
(yi −
βj xij )2 +
|βj |
i=1
=
j=1
N
X
j=1
(28)
(rij − βj xij )2 + |βj | + c ,
i=1
where
rij = yi −
X
xik β̂k ,
(29)
k6=j
and c is a constant that doesn’t depend on βj . Ignoring c, this is
exactly the form of the Lasso criterion with one predictor, xj , and
response vector rj .
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Coordinate Descent
I
Coordinate descent is a classic optimisation technique for
multidimensional functions; the idea is to optimise with
respect to one variable at a time, holding the others constant,
cycling through all variables in some given order repeatedly,
until convergence.
I
Given that we can optimise the Lasso objective with respect
to one coefficient while holding others constant, the idea
naturally arises to use coordinate descent to fit the Lasso.
I
The update for the j th coefficient is given by Sλ (hr(j) , xj i).
I
Because the updates have closed forms, this method is
efficient, and one of the fastest available fitting algorithms.
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Degrees of Freedom of Lasso Fit
I
The degrees of freedom of a fitted model are a measure of the
complexity of the model
I
Often defined as ‘the number of parameters that are allowed
to vary’ when estimating the model
I
When using regularisation, such as ridge regression or Lasso,
this definition is unclear
P
A more rigorous definition is df(ŷ) = σ12 N
i=1 Cov(yi , yˆi )
I
I
For ordinary linear regression when design matrix is of full
rank, the degrees of freedom are simply p.
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Degrees of Freedom of Lasso Fit
I
For the Lasso, an unbiased estimator of the degrees of
freedom is d̂f(ŷ) = ||β̂ Lasso ||0
I
This is surprising; The non-zero coefficients were chosen
adaptively from a larger number of coefficients, so the degrees
of freedom should be higher. However, the shrinkage effect on
the non-zero coefficients reduces the degrees of freedom, and
these two forces in a sense cancel eachother out.
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Choosing Tuning Parameter λ
As is always the case in regularisation methods, we must determine
the amount of regularisation.
I
Standard method is cross-validation; split data into training
and test sets, fit the model for different λ values on the
training data, and evaluate performance on test sets. Choose
λ with lowest squared error (or any other error) on test set. λ
sequence is usually a logarithmic sequence from λmax (the λ
for which β̂ = 0) to 0.
I
Given that we can estimate the degrees of freedom of a fit,
many other methods exist as well: Generalized cross
validation, Mallow’s Cp , AIC, BIC, etc.
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Uniqueness of Lasso Estimates
I
I
I
I
Like least squares estimates, in the case of p > n, the Lasso
fits are unique however the estimates are not necessarily
unique
If however the columns of X are in general position, then for
λ > 0 the Lasso estimates are always unique, even for p > n,
however at most n coefficients will be non-zero
If the elements of X are drawn from a continuous distribution,
then the columns of X will be in general position with
probability 1
Points x1 , . . . , xp are in general position if the affine span of
any k + 1 points s1 xi1 , . . . , sk+1 xik+1 for any signs s1 , . . . , sk+1
does not contain any element of the set
{±xi : i 6= i1 , . . . , ik+1 }
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Nonnegative Garrote
Given initial parameter estimates β̃, the nonnegative garrote solves
min
c∈Rp
n
X
i=1
(yi −
p
X
cj xij β̃i )2
j=1
(30)
subject to cj ≥ 0 and ||c||1 < t ,
whereby the final estimates are given by β̂j = cj β̃j . This is also a
sparsity inducing and shrinkage procedure, however resulting in
smaller shrinkage for larger β̃j , making it similar to the adaptive
Lasso, in which each coefficient is penalised with a different λ
proportional to the inverse of the least squares estimate.
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Bayes Estimates
I
The Lasso has a bayesian interpretation, in which it is the
MAP estimate (the expectation of the posterior would not
yield sparsity) of a linear model in which a Laplacian prior is
placed on the coefficient vector.
I
Ridge regression on the hand is equivalent to a linear model in
which a normal prior is placed on the coefficient vector.
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun
Perspective
I
l1 regularisation has a long history, with precursors in signal
processing
I
It has grown to become popular in a variety of fields
Good properties of l1 regularisation:
I
1. Natural way to enforce sparsity and obtain simple models
2. ‘Bet on sparsity’ - If true signal is sparse, Lasso does well; if
true signal is not sparse, Lasso does not do well, but no
method will
3. Because the Lasso involves a convex optimisation problem, it
can be solved efficiently and deal with large scale problems
Alkeos Tsokos
Chapter 2: The Lasso for Linear Models
Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’
Download