The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Chapter 2: The Lasso for Linear Models Alkeos Tsokos Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ alkeos.tsokos.10@ucl.ac.uk February 19, 2016 Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Overview 1. The Lasso Estimator 2. Why Does the Lasso Give Sparse Solutions? 3. Computing the Solution 4. Degrees of Freedom 5. Choosing The Tuning Parameter 6. Uniqueness of Lasso Estimates 7. Nonnegative Garrote 8. Bayes Estimates 9. Perspective Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Notation and Recurring Assumptions I I I I I P || · ||q denotes the lq norm (i.e. ||x||q = ( pj=1 |xj |q )1/q ). For q = 0, we define 00 = 0 and so || · ||0 counts the non-zero elements in a vector, even though this does not satisfy the definition of a norm hv, xi denotes Pp the inner product between v and x (i.e. hv, xi = j=1 vj xj ) All predictor variables are assumed to be standardised (i.e. zero mean and unit variance) and response variables centred (i.e. zero mean). ∂+ f (·) and ∂− f (·) denote the derivative of f (·) from the right and left respectively. ∇+ f (·) and ∇− f (·) denote the gradient of f (·) from the right and left respectively. Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Introduction Consider a standard linear regression model y = X β + , (1) where y is a vector of responses of length n, X is an n × p design matrix containing replicates of covariates, ∼ N (0n , σ 2 In ) is a vector of errors, and β is a vector of coefficients (or parameters). Goal: Estimate parameter vector β. Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun The standard approach is ordinary least squares, i.e. compute 1 β̂ = arg min ||y − X β||22 , 2 β (2) β̂ = (X T X )−1 X T y . (3) which has solution Problem 1: If p ≈ n, β̂ has large variance; the model has too much freedom and may model noise rather than signal, i.e. it overfits. Problem 2: It may be the case ||β||0 << p, but ||β̂||0 = p always; the fitting mechanism doesn’t exclude predictors that don’t influence the response. Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Solutions to Problems Problem 1: Ridge regression solves 1 β̂ ridge = arg min ||y − X β||22 + λ||β||22 , 2 β (4) where λ is a tuning parameter. This is a shrinkage estimator; forcing ||β̂||22 to be small reduces the freedom of the model, hence β̂ ridge has lower variance (at the cost of being biased). Problem 2: Variable selection techniques such as best subset selection, stepwise regression, etc. Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Two Birds One Stone: The Lasso Estimator Is there a way to solve both problems at once? Yes, with the ‘Least Absolute Shrinkage and Selection Operator’ (Lasso). 1 β̂ lasso = arg min ||y − X β||22 + λ||β||1 . 2 β I Shrinkage: Like ridge regression, forcing ||β̂||1 to be small reduces the variance of β̂ lasso . I Selection: For large enough λ, ||β̂ lasso ||0 < p, i.e. some coefficients are estimated to be exactly zero, and the corresponding predictors are excluded from the model. Alkeos Tsokos Chapter 2: The Lasso for Linear Models (5) Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Why Do We Get Sparsity? One Predictor In the case of one predictor, the model is yi = βxi + i , i = 1, . . . , n . (6) The Lasso objective is n lasso L 1X (β) = (yi − βxi )2 + λ|β| . 2 (7) i=1 Llasso (β) is convex, hence β̂ is a minimiser of Llasso (β) if ∂− Llasso (β)|β=β̂ ≤ 0 ≤ ∂+ Llasso (β)|β=β̂ . Alkeos Tsokos Chapter 2: The Lasso for Linear Models (8) Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Then when will 0 be a minimiser of Llasso (β)? We have: ∂+ Llasso (β)|β=0 = −hy, xi + λ , (9) ∂− Llasso (β)|β=0 = −hy, xi − λ , (10) and Hence, 0 is a minimiser of Llasso (β) if −hy, xi − λ ≤ 0 ≤ −hy, xi + λ . (11) This will be the case precisely when λ > |hy, xi| . Alkeos Tsokos Chapter 2: The Lasso for Linear Models (12) Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Why Do We Get Sparsity? Multiple Predictors In the case of multiple predictors the Lasso objective is 1 Llasso (β) = ||y − X β||22 + λ||β||1 , 2 (13) and as before, the optimality conditions are ∇− Llasso (β)|β=β̂ ≤ 0 ≤ ∇+ Llasso (β)|β=β̂ . Alkeos Tsokos Chapter 2: The Lasso for Linear Models (14) Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun We now have: ∇+ Llasso (β)|β=0 = −X T y + λ1 , (15) ∇− Llasso (β)|β=0 = −X T y − λ1 , (16) and and hence the zero vector will be the solution when −X T y − λ1 ≤ 0 ≤ −X T y + λ1 , (17) where the inequalities are element-wise and all must hold. This will be the case when λ > max |hxj , yi| (18) j Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun When Are Only Some Predictors Removed? The optimality condition for the individual coefficient βj is, once again, ∂− lasso ∂+ lasso L (β) ≤ 0 ≤ L (β) . (19) ∂βj ∂βj We have X ∂+ lasso L (β)|βj =0 = −xT βk xk ) + λ j (y − ∂βj (20) X ∂+ lasso βk xk ) − λ L (β)|βj =0 = −xT j (y − ∂βj (21) k6=j and k6=j Hence, depending on Alkeos Tsokos Chapter 2: The Lasso for Linear Models P k6=j βk xk , β̂j may or may not be zero. Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Which Penalties Give Sparsity in General? Consider now the more general penalised least squares problem Lq (β) = arg min ||y − X β||22 + λ β p X |βj |q . (22) j=1 For q = 1, this is the Lasso, while for q = 2 this is ridge regression. Assuming the one dimensional case again, we now have that for all q > 1, ∇+ Lq (β)|β=0 = ∇− Lq (β)|β=0 = −X T y , (23) which is identical to the gradient of the ordinary least squares loss at 0. Hence, a coefficient will be estimated to be 0 only if the least squares estimate is also 0. For continuous data, this will never be the case. Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun What about q < 1? I For q < 1, Lq (·) is a sparsity inducing loss function, however it is no longer convex I This means that it may not have a unique minimum, and in fact 0 will always be a local minimum I L0 (·) amounts to best subset selection, which can only be solved by evaluating the loss function at every combination of active predictors Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Computing The Lasso Solution: One Predictor When λ > |hy, xi| we know the solution is 0. For λ < |hy, xi| we have: ∂Llasso (β) = −hy, xi + β + λsgn(β) , (24) giving estimating equation β̂ = hy, xi − λsgn(β̂) . (25) Trick: notice that sgn(β̂) = sgn(hy, xi), from which we obtain β̂ = hy, xi − λsgn(hy, xi) Alkeos Tsokos Chapter 2: The Lasso for Linear Models (26) Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Combining The Results Noticing that hy, xi is precisely the β̂ ols , we finally obtain ols β̂ + λ β̂ Lasso = 0 ols β̂ − λ ordinary least squares solution if β̂ ols < −λ if − λ < β̂ ols < λ if β̂ ols > λ (27) The right hand side of equation (27) can be denoted by Sλ (β̂ ols ), and is referred to as the soft-thresholding operator. Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Computing The Lasso Solution: Multiple Predictors The Lasso criterion for multiple predictors can be written as LLasso (β) = p p N X X X (yi − βj xij )2 + |βj | i=1 = j=1 N X j=1 (28) (rij − βj xij )2 + |βj | + c , i=1 where rij = yi − X xik β̂k , (29) k6=j and c is a constant that doesn’t depend on βj . Ignoring c, this is exactly the form of the Lasso criterion with one predictor, xj , and response vector rj . Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Coordinate Descent I Coordinate descent is a classic optimisation technique for multidimensional functions; the idea is to optimise with respect to one variable at a time, holding the others constant, cycling through all variables in some given order repeatedly, until convergence. I Given that we can optimise the Lasso objective with respect to one coefficient while holding others constant, the idea naturally arises to use coordinate descent to fit the Lasso. I The update for the j th coefficient is given by Sλ (hr(j) , xj i). I Because the updates have closed forms, this method is efficient, and one of the fastest available fitting algorithms. Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Degrees of Freedom of Lasso Fit I The degrees of freedom of a fitted model are a measure of the complexity of the model I Often defined as ‘the number of parameters that are allowed to vary’ when estimating the model I When using regularisation, such as ridge regression or Lasso, this definition is unclear P A more rigorous definition is df(ŷ) = σ12 N i=1 Cov(yi , yˆi ) I I For ordinary linear regression when design matrix is of full rank, the degrees of freedom are simply p. Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Degrees of Freedom of Lasso Fit I For the Lasso, an unbiased estimator of the degrees of freedom is d̂f(ŷ) = ||β̂ Lasso ||0 I This is surprising; The non-zero coefficients were chosen adaptively from a larger number of coefficients, so the degrees of freedom should be higher. However, the shrinkage effect on the non-zero coefficients reduces the degrees of freedom, and these two forces in a sense cancel eachother out. Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Choosing Tuning Parameter λ As is always the case in regularisation methods, we must determine the amount of regularisation. I Standard method is cross-validation; split data into training and test sets, fit the model for different λ values on the training data, and evaluate performance on test sets. Choose λ with lowest squared error (or any other error) on test set. λ sequence is usually a logarithmic sequence from λmax (the λ for which β̂ = 0) to 0. I Given that we can estimate the degrees of freedom of a fit, many other methods exist as well: Generalized cross validation, Mallow’s Cp , AIC, BIC, etc. Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Uniqueness of Lasso Estimates I I I I Like least squares estimates, in the case of p > n, the Lasso fits are unique however the estimates are not necessarily unique If however the columns of X are in general position, then for λ > 0 the Lasso estimates are always unique, even for p > n, however at most n coefficients will be non-zero If the elements of X are drawn from a continuous distribution, then the columns of X will be in general position with probability 1 Points x1 , . . . , xp are in general position if the affine span of any k + 1 points s1 xi1 , . . . , sk+1 xik+1 for any signs s1 , . . . , sk+1 does not contain any element of the set {±xi : i 6= i1 , . . . , ik+1 } Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Nonnegative Garrote Given initial parameter estimates β̃, the nonnegative garrote solves min c∈Rp n X i=1 (yi − p X cj xij β̃i )2 j=1 (30) subject to cj ≥ 0 and ||c||1 < t , whereby the final estimates are given by β̂j = cj β̃j . This is also a sparsity inducing and shrinkage procedure, however resulting in smaller shrinkage for larger β̃j , making it similar to the adaptive Lasso, in which each coefficient is penalised with a different λ proportional to the inverse of the least squares estimate. Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Bayes Estimates I The Lasso has a bayesian interpretation, in which it is the MAP estimate (the expectation of the posterior would not yield sparsity) of a linear model in which a Laplacian prior is placed on the coefficient vector. I Ridge regression on the hand is equivalent to a linear model in which a normal prior is placed on the coefficient vector. Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’ The Lasso Estimator Why Does the Lasso Give Sparse Solutions? Computing the Solution Degrees of Freedom Choosing The Tun Perspective I l1 regularisation has a long history, with precursors in signal processing I It has grown to become popular in a variety of fields Good properties of l1 regularisation: I 1. Natural way to enforce sparsity and obtain simple models 2. ‘Bet on sparsity’ - If true signal is sparse, Lasso does well; if true signal is not sparse, Lasso does not do well, but no method will 3. Because the Lasso involves a convex optimisation problem, it can be solved efficiently and deal with large scale problems Alkeos Tsokos Chapter 2: The Lasso for Linear Models Reading Group on ‘Statistical Learning with Sparsity: The Lasso and Generalizations’