Variable Selection LASSO: Sparse Regression Machine Learning – CSE446 Carlos Guestrin University of Washington ©2005-2013 Carlos Guestrin April 10, 2013 1 Regularization in Linear Regression n Overfitting usually leads to very large parameter choices, e.g.: -2.2 + 3.1 X – 0.30 X2 n -1.1 + 4,700,910.7 X – 8,585,638.4 X2 + … Regularized or penalized regression aims to impose a “complexity” penalty by penalizing large weights ¨ “Shrinkage” method ©2005-2013 Carlos Guestrin 2 1 Variable Selection n Ridge regression: Penalizes large weights n What if we want to perform “feature selection”? ¨ ¨ ¨ n E.g., Which regions of the brain are important for word prediction? Can’t simply choose features with largest coefficients in ridge solution Computationally intractable to perform “all subsets” regression Try new penalty: Penalize non-zero weights ¨ Regularization penalty: ¨ Leads to sparse solutions Just like ridge regression, solution is indexed by a continuous param λ This simple approach has changed statistics, machine learning & electrical engineering ¨ ¨ ©2005-2013 Carlos Guestrin 3 LASSO Regression n LASSO: least absolute shrinkage and selection operator n New objective: ©2005-2013 Carlos Guestrin 4 2 β2 ^ β . . From Rob Tibshirani slides w1 β1 5 Picture of Lasso and Ridge regression . wMLE β1 ^ β ©2005-2013 Carlos Guestrin w2 Ridge Regression β2 β1 Ridge Regression w1 ^ β . β1 wMLE β2 ^ β w2 Lasso β2 Lasso Picture of Lasso and Ridge regression Geometric Intuition for Sparsity 4 4 Optimizing the LASSO Objective n LASSO solution: ŵLASSO = arg min w ©2005-2013 Carlos Guestrin N X j=1 t(xj ) (w0 + k X i=1 wi hi (xj )) !2 + k X i=1 |wi | 6 3 Coordinate Descent n Given a function F ¨ Want to find minimum n Often, hard to find minimum for all coordinates, but easy for one coordinate n Coordinate descent: n How do we pick next coordinate? n Super useful approach for *many* problems ¨ Converges to optimum in some cases, such as LASSO 7 ©2005-2013 Carlos Guestrin Optimizing LASSO Objective One Coordinate at a Time N X t(xj ) (w0 + j=1 n !2 + k X i=1 |wi | Residual sum of squares (RSS): @ RSS(w) = @w` ¨ wi hi (xj )) i=1 Taking the derivative: ¨ k X 2 N X j=1 h` (xj ) t(xj ) (w0 + k X i=1 wi hi (xj )) ! Penalty term: ©2005-2013 Carlos Guestrin 8 4 Subgradients of Convex Functions n Gradients lower bound convex functions: n Gradients are unique at w iff function differentiable at w n Subgradients: Generalize gradients to non-differentiable points: ¨ Any plane that lower bounds function: 9 ©2005-2013 Carlos Guestrin N X Taking the Subgradient n Gradient of RSS term: @ RSS(w) = a` w` @w` ¨ n c` If no penalty: t(xj ) (w0 + j=1 a` = 2 N X c` = 2 N X j=1 j=1 k X wi hi (xj )) i=1 !2 + k X i=1 |wi | (h` (xj ))2 0 h` (xj ) @t(xj ) (w0 + X i6=` 1 wi hi (xj ))A Subgradient of full objective: ©2005-2013 Carlos Guestrin 10 5 Setting Subgradient to 0 8 < a` w ` c` [ c` , c` + ] @w` F (w) = : a` w ` c` + w` < 0 w` = 0 w` > 0 11 ©2005-2013 Carlos Guestrin Soft Thresholding 8 < (c` + )/a` 0 ŵ` = : (c` )/a` c` < c` 2 [ , ] c` > c` ©2005-2013 Carlos Guestrin From Kevin Murphy textbook 12 6 Coordinate Descent for LASSO (aka Shooting Algorithm) n Repeat until convergence ¨ Pick a coordinate l at (random or sequentially) 8 < (c` + )/a` 0 ŵ` = : (c` )/a` n Set: n Where: a` = 2 N X j=1 c` = 2 N X j=1 ¨ For n c` < c` 2 [ , ] c` > (h` (xj ))2 0 h` (xj ) @t(xj ) (w0 + X i6=` 1 wi hi (xj ))A convergence rates, see Shalev-Shwartz and Tewari 2009 Other common technique = LARS ¨ Least angle regression and shrinkage, Efron et al. 2004 13 ©2005-2013 Carlos Guestrin Recall: Ridge Coefficient Path From Kevin Murphy textbook n Typical approach: select λ using cross validation ©2005-2013 Carlos Guestrin 14 7 Now: LASSO Coefficient Path From Kevin Murphy textbook 15 ©2005-2013 Carlos Guestrin 6 LASSO Example Estimated coefficients Term Least Squares Ridge Lasso Intercept 2.465 2.452 2.468 lcavol 0.680 0.420 0.533 lweight 0.263 0.238 0.169 age −0.141 −0.046 lbph 0.210 0.162 0.002 svi 0.305 0.227 0.094 lcp −0.288 0.000 gleason −0.021 0.040 pgg45 0.267 0.133 ©2005-2013 Carlos Guestrin From Rob Tibshirani slides 16 8 What you need to know n n Variable Selection: find a sparse solution to learning problem L1 regularization is one way to do variable selection ¨ ¨ n n n Applies beyond regressions Hundreds of other approaches out there LASSO objective non-differentiable, but convex è Use subgradient No closed-form solution for minimization è Use coordinate descent Shooting algorithm is very simple approach for solving LASSO ©2005-2013 Carlos Guestrin 17 9