Annotated Slides - University of Washington

Variable Selection LASSO: Sparse Regression Machine Learning – CSE446 Carlos Guestrin University of Washington ©2005-2013 Carlos Guestrin April 10, 2013 1 Regularization in Linear Regression n  Overfitting usually leads to very large parameter choices, e.g.: -2.2 + 3.1 X – 0.30 X2 n  -1.1 + 4,700,910.7 X – 8,585,638.4 X2 + … Regularized or penalized regression aims to impose a “complexity” penalty by penalizing large weights ¨  “Shrinkage” method ©2005-2013 Carlos Guestrin 2 1 Variable Selection n  Ridge regression: Penalizes large weights n  What if we want to perform “feature selection”? ¨  ¨  ¨  n  E.g., Which regions of the brain are important for word prediction? Can’t simply choose features with largest coefficients in ridge solution Computationally intractable to perform “all subsets” regression Try new penalty: Penalize non-zero weights ¨  Regularization penalty: ¨  Leads to sparse solutions Just like ridge regression, solution is indexed by a continuous param λ This simple approach has changed statistics, machine learning & electrical engineering ¨  ¨  ©2005-2013 Carlos Guestrin 3 LASSO Regression n  LASSO: least absolute shrinkage and selection operator n  New objective: ©2005-2013 Carlos Guestrin 4 2 β2 ^ β . . From Rob Tibshirani slides w1 β1 5 Picture of Lasso and Ridge regression . wMLE β1 ^ β ©2005-2013 Carlos Guestrin w2 Ridge Regression β2 β1 Ridge Regression w1 ^ β . β1 wMLE β2 ^ β w2 Lasso β2 Lasso Picture of Lasso and Ridge regression Geometric Intuition for Sparsity 4 4 Optimizing the LASSO Objective n  LASSO solution: ŵLASSO = arg min w ©2005-2013 Carlos Guestrin N X j=1 t(xj ) (w0 + k X i=1 wi hi (xj )) !2 + k X i=1 |wi | 6 3 Coordinate Descent n  Given a function F ¨  Want to find minimum n  Often, hard to find minimum for all coordinates, but easy for one coordinate n  Coordinate descent: n  How do we pick next coordinate? n  Super useful approach for *many* problems ¨  Converges to optimum in some cases, such as LASSO 7 ©2005-2013 Carlos Guestrin Optimizing LASSO Objective One Coordinate at a Time N X t(xj ) (w0 + j=1 n  !2 + k X i=1 |wi | Residual sum of squares (RSS): @ RSS(w) = @w` ¨  wi hi (xj )) i=1 Taking the derivative: ¨  k X 2 N X j=1 h` (xj ) t(xj ) (w0 + k X i=1 wi hi (xj )) ! Penalty term: ©2005-2013 Carlos Guestrin 8 4 Subgradients of Convex Functions n  Gradients lower bound convex functions: n  Gradients are unique at w iff function differentiable at w n  Subgradients: Generalize gradients to non-differentiable points: ¨  Any plane that lower bounds function: 9 ©2005-2013 Carlos Guestrin N X Taking the Subgradient n  Gradient of RSS term: @ RSS(w) = a` w` @w` ¨  n  c` If no penalty: t(xj ) (w0 + j=1 a` = 2 N X c` = 2 N X j=1 j=1 k X wi hi (xj )) i=1 !2 + k X i=1 |wi | (h` (xj ))2 0 h` (xj ) @t(xj ) (w0 + X i6=` 1 wi hi (xj ))A Subgradient of full objective: ©2005-2013 Carlos Guestrin 10 5 Setting Subgradient to 0 8 < a` w ` c` [ c` , c` + ] @w` F (w) = : a` w ` c` + w` < 0 w` = 0 w` > 0 11 ©2005-2013 Carlos Guestrin Soft Thresholding 8 < (c` + )/a` 0 ŵ` = : (c` )/a` c` < c` 2 [ , ] c` > c` ©2005-2013 Carlos Guestrin From Kevin Murphy textbook 12 6 Coordinate Descent for LASSO (aka Shooting Algorithm) n  Repeat until convergence ¨  Pick a coordinate l at (random or sequentially) 8 < (c` + )/a` 0 ŵ` = : (c` )/a` n  Set: n  Where: a` = 2 N X j=1 c` = 2 N X j=1 ¨  For n  c` < c` 2 [ , ] c` > (h` (xj ))2 0 h` (xj ) @t(xj ) (w0 + X i6=` 1 wi hi (xj ))A convergence rates, see Shalev-Shwartz and Tewari 2009 Other common technique = LARS ¨  Least angle regression and shrinkage, Efron et al. 2004 13 ©2005-2013 Carlos Guestrin Recall: Ridge Coefficient Path From Kevin Murphy textbook n  Typical approach: select λ using cross validation ©2005-2013 Carlos Guestrin 14 7 Now: LASSO Coefficient Path From Kevin Murphy textbook 15 ©2005-2013 Carlos Guestrin 6 LASSO Example Estimated coefficients Term Least Squares Ridge Lasso Intercept 2.465 2.452 2.468 lcavol 0.680 0.420 0.533 lweight 0.263 0.238 0.169 age −0.141 −0.046 lbph 0.210 0.162 0.002 svi 0.305 0.227 0.094 lcp −0.288 0.000 gleason −0.021 0.040 pgg45 0.267 0.133 ©2005-2013 Carlos Guestrin From Rob Tibshirani slides 16 8 What you need to know n  n  Variable Selection: find a sparse solution to learning problem L1 regularization is one way to do variable selection ¨  ¨  n  n  n  Applies beyond regressions Hundreds of other approaches out there LASSO objective non-differentiable, but convex è Use subgradient No closed-form solution for minimization è Use coordinate descent Shooting algorithm is very simple approach for solving LASSO ©2005-2013 Carlos Guestrin 17 9

Annotated Slides - University of Washington

Related documents

Products

Support

Annotated Slides - University of Washington

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib