Machine Learning & Data Mining CS/CNS/EE 155 Lecture 3: Regularization, Sparsity & Lasso 1 Homework 1 • Check course website! • Some coding required • Some plotting required – I recommend Matlab • Has supplementary datasets • Submit via Moodle (due Jan 20th @5pm) 2 Recap: Complete Pipeline S = {(xi , yi )}i=1 N Training Data f (x | w, b) = wT x - b L(a, b) = (a - b)2 Model Class(es) Loss Function N argmin å L ( yi , f (xi | w, b)) w,b i=1 Cross Validation & Model Selection Profit! 3 Different Model Classes? • Option 1: SVMs vs ANNs vs LR vs LS • Option 2: Regularization N argmin å L ( yi , f (xi | w, b)) w,b i=1 Cross Validation & Model Selection 4 Notation • L0 Norm – # of non-zero entries • L1 Norm – Sum of absolute values • L2 Norm & Squared L2 Norm – Sum of squares – Sqrt(sum of squares) • L-infinity Norm – Max absolute value w 0 = å1[wd ¹0] d w = w 1 = å wd d w = 2 T w º w w å d d w = å wd2 º wT w 2 d w ¥ = lim p å wd = max wd p p®¥ d d 5 Notation Part 2 • Minimizing Squared Loss – Regression – Least-Squares argmin å( yi - w x + b) T w 2 i – (Unless Otherwise Stated) • E.g., Logistic Regression = Log Loss 6 Ridge Regression argmin l w w + å( yi - w x + b) T w,b Regularization T 2 i Training Loss • aka L2-Regularized Regression • Trades off model complexity vs training loss • Each choice of λ a “model class” – Will discuss the further later 7 é 1 [age>10] x =ê ê 1[gender=male] ë ù ú ú û argmin l w w + å( yi - w x + b) T T w,b 2 i w 0.2441 0.2277 0.1765 0.0817 0.0161 -0.1745 -0.1967 -0.2686 -0.4196 -0.5686 0.0001 0.0000 -0.6666 Test 0.7401 0.7122 0.6197 0.4124 0.1801 b … Larger Lambda Train ìï 1 height > 55" y=í ïî 0 height £ 55" Person Age>10 Male? Height > 55” Alice 1 0 1 Bob 0 1 0 Carol 0 0 0 Dave 1 1 1 Erin 1 0 1 Frank 0 1 1 Gena 0 0 0 Harold 1 1 1 Irene 1 0 0 John 0 1 1 Kelly 1 0 1 Larry 1 1 1 8 Updated Pipeline S = {(xi , yi )}i=1 N Training Data f (x | w, b) = wT x - b L(a, b) = (a - b)2 Model Class Loss Function N argmin l wT w + å L ( yi , f (xi | w, b)) w,b i=1 Cross Validation & Model Selection Profit! 9 Train Test Person Age>10 Male ? Height > 55” Alice 1 0 1 0.91 0.89 0.83 0.75 0.67 Bob 0 1 0 0.42 0.45 0.50 0.58 0.67 Carol 0 0 0 0.17 0.26 0.42 0.50 0.67 Dave 1 1 1 1.16 1.06 0.91 0.83 0.67 Erin 1 0 1 0.91 0.89 0.83 0.79 0.67 Frank 0 1 1 0.42 0.45 0.50 0.54 0.67 Gena 0 0 0 0.17 0.27 0.42 0.50 0.67 Harold 1 1 1 1.16 1.06 0.91 0.83 0.67 Irene 1 0 0 0.91 0.89 0.83 0.79 0.67 John 0 1 1 0.42 0.45 0.50 0.54 0.67 Kelly 1 0 1 0.91 0.89 0.83 0.79 0.67 Larry 1 1 1 1.16 1.06 0.91 0.83 0.67 Model Score w/ Increasing Lambda Best test error 10 Choice of Lambda Depends on Training Size 25 dimensional space Randomly generated linear response function + noise 11 Recap: Ridge Regularization • Ridge Regression: – L2 Regularized Least-Squares argmin l w w + å( yi - w x + b) T w,b T 2 i • Large λ more stable predictions – Less likely to overfit to training data – Too large λ underfit • Works with other loss – Hinge Loss, Log Loss, etc. 12 Model Class Interpretation N argmin l wT w + å L ( yi , f (xi | w, b)) w,b i=1 • This is not a model class! – At least not what we’ve discussed... • An optimization procedure – Is there a connection? 13 Norm Constrained Model Class f (x | w, b) = w x - b T s.t. w w £ c º w £ c 2 T Seems to correspond to lambda… Visualization N argmin l wT w + å L ( yi , f (xi | w, b)) w,b i=1 c=1 c=2 c=3 14 Lagrange Multipliers argmin L(y, w) º ( y - w x ) T 2 w s.t. wT w £ c • Optimality Condition: – Gradients aligned! – Constraint Boundary $l ³ 0 : (¶w L(y, w) = l¶w w w) Ù ( w w £ c) T T Omitting b & 1 training data for simplicity http://en.wikipedia.org/wiki/Lagrange_multiplier 15 Omitting b & 1 training data for simplicity Norm Constrained Model Class Training: argmin L(y, w) º ( y - w x ) T 2 w Two Conditions Must Be Satisfied At Optimality . s.t. wT w £ c Observation about Optimality: $l ³ 0 : (¶w L(y, w) = l¶w wT w) Ù ( wT w £ c) Lagrangian: Claim: Solving Lagrangian Solves Norm-Constrained Training Problem argmin L(w, l ) = ( y - w x ) + l ( wT w - c) T 2 w, l Optimality Implication of Lagrangian: Satisfies First Condition! ¶w L(w, l ) = -2x ( y - w x ) + 2l w º 0 T T ( 2x y - w x T ) T = 2l w http://en.wikipedia.org/wiki/Lagrange_multiplier 16 Omitting b & 1 training data for simplicity Norm Constrained Model Class Training: argmin L(y, w) º ( y - w x ) T 2 w Two Conditions Must Be Satisfied At Optimality . s.t. wT w £ c Observation about Optimality: $l ³ 0 : (¶w L(y, w) = l¶w wT w) Ù ( wT w £ c) Lagrangian: argmin L(w, l ) = ( y - w x ) + l ( wT w - c) T w, l 2 Claim: Solving Lagrangian Solves Norm-Constrained Training Problem Implication of Lagrangian: Optimality ImplicationOptimality of Lagrangian: ¶w0L(w, lif) w = T-2x w <(cy - w x )+ 2lTw º 0 º0 T w w£c T T ïî w w - c if w w 2x³( cy - wT x ) = 2 l w nd Satisfies 2 Satisfies First Condition! ì Condition! ¶ L(w, l ) = ï í l T T http://en.wikipedia.org/wiki/Lagrange_multiplier 17 L2 Regularized Training: Norm Constrained Model Class Training: argmin L(y, w) º ( y - w x ) T w 2 argmin l w w + ( y - w x ) T s.t. w w £ c T T 2 w Lagrangian: argmin L(w, l ) = ( y - w x ) + l ( wT w - c) T 2 w, l • Lagrangian = Norm Constrained Training: $l ³ 0 : (¶w L(y, w) = l¶w wT w) Ù ( wT w £ c) • Lagrangian = L2 Regularized Training: – Hold λ fixed – Equivalent to solving Norm Constrained! – For some c Omitting b & 1 training data for simplicity http://en.wikipedia.org/wiki/Lagrange_multiplier 18 Recap #2: Ridge Regularization • Ridge Regression: – L2 Regularized Least-Squares = Norm Constrained Model argmin l wT w + L(w) º w,b argmin L(w) s.t. wT w £ c w,b • Large λ more stable predictions – Less likely to overfit to training data – Too large λ underfit • Works with other loss – Hinge Loss, Log Loss, etc. 19 Hallucinating Data Points N argmin l w w + å( yi - w xi ) T T w N ¶w = 2 l w - 2å x ( yi - w xi ) 2 T i=1 i=1 • Instead hallucinate D data points? D ( argmin å 0 - w w d=1 D ( ¶w = 2å l ed w d=1 D ) N l ed +å( yi - w xi ) T l ed T D 2 T d=1 d=1 {( led , 0 )} D d=1 2 i=1 ) T N - 2å x ( yi - w xi ) T T i=1 = 2å l e w = 2å l wd = 2 l w T d T Identical to Regularization! Unit vector along d-th Dimension Omitting b & for simplicity 20 Extension: Multi-task Learning • 2 prediction tasks: – Spam filter for Alice – Spam filter for Bob • Limited training data for both… – … but Alice is similar to Bob 21 Extension: Multi-task Learning S = {(x , y )} • Two Training Sets (1) – N relatively small S (2) (1) i (1) i N i=1 = {(x , y )} (2) i (2) i N i=1 • Option 1: Train Separately N argmin l w w + å( yi - w x i T w i=1 N argmin l v v + å( yi - v x i T v ) (1) 2 T (1) (2) i=1 T Both models have high error. ) (2) 2 Omitting b & for simplicity 22 Extension: Multi-task Learning S = {(x , y )} • Two Training Sets (1) – N relatively small S (2) (1) i (1) i N i=1 = {(x , y )} (2) i (2) i N i=1 • Option 2: Train Jointly N argmin l w w + å( yi - w x i T w,v (1) i=1 N + l v v + å( yi - v x i T (2) i=1 T ) (1) 2 T ) (2) 2 Doesn’t accomplish anything! (w & v don’t depend on each other) Omitting b & for simplicity 23 Multi-task Regularization N argmin l w w + l v v + g ( w - v) ( w - v) + å( yi - w x i T T T w,v (1) i=1 Standard Regularization Multi-task Regularization • Prefer w & v to be “close” T N ) + å( y (1) 2 (2) i i=1 ) (2) 2 - v xi T Training Loss Test Loss (Task 2) – Controlled by γ – Tasks similar • Larger γ helps! – Tasks not identical • γ not too large 24 Lasso L1-Regularized Least-Squares 25 L1 Regularized Least Squares N argmin l w + å( yi - w xi ) T w i=1 2 N argmin l w + å( yi - w xi ) 2 w T 2 i=1 • L2: w= 2 vs w =1 = w =1 vs w=0 vs w =1 • L1: w=2 = w =1 vs w=0 Omitting b & for simplicity 26 Subgradient (sub-differential) Ña R(a) = {c "a' : R(a') - R(a) ³ c(a'- a)} • Differentiable: Ña R(a) = ¶a R(a) • L1: Ñ wd ì -1 if ïï wí +1 if ï ïî [ -1,+1] if wd < 0 wd > 0 wd = 0 Continuous range for w=0! Omitting b & for simplicity 27 L1 Regularized Least Squares N argmin l w + å( yi - w xi ) T w i=1 2 N argmin l w + å( yi - w xi ) 2 w T 2 i=1 • L2: Ñwd w = 2wd 2 • L1: Ñ wd ì -1 if ïï wí +1 if ï ïî [ -1,+1] if wd < 0 wd > 0 wd = 0 Omitting b & for simplicity 28 Lagrange Multipliers argmin L(y, w) º ( y - w x ) T 2 w s.t. w = c Ñ wd ì -1 if ïï wí +1 if ï ïî [ -1,+1] if wd < 0 wd > 0 wd = 0 $l ³ 0 : (¶w L(y, w) Î lÑw w ) Ù ( w £ c) Solutions tend to be at corners! Omitting b & 1 training data for simplicity http://en.wikipedia.org/wiki/Lagrange_multiplier 29 Sparsity • w is sparse if mostly 0’s: w 0 = å1[wd ¹0] – Small L0 Norm d • Why not L0 Regularization? – Not continuous! N argmin l w 0 + å( yi - w xi ) T w • L1 induces sparsity – And is continuous! 2 i=1 N argmin l w + å( yi - w xi ) T w 2 i=1 Omitting b & for simplicity 30 Why is Sparsity Important? • Computational / Memory Efficiency – Store 1M numbers in array – Store 2 numbers per non-zero • (Index, Value) pairs • E.g., [ (50,1), (51,1) ] – Dot product more efficient: wT x • Sometimes true w is sparse – Want to recover non-zero dimensions 31 Lasso Guarantee N argmin l w + å( yi - w xi + b) T w 2 i=1 • Suppose data generated as: yi ~ Normal ( w*T xi , s 2 ) • Then if: 2 2s 2 log D l> k N • With high probability (increasing with N): Supp ( w) Í Supp ( w* ) High Precision Parameter Recovery "d : wd ³ lc Supp ( w) = Supp ( w* ) Sometimes High Recall Supp ( w* ) = {d w*,d ¹ 0} See also: https://www.cs.utexas.edu/~pradeepr/courses/395T-LT/filez/highdimII.pdf http://www.eecs.berkeley.edu/~wainwrig/Papers/Wai_SparseInfo09.pdf 32 Person Age>10 Male? Height > 55” Alice 1 0 1 Bob 0 1 0 Carol 0 0 0 Dave 1 1 1 Erin 1 0 1 Frank 0 1 1 Gena 0 0 0 Harold 1 1 1 Irene 1 0 0 John 0 1 1 Kelly 1 0 1 Larry 1 1 1 33 Recap: Lasso vs Ridge • Model Assumptions – Lasso learns sparse weight vector • Predictive Accuracy – Lasso often not as accurate – Re-run Least Squares on dimensions selected by Lasso • Easy of Inspection – Sparse w’s easier to inspect • Easy of Optimization – Lasso somewhat trickier to optimize 34 Recap: Regularization argmin l w + å( yi - w xi ) • L2 T w 2 i=1 N argmin l w + å( yi - w xi ) • L1 (Lasso) • Multi-task N 2 T w 2 i=1 argmin l wT w + l vT v + g ( w - v) ( w - v) T w,v N + å( yi - w x i (1) i=1 T N ) + å( y (1) 2 ) T (2) 2 Omitting b & for simplicity 35 (2) i - v xi i=1 • [Insert Yours Here!] Next Lecture: Recent Applications of Lasso Cancer Detection Personalization viaexamining(B" twitter September"2012" music( Biden( soccer( Labour( Recitation on Wednesday: Probability & Statistics Image Sources: http://statweb.stanford.edu/~tibs/ftp/canc.pdf https://dl.dropboxusercontent.com/u/16830382/papers/badgepaper-kdd2013.pdf 36