Annotated Slides - University of Washington

advertisement
Variable Selection
LASSO: Sparse
Regression
Machine Learning – CSE446
Carlos Guestrin
University of Washington
©2005-2013 Carlos Guestrin
April 10, 2013
1
Regularization in Linear Regression
n 
Overfitting usually leads to very large parameter choices, e.g.:
-2.2 + 3.1 X – 0.30 X2
n 
-1.1 + 4,700,910.7 X – 8,585,638.4 X2 + …
Regularized or penalized regression aims to impose a
“complexity” penalty by penalizing large weights
¨ 
“Shrinkage” method
©2005-2013 Carlos Guestrin
2
1
Variable Selection
n 
Ridge regression: Penalizes large weights
n 
What if we want to perform “feature selection”?
¨ 
¨ 
¨ 
n 
E.g., Which regions of the brain are important for word prediction?
Can’t simply choose features with largest coefficients in ridge solution
Computationally intractable to perform “all subsets” regression
Try new penalty: Penalize non-zero weights
¨ 
Regularization penalty:
¨ 
Leads to sparse solutions
Just like ridge regression, solution is indexed by a continuous param λ
This simple approach has changed statistics, machine learning & electrical
engineering
¨ 
¨ 
©2005-2013 Carlos Guestrin
3
LASSO Regression
n 
LASSO: least absolute shrinkage and selection operator
n 
New objective:
©2005-2013 Carlos Guestrin
4
2
β2
^
β
.
.
From
Rob
Tibshirani
slides
w1
β1
5
Picture of Lasso and Ridge regression
.
wMLE
β1
^
β
©2005-2013 Carlos Guestrin
w2
Ridge Regression
β2
β1
Ridge Regression
w1
^
β
.
β1
wMLE
β2
^
β
w2
Lasso
β2
Lasso
Picture of Lasso and Ridge regression
Geometric Intuition for Sparsity
4
4
Optimizing the LASSO Objective
n 
LASSO solution:
ŵLASSO = arg min
w
©2005-2013 Carlos Guestrin
N
X
j=1
t(xj )
(w0 +
k
X
i=1
wi hi (xj ))
!2
+
k
X
i=1
|wi |
6
3
Coordinate Descent
n 
Given a function F
¨ 
Want to find minimum
n 
Often, hard to find minimum for all coordinates, but easy for one coordinate
n 
Coordinate descent:
n 
How do we pick next coordinate?
n 
Super useful approach for *many* problems
¨ 
Converges to optimum in some cases, such as LASSO
7
©2005-2013 Carlos Guestrin
Optimizing LASSO Objective
One Coordinate at a Time
N
X
t(xj )
(w0 +
j=1
n 
!2
+
k
X
i=1
|wi |
Residual sum of squares (RSS):
@
RSS(w) =
@w`
¨ 
wi hi (xj ))
i=1
Taking the derivative:
¨ 
k
X
2
N
X
j=1
h` (xj ) t(xj )
(w0 +
k
X
i=1
wi hi (xj ))
!
Penalty term:
©2005-2013 Carlos Guestrin
8
4
Subgradients of Convex Functions
n 
Gradients lower bound convex functions:
n 
Gradients are unique at w iff function differentiable at w
n 
Subgradients: Generalize gradients to non-differentiable points:
¨ 
Any plane that lower bounds function:
9
©2005-2013 Carlos Guestrin
N
X
Taking the Subgradient
n 
Gradient of RSS term:
@
RSS(w) = a` w`
@w`
¨ 
n 
c`
If no penalty:
t(xj )
(w0 +
j=1
a` = 2
N
X
c` = 2
N
X
j=1
j=1
k
X
wi hi (xj ))
i=1
!2
+
k
X
i=1
|wi |
(h` (xj ))2
0
h` (xj ) @t(xj )
(w0 +
X
i6=`
1
wi hi (xj ))A
Subgradient of full objective:
©2005-2013 Carlos Guestrin
10
5
Setting Subgradient to 0
8
<
a` w ` c`
[ c`
, c` + ]
@w` F (w) =
:
a` w ` c` +
w` < 0
w` = 0
w` > 0
11
©2005-2013 Carlos Guestrin
Soft Thresholding
8
< (c` + )/a`
0
ŵ` =
:
(c`
)/a`
c` <
c` 2 [ , ]
c` >
c`
©2005-2013 Carlos Guestrin
From
Kevin Murphy
textbook
12
6
Coordinate Descent for LASSO
(aka Shooting Algorithm)
n 
Repeat until convergence
¨  Pick
a coordinate l at (random or sequentially)
8
< (c` + )/a`
0
ŵ` =
:
(c`
)/a`
n 
Set:
n 
Where:
a` = 2
N
X
j=1
c` = 2
N
X
j=1
¨  For
n 
c` <
c` 2 [ , ]
c` >
(h` (xj ))2
0
h` (xj ) @t(xj )
(w0 +
X
i6=`
1
wi hi (xj ))A
convergence rates, see Shalev-Shwartz and Tewari 2009
Other common technique = LARS
¨  Least
angle regression and shrinkage, Efron et al. 2004
13
©2005-2013 Carlos Guestrin
Recall: Ridge Coefficient Path
From
Kevin Murphy
textbook
n 
Typical approach: select λ using cross validation
©2005-2013 Carlos Guestrin
14
7
Now: LASSO Coefficient Path
From
Kevin Murphy
textbook
15
©2005-2013 Carlos Guestrin
6
LASSO Example
Estimated coefficients
Term
Least Squares
Ridge
Lasso
Intercept
2.465
2.452
2.468
lcavol
0.680
0.420
0.533
lweight
0.263
0.238
0.169
age
−0.141
−0.046
lbph
0.210
0.162
0.002
svi
0.305
0.227
0.094
lcp
−0.288
0.000
gleason
−0.021
0.040
pgg45
0.267
0.133
©2005-2013 Carlos Guestrin
From
Rob
Tibshirani
slides
16
8
What you need to know
n 
n 
Variable Selection: find a sparse solution to learning
problem
L1 regularization is one way to do variable selection
¨ 
¨ 
n 
n 
n 
Applies beyond regressions
Hundreds of other approaches out there
LASSO objective non-differentiable, but convex è Use
subgradient
No closed-form solution for minimization è Use
coordinate descent
Shooting algorithm is very simple approach for solving
LASSO
©2005-2013 Carlos Guestrin
17
9
Download