Generalizations of the LASSO Penalty Hojjat Akhondi-Asl 18th March 2016

advertisement
Generalizations of the LASSO Penalty
Department of Statistical Science - UCL
18th March 2016
Hojjat Akhondi-Asl
1
Introduction
• In the previous talk, we considered some generalizations of the LASSO
by varying the loss function.
– In this talk, I will also show another variation called the robust PCA
method.
• In this talk, we aim to vary the LASSO `1-penalty itself and show its
useful features. e.g.
– LASSO does not perform well with highly correlated variables ⇒
Elastic Net
– Features may be structurally grouped (select all/omit all) ⇒ Group
LASSO
– We may want neighbouring coefficients to be the same or similar
(piecewise constant) ⇒ Fused LASSO
– We may want neighbouring coefficients to be piecewise polynomial
⇒ Trend Filtering
2
Robust PCA
• Let us assume that the data vector can be modelled as:
H = L + S + N,
• Here, L is the low rank matrix, S is a sparse matrix that contains the
outliers and N contains the Gaussian noise.
• H can be minimised with respect to L, S and N in a convex
optimisation methodology, that is (E.J. Candes 2009):
min ||L||∗ + λ1||S||1 + ||H − L − S||22.
L,S
• Here, ||L||∗ is the nuclear norm
P which measures the sum of the singular
values of L, that is ||L||∗ = i σi(L).
• ||S||1 is used to impose sparsity on S to capture the outliers
3
Robust PCA
• ||H − L − S||22 minimises the difference matrix with quadratic cost
function, optimal when noise is Gaussian distributed.
4
Definitions and Symbols
• We define LASSO as:
min
X
1
||Y − DX||22 + λ||X||1
2
• Here:
– Y is the input signal/data vector, etc with size RM ×1
– D is dictionary/feature matrix with multiple covariates/atoms with
size RM ×N
– D = [D1 D2 . . . DN ]
– X is the sparse coding matrix/coefficients with size RN ×1
5
LASSO and Highly Correlated Variables
• LASSO does not handle highly correlated variables very well
• The coefficient paths tend to be erratic and can sometimes show wild
behaviour.
• Consider a simple example:
– Say the coefficients for a feature/atom Dj with a particular value for
λ is Xj > 0.
– Let’s also say feature Dj+1 = Dj , where we have an identical copy
(extreme case).
– They can share this coefficient in infinitely many ways, that is any
X̃j + X̃j+1 = Xj .
– The loss and l1 penalty are indifferent. So the coefficients for this
pair are not defined.
– A quadratic penalty will divide Xj exactly equally between the two
twins
6
Elastic Net
• In practice it is unlikely to have an identical feature element, however
the features could well be highly correlated.
• The elastic net makes a compromise between the ridge and the LASSO
penalties (Zou and Hastie 2005):
min
X
1
1
||Y − DX||22 + λα||X||1 + λ(1 − α)||X||22.
2
2
(1)
• Here, if α = 1 the equation collapses down to LASSO and if α = 0, the
equation collapses down to ridge regression.
• Let us consider the following example:
–
–
–
–
Z1, Z2, , γj ∼ N (0, 1) where j = 1, . . . , 6 and N = 100
Dj = Z1 + γj /5 for j = 1, 2, 3
Dj = Z2 + γj /5 for j = 4, 5, 6
Y = 3Z1 − 1.5Z2 + 2
7
Elastic Net
LASSO
1.5
1
0.5
0
-0.5
-1
-1.5
0
5
10
15
20
25
30
35
40
25
30
35
40
Lambda
Elastic Net
1.5
1
0.5
0
-0.5
-1
-1.5
0
5
10
15
20
Lambda
8
Elastic Net Ball
• The Elastic-Net ball shares attributes of the `2 ball and the `1 ball.
• The sharp corners and edges encourage selection
• The curved contours encourage sharing of coefficients.
Figure 1: Elastic Net
LASSO
9
Elastic Net Minimisation - ADMM
• Augmented Lagrangian Framework:
min
X,Z
1
1
2
||Y − DX||2 + λα||Z||1 + λ(1 − α)||X||22.
2
2
subject to: Z − X = 0;
• The procedure for Elastic Net using ADMM:
• X K+1 = (DT D + ρI)−1(DT Y + U k + ρZ k )
k
• Z k+1 = S λα ( γρ X k+1 − Uγ )
γ
• U k+1 = U k + ρ(Z k+1 − X k+1)
where S is the soft-thresholding operator and γ = λ(1 − α) + ρ
10
Elastic Net Minimisation - ADMM
f u n c t i o n x new = Elastic Net ADMM ( b , D, o p t i o n s )
A = [ D’ ∗ D ] ;
A i n v = i n v (A + o p t i o n s . beta ∗ eye ( s i z e (A ) ) ) ;
z = 0 ; u = 0 ; x =0;
gamma = o p t i o n s . lambda ∗(1− o p t i o n s . a l p h a ) + o p t i o n s . beta ;
halt = false ;
i t e r = 0 ; dx norm = i n f ;
w h i l e ˜ h a l t && i t e r < o p t i o n s . m a x i t e r
i t e r = i t e r + 1;
x new = A i n v ∗ (D’ ∗ b + u + o p t i o n s . beta ∗ ( z ) ) ;
dx norm ( i t e r ) = ( ( norm p ( x − x new , 2 ) . ˆ 2 ) / ( norm p ( x , 2 ) . ˆ 2 ) ) ;
i f dx norm ( i t e r ) < o p t i o n s . t h r e s
halt = 1;
else
x = x new ;
end
z = p r o x l a s s o ( ( o p t i o n s . beta /gamma) ∗ x new − . . .
( 1 /gamma) ∗ u , o p t i o n s . lambda ∗ o p t i o n s . a l p h a /gamma ) ;
u = u + o p t i o n s . beta ∗ ( z−x new ) ;
end
end
% S o f t −T h r e s h o l d i n g
f u n c t i o n x new = p r o x l a s s o ( x , t h r e s )
x new = max( abs ( x ) − t h r e s , 0 ) . ∗ s i g n ( x ) ;
end
11
Group LASSO
• There are many regression problems in which the covariates/atoms have
a natural group structure
• It is desirable to have all coefficients within a group become nonzero (or
zero) simultaneously ⇒ e.g. classification problem
• Consider a linear regression model involving K groups of
covariates/atoms.
• In this case, for k = 1, . . . , K, the vector Dk represents the
covariates/atoms in group k.
• The goal is to predict response Y based on the collection of covariates:
D = [D1, D2, . . . , DK ].
12
Group LASSO
• Group LASSO is defined as:
min
X
1
||Y − DX||22 + λ||X||2,1
2
• where
||X||2,1 =
K
X
||Xk ||2
k=1
• The sparsity is now on the K groups and not on individual atoms.
• Either the entire vector Xk will be zero, or all its elements will be
nonzero (depending on λ ≥ 0)
13
Group LASSO Ball
• If Dk ∈ RN ×1, then ||X||2,1 = ||X||1
• That is, if all the groups are singletons, the optimization problem
reduces to LASSO.
• Group LASSO ball shares attributes of both `2 and `1 balls.
Figure 2: Group LASSO
LASSO
14
Group LASSO
• Generally we expect:
• Example:
LASSO
Group LASSO
2
2
4
4
6
6
8
8
10
10
12
12
14
14
100
200
300
400
500
600
700
100
200
300
400
500
600
700
15
Group LASSO Minimisation
• Augmented Lagrangian Framework:
min
X,Z
1
||Y − DX||22 + λ||Z||2,1
2
subject to: Z − X = 0;
• The procedure for Group LASSO using ADMM:
• X K+1 = (DT D + ρI)−1(DT Y + U k + ρZ k )
•
Zgk+1

 ||Pg ||2− λρ
Pg ||Pg ||2
=
0
λ
ρ
< ||Pg ||2
otherwise.
where P = X k+1 − ρ1 U k
• U k+1 = U k + ρ(Z k+1 − X k+1)
16
Sparse Group LASSO
• When a group is included in a Group LASSO fit, all the coefficients in
that group are nonzero. This is a consequence of the `2 norm.
• Sometimes we would like sparsity both with respect to which groups are
selected, and which coefficients are nonzero within a group.
• The sparse Group LASSO is defined as:
min
X
1
||Y − DX||22 + λ(1 − α)||X||2,1 + λα||X||1
2
(2)
• Much like the Elastic Net the parameter α creates a bridge between the
Group LASSO (α = 0) and the LASSO (α = 1).
17
Sparse Group LASSO
• Sparse Group LASSO vs. Group LASSO:
18
Fused Lasso
• Sometimes data are very noisy, so some kind of smoothing is essential.
• One assumption we could have is that data within segments are
replicated, therefore we expect a piecewise-constant estimation.
• Fused Lasso signal approximator exploits this structure:
min
X
1
||Y − X||22 + λ1||X||1 + λ2||F X||1
2
• F is a difference matrix of dimension (n − 1) × n where:

1 −1 0 . . .
0 1 −1 . . .
F =
..
..
...
 ..
0 0 ... 1

0
0
.. 

−1
19
Fused Lasso
• The first penalty is the `1-norm, and serves to shrink the Xi toward zero
• The second penalty encourages neighbouring coefficients Xi to be
similar.
• This will cause some to be identical (also known as total-variation
de-noising).
• In the previous equation every observation is associated with a
coefficient. More generally we can solve:
min
X
1
||Y − DX||22 + λ1||X||1 + λ2||F X||1
2
in an alternative domain or dictionary.
20
Fused LASSO Minimisation
• We want to solve:
min
X,Z,W
1
||Y − X||22 + λ1||Z||1 + λ2||W ||1
2
subject to: Z − X = 0
& W − F X = 0;
• The procedure for Fused Lasso using ADMM:
• X K+1 = (I + ρ1 + F T F ρ2)−1(Y + U k + ρ1Z k + F T (V k + ρ2W k ))
• Z k+1 = S λ1 (X k+1 − ρ11 U k )
ρ1
• W k = S λ2 (F X k+1 − ρ12 V k )
ρ2
• U k+1 = U k + ρ1(Z k+1 − X k+1)
• V k+1 = V k + ρ2(W k+1 − F X k+1)
21
Fused LASSO
• Yahoo Stock Price:
Fused LASSO
230
220
210
200
190
180
170
160
0
100
200
300
400
500
600
400
500
600
Fused LASSO
220
215
210
205
200
195
190
185
180
175
170
0
100
200
300
22
Trend Filtering
• The first-order absolute difference penalty in the fused LASSO can be
generalized with a higher-order difference:
min
X
1
||Y − X||22 + λ||F M +1X||1
2
• Here F M +1 is a matrix that computes discrete differences of order
M + 1.
• e.g. F 2 is a difference matrix of dimension (n − 2) × n where:


1 −2 1 0 . . . 0
0 1 −2 1 . . . 0
2
F =
..
..
.. . . . .. 
 ..

0 0 . . . 1 −2 1
• In general, trend filtering of order M results in solutions that are
piecewise polynomials of degree M .
23
Trend Filtering
• Yahoo Stock Price (Trend Filtering with Order 2):
Trend Filtering
230
220
210
200
190
180
170
160
0
100
200
300
400
500
600
400
500
600
Trend Filtering
230
220
210
200
190
180
170
160
0
100
200
300
24
Trend Filtering
• Yahoo Stock Price (Trend Filtering with Order 3):
Trend Filtering
230
220
210
200
190
180
170
160
0
100
200
300
400
500
600
400
500
600
Trend Filtering
230
220
210
200
190
180
170
160
0
100
200
300
25
Q/A
Thank you
26
Download