Generalizations of the LASSO Penalty Department of Statistical Science - UCL 18th March 2016 Hojjat Akhondi-Asl 1 Introduction • In the previous talk, we considered some generalizations of the LASSO by varying the loss function. – In this talk, I will also show another variation called the robust PCA method. • In this talk, we aim to vary the LASSO `1-penalty itself and show its useful features. e.g. – LASSO does not perform well with highly correlated variables ⇒ Elastic Net – Features may be structurally grouped (select all/omit all) ⇒ Group LASSO – We may want neighbouring coefficients to be the same or similar (piecewise constant) ⇒ Fused LASSO – We may want neighbouring coefficients to be piecewise polynomial ⇒ Trend Filtering 2 Robust PCA • Let us assume that the data vector can be modelled as: H = L + S + N, • Here, L is the low rank matrix, S is a sparse matrix that contains the outliers and N contains the Gaussian noise. • H can be minimised with respect to L, S and N in a convex optimisation methodology, that is (E.J. Candes 2009): min ||L||∗ + λ1||S||1 + ||H − L − S||22. L,S • Here, ||L||∗ is the nuclear norm P which measures the sum of the singular values of L, that is ||L||∗ = i σi(L). • ||S||1 is used to impose sparsity on S to capture the outliers 3 Robust PCA • ||H − L − S||22 minimises the difference matrix with quadratic cost function, optimal when noise is Gaussian distributed. 4 Definitions and Symbols • We define LASSO as: min X 1 ||Y − DX||22 + λ||X||1 2 • Here: – Y is the input signal/data vector, etc with size RM ×1 – D is dictionary/feature matrix with multiple covariates/atoms with size RM ×N – D = [D1 D2 . . . DN ] – X is the sparse coding matrix/coefficients with size RN ×1 5 LASSO and Highly Correlated Variables • LASSO does not handle highly correlated variables very well • The coefficient paths tend to be erratic and can sometimes show wild behaviour. • Consider a simple example: – Say the coefficients for a feature/atom Dj with a particular value for λ is Xj > 0. – Let’s also say feature Dj+1 = Dj , where we have an identical copy (extreme case). – They can share this coefficient in infinitely many ways, that is any X̃j + X̃j+1 = Xj . – The loss and l1 penalty are indifferent. So the coefficients for this pair are not defined. – A quadratic penalty will divide Xj exactly equally between the two twins 6 Elastic Net • In practice it is unlikely to have an identical feature element, however the features could well be highly correlated. • The elastic net makes a compromise between the ridge and the LASSO penalties (Zou and Hastie 2005): min X 1 1 ||Y − DX||22 + λα||X||1 + λ(1 − α)||X||22. 2 2 (1) • Here, if α = 1 the equation collapses down to LASSO and if α = 0, the equation collapses down to ridge regression. • Let us consider the following example: – – – – Z1, Z2, , γj ∼ N (0, 1) where j = 1, . . . , 6 and N = 100 Dj = Z1 + γj /5 for j = 1, 2, 3 Dj = Z2 + γj /5 for j = 4, 5, 6 Y = 3Z1 − 1.5Z2 + 2 7 Elastic Net LASSO 1.5 1 0.5 0 -0.5 -1 -1.5 0 5 10 15 20 25 30 35 40 25 30 35 40 Lambda Elastic Net 1.5 1 0.5 0 -0.5 -1 -1.5 0 5 10 15 20 Lambda 8 Elastic Net Ball • The Elastic-Net ball shares attributes of the `2 ball and the `1 ball. • The sharp corners and edges encourage selection • The curved contours encourage sharing of coefficients. Figure 1: Elastic Net LASSO 9 Elastic Net Minimisation - ADMM • Augmented Lagrangian Framework: min X,Z 1 1 2 ||Y − DX||2 + λα||Z||1 + λ(1 − α)||X||22. 2 2 subject to: Z − X = 0; • The procedure for Elastic Net using ADMM: • X K+1 = (DT D + ρI)−1(DT Y + U k + ρZ k ) k • Z k+1 = S λα ( γρ X k+1 − Uγ ) γ • U k+1 = U k + ρ(Z k+1 − X k+1) where S is the soft-thresholding operator and γ = λ(1 − α) + ρ 10 Elastic Net Minimisation - ADMM f u n c t i o n x new = Elastic Net ADMM ( b , D, o p t i o n s ) A = [ D’ ∗ D ] ; A i n v = i n v (A + o p t i o n s . beta ∗ eye ( s i z e (A ) ) ) ; z = 0 ; u = 0 ; x =0; gamma = o p t i o n s . lambda ∗(1− o p t i o n s . a l p h a ) + o p t i o n s . beta ; halt = false ; i t e r = 0 ; dx norm = i n f ; w h i l e ˜ h a l t && i t e r < o p t i o n s . m a x i t e r i t e r = i t e r + 1; x new = A i n v ∗ (D’ ∗ b + u + o p t i o n s . beta ∗ ( z ) ) ; dx norm ( i t e r ) = ( ( norm p ( x − x new , 2 ) . ˆ 2 ) / ( norm p ( x , 2 ) . ˆ 2 ) ) ; i f dx norm ( i t e r ) < o p t i o n s . t h r e s halt = 1; else x = x new ; end z = p r o x l a s s o ( ( o p t i o n s . beta /gamma) ∗ x new − . . . ( 1 /gamma) ∗ u , o p t i o n s . lambda ∗ o p t i o n s . a l p h a /gamma ) ; u = u + o p t i o n s . beta ∗ ( z−x new ) ; end end % S o f t −T h r e s h o l d i n g f u n c t i o n x new = p r o x l a s s o ( x , t h r e s ) x new = max( abs ( x ) − t h r e s , 0 ) . ∗ s i g n ( x ) ; end 11 Group LASSO • There are many regression problems in which the covariates/atoms have a natural group structure • It is desirable to have all coefficients within a group become nonzero (or zero) simultaneously ⇒ e.g. classification problem • Consider a linear regression model involving K groups of covariates/atoms. • In this case, for k = 1, . . . , K, the vector Dk represents the covariates/atoms in group k. • The goal is to predict response Y based on the collection of covariates: D = [D1, D2, . . . , DK ]. 12 Group LASSO • Group LASSO is defined as: min X 1 ||Y − DX||22 + λ||X||2,1 2 • where ||X||2,1 = K X ||Xk ||2 k=1 • The sparsity is now on the K groups and not on individual atoms. • Either the entire vector Xk will be zero, or all its elements will be nonzero (depending on λ ≥ 0) 13 Group LASSO Ball • If Dk ∈ RN ×1, then ||X||2,1 = ||X||1 • That is, if all the groups are singletons, the optimization problem reduces to LASSO. • Group LASSO ball shares attributes of both `2 and `1 balls. Figure 2: Group LASSO LASSO 14 Group LASSO • Generally we expect: • Example: LASSO Group LASSO 2 2 4 4 6 6 8 8 10 10 12 12 14 14 100 200 300 400 500 600 700 100 200 300 400 500 600 700 15 Group LASSO Minimisation • Augmented Lagrangian Framework: min X,Z 1 ||Y − DX||22 + λ||Z||2,1 2 subject to: Z − X = 0; • The procedure for Group LASSO using ADMM: • X K+1 = (DT D + ρI)−1(DT Y + U k + ρZ k ) • Zgk+1 ||Pg ||2− λρ Pg ||Pg ||2 = 0 λ ρ < ||Pg ||2 otherwise. where P = X k+1 − ρ1 U k • U k+1 = U k + ρ(Z k+1 − X k+1) 16 Sparse Group LASSO • When a group is included in a Group LASSO fit, all the coefficients in that group are nonzero. This is a consequence of the `2 norm. • Sometimes we would like sparsity both with respect to which groups are selected, and which coefficients are nonzero within a group. • The sparse Group LASSO is defined as: min X 1 ||Y − DX||22 + λ(1 − α)||X||2,1 + λα||X||1 2 (2) • Much like the Elastic Net the parameter α creates a bridge between the Group LASSO (α = 0) and the LASSO (α = 1). 17 Sparse Group LASSO • Sparse Group LASSO vs. Group LASSO: 18 Fused Lasso • Sometimes data are very noisy, so some kind of smoothing is essential. • One assumption we could have is that data within segments are replicated, therefore we expect a piecewise-constant estimation. • Fused Lasso signal approximator exploits this structure: min X 1 ||Y − X||22 + λ1||X||1 + λ2||F X||1 2 • F is a difference matrix of dimension (n − 1) × n where: 1 −1 0 . . . 0 1 −1 . . . F = .. .. ... .. 0 0 ... 1 0 0 .. −1 19 Fused Lasso • The first penalty is the `1-norm, and serves to shrink the Xi toward zero • The second penalty encourages neighbouring coefficients Xi to be similar. • This will cause some to be identical (also known as total-variation de-noising). • In the previous equation every observation is associated with a coefficient. More generally we can solve: min X 1 ||Y − DX||22 + λ1||X||1 + λ2||F X||1 2 in an alternative domain or dictionary. 20 Fused LASSO Minimisation • We want to solve: min X,Z,W 1 ||Y − X||22 + λ1||Z||1 + λ2||W ||1 2 subject to: Z − X = 0 & W − F X = 0; • The procedure for Fused Lasso using ADMM: • X K+1 = (I + ρ1 + F T F ρ2)−1(Y + U k + ρ1Z k + F T (V k + ρ2W k )) • Z k+1 = S λ1 (X k+1 − ρ11 U k ) ρ1 • W k = S λ2 (F X k+1 − ρ12 V k ) ρ2 • U k+1 = U k + ρ1(Z k+1 − X k+1) • V k+1 = V k + ρ2(W k+1 − F X k+1) 21 Fused LASSO • Yahoo Stock Price: Fused LASSO 230 220 210 200 190 180 170 160 0 100 200 300 400 500 600 400 500 600 Fused LASSO 220 215 210 205 200 195 190 185 180 175 170 0 100 200 300 22 Trend Filtering • The first-order absolute difference penalty in the fused LASSO can be generalized with a higher-order difference: min X 1 ||Y − X||22 + λ||F M +1X||1 2 • Here F M +1 is a matrix that computes discrete differences of order M + 1. • e.g. F 2 is a difference matrix of dimension (n − 2) × n where: 1 −2 1 0 . . . 0 0 1 −2 1 . . . 0 2 F = .. .. .. . . . .. .. 0 0 . . . 1 −2 1 • In general, trend filtering of order M results in solutions that are piecewise polynomials of degree M . 23 Trend Filtering • Yahoo Stock Price (Trend Filtering with Order 2): Trend Filtering 230 220 210 200 190 180 170 160 0 100 200 300 400 500 600 400 500 600 Trend Filtering 230 220 210 200 190 180 170 160 0 100 200 300 24 Trend Filtering • Yahoo Stock Price (Trend Filtering with Order 3): Trend Filtering 230 220 210 200 190 180 170 160 0 100 200 300 400 500 600 400 500 600 Trend Filtering 230 220 210 200 190 180 170 160 0 100 200 300 25 Q/A Thank you 26