` -regularized regression with generalized linear models

advertisement
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
`1 -regularized regression with generalized linear
models
largely based on Chapter 3 of “Statistical Learning with Sparsity”
Ioannis Kosmidis
Department of Statistical Science
University College London
i.kosmidis@ucl.ac.uk
Reading group on “Statistical Learning with Sparsity”
4 March 2016
1 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Outline
1 Generalized linear models
Model specification and estimation
Weighted lasso regression
Proximal Newton iteration
Logistic regression
Poisson log-linear models
Multinomial regression
2 Discussion
2 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Model specification and estimation
Exponential family of distributions
A random variable Y has a distribution from the exponential
family if its density/pmf is of the form
y θ − b(θ)
f (y ; θ, φ) = exp
+ c(y , φ)
φ
θ is the “natural parameter”
b(θ) is the “cumulant transform”
E(Y ; θ) = b 0 (θ); Var(Y ; θ, φ) = φb 00 (θ)
Variance is usually1 expressed in terms of the “dispersion” φ
and the mean µ = b 0 (θ) via the variance function
V (µ) = b 00 (b 0−1 (µ)).
1
e.g. in standard textbooks, like McCullagh and Nelder (1989), and family
objects in R
3 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Model specification and estimation
Generalized linear models
Data: (y1 , x1> ), . . . (yN , xN> ), where yi is the ith outcome and
xi ∈ <p is a vector of predictors
Generalized Linear Model (GLM) specification:
Random component: Y1 , . . . , YN are conditionally independent
given X1 , . . . , Xn and Yi |Xi = xi has a distribution from the
> >
exponential family with mean µ(β; xi ), where β = (β0 , β(x)
)
Systematic component: the predictors enter the model through
η(β; x) = β0 + x > β(x)
Link function: The mean is linked to the systematic component via
a link function g : C ⊂ < → < as
g (µ(β; x)) = η(β; x)
4 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Model specification and estimation
Maximum (regularized) likelihood
Let y = (y1 , . . . , yN )> and X the N × p matrix with ith row xi and
jth column gj , such that 1> gj = 0 and kgj k22 = 1
Maximum likelihood (ML) estimator β̂ (assuming that X is
of full rank and g strictly monotonic):
min {−`(β, φ; y , X )}
β
P
where `(β, φ; y , X ) = N
i=1 log f (yi ; θ(β; xi ), φ) with
θ(β; xi ) = b 0−1 (µ(β; xi ))
Maximum regularized likelihood (MRL) estimator β̂(λ):
min {−`(β, φ; y , X ) + Pλ (β)} ,
β
where the penalty Pλ (β) is either λ β(x) 1 or variants posed
by the model structure and/or the task (e.g. sum of `2 norms
of subsets of β, a.k.a. “grouped lasso” penalty)
5 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Model specification and estimation
ML via Iteratively reweighted least squares
A Taylor expansion of `(β, φ; y , X ) around an estimate β̃ gives the
log-likelihood approximation:
−
N
n
o2
1 X
w (β̃; xi ) z(β̃; yi , xi ) − η(β; xi ) + C (β̃, φ; y , X )
2φ
(1)
i=1
where
z(β; y , x) = η(β; x) + {y − µ(β; x)} /d(β; x) (“working variate”)
w (β; x) = {d(β; x)}2 /V (µ(β; x)) (“working weights”)
d(β; x) = 1/g 0 (µ(β; x))
C (β̃, φ; y , X ) collects terms that do no depend on β
Maximisation of (1) is a weighted least-squares problem with
responses z(β̃; y , xi ), predictors η(β; xi ) and weights w (β̃; xi )
6 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Model specification and estimation
Algorithm IWLS: Iteratively reweighted least squares
Input: y , X ; β(0)
Output: β̂
Iteration:
0 k ←0
1 Calculate W(k) = diag w (β(k) ; x1 ), . . . , w (β(k) , xn )
2 Calculate z(k) = (z(β(k) ; y , x1 ), . . . , z(β(k) ; y , xn ))>
−1 >
3 β(k+1) ← X > W(k) X
X W(k) z(k)
4 k ←k +1
5 Go to 1
W 1/2 X = QR can be used to simplify step 2
7 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Weighted lasso regression
Weighted lasso regression
Data:
Response vector: z = (z1 , . . . , zN )> such that
PN
i=1 zi
=0
Predictors: N × p matrix X with jth column gj , such that
1> gj = 0 and kgj k22 = 1
Weights: w1 , . . . , wn with wi > 0
Weighted lasso regression:
)
( n
2
1X >
min
wi zi − xi β + λ kβk1
β
2
i=1
8 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Weighted lasso regression
Algorithm CCD: Cyclic coordinate descent
Input: z, W , X ; β(0) , λ
Output: β̂(λ)
Iteration:
0 k ←0
1 For j = 1, . . . , p
βj,(k+1) ←
Sλ (gj> W (z − η−j,(k) ))
gj> W gj
where Sλ (a)
λ)+ and
P = sign(a)(|a| − P
p
η−j,(k) = j−1
β
g
+
t
t=1 t,(k+1)
t=j+1 βt,(k) gt with
Pb
t=a (.) = 0 if a > b
2 k ←k +1
3 Go to 1
“Pathwise coordinate descent”: Start from a value of λ for
which all coefficients are zero (e.g. λmax > maxj |gj> W z|) and
then move towards λ = 0 using “warm starts” for each CCD
9 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Proximal Newton iteration
MRL via a proximal Newton method2
n
o
min −`(β, φ; y , X ) + λ β(x) 1
β
Algorithm PN: Proximal Newton iteration
Input: y , X ; β(0) , λ
Output: β̂(λ)
Iteration:
0 k ←0
1 Update the quadratic approximation of `(β, φ, y , X ) at
β̃ := β(k) using (1) (step of the “outer loop”)
2 CCD for `1 -regularized weighted least squares (“inner loop”)
3 k ←k +1
4 Go to 1
2
see Lee et al. (2014) for details on proximal Newton-type methods and
their convergence properties
10 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Logistic regression
Logistic regression
Model: Y1 , . . . , YN are conditionally independent given X1 , . . . , XN
with Yi |Xi = xi ∼ Bernoulli(π(β, xi )) where
log
π(β, xi )
= β0 + xiT β(x)
1 − π(β, xi )
Log-likelihood:
N h oi
n
X
`(β; y , X ) =
yi β0 + xiT β(x) − log 1 + exp β0 + xiT β(x)
i=1
Penalty: Pλ (β) = λ β(x) 1
Working variate and working weight:
z(β; yi , xi ) = β0 + xi> β(x) +
yi − π(β, xi )
π(β, xi )(1 − π(β, xi ))
and w (β, xi ) = π(β, xi )(1 − π(β, xi ))
11 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Logistic regression
Logistic regression
If X has rank p, `(β; y , X ) is concave3
The ML estimate can have infinite components
If p > N − 1, then the model is over-parameterized and
regularization is required to achieve a stable solution
3
see, for example, Wedderburn (1976)
12 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Logistic regression
20-Newsgroups data
Document classification based on the 20-Newsgroups corpus with
feature set and class definition as in Koh et al. (2007)4
## Load the data
con <- url(paste0("https://www.jstatsoft.org/index.php/jss/article/",
"downloadSuppFile/v033i01/NewsGroup.RData"))
load(con)
## p >> N
dim(NewsGroup$x)
## [1]
11314 777811
## Feature matrix is sparse
mean(NewsGroup$x)
## [1] 0.0007721394
4
Data is available as a supplement for Friedman et al. (2010)
13 / 44
20-Newsgroups data: 1000 × 1000 block of feature matrix
1000
Row
2000
3000
4000
1000
2000
3000
Column
Dimensions: 5000 x 5000
4000
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Logistic regression
20-Newsgroups data: proximal Newton with warm starts
## Load glmnet
require("glmnet")
## Compute lasso path (alpha = 1 in glmnet arguments)
system.time(NewsFit <- glmnet(x = NewsGroup$x, y = NewsGroup$y,
family="binomial",
lambda.min.ratio = 1e-02,
thresh = 1e-05,
alpha = 1))
##
##
user
6.063
system elapsed
1.170
7.479
15 / 44
20-Newsgroups data: Solution path (λmin /λmax = 10−2 )
74
254
1076
3296
0.0
0.2
0.4
0.6
0.8
0
−2
−4
Coefficients
2
0
Fraction Deviance Explained
Dλ2 = (Devnull − Devλ )/Devnull
20-Newsgroups data: Solution path (λmin /λmax = 10−4 )
88
278
1087
3342
8103
0.0
0.2
0.4
0.6
0.8
1.0
0
−5
−10
Coefficients
5
0
Fraction Deviance Explained
Dλ2 = (Devnull − Devλ )/Devnull
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Logistic regression
Some observations
If p > N − 1, then the estimates diverge to ±∞ as λ → 0 in
order to achieve probabilities of 0 or 1
Plotting the solution path versus log(λ) or kβ̂(λ)k1 is
possible, though is not preferable to Dλ2 due to interpretation
issues in p >> N settings
18 / 44
20-Newsgroups data: Solution path (λmin /λmax = 10−2 )
74
254
1076
3296
0.0
0.2
0.4
0.6
0.8
0
−2
−4
Coefficients
2
0
Fraction Deviance Explained
20-Newsgroups data: Solution path (λmin /λmax = 10−2 )
1565
216
18
0
−6
−5
−4
−3
−2
0
−2
−4
Coefficients
2
4167
Log Lambda
20-Newsgroups data: Solution path (λmin /λmax = 10−2 )
2478
3790
4336
5066
0
500
1000
1500
2000
0
−2
−4
Coefficients
2
0
L1 Norm
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Logistic regression
Cross-validation
require("doMC")
registerDoMC(4)
## Run a 10-fold cross validation using doMC
cvNewsGroupC <- cv.glmnet(x = NewsGroup$x, y = NewsGroup$y,
family="binomial",
nfolds = 10,
lambda = NewsFit$lambda,
type.measure="class",
parallel = TRUE)
cvNewsGroupD <- cv.glmnet(x = NewsGroup$x, y = NewsGroup$y,
family="binomial",
nfolds = 10,
lambda = NewsFit$lambda,
type.measure="deviance",
parallel = TRUE)
22 / 44
Cross-validation
1.0
Binomial Deviance
0.5
●●
●●●●
●●●
●●●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●
●●
●
●
●●●
●●●
●●●●
●●●●●●
●●●●●●●●●●
0.0
0.4
0.3
0.2
0.1
●●
●
●
●●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●●
●
●●
●●
●●
●●
●●●
●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0.0
Misclassification Error
0.5
1.5
5388 3800 1187 214 46 7 1
0.6
5388 3800 1187 214 46 7 1
−6
−5
−4
log(Lambda)
−3
−2
−6
−5
−4
log(Lambda)
−3
−2
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Poisson log-linear models
Poisson log-linear models
Model: Y1 , . . . , YN are conditionally independent given X1 , . . . , XN
with Yi |Xi = xi ∼ Poisson(µ(β, xi )) where
log µ(β, xi ) = β0 + xiT β(x)
Log-likelihood:
`(β; y , X ) =
N n o
X
yi β0 + xiT β(x) − exp β0 + xiT β(x)
i=1
Penalty: Pλ (β) = λ β(x) 1
Working variate and working weight:
z(β; yi , xi ) = β0 + xi> β(x) +
y − µ(β, xi )
µ(β, xi )
and w (β, xi ) = µ(β, xi )
24 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Poisson log-linear models
Poisson log-linear models
If X has rank p, `(β; y , X ) is concave5
If the intercept is not penalized (typically the case), then
direct differentiation of the penalized log-likelihood gives
N
1 X
µ(β̂(λ), xi ) = ȳ
N
i=1
When p > N − 1 and zero counts are observed then the
estimates diverge to ±∞ as λ → 0, in order to achieve means
of 0.
5
see, for example, Wedderburn (1976)
25 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Poisson log-linear models
Distribution smoothing
Data: vector of proportions r = {ri }N
i=1 such that
PN
i=1 ri
=1
Target distribution: u = {ui }N
i=1
Task: Find estimate q = {qi }N
i=1 such that
the relative entropy between q and u is as small as possible
q and r are within a given max-norm tolerance
Constrained maximum-entropy problem6 :
min
q∈<N ;qi ≥0
6
N
X
i=1
qi log
qi
ui
such that
kq − r k∞ ≤ δ;
N
X
qi = 1
i=1
see, Dubiner and Singer (2011)
26 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Poisson log-linear models
Distribution smoothing: Lagrange dual
The Lagrange-dual is an `1 -regularized Poisson log-linear
regression:
" N
#
X
max
{ri log qi (α, βi ) − qi (α, βi )} − δ kβk1
α,β
i=1
where qi (α, βi ) = ui exp{α + βi }
The
PN presence of the unpenalized intercept ensures that
i=1 qi (α̂(δ), β̂i (δ)) = 1
As δ grows, β̂i (δ) → 0 (i = 1, . . . , N) and α̂(δ) → 0.
27 / 44
Artificial example
●
●
●●● ●
target
observed
●
0.020
●●
● ●●
●
f(x)
0.010
●
● ●●
●
●●
●●● ●● ●
●● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●
−6
−4
● ● ●●●
●
●
●
●
●
● ● ●
●●● ●●●● ●●● ●●●●
● ● ●● ●
●●●●
●
● ●
● ●● ● ● ●
●●
●
●●
● ● ● ●●
0.000
●
● ●●
●●● ●● ●
●●
●●●●●●
●●●
● ● ●●●●●●●
●
−2
●
● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0
x
2
4
6
Artificial example
δ = 0.001396317
●
●
●●● ●
target
observed
estimated
●
●●
0.000
0.010
f(x)
0.020
●
●●
●●
●● ●
●●
●●● ●● ●●● ●● ●
● ● ● ●●
● ● ●●●●
●
●
●
● ● ● ● ●●
● ●●●● ●●●●
●● ●●●●
●●●●●●●●●●● ●
●●●●●●●●
●●●
●● ● ●
● ●●●●●
●● ●● ●● ●
●● ●
● ● ●●● ●●● ●
●●●●●●●●●
●
●
●
● ●●●●●●● ● ●●●●
● ●●●●●
●●●●●●●● ●
●●●●●●●●●●●●●●
●●
●●
●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●
●
●
−6
−4
−2
●
● ●● ●●●●●
● ●●
● ● ●● ●
●●●
●●● ●●●
●
●● ●
●●
● ●
●
● ● ● ●●
●●● ● ●
● ●●●
●●
● ●● ● ● ● ●● ● ● ●●●
●
●●
●
●
●●●●
●
●
● ●
●●●
0
●
2
●●
●
●●● ●● ● ●●
●●
●●●
●●●●●●
●●
●●
● ● ●●●●●●●
●
●●●●●●●
●●●●●●●●●●●●●●
● ●●●●●●●●●●●●●●●●●●●
●●●●●●●●●
4
6
x
Basic call:
glmnet(x = Diagonal(N), y = r, offset = log(u), family = "poisson")
Artificial example
δ = 0.000457231
●
0.020
●
●
●●● ●
target
observed
estimated
●
●●
●
●
●● ● ● ●●●
● ●
●
●
●
●
● ● ●●●●●
f(x)
● ● ● ●●
●●
● ●●●● ●
●
●
● ●●●●●●●●●
● ●● ● ● ●● ●
0.010
0.000
●
●●● ●●●●
●●
●
●
●●
●●●●●
● ● ● ● ●●
● ●●●●●●●●
●
●●
●
−6
−4
−2
0
● ● ●●●
●
●●● ●● ● ●●
●●
●●
●●●
●●●●●●
●●
●
● ● ●●●●●●● ● ●●
●●●●●●
●●●●●●●●●●●
● ●●●●●●●●●●●●●●●●
●●
●●●
●●●●● ●●
● ●● ●●● ●● ● ●●●●● ●●●
●●
● ● ●●●●
●●
●●●●●●●●● ●●● ●●●●● ●●●●●●●
●●●●●●●●● ● ●●●●●●●●●●
● ● ●
●●● ●● ●
●●●●●
●●●●● ●●● ●
● ●●●●●●● ●
● ●●●●●●●●● ●●●●
●●●●●●● ●
●●
●●●●●●●●●●●●●●
●●●●●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●
●
2
4
6
x
Basic call:
glmnet(x = Diagonal(N), y = r, offset = log(u), family = "poisson")
Artificial example
δ = 0.0001497225
●
●
●●● ●
target
observed
estimated
●●●● ●
●● ●
● ●●●●● ●
● ● ● ●●● ●●
●●
● ● ● ●●●
●●● ●● ●●
●
●
●●●●● ●● ●●● ●
● ● ●●●●
●
0.010
0.000
●
●●●●● ●
f(x)
0.020
●
●
●●●●
● ●●● ●● ●
● ●●● ●●●●●●● ●●
● ● ●
●●
●
−4
−2
0
●
●●● ●●●●●●●●●
●
●●●●● ●●●●●● ●●●●●● ●●●●●●● ●●● ● ●
●
●● ●
● ● ●●●
●●●●●●
●●●●●●●●●●●● ●●●●
●●●●● ●●●●● ●●●●●
●
●
● ●
●●●●●●●●●●
●●●●
●
●
●
●●●●●●●●●●●●●●
●●●●●●
●●
●●
●●
●●●●●
●●●●●●●●●●●●●●●●●●●●●●●● ●●●●
●
−6
●
●
● ● ●●●●●●● ● ●●
●●
●●
●●●
●●●●●●●●●●●●●●
●
● ● ●●●●●●●●●●●●●●●●● ●
●●●●●●● ●
●●
●●
●●
●●
● ●●●●●●●●●●●●●●●●
●●
●●●
●●●●●●●●●
2
4
6
x
Basic call:
glmnet(x = Diagonal(N), y = r, offset = log(u), family = "poisson")
Artificial example
δ = 4.57231e−05
●●● ●
●●● ●
●
●
●● ●
●
●● ●
● ●●
● ●●
● ●● ●
● ●● ●
●
●● ●
●
●● ●
● ● ● ●●
● ●
● ●
● ● ● ●●
●
● ●● ● ● ●
● ●
● ●●●
● ●● ● ● ●
●
●●
●
●●
●
●●●●
●
●
●
●
●●
●
●●●●
●
●
●
●● ●
●
●
●● ●
●●● ●● ●
● ●● ●●● ●● ●
● ● ●
●●● ●● ●
● ●● ●●● ●● ●
● ● ●
●●●●●●
●●● ●●●● ●●● ●●●●
●●
●●●●●●
●● ●●●● ●●● ●●●●
●●
● ● ●●●●●●●
●
●●
● ● ●
●●
● ●
●● ● ● ●●●
● ● ●●●●●●●
●
●
●
● ● ●●●
●
●●●●●●●●●●●●●●●●●●●●●●●● ●●●●
● ●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●
●●●●●●
●●
●●
●●
●●
●●
●●●
●●●●●●●●●●●●●●●●●●●●●●●● ●●●●
●
● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
0.000
0.010
f(x)
0.020
●
target
observed
estimated
−6
−4
−2
0
2
4
6
x
Basic call:
glmnet(x = Diagonal(N), y = r, offset = log(u), family = "poisson")
Artificial example
δ = 1.497225e−05
●
0.020
●
●
●●
●●
● ●
●
●
target
observed
estimated
●
●
●
●●
●
● ●
●●
●
●
●
●
f(x)
0.010
●
●
● ●
●●
●
●
●
●
●●
●
●
●●
●●
●●
●●
● ●
●
●
● ●
●●
● ●
●
●●●●●●●●●●●●●●●●●●●
●●
●●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
● ●
●●
●●
●●
●
−6
−4
● ●
● ●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
● ●
●
● ●
● ●
●
●
●
●●
● ●
●●
●●
●●
● ●
●●
●●
● ●
●●
●●
●●
●
●●
● ●
● ●● ●
●
●
●
●●
●●
●●
●
●
●
● ●
●
●
● ●
●●
● ●
● ●
● ●
●
●
●
●●
●
●
●
●
●●
●
● ●
●●
● ●
●●
●
●
0.000
●
●
●●
●●
●
●
●
●●
●●
●●
●●
●●
●
●
●●
●
●
● ●
● ●
●●
●●
●●
●●
●●
●●
●
●●
●●
●
●
●
●
−2
●
●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●●
●●●
●●●●●●●●●●
0
2
4
6
x
Basic call:
glmnet(x = Diagonal(N), y = r, offset = log(u), family = "poisson")
Artificial example
δ = 4.902738e−06
●
0.020
●
●
●●● ●
target
observed
estimated
●
●●
● ●●
●
f(x)
0.010
●
● ●●
●
●●
●●● ●● ●
●● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●
−6
−4
● ● ●●●
●
●
●
●
●
●●● ●● ●
● ● ●
●●● ●●●● ●●● ●●●●
● ● ●● ●
●●●●
●
● ●
● ●● ● ● ●
●●
●
●●
● ● ● ●●
0.000
●
● ●●
●●●●●●
●●
● ● ●●●●●●●
●●●
−2
●
● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
0
2
4
6
x
Basic call:
glmnet(x = Diagonal(N), y = r, offset = log(u), family = "poisson")
Artificial example
δ = 1.497225e−06
●
0.020
●
●
●●● ●
target
observed
estimated
●
●●
● ●●
●
f(x)
0.010
●
● ●●
●
●●
●●● ●● ●
●● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●
−6
−4
● ● ●●●
●
●
●
●
●
●●● ●● ●
● ● ●
●●● ●●●● ●●● ●●●●
● ● ●● ●
●●●●
●
● ●
● ●● ● ● ●
●●
●
●●
● ● ● ●●
0.000
●
● ●●
●●●●●●
●●
● ● ●●●●●●●
●●●
●
−2
●
● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0
2
4
6
x
Basic call:
glmnet(x = Diagonal(N), y = r, offset = log(u), family = "poisson")
Artificial example (best λ from 10-fold CV with MSE loss)
δopt = 0.0001721445
●
●
●●● ●
target
observed
estimated
●●●● ●
●● ●
● ●●●●● ●
● ● ● ●●● ●●●
●
● ●
● ● ● ●●●
●●● ● ●
●
●
●●●●● ●● ●●● ●
● ● ●●●●
0.010
0.000
●
●●●●● ●
f(x)
0.020
●
●●
●●●
●●
●
●●
● ●●● ●● ●
●●●●
●
−4
−2
0
●
●●● ●●●●●●●●●
● ●●● ●●●●●●● ●● ● ● ●
●●
●
●●●
●● ●●●●● ●●●●●●● ●●●●●●● ●●● ● ●
●
●●
●●●●
●●●●●●
●●●●●●●●●●●●● ●●●●
●●●●● ●●●●● ●●●●●
●
● ●
● ● ●●
●●●●
●●●●●●●●●
●●●●●●●●●●●●●●
●●●●●●
●●
●●
●●
●●●●●
●●●●●●●●●●●●●●●●●●●●●●●● ●●●●
●
−6
●
●
● ● ●●●●●●● ● ●●
●●●●●●●●●●●●●●
●
● ● ●●●●●●●●●●●●●●●●● ●
●●●●●●● ●
●●
●●
●●
●●
● ●●●●●●●●●●●●●●●●
●●
●●●
●●●●●●●●●
2
4
6
x
Basic call:
glmnet(x = Diagonal(N), y = r, offset = log(u), family = "poisson")
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Multinomial regression
Multinomial regression
Model: The vectors {(Yi1 , . . . , YiK )}N
i=1 are conditionally
independent given X1 , . . . , XNP
, and (Yi1 , . . . , YiK )|Xi = xi has a
multinomial distribution with K
k=1 Yik = mi and kth-category
probability
exp β0,k + xiT β(x),k
(2)
π(βk , xi ) = PK
T
k=1 exp β0,k + xi β(x),k
Log-likelihood:
`(β; Y , X ) =
" K
N
X
X
i=1
yik (β0,k + xiT β(x),k )
k=1
−mi log
( K
X
exp β0,k +
xiT β(x),k
)#
k=1
Penalty: Pλ (β) = λ
PK
k=1 β(x),k 1
37 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Multinomial regression
Multinomial regression
The parameterization in (2) is not identifiable; adding
γ0 + xi> γ(x) to the ith predictors gives the same likelihood
ML can only estimate contrasts of parameters
e.g. setting βK = 0 gives the identifiable “baseline category
logit” that involves contrasts with category K
ML fit is invariant to the choice of baseline category
If X has rank p, the log-likelihood for the baseline category
logit is concave
MRL fit (using `1 -regularized likelihood) is not invariant to
the choice of baseline category
MRL estimation (using `1 -regularized likelihood) takes care of
the redundancy in (2) by an implied, coordinate-dependent
P
re-centering of β(x) (also need to set K
k=1 β0,k = 0).
38 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Multinomial regression
Maximum regularized likelihood
Partial quadratic approximation of the log-likelihood at β̃t allowing
only β0,k and β(x),k to vary:
N
n
o2
1X
−
w (β̃k , xi ) z(β̃k , yik , xi ) − β0,k − xi> β(x),k
2
(3)
i=1
+C ({β̃t }t6=k ; Y , X )
where C ({β̃t }t6=k ; Y , X ) collects terms that do not depend on β0,k
and β(x),k
z(βk , yik , xi ) = β0,k + xi> β(x),k +
yik /mi − π(βk , xi )
π(βk , xi ) {1 − π(βk , xi )}
and w (βk , xi ) = mi π(βk , xi ) {1 − π(βk , xi )}
39 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Multinomial regression
Nested loops for maximum regularized likelihood
outer : Cycle over t = {1, 2, . . . , K , 1, 2, . . .}
middle : for each t set {β̃0,t , β̃t }t6=k in the partial quadratic
approximation in (3) at the current estimates
inner : run CCD for the `1 -regularized weighted least squares
P
The penalty Pλ (β) = λ K
k=1 β(x),k 1 can select different
variables for different classes
40 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Generalized linear models
Multinomial regression
Grouped-lasso for multinomial regression
“Grouped-lasso” penalty
(G )
Pλ (β) = λ
p
X
kδj k2
j=1
where δj = (β(x),1j , . . . , β(x),Kj )>
Selects all coefficients for a particular variable to be in or out
of the model
41 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Discussion
Outline
1 Generalized linear models
2 Discussion
42 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Discussion
Other regression models
Similar procedures/algorithms can be applied in regression
settings with other objectives than the log-likelihood
e.g. Cox proportional hazards models using regularized partial
likelihood7
7
see Hastie et al. (2015, §3.5)
43 / 44
`1 -regularizion for GLMs
Ioannis Kosmidis, UCL
Discussion
References
Dubiner, M. and Y. Singer (2011). Entire relaxation path for maximum entropy
problems. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing, EMNLP ’11, Stroudsburg, PA, USA, pp. 941–948.
Association for Computational Linguistics.
Friedman, J., T. Hastie, and R. Tibshirani (2010). Regularization paths for generalized
linear models via coordinate descent. Journal of Statistical Software 33 (1), 1–22.
Hastie, T., R. Tibshirani, and M. Wainwright (2015, May). Statistical Learning with
Sparsity: The Lasso and Generalizations (Chapman & Hall/CRC Monographs on
Statistics & Applied Probability). Chapman and Hall/CRC.
Koh, K., S.-J. Kim, and S. Boyd (2007, December). An interior-point method for
large-scale l1-regularized logistic regression. J. Mach. Learn. Res. 8, 1519–1555.
Lee, J. D., Y. Sun, and M. A. Saunders (2014). Proximal newton-type methods for
minimizing composite functions. SIAM Journal on Optimization 24 (3), 1420–1443.
McCullagh, P. and J. A. Nelder (1989). Generalized linear models (Second edition).
London: Chapman & Hall.
Wedderburn, R. W. M. (1976). On the existence and uniqueness of the maximum
likelihood estimates for certain generalized linear models. Biometrika 63 (1), 27–32.
44 / 44
Download