Chapter 5: Optimization methods (part 1) Piotr Zwiernik and Omiros Papaspiliopoulos

advertisement
Chapter 5: Optimization methods (part 1)
Piotr Zwiernik and Omiros Papaspiliopoulos
Universitat Pompeu Fabra
April 8, 2016
1
Optimization literature
Convex Optimization
• S. Boyd and L. Vandenberghe, “Convex Optimization”, 2004.
• S. Boyd’s video lectures:
http://stanford.edu/class/ee364a/videos.html
• R. Tyrell Rockafellar, “Convex Analysis”, 1970.
Optimization
• J. Nocedal, S.J. Wright, “Numerical Optimization”, 2006.
Links to exponential families
• O.E. Barndorff-Nielsen, “Information and Exponential Families
in Statistical Theory”, 1978.
2
Basic outline
Theory part
1. Basic definitions and examples
2. Differentiable case: optimization duality, KKT conditions
3. Non-differentiable case: subdifferentials.
4. Some remarks on the lasso problem
Algorithmic part
1. Gradient descent
2. Proximal methods
3. Coordinate descent
3
Convex sets
• C ⊆ Rp is convex if
γ(t) = (1 − t)α + tβ ∈ C
for all α, β ∈ C and t ∈ (0, 1).
4
Convex functions
• f : C → R is convex if
f ((1−t)α+tβ) ≤ (1−t)f (α)+tf (β)
∀α, β ∈ C , t ∈ (0, 1).
• f convex iff {(β, y ) ∈ Rp+1 : y ≥ f (β)} is a convex set
5
Examples
• Linear functions are both concave and convex.
Univariate functions, x ∈ R
• exponential e ax is convex on R for any a ∈ R
• log x is concave on (0, ∞)
• negative entropy x log x is convex on (0, ∞)
Multivariate functions, x ∈ Rp
• every norm is convex by the triangle inequality
• f (x) = max{x1 , . . . , xp } is convex
• f (x) = log(e x1 + · · · + e xp ) is convex on Rp
Functions on Sp++ (X symmetric positive definite)
• f (X ) = log det X is concave
• λ1 (X ) ≥ · · · ≥ λp (X ) eigenvalues: λ1 (X ) + . . . + λk (X ) is
convex, λk (X ) + . . . + λp (X ) is concave.
6
Operations that preserve convexity
Basic operations
• If f1 , . . . , fk are convex then f (β) = max{f1 (β), . . . , fk (β)} is
convex
• If g (α, β) is convex in β for every α ∈ A then
f (β) = supα∈A g (α, β) is convex
• If g is convex then f (β) = g (Aβ + b) is convex
Some conclusions
• Let A ⊆ Rp then f (β) = supα∈A ||α − β|| is convex
• f (β) = ||y − X β||22 is convex
7
Pointwise supremum of convex functions
• If f1 , f2 : C → R convex then f (β) = max{f1 (β), f2 (β)} is
convex
• Pointwise supremum of any collection of convex functions is
convex.
• Pointwise infimum of any collection of concave functions is
concave.
8
How to quickly verify non-convexity
Quick test
• Sample many times pairs of points x, y ∈ Rp
• For each pair plot f on the interval between x and y
Example: Gaussian likelihood
• `(Σ) = − log det(Σ) − trace(Sn Σ−1 )
• This function is not concave.
• It is concave over {Σ : 2Sn − Σ 0}.
• If n/p large then it may be hard to find a witness of
non-concavity
9
A few points
The domain of a function
• Writing f : Rp → R does not mean f is defined in Rp !
• This notation means that f is +∞ where not defined.
• dom(f ) := {β ∈ Rp : f (β) < +∞}
• For a convex function dom(f ) must be convex.
Constraining the domain
0
if x ∈ C
+∞ otherwise.
• δ(x|C ) is convex if and only if C is a convex set.
• Define the function δ(x|C ) =
• f (x) + δ(x|C ) is a restriction of f to C .
10
Convex optimization problem
Problem:
minimize
f (β)
p
β∈R
such that
β∈C
Standard form
• C = {β ∈ Rp : g1 (β) ≤ 0, . . . , gm (β) ≤ 0}
• g1 , . . . , gm : Rp → R convex functions
• g (β) := (g1 (β), . . . , gm (β))
• additional linear equality constraints possible
Optimum guarantee
• if β ∈ C is a local optimum then it is a global optimum
• f ∗ := inf{f (β) : β ∈ C } ∈ R ∪ {−∞, +∞}
• β ∗ ∈ C is optimal if f (β ∗ ) = f ∗
11
Basic outline
Theory part
1. Basic definitions and examples
2. Differentiable case: optimization duality, KKT conditions
3. Non-differentiable case: subdifferentials.
4. Some remarks on the lasso problem
Algorithmic part
1. Gradient descent
2. Proximal methods
3. Coordinate descent
12
Convexity conditions
Function f : C → R, such that C ⊆ Rp convex.
First order conditions
• suppose ∇f (β) exists for every β ∈ C
• f is convex iff f (β 0 ) ≥ f (β) + h∇f (β), β 0 − βi for all
β, β 0 ∈ C
Second order conditions
• suppose ∇2 f (β) exists for every β ∈ C
• f is convex iff ∇2 f (β) is positive semi-definite for all β ∈ C
13
First order optimality conditions
Suppose F (β) = ∇f (β) exists for all β ∈ C .
• β ∗ is a global optimum if and only if h∇f (β ∗ ), β − β ∗ i ≥ 0
• If β ∗ is an interior point of C , then ∇f (β ∗ ) = 0
14
Lagrangian function
Lagrangian function L : Rp × Rm
≥0 → R
• L(β; λ) = f (β) + λ1 g1 (β) + . . . + λm gm (β)
• L(β; λ) is convex in β and linear in λ (in particular concave)
Simple observation (no convexity assumed)
• Fix β ∈ C . If g (β) ≤ 0 then supλ≥0 λT g (β) = 0;
• if gi (β) > 0 for some i then supλ≥0 λT g (β) = ∞, and so
sup L(β; λ) =
λ≥0
f (β) if g (β) ≤ 0, and
+∞ otherwise,
and thus f ∗ = inf β∈Rp supλ≥0 L(β; λ).
• Question: How does supλ≥0 inf β∈Rp L(β; λ) relate to f ∗ ?
15
Lagrange duality
Dual function (no convexity assumed)
• dual function h : Rm
≥0 → R
h(λ) = inf p L(β; λ) = inf p (f (β) + λT g (β))
β∈R
β∈R
• h is concave (pointwise infimum of concave (linear) functions)
• lower bound on the optimal value: h(λ) ≤ f ∗
I
if β ∈ C then f (β) + λT g (β) ≤ f (β), and so
inf (f (β)+λT g (β)) ≤ inf (f (β)+λT g (β)) ≤ inf f (β) = f ∗ .
β∈Rp
β∈C
β∈C
Convex dual problem
• maximize
•
λ∗
h(λ)
subject to
λ≥0
is called dual optimal
• h(λ∗ ) is the best lower bound on f ∗
16
Karush-Kuhn-Tucker conditions
We say that strong duality holds if h(λ∗ ) = f ∗ .
Slater’s condition
• If g (β) < 0 for some β ∈ C then strong duality holds.
KKT conditions
Let λ∗ ≥ 0 be the optimal dual vector and β ∗ ∈ Rp the optimal
primal vector. The following conditions are necessary and sufficient
for β ∗ to be the global optimum.
(a) Primal feasibility: g (β ∗ ) ≤ 0.
(b) Complementary slackness: λ∗i gi (β ∗ ) = 0 for i = 1, . . . , m
(c) Lagrangian condition: The pair (β ∗ , λ∗ ) satisfies
0 = ∇β L(β ∗ ; λ∗ ) = ∇f (β ∗ ) + λ∗ T ∇g (β ∗ ).
• Recall the Lagrange theorem. . .
17
Some intuition behind KKT conditions
• Consider a triangle described by three linear inequalities.
18
Some intuition behind KKT conditions
• Consider a triangle described by three linear inequalities.
18
Some intuition behind KKT conditions
• Consider a triangle described by three linear inequalities.
18
Example: minimum over the nonnegative orthant
• minimize f (β) subject to −β ≤ 0
• L(β; λ) = f (β) − λT β
• ∇β L(β; λ) = 0 if and only if λ = ∇f (β)
KKT conditions
• primal feasibility: β ∗ ≥ 0
• dual feasibility: ∇f (β ∗ ) ≥ 0
• complementary slackness: βi∗ (∇f (β ∗ ))i = 0 for all i
• Lagrangian condition: λ∗ = ∇f (β ∗ )
19
Basic outline
Theory part
1. Basic definitions and examples
2. Differentiable case: optimization duality, KKT conditions
3. Non-differentiable case: subdifferentials
4. Some remarks on the lasso problem
Algorithmic part
1. Gradient descent
2. Proximal methods
3. Coordinate descent
20
Nondifferentiable f and subdifferentials
• If f differentiable, then f (β 0 ) ≥ f (β) + h∇f (β), β 0 − βi
• A vector z ∈ Rp is said to be a subgradient of f at β if
f (β 0 ) ≥ f (β) + hz, β 0 − βi
for all β 0 ∈ Rp .
• it defines a supporting hyperplane of the epigraph of f
• subdifferential ∂f (β): the set of all subgradients of f at β
21
Examples
Recall: z ∈ ∂f (β) if f (β 0 ) ≥ f (β) + hz, β 0 − βi for all β 0 ∈ Rp
Absolute value f (β) = |β|, β ∈ R

if β > 0
 {1}
{−1} if β < 0
∂f (β) =

[−1, 1] if β = 0.
Indicator function δ(β|C ), C convex
• We have z ∈ ∂δ(β|C ) if and only if
I
I
I
δ(β 0 |C ) ≥ δ(β|C ) + hz, β 0 − βi for all β 0 ∈ Rp , or equivalently
β ∈ C and hz, β 0 − βi ≤ 0 for all β 0 ∈ C , that is,
z defines a supporting hyperplane of C at β.
22
Basic properties of subdifferentials
Geometric properties
• ∂f (β) is a closed convex set
• ∂f (β) = {∇f (β)} if f differentiable at β
• β ∗ is the minimum if and only if 0 ∈ ∂f (β ∗ )
Basic calculus
• if t > 0 then ∂(t f ) = t ∂f
• ∂(f + g ) = ∂f + ∂g (r.h.s. is a Minkowski sum)
• if f (β) = g (Aβ + b), then ∂f (β) = AT ∂g (Aβ + b)
Our main example
• ∂||β||1 = S, where
S = {s ∈ Rp : si = sgn(βi ) if βi 6= 0, and si ∈ [−1, 1] otherwise}
23
Generalized KKT conditions
• β is a minimum of f if and only if 0 ∈ ∂f (β), i.e., 0 is a
subgradient of f at β
• z ∈ ∂f (β ∗ ) if and only if hz, βi − f (β) achieves its supremum
at β = β ∗ (conjugate function)
I
I
f (β) ≥ f (β ∗ ) + hz, β − β ∗ i is equivalent to
hz, β ∗ i − f (β ∗ ) ≥ hz, βi − f (β)
• The generalized KKT theory can be applied with
0 ∈ ∂f (β ∗ ) + λ∗1 ∂g1 (β ∗ ) + . . . + λ∗m ∂gm (β ∗ ).
• KKT conditions can be derived from the theory of
subdifferentiation (see Rockafellar, p. 283)
24
Basic outline
Theory part
1. Basic definitions and examples
2. Differentiable case: optimization duality, KKT conditions
3. Non-differentiable case: subdifferentials
4. Some remarks on the lasso problem
Algorithmic part
1. Gradient descent
2. Proximal methods
3. Coordinate descent
25
Finding the optimal parameter, λ fixed
• minβ∈Rp 21 ||y − X β||22 + λ||β||1
• ∂f (β) = (y − X β)T X + λS, where S = ∂||β||1 .
Solve 0 ∈ ∂f (β) to minimize
• for each coordinate: 0 ∈ (y − X β)T Xi + λSi
I
Recall: Si = sign(βi ) if βi 6= 0 and Si = [−1, 1] if βi = 0.
• These are not independent equations and so hard
(impossible?) to solve exactly.
• This motivates the coordinate descent and other methods.
26
Connection to the constraint ||β||1 ≤ R
• f (β) = 12 ||y − X β||22 + λ||β||1 already looks like a Lagrangian.
Consider the following problem
• minβ∈Rp 21 ||y − X β||22 subject to g (β) = ||β||1 − R ≤ 0
• L(β; λ) = 21 ||y − X β||22 + λ(||β||1 − R)
• Lagrangian condition: 0 ∈ (y − X β ∗ )T X + λ∗ S ∗
• This means that for some λ these two problems are equivalent.
27
Dual of the lasso
• Lasso primal: minβ∈Rp 21 ||y − X β||22 + λ||β||1
• Equivalently: minβ∈Rp 12 ||r ||22 + λ||β||1 subject to r = y − X β
• We use the Lagrangian
1
L(β, r , θ) = ||r ||22 + λ||β||1 − θT (r − y + X β)
2
• This can be maximised separately with respect to β and r ,
which gives θ = r and
0
if ||X T θ||∞ ≤ λ
T
minp −θ X β + λ||β||1 =
−∞ otherwise.
β∈R
• Lasso dual: maxθ 21 {||y ||22 − ||y − θ||22 } subject to
||X T θ||∞ ≤ λ. This is a projection of y on a polytope!
28
Basic outline
Theory part
1. Basic definitions and examples
2. Differentiable case: optimization duality, KKT conditions
3. Non-differentiable case: subdifferentials.
Algorithmic part
1. Gradient descent
2. Proximal methods
3. Coordinate descent
29
Basic algorithms (unconstrained case)
If f differentiable then at β ∗ we have ∇f (β ∗ ) = 0.
First-order method
• Gradient descent: β t+1 = β t − s t ∇f (β t ) for t = 0, 1, 2, . . .
• −∇f (β t ) is the direction of the steepest descent, s t > 0.
• alternatively: instead of −∇f (β t ) any ∆t , h−∇f (β t ), ∆t i > 0
Newton’s method
• ∆t = −(∇2 f (β t ))−1 ∇f (β t )
• obtained by maximization of
f (β t ) + (∇f (β t ))T (β − β t ) + 12 (β − β t )T ∇2 f (β t )(β − β t ).
30
Projected gradient methods
Suppose we have a constrained optimization problem, β ∈ C .
Alternative interpretation of gradient descent
• Gradient descent: β t+1 = β t − s t ∇f (β t ) for t = 0, 1, 2, . . .
• The step of the algorithm is the solution to
1
β t+1 = arg minp f (β t ) + (∇f (β t ))T (β − β t )+ t ||β − β t ||22
β∈R
2s
Projected gradient descent
1
β t+1 = arg min f (β t ) + (∇f (β t ))T (β − β t ) + t ||β − β t ||22
β∈C
2s
31
Projected gradient methods (geometry)
• Let F (β) = f (β t ) + (∇f (β t ))T (β − β t ) + 2s1t ||β − β t ||22
• Unconstrained maximum: F (β) = F (β̃ t+1 ) + 2s1t ||β − β̃ t+1 ||22
so the level sets are spheres around the gradientdescent
step.
• Projected gradient descent: β t+1 = arg minβ∈C F (β) is the
orthogonal projection of the gradient descent step on C .
32
Projected gradient methods (the ball)
• If C = {β ∈ Rp : ||β||2 ≤ R} the projection is trivial
• If C = {β ∈ Rp : ||β||1 ≤ R} the projection is a variant of
soft thresholding
33
Basic outline
Theory part
1. Basic definitions and examples
2. Differentiable case: optimization duality, KKT conditions
3. Non-differentiable case: subdifferentials.
Algorithmic part
1. Gradient descent
2. Proximal methods
3. Coordinate descent
34
Proximal methods (non-differentiable case)
Set-up
• f=g+h, where g differentiable and convex, h convex
• β t+1 =
arg minβ∈Rp g (β t ) + (∇g (β t ))T (β − β t ) +
1
2st ||β
− β t ||22 + h(β)
Proximal map
• proxh (z) = arg minβ∈Rp { 12 ||z − β||22 + h(β)}
1
• proxsh (z) = arg minβ∈Rp { 2s
||z − β||22 + h(β)}, for all s > 0
Proximal update
• Simple algebra gives: β t+1 = proxs t h (β t − s t ∇g (β t ))
35
Proximal methods for constrained problems
1
||z − β||22 + h(β)}, for all s > 0
proxsh (z) = arg minβ∈Rp { 2s
Generalizes projection
• if h(β) = δ(β|C ) then proxsh (z) = arg minβ∈C ||z − β||2 .
Generalizes projected gradient descent
• β t+1 = proxδ(·|C ) (β t − s t ∇f (β t )).
Statistical applications
• This method can be efficient only for special forms of h(β).
• It does work well is h is the `1 -norm, group lasso `2 -norm, etc.
36
Proximal method for lasso
• Suppose the nondifferentiable component is h(β) = λ||β||1 .
• Soft-thresholding: Sλ (x) = sign(x)(|x| − λ)+ for x, λ ∈ R
Proximal algorithm
Step 1. Take a gradient step z = β t − s t ∇g (β t )
Step 2. Perform elementwise soft-thresholding β t+1 = Ss t λ (z)
• Step 2 follows by the standard calculation that
arg minβ∈Rp { 12 ||z − β||22 + s t λ||β||1 } = Ss t λ (z)
37
Basic outline
Theory part
1. Basic definitions and examples
2. Differentiable case: optimization duality, KKT conditions
3. Non-differentiable case: subdifferentials.
Algorithmic part
1. Gradient descent
2. Proximal methods
3. Coordinate descent
38
Coordinate descent
• Convex function is convex in each coordinate.
• Coordinate descent: Minimize one-dimensional functions for
each coordinate.
• This gives the global minimum under some additional
conditions.
Old idea in statistics. . .
• Iterative Proportional Fitting for log-linear models dates back
to Bartlett (1935), Csiszár (1970s).
• Gaussian graphical models: Dempster (1972), Wermuth and
Scheidt (1977), Speed and Kiiveri (1982).
39
When coordinate descent works?
Your intuition is correct
• If f is continuously differentiable and strictly convex in each
coordinate then coordinate descent converges to the global
optimum.
Separability condition
• Suppose f (β) = g (β) +
I
I
Pp
j=1 hj (βj )
p
g : R → R differentiable and convex
hj : R → R are convex
• Tseng (1988,2001): in this scenario coordinate descent
converges to the global optimum
40
Thank you!
41
Appendix: Conjugate functions
• Every convex set is the intersection of all supporting
hyperplanes containing it.
• f convex then epi(f ) ⊂ Rd+1 is convex
• all supporting hyperplanes are of the form hb, βi − a
• Consider the set of points (β ∗ , y ∗ ) ∈ Rd+1 such that
hβ ∗ , βi − y ∗ is a supporting hyperplane of epi(f )
• In particular supβ (hβ ∗ , βi − f (x)) ≤ y ∗ and so
(β ∗ , y ∗ ) ∈ epi(f ∗ ), where
f ∗ (β ∗ ) = sup(hβ ∗ , βi − f (x)).
β
42
Appendix: CVX lasso example
m = 500;
n = 2500;
% number of examples
% number of features
b0 = sprandn(n,1,0.05);
A = randn(m,n);
A = A*spdiags(1./sqrt(sum(A.^2))’,0,n,n); % normalize columns
v = sqrt(0.001)*randn(m,1);
y = A*b0 + v;
gamma_max = norm(A’*y,’inf’);
gamma = 0.1*gamma_max;
cvx_begin quiet
cvx_precision low
variable b(n)
minimize(0.5*sum_square(A*b - y) + gamma*norm(b,1))
cvx_end
• https://web.stanford.edu/~boyd/papers/prox_algs/
lasso.html
43
Download