Chapter 5: Optimization methods (part 1) Piotr Zwiernik and Omiros Papaspiliopoulos Universitat Pompeu Fabra April 8, 2016 1 Optimization literature Convex Optimization • S. Boyd and L. Vandenberghe, “Convex Optimization”, 2004. • S. Boyd’s video lectures: • R. Tyrell Rockafellar, “Convex Analysis”, 1970. Optimization • J. Nocedal, S.J. Wright, “Numerical Optimization”, 2006. Links to exponential families • O.E. Barndorff-Nielsen, “Information and Exponential Families in Statistical Theory”, 1978. 2 Basic outline Theory part 1. Basic definitions and examples 2. Differentiable case: optimization duality, KKT conditions 3. Non-differentiable case: subdifferentials. 4. Some remarks on the lasso problem Algorithmic part 1. Gradient descent 2. Proximal methods 3. Coordinate descent 3 Convex sets • C ⊆ Rp is convex if γ(t) = (1 − t)α + tβ ∈ C for all α, β ∈ C and t ∈ (0, 1). 4 Convex functions • f : C → R is convex if f ((1−t)α+tβ) ≤ (1−t)f (α)+tf (β) ∀α, β ∈ C , t ∈ (0, 1). • f convex iff {(β, y ) ∈ Rp+1 : y ≥ f (β)} is a convex set 5 Examples • Linear functions are both concave and convex. Univariate functions, x ∈ R • exponential e ax is convex on R for any a ∈ R • log x is concave on (0, ∞) • negative entropy x log x is convex on (0, ∞) Multivariate functions, x ∈ Rp • every norm is convex by the triangle inequality • f (x) = max{x1 , . . . , xp } is convex • f (x) = log(e x1 + · · · + e xp ) is convex on Rp Functions on Sp++ (X symmetric positive definite) • f (X ) = log det X is concave • λ1 (X ) ≥ · · · ≥ λp (X ) eigenvalues: λ1 (X ) + . . . + λk (X ) is convex, λk (X ) + . . . + λp (X ) is concave. 6 Operations that preserve convexity Basic operations • If f1 , . . . , fk are convex then f (β) = max{f1 (β), . . . , fk (β)} is convex • If g (α, β) is convex in β for every α ∈ A then f (β) = supα∈A g (α, β) is convex • If g is convex then f (β) = g (Aβ + b) is convex Some conclusions • Let A ⊆ Rp then f (β) = supα∈A ||α − β|| is convex • f (β) = ||y − X β||22 is convex 7 Pointwise supremum of convex functions • If f1 , f2 : C → R convex then f (β) = max{f1 (β), f2 (β)} is convex • Pointwise supremum of any collection of convex functions is convex. • Pointwise infimum of any collection of concave functions is concave. 8 How to quickly verify non-convexity Quick test • Sample many times pairs of points x, y ∈ Rp • For each pair plot f on the interval between x and y Example: Gaussian likelihood • `(Σ) = − log det(Σ) − trace(Sn Σ−1 ) • This function is not concave. • It is concave over {Σ : 2Sn − Σ 0}. • If n/p large then it may be hard to find a witness of non-concavity 9 A few points The domain of a function • Writing f : Rp → R does not mean f is defined in Rp ! • This notation means that f is +∞ where not defined. • dom(f ) := {β ∈ Rp : f (β) < +∞} • For a convex function dom(f ) must be convex. Constraining the domain 0 if x ∈ C +∞ otherwise. • δ(x|C ) is convex if and only if C is a convex set. • Define the function δ(x|C ) = • f (x) + δ(x|C ) is a restriction of f to C . 10 Convex optimization problem Problem: minimize f (β) p β∈R such that β∈C Standard form • C = {β ∈ Rp : g1 (β) ≤ 0, . . . , gm (β) ≤ 0} • g1 , . . . , gm : Rp → R convex functions • g (β) := (g1 (β), . . . , gm (β)) • additional linear equality constraints possible Optimum guarantee • if β ∈ C is a local optimum then it is a global optimum • f ∗ := inf{f (β) : β ∈ C } ∈ R ∪ {−∞, +∞} • β ∗ ∈ C is optimal if f (β ∗ ) = f ∗ 11 Basic outline Theory part 1. Basic definitions and examples 2. Differentiable case: optimization duality, KKT conditions 3. Non-differentiable case: subdifferentials. 4. Some remarks on the lasso problem Algorithmic part 1. Gradient descent 2. Proximal methods 3. Coordinate descent 12 Convexity conditions Function f : C → R, such that C ⊆ Rp convex. First order conditions • suppose ∇f (β) exists for every β ∈ C • f is convex iff f (β 0 ) ≥ f (β) + h∇f (β), β 0 − βi for all β, β 0 ∈ C Second order conditions • suppose ∇2 f (β) exists for every β ∈ C • f is convex iff ∇2 f (β) is positive semi-definite for all β ∈ C 13 First order optimality conditions Suppose F (β) = ∇f (β) exists for all β ∈ C . • β ∗ is a global optimum if and only if h∇f (β ∗ ), β − β ∗ i ≥ 0 • If β ∗ is an interior point of C , then ∇f (β ∗ ) = 0 14 Lagrangian function Lagrangian function L : Rp × Rm ≥0 → R • L(β; λ) = f (β) + λ1 g1 (β) + . . . + λm gm (β) • L(β; λ) is convex in β and linear in λ (in particular concave) Simple observation (no convexity assumed) • Fix β ∈ C . If g (β) ≤ 0 then supλ≥0 λT g (β) = 0; • if gi (β) > 0 for some i then supλ≥0 λT g (β) = ∞, and so sup L(β; λ) = λ≥0 f (β) if g (β) ≤ 0, and +∞ otherwise, and thus f ∗ = inf β∈Rp supλ≥0 L(β; λ). • Question: How does supλ≥0 inf β∈Rp L(β; λ) relate to f ∗ ? 15 Lagrange duality Dual function (no convexity assumed) • dual function h : Rm ≥0 → R h(λ) = inf p L(β; λ) = inf p (f (β) + λT g (β)) β∈R β∈R • h is concave (pointwise infimum of concave (linear) functions) • lower bound on the optimal value: h(λ) ≤ f ∗ I if β ∈ C then f (β) + λT g (β) ≤ f (β), and so inf (f (β)+λT g (β)) ≤ inf (f (β)+λT g (β)) ≤ inf f (β) = f ∗ . β∈Rp β∈C β∈C Convex dual problem • maximize • λ∗ h(λ) subject to λ≥0 is called dual optimal • h(λ∗ ) is the best lower bound on f ∗ 16 Karush-Kuhn-Tucker conditions We say that strong duality holds if h(λ∗ ) = f ∗ . Slater’s condition • If g (β) < 0 for some β ∈ C then strong duality holds. KKT conditions Let λ∗ ≥ 0 be the optimal dual vector and β ∗ ∈ Rp the optimal primal vector. The following conditions are necessary and sufficient for β ∗ to be the global optimum. (a) Primal feasibility: g (β ∗ ) ≤ 0. (b) Complementary slackness: λ∗i gi (β ∗ ) = 0 for i = 1, . . . , m (c) Lagrangian condition: The pair (β ∗ , λ∗ ) satisfies 0 = ∇β L(β ∗ ; λ∗ ) = ∇f (β ∗ ) + λ∗ T ∇g (β ∗ ). • Recall the Lagrange theorem. . . 17 Some intuition behind KKT conditions • Consider a triangle described by three linear inequalities. 18 Some intuition behind KKT conditions • Consider a triangle described by three linear inequalities. 18 Some intuition behind KKT conditions • Consider a triangle described by three linear inequalities. 18 Example: minimum over the nonnegative orthant • minimize f (β) subject to −β ≤ 0 • L(β; λ) = f (β) − λT β • ∇β L(β; λ) = 0 if and only if λ = ∇f (β) KKT conditions • primal feasibility: β ∗ ≥ 0 • dual feasibility: ∇f (β ∗ ) ≥ 0 • complementary slackness: βi∗ (∇f (β ∗ ))i = 0 for all i • Lagrangian condition: λ∗ = ∇f (β ∗ ) 19 Basic outline Theory part 1. Basic definitions and examples 2. Differentiable case: optimization duality, KKT conditions 3. Non-differentiable case: subdifferentials 4. Some remarks on the lasso problem Algorithmic part 1. Gradient descent 2. Proximal methods 3. Coordinate descent 20 Nondifferentiable f and subdifferentials • If f differentiable, then f (β 0 ) ≥ f (β) + h∇f (β), β 0 − βi • A vector z ∈ Rp is said to be a subgradient of f at β if f (β 0 ) ≥ f (β) + hz, β 0 − βi for all β 0 ∈ Rp . • it defines a supporting hyperplane of the epigraph of f • subdifferential ∂f (β): the set of all subgradients of f at β 21 Examples Recall: z ∈ ∂f (β) if f (β 0 ) ≥ f (β) + hz, β 0 − βi for all β 0 ∈ Rp Absolute value f (β) = |β|, β ∈ R if β > 0 {1} {−1} if β < 0 ∂f (β) = [−1, 1] if β = 0. Indicator function δ(β|C ), C convex • We have z ∈ ∂δ(β|C ) if and only if I I I δ(β 0 |C ) ≥ δ(β|C ) + hz, β 0 − βi for all β 0 ∈ Rp , or equivalently β ∈ C and hz, β 0 − βi ≤ 0 for all β 0 ∈ C , that is, z defines a supporting hyperplane of C at β. 22 Basic properties of subdifferentials Geometric properties • ∂f (β) is a closed convex set • ∂f (β) = {∇f (β)} if f differentiable at β • β ∗ is the minimum if and only if 0 ∈ ∂f (β ∗ ) Basic calculus • if t > 0 then ∂(t f ) = t ∂f • ∂(f + g ) = ∂f + ∂g (r.h.s. is a Minkowski sum) • if f (β) = g (Aβ + b), then ∂f (β) = AT ∂g (Aβ + b) Our main example • ∂||β||1 = S, where S = {s ∈ Rp : si = sgn(βi ) if βi 6= 0, and si ∈ [−1, 1] otherwise} 23 Generalized KKT conditions • β is a minimum of f if and only if 0 ∈ ∂f (β), i.e., 0 is a subgradient of f at β • z ∈ ∂f (β ∗ ) if and only if hz, βi − f (β) achieves its supremum at β = β ∗ (conjugate function) I I f (β) ≥ f (β ∗ ) + hz, β − β ∗ i is equivalent to hz, β ∗ i − f (β ∗ ) ≥ hz, βi − f (β) • The generalized KKT theory can be applied with 0 ∈ ∂f (β ∗ ) + λ∗1 ∂g1 (β ∗ ) + . . . + λ∗m ∂gm (β ∗ ). • KKT conditions can be derived from the theory of subdifferentiation (see Rockafellar, p. 283) 24 Basic outline Theory part 1. Basic definitions and examples 2. Differentiable case: optimization duality, KKT conditions 3. Non-differentiable case: subdifferentials 4. Some remarks on the lasso problem Algorithmic part 1. Gradient descent 2. Proximal methods 3. Coordinate descent 25 Finding the optimal parameter, λ fixed • minβ∈Rp 21 ||y − X β||22 + λ||β||1 • ∂f (β) = (y − X β)T X + λS, where S = ∂||β||1 . Solve 0 ∈ ∂f (β) to minimize • for each coordinate: 0 ∈ (y − X β)T Xi + λSi I Recall: Si = sign(βi ) if βi 6= 0 and Si = [−1, 1] if βi = 0. • These are not independent equations and so hard (impossible?) to solve exactly. • This motivates the coordinate descent and other methods. 26 Connection to the constraint ||β||1 ≤ R • f (β) = 12 ||y − X β||22 + λ||β||1 already looks like a Lagrangian. Consider the following problem • minβ∈Rp 21 ||y − X β||22 subject to g (β) = ||β||1 − R ≤ 0 • L(β; λ) = 21 ||y − X β||22 + λ(||β||1 − R) • Lagrangian condition: 0 ∈ (y − X β ∗ )T X + λ∗ S ∗ • This means that for some λ these two problems are equivalent. 27 Dual of the lasso • Lasso primal: minβ∈Rp 21 ||y − X β||22 + λ||β||1 • Equivalently: minβ∈Rp 12 ||r ||22 + λ||β||1 subject to r = y − X β • We use the Lagrangian 1 L(β, r , θ) = ||r ||22 + λ||β||1 − θT (r − y + X β) 2 • This can be maximised separately with respect to β and r , which gives θ = r and 0 if ||X T θ||∞ ≤ λ T minp −θ X β + λ||β||1 = −∞ otherwise. β∈R • Lasso dual: maxθ 21 {||y ||22 − ||y − θ||22 } subject to ||X T θ||∞ ≤ λ. This is a projection of y on a polytope! 28 Basic outline Theory part 1. Basic definitions and examples 2. Differentiable case: optimization duality, KKT conditions 3. Non-differentiable case: subdifferentials. Algorithmic part 1. Gradient descent 2. Proximal methods 3. Coordinate descent 29 Basic algorithms (unconstrained case) If f differentiable then at β ∗ we have ∇f (β ∗ ) = 0. First-order method • Gradient descent: β t+1 = β t − s t ∇f (β t ) for t = 0, 1, 2, . . . • −∇f (β t ) is the direction of the steepest descent, s t > 0. • alternatively: instead of −∇f (β t ) any ∆t , h−∇f (β t ), ∆t i > 0 Newton’s method • ∆t = −(∇2 f (β t ))−1 ∇f (β t ) • obtained by maximization of f (β t ) + (∇f (β t ))T (β − β t ) + 12 (β − β t )T ∇2 f (β t )(β − β t ). 30 Projected gradient methods Suppose we have a constrained optimization problem, β ∈ C . Alternative interpretation of gradient descent • Gradient descent: β t+1 = β t − s t ∇f (β t ) for t = 0, 1, 2, . . . • The step of the algorithm is the solution to 1 β t+1 = arg minp f (β t ) + (∇f (β t ))T (β − β t )+ t ||β − β t ||22 β∈R 2s Projected gradient descent 1 β t+1 = arg min f (β t ) + (∇f (β t ))T (β − β t ) + t ||β − β t ||22 β∈C 2s 31 Projected gradient methods (geometry) • Let F (β) = f (β t ) + (∇f (β t ))T (β − β t ) + 2s1t ||β − β t ||22 • Unconstrained maximum: F (β) = F (β̃ t+1 ) + 2s1t ||β − β̃ t+1 ||22 so the level sets are spheres around the gradientdescent step. • Projected gradient descent: β t+1 = arg minβ∈C F (β) is the orthogonal projection of the gradient descent step on C . 32 Projected gradient methods (the ball) • If C = {β ∈ Rp : ||β||2 ≤ R} the projection is trivial • If C = {β ∈ Rp : ||β||1 ≤ R} the projection is a variant of soft thresholding 33 Basic outline Theory part 1. Basic definitions and examples 2. Differentiable case: optimization duality, KKT conditions 3. Non-differentiable case: subdifferentials. Algorithmic part 1. Gradient descent 2. Proximal methods 3. Coordinate descent 34 Proximal methods (non-differentiable case) Set-up • f=g+h, where g differentiable and convex, h convex • β t+1 = arg minβ∈Rp g (β t ) + (∇g (β t ))T (β − β t ) + 1 2st ||β − β t ||22 + h(β) Proximal map • proxh (z) = arg minβ∈Rp { 12 ||z − β||22 + h(β)} 1 • proxsh (z) = arg minβ∈Rp { 2s ||z − β||22 + h(β)}, for all s > 0 Proximal update • Simple algebra gives: β t+1 = proxs t h (β t − s t ∇g (β t )) 35 Proximal methods for constrained problems 1 ||z − β||22 + h(β)}, for all s > 0 proxsh (z) = arg minβ∈Rp { 2s Generalizes projection • if h(β) = δ(β|C ) then proxsh (z) = arg minβ∈C ||z − β||2 . Generalizes projected gradient descent • β t+1 = proxδ(·|C ) (β t − s t ∇f (β t )). Statistical applications • This method can be efficient only for special forms of h(β). • It does work well is h is the `1 -norm, group lasso `2 -norm, etc. 36 Proximal method for lasso • Suppose the nondifferentiable component is h(β) = λ||β||1 . • Soft-thresholding: Sλ (x) = sign(x)(|x| − λ)+ for x, λ ∈ R Proximal algorithm Step 1. Take a gradient step z = β t − s t ∇g (β t ) Step 2. Perform elementwise soft-thresholding β t+1 = Ss t λ (z) • Step 2 follows by the standard calculation that arg minβ∈Rp { 12 ||z − β||22 + s t λ||β||1 } = Ss t λ (z) 37 Basic outline Theory part 1. Basic definitions and examples 2. Differentiable case: optimization duality, KKT conditions 3. Non-differentiable case: subdifferentials. Algorithmic part 1. Gradient descent 2. Proximal methods 3. Coordinate descent 38 Coordinate descent • Convex function is convex in each coordinate. • Coordinate descent: Minimize one-dimensional functions for each coordinate. • This gives the global minimum under some additional conditions. Old idea in statistics. . . • Iterative Proportional Fitting for log-linear models dates back to Bartlett (1935), Csiszár (1970s). • Gaussian graphical models: Dempster (1972), Wermuth and Scheidt (1977), Speed and Kiiveri (1982). 39 When coordinate descent works? Your intuition is correct • If f is continuously differentiable and strictly convex in each coordinate then coordinate descent converges to the global optimum. Separability condition • Suppose f (β) = g (β) + I I Pp j=1 hj (βj ) p g : R → R differentiable and convex hj : R → R are convex • Tseng (1988,2001): in this scenario coordinate descent converges to the global optimum 40 Thank you! 41 Appendix: Conjugate functions • Every convex set is the intersection of all supporting hyperplanes containing it. • f convex then epi(f ) ⊂ Rd+1 is convex • all supporting hyperplanes are of the form hb, βi − a • Consider the set of points (β ∗ , y ∗ ) ∈ Rd+1 such that hβ ∗ , βi − y ∗ is a supporting hyperplane of epi(f ) • In particular supβ (hβ ∗ , βi − f (x)) ≤ y ∗ and so (β ∗ , y ∗ ) ∈ epi(f ∗ ), where f ∗ (β ∗ ) = sup(hβ ∗ , βi − f (x)). β 42 Appendix: CVX lasso example m = 500; n = 2500; % number of examples % number of features b0 = sprandn(n,1,0.05); A = randn(m,n); A = A*spdiags(1./sqrt(sum(A.^2))’,0,n,n); % normalize columns v = sqrt(0.001)*randn(m,1); y = A*b0 + v; gamma_max = norm(A’*y,’inf’); gamma = 0.1*gamma_max; cvx_begin quiet cvx_precision low variable b(n) minimize(0.5*sum_square(A*b - y) + gamma*norm(b,1)) cvx_end • lasso.html 43