Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK Chapter 6 Proximal Gradient Method and Proximal Mapping 1 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 2 6.1. Proximal gradient method unconstrained optimization with objective split in two components min f (x) = g(x) + h(x) • g convex, differentiable, dom g = Rn • h convex, but non-differentiable So, • we can’t use gradient method, because h is non-differentiable • we do not want to use subgradient method, because it is too slow • but, is there any special structure of h? • if h has no special structure, we still have to use subgradient method Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 3 Proximal mapping We assume that h has a special structure: the proximal mapping (prox-operator) of h is easy to compute! the proximal mapping (prox-operator) of a convex function h is defined as 1 proxh(x) = argmin h(u) + ku − xk22 2 u examples • h(x) = 0: proxh(x) = x • more on next pages ... Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK • h(x) = IC (x) (indicator function of C): proxh is projection on C proxh(x) = argmin ku − xk22 = PC (x) u∈C where the indicator function of C is defined as 0, if x ∈ C IC (x) = +∞, otherwise . 4 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK • h(x) = kxk1: proxh is the “soft-threshold” (shrinkage) operation xi − 1 if xi ≥ 1 proxh(x)i = 0 if |xi| ≤ 1 xi + 1 if xi ≤ −1 • proof. 1 proxh(x) = min kuk1 + ku − xk2 u 2 This problem is separable! 1 min |ui| + (ui − xi)2 ui 2 5 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 6 proximal gradient method we want to solve min f (x) = g(x) + h(x) • g convex, differentiable, dom g = Rn • h convex, but non-differentiable the proximal gradient method: 1 xk = argmin kx − (xk−1 − tk ∇g(xk−1))k2 + h(x) 2tk x or equivalently, k x = proxtk h x k−1 − tk ∇g(x k−1 ) tk > 0 is step size, constant or determined by line search Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 7 Interpretation x+ = argmin u 1 ku − (x − t∇g(x))k2 + th(u) 2 or equivalently, x+ = proxth(x − t∇g(x)) Note that x+ = argminu h(u) + 2t1 ku − x + t∇g(x)k22 1 2 > = argminu h(u) + g(x) + ∇g(x) (u − x) + 2t ku − xk2 x+ minimizes h(u) plus a simple quadratic local model of g(u) around x Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK Another Interpretation we want to solve min f (x) = g(x) + h(x) The optimality conditions are 0 ∈ ∇g(x) + ∂h(x) This is equivalent to (for any t > 0 ) 0 ∈ t∇g(x) + t∂h(x) ⇔ 0 ∈ x − (x − t∇g(x)) + t∂h(x) Algorithm: • find xk , such that 0 ∈ xk − (xk−1 − t∇g(xk−1)) + t∂h(xk ) This means xk is the optimal solution of 1 kx − (xk−1 − t∇g(xk−1))k2 + th(x) 2 8 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 9 Examples gradient method: special case with h(x) = 0 x+ = x − t∇g(x) gradient projection method: special case with h(x) = IC (x) x+ = PC (x − t∇g(x)) iterative soft-thresholding algorithm (ISTA): special case with h(x) = kxk1 x+ = proxth(x − t∇g(x)) where ui − t if ui ≥ t proxth(u)i = 0 if − t ≤ ui ≤ t ui + t if ui ≤ −t Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 6.2. Conjugate function Further topics on conjugate function need to be discussed. 10 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 11 Closed function • definition: a function is closed if its epigraph is a closed set (a set contains its boundary) • sublevel sets: f is closed iff all its sublevel sets are closed • minimum: if f is closed with bounded sublevel set then it has a minimizer Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 12 Conjugate function • the conjugate of a function f is f ∗(y) = sup (y >x − f (x)) x∈dom f f ∗ is closed (its epigraph is a closed set) and convex even if f is not • Fenchel’s inequality f (x) + f ∗(y) ≥ x>y, ∀x, y (extends inequality x>x/2+y >y/2 ≥ x>y to non-quadratic convex f) Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 13 Quadratic function 1 f (x) = x>Ax + b>x + c 2 • strictly convex case (A 0) 1 f ∗(y) = (y − b)>A−1(y − b) − c 2 • general convex case (A 0) 1 f ∗(y) = (y − b)>A†(y − b) − c, 2 dom f ∗ = range(A) + b Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 14 Negative entropy and negative logarithm • negative entropy f (x) = n X xi log xi, f ∗(y) = n X i=1 eyi−1 i=1 • negative logarithm f (x) = − n X log xi, f ∗(y) = − i=1 n X log(−yi) − n i=1 • matrix logarithm f (X) = − log det X(dom f = Sn++), f ∗(Y ) = − log det(−Y )−n Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK Indicator function and norm • indicator of convex set C: conjugate is support function of C 0 if x ∈ C f (x) = f ∗(y) = sup y >x +∞ if x ∈ /C x∈C • norm: conjugate is indicator of unit dual norm ball 0 if kyk∗ ≤ 1 f (x) = kxk f ∗(y) = +∞ if kyk∗ > 1 (see next page for proof) 15 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 16 proof: recall the definition of dual norm: kyk∗ = sup x>y kxk≤1 to evaluate f ∗(y) = supx(y >x − kxk) we distinguish two cases • if kyk∗ ≤ 1, then (by Cauchy-Schwarz inequality) y >x ≤ kxk, ∀x and equality holds if x = 0; therefore supx(y >x − kxk) = 0 • if kyk∗ > 1, there exists an x with kxk ≤ 1, x>y > 1; then f ∗(y) ≥ y >(tx) − ktxk = t(y >x − kxk) and rhs goes to infinity if t → ∞ Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 17 The second conjugate f ∗∗(x) = sup (x>y − f ∗(y)) y∈dom f ∗ • f ∗∗ is closed and convex • from Fenchel’s inequality (x>y − f ∗(y) ≤ f (x) for all y and x): f ∗∗(x) ≤ f (x), ∀x equivalently, epi f ⊆ epi f ∗∗ (for any f ) • if f is closed and convex, then f ∗∗(x) = f (x), ∀x equivalently, epi f = epi f ∗∗ (if f is closed convex); proof on next page Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 18 proof. (f ∗∗ = f if f is closed and convex): by contradiction suppose (x, f ∗∗(x)) ∈ / epi f ; then there is a strict separating hyperplane: > a z−x ≤ c < 0, ∀(z, s) ∈ epi f b s − f ∗∗(x) for some a,b,c with b ≤ 0 (b > 0 gives a contradiction as s → ∞) • if b < 0, define y = a/(−b) and maximize lhs over (z, s) ∈ epi f : f ∗(y) − y >x + f ∗∗(x) ≤ c/(−b) < 0 this contradicts Fenchel’s inequality • if b = 0, then a>(z −x) < 0, ∀z. Choose z = x, get a contradiction. Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK Conjugates and subgradients if f is closed and convex, then y ∈ ∂f (x) ⇔ x ∈ ∂f ∗(y) ⇔ x>y = f (x) + f ∗(y) proof: if y ∈ ∂f (x), then f ∗(y) = supu(y >u − f (u)) = y >x − f (x) f ∗(v) = supu(v >u − f (u)) ≥ v >x − f (x) = x>(v − y) − f (x) + y >x = f ∗(y) + x>(v − y) for all v; therefore, x is a subgradient of f ∗ at y (x ∈ ∂f ∗(y)) reverse implication x ∈ ∂f ∗(y) ⇒ y ∈ ∂f (x) follows from f ∗∗ = f 19 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 20 6.3. Proximal mapping if h is convex and closed (has a closed epigraph), then 1 proxh(x) = argmin h(u) + ku − xk22 2 u exists and is unique for all x • existence: h(u) + 1/2ku − xk2 is closed with bounded sublevel sets • uniqueness: h(u) + 1/2ku − xk2 is strictly (in fact, strongly) convex subgradient characterization: from optimality conditions of minimization in the definition u = proxh(x) ⇔ x − u ∈ ∂h(u) ⇔ h(z) ≥ h(u) + (x − u)>(z − u), ∀z Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 21 Projection on closed convex set proximal mapping of indicator function IC is Euclidean projection on C proxIC (x) = argmin ku − xk22 = PC (x) u∈C subgradient characterization u = PC (x) ⇔ (x − u)>(z − u) ≤ 0 ∀z ∈ C we will see that proximal mappings have many properties of projections Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 22 Nonexpansiveness if u = proxh(x), v = proxh(y), then (u − v)>(x − y) ≥ ku − vk22 proxh is firmly nonexpansive • proof. follows from the monotonicity of ∂h: x − u ∈ ∂h(u), y − v ∈ ∂h(v) ⇒ (x − u − y + v)>(u − v) ≥ 0 • implies (from Cauchy-Schwarz inequality) kproxh(x) − proxh(y)k2 ≤ kx − yk2 proxh is nonexpansive, or Lipschitz continuous with constant 1 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 23 6.3.1. Examples of Easy Proximal Mappings “easy” means that the computational effort of proximal mapping is approximately equal to the effort of computing gradient or subgradient of h. Examples: • quadratic function (A 0) 1 f (x) = x>Ax + b>x + c, 2 proxtf (x) = (I + tA)−1(x − tb) • Euclidean norm: f (x) = kxk2 (1 − t/kxk2)x if kxk2 ≥ t proxtf (x) = 0 otherwise Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 24 • logarithmic barrier f (x) = − n X i=1 log xi, proxtf (x)i = xi + p x2i + 4t , i = 1, . . . , n 2 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 25 Some simple calculus rules • separable sum x f( ) = g(x) + h(y), y x proxg (x) proxf ( )= y proxh(y) • scaling and translation of argument: with λ 6= 0 f (x) = g(λx + a), 1 proxf (x) = (proxλ2g (λx + a) − a) λ • scalar multiplication: with λ > 0 f (x) = λg(x/λ), proxf (x) = λproxλ−1g (x/λ) Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 26 Addition to linear or quadratic function • linear function f (x) = g(x) + a>x, proxf (x) = proxg (x − a) • quadratic function: with µ > 0 µ f (x) = g(x) + kx − ak22, proxf (x) = proxθg (θx + (1 − θ)a) 2 where θ = 1/(1 + µ) Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 27 Moreau decomposition x = proxf (x) + proxf ∗ (x), ∀x • proof. follows from properties of conjugates and subgradients: u = proxf (x) ⇔ x − u ∈ ∂f (u) ⇔ u ∈ ∂f ∗(x − u) ⇔ x − u = proxf ∗ (x) • generalizes decomposition by orthogonal projection on subspaces: x = PL(x) + PL⊥ (x) (this is Moreau decomposition with f = IL, f ∗ = IL⊥ ) Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 28 Extended Moreau decomposition for λ > 0, x = proxλf (x) + λproxλ−1f ∗ (x/λ), ∀x proof: apply Moreau decomposition to λf x = proxλf (x) + prox(λf )∗ (x) = proxλf (x) + λproxλ−1f ∗ (x/λ) second line uses (λf )∗(y) = λf ∗(y/λ) and expression on page 25 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK Composition with affine mapping • for general A, the prox-operator of f (x) = g(Ax + b) does not follow easily from the prox-operator of g • however if AA> = (1/α)I, we have (see next page for proof) proxf (x) = (I − αA>A)x + αA>(proxα−1g (Ax + b) − b) • example: f (x1, . . . , xm) = g(x1 + x2 + . . . + xm) m m X 1 X proxf (x1, . . . , xm)i = xi − xj − proxmg ( xj ) m j=1 j=1 29 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 30 proof: u = proxf (x) is the solution of the optimization problem minu,y g(y) + 12 ku − xk22 s.t. Au + b = y The Lagrangian function is 1 L(u, y; λ) = g(y) + ku − xk22 + λ>(Au + b − y) 2 KKT conditions: u − x + A>λ = 0 λ ∈ ∂g(y) Au + b = y From (6.1a) and (6.1c) we have y = A(x − A>λ) + b = Ax − λ/α + b, (6.1a) (6.1b) (6.1c) Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 31 which implies λ = α(Ax + b − y) From (6.2) and (6.1b) we have 0 ∈ ∂g(y) + α(y − Ax − b) i.e., y is the minimizer of 1 min α−1g(y) + ky − (Ax + b)k22 y 2 so, y = proxα−1g (Ax + b). Thus we have λ = α(Ax + b − proxα−1g (Ax + b)) and u = x − αA>(Ax + b − proxα−1g (Ax + b)) (6.2) Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 6.3.2. Projections 32 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK Projection on affine sets • hyperplane: C = {x | a>x = b} (with a 6= 0) b − a>x a PC (x) = x + kak22 • affine set: C = {x | Ax = b} (with A ∈ Rp×n and rank(A) = p) PC (x) = x + A>(AA>)−1(b − Ax) inexpensive if p n, or AA> = I, ... 33 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 34 Projection on simple polyhedral sets • halfspace: C = {x | a>x ≤ b} (with a 6= 0) b − a>x a PC (x) = x + kak22 if a>x > b, PC (x) = x • rectangle: C = [l, u] = {x | l ≤ x ≤ u} li if xi ≤ li PC (x)i = xi if li ≤ xi ≤ ui ui if xi ≥ ui • nonnegative orthant: C = Rn+ PC (x) = x+ = max{x, 0} if a>x ≤ b Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 35 • probability simplex: C = {x | e>x = 1, x ≥ 0} PC (x) = (x − λe)+ where λ is the solution of the equation e>(x − λe)+ = n X max{0, xi − λ} = 1 i=1 • intersection of hyperplane and rectangle: C = {x | a>x = b, l ≤ x ≤ u} PC (x) = P[l,u](x − λa) where λ is the solution of a>P[l,u](x − λa) = b Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 36 Projection on norm balls • Euclidean ball: C = {x | kxk2 ≤ 1} PC (x) = 1 x kxk2 if kxk2 > 1, PC (x) = x, if kxk2 ≤ 1 • 1-norm ball: C = {x | kxk1 ≤ 1} xk − λ if xk > λ PC (x)k = 0 if − λ ≤ xk ≤ λ xk + λ if xk < −λ λ = 0 if kxk1 ≤ 1; otherwise λ is the solution of the equation n X k=1 max{|xk | − λ, 0} = 1 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 37 Projection on positive semidefinite cone • positive semidefinite cone: C = Sn+ PC (X) = n X max{0, λi}qiqi> i=1 if X = Pn > i=1 λi qi qi is the eigenvalue decomposition of X Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 6.3.3. support functions, norms, distances 38 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 39 Support function • conjugate of support function of closed convex set is indicator function f (x) = SC (x) = sup x>y, f ∗(y) = IC (y) y∈C • prox-operator of support function: apply Moreau decomposition proxtf (x) = x − tproxt−1f ∗ (x/t) = x − tPC (x/t) • example: f (x) is sum of largest r components of x f (x) = x[1] + . . . + x[r] = SC (x), C = {y | 0 ≤ y ≤ 1, e>y = r} prox-operator of f is easily evaluated via projection on C Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 40 Norms • conjugate of norm is indicator function of dual norm ball: f (x) = kxk, f ∗(y) = IB (y) (B = {y | kyk∗ ≤ 1}) • prox-operator of norm: apply Moreau decompositioin proxtf (x) = x − tproxt−1f ∗ (x/t) = x − tPB (x/t) = x − PtB (x) useful formula for proxtk·k when projection on tB = {x | kxk ≤ t} is cheap • example: for k · k1, k · k2, calculate their prox-operator Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK Distance to a point • distance (in general norm) f (x) = kx − ak • prox-operator: from page 25, with g(x) = kxk proxtf (x) = a + proxtg (x − a) = a + x − a − tPB ( x−a t ) = x − PtB (x − a) B is the unit ball for the dual norm k · k∗ 41 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK Euclidean distance to a set • Euclidean distance (to a closed convex set C) d(x) = inf kx − yk2 y∈C • prox-operator of distance (see next page for proof) t/d(x) if d(x) ≥ t proxtd(x) = θPC (x) + (1 − θ)x, θ = 1 otherwise • prox-operator of squared distance: f (x) = d(x)2/2 proxtf (x) = 1 t x+ PC (x) 1+t 1+t 42 Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 43 proof. (expression for proxtd(x)) • if u = proxtd(x) ∈ / C, then from page ??? and subgradient for d (page ???) t (u − PC (u)) x−u= d(u) implies PC (u) = PC (x), d(x) ≥ t, u is convex combination of x and PC (x) 1 ku − xk22, then u = PC (x) • if u ∈ C minimizes d(u) + (2t) Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK 44 proof. (expression for proxtf (x) when f (x) = d(x)2/2) 1 1 2 2 proxtf (x) = argminu 2 d(u) + 2t ku − xk2 = argminu inf v∈C 12 ku − vk22 + 2t1 ku − xk22 optimal u as a function of v is u= t 1 v+ x t+1 t+1 optimal v minimizes 2 2 1 t 1 1 1 t t = v + x − v v + x − x kv−xk22 + 2 t+1 t+1 t+1 2(1 + t) 2 2t t + 1 2 over C, i.e., v = PC (x)