Lecture Slides 6

advertisement
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
Chapter 6
Proximal Gradient Method and
Proximal Mapping
1
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
2
6.1. Proximal gradient method
unconstrained optimization with objective split in two components
min
f (x) = g(x) + h(x)
• g convex, differentiable, dom g = Rn
• h convex, but non-differentiable
So,
• we can’t use gradient method, because h is non-differentiable
• we do not want to use subgradient method, because it is too slow
• but, is there any special structure of h?
• if h has no special structure, we still have to use subgradient method
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
3
Proximal mapping
We assume that h has a special structure: the proximal mapping
(prox-operator) of h is easy to compute!
the proximal mapping (prox-operator) of a convex function h is defined as
1
proxh(x) = argmin h(u) + ku − xk22
2
u
examples
• h(x) = 0: proxh(x) = x
• more on next pages ...
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
• h(x) = IC (x) (indicator function of C): proxh is projection on C
proxh(x) = argmin ku − xk22 = PC (x)
u∈C
where the indicator function of C is defined as
0,
if x ∈ C
IC (x) =
+∞, otherwise .
4
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
• h(x) = kxk1: proxh is the “soft-threshold” (shrinkage) operation

 xi − 1 if xi ≥ 1
proxh(x)i = 0
if |xi| ≤ 1

xi + 1 if xi ≤ −1
• proof.
1
proxh(x) = min kuk1 + ku − xk2
u
2
This problem is separable!
1
min |ui| + (ui − xi)2
ui
2
5
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
6
proximal gradient method
we want to solve
min
f (x) = g(x) + h(x)
• g convex, differentiable, dom g = Rn
• h convex, but non-differentiable
the proximal gradient method:
1
xk = argmin
kx − (xk−1 − tk ∇g(xk−1))k2 + h(x)
2tk
x
or equivalently,
k
x = proxtk h x
k−1
− tk ∇g(x
k−1
)
tk > 0 is step size, constant or determined by line search
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
7
Interpretation
x+ = argmin
u
1
ku − (x − t∇g(x))k2 + th(u)
2
or equivalently,
x+ = proxth(x − t∇g(x))
Note that
x+ = argminu h(u) + 2t1 ku − x + t∇g(x)k22
1
2
>
= argminu h(u) + g(x) + ∇g(x) (u − x) + 2t ku − xk2
x+ minimizes h(u) plus a simple quadratic local model of g(u) around
x
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
Another Interpretation
we want to solve min f (x) = g(x) + h(x)
The optimality conditions are
0 ∈ ∇g(x) + ∂h(x)
This is equivalent to (for any t > 0 )
0 ∈ t∇g(x) + t∂h(x)
⇔ 0 ∈ x − (x − t∇g(x)) + t∂h(x)
Algorithm:
• find xk , such that
0 ∈ xk − (xk−1 − t∇g(xk−1)) + t∂h(xk )
This means xk is the optimal solution of
1
kx − (xk−1 − t∇g(xk−1))k2 + th(x)
2
8
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
9
Examples
gradient method: special case with h(x) = 0
x+ = x − t∇g(x)
gradient projection method: special case with h(x) = IC (x)
x+ = PC (x − t∇g(x))
iterative soft-thresholding algorithm (ISTA): special case with h(x) =
kxk1
x+ = proxth(x − t∇g(x))
where

 ui − t if ui ≥ t
proxth(u)i = 0
if − t ≤ ui ≤ t

ui + t if ui ≤ −t
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
6.2. Conjugate function
Further topics on conjugate function need to be discussed.
10
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
11
Closed function
• definition: a function is closed if its epigraph is a closed set (a set
contains its boundary)
• sublevel sets: f is closed iff all its sublevel sets are closed
• minimum: if f is closed with bounded sublevel set then it has a
minimizer
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
12
Conjugate function
• the conjugate of a function f is
f ∗(y) = sup (y >x − f (x))
x∈dom f
f ∗ is closed (its epigraph is a closed set) and convex even if f is not
• Fenchel’s inequality
f (x) + f ∗(y) ≥ x>y,
∀x, y
(extends inequality x>x/2+y >y/2 ≥ x>y to non-quadratic convex
f)
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
13
Quadratic function
1
f (x) = x>Ax + b>x + c
2
• strictly convex case (A 0)
1
f ∗(y) = (y − b)>A−1(y − b) − c
2
• general convex case (A 0)
1
f ∗(y) = (y − b)>A†(y − b) − c,
2
dom f ∗ = range(A) + b
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
14
Negative entropy and negative logarithm
• negative entropy
f (x) =
n
X
xi log xi,
f ∗(y) =
n
X
i=1
eyi−1
i=1
• negative logarithm
f (x) = −
n
X
log xi,
f ∗(y) = −
i=1
n
X
log(−yi) − n
i=1
• matrix logarithm
f (X) = − log det X(dom f = Sn++),
f ∗(Y ) = − log det(−Y )−n
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
Indicator function and norm
• indicator of convex set C: conjugate is support function of C
0
if x ∈ C
f (x) =
f ∗(y) = sup y >x
+∞ if x ∈
/C
x∈C
• norm: conjugate is indicator of unit dual norm ball
0
if kyk∗ ≤ 1
f (x) = kxk
f ∗(y) =
+∞ if kyk∗ > 1
(see next page for proof)
15
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
16
proof: recall the definition of dual norm:
kyk∗ = sup x>y
kxk≤1
to evaluate f ∗(y) = supx(y >x − kxk) we distinguish two cases
• if kyk∗ ≤ 1, then (by Cauchy-Schwarz inequality)
y >x ≤ kxk,
∀x
and equality holds if x = 0; therefore supx(y >x − kxk) = 0
• if kyk∗ > 1, there exists an x with kxk ≤ 1, x>y > 1; then
f ∗(y) ≥ y >(tx) − ktxk = t(y >x − kxk)
and rhs goes to infinity if t → ∞
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
17
The second conjugate
f ∗∗(x) =
sup (x>y − f ∗(y))
y∈dom f ∗
• f ∗∗ is closed and convex
• from Fenchel’s inequality (x>y − f ∗(y) ≤ f (x) for all y and x):
f ∗∗(x) ≤ f (x),
∀x
equivalently, epi f ⊆ epi f ∗∗ (for any f )
• if f is closed and convex, then
f ∗∗(x) = f (x),
∀x
equivalently, epi f = epi f ∗∗ (if f is closed convex); proof on next
page
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
18
proof. (f ∗∗ = f if f is closed and convex): by contradiction
suppose (x, f ∗∗(x)) ∈
/ epi f ; then there is a strict separating hyperplane:
> a
z−x
≤ c < 0, ∀(z, s) ∈ epi f
b
s − f ∗∗(x)
for some a,b,c with b ≤ 0 (b > 0 gives a contradiction as s → ∞)
• if b < 0, define y = a/(−b) and maximize lhs over (z, s) ∈ epi f :
f ∗(y) − y >x + f ∗∗(x) ≤ c/(−b) < 0
this contradicts Fenchel’s inequality
• if b = 0, then a>(z −x) < 0, ∀z. Choose z = x, get a contradiction.
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
Conjugates and subgradients
if f is closed and convex, then
y ∈ ∂f (x) ⇔ x ∈ ∂f ∗(y) ⇔ x>y = f (x) + f ∗(y)
proof: if y ∈ ∂f (x), then f ∗(y) = supu(y >u − f (u)) = y >x − f (x)
f ∗(v) = supu(v >u − f (u))
≥ v >x − f (x)
= x>(v − y) − f (x) + y >x
= f ∗(y) + x>(v − y)
for all v; therefore, x is a subgradient of f ∗ at y (x ∈ ∂f ∗(y))
reverse implication x ∈ ∂f ∗(y) ⇒ y ∈ ∂f (x) follows from f ∗∗ = f
19
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
20
6.3. Proximal mapping
if h is convex and closed (has a closed epigraph), then
1
proxh(x) = argmin h(u) + ku − xk22
2
u
exists and is unique for all x
• existence: h(u) + 1/2ku − xk2 is closed with bounded sublevel sets
• uniqueness: h(u) + 1/2ku − xk2 is strictly (in fact, strongly) convex
subgradient characterization: from optimality conditions of
minimization in the definition
u = proxh(x) ⇔ x − u ∈ ∂h(u)
⇔ h(z) ≥ h(u) + (x − u)>(z − u), ∀z
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
21
Projection on closed convex set
proximal mapping of indicator function IC is Euclidean projection on
C
proxIC (x) = argmin ku − xk22 = PC (x)
u∈C
subgradient characterization
u = PC (x) ⇔ (x − u)>(z − u) ≤ 0 ∀z ∈ C
we will see that proximal mappings have many
properties of projections
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
22
Nonexpansiveness
if u = proxh(x), v = proxh(y), then
(u − v)>(x − y) ≥ ku − vk22
proxh is firmly nonexpansive
• proof. follows from the monotonicity of ∂h:
x − u ∈ ∂h(u), y − v ∈ ∂h(v) ⇒ (x − u − y + v)>(u − v) ≥ 0
• implies (from Cauchy-Schwarz inequality)
kproxh(x) − proxh(y)k2 ≤ kx − yk2
proxh is nonexpansive, or Lipschitz continuous with constant 1
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
23
6.3.1. Examples of Easy Proximal Mappings
“easy” means that the computational effort of proximal mapping is
approximately equal to the effort of computing gradient or subgradient
of h.
Examples:
• quadratic function (A 0)
1
f (x) = x>Ax + b>x + c,
2
proxtf (x) = (I + tA)−1(x − tb)
• Euclidean norm: f (x) = kxk2
(1 − t/kxk2)x if kxk2 ≥ t
proxtf (x) =
0
otherwise
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
24
• logarithmic barrier
f (x) = −
n
X
i=1
log xi,
proxtf (x)i =
xi +
p
x2i + 4t
, i = 1, . . . , n
2
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
25
Some simple calculus rules
• separable sum
x
f(
) = g(x) + h(y),
y
x
proxg (x)
proxf (
)=
y
proxh(y)
• scaling and translation of argument: with λ 6= 0
f (x) = g(λx + a),
1
proxf (x) = (proxλ2g (λx + a) − a)
λ
• scalar multiplication: with λ > 0
f (x) = λg(x/λ),
proxf (x) = λproxλ−1g (x/λ)
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
26
Addition to linear or quadratic function
• linear function
f (x) = g(x) + a>x,
proxf (x) = proxg (x − a)
• quadratic function: with µ > 0
µ
f (x) = g(x) + kx − ak22, proxf (x) = proxθg (θx + (1 − θ)a)
2
where θ = 1/(1 + µ)
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
27
Moreau decomposition
x = proxf (x) + proxf ∗ (x),
∀x
• proof. follows from properties of conjugates and subgradients:
u = proxf (x) ⇔ x − u ∈ ∂f (u)
⇔ u ∈ ∂f ∗(x − u)
⇔ x − u = proxf ∗ (x)
• generalizes decomposition by orthogonal projection on subspaces:
x = PL(x) + PL⊥ (x)
(this is Moreau decomposition with f = IL, f ∗ = IL⊥ )
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
28
Extended Moreau decomposition
for λ > 0,
x = proxλf (x) + λproxλ−1f ∗ (x/λ),
∀x
proof: apply Moreau decomposition to λf
x = proxλf (x) + prox(λf )∗ (x)
= proxλf (x) + λproxλ−1f ∗ (x/λ)
second line uses (λf )∗(y) = λf ∗(y/λ) and expression on page 25
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
Composition with affine mapping
• for general A, the prox-operator of
f (x) = g(Ax + b)
does not follow easily from the prox-operator of g
• however if AA> = (1/α)I, we have (see next page for proof)
proxf (x) = (I − αA>A)x + αA>(proxα−1g (Ax + b) − b)
• example: f (x1, . . . , xm) = g(x1 + x2 + . . . + xm)


m
m
X
1 X
proxf (x1, . . . , xm)i = xi −
xj − proxmg (
xj ) 
m j=1
j=1
29
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
30
proof: u = proxf (x) is the solution of the optimization problem
minu,y g(y) + 12 ku − xk22
s.t.
Au + b = y
The Lagrangian function is
1
L(u, y; λ) = g(y) + ku − xk22 + λ>(Au + b − y)
2
KKT conditions:
u − x + A>λ = 0
λ ∈ ∂g(y)
Au + b = y
From (6.1a) and (6.1c) we have
y = A(x − A>λ) + b = Ax − λ/α + b,
(6.1a)
(6.1b)
(6.1c)
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
31
which implies
λ = α(Ax + b − y)
From (6.2) and (6.1b) we have
0 ∈ ∂g(y) + α(y − Ax − b)
i.e., y is the minimizer of
1
min α−1g(y) + ky − (Ax + b)k22
y
2
so, y = proxα−1g (Ax + b). Thus we have
λ = α(Ax + b − proxα−1g (Ax + b))
and
u = x − αA>(Ax + b − proxα−1g (Ax + b))
(6.2)
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
6.3.2. Projections
32
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
Projection on affine sets
• hyperplane: C = {x | a>x = b} (with a 6= 0)
b − a>x
a
PC (x) = x +
kak22
• affine set: C = {x | Ax = b} (with A ∈ Rp×n and rank(A) = p)
PC (x) = x + A>(AA>)−1(b − Ax)
inexpensive if p n, or AA> = I, ...
33
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
34
Projection on simple polyhedral sets
• halfspace: C = {x | a>x ≤ b} (with a 6= 0)
b − a>x
a
PC (x) = x +
kak22
if a>x > b,
PC (x) = x
• rectangle: C = [l, u] = {x | l ≤ x ≤ u}

 li if xi ≤ li
PC (x)i = xi if li ≤ xi ≤ ui

ui if xi ≥ ui
• nonnegative orthant: C = Rn+
PC (x) = x+ = max{x, 0}
if a>x ≤ b
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
35
• probability simplex: C = {x | e>x = 1, x ≥ 0}
PC (x) = (x − λe)+
where λ is the solution of the equation
e>(x − λe)+ =
n
X
max{0, xi − λ} = 1
i=1
• intersection of hyperplane and rectangle: C = {x | a>x = b, l ≤
x ≤ u}
PC (x) = P[l,u](x − λa)
where λ is the solution of
a>P[l,u](x − λa) = b
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
36
Projection on norm balls
• Euclidean ball: C = {x | kxk2 ≤ 1}
PC (x) =
1
x
kxk2
if kxk2 > 1,
PC (x) = x, if kxk2 ≤ 1
• 1-norm ball: C = {x | kxk1 ≤ 1}

 xk − λ if xk > λ
PC (x)k = 0
if − λ ≤ xk ≤ λ

xk + λ if xk < −λ
λ = 0 if kxk1 ≤ 1; otherwise λ is the solution of the equation
n
X
k=1
max{|xk | − λ, 0} = 1
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
37
Projection on positive semidefinite cone
• positive semidefinite cone: C = Sn+
PC (X) =
n
X
max{0, λi}qiqi>
i=1
if X =
Pn
>
i=1 λi qi qi
is the eigenvalue decomposition of X
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
6.3.3. support functions, norms, distances
38
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
39
Support function
• conjugate of support function of closed convex set is indicator
function
f (x) = SC (x) = sup x>y,
f ∗(y) = IC (y)
y∈C
• prox-operator of support function: apply Moreau decomposition
proxtf (x) = x − tproxt−1f ∗ (x/t) = x − tPC (x/t)
• example: f (x) is sum of largest r components of x
f (x) = x[1] + . . . + x[r] = SC (x),
C = {y | 0 ≤ y ≤ 1, e>y = r}
prox-operator of f is easily evaluated via projection on C
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
40
Norms
• conjugate of norm is indicator function of dual norm ball:
f (x) = kxk,
f ∗(y) = IB (y) (B = {y | kyk∗ ≤ 1})
• prox-operator of norm: apply Moreau decompositioin
proxtf (x) = x − tproxt−1f ∗ (x/t)
= x − tPB (x/t)
= x − PtB (x)
useful formula for proxtk·k when projection on tB = {x | kxk ≤ t}
is cheap
• example: for k · k1, k · k2, calculate their prox-operator
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
Distance to a point
• distance (in general norm)
f (x) = kx − ak
• prox-operator: from page 25, with g(x) = kxk
proxtf (x) = a + proxtg (x − a)
= a + x − a − tPB ( x−a
t )
= x − PtB (x − a)
B is the unit ball for the dual norm k · k∗
41
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
Euclidean distance to a set
• Euclidean distance (to a closed convex set C)
d(x) = inf kx − yk2
y∈C
• prox-operator of distance (see next page for proof)
t/d(x) if d(x) ≥ t
proxtd(x) = θPC (x) + (1 − θ)x, θ =
1
otherwise
• prox-operator of squared distance: f (x) = d(x)2/2
proxtf (x) =
1
t
x+
PC (x)
1+t
1+t
42
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
43
proof. (expression for proxtd(x))
• if u = proxtd(x) ∈
/ C, then from page ??? and subgradient for d
(page ???)
t
(u − PC (u))
x−u=
d(u)
implies PC (u) = PC (x), d(x) ≥ t, u is convex combination of x
and PC (x)
1
ku − xk22, then u = PC (x)
• if u ∈ C minimizes d(u) + (2t)
Shiqian Ma, SEEM 5121, Dept. of SEEM, CUHK
44
proof. (expression for proxtf (x) when f (x) = d(x)2/2)
1
1
2
2
proxtf (x) = argminu 2 d(u) + 2t ku − xk2
= argminu inf v∈C 12 ku − vk22 + 2t1 ku − xk22
optimal u as a function of v is
u=
t
1
v+
x
t+1
t+1
optimal v minimizes
2
2
1 t
1
1
1 t
t
=
v
+
x
−
v
v
+
x
−
x
kv−xk22
+
2 t+1
t+1
t+1
2(1 + t)
2 2t t + 1
2
over C, i.e., v = PC (x)
Download