New perspectives for increasing efficiency of optimization schemes

advertisement
New perspectives for increasing efficiency of
optimization schemes
Yurii Nesterov, CORE/INMA (UCL)
January 7, 2016 (Big Data Conference, UCL, London)
Joint results with S. Stich (ETH Zurich)
Yu. Nesterov
Unconstrained structural optimization
1/10
Thirty years ago ...
Problem:
min f (x), f is convex and
x∈Rn
k∇f (x) − ∇f (y)k ≤ Lkx − yk, x, y ∈ Rn .
Fast gradient method (N.1984): x0 ∈ Rn ,
yk = x k +
k
k+2 (xk
− xk−1 ), xk+1 = yk − L1 ∇f (yk ), k ≥ 0.
Result: f (xk ) − f ∗ ≤
2LR2
k(k+1) ,
k ≥ 1. (Optimal)
Compare: Heavy ball method (B.Polyak, 1964)
xk+1 = xk + αk (xk − xk−1 ) − βk ∇f (xk ), k ≥ 0.
(Convergence analysis for QP)
Applications (1983-2003):
Yu. Nesterov
Nothing ...
Unconstrained structural optimization
2/10
Twenty years later ...
Problem:
min f (x), f is convex, Q is convex, and
x∈Q
k∇f (x) − ∇f (y)k∗ ≤ Lkx − yk, x, y ∈ Q.
Prox-function: strongly convex d(x), x ∈ Q.
Fast gradient method (N.03): v0 = x0 ∈ Rn ,
k−1
P i+1
vk = arg min d(x) +
2L [f (yi ) + h∇f (yi ), x − yi i] ,
x∈Q
yk =
k
k+3 vk
+
i=0
2
k+3 xk ,
xk+1 = yk − L1 ∇f (yk ), k ≥ 0.
Result: f (xk ) − f ∗ ≤
2LR2
k(k+1) ,
k ≥ 1.
Applications (2003 - 2015(?)): Smooth approximations of
nonsmooth functions.
Yu. Nesterov
Unconstrained structural optimization
3/10
Smoothing technique
Problem:
Model:
min f (x), diam Q = D1 .
x∈Q
f (x) = max{hAx, ui − φ(u)}.
u∈U
Smoothing: fµ (x) = max{hAx, ui − φ(u) − µd2 (u)},
u∈U
where d2 is strongly convex on U , diam U = D2 .
Then k∇fµ (x) − ∇fµ (y)k∗ ≤ µ1 kAk2 · kx − yk, x, y ∈ Rn , µ > 0,
where kAk = max{hAx, ui : kxk ≤ 1, kuk ≤ 1}.
x,u
Complexity: Choose µ = O().
Then we get -solution in O( 1 kAkD1 D2 ) iterations.
NB: Subgradient schemes need O( 12 kAk2 D12 D22 ) iteratons.
Yu. Nesterov
Unconstrained structural optimization
4/10
Ten years later ...
Huge-scale problems
Problem:
⇒
Coordinate descent methods
min f (x), with objective satisfying conditions
x∈Rn
|∇i f (x + hei ) − ∇i f (x)| ≤ Li |h|, ∀x, h ∈ Rn . (NB: Li ≤ L∇f .)
1
Hence: f (x) − f x − L1i ∇i f (x)ei ≥ 2L
(∇i f (x))2 , i = 1 : n.
i
Consequences [N.12]: Denote Sα =
n
P
i=1
Lαi , α ∈ [0, 1].
Choose i with probability πi = S11 Li . Then
E (f (x+ )) ≤ f (x) − 2S1 1 k∇f (x)k2[0]
S R2
⇒ E (f (xk )) − f ∗ ≤ 1k 0 .
n
n
2
2
P
P
−α
2
α
(i)
2
(i)
kxk[α] =
Li x
, kgk[α] =
Li
g
.
i=1
i=1
Choose i with probability πi = n1 . Then
nR2
1
2
∗) ≤
1
E (f (x+ )) ≤ f (x)
−Nesterov
k∇f (x)k
⇒ E (fstructural
(xk ) − foptimization
.
Yu.
Unconstrained
5/10
Fast Coordinate Descent
Random generator: For β ∈ [0, 1], get j = Rβ (L) ∈ {1 : n}
def 1 β
i ∈ {1 :
Sβ Li ,
n 2
2 .
[N.12] β = 0: E(f (xk )) − f ∗ ≤ 2( k+1
) Lmax R[0]
nS1
2
[Lee, Sidford 13] β = 1: E(f (xk )) − f ∗ ≤ 2 (k+1)
2 R[0] .
2
S1/2
2 .
R[0]
[N., Stich 15] β = 21 : E(f (xk )) − f ∗ ≤ 2 k+1
Fast CD: Choose v0 = x0 ∈ Rn . Set A0 = 0
with probabilities πβ [i] ≡ Prob(j = i) =
1)
2)
αt
3)
n}.
Choose active coordinate it = R1/2 (L).
2 =A +a
Solve a2t+1 S1/2
t
t+1 . Set At+1 = At + at+1 ,
at+1
= At+1 .
Set yt = (1 − αt )xt + αt vt , xt+1 = yt − L1i ∇it f (yt )eit ,
vt+1 = vt −
at+1 S1/2
1/2
t
Li
t
∇it f (yt )eit .
NB: Each iteration needs O(n) a.o. ⇒ Not for Huge Scale.
Yu. Nesterov
Unconstrained structural optimization
6/10
Complexity
For getting -accuracy we need
S1/2 R[0]
1/2
1/2
iterations ≤
nL∇f R[0]
1/2
.
Main question: When CD-oracle is n times cheaper?
NB: For FGM, complexity oracle/method is often unbalanced.
Model:
f (x) = F (Ax, x), where F (s, x) : Rm+n → R.
Main assumption: F (s, x) can be computed in O(m + n)
a.o.
(And ∇F ∈ Rm+n too!)
Consequences:
Let A = (A1 , . . . , An ).
After coordinate move, product Ax can be computed in
O(m) a.o.
If Ax is known, ∇i f (x) = hAi , ∇s F i + ∇xi F can be
computed in O(m) a.o.
If Ax and Ay are known, αAx + βAy can be computed in
Yu. Nesterov
Unconstrained structural optimization
O(m) a.o.
7/10
Example 1: Unconstrained quadratic minimization
Let A = AT 0 ∈ Rn×n is dense.
Define F (s, x) = 12 hs, xi − hb, xi.
Then f (x) = 12 hAx, xi − hb, xi. We have 1/2
nS1/2
n2 λmax (A)
TCD = O 1/2 R[0]
≤ TF GM = O
R[0] .
1/2
1/2
NB: We use inequality Li
1/2
≤ λmax (A), i ∈ {1 : n}.
For some cases it is too weak.
Example: 0 < γ1 ≤ A(i,j) ≤ γ2 , i, j ∈ {1 : n}.
1/2
Then Li
1/2
≤ γ2
and λmax (A) ≥ γ1 λmax (1n 1Tn ) = nγ2 .
We gain O(n1/2 ) in the total number of operations.
Yu. Nesterov
Unconstrained structural optimization
8/10
Example 2: Smoothing technique
Consider function f (x) = max{hAx, ui − φ(u)},
u∈Q
where Q ⊂ Rm is closed convex and bounded.
Define fµ (x) = max{hAx, ui − φ(u) − µd(u)}, µ > 0.
u∈Q
Main assumption: F (s) = max{hs, ui − φ(u) − µd(u)} is
u∈Q
computable in O(m) operations (m ≥ n). Then
n
mR[0] P
mnR[0]
TCD = O µ1/2 1/2
kAei k
≤ TF GM = O µ1/2 1/2
kAk .
i=1
It can be that kAei k << kAk, i ∈ {1 : n}.
Example: 0 < γ1 ≤ A(i,j) ≤ γ2 , i ∈ {1 : m}, j ∈ {1 : n}.
√
√
Then kAei k ≤ γ2 m, and kAk ≥ γ1 k1m 1Tn k = γ1 mn.
√
We gain O( n) in the complexity.
Yu. Nesterov
Unconstrained structural optimization
9/10
Conclusion
1. Provided that T (∇i f (x)) = n1 T (f ), we get methods, which
are always more efficient than usual FGM.
√
2. Sometimes the gain reaches O( n).
3. We get better results on a problem class, which contains
many applications of Smoothing Technique.
4. Our method is oracle/iteration balanced (O(n + m)
operations for both).
But:
5. Constrained minimization is not covered.
6. Numerical testing is needed.
Yu. Nesterov
Unconstrained structural optimization
10/10
Download