New perspectives for increasing efficiency of optimization schemes Yurii Nesterov, CORE/INMA (UCL) January 7, 2016 (Big Data Conference, UCL, London) Joint results with S. Stich (ETH Zurich) Yu. Nesterov Unconstrained structural optimization 1/10 Thirty years ago ... Problem: min f (x), f is convex and x∈Rn k∇f (x) − ∇f (y)k ≤ Lkx − yk, x, y ∈ Rn . Fast gradient method (N.1984): x0 ∈ Rn , yk = x k + k k+2 (xk − xk−1 ), xk+1 = yk − L1 ∇f (yk ), k ≥ 0. Result: f (xk ) − f ∗ ≤ 2LR2 k(k+1) , k ≥ 1. (Optimal) Compare: Heavy ball method (B.Polyak, 1964) xk+1 = xk + αk (xk − xk−1 ) − βk ∇f (xk ), k ≥ 0. (Convergence analysis for QP) Applications (1983-2003): Yu. Nesterov Nothing ... Unconstrained structural optimization 2/10 Twenty years later ... Problem: min f (x), f is convex, Q is convex, and x∈Q k∇f (x) − ∇f (y)k∗ ≤ Lkx − yk, x, y ∈ Q. Prox-function: strongly convex d(x), x ∈ Q. Fast gradient method (N.03): v0 = x0 ∈ Rn , k−1 P i+1 vk = arg min d(x) + 2L [f (yi ) + h∇f (yi ), x − yi i] , x∈Q yk = k k+3 vk + i=0 2 k+3 xk , xk+1 = yk − L1 ∇f (yk ), k ≥ 0. Result: f (xk ) − f ∗ ≤ 2LR2 k(k+1) , k ≥ 1. Applications (2003 - 2015(?)): Smooth approximations of nonsmooth functions. Yu. Nesterov Unconstrained structural optimization 3/10 Smoothing technique Problem: Model: min f (x), diam Q = D1 . x∈Q f (x) = max{hAx, ui − φ(u)}. u∈U Smoothing: fµ (x) = max{hAx, ui − φ(u) − µd2 (u)}, u∈U where d2 is strongly convex on U , diam U = D2 . Then k∇fµ (x) − ∇fµ (y)k∗ ≤ µ1 kAk2 · kx − yk, x, y ∈ Rn , µ > 0, where kAk = max{hAx, ui : kxk ≤ 1, kuk ≤ 1}. x,u Complexity: Choose µ = O(). Then we get -solution in O( 1 kAkD1 D2 ) iterations. NB: Subgradient schemes need O( 12 kAk2 D12 D22 ) iteratons. Yu. Nesterov Unconstrained structural optimization 4/10 Ten years later ... Huge-scale problems Problem: ⇒ Coordinate descent methods min f (x), with objective satisfying conditions x∈Rn |∇i f (x + hei ) − ∇i f (x)| ≤ Li |h|, ∀x, h ∈ Rn . (NB: Li ≤ L∇f .) 1 Hence: f (x) − f x − L1i ∇i f (x)ei ≥ 2L (∇i f (x))2 , i = 1 : n. i Consequences [N.12]: Denote Sα = n P i=1 Lαi , α ∈ [0, 1]. Choose i with probability πi = S11 Li . Then E (f (x+ )) ≤ f (x) − 2S1 1 k∇f (x)k2[0] S R2 ⇒ E (f (xk )) − f ∗ ≤ 1k 0 . n n 2 2 P P −α 2 α (i) 2 (i) kxk[α] = Li x , kgk[α] = Li g . i=1 i=1 Choose i with probability πi = n1 . Then nR2 1 2 ∗) ≤ 1 E (f (x+ )) ≤ f (x) −Nesterov k∇f (x)k ⇒ E (fstructural (xk ) − foptimization . Yu. Unconstrained 5/10 Fast Coordinate Descent Random generator: For β ∈ [0, 1], get j = Rβ (L) ∈ {1 : n} def 1 β i ∈ {1 : Sβ Li , n 2 2 . [N.12] β = 0: E(f (xk )) − f ∗ ≤ 2( k+1 ) Lmax R[0] nS1 2 [Lee, Sidford 13] β = 1: E(f (xk )) − f ∗ ≤ 2 (k+1) 2 R[0] . 2 S1/2 2 . R[0] [N., Stich 15] β = 21 : E(f (xk )) − f ∗ ≤ 2 k+1 Fast CD: Choose v0 = x0 ∈ Rn . Set A0 = 0 with probabilities πβ [i] ≡ Prob(j = i) = 1) 2) αt 3) n}. Choose active coordinate it = R1/2 (L). 2 =A +a Solve a2t+1 S1/2 t t+1 . Set At+1 = At + at+1 , at+1 = At+1 . Set yt = (1 − αt )xt + αt vt , xt+1 = yt − L1i ∇it f (yt )eit , vt+1 = vt − at+1 S1/2 1/2 t Li t ∇it f (yt )eit . NB: Each iteration needs O(n) a.o. ⇒ Not for Huge Scale. Yu. Nesterov Unconstrained structural optimization 6/10 Complexity For getting -accuracy we need S1/2 R[0] 1/2 1/2 iterations ≤ nL∇f R[0] 1/2 . Main question: When CD-oracle is n times cheaper? NB: For FGM, complexity oracle/method is often unbalanced. Model: f (x) = F (Ax, x), where F (s, x) : Rm+n → R. Main assumption: F (s, x) can be computed in O(m + n) a.o. (And ∇F ∈ Rm+n too!) Consequences: Let A = (A1 , . . . , An ). After coordinate move, product Ax can be computed in O(m) a.o. If Ax is known, ∇i f (x) = hAi , ∇s F i + ∇xi F can be computed in O(m) a.o. If Ax and Ay are known, αAx + βAy can be computed in Yu. Nesterov Unconstrained structural optimization O(m) a.o. 7/10 Example 1: Unconstrained quadratic minimization Let A = AT 0 ∈ Rn×n is dense. Define F (s, x) = 12 hs, xi − hb, xi. Then f (x) = 12 hAx, xi − hb, xi. We have 1/2 nS1/2 n2 λmax (A) TCD = O 1/2 R[0] ≤ TF GM = O R[0] . 1/2 1/2 NB: We use inequality Li 1/2 ≤ λmax (A), i ∈ {1 : n}. For some cases it is too weak. Example: 0 < γ1 ≤ A(i,j) ≤ γ2 , i, j ∈ {1 : n}. 1/2 Then Li 1/2 ≤ γ2 and λmax (A) ≥ γ1 λmax (1n 1Tn ) = nγ2 . We gain O(n1/2 ) in the total number of operations. Yu. Nesterov Unconstrained structural optimization 8/10 Example 2: Smoothing technique Consider function f (x) = max{hAx, ui − φ(u)}, u∈Q where Q ⊂ Rm is closed convex and bounded. Define fµ (x) = max{hAx, ui − φ(u) − µd(u)}, µ > 0. u∈Q Main assumption: F (s) = max{hs, ui − φ(u) − µd(u)} is u∈Q computable in O(m) operations (m ≥ n). Then n mR[0] P mnR[0] TCD = O µ1/2 1/2 kAei k ≤ TF GM = O µ1/2 1/2 kAk . i=1 It can be that kAei k << kAk, i ∈ {1 : n}. Example: 0 < γ1 ≤ A(i,j) ≤ γ2 , i ∈ {1 : m}, j ∈ {1 : n}. √ √ Then kAei k ≤ γ2 m, and kAk ≥ γ1 k1m 1Tn k = γ1 mn. √ We gain O( n) in the complexity. Yu. Nesterov Unconstrained structural optimization 9/10 Conclusion 1. Provided that T (∇i f (x)) = n1 T (f ), we get methods, which are always more efficient than usual FGM. √ 2. Sometimes the gain reaches O( n). 3. We get better results on a problem class, which contains many applications of Smoothing Technique. 4. Our method is oracle/iteration balanced (O(n + m) operations for both). But: 5. Constrained minimization is not covered. 6. Numerical testing is needed. Yu. Nesterov Unconstrained structural optimization 10/10