Iteration Complexity Analysis of Block Coordinate Descent Method Mingyi Hong Joint work with Xiangfeng Wang, Meisam Razaviyayn and Zhi-Quan (Tom) Luo University of Minnesota May 2014 Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 1 / 36 The Problem We consider the following problem (with K block variables) K minimize f ( x ) : = g ( x1 , · · · , xK ) + subject to xk ∈ Xk , k = 1, ..., K ∑ hk (xk ) k =1 (P) g (·) smooth convex function; hk (·) nonsmooth convex function; x = (x1T , ..., xKT )T ∈ ℜn is a partition of x; Xk ⊆ Rnk The gradient of g (·) is block-wise Lipschitz continuous k∇k g ([x−k , xk ]) − ∇k g ([x−k , x̂k ])k ≤ Mk kxk − x̂k k, Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD ∀ x ∈ X, ∀ k May 2014 2 / 36 The Algorithm Consider the cyclic block coordinate minimization (BCM) 1 At iteration r + 1, block k updated by r r 1 xkr +1 ∈ arg min g (x1r +1 , . . . , xkr + − 1 , xk , xk + 1 , . . . , xK ) + hk ( xk ) xk ∈Xk 2 Sweep over all blocks in cyclic order Each step small dimension – simple computation Popular for solving modern large-scale problems Many known variants – block coordinate descent (BCD) type 1 2 Block Selection Rule: cyclic, randomized, parallel, greedy,... Block Update Rule: exact minimization, proximal gradient,... Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 3 / 36 The Algorithm (cont.) A popular variant: cyclic block coordinate proximal gradient (BCPG) 1 At iteration r + 1, block k updated by r 1 r xkr +1 ∈ arg min uk (xk ; x1r +1 , . . . , xkr + − 1 , xk , . . . , xK ) + hk ( xk ) xk ∈Xk 2 Sweep over all blocks in cyclic order uk (xk ; y ): a quadratic approximation of g (·) w.r.t. xk uk (xk ; y ) = g (y ) + h∇k g (y ), xk − yk i + Lk k xk − yk k 2 . 2 Achieving descent in each block (Lk large enough) Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 4 / 36 The Contribution Question: For BCD-type algorithm, what’s the number of iterations required to solve problem (P) to ǫ-OPT (a.k.a iteration complexity)? Contribution 1: Cyclic BCPG-type inexact scheme achieves O(1/r ) rate Contribution 2: Variants of BCM achieve O(1/r ) rate (randomized, greedy, cyclic) Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 5 / 36 Agenda Prior Art 1 2 BCPG and its variants BCM and its variants Analyzing the cyclic BCPG-type Algorithm 1 2 The block successive upper-bound minimization (BSUM) Main steps for iteration complexity analysis Analyzing the BCM-type Algorithm 1 2 The randomized and greedy BCM The cyclic BCM Conclusion Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 6 / 36 Agenda Prior Art 1 2 BCPG and its variants BCM and its variants Analyzing the cyclic BCPG-type Algorithm 1 2 The block successive upper-bound minimization (BSUM) Main steps for iteration complexity analysis Analyzing the BCM-type Algorithm 1 2 The randomized and greedy BCM The cyclic BCM Conclusion Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 7 / 36 Prior Art: BCPG A large literature on analyzing BCPG-type algorithm Randomized BCPG: Each step randomly pick a block to update Strongly convex problems, linear rate [Richtárik-Takáč 12] Convex problems with O(1/r ) rate 1 2 3 Smooth [Nesterov 12] Smooth + L1 penalization [Shalev-Shwartz-Tewari 11] General nonsmooth (P) [Richtárik-Takáč 12][Lu-Xiao 13] Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 8 / 36 Prior Art: BCPG (cont.) How about cyclic BCPG? Less is known For strongly convex problems, linear rate For convex problems with O(1/r ) rate 1 2 3 LASSO, when satisfies the so-called “isotonicity assumption” [Saha-Tweari 13] Smooth [Beck-Tetruashvili 13] General nonsmooth, i.e., (P)? Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 9 / 36 Prior Art: BCM How about cyclic BCM? Even less is known For strongly convex problems, linear rate For convex problem O(1/r ) rate 1 2 Smooth unconstrained problem with K = 2 (two-block variables) [Beck-Tetruashvili 13] Other cases? Randomized BCM? Other coordinate update rules (e.g., Gauss-Southwell)? Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 10 / 36 Prior Art: Linear Rate for Convex Problem It is also possible to achieve linear rate for a subset of convex problems (without strong convexity) Assume g (·) is a composite function g (x ) = ℓ(Ax ), ℓ strongly convex, A linear mapping. Assume h(·) is a mixed L1/L2 Randomize/Cyclic BCPG with linear rate [Luo-Tseng 92][Hong et al 13, 14] [Necoara-Clipici 13][Kadkhodaei et al 14] For BCM: additional requirement of per-block strong convexity Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 11 / 36 The Summary of Results A summary of existing results on sublinear rate for BCD-type NS=NonSmooth, S=Smooth, C=Constrained, U=Unconstrained, K=K-block, 2=2-Block Cy=Cyclic, GS=Gauss-Southwell, R=Randomized Table: Summary of Prior Art Method Problem R-BCPG NS-C-K Cy-BCPG S-C-K Cy/GS-BCPG NS-C-K R/GS-BCM NS-C-K Cy-BCM S-U-2 Cy-BCM NS-C-K Mingyi Hong (University of Minnesota) O(1/r√) Rate Iteration Complexity Analysis of BCD √ ? ? √ ? May 2014 12 / 36 This Work This work shows the following results NS=NonSmooth, S=Smooth, C=Constrained, U=Unconstrained, K=K-block, 2=2-Block Cy=Cyclic, GS=Gauss-Southwell, R=Randomized Table: Summary of Prior Art + This Work Method Problem R-BCPG NS-C-K Cy-BCPG S-C-K Cy/GS-BCPG NS-C-K R/GS-BCM NS-C-K Cy-BCM S-U-2 Cy-BCM NS-C-K Mingyi Hong (University of Minnesota) O(1/r√) Rate Iteration Complexity Analysis of BCD √ √ √ √ partial May 2014 13 / 36 Agenda Prior Art 1 2 BCPG and its variants BCM and its variants Analyzing the cyclic BCPG-type Algorithm 1 2 The block successive upper-bound minimization (BSUM) Main steps for iteration complexity analysis Analyzing the BCM-type Algorithm 1 2 The randomized and greedy BCM The cyclic BCM Conclusion Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 14 / 36 The BSUM method We first introduce the so-called block successive upper-bound minimization (BSUM) method [Razaviyayn-Hong-Luo 13] A general framework includes BCPG as a special case Main steps 1 At iteration r + 1, block k updated by r 1 r xkr +1 ∈ arg min uk (xk ; x1r +1 , . . . , xkr + − 1 , xk , . . . , xK ) + hk ( xk ) xk ∈Xk 2 Sweep over all blocks in cyclic order uk (xk ; y ) a locally tight, upper-bound function for g (x ) Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 15 / 36 The BSUM method (cont.) Assumptions on uk (vk ; x ) (a) uk (vk ; x ) = g (x ), ∀ x, (locally tight) (b) uk (vk ; x ) is continuous in vk and x (continuity) (c) uk (vk ; x ) ≥ g (vk , x−k ), ∀ vk , x (upper-bound) (d) uk (vk ; x ) strongly convex in vk with coefficient γk (e) uk (vk ; x ) has Lipchitz gradient with coefficient Lk Refer uk (vk ; x ) as a valid upper-bound Figure: Illustration of the upper-bound. Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 16 / 36 The BSUM method (cont.) The convergence of BSUM is guaranteed, even for nonconvex problems [Razaviyayn-Hong-Luo 13] Reduces to the cyclic BCPG when uk (xk ; y ) is a quadratic approx. Reduces to cyclic BCM when no approx. is used Many other choices of upper-bound functions [Razaviyayn-Hong-Luo 13] [Mairal 13] Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 17 / 36 Iteration Complexity Analysis for BSUM Define X ∗ as the optimal solution set, and let x ∗ ∈ X ∗ be one of the optimal solutions Define the optimality gap as ∆r : = f (x r ) − f (x ∗ ) Main Steps 1 2 3 Sufficient Descent: ∆r − ∆r +1 large enough Estimate Cost-To-Go: ∆r +1 small enough Estimate The Rate: Show (∆r +1 )2 ≤ c ∆r − ∆r +1 Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 18 / 36 Sufficient Descent Lemma (Sufficient Descent) For BSUM, we have for all r ≥ 1 ∆r − ∆r + 1 = f (x r ) − f (x r + 1 ) ≥ γ k x r − x r + 1 k 2 , where the constant γ := 1 2 mink γk > 0. Proof easy, using per-block strong convexity Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 19 / 36 Estimating Cost-To-Go Define R := maxx ∈X maxx ∗ ∈X ∗ kx − x ∗ k : f (x ) ≤ f (x 1 ) Utilize the optimality of {xkr +1 } as well as the Lipchitz continuity of the gradient of g and uk ’s Lemma (Cost-to-go Estimate) For BSUM, we have (∆r +1 )2 = (f (x r +1 ) − f (x ∗ ))2 ≤ 2K (M 2 + L2max )R 2 kx r +1 − x r k2 where Lmax = maxk Lk . Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 20 / 36 O(1/r ) Iteration Complexity Linking the above two lemmas, we obtain for all r ≥ 1 ∆r − ∆r + 1 ≥ γ ( ∆r + 1 )2 : = σ ( ∆r + 1 )2 2 2 2 2K (M + Lmax )R By a simple induction argument, we obtain ∆r ≤ c1 , with c := max{4σ − 2, f (x1 ) − f ∗ , 2}. σr Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 21 / 36 Special Case for cyclic BCPG For BCPG, the upper-bound function is given by uk (xk ; y ) := g (y ) + h∇k g (y ), xk − yk i + Lk k xk − yk k 2 2 where Lk is the largest eigenvalue of ∇2k g (x ) Then we have γk = Lk = Mk , and the bound reduces to ∆r ≤ 8 cKM 2 R 2 1 mink Mk r The bound is in the same order as the one stated in [Beck-Tetruashvili 13, Theorem 6.1] Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 22 / 36 Remark Also addresses the rate for the cyclic-BCM, when the per-block subproblem is strongly convex The LASSO problem when each block is a single variable: min kAx − b k2 + kx k1 Rate maximization problem in wireless communication networks ! K max log pk ∑ | hk | 2 pk + 1 k =1 , s.t. pk ∈ Pk , ∀ k Extends to the Gauss-Southwell update rule (i.e., greedy coordinate selection) Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 23 / 36 Agenda Prior Art 1 2 BCPG and its variants BCM and its variants Analyzing the cyclic BCPG-type Algorithm 1 2 The block successive upper-bound minimization (BSUM) Main steps for iteration complexity analysis Analyzing the BCM-type Algorithm 1 2 The randomized and greedy BCM The cyclic BCM Conclusion Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 24 / 36 The Key Challenge No per-block strong convexity (SC) anymore 1 multiple optimal solutions 2 the sufficient descent estimate is lost The road map 1 Still focus on analyzing BSUM 2 Drop the strong convexity assumption on uk (xk ; y ) 3 Consider randomized/greedy update rules for (P) 4 Consider cyclic rules for a subset of problems in (P) Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 25 / 36 The R-BSUM (no per-block SC) Consider the Randomized-BSUM without per-block SC 1 Pick {pk }K k =1 satisfies pk > 0, ∑k pk = 1. 2 At iteration r + 1, randomly choose k ∗ = k with probability pk ∗ 3 Compute 1 r xkr + ∗ ∈ arg min uk (xk ∗ ; x−k ∗ ) + hk ∗ (xk ∗ ) xk ∗ ∈Xk ∗ and set xjr +1 = xjr for all j 6= k ∗ . 1 In each step, one of the optimal solution xkr + is picked ∗ Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 26 / 36 Rate Analysis for R-BSUM (Brief) Additional assumption: h Lip-continuous Similar 3-steps as before, convergence in expectation Covers the R-BCM as a special case Extends to BSUM with Gauss-Southwell-type update rule (greedy coordinate selection) Question: Extends to cyclic rule? Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 27 / 36 Rate Analysis for cyclic BSUM (no per-block SC) Consider the following family of problems 1 Composite smooth part: g (x ) = ∑Ii =1 ℓi (Ai ,1 x1 , . . . , Ai ,K xK ) 2 ℓi (y1 , · · · , yK ) strongly convex w.r.t. each component yk ; Ai ,k can be rank deficient Special case: 1 2 3 Group-LASSO: k ∑k Ak xk − b k2 + ∑k kxk k2 Sparse logistic regression over a compact set Optimizing wireless network (with rank-deficient channel) Key steps: Same three-step approach r +1 2 r )k ∆r − ∆r +1 ≥ γ̂∑Ii =1 ∑K k =1 kAk,i (xk − xk I K r + 1 2 T 2 ∆ ≤ K maxk,i ρ(Ak,i Ak,i )R ∑i =1 ∑k =1 kAk,i (xkr − xkr +1 )k2 ... Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 28 / 36 Rate Analysis for cyclic BSUM (no per-block SC) Extends to the following smooth part (L2-SVM) I g (x ) = ∑ i =1 h max{0, 1 − x T ai } i2 Implies the same rate for cyclic BCM for these two cases In [Beck-Tetruashvili 13], cyclic BCM is shown to have O(1/r ) rate for 2-block, smooth, unconstrained problem The above analysis is done for K-block, nonsmooth, constrained problem, but needs special structure Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 29 / 36 Agenda Prior Art 1 2 BCPG and its variants BCM and its variants Analyzing the cyclic BCPG-type Algorithm 1 2 The block successive upper-bound minimization (BSUM) Main steps for iteration complexity analysis Analyzing the BCM-type Algorithm 1 2 The randomized and greedy BCM The cyclic BCM Conclusion Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 30 / 36 Conclusion We show the O(1/r ) sublinear rate of convergence for a few BCD-type methods Main results summarized below Table: Summary Method Cy/GS-BSUM R/GS-BCM Cy-BCM Problem NS+C+K NS+C+K NS+C+K 1 Assumption uk valid upper-bound h Lipchitz/ g block-SC g composite/L2 SVM/block-SC 1 NS=Nonsmooth, C=Constrained, K=K-block, SC=Strongly Convex, GS=Gauss-Southwell, Cy=Cyclic, R=Randomized Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 31 / 36 Future Work Reduce the constants, especially w.r.t. K ; some preliminary results [Sun 14] Analysis for Cyclic BCM methods for general (P)? Linearly constrained problem? some recent results in this direction [Hong-Luo 12][Hong et al 14] [Hong et al 13] [Necoara et al 13] Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 32 / 36 Thank You! Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 33 / 36 Reference 1 [Richtárik-Takáč 12] P. Richtárik and M. Takáč, “Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function,” Mathematical Programming, 2012. 2 [Nestrov 12] Y. Nestrov, “Efficiency of coordinate descent methods on huge-scale optimization problems,” SIAM Journal on Optimization, 2012. 3 [Shalev-Shwartz-Tewari] S. Shalev-Shwartz and A. Tewari, “Stochastic methods for L1 regularized loss minimization,” Journal of Machine Learning Research, 2011. 4 [Beck-Tetruashvili 13] A. Beck and L. Tetruashvili “On the convergence of block coordinate descent type methods,” SIAM Journal on Optimization, 2013. Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 34 / 36 Reference 5 [Lu-Lin 13] Z. Lu and X. Lin, “On the complexity analysis of randomized block-coordinate descent methods”, Preprint, 2013 6 [Saha-Tewari 13] A. Saha and A. Tewari, “On the nonaymptotic convergence of cyclic coordinate descent method,” SIAM Journal on Optimization, 2013. 7 [Luo-Tseng 92] Z.-Q. Luo and P. Tseng, “On the convergence of the coordinate descent method for convex differentiable minimization,” Journal of Optimization Theory and Application, 1992. 8 [Hong et al 14] M. Hong, T.-H. Chang, X. Wang, M. Razaviyayn, S. Ma and Z.-Q. Luo, “A Block Successive Minimization Method of Multipliers”, Preprint, 2014 9 [Hong-Luo 12] M. Hong and Z.-Q. Luo, “On the convergence of ADMM”, Preprint, 2012 Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 35 / 36 Reference 10 [Necoara-Clipici 13] I. Necoara and D. Clipici, “Distributed Coordinate Descent Methods for Composite Minimization”, Preprint, 2013. 11 [Kadkhodaei et al 14] M. Kadkhodaei, M. Sanjabi and Z.-Q. Luo “On the Linear Convergence of the Approximate Proximal Splitting Method for Non-Smooth Convex Optimization”, preprint 2014. 12 [Razaviyayn-Hong-Luo 13] M. Razaviyayn, M. Hong, and Z.-Q. Luo, “A unified convergence analysis of block successive minimization methods for nonsmooth optimization,” 2013. 13 [Marial 13] J. Marial, “Optimization with First-Order Surrogate Functions”, Journal of Machine Learning Research, 2013. 14 [Sun 14] R. Sun, “Improved Iteration Complexity Analysis for Cyclic Block Coordinate Descent Method”, Technical Note, 2014. Mingyi Hong (University of Minnesota) Iteration Complexity Analysis of BCD May 2014 36 / 36