Iteration Complexity Analysis of Block Coordinate Descent Method Mingyi Hong University of Minnesota

advertisement
Iteration Complexity Analysis of Block
Coordinate Descent Method
Mingyi Hong
Joint work with Xiangfeng Wang, Meisam Razaviyayn
and Zhi-Quan (Tom) Luo
University of Minnesota
May 2014
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
1 / 36
The Problem
We consider the following problem (with K block variables)
K
minimize
f ( x ) : = g ( x1 , · · · , xK ) +
subject to xk ∈ Xk ,
k = 1, ..., K
∑ hk (xk )
k =1
(P)
g (·) smooth convex function; hk (·) nonsmooth convex function;
x = (x1T , ..., xKT )T ∈ ℜn is a partition of x; Xk ⊆ Rnk
The gradient of g (·) is block-wise Lipschitz continuous
k∇k g ([x−k , xk ]) − ∇k g ([x−k , x̂k ])k ≤ Mk kxk − x̂k k,
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
∀ x ∈ X, ∀ k
May 2014
2 / 36
The Algorithm
Consider the cyclic block coordinate minimization (BCM)
1
At iteration r + 1, block k updated by
r
r
1
xkr +1 ∈ arg min g (x1r +1 , . . . , xkr +
− 1 , xk , xk + 1 , . . . , xK ) + hk ( xk )
xk ∈Xk
2
Sweep over all blocks in cyclic order
Each step small dimension – simple computation
Popular for solving modern large-scale problems
Many known variants – block coordinate descent (BCD) type
1
2
Block Selection Rule: cyclic, randomized, parallel, greedy,...
Block Update Rule: exact minimization, proximal gradient,...
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
3 / 36
The Algorithm (cont.)
A popular variant: cyclic block coordinate proximal gradient
(BCPG)
1
At iteration r + 1, block k updated by
r
1 r
xkr +1 ∈ arg min uk (xk ; x1r +1 , . . . , xkr +
− 1 , xk , . . . , xK ) + hk ( xk )
xk ∈Xk
2
Sweep over all blocks in cyclic order
uk (xk ; y ): a quadratic approximation of g (·) w.r.t. xk
uk (xk ; y ) = g (y ) + h∇k g (y ), xk − yk i +
Lk
k xk − yk k 2 .
2
Achieving descent in each block (Lk large enough)
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
4 / 36
The Contribution
Question: For BCD-type algorithm, what’s the number of
iterations required to solve problem (P) to ǫ-OPT (a.k.a
iteration complexity)?
Contribution 1: Cyclic BCPG-type inexact scheme achieves
O(1/r ) rate
Contribution 2: Variants of BCM achieve O(1/r ) rate
(randomized, greedy, cyclic)
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
5 / 36
Agenda
Prior Art
1
2
BCPG and its variants
BCM and its variants
Analyzing the cyclic BCPG-type Algorithm
1
2
The block successive upper-bound minimization (BSUM)
Main steps for iteration complexity analysis
Analyzing the BCM-type Algorithm
1
2
The randomized and greedy BCM
The cyclic BCM
Conclusion
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
6 / 36
Agenda
Prior Art
1
2
BCPG and its variants
BCM and its variants
Analyzing the cyclic BCPG-type Algorithm
1
2
The block successive upper-bound minimization (BSUM)
Main steps for iteration complexity analysis
Analyzing the BCM-type Algorithm
1
2
The randomized and greedy BCM
The cyclic BCM
Conclusion
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
7 / 36
Prior Art: BCPG
A large literature on analyzing BCPG-type algorithm
Randomized BCPG: Each step randomly pick a block to update
Strongly convex problems, linear rate [Richtárik-Takáč 12]
Convex problems with O(1/r ) rate
1
2
3
Smooth [Nesterov 12]
Smooth + L1 penalization [Shalev-Shwartz-Tewari 11]
General nonsmooth (P) [Richtárik-Takáč 12][Lu-Xiao 13]
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
8 / 36
Prior Art: BCPG (cont.)
How about cyclic BCPG? Less is known
For strongly convex problems, linear rate
For convex problems with O(1/r ) rate
1
2
3
LASSO, when satisfies the so-called “isotonicity assumption”
[Saha-Tweari 13]
Smooth [Beck-Tetruashvili 13]
General nonsmooth, i.e., (P)?
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
9 / 36
Prior Art: BCM
How about cyclic BCM? Even less is known
For strongly convex problems, linear rate
For convex problem O(1/r ) rate
1
2
Smooth unconstrained problem with K = 2 (two-block
variables) [Beck-Tetruashvili 13]
Other cases?
Randomized BCM? Other coordinate update rules (e.g.,
Gauss-Southwell)?
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
10 / 36
Prior Art: Linear Rate for Convex Problem
It is also possible to achieve linear rate for a subset of convex
problems (without strong convexity)
Assume g (·) is a composite function
g (x ) = ℓ(Ax ), ℓ strongly convex, A linear mapping.
Assume h(·) is a mixed L1/L2
Randomize/Cyclic BCPG with linear rate [Luo-Tseng 92][Hong
et al 13, 14] [Necoara-Clipici 13][Kadkhodaei et al 14]
For BCM: additional requirement of per-block strong convexity
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
11 / 36
The Summary of Results
A summary of existing results on sublinear rate for BCD-type
NS=NonSmooth, S=Smooth, C=Constrained,
U=Unconstrained, K=K-block, 2=2-Block
Cy=Cyclic, GS=Gauss-Southwell, R=Randomized
Table: Summary of Prior Art
Method
Problem
R-BCPG
NS-C-K
Cy-BCPG
S-C-K
Cy/GS-BCPG NS-C-K
R/GS-BCM
NS-C-K
Cy-BCM
S-U-2
Cy-BCM
NS-C-K
Mingyi Hong (University of Minnesota)
O(1/r√) Rate
Iteration Complexity Analysis of BCD
√
?
?
√
?
May 2014
12 / 36
This Work
This work shows the following results
NS=NonSmooth, S=Smooth, C=Constrained,
U=Unconstrained, K=K-block, 2=2-Block
Cy=Cyclic, GS=Gauss-Southwell, R=Randomized
Table: Summary of Prior Art + This Work
Method
Problem
R-BCPG
NS-C-K
Cy-BCPG
S-C-K
Cy/GS-BCPG NS-C-K
R/GS-BCM
NS-C-K
Cy-BCM
S-U-2
Cy-BCM
NS-C-K
Mingyi Hong (University of Minnesota)
O(1/r√) Rate
Iteration Complexity Analysis of BCD
√
√
√
√
partial
May 2014
13 / 36
Agenda
Prior Art
1
2
BCPG and its variants
BCM and its variants
Analyzing the cyclic BCPG-type Algorithm
1
2
The block successive upper-bound minimization (BSUM)
Main steps for iteration complexity analysis
Analyzing the BCM-type Algorithm
1
2
The randomized and greedy BCM
The cyclic BCM
Conclusion
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
14 / 36
The BSUM method
We first introduce the so-called block successive upper-bound
minimization (BSUM) method [Razaviyayn-Hong-Luo 13]
A general framework includes BCPG as a special case
Main steps
1
At iteration r + 1, block k updated by
r
1 r
xkr +1 ∈ arg min uk (xk ; x1r +1 , . . . , xkr +
− 1 , xk , . . . , xK ) + hk ( xk )
xk ∈Xk
2
Sweep over all blocks in cyclic order
uk (xk ; y ) a locally tight, upper-bound function for g (x )
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
15 / 36
The BSUM method (cont.)
Assumptions on uk (vk ; x )
(a) uk (vk ; x ) = g (x ), ∀ x, (locally tight)
(b) uk (vk ; x ) is continuous in vk and x (continuity)
(c) uk (vk ; x ) ≥ g (vk , x−k ), ∀ vk , x (upper-bound)
(d) uk (vk ; x ) strongly convex in vk with coefficient γk
(e) uk (vk ; x ) has Lipchitz gradient with coefficient Lk
Refer uk (vk ; x ) as a valid upper-bound
Figure: Illustration of the upper-bound.
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
16 / 36
The BSUM method (cont.)
The convergence of BSUM is guaranteed, even for nonconvex
problems [Razaviyayn-Hong-Luo 13]
Reduces to the cyclic BCPG when uk (xk ; y ) is a quadratic
approx.
Reduces to cyclic BCM when no approx. is used
Many other choices of upper-bound functions
[Razaviyayn-Hong-Luo 13] [Mairal 13]
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
17 / 36
Iteration Complexity Analysis for BSUM
Define X ∗ as the optimal solution set, and let x ∗ ∈ X ∗ be one
of the optimal solutions
Define the optimality gap as
∆r : = f (x r ) − f (x ∗ )
Main Steps
1
2
3
Sufficient Descent: ∆r − ∆r +1 large enough
Estimate Cost-To-Go: ∆r +1 small enough
Estimate The Rate: Show (∆r +1 )2 ≤ c ∆r − ∆r +1
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
18 / 36
Sufficient Descent
Lemma
(Sufficient Descent) For BSUM, we have for all r ≥ 1
∆r − ∆r + 1 = f (x r ) − f (x r + 1 ) ≥ γ k x r − x r + 1 k 2 ,
where the constant γ :=
1
2
mink γk > 0.
Proof easy, using per-block strong convexity
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
19 / 36
Estimating Cost-To-Go
Define R := maxx ∈X maxx ∗ ∈X ∗ kx − x ∗ k : f (x ) ≤ f (x 1 )
Utilize the optimality of {xkr +1 } as well as the Lipchitz
continuity of the gradient of g and uk ’s
Lemma
(Cost-to-go Estimate) For BSUM, we have
(∆r +1 )2 = (f (x r +1 ) − f (x ∗ ))2 ≤ 2K (M 2 + L2max )R 2 kx r +1 − x r k2
where Lmax = maxk Lk .
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
20 / 36
O(1/r ) Iteration Complexity
Linking the above two lemmas, we obtain for all r ≥ 1
∆r − ∆r + 1 ≥
γ
( ∆r + 1 )2 : = σ ( ∆r + 1 )2
2
2
2
2K (M + Lmax )R
By a simple induction argument, we obtain
∆r ≤
c1
, with c := max{4σ − 2, f (x1 ) − f ∗ , 2}.
σr
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
21 / 36
Special Case for cyclic BCPG
For BCPG, the upper-bound function is given by
uk (xk ; y ) := g (y ) + h∇k g (y ), xk − yk i +
Lk
k xk − yk k 2
2
where Lk is the largest eigenvalue of ∇2k g (x )
Then we have γk = Lk = Mk , and the bound reduces to
∆r ≤ 8
cKM 2 R 2 1
mink Mk r
The bound is in the same order as the one stated in
[Beck-Tetruashvili 13, Theorem 6.1]
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
22 / 36
Remark
Also addresses the rate for the cyclic-BCM, when the per-block
subproblem is strongly convex
The LASSO problem when each block is a single variable:
min kAx − b k2 + kx k1
Rate maximization problem in wireless communication networks
!
K
max log
pk
∑ | hk | 2 pk + 1
k =1
, s.t. pk ∈ Pk , ∀ k
Extends to the Gauss-Southwell update rule (i.e., greedy
coordinate selection)
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
23 / 36
Agenda
Prior Art
1
2
BCPG and its variants
BCM and its variants
Analyzing the cyclic BCPG-type Algorithm
1
2
The block successive upper-bound minimization (BSUM)
Main steps for iteration complexity analysis
Analyzing the BCM-type Algorithm
1
2
The randomized and greedy BCM
The cyclic BCM
Conclusion
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
24 / 36
The Key Challenge
No per-block strong convexity (SC) anymore
1
multiple optimal solutions
2
the sufficient descent estimate is lost
The road map
1
Still focus on analyzing BSUM
2
Drop the strong convexity assumption on uk (xk ; y )
3
Consider randomized/greedy update rules for (P)
4
Consider cyclic rules for a subset of problems in (P)
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
25 / 36
The R-BSUM (no per-block SC)
Consider the Randomized-BSUM without per-block SC
1
Pick {pk }K
k =1 satisfies pk > 0, ∑k pk = 1.
2
At iteration r + 1, randomly choose k ∗ = k with probability pk ∗
3
Compute
1
r
xkr +
∗ ∈ arg min uk (xk ∗ ; x−k ∗ ) + hk ∗ (xk ∗ )
xk ∗ ∈Xk ∗
and set xjr +1 = xjr for all j 6= k ∗ .
1
In each step, one of the optimal solution xkr +
is picked
∗
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
26 / 36
Rate Analysis for R-BSUM (Brief)
Additional assumption: h Lip-continuous
Similar 3-steps as before, convergence in expectation
Covers the R-BCM as a special case
Extends to BSUM with Gauss-Southwell-type update rule
(greedy coordinate selection)
Question: Extends to cyclic rule?
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
27 / 36
Rate Analysis for cyclic BSUM (no per-block SC)
Consider the following family of problems
1
Composite smooth part: g (x ) = ∑Ii =1 ℓi (Ai ,1 x1 , . . . , Ai ,K xK )
2
ℓi (y1 , · · · , yK ) strongly convex w.r.t. each component yk ; Ai ,k
can be rank deficient
Special case:
1
2
3
Group-LASSO: k ∑k Ak xk − b k2 + ∑k kxk k2
Sparse logistic regression over a compact set
Optimizing wireless network (with rank-deficient channel)
Key steps: Same three-step approach
r +1 2
r
)k
∆r − ∆r +1 ≥ γ̂∑Ii =1 ∑K
k =1 kAk,i (xk − xk
I
K
r
+
1
2
T
2
∆
≤ K maxk,i ρ(Ak,i Ak,i )R ∑i =1 ∑k =1 kAk,i (xkr − xkr +1 )k2
...
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
28 / 36
Rate Analysis for cyclic BSUM (no per-block SC)
Extends to the following smooth part (L2-SVM)
I
g (x ) =
∑
i =1
h
max{0, 1 − x T ai }
i2
Implies the same rate for cyclic BCM for these two cases
In [Beck-Tetruashvili 13], cyclic BCM is shown to have O(1/r )
rate for 2-block, smooth, unconstrained problem
The above analysis is done for K-block, nonsmooth, constrained
problem, but needs special structure
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
29 / 36
Agenda
Prior Art
1
2
BCPG and its variants
BCM and its variants
Analyzing the cyclic BCPG-type Algorithm
1
2
The block successive upper-bound minimization (BSUM)
Main steps for iteration complexity analysis
Analyzing the BCM-type Algorithm
1
2
The randomized and greedy BCM
The cyclic BCM
Conclusion
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
30 / 36
Conclusion
We show the O(1/r ) sublinear rate of convergence for a few
BCD-type methods
Main results summarized below
Table: Summary
Method
Cy/GS-BSUM
R/GS-BCM
Cy-BCM
Problem
NS+C+K
NS+C+K
NS+C+K
1
Assumption
uk valid upper-bound
h Lipchitz/ g block-SC
g composite/L2 SVM/block-SC
1 NS=Nonsmooth,
C=Constrained, K=K-block, SC=Strongly Convex,
GS=Gauss-Southwell, Cy=Cyclic, R=Randomized
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
31 / 36
Future Work
Reduce the constants, especially w.r.t. K ; some preliminary
results [Sun 14]
Analysis for Cyclic BCM methods for general (P)?
Linearly constrained problem? some recent results in this
direction [Hong-Luo 12][Hong et al 14] [Hong et al 13] [Necoara
et al 13]
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
32 / 36
Thank You!
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
33 / 36
Reference
1 [Richtárik-Takáč 12] P. Richtárik and M. Takáč, “Iteration
complexity of randomized block-coordinate descent methods for
minimizing a composite function,” Mathematical Programming,
2012.
2 [Nestrov 12] Y. Nestrov, “Efficiency of coordinate descent methods
on huge-scale optimization problems,” SIAM Journal on
Optimization, 2012.
3 [Shalev-Shwartz-Tewari] S. Shalev-Shwartz and A. Tewari,
“Stochastic methods for L1 regularized loss minimization,” Journal
of Machine Learning Research, 2011.
4 [Beck-Tetruashvili 13] A. Beck and L. Tetruashvili “On the
convergence of block coordinate descent type methods,” SIAM
Journal on Optimization, 2013.
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
34 / 36
Reference
5 [Lu-Lin 13] Z. Lu and X. Lin, “On the complexity analysis of
randomized block-coordinate descent methods”, Preprint, 2013
6 [Saha-Tewari 13] A. Saha and A. Tewari, “On the nonaymptotic
convergence of cyclic coordinate descent method,” SIAM Journal on
Optimization, 2013.
7 [Luo-Tseng 92] Z.-Q. Luo and P. Tseng, “On the convergence of the
coordinate descent method for convex differentiable minimization,”
Journal of Optimization Theory and Application, 1992.
8 [Hong et al 14] M. Hong, T.-H. Chang, X. Wang, M. Razaviyayn, S.
Ma and Z.-Q. Luo, “A Block Successive Minimization Method of
Multipliers”, Preprint, 2014
9 [Hong-Luo 12] M. Hong and Z.-Q. Luo, “On the convergence of
ADMM”, Preprint, 2012
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
35 / 36
Reference
10 [Necoara-Clipici 13] I. Necoara and D. Clipici, “Distributed
Coordinate Descent Methods for Composite Minimization”, Preprint,
2013.
11 [Kadkhodaei et al 14] M. Kadkhodaei, M. Sanjabi and Z.-Q. Luo
“On the Linear Convergence of the Approximate Proximal Splitting
Method for Non-Smooth Convex Optimization”, preprint 2014.
12 [Razaviyayn-Hong-Luo 13] M. Razaviyayn, M. Hong, and Z.-Q. Luo,
“A unified convergence analysis of block successive minimization
methods for nonsmooth optimization,” 2013.
13 [Marial 13] J. Marial, “Optimization with First-Order Surrogate
Functions”, Journal of Machine Learning Research, 2013.
14 [Sun 14] R. Sun, “Improved Iteration Complexity Analysis for Cyclic
Block Coordinate Descent Method”, Technical Note, 2014.
Mingyi Hong (University of Minnesota)
Iteration Complexity Analysis of BCD
May 2014
36 / 36
Download