Stochastic Proximal Gradient Consensus Over Time-Varying Multi-Agent Networks Mingyi Hong Joint work with Tsung-Hui Chang IMSE and ECE Department, Iowa State University Presented at INFORMS 2015 Mingyi Hong (Iowa State University) 1 / 37 Main Content Setup: Optimization over a time-varying multi-agent network Mingyi Hong (Iowa State University) 2 / 37 Main Results An algorithm for a large class of convex problems with rate guarantees Connections among a number of popular algorithms Mingyi Hong (Iowa State University) 3 / 37 Outline 1 Review of Distributed Optimization 2 The Proposed Algorithm The Proposed Algorithms Distributed Implementation Convergence Analysis 3 Connection to Existing Methods 4 Numerical Results 5 Concluding Remarks Mingyi Hong (Iowa State University) 3 / 37 Review of Distributed Optimization Basic Setup Consider the following convex optimization problem N min f (y) := y ∈R ∑ f i ( y ), (P) i =1 Each f i (y) is a convex and possibly nonsmooth function A collection of N agents connected by a network: 1 Network defined by an undirected graph G = {V , E } 2 |V | = N vertices and |E | = E edges. 3 Each agent can only communicate with its immediate neighbors Mingyi Hong (Iowa State University) 4 / 37 Review of Distributed Optimization Basic Setup Numerous applications in optimizing networked systems 1 Cloud computing [Foster et al 08] 2 Smart grid optimization [Gan et al 13] [Liu-Zhu 14][Kekatos 13] 3 Distributed learning [Mateos et al 10] [Boyd et al 11] [Bekkerman et al 12] 4 Communication and signal processing [Rabbat-Nowak 04] [Schizas et al 08] [Giannakis et al 15] 5 Seismic Tomography [Zhao et al 15] 6 ... Mingyi Hong (Iowa State University) 5 / 37 Review of Distributed Optimization The Algorithms A lot of algorithms are available for problem (P) 1 The distributed subgradient (DSG) based methods 2 The Alternating Direction Method of Multiplier (ADMM) based methods 3 The Distributed Dual Averaging based methods 4 ... Algorithm families differ in applicable problems and convergence cond. Mingyi Hong (Iowa State University) 6 / 37 Review of Distributed Optimization The DSG Algorithm Each agent i keeps a local copy of y, denoted as xi Each agent i iteratively computes xir+1 = N ∑ wijr xrj − γr dri , ∀ i ∈ V. j =1 We used the following notations 1 dri ∈ ∂ f i (yri ): a subgradient of the local function f i 2 wijr ≥ 0: the weight for the link eij at iteration r 3 γr > 0: some stepsize parameter Mingyi Hong (Iowa State University) 7 / 37 Review of Distributed Optimization The DSG Algorithm (Cont.) Compactly, the algorithm can be written in vector form xr+1 = Wxr − γr dr 1 xr ∈ R: vector of the agents’ local variable 2 dr ∈ R: vector of subgradients 3 W: a row-stochastic weight matrix Mingyi Hong (Iowa State University) 8 / 37 Review of Distributed Optimization The DSG Algorithm (Cont.) Convergence has been analyzed in many works [Nedić-Ozdaglar 09a][Nedić-Ozdaglar 09b] √ The algorithm converges with a rate of O(ln(r )/ r ) [Chen 12] Usually diminishing stepsize The algorithm has been generalized to problems with 1 constraints [Nedić-Ozdaglar-Parrilo 10] 2 quantized messages [Nedić et al 08] 3 directed graphs [Nedić-Olshevsky 15] 4 stochastic gradients [Ram et al 10] 5 ... Accelerated versions with rates O(ln(r )/r ) [Chen 12] [Jakovetić et al 14] Mingyi Hong (Iowa State University) 9 / 37 Review of Distributed Optimization The EXTRA Algorithm Recently, [Shi et al 14] proposed an EXTRA algorithm xr+1 = Wxr − 1 r 1 r −1 b r −1 d + d + xr − Wx β β b = 1 ( I + W ); f is assumed to be smooth; W is symmetric where W 2 EXTRA is an error-corrected version of DSG xr+1 = Wxr − r 1 r b ) x t −1 d + ∑ (W − W β t =1 It is shown that 1 A constant stepsize β can be used (with computable lower bound) 2 The algorithm converges with a (improved) rate of O(1/r ) Mingyi Hong (Iowa State University) 10 / 37 Review of Distributed Optimization The ADMM Algorithm The general ADMM solves the following two-block optimization problem min x,y s.t. f ( x ) + g(y) Ax + By = c, x ∈ X, y ∈ Y The augmented Lagrangian ρ L( x, y; λ) = f ( x ) + g(y) + hλ, c − Ax − Byi + kc − Ax − Byk2 2 The algorithm 1 Minimize L( x, y; λ) w.r.t. x 2 Minimize L( x, y; λ) w.r.t. y 3 λ ← λ + ρ(c − Ax − By) Mingyi Hong (Iowa State University) 11 / 37 Review of Distributed Optimization The ADMM for Network Consensus For each link eij introduce two link variables zij , z ji Reformulate problem (P) as [Schizas et al 08] N min f ( x ) := ∑ f i ( x i ), i =1 s.t. Mingyi Hong (Iowa State University) xi = zij , x j = zij , xi = z ji , x j = z ji , ∀ eij ∈ E . 12 / 37 Review of Distributed Optimization The ADMM for Network Consensus (cont.) The above problem is equivalent to N min f ( x ) := ∑ f i ( x i ), i =1 s.t. (1) Ax + Bz = 0 where A, B are matrices related to network topology Converges with O(1/r ) rate [Wei-Ozdaglar 13] When the objective is smooth and strongly convex, linear convergence has been shown in [Shi et al 14] For a star-network, convergence to stationary solution for nonconvex √ problem (with rate O(1/ r )) [H.-Luo-Razaviyayn 14] Mingyi Hong (Iowa State University) 13 / 37 Review of Distributed Optimization Comparison of ADMM and DSG Table: Comparison of ADMM and DSG. Problem Type Stepsize Convergence Rate Network Topology Subproblem DSG general convex ( a) diminishing √ O(ln(r )/ r ) dynamic simple ADMM smooth/smooth+simple NS. constant O(1/r ) static(b) difficult(c) (a) Except [Shi et al 14], which uses a constant stepsize (b) Except [Chang-H.-Wang 14] [Ling et al 15], gradient-type subproblem (c) Except [Wei-Ozdaglar 13], random graph Mingyi Hong (Iowa State University) 14 / 37 Review of Distributed Optimization Comparison of ADMM and DSG Connections? Mingyi Hong (Iowa State University) 15 / 37 Review of Distributed Optimization Outline 1 Review of Distributed Optimization 2 The Proposed Algorithm The Proposed Algorithms Distributed Implementation Convergence Analysis 3 Connection to Existing Methods 4 Numerical Results 5 Concluding Remarks Mingyi Hong (Iowa State University) 15 / 37 The Proposed Algorithm Setup The proposed method is ADMM based We consider N min f (y) := ∑ N f i (y) = i =1 1 ∑ gi ( y ) + h i ( y ) , Each hi is lower-semicontinuous with easy “prox” operator β proxh (u) := min hi (y) + y 2 (Q) i =1 β k y − u k2 . 2 Each gi has a Lipschitz continuous gradient, i.e., for some ρi > 0 k∇ gi (y) − ∇ gi (v)k ≤ Pi ky − vk, ∀ y, v ∈ dom(h), ∀ i. Mingyi Hong (Iowa State University) 16 / 37 The Proposed Algorithm Graph Structure Both static and random time-varying graph For random network assume that 1 At a given iteration G r is a subgraph of a connected graph G 2 Each link e has a probability of pe ∈ (0, 1] of being active 3 A node i is active if an active link connects to it 4 Each iteration the graph realization is independent Mingyi Hong (Iowa State University) 17 / 37 The Proposed Algorithm Gradient Information Each agent has access to an estimate of the gradient g̃i ( xi , ξ i ) such that E[ g̃i ( xi , ξ i )] = ∇ gi ( xi ) i h E k g̃i ( xi , ξ i ) − ∇ gi ( xi )k2 ≤ σ2 , ∀i Can be extended to allow only subgradient of the obj Mingyi Hong (Iowa State University) 18 / 37 The Proposed Algorithm The Augmented Lagrangian The problem we solve is still given by N min f ( x ) := ∑ gi ( x i ) + h i ( x i ) , i =1 s.t. Ax + Bz = 0 The augmented Lagrangian N LΓ ( x, z, λ) = 1 ∑ gi (xi ) + hi (xi ) + hλ, Ax + Bzi + 2 k Ax + BzkΓ 2 . i =1 A diagonal matrix Γ is used as the penalty parameter (one edge one ρij ) Γ := diag{ρij }ij∈E Mingyi Hong (Iowa State University) 19 / 37 The Proposed Algorithm The Proposed Algorithms The DySPGC Algorithm The proposed algorithm is named DySPGC (Dynamic Stochastic Proximal Gradient Consensus) It optimizes LΓ ( x, z, λ) using similar steps as ADMM The x-step will be replaced by a proximal gradient step Mingyi Hong (Iowa State University) 20 / 37 The Proposed Algorithm The Proposed Algorithms The DySPGC: Static Graph + Exact Gradient Algorithm 1. PGC Algorithm T x0 . At iteration 0, let B T λ0 = 0, z0 = 12 M+ At each iteration r + 1, update the variable blocks by: xr+1 = arg min h∇ g( xr ), x − xr i + h( x ) 2 1 1 + Ax + Bzr + Γ−1 λr + k x − xr k2Ω 2 2 Γ 2 1 r +1 r +1 −1 r z = arg min Ax + Bz + Γ λ Γ 2 λr+1 = λr + Γ Axr+1 + Bzr+1 Mingyi Hong (Iowa State University) 21 / 37 The Proposed Algorithm The Proposed Algorithms The DySPGC: Static Graph + Exact Gradient Algorithm 1. PGC Algorithm T x0 . At iteration 0, let B T λ0 = 0, z0 = 12 M+ At each iteration r + 1, update the variable blocks by: xr+1 = arg min h∇ g( xr ), x − xr i + h( x ) 1 1 + Ax + Bzr + Γ−1 λr 2 + k x − xr k2Ω 2 2 Γ 1 r +1 −1 r 2 r +1 z = arg min Ax + Bz + Γ λ Γ 2 λr+1 = λr + Γ Axr+1 + Bzr+1 Mingyi Hong (Iowa State University) 21 / 37 The Proposed Algorithm The Proposed Algorithms The DySPGC: Static Graph + Stochastic Gradient Algorithm 2. SPGC Algorithm T x0 . At iteration 0, let B T λ0 = 0, z0 = 21 M+ At each iteration r + 1, update the variable blocks by: E D xr+1 = arg min G̃ ( xr , ξ r+1 ), x − xr + h( x ) 2 1 1 + Ax + Bzr + Γ−1 λr + k x − xr k2Ω+η r+1 I MN 2 2 Γ 2 1 zr+1 = arg min Axr+1 + Bz + Γ−1 λr Γ 2 r +1 r r +1 r +1 λ = λ + Γ Ax + Bz Mingyi Hong (Iowa State University) 22 / 37 The Proposed Algorithm The Proposed Algorithms The DySPGC: Static Graph + Stochastic Gradient Algorithm 2. SPGC Algorithm T x0 . At iteration 0, let B T λ0 = 0, z0 = 21 M+ At each iteration r + 1, update the variable blocks by: D E xr+1 = arg min G̃ ( xr , ξ r+1 ), x − xr + h( x ) 2 1 1 + Ax + Bzr + Γ−1 λr + k x − xr k2Ω+η r+1 I MN 2 2 Γ 2 1 zr+1 = arg min Axr+1 + Bz + Γ−1 λr 2 Γ r +1 r r +1 r +1 λ = λ + Γ Ax + Bz Mingyi Hong (Iowa State University) 22 / 37 The Proposed Algorithm The Proposed Algorithms The DySPGC: Dynamic Graph + Stochastic Gradient Algorithm 3. DySPGC Algorithm T x0 . At iteration 0, let B T λ0 = 0, z0 = 12 M+ At each iteration r + 1, update the variable blocks by: D E xr+1 = arg min G̃r+1 ( xr , ξ r+1 ), x − xr + hr+1 ( x ) 2 1 1 + Ar+1 x + Br+1 zr + Γ−1 λr + k x − xr k2Ωr+1 +η r+1 I MN 2 2 Γ xir+1 = xir , if i ∈ / V r +1 2 1 zr+1 = arg min Ar+1 xr+1 + Br+1 z + Γ−1 λr 2 Γ r +1 r +1 r /A zij = zij , if eij ∈ λ r +1 = λ r + Γ A r +1 x r +1 + B r +1 z r +1 Mingyi Hong (Iowa State University) 23 / 37 The Proposed Algorithm The Proposed Algorithms The DySPGC: Dynamic Graph + Stochastic Gradient Algorithm 3. DySPGC Algorithm T x0 . At iteration 0, let B T λ0 = 0, z0 = 12 M+ At each iteration r + 1, update the variable blocks by: D E xr+1 = arg min G̃r+1 ( xr , ξ r+1 ), x − xr + hr+1 ( x ) 2 1 1 + Ar+1 x + Br+1 zr + Γ−1 λr + k x − xr k2Ωr+1 +η r+1 I MN 2 2 Γ xir+1 = xir , if i ∈ / V r +1 2 1 zr+1 = arg min Ar+1 xr+1 + Br+1 z + Γ−1 λr 2 Γ r +1 r +1 r /A zij = zij , if eij ∈ λ r +1 = λ r + Γ A r +1 x r +1 + B r +1 z r +1 Mingyi Hong (Iowa State University) 23 / 37 The Proposed Algorithm Distributed Implementation Distributed Implementation The algorithms admit distributed implementation In particular, the PGC admits a single-variable characterization Mingyi Hong (Iowa State University) 24 / 37 The Proposed Algorithm Distributed Implementation Implementation of PGC Define a stepsize parameter as β i := ∑ (ρij + ρ ji ) + wi , ∀ i. j∈Ni ({ωi }: proximal parameters; {ρij }: penalty parameters for constraints) Define a stepsize matrix Υ := diag([ β 1 , · · · , β N ]) 0. Define a weight matrix W ∈ R N × N as (a row-stochastic matrix) ρ ji +ρij ρ ji +ρij ∑`∈Ni (ρ`i +ρi` )+ωi = βi , if eij ∈ E , ωi (W [i, j]) = = ωβii , ∀ i = j, i ∈ V ∑`∈Ni (ρ`i +ρi` )+ωi 0, otherwise, Mingyi Hong (Iowa State University) 25 / 37 The Proposed Algorithm Distributed Implementation Implementation of PGC (cont.) Implementation of PGC Let ζ r ∈ ∂h( xr ) be some subgradient vector for the nonsmooth function; then the PGC algorithm admits the following single-variable characterization x r +1 − x r + Υ −1 ( ζ r +1 − ζ r ) 1 = Υ−1 −∇ g ( xr ) + ∇ g xr−1 + Wxr − ( IN + W ) xr−1 . 2 In particular, for smooth problems 1 xr+1 = Wxr − Υ−1 ∇ g( xr ) + Υ−1 ∇ g( xr−1 ) + xr − ( IN + W ) xr−1 . 2 Mingyi Hong (Iowa State University) 26 / 37 The Proposed Algorithm Convergence Analysis Convergence Analysis We analyze the (rate of) convergence of the proposed methods Let us define a matrix of Lip-constants Pe = diag([ P1 , · · · , PN ]). Measure convergence rate by [Gao et al 14, Ouyang et al 14] | f ( x̄r ) − f ( x ∗ ) |, | {z } objective gap Mingyi Hong (Iowa State University) and k A x̄r + Bz̄r k | {z } consensus gap 27 / 37 The Proposed Algorithm Convergence Analysis Convergence Analysis Table: Main Convergence Results. Algorithm Network Type Gradient Type Static Static Random Random Exact Stochastic Exact Stochastic Convergence Condition ΥW + Υ ΥW + Υ Ω Ω 2 Pe 2 Pe Pe Pe Convergence Rate O(1/r ) √ O(1/ r ) O(1/r ) √ O(1/ r ) Note: For the exact gradient case, stepsize β can be halved if only convergence is needed Mingyi Hong (Iowa State University) 28 / 37 The Proposed Algorithm Convergence Analysis Outline 1 Review of Distributed Optimization 2 The Proposed Algorithm The Proposed Algorithms Distributed Implementation Convergence Analysis 3 Connection to Existing Methods 4 Numerical Results 5 Concluding Remarks Mingyi Hong (Iowa State University) 28 / 37 Connection to Existing Methods Comparison with Different Algorithms Algorithm Conn. with DySPCA Special Setting EXTRA [Shi 14] DSG [Nedić 09] IC-ADMM [Chang14] DLM [Ling 15] PG-EXTRA [Shi 15] Special Case Different x-step Special Case Special Case Special Case Static, h ≡ 0, W = W T , G̃ = ∇ g Static, g smooth, G̃ = ∇ g Static, G̃ = ∇ g, g composite Static, G̃ = ∇ g, h ≡ 0, β ij = β, ρij = ρ Static, W = W T , G̃ = ∇ g Mingyi Hong (Iowa State University) 29 / 37 Connection to Existing Methods Comparison with Different Algorithms Figure: Relationship among different algorithms Mingyi Hong (Iowa State University) 30 / 37 Connection to Existing Methods The EXTRA Related Algorithms The EXTRA related algorithms (for either smooth or nonsmooth cases) [Shi et al 14, 15] are special cases of DySPCA 1 Symmetric weight matrix W = W T 2 Exact gradient 3 Scalar stepsize 4 Static graph Mingyi Hong (Iowa State University) 31 / 37 Connection to Existing Methods The DSG Method Replacing our x-update by (setting the dual variable λr = 0) xr+1 = arg min h∇ g( xr ), x − xr i + h0, Ax + Bzr i 1 1 + k Ax + Bzr k2Γ + k x − xr k2Ω 2 2 Let β i = β j = β, then the PGC algorithm becomes 1 e r. xr+1 = − ∇ g( xr ) + Wx β e = 1 (I + W) with W 2 This is precisely the DSG iterates Convergence not covered by our results Mingyi Hong (Iowa State University) 32 / 37 Connection to Existing Methods Outline 1 Review of Distributed Optimization 2 The Proposed Algorithm The Proposed Algorithms Distributed Implementation Convergence Analysis 3 Connection to Existing Methods 4 Numerical Results 5 Concluding Remarks Mingyi Hong (Iowa State University) 32 / 37 Numerical Results Numerical Results Some preliminary numerical results by solving a LASSO problem min x 1 2 ∑iN=1 k Ai x − bi k2 + νk x k1 where Ai ∈ RK × M , bi ∈ RK The parameters: N = 16, M = 100, ν = 0.1, K = 200 Data matrix randomly generated Static graphs, generated according to the method proposed in [Yildiz-Scaglione 08], with a radius parameter set to 0.4. Mingyi Hong (Iowa State University) 33 / 37 Numerical Results Comparison between PG-EXTRA and PGC Stepsize of PG-EXTRA chosen according to conditions given in [Shi 14] W is Metropolis constant edge weight matrix PCG: wi = Pi /2, ρij = 10−3 Figure: Comparison between PG-EXTRA and PGC Mingyi Hong (Iowa State University) 34 / 37 Numerical Results Comparison between DSG and Stochastic PGC Stepsize of DSG chosen as a small constant σ2 = 0.1 W is Metropolis constant edge weight matrix SPCG: wi = Pi , ρij = 10−3 Figure: Comparison between DSG and SPGC Mingyi Hong (Iowa State University) 35 / 37 Numerical Results Outline 1 Review of Distributed Optimization 2 The Proposed Algorithm The Proposed Algorithms Distributed Implementation Convergence Analysis 3 Connection to Existing Methods 4 Numerical Results 5 Concluding Remarks Mingyi Hong (Iowa State University) 35 / 37 Concluding Remarks Summary Develop a DySPGC algorithm for multi-agent optimization It can deal with 1 Stochastic gradient 2 Time-varying networks 3 Nonsmooth composite objective Convergence rate guarantee for various scenarios Mingyi Hong (Iowa State University) 36 / 37 Concluding Remarks Future Work/Generalization Identified the relation between DSG-type and ADMM-type methods Allows for significant generalization 1 Acceleration [Ouyang et al 15] 2 Variance Reduction for local problem when f i is a finite sum M f i ( xi ) = ∑ ` j ( xi ) j =1 3 Inexact x-subproblems (using, e.g., Conditional-Gradient) 4 Nonconvex problems [H.-Luo-Razaviyayn 14] 5 ... Mingyi Hong (Iowa State University) 37 / 37 Concluding Remarks Thank You! Mingyi Hong (Iowa State University) 37 / 37 Concluding Remarks Parameter Selection It is easy to pick various parameters in various different scenarios Case A: The weight matrix W is given and symmetric 1 We must have β i = β j = β; 2 For any fixed β, can compute (Ω, {ρij }) 3 Increase β to satisfy convergence condition Case B: The user has the freedom in picking ({ρij }, Ω) 1 For any set of ({ρij }, Ω), can compute W and { β i } 2 Increase Ω to satisfy convergence condition In either case, the convergence condition can be verified by local agents Mingyi Hong (Iowa State University) 37 / 37 Concluding Remarks Case 1: Exact Gradient with Static Graph Convergence for PGC Algorithm Suppose that problem (Q) has a nonempty set of optimal solutions X ∗ 6= ∅. Suppose G r = G for all r and G is connected. Then the PGC converges to a primal-dual optimal solution if T e = ΥW + Υ P. 2Ω + M+ ΞM+ T is some matrix related to network topology M+ ΞM+ A sufficient condition is Ω Pe or ωi > Pi for all i ∈ V ; can be determined locally. Mingyi Hong (Iowa State University) 37 / 37 Concluding Remarks Case 2: Stochastic Gradient with Static Graph Convergence for SPGC Algorithm Assume that dom(h) is a bounded set. Suppose that the following conditions hold √ η r+1 = r + 1, ∀ r, and the stepsize matrix satisfies T e 2Ω + M+ ΞM+ = ΥW + Υ 2 P. (8) Then at a given iteration r, we have σ2 d2 1 E [ f ( x̄r ) − f ( x ∗ )] + ρk A x̄r + Bz̄r k ≤ √ + √x + r 2 r 2r d2z + d2λ (ρ) + max ωi d2x i where dλ (ρ) > 0, d x > 0, dz > 0 are some problem dependent constants. Mingyi Hong (Iowa State University) 37 / 37 Concluding Remarks Case 2: Stochastic Gradient with Static Graph (cont.) √ Both objective value and constraint violation converge with rate O(1/ r ) Easy to extend to the exact gradient case, with rate O(1/r ) Requires larger proximal parameter Ω than Case 1 Mingyi Hong (Iowa State University) 37 / 37 Concluding Remarks Case 3: Exact Gradient with Time-Varying Graph Convergence for DySPGC Algorithm Suppose that problem (Q) has a nonempty set of optimal solutions X ∗ 6= ∅, e( xr , ξ r+1 ) = ∇ g( xr ) for all r. Suppose the graph is randomly generated. and G If we choose the following stepsize Ω 1e P 2 then ( xr , zr , λr ) that converges w.p.1. to a primal-dual solution. 1 The stepsize is more restrictive than Case 1 (not dependent on graph) 2 Convergence is in the sense of with probability 1 Mingyi Hong (Iowa State University) 37 / 37 Concluding Remarks Case 4: Stochastic Gradient with Time-Varying Graph Convergence for DySPGC Algorithm Suppose {wt } = { x t , zt , λt } is a sequence generated by DySPCA, and that √ e η r+1 = r + 1, ∀ r, and Ω P. Then we have E [ f ( x̄r ) − f ( x ∗ ) + ρk A x̄r + Bz̄r k] σ2 d2x 1 2 2 2 ≤ √ + √ + 2d J + dz + dλ (ρ) + max ωi d x i r 2 r 2r where dλ (ρ), d J , d x , dy are some positive constants. Mingyi Hong (Iowa State University) 37 / 37