ITERATIVE METHODS AND REGULARIZATION IN THE DESIGN OF FAST ALGORITHMS An unified framework for optimization and online learning beyond Multiplicative Weight Updates Lorenzo Orecchia, MIT Math Talk Outline: A Tale of Two Halves PART 1: REGULARIZATION AND ITERATIVE TECHNIQUES FOR ONLINE LEARNING • Online Linear Optimization • Online Linear Optimization over Simplex and Multiplicative Weight Updates (MWUs) • A Regularization Framework to generalize MWUs: Follow the Regularized Leader MESSAGE: REGULARIZATION IS A POWERFUL ALGORITHMIC TECHNIQUE Talk Outline: A Tale of Two Halves PART 1: REGULARIZATION AND ITERATIVE TECHNIQUES FOR ONLINE LEARNING • Online Linear Optimization • Online Linear Optimization over Simplex and Multiplicative Weight Updates (MWUs) • A Regularization Framework to generalize MWUs: Follow the Regularized Leader MESSAGE: REGULARIZATION IS A POWERFUL ALGORITHMIC TECHNIQUE Optimization: Regularized Updates Online Learning: Multiplicative Weight Updates (MWUs) Talk Outline: A Tale of Two Halves PART 1: REGULARIZATION AND ITERATIVE TECHNIQUES FOR ONLINE LEARNING • Online Linear Optimization • Online Linear Optimization over Simplex and Multiplicative Weight Updates (MWUs) • A Regularization Framework to generalize MWUs: Follow the Regularized Leader MESSAGE: REGULARIZATION IS A POWERFUL ALGORITHMIC TECHNIQUE PART 2: NON-SMOOTH OPTIMIZATION AND FAST ALGORITHMS FOR MAXFLOW • Non-smooth vs Smooth Convex Optimization •Non-smooth Convex Optimization reduces to Online Linear Optimization • Application: Understanding Undirected Maxflow algorithms based on MWUs MESSAGE: FASTEST ALGORITHMS REQUIRE PRIMAL-DUAL APPROACH TOC Applications of MWUs Fast Algorithms for solving specific LPs and SDPs: Maximum Flow problems [PST], [GK], [F], [CKMST] Covering-packing problems [PST] Oblivious routing [R], [M] Fast Approximation Algorithms based on LP and SDP relaxations: Maxcut [AK] Graph Partitioning Problems [AK], [S], [OSV] Proof Technique Hardcore Lemma [BHK] QIP = PSPACE [W] Derandomization [Y] … and more Machine Learning meets Optimization meets TCS These techniques have been rediscovered multiple times in different fields: Machine Learning, Convex Optimization, TCS Three surveys emphasizing the different viewpoints and literatures: 1) ML: Prediction, Learning and Games by Gabor and Lugosi 2) Optimization: Lectures in Modern Convex Optimization by Ben Tal and Nemirowski 3) TCS: The Multiplicative Weights Update Method: a Meta Algorithm and Applications by Arora, Hazan and Kale REGULARIZATION 101 What is Regularization? Regularization is a fundamental technique in optimization OPTIMIZATION PROBLEM WELL-BEHAVED OPTIMIZATION PROBLEM • Stable optimum • Unique optimal solution • Smoothness conditions … What is Regularization? Regularization is a fundamental technique in optimization OPTIMIZATION PROBLEM WELL-BEHAVED OPTIMIZATION PROBLEM Parameter ¸ > 0 Benefits of Regularization in Learning and Statistics: • Prevents overfitting • Increases stability •Decreases sensitivity to random noise Regularizer F Example: Regularization Helps Stability Consider a convex set S ½ Rn and a linear optimization problem: f(c) = arg minx2S cT x The optimal solution f(c) may be very unstable under perturbation of c : kc0 ¡ ck · ± kf(c0 ) ¡ f(c)k >> ± and c0 f(c0 ) c S f(c) Example: Regularization Helps Stability Consider a convex set S ½ Rn and a regularized linear optimization problem f(c) = arg minx2S cT x +F (x) where F is ¾-strongly convex. Then: kc0 ¡ ck · ± implies kf(c0 ) ¡ f(c)kk · ± ¾ c0T x + F (x) cT x + F (x) f(c) f(c0 ) Example: Regularization Helps Stability Consider a convex set S ½ Rn and a regularized linear optimization problem f(c) = arg minx2S cT x +F (x) where F is ¾-strongly convex. Then: kc0 ¡ ck · ± implies kslopek · ± kf(c0 ) ¡ f(c)kk · ± ¾ c0T x + F (x) cT x + F (x) f(c) f(c0 ) ONLINE LINEAR OPTIMIZATION AND MULTIPLICATIVE WEIGHT UPDATES Online Linear Minimization SETUP: Convex set Xµ Rn, generic norm, repeated game over T rounds. At round t, ALGORITHM x(t) 2 X Current solution ADVERSARY Online Linear Minimization SETUP: Convex set Xµ Rn, generic norm, repeated game over T rounds. At round t, ALGORITHM x(t) 2 X Current solution ADVERSARY `(t) 2 Rn; kr`(t) k¤ · ½ Current linear objective Loss vector Online Linear Minimization SETUP: Convex set Xµ Rn, generic norm, repeated game over T rounds. At round t, ALGORITHM ADVERSARY x(t) 2 X `(t) 2 Rn; kr`(t) k¤ · ½ Current linear objective Current solution Loss vector (t) T ` x(t) Algorithm’s loss Online Linear Minimization SETUP: Convex set Xµ Rn, generic norm, repeated game over T rounds. At round t, ALGORITHM x(t) 2 X x(t+1) 2 X Updated solution ADVERSARY n `(t) 2 R ; kr` x(t) 2 X(t) k¤ · ½ Online Linear Minimization SETUP: Convex set Xµ Rn, generic norm, repeated game over T rounds. At round t, ALGORITHM x(t) 2 X (t+1) x 2X Updated solution ADVERSARY n `(t) 2 R ; kr` x(t) 2 X(t) k¤ · ½ `(t+1) 2 Rn; kr`(t) k¤ · ½ New Loss Vector Online Linear Minimization SETUP: Convex set Xµ Rn, generic norm, repeated game over T rounds. At round t, ALGORITHM x(t) 2 X (t+1) x 2X ADVERSARY n `(t) 2 R ; kr` x(t) 2 X(t) k¤ · ½ `(t+1) 2 Rn; kr`(t) k¤ · ½ GOAL: update x(t) to minimize regret T T X X 1 1 (t) T (t) T T ¢ ` x ¡ min ¢ `i x x2X T T t=1 t=1 ^ Average Algorithm’s Loss L A Posteriori Optimum L¤ Simplex Case: Learning with Experts SETUP: Simplex Xµ Rn under ℓ1 norm. At round t, ALGORITHM p(t) distribution over experts ADVERSARY Simplex Case: Learning with Experts SETUP: Simplex Xµ Rn under ℓ1 norm. At round t, ALGORITHM p(t) distribution over dimensions i.e. experts ADVERSARY k`(t) k1 · ½ Experts’ losses Simplex Case: Learning with Experts SETUP: Simplex Xµ Rn under ℓ1 norm. At round t, ALGORITHM ADVERSARY p(t) k`(t) k1 · ½ distribution over experts Experts’ losses h i (t) (t) T (t) EiÃp(t) `i = p ` Algorithm’s loss Simplex Case: Learning with Experts SETUP: Simplex Xµ Rn under ℓ1 norm. At round t, ALGORITHM p(t) distribution over experts p(t+1) Update distribution ADVERSARY k`(t) k1 · ½ Experts’ losses Simplex Case: Multiplicative Weight Updates ALGORITHM ADVERSARY p(t) `(t) (t+1) Weights: w i (t) `i = (1 ¡ ²) (t) wi ; w1 = ~1 Simplex Case: Multiplicative Weight Updates ALGORITHM ADVERSARY p(t) `(t) (t+1) Weights: w i Distribution: (t+1) pi (t) `i = (1 ¡ ²) (t) wi (t) wi = Pn j=1 (t) wj ; w1 = ~1 Simplex Case: Multiplicative Weight Updates ALGORITHM ADVERSARY p(t) `(t) (t+1) Weights: w i Distribution: (t+1) pi (t) `i = (1 ¡ ²) (t) wi ; w1 = ~1 (t) wi = Pn j=1 (t) wj MULTIPLICATIVE WEIGHT UPDATE Simplex Case: Multiplicative Weight Updates ALGORITHM ADVERSARY p(t) `(t) (t+1) Weights: w i Distribution: (t+1) pi = (1 ¡ ²) (t) wi ; w1 = ~1 (t) wi = Pn j=1 CONSERVATIVE (t) `i 0 (t) wj AGGRESSIVE 1 ² 2 (0; 1) MWUs: Unraveling the Update ALGORITHM ADVERSARY p(t) `(t) Update: (t+1) pi / (t+1) wi (t) `i = (1 ¡ ²) (t) ¢ wi WEIGHT (t+1) wi P (t) ` t i (1 ¡ ²) CUMULATIVE LOSS P (t) t `i MWUs: Regret Bound ALGORITHM ADVERSARY p(t) `(t) Update: For ² < 1 2 (t+1) pi / (t+1) wi = (1 ¡ ²) and k`(t) k1 · ½ ^ ¡ L? · L ½ log n ²T (t) `i + ½² (t) ¢ wi MWUs: Regret Bound ALGORITHM ADVERSARY p(t) `(t) Update: For ² < 1 2 (t+1) pi / (t+1) wi (t) `i = (1 ¡ ²) (t) ¢ wi and k`(t) k1 · ½ ^ ¡ L? · L Algorithm’s Regret ½ log n ²T + ½² Start-up Penalty Penalty for being greedy ONLINE LINEAR OPTIMIZATION BEYOND MWUs A REGULARIZATION FRAMEWORK MWUs: Proof Sketch of Regret Bound Update: (t+1) pi / (t+1) wi Pt (s) = (1 ¡ ²) s=1 `i • Proof is potential function argument (t+1) © = log1¡² Pn (t+1) i=1 wi MWUs: Proof Sketch of Regret Bound (t+1) pi Update: / (t+1) wi Pt (s) = (1 ¡ ²) s=1 `i • Proof is potential function argument (t+1) © = log1¡² Pn (t+1) i=1 wi • Potential function bounds loss of best expert (t+1) © · (t+1) n log1¡² mini=1 wi = minni=1 ³P t (s) s=1 `i ´ MWUs: Proof Sketch of Regret Bound (t+1) pi Update: / (t+1) wi Pt (s) = (1 ¡ ²) s=1 `i • Proof is potential function argument (t+1) © = log1¡² Pn (t+1) i=1 wi • Potential function bounds loss of best expert (t+1) © · (t+1) n log1¡² mini=1 wi = minni=1 ³P t • Potential function is related to algorithm’s performance ©(t+1) ¡ ©(t) ³ T ´ ¸ `(t) p(t) ¡ ² (s) s=1 `i ´ MWUs: Proof Sketch of Regret Bound (t+1) pi Update: / (t+1) wi Pt (s) = (1 ¡ ²) s=1 `i • Proof is potential function argument (t+1) © = log1¡² Pn (t+1) i=1 wi • Potential function bounds loss of best expert (t+1) © · (t+1) n log1¡² mini=1 wi = minni=1 ³P t • Potential function is related to algorithm’s performance ©(t+1) ¡ ©(t) ³ T ´ ¸ `(t) p(t) ¡ ² (s) s=1 `i ´ DOES THIS PROOF TECHNIQUE GENERALIZE TO BEYOND SIMPLEX CASE? Designing a Regularized Update GOAL: Design an update and its potential function analysis QUESTION: Choice of potential function? DESIDERATA: 1) lower bounds best expert’s loss 2) tracks algorithm’s performance MWUs AND APPLICATIONS Designing a Regularized Update QUESTION: Choice of potential function? DESIDERATA: 1) lower bounds best expert’s loss 2) tracks algorithm’s performance Attempt 1 – FOLLOW THE LEADER: Cumulative loss L (t) MWUs AND APPLICATIONS x(t+1) = arg min xT L(t) x2X Pick best current solution = Pt (s) ` s=1 ©(t+1) = min xT L(t) x2X Potential is current best loss Designing a Regularized Update QUESTION: Choice of potential function? DESIDERATA: 1) lower bounds best expert’s loss 2) tracks algorithm’s performance Attempt 1 – FOLLOW THE LEADER: Cumulative loss L (t) MWUs AND APPLICATIONS x(t+1) = arg min xT L(t) x2X Pick best current solution = Pt (s) ` s=1 ©(t+1) = min xT L(t) x2X Potential is current best loss Designing a Regularized Update QUESTION: Choice of potential function? DESIDERATA: 1) lower bounds best expert’s loss 2) tracks algorithm’s performance Attempt 1 – FOLLOW THE LEADER: Cumulative loss L (t) MWUs AND APPLICATIONS x(t+1) = arg min xT L(t) x2X Pick best current solution = Pt (s) ` s=1 ©(t+1) = min xT L(t) x2X Potential is current best loss Fails if best expert changes moves drastically Designing a Regularized Update QUESTION: Choice of potential function? DESIDERATA: 1) lower bounds best expert’s loss 2) tracks algorithm’s performance Attempt 1 – FOLLOW THE LEADER: Cumulative loss L (t) MWUs AND APPLICATIONS x(t+1) = arg min xT L(t) x2X ©(t+1) = min xT L(t) x2X = Pt How to make update more stable? (s) ` s=1 Regularized Update: Definition QUESTION: Choice of potential function? DESIDERATA: 1) lower bounds best expert’s loss 2) tracks algorithm’s performance Attempt 2 – FOLLOW THE REGULARIZED LEADER: MWUs AND APPLICATIONS x(t+1) = arg min xT L(t) + ´ ¢ F(x) x2X ©(t+1) = min xT L(t) + ´ ¢ F(x) x2X Properties of Regularizer F(x): 1. Convex, differentiable 2. ¾-strong convex w.r.t. norm Parameter ´ ¸ 0, TBD Regularized Update: Definition QUESTION: Choice of potential function? DESIDERATA: 1) lower bounds best expert’s loss 2) tracks algorithm’s performance Attempt 2 – FOLLOW THE REGULARIZED LEADER: MWUs AND APPLICATIONS x(t+1) = arg min xT L(t) + ´ ¢ F(x) x2X ©(t+1) = min xT L(t) + ´ ¢ F(x) x2X Properties of Regularizer F(x): 1. Convex, differentiable 2. ¾-strong convex w.r.t. norm Parameter ´ ¸ 0, TBD These properties are actually sufficient to get a regret bound Regularized Update: Analysis QUESTION: Choice of potential function? DESIDERATA: 1) lower bounds best expert’s loss 2) tracks algorithm’s performance Attempt 2 – FOLLOW THE REGULARIZED LEADER: MWUs AND APPLICATIONS x(t+1) = arg min xT L(t) + ´ ¢ F(x) x2X ©(t+1) = min xT L(t) + ´ ¢ F(x) x2X (t+1) © (t) T · min L x2X Properties of Regularizer F(x): 1. Convex, differentiable 2. ¾-strong convex w.r.t. norm Parameter ´ ¸ 0, TBD x + ´ ¢ max F (x) x2X Regularized Update: Analysis QUESTION: Choice of potential function? DESIDERATA: 1) lower bounds best expert’s loss 2) tracks algorithm’s performance Attempt 2 – FOLLOW THE REGULARIZED LEADER: MWUs AND APPLICATIONS x(t+1) = arg min xT L(t) + ´ ¢ F(x) x2X ©(t+1) = min xT L(t) + ´ ¢ F(x) x2X (t+1) © (t) T · min L x2X Properties of Regularizer F(x): 1. Convex, differentiable 2. ¾-strong convex w.r.t. norm Parameter ´ ¸ 0, TBD x + ´ ¢ max F (x) x2X Regularization error Regularized Update: Analysis QUESTION: Choice of potential function? DESIDERATA: 1) lower bounds best expert’s loss 2) tracks algorithm’s performance ? Attempt 2 – FOLLOW THE REGULARIZED LEADER: MWUs AND APPLICATIONS x(t+1) = arg min xT L(t) + ´ ¢ F(x) x2X ©(t+1) = min xT L(t) + ´ ¢ F(x) x2X f (t+1) (x) Properties of Regularizer F(x): 1. Convex, differentiable 2. ¾-strong convex w.r.t. norm Parameter ´ ¸ 0, TBD Tracking the Algorithm: Proof by Picture f (t+1) (x) f (t) (x) ©(t+1) ©(t) x(t) Define: x(t+1) f (t+1) (x) = xT L(t) + ´ ¢ F (x) x Tracking the Algorithm: Proof by Picture f (t+1) (x) f (t) (x) ©(t+1) ©(t) x(t) Define: Notice: x x(t+1) f (t+1) (x) = xT L(t) + ´ ¢ F (x) f (t+1) (x) ¡ f (t) (t) T (x) = ` x Latest loss vector Tracking the Algorithm: Proof by Picture f (t+1) (x) f (t) (x) ©(t+1) T `(t) x(t) ©(t) x(t) Define: Notice: x x(t+1) T f (t+1) (x) = L(t) x + ´ ¢ F (x) f (t+1) (x) ¡ f (t) (t) T (x) = ` x Latest loss vector Tracking the Algorithm: Proof by Picture (t+1) (x) ff(t+1) (x) (t) (x) ff(t) (x) ©(t+1) T `(t) x(t) ©(t) x(t) x(t+1) Compare: (t) T ` x(t) and ©(t+1) ¡ ©(t) x Tracking the Algorithm: Proof by Picture f (t+1) (x) f (t) (x) ©(t+1) T `(t) x(t) ©(t) x(t) (t+1) © Want: (t) ¡© =f (t+1) (t+1) (x p x x(t+1) )¡f (t+1) (t) (t) T (t) (x ) + ` f (t+1) (x(t) ) ¼ f (t+1) (x(t+1) ) x Regularization in Action f (t+1) (x) f (t) (x) ©(t+1) T `(t) x(t) ©(t) x(t) x(t+1) REGULARIZATION T f (t+1) (x) = L(t) x + ´ ¢ F (x) x f (t) is (´ ¢ ¾ )-strongly-convex Regularization in Action f (t+1) (x) f (t) (x) ©(t+1) `(t) T `(t) x(t) ©(t) x(t) x x(t+1) REGULARIZATION T f (t+1) (x) = L(t) x + ´ ¢ F (x) kf (t+1) ¡f (t) (t) k¤ = k` k¤ STABILITY f (t) is (´ ¢ ¾ )-strongly-convex (t+1) jjx (t) ¡ x jj · jj`(t) jj¤ ´¢¾ Regularization in Action f (t+1) (x) f (t) (x) ©(t+1) `(t) Quadratic lower bound to f(t+1) T `(t) x(t) ©(t) x(t) x x(t+1) REGULARIZATION T f (t+1) (x) = L(t) x + ´ ¢ F (x) kf (t+1) ¡f (t) (t) k = k` k STABILITY f (t) is (´ ¢ ¾ )-strongly-convex (t+1) jjx (t) ¡ x jj¤ · jj`(t) jj ´¢¾ Analysis: Progress in One Iteration (t+1) © rf (t+1) (t) ¡© =f (t) (t) (x ) = ` (t+1) (t+1) (x )¡f (x ) + ` (t) (t) jjx (t) (t) T (t) (t+1) ¡ x jj · x jj`(t) jj¤ ´¢¾ MWUs AND APPLICATIONS (t+1) f (t) T f (t+1) (x(t+1) ) ¡ f (t+1) (x(t) ) ¸ ` is (´ ¢ ¾)-strongly-convex (t) 2 jj` jj¤ (t+1) (t) (x ¡x )+ 2´ ¢ ¾ Analysis: Progress in One Iteration (t+1) © rf (t+1) (t) ¡© =f (t) (t) (x ) = ` (t+1) (t+1) (x )¡f (t+1) (t) jjx (t) (t) T (t) (t) jj`(t) jj¤ ´¢¾ (x ) + ` ¡ x jj · x MWUs AND APPLICATIONS f (t+1) is (´ ¢ ¾)-strongly-convex (t) 2 jj` jj¤ (t+1) (t+1) (t+1) (t) (t+1) (t) f (x )¡f (x ) ¸ ` (x ¡x )+ 2´ ¢ ¾ (t) (t) 2 jj` jj k` k¤ ¤ (t) (t+1) (t) ¸ ¡k` k¤ kx ¡x k+ ¸¡ 2´ ¢ ¾ 2´ ¢ ¾ (t) T Completing the Analysis Progress in one iteration: (t) k` k¤ (t+1) (t) (t) © ¡© ¸` x ¡ 2¾´ MWUs AND APPLICATIONS (t) T Regret at iteration t Completing the Analysis Progress in one iteration: (t) T ©(t+1) ¡ ©(t) ¸ ` (t) k` k¤ (t) x ¡ 2¾´ MWUs AND APPLICATIONS Telescopic sum: ©(T +1) ¸ T X t=1 (t) T (t) ` p (t) jj` jj (1) +© ¡T ¢ 2´ ¢ ¾ Completing the Analysis Progress in one iteration: (t) T ©(t+1) ¡ ©(t) ¸ ` (t) k` k¤ (t) x ¡ 2¾´ MWUs AND APPLICATIONS Telescopic sum: ©(T +1) ¸ T X t=1 (t) T (t) ` p (t) jj` jj (1) +© ¡T ¢ 2´ ¢ ¾ Final regret bound: à T ! T X T 1 X (t) T (t) ´ ½2 (t) ` x ¡ min ` x · ¢ (max F (x) ¡ min F (x)) + x2X x2X x2X T t=1 T 2¾´ t=1 Completing the Analysis Regret bound: with regularizer F and jj`(t) jj¤ · ½ à T ! T X T 1 X (t) T (t) ´ ½2 (t) ` x ¡ min ` x · ¢ (max F (x) ¡ min F (x)) + x2X x2X T t=1 T x2X 2¾´ t=1 MWUs AND APPLICATIONS Start-up Penalty SAME TYPE OF BOUND AS FOR MWUs Penalty for being greedy Reinterpreting MWUs Potential function: Regularizer: F (p) = ©(t+1) = min pT L(t) + ´ ¢ n X Pp¸0; pi =1 n X pi log pi i=1 pi log pi is negative entropy MWUs i=1 AND APPLICATIONS Reinterpreting MWUs Potential function: ©(t+1) = min pT L(t) + ´ ¢ Regularizer: F (p) = Pp¸0; pi =1 n X n X pi log pi i=1 pi log pi is negative entropy MWUs SOFT-MAX i=1 AND APPLICATIONS F (p) is 1-strongly-convex w.r.t. k ¢ k1 Update: p(t+1) = arg min pT L(t) + ´ ¢ Pp¸0; pi =1 (t) (t+1) pi 1 ¡´ Li i=1 (t) 1 ¡´ Li e pi log pi i=1 (t) Li e =P n n X (1 ¡ ²) = Pn i=1 (t) (1 ¡ ²)Li : Reinterpreting MWUs Potential function: ©(t+1) = min pT L(t) + ´ ¢ Regularizer: F (p) = Pp¸0; pi =1 n X n X pi log pi i=1 pi log pi is negative entropy MWUs i=1 AND APPLICATIONS F (p) is 1-strongly-convex w.r.t. k ¢ k1 Update: p(t+1) = arg min pT L(t) + ´ ¢ Pp¸0; pi =1 (t) (t+1) pi 1 ¡´ Li i=1 (t) 1 ¡´ Li e pi log pi i=1 (t) Li e =P n n X (1 ¡ ²) = Pn i=1 (t) (1 ¡ ²)Li : Beyond MWUs: which regularizer? Regret bound: optimizing over ´ à T ! p T X X ½ (2 ¢ (maxx2X F (x) ¡ minx2X F (x)) 1 (t) T (t) (t) T p ` x ¡ min ` x · x2X T t=1 ¾T t=1 MWUs AND APPLICATIONS Best choice of regularizer and norm minimizes maxt jj`(t) jj2¤ ¢ (maxx2X F (x) ¡ minx2X F (x)) ¾ Beyond MWUs: which regularizer? Regret bound: optimizing over ´ à T ! p T X X ½ (2 ¢ (maxx2X F (x) ¡ minx2X F (x)) 1 (t) T (t) (t) T p ` x ¡ min ` x · x2X T t=1 ¾T t=1 MWUs AND APPLICATIONS Best choice of regularizer and norm minimizes maxt jj`(t) jj2¤ ¢ (maxx2X F (x) ¡ minx2X F (x)) ¾ Negative entropy with `1-norm is approximately optimal for simplex QUESTION: are other regularizers ever useful? Different Regularizers in Algorithm Design QUESTION 1: Are other regularizers, besides entropy, ever useful? YES! Applications: Graph Partitioning and Random Walks ~ Spectral algorithms for balanced separator running in time O(m) Uses random-walk framework and SDP MWUs Different walks correspond to different regularizers for eigenvector problem F(X) = Tr(X log X) Heat Kernel Random Walk p-norm, 1 · p · 1 F(X) = Tr(X p) Lazy Random Walk NEW REGULARIZER F (X) = Tr(X 1=2) Personalized PageRank SDP MWU [Mahoney, Orecchia, Vishnoi 2011], [Orecchia, Sachdeva, Vishnoi 2012] Different Regularizers in Algorithm Design QUESTION 1: Are other regularizers, besides entropy, ever useful? YES! Applications: Graph Partitioning and Random Walks Sparsification n ²-spectral-sparsifiers with O( n log edges ²2 ) Uses Matrix concentration bound equivalent to SDP MWUs [Spielman, Srivastava 2008] ²-spectral-sparsifiers with O( ²n2 ) edges Can be interpreted as different regularizer: F (X) = Tr(X 1=2) [Batson, Spielman, Srivastava 2009] Different Regularizers in Algorithm Design QUESTION 1: Are other regularizers, besides entropy, ever useful? YES! Applications: Graph Partitioning and Random Walks Sparsification Many more in Online Learning Bandit Online Learning [AHR], … NON-SMOOTH CONVEX OPTIMIZATION REDUCES TO ONLINE LINEAR OPTIMIZATION Convex Optimization Setup min f(x) x2X NON-SMOOTH f convex, differentiable X µ Rn closed, convex set SMOOTH 8x 2 X; krf(x)k¤ · ½ 8x; y 2 X; krf(y) ¡ rf(x)k¤ · Lky ¡ xk ½-Lipschitz continuous ½-Lipschitz continuous gradient Convex Optimization Setup min f(x) x2X f convex, differentiable X µ Rn closed, convex set NON-SMOOTH SMOOTH 8x 2 X; krf(x)k¤ · ½ 8x; y 2 X; krf(y) ¡ rf(x)k¤ · Lky ¡ xk ½-Lipschitz continuous ½-Lipschitz continuous gradient Gradient step is guaranteed to decrease function value (t+1) f(x krf(x(t) )k2¤ ) · f(x ) ¡ 2L (t) Convex Optimization Setup min f(x) x2X f convex, differentiable X µ Rn closed, convex set NON-SMOOTH SMOOTH 8x; y 2 X; krf(y) ¡ rf(x)k¤ · Lky ¡ xk 8x 2 X; krf(x)k¤ · ½ ½-Lipschitz continuous NO GRADIENT STEP GUARANTEE ½-Lipschitz continuous gradient Gradient step is guaranteed to decrease function value (t+1) f(x x(t+1) x(t) krf(x(t) )k2¤ ) · f(x ) ¡ 2L (t) Convex Optimization Setup min f(x) x2X f convex, differentiable X µ Rn closed, convex set NON-SMOOTH SMOOTH 8x; y 2 X; krf(y) ¡ rf(x)k¤ · Lky ¡ xk 8x 2 X; krf(x)k¤ · ½ ½-Lipschitz continuous NO GRADIENT STEP GUARANTEE ½-Lipschitz continuous gradient Gradient step is guaranteed to decrease function value (t+1) f(x x(t+1) x(t) ONLY DUAL GUARANTEE krf(x(t) )k2¤ ) · f(x ) ¡ 2L (t) Non-Smooth Setup: Dual Approach f convex, differentiable min f(x) x2X X µ Rn closed, convex set 8x 2 X; krf(x)k¤ · ½ ½-Lipschitz continuous APPROACH: Each iterate solution provides a lower bound and an upper bound ¤ (t) (t) T f(x ) ¸ f(x ) + rf(x (x¤ ¡ x(t) ) f(x(t)) ¸ f(x¤) (t+1) x x(t+2) x(t) Non-Smooth Setup: Dual Approach f convex, differentiable min f(x) x2X X µ Rn closed, convex set 8x 2 X; krf(x)k¤ · ½ ½-Lipschitz continuous APPROACH: Each iterate solution provides a lower bound and an upper bound ¤ (t) (t) T f(x ) ¸ f(x ) + rf(x (x¤ ¡ x(t) ) f(x(t)) ¸ f(x¤) (t+1) x x(t+2) x(t) CAN WEAKEN DIFFERENTIABILITY ASSUMPTION: SUBGRADIENTS SUFFICE Non-Smooth Setup: Dual Approach APPROACH: Each iterate solution provides a lower bound and an upper bound T f(x¤) ¸ f(x(t) ) + rf(x(t) (x¤ ¡ x(t) ) f(x(t)) ¸ f(x¤) UPPER x(t) x(t+1) x(t+2) Take convex combination of both upper bounds and lower bounds with weights °t UPPER BOUND: 1 PT t=1 LOWER BOUND: °t ³P T ´ ¤ ° f(x ) ¸ f(x ) t t=1 (t) Non-Smooth Setup: Dual Approach APPROACH: Each iterate solution provides a lower bound and an upper bound T f(x¤) ¸ f(x(t) ) + rf(x(t) ) (x¤ ¡ x(t) ) f(x(t)) ¸ f(x¤) UPPER LOWER x(t) x(t+1) x(t+2) Take convex combination of both upper bounds and lower bounds with weights °t ³P ´ T 1 (t) ¤ P ° f(x ) ¸ f(x ) T t UPPER: t=1 ° t=1 LOWER : f(x¤ ) ¸ PT1 t=1 °t t hP T i (t) ¤ (t) ° (f(x ) + rf(x ) (x ¡ x )) t t=1 (t) T Non-Smooth Setup: Dual Approach APPROACH: Each iterate solution provides a lower bound and an upper bound T f(x¤) ¸ f(x(t) ) + rf(x(t) ) (x¤ ¡ x(t) ) UPPER LOWER f(x(t)) ¸ f(x¤) HOW TO UPDATE ITERATES? HOW TO CHOSE WEIGHTS? x(t) x(t+1) x(t+2) Take convex combination of both upper bounds and lower bounds with weights °t ³P ´ T 1 (t) ¤ P ° f(x ) ¸ f(x ) T t UPPER: t=1 ° t=1 LOWER : f(x¤ ) ¸ PT1 t=1 °t t hP T i (t) ¤ (t) ° (f(x ) + rf(x ) (x ¡ x )) t t=1 (t) T Reduction to Online Linear Minimization Fix weights °t to be uniform for simplicity: PT1 UPPER: t=1 LOWER : f(x¤ ) ¸ PT1 t=1 DUALITY GAP: · PT t=1 PT°t t=1 °t °t °t ³P T hP T ´ (t) ¤ ° f(x ) ¸ f(x ) t t=1 i (t) ¤ (t) ° (f(x ) + rf(x ) (x ¡ x )) t t=1 (t) T ¸ PT (t) ¤ (t) T ¤ (t) f(x ) ¡ f(x ) · ¡rf(x ) (x ¡ x ) t=1 LINEAR FUNCTION Reduction to Online Linear Minimization Fix weights °t to be uniform for simplicity: DUALITY GAP: · PT t=1 PT°t t=1 °t ¸ PT (t) ¤ (t) T ¤ (t) f(x ) ¡ f(x ) · ¡rf(x ) (x ¡ x ) t=1 ONLINE SETUP ALGORITHM ADVERSARY x(t) 2 X ¡rf(x(t) ) Reduction to Online Linear Minimization Fix weights °t to be uniform for simplicity: DUALITY GAP: · PT t=1 PT°t t=1 °t ¸ PT (t) ¤ (t) T ¤ (t) f(x ) ¡ f(x ) · ¡rf(x ) (x ¡ x ) t=1 ONLINE SETUP ALGORITHM ADVERSARY x(t) 2 X `(t) = ¡rf(x(t) ) Recall that by assumption: (t) (t) k` k¤ = krf(x )k¤ · ½ Loss vector is gradient Reduction to Online Linear Minimization Fix weights °t to be uniform for simplicity: DUALITY GAP: hP T t=1 i 1 (t) ¤ f(x ) ¡ f(x )· T 1 T ¢ PT (t) T ¤ (t) ¡rf(x ) (x ¡ x ) t=1 ONLINE SETUP ALGORITHM ADVERSARY x(t) 2 X `(t) = ¡rf(x(t) ) Recall that by assumption: (t) (t) k` k¤ = krf(x )k¤ · ½ Loss vector is gradient T T 1 X ¢ ¡rf(x(t) ) (x¤ ¡ x(t) ) = REGRET T t=1 Final Bound ONLINE SETUP ALGORITHM ADVERSARY x(t) 2 X `(t) = ¡rf(x(t) ) Recall that by assumption: T X t=1 (t) (t) k` k¤ = krf(x )k¤ · ½ Loss vector is gradient (t) T ¡rf(x ) (x¤ ¡ x(t) ) = REGRET RESULTING ALGORITHM: MIRROR DESCENT Error bound with ¾-strongly-convex regularizer F p ½ 2 ¢ (maxx2X F (x) ¡ minx2X F (x)) p ²MD · ¾ T Final Bound ONLINE SETUP ALGORITHM ADVERSARY x(t) 2 X `(t) = ¡rf(x(t) ) Recall that by assumption: T X t=1 (t) (t) k` k¤ = krf(x )k¤ · ½ Loss vector is gradient (t) T ¡rf(x ) (x¤ ¡ x(t) ) = REGRET RESULTING ALGORITHM: MIRROR DESCENT Error bound with ¾-strongly-convex regularizer F p ½ 2 ¢ (maxx2X F (x) ¡ minx2X F (x)) p ²MD · ¾ T ASYMPTOTICALLY OPTIMAL BY INFORMATION COMPLEXITY LOWER BOUND Non-Smooth Optimization over Simplex RESULTING ALGORITHM: MIRROR DESCENT OVER SIMPLEX = MWU Regularizer F is negative entropy, with krf(x(t) )k1 · ½ p ½ 2 ¢ log n p ²MD · T APPLICATIONS IN ALGORITHM DESIGN Warm-up Example: Linear Programming A 2 Rm£n ; ?9x 2 X : Ax ¡ b ¸ 0 Easy constraints Maintain feasible Hard constraints Require fixing LP Feasibility problem Warm-up Example: Linear Programming A 2 Rm£n ; ?9x 2 X : Ax ¡ b ¸ 0 LP Feasibility problem Convert into non-smooth optimization problem over simplex: min max pT (b ¡ Ax) p2¢m x2X Non-differentiable objective: f(p) = max pT (b ¡ Ax) x2X Warm-up Example: Linear Programming A 2 Rm£n ; ?9x 2 X : Ax ¡ b ¸ 0 LP Feasibility problem Convert into non-smooth optimization problem over simplex: min max pT (b ¡ Ax) p2¢m x2X Non-differentiable objective: T f(p) = max p (b ¡ Ax) x2X Best response to dual solution p Warm-up Example: Linear Programming A 2 Rm£n ; ?9x 2 X : b ¡ Ax ¸ 0 LP Feasibility problem Convert into non-smooth optimization problem over simplex: min max pT (b ¡ Ax) p2¢m x2X Non-differentiable objective f(p) = max pT (b ¡ Ax) x2X Admits subgradients, for all p: xp : pT (b ¡ Axp ) ¸ 0; (b ¡ Axp ) 2 @f(p) Subgradient is slack in constraints Warm-up Example: Linear Programming A 2 Rm£n ; ?9x 2 X : b ¡ Ax ¸ 0 LP Feasibility problem Convert into non-smooth optimization problem over simplex: min max pT (b ¡ Ax) p2¢m x2X Non-differentiable objective f(p) = max pT (b ¡ Ax) x2X Admits subgradients, for all p: xp : pT (b ¡ Axp ) ¸ 0; (b ¡ Axp ) 2 @f(p) If we can pick xp such that kb ¡ Axpk1 · ½ , then p ½ 2 ¢ log n p ²MD · T 2 ¢ ½2 ¢ log n T· ²2 MWU and s-t Maxflow Minaximum flow feasibility for value F over undirected graph G with incidence matrix B: jfe j 8e 2 E; F ¢ · 1 ce B T f = es ¡ et Will enforce this Turn into non-smooth minimization problem over simplex: X F ¢ jfe j f(p) = min pe ¢ ¡1 T ce B f =es ¡et e2E Best response fp is shortest s-t path with lengths pe / ce. For any p, if fphas length > 1, there is no subgradient, i.e. problem is infeasible. Otherwise, the following is a subgradient F ¢ j(fp )e j @f(p)e = ¡1 ce Unfortunately, width can be large k@f(p)e k1 · F cmin [PST 91] T =O ³ F log n ²2 cmin ´ Width Reduction: make function nicer NEED PRIMAL ARGUMENT (t+1) x x(t+2) x(t) PROBLEM: Optimal for this specific formulation k@f(p)e k1 · SOLUTION: Regularize primal X fe ³ ²´ f(p) = min F ¢ pe + ¡1 T c m B f=es ¡et e e2E F cmin Width Reduction: make primal nicer PROBLEM: Optimal for this specific formulation k@f(p)e k1 · SOLUTION: Regularize primal X fe ³ ²´ f(p) = min F ¢ pe + ¡1 T ce m B f=es ¡et e2E REGULARIZATION ERROR: NEW WIDTH: ²F m k@f(p)e k1 · ² ITERATION BOUND: T =O ³ m log n ²2 ´ [GK 98] F cmin Electrical Flow Approach [CKMST] Different formulation yields basis for CKMST algorithm: fe2 8e 2 E; F ¢ 2 · 1 ce B T f = es ¡ et Non-smooth optimization problem: f(p) = T min B f =es ¡et X e2E F ¢ fe2 pe ¢ ¡1 c2e Will enforce this Electrical Flow Approach [CKMST] Different formulation yields basis for CKMST algorithm: fe2 8e 2 E; F ¢ 2 · 1 ce B T f = es ¡ et Non-smooth optimization problem: f(p) = T min B f =es ¡et X e2E F ¢ fe2 pe ¢ ¡1 c2e Best response is electrical flow fp Original width: k@f(p)ek1 · m Will enforce this Electrical Flow Approach [CKMST] Different formulation yields basis for CKMST algorithm: fe2 8e 2 E; F ¢ 2 · 1 ce B T f = es ¡ et Will enforce this Non-smooth optimization problem: f(p) = Regularize primal: T min B f =es ¡et X e2E F ¢ fe2 pe ¢ ¡1 2 ce X f2 ³ ²´ e f(p) = min F ¢ pe + ¡1 2 T ce m B f =es ¡et e2E k@f(p)e k1 · r m ² Conclusion: Take-away messages • Regularization is a powerful tool for the design of fast algorithms. • Most iterative algorithms can be understood as regularized updates: MWUs, Width Reduction, Interior Point, Gradient descent, .. • Perform well in practice. Regularization also helps eliminate noise. • ULTIMATE GOAL: Development of a library of iterative methods for fast graph algorithms. Regularization plays a fundamental role in this effort THE END – THANK YOU