Frank-Wolfe optimization insights in machine learning Simon Lacoste-Julien INRIA / École Normale Supérieure SIERRA Project Team SMILE – November 4th 2013 Outline Frank-Wolfe optimization Frank-Wolfe for structured prediction links with previous algorithms block-coordinate extension results for sequence prediction Herding as Frank-Wolfe optimization extension: weighted Herding simulations for quadrature Frank-Wolfe algorithm [Frank, Wolfe 1956] (aka conditional gradient) alg. for constrained opt.: m in f ( ®) ®2 M where: f convex & cts. differentiable M convex & compact FW algorithm – repeat: 1) Find good feasible direction by minimizing linearization of f : Properties: O(1/T) rate 2) Take convex step in direction: ®t+ 1 = ( 1 ¡ ° t ) ®t + ° t st+ 1 sparse iterates get duality gap for free affine invariant rate holds even if linear subproblem solved approximately Frank-Wolfe: properties convex steps => convex sparse combo: ®T = ½0 ®0 + XT ½t st w here t= 1 ½t = 1 t= 0 get duality gap certificate for free XT (special case of Fenchel duality gap) also converge as O(1/T)! only need to solve linear subproblem *approximately* (additive/multiplicative bound) affine invariant! [see Jaggi ICML 2013] Block-Coordinate Frank-Wolfe Optimization for Structured SVMs Simon Lacoste-Julien [ICML 2013] Martin Mark Jaggi Schmidt Patrick Pletscher Structured SVM optimization structured prediction: learn classifier: decoding structured hinge loss: structured SVM primal: -> loss-augmented decoding vs. binary hinge loss: structured SVM dual: -> exp. number of variables! primal-dual pair: Structured SVM optimization (2) popular approaches: stochastic subgradient method [Ratliff et al. 07, Shalev-Shwartz et al. 10] pros: online! cons: sensitive to step-size; don’t know when to stop cutting plane method (SVMstruct) rate: after K passes through data: [Tsochantaridis et al. 05, Joachims et al. 09] pros: automatic step-size; duality gap cons: batch! -> slow for large n our approach: block-coordinate Frank-Wolfe on dual -> combines best of both worlds: online! automatic step-size via analytic line search duality gap rates also hold for approximate oracles Frank-Wolfe algorithm [Frank, Wolfe 1956] (aka conditional gradient) alg. for constrained opt.: m in f ( ®) ®2 M where: f convex & cts. differentiable M convex & compact FW algorithm – repeat: 1) Find good feasible direction by minimizing linearization of f : Properties: O(1/T) rate 2) Take convex step in direction: ®t+ 1 = ( 1 ¡ ° t ) ®t + ° t st+ 1 sparse iterates get duality gap for free affine invariant rate holds even if linear subproblem solved approximately Frank-Wolfe for structured SVM structured SVM dual: ¡ m in f ( ®) ®2 M use primal-dual link: link between FW and subgradient method: see [Bach 12] FW algorithm – repeat: key insight: 1) Find good feasible direction by minimizing linearization of f : loss-augmented decoding on each example i 2) Take convex step in direction: ®t+ 1 = ( 1 ¡ ° t ) ®t + ° t st+ 1 becomes a batch subgradient step: choose by analytic line search on quadratic dual f ( ®) FW for structured SVM: properties running FW on dual batch subgradient on primal ‘fully corrective’ FW on dual cutting plane alg. still O(1/T) rate; but provides (SVMstruct) but adaptive step-size from analytic line-search and duality gap stopping criterion simpler proof for SVMstruct convergence + approximate oracles guarantees not faster than simple FW in our experiments BUT: still batch => slow for large n... Block-Coordinate Frank-Wolfe (new!) for constrained optimization over compact product domain: pick i at random; update only block i with a FW step: we proved same O(1/T) rate as batch FW -> each step n times cheaper though -> constant can be the same (SVM e.g.) Properties: O(1/T) rate sparse iterates get duality gap guarantees affine invariant rate holds even if linear subproblem solved approximately Block-Coordinate Frank-Wolfe (new!) for constrained optimization over compact product domain: structured SVM: pick i at random; update only block i with a FW step: loss-augmented decoding we proved same O(1/T) rate as batch FW -> each step n times cheaper though -> constant can be the same (SVM e.g.) BCFW for structured SVM: properties each update requires 1 oracle call so get (vs. n for SVMstruct) error after K passes through data (vs. advantages over stochastic subgradient: step-sizes by line-search -> more robust duality gap certificate -> know when to stop guarantees hold for approximate oracles implementation: https://github.com/ppletscher/BCFWstruct for SVMstruct) almost as simple as stochastic subgradient method caveat: need to store one parameter vector per example (or store the dual variables) for binary SVM -> reduce to DCA method [Hsieh et al. 08] interesting link with prox SDCA [Shalev-Shwartz et al. 12] More info about constants... batch FW rate: “curvature” BCFW rate: “product curvature” ->remove with line-search comparing constants: for structured SVM – same constants: identity Hessian + cube constraint: (no speed-up) Sidenote: weighted averaging standard to average iterates of stochastic subgradient method uniform averaging: vs. t-weighted averaging: [L.-J. et al. 12], [Shamir & Zhang 13] weighted avg. improves duality gap for BCFW also makes a big difference in test error! Experiments OCR dataset CoNLL dataset Surprising test error though! test error: CoNLL dataset optimization error: flipped! Conclusions for unified previous algorithms provided line-search version of batch subgradient new block-coordinate variant of Frank-Wolfe algorithm part applying FW on dual of structured SVM st 1 same convergence rate but with cheaper iteration cost yields a robust & fast algorithm for structured SVM future work: caching tricks non-uniform sampling regularization path explain weighted avg. test error mystery On the Equivalence between Herding and Conditional Gradient Algorithms [ICML 2012] Francis Bach Simon Lacoste-Julien Guillaume Obozinski A motivation: quadrature Approximating integrals: 1 XT f ( x) p( x) dx ¼ f ( xt) T t= 1 X Random sampling x t » p( x) p yields O( 1= T ) error Herding [Welling 2009] [Chen et al. 2010] yields O( 1=T ) error! (like quasi-MC) This part -> links herding with optimization algorithm (conditional gradient / Frank-Wolfe) Z suggests extensions - e.g. weighted version with O( e¡ cT ) BUT extensions worse for learning??? -> yields interesting insights on properties of herding... Outline Background: Equivalence between herding & cond. gradient Herding [Conditional gradient algorithm] Extensions New rates & theorems Simulations Approximation of integrals with cond. gradient variants Learned distribution vs. max entropy Review of herding Learning in MRF: [Welling ICML 2009] 1 pµ( x) = exp( hµ; © ( x) i ) Zµ feature map © : X ! F Motivation: learning: (app.) ML / max. entropy moment matching data parameter µM L (app.) inference: sampling samples (pseudo)herding Herding updates Zero temperature limit of log-likelihood: 0 X 1 ‘Tipi’ function: 1 @ © ( x) lim hµ; ¹ i ¡ ¯ exp( hµ; © ( x) i ) A mlog ax hµ; i x2 X ¯! 0 ¯ x2 X Herding updates subgradient ascent updates: x t+ 1 2 arg m ax hµt ; © ( x) i x2 X µt+ 1 = µt + ¹ ¡ © ( x t+ 1 ) Properties: (thanks to Max Welling for picture) 1) µt weakly chaotic -> entropy? 1 XT 2) Moment matching: k¹ ¡ © ( x t ) k2 = O( 1=T 2 ) -> our focus T t= 1 Approx. integrals in RKHS Controlling moment discrepancy is enough to control error of integrals in RKHS F : Reproducing property: f 2 F ) f ( x) = hf ; © ( x) i Define mean map : ¹ = Ep( x) © ( x) Want to approximate integrals of the form: Ep( x) f ( x) = Ep( x) hf ; © ( x) i = hf ; ¹ i Use weighted sum to get approximated mean: ¹^ = Ep^( x) © ( x) = XT wt © ( x t ) t= 1 Approximation error is then bounded by: jEp( x) f ( x) ¡ Ep^( x) f ( x) j · kf k k¹ ¡ ¹^ k Conditional gradient algorithm (aka Frank-Wolfe) Jconvex & (twice) cts. differentiable J ( g) M convex & compact m in J ( g) Alg. to optimize: g2 M Repeat: Find good feasible direction by minimizing linearization of J: ¹gt+ 1 M gt J ( gt ) + hJ 0( gt ) ; g ¡ gt i ¹gt + 1 2 arg m in hJ 0( gt ) ; gi g2 M Take convex step in direction: gt+ 1 = ( 1 ¡ ½t ) gt + ½t ¹gt+ 1 -> Converges in O(1/T) in general ½t = 1=( t + 1) 1 J ( g) = kg ¡ ¹ k2 2 Herding & cond. grad. are equiv. Trick: look at cond. gradient on dummy objective: 1 M = convf © ( x) g m in f J ( g) = kg ¡ ¹ k2 g g2 M 2 + Do change of variable: gt ¡ ¹ = ¡ µt =t herding updates: cond. grad. updates: x t+ 1 Subgradient 2 arg m ax hµt ; © ( x) i ascent x2 X m in hgt ¡ are ¹ ; gi 1 2 arg and¹gt+cond. gradient g2 M © ( x t+ 1 ) Fenchel duals of each other! (see also [Bach 2012]) µt+ 1 = µt + ¹ ¡ © ( x t+ 1 ) Same with step-size: gt+ 1 = ( 1 ¡ ½t ) gt + ½t ¹gt+ 1 ( t + 1) gt+ 1 = tgt + © ( x t+ 1 ) ½t = 1=( t + 1) ½0 = 1 1 XT gT = © ( x t ) = ¹^ T T t= 1 Extensions of herding More general step-sizes -> gives weighted sum: gT = XT wt © ( x t ) t= 1 Two extensions: 1) Line search for ½t 2) Min. norm point algorithm (min J(g) on convex hull of previously visited points) Rates of convergence & thms. No assumption: cond. grad. yields*: kgt ¡ ¹ k2 = O( 1=t) If assume ¹ in rel. int. of M with radius r > 0 [Chen et al. 2010] yields for herding kgt ¡ ¹ k2 = O( 1=t 2 ) ( ½t = 1=( t + 1) ) kgt ¡ ¹ k2 = O( e¡ ct ) Whereas line search version yields [Guélat & Marcotte 1986, Beck & Teboulle 2004] Propositions: suppose X compact and © cont inuous 1) F ¯ nit e dim . and p full support means 9r > 0 2) F in¯ nit e dim . means r = 0 (i.e. [Chen et al. 2010] doesn’t hold!) Simulation 1: approx. integrals Kernel herding on X = [0; 1] Use RKHS with Bernouilli polynomial kernel (infinite dim.) ³ P ´2 K p( x) / (closed form) k= 1 ak cos( 2k¼x) + bk sin( 2k¼x) log 10 k^ ¹T¡ ¹k T Simulation 2: max entropy? learning independent bits: ir r at io n al ¹ error on moments log 10 k^ ¹T¡ ¹k error on distribution log 10 k^ pT ¡ pk X = f ¡ 1; 1gd, d = 10 © ( x) = x r at io n al ¹ Conclusions for nd 2 part Equivalence of herding and cond. gradient: -> Yields better alg. for quadrature based on moments -> But highlights max entropy / moment matching tradeoff! Other interesting points: Setting up fake optimization problems -> harvest properties of known algorithms Conditional gradient algorithm useful to know... Duality of subgradient & cond. gradient is more general Recent related work: link with Bayesian quadrature [Huszar & Duvenaud UAI 2012] herded Gibbs sampling [Born et al. ICLR 2013] Thank you!