Frank-Wolfe optimization insights in machine learning Simon

advertisement
Frank-Wolfe optimization
insights in machine learning
Simon Lacoste-Julien
INRIA / École Normale Supérieure
SIERRA Project Team
SMILE – November 4th 2013
Outline

Frank-Wolfe optimization

Frank-Wolfe for structured prediction




links with previous algorithms
block-coordinate extension
results for sequence prediction
Herding as Frank-Wolfe optimization


extension: weighted Herding
simulations for quadrature
Frank-Wolfe algorithm
[Frank, Wolfe 1956]
(aka conditional gradient)

alg. for constrained opt.: m in f ( ®)
®2 M
where:
f convex & cts. differentiable
M

convex & compact
FW algorithm – repeat:
1) Find good feasible direction by
minimizing linearization of f :

Properties: O(1/T) rate


2) Take convex step in direction:
®t+ 1 = ( 1 ¡ ° t ) ®t + ° t st+ 1


sparse iterates
get duality gap
for free
affine invariant
rate holds even if linear
subproblem solved
approximately
Frank-Wolfe: properties

convex steps => convex sparse combo:
®T = ½0 ®0 +
XT
½t st
w here
t= 1



½t = 1
t= 0
get duality gap certificate for free


XT
(special case of Fenchel duality gap)
also converge as O(1/T)!
only need to solve linear subproblem
*approximately* (additive/multiplicative bound)
affine invariant!
[see Jaggi ICML 2013]
Block-Coordinate
Frank-Wolfe Optimization for
Structured SVMs
Simon
Lacoste-Julien
[ICML 2013]
Martin
Mark
Jaggi
Schmidt
Patrick
Pletscher
Structured SVM optimization

structured prediction:

learn classifier:
decoding
structured hinge loss:

structured SVM primal: -> loss-augmented decoding
vs. binary hinge loss:

structured SVM dual:
-> exp. number of variables!
primal-dual
pair:
Structured SVM optimization (2)

popular approaches:
 stochastic subgradient method



[Ratliff et al. 07,
Shalev-Shwartz et al. 10]
pros: online!
cons: sensitive to step-size; don’t know when to stop
cutting plane method (SVMstruct)



rate: after K passes
through data:
[Tsochantaridis et al. 05,
Joachims et al. 09]
pros: automatic step-size; duality gap
cons: batch! -> slow for large n
our approach: block-coordinate Frank-Wolfe on dual
-> combines best of both worlds:
 online!
 automatic step-size via analytic line search
 duality gap
 rates also hold for approximate oracles
Frank-Wolfe algorithm
[Frank, Wolfe 1956]
(aka conditional gradient)

alg. for constrained opt.: m in f ( ®)
®2 M
where:
f convex & cts. differentiable
M

convex & compact
FW algorithm – repeat:
1) Find good feasible direction by
minimizing linearization of f :

Properties: O(1/T) rate


2) Take convex step in direction:
®t+ 1 = ( 1 ¡ ° t ) ®t + ° t st+ 1


sparse iterates
get duality gap
for free
affine invariant
rate holds even if linear
subproblem solved
approximately
Frank-Wolfe for structured SVM

structured SVM dual: ¡ m in f ( ®)
®2 M
use primal-dual link:
link between FW and subgradient method: see [Bach 12]

FW algorithm – repeat:
key insight:
1) Find good feasible direction by
minimizing linearization of f :
loss-augmented decoding
on each example i

2) Take convex step in direction:
®t+ 1 = ( 1 ¡ ° t ) ®t + ° t st+ 1
becomes a batch subgradient step:
choose by analytic line search on quadratic dual f ( ®)
FW for structured SVM: properties

running FW on dual  batch subgradient on primal



‘fully corrective’ FW on dual  cutting plane alg.
 still O(1/T) rate; but provides
(SVMstruct)


but adaptive step-size from analytic line-search
and duality gap stopping criterion
simpler proof for SVMstruct convergence
+ approximate oracles guarantees
not faster than simple FW in our experiments
BUT: still batch => slow for large n...
Block-Coordinate Frank-Wolfe
(new!)

for constrained optimization over compact product domain:

pick i at random; update only block i with a FW step:


we proved same O(1/T) rate
as batch FW
-> each step n times cheaper though
-> constant can be the same (SVM e.g.)
Properties: O(1/T) rate




sparse iterates
get duality gap guarantees
affine invariant
rate holds even if linear
subproblem solved
approximately
Block-Coordinate Frank-Wolfe


(new!)
for constrained optimization over compact product domain:
structured SVM:
pick i at random; update only block i with a FW step:

loss-augmented decoding

we proved same O(1/T) rate
as batch FW
-> each step n times cheaper though
-> constant can be the same (SVM e.g.)
BCFW for structured SVM: properties

each update requires 1 oracle call
so get
(vs. n for SVMstruct)
error after K passes through data
(vs.

advantages over stochastic subgradient:






step-sizes by line-search -> more robust
duality gap certificate -> know when to stop
guarantees hold for approximate oracles
implementation:
https://github.com/ppletscher/BCFWstruct


for SVMstruct)
almost as simple as stochastic subgradient method
caveat: need to store one parameter vector per example
(or store the dual variables)
for binary SVM -> reduce to DCA method [Hsieh et al. 08]
interesting link with prox SDCA [Shalev-Shwartz et al. 12]
More info about constants...

batch FW rate:
“curvature”

BCFW rate:
“product curvature”
->remove with line-search

comparing constants:


for structured SVM – same constants:
identity Hessian + cube constraint:
(no speed-up)
Sidenote: weighted averaging

standard to average iterates of stochastic
subgradient method
uniform averaging:
vs. t-weighted averaging:
[L.-J. et al. 12], [Shamir & Zhang 13]

weighted avg. improves duality gap for BCFW
 also makes a big difference in test error!
Experiments
OCR dataset
CoNLL dataset
Surprising test error though!
test error:
CoNLL dataset
optimization
error:
flipped!
Conclusions for


unified previous algorithms
provided line-search version of batch subgradient
new block-coordinate variant of Frank-Wolfe
algorithm



part
applying FW on dual of structured SVM


st
1
same convergence rate but with cheaper iteration cost
yields a robust & fast algorithm for structured SVM
future work:


caching tricks
non-uniform sampling


regularization path
explain weighted avg. test
error mystery
On the Equivalence between
Herding and
Conditional Gradient Algorithms
[ICML 2012]
Francis
Bach
Simon
Lacoste-Julien
Guillaume
Obozinski
A motivation: quadrature

Approximating integrals:



1 XT
f ( x) p( x) dx ¼
f ( xt)
T t= 1
X
Random sampling x t » p( x)
p
yields O( 1= T ) error
Herding [Welling 2009]
[Chen et al. 2010]
yields
O( 1=T ) error!
(like quasi-MC)
This part -> links herding with optimization algorithm
(conditional gradient / Frank-Wolfe)


Z
suggests extensions - e.g. weighted version with O( e¡ cT )
BUT extensions worse for learning???

-> yields interesting insights on properties of herding...
Outline

Background:



Equivalence between herding & cond. gradient



Herding
[Conditional gradient algorithm]
Extensions
New rates & theorems
Simulations


Approximation of integrals with cond. gradient variants
Learned distribution vs. max entropy
Review of herding


Learning in MRF:
[Welling ICML 2009]
1
pµ( x) =
exp( hµ; © ( x) i )
Zµ
feature map © : X !
F
Motivation:
learning:
(app.) ML /
max. entropy
moment matching
data
parameter µM L
(app.) inference:
sampling
samples
(pseudo)herding
Herding updates

Zero temperature limit of log-likelihood:
0
X
1
‘Tipi’ function:
1
@ © ( x)
lim hµ; ¹ i ¡ ¯
exp(
hµ; © ( x) i ) A
mlog
ax hµ;
i
x2 X
¯! 0
¯
x2 X

Herding updates subgradient ascent updates:
x t+ 1 2 arg m ax hµt ; © ( x) i
x2 X
µt+ 1 = µt + ¹ ¡ © ( x t+ 1 )

Properties:
(thanks to Max Welling for picture)
1) µt weakly chaotic -> entropy?
1 XT
2) Moment matching:
k¹ ¡
© ( x t ) k2 = O( 1=T 2 )
-> our focus
T t= 1
Approx. integrals in RKHS





Controlling moment discrepancy is enough to
control error of integrals in RKHS F :
Reproducing property: f 2 F ) f ( x) = hf ; © ( x) i
Define mean map : ¹ = Ep( x) © ( x)
Want to approximate integrals of the form:
Ep( x) f ( x) = Ep( x) hf ; © ( x) i = hf ; ¹ i
Use weighted sum to get approximated mean:
¹^ = Ep^( x) © ( x) =
XT
wt © ( x t )
t= 1

Approximation error is then bounded by:
jEp( x) f ( x) ¡ Ep^( x) f ( x) j · kf k k¹ ¡ ¹^ k
Conditional gradient algorithm
(aka Frank-Wolfe)
Jconvex & (twice) cts. differentiable
J ( g)
M convex & compact


m in J ( g)
Alg. to optimize: g2
M
Repeat:
Find good feasible direction by
minimizing linearization of J:
¹gt+ 1
M gt
J ( gt ) + hJ 0( gt ) ; g ¡ gt i
¹gt + 1 2 arg m in hJ 0( gt ) ; gi
g2 M
Take convex step in direction:
gt+ 1 = ( 1 ¡ ½t ) gt + ½t ¹gt+ 1
-> Converges in O(1/T) in general
½t = 1=( t + 1)
1
J ( g) = kg ¡ ¹ k2
2
Herding & cond. grad. are equiv.

Trick: look at cond. gradient on dummy objective:
1
M = convf © ( x) g
m in f J ( g) = kg ¡ ¹ k2 g
g2 M
2
+ Do change of variable: gt ¡ ¹ = ¡ µt =t
herding updates:
cond. grad. updates:
x t+ 1 Subgradient
2 arg m ax hµt ; ©
( x) i
ascent
x2 X
m in hgt ¡ are
¹ ; gi
1 2 arg
and¹gt+cond.
gradient
g2 M
© ( x t+ 1 )
Fenchel duals of each other! (see also [Bach 2012])
µt+ 1 = µt + ¹ ¡ © ( x t+ 1 )
Same with step-size:
gt+ 1 = ( 1 ¡ ½t ) gt + ½t ¹gt+ 1
( t + 1) gt+ 1 = tgt + © ( x t+ 1 )
½t = 1=( t + 1)
½0 = 1
1 XT
gT =
© ( x t ) = ¹^ T
T t= 1
Extensions of herding

More general step-sizes -> gives weighted sum:
gT =
XT
wt © ( x t )
t= 1

Two extensions:
1) Line search for ½t
2) Min. norm point algorithm
(min J(g) on convex hull of previously visited points)
Rates of convergence & thms.

No assumption: cond. grad. yields*: kgt ¡ ¹ k2 = O( 1=t)

If assume ¹ in rel. int. of M with radius r > 0

[Chen et al. 2010] yields for herding kgt ¡ ¹ k2 = O( 1=t 2 )
( ½t = 1=( t + 1) )
kgt ¡ ¹ k2 = O( e¡ ct )
Whereas line search version yields
[Guélat & Marcotte 1986, Beck & Teboulle 2004]


Propositions: suppose X compact and © cont inuous
1) F ¯ nit e dim . and p full support means 9r > 0
2) F in¯ nit e dim . means r = 0 (i.e. [Chen et al. 2010] doesn’t
hold!)
Simulation 1: approx. integrals

Kernel herding on
X = [0; 1]
Use RKHS with Bernouilli polynomial kernel (infinite dim.)
³ P
´2
K
p( x) /
(closed form)
k= 1 ak cos( 2k¼x) + bk sin( 2k¼x)
log 10 k^
¹T¡ ¹k
T
Simulation 2: max entropy?

learning independent bits:
ir r at io n al ¹
error on
moments
log 10 k^
¹T¡ ¹k
error on
distribution
log 10 k^
pT ¡ pk
X = f ¡ 1; 1gd, d = 10
© ( x) = x
r at io n al ¹
Conclusions for

nd
2
part
Equivalence of herding and cond. gradient:
-> Yields better alg. for quadrature based on moments
-> But highlights max entropy / moment matching
tradeoff!

Other interesting points:




Setting up fake optimization problems -> harvest
properties of known algorithms
Conditional gradient algorithm useful to know...
Duality of subgradient & cond. gradient is more general
Recent related work:


link with Bayesian quadrature [Huszar & Duvenaud UAI 2012]
herded Gibbs sampling [Born et al. ICLR 2013]
Thank you!
Download