online learning with structured experts–a biased survey G´abor Lugosi

online learning with structured experts–a biased
survey
Gábor Lugosi
ICREA and Pompeu Fabra University, Barcelona
1
on-line prediction
A repeated game between forecaster and environment.
At each round t,
the forecaster chooses an action It 2 {1, . . . , N };
(actions are often called experts)
the environment chooses losses `t (1), . . . , `t (N) 2 [0, 1];
the forecaster su↵ers loss `t (It ).
The goal is to minimize the regret
Rn =
n
X
`t (It )
min
i N
t =1
7
n
X
t=1
`t (i )
!
.
simplest example
Is it possible to make (1/n)Rn ! 0 for all loss assignments?
Let N = 2 and define, for all t = 1, . . . , n,
⇢
0 if It = 2
`t (1) =
1 if It = 1
and `t (2) = 1
Then
n
X
`t (1).
`t (It ) = n
and
t=1
so
1
n
min
i =1,2
1
Rn
2
11
.
n
X
t=1
`t (i ) 
n
2
randomized prediction
Key to solution: randomization.
At time t, the forecaster chooses a probability distribution
p t 1 = (p1,t 1 , . . . , pN,t 1 )
and chooses action i with probability pi ,t
1.
Simplest model: all losses `s (i ), i = 1, . . . , N, s < t, are
observed: full information.
12
Hannan and Blackwell
Hannan (1957) and Blackwell (1956) showed that the forecaster
has a strategy such that
!
n
n
X
1 X
`t (i ) ! 0
`t (It ) min
i N
n t =1
t=1
almost surely for all strategies of the environment.
13
basic ideas
expected loss of the forecaster:
`t (p t
1) =
N
X
pi ,t
1 `t ( i )
= Et `t (It )
i =1
By martingale convergence,
1
n
n
X
`t (It )
t=1
n
X
`t (p t
1)
t=1
!
= OP (n
so it suffices to study
1
n
n
X
`t (p t
1)
t =1
16
min
i N
n
X
t=1
`t (i )
!
1/2
)
weighted average prediction
Idea: assign a higher probability to better-performing actions.
Vovk (1990), Littlestone and Warmuth (1989).
A popular choice is
pi ,t
1
⇣
Pt
1
s =1 `s (i )
⌘
exp
⌘
⇣
⌘
= P
Pt 1
N
exp
⌘
`
(
k
)
k =1
s =1 s
i = 1, . . . , N .
where ⌘ > 0. Then
1
n
with ⌘ =
p
n
X
`t (p t
t=1
1)
min
i N
8 ln N/n.
19
n
X
t=1
`t (i )
!
=
r
ln N
2n
proof
Let Li ,t =
Pt
s=1 `s (i )
and
Wt =
N
X
wi ,t =
i =1
for t
N
X
e
⌘Li ,t
i =1
1, and W0 = N. First observe that
!
N
X
Wn
ln N
= ln
e ⌘Li ,n
ln
W0
i =1
✓
◆
⌘Li ,n
ln
max e
ln N
i =1,...,N
=
⌘ min Li ,n
i =1,...,N
21
ln N .
proof
On the other hand, for each t = 1, . . . , n
ln
Wt
Wt
PN
wi ,t
PN
i =1
= ln
1
j =1
PN
⌘`t (i )
1e
wj ,t
1
⌘2
i =1 wi ,t 1 `t (i )
+
PN
8
j =1 wj ,t 1

⌘
=
⌘`t (p t
1)
+
⌘2
8
by Hoe↵ding’s inequality.
Hoe↵ding (1963): if X 2 [0, 1],
ln Ee
⌘X

23
⌘EX +
⌘2
8
proof
for each t = 1, . . . , n
ln
Wt
Wt
1

⌘`t (p t
1) +
⌘2
8
Summing over t = 1, . . . , n,
ln
Wn
W0

⌘
n
X
`t (p t
1) +
t=1
⌘2
8
n.
Combining these, we get
n
X
t=1
`t (p t
1)

min Li ,n +
i =1,...,N
24
ln N
⌘
+
⌘
8
n
lower bound
The upper bound is optimal: for all predictors,
Pn
P
mini N nt =1 `t (i )
t =1 `t (It )
p
sup
(n/2) ln N
n,N,`t (i )
1.
Idea: choose `t (i ) to be i.i.d. symmetric Bernoulli coin flips.
!
n
n
X
X
`t (i )
`t (It ) min
sup
`t (i )
E
=
i N
t=1
n
2
" n
X
`t (It )
t=1
min Bi
t=1
min
i N
n
X
t=1
`t (i )
#
i N
Where B1 , . . . , BN are independent Binomial (n, 1/2).
Use the central limit theorem.
27
follow the perturbed leader
It = arg min
tX
1
i =1,...,N s=1
`s (i ) + Zi ,t
where the Zi ,t are random noise variables.
The original forecaster of Hannan (1957) is based on this idea.
28
follow the perturbed leader
p
If the Zi ,t are i.i.d. uniform [0, nN], then
r
1
N
+ Op (n 1/2 ) .
Rn  2
n
n
If the Zi ,t are i.i.d. with density (⌘/2)e ⌘|z | , then for
p
⌘ ⇡ log N/n,
r
1
log N
Rn  c
+ Op (n 1/2 ) .
n
n
Kalai and Vempala (2003).
30
combinatorial experts
Often the class of experts is very large but has some combinatorial
structure. Can the structure be exploited?
path planning. At each time
instance, the forecaster chooses a
path in a graph between two
fixed nodes. Each edge has an
associated loss. Loss of a path is
the sum of the losses over the
edges in the path.
© Google. All rights reserved. This content is excluded from our
Creative Commons license. For more information, see
http://ocw.mit.edu/fairuse.
N is huge!!!
32
assignments: learning permutations
Given a complete
bipartite graph
Km,m , the
forecaster chooses a
perfect matching.
The loss is the sum
of the losses over
the edges.
This image has been removed due to copyright restrictions.
Please see the image at
http://38.media.tumblr.com/tumblr_m0ol5tggjZ1qir7tc.gif
Helmbold and Warmuth (2007): full information case.
33
spanning trees
The forecaster chooses a
spanning tree in the complete
graph Km . The cost is the sum
of the losses over the edges.
34
combinatorial experts
Formally, the class of experts is a set S ⇢ {0, 1}d of cardinality
|S| = N.
At each time t , a loss is assigned to each component: `t 2 Rd .
Loss of expert v 2 S is `t (v ) = `>
t v.
Forecaster chooses It 2 S.
The goal is to control the regret
n
X
t=1
`t (It )
min
k=1,...,N
35
n
X
t=1
`t (k) .
computing the exponentially weighted average forecaster
One needs to draw a random element of S with distribution
proportional to
!
tX
1
>
⌘
`t v .
wt (v ) = exp ⌘ Lt (v ) = exp
=
d
Y
exp
⌘
s=1
tX
1
`t,j vj
s=1
j =1
36
!
.
computing the exponentially weighted average forecaster
path planning: Sampling may be done by dynamic programming.
assignments: Sum of weights (partition function) is the permanent
of a non-negative matrix. Sampling may be done by a FPAS of
Jerrum, Sinclair, and Vigoda (2004).
spanning trees: Propp and Wilson (1998) define an exact sampling
algorithm. Expected running time is the average hitting time of
the Markov chain defined by the edge weights wt (v ).
39
computing the follow-the-perturbed leader forecaster
In general, much easier. One only needs to solve a linear
optimization problem over S. This may be hard but it is well
understood.
In our examples it becomes either a shortest path problem, or an
assignment problem, or a minimum spanning tree problem.
40
follow the leader: random walk perturbation
Suppose N experts, no structure. Define
It = arg min
t
X
i =1,...,N s=1
(`i ,s
1
+ Xs )
where the Xs are either i.i.d. normal or ±1 coinflips.
This is like follow-the-perturbed-leader
but with random walk
Pt
perturbation: s=1 Xt .
Advantage: forecaster rarely changes actions!
42
follow the leader: random walk perturbation
If Rn is the regret and Cn is the number of times It 6= It 1 , then
p
ERn  2ECn  8 2n log N + 16 log n + 16 .
Devroye, Lugosi, and Neu (2015).
Key tool: number of leader changes in N independent random
walks with drift.
43
follow the leader: random walk perturbation
This also works in the “combinatorial” setting: just add an
independent N(0, d ) at each time to every component.
and
e (B 3/2
ERn = O
ECn = O(B
p
n log d )
p
where B = maxv 2S kv k1 .
44
n log d ) ,
why exponentially weighted averages?
May be adapted to many di↵erent variants of the problem,
including bandits, tracking, etc.
45
multi-armed bandits
The forecaster only observes `t (It ) but not `t (i ) for i 6= It .
Herbert Robbins (1952).
This image has been removed due to copyright restrictions. Please see the image at
https://en.wikipedia.org/wiki/Herbert_Robbins#/media/File:1966-HerbertRobbins.jpg
46
multi-armed bandits
Trick: estimate `t (i ) by
èt (i ) =
`t (It )
pIt ,t
{It =i }
1
This is an unbiased estimate:
Et `et (i ) =
N
X
pj ,t
1
`t (j )
j =1
pj ,t
{j =i }
1
= `t (i )
Use the estimated losses to define exponential weights and mix
with uniform (Auer, Cesa-Bianchi, Freund, and Schapire, 2002):
⌘
⇣
P
exp
⌘ ts=11 `es (i )
⌘+
⇣
)P
pi ,t 1 = (1
Pt 1 e
N
N
exp
⌘
(k)
`
|{z}
k=1
s=1 s
{z
} exploration
|
exploitation
49
multi-armed bandits
E
1
n
n
X
t=1
`t (p t
1)
min
i N
n
X
`t (i )
t=1
50
!
=O
r
N ln N
n
!
,
multi-armed bandits
Lower bound:
sup E
`t (i )
1
n
n
X
t =1
`t (p t
1)
min
i N
n
X
`t (i )
t =1
!
C
r
N
n
,
Dependence on N is not logarithmic anymore!
Audibert and Bubeck (2009) constructed a forecaster with
!
r !
n
n
X
1 X
N
max E
`t (p t 1 )
`t (i ) = O
,
i N
n
n t=1
t=1
52
calibration
Sequential probability assignment.
A binary sequence x1 , x2 , . . . is revealed one by one.
After observing x1 , . . . , xt
It 2 {0, 1, . . . , N}.
1,
the forecaster issues prediction
Meaning: “chance of rain is It /N”.
Forecast is calibrated if
Pn
xt
Pt=1
n
t=1
i
{It =i }
N
{It =i }
whenever lim supn (1/n)
Pn
t=1

{It =i }
1
2N
+ o(1)
> 0.
Is there a forecaster that is calibrated for all possible sequences?
NO. (Dawid, 1985).
55
randomized calibration
However, if the forecaster is allowed to randomize then it is
possible! (Foster and Vohra, 1997).
This can be achieved by a simple modification of any regret
minimization procedure.
Set of actions (experts): {0, 1, . . . , N}.
At time t, assign loss `t (i ) = (xt
i /N)2 to action i .
One can now define a forecaster. Minimizing regret is not
sufficient.
56
internal regret
Recall that the (expected) regret is
n
X
t=1
`t (p t
1)
min
i
n
X
`t (i ) = max
i
t=1
n X
X
t=1
pj ,t (`t (j )
`t (i ))
j
The internal regret is defined by
max
i ,j
pj ,t (`t (j )
n
X
pj ,t (`t (j )
`t (i ))
t=1
`t (i )) = Et
{It =j }
(`t (j )
`t (i ))
is the expected regret of having taken action j instead of action i .
59
internal regret and calibration
By guaranteeing small internal regret, one obtains a calibrated
forecaster.
This can be achieved by an exponentially weighted average
forecaster defined over N 2 actions.
Can be extended even for calibration with checking rules.
60
prediction with partial monitoring
For each round t = 1, . . . , n,
the environment chooses the next outcome Jt 2 {1, . . . , M}
without revealing it;
the forecaster chooses a probability distribution p t
draws an action It 2 {1, . . . , N} according to p t
1
and
1;
the forecaster incurs loss `(It , Jt ) and each action i incurs loss
`(i , Jt ). None of these values is revealed to the forecaster;
the feedback h(It , Jt ) is revealed to the forecaster.
H = [h(i , j )]N⇥M is the feedback matrix.
L = [`(i , j )]N⇥M is the loss matrix.
61
examples
Dynamic pricing. Here M = N, and L = [`(i , j )]N⇥N where
`( i , j ) =
and h(i , j ) =
{i >j }
h(i , j ) = a
(j
i)
{i j }
N
+c
{i >j }
.
or
{i j }
+b
{i >j }
,
i , j = 1, . . . , N .
Multi-armed bandit problem. The only information the forecaster
receives is his own loss: H = L.
63
examples
Apple tasting. N = M = 2.
L=
H =


0 1
1 0
a a
b c
.
The predictor only receives feedback when he chooses the second
action.
Label efficient prediction. N = 3, M = 2.
3
2
1 1
L=4 0 1 5
1 0
2
3
a b
H =4 c c 5 .
c c
65
a general predictor
A forecaster first proposed by Piccolboni and Schindelhauer (2001).
Crucial assumption: H can be encoded such that there exists an
N ⇥ N matrix K = [k(i , j )]N⇥N such that
L=K ·H .
Thus,
`(i , j ) =
N
X
k(i , l )h(l , j ) .
l =1
Then we may estimate the losses by
è(i , Jt ) = k(i , It )h(It , Jt ) .
pIt ,t
66
a general predictor
Observe
e , Jt ) =
Et `(i
=
N
X
k=1
N
X
pk,t
1
k(i , k)h(k, Jt )
pk,t
1
k (i , k)h(k, Jt ) = `(i , Jt ) ,
k =1
è(i , Jt ) is an unbiased estimate of `(i , Jt ).
Let
e
e ⌘Li ,t 1
pi ,t 1 = (1
)P
ek,t
N
⌘L
k =1 e
e , Jt ).
ei ,t = Pt `(i
where L
s =1
67
+
1
N
performance bound
With probability at least 1
n
1X
n
,
`(It , Jt )
min
i =1,...,N
t=1
 Cn
1/3
N 2/3
n
1X
n
`(i , Jt )
t=1
p
ln(N/ ) .
where C depends on K . (Cesa-Bianchi, Lugosi, Stoltz (2006))
Hannan consistency is achieved with rate O(n
L = K · H.
1/3 )
whenever
This solves the dynamic pricing problem.
Bartók, Pál, and Szepesvári (2010): if M = 2, only possible rates
are n 1/2 , n 1/3 , 1
70
imperfect monitoring: a general framework
S is a finite set of signals.
Feedback matrix: H : {1, . . . , N} ⇥ {1, . . . , M} ! P(S).
For each round t = 1, 2 . . . , n,
the environment chooses the next outcome Jt 2 {1, . . . , M}
without revealing it;
the forecaster chooses p t 1 and draws an action
It 2 {1, . . . , N} according to it;
the forecaster receives loss `(It , Jt ) and each action i su↵ers
loss `(i , Jt ), none of these values is revealed to the forecaster;
a feedback st drawn at random according to H(It , Jt ) is
revealed to the forecaster.
71
target
Define
`(p, q) =
X
pi qj `(i , j )
i ,j
H(·, q) = (H(1, q), . . . , H(N, q))
P
where H(i , q) = j qj H(i , j ) .
that can be written as H(·, q) for
Denote by F the set of those
some q.
F is the set of “observable” vectors of signal distributions
The key quantity is
⇢(p,
)=
max
q : H(·,q)=
⇢ is convex in p and concave in
.
72
`(p, q)
.
rustichini’s theorem
The value of the base one-shot game is
min max `(p, q) = min max ⇢(p,
p
q
p
2F
)
If q n is the empirical distribution of J1 , . . . , Jn , even with the
knowledge of H(·, q n ) we cannot hope to do better than
minp ⇢(p, H(·, q n )).
Rustichini (1999) proved that there exists a strategy such that for
all strategies of the opponent, almost surely,
0
1
X
1
lim sup @
`(It , Jt ) min ⇢ (p, H(·, q n ))A  0
p
n t=1,...,n
n !1
73
rustichini’s theorem
Rustichini’s proof relies on an approachability theorem for a
continuum of types (Mertens, Sorin, and Zamir, 1994).
It is non-constructive.
It does not imply any convergence rate.
Lugosi, Mannor, and Stoltz (2008) construct efficiently computable
strategies that guarantee fast rates of convergence.
74
combinatorial bandits
The class of actions is a set S ⇢ {0, 1}d of cardinality |S| = N.
At each time t , a loss is assigned to each component: `t 2 Rd .
Loss of expert v 2 S is `t (v ) = `>
t v.
Forecaster chooses It 2 S.
The goal is to control the regret
n
X
t=1
`t (It )
min
k=1,...,N
75
n
X
t=1
`t (k) .
combinatorial bandits
Three models.
(Full information.) All d components of the loss vector are
observed.
(Semi-bandit.) Only the components corresponding to the chosen
object are observed.
(Bandit.) Only the total loss of the chosen object is observed.
Challenge: Is O(n 1/2 poly(d )) regret achievable for the
semi-bandit and bandit problems?
77
combinatorial prediction game
Adversary
Player
78
combinatorial prediction game
Adversary
Player
© Google. All rights reserved. This content is excluded from our
Creative Commons license. For more information, see
http://ocw.mit.edu/fairuse.
79
combinatorial prediction game
Adversary
Player
© Google. All rights reserved. This content is excluded from our
Creative Commons license. For more information, see
http://ocw.mit.edu/fairuse.
80
combinatorial prediction game
Adversary
`4
`1
`2
`6
`3
`7
`8
Player
`d
`5
`9
© Google. All rights reserved. This content is excluded from our
Creative Commons license. For more information, see
http://ocw.mit.edu/fairuse.
81
`d
`d
1
2
combinatorial prediction game
Adversary
`4
`1
`2
`6
`3
`7
`8
Player
`d
`5
`d
`d
`9
© Google. All rights reserved. This content is excluded from our
Creative Commons license. For more information, see
http://ocw.mit.edu/fairuse.
loss su↵ered: `2 + `7 + . . . + `d
82
1
2
combinatorial prediction game
Adversary
Feedback:
`4
`1
`5
`2
:
`6
`3
`7
`8
Player
8
< Full Info: `1 , `2 , . . . , `d
`d
`d
`d
`9
© Google. All rights reserved. This content is excluded from our
Creative Commons license. For more information, see
http://ocw.mit.edu/fairuse.
loss su↵ered: `2 + `7 + . . . + `d
83
1
2
combinatorial prediction game
8
` 1 , ` 2 , . . . , `d
< Full Info:
Semi-Bandit: `2 , `7 , . . . , `d
Feedback:
:
Bandit:
`2 + `7 + . . . + ` d
`4
Adversary
`1
`2
`6
`3
`7
`8
Player
`d
`5
`d
`d
`9
© Google. All rights reserved. This content is excluded from our
Creative Commons license. For more information, see
http://ocw.mit.edu/fairuse.
loss su↵ered: `2 + `7 + . . . + `d
85
1
2
notation
S ⇢ {0, 1}d
`4
`1
`d
`5
`2
`6
`3
`7
`8
`d
2
`t 2 Rd+
1
`d
`9
`4
`1
`d
`5
`2
`6
`3
`7
`8
`d
2
Vt 2 S , loss su↵ered: `T
t Vt
1
`d
`9
regret:
Rn = E
n
X
t=1
`tT Vt
min E
u 2S
n
X
t=1
`T
t u
loss assumption: |`T
t v |  1 for all v 2 S and t = 1, . . . , n.
86
weighted average forecaster
At time t assign a weight wt,i to each i = 1, . . . , d .
The weight of each v k 2 S is
Y
w t (k) =
wt,i .
i :vk (i )=1
Let qt
1 (k)
= wt
1 (k )/
PN
k=1
wt
1 (k).
At each time t, draw Kt from the distribution
pt
1 (k)
= (1
)qt
1 (k )
where µ is a fixed distribution on S and
wt,i = exp
et,i
⌘L
+ µ(k)
> 0. Here
et,i = `e1,i + · · · + `et,i and `et,i is an estimated loss.
where L
89
loss estimates
Dani, Hayes, and Kakade (2008).
Define the scaled incidence vector
X t = `t (Kt )V Kt
where Kt is distributed according to pt 1 .
⇥
⇤
Let Pt 1 = E V Kt V >
Kt be the d ⇥ d correlation matrix.
Hence
X
pt 1 (k) .
Pt 1 (i , j ) =
k : vk (i )=vk (j )=1
⇥
⇤
Similarly, let Qt 1 and M be the correlation matrices of E V V >
when V has law, qt 1 and µ. Then
Pt
1 (i , j )
= (1
)Qt
1 (i , j )
The vector of loss estimates is defined by
èt = P + X t
t
where
Pt+ 1
1
is the pseudo-inverse of Pt
92
1.
+
M(i , j ) .
key properties
M M + v = v for all v 2 S.
Qt
1
Pt
1
is positive semidefinite for every t.
Pt+ 1 v = v for all t and v 2 S.
By definition,
Et X t = Pt
and therefore
1 `t
Et èt = Pt+ 1 Et X t = `t
An unbiased estimate!
93
performance bound
The regret of the forecaster satisfies
s✓
✓
◆
◆
2B 2
d ln N
1
b
ELn
min Ln (k)  2
+1
.
k=1,...,N
n
d min (M)
n
where
min (M)
=
min
x2span(S):kxk=1
x T Mx > 0
is the smallest “relevant” eigenvalue of M. (Cesa-Bianchi and
Lugosi, 2009.)
Large
min (M)
is needed to make sure no |`et,i | is too large.
94
performance bound
Other bounds:
p
B d ln N/n (Dani, Hayes, and Kakade). No condition on S.
Sampling is over a barycentric spanner.
p
d (✓ ln n)/n (Abernethy, Hazan, and Rakhlin). Computationally
efficient.
95
eigenvalue bounds
min (M)
=
min
x2span(S):kxk=1
E(V , x)2 .
where V has distribution µ over S.
In many cases it suffices to take µ uniform.
96
multitask bandit problem
The decision maker acts in m games in parallel.
In each game, the decision maker selects one of R possible actions.
After selecting the m actions, the sum of the losses is observed.
min
h
bn
max E L
k
=
i
1
R
Ln (k)  2m
p
3nR ln R .
The price of only observing the sum of losses is a factor of m.
Generating a random joint action can be done in polynomial time.
97
assignments
Perfect matchings of Km,m .
At each time one of the N = m! perfect matchings of Km,m is
selected.
min (M)
h
bn
max E L
k
=
1
m
1
i
p
Ln (k)  2m 3n ln(m!) .
Only a factor of m worse than naive full-information bound.
98
spanning trees
In a network of m nodes, the cost of communication between two
nodes joined by edge e is `t (e) at time t. At each time a minimal
connected subnetwork (a spanning tree) is selected. The goal is to
minimize the total cost. N = m m 2 .
◆
✓
1
1
.
O
min (M) =
m
m2
The entries of M are
P{Vi = 1}=
P Vi = 1, Vj = 1 =
P Vi = 1, Vj = 1 =
99
2
m
3
m2
4
m2
if i ⇠ j
if i 6⇠ j .
stars
At each time a central node of a network of m nodes is selected.
Cost is the total cost of the edges adjacent to the node.
min
1
O
100
✓
1
m
◆
.
cut sets
A balanced cut in K2m is the collection of all edges between a set
of m vertices and its complement. Each balanced cut has m 2
edges and there are N = 2m
balanced cuts.
m
min (M)
=
1
4
O
✓
1
m2
◆
.
Choosing from the exponentially weighted average distribution is
equivalent to sampling from ferromagnetic Ising model. FPAS by
Randall and Wilson (1999).
101
hamiltonian cycles
A Hamiltonian cycle in Km is a cycle that visits each vertex exactly
once and returns to the starting vertex. N = (m 1)!
2
min
Efficient computation is hopeless.
102
m
sampling paths
In all these examples µ is uniform over S.
For path planning it does not always work.
What is the optimal choice of µ?
What is the optimal way of exploration?
104
minimax regret
Rn =
inf
max
sup
strategy S⇢{0,1}d adversary
Rn
Theorem
Let n
have
d 2 . In the full information and semi-bandit games, we
0.008 d
p
p
n  R n  d 2n,
and in the bandit game,
0.01 d 3/2
p
n  R n  2 d 5/2
106
p
2n.
proof
upper bounds:
P
D = [0, +1)d , F (x) = ⌘1 di =1 xi log xi works for full
information but it is only optimal up to a logarithmic factor in the
semi-bandit case.
in the bandit case it does not work at all! Exponentially weighted
average forecaster is used.
lower bounds:
careful construction of randomly chosen set S in each case.
108
a general strategy
Let D be a convex subset of Rd with nonempty interior int(D).
A function F : D ! R is Legendre if
• F is strictly convex and admits continuous first partial
derivatives on int(D),
• For u 2 @D, and v 2 int(D), we have
lim
(u
s!0,s>0
v )T rF (1
s)u + sv = +1.
The Bregman divergence DF : D ⇥ int(D) associated to a
Legendre function F is
DF (u , v ) = F (u)
F (v )
111
(u
v )T rF (v ).
CLEB (Combinatorial LEarning with Bregman divergences)
Parameter: F Legendre on D
(1)
wt0+1
2D:
rF (wt0+1 ) = rF (wt )
Conv (S)
`˜t
(2) wt+1 2 arg min DF (w , wt0+1 )
w 2Conv (S)
(3) pt+1 2
D
0
wt+1
(S) : wt+1 = EV ⇠pt+1 V
wt
wt+1
Conv (S)
pt+1
(S)
pt
117
General regret bound for CLEB
Theorem
If F admits a Hessian r2 F always invertible then,
Rn / diamDF (S) + E
n
X
t =1
118
⌘
⇣
2
`˜T
)
r
F
(w
t
t
1
`˜t .
Di↵erent instances of CLEB: LinExp (Entropy Function)
D = [0, +1)d , F (x) =
1
⌘
Pd
i =1
xi log xi
8
Full Info: Exponentially weighted average
>
>
<
Semi-Bandit=Bandit: Exp3
>
>
:
Auer et al. [2002]
8
Full Info: Component Hedge
>
>
>
>
Koolen, Warmuth and Kivinen [2010]
>
>
>
>
<
Semi-Bandit: MW
>
>
> Kale, Reyzin and Schapire [2010]
>
>
>
>
>
:
Bandit: new algorithm
124
Di↵erent instances of CLEB: LinINF (Exchangeable
Hessian)
D = [0, +1)d , F (x) =
R xi
i =1 0
Pd
1 (s )ds
INF, Audibert and Bubeck [2009]
⇢
(x) = exp(⌘x) : LinExp
(x) = ( ⌘x) q , q > 1 : LinPoly
128
Di↵erent instances of CLEB: Follow the regularized leader
D = Conv (S), then
wt+1 2 arg min
w 2D
t
X
s=1
`˜T
s w + F (w )
!
Particularly interesting choice: F self-concordant barrier function,
Abernethy, Hazan and Rakhlin [2008]
130
MIT OpenCourseWare
http://ocw.mit.edu
18.657 Mathematics of Machine Learning
Fall 2015
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.