18.657: Mathematics of Machine Learning

advertisement
18.657: Mathematics of Machine Learning
Lecture 20
Nov. 23, 2015
Lecturer: Philippe Rigollet
Scribe: Vira Semenova
In this lecture, we talk about the adversarial bandits under limited feedback. Adversarial bandit is a setup in which the loss function l(a, z) : AxZ is determinitic. Limited feedback means that the information available to the DM after the step t is It =
{l(a1 , z1 ), ..., l(at−1 , zt )}, namely consits of the realised losses of the past steps only.
5. ADVERSARIAL BANDITS
Consider the problem of prediction with expert advice. Let the set of adversary moves be Z
and the set of actions of a decision maker A = {e1 , ..., eK }. At time t, at ∈ A and zt ∈ Z are
simultaneously revealed.Denote the loss associated to the decision at ∈ A and his adversary
playing zt by l(at , zt ). We compare the total loss after n steps to the minimum expert loss,
namely:
min
1≤j ≤K
n
X
lt (ej , zt ),
t=1
where ej is the choice of expert j ∈ {1, 2, .., K}.
The cumulative regret is then defined as
Rn =
n
X
t=1
lt (at , zt ) − min
1≤j≤K
n
X
lt (ej , zt )
t=1
The feedback at step t can be either full or limited. The full feedback setup means that
the vector f = (l(e1 , zt ), ..., l(eK , zt ))⊤ of losses incurred at a pair of adversary’s choice zt and
each bandit ej ∈ {e1 , ..., eK } is observed after each step t. Hence, the information available
to the DM after the step t is It = ∪tt′ =1 {l(a1 , zt′ ), ..., l(aK , zt′ )}. The limited feedback means
that the time −t feedback consists of the realised loss l(at , zt ) only. Namely, the information
available to the DM is It = {l(a1 , z1 ), ..., l(at , zt )}. An example of the first setup is portfolio
optimization problems, where the loss of all possible portfolios is observed at time t. An
example of the second setup is a path planning problem and dynamic pricing, where the
loss of the chosen decision only is observed. This lecture has limited feedback setup.
The two strategies, defined
√ in the past lectures, were exponential weights, which yield
the regret of order Rn ≤ c n log K and Follow the Perturbed Leader. We would like to
play exponential weights, defined as:
P −1
exp(−η ts=1
l(ej , zs ))
pt,j = Pk
Pt−1
l=1 exp(−η
s=1 l(ej , zs ))
This decision rule is not feasible, since the loss l(ej , zt ) are not part of the feedback if
ej 6= at . We will estimate it by
ˆl(ej , zt ) = l(ej , zt )1I(at = ej )
P (at = ej )
1
Lemma: ˆl(ej , zt ) is an unbiased estimator of l(ej , zt )
ˆ j , zt ) = PK
Proof. Eat l(e
k=1
l(ek ,zt )1I(ek =et )
P (at
P (at =ej )
= ek ) = l(ej , zt )
Definition (Exp 3 Algorithm): Let η > 0 be fixed. Define the exponential weights
as
(j)
pt+1,j
Pt−1 ˆ
s=1 l(ej , zs ))
P
exp(−η t−1 ˆl(ej , zs ))
exp(−η
= Pk
l=1
s=1
(Exp3 stands for Exponential weights for Exploration√and Exploitation.)
√
bound
is
K
We will show that the regret of Exp3 is bounded by 2nK log K. This
√
times bigger than the bound on the regret under the full feedback. The K multiplier is
the price of have smaller information set at the time t. The are methods that√allow to get
rid of log K term in this expression. On the other hand, it can be shown that 2nK is the
optimal regret.
Proof. Let Wt,j = exp(−η
Pk
Pt−1 ˆ
s=1 l(ej , zs )), Wt =
j=1 Wt,j , and pt =
Wt+1
log(
) = log(
Wt
PK
Wt,j ej
.
Wt
j=1
PK
Pt−1 ˆ
ˆ
j=1 exp(−η
s=1 l(ej , zs )) exp(−ηl(ej , zt ))
)
PK
Pt−1 ˆ
j=1 exp(−η
s=1 l(ej , zs ))
t−1
X
ˆl(eJ , zs )))
= log(IEJ ∼pt exp(−η
(5.1)
(5.2)
s=1
η2
≤∗ log(1 − η IEJ ∼pt ˆl(eJ , zs ) + IEJ ∼pt ˆl2 (eJ , zs )
2
(5.3)
where ∗ inequality is obtained by plugging in IEJ ∼pt ˆl(eJ , zt ) into the inequality
exp x ≥ 1 − ηx +
η 2 x2
2
.
IEJ ∼pt ˆl(eJ , zt ) =
K
X
pt,j ˆl(eJ , zt ) =
j=1
IEJ ∼pt lˆ2 (eJ , zt ) =
=
K
X
K
X
pt,j
j=1
pt,j lˆ2 (eJ , zt ) =
j=1
l2 (ej , zt )
Pat ,t
K
X
pt,j
j=1
≤
l(ej , zt )1I(at = ej )
= l(at , zt )
P (at = ej )
l2 (ej , zt )1I(at = ej )
P 2 (at = ej )
1
Pat ,t
Summing from 1 through n, we get
P
log(Wt+1 ) ≤ log(W1 ) − η nt=1 l(at , zt ) +
(5.4)
(5.5)
(5.6)
η2
2
2
P
1
Pa ,t
t
.
For t = 1, we initialize w1,j = 1, so W1 = K.
PK pj,t
Since IEJ Pa1 ,t = j=1
pj,t = K , the expression above becomes
t
P
2
IE log(Wn+1 ) − log K ≤ −η tn=1 l(at , zt ) + η 2Kn
PK
Pt−1 ˆ
Noting that log(Wn+1 ) = log( P
j=1 exp(−η
s=1 l(ej , zs ))
n
∗
and defining j = argmin1≤j≤K t=1 l(ej , zt ), we obtain:
K
t−1
t−1
X
X
X
log(Wn+1 ) ≥ log(
exp(−η
l(ej , zs ))) = −η
l(ej , zs )
j=1
s=1
s=1
Together:
n
X
n
X
log K
ηKn
+
η
2
t=1
t=1
√
√
The choice of η := 2 log KnK yields the bound Rn ≤ 2K log Kn.
l(at , zt ) − min
1≤j≤K
l(ej , zt ) ≤
3
MIT OpenCourseWare
http://ocw.mit.edu
18.657 Mathematics of Machine Learning
Fall 2015
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Download