18.657: Mathematics of Machine Learning 3. STOCHASTIC BANDITS

advertisement
18.657: Mathematics of Machine Learning
Lecture 18
Nov. 2, 2015
Lecturer: Philippe Rigollet
Scribe: Haihao (Sean) Lu
3. STOCHASTIC BANDITS
3.1 Setup
The stochastic multi-armed bandit is a classical model for decision making and is defined
as follows:
There are K arms(different actions). Iteratively, a decision maker chooses an arm k ∈
{1, . . . , K}, yielding a sequence XK,1 , . . . , XK,t , . . ., which are i.i.d random variables with
mean µk . Define µ∗ = maxj µj or ∗ ∈ arg max. A policy π is a sequence {πt }t≥1 , which
indicates which arm to be pulled at time t. πt ∈ {1, . . . , K} and it depends only on the
observations strictly interior to t. The regret is then defined as:
n
n
X
X
Rn = max IE[
XK,t ] − IE[
Xπt ,t ]
k
t=1
n
X
= nµ∗ − IE[
t=1
Xπt ,t ]
t=1
n
X
= nµ∗ − IE[IE[
Xπt ,t | πt ]]
t=1
=
K
X
∆k IE[Tk (n)] ,
k=1
where ∆k = µ∗ − µk and Tk (n) =
pulled.
Pn
t=1 1I(πt
= k) is the number of time when arm k was
3.2 Warm Up: Full Info Case


X1,t


Assume in this subsection that K = 2 and we observe the full information  ...  at
XK,t
time t after choosing πt . So in each iteration, a normal idea is to choose the arm with
highest average return so far. That is
¯ k,t
πt = argmax X
k=1,2
where
t
X̄k,t
1X
=
Xk,s
t s=1
Assume from now on that all random variable Xk,t are subGaussian with variance proxy
σ 2 , which means IE[eux ] ≤ e
u2 σ 2
2
for all u ∈ IR. For example, N (0, σ 2 ) is subGaussian with
1
variance proxy σ 2 and any bounded random variable X ∈ [a, b] is subGaussian with variance
proxy (b − a)2 /4 by Hoeffding’s Lemma.
Therefore,
Rn = ∆IE[T2 (n)] ,
(3.1)
where ∆ = µ1 − µ2 . Besides,
T2 (n) = 1 +
n
X
¯ 2,t > X
¯ 1,t )
1I(X
t=2
=1+
n
X
¯ 2,t − X
¯ 1,t − (µ2 − µ1 ) ≥ ∆) .
1I(X
t=2
¯ 2,t − X
¯ 1,t ) − (µ2 − µ1 ) is centered subGaussian with variance proxy
It is easy to check that (X
2
2σ , whereby
t∆2
¯2,t > X
¯ 1,t )] ≤ e− 4σ2
IE[1I(X
by a simple Chernoff Bound. Therefore,
Rn ≤ ∆(1 +
∞
X
t∆2
e− 4σ2 ) ≤ ∆ +
t=0
whereby the benchmark is
Rn ≤ ∆ +
4σ 2
,
∆
(3.2)
4σ 2
.
∆
3.3 Upper Confidence Bound (UCB)
Without loss of generality, from P
now on we assume σ = 1. A trivial idea is that after s
pulls on arm k, we use µ̂k,s = 1s j∈{pulls of k} XK,j and choose the one with largest µ̂k,s .
The problem of this trivial policy is that for some arm, we might try it for only limited
times, which give a bad average and then we never try it again. In order to overcome this
limitation, a good idea is to choose the arm with highest upper bound estimate on the mean
of each arm at some probability lever. Note that the arm with less tries would have a large
deviations from its mean. This is called Upper Confidence Bound policy.
2
Algorithm 1 Upper Confidence Bound (UCB)
for t = 1 to K do
πt = t
end for
for t = K + 1 to n do
Tk (t) =
t−1
X
1I(πt = k)
s=1
(number of time we have pull arm k before time t)
t−1
1 X
µ̂k,t =
XK,t∧s
Tk (t) s=1
s
(
)
2 log(t)
πt ∈ argmax µ̂k,t + 2
,
Tk (t)
k∈[K]
end for
Theorem: The UCB policy has regret
K
X log n
π2 X
+ (1 + )
∆k
∆k
3
Rn ≤ 8
k,∆k >0
k=1
Proof. From now on we fix k such that ∆k > 0. Then
IE[Tk (n)] = 1 +
n
X
IP(πt = k) .
t=K+1
Note that for t > K,
s
s
2 log t
2 log t
{πt = k} ⊆ {µ̂k,t + 2
≤ µ̂∗,t + 2
}
Tk (t)
T∗ (t)
s
s
s
(
)
2 log t [
2 log t [
2 log t
⊆ {µk ≥ µ̂k,t + 2
} {µ∗ ≥ µ̂∗,t + 2
} {µ∗ ≤ µk + 2
, πt = k}
Tk (t)
T∗ (t)
Tk (t)
And from a union bound, we have
s
IP(µ̂k,t − µk < −2
s
2 log t
2 log t
) = IP(µ̂k,t − µk < 2
)
Tk (t)
Tk (t)
t
X
t
−s 8 log
s
≤
exp(
)
2
s=1
=
3
1
t3
q
t
Thus IP(µk > µ̂k,t + 2 2Tlog
)≤
k (t)
whereby
n
X
t=K+1
1
t3
q
t
and similarly we have IP(µ∗ > µ̂∗,t + 2 2Tlog
)≤
∗ (t)
1
t3 ,
s
n
n
X
1 X
2 log t
IP(πt = k) ≤ 2
+
IP(µ∗ ≤ µk + 2
, πt = k)
3
t
Tk (t)
≤2
t=1
∞
X
t=1
1
+
t3
t=1
n
X
IP(Tk (t) ≤
t=1
8 log t
, πt = k)
∆2k
∞
n
X
1 X
8 log n
≤2
+
IP(Tk (t) ≤
2 , πt = k)
3
t
∆
k
t=1
t=1
∞
∞
X
1 X
8 log n
≤2
+
IP(s ≤
)
t3
∆2k
t=1
s=1
∞
X
1
8 log n
≤2
+
2
t
∆2k
t=1
=
π 2 8 log n
+
,
3
∆2k
where s is the counter of pulling arm k. Therefore we have
Rn =
K
X
k =1
∆k IE[Tk (n)] ≤
X
k,∆k >0
∆k (1 +
π 2 8 log n
+
),
3
∆2k
which furnishes the proof.
n
Consider the case K = 2 at first, then from the theorem above we know Rn ∼ log
∆ ,
which is consistent with intuition that when the difference of two arm is small, it is hard to
distinguish which to choose. On the other hand, it always hold that Rn ≤ n∆. Combining
log(n∆2 )
n
these two results, we have Rn ≤ log
up to a constant.
∆ ∧ n∆, whereby Rn ≤
∆
Actually it turns out to be the optimal bound. When K ≥ 3, we can similarly get the
P log(n∆2 )
result that Rn ≤ k ∆k k . This, however, is not the optimal bound. The optimal bound
P
P
should be k log(n/H)
, which includes the harmonic sum and H = k ∆12 . See [Lat15].
∆k
k
3.4 Bounded Regret
From above we know UCB policy can give regret that increases with at most rate log n with
n. In this section we would consider whether it is possible to have bounded regret. Actually
it turns out that if there is a known separator between the expected reward of optimal arm
and other arms, there is a bounded regret policy.
We would only consider the case when K = 2 here. Without loss of generality, we
∆
assume µ1 = ∆
2 and µ2 = − 2 , then there is a natural separator 0.
4
Algorithm 2 Bounded Regret Policy (BRP)
π1 = 1 and π2 = 2
for t = 3 to n do
if maxk µ̂k,t > 0 then
then πt = argmaxk µ̂k,t
else
πt = 1, πt+1 = 2
end if
end for
Theorem: BRP has regret
Rn ≤ ∆ +
16
.
∆
Proof.
IP(πt = 2) = IP(µ̂2,t > 0, πt = 2) + IP(µ̂2,t ≤ 0, πt = 2)
Note that
n
X
IP(µ̂2,t > 0, πt = 2) ≤ IE
t=3
≤ IE
≤
n
X
t=3
n
X
1I(µ̂2,t > 0, πt = 2)
1I(µ̂2,t − µ2 > 0, πt = 2)
t=3
∞
X
2
− s∆
8
e
s=1
=
8
,
∆2
where s is the counter of pulling arm 2 and the third inequality is a Chernoff bound.
Similarly,
n
X
IP(µ̂2,t ≤ 0, πt = 2) =
t=3
n
X
IP(µ̂1,t ≤ 0, πt−1 = 1)
t=3
≤
8
,
∆2
Combining these two inequality, we have
Rn ≤ ∆(1 +
16
),
∆2
References
[Lat15] Tor Lattimore, Optimally confident UCB : Improved regret for finite-armed bandits,
Arxiv:1507.07880, 2015.
5
MIT OpenCourseWare
http://ocw.mit.edu
18.657 Mathematics of Machine Learning
Fall 2015
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Download