EECS 495: Randomized Algorithms Lecture 13 Multi

advertisement
EECS 495: Randomized Algorithms
Multi-armed Bandits
Reading: none
Lecture 13
Goal: Given finite horizon T , seek policy to
minimize regret:
!
T
X
max T × µi∗ − E[
Rt ]
Multi-armed bandits (Robbins ’52):
• Slot machine with multiple levers
Xi
t=1
• levers give rewards according to distribu- where i∗ is arm with highest µ .
i
tion
Claim: √There is a policy that obtains regret O( nT log T ) (and hence per-turn re• want to maximize sum of rewards
gret vanishes for large T ).
• no initial knowledge about payoffs
Question: Easy policies?
Algorithm: Play all arms for a while, then
play best one.
Problem: Tradoff between
• exploit lever with high expected payoff
• Play each arm for T 2/3 steps
• explore to get more info about expected
payoffs of other levers
• Choose arm with max sample ave and
play for remaining T − nT 2/3 steps
√
Claim: O(nT 2/3 log T ) regret.
Example: Keyword Allocations
Claim: (Chernoff-Hoeffding’s inequality).
• n advertisers with CPC v1 , . . . , vn and
Let X1 , . . . , Xk be k independent draws from
CTR p1 , . . . , pn
a distribution
on [0, 1] with mean µ. Let
Pk
1
X
be
sample average. Then:
µ̂
=
• one slot per keyword
i
i=1
k
2
Pr[µ̂ − µ > ] ≤ 2e−2k
• CTRs unknown, fixed over time
and
• which ad to show?
2
Pr[µ̂ + µ < ] ≤ 2e−2k .
Problem:
Proof: (of regret bound):
• µ̂i = sample ave of arm i. by hoeffding
with k = T 2/3 :
√
log T
2
Pr[|µi − µ̂i | >
]≤ 2
1/3
T
T
• n arms
• reward Xi ∈ [0, 1] of arm i
• Xi random variable with mean µi
1
• Assume n << T . by union bound:
√
log T
2
Pr[∃i : |µi − µ̂i | >
]≤
1/3
T
T
• ∆i = µi∗ − µi be per-turn regret for playing i,
• Qi be exp. # times i is played in T steps.
T
• W/prob.
1 − T2 , chosen arm i has µi ≥ For each arm i 6= i∗ , E(Qi ) ≤ 4 log
+ 2:
∆2i
√
2 log T
µi∗ − T 1/3 , so regret at most
• Pr[Φi∗ < µi∗ ] ≤ 1/T no matter how long
√
2
log
T
2
we play it.
nT 2/3 + T ×
+ ×T
T 1/3
T
If i∗ played continously, at each step t,
where
r
log T
1
Pr[Φi∗ (t) < µi∗ ] = Pr[µi∗ −µˆi∗ (t) >
]≤ 2
– first term is regret due to initial ext
T
plore
by Hoeffding. Dips below µi∗ with prob.
– second term regret due to slightly
at most T1 by union bound over steps.
sub-opt arm played at most T times
• Pr[Φi > µi ] ≤ 1/T after enough trials.
– third term regret due to arm subopt by 1
If i played for ti =
µ̂i + ∆i /2), then
Idea: Treating all arms equal wastes time.
Play arm with highest upper confidence interval. Either
4 log T
∆2i
steps (index is
Pr[Φi > µi ] = Pr[µ̂i − µi > ∆i /2] ≤ 1/T
by Hoeffding.
• also has higher mean, or
If neither event happens, play i at most ti
times, else w.p. at most 2/T , play arm at
most T times.
• narrow confidence interval
either way, we’re happy.
Regret is:
Def: If µ̂i is sample ave and ti is number of
timesqplayed arm i, then the index Φi of i is
µ̂i + logti T .
X
i
Algorithm: Index policy
∆i E[Qi ] ≈
X 4 log T
(
)
∆
i
i
Define
• Play arm with highest index
• Update index
incur total regret at most
√
2 nT log T
√
Claim: O( nT log T ) regret.
q
4n log T
T
4 log T
n ∆i
=
• Arms with large regret ∆i >
q
4n log T
• Arms with small regret ∆i ≤
T
incur
total
regret
at
most
T
max
∆
i
i =
√
2 nT log T .
Proof: Let
• i∗ be arm with highest mean,
2
Lower bound:
Claim:
Any bandit policy incurs regret
√
Ω( nT ).
Proof: n − 1 arms with meanp
1/2; one arm
with mean 1/2 + for = O(p n/2). With
t samples, variance becomes 1/t, so need
O(T /n) samples to decide if arm is good one
with constant prob. Not enough samples to
resolve all arms, so with constant
√ probability
fail and incur regret T = Ω( nT ).
3
Download