Machine Learning & Data Mining CS/CNS/EE 155

advertisement
Machine Learning & Data Mining
CS/CNS/EE 155
Lecture 17:
The Multi-Armed Bandit Problem
Lecture 17: The Multi-Armed Bandit Problem
1
Announcements
• Lecture Tuesday will be Course Review
• Final should only take a 4-5 hours to do
– We give you 48 hours for your flexibility
• Homework 2 is graded
– We graded pretty leniently
– Approximate Grade Breakdown:
– 64: A
61: A-
58: B+
53: B
50: B-
47: C+ 42: C
39: C-
• Homework 3 will be graded soon
Lecture 17: The Multi-Armed Bandit Problem
2
Today
• The Multi-Armed Bandits Problem
– And extensions
• Advanced topics course on this next year
Lecture 17: The Multi-Armed Bandit Problem
3
Recap: Supervised Learning
x Î RD
• Training Data: S = {(xi , yi )}
i=1
N
• Model Class:
y Î {-1, +1}
f (x | w, b) = wT x - b
• Loss Function: L(a, b) = (a - b)2
• Learning Objective:
E.g., Linear Models
E.g., Squared Loss
N
argmin å L ( yi , f (xi | w, b))
w,b
i=1
Optimization Problem
Lecture 17: The Multi-Armed Bandit Problem
4
Labeled
andare
Unlabeled
data
Labeled
Unlabeled
data
But
Labels
Expensive!
Labeled
and
Unlabeled
data
Labeled and Unlabeled data
“Crystal”
“Crystal”
“Needle” “Empty”
“Empty”
“Crystal”
“Needle”
“Empty”
“Crystal”
“Needle”
“Crystal”
“Needle”
“Empty”
“Needle”
“Empty”
“0” “0”
“1” “1”
“2”
…
“0”“2”
“1”…“2” …
1 “2”
2 3…
“0”0“1”
…
“Sports”
Human
expert/
“Sports”
Human
expert/
“Sports”
Human
expert/
“News”
Special
equipment/
“News” “Sports”
Special
equipment/
“Sports”
“News”
Special
equipment/
Human
expert/“Science”
Experiment
Human
Annotation
“Science”
Experiment
“World News”
“News”
Experiment
Special
equipment/… “Science”
Running Experiment
Experiment
x
Cheap and abundant !
…
…
“Science”
“Science”
…
y
Expensive and scarce !
Cheapand
andAbundant!
abundant !
Expensive
and scarce
!
Expensive
and Scarce!
Cheap
Cheap and abundant !
Expensive and scarce !
Image Source: http://www.cs.cmu.edu/~aarti/Class/10701/slides/Lecture23.pdf
Cheap and abundant !
Expensive and scarce !
Lecture 17: The Multi-Armed Bandit Problem
5
Solution?
• Let’s grab some labels!
– Label images
– Annotate webpages
– Rate movies
– Run Experiments
– Etc…
• How should we choose?
Lecture 17: The Multi-Armed Bandit Problem
6
Interactive Machine Learning
• Start with unlabeled data:
• Loop:
– select xi
– receive feedback/label yi
• How to measure cost?
• How to define goal?
Lecture 17: The Multi-Armed Bandit Problem
7
Crowdsourcing
Unlabeled
Labeled and Unlabeled data
Labeled
Repeat
“Crystal” “Needle” “Empty”
Initially Empty
“0” “1” “2” …
“Sports”
“News”
“Science”
…
Human expert/
Special equipment/
Experiment
“Mushroom”
Cheap and abundant !
Lecture 17: The Multi-Armed Bandit Problem
Expensive and scarce !
8
Goal: Maximize
Minimal Cost
Aside:Accuracy
Activewith
Learning
Unlabeled
Choose
Labeled and Unlabeled data
Labeled
Repeat
“Crystal” “Needle” “Empty”
Initially Empty
“0” “1” “2” …
“Sports”
“News”
“Science”
…
Human expert/
Special equipment/
Experiment
“Mushroom”
Cheap and abundant !
Lecture 17: The Multi-Armed Bandit Problem
Expensive and scarce !
9
Passive Learning
Unlabeled
Random
Labeled and Unlabeled data
Labeled
Repeat
“Crystal” “Needle” “Empty”
Initially Empty
“0” “1” “2” …
“Sports”
“News”
“Science”
…
Human expert/
Special equipment/
Experiment
“Mushroom”
Cheap and abundant !
Lecture 17: The Multi-Armed Bandit Problem
Expensive and scarce !
10
Comparison with Passive Learning
• Conventional Supervised Learning is considered
“Passive” Learning
• Unlabeled training set sampled according to test
distribution
• So we label it at random
– Very Expensive!
Lecture 17: The Multi-Armed Bandit Problem
11
Aside: Active Learning
• Cost: uniform
– E.g., each label costs $0.10
Object Recognition [Krizhevsky, Sutskever, Hinton 2012]
• Goal: maximize accuracy of trained model
Y LeCun
MA Ranzato
Goal:&
• Control distribution of
labeled training data
Labeled and Unlabeled data
“Crystal” “Needle” “Empty”
“0” “1” “2” …
Human expert/
Special equipment/
Experiment
Cheap and abundant !
Lecture 17: The Multi-Armed Bandit Problem
“Sports”
“News”
“Science”
…
Expensive and scarce !
12
Problems with Crowdsourcing
• Assumes you can label by proxy
– E.g., have someone else label objects in images
• But sometimes you can’t!
– Personalized recommender systems
• Need to ask the user whether content is interesting
– Personalized medicine
• Need to try treatment on patient
– Requires actual target domain
Lecture 17: The Multi-Armed Bandit Problem
13
Personalized Labels
Unlabeled
Choose
Sports
Repeat
Labeled
Initially Empty
End User
What is Cost?
Real System
Lecture 17: The Multi-Armed Bandit Problem
14
The Multi-Armed Bandit Problem
Lecture 17: The Multi-Armed Bandit Problem
15
Formal Definition
• K actions/classes
• Each action has an average reward: μk
– Unknown to us
– Assume WLOG that u1 is largest
Basic Setting
K classes
No features
• For t = 1…T
– Algorithm chooses action a(t)
– Receives random reward y(t)
• Expectation μa(t)
Algorithm Simultaneously
Predicts & Receives Labels
• Goal: minimize Tu1 – (μa(1) + μa(2) + … + μa(T))
If we had perfect information to start
Lecture 17: The Multi-Armed Bandit Problem
Expected Reward of Algorithm
16
Interactive Personalization
(5 Classes, No features)
Sports
Average Likes
--
--
--
--
--
# Shown
0
0
0
1
0
Lecture 17: The Multi-Armed Bandit Problem
:0
17
Interactive Personalization
(5 Classes, No features)
Sports
Average Likes
--
--
--
0
--
# Shown
0
0
0
1
0
Lecture 17: The Multi-Armed Bandit Problem
:0
18
Interactive Personalization
(5 Classes, No features)
Politics
Average Likes
--
--
--
0
--
# Shown
0
0
1
1
0
Lecture 17: The Multi-Armed Bandit Problem
:0
19
Interactive Personalization
(5 Classes, No features)
Politics
Average Likes
--
--
1
0
--
# Shown
0
0
1
1
0
Lecture 17: The Multi-Armed Bandit Problem
:1
20
Interactive Personalization
(5 Classes, No features)
World
Average Likes
--
--
1
0
--
# Shown
0
0
1
1
1
Lecture 17: The Multi-Armed Bandit Problem
:1
21
Interactive Personalization
(5 Classes, No features)
World
Average Likes
--
--
1
0
0
# Shown
0
0
1
1
1
Lecture 17: The Multi-Armed Bandit Problem
:1
22
Interactive Personalization
(5 Classes, No features)
Economy
Average Likes
--
--
1
0
0
# Shown
0
1
1
1
1
Lecture 17: The Multi-Armed Bandit Problem
:1
23
Interactive Personalization
(5 Classes, No features)
Economy
Average Likes
--
1
1
0
0
# Shown
0
1
1
1
1
Lecture 17: The Multi-Armed Bandit Problem
…
:2
24
What should Algorithm Recommend?
Exploit:
Explore:
Economy
Celebrity
Best:
Politics
How to Optimally Balance Explore/Exploit Tradeoff?
Characterized by the Multi-Armed Bandit Problem
Average Likes
--
0.44
0.4
0.33
0.2
# Shown
0
25
10
15
20
Lecture 17: The Multi-Armed Bandit Problem
: 24
25
(OPT ) =
( )+
(ALG) =
(
)+
( )+
( )
…
)+
( )
…
(
Time Horizon
Regret: R(T ) =
(OPT ) -
( ALG)
• Opportunity cost of not knowing preferences
• “no-regret” if R(T)/T  0
– Efficiency measured by convergence rate
Recap: The Multi-Armed Bandit Problem
• K actions/classes
• Each action has an average reward: μk
– All unknown to us
– Assume WLOG that u1 is largest
Basic Setting
K classes
No features
• For t = 1…T
– Algorithm chooses action a(t)
– Receives random reward y(t)
• Expectation μa(t)
Algorithm Simultaneously
Predicts & Receives Labels
• Goal: minimize Tu1 – (μa(1) + μa(2) + … + μa(T))
Regret
Lecture 17: The Multi-Armed Bandit Problem
27
The Motivating Problem
• Slot Machine = One-Armed Bandit
Each Arm Has
Different Payoff
• Goal: Minimize regret From pulling suboptimal arms
http://en.wikipedia.org/wiki/Multi-armed_bandit
Lecture 17: The Multi-Armed Bandit Problem
28
Implications of Regret
Regret: R(T ) =
(OPT ) -
( ALG)
• If R(T) grows linearly w.r.t. T:
– Then R(T)/T  constant > 0
– I.e., we converge to predicting something suboptimal
• If R(T) is sub-linear w.r.t. T:
– Then R(T)/T  0
– I.e., we converge to predicting the optimal action
Lecture 17: The Multi-Armed Bandit Problem
29
Experimental Design
• How to split trials to collect information
• Static Experimental Design
– Standard practice
– (pre-planned)
Treatment
Placebo
Treatment
Placebo
Treatment
…
http://en.wikipedia.org/wiki/Design_of_experiments
Lecture 17: The Multi-Armed Bandit Problem
30
Sequential Experimental Design
• Adapt experiments based on outcomes
Treatment
Placebo
Treatment
Treatment
Treatment
…
Lecture 17: The Multi-Armed Bandit Problem
31
Sequential Experimental Design Matters
http://www.nytimes.com/2010/09/19/health/research/19trial.html
Lecture 17: The Multi-Armed Bandit Problem
32
Sequential Experimental Design
• MAB models sequential experimental design!
basic
• Each treatment has hidden expected value
– Need to run trials to gather information
– “Exploration”
• In hindsight, should always have used treatment
with highest expected value
• Regret = opportunity cost of exploration
Lecture 17: The Multi-Armed Bandit Problem
33
Online Advertising
Largest Use-Case
of Multi-Armed
Bandit Problems
Lecture 17: The Multi-Armed Bandit Problem
34
The UCB1 Algorithm
http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf
Lecture 17: The Multi-Armed Bandit Problem
35
Confidence Intervals
• Maintain Confidence Interval for Each Action
– Often derived using Chernoff-Hoeffding bounds (**)
= [0.1, 0.3]
= [0.25, 0.55]
Undefined
Average Likes
--
0.44
0.4
0.33
0.2
# Shown
0
25
10
15
20
** http://www.cs.utah.edu/~jeffp/papers/Chern-Hoeff.pdf
http://en.wikipedia.org/wiki/Hoeffding%27s_inequality
Lecture 17: The Multi-Armed Bandit Problem
36
UCB1 Confidence Interval
Expected Reward
Estimated from data
2 ln t
mk ±
tk
Total Iterations so far
(70 in example below)
#times action k was chosen
Average Likes
--
0.44
0.4
0.33
0.2
# Shown
0
25
10
15
20
http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf
Lecture 17: The Multi-Armed Bandit Problem
37
The UCB1 Algorithm
• At each iteration
– Play arm with highest Upper Confidence Bound:
argmax mk +
k
( 2 lnt ) / tk
Average Likes
--
0.44
0.4
0.33
0.2
# Shown
0
25
10
15
20
http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf
Lecture 17: The Multi-Armed Bandit Problem
38
Balancing Explore/Exploit
“Optimism in the Face of Uncertainty”
argmax mk +
k
( 2 lnt ) / tk
Exploration Term
Exploitation Term
Average Likes
--
0.44
0.4
0.33
0.2
# Shown
0
25
10
15
20
http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf
Lecture 17: The Multi-Armed Bandit Problem
39
Analysis (Intuition)
a(t +1) = argmax mk +
k
Upper Confidence Bound of Best Arm
With high probability (**):
ma(t+1) +
( 2 lnt ) / tk
( 2 lnt ) / ta(t+1) ³ m1 + ( 2 lnt ) / t1 ³ m1
ma(t+1) ³ ma(t+1) -
( 2 lnt ) / ta(t+1)
m1 - ma(t+1) £ 2 ( 2 lnt ) / ta(t+1)
Value of
Best Arm
The true value is greater than
the lower confidence bound.
Bound on regret at time t+1
** Proof of Theorem 1 in http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf
Lecture 17: The Multi-Armed Bandit Problem
40
500 Iterations
158 145 89 34 74
5000 Iterations
2442 1401 713 131 318
Lecture 17: The Multi-Armed Bandit Problem
2000 Iterations
913 676 139 82 195
25000 Iterations
20094 2844 1418 181 468
41
How Often Sub-Optimal Arms Get Played
• An arm never gets selected if:
mk +
Bound grows
slowly with time
( 2 lnt ) / tk £ m1
Shrinks quickly
with #trials
æ ln t ö
• The number of times selected: O ç
÷
2
ç m -m ÷
– Prove using Hoeffding’s Inequality
è( 1 k) ø
Theorem 1 in http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf
Lecture 17: The Multi-Armed Bandit Problem
42
Regret Guarantee
• With high probability:
– UCB1 accumulates regret at most:
#Actions
æK
ö
R(T ) = O ç lnT ÷
èe
ø
Time Horizon
Gap between best & 2nd best
ε = μ1 – μ2
Theorem 1 in http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf
Lecture 17: The Multi-Armed Bandit Problem
43
Recap: MAB & UCB1
• Interactive setting
– Receives reward/label while making prediction
• Must balance explore/exploit
• Sub-linear regret is good
– Average regret converges to 0
Lecture 17: The Multi-Armed Bandit Problem
44
Extensions
• Contextual Bandits
– Features of environment
• Dependent-Arms Bandits
– Features of actions/classes
• Dueling Bandits
• Combinatorial Bandits
• General Reinforcement Learning
Lecture 17: The Multi-Armed Bandit Problem
45
Contextual Bandits
• K actions/classes
• Rewards depends on context x: μ(x)
K classes
Best class depends
on features
• For t = 1…T
– Algorithm receives context xt
– Algorithm chooses action a(t)
– Receives random reward y(t)
• Expectation μ(xt)
Algorithm Simultaneously
Predicts & Receives Labels
Bandit multiclass prediction
• Goal: Minimize Regret
http://arxiv.org/abs/1402.0555
http://www.research.rutgers.edu/~lihong/pub/Li10Contextual.pdf
Lecture 17: The Multi-Armed Bandit Problem
46
Linear Bandits
• K actions/classes
K classes
Linear dependence
Between Arms
– Each action has features xk
– Reward function: μ(x) = wTx
• For t = 1…T
– Algorithm chooses action a(t)
– Receives random reward y(t)
• Expectation μa(t)
Algorithm Simultaneously
Predicts & Receives Labels
Labels can share information
to other actions
• Goal: regret scaling independent of K
http://webdocs.cs.ualberta.ca/~abbasiya/linear-bandits-NIPS2011.pdf
Lecture 17: The Multi-Armed Bandit Problem
47
Example
• Treatment of spinal cord injury patients
– Studied by Joel Burdick’s group @Caltech
5
0
5
11
0
12
1
6
1
13
2
14
1
2
1
2
2
13
8
14
4
12
Want regret bound that
scales independently of #arms
7
13
3
11
6
3
9
15
10
12
8
14
4
0
7
13
3
5
11
6
9
15
10
12
8
14
4
0
7
13
3
5
11
6
9
15
10
1
8
9
4
12
7
8
3
0
6
7
2
5
11
14
9
15
10
4
15
10
E.g., linearly in dimensionality
of features x describing arms
• Multi-armed bandit problem:
– Thousands of arms
UCB1 Regret Bound:
æK
ö
R(T ) = O ç lnT ÷
èe
ø
Images from Yanan Sui
Lecture 17: The Multi-Armed Bandit Problem
48
Dueling Bandits
• K actions/classes
– Preference model P(ak > ak’)
K classes
Can only measure
pairwise preferences
• For t = 1…T
– Algorithm chooses actions a(t) & b(t)
– Receives random reward y(t)
• Expectation P(a(t) > b(t))
Algorithm Simultaneously
Predicts & Receives Labels
Only pairwise rewards
• Goal: low regret despite only pairwise feedback
http://www.yisongyue.com/publications/jcss2012_dueling_bandit.pdf
Lecture 17: The Multi-Armed Bandit Problem
49
Example in Sensory Testing
• (Hypothetical) taste experiment:
vs
– Natural usage context
• Experiment 1: Absolute Metrics
3 cans
3 cans
Total: 8 cans
2 cans
1 can
Very Thirsty!
5 cans
Total: 9 cans
3 cans
Example in Sensory Testing
• (Hypothetical) taste experiment:
vs
– Natural usage context
• Experiment 1: Relative Metrics
2-1
3-0
2-0
1-0
All 6 prefer Pepsi
4-1
2-1
Example Revisited
• Treatment of spinal cord injury patients
– Studied by Joel Burdick’s group @Caltech
5
0
5
11
0
12
1
6
1
13
2
14
1
2
1
2
2
13
8
14
4
12
Patients cannot reliably
rate individual treatments
7
13
3
11
6
3
9
15
10
12
8
14
4
0
7
13
3
5
11
6
9
15
10
12
8
14
4
0
7
13
3
5
11
6
9
15
10
1
8
9
4
12
7
8
3
0
6
7
2
5
11
14
9
15
10
4
15
10
Patients can reliably
compare pairs of treatments
• Dueling Bandits Problem!
Images from Yanan Sui
http://dl.acm.org/citation.cfm?id=2645773
Lecture 17: The Multi-Armed Bandit Problem
52
Combinatorial Bandits
• Sometimes, actions must be selected from
combinatorial action space:
– E.g., shortest path problems with unknown costs
on edges
• aka: Routing under uncertainty
• If you knew all the parameters of model:
– standard optimization problem
http://www.yisongyue.com/publications/nips2011_submod_bandit.pdf
http://www.cs.cornell.edu/~rdk/papers/OLSP.pdf
http://homes.di.unimi.it/cesa-bianchi/Pubblicazioni/comband.pdf
Lecture 17: The Multi-Armed Bandit Problem
53
General Reinforcement Learning
Treatment
Placebo
Treatment
Placebo
Treatment
…
• Bandit setting assumes actions do not affect
the world
– E.g., sequence of experiments does not affect the
distribution of future trials
Lecture 17: The Multi-Armed Bandit Problem
54
Markov Decision Process
• M states
• K actions
• Reward: μ(s,a)
Example: Personalized Tutoring
– Depends on state
[Emma Brunskill et al.] (**)
(a)
Fig. 1.
• For t = 1…T
without a mouse typically became disengaged. Extending
this work, the authors later conducted an English vocabulary
retention task [4] and found that student performance gain
from pretest to post test was not significantly different between
students who were at their own computer versus students
who each had their own mouse at a shared computer. This
highlights the potential of multi-user software to provide an
equivalent experience as single-user software. However, the
pretest was conducted immediately before the session, and
the post test immediately after, so the study did not measure
longer-term interaction and retention effects.
Prior work has noted that even equipping each student with
his or her own mouse does not eliminate the potential for
a single student to dominate the interaction. Moed et al. [7]
showed that imposing a procedural restriction on the interaction through the use of enforced turn-taking could reduce
single-student dominance. Tseng et al. [16] considered explicit
and implicit encouragement of collaboration by changing the
incentives in a competitive game, and varying physical device
sharing. In contrast to these two papers, which mostly rely on
modifying the game rules to avoid dominance, in our work
we propose to adapt the activities selected for each individual
student’s current progress to implicitly reduce the potential for
dominance.
Beyond one-mouse-per-child with individual cursors on a
shared screen, there have also been other interface choices
explored for multi-user settings within the ICTD community.
While mouse input is useful for many tasks, it is generally less
well-suited to entering numbers or text (though see [5]). An
alternative option that has been less explored in the multipleinput literature is to use keyboards or numeric keypads. In
addition, rather than having all students work together on a
single task, the screen can be divided into separate sections
– Algorithm (approximately) observes current state st
• Depends on previous state & action taken
– Algorithm chooses action a(t)
– Receives random reward y(t)
• Expectation μ(st,a(t))
** http://www.cs.cmu.edu/~ebrun/FasterTeachingPOMDP_planning.pdf
Lecture 17: The Multi-Armed Bandit Problem
(b)
The MultiLearn+ system.
As our motivation is to construct software tar
resource communities, the underlying impetus
is not to facilitate collaboration, but rather
engage and challenge each individual student giv
computer-per-child situation is not feasible in th
our areas of interest. There is a rich set of questio
interesting work) regarding collaborative softwar
of collaboration, and whether competitive lear
learning, or collaborative learning are most eff
this particular study we leave these questions as
on individual learning for students operating in
environment.
A. Core tutor
Our multi-user educational software builds u
work on MultiLearn [16]. MultiLearn splits th
multiple regions, so that each student interacts w
part of the screen. This makes it well suited to
like ours where our interest is in customizing
provided to each student.
We extended existing MultiLearn software
simple mathematics drill exercises. The screen
4 vertical regions, and math exercises were prov
region. Each student received his or her own num
which is associated with a particular region o
and therefore, to a particular drill exercise. As
incentive to keep students engaged, a competitiv
introduced: students on the same computer co
who can correctly finish 12 questions the fastes
The mathematics curriculum consisted of 1
general skill categories (addition, subtraction, m
division and fractions) were selected based on m
produced by the (Indian) National Council o
55
Summary
• Interactive Machine Learning
– Multi-armed Bandit Problem
– Basic result: UCB1
– Surveyed Extensions
• Advanced Topics in ML course next year
• Next lecture: course review
– Bring your questions!
Lecture 17: The Multi-Armed Bandit Problem
56
Download