Machine Learning & Data Mining CS/CNS/EE 155 Lecture 17: The Multi-Armed Bandit Problem Lecture 17: The Multi-Armed Bandit Problem 1 Announcements • Lecture Tuesday will be Course Review • Final should only take a 4-5 hours to do – We give you 48 hours for your flexibility • Homework 2 is graded – We graded pretty leniently – Approximate Grade Breakdown: – 64: A 61: A- 58: B+ 53: B 50: B- 47: C+ 42: C 39: C- • Homework 3 will be graded soon Lecture 17: The Multi-Armed Bandit Problem 2 Today • The Multi-Armed Bandits Problem – And extensions • Advanced topics course on this next year Lecture 17: The Multi-Armed Bandit Problem 3 Recap: Supervised Learning x Î RD • Training Data: S = {(xi , yi )} i=1 N • Model Class: y Î {-1, +1} f (x | w, b) = wT x - b • Loss Function: L(a, b) = (a - b)2 • Learning Objective: E.g., Linear Models E.g., Squared Loss N argmin å L ( yi , f (xi | w, b)) w,b i=1 Optimization Problem Lecture 17: The Multi-Armed Bandit Problem 4 Labeled andare Unlabeled data Labeled Unlabeled data But Labels Expensive! Labeled and Unlabeled data Labeled and Unlabeled data “Crystal” “Crystal” “Needle” “Empty” “Empty” “Crystal” “Needle” “Empty” “Crystal” “Needle” “Crystal” “Needle” “Empty” “Needle” “Empty” “0” “0” “1” “1” “2” … “0”“2” “1”…“2” … 1 “2” 2 3… “0”0“1” … “Sports” Human expert/ “Sports” Human expert/ “Sports” Human expert/ “News” Special equipment/ “News” “Sports” Special equipment/ “Sports” “News” Special equipment/ Human expert/“Science” Experiment Human Annotation “Science” Experiment “World News” “News” Experiment Special equipment/… “Science” Running Experiment Experiment x Cheap and abundant ! … … “Science” “Science” … y Expensive and scarce ! Cheapand andAbundant! abundant ! Expensive and scarce ! Expensive and Scarce! Cheap Cheap and abundant ! Expensive and scarce ! Image Source: http://www.cs.cmu.edu/~aarti/Class/10701/slides/Lecture23.pdf Cheap and abundant ! Expensive and scarce ! Lecture 17: The Multi-Armed Bandit Problem 5 Solution? • Let’s grab some labels! – Label images – Annotate webpages – Rate movies – Run Experiments – Etc… • How should we choose? Lecture 17: The Multi-Armed Bandit Problem 6 Interactive Machine Learning • Start with unlabeled data: • Loop: – select xi – receive feedback/label yi • How to measure cost? • How to define goal? Lecture 17: The Multi-Armed Bandit Problem 7 Crowdsourcing Unlabeled Labeled and Unlabeled data Labeled Repeat “Crystal” “Needle” “Empty” Initially Empty “0” “1” “2” … “Sports” “News” “Science” … Human expert/ Special equipment/ Experiment “Mushroom” Cheap and abundant ! Lecture 17: The Multi-Armed Bandit Problem Expensive and scarce ! 8 Goal: Maximize Minimal Cost Aside:Accuracy Activewith Learning Unlabeled Choose Labeled and Unlabeled data Labeled Repeat “Crystal” “Needle” “Empty” Initially Empty “0” “1” “2” … “Sports” “News” “Science” … Human expert/ Special equipment/ Experiment “Mushroom” Cheap and abundant ! Lecture 17: The Multi-Armed Bandit Problem Expensive and scarce ! 9 Passive Learning Unlabeled Random Labeled and Unlabeled data Labeled Repeat “Crystal” “Needle” “Empty” Initially Empty “0” “1” “2” … “Sports” “News” “Science” … Human expert/ Special equipment/ Experiment “Mushroom” Cheap and abundant ! Lecture 17: The Multi-Armed Bandit Problem Expensive and scarce ! 10 Comparison with Passive Learning • Conventional Supervised Learning is considered “Passive” Learning • Unlabeled training set sampled according to test distribution • So we label it at random – Very Expensive! Lecture 17: The Multi-Armed Bandit Problem 11 Aside: Active Learning • Cost: uniform – E.g., each label costs $0.10 Object Recognition [Krizhevsky, Sutskever, Hinton 2012] • Goal: maximize accuracy of trained model Y LeCun MA Ranzato Goal:& • Control distribution of labeled training data Labeled and Unlabeled data “Crystal” “Needle” “Empty” “0” “1” “2” … Human expert/ Special equipment/ Experiment Cheap and abundant ! Lecture 17: The Multi-Armed Bandit Problem “Sports” “News” “Science” … Expensive and scarce ! 12 Problems with Crowdsourcing • Assumes you can label by proxy – E.g., have someone else label objects in images • But sometimes you can’t! – Personalized recommender systems • Need to ask the user whether content is interesting – Personalized medicine • Need to try treatment on patient – Requires actual target domain Lecture 17: The Multi-Armed Bandit Problem 13 Personalized Labels Unlabeled Choose Sports Repeat Labeled Initially Empty End User What is Cost? Real System Lecture 17: The Multi-Armed Bandit Problem 14 The Multi-Armed Bandit Problem Lecture 17: The Multi-Armed Bandit Problem 15 Formal Definition • K actions/classes • Each action has an average reward: μk – Unknown to us – Assume WLOG that u1 is largest Basic Setting K classes No features • For t = 1…T – Algorithm chooses action a(t) – Receives random reward y(t) • Expectation μa(t) Algorithm Simultaneously Predicts & Receives Labels • Goal: minimize Tu1 – (μa(1) + μa(2) + … + μa(T)) If we had perfect information to start Lecture 17: The Multi-Armed Bandit Problem Expected Reward of Algorithm 16 Interactive Personalization (5 Classes, No features) Sports Average Likes -- -- -- -- -- # Shown 0 0 0 1 0 Lecture 17: The Multi-Armed Bandit Problem :0 17 Interactive Personalization (5 Classes, No features) Sports Average Likes -- -- -- 0 -- # Shown 0 0 0 1 0 Lecture 17: The Multi-Armed Bandit Problem :0 18 Interactive Personalization (5 Classes, No features) Politics Average Likes -- -- -- 0 -- # Shown 0 0 1 1 0 Lecture 17: The Multi-Armed Bandit Problem :0 19 Interactive Personalization (5 Classes, No features) Politics Average Likes -- -- 1 0 -- # Shown 0 0 1 1 0 Lecture 17: The Multi-Armed Bandit Problem :1 20 Interactive Personalization (5 Classes, No features) World Average Likes -- -- 1 0 -- # Shown 0 0 1 1 1 Lecture 17: The Multi-Armed Bandit Problem :1 21 Interactive Personalization (5 Classes, No features) World Average Likes -- -- 1 0 0 # Shown 0 0 1 1 1 Lecture 17: The Multi-Armed Bandit Problem :1 22 Interactive Personalization (5 Classes, No features) Economy Average Likes -- -- 1 0 0 # Shown 0 1 1 1 1 Lecture 17: The Multi-Armed Bandit Problem :1 23 Interactive Personalization (5 Classes, No features) Economy Average Likes -- 1 1 0 0 # Shown 0 1 1 1 1 Lecture 17: The Multi-Armed Bandit Problem … :2 24 What should Algorithm Recommend? Exploit: Explore: Economy Celebrity Best: Politics How to Optimally Balance Explore/Exploit Tradeoff? Characterized by the Multi-Armed Bandit Problem Average Likes -- 0.44 0.4 0.33 0.2 # Shown 0 25 10 15 20 Lecture 17: The Multi-Armed Bandit Problem : 24 25 (OPT ) = ( )+ (ALG) = ( )+ ( )+ ( ) … )+ ( ) … ( Time Horizon Regret: R(T ) = (OPT ) - ( ALG) • Opportunity cost of not knowing preferences • “no-regret” if R(T)/T 0 – Efficiency measured by convergence rate Recap: The Multi-Armed Bandit Problem • K actions/classes • Each action has an average reward: μk – All unknown to us – Assume WLOG that u1 is largest Basic Setting K classes No features • For t = 1…T – Algorithm chooses action a(t) – Receives random reward y(t) • Expectation μa(t) Algorithm Simultaneously Predicts & Receives Labels • Goal: minimize Tu1 – (μa(1) + μa(2) + … + μa(T)) Regret Lecture 17: The Multi-Armed Bandit Problem 27 The Motivating Problem • Slot Machine = One-Armed Bandit Each Arm Has Different Payoff • Goal: Minimize regret From pulling suboptimal arms http://en.wikipedia.org/wiki/Multi-armed_bandit Lecture 17: The Multi-Armed Bandit Problem 28 Implications of Regret Regret: R(T ) = (OPT ) - ( ALG) • If R(T) grows linearly w.r.t. T: – Then R(T)/T constant > 0 – I.e., we converge to predicting something suboptimal • If R(T) is sub-linear w.r.t. T: – Then R(T)/T 0 – I.e., we converge to predicting the optimal action Lecture 17: The Multi-Armed Bandit Problem 29 Experimental Design • How to split trials to collect information • Static Experimental Design – Standard practice – (pre-planned) Treatment Placebo Treatment Placebo Treatment … http://en.wikipedia.org/wiki/Design_of_experiments Lecture 17: The Multi-Armed Bandit Problem 30 Sequential Experimental Design • Adapt experiments based on outcomes Treatment Placebo Treatment Treatment Treatment … Lecture 17: The Multi-Armed Bandit Problem 31 Sequential Experimental Design Matters http://www.nytimes.com/2010/09/19/health/research/19trial.html Lecture 17: The Multi-Armed Bandit Problem 32 Sequential Experimental Design • MAB models sequential experimental design! basic • Each treatment has hidden expected value – Need to run trials to gather information – “Exploration” • In hindsight, should always have used treatment with highest expected value • Regret = opportunity cost of exploration Lecture 17: The Multi-Armed Bandit Problem 33 Online Advertising Largest Use-Case of Multi-Armed Bandit Problems Lecture 17: The Multi-Armed Bandit Problem 34 The UCB1 Algorithm http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf Lecture 17: The Multi-Armed Bandit Problem 35 Confidence Intervals • Maintain Confidence Interval for Each Action – Often derived using Chernoff-Hoeffding bounds (**) = [0.1, 0.3] = [0.25, 0.55] Undefined Average Likes -- 0.44 0.4 0.33 0.2 # Shown 0 25 10 15 20 ** http://www.cs.utah.edu/~jeffp/papers/Chern-Hoeff.pdf http://en.wikipedia.org/wiki/Hoeffding%27s_inequality Lecture 17: The Multi-Armed Bandit Problem 36 UCB1 Confidence Interval Expected Reward Estimated from data 2 ln t mk ± tk Total Iterations so far (70 in example below) #times action k was chosen Average Likes -- 0.44 0.4 0.33 0.2 # Shown 0 25 10 15 20 http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf Lecture 17: The Multi-Armed Bandit Problem 37 The UCB1 Algorithm • At each iteration – Play arm with highest Upper Confidence Bound: argmax mk + k ( 2 lnt ) / tk Average Likes -- 0.44 0.4 0.33 0.2 # Shown 0 25 10 15 20 http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf Lecture 17: The Multi-Armed Bandit Problem 38 Balancing Explore/Exploit “Optimism in the Face of Uncertainty” argmax mk + k ( 2 lnt ) / tk Exploration Term Exploitation Term Average Likes -- 0.44 0.4 0.33 0.2 # Shown 0 25 10 15 20 http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf Lecture 17: The Multi-Armed Bandit Problem 39 Analysis (Intuition) a(t +1) = argmax mk + k Upper Confidence Bound of Best Arm With high probability (**): ma(t+1) + ( 2 lnt ) / tk ( 2 lnt ) / ta(t+1) ³ m1 + ( 2 lnt ) / t1 ³ m1 ma(t+1) ³ ma(t+1) - ( 2 lnt ) / ta(t+1) m1 - ma(t+1) £ 2 ( 2 lnt ) / ta(t+1) Value of Best Arm The true value is greater than the lower confidence bound. Bound on regret at time t+1 ** Proof of Theorem 1 in http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf Lecture 17: The Multi-Armed Bandit Problem 40 500 Iterations 158 145 89 34 74 5000 Iterations 2442 1401 713 131 318 Lecture 17: The Multi-Armed Bandit Problem 2000 Iterations 913 676 139 82 195 25000 Iterations 20094 2844 1418 181 468 41 How Often Sub-Optimal Arms Get Played • An arm never gets selected if: mk + Bound grows slowly with time ( 2 lnt ) / tk £ m1 Shrinks quickly with #trials æ ln t ö • The number of times selected: O ç ÷ 2 ç m -m ÷ – Prove using Hoeffding’s Inequality è( 1 k) ø Theorem 1 in http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf Lecture 17: The Multi-Armed Bandit Problem 42 Regret Guarantee • With high probability: – UCB1 accumulates regret at most: #Actions æK ö R(T ) = O ç lnT ÷ èe ø Time Horizon Gap between best & 2nd best ε = μ1 – μ2 Theorem 1 in http://homes.di.unimi.it/~cesabian/Pubblicazioni/ml-02.pdf Lecture 17: The Multi-Armed Bandit Problem 43 Recap: MAB & UCB1 • Interactive setting – Receives reward/label while making prediction • Must balance explore/exploit • Sub-linear regret is good – Average regret converges to 0 Lecture 17: The Multi-Armed Bandit Problem 44 Extensions • Contextual Bandits – Features of environment • Dependent-Arms Bandits – Features of actions/classes • Dueling Bandits • Combinatorial Bandits • General Reinforcement Learning Lecture 17: The Multi-Armed Bandit Problem 45 Contextual Bandits • K actions/classes • Rewards depends on context x: μ(x) K classes Best class depends on features • For t = 1…T – Algorithm receives context xt – Algorithm chooses action a(t) – Receives random reward y(t) • Expectation μ(xt) Algorithm Simultaneously Predicts & Receives Labels Bandit multiclass prediction • Goal: Minimize Regret http://arxiv.org/abs/1402.0555 http://www.research.rutgers.edu/~lihong/pub/Li10Contextual.pdf Lecture 17: The Multi-Armed Bandit Problem 46 Linear Bandits • K actions/classes K classes Linear dependence Between Arms – Each action has features xk – Reward function: μ(x) = wTx • For t = 1…T – Algorithm chooses action a(t) – Receives random reward y(t) • Expectation μa(t) Algorithm Simultaneously Predicts & Receives Labels Labels can share information to other actions • Goal: regret scaling independent of K http://webdocs.cs.ualberta.ca/~abbasiya/linear-bandits-NIPS2011.pdf Lecture 17: The Multi-Armed Bandit Problem 47 Example • Treatment of spinal cord injury patients – Studied by Joel Burdick’s group @Caltech 5 0 5 11 0 12 1 6 1 13 2 14 1 2 1 2 2 13 8 14 4 12 Want regret bound that scales independently of #arms 7 13 3 11 6 3 9 15 10 12 8 14 4 0 7 13 3 5 11 6 9 15 10 12 8 14 4 0 7 13 3 5 11 6 9 15 10 1 8 9 4 12 7 8 3 0 6 7 2 5 11 14 9 15 10 4 15 10 E.g., linearly in dimensionality of features x describing arms • Multi-armed bandit problem: – Thousands of arms UCB1 Regret Bound: æK ö R(T ) = O ç lnT ÷ èe ø Images from Yanan Sui Lecture 17: The Multi-Armed Bandit Problem 48 Dueling Bandits • K actions/classes – Preference model P(ak > ak’) K classes Can only measure pairwise preferences • For t = 1…T – Algorithm chooses actions a(t) & b(t) – Receives random reward y(t) • Expectation P(a(t) > b(t)) Algorithm Simultaneously Predicts & Receives Labels Only pairwise rewards • Goal: low regret despite only pairwise feedback http://www.yisongyue.com/publications/jcss2012_dueling_bandit.pdf Lecture 17: The Multi-Armed Bandit Problem 49 Example in Sensory Testing • (Hypothetical) taste experiment: vs – Natural usage context • Experiment 1: Absolute Metrics 3 cans 3 cans Total: 8 cans 2 cans 1 can Very Thirsty! 5 cans Total: 9 cans 3 cans Example in Sensory Testing • (Hypothetical) taste experiment: vs – Natural usage context • Experiment 1: Relative Metrics 2-1 3-0 2-0 1-0 All 6 prefer Pepsi 4-1 2-1 Example Revisited • Treatment of spinal cord injury patients – Studied by Joel Burdick’s group @Caltech 5 0 5 11 0 12 1 6 1 13 2 14 1 2 1 2 2 13 8 14 4 12 Patients cannot reliably rate individual treatments 7 13 3 11 6 3 9 15 10 12 8 14 4 0 7 13 3 5 11 6 9 15 10 12 8 14 4 0 7 13 3 5 11 6 9 15 10 1 8 9 4 12 7 8 3 0 6 7 2 5 11 14 9 15 10 4 15 10 Patients can reliably compare pairs of treatments • Dueling Bandits Problem! Images from Yanan Sui http://dl.acm.org/citation.cfm?id=2645773 Lecture 17: The Multi-Armed Bandit Problem 52 Combinatorial Bandits • Sometimes, actions must be selected from combinatorial action space: – E.g., shortest path problems with unknown costs on edges • aka: Routing under uncertainty • If you knew all the parameters of model: – standard optimization problem http://www.yisongyue.com/publications/nips2011_submod_bandit.pdf http://www.cs.cornell.edu/~rdk/papers/OLSP.pdf http://homes.di.unimi.it/cesa-bianchi/Pubblicazioni/comband.pdf Lecture 17: The Multi-Armed Bandit Problem 53 General Reinforcement Learning Treatment Placebo Treatment Placebo Treatment … • Bandit setting assumes actions do not affect the world – E.g., sequence of experiments does not affect the distribution of future trials Lecture 17: The Multi-Armed Bandit Problem 54 Markov Decision Process • M states • K actions • Reward: μ(s,a) Example: Personalized Tutoring – Depends on state [Emma Brunskill et al.] (**) (a) Fig. 1. • For t = 1…T without a mouse typically became disengaged. Extending this work, the authors later conducted an English vocabulary retention task [4] and found that student performance gain from pretest to post test was not significantly different between students who were at their own computer versus students who each had their own mouse at a shared computer. This highlights the potential of multi-user software to provide an equivalent experience as single-user software. However, the pretest was conducted immediately before the session, and the post test immediately after, so the study did not measure longer-term interaction and retention effects. Prior work has noted that even equipping each student with his or her own mouse does not eliminate the potential for a single student to dominate the interaction. Moed et al. [7] showed that imposing a procedural restriction on the interaction through the use of enforced turn-taking could reduce single-student dominance. Tseng et al. [16] considered explicit and implicit encouragement of collaboration by changing the incentives in a competitive game, and varying physical device sharing. In contrast to these two papers, which mostly rely on modifying the game rules to avoid dominance, in our work we propose to adapt the activities selected for each individual student’s current progress to implicitly reduce the potential for dominance. Beyond one-mouse-per-child with individual cursors on a shared screen, there have also been other interface choices explored for multi-user settings within the ICTD community. While mouse input is useful for many tasks, it is generally less well-suited to entering numbers or text (though see [5]). An alternative option that has been less explored in the multipleinput literature is to use keyboards or numeric keypads. In addition, rather than having all students work together on a single task, the screen can be divided into separate sections – Algorithm (approximately) observes current state st • Depends on previous state & action taken – Algorithm chooses action a(t) – Receives random reward y(t) • Expectation μ(st,a(t)) ** http://www.cs.cmu.edu/~ebrun/FasterTeachingPOMDP_planning.pdf Lecture 17: The Multi-Armed Bandit Problem (b) The MultiLearn+ system. As our motivation is to construct software tar resource communities, the underlying impetus is not to facilitate collaboration, but rather engage and challenge each individual student giv computer-per-child situation is not feasible in th our areas of interest. There is a rich set of questio interesting work) regarding collaborative softwar of collaboration, and whether competitive lear learning, or collaborative learning are most eff this particular study we leave these questions as on individual learning for students operating in environment. A. Core tutor Our multi-user educational software builds u work on MultiLearn [16]. MultiLearn splits th multiple regions, so that each student interacts w part of the screen. This makes it well suited to like ours where our interest is in customizing provided to each student. We extended existing MultiLearn software simple mathematics drill exercises. The screen 4 vertical regions, and math exercises were prov region. Each student received his or her own num which is associated with a particular region o and therefore, to a particular drill exercise. As incentive to keep students engaged, a competitiv introduced: students on the same computer co who can correctly finish 12 questions the fastes The mathematics curriculum consisted of 1 general skill categories (addition, subtraction, m division and fractions) were selected based on m produced by the (Indian) National Council o 55 Summary • Interactive Machine Learning – Multi-armed Bandit Problem – Basic result: UCB1 – Surveyed Extensions • Advanced Topics in ML course next year • Next lecture: course review – Bring your questions! Lecture 17: The Multi-Armed Bandit Problem 56