>> Yuval Peres: All right. Good afternoon, everyone. ... day. We have I'm sure a bunch of great...

>> Yuval Peres: All right. Good afternoon, everyone. Welcome to theory day. We have I'm sure a bunch of great talks coming. In fact, when I -- so we have a combination of some [inaudible] veterans and some young researchers and postdocs. And when I told our postdocs that they should not feel too pressured that if the talks don't go well, then in a few years they can recover from it. Somehow this reassurance didn't work so well because I saw them at 11 p.m. yesterday practicing their talks. So, anyway, I'm sure they'll all be fantastic. And we're starting with a talk on looking for the Bobby Kleinberg incentivizing exploration. >> Bobby Kleinberg: Thanks so much, Yuval, for organizing this workshop and inviting me to present this paper. I'm going to be telling you about some work I did on incentivizing exploration with Peter Frazier, David Kempe, and Jon Kleinberg. This picture you might be wondering about. This is Christopher Columbus receiving his gifts from Ferdinand and Isabella for discovering North America. So this is a depiction of how incentivizing exploration used to work at some time in the past. We had all-powerful monarchs who would command their loyal subjects to go out and explore. The subjects would obediently do it, and then, if successful, they would earn their rewards. You know, in the online world, also there are many powerful entities that depend on their subjects now called users to explore a world of possibilities for them, you know, so Amazon, for example, is a store where you can buy almost anything, you can read reviews of almost anything, but it's not as if Amazon employs full-time reviewers to figure out which products are good and which ones are bad. They just depend on the autonomous activities of users going about their business. You see the same pattern repeated in many, many other contexts, both in the commercial world and elsewhere, you know, social news readers need to recommend articles to you, but in order to do it, they need you to recommend articles to them. There are these collective citizen science efforts where amateurs at home with their telescopes are mapping the sky but there's no global governing organization that can demand them which part of the sky to aim their telescope at. And it's a bit of a stretch, but you could even say that the same types of issues play out when we talk about things like national funding of scientific research. We have an organization like the NSF that may have priorities for what research projects it wants people to undertake. But in the end the research is being done by individual scientists who will pursue their own aims that may align but are not obedient to the dictates of the funding agency. In all these situations, what we have really is a misalignment of incentives where we have a principal whose goal is to explore a broad space of alternatives and collect information about all of them, and we have individual users whose goal is to select the best alternative for them. To state it in a more pithy way, we have a principal whose goal is to explore and individual users whose goal is to exploit. Stated in this way, it's very tempting to model this dilemma of misaligned incentives as some type of multi-armed bandit problem. So in the next slide I'm going to introduce to or recap for you what is the multi-armed bandit problem. And the version of the problem summarized on this slide is what generally gets called the Bayesian multi-armed bandit problem. So this is a problem where there are K alternatives, conventionally called arms. The strange name stems from the metaphor of K different slot machines, each of which has an arm that you can pull and get some random payout. And we'll assume that the payout distribution of one of these slot machines is determined by some unobservable parameter called its type. And so the K arms have independent random types but the type itself is unobservable. All you can do to find out the type of an arm is to pull it, observe one random sample from its distribution, and as you accumulate more and more samples, you accumulate more and more certainty about what its type is. Okay. In the model of decentralized crowd sourced exploration of a set of alternatives, I'm going to assume that you have a sequence of users -- in this talk it will be modeled as an infinite sequence -- indexed by time. So the user that comes in at time T will participate in one and only one round of decision-making in which it chooses one of K arms. I'll call it I sub-T. Think of this as a user coming to amazon.com to buy an 8-megapixel camera and there's some finite set of 8-megapixel cameras that are sold at Amazon, it selects one of them. Afterwards it observes how good that alternative was, maybe on a scale of one to five stars, and reports what reward it got from pulling that arm. The sequence then repeats in the observing -- we're assuming that user that comes in at time T can provided in all the time steps 1 next time step. And in this paper, we're the history is fully observable. So the see all of the star ratings that were through T minus 1. And then to reiterate with a bit more formalism, what I said on the previous slide, the principal's goal is to maximize the long run average of people's payoffs. I'm going to do the standard thing in this field and assume that there's some geometric discount factor gamma less than one, and payoffs that people get T time steps in the future contribute with weight gamma to the T, into this weighted average that the principal is trying to optimize. The user at T time, on the other hand, has a goal of just maximizing their individual constituent of their weighted average. And so the policies that would maximize this quantity and this one respectively I call the optimal policy and the myopic policy. So the optimal policy chooses a sequence of ITs that maximizes the expectation of this average reward. The myopic policy is a completely greedy algorithm that in each step T chooses the IT that maximizes expected RT given the history encoded in the public log file of the past T minus 1 observations. What's known about these two alternatives? So the optimal policy is principle could be very, very complex. It's a function that maps every possible history of T minus 1 steps into a decision at time T. So there are exponentially many possible histories. So encoding the truth table of this policy could potentially take double the exponential space. And, in fact, for a long time this problem of finding the optimal policy for the multi-armed bandit was thought to be so hard that statisticians who were working on it on the Allied side during World War II were joking that if we wanted to defeat the Germans, what we should do is we should drop leaflets containing descriptions of the multi-armed bandit problem over Germany and their scientists would become so preoccupied with this unsolvable problem that they would get distracted from the war effort and we would vanquish them. So then it came as quite a surprise in the 1970s when this man, John Gittins, came along and solved the multi-armed bandit problem with an amazingly simple and succinct solution. So he defined a number called the Gittins index that you can compute for each arm that depends only on the past history that you observed for that arm. I'm not going to tell you how this number is computed because it won't be relevant for my talk, but the optimal policy is simply at every time compute the Gittins index for every arm, pull the one with the maximum index. Question. >>: What's the model of -- how are these RT distributed? Do you need to -I mean, if you know this or you don't know this, this could change your approach. >> Bobby Kleinberg: So you need to have a prior belief about how the reward distribution of an arm depends upon its type parameter. The Gittins index policy is generic in the sense that no matter what your prior is it explains a procedure for computing a score such that the optimal thing to do is to pull an arm with a highest score. The specific function what that procedure computes will depend on the form of your prior. So this might be easy or hard to compute depending how complex your prior is. For standard priors, like a beta prior -- a common one that's used in reality is you believe that the arm has some unknown parameter theta between 0 and 1. >>: Independently drawn from this. >> Bobby Kleinberg: Independently drawn from their beta distribution. When you pull it, you see a binary reward, which is either 0 or 1, and you do a Bayesian update that gives you a beta distribution with difficult parameters. So like that would be a typical prior that people would use in practice. >>: 0, 1 [inaudible]? >> Bobby Kleinberg: In the specific instantiation of the model that I just talked about with beta distributions, outcomes are 0 or 1. For Gittins theorem, all he needs is that there exists a bounded subinterval at the real line, and outcomes are distributed on that interval. So the result is really quite general. Okay. For myopic users, the policy is even simpler to describe. I described it on the last slide. It's a purely greedy policy that always computes the posterior expected reward of each arm and pulls the one with the highest expectation. A lemma that we prove in our paper that I'll skip the proof in the interest of time is that the value of the myopic policy is always at least 1 minus gamma times the value of the optimum. And this multiplicative factor 1 minus gamma can't be improved in the worst case. >>: Gamma is the discounting [inaudible]? >> Bobby Kleinberg: Gamma is the discounting factor. That's right. So if the principal is very patient, gamma is close to 1, it might be .999, and then this is a 1 over 1,000 factor, which is not very desirable. If the principal is very impatient -- so let's consider gamma equals one half. This is a principal who is so impatient that a reward of one in the present day is as good as getting 0 in the present day and one every day from tomorrow until eternity. And so in -- if you're discounting so steeply that the present period is worth as much as the future combined, this says that being myopic is half as good as being optimal, which is actually not so surprising. Right? Myopic is doing exactly the best thing you could do in the present, and if the present accounts for half of all the value you could get over time. Okay. So the proof of this lemma is not quite as brief as I just made it out to be because, you know, neither of these policies is getting a stationary reward sequence whose expected value in each time period is the same. The expected values are probably increasing over time. So you need to do a little bit more work to prove this approximation. But trust me that it's valid. So the trouble is that when the principal is very patient, this approximate nation factor is very close to zero and we want to know if we can do something better. To do something better, we're going to introduce a new feature into our model. We're going to allow the principal to pay the agents a bonus for exploring alternatives that are not myopically optimal for them. So, you know, in the simplest implementation of these bonus payments, you might just post a sign that says, you know, here -- for arms 1 through K here's the bonus that you would get if you were to pull each one of them right now. That changes the user's decision problem so that they're maximizing posterior expected reward plus bonus. And in these applications that I talked about, you presumably would not put up a sign that literally says I'm paying you to do suboptimal explorations and here's how much I'm paying you. But you might instead, for example, if you were Amazon, just silently offer a discount on the eight-megapixel camera that does not yet have as many reviews as its competitors. And in that way you might hope to accumulate a more diverse set of training data without the users knowing that you were making them do your experimentation for you. And some of these other environments like social news readers where they're not actually exchanging money with their users, then you can think of these payments as being in some kind of artificial virtual currency like your reputation points of the user of the system. And indeed you often find on these systems that they give their users reputation points, and when you reach certain milestones, you get a badge and other people can see your badge. And there's research literature that I have not contributed to but people like my brother are quite interested in having to do with how to design these virtual reward systems to encourage the maximum amount of participation. Okay. Now in my model where we have this publicly observable log file of every transaction that's ever taken place, if the principal and the users are correctly doing Bayesian updates on the evidence of that log file, they will always have the same posterior beliefs about the arms at all points in time. Which means that I'm -- if I'm a principal that's trying get somebody to pull arm I instead of arm J, I know exactly how much bonus I need to pay them to bridge the difference in expected rewards between those two alternatives. So an equivalent description of what the principal is trying to do is it can adopt any policy it wants for selecting which arm to play at time T based on past history. But if it makes a selection other than the myopically optimal arm at any time, it needs to pay this much cost to bridge the gap in the user's utility between doing what's myopic and doing what the principal is asking. Okay. This is a good -- so that specifies the model of the problem that we're investigating. This is a good time for me to take a break and tell you about a bunch of very interesting related work, much of it coming out of MSR. That has to do with models that share the same motivation of incentivizing users to explore a space of alternatives for you but don't make use of this assumption that you can pay people to pull arms that are not myopically optimal for them. So in the absence of that assumption, what other mechanisms do you have for manipulating their behavior? Well, you could withhold certain information from them. So in my model I'm assuming you can see the full history of every transaction that's ever happened in the past. But in many recommendation systems they don't show you every review that's ever been produced, they just sort of say here's one or a couple of alternatives that we're recommending to you and, you know, maybe here's some small amount of evidence why we think this is a good recommendation. Think of Netflix, for example. Netflix doesn't make public the set of all star ratings that all of its users have ever given to movies. Okay. So in those models, an exemplary one being this 2013 paper of Kremer, Mansour, and Perry, which with apologies to Yishay who's in the audience, I have the conference citation, but I think it's now in some journal ->>: JPE. >> Bobby Kleinberg: JPE, yes. So not just some journal but a top five one. They have a multi-armed bandit problem where the rewards are privately observed and the principal controls the information channel by which information about this past history is funneled back to the users. And their paper for the most part focuses on the case of two arms, one of which is a priori better than the other and which are collapsing in the sense that after you pull the arm once and observe a reward, you have no remaining uncertainty about its type. So your prior collapses to a point mass distribution the very first time you pull the arm. And in nonstrategic settings, it's trivial to design an optimal policy for these problems, but in the strategic setting where you have to give people advice and they have to be willing to take your advice, it turns out to be quite challenging to solve for the optimal policy even in this setting. Their paper is primarily devoted to doing that. They have some follow-up results on -- dealing with a greater number of arms. But then there's this paper from two years later where Yishay, Alex, and Syrgkanis did an extension from two to many arms, they allow for much more general prior distributions over the arms. And they give a policy which, while not being optimal, achieves a regret that has the optimal scaling in terms of the number of time steps up to constant factors. I haven't defined regret in this talk, but for non-Bayesian analysis of the multi-armed bandit problem, this is sort of the gold standard for how you evaluate the quality of algorithms. And in this discussion of related work, I skipped over another important reference, which is this working paper of Yeon-Koo Che and Johannes Horner, two economists who are looking at a continuous time model that's very similar to the Kremer, Mansour, Perry discrete time paper and obtaining qualitatively similar conclusions. Okay. As I said, all of these papers are on mechanisms without money that try to achieve the same effect as the mechanism in our paper. But now let me return to our model in which rather than withholding information about the past we are paying people side payments to get them to do things that are not greedy in the present time step. So to state our main results, I need to define a term called the achievable region. And the way to think about it is that if you're the principal in this problem, there are sort of two measures of costs that you want to simultaneously minimize. One is the opportunity cost relative to the first best policy, the Gittins index policy that you would use to explore this space of alternatives in a perfect world where everybody's incentives were perfectly aligned. And I can normalize this to be a number between 0 and 1. So if we scale all of the rewards so that the expected, gee, metric discounted reward of the optimal policy is exactly one and the principal instead adopts some suboptimal policy that gets 1 minus A, then I'm going to call that number A the opportunity cost and plot it on the X axis. Okay. And, you know, the other thing the principal would like to minimize is the amount of bonuses that it has to pay out to the user. So I'm also going to express that as a multiplicative factor B times the value of the optimal policy. We now have two parameters A and B that we'd like to drive both of them down to zero, but in general you can't do that. Okay. And so just to make sure the meaning of the two axes is transparent to everybody, let's take that result that I presented a few slides back that said the value of the myopic policy is at least 1 minus gamma times the value of the optimal one. Okay. So the myopic policy never has to pay anybody any bonuses. So it's at 0 on the Y axis. And the theorem says that it's getting at least 1 minus gamma times the optimum. So that says that on the X axis it lies somewhere between 0 and gamma. Okay. So that's one example of plotting a point in the achievable region. So, oh, right, I forgot to define achievability. So if a policy satisfies these two inequalities, we say achieves the loss pair A, B. And then we say that a loss pair is achievable, maybe I should say universally achievable, if for every instance of the multi-armed bandit problem, no matter what priors you have on the arms, there always exists a policy that achieves that loss pair. So previous result with the 1 minus gamma in it says that the .0, gamma is universally achievable, and the objective of the paper is to map out the entire achievable region and not just its X intercept. >>: Is this an expectation [inaudible]? >> Bobby Kleinberg: This is -- yes, yes, yes. Very important question. Both of these should be interpreted as respected values. In a lot of these problems it would be interesting to be able to solve, for example, for the maximum expected reward under a hard constraint on the payment. Right? Like it -- you know, I give you a budget of $100,000 of bonus payments that you can pay out to your users, and I don't want you to satisfy an expectation. You should never exceed the budget. We don't know how to solve that problem, but I think it's a really interesting one. Okay. You know, once we had formulated the problem this way, one thing that struck me as really appealing about it is that this is a model that in some concrete sense lets you plot or depict the exploration versus exploitation tradeoff inherent in multi-armed bandit problems. So I've written a lot of papers in my life about multi-armed bandits. I can't tell you how many times I've stood in front of a projector screen saying you should think of the multi-armed bandit problem as a theoretical construct that abstracts the exploration-exploitation tradeoff that decision-makers often face. Okay. But here's a model where if you want to explore, you have to pay people to do the exploration. So the cost of exploration is very nicely encoded on the Y axis. And if you let people exploit, they do something that's in general suboptimal, and so the cost of allowing people to exploit is nicely encoded on the X axis. And so the shape of this curve is the shape of the exploration-exploitation tradeoff curve for the multi-armed bandit problem. Okay. So once I like had conceptualized our problem in that way, I became very curious to know what the shape of this curve is. And our main theorem tells you exactly what the curve is. So the achievable region is the set of all pairs A and B that satisfy this inequality. The square root of A plus the square root of B is greater than or equal to the square root of gamma. Let's pause and reflect on what this result tells you about the incentive dilemma. Okay. So the first qualitative takeaway from this main theorem is that the achievable region is convex, closed, and upward monotone; meaning that if I have any A, B and I increase one or the other coordinate, I remain in the achievable set. Okay. Except for being a closed set, these other two properties obvious in hindsight. If I have a policy that achieves A, B and that achieves A prime, B prime, for example, I can achieve their tossing a coin at time 0 and deciding whether to use policy 1 or are actually another one midpoint by policy 2. The achievable region is set-wise decreasing in gamma. As I increase gamma towards 1, the set strengths. This is also consistent with our intuition that has a principal becomes more and more patient, its incentives are less and less aligned with those of the myopic users, and so it should become harder and harder to achieve the points that are close to zero. A more interesting qualitative takeaway is that there are certain points, even ones that are not very far away from zero, that belong to the universally achievable region no matter how patient the principal is. So any time root A plus root B is greater or equal to 1, even as the principal approaches infinite patience, it still remains possible to achieve that loss pair. So you could state this more interpretably as saying this .1, .5 corresponds to the statement that the principal can always run their system at 90 percent of the optimal learning policy, while giving back only 50 percent of the surplus to the users in the form of these bonus payments. And that holds even in the infinite patience limit. A final thing that I should say, although I forgot to put it on the slide, is that I told you about the result that the X intercept of this region is at gamma comma 0. I said that that's a pretty easy lemma, although I skipped the proof. The result says that the Y intercept is also at zero comma gamma. As far as I know, that's a hard result. And even proving that the Y intercept is finite, as far as I know, is a hard result. >>: [inaudible]? >> Bobby Kleinberg: No. No. So the Y intercept means -- so the opportunity cost is pinned at zero. So you have to run the Gittins index policy and pay them whatever it takes to get them to keep doing what the Gittins index policy wants them to do. There's no reason -- so you keep paying them the difference between the myopically best and the best according to the Gittins index arm. There's no reason that those payments have to be bounded above by the value of the arms that Gittins is pulling. I guess maybe if I could depict it using a whiteboard marker that works. You have the payoff sequence of the optimal policy which presumably looks like this. And you have the payoff sequence of the myopically best arm, which is also increasing. It must eventually increase more slowly. If it was better than this one out to time infinity, then this one wouldn't be optimal. So these two eventually cross each other. But initially this one might be way more than twice as high as that one. Might be a thousand times as high. So the gap -- so the -- right. The policy that achieves the Y intercept is paying for these gaps and the optimal policy is only gaining this amount. And so if you chop at any finite initial time, there's actually no bounded approximation time factor between what the optimal policy is getting and how much you have to pay people to play it. It's only if you let them go out to infinity and take advantage of this geometric averaging that the approximation kicks in. Okay. I want to devote the remainder of my talk to giving you some sketch of how we prove this result. It's an if and only if. So there's an unachievability part and an achievability part. And the one that's easier to talk about is the unachievability. So let me do that first. There's a particular type of stereotypical hard instance of incentivized exploration that we call Diamonds in the Rough. And this is searching through a bunch of risky alternatives that are probability worthless but have a huge upside if they pay off. When there's an outside option that's a safe bet that everybody would rather do instead. Okay. So it's a -- link this back to the citizen science example where, you know, like you're trying to get birdwatchers to go out and record observations of what birds they see in order to collect ecological data. This is like everybody in Ithaca wants to go to Sapsucker Woods and look at the birds there because it's the most beautiful place to go birding. But maybe if someone would go and sit next to the airport runway for a while and watch the birds there, we would actually learn some very ecologically relevant information about that ways that air traffic interferes with bird migratory patterns. So you want to somehow get people do these risky but potentially very valuable bets, but none of them want to do it on their own. Okay. So how does that work quantitatively? We're going to have infinitely many collapsing arms, each of which is -- you know, think of it as like a sealed envelope that the first time you pull the arm you see a payoff which completely reveals its type. Either it's an arm that always gives you some high payoff, capital M, or it's an arm that always gives you 0. The probability of giving the high payoff is 1 over capital M times 1 minus gamma squared. This magic number is chosen so that if you spend your whole life opening these sealed envelopes until you finally find the one that yields the big payoff and then you always play that one, the expected value of that policy is normalized to be equal to 1. And there's an outside option whose payoff is all and this is normalized so that if you always play that one, then your geometric discount reward is going to phi. Okay. So there are two obvious policies to pursue here. I just told you what they are. One of them searches through the blues until it finds a winner; the other one always picks the yellow. >>: Your phi is less than one? >> Bobby Kleinberg: Phi is less that one, otherwise the problem is not interesting. Yeah. So the optimal policy gets a reward of essentially one. What cost does it pay? It pays the difference between this and the a priori expected value of one of those. And it does that from time zero to infinity, which gets rid of the 1 minus gamma factor. Okay. So there's an opportunity cost of zero and there's an incentive cost of phi minus 1 plus gamma, if you worked it out for what the optimal policy does. The myopic policy has no incentive costs and it only gets phi, which is less than 1, so there's an opportunity cost of 1 minus phi. Okay. And it's not obvious but it's true that the achievable region for this instance of the multi-armed bandit problem in the limit as the number of these blue arms goes to infinity is just an unbounded polygon with two corners at these two points and everything in the first quadrant that sits above them. And so as I vary this parameter phi, I get a bunch of unachievable points. All the white points lying below this line are unachievable. And as a vary phi, I get a bunch of different lines and a bunch of unachievable points that lie below them. So the union of all those white regions gives me a bunch of unachievable points. The theorem statement that I had on the previous slide exactly says that those are all the unachievable points. I want to take a second to point out to you something cute about the form of this curve. So if you look at all these tangent lines that I derived by varying the value of phi, you'll notice that the sum of the X intercept and Y intercept doesn't depend on phi. It's always equal to gamma. Okay. So the envelope that you trace out is the one that you get by starting with a point at the origin and one on the Y axis at zero gamma and letting them move at equal speed until the Y axis point drops down to the origin and the X one is at gamma zero. That's actually an art project that lots of people do when they're elementary -- when they're in elementary school. Okay. I found on the Web somebody who had done it with string. There's the curve. I remember doing this when I was in elementary school. My brother did too. And it's mysterious. I loved the beautiful shape of the envelope that came out of that. And at the time it was unimaginable to me that I would someday write a research paper where the answer to the question was that curve. There are other people besides me who are captivated by this curve. Here's somebody who constructed it between a fallen tree and a standing one. Okay. On to the achievability result. So now I want to -- I need to take a point that lies inside the purple region and I need to show that there is a policy that achieves it. And so it's going to be a proof by contradiction. Suppose that, rather than the purple region, the achievable region was some other subset denoted here in yellow. I explained to you that trivial reasoning justifies that the achievable region is convex. Okay. So if there's a purple point that doesn't belong to the achievable reason, then there's a separating hyperplane, a line that passes through that point that separates it from everything that's achievable. And that line has some slope lambda and so to say that this line separates the point from the unachievable -- so say that the line separates this point from the achievable region is to say that there's some instance of the multi-armed bandit problem where no policy can achieve reward minus lambda times cost as large as the value that that objective function attains at this putatively unachievable point A, B. So reward minus lambda times cost, I'm going to call that the Lagrangean objective, or I'll abbreviate it as the lambda objective. It's an objective function parameterized by this parameter lambda that measures kind of the tradeoff between learning and earning, the tradeoff between these incentive payments that you have to make to agents and the value of the rewards that they reap. Okay. And in our proof by -- okay. So our theorem that the achievable region is this purple one can be reinterpreted as saying that for every lambda you can always guarantee that there's a policy whose lambda objective value is at least some approximation factor times the opt. To extract what that approximation factor is, you would merely have to calculate for the curve root A plus root B equals root gamma, where's the -- if I draw a line of slope lambda, I guess negative lambda, where's the point of tangency, what's the value of 1 minus A minus lambda B at that tangency point. Okay. I've done the calculation for you. The value is 1 minus lambda gamma over 1 plus lambda. And our theorem is simply equivalent to the assertion that not only for Diamonds in the Rough but for every multi-armed bandit instance you can find a -- for any specified value of lambda, you can find a policy whose lambda objective is at least this big. That's what I need to prove for you. And, okay, so let me check in with Yuval. It's 1:20. 1:20, which is when I was scheduled to stop talking. >> Yuval Peres: You have ten more minutes. >> Bobby Kleinberg: >> Yuval Peres: In a minute it will be I have ten more minutes. Excellent. [inaudible]. >> Bobby Kleinberg: Okay. I'm planning to spend less than ten more minutes. Okay. So proving that the answer to this question is yes is the bulk of the technical work at our paper. And I just want to present maybe two slides that give you a sketch of why this is true. So rather than computing the policy that optimizes the lambda objective, which we believe is probably a very, very hard, maybe PSPACE-hard, problem, we design a simple but suboptimal policy that we're able to analyze. The policy incorporates two features that are crucial to its analysis, time extension and random censorship. So I'll call it the TERC policy for time expansion with random censorship. And how does this policy work? At initialization time, before it pulls any arms, it tosses a biased coin for each arm. And with this probability marks the arm as censored. Censoring an arm basically means I'm never going to try learning that that arm is the best. Okay. After initialization at every time step it tosses an independent coin with bias 1 over lambda plus 1. You'll see why this 1 over lambda plus 1 in a second. If it gets heads, it plays the next step of the optimal policy limited to the set of uncensored arms. If it gets -- Yishay, yeah. >>: [inaudible]. >> Bobby Kleinberg: It's a finite number K and -- I mean, the parameter K is not represented in the pseudocode because ->>: [inaudible]. >> Bobby Kleinberg: It's assumed to be finite. Yeah, that's right. So when I was informally analyzing the Diamonds in the Rough instance, I said that there are an infinite number of those blue arms, but it's really a limit of a sequence of finite instances, and so the proof of unachievability for the white points requires actually doing it for every finite number of arms and taking a limit. Okay. If the coin toss comes up tails, you just don't offer people any bonuses and they pull an arm myopically. I didn't say it on the slide, but you also ignore whatever observation came out of that arm. So your posterior in this policy is actually not conditioning on the entire history, but you only do Bayesian updates on the histories that you saw when your coin toss came out heads. Okay. So okay. So this trick where we interleave the optimum and the myopic policy, that's what we call time expansion. Why does it work and why this one over lambda plus one in the bias? So here's a simple calculation. In any step, if you look at the expected value of lambda times the bonus that you pay out, that's lambda times 1 over lambda plus 1, because you only pay a bonus in the event that the coin lands on heads, times the amount of the bonus which is the payoff of the myopic arm minus that of the one that you told the person to play instead of the myopic arm. Now, in the remaining time steps where you tossed tails, you actually got some surplus reward above what the optimum policy on the uncensored arms thought it was going to get. That is to say, had you been playing this restricted optimal policy, you would have gotten this payoff, but instead you told people to just play myopically, and they got, in expectation, a higher payoff. Okay. So in terms of the Lagrangean objective, that surplus, when you toss tails, exactly cancels this deficit that you get when you toss heads. Because tails comes up with probability lambda over lambda plus 1. And when it comes up, the surplus that you get in your Lagrangean objective is this thing in parentheses. Okay. So the time expansion trick brings about this cancellation where from now on we can ignore the bonus payoffs that we're paying to the agents because they're, in expectation, getting exactly canceled out by these windfalls that happen when the coin toss comes up tails. Okay. So, in other words, what I've said is that the Lagrangean objective of the TERC policy at time step T is equal to the expected payoff that the restricted optimal policy gets in the same time step. Okay. Now, this analysis has a bit of a something for nothing sound to it. We're giving people these bonus payments and yet I just told you that on the expectation our reward minus the bonus payment is equal to the reward of the optimum policy. So why are we censoring arms at all? We should just leave every arm uncensored and always get as much payoff as the [inaudible] in every time step, and then we would have -- you know, we'd be achieving the magical point 00 on the achievability plot. Okay. So the reason that this something for nothing reasoning doesn't work is that the TERC policy is learning more slowly than the [inaudible]. The opt policy gets to pull an arm and see a feedback in every time step. The TERC policy in step T is getting the same payoff that the opt -- restricted opt policy was expected to get in step T based on the knowledge state that the TERC was actually in. But this policy always gets a new data point and advances its knowledge state. This one only gets new data with this probability. And the rest of the time its knowledge state remains paused. So it's basically trying to simulate this policy but it's playing more slowly. Okay. So, in other words, the Lagrangean objective for the TERC policy is the same as the expected payoff for a stuttering opt that every time it pulls an arm it stays on it for some random geometrically distributed number of steps and then advances. Okay. Stuttering slows down the rate of learning and reduces payoff. The censorship is designed to compensate for this slowdown. So, in other words, conditional on arm one remaining uncensored, the slowdown in learning delays the time when I get to pull arm one for the hundredth time. But censoring the other arms accelerates the time when I get to pull arm one for the hundredth time because it didn't have to wait behind all those pulls of all those other arms that got censored. And the censorship probability is designed to exactly trade off this slowdown with this speedup, and that's about as much as I can say about the analysis of the censorship. So I'm moving on to my concluding slide now. So in summary, this paper presented a model of crowdsourced information discovery that models phenomena like reviewing products online or articles or crowdsourced scientific exploration. We saw that situations like this involve a misalignment of incentives that has to do with explore exploit tradeoffs and the analysis that I provided allows a surprisingly precise quantification of that tradeoff as a function of the principal's patience. It allows you to make statements like the principal can always achieve at least 75 percent of the social surplus while paying back only 25 percent to the users. The structure of the TERC policy lets you say that simple policies that randomize between just letting people do what they want or providing incentives for them to do something optimal are -- simple policies like this are sufficient to achieve the approximation guarantees in this bullet. And the analysis of the lower bound gave us some insight of what the hardest looks like. And so this was going to be the last slide of waiting in the Overlake transit center for my Mahatma Gandhi that I had never seen before. were to die tomorrow, learn as if you were to my talk, but yesterday I was bus, and I saw a quote from So he said: Live as if you live forever. I hope you can see the relevance of that quote to this talk. But maybe you can also see that uncharacteristically for Gandhi, his wisdom was sort of incomplete here. He was overlooking a fundamental tradeoff in life. We learn by living. We learn by living. So you can't adopt a different policy when learning from the policy that you adopt when living. >>: [inaudible] randomization. >> Bobby Kleinberg: Oh, yeah. So that's right. So replacing Gandhi's maxim with the much more wise Yuval Peres maxim, what he should have said was live and learn as if you were to die tomorrow with probability 1 over lambda plus 1 and live forever with probability lambda over [inaudible] plus 1. [laughter] >> Bobby Kleinberg: And with that, I conclude. [applause] >>: [inaudible] make sense in practice [inaudible] in principle [inaudible] I have to see the whole history of all the [inaudible]. >>: Thousands and thousands of reviews. >>: [inaudible] because otherwise. >> Bobby Kleinberg: You wouldn't have to see that because if we all have a common prior and you have a correct belief about what discount policy I'm using, then you could compute for yourself what discounts were applied. >>: But I'm just wondering [inaudible] I'm wondering whether there's any middle ground where, you know, you could publicly announce [inaudible] X which is not the optimal thing but it's very easy for us to understand. >>: Why do you care about the discount? You don't have [inaudible] you just care about you get to see the results. That's all the information there is about the arms. You don't actually care why people chose [inaudible]. >>: [inaudible] how can I find a policy [inaudible]. >> Bobby Kleinberg: Sorry, are you portraying that as a modification of Ander's question, or just ->>: It's purely [inaudible] I just want to say is it clearly to see them only [inaudible]. >> Bobby Kleinberg: Yeah. So no. Given A and B it's not easy to define a policy [inaudible]. What's easy to do is, given a lambda, I showed you how to get a simple policy whose Lagrangean objective achieves 1 minus lambda gamma over lambda plus 1 approximation to the optimum. But the business of achieving a specified A and B depends on knowing more about the parameters of your multi-armed bandit instance. You can't simply be -- it can't be described as a simple transform move, restricted optimal policy. Ander, I think your question is a good one and a perceptive one. I don't really have -- I haven't thought about the question enough to say anything of substance. So I think the one thing I'll say in response to your question is that I object that I was here for an entire morning of the Algorithms for Technology Transfer workshop, and I did not once hear a question that was prefaced by, you know, your theory is all fine and good, but bringing it a step closer to practice ->>: [inaudible]. >> Yuval Peres: [applause] Okay. I think we have to move on, so let's thank.

>> Yuval Peres: All right. Good afternoon, everyone. ... day. We have I'm sure a bunch of great...

Related documents

Products

Support

&gt;&gt; Yuval Peres: All right. Good afternoon, everyone. ... day. We have I'm sure a bunch of great...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Yuval Peres: All right. Good afternoon, everyone. ... day. We have I'm sure a bunch of great...