Ofer Dekel: Let`s start

>> Ofer Dekel: Let's start. So in our problem we're going to have a player, and player is synonymous with maybe an agent or learner or decision-maker. And this player plays in the world and sometimes we call the world nature or the environment. And I'd like you to imagine maybe an application of an algorithm that's making some automated investment decisions for us in handling our portfolio and the world is the stock market or maybe a spam filtering algorithm that's observing a stream of spamules [phonetic] coming in and it has to determine which ones are spam, which ones are not. That's the player world. In the machine learning world we make assumptions that the world is simple. We assume that the world behaves like some stochastic process, very often IED stochastic process, and online learning we like to say that the world is oblivious, means the world doesn't react to us and we have some formulations of worlds that react but often we assume how they react. And I want to make the point this is often an unrealistic assumption. When we work in the world, when we make choices, when we make decisions the world reacts and it's often in a way that we really can't model in any good way. For example, if you think of the spam filtering example that I gave, as I managed to detect spam better and better, the spammers are going to notice that and they're going to change their strategy. This is a little bit of an arm's race going on. As I improve, make better decisions they get better as well. If you think of the online investment or portfolio management problem also what's the world? The world is the thousands of other investors that are trying to make a profit at my expense. It's kind of a zero-sum game in some sense, or I mean they want me to lose so they can win, and they're strategic. So the world is not IAD they're trying to hurt me. So the assumptions are unrealistic, and the problem is we don't know how to model them very well. How does the world react to my actions. In this talk what I'd like to ask, the question I'd like to ask is how far can I go if I assume that the world reacts to me in the worst possible way. So in fact the world isn't a world it's an adaptive adversary, bad guy. He's malicious. He knows my algorithm. He knows the pseudo code of the algorithm I am going to run and has infinite computational power. And assume that's the world. If I can deal with that guy, then I can certainly deal with the real world. That's why it's an interesting question to ask. That's going to be the topic of the talk. So let me explain how the player interacts with the adaptive adversary. A little protocol that goes on. A repeated game. Assume that the player has the power of randomization. This is critical. The player has access to random bits and the player has to perform one of a finite number of actions. So here they're depicted by the slot machines, by these arms. So the player has to choose which arm she pulls. And this is a repeated game. So it goes on for a while. Let me describe one round of the game. So round two of the game. T minus one rounds have already finished, concluded, and they've played those rounds so they've learned a little bit about each other. The player has learned about the adversary and the adversary has learned a little bit about the player. Here's round T. It starts when the adversary chooses a loss for each one of these possible arms and immediately can field them. So the player doesn't see what the losses are. And the player defines a probability distribution over the arms, draws from random bits, and according to the distribution chooses the arm she wants to pull. She pulls the blue arm and suffers a loss of 0.8. She gets to see what this loss is and she doesn't get to see what the other losses are. This is why we use slot machines. This is the idea when you pull one of the arms you don't know what you would have gotten had you pulled any one of the other arms. This is called banded feedback and the name is for historic reasons, it's because these slot machines are nicknamed one-armed bandits. This is called bandit feedback. It's when I only see the value that I actually suffered, I don't see the loss I would have gotten if I pulled something else. This is the game they played. There's another game, similar game, which is called the full information feedback game, which is the same thing but there's one additional step at the end of the game. And that's a step where the other loss functions are in fact revealed. At the end of each round the player sees what would have happened had she done something else. That's called phone information. If she doesn't that's called bandit feedback. We'll talk about these settings in parallel. So let's define things a little bit more formally. So this is the K armed online learning problem. So it's a T capital T round repeated game. So let's assume that capital T the number of rounds that's going to be played is known in advance and agreed upon between the adversary and the player. The player's randomized. The adversary is a deterministic adaptive adversary. So he's adaptive. Sees what the player plays and adapts his strategy based on what the player plays does and he does so deterministically. Here's a more formal way to state the game. So, first of all, the notation. actions. The player has an action set. One of K And here's something that makes things simpler. We can notice that the adversary, because we assume that he's deterministic, we can actually have him do all his decision making before the game begins. So, right, he has infinite computational power. He knows the code, the pseudo code of the algorithm that the player is playing. So he can run all these simulations beforehand and just specify to us before the game begins how he will react to any sequence of actions played by the player. This is exploiting the fact that the adversary is deterministic. So actually the adversary doesn't participate actively in the game itself. Before the game begins, he defines a sequence of T loss functions where each loss function is a function of the entire history. So this is how we have him specify how he's going to react to the player's actions. So he defines T functions. FT takes T actions. The history of T things that happen until time T and maps it to a loss. Now the game begins, only the player plays in the game actively. So for T rounds, what happens is the player chooses a distribution theta T over the action space. Draws the action. And then suffers the loss if sub T defined by the adversary evaluated on the concrete sequence that the player plays. Of course if player doesn't get to see the loss function that he plays, they're kept secret from him but well defined for the purpose of formal setup. Full information feedback game is one additional step that happens here. This is at the end of the game, the player gets to observe for every possible X what would have happened had I played X now. I don't get to see the counterfactual, what would have happened if I changed my strategy from the beginning of time, but today I get to see what would have happened had I done something else. Here are two examples. Just to motivate. So bandit feedback. Here's one little toy example: Online content optimization. So I assume that I have a news website, and my editors or my reporters give me a bunch of different articles. And I can only advertise three of them on the top of the page. Which three do I show? And I want people to click on these ads. So I choose three to show. Some user comes to my website. I show these three little ads, either get a click, which is great. That's a loss of 0. Or if the guy doesn't click that's a 1. That's bad. But I don't know what would have happened if I showed some other ads bandit example. And phonebook information feedback is very real, sample of investing in one stock. day. Let's say I just want to own one stock every I choose one stock. I buy it. I own it. But at the end of the day I know how good the stock market, I know how much money I would have made had I invested in some other stock so I get full information in this case. This is just a little bit of a motivation. Okay. So now let's say I have a player, how do I evaluate the quality of this player? So first let's define the expected cumulative loss of the player. So what we do is we just sum up over these T rounds the loss suffered by the player. We take expectation because the player plays randomly. We define L of T to be the loss accumulated by the player over the rounds of the game. But notice that this really isn't meaningful. We didn't restrict the adversary in any way. The adversary can assign a maximal loss to all actions. So he can inflate my loss as much as he likes. That's not interesting. So we need to compare to something. We need a reasonable basis of comparison. We need to compare to some alternative which makes sense and I'm going to use the very, very simple comparison which is to compare against constant actions. So I'm going to say how much do I regret not just sticking to maybe the blue arm and just pulling that for the duration of the game. So I choose one action. And I say my regret for not having chosen and played this action throughout the entire game is just the amount of loss in expectation that I suffered minus the cumulative loss I would have suffered had I played that action again and again. Now, my regret against the best action is, again, my cumulative loss. The difference between my cumulative loss and the loss of the best action in hindsight. So after the game is over, assume that all the loss functions are now exposed. I can say if I had only played this one best action all the time, this is what I would have gotten. This is my notion of regret. Okay. So that kind of clears up the problem of the adversary can artificially inflate everything, if he inflates everything he'll hurt me and the competitor equally. This is the notion of regret. What we'll try to do is bound regret. Prove these bounds on regret. Notice as the game goes along I accumulate the regret accumulates, the loss accumulates and the regret can grow and the question is how fast does it grow? So if the regret is guaranteed, if I can prove to you that it grows only sublinearly, so maybe it grows like big O of T to the power of Q for some Q less than 1, it means that as time is going by, the gap between me and the best arm in hindsight is shrinking I'm doing better I'm improving. On the other hand, if I can prove to you that the regret is lower bounded by a linear function T, it's omega of T, I'm not learning anything. It means I'm all round, there's still a big gap between me and the best guy that persists for the entire game, so I'm not learning. These are the two kinds of things we want to talk about. And then Q, the exponent here, is the learning rate. Smaller is better. If we have a smaller Q we are learning faster. So we've defined regret. We've defined the game. Can we prove some bounds on regret? Yes. Here's the first result. It's kind of depressing. No, we can't learn. So there always exists an adaptive sequence of loss functions such that my regret grows linearly. It's impossible to learn against adaptive adversaries. How do I prove this? I can tell you what the adversary does. I can define to you an adversarial strategy and the algorithm for the adaptive adversary that will cause me to suffer linear regret. Here it is. He knows my pseudo code. Before I've seen anything, he knows the distribution that I'm going to start out. The player on each round defines some distribution of the arm and theta one is just this distribution in the first round before having seen anything. So he just chooses some action with a positive probability. Now he defines his adaptive sequence of loss functions as follows. He ignores everything but the first action. Doesn't matter when T equals two or three or four. He only looks at the first action. If that first action was X hat, which there's a probability of that being, then the loss is one. Otherwise, the loss is zero. He doesn't care if I prove -- he doesn't care if I learn, he just looks at my one first play. If that was X hat, I will suffer a unit loss forever. After T rounds I'll suffer a loss of T. I'll accumulate T units of loss. There's some positive probability of that occurring. So my expected loss is T times that positive thing. What's the alternative? The best action in hindsight could be anything which isn't X hat. If I just stuck to that I would have suffered zero throughout. So the regret indeed is T times this constant. Very, very easy. That's the adversary. And I can't learn. So why did we bother and set up all these things? So we'll do what we always do in computer science or machine learning. When a problem is too hard, we just have to make it a little bit easier. We have to restrict the adversary just a little bit. So this is the class of all adaptive adversaries. We have to restrict it. We showed a point in here which causes me to suffer a linear regret. I can't learn. So here's another little subclass in there called oblivious adversaries. Oblivious adversaries are a very special case of adaptive adversaries. They in fact do not adapt. Adapt in a trivial way, which in fact they don't. So let's define that. Oblivious adversaries are adversaries that don't rely on the past. We'll still use the same notation, the function still takes the entire history, but it ignores the first T minus 1 actions in that history, just looks at my current action. So formally for any sequence of T actions and any other prefix of the sequence of T minus 1 actions, if I take these first T minus one actions replace them by the other guys, the last one is the same, then the loss is the same. This is an oblivious adversary. And this is in fact the standard assumption in 99 percent of papers in online learning. We like to assume that the world is oblivious. We can use the convenient shorthand, since we only depend on the last guy, we can write F, give it that one argument. And then our definition of regret simplifies to this simpler thing. You'll see this in most papers. Most papers will use this kind of abbreviated notion of regret; but just keep in mind that if we don't assume that the adversary is oblivious, then this doesn't make sense. We have to refer to the more complicated one that uses the full expressive form, and you'll see actually quite a few papers that talk about adaptive adversaries and use this notion of regret. You should be very suspicious of those results. This only makes sense in the oblivious case. To again give an example, when is it reasonable to assume the world is oblivious, let's say I'm investing a thousand dollars in a stock, no one is going to care. The world isn't going to change by my little investment. If I'm investing $100 million in a stock, the stock market is going to react; it's going to change. Big institutional investor cannot assume that the world is oblivious to his actions, can't assume that he's some negligent little speck of dust, doesn't affect the world. His actions will cause the world to change and affect his losses in the future. I don't know if this is true -- I didn't know whether to say a billion -- I've never invested $100 million, so I don't know. So we have oblivious adversaries. This is online learning. This is impossible. Let's define some intermediate steps in between. The first one I want to define is the switching cost adversary. What's that? It's more general than the oblivious, less general than the adaptive. Oblivious adversary with switching cost is like an oblivious adversary but he charges me an additional cost of C, some constant switching cost whenever I change my action. So whenever my action on time T is different than my action on T minus 1, I pay an additional loss of C. Or, in other words, he defines this oblivious sequence which depends only on the current action and then the real loss is that oblivious loss plus C times the indicator function of did I change, did I swap arms. And again the example is that maybe I'm investing a little bit of money in the stock market so the stock market is oblivious but maybe I pay some transaction cost. Go on the Web pay $7 for each commission trade, you don't want to hop around every day for every switch from one stock to another, costs a little bit of a penalty. Notice this loss relies on the current action and the previous action. So we say it has a memory of one; whereas in that counterexample, remember that, he remembered all the way, every iteration. He penalized me for the very first thing I did. So switching costs. They have a memory of one. So there are special case of adaptive adversaries that only look at my current action and my previous action, which are special case of adversaries that have a finite memory that can look M steps back in time but they can't look all the way infinitely back in time. This is the little hierarchy I want to talk about. Just to mention -- I mean, most of machine learning is here. Most people like to assume this IID thing online learning of this and I want to talk about that. Now we have a table. When theoreticians see empty tables, they get excited, sparkle in their eye. We have to fill in the values of this table, here are different types of adaptive adversaries. We want upper bounds and lower bounds on regret, bandit feedback in full information case. I have to make some technical assumptions, just to fit everything in a table, because I'm going to refer to previous work in other papers and sometimes the assumptions are a little bit different. So I won't go through the details of this but I'm just going to make two very general assumptions. One is bounded range which means that whether I choose this action or that action at time T, the difference isn't going to be unbounded. There's some bound on that. And bounded drift means if I play action X today or tomorrow, again there's not going to be a huge gap between them. So these are necessary. You can go without them but let's skip that. Upper bound of T is trivial. I said everything is bounded. If I pay any action, the regret I can accumulate in each round is bounded by one or some constant. So if I play T rounds I'm going to have a regret of T. We proved a few slides ago that in the fully adaptive case I can do better than that. Here I have a perfect characterization. When these two things are equal I understand it. So I understand this so far. Now I can take all the previous work done in this field. So a bunch of papers. They prove a lot of things. Again online learning is what we understand or classic online learning is what we understand the best. Here we have best characterization. We have square root T sublinear growth of the regret. We do have a nice quick learning upper and lower bounds match both in the bandit full information case. So this was the state of the art when I started working on this problem. Based on ten years of past work on online learning. And then the nice thing about this table, when you have one problem, a special case of another problem, lower bound of course applies. Lower bounds propagate this way and upper bounds propagate that way. If I could get t two-thirds on any guy with bounded certainly, I could get T two-thirds unit memory whenever we have a result here. It propagates this way. So this is the state of the art. I want to talk about two results. One is a result I had last summer with Roman and Ambuge [phonetic]. And there we used a blocking technique to give an algorithm that achieved T to the two-thirds against bounded memory adversaries in the banded case, bam, bam, it propagates to the left. Blocking just means we take a standard off-the-shelf online learning algorithm and run it in blocks, chooses some action and sticks to it for the duration of the predefined block and only available if the block changes to a different action. So, I mean, intuitively that would reduce the number of switches, but we can also prove that it actually works better against, that also works against memory bounded and adaptive adversaries. This is the first result. And now very, very recently, so hot off the presses, this is just this summer, with Nicholich [inaudible] who came as our visitor to Microsoft Research, and Hajameer [phonetic] from New England -- sorry, he's now in our group. Used to be in the MSR New England. Now he's moving to our group. So this is yet unpublished work. And we can -- we prove this lower bound. So it's the lower bound on the number of switches. So again we're looking at the banded feedback case. And we're asking how much does an oblivious adversary penalize me for the number of switches, how much does he pay, and we showed that in order to get something better than T to the two-thirds regret on the oblivious part of the loss, we have to make it at least T to the two-thirds switches. So this is how we proved that. And, again, we have this magic propagation to the right. And there's also this nontrivial propagation of why this applies that. It's very simple, but I can't go into it right now. So how do you prove these lower bounds. I want to say one word about that. I have to prove the existence of an adversary. Remember when we showed this lower bound, I just told you what that adversary is. I don't know what this adversary is, but I have to prove that he exists. So we used a very convenient technique called the probabilistic method. There, if you want to prove something exists, the way you do it, you define some probability distribution. Here's it's a distribution over adversaries or over sequence of loss functions and you show there's a positive probability when you sample from that distribution of getting a sequence of loss functions that will inflict this much damage. If there's a positive probability of finding that guy, then he must exist. If he doesn't exist, then there's probability 0 of finding him when I just want to sample randomly. All we have to do after using this technique is define what is the probability of distribution and the one that we just used is a Gaussian random walk. Define two arms, hop around, due to Gaussian walk and do some theory and prove that's enough to show with positive probability we'll get a guy that will hurt us that much. So this is now the current state of the art. You can see that it's all tied up here. So we understand, we understand what happens with switching, we have an upper and lower bounds on regret that match up for one memory, for N memory. We have one little glitch here, which is in the full information feedback case. We're not exactly sure what the situation is when you have one memory. But when you have a memory of two or more we already understand it perfectly. So we almost have a perfect characterization of what's going on. The super interesting thing that just pops out here is this. So you can see that there's a fundamental difference between learning with banded feedback and learning with full information. And this isn't trivial, because maybe intuitively it's trivial, but what we did before they always looked the same. Somehow banded was few years delay until the paper came out. But you often saw that the regret behaved the same in banded information settings. Here we have a very clear indication that the banded case and full information case are not the same. Let me summarize. The world reacts to our actions. We ask the question how far can we go under the paranoid assumption that the world is out to get us. It reacts in the worst possible way. We model this problem as online learning against an adaptive adversary and then we derive an almost complete characterization of policy regret in the setting and we got this very, very interesting result which shows that banded feedback and full information feedback have a qualitative difference to them. And that's it. [applause] Thanks. Are there any questions? Pedro. >>: So what you said brings up [inaudible] spam in the [inaudible] of spam your move is actually not classified particular spam but a whole filter in the case of link spamming your whole function your link is not -- there's a very large number of possible moves right there. And your learning is choosing that function. >> Ofer Dekel: Yes, I made things very easy for myself. This thing completely ignores the dependence on the number of arms and certainly none of this works if the number of arms is infinite. This is -- I have research on this but it wasn't presented in this talk. >>: That's my question, what happens in that case? >> Ofer Dekel: So I mean each result is different. They all have a polynomial dependence on the number of arms. Is it logarithmic or square root or linear. We have to go into each one of the results and kind of refresh our memory about it. Here I really just cared about how these things developed with time. But definitely the dependence on number of arms is an active, interesting field of research that I'm engaged in. Another one is well I just compared against constant actions, that kind of sucks, can we find more interesting comparison classes and certainly we can do that. We can define structured comparators when we defined deterministic Markov decision processes we can compete against constant actions that occasionally are allowed to switch and so on. So we have all those results but I just didn't have time to go into them. We certainly don't have this beautiful kind of complete picture there. >>: Two questions which are about the same. So the first one is what's the dependency on M because to lose linearly on them ->> Ofer Dekel: >>: Yeah, do you? >> Ofer Dekel: >>: Lose linearly on M. Yes, you lose -- yes. So when -- Lower bound? >> Ofer Dekel: Lower bound, I don't know. The lower bound is two arms. We just need to show a adversary that gets this dependence on T. One with two arms. I don't think we really thought through the thing but this upper bound depends on linearly, linearly is good because when you take MDP, kind of reinforcement learning type of algorithms, they will sometimes have an exponential dependence on the history, because the number of states is two to -- number of actions to the power of number of history length. So you're linear in the number of states which is exponential ->>: Much more complex there. comparison in the sense. >> Ofer Dekel: fixed actions. Policies explanation large, not fair You're right. You're competing against policies, not That's why it's a higher bar. >>: So the other question is do you have like a linear setting bounds for this memory, when you have memory, so for the featurized case, I guess [inaudible] this question like linearly bounded [inaudible]. >> Ofer Dekel: These results -- so, okay, so you're asking me -- so here we had a discrete choice type of setting. We had one of a finite number of actions what happens -- many of the upper bound -- these upper bounds work for pretty much any online learning setting. The lower bound is very, very specific to two arms, which is good enough for this. And it's also very recent work. I don't think we've thought about how this affects linear case but the upper bounds always work. Yeah? >>: I have a question about is the assumption about [inaudible] situation [inaudible], for example, I'm analyzing click through rate and I wanted to use choose optimal advertisement. And user calculated also gives me one [inaudible] and let's say he finds [inaudible] in the third position in the second position. You can count the strategy affect the strategy to second position. So he's not IAD, but he is going to get maximum click through rate. through this area to improve IAD can calculate it with you? >> Ofer Dekel: I don't know. I'd have it off line and chat about it. I'm not who. In this world it's me against the and I'm running away. Let's definitely So to think about it. Let's take fully sure who cooperates with world, everyone's against me take it off line. So now I'm putting on my organizer hat. So we have lunch. And here's how it's going to work. Is there something I need to know? >>: [inaudible] the option today we have those actually set aside out front, up here, so if you want people to go this way [inaudible] and then know what the options are. >> Ofer Dekel: So here's how it's going to work. So we have three stations that are giving out the food. One is right behind this wall. It's out these doors and to the left. Another one -- the other two are in the two rooms that are just opposite here. So if we go around this way or that way you'll see there's two other smaller classrooms set up with tables and there's food there. What I'd like to ask if you're sitting on this side of the room, if you could please go out these doors and then either take a left or a right and get to that room, if you're sitting on that side of the room if you could exit through those doors and go around. You could go to the first or second room. The atrium of the building is open to us. So if you -- once you take your food you can either sit in any of these rooms anywhere you see a chair you can sit down you can walk through the glass doors there's an atrium, there's security guards that will prevent you from going where you shouldn't be going. If you find yourself in a headlock or handcuffs, you'll know you've gone too far. Please go wherever you like. There are soft drinks at the end of this corridor. And bon apetite.

Ofer Dekel: Let`s start

Related documents

Products

Support

Ofer Dekel: Let`s start

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib