22182 >> Ofer Dekel: Umar Syed from University of Pennsylvania... joint work with Rob Schapire, Computing Mistake Exploiting Optimal Strategies...

22182 >> Ofer Dekel: Umar Syed from University of Pennsylvania today. And this is joint work with Rob Schapire, Computing Mistake Exploiting Optimal Strategies in Very Large Zero-Sum Games. >> Umar Syed: Thanks for inviting me. It's a pleasure to be here. As Ofer said, today I'll be talking about computing certain kind of optimal strategies in really large games. This is joint work with my former advisor at Princeton, Rob Schapire. I graduated from there about a year and a half ago, a little more than that. But I've continued to work with him. Today I'm going to be describing sort of recent work that we've been collaborating on together. Okay. So I'll be talking about zero-sum games today. So a zero-sum game is a strategic interaction between two autonomous agents who have diametrically opposed goals. Sort of the usual and canonical example of a simple zero-sum game is the children's game rock, paper, scissors. In rock, paper, scissors each player has three strategies, and they have this sort of cyclic relationship. So rock breaks scissors, scissors cuts paper and paper covers rock. So because of the cyclic nature between the strategies or among the strategies, there's no single strategy that is guaranteed to win for any player. So this raises the question of well how do we define an optimal strategy? So in zero-sum games typically, and really in the study of game theory more generally, people usually define optimal strategies in a worst case kind of way. So the best strategy in a zero-sum game is typically defined to be one that is best in assuming a perfectly adversarial opponent. That is, choose the strategy that's going to do the best assuming that the opponent plays the worst counter strategy. So in rock, paper, scissors, if you have ever played it as a child, the best thing is to randomly choose one of the three moves. Even if your opponent knows that you're doing that, this strategy is guaranteed to get a draw on average. And no strategy can be guaranteed to do better so therefore it's optimal. Now, you might think that planning for the worst case is really a sort of safe and conservative approach. And you might wonder, well if I plan for the worst case does that mean I'm going to be optimal in every case? And the answer is no. You may actually fail to exploit by assuming that your opponent is perfectly adversarial, you may have failed to exploit a mistake by your opponent. So let me try to extend this simple example to try to illustrate what I mean. So let's extend the game a little bit, and let's consider the game rock, paper, scissors bomb. So this is an asymmetric game. Let's say one of the players, player one, has a move called bomb. And bomb has the property that it beats everything. So obviously in this game, the optimal strategy for player one is bomb. Naturally. And now what's interesting is that using our worst case definition of optimality, every strategy for player two is optimal. Because every strategy is short of lose, if you assume that player one is going to play the best counter strategy. But what if player one makes a mistake, what he forgets to play a bomb, what if he doesn't know he has a bomb? So in this case the best strategy if the player one forgets to play bomb would again to be the uniform distribution on all three moves. But nothing in our current definition of optimality is guiding us towards that strategy. To give you a slightly less trivial example, this position in chess is called the Saavedra position. So here we have the black player and the white player. The white player has a king and a pawn. And the black player has a king and a rook. If you play chess, you know that a rook is much more powerful than a pawn. So it would appear that in this position, this is a shorter win for black, black seems to have an enormous advantage. And indeed for many, many years, in the 19th century, people assumed that this position was a win for black. But then in 1895, this Spanish priest named Reverend Saavedra -- I'm not sure if I'm saying it correctly, if anyone speaks Spanish, correct me, discovered sort of an ingenious sequence of moves that actually wins this position for white. Now imagine you're the black player and you find yourself in this position? >>: Who has the move? >> Umar Syed: White. White to play and win. And white is going towards the top of the board. So he can't kill the rook right away. So if you're the black player in this situation, you find yourself in the Saavedra position, and you're a very good black player, then you know once you reach the position the game is a sure loss. Assuming your opponent is perfectly adversarial. And so now from this position you would regard every possible sequence of moves as equally good. That is, equally bad because they all lead to a loss. But if your opponent, the white player, doesn't know about the Saavedra position or fails to properly execute the ingenious sequence of moves that leads to a win, then you should exploit that. But again our current pessimistic definition of optimality isn't guiding us towards any strategy like that. There's -- when I was reading about this, there's a chess grand master named Fred Mitchell. He has a saying that the hardest thing to win in chess is a one game. That is, the hardest thing to do is once you've reached a position that is known to be a win it's actually very hard to execute that through to completion. So in this talk I'm going to be talking about how to efficiently compute mistake exploiting optimal strategies in really large zero-sum game games, and that will be the bulk of the talk. And towards tend of the talk I'll try to connect this more to machine learning and specifically reinforcement learning, which is an interest of mine, and show how the algorithm we developed for computing these mistake exploiting strategies can be used to solve certain problems in reinforcement learning. But that will just be a little bit at the end of the talk. Okay. So here's kind of a roadmap. So let me sort of more formally define what exactly is a zero-sum game. Every zero-sum game is defined by a mate trick called the game matrix. And there are two players called the row player and the column player. And the game proceeds as follows. The row player chooses a distribution on the rows of the matrix. The column player chooses a distribution, Q, on the columns of the matrix, and the pay off of the game is just the expected value of an entry of the matrix with respect to the row and column distribution that the two players selected. And that's the end of the game. These distributions are called strategies. And a degenerative distribution, that's just concentrated on one row or one column, it's called a pure strategy. So for rock, paper, scissors, the three pure strategies were rock, paper, scissors for each player. So we need -- they have diametrically opposed goals, so we just need to assign some sort of polarity to each player. So we say the row player is the minimizing player. He wants to minimize the expected payoff and the column player is the maximizing player. He wants to maximize the expected payoff. And then those worst case optimal strategies now have the following definition. So for the row player, his optimal strategy is the min/max strategy. That is, the row strategy that minimizes the expected payoff for the worst case choice of the column distribution and the best strategy for the column player is just the sort of reverse definition, the max/min strategy, the best column distribution for the worst case row distribution. Okay. So that's the traditional definition of optimal strategies. Now I want to define some notion of a mistake exploiting optimal strategy. Now, there's more than one way to define such a thing. I'm going to choose a certain definition, and I'm going to try to motivate it as follows: So let's look at the game from the perspective of the column player. So his max/min strategy is the strategy Q that satisfies this objective. And another way of looking at this objective is saying, okay, I'm going to choose a column distribution Q so that the column vector NQ has the largest minimum row. Right? When I hit M with this column distribution, I'm now basically presenting the row player with this column vector of choices and they each have some expected payoff. And now the row player if he's maximally adversarial, there's a best choice for him that's just concentrated on a single row. And so all I'm doing is saying, okay, of those choices the row player now has I'm going to try to maximize the smallest one of them. And if the role player makes a mistake, one way we can think of a mistake is failing to choose that minimizing row. If he fails to choose that, he'll choose some other row. What's interesting is that our current objective doesn't care about the values of the other rows. And so maybe we should modify our objective so that not only does it try to maximize the minimum row but maybe something about all of those rows. And so this discussion suggests the following definition of a mistake exploiting optimal strategy. So it is the strategy Q that for this column vector NQ, maximizes the minimum element of that vector. And among all such column strategies, maximizes the second smallest element, and among all such column strategies maximizes the third and so on. So, in other words, it's trying to increase these things lexicographically. >>: I see it, right, so it could be that if my opponent -- if I know the strategy of my opponent, I could have done much better than that. >> Umar Syed: Definitely. >>: Right? But you're still being conservative in the sense, I want to guarantee the minimal payoff as in the classical game theory? >> Umar Syed: That's right. >>: But ->> Umar Syed: That's right. So this is a special case of a max and min strategy. >>: Is this willing to pay off the optimal, strongest to get the better [inaudible]. >> Umar Syed: This definition is not willing to do that. >>: Can you use -- how do you define the second, second smaller, all max? Because sometimes they're conflicting. For example, if you want the smallest, has to be -- these three are independent. >> Umar Syed: You're right. There may be no strategy that maximizes all rows simultaneously. So the definition is the following: >>: But -- >> Umar Syed: I look at the set of the strategies, column strategies, that maximize the smallest value of the row, of the maximize the minimum row value. This is some set. Now I look in that set and I ask which of them maximize the second smallest row value. Now I have a smaller set. And then I just recurse. And shorter way of saying that is I find, well, let me give the more formal definition. So the formal definition of what I just said is we define the column player lexicographic max/min strategy to be the strategy here that -- so what is this sort of saying? Here's the MQ that row vector, column vector I was talking about. This theta is a function I'm going to define that just thwarts the elements of this column vector in nondecreasing order. And this lex max operator, this is the maximum with respect to lexicographical work. Okay. So I hope that's clear what's going on. >>: The lexicographic ->> Umar Syed: Meaning, first of all, the component 1 should be as large as possible. So two vectors, dictionary order. So compare two vectors, component by component, starting with the first one. And the first place where they differ, the smaller component is the smaller vector. >>: Where do you put the supposed -- to make the first one smallest, then after -you have a unique solution? >> Umar Syed: I beg your pardon? >>: Suppose you insert Q it makes the one smaller. >> Umar Syed: This is sort of degenerative. So then the maximum. >>: Have you failed to make any change [inaudible] the second one you have a change to the first one. It's kind of ->> Umar Syed: It could happen that ->>: You have a simple optimal strategy in the classical sense. >> Umar Syed: Is also the lexicographic property. That's correct. >>: So we're just trying to understand, in the setting, for example, of the game rock, paper, scissors, actually you have just the one optimal strategy. >> Umar Syed: That's right. That's right. But in rock, paper, scissors ->>: When you have the bomb, then it becomes more interesting. >> Umar Syed: That's right. Anytime you have two pure strategies that are both max and min, then any linear or any convex combination of them is also max/min, and in that setting you also have infinitely optimal strategies. >>: But it's not guaranteed. So could it be that in the case that there are many optimal strategies in the classical sense, could it be also that they also match in all properties in all higher order ->> Umar Syed: It could happen. >>: It could happen? >> Umar Syed: Yes. There could be a pure strategy that is also lexicographically optimal. It could certainly happen. >>: Another ways -- ideas. You could define a distribution, for example, the outcomes how the first or second, third one. So you use a combination of -- you want to maximize the [inaudible] first and second, third one. >> Umar Syed: You could. That sounds like you would require some kind of prior knowledge about how your opponent is going to behave. If you had this prior knowledge you could certainly exploit it. But in this setting we assume we have no prior knowledge and trying to take sort of the usual theoretic game approach but refining it somewhat. So let me give you some history. So we didn't invent the idea of lexicographical optimal strategies. They were first introduced by Melvin Dresher way back in 1961. He didn't give them a name. If you look at the literature, people just called them Dresher optimal or deoptimal strategies. And what's interesting about them is that people have made connections between these and other notions of equilibria in game theory. So today I'm just going to be talking about zero-sum game. More generally, there's zero-sum game games. In zero-sum game there's a concept called the Nash equilibrium which you're probably familiar with, which is a situation where every player is playing a strategy and no player has any incentive to deviate from that strategy. And something that people have studied a lot about Nash equilibria are refinements of the equilibria that are less sensitive to perturbations. So it turns out that some Nash equilibria and some games are really sensitive to perturbations, meaning if one person slightly changes their strategy then it's suddenly not Nash at all. So people are interested in strategies that are, Nash equilibria that are not just Nash but also insensitive to small perturbations and there's like a whole menagerie of refined equilibrium concepts in general sum games that correspond to different definitions of perturbations. But in zero-sum game, everything sort of collapses. All of these refinements, all coincide with lexicographic optimal strategies in a zero-sum game. So, right, so if both players are in the game are playing a lexicographic optimal strategy, that's a proper equilibria, it's also a perfect equilibrium and there's also a few other things that it is. Okay. So having defined and hopefully motivated this solution concept, let's turn to trying to compute it. Throughout the talk I'm going to be looking at the game primarily from the perspective of the column player. But most of what I say can be applied to the role player as well. You just turn everything around. So if you wanted to compute the classical max/min column strategy, you could do it just by solving this objective. The linearity of this problem sort of you can see directly that you can solve it by solving a linear program. You can compute this Q star. And the sides of that linear program is roughly the size of the game matrix. So you can find this strategy in polynomial time in the game matrix, in the size of the game matrix. And do we have a solution like that for the lexicographical optimal strategy? And indeed we do. So Miltersen and Sorensen back in '06 described a similar approach to computing these lexicographical optimal strategies using a sequence of linear programs. And the way their algorithm works is as follows: So the first algorithm in the sequence outputs a set of linear constraints. These linear constraints define the set of optimal strategies in the classical sense, and those constraints are fed to the second linear program in the sequence, and that second linear program uses it to find strategies that are optimal on both the minimum and second smallest row. And it outputs a set of constraints that describes that set. And each successive stage does better on like the Kth stage figures out the Kth row and by the time you're done you've found your lexicographical optimal strategy. Since you have to solve one linear program per row of the matrix, the time of this entire procedure is polynomial in the size of the matrix. So that's fine. But a setting that we're interested in, and I'll be talking about today, is when one of the players has a very large number of strategies. Exponential, maybe even infinite. So the game matrix looks like this. It's much wider than it is tall. >>: Can I take the next question about this? So you're saying that the output of each one of these LPs is a set of constraints. >> Umar Syed: Yes. >>: So basically what this would be the constraints that have non-zero dual variables, that's [inaudible] like if you have a, the algorithm would be give you an answer which just point inside the constraints, but you're looking -- you're asking what is the -- you want the entire set of all feasible? >> Umar Syed: Yeah. So let me give you the more naive version of their algorithm, which is I think easier to understand, which is not only do they figure out the, like let's take the original problem, so not only do they figure out a strategy that is optimal in the classical sense, and they figure out sort of the value of the game, the payoff that strategy earns, and also it figures out the support of that strategy, which essentially tells you what are the hardest rows in the matrix, the ones that are sort of tied for the minimization. Then in the second stage of the algorithm, those rows are now bound to be that value. So if you know what are the hardest rows and you bound them to be the optimal value, then so now you can, so the set of column strategies that gives you those values on those rows are the set of all optimal strategies. >>: These are equality matrixes. >> Umar Syed: These are equality matrixes. Solving exactly. Each of those linear programs have size roughly proportional to the size of the game matrix. And we're interested in a setting where one player has a whole bunch of strategies. >>: Is there any results, [inaudible] the sigma rounds will be much easier to solve, [phonetic] given that you have [inaudible]. >> Umar Syed: I guess intuitively. >>: Is there any way, is it going to be formalized in terms of the actual velocity that you [inaudible] depending on how many [inaudible]. >> Umar Syed: They didn't really investigate that. Everything was just bounded in terms of the size of the matrix. It sounds like the worst case would be you figure out one Roper stage, I guess, yeah. Okay. But, again, that algorithm is polynomial in the size of the matrix. And sometimes the game you want to solve doesn't have the property that it's a reasonable size. And towards the end of the talk I'll give you some examples of games from reinforcement learning that have that property. But maybe to give you an example that's perhaps familiar to more of you, I think, let me talk a little bit about boosting. So boosting is an algorithm for supervised classification. And the goal of boosting is you're given a set of training examples and a set of weak hypotheses, and the goal boosting is to find a weighted combination of the weak hypotheses that maximizes the minimum margin on your training set. And you can formulate that goal as finding a max/min strategy in a zero-sum game. So the game is -- so the game matrix for that game has the following shape: The rows correspond to the training examples. One for each example. The columns correspond to these weak hypotheses. And an entry in the matrix is the margin of that weak hypotheses on that example. So you're trying to find a weighted combination, a strategy of weak hypotheses that maximizes margins, that's a max and min column strategy. The trouble is typically the number of weak hypotheses is very, very large. In some proportions it can be infinite. We can't just use a linear program to find an optimal strategy in this game. So this is an example of a game where it's too big for the LP solution to work. Okay. So let me talk a little bit about now existing algorithms for finding optimal strategies in very large games. So typically the assumption that you make when you're faced with this kind of situation is that you assume you have what's called a best response oracle for the player that has a very large number of strategies. And the best response oracle is the following. It's an algorithm that, for example, for the column player, if you fix a row strategy, then this oracle can tell you what is the best column strategy for that fixed row strategy. And in the boosting example that I described earlier, that oracle is just the week learning. So if you're familiar with boosting, boosting, every round just adjust the weights on the examples. And for a given weighting of the examples you find the best weak hypotheses that has the lowest error, and the algorithm that gives you that weak hypotheses is the weak learner. That's the oracle. And again at the end of the talk I'll describe some situations in reinforcement learning where you have exactly these kind of games and you have an oracle for one of the players. So if you have this really wide game and you have an oracle for the column player, here's an algorithm, a pretty intuitive algorithm, that finds an optimal strategy in the game. And it works as follows: So you start with, you put a uniform distribution on the rows. You find the best response, and in each round you find the best response with the current weighting of the rows using your oracle. And then you update the weights. And what is this saying? This is just saying that the update of the weights puts more weight on to rows where the -- so in each round T, you learn some best response, right? And that best response has some performance against every row strategy. And you just shift the weight onto those strategies, that column strategy that did poorly worst, the worst it did, the more weight it gets. By shifting the weight in this manner, over time you're forcing the column or the column strategy oracle to focus on the hardest row strategies, because it puts more and more weight on the ones that it did poorly in the past. And the analysis of this algorithm says that because of that property over time the minimum row of the average of all the column strategies that you generate is maximized. And this is basically automaticity, I just described automaticity. To describe what this algorithm is doing let me describe how it would do against a standard game matrix. So you initialize, this is supposed to represent the weights. You initialize all the weights on the rows. As the algorithm proceeds it keeps generating these -- it keeps generating these best responses for each weighting of the rows. And as the algorithm proceeds, more and more weight is being put on some rows, and other rows it's decaying, going all the way to 0. The rows getting all the weight are the hardest ones, the hardest ones to increase the value for. So towards the end of the algorithm the best response algorithm is forced to concentrate on those rows. >>: [inaudible]. >> Umar Syed: What is? >>: The way in which it's increasing and decrease ->> Umar Syed: No, it's not monotone. And I'll illustrate that in a couple of slides. So the analysis of this algorithm is that if you run this weight shifting algorithm the way I described, then both the average of the row strategies and the column strategies can converge asymptotically in the classical sense and the rate is not too bad. And here N is the number of rows in the matrix. And notice there's no explicit dependence on the number of columns of N as long as you have the best response oracle there's no explicit dependence. Okay. So, again, just looking at things from the perspective of the column player, I have -- we want a lexicographic max/min strategy for the column player. I've already shown you that the average of the column strategy generated by this no regret algorithm, which is basically atom boost, converges to a max/min strategy in the classical sense, could it be that it also converges to a lexicographic max/min strategy, that's just a special case of a classical max/min strategy. If that were true then we would be really lucky and we would be done. But unfortunately this does not hold in general. And I'm going to describe a counter example and I'll spend a little bit of time describing this counter example and explaining why the algorithm fails because that's going to motivate and explain why our algorithm is able to overcome this obstacle. So here's this matrix. So let me give you a tour of this matrix. So this matrix has three rows. Basically the first two rows are about A. Think of A as some positive number, and epsilon is some small positive number. So the first two rows of the matrix are all roughly A. And on the third row you can divide the columns into two groups. Columns that do quite well on the third row and columns that don't. So these columns do twice as well on the third row than these columns. Okay. So that's the tour of the matrix. So what if we ran our no regret weight updating algorithm on this matrix, what would happen? So again you initialize the row weights are uniform. And what happens initially is that the best response toggles between the first two columns. And let me try to explain why that's happening. So, first of all, why is it selecting columns in the first pair and not in the second pair? Well, this is happening because there's some weight on this third row. And early in the algorithm that weight is not negligible. And so since these column strategies do so much better on that third row than these do, these strategies are much better than the ones over here. And they all do about the same on the first two rows. Now, why is it toggling? Well, the reason it's toggling is that so this maybe addresses your question, this is sort of very typical behavior of these algorithms. At some point in the algorithm one row has slightly more weight than the other. Notice the way these two columns deviate from A from their min is sort of symmetric. So one is a little bit more than A and the other is not a little bit less and then they change places. So in this iteration, between the two, first two columns it prefers, the first column slightly more because it's slightly more biased towards the row that has a little bit more weight. And here because the weight now shifts on the ones that it didn't do well, the bias just flip-flops back and forth. Okay. So that's what happens initially. But now here's what happens after a while. So remember the weight increases on the hardest rows of the matrix, the hardest rows of this matrix are the first two. And the first iterations it's doing really well on the third row it's 2-A instead of just A. And the weight on that row decays rapidly to 0. And pretty soon the weight on that row is so small that basically the best response oracle is ignoring that row entirely. It's just having basically no effect on the payoff. And now it toggles between these two because the deviations here are slightly larger than the deviations here. So as soon as the weight on that third row becomes less than basically the magnitude of this oscillation that's happening, it's going to flop over to just toggling between those last two columns and it will do that indefinitely. So putting this all together if I run the no regret algorithm on this matrix then asymptotically it converts to a column strategy that gives the value A to every row whereas we know there's a lexicographic optimality strategy that gives 2-A to the last row, basically the uniform average of these first two columns. Okay. So just to sum up what I said, the basic problem is that this no regret algorithm, the reason it works is that it shifts all the weight of these row strategies on to the hardest rows of the matrix. And that causes the best response oracle to ultimately ignore the other rows. It just focuses on the hardest rows, the ones that are hardest to maximize. >>: So there's an underlying convex condition protocol going on here. So that's still converging to something, you're oscillating and the value converges, oscillation continues in these last two columns. >> Umar Syed: That's right. The behavior of the averages, the averages are well-behaved. The actual strategies themselves are not well-behaved at all. In fact, people have studied the dynamics of that distribution and it's very complicated. In fact, you mentioned Cynthia Ruden, her thesis was on studying the dynamics of those solutions. There's no simple answer there. It's highly complex. >>: So [inaudible] ground motion. >> Umar Syed: Yeah. >>: So how does it relate to this? >> Umar Syed: So if we look at this algorithm as a boosting algorithm, which it is, the problem that I'm describing is that the algorithm focuses too much on outliers. Outliers are the ones that have small margin. And that's like sort of the fundamental problem without a boost. There's a whole family of algorithms that have been designed to try to address this problem. Brown boost is one of them. There's [inaudible] boost. There's meta boost. There's just a lot of them. And they all try to basically -- so Atta Boost puts an extreme amount of weight on these outliers. They one way or another try to soften that extreme behavior to try to get it to perform behavior. >>: So in one sense, so would you propose it falls in the same family or is it something different? >> Umar Syed: It's something different, because although it's the same issue, we're taking a slightly different approach. So, again, the problem is that it focuses too much on those hardest rows. It ignores the ones that aren't hardest. But in our objective we care about those rows. So our solution which I'm about to describe is to change the weight updates in such a way so that the weights do not decay all the way to zero on those other rows. And that will make the algorithm not ignore them. That's sort of the outline of what we're about to do. >>: So the analogy of lexicographic objective for boosting is compared to Saavedra, is it still the same worst case optimum, but this sort of less-than-worst case, what is the game theory? >> Umar Syed: So the analogy would be usual boosting is trying to maximize the minimum margin. This objective is maximizing the minimum margin and subject to that constraint maximize the second smallest margin and then the third smallest and so forth. >>: So sort of think about you don't care about just supporting vectors but you care, you're stepping a little further out and you're adding this kind of soft core vectors and you care about their margins. >> Umar Syed: That's a good phrase for it. Soft support vector matrix. Another way of thinking about it you care about the whole margin distribution and you're trying to push it all up. But while maintaining the worst case property. >>: But you care about the amount of declines set further up. >> Umar Syed: I don't know. So it's lexicographic. So it's not that I'm putting an amount of weight on each one, it's ->>: The weight is now impacted? >> Umar Syed: Yes. >>: This is a generalization ->> Umar Syed: Not that I know of. I mean, that's something I would like to study. Because we're working on this now. >>: The work by Forester [phonetic] and I don't remember who else who studied the full distribution. >> Umar Syed: That's right. I should mention that. People have studied how to bound generalization in terms of the entire margin distribution. But not -- I don't know how to connect their work to this lexicographic property. Okay. So maybe I can -- I'll try to explain our algorithm. So before jumping into exactly what our algorithm does, let me give a little, set some foundation. So for any vectors CI want to think about the shifted version of our game matrix. So M sub C is just take each column of our game matrix and subtract the vector C from it. If you did it in Matlab, this is what I mean, M minus C. Now, this shifting operation has two interesting properties. One is that our assumption that we have a best response oracle for the original matrix implies that we have a best response oracle for any shifted version of the matrix. And the proof is just this equation. The payoff of any two strategies in the shifted version is just the original payoff minus this quantity that's independent of Q. So if you can aug max this quantity, then you're also aug maxing this quantity with respect to Q. So we have a best response oracle for any shifted version of the matrix. Now, here is the second fact, which is why we care about shifted matrixes. Which is if you let V star be sort of the vector of values, the sort of -- the vector of values that you're going to get for each row strategy, the vector of payoffs that you're going to get on each pure row strategy under the lexicographic optimal, if you just shift your matrix by that vector, then any optimal strategy in the traditional sense on this shifted matrix is a lexicographic optimal strategy on the original matrix. >>: So this requires somewhat -- >> Umar Syed: Yeah. >>: So the vector V is now, if you had -- if it was D dimension, now it's sorted. >> Umar Syed: It's sorted, that's right. >>: So ->> Umar Syed: This is a very good point. So I decided not to put this in. I thought it was getting too much into the weeds. But I can see you're going there. So you may be concerned that -- let me just say this. So it turns out that we can assume that without loss of generality that the first row of our matrix is the hardest one. The second row is the second hardest and so on. And further every lexicographic optimal strategy is also sorted. >>: So that means -- let me follow that line. That means if there are two optimal strategies in the lexicographical sense, the hardness of a pure strategy is fixed independent, right? Fixed strategy in which this will be the second hardest and this will be the sixth hardest and another one will switch places. This is fixed. >> Umar Syed: That's right. I used to have a slide. I think another way of what you're saying there's an unambiguous ordering of the rows. One of them is the hardest and the other one is the second hardest and so on. It's independent of the actual strategy that's being used. That's why it's kind of -- that's part of the reason -- that fact is used to prove this lemma. I should have put this in the presentation. So now if I just -- so if I just shift down all the rows of the matrix by how big those worst case values are, now any optimal strategy in the traditional sense is a lexicographic optimal strategy for the original matrix. >>: [inaudible]. >> Umar Syed: Yes, that's right. >>: Any optimal strategy must be ->> Umar Syed: That's right. So combining these facts, now in order to figure out a lexicographic optimal strategy I just have to compute this vector V star. If I've got this vector, again then I can just run my usual algorithm on the shifted version of the matrix, and I have a best response oracle for that shifted version. I argued for that earlier. Now the problem reduces to computing this vector of values V star. Okay. So how will I do that? So here's the algorithm in the pictures for how to compute V star. So I run my no regret algorithm on my matrix N. And after a while I get a picture that looks like this. I have a lot of weight in the hardest rows and a little weight on everything else. So I can use that fact to identify which are the hardest rows of the matrix. And here I've coded them blue and everything else I've coded black. Now that I've identified those rows, okay, and as you pointed out they're sort of unambiguously the hardest, they're the hardest for all strategy, now I rerun the algorithm. But when I rerun it, every round I change the weight update a little bit. Basically I look at my distribution the algorithm is giving me on the rows. If I find that the weight on the nonhardest rows is too small, which is the problem in the counter example, right, I just rescale the weight so it's not too small. So I force those weights to be bounded. >>: Matching or do you change the update? >> Umar Syed: Just the updates. So those weights that were decaying to zero, which was the problem, I just force those weights not to decay to zero. I bound them away from zero and the order is roughly one over epsilon to the fourth. Go ahead. >>: In other work, in other problems, with a similar flavor, like problems of tracking the best targeted learning or maybe bandied position algorithms they have this problem of multiplicative weights going to 0 and what they usually do is just mix in kind of a little bit of uniform distribution and fix the problem that way. >> Umar Syed: Yeah. But that doesn't work for us. >>: Why not? >> Umar Syed: Or at least I couldn't get that to work. Because uniform distribution sort of smears everything. One way to think about it is what I really want is I want my algorithm to be running -- I want my no regret algorithm to be sort of running on parallel on the hardest rows and nonhardest rows. I want them to focus on both. And these black ones, the nonhardest ones those weights are getting really small. If I just mix this distribution with uniform, it's going to like -- like there's some structure to this distribution that I want to preserve. But if I mix it with uniform distribution, those weights are so small it will just get smeared. And my regret property will no longer hold. I'm not sure if it's making sense. But I need to sort of run -- I want to run two versions of my no regret algorithm on these two groups of rows and I want them run at different scales but within each group I want those proportions to be preserved so that I have the no regret property. >>: So you have an algorithm with the proof. I have nothing. >> Umar Syed: But your idea for first approach is the right one, because that's the first thing we tried. >>: Then I would say adding that little bit of uniform distribution, the guys that became very small, it would make them big again. The guys that are big, it wouldn't change because it would just be epsilon changes, it wouldn't affect them at all. It would have the same effect of exactly amplifying the little ones while not touching the big ones. >> Umar Syed: So trouble -- suppose I just cared about these non-hard rows. I want to maximize the smallest one of the non-hard ones. So what I could do -what I'd really like to do is run my no regret algorithm on just those rows. Now, remember these weights aren't -- the blue weights are on a totally different scale than the black weights. So a small number compared to the blue weights is a very big number for the black weights. So if I mix -- if I mix the entire strategy with the uniform distribution, even with a small mixing coefficient, that's going to be a big mixing coefficient compared to the weights on those black rows. And it will just -- it will sort of totally screw up the algorithm that's running on those rows, because sort of on their scale, mixing it with a huge uniform distribution. >>: More explicit what does it mean rescaling it and we get the picture of what are the differences of doing that. >> Umar Syed: You want to ask the question first? >>: Would you say in general you can propose different strategies for doing this regularization slash smoothing, think of it as weights or smoothing and so on, but your lexicographic objective in the end gives you a single correct way that will hope to do it that will maximize it. So the fact that there's -- that there's other ways to do the smoothing, transition, are inferior is because lexicographically there's one of them that is the right one that maximizes it. Is that fair? >> Umar Syed: Well, I don't know. I would say it's inferior because of the algorithm we're using. There may be another algorithm. So we're solving this problem in a no regret way. And there's nothing that -- there's nothing that tells me that the no regret algorithm is the best way to solve this problem. >>: But there's a function of both objective and the regret approach. >> Umar Syed: Yeah. So let me answer your question. When I say "rescale", what I mean is I have a constant in mind, lambda. And I want the sum of the black weights to always be at least lambda. And the sum of the blue weights will be at most 1 minus lambda. >>: How do you define that? Isn't it so it could be always take the smallest one and make it lambda. >> Umar Syed: No, you're right. I do a differential normalization. So I look at the blue rows, and I normalize them so their sum is 1 minus lambda. And then separately I normalize the black rows so that their sum is lambda. But within the groups I maintain the proportions between them and that's very important. >>: How do you define the lexi [inaudible]. >> Umar Syed: So in the previous step of the algorithm, I run the algorithm. I look at the places where it's putting a lot of weight. >>: There needs to be some threshold. It needs to be modified, right? So ->> Umar Syed: The threshold that comes from the analysis that says if the weights on certain rows are above the threshold, then I know that they should be colored blue, and if the weight is below a threshold then they should be colored black. >>: So that looks -- let me see if I get it right. So you say okay I run the algorithm and then I set a threshold. And all the -- I make sure that all the rows that are below the threshold sums up to lambda. So renormalize them ->> Umar Syed: I identify the rows first then I start from the beginning. And at every round I ensure that the black rows are at least lambda and the blue rows are at most 1 minus lambda. >>: You said this threshold comes on your analysis, what does it depend on? >> Umar Syed: The desired error. And ->>: And the gaps. >> Umar Syed: And the gap. >>: So it looks like you're precisely choosing the active dual variables again, just like these 2006. >> Umar Syed: Yes. >>: Choosing active constraints, choosing them separately from the inactive constraints? >> Umar Syed: Yeah, exactly. That's right. Yeah. >>: So just make sure that I understand what you said. After -- so you had the initial step in which you identified the blues and the blacks. And now you will start from scratch, but after every atom boost update after every multiplicative union date that you do, you redo the ->> Umar Syed: Normalize. I normalize one group separate from the other. Within the groups I maintain the proportions. That's super important. That's why the uniform mixing doesn't work for us. Because if I just uniform mix them the black group will just get blown away. The proportions within them will be totally screwed up. So I do that. So what does this get me? So before I can explain what this gets me just let me give you a little more notation. Remember this vector of payoff values, this is its definition. And so by the definition of theta, this is a sorted vector. The first component is smallest and the second and so on. So let's just identify the first break point in that vector. So what this is saying is that the first K value, the vector are all equal and then K plus first one is a little bit more. K could be one. It could be N. It could be anything. So if I do this scheme that I just described, identify the hardest rows and start from scratch and differentially normalize, then the minimum over the hardest rows will be V1, sort of the hardest value, the smallest value. And the minimum over the non-hardest row will be V star minus one the minimum over those rows. It's like I was saying earlier I'm sort of running the algorithm in parallel on these two groups, and they're each computing this value sort of separately. And the key to making this happen is that that differential normalization that kept the weights from the nonhard rows from going all the way to 0 this means that the best response oracle can ignore those rows and this is why -- oops -- this is why I get this property. Otherwise I would get, I would run into the problem I had with the counter example. And this might be as small as V1. Okay. So I just showed you how to get basically V1 through VK plus 1. And you just -- this procedure that I described can be sort of the iterated to recover all the values in the entire vector. >>: So what do you want to do is to first recover the first K plus 1 and then just add them to the objectives and. >> Umar Syed: Shift them. Shift the first rows so they're all the same value. Now they become the hardest rows. Before they were in two groups. Now they'll be in the same group. That's right. And so after you sort of do all this, then the theorem that we have is that this algorithm that I just described at a high level finds this epsilon optimal lexicographic strategy in about this much time. And which is about quadratically worse than the no regret algorithms. They're usually squared and this is to the fourth. >>: Can you say something about how [inaudible] finds a nonstandard min/max solution? Because it does. >> Umar Syed: In the same amount of time. That's basically the first -- no, that's not right. That's not right. No, no, no it's lower. It will be the same rate. It will be the same rate. >>: You think that's just inevitable, or could you find the best of both everything? >> Umar Syed: I wouldn't say it is inevitable. >>: But just by the sum [inaudible] the spatial graphs and N graphs and so forth, does it somehow relate [inaudible] spatial graph. >> Umar Syed: I'm not sure ->>: If you take the matrix you want to find the first eigenvector, not -- you can just find the matrix over and over again, just run the vector, and all the weights decay exponentially. This is what happens here, right? You get this exponential decay of weights and you know there are a lot of properties of matches come from the gap from the first eigenvalue and second eigenvalue, so how fast will it converge to the first eigenvector, just the spectro graph. And it seems as this is exactly what happened here, right? If the values -- when will it be hard to find that the graph is small, right? >> Umar Syed: Yeah, the key is this gap right here. But I don't know if it's related to the gap. >>: Is there any connection, can we tell some unified story about that? >> Umar Syed: I don't know. I don't know that these values are related to the eigenvalues in the matrix. It doesn't ->>: Isn't your analysis -- the difference between these sort of KNP star 1 [phonetic] is in there? >> Umar Syed: It is. >>: So it's smaller then you don't care, if it's large then... >> Umar Syed: But I don't know whether it is or if it is how it's connected to the eigenvalues, I'm not sure. Okay. And now so that's our algorithm. Now I'll just briefly describe some problems in reinforcement learning that can be solved using this kind of algorithm. Okay. So what's reinforcement learning? So let's step back. So the goal in reinforcement learning is that there's some agent navigating some unknown stochastic environment. And you're trying to come up with a control policy for that agent. Typical applications are like navigating a vehicle, guiding a robot, something like that, and the only feedback that this agent receives is some exogenous reward signal that tells it in every state of the environment whether that state is good or not good. The goal for the control policy for the agent is to maximize the cumulative reward that the agent receives over time. And so the value of a policy is defined as the sort of total cumulative reward that the policy receives over time and that's called the value function. I'm going a bit fast because I think maybe you guys are familiar with this. Okay. Now, there's really sort of two definitions of an optimal policy. And sometimes the distinction is not made clear in papers that you read. A weakly optimal policy is one that maximizes the value for some fixed initial distribution on the states. We can call that a weakly optimal policy. And a strongly optimal policy is one that maximizes the value of the policy no matter what the initial state distribution is. So it's simultaneously has maximum value at every state. We'll call that a strongly optimal policy. So weakly and strongly optimal. Clearly weakly optimal strategy is by continuity or something, but it's not obvious that a strongly optimal policy exists, one that maximizes the value at every state simultaneously but there's a classical result that in a Markov decision process which, is the usual model for decision learning, there's always a strongly optimal policy. There's a control policy that maximizes reward no matter what state you start in simultaneously. But a lot of reinforcement learning algorithms, especially the ones that do not assume that there's a explicit description of the environment, they only return a weakly optimal policy. That is, you specify some start state distribution and it will find the optimal policy with respect to that initial distribution but perhaps not all state simultaneously. Those are called model-free algorithms, the ones that don't see the explicit description of the environment. So this lexicographic optimal strategy finding algorithm that I just described to you is a way of converting reinforcement learning algorithm that returns weakly optimal policies into one that returns strongly optimal policies in the following way. So I construct the following game matrix. The rows of the matrix correspond to states of the environment. And the columns of the matrix correspond to all possible policies. And an entry in the matrix is the value. So the IJth entry is the value of the Jth policy when you start at state I. And it's easy to show that the lexicographic max/min strategy in this game is a strongly optimal policy. It's maximizing the value at every state simultaneously. >>: [inaudible]. >> Umar Syed: [inaudible] well, probably it's just for technical reasons it might be better to assume that there's not. That implies there might be infinitely many actions and that gets hairy but there are certainly exponential number of rows. >>: [inaudible] to make it [inaudible] would those same amounts hold? >> Umar Syed: If you reduce the banding case you would have one arm per policy. There would be exponentially many arms. Exponential in the number of states. So strongly optimal policy is a lexicographic max/min strategy in this game. And what is the best response oracle for this game? Well, it's an algorithm for computing a weakly optimal policy. You fix some distribution on the states and the best response is the optimal policy for that fixed input distribution. Okay. So I've given you a reduction. And I'm a little hesitant to say this because it seems a little surprising, but so far as I know, no one has actually given this kind of reduction before going from an algorithm that computes weakly to strongly optimal. But let me add an asterisk to it because it seems kind of too good to be true. >>: Where does the lexicographic stuff come in? Why doesn't the standard boosting do ->> Umar Syed: What boosting would do, it would return a policy that has the highest value on the start state that's sort of the worst start state to be in. >>: I see. Okay. >> Umar Syed: That's what standard boosting would give you. And this gives you the highest value on all of them simultaneously. Okay. So that's application number one. Application number two is perhaps maybe a little bit more obvious. So in a lot of environments, there's not just one natural value function that tells you the value of the policy. There's like sever. So if you're driving a car it's like multiple objectives you're trying to satisfy. You're trying to go as fast as you can, trying to avoid crashes, trying to stay on the road, et cetera, and these folks gave an algorithm for finding a policy that lexicographically maximizes all those set of value functions. But assuming a fixed ordering on the value functions. So they assume that as input to the algorithm that one of the values is one of the objectives like say speed is the most important. And then another objective like staying on the road is second most important so on. So you're giving that ordering and they find the one that's best with respect to that ordering, lexicographically best with respect to that ordering. If you were to apply our algorithm to this problem, where the game matrix is defined as follows, again the columns are the policies and the rows are the different value functions. And the Ith Jth entry is the value of the Jth policy with respect to the Ith value proposition. The max/min strategy for this game is a policy that does the same thing. It finds the lexicographically best policy, but without having to specify which value functions are more important than the others. >>: The valuations would need to be in the same scale? >> Umar Syed: That's exactly right. That's exactly right. I was just about to say that. This assumes implicitly that these are on the same scale. If they're on different scales, then the different scales are telling you implicitly which are more important than the other, that's totally right. Okay. And that's it. It was just a very sort of brief tour of the application. So again I gave you an efficient algorithm for computing these mistake exploiting strategies. And very large games and a few applications to reinforcement learning. So thanks a lot for your time. [applause]. >>: Does anyone have questions? >>: Have you tried this on real problems? >> Umar Syed: So we have a sort of toy simulator, a car driving simulator. We tried it on that. And you can definitely set up a situation where -- and we give it like multiple objectives. It's quite easy to set up a situation where a traditional algorithm is not going to be maximizing all objectives simultaneously. >>: Crash, stay on the road. >> Umar Syed: Stay on the road, drive as fast as we can. >>: [inaudible] equally. >> Umar Syed: They're all on similar scales, that's right. >>: I have a question I want to address. So I think what Chris said in the beginning of the talk is interesting about what happens if -- I don't want a strict lexicographic order. But I'm saying you know I'm willing to sacrifice something with respect to this max guy. So I mean they won't be optimal with respect to the best thing, but if I can be much better on the second best and so on. >> Umar Syed: Yeah. >>: And I think an equivalent way of saying this is to say that I have some model of how my [inaudible] chooses his strategy. Perhaps he can sort them from the best to the worst, but rather than picking the best thing, maybe he has like a stupidity point and he flips it and am I going to choose the first one, yes or no, if it's head I choose that one if not I move on to the second one. I think that's maybe equivalent to what I want to say about -- I believe that's the brow or the comb is no. But it doesn't have to be that. Let's think of the more general thing that Chris was saying. I just want to be somehow more soft with respect to the first one. So softer. So it seems that the algorithm from 2006, I forget the algorithm, the recurring LP one is easily adaptable to that setting whereas maybe I'm missing something here, but it seems like rather than putting out a set of quality constraints, the constraints you pass on to the next step will be -- instead of linear constraints linear sub space you would have some set of -- you code in hyperstack that says roughly in this optimal space but there's some epsilon. And of course you move away from the space and get worse with the first one, but now forget about that. >> Umar Syed: That sounds right. >>: Just the comment about the [inaudible] if you know what the [inaudible] and tradition over ->>: For this one I don't know anything. I'm just saying willing to pay epsilon on the first objective or let's say ->>: But implicitly assumes that you have high level cost structure that you can go back to defining. >>: I think it's equivalent to saying the coin flipping with the pea stick type of thing. But my question is I think we now understand how to take the simple recurring LP algorithm and make it softer with respect to more important decisions, how do you take your algorithm ->> Umar Syed: I don't know. That's an excellent question. The answer is I don't know if this algorithm is suited for the kind of extension that you're describing. >>: But why? It seems that ->>: Isn't it like going from we had all these algorithms, [inaudible] space housing and stuff like that, going from each [inaudible] so basically what you say if I have a policy now, I measure it with respect to the worst case, and I kind of penalize it based on how far it is from the best policies as opposed to kind of the spacing that will cut out everything that's not optimal, you're just going to penalize it in some way and proceed. So at every step I have some sort of aggregate the penalties that this policy gets. But the question that I can see, how -- you need to have some sort of model. So you propose this, point to us this model. But I think this model it does not assume anything about the opponent, you don't have to assume that ->>: I'm assuming that the opponent is P stupid. I don't want to exploit his mistakes. I'm saying he'll make mistakes with the probability I know. >> Umar Syed: So let me try to answer your question more fully. So one thing you might guess is, so there's this -- I defined this term "lambda," you're normalizing these two groups of rows differently. One of them is at least lambda and the other is at most one minus lambda. You might make think to make things more softer you change the value of lambda somehow, right? >>: Yes. >> Umar Syed: But the thing is whatever lambda is, if it's more than 0, then if you run the algorithm forever it will end up in the same place. As long as lambda is more than 0, it ends up in exactly the same place. It just changes how long it takes to get there. So the lambda is tuned to make, for speed, to make the convergence time as good as possible. But it doesn't affect where the algorithm ends up. So that's an idea to address what you're saying. That won't work, and I guess I don't have another idea off the top of my head. >>: Good answer. >> Ofer Dekel: No more questions? Thank you again. [applause]

22182 >> Ofer Dekel: Umar Syed from University of Pennsylvania... joint work with Rob Schapire, Computing Mistake Exploiting Optimal Strategies...

Related documents

Products

Support

22182 &gt;&gt; Ofer Dekel: Umar Syed from University of Pennsylvania... joint work with Rob Schapire, Computing Mistake Exploiting Optimal Strategies...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

22182 >> Ofer Dekel: Umar Syed from University of Pennsylvania... joint work with Rob Schapire, Computing Mistake Exploiting Optimal Strategies...