22182 >> Ofer Dekel: Umar Syed from University of Pennsylvania... joint work with Rob Schapire, Computing Mistake Exploiting Optimal Strategies...

advertisement
22182
>> Ofer Dekel: Umar Syed from University of Pennsylvania today. And this is
joint work with Rob Schapire, Computing Mistake Exploiting Optimal Strategies in
Very Large Zero-Sum Games.
>> Umar Syed: Thanks for inviting me. It's a pleasure to be here. As Ofer said,
today I'll be talking about computing certain kind of optimal strategies in really
large games. This is joint work with my former advisor at Princeton, Rob
Schapire. I graduated from there about a year and a half ago, a little more than
that. But I've continued to work with him.
Today I'm going to be describing sort of recent work that we've been
collaborating on together.
Okay. So I'll be talking about zero-sum games today. So a zero-sum game is a
strategic interaction between two autonomous agents who have diametrically
opposed goals. Sort of the usual and canonical example of a simple zero-sum
game is the children's game rock, paper, scissors. In rock, paper, scissors each
player has three strategies, and they have this sort of cyclic relationship. So rock
breaks scissors, scissors cuts paper and paper covers rock.
So because of the cyclic nature between the strategies or among the strategies,
there's no single strategy that is guaranteed to win for any player. So this raises
the question of well how do we define an optimal strategy?
So in zero-sum games typically, and really in the study of game theory more
generally, people usually define optimal strategies in a worst case kind of way.
So the best strategy in a zero-sum game is typically defined to be one that is best
in assuming a perfectly adversarial opponent. That is, choose the strategy that's
going to do the best assuming that the opponent plays the worst counter
strategy.
So in rock, paper, scissors, if you have ever played it as a child, the best thing is
to randomly choose one of the three moves. Even if your opponent knows that
you're doing that, this strategy is guaranteed to get a draw on average. And no
strategy can be guaranteed to do better so therefore it's optimal.
Now, you might think that planning for the worst case is really a sort of safe and
conservative approach. And you might wonder, well if I plan for the worst case
does that mean I'm going to be optimal in every case?
And the answer is no. You may actually fail to exploit by assuming that your
opponent is perfectly adversarial, you may have failed to exploit a mistake by
your opponent.
So let me try to extend this simple example to try to illustrate what I mean. So
let's extend the game a little bit, and let's consider the game rock, paper, scissors
bomb. So this is an asymmetric game. Let's say one of the players, player one,
has a move called bomb. And bomb has the property that it beats everything.
So obviously in this game, the optimal strategy for player one is bomb. Naturally.
And now what's interesting is that using our worst case definition of optimality,
every strategy for player two is optimal. Because every strategy is short of lose,
if you assume that player one is going to play the best counter strategy. But
what if player one makes a mistake, what he forgets to play a bomb, what if he
doesn't know he has a bomb?
So in this case the best strategy if the player one forgets to play bomb would
again to be the uniform distribution on all three moves. But nothing in our current
definition of optimality is guiding us towards that strategy.
To give you a slightly less trivial example, this position in chess is called the
Saavedra position. So here we have the black player and the white player. The
white player has a king and a pawn. And the black player has a king and a rook.
If you play chess, you know that a rook is much more powerful than a pawn. So
it would appear that in this position, this is a shorter win for black, black seems to
have an enormous advantage.
And indeed for many, many years, in the 19th century, people assumed that this
position was a win for black. But then in 1895, this Spanish priest named
Reverend Saavedra -- I'm not sure if I'm saying it correctly, if anyone speaks
Spanish, correct me, discovered sort of an ingenious sequence of moves that
actually wins this position for white.
Now imagine you're the black player and you find yourself in this position?
>>: Who has the move?
>> Umar Syed: White. White to play and win. And white is going towards the
top of the board. So he can't kill the rook right away.
So if you're the black player in this situation, you find yourself in the Saavedra
position, and you're a very good black player, then you know once you reach the
position the game is a sure loss. Assuming your opponent is perfectly
adversarial.
And so now from this position you would regard every possible sequence of
moves as equally good. That is, equally bad because they all lead to a loss.
But if your opponent, the white player, doesn't know about the Saavedra position
or fails to properly execute the ingenious sequence of moves that leads to a win,
then you should exploit that. But again our current pessimistic definition of
optimality isn't guiding us towards any strategy like that.
There's -- when I was reading about this, there's a chess grand master named
Fred Mitchell. He has a saying that the hardest thing to win in chess is a one
game. That is, the hardest thing to do is once you've reached a position that is
known to be a win it's actually very hard to execute that through to completion.
So in this talk I'm going to be talking about how to efficiently compute mistake
exploiting optimal strategies in really large zero-sum game games, and that will
be the bulk of the talk. And towards tend of the talk I'll try to connect this more to
machine learning and specifically reinforcement learning, which is an interest of
mine, and show how the algorithm we developed for computing these mistake
exploiting strategies can be used to solve certain problems in reinforcement
learning.
But that will just be a little bit at the end of the talk. Okay. So here's kind of a
roadmap. So let me sort of more formally define what exactly is a zero-sum
game. Every zero-sum game is defined by a mate trick called the game matrix.
And there are two players called the row player and the column player. And the
game proceeds as follows.
The row player chooses a distribution on the rows of the matrix. The column
player chooses a distribution, Q, on the columns of the matrix, and the pay off of
the game is just the expected value of an entry of the matrix with respect to the
row and column distribution that the two players selected.
And that's the end of the game. These distributions are called strategies. And a
degenerative distribution, that's just concentrated on one row or one column, it's
called a pure strategy.
So for rock, paper, scissors, the three pure strategies were rock, paper, scissors
for each player.
So we need -- they have diametrically opposed goals, so we just need to assign
some sort of polarity to each player. So we say the row player is the minimizing
player. He wants to minimize the expected payoff and the column player is the
maximizing player. He wants to maximize the expected payoff.
And then those worst case optimal strategies now have the following definition.
So for the row player, his optimal strategy is the min/max strategy. That is, the
row strategy that minimizes the expected payoff for the worst case choice of the
column distribution and the best strategy for the column player is just the sort of
reverse definition, the max/min strategy, the best column distribution for the worst
case row distribution.
Okay. So that's the traditional definition of optimal strategies. Now I want to
define some notion of a mistake exploiting optimal strategy. Now, there's more
than one way to define such a thing. I'm going to choose a certain definition, and
I'm going to try to motivate it as follows: So let's look at the game from the
perspective of the column player.
So his max/min strategy is the strategy Q that satisfies this objective. And
another way of looking at this objective is saying, okay, I'm going to choose a
column distribution Q so that the column vector NQ has the largest minimum row.
Right? When I hit M with this column distribution, I'm now basically presenting
the row player with this column vector of choices and they each have some
expected payoff. And now the row player if he's maximally adversarial, there's a
best choice for him that's just concentrated on a single row. And so all I'm doing
is saying, okay, of those choices the row player now has I'm going to try to
maximize the smallest one of them.
And if the role player makes a mistake, one way we can think of a mistake is
failing to choose that minimizing row. If he fails to choose that, he'll choose some
other row. What's interesting is that our current objective doesn't care about the
values of the other rows.
And so maybe we should modify our objective so that not only does it try to
maximize the minimum row but maybe something about all of those rows.
And so this discussion suggests the following definition of a mistake exploiting
optimal strategy. So it is the strategy Q that for this column vector NQ,
maximizes the minimum element of that vector. And among all such column
strategies, maximizes the second smallest element, and among all such column
strategies maximizes the third and so on.
So, in other words, it's trying to increase these things lexicographically.
>>: I see it, right, so it could be that if my opponent -- if I know the strategy of my
opponent, I could have done much better than that.
>> Umar Syed: Definitely.
>>: Right? But you're still being conservative in the sense, I want to guarantee
the minimal payoff as in the classical game theory?
>> Umar Syed: That's right.
>>: But ->> Umar Syed: That's right. So this is a special case of a max and min strategy.
>>: Is this willing to pay off the optimal, strongest to get the better [inaudible].
>> Umar Syed: This definition is not willing to do that.
>>: Can you use -- how do you define the second, second smaller, all max?
Because sometimes they're conflicting. For example, if you want the smallest,
has to be -- these three are independent.
>> Umar Syed: You're right. There may be no strategy that maximizes all rows
simultaneously. So the definition is the following:
>>: But --
>> Umar Syed: I look at the set of the strategies, column strategies, that
maximize the smallest value of the row, of the maximize the minimum row value.
This is some set. Now I look in that set and I ask which of them maximize the
second smallest row value. Now I have a smaller set. And then I just recurse.
And shorter way of saying that is I find, well, let me give the more formal
definition.
So the formal definition of what I just said is we define the column player
lexicographic max/min strategy to be the strategy here that -- so what is this sort
of saying? Here's the MQ that row vector, column vector I was talking about.
This theta is a function I'm going to define that just thwarts the elements of this
column vector in nondecreasing order.
And this lex max operator, this is the maximum with respect to lexicographical
work. Okay. So I hope that's clear what's going on.
>>: The lexicographic ->> Umar Syed: Meaning, first of all, the component 1 should be as large as
possible. So two vectors, dictionary order. So compare two vectors, component
by component, starting with the first one. And the first place where they differ,
the smaller component is the smaller vector.
>>: Where do you put the supposed -- to make the first one smallest, then after -you have a unique solution?
>> Umar Syed: I beg your pardon?
>>: Suppose you insert Q it makes the one smaller.
>> Umar Syed: This is sort of degenerative. So then the maximum.
>>: Have you failed to make any change [inaudible] the second one you have a
change to the first one. It's kind of ->> Umar Syed: It could happen that ->>: You have a simple optimal strategy in the classical sense.
>> Umar Syed: Is also the lexicographic property. That's correct.
>>: So we're just trying to understand, in the setting, for example, of the game
rock, paper, scissors, actually you have just the one optimal strategy.
>> Umar Syed: That's right. That's right. But in rock, paper, scissors ->>: When you have the bomb, then it becomes more interesting.
>> Umar Syed: That's right. Anytime you have two pure strategies that are both
max and min, then any linear or any convex combination of them is also
max/min, and in that setting you also have infinitely optimal strategies.
>>: But it's not guaranteed. So could it be that in the case that there are many
optimal strategies in the classical sense, could it be also that they also match in
all properties in all higher order ->> Umar Syed: It could happen.
>>: It could happen?
>> Umar Syed: Yes. There could be a pure strategy that is also
lexicographically optimal. It could certainly happen.
>>: Another ways -- ideas. You could define a distribution, for example, the
outcomes how the first or second, third one. So you use a combination of -- you
want to maximize the [inaudible] first and second, third one.
>> Umar Syed: You could. That sounds like you would require some kind of
prior knowledge about how your opponent is going to behave. If you had this
prior knowledge you could certainly exploit it. But in this setting we assume we
have no prior knowledge and trying to take sort of the usual theoretic game
approach but refining it somewhat.
So let me give you some history. So we didn't invent the idea of lexicographical
optimal strategies. They were first introduced by Melvin Dresher way back in
1961. He didn't give them a name. If you look at the literature, people just called
them Dresher optimal or deoptimal strategies. And what's interesting about them
is that people have made connections between these and other notions of
equilibria in game theory. So today I'm just going to be talking about zero-sum
game. More generally, there's zero-sum game games. In zero-sum game
there's a concept called the Nash equilibrium which you're probably familiar with,
which is a situation where every player is playing a strategy and no player has
any incentive to deviate from that strategy.
And something that people have studied a lot about Nash equilibria are
refinements of the equilibria that are less sensitive to perturbations. So it turns
out that some Nash equilibria and some games are really sensitive to
perturbations, meaning if one person slightly changes their strategy then it's
suddenly not Nash at all. So people are interested in strategies that are, Nash
equilibria that are not just Nash but also insensitive to small perturbations and
there's like a whole menagerie of refined equilibrium concepts in general sum
games that correspond to different definitions of perturbations.
But in zero-sum game, everything sort of collapses. All of these refinements, all
coincide with lexicographic optimal strategies in a zero-sum game.
So, right, so if both players are in the game are playing a lexicographic optimal
strategy, that's a proper equilibria, it's also a perfect equilibrium and there's also
a few other things that it is.
Okay. So having defined and hopefully motivated this solution concept, let's turn
to trying to compute it.
Throughout the talk I'm going to be looking at the game primarily from the
perspective of the column player. But most of what I say can be applied to the
role player as well. You just turn everything around.
So if you wanted to compute the classical max/min column strategy, you could do
it just by solving this objective. The linearity of this problem sort of you can see
directly that you can solve it by solving a linear program. You can compute this
Q star.
And the sides of that linear program is roughly the size of the game matrix. So
you can find this strategy in polynomial time in the game matrix, in the size of the
game matrix.
And do we have a solution like that for the lexicographical optimal strategy? And
indeed we do. So Miltersen and Sorensen back in '06 described a similar
approach to computing these lexicographical optimal strategies using a
sequence of linear programs. And the way their algorithm works is as follows:
So the first algorithm in the sequence outputs a set of linear constraints. These
linear constraints define the set of optimal strategies in the classical sense, and
those constraints are fed to the second linear program in the sequence, and that
second linear program uses it to find strategies that are optimal on both the
minimum and second smallest row.
And it outputs a set of constraints that describes that set. And each successive
stage does better on like the Kth stage figures out the Kth row and by the time
you're done you've found your lexicographical optimal strategy. Since you have
to solve one linear program per row of the matrix, the time of this entire
procedure is polynomial in the size of the matrix.
So that's fine. But a setting that we're interested in, and I'll be talking about
today, is when one of the players has a very large number of strategies.
Exponential, maybe even infinite. So the game matrix looks like this. It's much
wider than it is tall.
>>: Can I take the next question about this? So you're saying that the output of
each one of these LPs is a set of constraints.
>> Umar Syed: Yes.
>>: So basically what this would be the constraints that have non-zero dual
variables, that's [inaudible] like if you have a, the algorithm would be give you an
answer which just point inside the constraints, but you're looking -- you're asking
what is the -- you want the entire set of all feasible?
>> Umar Syed: Yeah. So let me give you the more naive version of their
algorithm, which is I think easier to understand, which is not only do they figure
out the, like let's take the original problem, so not only do they figure out a
strategy that is optimal in the classical sense, and they figure out sort of the value
of the game, the payoff that strategy earns, and also it figures out the support of
that strategy, which essentially tells you what are the hardest rows in the matrix,
the ones that are sort of tied for the minimization.
Then in the second stage of the algorithm, those rows are now bound to be that
value. So if you know what are the hardest rows and you bound them to be the
optimal value, then so now you can, so the set of column strategies that gives
you those values on those rows are the set of all optimal strategies.
>>: These are equality matrixes.
>> Umar Syed: These are equality matrixes. Solving exactly. Each of those
linear programs have size roughly proportional to the size of the game matrix.
And we're interested in a setting where one player has a whole bunch of
strategies.
>>: Is there any results, [inaudible] the sigma rounds will be much easier to
solve, [phonetic] given that you have [inaudible].
>> Umar Syed: I guess intuitively.
>>: Is there any way, is it going to be formalized in terms of the actual velocity
that you [inaudible] depending on how many [inaudible].
>> Umar Syed: They didn't really investigate that. Everything was just bounded
in terms of the size of the matrix.
It sounds like the worst case would be you figure out one Roper stage, I guess,
yeah. Okay. But, again, that algorithm is polynomial in the size of the matrix.
And sometimes the game you want to solve doesn't have the property that it's a
reasonable size. And towards the end of the talk I'll give you some examples of
games from reinforcement learning that have that property. But maybe to give
you an example that's perhaps familiar to more of you, I think, let me talk a little
bit about boosting.
So boosting is an algorithm for supervised classification. And the goal of
boosting is you're given a set of training examples and a set of weak hypotheses,
and the goal boosting is to find a weighted combination of the weak hypotheses
that maximizes the minimum margin on your training set.
And you can formulate that goal as finding a max/min strategy in a zero-sum
game. So the game is -- so the game matrix for that game has the following
shape: The rows correspond to the training examples. One for each example.
The columns correspond to these weak hypotheses. And an entry in the matrix
is the margin of that weak hypotheses on that example. So you're trying to find a
weighted combination, a strategy of weak hypotheses that maximizes margins,
that's a max and min column strategy. The trouble is typically the number of
weak hypotheses is very, very large. In some proportions it can be infinite. We
can't just use a linear program to find an optimal strategy in this game.
So this is an example of a game where it's too big for the LP solution to work.
Okay. So let me talk a little bit about now existing algorithms for finding optimal
strategies in very large games. So typically the assumption that you make when
you're faced with this kind of situation is that you assume you have what's called
a best response oracle for the player that has a very large number of strategies.
And the best response oracle is the following. It's an algorithm that, for example,
for the column player, if you fix a row strategy, then this oracle can tell you what
is the best column strategy for that fixed row strategy. And in the boosting
example that I described earlier, that oracle is just the week learning. So if you're
familiar with boosting, boosting, every round just adjust the weights on the
examples. And for a given weighting of the examples you find the best weak
hypotheses that has the lowest error, and the algorithm that gives you that weak
hypotheses is the weak learner. That's the oracle.
And again at the end of the talk I'll describe some situations in reinforcement
learning where you have exactly these kind of games and you have an oracle for
one of the players.
So if you have this really wide game and you have an oracle for the column
player, here's an algorithm, a pretty intuitive algorithm, that finds an optimal
strategy in the game. And it works as follows: So you start with, you put a
uniform distribution on the rows. You find the best response, and in each round
you find the best response with the current weighting of the rows using your
oracle. And then you update the weights.
And what is this saying? This is just saying that the update of the weights puts
more weight on to rows where the -- so in each round T, you learn some best
response, right? And that best response has some performance against every
row strategy. And you just shift the weight onto those strategies, that column
strategy that did poorly worst, the worst it did, the more weight it gets. By shifting
the weight in this manner, over time you're forcing the column or the column
strategy oracle to focus on the hardest row strategies, because it puts more and
more weight on the ones that it did poorly in the past.
And the analysis of this algorithm says that because of that property over time
the minimum row of the average of all the column strategies that you generate is
maximized. And this is basically automaticity, I just described automaticity. To
describe what this algorithm is doing let me describe how it would do against a
standard game matrix.
So you initialize, this is supposed to represent the weights. You initialize all the
weights on the rows. As the algorithm proceeds it keeps generating these -- it
keeps generating these best responses for each weighting of the rows.
And as the algorithm proceeds, more and more weight is being put on some
rows, and other rows it's decaying, going all the way to 0. The rows getting all
the weight are the hardest ones, the hardest ones to increase the value for.
So towards the end of the algorithm the best response algorithm is forced to
concentrate on those rows.
>>: [inaudible].
>> Umar Syed: What is?
>>: The way in which it's increasing and decrease ->> Umar Syed: No, it's not monotone. And I'll illustrate that in a couple of slides.
So the analysis of this algorithm is that if you run this weight shifting algorithm the
way I described, then both the average of the row strategies and the column
strategies can converge asymptotically in the classical sense and the rate is not
too bad.
And here N is the number of rows in the matrix. And notice there's no explicit
dependence on the number of columns of N as long as you have the best
response oracle there's no explicit dependence.
Okay. So, again, just looking at things from the perspective of the column player,
I have -- we want a lexicographic max/min strategy for the column player. I've
already shown you that the average of the column strategy generated by this no
regret algorithm, which is basically atom boost, converges to a max/min strategy
in the classical sense, could it be that it also converges to a lexicographic
max/min strategy, that's just a special case of a classical max/min strategy. If
that were true then we would be really lucky and we would be done.
But unfortunately this does not hold in general. And I'm going to describe a
counter example and I'll spend a little bit of time describing this counter example
and explaining why the algorithm fails because that's going to motivate and
explain why our algorithm is able to overcome this obstacle.
So here's this matrix. So let me give you a tour of this matrix. So this matrix has
three rows. Basically the first two rows are about A. Think of A as some positive
number, and epsilon is some small positive number. So the first two rows of the
matrix are all roughly A. And on the third row you can divide the columns into
two groups. Columns that do quite well on the third row and columns that don't.
So these columns do twice as well on the third row than these columns.
Okay. So that's the tour of the matrix. So what if we ran our no regret weight
updating algorithm on this matrix, what would happen? So again you initialize
the row weights are uniform. And what happens initially is that the best response
toggles between the first two columns. And let me try to explain why that's
happening. So, first of all, why is it selecting columns in the first pair and not in
the second pair? Well, this is happening because there's some weight on this
third row. And early in the algorithm that weight is not negligible.
And so since these column strategies do so much better on that third row than
these do, these strategies are much better than the ones over here.
And they all do about the same on the first two rows.
Now, why is it toggling? Well, the reason it's toggling is that so this maybe
addresses your question, this is sort of very typical behavior of these algorithms.
At some point in the algorithm one row has slightly more weight than the other.
Notice the way these two columns deviate from A from their min is sort of
symmetric. So one is a little bit more than A and the other is not a little bit less
and then they change places.
So in this iteration, between the two, first two columns it prefers, the first column
slightly more because it's slightly more biased towards the row that has a little bit
more weight. And here because the weight now shifts on the ones that it didn't
do well, the bias just flip-flops back and forth.
Okay. So that's what happens initially. But now here's what happens after a
while. So remember the weight increases on the hardest rows of the matrix, the
hardest rows of this matrix are the first two. And the first iterations it's doing
really well on the third row it's 2-A instead of just A. And the weight on that row
decays rapidly to 0. And pretty soon the weight on that row is so small that
basically the best response oracle is ignoring that row entirely. It's just having
basically no effect on the payoff. And now it toggles between these two because
the deviations here are slightly larger than the deviations here.
So as soon as the weight on that third row becomes less than basically the
magnitude of this oscillation that's happening, it's going to flop over to just
toggling between those last two columns and it will do that indefinitely.
So putting this all together if I run the no regret algorithm on this matrix then
asymptotically it converts to a column strategy that gives the value A to every row
whereas we know there's a lexicographic optimality strategy that gives 2-A to the
last row, basically the uniform average of these first two columns.
Okay. So just to sum up what I said, the basic problem is that this no regret
algorithm, the reason it works is that it shifts all the weight of these row strategies
on to the hardest rows of the matrix.
And that causes the best response oracle to ultimately ignore the other rows. It
just focuses on the hardest rows, the ones that are hardest to maximize.
>>: So there's an underlying convex condition protocol going on here. So that's
still converging to something, you're oscillating and the value converges,
oscillation continues in these last two columns.
>> Umar Syed: That's right. The behavior of the averages, the averages are
well-behaved. The actual strategies themselves are not well-behaved at all. In
fact, people have studied the dynamics of that distribution and it's very
complicated. In fact, you mentioned Cynthia Ruden, her thesis was on studying
the dynamics of those solutions. There's no simple answer there. It's highly
complex.
>>: So [inaudible] ground motion.
>> Umar Syed: Yeah.
>>: So how does it relate to this?
>> Umar Syed: So if we look at this algorithm as a boosting algorithm, which it
is, the problem that I'm describing is that the algorithm focuses too much on
outliers. Outliers are the ones that have small margin. And that's like sort of the
fundamental problem without a boost.
There's a whole family of algorithms that have been designed to try to address
this problem. Brown boost is one of them. There's [inaudible] boost. There's
meta boost. There's just a lot of them. And they all try to basically -- so Atta
Boost puts an extreme amount of weight on these outliers. They one way or
another try to soften that extreme behavior to try to get it to perform behavior.
>>: So in one sense, so would you propose it falls in the same family or is it
something different?
>> Umar Syed: It's something different, because although it's the same issue,
we're taking a slightly different approach.
So, again, the problem is that it focuses too much on those hardest rows. It
ignores the ones that aren't hardest. But in our objective we care about those
rows. So our solution which I'm about to describe is to change the weight
updates in such a way so that the weights do not decay all the way to zero on
those other rows. And that will make the algorithm not ignore them. That's sort
of the outline of what we're about to do.
>>: So the analogy of lexicographic objective for boosting is compared to
Saavedra, is it still the same worst case optimum, but this sort of less-than-worst
case, what is the game theory?
>> Umar Syed: So the analogy would be usual boosting is trying to maximize the
minimum margin. This objective is maximizing the minimum margin and subject
to that constraint maximize the second smallest margin and then the third
smallest and so forth.
>>: So sort of think about you don't care about just supporting vectors but you
care, you're stepping a little further out and you're adding this kind of soft core
vectors and you care about their margins.
>> Umar Syed: That's a good phrase for it. Soft support vector matrix. Another
way of thinking about it you care about the whole margin distribution and you're
trying to push it all up. But while maintaining the worst case property.
>>: But you care about the amount of declines set further up.
>> Umar Syed: I don't know. So it's lexicographic. So it's not that I'm putting an
amount of weight on each one, it's ->>: The weight is now impacted?
>> Umar Syed: Yes.
>>: This is a generalization ->> Umar Syed: Not that I know of. I mean, that's something I would like to
study. Because we're working on this now.
>>: The work by Forester [phonetic] and I don't remember who else who studied
the full distribution.
>> Umar Syed: That's right. I should mention that. People have studied how to
bound generalization in terms of the entire margin distribution. But not -- I don't
know how to connect their work to this lexicographic property.
Okay. So maybe I can -- I'll try to explain our algorithm. So before jumping into
exactly what our algorithm does, let me give a little, set some foundation. So for
any vectors CI want to think about the shifted version of our game matrix. So M
sub C is just take each column of our game matrix and subtract the vector C from
it. If you did it in Matlab, this is what I mean, M minus C.
Now, this shifting operation has two interesting properties. One is that our
assumption that we have a best response oracle for the original matrix implies
that we have a best response oracle for any shifted version of the matrix. And
the proof is just this equation.
The payoff of any two strategies in the shifted version is just the original payoff
minus this quantity that's independent of Q. So if you can aug max this quantity,
then you're also aug maxing this quantity with respect to Q.
So we have a best response oracle for any shifted version of the matrix. Now,
here is the second fact, which is why we care about shifted matrixes. Which is if
you let V star be sort of the vector of values, the sort of -- the vector of values
that you're going to get for each row strategy, the vector of payoffs that you're
going to get on each pure row strategy under the lexicographic optimal, if you just
shift your matrix by that vector, then any optimal strategy in the traditional sense
on this shifted matrix is a lexicographic optimal strategy on the original matrix.
>>: So this requires somewhat --
>> Umar Syed: Yeah.
>>: So the vector V is now, if you had -- if it was D dimension, now it's sorted.
>> Umar Syed: It's sorted, that's right.
>>: So ->> Umar Syed: This is a very good point. So I decided not to put this in. I
thought it was getting too much into the weeds. But I can see you're going there.
So you may be concerned that -- let me just say this. So it turns out that we can
assume that without loss of generality that the first row of our matrix is the
hardest one. The second row is the second hardest and so on.
And further every lexicographic optimal strategy is also sorted.
>>: So that means -- let me follow that line. That means if there are two optimal
strategies in the lexicographical sense, the hardness of a pure strategy is fixed
independent, right? Fixed strategy in which this will be the second hardest and
this will be the sixth hardest and another one will switch places. This is fixed.
>> Umar Syed: That's right. I used to have a slide. I think another way of what
you're saying there's an unambiguous ordering of the rows. One of them is the
hardest and the other one is the second hardest and so on. It's independent of
the actual strategy that's being used. That's why it's kind of -- that's part of the
reason -- that fact is used to prove this lemma. I should have put this in the
presentation. So now if I just -- so if I just shift down all the rows of the matrix by
how big those worst case values are, now any optimal strategy in the traditional
sense is a lexicographic optimal strategy for the original matrix.
>>: [inaudible].
>> Umar Syed: Yes, that's right.
>>: Any optimal strategy must be ->> Umar Syed: That's right. So combining these facts, now in order to figure out
a lexicographic optimal strategy I just have to compute this vector V star. If I've
got this vector, again then I can just run my usual algorithm on the shifted version
of the matrix, and I have a best response oracle for that shifted version. I argued
for that earlier. Now the problem reduces to computing this vector of values V
star.
Okay. So how will I do that? So here's the algorithm in the pictures for how to
compute V star. So I run my no regret algorithm on my matrix N. And after a
while I get a picture that looks like this. I have a lot of weight in the hardest rows
and a little weight on everything else.
So I can use that fact to identify which are the hardest rows of the matrix. And
here I've coded them blue and everything else I've coded black.
Now that I've identified those rows, okay, and as you pointed out they're sort of
unambiguously the hardest, they're the hardest for all strategy, now I rerun the
algorithm. But when I rerun it, every round I change the weight update a little bit.
Basically I look at my distribution the algorithm is giving me on the rows. If I find
that the weight on the nonhardest rows is too small, which is the problem in the
counter example, right, I just rescale the weight so it's not too small. So I force
those weights to be bounded.
>>: Matching or do you change the update?
>> Umar Syed: Just the updates. So those weights that were decaying to zero,
which was the problem, I just force those weights not to decay to zero. I bound
them away from zero and the order is roughly one over epsilon to the fourth. Go
ahead.
>>: In other work, in other problems, with a similar flavor, like problems of
tracking the best targeted learning or maybe bandied position algorithms they
have this problem of multiplicative weights going to 0 and what they usually do is
just mix in kind of a little bit of uniform distribution and fix the problem that way.
>> Umar Syed: Yeah. But that doesn't work for us.
>>: Why not?
>> Umar Syed: Or at least I couldn't get that to work. Because uniform
distribution sort of smears everything.
One way to think about it is what I really want is I want my algorithm to be
running -- I want my no regret algorithm to be sort of running on parallel on the
hardest rows and nonhardest rows. I want them to focus on both.
And these black ones, the nonhardest ones those weights are getting really
small. If I just mix this distribution with uniform, it's going to like -- like there's
some structure to this distribution that I want to preserve.
But if I mix it with uniform distribution, those weights are so small it will just get
smeared. And my regret property will no longer hold. I'm not sure if it's making
sense.
But I need to sort of run -- I want to run two versions of my no regret algorithm on
these two groups of rows and I want them run at different scales but within each
group I want those proportions to be preserved so that I have the no regret
property.
>>: So you have an algorithm with the proof. I have nothing.
>> Umar Syed: But your idea for first approach is the right one, because that's
the first thing we tried.
>>: Then I would say adding that little bit of uniform distribution, the guys that
became very small, it would make them big again. The guys that are big, it
wouldn't change because it would just be epsilon changes, it wouldn't affect them
at all. It would have the same effect of exactly amplifying the little ones while not
touching the big ones.
>> Umar Syed: So trouble -- suppose I just cared about these non-hard rows. I
want to maximize the smallest one of the non-hard ones. So what I could do -what I'd really like to do is run my no regret algorithm on just those rows.
Now, remember these weights aren't -- the blue weights are on a totally different
scale than the black weights. So a small number compared to the blue weights
is a very big number for the black weights.
So if I mix -- if I mix the entire strategy with the uniform distribution, even with a
small mixing coefficient, that's going to be a big mixing coefficient compared to
the weights on those black rows. And it will just -- it will sort of totally screw up
the algorithm that's running on those rows, because sort of on their scale, mixing
it with a huge uniform distribution.
>>: More explicit what does it mean rescaling it and we get the picture of what
are the differences of doing that.
>> Umar Syed: You want to ask the question first?
>>: Would you say in general you can propose different strategies for doing this
regularization slash smoothing, think of it as weights or smoothing and so on, but
your lexicographic objective in the end gives you a single correct way that will
hope to do it that will maximize it. So the fact that there's -- that there's other
ways to do the smoothing, transition, are inferior is because lexicographically
there's one of them that is the right one that maximizes it. Is that fair?
>> Umar Syed: Well, I don't know. I would say it's inferior because of the
algorithm we're using. There may be another algorithm. So we're solving this
problem in a no regret way. And there's nothing that -- there's nothing that tells
me that the no regret algorithm is the best way to solve this problem.
>>: But there's a function of both objective and the regret approach.
>> Umar Syed: Yeah. So let me answer your question. When I say "rescale",
what I mean is I have a constant in mind, lambda. And I want the sum of the
black weights to always be at least lambda.
And the sum of the blue weights will be at most 1 minus lambda.
>>: How do you define that? Isn't it so it could be always take the smallest one
and make it lambda.
>> Umar Syed: No, you're right. I do a differential normalization. So I look at the
blue rows, and I normalize them so their sum is 1 minus lambda. And then
separately I normalize the black rows so that their sum is lambda. But within the
groups I maintain the proportions between them and that's very important.
>>: How do you define the lexi [inaudible].
>> Umar Syed: So in the previous step of the algorithm, I run the algorithm. I
look at the places where it's putting a lot of weight.
>>: There needs to be some threshold. It needs to be modified, right? So ->> Umar Syed: The threshold that comes from the analysis that says if the
weights on certain rows are above the threshold, then I know that they should be
colored blue, and if the weight is below a threshold then they should be colored
black.
>>: So that looks -- let me see if I get it right. So you say okay I run the algorithm
and then I set a threshold. And all the -- I make sure that all the rows that are
below the threshold sums up to lambda. So renormalize them ->> Umar Syed: I identify the rows first then I start from the beginning. And at
every round I ensure that the black rows are at least lambda and the blue rows
are at most 1 minus lambda.
>>: You said this threshold comes on your analysis, what does it depend on?
>> Umar Syed: The desired error. And ->>: And the gaps.
>> Umar Syed: And the gap.
>>: So it looks like you're precisely choosing the active dual variables again, just
like these 2006.
>> Umar Syed: Yes.
>>: Choosing active constraints, choosing them separately from the inactive
constraints?
>> Umar Syed: Yeah, exactly. That's right. Yeah.
>>: So just make sure that I understand what you said. After -- so you had the
initial step in which you identified the blues and the blacks. And now you will
start from scratch, but after every atom boost update after every multiplicative
union date that you do, you redo the ->> Umar Syed: Normalize. I normalize one group separate from the other.
Within the groups I maintain the proportions. That's super important. That's why
the uniform mixing doesn't work for us. Because if I just uniform mix them the
black group will just get blown away. The proportions within them will be totally
screwed up. So I do that. So what does this get me? So before I can explain
what this gets me just let me give you a little more notation. Remember this
vector of payoff values, this is its definition. And so by the definition of theta, this
is a sorted vector. The first component is smallest and the second and so on.
So let's just identify the first break point in that vector.
So what this is saying is that the first K value, the vector are all equal and then K
plus first one is a little bit more. K could be one. It could be N. It could be
anything.
So if I do this scheme that I just described, identify the hardest rows and start
from scratch and differentially normalize, then the minimum over the hardest
rows will be V1, sort of the hardest value, the smallest value. And the minimum
over the non-hardest row will be V star minus one the minimum over those rows.
It's like I was saying earlier I'm sort of running the algorithm in parallel on these
two groups, and they're each computing this value sort of separately.
And the key to making this happen is that that differential normalization that kept
the weights from the nonhard rows from going all the way to 0 this means that
the best response oracle can ignore those rows and this is why -- oops -- this is
why I get this property. Otherwise I would get, I would run into the problem I had
with the counter example. And this might be as small as V1.
Okay. So I just showed you how to get basically V1 through VK plus 1. And you
just -- this procedure that I described can be sort of the iterated to recover all the
values in the entire vector.
>>: So what do you want to do is to first recover the first K plus 1 and then just
add them to the objectives and.
>> Umar Syed: Shift them. Shift the first rows so they're all the same value.
Now they become the hardest rows. Before they were in two groups. Now they'll
be in the same group. That's right. And so after you sort of do all this, then the
theorem that we have is that this algorithm that I just described at a high level
finds this epsilon optimal lexicographic strategy in about this much time. And
which is about quadratically worse than the no regret algorithms. They're usually
squared and this is to the fourth.
>>: Can you say something about how [inaudible] finds a nonstandard min/max
solution? Because it does.
>> Umar Syed: In the same amount of time. That's basically the first -- no, that's
not right. That's not right. No, no, no it's lower. It will be the same rate. It will be
the same rate.
>>: You think that's just inevitable, or could you find the best of both everything?
>> Umar Syed: I wouldn't say it is inevitable.
>>: But just by the sum [inaudible] the spatial graphs and N graphs and so forth,
does it somehow relate [inaudible] spatial graph.
>> Umar Syed: I'm not sure ->>: If you take the matrix you want to find the first eigenvector, not -- you can just
find the matrix over and over again, just run the vector, and all the weights decay
exponentially. This is what happens here, right? You get this exponential decay
of weights and you know there are a lot of properties of matches come from the
gap from the first eigenvalue and second eigenvalue, so how fast will it converge
to the first eigenvector, just the spectro graph. And it seems as this is exactly
what happened here, right? If the values -- when will it be hard to find that the
graph is small, right?
>> Umar Syed: Yeah, the key is this gap right here. But I don't know if it's
related to the gap.
>>: Is there any connection, can we tell some unified story about that?
>> Umar Syed: I don't know. I don't know that these values are related to the
eigenvalues in the matrix. It doesn't ->>: Isn't your analysis -- the difference between these sort of KNP star 1
[phonetic] is in there?
>> Umar Syed: It is.
>>: So it's smaller then you don't care, if it's large then...
>> Umar Syed: But I don't know whether it is or if it is how it's connected to the
eigenvalues, I'm not sure. Okay. And now so that's our algorithm. Now I'll just
briefly describe some problems in reinforcement learning that can be solved
using this kind of algorithm.
Okay. So what's reinforcement learning? So let's step back. So the goal in
reinforcement learning is that there's some agent navigating some unknown
stochastic environment. And you're trying to come up with a control policy for
that agent. Typical applications are like navigating a vehicle, guiding a robot,
something like that, and the only feedback that this agent receives is some
exogenous reward signal that tells it in every state of the environment whether
that state is good or not good.
The goal for the control policy for the agent is to maximize the cumulative reward
that the agent receives over time. And so the value of a policy is defined as the
sort of total cumulative reward that the policy receives over time and that's called
the value function.
I'm going a bit fast because I think maybe you guys are familiar with this. Okay.
Now, there's really sort of two definitions of an optimal policy. And sometimes
the distinction is not made clear in papers that you read.
A weakly optimal policy is one that maximizes the value for some fixed initial
distribution on the states. We can call that a weakly optimal policy. And a
strongly optimal policy is one that maximizes the value of the policy no matter
what the initial state distribution is. So it's simultaneously has maximum value at
every state. We'll call that a strongly optimal policy.
So weakly and strongly optimal. Clearly weakly optimal strategy is by continuity
or something, but it's not obvious that a strongly optimal policy exists, one that
maximizes the value at every state simultaneously but there's a classical result
that in a Markov decision process which, is the usual model for decision learning,
there's always a strongly optimal policy. There's a control policy that maximizes
reward no matter what state you start in simultaneously.
But a lot of reinforcement learning algorithms, especially the ones that do not
assume that there's a explicit description of the environment, they only return a
weakly optimal policy. That is, you specify some start state distribution and it will
find the optimal policy with respect to that initial distribution but perhaps not all
state simultaneously.
Those are called model-free algorithms, the ones that don't see the explicit
description of the environment. So this lexicographic optimal strategy finding
algorithm that I just described to you is a way of converting reinforcement
learning algorithm that returns weakly optimal policies into one that returns
strongly optimal policies in the following way. So I construct the following game
matrix.
The rows of the matrix correspond to states of the environment. And the
columns of the matrix correspond to all possible policies. And an entry in the
matrix is the value. So the IJth entry is the value of the Jth policy when you start
at state I.
And it's easy to show that the lexicographic max/min strategy in this game is a
strongly optimal policy. It's maximizing the value at every state simultaneously.
>>: [inaudible].
>> Umar Syed: [inaudible] well, probably it's just for technical reasons it might be
better to assume that there's not. That implies there might be infinitely many
actions and that gets hairy but there are certainly exponential number of rows.
>>: [inaudible] to make it [inaudible] would those same amounts hold?
>> Umar Syed: If you reduce the banding case you would have one arm per
policy. There would be exponentially many arms. Exponential in the number of
states.
So strongly optimal policy is a lexicographic max/min strategy in this game. And
what is the best response oracle for this game? Well, it's an algorithm for
computing a weakly optimal policy. You fix some distribution on the states and
the best response is the optimal policy for that fixed input distribution.
Okay. So I've given you a reduction. And I'm a little hesitant to say this because
it seems a little surprising, but so far as I know, no one has actually given this
kind of reduction before going from an algorithm that computes weakly to
strongly optimal. But let me add an asterisk to it because it seems kind of too
good to be true.
>>: Where does the lexicographic stuff come in? Why doesn't the standard
boosting do ->> Umar Syed: What boosting would do, it would return a policy that has the
highest value on the start state that's sort of the worst start state to be in.
>>: I see. Okay.
>> Umar Syed: That's what standard boosting would give you. And this gives
you the highest value on all of them simultaneously. Okay. So that's application
number one. Application number two is perhaps maybe a little bit more obvious.
So in a lot of environments, there's not just one natural value function that tells
you the value of the policy. There's like sever. So if you're driving a car it's like
multiple objectives you're trying to satisfy. You're trying to go as fast as you can,
trying to avoid crashes, trying to stay on the road, et cetera, and these folks gave
an algorithm for finding a policy that lexicographically maximizes all those set of
value functions. But assuming a fixed ordering on the value functions. So they
assume that as input to the algorithm that one of the values is one of the
objectives like say speed is the most important. And then another objective like
staying on the road is second most important so on. So you're giving that
ordering and they find the one that's best with respect to that ordering,
lexicographically best with respect to that ordering.
If you were to apply our algorithm to this problem, where the game matrix is
defined as follows, again the columns are the policies and the rows are the
different value functions. And the Ith Jth entry is the value of the Jth policy with
respect to the Ith value proposition. The max/min strategy for this game is a
policy that does the same thing. It finds the lexicographically best policy, but
without having to specify which value functions are more important than the
others.
>>: The valuations would need to be in the same scale?
>> Umar Syed: That's exactly right. That's exactly right.
I was just about to say that. This assumes implicitly that these are on the same
scale. If they're on different scales, then the different scales are telling you
implicitly which are more important than the other, that's totally right.
Okay. And that's it. It was just a very sort of brief tour of the application. So
again I gave you an efficient algorithm for computing these mistake exploiting
strategies. And very large games and a few applications to reinforcement
learning. So thanks a lot for your time.
[applause].
>>: Does anyone have questions?
>>: Have you tried this on real problems?
>> Umar Syed: So we have a sort of toy simulator, a car driving simulator. We
tried it on that. And you can definitely set up a situation where -- and we give it
like multiple objectives. It's quite easy to set up a situation where a traditional
algorithm is not going to be maximizing all objectives simultaneously.
>>: Crash, stay on the road.
>> Umar Syed: Stay on the road, drive as fast as we can.
>>: [inaudible] equally.
>> Umar Syed: They're all on similar scales, that's right.
>>: I have a question I want to address. So I think what Chris said in the
beginning of the talk is interesting about what happens if -- I don't want a strict
lexicographic order. But I'm saying you know I'm willing to sacrifice something
with respect to this max guy. So I mean they won't be optimal with respect to the
best thing, but if I can be much better on the second best and so on.
>> Umar Syed: Yeah.
>>: And I think an equivalent way of saying this is to say that I have some model
of how my [inaudible] chooses his strategy. Perhaps he can sort them from the
best to the worst, but rather than picking the best thing, maybe he has like a
stupidity point and he flips it and am I going to choose the first one, yes or no, if
it's head I choose that one if not I move on to the second one. I think that's
maybe equivalent to what I want to say about -- I believe that's the brow or the
comb is no. But it doesn't have to be that. Let's think of the more general thing
that Chris was saying. I just want to be somehow more soft with respect to the
first one. So softer. So it seems that the algorithm from 2006, I forget the
algorithm, the recurring LP one is easily adaptable to that setting whereas maybe
I'm missing something here, but it seems like rather than putting out a set of
quality constraints, the constraints you pass on to the next step will be -- instead
of linear constraints linear sub space you would have some set of -- you code in
hyperstack that says roughly in this optimal space but there's some epsilon. And
of course you move away from the space and get worse with the first one, but
now forget about that.
>> Umar Syed: That sounds right.
>>: Just the comment about the [inaudible] if you know what the [inaudible] and
tradition over ->>: For this one I don't know anything. I'm just saying willing to pay epsilon on
the first objective or let's say ->>: But implicitly assumes that you have high level cost structure that you can go
back to defining.
>>: I think it's equivalent to saying the coin flipping with the pea stick type of
thing. But my question is I think we now understand how to take the simple
recurring LP algorithm and make it softer with respect to more important
decisions, how do you take your algorithm ->> Umar Syed: I don't know. That's an excellent question. The answer is I don't
know if this algorithm is suited for the kind of extension that you're describing.
>>: But why? It seems that ->>: Isn't it like going from we had all these algorithms, [inaudible] space housing
and stuff like that, going from each [inaudible] so basically what you say if I have
a policy now, I measure it with respect to the worst case, and I kind of penalize it
based on how far it is from the best policies as opposed to kind of the spacing
that will cut out everything that's not optimal, you're just going to penalize it in
some way and proceed. So at every step I have some sort of aggregate the
penalties that this policy gets.
But the question that I can see, how -- you need to have some sort of model. So
you propose this, point to us this model. But I think this model it does not
assume anything about the opponent, you don't have to assume that ->>: I'm assuming that the opponent is P stupid. I don't want to exploit his
mistakes. I'm saying he'll make mistakes with the probability I know.
>> Umar Syed: So let me try to answer your question more fully. So one thing
you might guess is, so there's this -- I defined this term "lambda," you're
normalizing these two groups of rows differently. One of them is at least lambda
and the other is at most one minus lambda. You might make think to make
things more softer you change the value of lambda somehow, right?
>>: Yes.
>> Umar Syed: But the thing is whatever lambda is, if it's more than 0, then if
you run the algorithm forever it will end up in the same place. As long as lambda
is more than 0, it ends up in exactly the same place. It just changes how long it
takes to get there. So the lambda is tuned to make, for speed, to make the
convergence time as good as possible. But it doesn't affect where the algorithm
ends up. So that's an idea to address what you're saying. That won't work, and I
guess I don't have another idea off the top of my head.
>>: Good answer.
>> Ofer Dekel: No more questions? Thank you again.
[applause]
Download