Document 17865075

advertisement
>> Eric Horvitz: Okay, we’ll get started. It’s great having Tuomas here from CMU. Tuomas is a
Computer Science, Professor of Computer Science. He has affiliate appointments in the Machine
Learning Department at CMU, as well as the Program in Algorithms, Combinatorics, and Optimization.
He’s also part of the newish CMU/UPitt Joint Ph.D. Program in computational Biology.
I’ve known Tuomas for a number of years. He’s been working in areas of interest to both of us. I mean
that are dear to my heart and soul in bounded rationality, decision theory, game theory. Some of the
core challenges we face in AI where we have limited resources and information. He developed per the
topic today some leading algorithms for several general classes of game. These algorithms we’ll hear
more about today, one of the most recent world championships in computer Heads-Up No-Limit Texas
Hold’em.
They also have lots of interesting implications for other kinds of problem solving in Machine intelligence,
more broadly. Tuomas did his Ph.D. work at UMass Amherst working with mutual colleague Victor
Lesser. This was before coming out to CMU. He probably remembers on his way to becoming a
professor stopping off at Microsoft Research where Jack Breese and I worked very hard, including a boat
trip in a rain storm.
[laughter]
Hope it’s a different jacket than you’re wearing today.
>> Tuomas Sandholm: I wasn’t thinking it’s exactly the same looking jacket. But I swear it’s a different
jacket.
[laughter]
>> Eric Horvitz: We tried to commit Tuomas to come MSR as a full time researcher back then. I think he
actually considered us seriously, rainstorm to the side. But literally he was dressed like this in a pouring
rainstorm on a boat in the middle of Lake Washington. He was very good spirited about it.
In two thousand three Tuomas won the Computers and Thought Award given out by the IJCAI folks.
He’s a Fellow of ACM, AAAI, and INFORMS. He’s published; I often reflected with some colleagues that
Tuomas is like a publishing giant in terms of the breadth and depth of his publications in several
different areas of work. Beyond that he’s managed to start and sell a very interesting startup, run it
over a number of years. Help on the boards of other kinds of startups and venture oriented
entrepreneurial projects.
He has algorithms running in the real world including I think a very interesting set of procedures that are
now being used in kidney exchanges. He came here about three or four years ago and just gave a talk
on that work which I think has such incredible societal benefit, so with that come on up Tuomas.
>> Tuomas Sandholm: Okay, thanks a lot for the very kind introduction. Thanks a lot all for coming
here. I was asked to talk about kind of the state of the art in solving poker. I wanted to generalize it a
little bit out into games beyond poker and kind of the technology capability that the community has
built over the last ten years which is a huge leap.
Ten years ago there was this kind of perception that they don’t work on game theory, can only solve toy
problems. Well that’s no longer true. I couldn’t help myself; I put some new content in here as well.
It’s not just an overview talk. If you have heard all of my talks in the past there’s about thirty percent
new in material here. This is joint work with a number of Ph.D. students and collaborators. I’ll show
those names on the slides as we go through each piece.
Game formalism that we’ll be using is going to be the absolute standard extensive form Incompleteinformation game, which is this. There’s a game tree much like in chess. Except that there are these
white nodes which represent natures move. You can model nature as a random player that moves
stochastically not strategically, and has some probabilities for their actions.
Also there are information sets which represent incomplete information. For example when the red
player is in this state or this state he doesn’t know which one of those two states he’s in. He knows that
he’s in one or the other but doesn’t know which. Similarly the blue player doesn’t know which one of
these nodes he’s in when it’s his turn to move, any questions from that formulism?
Okay, what’s the strategy here? Well it’s a mapping from information sets to actions. It would say for
the red player which action will you take here, which action will you take in this information set? Which
actions will you take in this information set and so forth?
The strategy can be probabilistic. It might say okay, here you go forty percent here, sixty percent here.
Clearly the strategy should depend on where you believe you are in the information set. For example
the blue guy might want to do a different thing here and here. But he can’t because he doesn’t know
where he is. But he can actually derive beliefs as to what’s a probability that he’s here versus here.
Once you have the strategies you can just use Bayes’ rule to derive the probabilities.
Okay, come on in there’s plenty of seats, don’t be shy.
[laughter]
>> Eric Horvitz: [indiscernible] don’t go together.
[laughter]
>> Tuomas Sandholm: Okay, so I’m going to be talking about domain-independent techniques. The
application areas logical and it’ll be poker but there’s nothing poker specific here. Techniques for
complete-information games like chess or checkers don’t apply here. You have to have completely
different techniques.
Challenges here include unknown state, uncertainty about what others and nature will do. In my
opinion most importantly interpreting the signals that the other players sent, so when the other players
take actions it signals to me about their private information. Conversely whenever I take actions it
signals to the other guy about my private information. How do I take those into account?
Well the beauty about Nash equilibrium which is a solution concept by John Nash from 1950 this gives a
solid definition on how those signals prescriptively should be taken into account. But of course the Nash
equilibrium solution concept it’s just a definition. To operationalize it you have to have algorithms for
computing Nash equilibrium or approximations thereof.
Alright, so Nash equilibrium is a strategy for each player and beliefs for each player, so that no agent
benefits from using a different strategy. No agent unilaterally can benefit from deviating. This is the
solution concept that we’ll be using throughout the talk. In one place I’m going to talk about the
refinement as well and I’ll just make it clear.
Okay, most real-world games are actually like this. By this I mean incomplete-information games that fit
into the general extensive form game model. Negotiation all in various forms, military things, cyber
security, like we’ve done some work on wireless jamming games. I have some ideas how to do
operating system security applications, so that as well. Medical treatment planning, this is something
that I’m super excited about. We have a new big crank proposal pending of that where you think about
the world as there being a treater that treats a patient and a disease, two players zero some game.
For some nodes of the game you have some probabilistic information that’s good. For other nodes you
don’t so you have this kind of mixed stochastic versus game theory situation. Then you can use game
solving techniques and opponent exploitation techniques for making sequential treatment plans.
Biological opponents furthermore have the limitation that they can’t look ahead. They have like one
step look ahead and you can actually exploit that as well.
We’re actually with a biology collaborator proposing to use this for steering the adaptation of T-cell
population in an individual, so as to drive their own T-cells to battle cancer or to battle auto-immune
hepatitis. If we get the funding we’ll actually be doing wet lab with In vitro and then In vivo in mice.
Alright, enough of that, oh by the way, thank you Microsoft we got a little seed grant to get that work
started a couple of years ago.
Okay, Poker, well that’s a benchmark. I don’t really view it as an application although you can view it as
such. But I view it as a benchmark. It’s a challenge problem in the AI community since about 1992.
There’s hidden information which is the other player’s cards, uncertainty about future events.
Deceptive strategies are needed in a good player. You can’t always play the good hands aggressively
and the bad hands weakly because the opponent will pick up on that, and do very bad things to you.
The game trees are very large. I’ll talk about how large.
Some of the techniques I’m going to be talking about here apply to general some multiplayer games.
Some apply to just two players here on some games. If it’s just for two players here or some I’ll mention
that on the slides. I should mention that two player Poker is a zero sum game. It’s actually very popular
form of Poker. It’s not like we just looked at that because that’s the only thing we can look at which is
also true. But it’s high stakes. Some of it is TV but most of it is actually online, so real no split level high
stakes play, two player, mano e mano. A lot of professional gamblers prefer that form.
It’s very interesting and they are super good, unbelievably good at that. How quickly they can adapt.
How sophisticated their strategies are. Some of these people are without college degrees yet they are
so smart. It’s just unbelievable. It’s kind of humbling.
Alright, so here’s our approach which is basically now used by all the leading Poker research groups.
This was foreshadowed of course by other things. The idea of automated abstraction was already here.
But then there was custom equilibrium-finding and manual abstraction and so forth.
But here’s the idea you have the original game. In the case of two player or in other words heads-up no
limit Texas Hold’em the number of information sets in the game is about ten to the one hundred and
sixty-one. Bigger than the number of atoms in the universe but even if you had for every universe you
had a sub-universe and counted those atoms. It would still be less than this number. It’s a big game you
can’t write in down.
We run some automated abstraction algorithm that takes, as input some compact representation of the
game. Think about a rule sheet printed on one piece of paper, if you will. To an abstracted game that is
similar or hopefully similar. We’ll talk about that, to the original game. Then you run a custom
equilibrium-finding algorithm to find a Nash equilibrium of that abstract game. Then you use a reverse
model or reverse mapping to map it back to app works Nash equilibrium of the original game. Any
questions on this framework?
>>: Can you explain an abstraction at [indiscernible]?
>> Tuomas Sandholm: No, I can’t explain it a little bit more because I have like thirty slides on it. I’m
going to say a lot more about it, yeah. I’ll talk about it in detail now.
Okay, so Lossless abstraction which almost sounds like an oxymoron. It’s more like finding
isomorphism’s if you will. The observation there is that we can make games smaller by filtering
information a player receives. For example instead of observing a specific signal exactly. A player
instead of observes a filtered set of signals.
For example, if the player is receiving an Ace of Hearts we’ll say it okay, he received an Ace. Sometimes
some of the other detail doesn’t matter. This form of abstraction is just merging information sets.
There are two information sets that the player can distinguish between. We’re going to set okay now he
can’t, in the abstract game he can’t. If we do it losslessly we only remove redundant or irrelevant
information from that model. Does that answer the question? Yeah?
>> Eric Horvitz: Tuomas did you have any, has it been helpful and do we know what kind abstractions
human experts use?
>> Tuomas Sandholm: Okay, so this has been very helpful. I’ll talk about that on the next slide. We
don’t really know what the abstractions humans use.
>>: This suggests you’re not going for a flush, right, so...
>> Tuomas Sandholm: I’m not, this is an example. I’m not saying that you can ever actually pull this
abstraction off in a lossless way. This is an example of what this might be. We are not programming in
the abstractions. We generated this algorithm called GameShrink which automatically identifies all of
abstractions like this and makes them. We don’t have to know in advance what kind of abstraction you
could actually do and still be lossless. The algorithm will smell them out itself.
>>: But in many information sets this abstraction should work, shouldn’t it?
>> Tuomas Sandholm: Yeah, well I’m not going to argue this particular one. But I’ll argue another one
which is easy to understand. Let’s say that you get your first two cards in Texas Hold’em. Whether it’s
an Ace, let’s say you get two Kings, King of Hearts, King of Spades versus King of Hearts, King of Clubs,
same thing at that point. They only become important later when the flush consideration is relevant.
That’s captured, so if you bundle something it doesn’t mean that you’re going to bundle it forever.
>>: By that do mean that you’re encoding some like [indiscernible] formation that the game probably
like…
>> Tuomas Sandholm: No, you’re not encoding it.
>>: Yeah.
>> Tuomas Sandholm: You’re algorithmically identifying it.
>>: You’re putting it like in your set of candidate abstractions that can be considered as a
homomorphism.
>> Tuomas Sandholm: I’m with you except for the putting it in part. It’s actually considering all of them
itself.
>>: Okay.
>> Tuomas Sandholm: You know we’re not putting in candidates. It’s figuring them all out.
>> Eric Horvitz: He’s got twenty-nine more slides coming so maybe.
>> Tuomas Sandholm: Well twenty-nine of this. That’s the first topic of many. Okay, so with this we
solved a game called Rhode Island Hold’em poker which was an AI challenge problem. Introduced by Shi
and Littman to be kind of bigger than the Koon Poker that John Nash solved in the fifty’s, which was by
the way the only game in his thesis.
The connection between poker and game theory actually goes way back. Smaller than Texas Hold’em
because it was viewed that Texas Hold’em is so big that we can’t ever make real traction on it. That
actually turned out to be wrong.
But anyway, three billion nodes in the game tree, without abstraction the sequence form linear program
which can be used for equilibrium finding has ninety million rows and columns. It’s unsolvable.
GameShrink which is our abstraction algorithm ran in one second and collapsed the game down to one
percent of its original size. Ninety-nine percent of the game was actually irrelevant.
After that the LP had one point two million rows and columns. At that point with the best LP solver
which was a CPLEX barrier method it took eight days on a small super computer in my lab. We could just
crank out the exact answer. We solved the exact Nash equilibrium with this. That was the largest
incomplete information game solved by over four orders of magnitude at the time. That really showed
the power of abstraction.
>> Eric Horvitz: Tuomas when you save solve you’re solving a stack version of the game.
>> Tuomas Sandholm: Lossless, so these Nash…
>> Eric Horvitz: [Indiscernible]…
>> Tuomas Sandholm: Of the original game.
>> Eric Horvitz: You proved losslessness?
>> Tuomas Sandholm: You prove, the GameShrink algorithm proves losslessness. It finds the exact Nash
equilibrium. Not close but all the way to machine precision.
>>: You said the word the, of course in general we know there can be multiple…
>> Tuomas Sandholm: Yeah, I meant to say [indiscernible].
[laughter]
That’s right there could be multiple Nash equilibrium, yeah, and this bound one.
>>: Of course some would be better than others to have a way to…
>> Tuomas Sandholm: Oh, no, no they couldn’t. In two player zero sum games…
>>: [indiscernible]…
>> Tuomas Sandholm: You have a nice swapping property that if you play anyone of the Nash
equilibrium strategies. The opponent plays anyone of his they all pair up equally well against you. You’ll
get the same values.
>>: Zero-sum?
>> Tuomas Sandholm: Yeah, okay so sometimes though even though lossless abstraction gets you in
poker about ninety-nine percent of the game to go away. If you leave one percent of ten to the one
sixty, that’s still a big number.
[laughter]
In Texas Hold’em you have to do lossless abstraction to get anywhere. Now I’m going to talk about first
the leading practical approaches and the history of that. Then some new theory about how do tie
abstraction to the Nash equilibrium quality in the original game.
I’m not going to in the interest of time talk about everything. But from two thousand seven to two
thousand thirteen the main idea is for practical abstraction with the following. One was integer
programming to decide how many children each node at the level gets to have. You don’t want to have
a uniform abstraction. You’re computing resources for Nash equilibrium finding tell you how large the
abstraction can be. You want to use that size smartly where it matters.
Secondly, potential-aware, I’ll talk about that. Then imperfect recall, imperfect recall is the idea that
you may want to forget something that you knew in the past. In order to make your abstraction smaller
and buy yourself more space to refine the present more finely. That use to be kind of a weird notion in
game theory. There were just these obscure things but now it’s actually a very practical tool in solving
games.
Now I’m going to jump to the currently best abstraction algorithm which combines the ideas of
Potential-Aware and Imperfect-Recall, and Earth Mover’s Distance. Then it obviates a need for IntegerProgramming.
Alright and I’m going to do kind of a transformation from how that literature went. First Expected Hand
Strength is the goodness of your hand assuming that nature and the opponent rolls out cards uniformly,
from then on. Early poker abstractions use that as the measure of clustering hands, clustering the
information that the players get. But that doesn’t really work that well. Here’s an example where you
have Expected Hand Strength being basically equal but the hands are very different in the middle of the
game.
Let’s say we’ve started with two pair of fours. Or we started with ten Jack suited. Very different hands
both have Expected Hand Strength point fifty-seven. Your algorithm might bucket those together or it
should be played completely differently.
Why are they different? Well this hand is pretty much usually mediocre, pretty good, not great. If you
get a triple, yeah that’s great. But it has very little mass here and it has a lot of mass here. This is all but
it has quite a bit of mass here and no mass here. Basically this hand is going to end up really good or
terrible.
Okay, so those should be played differently. That can be captured in Distribution-Aware abstraction.
You look at the full distribution of hand strength, basically the histogram on the previous slide. Then
you use for example earth-mover’s distance as a distance metric between the histograms. Turns out
earth-mover’s distance is much better than L one or L two.
Come on in. Don’t be shy. Do you want to open the door I guess they’re being shy. Yeah?
>>: Sorry, [indiscernible] going back to the last slide.
>> Tuomas Sandholm: Yeah.
>>: Yes, thank you. You said that the least expected hand strengths I interpret as some of like value
function, is that right? Or…
>> Tuomas Sandholm: Yeah, I guess you could call it a value function, yeah.
>>: Value…
>> Tuomas Sandholm: But it’s not really a value function because it’s not based on the strategicness or
what the other player, what the player and others are going to do from then on. It’s assuming a uniform
roll out.
>> Right, yes, but what if they, imagine that your opponents are going to roll out a strategic and
[indiscernible] real way.
>> Tuomas Sandholm: Right.
>>: The clustering criteria according to that like expected hand strengths, be perfect or at least
[indiscernible]?
>> Tuomas Sandholm: No, I’m saying that even if they don’t do that it’s going to be imperfect. Now, I’m
doing this kind of transformation into better and better things.
Okay, so the prior before the twenty fourteen paper the prior best approach used this distributionaware abstraction with imperfect recall. But that doesn’t really take the potential into account.
Potential is something you read about in the poker literature. But nobody’s really been able to define it.
We actually define it operationally using kind of a recursive formulation.
Let me show you an example of this first. Let’s say we have two situations. This is the game with private
single X one. This is the game with private single X two. Here with probability one you get no
information in the next step, answered in the results in the second step. Here I’m showing the results in
the first step and you get no more information in the second step.
They have the same distribution to the last round but very different potential. What we do then we
instead of thinking of transitions to the last round histograms. We think about transitions to the next
round histograms where the base of the histogram is states of the next round, which we have already
abstracted by moving bottom up in the game tree to do the abstractions. Did this make sense? You’re
not so sure.
>> Eric Horvitz: Say it one more time.
>> Tuomas Sandholm: Okay, the algorithm starts from the bottom of the game, from the leaves.
There’s no potential left there. We can use your favorite metric like expected hand strength to cluster
those. Now you have clusters there. At the previous level now you can look at what is a probability
distribution of transition change to those next level clusters. Of course the earth-mover distance now is
in a multi D space. We cluster based on that and that’s how we move up the game. Again, we can use
imperfect, we use imperfect recall throughout.
>> Eric Horvitz: It’s in the earlier versions just like the final state only.
>> Tuomas Sandholm: The earlier ones, these, yeah these ones to kind of look at the transition from a
current state to the final state histogram. Now we’re looking at the transition level by level, step by step
in the game. But that also means that you don’t have this type of X access. You have this multi
dimensional thing where you have to do the earth-movers distance.
We do that and we develop the custom algorithm for that to approximate it. Because the normal earthmover algorithms don’t scale and this led to the best abstractions evaluated experimentally.
>> Eric Horvitz: One more thing, stop me for a second that you’d be losing information without
abstraction steps, that particular one?
>> Tuomas Sandholm: Yeah, all of the abstractions I’m now talking about are a lossy. You have to lose
information.
>> Eric Horvitz: Right.
>> Tuomas Sandholm: Otherwise the game ends up being too big.
>> Eric Horvitz: As you go on you talk about lossy abstraction.
>> Tuomas Sandholm: Yeah, the first one…
>> Eric Horvitz: I can imagine the come back to say there’s not loss later.
>> Tuomas Sandholm: I’m going to, so the thread on this part of the talk is going to be we started with
lossless gets rid of ninety-nine percent. That’s not enough have to abstract more. Here’s the practical
stuff. Then we’re going to talk about theory that actually gives you balance. That even if you abstract in
this way you’re still bounded with respect to the original again.
>> Eric Horvitz: Yeah.
>> Tuomas Sandholm: Yeah and that’s a thread here.
>> Eric Horvitz: Then…
>> Tuomas Sandholm: I’m not there yet.
>> Eric Horvitz: Michael Bowling’s group also did similar kinds of bounding.
>> Tuomas Sandholm: Yeah, actually I’m talking about not only our work but throughout this. Like here
this was Michael Bowling’s group. This was Alberta group before Michael Bowling joined, Michael
Bowling’s work, Michael Bowling’s works, yeah.
>> Eric Horvitz: Yeah.
>> Tuomas Sandholm: Yeah, so I’m trying to do a little bit of overview of the field not just a
presentation of our work. Okay, so Tartanian seven was a program at one of the most recent annual
computer poker competitions in the no-limit heads-up category. It uses an abstraction similar to what I
was just talking about, except in a distributed way. We can run on clusters now or we were running on
cache coherent non-uniform memory access, super computer for that competition.
The abstraction just uses any favorite abstraction algorithm like the one from the previous slide. At the
top of the game you can define top anyway you want. We defined it be the flop round in poker, actually
sort of pre-flop. Then the rest of the game is split into equal sized disjoint pieces based on public
signals. You can put different computers working on the different PCs. It’s important that you do it
based on the public signals because that guarantees that the information sets don’t cut across
computers.
Alright and how do you do that? Well you have a base abstraction generated with algorithm on the
previous slide and you can look to transitions into that to have a well defined algorithm. Then for
equilibrium finding we used External Sampling Monte Carlo Counterfactual Regret Minimization. Or a
variant of that which is from the University of Alberta, it starts at the top of the tree and then when it
gets to the rest part it passes samples of flop from each public cluster. Then you continue the iteration
on a separate blade for each public cluster. Then you return the nodes.
There’s some details as to how do you actually make it work in this distributed context. We could talk
about but you have to do otherwise one converge. Then you can sample, do multiple samples into one
of those continuations if you’re worried about the communication overhead. It becomes minor.
Okay, now to Eric’s bound.
>>: Just an idea is how much time does this…
>> Tuomas Sandholm: Oh, as much as you can give it. As many cores as you can give it. As much time
as you can give it.
>>: But what are you…
>> Tuomas Sandholm: We were running, this spring we were running. Oh, sorry this is the previous
spring, about a thousand cores for about three months. We’d like to take it to an order or two more
cores in the future. Yeah?
>>: Maybe a technical detail, but are you maintaining your belief state at each node in the tree? Or you
have some particle like sample core representation perhaps?
>> Tuomas Sandholm: The beliefs are maintained explicitly. For each information set in the abstraction
for each action CFR maintains the probability and one number for the regret. Yeah?
>>: I’m a little confused. You run this thing for three months. Then you have a representation that you
can play any game with no further computation.
>> Tuomas Sandholm: No, no.
>>: This is some particular instance of one game.
>> Tuomas Sandholm: Good question, so this is, if we go back to that framework slide. It starts by
getting the description of a particular game. Then the abstraction algorithm is run on that description,
spits out the abstraction for that game. Then the equilibrium is computed for that game. The algorithm
is general but the run is specific to that game, to that input.
>>: Especially in poker.
>> Tuomas Sandholm: Not even all poker, Heads-up no-limit Texas Hold’em poker. Yeah?
>>: But, so you have a solution for that particular poker game that you can now take to a tournament
and run in real time.
>> Tuomas Sandholm: Yes, right.
>>: How big is that representation?
>> Tuomas Sandholm: I don’t remember how big it was here. For the next program that we developed
this spring which is Claudico it was one point four terabytes.
>>: That’s a big table or…
>> Tuomas Sandholm: Big table, one point four terabyte table of action probabilities.
>> Eric Horvitz: Does it ever make sense as the game evolves to try to get ahead of it with real time
computation?
>> Tuomas Sandholm: Yes, I was going to get to that. Yeah, you’re so smart you know you should tape
your mouth because you’re jumping me ahead.
[laughter]
Okay, good, so Lossy game abstraction with bounds. This is actually tricky due to a known fact in games
which is called abstraction pathology or abstraction non-moniticity, again from the University of Alberta
here. Basically, in single agent settings be it in planning or MDPs what have you. If you make an
abstraction that’s finer grained your solution quality can’t go down.
In games that’s not true. You come up with a finer grained abstractions even if strict refinement of your
original abstraction. Your solution quality can actually go down. It’s for awhile that kind of threw the
whole framework into, in question. If that’s true why are we coming out with these finer and finer
abstractions? Maybe we’re actually taking steps backward.
But then we started looking at Lossy game abstraction with bounds first for stochastic games. Then for
general extensive-form games and I’ll show you a few results on that in the next couple of slides. The
abstraction is performed in the game tree not in what’s called the game of ordered signals and the
signal representation. These are now general purpose unlike the GameShrink algorithm which was for
that game of order signals class of games.
It’s for both action and state abstraction. We so far talked about state abstraction where you bucket the
information that you get. But you can also do action abstraction which is really important in games with
large or continuous action spaces. Your true some of the actions are the action prototypes and pretend
that the rest of the actions don’t exist.
We’ll talk about that in detail. Here’s a detail, more general abstraction operations are enabled here by
allowing not only many to one mappings of the states. But also the other way around one too many
mappings and you can get some leverage from that.
Okay, so here’s the main theorem. This is joint work with my student Christian Kroer. For any Nash
equilibrium in the abstract game any undivided lifted strategy is an epsilon-Nash equilibrium in the
original game. Where epsilon is defined like this. What is an undivided lifted strategy? Well, lifted
strategy just is something that works in the original game in the obvious way you’d think about.
Undivided is a constraint on how we reverse map. It’s not a restriction on games. It’s just a constraint
on how we reverse map the answer back to the original game.
Now what is this? This is kind of where the action is. It’s looking at measurable things in the difference
between the abstraction and the real game and then tying it into the epsilon in the epsilon-Nash
equilibrium which is a game theoretic notion.
Okay, so what is this two times epsilon R? This is a utility error and this is defined recursively at the
leaves. It’s just the error between the model, the abstraction, and the actual game. Its interior nodes, if
it’s a player node it’s a max over what the players can do. If it’s a nature node it’s just the probability
weighted sum of what nature can do. You’ll maximize over agents and you maximize over information
sets and that gives you this value.
Then there’s maximum over players of the sum of the heights where its player i’s turn to move or
epsilon j zero times this W. This is a nature distribution error at that height. How wrong is your nature
model at worst compared to the real game? You add that to the sum over height where it’s nature’s
turn to move two times, this epsilon times this W. Where, this is the heights of nature turns nature
distribution error at height j, and W is the maximum utility of a player in the abstract game.
That’s, I didn’t expect that we’d walk through the proof or anything. But this gives you a very concrete
thing where you can measure everything on the right hand side by just looking at the abstraction and
looking at the game. Then it ties it to saying okay if you abstract it this way you’d solve the abstraction.
Your error in the original game in Nash equilibrium is at most epsilon. Yeah?
>>: Yeah, so you mentioned provicity that when we get finer and finer abstractions the actual solution
can actually like go worse, right?
>> Tuomas Sandholm: Yeah, for…
>>: But I guess the hope is that with these transition and reward error bounds you would hope that the
upper bound would behave in a better way. But actually we know that they don’t, right, because when
you…
>> Tuomas Sandholm: Good question, not quite, not quite. But somebody’s being eagle eyed here,
that’s good. How is it possible that I’m giving you a theorem? That says that as I’m getting closer in the
abstraction, finer grain in the abstraction to the original game. My epsilon is going down and I just told
you its well known that there are abstraction pathologies where it actually goes up. What gives?
>>: [indiscernible] bound proof.
>> Tuomas Sandholm: Exactly, this is a bound and it leaves some room for non-moniticity as we’re
approaching the real game with abstraction. Alright, so the utility error side of the bound is tight. The
nature distribution error bound is tight up to a factor of six.
Hardness results, well determining whether the two subtrees are what’s called extensive-form gametree isomorphic is actually graph isomorphism complete. This is something that you need to check even
for lossless abstraction. It is not obvious because the graphs have special structure so you might think
that this might be easier than graph isomorphism but it’s not.
This if it’s hard computing the minimum-size abstraction given a bound is NP-complete and the other
way around as well. Minimizing the bound given a maximum size of the abstraction is NP-complete.
Now you might ask wait a second, as a pre-processor you’re solving some NP-complete problem. Then
you’re doing the equilibrium finding which in two player zero-sum is actually polynomial time.
How does that make sense? But these are of course worst case results. In practice it’s not only helpful
it’s necessary to do the abstraction. You don’t have to do it optimally.
Okay, this is showing an impossibility of level-by-level abstraction that shows that you have to actually
consider the whole tree. Or at least you can’t focus your attention level-by-level as all of the prior
abstraction algorithms have done if you want to have bounds. Even if you want to have a lossless
abstraction you can’t do that. But in the interest of time let me not walk through that example.
Okay, extension to imperfect recall. That theorem was for perfect recall. We have extensions to
imperfect recall as well. There’s a paper from Alberta by Lanctot et al. Here we get exponentially
stronger broad bounds than that. We get bounds for a broader class of games where abstraction can
actually introduce nature error as well, which is somehow something that they precluded from
consideration.
Furthermore, our theorem is for any game. Theirs is just for the counter effects, sorry any equilibrium
finding algorithm. Theirs is just for the counter factual regret naming algorithm.
Okay, so, now as I thought about this abstraction theorem. It actually brought up an interesting other
connection which is if you think about modeling. Models are never the real world. Modeling is a form
of abstraction. Typically in game theory when we take the model and solve it we actually take the
answer as if that’s somehow applicable to the real world.
But we had no connection that says that, how does that answer actually relate to the real game? Now
these are the first results that actually tie that gap as well. If you can measure the gap between your
model and the reality, or at least bound it.
Okay, action abstraction typically has been done manually. Still often time manual, there’s been some
automation. Again, this is from a different group from the University of Alberta. For stochastic games of
theory we had applies the theory that I just talk about applies.
Then with my Ph.D. student Noam Brown we developed the first algorithm for parameter optimization
for one player and two player zero-sum games. Where you can actually have one player control some
parameter like for example the bet size. Then as you change the bet size you don’t have to restart the
whole equilibrium finding. You can warm start with some really clever stuff that Noam did here. That
allows you to move bet sizes. You can actually move multiple bet sizes at once. As long as the payouts
are convex in the bet size vector this is actually guaranteed to converge.
But I’m going to show you something cooler later today. I’m going to skip that part. Alright, so that
gives us, that’s all I was going to say about automated abstraction for now.
Next is custom equilibrium-finding. How do you solve the abstract game? Now I’m going to really look
at two-player zero-sum only. Okay, this is kind of giving a perspective on the field. On the X axis we
have year and y axis we have on log scale the number of nodes in the game tree that had been actually
solved or near optimally solved.
You can see that when their annual computer poker competition was announced around here it really
spurred a lot of interest in this. People started building on each other’s work. We saw this super
exponential jump in the technology capability. I’ll talk about this the best algorithms in detail in a little
bit.
Now especially when you’re in the CFR family you want to measure complexity not in the number of
nodes, but in the number of information sets. Here’s a newer graph that I made again years on the x
axis, number of information sets and a log scale on the y axis. You can see that this exponential growth
has continued to this day. You can actually solve games now with about what is that?
[laughter]
Six, twelve, thirteen, ten to the thirteen, a little bit more than ten to the thirteen nodes in the game
tree. I don’t have the number for this one. But for this one the number of nodes was already five times
ten to the fifteen.
>> Eric Horvitz: That’s interesting with people I often get questions is, are advances in AI. You know
how much do our advances are due to power of machines getting better and more memory getting
cheaper? I often say well it really come down to innovation in the problem solving space. It’s going to
bust out of the kind of constraints that we see as the power of the computation pushing mean sum. You
can imagine Poly-Moore’s Law against this graph having sort of seen really level out to be here. It can go
flat while the AI invasions created…
>> Tuomas Sandholm: That’s right. This is almost all algorithmic innovation, right, or AI innovation. I
like that term.
>> Eric Horvitz: No charge for that.
>> Tuomas Sandholm: Yeah.
>>: But, okay, but these dramatic improvements. You’re exploiting the structure of the problem, right.
If I come up with a new game like Texas Hold’em and I’m going to add a…
>> Tuomas Sandholm: I wouldn’t say that. I wouldn’t say that.
>>: The actually integer on that card I can say that my search tree has explodes dramatically because I
have more information. But because it’s independent of the game I can just decide to ignore it.
>> Tuomas Sandholm: This is not the size of the original game. This is the size of the abstraction. Now
if you go back to this picture. This was ten to the one sixty, keep that oh ten to the one sixty-one. Here
we’re measuring how much comes into here. What is the size of the abstract game that get’s fed into
the equilibrium-finding algorithm. That’s what’s on the y axis now.
>>: I see, okay.
>> Eric Horvitz: I wonder if you could actually [indiscernible] a cool graph would be also the bound, no
so much of a bound on the formality on this graph here.
>> Tuomas Sandholm: Bound on the optimality in the real game?
>> Eric Horvitz: Right, the function of the size and study…
>> Tuomas Sandholm: Yeah, for limit Texas Hold’em there’s been some of that. This I actual know the
answer for this. Its one milli big blind per hand, it’s so, so close to optimal that a human playing for a
lifetime at human speed could not tell with statistical significance whether they’re winning or losing,
even if they’re playing optimally.
>> Eric Horvitz: What was the measure you used, what’s the word again?
>> Tuomas Sandholm: Milli big blind per hand, one thousandth of a big blind. For this it’s not, this is for
no limit, and no bounds are known for no limit because you cannot even run the best response
computation. You can’t even expose check how close to optimal you are in no limit. It’s a whole
different beast.
Okay, best equilibrium-finding algorithms counterfactual regret from Alberta…
>> Eric Horvitz: Sorry, Tuomas can we go back to that last slide?
>> Tuomas Sandholm: Yeah.
>> Eric Horvitz: But if you use the same algorithms went back down in size of the abstraction. It would
be interesting to just understand the function of the constraints on the richness of [indiscernible] what
the error is. Current best algorithms but just reduce the constraints on so maybe the actual nature of
the size of richness of the abstraction [inaudible]?
>> Tuomas Sandholm: Yeah, so how this went is that the practice went way ahead of the theory as
usual, many years before the theory. The theory is relatively new and it actually, we haven’t actually
tried to tie the theory to any one of these things. The abstraction algorithms that were used before you
got to these numbers. That hasn’t been done.
But what’s been done in limit Texas-Hold’em you can actually expose computer best response and
measure how exploitable in the original game you are. I know the answer for that guy.
>> Eric Horvitz: Okay.
>> Tuomas Sandholm: Okay, best algorithms were counterfactual regret from Alberta, Scalable EGT
from my group, completely different algorithms. It’s amazing that completely different algorithms and
they have selective superiority which is kind of weird. This is based on no-regret learning. This is based
on Nesterov’s Excessive Gap Technique.
Most powerful innovations here, well the number one is that each information set has its own separate
no-regret learner. If you think about doing no-regret learning in the whole strategy space you’re totally
dead in the water. It’s way too big. But here you can actually isolate it to each information set
separately which is a brilliant innovation. Sampling you can actually sample a tree over and over of each
iteration so you don’t have to walk through the whole tree.
Here most powerful innovations, well first of all smoothing functions for the Excessive Gap Technique.
That satisfy the conditions of that technique for sequence or games that enable this idea to be used for
sequential games in the first place. More aggressive smoothing helped for an order of magnitude and
still have spatial constraint balance smoothing between the primal and dual. Also kind of an order of
magnitude there and then you can get memory scalability by taking the memory to a square root of the
original. If the actions don’t depend on chance which is the case in poker, so fortunately this can take
your memory to a square root of the original.
This iteration complexity is one over epsilon and each iteration is slow. Here both of these parallelize.
Here the iteration complexity is much worse one over epsilon squared but each iteration is fast. These
are totally different. You think about this doing billions and billions of iterations to solve, each iteration
running in less than a second. Here each iteration is like a day and maybe you do two hundred
iterations, so totally different in that sense as well.
Selective superiority so one can be faster than the other depending on the game and the abstraction.
One thing that’s nice about this is that you can run it on imperfect recall abstractions. Although it’s not
guaranteed to converge to an equilibrium, but at least you can run it. For awhile this couldn’t even be
run on that. We kind of abandoned this for awhile.
Also with some condition numbers on the matrix you can get log one over epsilon which is the best
possible. That’s the same as interior point techniques. But interior point techniques aren’t scalable for
memory.
Alright, one slide on a new paper here, so a new prox function for the first-order method such as
Excessive Gap Technique and Mirror Prox. Gives the first explicit convergence-rate bound for general
zero-sum extensive-form games without requiring the condition number, the log one over epsilon.
You’re getting, basically you’re getting this complexity but much faster and for much more general
setting than our original paper.
>> Eric Horvitz: Tuomas while the slide is above I was going to ask you at this pretty [indiscernible]
seconds on this how this kind of semi-parallel pursuit at Alberta was influencing your team in terms of
learnings or directions, or contrast to it what the Alberta Team was doing over the years?
>> Tuomas Sandholm: Okay, great, over the years. Abstraction we’ve certainly been building on each
other’s work a lot. On the equilibrium finding we went into exact opposite directions. This is coming
from kind of a Machine Learning, no-regret learning tradition. We came from the optimization tradition.
These had very little interplay. Except that when imperfect recall became the abstraction of choice we
had abandoned this because it didn’t do imperfect recall. This one at least although it doesn’t have any
guarantees you can run it. You can press the button and see what happens. That actually ended up
being the best approach for awhile.
Then because of that we actually went and said okay can we improve this? For the last couple of years
we’ve mostly been coming up with better and better things here. Now, we’ve been building on that.
Now the next slide I’m going to show is actually an improvement on this thread. Now we’re actually
pursuing parallel threads in my group, this thread and that thread.
Here there’s a lot of interplay with it and building up on each other’s work with the Alberta Group and
other groups. It’s not just us and Alberta although they’re the leader groups. But Eric Jackson from
California, Team [indiscernible] from California, Oscar [indiscernible] from Finland. There’s a
Czechoslovakian group that’s very strong, French group that’s very strong. There’s been a lot of building
on each other’s work.
Yeah, okay, so a new prox function, better prox function for these optimization based techniques. That
introduces gradient sampling schemes. In particular it enables the first stochastic first-order approach
with convergence guarantees for extensive-form games. Now you can start to do sampling. We did
some game sampling before but now you can actually do gradient sampling in this optimization
framework as well which was one of the big reasons we moved away into the kind of no-regret space.
It introduces a first first-order method for imperfect-recall abstractions which is the second reason we
moved away from that. Now, I would say that both threads are alive again.
Okay, this is kind of a weird post processing deal. Actually let me skip that. Endgame solving, coming
back to your real time question, so, so far I talked about the game being sold up front. Huge strategy
vector and then just look up at run time. But you can actually do end game solving. This has been very
powerful in complete information games like chess. In fact for solving checkers that was the whole
thing. It is a big dynamic program that was the whole the game as the endgame. That allowed Jonathan
Schaeffer to solve checkers.
In imperfect information games endgames holding is totally different due to the information sets.
Benefits first of all, finer-grained information and action abstraction because you’re in a specific context
you can afford to do finer-grained. You can dynamically select the coarseness of your action abstraction
based on the size of your endgame. This is actually something that threw the humans off really badly in
the man/machine Prince versus AI Match that I organized this spring. In that humans usually think a lot
when there’s a lot of money in the pot in no-limit Texas Hold’em.
This does the opposite. If there’s little money in the pot the endgame’s actually bigger because there’s
more raises that are possible still, so it actually has to think more. The smaller the pot was the more the
computer thought. That really rubbed the humans the wrong way.
[laughter]
Anyway, new information abstraction algorithms have taken into account the relevant distribution of
the players’ type distributions, and types entering the endgame. By the time we get to the endgame we
can use Bayes rule from what’s been played so far, get the distributions. We can now decide where we
need more resolution in the abstraction versus not.
We can compute exact equilibrium strategies rather than approximate ones because now we have a
much smaller game. We can use LP instead of these iterative methods. We can compute equilibrium
refinement and solving the off-tree problem.
If we’ve action abstracted before we have been rounding back into the abstraction what the opponent
has done. Our model, our thinking of where we are in the game in terms of pot size might be off. Now
we can start again. We can start with a real pot size and fix that problem right on that spot.
>> Eric Horvitz: All the solutions are made to fit in the attractability of what’s considered a normal
response time?
>> Tuomas Sandholm: Exactly, exactly, that’s exactly right. You get control of the coarseness of the
abstraction to accomplish exactly that. Alright, so…
>> Eric Horvitz: But I guess comfortably or is it the idea that if you had a little bit more time you could
go, get a sense for the actual profile you’re on…
>> Tuomas Sandholm: Yeah, we do a little, I don’t have a graph on that but you can develop that sense
very easily as to what the tradeoff is. One of the tradeoffs we did in the man/machine is that we said
look we don’t want to take more than twenty seconds on the last betting round. Because that’s
something that is quicker than what humans on average do and quicker than the pros did on average.
But this is still kind of annoying if you think about it. Each pro grave two days of their life waiting for the
computer to respond over two weeks.
You know that twenty seconds was the right number. If we had, had say two minutes we could have
had a much more refined endgame abstraction and played better regarding card removal. Things like
that that the pros actually picked up on. But it’s with the same algorithm just giving you the finergrained abstraction.
>>: I wanted to get back to the question earlier and ask if it relates to always having, always starting
every hand with the same pot, with the same chip stacks?
>> Tuomas Sandholm: Yes, so if you wanted to, the game with different chip stacks is a different game,
we’d go through the whole loop again.
>>: Right.
>> Tuomas Sandholm: That’s right.
>>: In that that would be an example of an issue that we raised earlier because you would in essence
have to recompute the entire, you’d have to compute many different games…
>> Tuomas Sandholm: Yeah, you, and people have done that. If you want to play a poker game where
you have different stacks as starting stacks in different hands, then and people have done that. Where
we’ve always focused on annual computer poker competitions style where it’s always the same chip
stacks that start out with. That’s how we played with the humans as well. That’s how, when we, yeah
let me leave it at that, yeah.
>>: Yeah.
>> Tuomas Sandholm: Okay, it’s not perfect though. You think of rock, scissors, paper, if we are in this
yellow endgame where the first guy has moved one third, one third, one third. Here we can conclude
that ah, because he is moving randomly one third, one third, one third. Yeah, might as well always
move rock. The endgame solver could conclude that playing rock always is just fine. Of course that’s a
disaster.
It does have its perils. We have some theory that ties the size of the endgame to the rest of the game.
But that’s kind of largely an open research question. How do you tie the endgame solving into the game
so that it’s not very exploitable? Alberta has also done some work on that and they have some
guarantees. But in practice our method seems to be doing better than theirs which is, so pretty much
too be open how that can be done in a semi safe way at least and still practically playing well.
Okay, experimentally it helps. We did a test in twenty twelve with no-limit against all of the top players.
Tartanian five was our button and adding the endgame solver improved performance against all of the
competitors including itself. Then you can also look at removing weekly dominated strategies first.
Looking at equilibrium on the remaining set which is a refinement of Nash equilibrium and that helped
even more. This you can solve with LP. This you can solve with two LPs and you’re done.
Okay, here’s another idea that Sam Ganzfried my student came up with. The idea was that what if we
knew some dominate knowledge about the game that we’re solving. Now we’re by the way in Limit
Texas Hold’em not no-limit. Maybe we’re right, maybe we’re wrong but we have some gut feeling that
you know there’s some regions in the endgame.
For example as we got stronger hand, weaker hand in this region we should bet fold. Here we should
check call and so forth. This is what the opponent should do. We have some gut feel that that’s how it’s
going to be. We’ve seen humans play like that though, we’re making a guess.
But now we can write an integer program that will actually find and equilibrium that matches this
qualitative structure if it exists. The idea is that at, so basically the integer program is trying to place this
cut of thresholds in the right places. The leverage that allows us to do the integer program is at a
threshold. I have to be indifferent between doing this and doing that. That’s the short of it.
You can actually make multiple guesses and you can test each one of them. If you’re right you’re getting
the Nash equilibrium. That really speeds up endgame solving if you want to use this idea. Also it allows
to sometimes prove existence of equilibrium in games where it hasn’t been proven or couldn’t be
proven before. Solve games for which no algorithms existed including multi player games.
Of course you have to have a guess. The good thing is you can be wrong about it. If you’re wrong about
it the integer program is going to tell you it, you’re an idiot.
>>: When you say [indiscernible] equilibrium what is the setting…
>> Tuomas Sandholm: Let me in the interest of time take that offline. It’s kind of a long story. That’s
not poker. There are weird kind of continuous games that don’t fit Nash’s original theorem.
Okay, so we talked about that, custom equilibrium-finding reverse model. The reverse model is this
problem. Let’s say that we a continuous action space or a large action space and we have red action
prototypes. The opponent could play outside of those actions, what do you do? Of course you yourself
decide to play into those. You never fully get yourself off track but the opponent you can’t control.
Let’s say f over x is a probability we map to the, down and one minus f is probability we map up. We
came up with this axiomatic approach what would be Desiderata about f. Of course you might want to
have more. But these seem to be at least what you want, if you’re at A map to A, if you’re at B map to
B.
Monotonicity as you get closer to B the probability of going to B shouldn’t go down. Scale invariance
whether they’re playing for a dollar or a hundred dollars the reverse mapping should be the same.
Small change in x shouldn’t lead to a large change in f. Small change in A or B shouldn’t lead to a large
change in f.
Here’s the Pseudo-harmonic mapping that we put together actually satisfies these Desiderata. It’s
derived from the Nash equilibrium of simplified no-limit poker. It’s much less exploitable than prior
mappings in simplified domains where we can evaluate it. It performs well in practice in no-limit Texas
Hold’em. In particular it’s significantly outperforms the randomized geometric mapping which is the
geometric average and randomized according to that.
Alright, now comes something that I couldn’t help putting in because I’m so excited about this. This is
not an overview part. This is kind of the bleeding edge of what we’re doing, or one of the bleeding
edges of what we’re doing.
Joint work with my student Noam Brown, the idea is that action abstraction we talked about could have
a larger infinite branching factor. We pick the prototypes, let’s say those three. But the opponent can
move out we have to map back.
Alright, problems, well it’s a chicken and egg really. How you should abstract depends on the
equilibrium because you should abstract things together that are played similarly. But you can’t start
your equilibrium finding before you have the abstraction.
If the abstraction changes you have to start the equilibrium finding from scratch. That sucks.
Abstraction size must be tuned to the available run time. If I know that okay this spring I had a thousand
cores with three months. Yeah, based on my practice experience in the past I know that this is roughly
the abstraction size.
But what if somebody donated us another month of computing, or another nine months of computing?
Then we would have wasted our time running on a coarser abstraction that we could have used. Finer
abstractions are not always better as we talked about and cannot feasibly calculate exploitability in the
full game in large games like no-limit.
The new idea is this; instead of going through the old sequence we’re going to collapse all of that into
one. We call it Simultaneous Abstraction and Equilibrium Finding.
Okay, so how it works let’s say that we have the original game like that, again, blue player, red player.
We want to add the extra action for the blue player. The idea is that we assume the action was always
there but was never played according to the counterfactual regret algorithm. Now, this is actually tying
to the counterfactual regret algorithm that we talked about.
Two challenges, what happened in that branch on iterations one through T that we really didn’t run on
that branch because it wasn’t really there. This may violate the CFR algorithm. The regret bounds might
not apply.
Now we’re going to solve both. The first thing that is, we’re going to fill in the iterations. We’re going to
generate this thing called the auxiliary game where we’re going to put all of this rest of it which we had
already computed into a special node kind of an outside option that the player can take. Or the player
can go into this new piece.
Then we’re going to compute CFR in just this game which is much smaller. Furthermore, you don’t have
to compute all T iterations of it because you wouldn’t actually have reached this node on all iterations of
CFR. You can just weight it based on the reach, this kind of fills in what happened in those iterations.
Alright, then we copy the strategy back here and voila we can continue. One fly in the ointment is that
any imperfect information games some action may originate from multiple information sets. But we can
solve that by putting in extra chance node in there. It plays according to the same probability it was
reached in the T iterations in CFR.
Okay an alternative to the Auxiliary game is called regret transfer. Something Noam and [indiscernible]
put together. It doesn’t always work. It works for special cases where the payoffs are function of some
Theta which is the action. Poker typically has this flavor where you can say okay I’m going to raise the
stakes by a factor of three.
Then you have an identical structure to another subgame in this new subgame. We’re going to store the
regret as a function of Theta. When adding a new action Theta two we copy over the regret function
and replace Theta one with Theta two. This runs in constant time so we can in constant time add this
instead of the Order T time which the Auxiliary game required.
Okay, now regret discounting applies to both the auxiliary game and the regret transfer. As I mentioned
the second problem was that if the new action was never played we may have violated the CFR
execution and the regret bounds don’t hold. We have to fix this.
We can do this by de-weighting the old iterations. Zero means that we give them no-weight; one means
we give them full weight. We have this theorem that says that how much weight can you give them and
still satisfy the CFR regret bound? These are all measurable quantities that you’ve already computed.
Eric you look unhappy or puzzled.
>> Eric Horvitz: At best I was thinking about the question which is this, maybe you can answer it.
Maybe you can just say it’s a Nash equilibrium so don’t worry about this. Is if you had two machines
playing each other, doing this, with knowledge of each other. Are you still in a world where you can
assume, I just don’t, I was trying to sort of entail what that would mean at the level of a reflective policy
knowing about this algorithm, about both sides.
>> Tuomas Sandholm: Okay, so that’s, sorry Eric I’ll have to take that offline. I’m going to think about
that.
>> Eric Horvitz: That goes with my expression.
>> Tuomas Sandholm: I’ll have to think about that. That’s a good question because the way I’m thinking
about this now. With Noam we’ve been thinking about this now. It’s an algorithm that you run before
any play happens. But another thing that I want to do is that I see the pros be they human or computer
play some actions that are out of my abstraction. I want to throw them in my abstraction and do this
somehow online.
>> Eric Horvitz: I guess I don’t know if all your assumptions hold in that situation?
>> Tuomas Sandholm: I think they do because I’m just ignoring what the other guys are actually
thinking.
>> Eric Horvitz: Okay.
>> Tuomas Sandholm: I’m just going to, and one thing that I’m thinking here is that I’m going to let
them play out of the abstraction and not do this. If according to best response they’re actually shooting
themselves in the foot. We could actually know that the human pros that we were playing this spring.
Some of their manipulations were actually hurting them. Just let it go. Then those ones that actually
hurt us we throw those back in the abstraction.
>>: One interesting thing to see if it’s similar. It’s not just part of the human pros but give them a copy
of this where they can express their beliefs of what hand they think the computer holds. Run that
algorithm and say, if I knew what the computer was holding according to my beliefs, what would the
algorithm do in the situation? How do they change their behavior and response?
In other words they still don’t know what the computer’s doing. But they’re able to simulate the
computer given their own beliefs about what the computer holds. I put you on an Ace, King; you know
unsuited what would you play in this situation? I can actually run the algorithm and see what you’d
play. I can adapt that to how I think you’d play.
>> Tuomas Sandholm: Okay, interesting I’ll have to think about that. Okay good and it’s not always best
to go all the way to the least T weighting. Here’s something that worked well in practice inside
[indiscernible] of theorem. Again, these are things that you’ve already computed so no extra effort
there.
Where and when to add actions? Well, we talked about doing these online where based on what the
humans do. But we can actually do this automatically offline. You’d want to add to the abstraction
actions that are exploitative in the original game. One way to do it is to compute a full-game best
response and see which ones exploit.
The idea that we’d even experiment is that we add in actions when the derivative of the average regret
with respect to time is more negative with the action than without it. Here’s the formula for it. Let’s
not go into the details on that. The key here is that we can add any number of actions at once. One
action can be at multiple information sets and added like that. You don’t have to just add piece by
piece.
Also you can of course start from some manual abstraction and then add on top of that. You can also
have stronger conditions to be more conservative about what you add. The theory still goes through.
Their claim is that, an obvious actually, eventually this will add every action that has Omega T regret
growth which guarantees convergence to an equilibrium in the original unabstracted game. It will also
avoid the abstraction pathology.
Alright, Full-Game Best Response that’s kind of sub-routine here. Actually let me skip that because
that’s kind of a detail. Removing actions, some actions may be added early, but turn out to be useless.
We can also remove them so they don’t keep dragging us, our computation down.
In two-player zero-sum games if an action is not part of a best response to a Nash equilibrium strategy.
Then the regret goes down to minus infinity and the action needs to only be explored in CFR for a
constant number of iterations in the beginning. This could be a large constant but still a constant.
Furthermore, some of these iterations can be skipped. The idea here is that we can project if we do
negative regret. How many times would I have to hit that part of the game in CFR to get to zero regret?
Only then would that start to get revisited. I can already project that I can skip so many iterations and I
don’t have to go to that sub-tree in those iterations.
Experiments well we tested on Continuous Leduc Hold’em. Leduc Hold’em is one of the standard
benchmarks in the field. Continuous just means that we have continuous actions. The initial abstraction
contains just a minimum and maximum bet sizes. We’re not putting any handcrafted information in.
We are testing against fixed abstractions with branching factors two, three, and five which are placed
not just uniformly but smartly using the pseudoharmonic mapping.
Here’s what we have for the benchmarks. If you look at here, Full Game Exploitability as we compute
more exploitability goes down. The smaller the abstraction the quicker it converges in the beginning.
But then it ends up higher and doesn’t go so well in the long run, as you would expect.
Now we put our stuff on here and you can see that we actually do better than the small abstractions and
better than the large abstractions. It really accomplishes what you wanted. It reaches zero while any
fixed abstraction that’s not the full game will actually cap out and not reach zero. In fact it will over fit
into the abstraction and start going up in the end.
Okay, so let me just skip that. Okay, so now we can talk about two more pieces here. One is Opponent
Exploitation and then the actual how is the State of Poker. I can get both of the in within the ninety
minutes or we can skip one or the other in the interest of time.
>> Eric Horvitz: You might want to leave some time for questions since the sessions so long. Why don’t
you just pick.
>> Tuomas Sandholm: Okay, I’m going to try to zoom through both then.
[laughter]
More is more.
>> Eric Horvitz: I thought of applying one or other. But that’s okay or is also mathematically inclusive or
so I…
>> Tuomas Sandholm: Okay, well I’ll skip this part then. I’ll skip this part and I’ll just tell what it is. One
is a hybrid between playing equilibrium and using opponent exploitation. The Machine Learning
techniques for opponent exploitation have been tried in poker and they really can’t hold a candle to the
game theory stuff.
But the game theory stuff doesn’t fully exploit the opponents. We have a paper at Hybrid [indiscernible]
starts from the game theory stuff and as we get more and more evidence that the opponent is making
mistakes. We adjust our strategy to exploit.
The second paper I was going to cover is what we call Safe Exploitation. If you start to deviate from
equilibrium, so in from all equilibrium, so you’re playing something that’s not part of any equilibrium.
You can exploit the opponent more. But you can open yourself for counter exploitation. The full reason
was that that’s kind of an inherent problem you can’t get around for now.
But now we can actually ask is it really true? Or can we exploit more than the game theory, any game
theory strategy would and still be completely safe ourselves? The answer is surprisingly you can. Why?
Well, what you do, you can’t, the high level idea is this and the first part is wrong. You would like to
bankroll your further exploitation by the winnings you’ve made so far. But you can’t quite do that
because if you mark with your upside you still have the full downside and your expected value becomes
negative, becomes below your actual game value.
What you do you tease out the role of luck from the role of mistakes the opponent could have made and
given you money. You measure this or at least lower bound this part of what your opponent gave you
due to mistakes. That is the amount with which you can bankroll your exploitation into the future and
still be fully safe.
That’s it, so now let’s move to poker. That tradeoff doesn’t start at zero. State to poker and this is the
fun part; oh I guess it depends who you are. How many of you play poker? Okay, great. State to poker,
so as I mentioned in Rhode Island Hold’em three point one billion nodes in the game tree, exactly
solved. Key was a lossless abstraction, standard LPs holder. Nothing fancy in the equilibrium findings on
it.
Heads-Up Limit Texas Hold’em, the bots surpassed the pros in two thousand eight. University of Alberta
organized the Man/Machine Match against two pros in two thousand seven, lost it. Two thousand eight
they did it again and won it. In two thousand and eight in Limit Texas Hold’em which has ten to the
fourteen nodes in the game tree, information sets in the game tree, bots surpassed humans. Now, it’s
was the Alberta guys call essentially solved. It solved within a very close bound to optimal, one milli big
blind per hand.
What was key here, this was a new variant of CFR. They used a standard lossless abstraction
methodology. They actually hardcoded it in a pseudo-isomorphism to preprocess and then they used a
new variant of CFR for that.
Head-Up No-Limit ten to the one hundred and sixty-one information sets. Much bigger a whole other
can of worms. You can’t even measure currently exploitability how close to optimal you are. Tartanian
seven won the Annual Computer Poker Competition, so that’s our bot.
What we did then for the next few months we made an even better bot called [indiscernible]. Sorry,
called Claudico. Then I organized this Man/Machine Match where I got four of the top ten pros in the
world in Head-up no-limit Texas Hold’em. Specifically, to come and play eighty thousand hands of
Heads-up no-limit Texas Hold’em at the casino in Pittsburgh. Each one was playing alone against
Claudico.
Here are the four pros Jason Less, Doug Polk, Buren Lee, Dong Kim. Here’s the super computer we used
to compute the Nash equilibrium. It’s a black light super computer and here’s our team, so Noam
Brown, and Sam Ganzfried who are my Ph.D. students.
Alright…
>> Eric Horvitz: Can’t get having this…
>> Tuomas Sandholm: What?
>> Eric Horvitz: I like how you displaced it and something’s sitting there, right?
>> Tuomas Sandholm: Say again?
>> Eric Horvitz: It’s like a display sitting there.
>> Tuomas Sandholm: Oh, I know, I know. I didn’t, I thought that was funny that they actually made a
custom edge of table and all of these things. Then they made like a chair, it had a chair. Like there’s
something there but there’s nothing there.
[laughter]
I thought they did, the casino did a fantastic job with the setting. It was just awesome. We’re actually
doing duplicate poker to reduce the role of variance. Two of them were actually playing in public with
the reverse cards and other pair of them were playing in a private room with armed guards upstairs, so
there was no cheating.
We tried to really reduce the role of luck here to try to get statistical significance. But we failed so
eighty thousand hands although that’s right at the upper limit of what the humans could do in terms of
their time. Still not enough, so the humans won more chips than the computer. But it was so close that
we couldn’t get our ninety-five percent statistical significance to say whose better, but oh, well.
>>: Is there, how do they rate the pros?
>> Tuomas Sandholm: Yeah, these are four of the top ten pros in the world. Doug Polk considered
number one. He came number two against the computer. Buren Lee actually came in as number one,
won twice as much as Doug against the computer. Dong often times considered number two or three in
the world barely beat the computer. Jason Less lost to the computer. These guys carried the day.
>> Eric Horvitz: It really is a festive looking at the methods to think about you know what on earth
humans are doing.
>> Tuomas Sandholm: Oh, my goodness.
[laughter]
Oh, my goodness and these guys don’t have the game theory vocabulary. I mean he has a Math and
Economics Bachelor’s from University of Chicago. He’s a Computer Science Bachelor’s. No college
degree. I don’t think a college degree. He was drawing graphs to me about the endgame where the y
axis is a defense probability. The x axis has this thing and he had these curves. It was spot on.
Like if I had to teach that stuff you know I could take this guys graphs and the guy doesn’t have a college
degree. It’s so amazing. When we…
>> Eric Horvitz: But even with the idea that the obvious noisy abstractions they must be using are
getting so close to beating your…
>> Tuomas Sandholm: Oh, they are beating, well yeah.
>> Eric Horvitz: Yeah, something you said about whatever these abstractions are and then plus the
amount of noise and all the craziness, without theorems and proofs.
>> Tuomas Sandholm: Unbelievable, unbelievable and somebody might say hey well they played a lot of
poker. They read books and played a lot of poker. Wait a second we play more poker in self play on the
super computer every spring than mankind has ever played.
[laughter]
>>: Okay, anyway it’s just a reflection.
>> Tuomas Sandholm: Yeah, it is very impressive. Also when we changed our bots like on some days we
were turning the endgames over on versus off. It would actually have different flavors. We would
change the pseudoharmonic mapping whether we’d randomize it or not. You know to try to throw
curve balls at these guys. Within a hundred and fifty hands they picked up on everything. It’s just
unbelievable.
Anyway this is what it looked like. This is on Twitch and YouTube. You can look at all of the hands. This
is really like a university for poker if you want to study this. You can look at two weeks of poker.
>> Eric Horvitz: You have a comment about Microsoft [indiscernible] how we, why our names printed
there?
>>: Well, in the State of Pennsylvania in order for this to be legal there needs to be real cash prize
money put up. We, MSR provided the cash awards for…
>> Tuomas Sandholm: Yeah, so we had, the pros need to get paid to do this. We couldn’t gamble for
real money. The Pennsylvania Gaming Board didn’t allow that. In hindsight that’s probably a good
thing, CMU would have lost part of its endowment I guess.
[laughter]
Generously, Microsoft sponsored half of the prize and the Rivers Casino in Pittsburg sponsored half of
the prize. In addition we had the Pittsburg Super Computing Center they were sponsoring the super
computing. AI Journal was sponsoring the laptops and so forth. Thank you. It wouldn’t have been
possible without you.
I mean quite literally we were like two weeks before and before the commitments were hard. You know
the casino would not have run this event if it hadn’t happened.
>>: It was for some reason our Purchasing Department had a hard time issuing a purchase order.
[laughter]
It took awhile.
>> Tuomas Sandholm: I had, yeah, thank you. Okay, so these pros took it very seriously. Two weeks of
poker, one day break. These guys, these are not the kind of cigar smoking, scotch drinking, Stetson hat
wearing guys, all American guys. These are like international pros who study all the time. They have
computation tools. They flew in a guy from Florida to help them do the computational analysis during
the day. During nights they were doing computational analysis…
>> Eric Horvitz: Is that allowed, that’s allowed?
>> Tuomas Sandholm: Well, we allowed them to do anything.
>> Eric Horvitz: But I guess you’re saying that one day there might be human versus machine where
there’s no support, right. But…
>> Tuomas Sandholm: Yeah, we allowed them to have support.
>> Eric Horvitz: Yeah.
>> Tuomas Sandholm: Yeah, well you know we…
>> Eric Horvitz: Except for your computer.
>> Tuomas Sandholm: What?
>> Eric Horvitz: Except access to your computer.
>> Tuomas Sandholm: Except access to ours.
[laughter]
We gave them the logs every night so they could analyze that. They had all of the tools. They could use
computers so they were kind of a human computer hybrid, if you will. But they took it very seriously.
They’re stretching here in the morning. They were eating oatmeal at the casino for breakfast. I saw
them drink one glass of red wine during those two weeks. I hope that it would jinx their team but I
guess not.
Okay, multiplayer poker, well bots aren’t very strong. In special cases like programs for jam/fold
tournaments we’ve solved it near optimally. But by and large it’s not even clear that the Nash
equilibrium is the right thing to play there. There’s some really interesting results from University of
Alberta for Three Player Koon Poker. Even there you know what strategy I’d pick I can’t really help
myself or hurt myself. But I can allocate the money between Max and Eric radically differently. It’s, Max
wants it, yeah.
It’s not even clear if Nash is the right thing. Then what can we learn from bots? How do humans learn
poker? Well, they read books and they play poker. Who wrote books? Well, humans and it’s kind of
recursive thing, it kind of falls on itself. There’s no ground truth there.
In contrast to bots they’re working from the definition of the game only and the definition of Nash
equilibrium. They’re sitting on ground truth. The bots actually learn to play very different kinds of
strategies than the humans have evolved to play. Problem is the bot strategy is a big probability vector
of one point four terabytes in the case of Claudico of probabilities.
It’s hard for a human to understand stuff from there. But I’ll mention a few things that it does
differently. First action to limp, so limping in poker is that when it’s your move you’re the first mover.
Typically you want to raise or fold. Limping means that you just call. It’s like okay I’ll play that part.
Consider it a weak move.
Here is what this book says about it, poker book, “Limping is for Losers. This is the most important
fundamental in poker-for every game, for every tournament, every stake: If you are the first player to
voluntarily commit chips to the pot, open for a raise. Limping is inevitably a losing play. If you see a
person at the table limping, you can be fairly sure he is a bad player. Bottom line: If your hand is worth
playing, it is worth raising.”
Similarly, Daniel Cates whose one of the other top ten players, not one of the ones in the tournament he
verifies that in Two-Player Heads-up no-limit limping is a bad idea. Well, our bot limps.
[laughter]
It’s not just this bot. Every bot we’ve computed for this game has always limped between eight and
twelve percent of the time. That’s an indication that limping might not be a bad idea. In fact the name
Claudico is Latin for “I limp”.
[laughter]
We named it after its signature move. Alright, Donk bet. A common sequence in the first betting round
is that the first mover raises, the second mover calls. First mover is representing strength, second
mover is not. The latter has to move first in the second betting round by the rules. If he bets that is
called “donk bet” or donkey bet or bad player bet.
Like if you represented that you’re weak and now you’re representing you’re strong. Eh, something’s
rotten in Denmark. You’re not really that credible. Consider it a bad move. Our bots donk bets and
with various sizes as well.
>> Eric Horvitz: There’s a Latin word for donkey in there somewhere.
[laughter]
>>: By the way, when watching it. When you’ve witnessed that you as a human you’re sure you’ve
misremembered.
>> Tuomas Sandholm: Oh, yeah you can start to doubt yourself.
>>: Yeah.
>> Tuomas Sandholm: As a human you’re like, woo that is so weird, now I must have misremembered.
But there’s actually a string there that encodes the whole hand history. We gave that to the humans as
well. We didn’t take; try to take advantage of the human’s bad memory. Not that they have a bad
memory. They can remember these hand sequences from days ago in full detail. But you know if a
layman like me plays you know it’s nice to have the sequence there. Okay, no I actually didn’t
remember, he didn’t misremember. He did make that donk bet.
Okay, using more than one bet size in a given situation risks signaling too much. Remember we talked
about the signaling. Most pros use one bet size. Some use two and this is a little bit of an old bullet.
Now a day’s pros have started to vary, some pros have started to vary the bet size a little more. Our bot
uses a wide range of bet sizes, randomizes across them.
What the human said is it’s perfectly balanced. It will bluff and it will value bet in the same type of
situation with the same types of bet sizes, including huge ones and tiny ones. It will make a ten percent
bet on the River to open. Or it would go all in on top of one fortieth of the pot. Sorry, forty times the
pot or thirty-seven times the pot, and so forth.
Alright, conclusions, domain-independent techniques combination of abstraction, equilibrium-finding,
reverse mapping, and then opponent exploitation. Claudico we turned opponent exploitation off
completely so Claudico could never actual saw how the humans played poker. We just did it in a purely
game theoretic way. Let me leave it at that.
>>: Can you say why?
>> Tuomas Sandholm: Why? The opponent exploitation techniques really risk a lot depending on the
technique because they’re not safe. Except the one that is safe but it doesn’t exploit much either. We
thought that was a risk. We thought that there’s very little to exploit in these top pros. We just didn’t
go there.
For the next time I have my own ideas as to what we’re going to do. We’re going to do some of that in a
very different way than we’ve done in the literature so far.
>>: That’s an assumption that’s important to test because you know if maybe there is something that
you can exploit.
>> Tuomas Sandholm: Yeah, maybe there is something. I’m sure that there’s something to exploit. The
top players they say it. Okay, I believe that there’s a lot to be exploited. It’s just hard to find those
exploitations in the reasonable number of hands before you’ve lost a whole bunch of money trying.
One thing that suggests that there is a lot exploit is that the human poker play in no-limit Texas Hold’em
has changed quite radically over the last ten years. It was actually kind of soft game ten years ago. If
you put in a half a year of study you would actually make a lot of money. That’s not the case anymore.
Now a day’s top pros are very good and they’re randomizing. They’re using these notions of balance,
card removal, very sophisticate things. They’re learning from each other. They actually have these
schools. Doug Polk is actually the trainer of two of these other pros, so Dong and Jason because he’s so.
People are so scared of him online he doesn’t get any action. He’s so good nobody wants to play him.
What he does he takes these younger guys he calls students. Just like professors he calls them students
that he trains. Who come in with a no-name and will play not as quite as well as him maybe but really
well. Using his strategies and they will get action until they become too famous and nobody wants to
play them either.
But that is how the ecosystem works. Thank you.
[applause]
>> Eric Horvitz: Since we had the discussion along the way and it’s already noon so if it’s a burning
question we’ll take it. But otherwise thanks everybody. Great thanks.
[applause]
>> Tuomas Sandholm: Thank you.
Download