The iterated prisoner`s dilemma

advertisement
Prisoner's dilemma
RE: From Wikipedia, the free encyclopedia
Will the two prisoners cooperate to minimize total loss of liberty or will one of them, trusting the
other to cooperate, betray him so as to go free?
Many points in this article may be difficult to understand without a background in the
elementary concepts of game theory.
In game theory, the prisoner's dilemma is a type of non-zero-sum game in which two players try to
get rewards from a banker by cooperating with or betraying the other player. In this game, as in
many others, it is assumed that the primary concern of each individual player ("prisoner") is selfregarding; i.e., trying to maximise his own advantage, with less concern for the well-being of the
other players.
In the prisoner's dilemma, cooperating is strictly dominated by defecting (i.e., betraying one's
partner), so that the only possible equilibrium for the game is for all players to defect. In simpler
terms, no matter what the other player does, one player will always gain a greater payoff by playing
defect. Since in any situation playing defect is more beneficial than cooperating, all rational players
will play defect.
The unique equilibrium for this game does not lead to a Pareto-optimal solution—that is, two
rational players will both play defect even though the total reward (the sum of the reward received
by the two players) would be greater if they both played cooperate. In equilibrium, each prisoner
chooses to defect even though both would be better off by cooperating, hence the dilemma.
In the iterated prisoner's dilemma the game is played repeatedly. Thus each player has an
opportunity to "punish" the other player for previous non-cooperative play. Cooperation may then
arise as an equilibrium outcome. The incentive to defect may then be overcome by the threat of
punishment, leading to the possibility of a cooperative outcome. As the number of iterations
approaches infinity, the Nash equilibrium tends to the Pareto optimum.
The classical prisoner's dilemma
The Prisoner's dilemma was originally framed by Merrill Flood and Melvin Dresher working at
RAND in 1950. Albert W. Tucker formalized the game with prison sentence payoffs and gave it the
"Prisoner's Dilemma" name.
The classical prisoner's dilemma (PD) is as follows:
Two suspects, A and B, are arrested by the police. The police have insufficient evidence for a
conviction, and, having separated both prisoners, visit each of them to offer the same deal: if
one testifies for the prosecution against the other and the other remains silent, the betrayer
goes free and the silent accomplice receives the full 10-year sentence. If both stay silent, the
police can sentence both prisoners to only six months in jail for a minor charge. If each
betrays the other, each will receive a two-year sentence. Each prisoner must make the choice
of whether to betray the other or to remain silent. However, neither prisoner knows for sure
what choice the other prisoner will make. So the question this dilemma poses is: What will
happen? How will the prisoners act?
The dilemma can be summarised thus:
Prisoner B Stays Silent
Prisoner A Stays Silent Both serve six months
Prisoner A Betrays
Prisoner B Betrays
Prisoner A serves ten years
Prisoner B goes free
Prisoner A goes free
Both serve two years
Prisoner B serves ten years
The dilemma arises when one assumes that both prisoners only care about minimising their own jail
terms. Each prisoner has two options: to cooperate with his accomplice and stay quiet, or to betray
his accomplice and give evidence. The outcome of each choice depends on the choice of the
accomplice. However, neither prisoner knows the choice of his accomplice. Even if they were able
to talk to each other, neither could be sure that he could trust the other.
Let's assume the protagonist prisoner is working out his best move. If his partner stays quiet, his best
move is to betray as he then walks free instead of receiving the minor sentence. If his partner betrays,
his best move is still to betray, as by doing it he receives a relatively lesser sentence than staying
silent. At the same time, the other prisoner's thinking would also have arrived at the same conclusion
and would therefore also betray.
If reasoned from the perspective of the optimal outcome for the group (of two prisoners), the correct
choice would be for both prisoners to cooperate with each other, as this would reduce the total jail
time served by the group to one year total. Any other decision would be worse for the two prisoners
considered together. When the prisoners both betray each other, each prisoner achieves a worse
outcome than if they had cooperated. This demonstrates very elegantly that in a non-zero sum game
the Pareto optimum and the Nash Equilibrium can be opposite.
Alternately, the options are "confess" and "don't confess."
Generalized form
We can expose the skeleton of the game by stripping it of the Prisoners’ subtext. The generalized
form of the game has been used frequently in experimental economics. The following rules give a
typical realization of the game.
There are two players and a banker. Each player holds a set of two cards: one printed with the word
"Cooperate", the other printed with "Defect" (the standard terminology for the game). Each player
puts one card face-down in front of the banker. By laying them face down, the possibility of a player
knowing the other player's selection in advance is eliminated (although revealing one's move does
not affect the dominance analysis[1]). At the end of the turn, the banker turns over both cards and
gives out the payments accordingly.
If player 1 (red) defects and player 2 (blue) cooperates, player 1 gets the Temptation to Defect
payoff of 5 points while player 2 receives the Suckers payoff of 0 points. If both cooperate they get
the Reward for Mutual Cooperation payoff of 3 points each, while if they both defect they get the
Punishment for Mutual Defection payoff of 1 point. The checker board payoff matrix showing the
payoffs is given below.
Canonical PD payoff matrix
Cooperate Defect
Cooperate
3, 3
0, 5
Defect
5, 0
1, 1
In "win-lose" terminology the table looks like this:
Cooperate
Defect
Cooperate
win-win
lose much-win much
Defect
win much-lose much
lose-lose
These point assignments are given arbitrarily for illustration. It is possible to generalize them. Let T
stand for Temptation to Defect, R for Reward for Mutual Cooperation, P for Punishment for Mutual
Defection and S for Suckers Payoff. The following inequalities must hold:
T>R>P>S
If the game is iterated (played more than once in a row), the mutual cooperation total payment must
exceed the temptation total payment, because otherwise the iterated game would not have a different
Nash Equilibrium (see section on iterated version):
2R > T+S
These rules were established by cognitive scientist Douglas Hofstadter and form the formal
canonical description of a typical game of Prisoners Dilemma.
A similar but different game
Hofstadter[2] once suggested that people often find problems such as the PD problem easier to
understand when it is illustrated in the form of a simple game, or trade-off. One of several examples
he used was "closed bag exchange":
Two people meet and exchange closed bags, with the understanding that one of them
contains money, and the other contains a purchase. Either player can choose to honour the
deal by putting into his bag what he agreed, or he can defect by handing over an empty bag.
In this game, defection is always the best course, implying that rational agents will never play, and
that "closed bag exchange" will be a missing market due to adverse selection.
In a variation, popular among hackers and programmers, each bag-exchanging agent is given a
memory (or access to a collective memory), and many exchanges are repeated over time.
As noted, without this introduction of time and memory, there is not much meaning to this game.
Not much is explained about the behaviour of actual systems and groups of people, except for
describing interactions which don't happen. Yet more complexity is introduced here than might be
expected. The programmer (especially the functional programmer) will pick up right away on the
significance of introducing time and state (memory). But without any background on writing
programs or modelling these kinds of systems, the various choices that one would have to make can
be seen. How big is the memory of each actor? What is the strategy of each actor? How are actors
with various strategies distributed and what determines who interacts with whom and in what order?
One may become frustrated by the complexity involved in creating any model which is meaningful
at all, but some very interesting and worthy technical and philosophical issues are raised.
There is some work that has been done to model this. Various programmers and mathematicians
have claimed that "Tit for tat" is shown to be the best general strategy, but no serious academic
efforts have been made to classify the various kinds and distributions of stateful actors with different
strategies.
The pregnancy of this problem is suggested by the fact that this discussion has not even mentioned
the possibility of the formation (spontaneous or otherwise) of conglomerates of actors, negotiating
their bag-exchanges collectively. And what about agents, who charge a fee for organising these bag
exchanges? Or agents (journalists?) who collect and exchange information about the bag exchanges
themselves?
Real-life examples
These particular examples, involving prisoners and bag switching and so forth, may seem contrived,
but there are in fact many examples in human interaction as well as interactions in nature that have
the same payoff matrix. The prisoner's dilemma is therefore of interest to the social sciences such as
economics, politics and sociology, as well as to the biological sciences such as ethology and
evolutionary biology. Many natural processes have been abstracted into models in which living
beings are engaged in endless games of Prisoner's Dilemma. This wide applicability of the PD gives
the game its substantial importance.
In political science, for instance, the PD scenario is often used to illustrate the problem of two states
engaged in an arms race. Both will reason that they have two options, either to increase military
expenditure or to make an agreement to reduce weapons. Neither state can be certain that the other
one will keep to such an agreement; therefore, they both incline towards military expansion. The
paradox is that both states are acting "rationally", but producing an apparently "irrational" result.
This could be considered a corollary to deterrence theory.
Another interesting example concerns a well-known concept in cycling races, for instance in the
Tour de France. Consider two cyclists halfway in a race, with the peloton (larger group) at great
distance behind them. The two cyclists often work together (mutual cooperation) by sharing the
tough load of the front position, where there is no shelter from the wind. If neither of the cyclists
makes an effort to stay ahead, the peloton will soon catch up (mutual defection). An often-seen
scenario is one cyclist doing the hard work alone (cooperating), keeping the two ahead of the
peloton. In the end, this will likely lead to a victory for the second cyclist (defecting) who has an
easy ride in the first cyclist's slipstream.
An occurrence of the prisoner’s dilemma in real life can be found in business. Two competing firms
must decide how many resources to devote to advertisement. The effectiveness of Firm A’s
advertising is partially determined by the advertising conducted by Firm B. Likewise, the profit
derived from advertising for Firm B is affected by the advertising conducted by Firm A. If both Firm
A and Firm B choose to advertise during a given period the advertising cancels out, receipts remain
constant, and expenses increase due to the cost of advertising. Both firms would benefit from a
reduction in advertising. However, should Firm B choose not to advertise, Firm A could benefit
greatly by advertising. A prisoners' dilemma occurs when both Firm A and Firm B have dominant
strategies, so the outcome is easy to predict. A dominant strategy is an action that will give the best
result no matter what decision the rival firm makes. It is unlikely in a true prisoners' dilemma that
Firm A and Firm B will cooperate because there is too much incentive for both sides to "cheat" in
order to get their best outcome. It is also true that both sides will end up worse off than if they had
cooperated. However, sometimes cooperative behaviours emerge in business situations which are
surprisingly beneficial to the masses [6].
William Poundstone, in a book about the Prisoner's Dilemma (see References below), describes a
situation in New Zealand where newspaper boxes are left unlocked. It is possible for someone to
take a paper without paying (defecting) but very few do, recognising the resultant harm if everybody
stole newspapers (mutual defection). Since the pure PD is simultaneous for all players (with no way
for any player's action to have an effect on another's strategy) this widespread line of reasoning is
called "magical thinking".[3]
The theoretical conclusion of PD is one reason why, in many countries, plea bargaining is forbidden.
Often, precisely the PD scenario applies: it is in the interest of both suspects to confess and testify
against the other prisoner/suspect, even if each is innocent of the alleged crime. Arguably, the worst
case is when only one party is guilty — here, the innocent one is unlikely to confess, while the guilty
one is likely to confess and testify against the innocent.
Many real-life dilemmas involve multiple players. Although metaphorical, Hardin's tragedy of the
commons may be viewed as an example of a multi-player generalisation of the PD: Each villager
makes a choice for personal gain or restraint. The collective reward for unanimous (or even frequent)
defection is very low payoffs (representing the destruction of the "commons"). However, such multiplayer PDs are not formal as they can always be decomposed into a set of classical two-player games.
The iterated prisoner's dilemma
In his book The Evolution of Cooperation (1984), Robert Axelrod explored an extension to the
classical PD scenario, which he called the iterated prisoner's dilemma (IPD). In this, participants
have to choose their mutual strategy again and again, and have memory of their previous encounters.
Axelrod invited academic colleagues all over the world to devise computer strategies to compete in
an IPD tournament. The programs that were entered varied widely in algorithmic complexity; initial
hostility; capacity for forgiveness; and so forth.
Axelrod discovered that when these encounters were repeated over a long period of time with many
players, each with different strategies, "greedy" strategies tended to do very poorly in the long run
while more "altruistic" strategies did better, as judged purely by self-interest. He used this to show a
possible mechanism for the evolution of altruistic behaviour from mechanisms that are initially
purely selfish, by natural selection.
The best deterministic strategy was found to be "Tit for Tat", which Anatol Rapoport developed and
entered into the tournament. It was the simplest of any program entered, containing only four lines of
BASIC, and won the contest. The strategy is simply to cooperate on the first iteration of the game;
after that, the player does what his opponent did on the previous move. A slightly better strategy is
"Tit for Tat with forgiveness". When the opponent defects, on the next move, the player sometimes
cooperates anyway, with a small probability (around 1%-5%). This allows for occasional recovery
from getting trapped in a cycle of defections. The exact probability depends on the line-up of
opponents. "Tit for Tat with forgiveness" is best when miscommunication is introduced to the game
— when one's move is incorrectly reported to the opponent.
By analysing the top-scoring strategies, Axelrod stated several conditions necessary for a strategy to
be successful.
Nice
The most important condition is that the strategy must be "nice", that is, it will not defect
before its opponent does. Almost all of the top-scoring strategies were nice. Therefore a
purely selfish strategy for purely selfish reasons will never hit its opponent first.
Retaliating
However, Axelrod contended, the successful strategy must not be a blind optimist. It must
always retaliate. An example of a non-retaliating strategy is Always Cooperate. This is a very
bad choice, as "nasty" strategies will ruthlessly exploit such softies.
Forgiving
Another quality of successful strategies is that they must be forgiving. Though they will
retaliate, they will once again fall back to cooperating if the opponent does not continue to
play defects. This stops long runs of revenge and counter-revenge, maximising points.
Non-envious
The last quality is being non-envious, that is not striving to score more than the opponent
(impossible for a ‘nice’ strategy, i.e., a 'nice' strategy can never score more than the
opponent).
Therefore, Axelrod reached the Utopian-sounding conclusion that selfish individuals for their own
selfish good will tend to be nice and forgiving and non-envious. One of the most important
conclusions of Axelrod's study of IPDs is that Nice guys can finish first.
Reconsider the arms-race model given in the classical PD section above: It was concluded that the
only rational strategy was to build up the military, even though both nations would rather spend their
GDP on butter than guns. Interestingly, attempts to show that rival states actually compete in this
way (by regressing "high" and "low" military spending between periods under iterated PD
assumptions) often show that the posited arms race is not occurring as expected. (For example Greek
and Turkish military spending does not appear to follow a tit-for-tat iterated-PD arms-race, but is
more likely driven by domestic politics.) This may be an example of rational behaviour differing
between the one-off and iterated forms of the game.
The optimal (points-maximising) strategy for the one-time PD game is simply defection; as
explained above, this is true whatever the composition of opponents may be. However, in the
iterated-PD game the optimal strategy depends upon the strategies of likely opponents, and how they
will react to defections and cooperations. For example, consider a population where everyone
defects every time, except for a single individual following the Tit-for-Tat strategy. That individual
is at a slight disadvantage because of the loss on the first turn. In such a population, the optimal
strategy for that individual is to defect every time. In a population with a certain percentage of
always-defectors and the rest being Tit-for-Tat players, the optimal strategy for an individual
depends on the percentage, and on the length of the game.
Deriving the optimal strategy is generally done in two ways:
1. Bayesian Nash Equilibrium: If the statistical distribution of opposing strategies can be
determined (e.g. 50% tit-for-tat, 50% always cooperate) an optimal counter-strategy can be
derived mathematically[4].
2. Monte Carlo simulations of populations have been made, where individuals with low scores
die off, and those with high scores reproduce (a genetic algorithm for finding an optimal
strategy). The mix of algorithms in the final population generally depends on the mix in the
initial population.
Although Tit-for-Tat was long considered to be the most solid basic strategy, a team from
Southampton University in England (led by Professor Nicholas Jennings [1], and including Rajdeep
Dash, Sarvapali Ramchurn, Alex Rogers and Perukrishnen Vytelingum) introduced a new strategy at
the 20th-anniversary Iterated Prisoner's Dilemma competition, which proved to be more successful
than Tit-for-Tat. This strategy relied on cooperation between programs to achieve the highest
number of points for a single program. The University submitted 60 programs to the competition,
which were designed to recognise each other through a series of five to ten moves at the start. Once
this recognition was made, one program would always cooperate and the other would always defect,
assuring the maximum number of points for the defector. If the program realised that it was playing
a non-Southampton player, it would continuously defect in an attempt to minimise the score of the
competing program. As a result[5], this strategy ended up taking the top three positions in the
competition, as well as a number of positions towards the bottom.
Although this strategy is notable in that it proved more effective than Tit-for-Tat, it takes advantage
of the fact that multiple entries were allowed in this particular competition. In a competition where
one has control of only a single player, Tit-for-Tat is certainly a better strategy. It also relies on
circumventing rules about the prisoner's dilemma in that there is no communication allowed between
the two players. When the Southampton programs engage in an opening "ten move dance" to
recognize one another, this only reinforces just how valuable communication can be in shifting the
balance of the game.
If an iterated PD is going to be iterated exactly N times, for some known constant N, then there is
another interesting fact. The Nash equilibrium is to defect every time. That is easily proved by
induction; one might as well defect on the last turn, since the opponent will not have a chance to
punish the player. Therefore, both will defect on the last turn. Thus, the player might as well defect
on the second-to-last turn, since the opponent will defect on the last no matter what is done, and so
on. For cooperation to remain appealing, then, the future must be indeterminate for both players.
One solution is to make the total number of turns N random. The shadow of the future must be
indeterminably long.
Another odd case is "play forever" prisoner's dilemma. The game is repeated infinitely many times,
and the player's score is the average (suitably computed).
The prisoner's dilemma game is fundamental to certain theories of human cooperation and trust. On
the assumption that the PD can model transactions between two people requiring trust, cooperative
behaviour in populations may be modelled by a multi-player, iterated, version of the game. It has,
consequently, fascinated many, many scholars over the years. In 1975, Grofman and Pool estimated
the count of scholarly articles devoted to it at over 2000.
Learning psychology and game theory
Where game players can learn to estimate the likelihood of other players defecting, their own
behaviour is influenced by their experience of that of others. Simple statistics show that
inexperienced players are more likely to have had, overall, atypically good or bad interactions with
other players. If they act on the basis of these experiences (by defecting or cooperating more than
they would otherwise) they are likely to suffer in future transactions. As more experience is accrued
a truer impression of the likelihood of defection is gained and game playing becomes more
successful. The early transactions experienced by immature players are likely to have a greater effect
on their future playing than would such transactions affect mature players. This principle goes part
way towards explaining why the formative experiences of young people are so influential and why
they are particularly vulnerable to bullying, sometimes ending up as bullies themselves.
The likelihood of defection in a population may be reduced by the experience of cooperation in
earlier games allowing trust to build up[6]. Hence self-sacrificing behaviour may, in some instances,
strengthen the moral fibre of a group. If the group is small the positive behaviour is more likely to
feedback in a mutually affirming way encouraging individuals within that group to continue to
cooperate. This is allied to the twin dilemma of encouraging those people whom one would aid to
indulge in behaviour that might put them at risk. Such processes are major concerns within the study
of reciprocal altruism, group selection, kin selection and moral philosophy.
Friend or Foe?
Friend or Foe? is a game show that aired from 2002 to 2005 on the Game Show Network in the
United States. It is an example of the prisoner's dilemma game tested by real people, but in an
artificial setting. On the game show, three pairs of people compete. As each pair is eliminated, they
play a game of Prisoner's Dilemma to determine how their winnings are split. If they both cooperate
("Friend"), they share the winnings 50-50. If one cooperates and the other defects ("Foe"), the
defector gets all the winnings and the cooperator gets nothing. If both defect, both leave with nothing.
Notice that the payoff matrix is slightly different from the standard one given above, as the payouts
for the "both defect" and the "cooperate while the opponent defects" cases are identical. This makes
the "both defect" case a weak equilibrium, compared with being a strict equilibrium in the standard
prisoner's dilemma. If you know your opponent is going to vote "Foe", then your choice does not
affect your winnings. In a certain sense, "Friend or Foe" has a payoff model between "Prisoner's
Dilemma" and "Chicken".
The payoff matrix is

If both players cooperate, each gets +1.

If both defect, each gets 0.

If A cooperates and B defects, A gets +0 and B gets +2.
Friend or Foe would be useful for someone who wanted to do a real-life analysis of prisoner's
dilemma. Notice that participants only get to play once, so all the issues involving repeated playing
are not present and a "tit for tat" strategy cannot develop.
In Friend or Foe, each player is allowed to make a statement to convince the other of his friendliness
before both make the secret decision to cooperate or defect. One possible way to 'beat the system'
would be for a player to tell his rival, "I am going to choose foe. If you trust me to split the winnings
with you later, choose friend. Otherwise, if you choose foe, we both walk away with nothing." A
greedier version of this would be "I am going to choose foe. I am going to give you X%, and I'll take
(100-X)% of the total prize package. So, take it or leave it, we both get something or we both get
nothing." (As in the Ultimatum game.) Now, the trick is to minimise X such that the other contestant
will still choose friend. Basically, the player has to know the threshold at which the utility his
opponent gets from watching him receive nothing exceeds the utility he gets from the money he
stands to win if he just went along.
This approach was never tried in the game; it's possible that the judges might not allow it, and that
even if they did, inequity aversion would produce a lower expected payoff from using the tactic.
(Ultimatum games in which this approach was attempted have led to rejections of high but unequal
offers – in some cases up to two weeks wages have been turned down in preference to both players
receiving nothing.)
References

Axelrod, R. (1981). The Evolution of Cooperation. Science, 211(4489):1390-6

Axelrod, R. (1984). The Evolution of Cooperation. ISBN 0465021212

Dresher, M. (1961). The Mathematics of Games of Strategy: Theory and Applications
Prentice-Hall, Englewood Cliffs, NJ.

Flood, M.M. (1952). Some experimental games. Research memorandum RM-789. RAND
Corporation, Santa Monica, CA.

Poundstone, W. (1992) Prisoner's Dilemma Doubleday, NY NY.

Greif, A. (2006). Institutions and the Path to the Modern Economy: Lessons from Medieval
Trade. Cambridge University Press, Cambridge, UK.
1. ^ A simple "tell" that partially or wholly reveals one player's choice – such as the Red player playing their
Cooperate card face-up – does not change the fact that Defect is the dominant strategy. When one is considering
the game itself, communication has no effect whatsoever. However, when the game is being played in real life
considerations outside of the game itself may cause communication to matter. It is a point of utmost importance to
the full implications of the dilemma that when we do not need to take into account external considerations, singleinstance Prisoner's Dilemma is not affected in any way by communications.
Even in single-instance Prisoner's Dilemma, meaningful prior communication could alter the play environment, by
raising the possibility of enforceable side contracts or credible threats. For example, if the Red player plays their
Cooperate card face-up and simultaneously reveals a binding commitment to blow the jail up if and only if Blue
Defects (with additional payoff -11,-10), then Blue's Cooperation becomes dominant. As a result, players are
screened from each other and prevented from communicating outside of the game.
2. ^ Hofstadter, Douglas R. (1985). Metamagical Themas: questing for the essence of mind and pattern. Bantam
Dell Pub Group. ISBN 0-46-504566-9. - see Ch.29 The Prisoner's Dilemma Computer Tournaments and the
Evolution of Cooperation.
3. ^ As well as being an explanation for the lack of petty-theft, magical thinking has been used to explain such
things as voluntary voting (where a non-voter is considered a free rider). Potentially, it might be used to explain
Wikipedia contributions: Text may be added under the assumption that if contributions are not made, then similar
people will also fail to contribute (i.e. arguing from effect to cause). Alternatively, the explanation could depend on
expected future actions (and not require a magical connection). Modelling future interactions requires the addition
of the temporal dimension, as given in the Iterated prisoner’s dilemma section.
4. ^ For example see the 2003 study “Bayesian Nash equilibrium; a statistical test of the hypothesis” for
discussion of the concept and whether it can apply in real economic or strategic situations (from Tel Aviv
University).
5. ^ [http://www.prisoners-dilemma.com/results/cec04/ipd_cec04_full_run.h
Download