Concurrent Reachability Games

advertisement
Mini-course on algorithmic aspects of
stochastic games and related models
Marcin Jurdzinski (University of Warwick)
Peter Bro Miltersen (Aarhus University)
Uri Zwick (武熠) (Tel Aviv University)
Oct. 31 – Nov. 2, 2011
15-03-2016
1
Day 3
Wednesday, November 2
Peter Bro Miltersen (Aarhus University)
Imperfect Information Stochastic Games
15-03-2016
2
Plan
• Introduction to imperfect information
(concurrent) stochastic games
• Analysis of the complexity of strategy iteration
• Algorithms based on real and semi-algebraic
geometry
• Combinatorial algorithms for ”qualitatively”
solving concurrent stochastic games.
15-03-2016
3
Perfect Information (a.k.a. turn-based)
stochastic games
1/2
1/2
MAX
R
RAND
min
Objectives:
MAX/min the probability of getting to a GOAL ,
MAX/min discounted sum of rewards,
MAX/min limiting average rewards
15-03-2016
4
Imperfect Information (a.k.a. concurrent)
stochastic games
1/2
1/2
MAX
R
RAND
min
Objectives:
MAX/min the probability of getting to a GOAL ,
MAX/min discounted sum of rewards,
MAX/min limiting average rewards
15-03-2016
5
Concurrent Reachability Games
GOAL
Dante - Row player
Wants to reach GOAL
Lucifer – Column player
Wants to prevent Dante
from reaching GOAL
15-03-2016
6
Concurrent Reachability Games
GOAL
Dante - Row player
Wants to reach GOAL
Lucifer – Column player
Wants to prevent Dante
from reaching GOAL
15-03-2016
7
Concurrent Reachability Games
GOAL
Dante - Row player
Wants to reach GOAL
Lucifer – Column player
Wants to prevent Dante
from reaching GOAL
15-03-2016
8
Concurrent Reachability Games
GOAL
Dante - Row player
Wants to reach GOAL
Lucifer – Column player
Wants to prevent Dante
from reaching GOAL
15-03-2016
9
Concurrent Reachability Games
GOAL
Dante - Row player
Wants to reach GOAL
Lucifer – Column player
Wants to prevent Dante
from reaching GOAL
15-03-2016
10
Concurrent Reachability Game
• Arena:
– Finite directed graph.
– One terminal GOAL node, a terminal trap node, N non-terminal
nodes.
– Each non-terminal node contains an m x m matrix of outgoing
arcs.
• Play:
– A pebble moves from position to position.
– In each step, Dante chooses a row and Lucifer simultaneously
chooses a column of the matrix.
– The pebble moves along the appropriate arc.
– If Dante reaches the GOAL position he wins
– If this never happens, Lucifer wins.
15-03-2016
11
Simulation
MAX
15-03-2016
12
Simulation
min
15-03-2016
13
Simulation
1/2
1/2
R
…. Somewhat more subtle
that this works!
15-03-2016
14
”Proof” of correctness
• We want values in the CRG to be the same as in
the turn based reachability game.
• In particular, the value of the node simulating a
coin toss should be the average of the values of
the two nodes it points to.
• If these two values are the same, this is ”clearly”
the case.
• If they have different values v1, v2, the simulated
coin toss nodes is a game of Matching Pennies
with payoffs v1, v2. This game has value (v1+v2)/2.
15-03-2016
15
Matrix games
Matching Pennies:
Hide heads up
Guess heads up
Guess tails up
Hide tails up
1
0
0
1
Solving matrix games
Matching Pennies:
Hide heads up
Guess heads up
Guess tails up
Hide tails up
1
0
1/2
0
1
1/2
1/2
1/2
Mixed strategies
The value of the game is ½ and the stated strategies are optimal in the minimax
sense: They assure the best possible expected payoff against a worst case
opponent. Values and optimal strategies are found in polynomial time using
linear programming.
Concurrent reachability games
generalize matrix games
GOAL
15-03-2016
18
Variants
(in increasing order of generality)
• Shapley’s model (1953): Rewards on arcs,
payoff is the discounted sum of total rewards,
(or undiscounted sum, with non-zero stopping probability).
• Everett’s model (1957): (weighted)
reachability games.
• Gillette’s model (1957): Rewards on arcs,
limiting average payoffs.
15-03-2016
19
Why are reachability games more
general than discounted payoff
games?
Affinely scale all rewards to make probabilities well-defined.
15-03-2016
Works also for the perfect information case.
20
Why are limiting average payoff
games more general than
reachability games?
15-03-2016
21
Turn-based stochastic games
Values
Every vertex i in the game has a value vi
positional
general
positional
general
Both players have positional optimal strategies
There are strategies that are
optimal for every starting position
15-03-2016
22
Discounted stochastic games (Shapley)
Values
Every vertex i in the game has a value vi
stationary
general
stationary
general
Both players have stationary optimal strategies
Stationary: As positional, except that we allow randomization
There are strategies that are
optimal for every starting position
15-03-2016
23
Existence of value and optimal
stationary strategies (Shapley)
• The value vector is the unique fixed point of
value iteration.
• The optimal stationary strategies are the
optimal strategies in the matrix games arising
in the natural definition of value iteration.
15-03-2016
24
The value equation


yi  val cab   pabj y j 
j

 a ,b
• When we demand that j pabj = ¸ < 1 this
system of equations has a unique solution, as
it describes the fixed point of a contraction
(called value iteration).
15-03-2016
25
Bad news
• The value of a discounted stochastic game
with rational rewards may be irrational.
• Deciding if the value of given discounted
stochastic game is greater than or equal to 0 is
SQRT-SUM hard (exercise)
15-03-2016
26
SQRT-SUM hardness
• SQRT-SUM: Given an epression E which is a
weighted (by integers) sum of square roots (of
integers), does E evaluate to a positive number?
• Not known to be in P or NP or even the
polynomial hierarchy (open since pointed out by
Johnson in the 70s).
• Etessami and Yannakakis, 2005: Comparing the
value of a Shapley game to a rational number is
hard for SQRT-SUM.
15-03-2016
27
Discounted stochastic games (Shapley)
Values
Every vertex i in the game has a value vi
stationary
general
stationary
general
Both players have stationary optimal strategies
Stationary: As positional, except that we allow randomization
There are strategies that are
optimal for every starting position
15-03-2016
28
Concurrent Reachability Games
Values
Every vertex i in the game has a value vi
sup
stationary
general
stationary
general
Lucifer has stationary optimal strategy
Stationary: As positional, except that we allow randomization
There are strategies that are
good for every starting position
15-03-2016
29
Why sup instead of max
GOAL
15-03-2016
30
Why sup instead of max
GOAL
15-03-2016
31
Why sup/inf instead of max/min
• ”Conditionally repeated matching pennies”:
– Lucifer hides a penny
– Dante tries to guess if it is heads up or tails up.
– If Dante guesses correctly, he gets the penny.
– If Dante incorrectly guesses tails, he loses (goes
into trap)
– If Dante incorrectly guesses heads, play repeats.
• What is the value of this game? 1
15-03-2016
32
Almost optimal strategy for Dante
• Guess ”heads” with probability 1-² and ”tails”
with probability ² (every time).
• Guaranteed to win with probability 1-².
• But no strategy of Dante wins with probability
1.
15-03-2016
33
Concurrent Reachability Games
Values
Every vertex i in the game has a value vi
sup
stationary
general
stationary
general
Lucifer has stationary optimal strategy
Stationary: As positional, except that we allow randomization
There are strategies that are
good for every starting position
15-03-2016
34
Weighted Concurrent Reachability Games
Values
Every vertex i in the game has a value vi
sup
stationary
general
inf
stationary
general
Stationary: As positional, except that we allow randomization
There are strategies that are
good for every starting position
15-03-2016
35
Limiting Average Stochastic Games (Gillette)
Values
Every vertex i in the game has a value vi
sup
general
general
inf
general
general
Lucifer has stationary optimal strategy
Stationary: As positional, except that we allow randomization
There are strategies that are
good for every starting position
15-03-2016
36
Why is this surprising?
• With limiting average payoffs, the finite
sequence of rewards achieved in the past has
absolutely no influence on the final payoff.
• Yet, to play optimally, it is necessary to take
the past into account….(????)
15-03-2016
37
The Big Match
Gillette 1957, Blackwell and Ferguson 1968
• Once a day, Player 2 hides a penny.
• Player 1 has to guess if it is heads up or tails up. If
he guesses correctly, he gets the penny.
• The first time (if ever) Player 1 guesses ”tails up”
Player 2 stops hiding pennies and the following
happens.
– If Player 1 guessed correctly, he gets a penny each day
from now on and forever.
– If Player 1 guessed incorrectly, he never gets a penny
again.
15-03-2016
38
The big match
1
0
0
1
0
• Limiting average payoff:
lim inft ! 1 (r1 + r2 + .. . + rt)/t
1
Analysis
• Can Player 1 ensure a better expected payoff than
½?
– No! Player 2 can prevent Player 1 from such a payoff
by uniform play.
• Can Player 1 ensure payoff close to ½?
– Open between 1957 and 1968….
– Blackwell and Ferguson, 1968. Yes!
Guess tails with probability
ptails = 1/(#tails seen - #heads seen + M)2
where M is a big number.
Exercise!
15-03-2016
40
Mertens and Neyman 1981
• Every finite stochastic game with limiting
average payoff has a value
• The value is the limit of the value of the value
of the corresponding discounted game as the
discount factor approaches 1.
• The value is the limit of the value of the
corresponding time bounded game as the
time bound approaches infinity.
15-03-2016
41
Algorithmic problems
• Qualitatively solving a game with 0-1 rewards.
– Determining which nodes have value 1.
• Quantitatively solving a game.
– Deciding if the value of the game is bigger than α
– Approximately computing the values of the nodes.
• Strategically solving a game.
– Computing an ²-optimal stationary strategy for a
given ².
15-03-2016
42
The algorithmic lens:
Polynomial time reductions
15-03-2016
43
•
•
•
•
SQRT SUM hardness above here
•
•
•
15-03-2016
Shapley: Payoff is sum of rewards,
non-zero stopping probability.
Everett: Payoffs occur at terminals.
Gillette: Limiting average payoffs.
Arrow from A to B means: comparing
the value of instance of A to given
rational number polynomial time
reduces to comparing the value of
instance of B to given rational
number.
None of these problems are known
to be polynomial time solvable or
NP- or PPAD-hard.
Don’t know how to reverse any
arrow.
Note the curious different order of
Shapley and Gillette in bottom and
top of diagram…..
44
•
•
SQRT SUM hardness above here
•
15-03-2016
Below the red line, variants of
Howard’s algorithm (1960) solve the
games efficiently in practice, except
on very carefully constructed
instances.
Above red line, no universally
applicable practically efficient
algorithm is known. Simple 1-position
examples make Howard’s algorithm
take exponential time, even to
achieve non-trivial approximation.
All games can be solved in PSPACE by
a reduction to decision procedures
for the first order theory of the real
numbers. For games of few positions,
this approach can be refined
somewhat.
45
Plan
• Introduction to imperfect information
(concurrent) stochastic games
• Analysis of the complexity of strategy
iteration
• Algorithms based on real and semi-algebraic
geometry
• Combinatorial algorithms for ”qualitatively”
solving concurrent stochastic games.
15-03-2016
46
Solving stochastic games
Howard’s algorithm!
– a.k.a. policy improvement, policy iteration, strategy improvement, stategy
iteration.
– Howard is to Shapley’s stochastic games what the simplex algorithm is to
linear programming, ”polynomial time in practice”
– Preferred algorithm for solving parity games is adaptation of Howard due to
Jurdzinski and Voge (2000).
– Conjectured until 2009 to be polynomial time for the perfect information
case. Oliver Friedmann in 2009 found examples showing otherwise.
– The examples were obtained by looking at the parity game case!
15-03-2016
47
Strategy iteration
(Howard’s algorithm)
• Start with an arbitrary stationary strategy for the protagonist
• Solve the resulting one-player game (MDP) for the antagonist.
Compute expected total payoff for each position
• For all positions, if the present strategy is not maximin for the
matrix game obtained by replacing pointers with expected
total payoffs, switch to a maxmin mixed strategy.
• Iterate!
With Uri’s terminology, this is the SWITCH-ALL / GREEDY version of
strategy iteration.
15-03-2016
48
Howard has been adapted to the entire
yellow region…..
15-03-2016
49
Howard’s algorithm for CRGs
Chatterjee, de Alfaro, Henzinger ’06
Solve Markov
Decision Process
Solve matrix game
15-03-2016
50
Properties
• The valuations vti converge to the values vi (from
below).
• The strategies xt guarantee the valuations vti for
Dante.
• What is the number of iterations required to
guarantee a good approximation?
15-03-2016
51
Hansen, Ibsen-Jensen, M., 2011
• Solving Concurrent Reachability Games using
strategy iteration has worst case time complexity
doubly exponential in size of the input.
• This is an upper and a lower bound. For games with
N positions and m actions for each player in each
position:
N/4
– (1/ε)m iterations are (sometimes) necessary to get εapproximation of value.
O(N)
– (1/ε)m iterations are always sufficient.
15-03-2016
52
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
Purgatory has 7 terraces.
5
4
3
2
1
Dante enters Purgatory
at terrace 1.
15-03-2016
53
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
5
4
3
2
1
15-03-2016
While in Purgatory, once a
second, Dante must play
Matching Pennies
with Lucifer
54
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
5
4
3
If Dante wins, he proceeds
to the next terrace
2
1
15-03-2016
55
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
5
4
3
If Dante wins, he proceeds
to the next terrace
2
1
15-03-2016
56
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
5
4
3
If Dante wins, he proceeds
to the next terrace
2
1
15-03-2016
57
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
5
4
3
If Dante wins, he proceeds
to the next terrace
2
1
15-03-2016
58
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
5
4
3
If Dante wins, he proceeds
to the next terrace
2
1
15-03-2016
59
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
5
4
3
If Dante wins, he proceeds
to the next terrace
2
1
15-03-2016
60
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
5
4
3
If Dante wins, he proceeds
to the next terrace
2
1
15-03-2016
61
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
5
4
3
If Dante wins, he proceeds
to the next terrace
2
1
15-03-2016
62
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
5
4
3
If Dante wins, he proceeds
to the next terrace
2
1
15-03-2016
63
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
5
4
3
If Dante wins, he proceeds
to the next terrace
2
1
15-03-2016
64
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
5
4
3
If Dante wins, he proceeds
to the next terrace
2
1
15-03-2016
65
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
5
4
3
If Dante wins, he proceeds
to the next terrace
2
1
15-03-2016
66
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
If Dante wins Matching Pennies
at terrace 7, he wins the game of
Purgatory.
5
4
3
2
1
15-03-2016
67
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
If Dante wins Matching Pennies
at terrace 7, he wins the game of
Purgatory.
5
4
3
2
1
15-03-2016
68
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
5
If Dante loses Matching Pennies
guessing Heads, he goes back to
terrace 1.
4
3
2
1
15-03-2016
69
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
5
If Dante loses Matching Pennies
guessing Heads, he goes back to
terrace 1.
4
3
2
1
15-03-2016
70
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
5
If Dante loses Matching Pennies
guessing Heads, he goes back to
terrace 1.
4
3
2
1
15-03-2016
71
Dante in Purgatory
(Hansen, Koucky, Miltersen, LICS’09)
7
6
5
4
If Dante loses Matching Pennies
guessing Tails…..
…. he loses the game of Purgatory!!!!
3
2
1
15-03-2016
72
Dante in Purgatory, summary
• Once a second, Lucifer hides
a coin and Dante guesses if
it is heads up or tails up.
• If Dante guesses correctly 7
times in a row, he goes to
Paradise.
• If Dante every incorrectly
guesses tails, he goes to
Hell.
15-03-2016
73
Dante in Purgatory
• Is there is a strategy for Dante so that he is
guaranteed to win the game of Purgatory with
probability at least 90%?
– Yes.
A bit surprising – when Dante wins, he has guessed
the parity of the coin correctly seven times in a row!
• How long can Lucifer confine Dante to Purgatory if
Dante plays by such a strategy?
– 1055 years.
15-03-2016
74
Purgatory
• P(N) = Purgatory with N terraces.
• val(P(N)) = 1.
• For N > 3, let
N-1
2
T=2 .
Then val(P(N)T) < 0.68.
• With N=7 and 1 move per second,
means….
• 500 billion years.
15-03-2016
N-1
2
T=2
75
Why does Purgatory have value 1?
15-03-2016
76
15-03-2016
77
15-03-2016
78
15-03-2016
79
15-03-2016
80
15-03-2016
81
Relevance for complexity analysis?
• The difference between val(P(N)T) and
val(P(N)) directly captures how well a certain
algorithm approximates the value of P(N) after
T iterations.
• Which algorithm?
Value iteration!
15-03-2016
82
Value iteration
Value iteration computes the value of the time bounded game,
for larger and larger values of the time bound, by backward induction.
The game is not stopping, so value iteration is not a contraction!
So why correct?
15-03-2016
83
Mertens and Neyman 1981
• Every finite stochastic game with limiting
average payoff has a value.
• The value is the limit of the value of the value
of the corresponding discounted game as the
discount factor approaches 1.
• The value is the limit of the value of the
corresponding time bounded game as the
time bound approaches infinity.
15-03-2016
84
Connection to Strategy iteration
• As for the case of MDPs, we can relate the
valuations computed by strategy iteration to
the valuations computed by value iteration.
Actual values
Valuations computed
by value iteration
15-03-2016
Valuations computed
by strategy iteration
85
Why is one iteration of (switch-all)
strategy iteration better than one
iteration of value iteration?
• Let X be a positional strategy, guaranteeing a
value vector v.
• Let Y be the strategy obtained after one
iteration of strategy iteration.
• The value vector obtained after applying value
iteration to v is the values guaranteed by
”Apply Y once, then apply X forever”.
15-03-2016
86
Connection to Strategy iteration
• As for the case of MDPs we can relate the
valuations computed by strategy iteration to
the valuations computed by value iteration.
Actual values
Valuations computed
by value iteration
15-03-2016
Valuations computed
by strategy iteration
87
Strategy iteration is slow on Purgatory
#iterations:
1
10
100
1000
10000
100000
1000000
10000000
100000000
> 2*1065
> 10128
15-03-2016
Valuation of lowest terrace:
0.01347
0.03542
0.06879
0.10207
0.13396
0.16461
0.19415
0.22263
0.24828
0.9
0.99
88
Main result, recapitulated
• For games with N positions and m actions for each
player in each position:
N/4
– (1/ε)m iterations are (sometimes) necessary to get εapproximation of value.
O(N)
m
– (1/ε)
iterations are always sufficient.
• For the lower bound, we generalize Purgatory
to more than 2 actions.
15-03-2016
89
Generalized Purgatory P(N,m)
• Lucifer repeatedly hides a number between 1
and m.
• Dante must try to guess the number.
• If he guesses correctly N times in a row, he
wins the game.
• If he ever guesses incorrectly overshooting
Lucifer’s number, he loses the game.
15-03-2016
90
Why strategy iteration is slow on
Purgatory (sketch!)
• Strategy iteration on Purgatory with n terraces compute the
same sequence of strategies for the lowest terrace as strategy
itearation on Purgatory with one terrace only.
• Strategy iteration and value iteration are in synch when
applied to Purgatory with one terrace.
• We derive a closed form formula for the strategy and observe
that the patience of the strategy computed after few
iterations is low.
• Patience = 1/smallest non-zero probability used.
• When a pure best reply is played to a strategy of low patience,
the play terminates quickly.
• We already know that such strategies do not do well for
15-03-2016
91
Purgatory, so the strategies computed are not very good.
Upper bound
• For any CRG with N positions and m actions
O(N)
m
for each player in each position, (1/²)
iterations are sufficient to achieve ²-optimal
strategy.
• … to show this, we need a detour.
15-03-2016
92
Plan
• Introduction to imperfect information
(concurrent) stochastic games
• Analysis of the complexity of strategy iteration
• Algorithms based on real and semi-algebraic
geometry
• Combinatorial algorithms for ”qualitatively”
solving concurrent stochastic games.
15-03-2016
93
A generic algorithm for determining
values of stochastic games
• The property of being a number larger or
smaller than the value of a CRG can be
expressed by a polynomial length formula in
the existential first order theory of the reals.
• There exists a stationary strategy such that….
15-03-2016
94
A generic algorithm for determining
values
• As a corollary to decision
procedures of semi-algebraic
geometry, comparing the value
of a stochastic game of any of the
kinds we have seen to a rational
number is in PSPACE.
• This is the best known ”complexity
class” upper bound!
• …. can this semi-algebraic approach be refined?
15-03-2016
95
ctic.au.dk
Exact Algorithms for Solving
Stochastic Games
Kristoffer Arnsfelt Hansen
Michal Koucky
Niels Lauritzen
Peter Bro Miltersen
Elias Tsigaridas
15-03-2016
96
Slogan of approach
• Doing numerical analysis/optimization in
dangerous waters using real algebraic
geometry.
• Why are the waters dangerous?
• Small perturbations mean everything!!!
15-03-2016
97
Purgatory
• The value of Purgatory is 1
– Dante can win the game with probability 1-ε for
any ε>0
• Any strategy that guarantees a win with
probability > 0.9 must use probabilities
7-1
2
smaller than (1/10) =
0.00000000000000000000000000000000000
00000000000000000000000000001
15-03-2016
98
Solving Stochastic Games
• Input:
15-03-2016
• Output: 1
99
HKLMT’11
• Good news: Stochastic games of all kinds with
a constant number of positions can be solved
exactly in polynomial time.
– In contrast, Howard’s algorithm (strategy
iteration) uses exponential time to get rough
approximation even for one-position games.
– ”Generic” PSPACE algorithm is exponential even
for one-position game.
• Bad news: Complexity is something like
Lexp(O(N log N))…..
15-03-2016
100
Recursive Bisection Algorithm
15-03-2016
101
Recursive Bisection Algorithm
¸ 0.5?
15-03-2016
102
Recursive Bisection Algorithm
1. Replace
position with
target value
15-03-2016
0.5
103
Recursive Bisection Algorithm
1. Replace
position with
target value
0.5
0.8
0.2
0.9
2. Recursively
solve smaller
game
15-03-2016
104
Recursive Bisection Algorithm
0.5
1. Replace
position with
target value
0.8
0.2
3. Reinstate
position
0.9
2. Recursively
solve smaller
game
15-03-2016
105
Recursive Bisection Algorithm
1. Replace
position with
target value
0.5
0.5
3. Reinstate
position
4. Replace
pointers
0.8
0.2
0.9
2. Recursively
solve smaller
game
15-03-2016
106
Recursive Bisection Algorithm
1. Replace
position with
target value
0.5
0.9
0.5
3. Reinstate
position
4. Replace
pointers
0.8
0.2
0.9
2. Recursively
solve smaller
game
15-03-2016
107
Recursive Bisection Algorithm
1. Replace
position with
target value
0.5
0.9
0.5
3. Reinstate
position
0.8
4. Replace
pointers
0.8
0.2
0.9
2. Recursively
solve smaller
game
15-03-2016
108
Recursive Bisection Algorithm
1. Replace
position with
target value
0.5
0.9
0.8
0.2
0.5
3. Reinstate
position
4. Replace
pointers
0.8
0.2
0.9
2. Recursively
solve smaller
game
15-03-2016
109
Recursive Bisection Algorithm
1. Replace
position with
target value
0.5
0.9
0.8
0.2
0.5
0.6
0.4
3. Reinstate
position
0.62
4. Replace
pointers
0.8
0.2
0.9
5. Solve matrix
game
2. Recursively
solve smaller
game
15-03-2016
110
Recursive Bisection Algorithm
1. Replace
position with
target value
> 0.5?
Yes!
3. Reinstate
position
4. Replace
pointers
5. Solve matrix
game
2. Recursively
solve smaller
game
15-03-2016
111
What’s the catch
• We can compare the value of a position in an
N-position game to a given rational number
(and do binary search) if we recursively can
solve an (N-1)-position game exactly!
– 0.5 vs. 0.5000000000000000000000000001
– Will happen on simple examples such as
Purgatory.
• To get implementable recursive algorithm, we
must replace ”exactly” with ”approximately”.
15-03-2016
112
Real algebraic geometry to the rescue
• To resolve 0.5 vs. 0.5000000000000000001
issue, we need separation bounds.
• Separation bound: If games X and Y of certain
parameters have different values, they differ
by at least ε.
• Obtaining separation bounds for values of
stochastic games using real algebraic
geometry is the technical meat of our work.
15-03-2016
113
Isolated root theorem
• Given a polynomial system:
– f1 (x1, x2, x3, …, xn) = … = fm (x1, x2, x3, …, xn)
• with each fj in Q[x1, x2, …, xn], of total degree d
and with an isolated root x* in Rn.
• Then, the algebraic degree of each x*i is at
most (2d+1)n.
• Best previously published bound (Basu et al):
(O(d))n.
• Open: Is dn possible?
15-03-2016
114
End of detour
• Strategy iteration (Howard’s algorithm) finds
an ε-approximation to the value of a
concurrent reachability game with N positions
O(N)
m
and m actions per position after (1/ε)
iterations.
– Tight (ish)! [Hansen et al, CSR’11].
31 mN
2
– Best previous bound (1/ε)
iterations
[Hansen et al, LICS’09].
– Slogan: Algorithm analysis using real algebraic
geometry.
15-03-2016
115
Algorithm analysis by R.A.G.
•
•
•
Strategy iteration needs at most as many iterations as value iteration.
The complexity of value iteration on a game is captured by the difference in value
between time bounded and infinite versions of the game.
The difference in value between time bounded and infinite version of the game is
upper bounded as a function of its patience.
–
•
•
•
•
Patience = 1/smallest probability used in near-optimal strategy
Thus, to get upper bound on complexity, we need upper bound on patience of
near-optimal strategies.
Everett (1957) exhibits near-optimal strategies that are characterized by certain
formulas of first order logic over the reals. The number of variables can be reduced
by applying Cramer’s rule at the expense of blowing up the size superpolynomially.
Applying sampling theorem of Basu et al to these formulae, we get a bound on
the algebraic degree of the resulting numbers.
By applying standard separation bounds on algebraic numbers, we get the bound
on patience, leading to desired bound on the time complexity of strategy iteration.
15-03-2016
116
A non-algebraic approach to an
even better upper bound
• Is Purgatory extremal with respect to patience
among n-node CRGs?
• If yes, this gives a better upper bound on
number of iterations of value iteration for CRGs,
replacing O(m) with m+o(m)!
15-03-2016
117
Compare
15-03-2016
Condon’s example.
Extremal with
respect to, e.g.,
expected absorption
time.
118
Perspectives….
• Practical algorithm?
– Can be made more practical by ”iterated precision
extension”.
• Better algorithms using more clever numerical ideas? Newton
etc.?
– But very tricky in this domain:
• Precision issues.
• Only ”piecewise” smooth nature of domain.
• Tighter analysis of Howard?
• More ”big-O-less” real algebraic geometry and semi-algebraic
geometry?
• More numerical algorithms in dangerous waters using r.a.g.?
15-03-2016
119
Representing strategies for CRGs
• The fact that the values can be approximated in PSPACE,
strongly suggests that PSPACE should be enough for
“understanding” CRGs. But natural representation of nearoptimal strategies require exponential space!
• Is there a “natural” representation of probabilities so that
– ε-optimal strategies of CRGs can be represented succinctly and
– ε-optimal strategies of CRGs can be computed using polynomial space
– or better?
• De Alfaro, Henzinger, Kupferman , FOCS’98: Yes, for the
restricted case of CRGs where the values of all positions are 0
or 1.
15-03-2016
120
Plan
• Introduction to imperfect information
(concurrent) stochastic games
• Analysis of the complexity of strategy iteration
• Algorithms based on real and semi-algebraic
geometry
• Combinatorial algorithms for ”qualitatively”
solving concurrent stochastic games.
15-03-2016
121
Qualitatively solving Concurrent
Reachability Games
• De Alfaro, Henzinger, Kupferman ’98:
– There is a (combinatorial) polynomial time
algorithm that finds those positions in a
concurrent reachability game that have value 1.
– In other words, those positions can be
”combinatorially characterized”.
15-03-2016
122
LimitEscape states
• Let s 2 C µ U.
• We say that s 2 LimitEscape(C,U) if for any
number K, there is a strategy ¾ for Player 1 so:
• inf¼ Pr[From s, (¾,¼) leaves C in one step] >
inf¼ Pr[From s, (¾,¼) leaves U in one step] * K
• ”Player 1 can leave C in one step with positive
probability and also ensure that given that he
does leave C, he stays in U with high
probability”
15-03-2016
123
Algorithm for deciding if
s 2 LimEscape(C,U)
• E1 = {(i,j)|In s, (i,j) leaves C with positive probability}
• E2 = {(j,i)|In s, (i,j) leaves U with positive probability}
•
•
•
•
•
•
•
•
A0 = {i | 8 j: (i,j) not in E2}
B0 = {j | 9 i 2 A0: (i,j) 2 E1}
A1 = {i | 8 j: (i,j) not in E2 or (i,j) 2 B0}
B1 = {j | 9 i 2 A1: (i,j) 2 E1}
A2 = {i | 8 j: (i,j) not in E2 or (i,j) 2 B1}
….
Until Bk = Bk+1.
s 2 LimEscape(C,U) if and only if Bk contains all actions of Player 2.
15-03-2016
124
LimSafe(W)
• Let W be a set of states containing GOAL.
• LimSafe(W) is the largest subset V of W –
{GOAL} so that no state u in V is in
LimEscape(V,W).
• ”Player 2 can contain play in V with positive
probability”
• V := W – {GOAL}
• Repeat V := {s in V | s not in LimEscape(V,W)}
until stable
15-03-2016
125
Algorithm for computing states of
value 1
•
•
•
•
•
•
•
C0 = LimSafe(S)
U1 = Safe1(S – C0)
C1 = LimSafe(U1)
U2 = Safe1(S – C1)
….
Until Uk+1 = Uk
Return Uk
15-03-2016
Largest subset that Player 1 can
stay inside with certainty
126
ε-optimal strategies
• When all positions have value 0 or 1, the
correctness proof of the algorithm constructs
an ε-optimal strategy where each probability
is either of the form εr or 1 - εr1 - εr2 - …
• Open: Is it true in general that an ε-optimal
strategy for a concurrent reachability game
can be described as a sparse polynomial in ε?
15-03-2016
127
The limiting average case
• Can value 1 positions in Gillette’s games with
rewards 0 and 1 (e.g., The Big Match) be
combinatorially characterized? And as a
consequence be determined in P?
• Question due to Rasmus Ibsen-Jensen.
15-03-2016
128
A mysterious value 1 game due to
Rasmus-Ibsen Jensen
15-03-2016
129
The End!
Download