Mini-course on algorithmic aspects of stochastic games and related models Marcin Jurdzinski (University of Warwick) Peter Bro Miltersen (Aarhus University) Uri Zwick (武熠) (Tel Aviv University) Oct. 31 – Nov. 2, 2011 15-03-2016 1 Day 3 Wednesday, November 2 Peter Bro Miltersen (Aarhus University) Imperfect Information Stochastic Games 15-03-2016 2 Plan • Introduction to imperfect information (concurrent) stochastic games • Analysis of the complexity of strategy iteration • Algorithms based on real and semi-algebraic geometry • Combinatorial algorithms for ”qualitatively” solving concurrent stochastic games. 15-03-2016 3 Perfect Information (a.k.a. turn-based) stochastic games 1/2 1/2 MAX R RAND min Objectives: MAX/min the probability of getting to a GOAL , MAX/min discounted sum of rewards, MAX/min limiting average rewards 15-03-2016 4 Imperfect Information (a.k.a. concurrent) stochastic games 1/2 1/2 MAX R RAND min Objectives: MAX/min the probability of getting to a GOAL , MAX/min discounted sum of rewards, MAX/min limiting average rewards 15-03-2016 5 Concurrent Reachability Games GOAL Dante - Row player Wants to reach GOAL Lucifer – Column player Wants to prevent Dante from reaching GOAL 15-03-2016 6 Concurrent Reachability Games GOAL Dante - Row player Wants to reach GOAL Lucifer – Column player Wants to prevent Dante from reaching GOAL 15-03-2016 7 Concurrent Reachability Games GOAL Dante - Row player Wants to reach GOAL Lucifer – Column player Wants to prevent Dante from reaching GOAL 15-03-2016 8 Concurrent Reachability Games GOAL Dante - Row player Wants to reach GOAL Lucifer – Column player Wants to prevent Dante from reaching GOAL 15-03-2016 9 Concurrent Reachability Games GOAL Dante - Row player Wants to reach GOAL Lucifer – Column player Wants to prevent Dante from reaching GOAL 15-03-2016 10 Concurrent Reachability Game • Arena: – Finite directed graph. – One terminal GOAL node, a terminal trap node, N non-terminal nodes. – Each non-terminal node contains an m x m matrix of outgoing arcs. • Play: – A pebble moves from position to position. – In each step, Dante chooses a row and Lucifer simultaneously chooses a column of the matrix. – The pebble moves along the appropriate arc. – If Dante reaches the GOAL position he wins – If this never happens, Lucifer wins. 15-03-2016 11 Simulation MAX 15-03-2016 12 Simulation min 15-03-2016 13 Simulation 1/2 1/2 R …. Somewhat more subtle that this works! 15-03-2016 14 ”Proof” of correctness • We want values in the CRG to be the same as in the turn based reachability game. • In particular, the value of the node simulating a coin toss should be the average of the values of the two nodes it points to. • If these two values are the same, this is ”clearly” the case. • If they have different values v1, v2, the simulated coin toss nodes is a game of Matching Pennies with payoffs v1, v2. This game has value (v1+v2)/2. 15-03-2016 15 Matrix games Matching Pennies: Hide heads up Guess heads up Guess tails up Hide tails up 1 0 0 1 Solving matrix games Matching Pennies: Hide heads up Guess heads up Guess tails up Hide tails up 1 0 1/2 0 1 1/2 1/2 1/2 Mixed strategies The value of the game is ½ and the stated strategies are optimal in the minimax sense: They assure the best possible expected payoff against a worst case opponent. Values and optimal strategies are found in polynomial time using linear programming. Concurrent reachability games generalize matrix games GOAL 15-03-2016 18 Variants (in increasing order of generality) • Shapley’s model (1953): Rewards on arcs, payoff is the discounted sum of total rewards, (or undiscounted sum, with non-zero stopping probability). • Everett’s model (1957): (weighted) reachability games. • Gillette’s model (1957): Rewards on arcs, limiting average payoffs. 15-03-2016 19 Why are reachability games more general than discounted payoff games? Affinely scale all rewards to make probabilities well-defined. 15-03-2016 Works also for the perfect information case. 20 Why are limiting average payoff games more general than reachability games? 15-03-2016 21 Turn-based stochastic games Values Every vertex i in the game has a value vi positional general positional general Both players have positional optimal strategies There are strategies that are optimal for every starting position 15-03-2016 22 Discounted stochastic games (Shapley) Values Every vertex i in the game has a value vi stationary general stationary general Both players have stationary optimal strategies Stationary: As positional, except that we allow randomization There are strategies that are optimal for every starting position 15-03-2016 23 Existence of value and optimal stationary strategies (Shapley) • The value vector is the unique fixed point of value iteration. • The optimal stationary strategies are the optimal strategies in the matrix games arising in the natural definition of value iteration. 15-03-2016 24 The value equation yi val cab pabj y j j a ,b • When we demand that j pabj = ¸ < 1 this system of equations has a unique solution, as it describes the fixed point of a contraction (called value iteration). 15-03-2016 25 Bad news • The value of a discounted stochastic game with rational rewards may be irrational. • Deciding if the value of given discounted stochastic game is greater than or equal to 0 is SQRT-SUM hard (exercise) 15-03-2016 26 SQRT-SUM hardness • SQRT-SUM: Given an epression E which is a weighted (by integers) sum of square roots (of integers), does E evaluate to a positive number? • Not known to be in P or NP or even the polynomial hierarchy (open since pointed out by Johnson in the 70s). • Etessami and Yannakakis, 2005: Comparing the value of a Shapley game to a rational number is hard for SQRT-SUM. 15-03-2016 27 Discounted stochastic games (Shapley) Values Every vertex i in the game has a value vi stationary general stationary general Both players have stationary optimal strategies Stationary: As positional, except that we allow randomization There are strategies that are optimal for every starting position 15-03-2016 28 Concurrent Reachability Games Values Every vertex i in the game has a value vi sup stationary general stationary general Lucifer has stationary optimal strategy Stationary: As positional, except that we allow randomization There are strategies that are good for every starting position 15-03-2016 29 Why sup instead of max GOAL 15-03-2016 30 Why sup instead of max GOAL 15-03-2016 31 Why sup/inf instead of max/min • ”Conditionally repeated matching pennies”: – Lucifer hides a penny – Dante tries to guess if it is heads up or tails up. – If Dante guesses correctly, he gets the penny. – If Dante incorrectly guesses tails, he loses (goes into trap) – If Dante incorrectly guesses heads, play repeats. • What is the value of this game? 1 15-03-2016 32 Almost optimal strategy for Dante • Guess ”heads” with probability 1-² and ”tails” with probability ² (every time). • Guaranteed to win with probability 1-². • But no strategy of Dante wins with probability 1. 15-03-2016 33 Concurrent Reachability Games Values Every vertex i in the game has a value vi sup stationary general stationary general Lucifer has stationary optimal strategy Stationary: As positional, except that we allow randomization There are strategies that are good for every starting position 15-03-2016 34 Weighted Concurrent Reachability Games Values Every vertex i in the game has a value vi sup stationary general inf stationary general Stationary: As positional, except that we allow randomization There are strategies that are good for every starting position 15-03-2016 35 Limiting Average Stochastic Games (Gillette) Values Every vertex i in the game has a value vi sup general general inf general general Lucifer has stationary optimal strategy Stationary: As positional, except that we allow randomization There are strategies that are good for every starting position 15-03-2016 36 Why is this surprising? • With limiting average payoffs, the finite sequence of rewards achieved in the past has absolutely no influence on the final payoff. • Yet, to play optimally, it is necessary to take the past into account….(????) 15-03-2016 37 The Big Match Gillette 1957, Blackwell and Ferguson 1968 • Once a day, Player 2 hides a penny. • Player 1 has to guess if it is heads up or tails up. If he guesses correctly, he gets the penny. • The first time (if ever) Player 1 guesses ”tails up” Player 2 stops hiding pennies and the following happens. – If Player 1 guessed correctly, he gets a penny each day from now on and forever. – If Player 1 guessed incorrectly, he never gets a penny again. 15-03-2016 38 The big match 1 0 0 1 0 • Limiting average payoff: lim inft ! 1 (r1 + r2 + .. . + rt)/t 1 Analysis • Can Player 1 ensure a better expected payoff than ½? – No! Player 2 can prevent Player 1 from such a payoff by uniform play. • Can Player 1 ensure payoff close to ½? – Open between 1957 and 1968…. – Blackwell and Ferguson, 1968. Yes! Guess tails with probability ptails = 1/(#tails seen - #heads seen + M)2 where M is a big number. Exercise! 15-03-2016 40 Mertens and Neyman 1981 • Every finite stochastic game with limiting average payoff has a value • The value is the limit of the value of the value of the corresponding discounted game as the discount factor approaches 1. • The value is the limit of the value of the corresponding time bounded game as the time bound approaches infinity. 15-03-2016 41 Algorithmic problems • Qualitatively solving a game with 0-1 rewards. – Determining which nodes have value 1. • Quantitatively solving a game. – Deciding if the value of the game is bigger than α – Approximately computing the values of the nodes. • Strategically solving a game. – Computing an ²-optimal stationary strategy for a given ². 15-03-2016 42 The algorithmic lens: Polynomial time reductions 15-03-2016 43 • • • • SQRT SUM hardness above here • • • 15-03-2016 Shapley: Payoff is sum of rewards, non-zero stopping probability. Everett: Payoffs occur at terminals. Gillette: Limiting average payoffs. Arrow from A to B means: comparing the value of instance of A to given rational number polynomial time reduces to comparing the value of instance of B to given rational number. None of these problems are known to be polynomial time solvable or NP- or PPAD-hard. Don’t know how to reverse any arrow. Note the curious different order of Shapley and Gillette in bottom and top of diagram….. 44 • • SQRT SUM hardness above here • 15-03-2016 Below the red line, variants of Howard’s algorithm (1960) solve the games efficiently in practice, except on very carefully constructed instances. Above red line, no universally applicable practically efficient algorithm is known. Simple 1-position examples make Howard’s algorithm take exponential time, even to achieve non-trivial approximation. All games can be solved in PSPACE by a reduction to decision procedures for the first order theory of the real numbers. For games of few positions, this approach can be refined somewhat. 45 Plan • Introduction to imperfect information (concurrent) stochastic games • Analysis of the complexity of strategy iteration • Algorithms based on real and semi-algebraic geometry • Combinatorial algorithms for ”qualitatively” solving concurrent stochastic games. 15-03-2016 46 Solving stochastic games Howard’s algorithm! – a.k.a. policy improvement, policy iteration, strategy improvement, stategy iteration. – Howard is to Shapley’s stochastic games what the simplex algorithm is to linear programming, ”polynomial time in practice” – Preferred algorithm for solving parity games is adaptation of Howard due to Jurdzinski and Voge (2000). – Conjectured until 2009 to be polynomial time for the perfect information case. Oliver Friedmann in 2009 found examples showing otherwise. – The examples were obtained by looking at the parity game case! 15-03-2016 47 Strategy iteration (Howard’s algorithm) • Start with an arbitrary stationary strategy for the protagonist • Solve the resulting one-player game (MDP) for the antagonist. Compute expected total payoff for each position • For all positions, if the present strategy is not maximin for the matrix game obtained by replacing pointers with expected total payoffs, switch to a maxmin mixed strategy. • Iterate! With Uri’s terminology, this is the SWITCH-ALL / GREEDY version of strategy iteration. 15-03-2016 48 Howard has been adapted to the entire yellow region….. 15-03-2016 49 Howard’s algorithm for CRGs Chatterjee, de Alfaro, Henzinger ’06 Solve Markov Decision Process Solve matrix game 15-03-2016 50 Properties • The valuations vti converge to the values vi (from below). • The strategies xt guarantee the valuations vti for Dante. • What is the number of iterations required to guarantee a good approximation? 15-03-2016 51 Hansen, Ibsen-Jensen, M., 2011 • Solving Concurrent Reachability Games using strategy iteration has worst case time complexity doubly exponential in size of the input. • This is an upper and a lower bound. For games with N positions and m actions for each player in each position: N/4 – (1/ε)m iterations are (sometimes) necessary to get εapproximation of value. O(N) – (1/ε)m iterations are always sufficient. 15-03-2016 52 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 Purgatory has 7 terraces. 5 4 3 2 1 Dante enters Purgatory at terrace 1. 15-03-2016 53 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 5 4 3 2 1 15-03-2016 While in Purgatory, once a second, Dante must play Matching Pennies with Lucifer 54 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 5 4 3 If Dante wins, he proceeds to the next terrace 2 1 15-03-2016 55 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 5 4 3 If Dante wins, he proceeds to the next terrace 2 1 15-03-2016 56 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 5 4 3 If Dante wins, he proceeds to the next terrace 2 1 15-03-2016 57 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 5 4 3 If Dante wins, he proceeds to the next terrace 2 1 15-03-2016 58 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 5 4 3 If Dante wins, he proceeds to the next terrace 2 1 15-03-2016 59 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 5 4 3 If Dante wins, he proceeds to the next terrace 2 1 15-03-2016 60 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 5 4 3 If Dante wins, he proceeds to the next terrace 2 1 15-03-2016 61 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 5 4 3 If Dante wins, he proceeds to the next terrace 2 1 15-03-2016 62 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 5 4 3 If Dante wins, he proceeds to the next terrace 2 1 15-03-2016 63 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 5 4 3 If Dante wins, he proceeds to the next terrace 2 1 15-03-2016 64 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 5 4 3 If Dante wins, he proceeds to the next terrace 2 1 15-03-2016 65 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 5 4 3 If Dante wins, he proceeds to the next terrace 2 1 15-03-2016 66 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 If Dante wins Matching Pennies at terrace 7, he wins the game of Purgatory. 5 4 3 2 1 15-03-2016 67 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 If Dante wins Matching Pennies at terrace 7, he wins the game of Purgatory. 5 4 3 2 1 15-03-2016 68 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 5 If Dante loses Matching Pennies guessing Heads, he goes back to terrace 1. 4 3 2 1 15-03-2016 69 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 5 If Dante loses Matching Pennies guessing Heads, he goes back to terrace 1. 4 3 2 1 15-03-2016 70 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 5 If Dante loses Matching Pennies guessing Heads, he goes back to terrace 1. 4 3 2 1 15-03-2016 71 Dante in Purgatory (Hansen, Koucky, Miltersen, LICS’09) 7 6 5 4 If Dante loses Matching Pennies guessing Tails….. …. he loses the game of Purgatory!!!! 3 2 1 15-03-2016 72 Dante in Purgatory, summary • Once a second, Lucifer hides a coin and Dante guesses if it is heads up or tails up. • If Dante guesses correctly 7 times in a row, he goes to Paradise. • If Dante every incorrectly guesses tails, he goes to Hell. 15-03-2016 73 Dante in Purgatory • Is there is a strategy for Dante so that he is guaranteed to win the game of Purgatory with probability at least 90%? – Yes. A bit surprising – when Dante wins, he has guessed the parity of the coin correctly seven times in a row! • How long can Lucifer confine Dante to Purgatory if Dante plays by such a strategy? – 1055 years. 15-03-2016 74 Purgatory • P(N) = Purgatory with N terraces. • val(P(N)) = 1. • For N > 3, let N-1 2 T=2 . Then val(P(N)T) < 0.68. • With N=7 and 1 move per second, means…. • 500 billion years. 15-03-2016 N-1 2 T=2 75 Why does Purgatory have value 1? 15-03-2016 76 15-03-2016 77 15-03-2016 78 15-03-2016 79 15-03-2016 80 15-03-2016 81 Relevance for complexity analysis? • The difference between val(P(N)T) and val(P(N)) directly captures how well a certain algorithm approximates the value of P(N) after T iterations. • Which algorithm? Value iteration! 15-03-2016 82 Value iteration Value iteration computes the value of the time bounded game, for larger and larger values of the time bound, by backward induction. The game is not stopping, so value iteration is not a contraction! So why correct? 15-03-2016 83 Mertens and Neyman 1981 • Every finite stochastic game with limiting average payoff has a value. • The value is the limit of the value of the value of the corresponding discounted game as the discount factor approaches 1. • The value is the limit of the value of the corresponding time bounded game as the time bound approaches infinity. 15-03-2016 84 Connection to Strategy iteration • As for the case of MDPs, we can relate the valuations computed by strategy iteration to the valuations computed by value iteration. Actual values Valuations computed by value iteration 15-03-2016 Valuations computed by strategy iteration 85 Why is one iteration of (switch-all) strategy iteration better than one iteration of value iteration? • Let X be a positional strategy, guaranteeing a value vector v. • Let Y be the strategy obtained after one iteration of strategy iteration. • The value vector obtained after applying value iteration to v is the values guaranteed by ”Apply Y once, then apply X forever”. 15-03-2016 86 Connection to Strategy iteration • As for the case of MDPs we can relate the valuations computed by strategy iteration to the valuations computed by value iteration. Actual values Valuations computed by value iteration 15-03-2016 Valuations computed by strategy iteration 87 Strategy iteration is slow on Purgatory #iterations: 1 10 100 1000 10000 100000 1000000 10000000 100000000 > 2*1065 > 10128 15-03-2016 Valuation of lowest terrace: 0.01347 0.03542 0.06879 0.10207 0.13396 0.16461 0.19415 0.22263 0.24828 0.9 0.99 88 Main result, recapitulated • For games with N positions and m actions for each player in each position: N/4 – (1/ε)m iterations are (sometimes) necessary to get εapproximation of value. O(N) m – (1/ε) iterations are always sufficient. • For the lower bound, we generalize Purgatory to more than 2 actions. 15-03-2016 89 Generalized Purgatory P(N,m) • Lucifer repeatedly hides a number between 1 and m. • Dante must try to guess the number. • If he guesses correctly N times in a row, he wins the game. • If he ever guesses incorrectly overshooting Lucifer’s number, he loses the game. 15-03-2016 90 Why strategy iteration is slow on Purgatory (sketch!) • Strategy iteration on Purgatory with n terraces compute the same sequence of strategies for the lowest terrace as strategy itearation on Purgatory with one terrace only. • Strategy iteration and value iteration are in synch when applied to Purgatory with one terrace. • We derive a closed form formula for the strategy and observe that the patience of the strategy computed after few iterations is low. • Patience = 1/smallest non-zero probability used. • When a pure best reply is played to a strategy of low patience, the play terminates quickly. • We already know that such strategies do not do well for 15-03-2016 91 Purgatory, so the strategies computed are not very good. Upper bound • For any CRG with N positions and m actions O(N) m for each player in each position, (1/²) iterations are sufficient to achieve ²-optimal strategy. • … to show this, we need a detour. 15-03-2016 92 Plan • Introduction to imperfect information (concurrent) stochastic games • Analysis of the complexity of strategy iteration • Algorithms based on real and semi-algebraic geometry • Combinatorial algorithms for ”qualitatively” solving concurrent stochastic games. 15-03-2016 93 A generic algorithm for determining values of stochastic games • The property of being a number larger or smaller than the value of a CRG can be expressed by a polynomial length formula in the existential first order theory of the reals. • There exists a stationary strategy such that…. 15-03-2016 94 A generic algorithm for determining values • As a corollary to decision procedures of semi-algebraic geometry, comparing the value of a stochastic game of any of the kinds we have seen to a rational number is in PSPACE. • This is the best known ”complexity class” upper bound! • …. can this semi-algebraic approach be refined? 15-03-2016 95 ctic.au.dk Exact Algorithms for Solving Stochastic Games Kristoffer Arnsfelt Hansen Michal Koucky Niels Lauritzen Peter Bro Miltersen Elias Tsigaridas 15-03-2016 96 Slogan of approach • Doing numerical analysis/optimization in dangerous waters using real algebraic geometry. • Why are the waters dangerous? • Small perturbations mean everything!!! 15-03-2016 97 Purgatory • The value of Purgatory is 1 – Dante can win the game with probability 1-ε for any ε>0 • Any strategy that guarantees a win with probability > 0.9 must use probabilities 7-1 2 smaller than (1/10) = 0.00000000000000000000000000000000000 00000000000000000000000000001 15-03-2016 98 Solving Stochastic Games • Input: 15-03-2016 • Output: 1 99 HKLMT’11 • Good news: Stochastic games of all kinds with a constant number of positions can be solved exactly in polynomial time. – In contrast, Howard’s algorithm (strategy iteration) uses exponential time to get rough approximation even for one-position games. – ”Generic” PSPACE algorithm is exponential even for one-position game. • Bad news: Complexity is something like Lexp(O(N log N))….. 15-03-2016 100 Recursive Bisection Algorithm 15-03-2016 101 Recursive Bisection Algorithm ¸ 0.5? 15-03-2016 102 Recursive Bisection Algorithm 1. Replace position with target value 15-03-2016 0.5 103 Recursive Bisection Algorithm 1. Replace position with target value 0.5 0.8 0.2 0.9 2. Recursively solve smaller game 15-03-2016 104 Recursive Bisection Algorithm 0.5 1. Replace position with target value 0.8 0.2 3. Reinstate position 0.9 2. Recursively solve smaller game 15-03-2016 105 Recursive Bisection Algorithm 1. Replace position with target value 0.5 0.5 3. Reinstate position 4. Replace pointers 0.8 0.2 0.9 2. Recursively solve smaller game 15-03-2016 106 Recursive Bisection Algorithm 1. Replace position with target value 0.5 0.9 0.5 3. Reinstate position 4. Replace pointers 0.8 0.2 0.9 2. Recursively solve smaller game 15-03-2016 107 Recursive Bisection Algorithm 1. Replace position with target value 0.5 0.9 0.5 3. Reinstate position 0.8 4. Replace pointers 0.8 0.2 0.9 2. Recursively solve smaller game 15-03-2016 108 Recursive Bisection Algorithm 1. Replace position with target value 0.5 0.9 0.8 0.2 0.5 3. Reinstate position 4. Replace pointers 0.8 0.2 0.9 2. Recursively solve smaller game 15-03-2016 109 Recursive Bisection Algorithm 1. Replace position with target value 0.5 0.9 0.8 0.2 0.5 0.6 0.4 3. Reinstate position 0.62 4. Replace pointers 0.8 0.2 0.9 5. Solve matrix game 2. Recursively solve smaller game 15-03-2016 110 Recursive Bisection Algorithm 1. Replace position with target value > 0.5? Yes! 3. Reinstate position 4. Replace pointers 5. Solve matrix game 2. Recursively solve smaller game 15-03-2016 111 What’s the catch • We can compare the value of a position in an N-position game to a given rational number (and do binary search) if we recursively can solve an (N-1)-position game exactly! – 0.5 vs. 0.5000000000000000000000000001 – Will happen on simple examples such as Purgatory. • To get implementable recursive algorithm, we must replace ”exactly” with ”approximately”. 15-03-2016 112 Real algebraic geometry to the rescue • To resolve 0.5 vs. 0.5000000000000000001 issue, we need separation bounds. • Separation bound: If games X and Y of certain parameters have different values, they differ by at least ε. • Obtaining separation bounds for values of stochastic games using real algebraic geometry is the technical meat of our work. 15-03-2016 113 Isolated root theorem • Given a polynomial system: – f1 (x1, x2, x3, …, xn) = … = fm (x1, x2, x3, …, xn) • with each fj in Q[x1, x2, …, xn], of total degree d and with an isolated root x* in Rn. • Then, the algebraic degree of each x*i is at most (2d+1)n. • Best previously published bound (Basu et al): (O(d))n. • Open: Is dn possible? 15-03-2016 114 End of detour • Strategy iteration (Howard’s algorithm) finds an ε-approximation to the value of a concurrent reachability game with N positions O(N) m and m actions per position after (1/ε) iterations. – Tight (ish)! [Hansen et al, CSR’11]. 31 mN 2 – Best previous bound (1/ε) iterations [Hansen et al, LICS’09]. – Slogan: Algorithm analysis using real algebraic geometry. 15-03-2016 115 Algorithm analysis by R.A.G. • • • Strategy iteration needs at most as many iterations as value iteration. The complexity of value iteration on a game is captured by the difference in value between time bounded and infinite versions of the game. The difference in value between time bounded and infinite version of the game is upper bounded as a function of its patience. – • • • • Patience = 1/smallest probability used in near-optimal strategy Thus, to get upper bound on complexity, we need upper bound on patience of near-optimal strategies. Everett (1957) exhibits near-optimal strategies that are characterized by certain formulas of first order logic over the reals. The number of variables can be reduced by applying Cramer’s rule at the expense of blowing up the size superpolynomially. Applying sampling theorem of Basu et al to these formulae, we get a bound on the algebraic degree of the resulting numbers. By applying standard separation bounds on algebraic numbers, we get the bound on patience, leading to desired bound on the time complexity of strategy iteration. 15-03-2016 116 A non-algebraic approach to an even better upper bound • Is Purgatory extremal with respect to patience among n-node CRGs? • If yes, this gives a better upper bound on number of iterations of value iteration for CRGs, replacing O(m) with m+o(m)! 15-03-2016 117 Compare 15-03-2016 Condon’s example. Extremal with respect to, e.g., expected absorption time. 118 Perspectives…. • Practical algorithm? – Can be made more practical by ”iterated precision extension”. • Better algorithms using more clever numerical ideas? Newton etc.? – But very tricky in this domain: • Precision issues. • Only ”piecewise” smooth nature of domain. • Tighter analysis of Howard? • More ”big-O-less” real algebraic geometry and semi-algebraic geometry? • More numerical algorithms in dangerous waters using r.a.g.? 15-03-2016 119 Representing strategies for CRGs • The fact that the values can be approximated in PSPACE, strongly suggests that PSPACE should be enough for “understanding” CRGs. But natural representation of nearoptimal strategies require exponential space! • Is there a “natural” representation of probabilities so that – ε-optimal strategies of CRGs can be represented succinctly and – ε-optimal strategies of CRGs can be computed using polynomial space – or better? • De Alfaro, Henzinger, Kupferman , FOCS’98: Yes, for the restricted case of CRGs where the values of all positions are 0 or 1. 15-03-2016 120 Plan • Introduction to imperfect information (concurrent) stochastic games • Analysis of the complexity of strategy iteration • Algorithms based on real and semi-algebraic geometry • Combinatorial algorithms for ”qualitatively” solving concurrent stochastic games. 15-03-2016 121 Qualitatively solving Concurrent Reachability Games • De Alfaro, Henzinger, Kupferman ’98: – There is a (combinatorial) polynomial time algorithm that finds those positions in a concurrent reachability game that have value 1. – In other words, those positions can be ”combinatorially characterized”. 15-03-2016 122 LimitEscape states • Let s 2 C µ U. • We say that s 2 LimitEscape(C,U) if for any number K, there is a strategy ¾ for Player 1 so: • inf¼ Pr[From s, (¾,¼) leaves C in one step] > inf¼ Pr[From s, (¾,¼) leaves U in one step] * K • ”Player 1 can leave C in one step with positive probability and also ensure that given that he does leave C, he stays in U with high probability” 15-03-2016 123 Algorithm for deciding if s 2 LimEscape(C,U) • E1 = {(i,j)|In s, (i,j) leaves C with positive probability} • E2 = {(j,i)|In s, (i,j) leaves U with positive probability} • • • • • • • • A0 = {i | 8 j: (i,j) not in E2} B0 = {j | 9 i 2 A0: (i,j) 2 E1} A1 = {i | 8 j: (i,j) not in E2 or (i,j) 2 B0} B1 = {j | 9 i 2 A1: (i,j) 2 E1} A2 = {i | 8 j: (i,j) not in E2 or (i,j) 2 B1} …. Until Bk = Bk+1. s 2 LimEscape(C,U) if and only if Bk contains all actions of Player 2. 15-03-2016 124 LimSafe(W) • Let W be a set of states containing GOAL. • LimSafe(W) is the largest subset V of W – {GOAL} so that no state u in V is in LimEscape(V,W). • ”Player 2 can contain play in V with positive probability” • V := W – {GOAL} • Repeat V := {s in V | s not in LimEscape(V,W)} until stable 15-03-2016 125 Algorithm for computing states of value 1 • • • • • • • C0 = LimSafe(S) U1 = Safe1(S – C0) C1 = LimSafe(U1) U2 = Safe1(S – C1) …. Until Uk+1 = Uk Return Uk 15-03-2016 Largest subset that Player 1 can stay inside with certainty 126 ε-optimal strategies • When all positions have value 0 or 1, the correctness proof of the algorithm constructs an ε-optimal strategy where each probability is either of the form εr or 1 - εr1 - εr2 - … • Open: Is it true in general that an ε-optimal strategy for a concurrent reachability game can be described as a sparse polynomial in ε? 15-03-2016 127 The limiting average case • Can value 1 positions in Gillette’s games with rewards 0 and 1 (e.g., The Big Match) be combinatorially characterized? And as a consequence be determined in P? • Question due to Rasmus Ibsen-Jensen. 15-03-2016 128 A mysterious value 1 game due to Rasmus-Ibsen Jensen 15-03-2016 129 The End!