Utility Theory & MDPs Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer Announcements • HW2 will be online tomorrow – Due Oct 8 (but make sure to start early!) • As always, you can work in groups of up to 3 and submit 1 written/coding solution (pairs don’t need to be the same as HW1) AI/Games in the news Sept 14, 2015 Review from last class A more abstract game tree 3 3 2 Terminal utilities (for MAX) A two-ply game 2 Computing the minimax value of a node 3 3 2 2 • Minimax(node) = Utility(node) if node is terminal maxaction Minimax(Succ(node, action)) if player = MAX minaction Minimax(Succ(node, action)) if player = MIN Alpha-beta pruning • It is possible to compute the exact minimax decision without expanding every node in the game tree Alpha-beta pruning • It is possible to compute the exact minimax decision without expanding every node in the game tree 3 3 Alpha-beta pruning • It is possible to compute the exact minimax decision without expanding every node in the game tree 3 3 2 Alpha-beta pruning • It is possible to compute the exact minimax decision without expanding every node in the game tree 3 3 2 14 Alpha-beta pruning • It is possible to compute the exact minimax decision without expanding every node in the game tree 3 3 2 5 Alpha-beta pruning • It is possible to compute the exact minimax decision without expanding every node in the game tree 3 3 2 2 Resource Limits • Alpha-beta still has to search all the way to terminal states for portions of the search space. • Instead we can cut off search earlier and apply a heuristic evaluation function. – Search a limited depth of tree – Replace terminal utilities with evaluation function for non-terminal positions. – Performance of the program highly depends on its evaluation function. Evaluation function • Cut off search at a certain depth and compute the value of an evaluation function for a state instead of its minimax value – The evaluation function may be thought of as the probability of winning from a given state or the expected value of that state • A common evaluation function is a weighted sum of features: Eval(s) = w1 f1(s) + w2 f2(s) + … + wn fn(s) • Evaluation functions may be learned from game databases or by having the program play many games against itself Cutting off search • Horizon effect: you may incorrectly estimate the value of a state by overlooking an event that is just beyond the depth limit – For example, a damaging move by the opponent that can be delayed but not avoided • Possible remedies – Quiescence search: do not cut off search at positions that are unstable – for example, are you about to lose an important piece? – Singular extension: a strong move that should be tried when the normal depth limit is reached Additional techniques • Transposition table to store previously expanded states • Forward pruning to avoid considering all possible moves • Lookup tables for opening moves and endgames Iterative deepening search • Use DFS as a subroutine 1. Check the root 2. Do a DFS with depth limit 1 3. If there is no path of length 1, do a DFS search with depth limit 2 4. If there is no path of length 2, do a DFS with depth limit 3. 5. And so on… Why might this be useful for multi-player games? Chess playing systems • Baseline system: 200 million node evaluations per move (3 min), minimax with a decent evaluation function and quiescence search – 5-ply ≈ human novice • Add alpha-beta pruning – 10-ply ≈ typical PC, experienced player • Deep Blue: 30 billion evaluations per move, singular extensions, evaluation function with 8000 features, large databases of opening and endgame moves – 14-ply ≈ Garry Kasparov • Recent state of the art (Hydra, ca. 2006): 36 billion evaluations per second, advanced pruning techniques – 18-ply ≈ better than any human alive? Games of chance • How to incorporate dice throwing into the game tree? Maximum Expected Utility • Why should we calculate expected utility? • Principle of maximum expected utility: an agent should choose the action which maximizes its expected utility, given its knowledge • General principle for decision making (definition of rationality) Reminder: Expectations • The expected value of a function is its average value, weighted by the probability distribution over inputs • Example: How long to get to the airport? • Length of driving time as a function of traffic: L(none)=20, L(light)=30, L(heavy)=60 • P(T)={none=0.25, light=0.5, heavy=0.25} • What is my expected driving time: E[L(T)]? • E[L(T)] = L(none)*P(none)+L(light)*P(light)+L(heavy)*P(heavy) • E[L(T)] = (20*.25) + (30*.5) + (60*0.25) = 35 Games of chance Games of chance • Expectiminimax: for chance nodes, average values weighted by the probability of each outcome – Nasty branching factor, defining evaluation functions and pruning algorithms more difficult • Monte Carlo simulation: when you get to a chance node, simulate a large number of games with random dice rolls and use win percentage as evaluation function – Can work well for games like Backgammon Partially observable games • Card games like bridge and poker • Monte Carlo simulation: deal all the cards randomly in the beginning and pretend the game is fully observable – “Averaging over clairvoyance” – Problem: this strategy does not account for bluffing, information gathering, etc. Origins of game playing algorithms • Minimax algorithm: Ernst Zermelo, 1912, first published in 1928 by John von Neumann • Chess playing with evaluation function, quiescence search, selective search: Claude Shannon, 1949 (paper) • Alpha-beta search: John McCarthy, 1956 • Checkers program that learns its own evaluation function by playing against itself: Arthur Samuel, 1956 Game playing algorithms today • Computers are better than humans: – Checkers: solved in 2007 – Chess: IBM Deep Blue defeated Kasparov in 1997 • Computers are competitive with top human players: – Backgammon: TD-Gammon system used reinforcement learning to learn a good evaluation function – Bridge: top systems use Monte Carlo simulation and alpha-beta search • Computers are not competitive: – Go: branching factor 361. Existing systems use Monte Carlo simulation and pattern databases http://xkcd.com/1002/ See also: http://xkcd.com/1263/ Utility Theory Maximum Expected Utility • Principle of maximum expected utility: an agent should choose the action which maximizes its expected utility, given its knowledge • General principle for decision making (definition of rationality) • Where do utilities come from? Why MEU? Utility Scales • Normalized Utilities: u+=1.0, u-=0.0 • Micromorts: one-millionth chance of death, useful for paying to reduce product risks, etc Human Utilities • How much do people value their lives? – How much would you pay to avoid a risk, e.g. Russian roulette with a million-barreled revolver (1 micromort)? – Driving in a car for 230 miles incurs a risk of 1 micromort. Measuring Utilities Best possible prize Worst possible catastrophe Markov Decision Processes Stochastic, sequential environments (Chapter 17) Image credit: P. Abbeel and D. Klein Markov Decision Processes • Components: – States s, beginning with initial state s0 – Actions a • Each state s has actions A(s) available from it – Transition model P(s’ | s, a) • Markov assumption: the probability of going to s’ from s depends only on s and a and not on any other past actions or states – Reward function R(s) • Policy (s): the action that an agent takes in any given state – The “solution” to an MDP Overview • First, we will look at how to “solve” MDPs, (find the optimal policy when the transition model and the reward function are known) • Second, we will consider reinforcement learning, where we don’t know the rules of the environment or the consequences of our actions Game show • A series of questions with increasing level of difficulty and increasing payoff • Decision: at each step, take your earnings and quit, or go for the next question – If you answer wrong, you lose everything $100 question $1,000 question Correct Correct Q1 $10,000 question Q2 Quit: $100 Correct: $61,100 Correct Q3 Incorrect: $0 Incorrect: $0 $50,000 question Q4 Incorrect: $0 Quit: $1,100 Incorrect: $0 Quit: $11,100 Game show • Consider $50,000 question – Probability of guessing correctly: 1/10 – Quit or go for the question? • What is the expected payoff for continuing? 0.1 * 61,100 + 0.9 * 0 = 6,110 • What is the optimal decision? $100 question 9/10 $1,000 question Correct Correct Q1 3/4 $10,000 question Q2 Quit: $100 Correct Q3 Incorrect: $0 Incorrect: $0 1/2 $50,000 question 1/10 Correct: $61,100 Q4 Incorrect: $0 Quit: $1,100 Incorrect: $0 Quit: $11,100 Game show • What should we do in Q3? – Payoff for quitting: $1,100 – Payoff for continuing: 0.5 * $11,100 = $5,550 • What about Q2? – $100 for quitting vs. $4,162 for continuing • What about Q1? U = $3,746 U = $4,162 $100 question $1,000 question 9/10 U = $11,100 $10,000 question $50,000 question Correct Correct Q1 3/4 U = $5,550 Q2 Quit: $100 Correct Q3 Incorrect: $0 Incorrect: $0 1/2 1/10 Correct: $61,100 Q4 Incorrect: $0 Quit: $1,100 Incorrect: $0 Quit: $11,100 Grid world Transition model: 0.1 0.8 0.1 R(s) = -0.04 for every non-terminal state Source: P. Abbeel and D. Klein Goal: Policy Source: P. Abbeel and D. Klein Grid world Transition model: R(s) = -0.04 for every non-terminal state Grid world Optimal policy when R(s) = -0.04 for every non-terminal state Grid world • Optimal policies for other values of R(s):