utility_mdps

advertisement
Utility Theory & MDPs
Tamara Berg
CS 560 Artificial Intelligence
Many slides throughout the course adapted from Svetlana Lazebnik, Dan
Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer
Announcements
• HW2 will be online tomorrow
– Due Oct 8 (but make sure to start early!)
• As always, you can work in groups of up to 3
and submit 1 written/coding solution (pairs don’t
need to be the same as HW1)
AI/Games in the news
Sept 14, 2015
Review from last class
A more abstract game tree
3
3
2
Terminal utilities (for MAX)
A two-ply game
2
Computing the minimax value of a node
3
3
2
2
• Minimax(node) =
 Utility(node) if node is terminal
 maxaction Minimax(Succ(node, action)) if player = MAX
 minaction Minimax(Succ(node, action)) if player = MIN
Alpha-beta pruning
• It is possible to compute the exact minimax decision
without expanding every node in the game tree
Alpha-beta pruning
• It is possible to compute the exact minimax decision
without expanding every node in the game tree
3
3
Alpha-beta pruning
• It is possible to compute the exact minimax decision
without expanding every node in the game tree
3
3
2
Alpha-beta pruning
• It is possible to compute the exact minimax decision
without expanding every node in the game tree
3
3
2
14
Alpha-beta pruning
• It is possible to compute the exact minimax decision
without expanding every node in the game tree
3
3
2
5
Alpha-beta pruning
• It is possible to compute the exact minimax decision
without expanding every node in the game tree
3
3
2
2
Resource Limits
• Alpha-beta still has to search all the
way to terminal states for portions of
the search space.
• Instead we can cut off search earlier
and apply a heuristic evaluation
function.
– Search a limited depth of tree
– Replace terminal utilities with evaluation
function for non-terminal positions.
– Performance of the program highly
depends on its evaluation function.
Evaluation function
• Cut off search at a certain depth and compute the value of an
evaluation function for a state instead of its minimax value
– The evaluation function may be thought of as the probability of winning
from a given state or the expected value of that state
• A common evaluation function is a weighted sum of features:
Eval(s) = w1 f1(s) + w2 f2(s) + … + wn fn(s)
• Evaluation functions may be learned from game databases or
by having the program play many games against itself
Cutting off search
• Horizon effect: you may incorrectly estimate the
value of a state by overlooking an event that is
just beyond the depth limit
– For example, a damaging move by the opponent that
can be delayed but not avoided
• Possible remedies
– Quiescence search: do not cut off search at
positions that are unstable – for example, are you
about to lose an important piece?
– Singular extension: a strong move that should be
tried when the normal depth limit is reached
Additional techniques
• Transposition table to store previously expanded
states
• Forward pruning to avoid considering all possible
moves
• Lookup tables for opening moves and endgames
Iterative deepening search
• Use DFS as a subroutine
1. Check the root
2. Do a DFS with depth limit 1
3. If there is no path of length 1, do a DFS
search with depth limit 2
4. If there is no path of length 2, do a DFS with
depth limit 3.
5. And so on…
Why might this be useful for multi-player games?
Chess playing systems
•
Baseline system: 200 million node evaluations per move
(3 min), minimax with a decent evaluation function and
quiescence search
– 5-ply ≈ human novice
• Add alpha-beta pruning
– 10-ply ≈ typical PC, experienced player
• Deep Blue: 30 billion evaluations per move, singular
extensions, evaluation function with 8000 features,
large databases of opening and endgame moves
– 14-ply ≈ Garry Kasparov
• Recent state of the art (Hydra, ca. 2006): 36 billion
evaluations per second, advanced pruning techniques
– 18-ply ≈ better than any human alive?
Games of chance
• How to incorporate dice throwing into the
game tree?
Maximum Expected Utility
• Why should we calculate expected utility?
• Principle of maximum expected utility: an
agent should choose the action which
maximizes its expected utility, given its
knowledge
• General principle for decision making
(definition of rationality)
Reminder: Expectations
• The expected value of a function is its average
value, weighted by the probability distribution over
inputs
• Example: How long to get to the airport?
• Length of driving time as a function of traffic: L(none)=20,
L(light)=30, L(heavy)=60
• P(T)={none=0.25, light=0.5, heavy=0.25}
• What is my expected driving time: E[L(T)]?
• E[L(T)] =
L(none)*P(none)+L(light)*P(light)+L(heavy)*P(heavy)
• E[L(T)] = (20*.25) + (30*.5) + (60*0.25) = 35
Games of chance
Games of chance
• Expectiminimax: for chance nodes, average
values weighted by the probability of each outcome
– Nasty branching factor, defining evaluation functions and
pruning algorithms more difficult
• Monte Carlo simulation: when you get to a
chance node, simulate a large number of games
with random dice rolls and use win percentage as
evaluation function
– Can work well for games like Backgammon
Partially observable games
• Card games like bridge and poker
• Monte Carlo simulation: deal all the cards
randomly in the beginning and pretend the game
is fully observable
– “Averaging over clairvoyance”
– Problem: this strategy does not account for bluffing,
information gathering, etc.
Origins of game playing
algorithms
• Minimax algorithm: Ernst Zermelo, 1912, first
published in 1928 by John von Neumann
• Chess playing with evaluation function,
quiescence search, selective search: Claude
Shannon, 1949 (paper)
• Alpha-beta search: John McCarthy, 1956
• Checkers program that learns its own evaluation
function by playing against itself: Arthur Samuel,
1956
Game playing algorithms today
• Computers are better than humans:
– Checkers: solved in 2007
– Chess: IBM Deep Blue defeated Kasparov in 1997
• Computers are competitive with top human players:
– Backgammon: TD-Gammon system used reinforcement
learning to learn a good evaluation function
– Bridge: top systems use Monte Carlo simulation and
alpha-beta search
• Computers are not competitive:
– Go: branching factor 361. Existing systems use Monte
Carlo simulation and pattern databases
http://xkcd.com/1002/
See also: http://xkcd.com/1263/
Utility Theory
Maximum Expected Utility
• Principle of maximum expected utility: an agent
should choose the action which maximizes its
expected utility, given its knowledge
• General principle for decision making (definition
of rationality)
• Where do utilities come from?
Why MEU?
Utility Scales
• Normalized Utilities: u+=1.0, u-=0.0
• Micromorts: one-millionth chance of death,
useful for paying to reduce product risks,
etc
Human Utilities
• How much do people value their lives?
– How much would you pay to avoid a risk, e.g.
Russian roulette with a million-barreled
revolver (1 micromort)?
– Driving in a car for 230 miles incurs a risk of 1
micromort.
Measuring Utilities
Best possible prize
Worst possible catastrophe
Markov Decision Processes
Stochastic, sequential environments
(Chapter 17)
Image credit: P. Abbeel and D. Klein
Markov Decision Processes
• Components:
– States s, beginning with initial state s0
– Actions a
• Each state s has actions A(s) available from it
– Transition model P(s’ | s, a)
• Markov assumption: the probability of going to s’ from
s depends only on s and a and not on any other past
actions or states
– Reward function R(s)
• Policy (s): the action that an agent takes in any given state
– The “solution” to an MDP
Overview
• First, we will look at how to “solve” MDPs,
(find the optimal policy when the transition
model and the reward function are known)
• Second, we will consider reinforcement
learning, where we don’t know the rules
of the environment or the consequences of
our actions
Game show
• A series of questions with increasing level of
difficulty and increasing payoff
• Decision: at each step, take your earnings and
quit, or go for the next question
– If you answer wrong, you lose everything
$100
question
$1,000
question
Correct
Correct
Q1
$10,000
question
Q2
Quit:
$100
Correct:
$61,100
Correct
Q3
Incorrect:
$0
Incorrect:
$0
$50,000
question
Q4
Incorrect:
$0
Quit:
$1,100
Incorrect:
$0
Quit:
$11,100
Game show
• Consider $50,000 question
– Probability of guessing correctly: 1/10
– Quit or go for the question?
• What is the expected payoff for continuing?
0.1 * 61,100 + 0.9 * 0 = 6,110
• What is the optimal decision?
$100
question
9/10
$1,000
question
Correct
Correct
Q1
3/4
$10,000
question
Q2
Quit:
$100
Correct
Q3
Incorrect:
$0
Incorrect:
$0
1/2
$50,000
question
1/10
Correct:
$61,100
Q4
Incorrect:
$0
Quit:
$1,100
Incorrect:
$0
Quit:
$11,100
Game show
• What should we do in Q3?
– Payoff for quitting: $1,100
– Payoff for continuing: 0.5 * $11,100 = $5,550
• What about Q2?
– $100 for quitting vs. $4,162 for continuing
• What about Q1?
U = $3,746
U = $4,162
$100
question
$1,000
question
9/10
U = $11,100
$10,000
question
$50,000
question
Correct
Correct
Q1
3/4
U = $5,550
Q2
Quit:
$100
Correct
Q3
Incorrect:
$0
Incorrect:
$0
1/2
1/10
Correct:
$61,100
Q4
Incorrect:
$0
Quit:
$1,100
Incorrect:
$0
Quit:
$11,100
Grid world
Transition model:
0.1
0.8
0.1
R(s) = -0.04 for every
non-terminal state
Source: P. Abbeel and D. Klein
Goal: Policy
Source: P. Abbeel and D. Klein
Grid world
Transition model:
R(s) = -0.04 for every
non-terminal state
Grid world
Optimal policy when
R(s) = -0.04 for every
non-terminal state
Grid world
• Optimal policies for other values of R(s):
Download