Week 13 - Utility & MDP

advertisement
CS188 Discussion – Utility Theory & MDP by Nuttapong Chentanez
Utilities
- Function mapping from outcomes (states of the world) to real numbers
that describes an agent’s preference
Maximum Expected Utility
- An agent should chose the action which maximizes its expected utility
given its knowledge
Markov Decision Process (MDP)
-Consists of set of states s  S
-Transition model model, T(s, a, s’) = P(s’ | s, a)
- Probability that executing action a in state s leads to s’
-Reward function R(s, a, s’),
- Can also write Renter(s’) for reward for entering into s’
- Or Rin(s) for being in s
-A start state (or a distribution)
-May be terminal states
Expectation
Preferences
An agent choose among
Outcome: A, B.
Lotteries (uncertain prizes)
Constraints on the preferences of a rational agent
Optimum behavior
-Suppose we have a policy : S ->A, gives an action for each state
-Optimum policy maximizes expected utility (utility is related to
rewards obtained by following the policy)
Stationarity
To formalize utility, need some assumptions. Typically stationary
preferences:
“If a future starting tomorrow is preferred another future, then starting
the former future today does not change the preference”
With this stationarity assumption, can show that there are two ways to
define stationary utilities:
Additive utility:
Discounted utility:
If 0 <  < 1,
Without these constraints, irrational behavior may result.
Utility of a state under a policy 
V(s) = expected total (discounted) rewards starting in s and following 
Theorem: Rational preferences imply behavior describable as
maximization of expected utility. The utility has the following two
properties:
Idea 1, turn the recursive equations into updates
Utility is NOT unique
Idea 2, it’s just a linear system, solve it with Matlab
Without lotteries, any monotonic transformation of a utility will be
utility as well.
Optimal Utilities
Goal: Calculate the optimal utility of each state
V*(s) = expected (discounted) rewards with optimal actions
Why? Given the optimal V, we can find optimal policy
Insurance
Consider a lottery [0.5, $1000; 0.5, $0], the expected monetary value is
$500 vs. a certain alternative [1, $400]. The difference of $100 is called
insurance premium. It is used in insurance industry to figure out the
premium.
Bellman’s Equation
Idea: Optimal rewards = maximize over first action and then follow
optimal policy
Money != Utility
– For most people [1, $1M] > [0.5, $3M; 0.5, $0]
- A: [0.8,$4k; 0.2,$0], B: [1.0,$3k; 0.0,$0]
C: [0.2,$4k; 0.8,$0], D: [0.25,$3k; 0.75,$0]
For most people, B >A, C > D, but
B > A  U($3k) > 0.8 U($4k)
C > D  0.8 U($4k) > U($3k)
-Either human is irrational or money != utility or both
*(s) =
How to solve this equation?
Value Iteration
Theorem: Will converge to a unique optimal values. Policy may
converge long before values do.
Convergence
Can be shown with contraction property of the update:
This equation implies that after the update, the two values function will
be closer to each other. The update applied to V* not change its values,
therefore, updating arbitrary values function will get it closer to V * by a
factor of .
Can be shown by expanding maxs [Ut+1(s)– Vt+1(s)] with the value
iteration update
To check for convergence, use
Once the change in our approximation is small, it must also be close to
correct
Policy Iteration
Policy Evaluation: Calculate utilities for a fixed policy
Policy Improvement: Update policy based on resulting converged utility
Repeat until policy does not change
In practice, no need to compute “exact” utility of a policy just rough
estimate is enough.
Exercise
Suppose an agent inhabits a world with two states, S and ~S, can do
exactly one of two actions, a and b. Action a does nothing and action b
flips from one state to the other.
Suppose Rin(S) = 3, Rin(~S) = 2, and  = 0.5, complete the following
table to find the optimum policy with policy iteration
Download