complex decisions

advertisement
Making complex decisions
Chapter 17
Outline
• Sequential decision problems (Markov
Decision Process)
– Value iteration
– Policy iteration
• Partially Observable MDP’s
• Decision theoretic agent
• Dynamic belief networks
Sequential Decisions
• Agent’s utility depends on a sequence of
decisions
Markov Decision Process (MDP)
• Defined as a tuple: <S, A, M, R>
– S: State
– A: Action
– M: Transition function
• Table Mija = P(sj| si, a), prob of sj| given action “a” in
state si
– R: Reward
• R(si, a) = cost or reward of taking action a in state si
• In our case R = R(si)
• Choose a sequence of actions (not just one decision or one
action)
– Utility based on a sequence of decisions
Generalization
 Inputs:
• Initial state s0
• Action model
• Reward R(si) collected in each state si
 A state is terminal if it has no successor
 Starting at s0, the agent keeps executing actions
until it reaches a terminal state
 Its goal is to maximize the expected sum of rewards
collected (additive rewards)
 Additive rewards: U(s0, s1, s2, …) = R(s0) + R(s1) + R(s2)+ …
 Discounted rewards (we will not use this reward form)
U(s0, s1, s2, …) = R(s0) + R(s1) +  2R(s2) + … (0    1)
Dealing with infinite sequences
of actions
• Use discounted rewards
• The environment contains terminal states
that the agent is guaranteed to reach
eventually
• Infinite sequences are compared by their
average rewards
Utility of a State
Given a state s we can measure the expected utilities by
applying any policy .
We assume the agent is in state s and define St (a random
variable) as the state reached at step t.
Obviously S0 = s
The expected utility of a state s given a policy  is
U(s) = E[St t R(St)]
*s will be a policy that maximizes the expected utility of
state s.
Utility of a State
The utility of a state s measures its desirability:
 If i is terminal:
U(Si) = R(Si)
 If i is non-terminal,
U(Si) = R(Si) +  maxaSjP(Sj|Si,a) U(Sj)
[Bellman equation]
[the reward of s augmented by the expected
sum of discounted rewards collected in future
states]
Utility of a State
The utility of a state s measures its desirability:
 If i is terminal:
U(Si) = R(Si)
 If i is non-terminal,
U(Si) = R(Si) +  maxaSjP(Sj|Si,a) U(Sj)
[Bellman equation]
[the reward of s augmented by the expected
sum of discounted rewards collected in future
states]
 dynamic programming
Optimal Policy
 A policy is a function that maps each state s into the
action to execute if s is reached
 The optimal policy * is the policy that always leads
to maximizing the expected sum of rewards
collected in future states
(Maximum Expected Utility principle)
*(Si) = argmaxaSjP(Sj|Si,a) U(Sj)
Example
• Fully observable environment
• Non deterministic actions
– intended effect: 0.8
– not the intended effect: 0.2
+1
3
-1
2
1
start
1
2
3
4
0.8
0.1
0.1
MDP of example
• S: State of the agent on the grid (4,3)
– Note that cell denoted by (x,y)
• A: Actions of the agent, i.e., N, E, S, W
• M: Transition function
– E.g., M( (4,2) | (3,2), N) = 0.1
– E.g., M((3, 3) | (3,2), N) = 0.8
– (Robot movement, uncertainty of another agent’s actions,…)
•
R: Reward (more comments on the reward function later)
– R(1, 1) = -1/25, R(1, 2) = -1/25 ….
– R (4,3) = +1
– =1
R(4,2) = -1
Policy
• Policy is a mapping from states to actions.
• Given a policy, one may calculate the expected utility
from series of actions produced by policy.
3
+1
2
-1
Example of policy
1
1
2
3
4
• The goal: Find an optimal policy , one that would
produce maximal expected utility.
Optimal policy and state utilities
for MDP
3
+1
3 0.812 0.868 0.912
+1
2
-1
2 0.762
-1
1
0.660
1 0.705 0.655 0.611 0.388
1
2
3
4
1
2
3
4
• What will happen if cost of step is very low?
Finding the optimal policy
• Solution must satisfy both equations
– *(Si) = argmaxaSjP(Sj|Si,a) U(Sj)
– U(Si) = R(Si) +  maxaSjP(Sj|Si,a) U(Sj)
(1)
(2)
• Value iteration:
– start with a guess of a utility function and use (2)
to get a better estimate
• Policy iteration:
– start with a fixed policy, and solve for the exact
utilities of states; use eq 1 to find an updated
policy.
Value iteration
function Value-Iteration(MDP) returns a utility function
inputs:
P(Si |Sj,a), a transition model
R, a reward function on states
 discount parameter
local variables: Utility function, initially identical to R
U’, utility function, initially identical to R
repeat
U  U’
for each state i do
U’ [Si]  R [Si] +  maxa Sj P(Sj |Si,a) U (Sj)
end
until Close-Enough(U,U’)
return U
Value iteration - convergence of
utility values
policy iteration
function Policy-Iteration (MDP) returns a policy
inputs:
P(Si |Sj,a), a transition model
R, a reward function on states
 discount parameter
local variables: U, utility function, initially identical to R
, a policy, initially optimal with respect to U
repeat
U  Value-Determination(, U, MDP)
unchaged?  true
for each state Si do
if maxa j P(Sj |Si,a) U (Sj) > j P(Sj |Si, (Si)) U (Sj)
(Si)  arg maxa j P(Sj |Si,a) U (Sj)
unchaged?  false
until unchaged?
return 
Decision Theoretic Agent
• Agent that must act in uncertain
environments. Actions may be nondeterministic.
Stationarity and Markov
assumption
• A stationary process is one whose
dynamics, i.e., P(Xt|Xt-1,…,X0) for t>0,
are assumed not to change with time.
• The Markov assumption is that the
current state Xt is dependent only on a
finite history of previous states. In this
class, we will only consider first-order
Markov processes for which
P(Xt|Xt-1,…,X0) =P(Xt|Xt-1)
Transition model and sensor
model
• For first-order Markov processes, the laws
describing how the process state evolves with
time is contained entirely within the
conditional distribution P(Xt|Xt-1), which is
called the transition model for the process.
• We will also assume that the observable state
variables (evidence) Et are dependent only on
the state variables Xt. P(Et|Xt) is called the
sensor or observational model.
Decision Theoretic Agent implemented
sensor model
generic
Example – part of steam train control
sensor model II
Model for lane position sensor
for automated vehicle
dynamic belief network – generic
structure
• Describe the action model, namely the
state evolution model.
State evolution model - example
Dynamic Decision Network
• Can handle uncertainty
• Deal with continuous streams of
evidence
• Can handle unexpected events (they
have no fixed plan)
• Can handle sensor noise and sensor
failure
• Can act to obtain information
• Can handle large state spaces
Download