Making complex decisions Chapter 17 Outline • Sequential decision problems (Markov Decision Process) – Value iteration – Policy iteration • Partially Observable MDP’s • Decision theoretic agent • Dynamic belief networks Sequential Decisions • Agent’s utility depends on a sequence of decisions Markov Decision Process (MDP) • Defined as a tuple: <S, A, M, R> – S: State – A: Action – M: Transition function • Table Mija = P(sj| si, a), prob of sj| given action “a” in state si – R: Reward • R(si, a) = cost or reward of taking action a in state si • In our case R = R(si) • Choose a sequence of actions (not just one decision or one action) – Utility based on a sequence of decisions Generalization Inputs: • Initial state s0 • Action model • Reward R(si) collected in each state si A state is terminal if it has no successor Starting at s0, the agent keeps executing actions until it reaches a terminal state Its goal is to maximize the expected sum of rewards collected (additive rewards) Additive rewards: U(s0, s1, s2, …) = R(s0) + R(s1) + R(s2)+ … Discounted rewards (we will not use this reward form) U(s0, s1, s2, …) = R(s0) + R(s1) + 2R(s2) + … (0 1) Dealing with infinite sequences of actions • Use discounted rewards • The environment contains terminal states that the agent is guaranteed to reach eventually • Infinite sequences are compared by their average rewards Utility of a State Given a state s we can measure the expected utilities by applying any policy . We assume the agent is in state s and define St (a random variable) as the state reached at step t. Obviously S0 = s The expected utility of a state s given a policy is U(s) = E[St t R(St)] *s will be a policy that maximizes the expected utility of state s. Utility of a State The utility of a state s measures its desirability: If i is terminal: U(Si) = R(Si) If i is non-terminal, U(Si) = R(Si) + maxaSjP(Sj|Si,a) U(Sj) [Bellman equation] [the reward of s augmented by the expected sum of discounted rewards collected in future states] Utility of a State The utility of a state s measures its desirability: If i is terminal: U(Si) = R(Si) If i is non-terminal, U(Si) = R(Si) + maxaSjP(Sj|Si,a) U(Sj) [Bellman equation] [the reward of s augmented by the expected sum of discounted rewards collected in future states] dynamic programming Optimal Policy A policy is a function that maps each state s into the action to execute if s is reached The optimal policy * is the policy that always leads to maximizing the expected sum of rewards collected in future states (Maximum Expected Utility principle) *(Si) = argmaxaSjP(Sj|Si,a) U(Sj) Example • Fully observable environment • Non deterministic actions – intended effect: 0.8 – not the intended effect: 0.2 +1 3 -1 2 1 start 1 2 3 4 0.8 0.1 0.1 MDP of example • S: State of the agent on the grid (4,3) – Note that cell denoted by (x,y) • A: Actions of the agent, i.e., N, E, S, W • M: Transition function – E.g., M( (4,2) | (3,2), N) = 0.1 – E.g., M((3, 3) | (3,2), N) = 0.8 – (Robot movement, uncertainty of another agent’s actions,…) • R: Reward (more comments on the reward function later) – R(1, 1) = -1/25, R(1, 2) = -1/25 …. – R (4,3) = +1 – =1 R(4,2) = -1 Policy • Policy is a mapping from states to actions. • Given a policy, one may calculate the expected utility from series of actions produced by policy. 3 +1 2 -1 Example of policy 1 1 2 3 4 • The goal: Find an optimal policy , one that would produce maximal expected utility. Optimal policy and state utilities for MDP 3 +1 3 0.812 0.868 0.912 +1 2 -1 2 0.762 -1 1 0.660 1 0.705 0.655 0.611 0.388 1 2 3 4 1 2 3 4 • What will happen if cost of step is very low? Finding the optimal policy • Solution must satisfy both equations – *(Si) = argmaxaSjP(Sj|Si,a) U(Sj) – U(Si) = R(Si) + maxaSjP(Sj|Si,a) U(Sj) (1) (2) • Value iteration: – start with a guess of a utility function and use (2) to get a better estimate • Policy iteration: – start with a fixed policy, and solve for the exact utilities of states; use eq 1 to find an updated policy. Value iteration function Value-Iteration(MDP) returns a utility function inputs: P(Si |Sj,a), a transition model R, a reward function on states discount parameter local variables: Utility function, initially identical to R U’, utility function, initially identical to R repeat U U’ for each state i do U’ [Si] R [Si] + maxa Sj P(Sj |Si,a) U (Sj) end until Close-Enough(U,U’) return U Value iteration - convergence of utility values policy iteration function Policy-Iteration (MDP) returns a policy inputs: P(Si |Sj,a), a transition model R, a reward function on states discount parameter local variables: U, utility function, initially identical to R , a policy, initially optimal with respect to U repeat U Value-Determination(, U, MDP) unchaged? true for each state Si do if maxa j P(Sj |Si,a) U (Sj) > j P(Sj |Si, (Si)) U (Sj) (Si) arg maxa j P(Sj |Si,a) U (Sj) unchaged? false until unchaged? return Decision Theoretic Agent • Agent that must act in uncertain environments. Actions may be nondeterministic. Stationarity and Markov assumption • A stationary process is one whose dynamics, i.e., P(Xt|Xt-1,…,X0) for t>0, are assumed not to change with time. • The Markov assumption is that the current state Xt is dependent only on a finite history of previous states. In this class, we will only consider first-order Markov processes for which P(Xt|Xt-1,…,X0) =P(Xt|Xt-1) Transition model and sensor model • For first-order Markov processes, the laws describing how the process state evolves with time is contained entirely within the conditional distribution P(Xt|Xt-1), which is called the transition model for the process. • We will also assume that the observable state variables (evidence) Et are dependent only on the state variables Xt. P(Et|Xt) is called the sensor or observational model. Decision Theoretic Agent implemented sensor model generic Example – part of steam train control sensor model II Model for lane position sensor for automated vehicle dynamic belief network – generic structure • Describe the action model, namely the state evolution model. State evolution model - example Dynamic Decision Network • Can handle uncertainty • Deal with continuous streams of evidence • Can handle unexpected events (they have no fixed plan) • Can handle sensor noise and sensor failure • Can act to obtain information • Can handle large state spaces