Automated Planning Planning under Uncertainty Jonas Kvarnström Automated Planning Group Department of Computer and Information Science Linköping University jonas.kvarnstrom@liu.se – 2015 2 Recall the restricted state transition system Σ = (S,A,γ) Finite set of world states Finite set of actions State transition function, where |γ(s,a)| ≤ 1 S = { s0, s1, … }: A = { a0, a1, … }: γ: S × A 2S: ▪ If γ(s,a) = {s’}, then whenever you are in state s, you can execute action a and you end up in state s’ ▪ If γ(s,a) = ∅ (the empty set), then a cannot be executed in s S = { s0, s1, … } A = { take1, put1, … } γ: S × A 2S γ(s0, take2) = { s1 } γ(s1, take2) = ∅ Often we also add a cost function: c: S × A ℝ s0 take2 put2 s1 put3 take3 s2 jonkv@ida Restricted State Transition System 3 Recall the classical planning problem Let Σ = (S,A,γ) be a state transition system Let s0 ∈ S Let Sg ⊆ S satisfying the assumptions A0 to A7 (called a restricted state transition system in the book) be the initial state be the set of goal states Then, find a sequence of transitions labeled with actions [a1, a2, …, an] that can be applied starting at s0 resulting in a sequence of states [s1, s2, …, sn] such that sn ∈ Sg start goal goal goal jonkv@ida Classical Planning Problem Assumes we know in advance: The state of the world when plan execution starts The outcome of any action, given the state where it is executed ▪ State + action unique resulting state So if there is a solution: There is an unconditional solution Planning Model says: we end up in this specific state! Start here… A1 Execution No new information can be relevant Just follow the unconditional plan… 4 jonkv@ida Planning with Complete Information 5 In reality, actions may have multiple outcomes Some outcomes can indicate faulty / imperfect execution ▪ pick-up(object) Intended outcome: carrying(object) is true Unintended outcome: carrying(object) is false ▪ move(100,100) Intended outcome: xpos(robot)=100 Unintended outcome: xpos(robot) != 100 ▪ jump-with-parachute To a planner, Intended outcome: alive is true Unintended outcome: alive is false there is generally no difference between Some outcomes are more random, these cases! but clearly desirable / undesirable ▪ Pick a present at random – do I get the one I longed for? ▪ Toss a coin – do I win? Sometimes we have no clear idea what is desirable ▪ Outcome will affect how we can continue, but in less predictable ways jonkv@ida Multiple Outcomes 7 Nondeterministic planning: S = { s0, s1, … }: A = { a0, a1, … }: γ: S × A 2S: Finite set of world states Finite set of actions State transition function, where |γ(s,a)| is finite Planning Model says: we end up in one of these states Start here… A1 Execution Will we find out more when we execute? jonkv@ida Nondeterministic Planning 8 NOND: Non-Observable Non-Deterministic Also called conformant non-deterministic Only predictions can guide us ▪ We have no sensors to determine what happened Planning Execution Model says: we end up in one of these states Start here… A1 We still only know that we're in one of these states Start here… A1 jonkv@ida NOND Planning NOND plans are still plain sequences Trivial example: VCR domain ▪ Initial state: Don’t know the current tape position ▪ Goal: Tape at 00:20 ▪ Operators: Fast-forward n minutes; Rewind completely (counter := 0). ▪ Solution: Rewind completely; fast-forward 20 minutes More interesting: Sorting domain (sortnet) ▪ Initial state: Don't know the order of the list ▪ Operators: Compare-and-swap orders two elements correctly ▪ Plan: A sequence that can sort any list of the given length ▪ http://ldc.usb.ve/~bonet/ipc5/ 9 jonkv@ida NOND Planning: Examples 10 Possible plan for sorting: position 1 position 2 position 3 position 4 Comp / Swap Comp / Swap Comp / Swap Comp / Swap Comp / Swap Comp / Swap All elements in place Last two in place Last element in place No need for the planner to know the element values No need to know during runtime if elements were swapped or not jonkv@ida NOND Planning: Examples 11 FOND: Fully Observable Non-Deterministic After executing an action, sensors determine exactly which state we are in Planning Execution Model says: we end up in one of these states Start here… A1 Sensors say: we are in this state! Start here… A1 jonkv@ida FOND Planning 12 Example state transition system: wait wait Initial state s0: at pos1, standing stand-up s1: at pos1, fallen Multiple outcomes: May or may not fall move-to(pos2) wait wait Goal state s2: at pos2, standing Intuitive strategy: while (not in s2) { } move-to(pos2); if (fallen) stand-up; stand-up s3: at pos2, fallen In FOND planning, action choice should depend on actual outcomes (through the current state) There may be no upper bound on how many actions we may have to execute! jonkv@ida FOND Planning: Plan Structure (1) 13 Examples of formal plan structures: Conditional plans (with if/then/else statements) Policies π : S → A ▪ Defining, for each state, which action to execute whenever we end up there ▪ π(s0) = move-to(pos2) ▪ π(s1) = stand-up Or at least, for every state that is reachable from the given initial state ▪ π(s2) = wait (A policy can be a partial function) ▪ π(s3) = stand-up wait Initial state s0: at pos1, standing wait move-to(pos2) Goal state s2: at pos2, standing wait stand-up wait stand-up s1: at pos1, fallen s3: at pos2, fallen jonkv@ida FOND Planning: Plan Structure (2) 14 Assume our objective remains to reach a state in Sg A weak solution: For some outcomes, the goal is reached in a finite number of steps ▪ ▪ ▪ ▪ π(s0) = move-to(pos2) π(s1) = wait π(s2) = wait π(s3) = stand-up wait Initial state s0: at pos1, standing wait move-to(pos2) Goal state s2: at pos2, standing wait stand-up wait stand-up s1: at pos1, fallen s3: at pos2, fallen jonkv@ida Solution Types 1 15 Assume our objective remains to reach a state in Sg A strong solution: For every outcome, the goal is reached in a finite number of steps ▪ Not possible for this example problem ▪ Could fall every time wait Initial state s0: at pos1, standing wait move-to(pos2) Goal state s2: at pos2, standing wait stand-up wait stand-up s1: at pos1, fallen s3: at pos2, fallen jonkv@ida Solution Types 2 16 Assume our objective remains to reach a state in Sg A strong cyclic solution will reach a goal state in a finite number of steps given a fairness assumption: Informally, ”if we can exit a loop, we eventually will” ▪ ▪ ▪ ▪ π(s0) = move-to(pos2) π(s1) = stand-up π(s2) = wait π(s3) = stand-up wait Initial state s0: at pos1, standing wait move-to(pos2) Goal state s2: at pos2, standing wait stand-up wait stand-up s1: at pos1, fallen s3: at pos2, fallen jonkv@ida Solution Types 3 The cost of a FOND policy is undefined We don't know in advance which actions we must execute We have no estimate of how likely different outcomes are We will not discuss non-deterministic planning further! 17 jonkv@ida Solutions and Costs 19 Probabilistic planning uses a stochastic system Σ = (S, A, P) Finite set of world states A = { a0, a1, … }: Finite set of actions P(s, a, s'): Given that we are in s and execute a, Replaces γ the probability of ending up in s’ For every state s and action a, we have ∑s' ∈ S P(s, a, s’) = 1: The world gives us 100% probability of ending up in some state S = { s0, s1, … }: Planning Model says: we end up in one of these states Start here… A1 …with this probability jonkv@ida Stochastic Systems 20 Example with "desirable outcome" Arc indicates outcomes of a single action Action: drive-uphill S125,203 At location 5 Model says: 2% risk of slipping, ending up somewhere else S125,204 At location 6 P(S125203, drive-uphill, S125204) = 0.98 S125,222 Intermediate location P(S125203, drive-uphill, S125222) = 0.02 jonkv@ida Stochastic Systems (2) 21 May have very unlikely outcomes… S125,204 At location 6 S125,203 At location 5 S125,222 Intermediate location Probability sum = 1 S247,129 Broken Very unlikely, but may still be important to consider, if it has great impact on goal achievement! jonkv@ida Stochastic Systems (3) And very many outcomes… 22 S125,104 At location 6 Fuel level 650 Uncertain how much fuel will be consumed S125,203 At location 5 Fuel level 980 S125,204 At location 6 Fuel level 750 S125,222 Intermediate location S247,129 Broken As always, one state for every combination of properties jonkv@ida Stochastic Systems (4) 23 Like before, often many executable actions in every state 3 possible actions (red, blue, green) Probability sum = 1 (certain outcome) Probability sum = 1 (three possible outcomes of A2) Arcs connect edges belonging to the same action We choose the action… Nature chooses the outcome! Probability sum = 1 (three possible outcomes of A3) Searching the state space yields an AND/OR graph jonkv@ida Stochastic Systems (5) Example: A single robot s5 Moving between 5 locations move(l3,l2) move(l5,l4) s1: at(r1, l1) s2: at(r1, l2) s3: at(r1, l3) s4: at(r1, l4) s5: at(r1, l5) s3 wait move(l3,l4) ▪ ▪ ▪ ▪ ▪ s2 wait move(l2,l1) states correspond directly to locations move(l2,l3) move(l4,l3) For simplicity, move(l1,l2) 24 move(l4,l1) wait s1 s4 move(l1,l4) wait Some transitions are deterministic, some are stochastic ▪ Trying to move from l2 to l3: You may end up at l5 instead (20% risk) ▪ Trying to move from l1 to l4: You may stay where you are instead (50% risk) (Can’t always move in both directions, e.g. due to terrain gradient) wait jonkv@ida Stochastic System Example jonkv@ida Stochastic Shortest Path Problem 25 Closest to classical planning: Stochastic Shortest Path Problem Let Σ = (S, A, P) be a stochastic system as defined earlier Let s0 ∈ S be the initial state Let Sg ⊆ S be the set of goal states Then, find a policy of minimal expected cost that can be applied starting at s0 and that reaches a state in Sg with probability 1 Expected cost: A statistical concept (Discussed soon) "Reaches"? Not "Ends in"? s5 Technically, a policy never terminates! move(l3,l2) move(l4,l3) move(l1,l2) wait s3 move(l4,l1) s1 s4 move(l1,l4) wait wait move(l5,l4) s2 move(l3,l4) Then it doesn't matter that the policy continues executing actions infinitely! wait move(l2,l1) So, we can make every goal state g absorbing: For every action a, P(g,a,g) = 1 returns to same state c(g,a) = 0 no more cost accumulates move(l2,l3) wait 26 Non-Observable: No information gained after action Deterministic: Exact outcome known in advance Fully Observable: Exact outcome known after action Classical planning (possibly with extensions) (Information dimension is irrelevant; we know everything before execution) Non-deterministic: Multiple outcomes, no probabilities NOND: Conformant Planning FOND: Conditional / Contingent Planning Probabilistic: Multiple outcomes with probabilities Probabilistic Conformant Planning Probabilistic Conditional Planning (Special case of POMDPs, Partially Observable MDPs) Stochastic Shortest Path Problems Markov Decision Processes (MDPs) This lecture! jonkv@ida Overview Example 1 π1 = { (s1, move(l1,l2)), (s2, move(l2,l3)), (s3, move(l3,l4)), (s4, wait), (s5, wait)} s5 move(l2,l3) wait s3 move(l3,l4) move(l2,l1) move(l4,l3) move(l3,l2) move(l4,l1) wait Start s1 s4 move(l1,l4) wait Reaches s4 or s5, waits there infinitely many times move(l5,l4) s2 wait move(l1,l2) 28 wait jonkv@ida Policy Example 1 Example 2 π2 = { (s1, move(l1,l2)), (s2, move(l2,l3)), (s3, move(l3,l4)), (s4, wait), (s5, move(l5,l4))} s5 move(l2,l3) s3 wait move(l3,l4) move(l2,l1) move(l4,l3) move(l3,l2) move(l4,l1) wait Start s1 s4 move(l1,l4) wait Always reaches state s4, waits there infinitely many times move(l5,l4) s2 wait move(l1,l2) 29 wait jonkv@ida Policy Example 2 Example 3 π3 = { (s1, move(l1,l4)), (s2, move(l2,l1)), (s3, move(l3,l4)), (s4, wait), (s5, move(l5,l4)} s5 move(l2,l3) s3 wait move(l3,l4) move(l2,l1) move(l4,l3) move(l3,l2) move(l5,l4) s2 wait move(l1,l2) 30 move(l4,l1) wait Start s1 s4 move(l1,l4) wait Reaches state s4 with 100% probability ”in the limit” (it could happen that you never reach s4, but the probability is 0) wait jonkv@ida Policy Example 3 31 Sequential execution results in a state sequence: A history Infinite, since policies do not terminate h = 〈s0, s1, s2, s3, s4, … 〉 s0 (index zero): Variable used in histories, etc s0: concrete state name used in diagrams We may have s0 = s27 For classical planning: A plan yields a single history (last state repeated infinitely) For probabilistic planning: Initial states can be probabilistic ▪ For every state s, there will be a probability P(s) that we begin in the state s Actions can have multiple outcomes A policy can yield many different histories ▪ Which one? Only known at execution time! jonkv@ida Policies and Histories Example 1 s5 π1 = { (s1, move(l1,l2)), move(l2,l3) s3 wait move(l3,l2) move(l1,l2) move(l3,l4) move(l4,l1) wait Start s1 s4 move(l1,l4) wait Even if we only consider starting in s1: Two possible histories h1 = 〈s1, s2, s3, s4, s4, … 〉 h2 = 〈s1, s2, s5, s5 … 〉 – Reached s4, waits indefinitely – Reached s5, waits indefinitely How likely are these histories? move(l5,l4) s2 move(l2,l1) (s2, move(l2,l3)), (s3, move(l3,l4)), wait (s4, wait), (s5, wait)} move(l4,l3) 32 wait jonkv@ida History Example 1 Each policy induces a probability distribution over histories Let h = 〈s0, s1, s2, s3, … 〉 move(l2,l3) s3 wait move(l3,l4) move(l2,l1) move(l3,l2) move(l4,l3) s2 wait move(l5,l4) With unknown initial state: ▪ P(〈s0, s1, s2, s3, … 〉 | π) = P(s0) ∏i ≥ 0 P(si, π(si), si+1) s5 move(l1,l2) 33 The book: move(l4,l1) ▪ Assumes you start wait s1 s4 in a known state s0 ▪ So all histories start move(l1,l4) wait Start with the same state ▪ P(〈s0, s1, s2, s3, … 〉 | π) = ∏i ≥ 0 P(si, π(si), si+1) if s0 is the known initial state P(〈s0, s1, s2, s3, … 〉 | π) = 0 if s0 is any other state wait jonkv@ida Probabilities: Initial States, Transitions Example 1 s5 π1 = { (s1, move(l1,l2)), move(l2,l3) s3 wait move(l3,l2) move(l1,l2) move(l3,l4) move(l4,l1) wait Start s1 s4 move(l1,l4) wait Two possible histories, if we always start in s1 h1 = 〈s1, s2, s3, s4, s4, … 〉 h2 = 〈s1, s2, s5, s5 … 〉 – P(h1 | π1) = 1 × 1 × 0.8 × 1 × … = 0.8 – P(h2 | π1) = 1 × 1 × 0.2 × 1 × … = 0.2 – P(h | π1) = 1 × 0 for all other h move(l5,l4) s2 move(l2,l1) (s2, move(l2,l3)), (s3, move(l3,l4)), wait (s4, wait), (s5, wait)} move(l4,l3) 34 wait jonkv@ida History Example 1 Example 2 s5 π2 = { (s1, move(l1,l2)), move(l2,l3) s3 wait move(l3,l4) move(l4,l3) move(l3,l2) move(l4,l1) wait Start h1 = 〈s1, s2, s3, s4, s4, … 〉 h3 = 〈s1, s2, s5, s4, s4, … 〉 s1 s4 move(l1,l4) wait P(h1 | π2) = 1 × 1 × 0.8 × 1 × … = 0.8 P(h3 | π2) = 1 × 1 × 0.2 × 1 × … = 0.2 P(h | π2) = 1 × 0 for all other h move(l5,l4) s2 move(l2,l1) (s2, move(l2,l3)), (s3, move(l3,l4)), wait (s4, wait), (s5, move(l5,l4))} move(l1,l2) 35 wait jonkv@ida History Example 2 Example 3 s5 π3 = { (s1, move(l1,l4)), move(l2,l3) s3 wait move(l3,l2) move(l3,l4) move(l4,l3) wait move(l5,l4) s2 move(l2,l1) (s2, move(l2,l1)), (s3, move(l3,l4)), wait (s4, wait), (s5, move(l5,l4)} move(l1,l2) 36 move(l4,l1) wait Start h4 = 〈s1, s4, s4, … 〉 h5 = 〈s1, s1, s4, s4, … 〉 h6 = 〈s1, s1, s1, s4, s4, … 〉 ••• s1 s4 move(l1,l4) wait P(h4 | π3) = 0.5 × 1 × 1 × 1 × 1 × … = 0.5 P(h5 | π3) = 0.5 × 0.5 × 1 × 1 × 1 × … = 0.25 P(h6 | π3) = 0.5 × 0.5 × 0.5 × 1 × 1 × … =0.125 h∞ = 〈s1, s1, s1, s1, s1, s1, … 〉 P(hn | π3) = 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × … = 0 jonkv@ida History Example 3 We have defined the Stochastic Shortest Path Problem Similar to the classical planning problem Adapted to probabilistic outcomes But policies allow indefinite execution Can we exploit this fact to generalize from SSSPs? Yes – remove the goal states, assume no termination But without goal states, what is the objective? 38 jonkv@ida Generalizating from the SSPP 39 An alternative to goal states: Numeric cost C(s,a) for each state s, action a Numeric reward R(s) for each state s Example: Grid World Actions: North, South, West, East ▪ Associated with a cost ▪ 90% probability of doing what you want ▪ 10% probability of not moving Rewards in some cells Danger in some cells ▪ Negative reward -100 -200 +100 -80 +50 jonkv@ida Costs and Rewards 40 Important: States != locations Reward: A person who wants to move is on board elevator-at(floor3) person-at(p1, floor3) wants-to-move(p1) pickup(p1, floor3) elevator-at(floor3) person-onboard(p1) wants-to-move(p1) Can't "cycle" to receive the same award again: No path leads back to this state Can't stay and "accumulate rewards": All actions lead to new states jonkv@ida States, not Locations 41 To model the SSPP using un-discounted utilities: Assume every action has positive cost except in goal states, which are absorbing ( action cost 0) Set all rewards to 0 Example: s5 r=0 c=1 s2 c=1 s3 r=0 r=0 c=100 c=1 c=100 c=1 c=100 ▪ No rewards ▪ C(s,a) = 1 for each “horizontal” action ▪ C(s,a) = 100 for each “vertical” action ▪ C(s,wait) = 1 ▪ Goal state s4 only permits one action, returning to s4 at cost 0 r=0 c=100 s1 s4 c=1 c=0 r=0 c=1 jonkv@ida SSPP using Costs and Rewards 43 We want to know how good a policy is But we have many possible outcomes… Intermediate step: Utility functions Suppose a policy has one particular outcome results in one particular history (state sequence) How ”useful / valuable" is this outcome to us? One possibility: Un-discounted utility Let's use both costs and rewards for now… h = 〈s0, s1, … 〉 V(h | π) = ∑i ≥ 0 (R(si) – C(si,π(si))) Un-discounted utility of history h given policy π The reward for being in state si Minus the cost of the action chosen in si jonkv@ida Utility Functions 44 The SSPP using un-discounted utilities: Positive action cost except in goal states; all rewards are zero If history h visits a goal state: ▪ Finitely many actions of finite positive cost ▪ Followed by infinitely many actions of cost 0, which we won't actually execute ▪ Finite cost Any policy of finite cost solves the SSPP If history h does not visit a goal state: ▪ Infinitely many actions of finite positive cost ▪ Infinite cost But what about the cases where we want to execute infinitely or indefinitely? jonkv@ida Utility Functions and SSPP 45 Infinite horizon solution Considers all possible infinite histories (as defined earlier) (Infinite execution) Indefinite execution Never ends Unrealistic! Will end, but we don't know when Realistic… "Goal-based" execution (SSPP) Truly indefinite execution Execute until we achieve a goal state Solution guarantees: Finitely many actions of cost > 0 No stop criterion Solution can have infinitely many actions of cost > 0 jonkv@ida Planning and Execution Example: The robot does not "want to reach s4" It wants to walk around gaining rewards Every time step it is in s5: ▪ Negative reward – maybe it's in our way Every time step it is in s4: ▪ Positive reward – r=0 maybe it helps us and "gets a salary" c=1 r=-100 s5 c=1 s2 s3 c=100 c=1 r=0 c=100 c=100 c=100 c=100 46 c=1 c=1 r=0 s1 s4 c=1 c=0 r=+100 c=0 jonkv@ida Indefinite/Infinite Execution Example: Suppose π1 happens to result in h1 = 〈s1, s2, s3, s4, s4, … 〉 V(h1 | π1) = (0 – 100) + (0 – 1) + (0 – 100) + 100 + 100 + … Stays at s4 forever, executing “wait” infinite amount of rewards! r=-100 s5 r=0 s3 c=100 c=1 r=0 c=100 c=100 That’s why we specify the policy and the history to calculate a utility… s2 c=1 c=100 This is not the only history that could result from the policy! c=1 c=100 47 c=1 c=1 r=0 s1 s4 c=1 c=0 r=+100 c=0 jonkv@ida Utility Functions What’s the problem, given that we "like" being in state s4? We can’t distinguish between different ways of getting there! ▪ s1s2s3s4: -201 + ∞ = ∞ ▪ s1s2s1s2s3s4: -401 + ∞ = ∞ ▪ Both appear equally good… r=-100 s5 r=0 c=1 s2 c=1 s3 c=100 c=1 r=0 c=100 c=100 c=100 c=100 48 c=1 c=1 r=0 s1 s4 c=1 c=0 r=+100 c=0 jonkv@ida Utility Functions 49 Solution: Use a discount factor, γ, with 0 ≤ γ ≤ 1 To avoid infinite utilities V(…) To model "impatience": rewards and costs far in the future are less important to us Discounted utility of a history: V(h | π) = ∑i ≥ 0 γ i (R(si) – C(si,π(si))) Distant rewards/costs s5 r=0 c=1 s2 c=1 c=1 r=0 c=100 c=100 c=100 Examples will use γ=0.9 c=1 c=1 Only to simplify formulas! Should choose carefully… s3 c=100 have less influence Convergence (with finite results) is guaranteed if 0 ≤ γ < 1 r=-100 c=100 r=0 s1 s4 c=1 c=0 r=+100 c=0 jonkv@ida Discounted Utility r=0 c=1 s2 c=1 c=0 s3 c=100 c=1 r=0 c=100 c=100 γ = 0.9 s5 c=100 Given that we start in s1, π1 can lead to only two histories: 80% chance of history h1, 20% chance of history h2 r=-100 c=100 π1 = {(s1, move(l1,l2)), (s2, move(l2,l3)), (s3, move(l3,l4)), (s4, wait), (s5, wait)} 50 c=1 c=1 r=0 Factors 1, 0.9, 0.81, 0.729, 0.6561… s1 s4 c=1 c=0 r=+100 h1 = 〈s1, s2, s3, s4, s4, … 〉 V(h1 | π1) = .90(0 –100) + .91(0 –1) + .92(0 –100) + .93 100 + .94 100 + … = 547.9 h2 = 〈s1, s2, s5, s5 … 〉 V(h2 | π1) = .90(0 –100) + .91(0 – 1) + .92(–100-0) + .93(–100-0) + … = –910.1 E(π1) = 0.8 * 547.9 + 0.2 (–910.1) = 256.3 We expect a reward of 256.3 on average jonkv@ida Example r=0 c=1 s2 c=1 c=0 s3 c=100 c=1 r=0 c=100 c=100 γ = 0.9 s5 c=100 Given that we start in s1, also two different histories… 80% chance of history h1, 20% chance of history h2 r=-100 c=100 π2 = {(s1, move(l1,l2)), (s2, move(l2,l3)), (s3, move(l3,l4)), (s4, wait), (s5, move(l5,l4)} 51 c=1 c=1 s1 r=0 Factors 1, 0.9, 0.81, 0.729, 0.6561… s4 c=1 c=0 r=+100 h1 = 〈s1, s2, s3, s4, s4, … 〉 V(h1 | π1) = .90(0 –100) + .91(0 –1) + .92(0 –100) + .93 100 + .94 100 + … = 547.9 h2 = 〈s1, s2, s5, s5 … 〉 V(h2 | π1) = .90(0 –100) + .91(0 – 1) + .92(–100-100) + .93100 + … = 466.9 E(π1) = 0.8 * 547.9 + 0.2 (466.9) = 531,7 Expected reward 531,7 (π1 gave 256.3) jonkv@ida Example Rewards are often generalized: Numeric cost C(s,a) for each state s, action a Numeric reward R(s,a,s') if you are in s, execute a, and reach s' V(〈s0, s1, … 〉 | π) = ∑i ≥ 0 γ i (R(si, π(si), si+1) – C(si, π(si))): More expressive! 53 jonkv@ida Costs and Rewards: More General 54 An observation: Every step yields one reward, one cost: R(si, π(si), si+1) – C(si, π(si)) r=-100 for all outgoing edges r=0 for all outgoing edges s5 c=1 s2 c=1 c=100 c=100 c=100 c=100 c=100 r=0 for all … c=1 c=1 r=0 for all … s1 s4 c=1 c=0 Sum = –200 s3 c=1 Sum = –100 c=0 r=+100 for all … Any combination of reward and cost that gives us these sums is equivalent! jonkv@ida Simplification To simplify, include the cost in the reward! Decrease each R(si, π(si), si+1) by C(si, π(si)) C(s0, takeoff) = 80 R(s0, takeoff, s1) = 200 R(s0, takeoff, s2) = –100 s5 r= -1 s2 r= -1 s3 r= 0 r= -100 r= -100 R(s0, takeoff, s1) = 120 R(s0, takeoff, s2) = –180 r= 99 r= -1 s1 s4 r= -1 r= 100 V(〈s0, s1, … 〉 | π) = ∑i ≥ 0 γ i (R(si, π(si), si+1) – C(si, π(si))) V(〈s0, s1, … 〉 | π) = ∑i ≥ 0 γ i R(si, π(si), si+1) r= -1 r= -200 r= -1 r= -100 55 jonkv@ida Simplification (2) r= -100 57 Markov Decision Processes Underlying world model: Plan representation: Goal representation: Planning problem: Stochastic system Policy – which action to perform in any state Utility function defining “solution quality” Optimization: Maximize expected utility Why "Markov"? jonkv@ida Overview If a stochastic process has the Markov Property: It is memoryless The future of the process can be predicted equally well if we use only its current state or if we use its entire history This is part of the definition! P(s, a, s') is the probability of ending up in s’ when we are in s and execute a Nothing else matters! 58 jonkv@ida Markov Property (1) 59 … S122,281 At location 3 a S125,203 At location 5 jonkv@ida Markov Property (2) S125,204 At location 6 S125,222 Intermediate location S121,284 At location 4 … S247,129 Broken We don’t need to know the states we visited before… Only the current state and the action… …To find out where we may end up, with which prob. 60 Essential distinction: Previous states in the history sequence: What happened at earlier timepoints: Cannot affect the transition function Can partly be encoded into the current state Can affect the transition function Example: If you have visited the lectures, you are more likely to pass the exam ▪ Add predicate visitedLectures, representing in this state what you did in the past This information is encoded and stored in the current state ▪ State space doubles in size (and here we often treat every state separately!) ▪ We only have a finite number of states can't encode an unbounded history jonkv@ida Remembering the Past Now we know the utility of each history, each outcome But we can only decide a policy Each outcome has a probability So we can calculate an expected ("average") utility for the policy: E(π) 62 jonkv@ida Expected Utility 63 A policy selects actions; the world chooses the outcome So we must consider all possible outcomes / histories but not all possible choices Policy chooses green action These outcomes must be handled! Irrelevant to us jonkv@ida Expected Utility 2 In the next step the policy again makes a choice 64 jonkv@ida Expected Utility 3 65 Calculating expected utility E(π), method 1: "History-based" <A,B,E,…> <A,B,F,…> <A,B,G,…> <A,C,H,…> … Find all possible infinite histories Calculate probabilities, rewards A over each entire history Multiply and sum D B C E F I G J K H E(π) = ∑h P(h | π) V(h | π) where V(h | π) = ∑i ≥ 0 γ i R(si, π(si), si+1) Simple conceptually Less useful for calculations jonkv@ida Expected Utility 4 66 Calculating expected costs, method 2: "Step-based" What's the probability of the outcomes B, C, or D? What's the reward for each outcome? What's the reward of continuing from B there? A D C E F G E(π) = E(π,s0) E(π,s) = ∑s’ ∈S P(s, π(s), s') * (R(s, π(s), s') + γ E(π,s')) I J K H E(π) = expected cost "from the start" E(π,s) = "continuing after having reached s" jonkv@ida Expected Utility 5 If π is a policy, then E(π,s) = ∑s’ ∈S P(s, π(s), s') * (R(s, π(s), s') + γ E(π,s')) The expected cost of continuing to execute π after having reached s Is the sum, for all possible states s’ ∈ S that you might end up in, ▪ of the probability P(s, π(s), s') of actually ending up in that state given the action π(s) chosen by the policy, times ▪ the reward you get for this transition ▪ plus the discount factor times the expected cost E(π,s') of continuing π from the new state s’ 67 jonkv@ida Expected Utility 6: "Step-Based" 68 jonkv@ida Example 1 E(π2,s1) = The expected cost of executing π2 starting in s1: The cost of the first action: move(l1,l2) Plus the discount factor γ times… ▪ [Ending up in s2] 100% * E(π2,s2) s5 r= -1 s2 r= -1 s3 r= -100 r= 0 r= -100 r= -100 π2 = {(s1, move(l1,l2)), (s2, move(l2,l3)), (s3, move(l3,l4)), (s4, wait), (s5, move(l5,l4)} r= 99 r= -1 s1 s4 r= -1 r= -1 r= 100 r= -200 r= -1 r= -100 69 jonkv@ida Example 2 E(π2,s2) = the expected cost of executing π2 starting in s2: The cost of the first action: move(l2,l3) ▪ (Which has multiple outcomes!) Plus the discount factor γ times… ▪ [Ending up in s3] 80% * E(π2,s3) ▪ Plus [Ending up in s5] 20% * E(π2,s5) s5 r= -1 s2 r= -1 s3 r= -100 r= 0 r= -100 r= -100 π2 = {(s1, move(l1,l2)), (s2, move(l2,l3)), (s3, move(l3,l4)), (s4, wait), (s5, move(l5,l4)} r= 99 r= -1 s1 s4 r= -1 r= -1 r= 100 r= -200 r= -1 r= -100 70 jonkv@ida Recursive? Seems like we could easily calculate this recursively! E(π2s1) ▪ defined in terms of E(π2,s2) ▪ defined in terms of E(π2,s3) and E(π2,s5) … Just continue until you reach the end! Why doesn't this work? s5 r= -1 s2 r= -1 s3 r= -100 r= 0 r= -100 r= -100 π2 = {(s1, move(l1,l2)), (s2, move(l2,l3)), (s3, move(l3,l4)), (s4, wait), (s5, move(l5,l4)} r= 99 r= -1 s1 s4 r= -1 r= -1 r= 100 r= -200 r= -1 r= -100 There isn’t always an ”end”! Modified example below is a valid policy π (different action in s5) ▪ E(π,s1) defined in terms of E(π,s2) ▪ E(π,s2) defined in terms of E(π,s3) and E(π,s5) ▪ E(π,s3) defined in terms of E(π,s4) ▪ E(π,s5) defined in terms of E(π,s2)… s5 r= -1 s2 r= -1 s3 r= 0 r= -100 r= -100 r= 99 r= -1 s1 s4 r= -1 r= -1 r= 100 r= -200 r= -1 r= -100 71 jonkv@ida Not Recursive! r= -100 72 If π is a policy, then E(π,s) = ∑s’ ∈S P(s, π(s), s') * (R(s, π(s), s') + γ E(π,s')) The expected cost of continuing to execute π after having reached s Is the sum, for all possible states s’ ∈ S that you might end up in, ▪ of the probability P(s, π(s), s') of actually ending up in that state given the action π(s) chosen by the policy, times ▪ the reward you get for this transition ▪ plus the discount factor times the expected cost E(π,s') of continuing π from the new state s’ This is an equation system: |S| equations, |S| variables! Requires different solution methods… jonkv@ida Equation System 75 Let us first revisit the definition of utility We can define the actual utility given an outcome, a history ▪ Given any history 𝑠𝑠0 , 𝑠𝑠1 , … : 𝑉𝑉 𝑠𝑠0 , 𝑠𝑠1 , … 𝜋𝜋 = � 𝛾𝛾 𝑖𝑖 𝑅𝑅 𝑠𝑠𝑖𝑖 , 𝜋𝜋 𝑠𝑠𝑖𝑖 , 𝑠𝑠𝑖𝑖+1 𝑖𝑖≥0 We can define the expected utility using the given probability distribution: ▪ Given that we start in state s: 𝐸𝐸(𝜋𝜋, 𝑠𝑠) = � 𝑃𝑃 𝑠𝑠0 , 𝑠𝑠1 , … 𝑠𝑠0 ,𝑠𝑠1 ,… 𝑠𝑠0 = 𝑠𝑠) � 𝛾𝛾 𝑖𝑖 𝑅𝑅 𝑠𝑠𝑖𝑖 , 𝜋𝜋 𝑠𝑠𝑖𝑖 , 𝑠𝑠𝑖𝑖+1 𝑖𝑖≥0 ▪ As we saw, we can also rewrite this recursively! Given that we start in state s: 𝐸𝐸 𝜋𝜋, 𝑠𝑠 = � 𝑃𝑃 𝑠𝑠, 𝜋𝜋 𝑠𝑠 , 𝑠𝑠 ′ ⋅ 𝑅𝑅 𝑠𝑠, 𝜋𝜋 𝑠𝑠 , 𝑠𝑠 ′ + 𝛾𝛾𝛾𝛾 𝜋𝜋, 𝑠𝑠 ′ 𝑠𝑠 ′ ∈𝑆𝑆 jonkv@ida Repetition: Utility 76 Policy 𝜋𝜋 ∗ is optimal if it maximizes expected utility Given the specified probabilities and transition rewards For all states s and policies 𝜋𝜋, 𝐸𝐸 𝜋𝜋 ∗ , 𝑠𝑠 ≥ 𝐸𝐸(𝜋𝜋, 𝑠𝑠) Sometimes: A given initial state s0 is assumed Optimal = for all policies 𝜋𝜋, 𝐸𝐸 𝜋𝜋 ∗ , 𝑠𝑠0 ≥ 𝐸𝐸 𝜋𝜋, 𝑠𝑠0 Everywhere optimal = for all states s and policies 𝜋𝜋, 𝐸𝐸 𝜋𝜋 ∗ , 𝑠𝑠 ≥ 𝐸𝐸(𝜋𝜋, 𝑠𝑠) When we solve an MDP exactly: A solution is defined as an optimal policy! How do we find an optimal policy? Let’s begin with an important property of optimal policies jonkv@ida Optimality 77 jonkv@ida Principle of Optimality Bellman’s Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision s5 r= -1 s3 r= -100 r= -100 r= -100 r= 99 r= -1 s1 s4 r= -1 r= -1 r= 100 r= -200 r= -1 Suppose that π* is an optimal policy and that π*(s1) = move(l1,l2), so the next state is always s2. Then π* must also maximize expexted utility given that we start in s2! s2 r= -1 r= 0 Example: Find a policy that maximizes exp. utility given that we start in s1. r= -100 Sounds trivial? Depends on the Markov Property! Suppose costs depended on which states you had visited before To go s5 s4 s1: ▪ Use move(l5,l4) and move(l4,l1) ▪ Reward –200 + –400 = –600 To go s4 s1 without having visited s5: ▪ Use move(l4,l1), same as above ▪ Reward for this step: 99, not –400 r= -1 r= 0 r=99 usually r= - 400 if we visited s5 s1 s4 r= -1 r= -1 r= -100 in an MDP: Markovian! s3 r= -100 This can’t happen r= -1 r= -1 r= 100 r= -200 Optimal action would have to take history into account s5 s2 r= -1 r= -100 78 jonkv@ida Principle of Optimality (2) r= -100 Suppose I have a non-optimal policy 𝜋𝜋 I select an arbitrary state s I change 𝜋𝜋 𝑠𝑠 in a way that increases E 𝜋𝜋, 𝑠𝑠 This cannot decrease E 𝜋𝜋, 𝑠𝑠 ′ for any 𝑠𝑠𝑠! We can find solutions through local improvements No need to “think globally” Suppose a given initial state s0 is assumed Optimal = for all policies 𝜋𝜋, 𝐸𝐸 𝜋𝜋 ∗ , 𝑠𝑠0 ≥ 𝐸𝐸 𝜋𝜋, 𝑠𝑠0 Then for all states s that are reachable from 𝑠𝑠0 , 𝐸𝐸 𝜋𝜋 ∗ , 𝑠𝑠 ≥ 𝐸𝐸 𝜋𝜋, 𝑠𝑠 79 jonkv@ida Consequences 81 First algorithm: Policy iteration General idea: ▪ Start out with an initial policy, maybe randomly chosen ▪ Calculate the expected utility of executing that policy from each state ▪ Update the policy by making a local decision for each state : ”Which action should my improved policy choose in this state, given the expected utility of the current policy?” ▪ Iterate until convergence (until the policy no longer changes) Simplification: Our rewards actually do not depend on the state we end up in R(s, π(s)) = ∑s' ∈S P(s, π(s), s’) (R(s, π(s), s’) + γ E(π,s')) E(π,s) = R(s, π(s)) + ∑s' ∈S P(s, π(s), s’) γ E(π,s') E(π,s) jonkv@ida Policy Iteration 82 Preliminaries: Suppose I have a policy, change the decision in the first step, and keep the policy for everything else? ▪ Let Q(π,s,a) be the expected utility of π in a state s if we start by executing the given action a, but we use the policy π from then onward Expected utility of original policy π: ▪ E(π,s) = R(s, π(s)) + ∑s' ∈S P(s, π(s), s’) γ E(π,s') Expected utility given one-step change: ▪ Q(π,s,a) = R(s, a) + ∑s' ∈S P(s, a, s’) γ E(π,s') Local change! jonkv@ida Preliminaries 1: Single-step policy changes 83 jonkv@ida Preliminaries 2: Example Example: E(π,s1) ▪ The expected utility of following the current policy ▪ Starting in s1, beginning with move(l1,l2) Q(π,s1, move(l1,l4)) ▪ The expected utility of first trying to move from l1 to l4, then following the current policy Does not correspond to s5 any possible policy! r= -1 s2 s3 r= -100 r= 0 r= -100 r= 99 r= -1 s1 s4 r= -1 r= -1 r= 100 r= -200 r= -1 r= -100 ▪ If move(l1,l4) returns r= -1 you to state s1, then the next action is move(s1,s2)! r= -100 Suppose you have an (everywhere) optimal policy π* Then, because of the principle of optimality: ▪ In every state, the local choice made by the policy is locally optimal ▪ For all states s, E(π*,s) = mina Q(π*,s,a) This yields the modification step we will use! We have a possibly non-optimal policy π, want to create an improved policy π’ For every state s: ▪ Set π’(s) = arg mina Q(π,s,a) 84 jonkv@ida Preliminaries 3 85 jonkv@ida Preliminaries 4 Example: E(π,s1) ▪ The expected utility of following the current policy ▪ Starting in s1, beginning with move(l1,l2) Q(π,s1, move(l1,l4)) ▪ The expected utility of first trying to move from l1 to l4, then following the current policy s5 r= -1 s2 r= -1 s3 r= -100 r= 0 r= -100 π(s1) := move(l1,l4) r= 99 r= -1 s1 s4 r= -1 r= -1 r= 100 r= -200 r= -1 r= -100 If doing move(l1,l4) first has a greater expected utility, we should modify the current policy: r= -100 Policy iteration requires an initial policy π1 = {(s1, wait), (s2, wait), (s3, wait), (s4, wait), (s5, wait)} Let’s start by choosing “wait” in every state Let’s set a discount factor: γ = 0.9 ▪ Easy to use in calculations on these slides, but in reality we might use a larger factor (we’re not that short-sighted!) Need to know expected utilities! ▪ Because we will make changes according to Q(π1,s,a), r= -1 s2 which depends on ∑s' ∈S P(s, a, s’) E(π1,s') s5 r= -1 s3 r= 0 r= -100 r= 99 r= -1 s1 s4 r= -1 r= -1 r= 100 r= -200 r= -1 r= -100 r= -100 86 jonkv@ida Policy Iteration 2: Initial Policy π1 r= -100 87 Calculate expected utilities for the current policy π1 Simple: Chosen transitions are deterministic and return to the same state! ▪ E(π,s) = R(s, π(s)) + γ ∑s' ∈S P(s, π(s), s’) E(π,s') ▪ ▪ ▪ ▪ ▪ E(π1,s1) = E(π1,s2) = E(π1,s3) = E(π1,s4) = E(π1,s5) = R(s1, wait) R(s2, wait) R(s3, wait) R(s4, wait) R(s5, wait) +γ +γ +γ +γ +γ Simple equations to solve: ▪ 0.1E(π1,s1) = –1 ▪ 0.1E(π1,s2) = – 1 ▪ 0.1E(π1,s3) = – 1 ▪ 0.1E(π1,s4) = +100 ▪ 0.1E(π1,s5) = –100 E(π1,s1) E(π1,s2) E(π1,s3) E(π1,s4) E(π1,s5) = –1 = –1 = –1 = +100 = –100 + 0.9 E(π1,s1) + 0.9 E(π1,s2) + 0.9 E(π1,s3) + 0.9 E(π1,s4) + 0.9 E(π1,s5) E(π1,s1) = –10 E(π1,s2) = –10 E(π1,s3) = –10 E(π1,s4) = +1000 E(π1,s5) = –1000 Given this policy π1: High rewards if we start in s4, high costs if we start in s5 jonkv@ida Policy Iteration 3: Expected Costs for π1 For every state s: Let π2(s) = argmaxa ∈ A Q(π1,s,a) s5 r= –1 s2 s3 r= –1 r= –100 c=0 r= –1 r= –1 r = –200 = –10 = –10 = –10 = +1000 = –1000 r= –100 E(π1, s1) E(π1, s2) E(π1, s3) E(π1, s4) E(π1, s5) r= –100 What is the best local modification according to the expected utilities of the current policy? 88 r=99 r= –1 s1 s4 r= –1 r=100 That is, find the action a that maximizes R(s, a) + γ ∑s' ∈S P(s, a, s’) E(π1,s') ▪ s1: wait –1 + 0.9 * – 10 = – 10 move(l1,l2) – 100 + 0.9 * – 10 = – 109 move(l1,l4) –1 + 0.9 * (0.5* – 10 + 0.5*1000) = +444,5 Best improvement These are not the true expected utilities for starting in state s1! ▪ They are only correct if we locally change the first action to execute and then go on to use the previous policy (in this case, always waiting)! ▪ But they can be proven to yield good guidance, as long as you apply the improvements repeatedly (as policy iteration does) jonkv@ida Policy Iteration 4: Update 1a r= –100 For every state s: Let π2(s) = argmaxa ∈ A Q(π1,s,a) s5 r= –1 s2 s3 r= –1 r= –100 c=0 r= –1 r= –1 r = –200 = –10 = –10 = –10 = +1000 = –1000 r= –100 E(π1, s1) E(π1, s2) E(π1, s3) E(π1, s4) E(π1, s5) r= –100 What is the best local modification according to the expected utilities of the current policy? 89 r=99 r= –1 s1 s4 r= –1 r=100 That is, find the action a that maximizes R(s, a) + γ ∑s' ∈S P(s, a, s’) E(π1,s') ▪ s2: wait –1 + 0.9 * –10 = –10 move(l2,l1) –100 + 0.9 * –10 = –109 move(l2,l3) –1 + 0.9 * (0.8*–10+ 0.2*–1000) = –188,2 jonkv@ida Policy Iteration 5: Update 1b r= –100 For every state s: Let π2(s) = argmaxa ∈ A Q(π1,s,a) s5 r= –1 s2 s3 r= –1 r= –100 c=0 r= –1 r= –1 r = –200 = –10 = –10 = –10 = +1000 = –1000 r= –100 E(π1, s1) E(π1, s2) E(π1, s3) E(π1, s4) E(π1, s5) r= –100 What is the best local modification according to the expected utilities of the current policy? 90 r=99 r= –1 s1 s4 r= –1 r=100 That is, find the action a that maximizes R(s, a) + γ ∑s' ∈S P(s, a, s’) E(π1,s') ▪ s3: wait –1 + 0.9 * –10 = –10 move(l3,l2) –1 + 0.9 * –10 = –10 move(l3,l4) –100 + 0.9 * +1000 = +800 ▪ s4: wait +100 + 0.9 * +1000 = +1000 move(l4,l1) +99 + 0.9 * –10 = +90 … ▪ s5: wait –100 + 0.9 * –1000 = –1000 move(l5,l2) –101 + 0.9 * –10 = –110 move(l5,l4) –200 + 0.9 * +1000 = +700 jonkv@ida Policy Iteration 6: Update 1c r= –100 91 jonkv@ida Policy Iteration 7: Second Policy This results in a new policy π1 = {(s1, wait), (s2, wait), (s3, wait), (s4, wait), (s5, wait)} E(π1,s1) =–10 E(π1,s2) = –10 E(π1,s3) = –10 E(π1,s4) =+1000 E(π1,s5) = –1000 π2 = { (s1, move(l1,l4), (s2, wait), (s3, move(l3,l4)), (s4, wait), (s5, move(l5,l4))} >= +444,5 >= –10 >= +800 >= +1000 >= +700 Utilities based on one modified action, then following π1 (can’t decrease!) s5 s3 r= –100 r= –1 r= –1 c=0 r= –100 No change in s2 yet… r= –1 s2 r= –100 Try to go there from s1 / s3 / s5! r= –1 r=99 r= –1 s1 s4 r= –1 r=100 r = –200 Now we have made use of earlier indications that s4 seems to be a good state r= –100 92 jonkv@ida Policy Iteration 8: Expected Costs for π2 Calculate true expected utilities for the new policy π2 ▪ ▪ ▪ ▪ ▪ E(π2,s1) = E(π2,s2) = E(π2,s3) = E(π2,s4) = E(π2,s5) = R(s1, move(l1,l4)) R(s2, wait) R(s3, move(l3,l4)) R(s4, wait) R(s5, move(l5,l4)) +γ… + γ E(π2,s2) + γ E(π2,s4) + γ E(π2,s4) + γ E(π2,s4) = –1 + 0.9 (0.5E(π2,s1) + 0.5E(π2,s4)) = –1 + 0.9 E(π2,s2) = –100+ 0.9 E(π2,s4) = +100+ 0.9 E(π2,s4) = –200+ 0.9 E(π2,s4) Equations to solve: ▪ 0.1E(π2,s2) = –1 ▪ 0.1E(π2,s4) = +100 ▪ E(π2,s3) = –100 + 0.9E(π2,s4) = –100 + 0.9*1000 = +800 ▪ E(π2,s5) = –200 + 0.9E(π2,s4) = –200 + 0.9*1000 = +700 ▪ E(π2,s1) = –1 + 0.45 * E(π2,s1) + 0.45 * E(π2,s4) 0.55 E(π2,s1) = –1 + 0.45 * E(π2,s4) 0.55 E(π2,s1) = –1 + 450 0.55 E(π2,s1) = +449 E(π2,s1) = +816,3636… E(π2,s2) = –10 E(π2,s4) = +1000 E(π2,s3) = +800 E(π2,s5) = +700 E(π2,s1) = +816,36 π2 = {(s1, move(l1,l4), (s2, wait), (s3, move(l3,l4)), (s4, wait), (s5, move(l5,l4))} Now we have the true expected utilities of the second policy… π1 = {(s1, wait), (s2, wait), (s3, wait), (s4, wait), (s5, wait)} E(π1,s1) =–10 E(π1,s2) = –10 E(π1,s3) = –10 E(π1,s4) =+1000 E(π1,s5) = –1000 π2 = { (s1, move(l1,l4), (s2, wait), (s3, move(l3,l4)), (s4, wait), (s5, move(l5,l4))} s5 S5 wasn’t so bad after all, since you can reach s4 in a single step! r= –1 s3 r= –1 r= –100 r= –100 r= –1 r= –100 S2 seems much worse in comparison, since the benefits of s4 haven’t ”propagated” that far. s2 r=99 r= –1 s1 s4 r= –1 r=100 r = –200 r= –1 S1 / s3 are even better. E(π2,s1) = +816,36 E(π2,s2) = – 10 E(π2,s3) = +800 E(π2,s4) = +1000 E(π2,s5) = +700 >= +444,5 >= –10 >= +800 >= +1000 >= +700 c=0 93 jonkv@ida Policy Iteration 9: Second Policy r= –100 s5 r= –1 s2 s3 r= –100 r= –1 r= –1 c=0 r = –200 r= –1 r= –100 E(π2,s1) = +816,36 E(π2,s2) = –10 E(π2,s3) = +800 E(π2,s4) = +1000 E(π2,s5) = +700 r= –100 What is the best local modification according to the expected utilities of the current policy? 94 r=99 For every state s: Let π3(s) = argmaxa ∈ A Q(π2,s,a) r= –1 s1 s4 r= –1 r=100 That is, find the action a that maximizes R(s, a) + γ ∑s' ∈S P(s, a, s’) E(π2,s') ▪ s1: wait –1 + 0.9 * 816,36 = +733,72 move(l1,l2) –100 + 0.9 * –10 = –109 move(l1,l4) –1 + 0.9 * (.5*1000+.5*816.36) = +816,36 Seems best – chosen! ▪ s2: wait move(l2,l1) move(l2,l3) –1 + 0.9 * –10 –100 + 0.9 * 816,36 –1 + 0.9 * (0.8*800 + 0.2*700) = –10 = +634,72 = +701 Now we will change the action taken at s2, since we have the expected utilities for reachable states s1, s3, s5… have increased jonkv@ida Policy Iteration 10: Update 2a r= –100 s5 r= –1 s3 r= –1 r= –100 For every state s: Let π3(s) = argmaxa ∈ A Q(π2,s,a) s2 r= –1 c=0 r = –200 r= –1 r= –100 E(π2,s1) = +816,36 E(π2,s2) = –10 E(π2,s3) = +800 E(π2,s4) = +1000 E(π2,s5) = +700 r= –100 What is the best local modification according to the expected utility of the current policy? 95 r=99 r= –1 s1 s4 r= –1 r=100 That is, find the action a that maximizes R(s, a) + γ ∑s' ∈S P(s, a, s’) E(π2,s') ▪ s3: wait –1 + 0.9 * 800 = +719 move(l3,l2) –1 + 0.9 * –10 = –10 move(l3,l4) –100 + 0.9 * 1000 = +800 ▪ s4: wait +100+ 0.9 * 1000 = +1000 move(l4,l1) +99 + 0.9 * 816,36 = +833,72 … ▪ s5: wait –100 + 0.9 * 700 = +530 move(l5,l2) –101 + 0.9 * –10 = –110 move(l5,l4) –200 + 0.9 * –1000 = +700 jonkv@ida Policy Iteration 11: Update 2b r= –100 This results in a new policy π3 π3 = {(s1, move(l1,l4), (s2, move(l2,l3)), (s3, move(l3,l4)), (s4, wait), (s5, move(l5,l4))} True expected utilities are updated by solving an equation system The algorithm will iterate once more No changes will be made to the policy Termination with optimal policy! s5 r= –1 s3 r= –1 r= –1 c=0 r= –100 r= –100 r=99 r= –1 s1 s4 r= –1 r=100 r = –200 r= –1 s2 r= –100 96 jonkv@ida Policy Iteration 12: Third Policy r= –100 97 jonkv@ida Policy Iteration 13: Algorithm Policy iteration is a way to find an optimal policy π* Start with an arbitrary initial policy π1. Then, for i = 1, 2, … ▪ Compute expected utilities E(πi ,s) for every s by solving a system of equations ▪ System: For all s, E(πi,s) = R(s, πi(s)) + γ ∑s' ∈ S P (s, πi(s), s’) E(πi,s') Find costs according to ▪ Result: The expected utilities of the “current” policy in every state s current policy ▪ Not a simple recursive calculation – the state graph is generally cyclic! ▪ Compute an improved policy πi+1 “locally” for every s Find best ▪ πi+1(s) := argmaxa ∈ A R(s, a) + γ ∑s' ∈ S P(s, a, s') E(πi,s') local improvements ▪ Best action in any given state s given expected utilities of old policy πi ▪ If πi+1 = πi then exit ▪ No local improvement possible, so the solution is optimal ▪ Otherwise ▪ This is a new policy πi+1 – with new expected utilities! ▪ Iterate, calculate those utilities, … Converges in a finite number of iterations! We change which action to execute if this improves expected (pseudo-)utility for this state ▪ This can sometimes decrease, and never increase, the cost of other states! ▪ So costs are monotonically improving and we only have to consider a finite number of policies In general: May take many iterations Each iteration involved can be slow Mainly because of the need to solve a large equation system! 98 jonkv@ida Convergence 100 How to avoid the need to solve an equation system? In PI we have a policy 𝜋𝜋𝑖𝑖 , want its expected utilities 𝐸𝐸(𝜋𝜋𝑖𝑖 , 𝑠𝑠) A way of approximating the same thing (one step in policy iteration): ▪ 𝐸𝐸𝑖𝑖, 0 𝜋𝜋, 𝑠𝑠 = 0 for all s Utility for 0 steps ▪ Then iterate for j=0, 1, 2, … and for all states s: 𝐸𝐸𝑖𝑖, 𝑗𝑗+1 𝜋𝜋, 𝑠𝑠 = 𝑅𝑅 𝑠𝑠, 𝜋𝜋𝑖𝑖 𝑠𝑠 ′ + 𝛾𝛾 � 𝑃𝑃 𝑠𝑠, 𝜋𝜋𝑖𝑖 𝑠𝑠 , 𝑠𝑠 𝐸𝐸𝑖𝑖,𝑗𝑗 𝜋𝜋𝑖𝑖 , 𝑠𝑠 𝑠𝑠 ′ ∈𝑆𝑆 ′ Utility for 1 step, 2 steps, 3 steps, … ▪ Iterate until 𝐸𝐸𝑖𝑖,𝑗𝑗+1 is sufficiently close to 𝐸𝐸𝑖𝑖,𝑗𝑗 ▪ Will converge in the limit: 𝛾𝛾 < 1 future steps become less and less important ▪ Finally, the utility function determines the best actions to use: 𝜋𝜋𝑖𝑖+1 𝑠𝑠 = arg max 𝑅𝑅 𝑠𝑠, 𝑎𝑎 + � 𝑃𝑃 𝑠𝑠, 𝑎𝑎, 𝑠𝑠 ′ 𝐸𝐸𝑖𝑖,𝑛𝑛 (𝜋𝜋, 𝑠𝑠) 𝑎𝑎∈𝐴𝐴 𝑠𝑠 ′ ∈𝑆𝑆 Essentially maximizes Q(…) jonkv@ida An Alternative Policy Iteration Replaces the equation system Not an exact solution Still takes a long time Create arbitrary policy 𝜋𝜋0 Calculate 𝐸𝐸0,0 , 𝐸𝐸0,1 , 𝐸𝐸0,2 , … Approximate utility function Update policy 𝜋𝜋1 Calculate 𝐸𝐸1,0, 𝐸𝐸1,1 , 𝐸𝐸1,2, … Approximate utility function … But this gives us a hint for how we can continue… 101 jonkv@ida An Alternative Policy Iteration (2) 102 Another algorithm: Value iteration Does not use a policy during iteration! Instead: ▪ 𝑉𝑉0 𝑠𝑠 = 0 for all s – no policy argument; let’s call it “V” since it does not correspond to expected utilities! ▪ Then iterate for j=0, 1, 2, … and for all states s: We used to let a policy select the action… 𝐸𝐸𝑖𝑖, 𝑗𝑗+1 𝜋𝜋, 𝑠𝑠 = 𝑅𝑅 𝑠𝑠, 𝜋𝜋𝑖𝑖 𝑠𝑠 + 𝛾𝛾 � 𝑃𝑃 𝑠𝑠, 𝜋𝜋𝑖𝑖 𝑠𝑠 , 𝑠𝑠 ′ 𝐸𝐸𝑖𝑖,𝑗𝑗 𝜋𝜋𝑖𝑖 , 𝑠𝑠 ′ 𝑠𝑠 ′ ∈𝑆𝑆 ▪ Now we use the action maximizing finite-horizon utility: 𝑉𝑉𝑗𝑗+1 𝑠𝑠 = max 𝑅𝑅 𝑠𝑠, 𝑎𝑎 + 𝛾𝛾 � 𝑃𝑃 𝑠𝑠, 𝑎𝑎, 𝑠𝑠 ′ 𝑉𝑉𝑗𝑗 𝑠𝑠 𝑎𝑎∈𝐴𝐴 𝑠𝑠 ′ ∈𝑆𝑆 ▪ Finally, after convergence: 𝜋𝜋𝑖𝑖+1 𝑠𝑠 = arg max 𝑄𝑄(𝜋𝜋𝑖𝑖 , 𝑠𝑠, 𝑎𝑎) 𝑎𝑎∈𝐴𝐴 Calculate 𝑉𝑉0 Calculate 𝑉𝑉1 Calculate 𝑉𝑉2 Calculate 𝑉𝑉3 … jonkv@ida Value Iteration Main difference: With policy iteration ▪ Find a policy ▪ Find exact expected utilities for infinite steps using this policy (expensive, but gives the best possible basis for improvement) ▪ Use these to generate a new policy ▪ Throw away the old utilities, find exact expected utilities for infinite steps using the new policy ▪ Use these to generate a new policy ▪ … With value iteration ▪ Find best utilities considering 0 steps; implicitly defines a policy ▪ Find best utilities considering 1 step; implicitly defines a policy ▪ Find best utilities considering 2 steps; implicitly defines a policy ▪ … 103 jonkv@ida Value Iteration Will always converge towards an optimal value function Will converge faster if V0(s) is close to the true value function Will actually converge regardless of the initial value of V0(s) ▪ Intuition: As n goes to infinity, the importance of V0(s) goes to zero 104 jonkv@ida Value Iteration Value iteration requires an initial approximation V0(s1) = 0 V0(s2) = 0 V0(s3) = 0 V0(s4) = 0 V0(s5) = 0 Let’s start with V0(s) = 0 for each s Does not correspond to any actual policy! ▪ Does correspond to the expected reward of executing zero steps… s5 r= -1 s2 r= -1 s3 r= 0 r= -100 r= -100 r= 99 r= -1 s1 s4 r= -1 r= -1 r= 100 r= -200 r= -1 r= -100 105 jonkv@ida Value Iteration 2: Initial Guess V0 r= -100 r= -1 s2 r= -1 s3 r= -1 r= -1 r= -100 r= 0 r= -1 r= 99 s1 s4 r= -1 r= 100 PI: find the action a that maximizes R(s, a) + γ ∑s' ∈S P(s, a, s’) E(π1,s') FI: find the action a that maximizes R(s, a) + γ ∑s' ∈ S P(s, a, s’) V0(s') ▪ s1: wait –1 + 0.9 * 0 =–1 move(l1,l2) –100 + 0.9 * 0 = – 100 move(l1,l4) –1 + 0.9 * (0.5*0 + 0.5*0) =–1 ▪ s2: wait –1 + 0.9 * 0 =–1 move(l2,l1) –100 + 0.9 * 0 = – 100 move(l2,l3) –1 + 0.9 * (0.8*0 + 0.2*0) =–1 r= -200 For every state s: s5 r= -100 V0(s1) = 0 V0(s2) = 0 V0(s3) = 0 V0(s4) = 0 V0(s5) = 0 r= -100 What is the best local modification according to the current approximation? 106 jonkv@ida Value Iteration 3: Update 1a r= -100 r= -1 s2 r= -1 s3 r= -1 r= -1 r= -100 r= 0 r= -1 r= 99 s1 s4 r= -1 r= 100 FI: find the action a that maximizes R(s, a) + γ ∑s' ∈ S P(s, a, s’) V0(s') ▪ s3: wait – 1 + 0.9 * 0 =–1 move(l3,l2) – 1 + 0.9 * 0 =–1 move(l3,l4) –100 + 0.9 * 0 = – 100 ▪ s4: wait +100 + 0.9 * 0 = +100 move(l4,l1) +99 + 0.9 * 0 = +99 … ▪ s5: wait –100 + 0.9 * 0 = – 100 move(l5,l2) –101 + 0.9 * 0 = – 101 move(l5,l4) –200 + 0.9 * 0 = – 200 r= -200 For every state s: s5 r= -100 V0(s1) = 0 V0(s2) = 0 V0(s3) = 0 V0(s4) = 0 V0(s5) = 0 r= -100 What is the best local modification according to the current approximation? 107 jonkv@ida Value Iteration 4: Update 1b r= -100 This results in a new approximation of the greatest expected utility V0(s1) = 0 V0(s2) = 0 V0(s3) = 0 V0(s4) = 0 V0(s5) = 0 π1 = { (s1, wait), (s2, wait), (s3, move(l3,l2)), (s4, wait), (s5, wait)} V1(s1) =– 1 V1(s2) = –1 V1(s3) = –1 V1(s4) = +100 V1(s5) = –100 V1 corresponds to one step of many polices, including the one shown here s5 r= -1 s2 r= -1 s3 r= 99 r= -1 s1 s4 r= -1 r= -1 r= -100 r= -100 r= -100 r= -1 r= 100 r= -200 Policy iteration would now calculate the true expected utility for a chosen policy Value iteration instead continues using V1, which is only a calculation guideline, not the true exp util of any policy For infinite execution, E(π1,s1) = 10, but this is not calculated… r= 0 108 jonkv@ida Value Iteration 5: Second Policy r= -100 r= -1 s2 r= -1 s3 r= -1 r= -1 r= -100 r= 0 r= -1 r= 99 s1 s4 r= -1 r= 100 PI: find the action a that maximizes R(s, a) + γ ∑s' ∈S P(s, a, s’) E(πk,s') FI: find the action a that maximizes R(s, a) + γ ∑s' ∈ S P(s, a, s’) Vk–1(s') ▪ s1: wait –1 + 0.9 * –1 = –1.9 move(l1,l2) –100 + 0.9 * –1 = –100.9 move(l1,l4) –1 + 0.9 * (0.5*–1 + 0.5*100) = +43,55 ▪ s2: wait –1 + 0.9 * –1 = –1.9 move(l2,l1) –100 + 0.9 * –1 = –100.9 move(l2,l3) –1 + 0.9 * (0.8*–1 + 0.2*–1) = –1.9 r= -200 For every state s: s5 r= -100 V1(s1) = –1 V1(s2) = –1 V1(s3) = –1 V1(s4) = +100 V1(s5) = –100 r= -100 What is the best local modification according to the current approximation? 109 jonkv@ida Value Iteration 6: Update 2a r= -100 r= -1 s2 r= -1 s3 r= -1 r= -1 r= -100 r= 0 r= -1 r= 99 s1 s4 r= -1 r= 100 FI: find the action a that maximizes R(s, a) + γ ∑s' ∈ S P(s, a, s’) Vk–1(s') ▪ s3: wait –1 + 0.9 * –1 = –1.9 move(l3,l2) –1 + 0.9 * –1 = –1.9 move(l3,l4) –100 + 0.9 * +100 = –10 ▪ s4: wait +100 + 0.9 * +100 = +190 move(l4,l1) +99 + 0.9 * –1 = +98.1 … ▪ s5: wait –100 + 0.9 * –1 = –100.9 move(l5,l2) –101 + 0.9 * –1 = –101.9 move(l5,l4) –200 + 0.9 * +100 = –110 r= -200 For every state s: s5 r= -100 V1(s1) = –1 V1(s2) = –1 V1(s3) = –1 V1(s4) = +100 V1(s5) = –100 r= -100 What is the best local modification according to the current approximation? 110 jonkv@ida Value Iteration 7: Update 2b r= -100 This results in another new approximation V0(s1) = 0 V0(s2) = 0 V0(s3) = 0 V0(s4) = 0 V0(s5) = 0 π1 = { (s1, wait), (s2, wait), (s3, move(l3,l2)), (s4, wait), (s5, wait)} Again, V2 doesn’t represent the true expected utility of π2 π2 = { (s1, move(l1,l4)), (s2, wait), (s3, wait), (s4, wait), (s5, wait)} V2(s1) = +43.55 V2(s2) = –1.9 V2(s3) = –1.9 V2(s4) = +190 V2(s5) = –100.9 s5 r= -1 s2 r= -1 s3 r= 99 r= -1 s1 s4 r= -1 r= -1 r= -100 It is the true expected utility of one step of π2, then one of π1! r= -100 r= -100 r= -1 r= 100 r= -200 Nor is it the true exp. utility of executing two steps of π2 (But it will converge towards true utility…) V1(s1) = –1 V1(s2) = –1 V1(s3) = –1 V1(s4) = +100 V1(s5) = –100 r= 0 111 jonkv@ida Value Iteration 8: Second Policy r= -100 Significant differences from policy iteration Less accurate basis for action selection ▪ Based on approximate utility, not true expected utility Policy does not necessarily change in each iteration ▪ May first have to iterate n times, incrementally improving approximations ▪ Then another action suddenly seems better in some state Requires a larger number of iterations ▪ But each iteration is cheaper Can’t terminate just because the policy does not change ▪ Need another termination condition… 112 jonkv@ida Differences 113 Illustration below Notice that we already calculated rows 1 and 2 ▪ s1: wait –1 + 0.9 * –1 move(l1,l2) –100 + 0.9 * –1 move(l1,l4) –1 + 0.9 * (0.5*–1 + 0.5*+100) = –1.9 = –100.9 = +43,55 jonkv@ida Illustration 114 Remember, these are “pseudo-rewards”! 324.109 = reward of waiting once in s5, then continuing according to the previous 14 policies for 14 steps, then doing nothing (which is impossible according to the model) jonkv@ida Illustration 115 The policy implicit in the value function changes incrementally… jonkv@ida Illustration 116 At some point we reach the final recommendation/policy: Max value for action move-s4 Will never change Max value for action move-s3 Will never change Max value for action move-s4 Only wait Will never change Max value for action move-s4 Will never change Optimal policy found in iteration 4 Can’t know this: These are not true rewards; maybe one action will soon “overtake” another! jonkv@ida Illustration Suppose discount factor is 0.99 instead Illustration, only showing best pseudo-utility at each step Much slower convergence ▪ Change at step 20: 2% 5% ▪ Change at step 50: 0.07% 1.63% Care more about the future need to consider many more steps! 117 jonkv@ida Different Discount Factors 118 We can find bounds! Let ε be the greatest change in pseudo-utility between two iterations: 𝜖𝜖 = max 𝑉𝑉𝑛𝑛𝑛𝑛𝑛𝑛 𝑠𝑠 − 𝑉𝑉𝑜𝑜𝑜𝑜𝑜𝑜 (𝑠𝑠) 𝑠𝑠∈𝑆𝑆 Then if we create a policy 𝜋𝜋 according to 𝑉𝑉𝑛𝑛𝑛𝑛𝑛𝑛 , we have a bound: max 𝐸𝐸(𝜋𝜋, 𝑠𝑠) − 𝐸𝐸(𝜋𝜋 ∗ , 𝑠𝑠) < 2𝜖𝜖𝜖𝜖/(1 − 𝛾𝛾) 𝑠𝑠∈𝑆𝑆 ▪ For every state, the reward of 𝜋𝜋 is at most 2𝜖𝜖𝜖𝜖/(1 − 𝛾𝛾) from the reward of an optimal policy Maximum absolute difference 𝜖𝜖 between two iterations 0,001 0,01 0,1 1 5 10 100 0,5 0,002 0,02 0,2 2 10 20 200 0,9 0,018 0,18 1,8 18 90 180 1800 Discount factor 𝛾𝛾 0,95 0,038 0,38 3,8 38 190 380 3800 0,99 0,198 1,98 19,8 198 990 1980 19800 0,999 1,998 19,98 199,8 1998 9990 19980 199800 jonkv@ida How Many Iterations? Quit after 2 iterations V2(s1)=43. Guarantee: Corresponding policy gives >= 43 - 1620. 119 jonkv@ida How Many Iterations? Discount 0.90 Bounds are incrementally tightened! Quit after 10 iterations we know V10(s1)=467. Guarantee: New corresponding policy gives >= 467 – 697 if we start in s1. Quit after 50 iterations we know V50(s1)=811. New guarantee: The same policy actually gives >= 811 – 10 if we start in s1. 120 jonkv@ida How Many Iterations? Discount 0.99 Bounds are incrementally tightened! Quit after 250 iterations we know V250(s1)=8989. Guarantee: Corresponding policy gives >= 8989 - 1621. Quit after 600 iterations we know V600(s1)=9775. Guarantee: >= 9775 - 48. 121 Value iteration to find π*: Start with an arbitrary reward V0(s) for each s and an arbitrary ε > 0 for k = 1, 2, … Modified definition of Q(s,a): ▪ for each s in S do Here we use the previous V() ▪ for each a in A do Q(s,a) := R(s,a) + γ ∑s' ∈ S Pa (s' | s) Vk–1(s') ▪ Vk(s) = maxa ∈ A Q(s,a) ▪ π(s) = argmaxa ∈ A Q(s,a) // Only needed in final iteration ▪ if maxs ∈ S |Vk(s) – Vk-1(s)| < ε then exit // Almost no change! On an acyclic graph, the values converge in finitely many iterations On a cyclic graph, value convergence can take infinitely many iterations That’s why ε > 0 is needed jonkv@ida Value Iteration Both algorithms converge in a polynomial number of iterations But the variable in the polynomial is the number of states ▪ The number of states is usually huge Need to examine the entire state space in each iteration 122 Thus, these algorithms take huge amounts of time and space Probabilistic set-theoretic planning is EXPTIME-complete ▪ Much harder than ordinary set-theoretic planning, which was only PSPACEcomplete ▪ Methods exist for reducing the search space, and for approximating optimal solutions ▪ Beyond the scope of this course jonkv@ida Discussion 124 Non-Observable: No information gained after action Deterministic: Exact outcome known in advance Fully Observable: Exact outcome known after action Partially Observable: Some information gained after action Classical planning (possibly with extensions) (Information dimension is meaningless) Non-deterministic: Multiple outcomes, no probabilities Non-deterministic Conformant Planning Conditional (Contingent) Planning (No specific name) Probabilistic: Multiple outcomes with probabilities Probabilistic Conformant Planning Probabilistic Conditional Planning Partially Observable MDPs (POMDPs) (Special case of POMDPs) Markov Decision Processes (MDPs) In general: Full information is the easiest Partial information is the hardest! jonkv@ida Overview B Action representations: The book only deals with the underlying semantics: “Unstructured” probability distribution P(s, a, s') Several “convenient” representations possible, such as Bayes networks, probabilistic operators 126 jonkv@ida Action Representations 127 Probabilistic PDDL: new constructs for effects, initial state (probabilistic p1 e1 ... pk ek) ▪ Effect e1 takes place with probability p1, etc. ▪ Sum of probabilities <= 1 (can be strictly less implicit empty effect) ▪ (define (domain bomb-and-toilet) (:requirements :conditional-effects :probabilistic-effects) (:predicates (bomb-in-package ?pkg) (toilet-clogged) (bomb-defused)) (:action dunk-package :parameters (?pkg) First, a "standard" effect :effect (and (when (bomb-in-package ?pkg) (bomb-defused)) (probabilistic 0.05 (toilet-clogged))))) 5% chance of toilet-clogged, ▪ (define (problem bomb-and-toilet) 95% chance of no effect (:domain bomb-and-toilet) (:requirements :negative-preconditions) (:objects package1 package2) (:init (probabilistic 0.5 (bomb-in-package package1) Probabilistic initial state 0.5 (bomb-in-package package2))) (:goal (and (bomb-defused) (not (toilet-clogged))))) jonkv@ida Representation Example: PPDDL 128 ▪ ;; Authors: Sylvie Thiébaux and Iain Little You are stuck on a roof because the ladder you climbed up on fell down. There are plenty of people around; if you call out for help someone will certaintly lift the ladder up again. Or you can try the climb down without it. You aren't a very good climber though, so there is a 50-50 chance that you will fall and break your neck if you go it alone. What do you do? ▪ (define (problem climber-problem) (:domain climber) (:init (on-roof) (alive) (ladder-on-ground)) (:goal (and (on-ground) (alive)))) jonkv@ida Ladder 1 ▪ (define (domain climber) (:requirements :typing :strips :probabilistic-effects) (:predicates (on-roof) (on-ground) (ladder-raised) (ladder-on-ground) (alive)) ▪ (:action climb-without-ladder :parameters () :precondition (and (on-roof) (alive)) :effect (and (not (on-roof)) (on-ground) (probabilistic 0.4 (not (alive))))) ▪ (:action climb-with-ladder :parameters () :precondition (and (on-roof) (alive) (ladder-raised)) :effect (and (not (on-roof)) (on-ground))) ▪ (:action call-for-help :parameters () :precondition (and (on-roof) (alive) (ladder-on-ground)) :effect (and (not (ladder-on-ground)) (ladder-raised)))) 129 jonkv@ida Ladder 2 130 When putting down a block: 30% risk that it explodes Destroys what you placed the block on Use additional blocks as potential “sacrifices” ▪ (:action put-down-block-on-table :parameters (?b - block) :precondition (and (holding ?b) (not (destroyed-table)) ) :effect (and (not (holding ?b)) (on-top-of-table ?b) (when (not (detonated ?b)) (probabilistic .3 (and (detonated ?b) (destroyed-table)) ))) ) jonkv@ida Exploding Blocks World 131 Reward/cost-based Tire may go flat – calling AAA is expensive – good idea to load a spare (:action mov-car :parameters (?from - location ?to - location) :precondition (and (vehicle-at ?from) (road ?from ?to) (not (flattire))) :effect (and (vehicle-at ?to) (not (vehicle-at ?from)) (decrease reward 1) (probabilistic .15 (flattire)))) (:action loadtire :parameters (?loc - location) :precondition (and (vehicle-at ?loc) (hasspare-location ?loc) (not (hasspare-vehicle))) :effect (and (hasspare-vehicle) (not (hasspare-location ?loc)) (decrease reward 1))) (:action changetire :precondition (and (hasspare-vehicle) (flattire)) :effect (and (decrease (reward) 1) (not (hasspare-vehicle)) (not (flattire)))) (:action callAAA :precondition (flattire) :effect (and (decrease (reward) 100) (not (flattire)))) jonkv@ida Tire World Relational Dynamic Influence Diagram Language ▪ Based on Dynamic Bayesian Networks ▪ domain prop_dbn { requirements = { reward - deterministic }; // Define the state and action variables ( not parameterized here ) pvariables { p : { state - fluent , bool , default = false }; q : { state - fluent , bool , default = false }; r : { state - fluent , bool , default = false }; a : { action - fluent , bool , default = false }; }; // Define the conditional probability function for each next // state variable in terms of previous state and action cpfs { p’ = if (p ^ r) then Bernoulli (.9) else Bernoulli (.3); q’ = if (q ^ r) then Bernoulli (.9) else if (a) then Bernoulli (.3) else Bernoulli (.8); r’ = if (~q) then KronDelta (r) else KronDelta (r <=> q); }; // Define the reward function ; note that boolean functions are // treated as 0/1 integers in arithmetic expressions reward = p + q - r; } 132 jonkv@ida Representation Example: RDDL