Markov Decision Processes (MDPs) • read Ch 17.1-17.2 • utility-based agents – goals encoded in utility function U(s), or U:S • effects of actions encoded in state transition function: T:SxAS – or T:SxApdf(S) for non-deterministic • rewards/costs encoded in reward function: R:SxA • Markov property: effects of actions only depend on current state, not previous history • the goal: maximize reward over time – long-term discounted reward – handles infinite horizon; encourages quicker achievement • “plans” are encoded in policies – mappings from states to actions: p:SA • how to compute optimal policy p* that maximizes longterm discounted reward? • value function Vp(s): expected long-term reward from starting in state s and following policy p • derive policy from V(s): • p(s)=maxaA E[R(s,a)+gV(T(s,p(s)))] • = max S p(s’|s,a)·(R+gV(s’)) • optimal policy comes from optimal value function: p*(s)= max S p(s’|s,a)·V*(s’) = Calculating V*(s) • Bellman’s equations – (eqn 17.5) • method 1: linear programming – n coupled linear equations – v1 = max(v2,v3,v4...) – v2 = max(v1,v3,v4...) – v3 = max(v1,v2,v4...) – solve for {v1,v2,v3...} using Gnu LP kit, etc. • method 2: Value Iteration – initialize V(s)=0 for all states – iteratively update value of each state based on neighbors – ...until convergence