Markov Decision Processes (MDPs)

Markov Decision Processes (MDPs) • read Ch 17.1-17.2 • utility-based agents – goals encoded in utility function U(s), or U:S  • effects of actions encoded in state transition function: T:SxAS – or T:SxApdf(S) for non-deterministic • rewards/costs encoded in reward function: R:SxA • Markov property: effects of actions only depend on current state, not previous history • the goal: maximize reward over time – long-term discounted reward – handles infinite horizon; encourages quicker achievement • “plans” are encoded in policies – mappings from states to actions: p:SA • how to compute optimal policy p* that maximizes longterm discounted reward? • value function Vp(s): expected long-term reward from starting in state s and following policy p • derive policy from V(s): • p(s)=maxaA E[R(s,a)+gV(T(s,p(s)))] • = max S p(s’|s,a)·(R+gV(s’)) • optimal policy comes from optimal value function: p*(s)= max S p(s’|s,a)·V*(s’) = Calculating V*(s) • Bellman’s equations – (eqn 17.5) • method 1: linear programming – n coupled linear equations – v1 = max(v2,v3,v4...) – v2 = max(v1,v3,v4...) – v3 = max(v1,v2,v4...) – solve for {v1,v2,v3...} using Gnu LP kit, etc. • method 2: Value Iteration – initialize V(s)=0 for all states – iteratively update value of each state based on neighbors – ...until convergence

Markov Decision Processes (MDPs)

Related documents

Products

Support

Markov Decision Processes (MDPs)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib