Markov Decision Processes (MDPs)

Markov Decision Processes
• read Ch 17.1-17.2
• utility-based agents
– goals encoded in utility function U(s), or U:S 
• effects of actions encoded in state transition function:
– or T:SxApdf(S) for non-deterministic
• rewards/costs encoded in reward function: R:SxA
• Markov property: effects of actions only depend on
current state, not previous history
• the goal: maximize reward over time
– long-term discounted reward
– handles infinite horizon; encourages quicker achievement
• “plans” are encoded in policies
– mappings from states to actions: p:SA
• how to compute optimal policy p* that maximizes longterm discounted reward?
• value function Vp(s): expected long-term
reward from starting in state s and
following policy p
• derive policy from V(s):
• p(s)=maxaA E[R(s,a)+gV(T(s,p(s)))]
= max S p(s’|s,a)·(R+gV(s’))
• optimal policy comes from optimal value
function: p*(s)= max S p(s’|s,a)·V*(s’)
Calculating V*(s)
• Bellman’s equations
– (eqn 17.5)
• method 1: linear programming
– n coupled linear equations
– v1 = max(v2,v3,v4...)
– v2 = max(v1,v3,v4...)
– v3 = max(v1,v2,v4...)
– solve for {v1,v2,v3...} using Gnu LP kit, etc.
• method 2: Value Iteration
– initialize V(s)=0 for all states
– iteratively update value of each state based
on neighbors
– ...until convergence