lecture1-mdps - University of Alberta

advertisement
Reinforcement Learning:
Dynamic Programming
Csaba Szepesvári
University of Alberta
Kioloa, MLSS’08
Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/
1
Reinforcement Learning
RL =
“Sampling based methods to solve
optimal control problems”
(Rich Sutton)
 Contents





Defining AI
Markovian Decision Problems
Dynamic Programming
Approximate Dynamic Programming
Generalizations
2
Literature
 Books
 Richard S. Sutton, Andrew G. Barto:
Reinforcement Learning: An Introduction,
MIT Press, 1998
 Dimitri P. Bertsekas, John Tsitsiklis: NeuroDynamic Programming, Athena Scientific,
1996
 Journals
 JMLR, MLJ, JAIR, AI
 Conferences
 NIPS, ICML, UAI, AAAI, COLT, ECML, IJCAI
3
Some More Books
 Martin L. Puterman. Markov Decision
Processes. Wiley, 1994.
 Dimitri P. Bertsekas: Dynamic
Programming and Optimal Control.
Athena Scientific. Vol. I (2005), Vol.
II (2007).
 James S. Spall: Introduction to
Stochastic Search and Optimization:
Estimation, Simulation, and Control,
Wiley, 2003.
4
Resources
 RL-Glue

http://rlai.cs.ualberta.ca/RLBB/top.html

http://rlai.cs.ualberta.ca/RLR/index.html

http://www.igi.tugraz.at/riltoolbox/general/overview.html
 RL-Library
 The RL Toolbox 2.0
 OpenDP

http://opendp.sourceforge.net


http://rl-competition.org/
June 1st, 2008: Test runs begin!



Operations research (MOR, OR)
Control theory (IEEE TAC, Automatica, IEEE CDC, ECC)
Simulation optimization (Winter Simulation Conference)
 RL-Competition (2008)!
 Related fields:
5
Abstract Control Model
Environment
Sensations
(and reward)
actions
Controller
= agent
“Perception-action loop”
6
Zooming in..
memory
reward
external sensations
agent
state
internal
sensations
actions
7
A Mathematical Model
 Plant (controlled object):
 xt+1 = f(xt,at,vt)
xt : state, vt : noise
 zt = g(xt,wt)
zt : sens/obs, wt : noise
 State: Sufficient statistics for the future
 Independently of what we measure
..or..
 Relative to measurements
 Controller
 at = F(z1,z2,…,zt)
at: action/control
=> PERCEPTION-ACTION LOOP
“CLOSED-LOOP CONTROL”
 Design problem: F = ?
 Goal: t=1T r(zt,at)! max
“Subjective
“Objective
State”
State”
8
A Classification of Controllers
 Feedforward:
 a1,a2,… is designed ahead in time
 ???
 Feedback:
 Purely reactive systems: at = F(zt)
 Why is this bad?
 Feedback with memory:
mt = M(mt-1,zt,at-1)
~interpreting sensations
at = F(mt)
decision making: deliberative vs. reactive
9
Feedback controllers
 Plant:
 xt+1 = f(xt,at,vt)
 zt+1 = g(xt,wt)
 Controller:
 mt = M(mt-1,zt,at-1)
 at = F(mt)
 mt ¼ xt: state estimation, “filtering”
difficulties: noise,unmodelled parts
 How do we compute at?
 With a model (f’): model-based control
 ..assumes (some kind of) state estimation
 Without a model: model-free control
10
Markovian Decision Problems
11
Markovian Decision Problems
(X,A,p,r)
X – set of states
A – set of actions (controls)
p – transition probabilities
p(y|x,a)
 r – rewards
r(x,a,y), or r(x,a), or r(x)
 ° – discount factor
0·°<1




12
The Process View





(Xt,At,Rt)
Xt – state at time t
At – action at time t
Rt – reward at time t
Laws:
 Xt+1~p(.|Xt,At)
 At ~ ¼(.|Ht)
 ¼: policy
 Ht = (Xt,At-1,Rt-1, .., A1,R1,X0) – history
 Rt = r(Xt,At,Xt+1)
13
The Control Problem
 Value functions:
P1
V¼(x) = E¼[ t = 0 ° t Rt jX 0 = x]
 Optimal value function:
V ¤ (x) = max¼ V¼(x)
 Optimal policy:
V¼¤ (x) = V ¤ (x)
14
Applications of MDPs
 Operations research
 Econometrics
 Control, statistics
 Games, AI
Optimal investments
Replacement problems
Option pricing
Logistics, inventory
management
 Active vision
 Production scheduling
 Dialogue control
 Bioreactor control
 Robotics (Robocup
Soccer)
 Driving
 Real-time load
balancing
 Design of experiments
(Medical tests)




15
Variants of MDPs
 Discounted
 Undiscounted: Stochastic Shortest
Path
 Average reward
 Multiple criteria
 Minimax
 Games
16
MDP Problems
 Planning
The MDP (X,A,P,r,°) is known.
Find an optimal policy ¼*!
 Learning
The MDP is unknown.
You are allowed to interact with it.
Find an optimal policy ¼*!
 Optimal learning
While interacting with the MDP,
minimize the loss due to not using an
optimal policy from the beginning
17
Solving MDPs – Dimensions





Which problem? (Planning, learning, optimal learning)
Exact or approximate?
Uses samples?
Incremental?
Uses value functions?
 Yes: Value-function based methods
 Planning: DP, Random Discretization Method, FVI, …
 Learning: Q-learning, Actor-critic, …
 No: Policy search methods
 Planning: Monte-Carlo tree search, Likelihood ratio
methods (policy gradient), Sample-path optimization
(Pegasus),
 Representation
 Structured state:
 Factored states, logical representation, …
 Structured policy space:
 Hierarchical methods
18
Dynamic Programming
19
Richard Bellman (1920-1984)
 Control theory
 Systems Analysis
 Dynamic Programming:
RAND Corporation, 1949-1955






Bellman equation
Bellman-Ford algorithm
Hamilton-Jacobi-Bellman equation
“Curse of dimensionality”
invariant imbeddings
Grönwall-Bellman inequality
20
Bellman Operators
 Let ¼:X ! A be a stationary policy
 B(X) = { V | V:X! R, ||V||1<1 }
 T¼:B(X)! B(X)
 (T¼ V)(x) =
yp(y|x,¼(x)) [r(x,¼(x),y)+° V(y)]
 Theorem:
T¼ V¼ = V¼
 Note: This is a linear system of
equations: r¼ + ° P¼ V¼ = V¼
 V¼ = (I-° P¼)-1 r¼
21
Proof of T¼ V¼ = V¼
 What you need to know:



Linearity of expectation: E[A+B] = E[A]+E[B]
Law of total expectation:
E[ Z ] = x P(X=x) E[ Z | X=x ], and
E[ Z | U=u ] = x P(X=x|U=u) E[Z|U=u,X=x].
Markov property:
E[ f(X1,X2,..) | X1=y,X0=x] = E[ f(X1,X2,..) | X1=y]
 V¼(x) = E¼ [t=01 °t Rt|X0 = x]
= y P(X1=y|X0=x) E¼[t=01 °t Rt|X0 = x,X1=y]
(by the law of total expectation)
= y p(y|x,¼(x)) E¼[t=01 °t Rt|X0 = x,X1=y]
(since X1~p(.|X0,¼(X0)))
= y p(y|x,¼(x))
{E¼[ R0|X0=x,X1=y]+° E¼ [ t=01°t Rt+1|X0=x,X1=y]}
(by the linearity of expectation)
= y p(y|x,¼(x)) {r(x,¼(x),y) + ° V¼(y)}
(using the definition of r, V¼)
= (T¼ V¼)(x).
(using the definition of T¼)
22
The Banach Fixed-Point Theorem
 B = (B,||.||) Banach space
 T: B1! B2 is L-Lipschitz (L>0) if for
any U,V,
|| T U – T V || · L ||U-V||.
 T is contraction if B1=B2, L<1; L is a
contraction coefficient of T
 Theorem [Banach]: Let T:B! B be
a °-contraction. Then T has a unique
fixed point V and 8 V02 B, Vk+1=T Vk,
Vk ! V and ||Vk-V||=O(°k)
23
An Algebra for Contractions
 Prop: If T1:B1! B2 is L1-Lipschitz,
T2: B2 ! B3 is L2-Lipschitz then T2 T1 is L1 L2
Lipschitz.
 Def: If T is 1-Lipschitz, T is called a
non-expansion
 Prop: M: B(X£ A) ! B(X),
M(Q)(x) = maxa Q(x,a) is a non-expansion
 Prop: Mulc: B! B, Mulc V = c V is
|c|-Lipschitz
 Prop: Addr: B ! B, Add V = r + V is a
non-expansion.
 Prop: K: B(X) ! B(X),
(K V)(x)=y K(x,y) V(y) is a non-expansion
if K(x,y)¸ 0, y K(x,y) =1.
24
Policy Evaluations are Contractions
 Def: ||V||1 = maxx |V(x)|,
supremum norm; here ||.||
 Theorem: Let T¼ the policy
evaluation operator of some policy ¼.
Then T¼ is a °-contraction.
 Corollary: V¼ is the unique fixed
point of T¼. Vk+1 = T¼ Vk ! V¼,
8 V0 2 B(X) and ||Vk-V¼|| = O(°k).
25
The Bellman Optimality Operator
 Let T:B(X)! B(X) be defined by
(TV)(x) =
maxa y p(y|x,a) { r(x,a,y) + ° V(y) }
 Def: ¼ is greedy w.r.t. V if T¼V =T V.
 Prop: T is a °-contraction.
 Theorem (BOE): T V* = V*.
 Proof: Let V be the fixed point of T.
T¼ · T  V* · V. Let ¼ be greedy
w.r.t. V. Then T¼ V = T V. Hence
V¼ = V  V · V*  V = V*.
26
Value Iteration
 Theorem: For any V0 2 B(X), Vk+1 = T Vk,
Vk ! V* and in particular ||Vk – V*||=O(°k).
 What happens when we stop “early”?
 Theorem: Let ¼ be greedy w.r.t. V. Then
||V¼ – V*|| · 2||TV-V||/(1-°).
 Proof: ||V¼-V*||· ||V¼-V||+||V-V*|| …
 Corollary: In a finite MDP, the number of
policies is finite. We can stop when
||Vk-TVk|| · ¢(1-°)/2, where
¢ = min{ ||V*-V¼|| : V¼  V* }
Pseudo-polynomial complexity
27
Policy Improvement [Howard ’60]
 Def: U,V2 B(X), V ¸ U if V(x) ¸ U(x)
holds for all x2 X.
 Def: U,V2 B(X), V > U if V ¸ U and
9 x2 X s.t. V(x)>U(x).
 Theorem (Policy Improvement):
Let ¼’ be greedy w.r.t. V¼. Then
V¼’ ¸ V¼. If T V¼>V¼ then V¼’>V¼.
28
Policy Iteration
 Policy Iteration(¼)
 V  V¼
 Do {improvement}
 V’  V
 Let ¼: T¼ V = T V
 V  V¼
 While (V>V’)
 Return ¼
29
Policy Iteration Theorem
 Theorem: In a finite, discounted
MDP policy iteration stops after a
finite number of steps and returns an
optimal policy.
 Proof: Follows from the Policy
Improvement Theorem.
30
Linear Programming
 V ¸ T V  V ¸ V* = T V*.
 Hence, V* is the “largest” V that satisfies
V ¸ T V.
V¸TV
,
(*) V(x) ¸ yp(y|x,a){r(x,a,y)+° V(y)},
8x,a
 LinProg(V):


x
V(x) ! min s.t. V satisfies (*).
 Theorem: LinProg(V) returns the optimal
value function, V*.
 Corollary: Pseudo-polynomial complexity
31
Variations of a Theme
32
Approximate Value Iteration
 AVI: Vk+1 = T Vk + ²k
 AVI Theorem:
Let ² = maxk ||²k||. Then
limsupk!1 ||Vk-V*|| · 2° ² / (1-°).
 Proof: Let ak = ||Vk –V*||.
Then ak+1 = ||Vk+1 – V*|| = ||T Vk – T
V* + ²k || · ° ||Vk-V*|| + ² = ° ak + ².
Hence, ak is bounded. Take “limsup”
of both sides: a· ° a + ²; reorder.//
(e.g., [BT96])
33
Fitted Value Iteration
– Non-expansion Operators
 FVI: Let A be a non-expansion,
Vk+1 = A T Vk. Where does this
converge to?
 Theorem: Let U,V be such that A T U
= U and T V = V. Then
||V-U|| · ||AV –V||/(1-°).
 Proof: Let U’ be the fixed point of TA.
Then ||U’-V|| · ° ||AV-V||/(1-°).
Since A U’ = A T (AU’), U=AU’. Hence,
||U-V|| =||AU’-V||
· ||AU’-AV||+||AV-V|| …
[Gordon ’95]
34
Application to Aggregation
 Let ¦ be a partition of X, S(x) be the
unique cell that x belongs to.
 Let A: B(X)! B(X) be
(A V)(x) =  z ¹(z;S(x)) V(z), where ¹ is a
distribution over S(x).
 p’(C|B,a) =
 x2 B ¹(x;B)  y2 C p(y|x,a),
r’(B,a,C) =
 x2 B ¹(x;B)  y2 C p(y|x,a) r(x,a,y).
 Theorem: Take (¦,A,p’,r’), let V’ be its
optimal value function, V’E(x) = V’(S(x)).
Then ||V’E – V*|| · ||AV*-V*||/(1-°).
35
Action-Value Functions
 L: B(X)! B(X£ A),
(L V)(x,a) =  y p(y|x,a) {r(x,a,y) + ° V(y)}.
“One-step lookahead”.
 Note: ¼ is greedy w.r.t. V if
(LV)(x,¼(x)) = max a (LV)(x,a).
 Def: Q* = L V*.
 Def: Let Max: B(X£ A)! B(X),
(Max Q)(x) = max a Q(x,a).
 Note: Max L = T.
 Corollary: Q* = L Max Q*.
 Proof: Q* = L V* = L T V* = L Max L V* = L Max Q*.
 T = L Max is a °-contraction
 Value iteration, policy iteration, …
36
Changing Granularity






Asynchronous Value Iteration:
 Every time-step update only a few states
AsyncVI Theorem: If all states are updated infinitely often,
the algorithm converges to V*.
How to use?
 Prioritized Sweeping
IPS [MacMahan & Gordon ’05]:
 Instead of an update, put state on the priority queue
 When picking a state from the queue, update it
 Put predecessors on the queue
Theorem: Equivalent to Dijkstra on shortest path problems,
provided that rewards are non-positive
LRTA* [Korf ’90] ~ RTDP [Barto, Bradtke, Singh ’95]
 Focussing on parts of the state that matter
 Constraints:
 Same problem solved from several initial positions
 Decisions have to be fast

Idea: Update values along the paths
37
Changing Granularity
 Generalized Policy Iteration:
 Partial evaluation
and partial
improvement
of policies
 Multi-step lookahead
improvement
 AsyncPI Theorem: If both evaluation and
improvement happens at every state
infinitely often then the process converges to
an optimal policy.
[Williams & Baird ’93]
38
Variations of a theme
[SzeLi99]
 Game against nature [Heger ’94]:
infw t°t Rt(w) with X0 = x
 Risk-sensitive criterion:
log ( E[ exp(t°t Rt ) | X_0 = x ] )
 Stochastic Shortest Path
 Average Reward
 Markov games
 Simultaneous action choices (Rockpaper-scissor)
 Sequential action choices
 Zero-sum (or not)
39
References









[Howard ’60] R.A. Howard: Dynamic Programming and Markov
Processes, The MIT Press, Cambridge, MA, 1960.
[Gordon ’95] G.J. Gordon: Stable function approximation in
dynamic programming. ICML, pp. 261—268, 1995.
[Watkins ’90] C.J.C.H. Watkins: Learning from Delayed Rewards,
PhD Thesis, 1990.
[McMahan, Gordon ’05] H. B. McMahan and Geoffrey J.
Gordon: Fast Exact Planning in Markov Decision Processes. ICAPS.
[Korf ’90] R. Korf: Real-Time Heuristic Search. Artificial Intelligence
42, 189–211, 1990.
[Barto, Bradtke & Singh, ’95] A.G. Barto, S.J. Bradtke & S. Singh:
Learning to act using real-time dynamic programming, Artificial
Intelligence 72, 81—138, 1995.
[Williams & Baird, ’93] R.J. Williams & L.C. Baird: Tight
Performance Bounds on Greedy Policies Based on Imperfect Value
Functions. Northeastern University Technical Report NU-CCS-9314, November, 1993.
[SzeLi99] Cs. Szepesvári and M.L. Littman: A Unified Analysis of
Value-Function-Based Reinforcement-Learning Algorithms, Neural
Computation, 11, 2017—2059, 1999.
[Heger ’94] M. Heger: Consideration of risk in reinforcement
learning, ICML, 105—111, 1994.
40
Download