Source

advertisement
Overcoming the
Curse of Dimensionality
with Reinforcement Learning
Rich Sutton
AT&T Labs
with thanks to
Doina Precup, Peter Stone, Satinder Singh,
David McAllester, Sanjoy Dasgupta
Computers have gotten faster
and bigger
• Analytic solutions are less important
• Computer-based approximate solutions
– Neural networks
– Genetic algorithms
• Machines take on more of the work
• More general solutions to more general problems
– Non-linear systems
– Stochastic systems
– Larger systems
• Exponential methods are still exponential…
but compute-intensive methods increasingly winning
New Computers have led to a
New Artificial Intelligence
 More general problems and algorithms, automation
- Data intensive methods
- learning methods
 Less handcrafted solutions, expert systems
 More probability, numbers
 Less logic, symbols, human understandability
 More real-time decision-making
States, Actions, Goals, Probability
=> Markov Decision Processes
Markov Decision Processes
State Space S (finite)
Action Space A (finite)
Discrete time t = 0,1,2,…
Episode
s0 a0 r1 s1 a1 r2 s2 a2
(rT sT )
Transition Probabilities
a
p
ss  Pr st1  s  st  s, at  a
Expected Rewards
Policy
rsas  E rt 1 st  s, at  a, st1  s
Return  : S  A  [0,1]  (s,a)  Pr at  a st  s
Value
 [0,1] (discount rate)
Rt  rt1   rt 2   2 rt3 
Optimal policy
PREDICTION Problem
V  (s)  E Rt st  s
 *  arg max V 

CONTROL Problem
Key Distinctions
• Control vs Prediction
• Bootstrapping/Truncation vs Full Returns
• Sampling vs Enumeration
• Function approximation vs Table lookup
• Off-policy vs On-policy
Harder,
more challenging
and interesting
Easier,
conceptually
simpler
Full Depth Search
ˆ (s)
Computing V
s
Full Returns
2
a
r   r   r L
r
s’
a’
r’
r”
is of exponential complexity
BD
branching factor
depth
Truncated Search
Computing V (s)
s
Truncated Returns
a
r   Vˆ ( s)
r
s’
Vˆ ( s)
Search truncated after one ply
Approximate values used at stubs
Values computed from their own estimates!
-- “Bootstrapping”
Dynamic Programming is Bootstrapping
s
Truncated Returns
E r  Vˆ (s ) s
a
r
s’
Vˆ
Vˆ
Vˆ
Vˆ
E.g., DP Policy Evaluation
Vˆ0 (s)  arbitrary
Vˆk 1 (s) 
a
lim Vˆk  V 
k


a
a

(s,a)
p
r

 ss ss  Vˆk (s)
s
s S
Boostrapping seems to Speed Learning
M OUNTAIN CAR
700
RANDOM W ALK
0.5
650
accumulating
traces
600
Steps per
episode
0.4
accumulating
traces
550
RMS error
0.3
500
450
replacing
traces
replacing
traces
400
0
0.2
0.4

0.6
0.8
1
0
0.4

0.6
0.8
1
CART AND POLE
PUDDLE W ORLD
240
0.2
0.2
300
230
250
220
210
Cost per
episode
200
200
replacing
traces
190
accumulating
traces
180
150
170
100
160
150
50
0
0.2
0.4

0.6
0.8
1
0
0.2
0.4

0.6
0.8
1
Failures per
100,000 steps
Bootstrapping/Truncation
• Replacing possible futures with estimates of value
• Can reduce computation and variance
• A powerful idea, but
• Requires stored estimates of value for each state
The Curse of Dimensionality
Bellman, 1961
• The number of states grows exponentially with
dimensionality -- the number of state variables
• Thus, on large problems,
– Can’t complete even one sweep of DP
• Can’t enumerate states, need sampling!
– Can’t store separate values for each state
• Can’t store values in tables, need function approximation!
DP Policy Evaluation


Vˆk 1 (s)    (s,a) psas  rsas   Vˆk ( s)
a
s
s S
DP Policy Evaluation
a
a
Vˆk 1 (s)    (s, a) pssrss  Vˆk ( s)
s  S
a
a
Vˆk 1 (s)  d(s)  (s,a) pss rss   Vˆk ( s)
s  S
a
a
s 
s
Some distribution over states, possibly uniform
DP Policy Evaluation
a
a
Vˆk 1 (s)    (s, a) pssrss  Vˆk ( s)
s  S
a
a
Vˆk 1 (s)  d(s)  (s,a) pss rss   Vˆk ( s)
s  S
a
a
s 
s
These terms can be
replaced by sampling
DP Policy Evaluation
a
a
Vˆk 1 (s)    (s, a) pssrss  Vˆk ( s)
s  S
a
a
Vˆk 1 (s)  d(s)  (s,a) pss rss   Vˆk ( s)
s  S
a
a
s 
s
Tabular TD(0)
Sutton, 1988; Witten, 1974
For each sample transition, s,a  s’,r :
Vˆ (s)  Vˆ (s)   r  Vˆ ( s)  Vˆ (s)
lim Vˆ (s)  V  ( )
Sampling vs Enumeration
Sample Returns can also be either
Full
or
Truncated
r
r 
r
2
r   r   r L
r   Vˆ ( s)
As in the general
TD() algorithm
Function Approximation
• Store values as a parameterized form
Vˆ (s)  f (s,q)
• Update q, e.g., by gradient descent:


qk1  qk    d(s)   (s,a) psas  rsas   Vˆk ( s)  Vˆk (s) qVˆk (s)
s
a
s 
cf. DP Policy Evaluation (rewritten to include a step-size ):


Vˆk 1 (s)  Vˆk (s)   d(s)  (s, a) psas rsas   Vˆk ( s)  Vˆk (s)
a
s 
s S
Linear Function Approximation
Vˆ (s)  qT s
qVˆ (s)  s
Each state s represented by a feature vector s
Or respresent a state-action pair with 
sa
and approximate action values:
Q (s,a)  E Rt st  s,at  a
Qˆ (s,a)  qTs,a
Linear TD()
After each episode:
T 1
where
q  q   q t
t0
e.g., rt 1  q T st1at1


qt   Rt  q T st at st at
“ -return”

Rt  (1   )  n1 Rt(n)
n1
“n-step return”
Rt(n)  rt 1   rt2 
  n1 rtn   nq T st natn
Sutton, 1988
RoboCup
An international AI and Robotics research initiative
• Use soccer as a rich and realistic testbed
• Robotic and simulation leagues
– Open source simulator (Noda)
Research Challenges
•
•
•
•
•
Multiple teammates with a common goal
Multiple adversaries – not known in advance
Real-time decision making necessary
Noisy sensors and actuators
Enormous state space, > 2310 states
9
RoboCup Feature Vectors
.
Full
soccer
state
Sparse, coarse,
tile coding
13 continuous
state variables
.
.
.
.
.
.
.
.
.
.
action
values
Linear
map q
r
Huge binary feature vector s
(about 400 1’s and 40,000 0’s)
13 Continuous State Variables
(for 3 vs 2)
11 distances among
the players, ball,
and the center of
the field
2 angles to takers
along passing lanes
Sparse, Coarse, Tile-Coding (CMACs)
32 tilings per
group of state
variables
Learning Keepaway Results
3v2 handcrafted takers
14
Multiple,
independent
runs of TD()
12
Episode
Duration
(seconds)
10
8
handcoded
6
random
always
hold
4
0
10
20
25
Hours of Training Time
(bins of 1000 episodes)
Stone & Sutton, 2001
Key Distinctions
• Control vs Prediction
• Bootstrapping/Truncation vs Full Returns
• Function approximation vs Table lookup
• Sampling vs Enumeration
• Off-policy vs On-policy
– The distribution d(s)
Off-Policy Instability
• Examples of diverging qk are known for
– Linear FA
– Bootstrapping
• Even for
– Prediction
– Enumeration
– Uniform d(s)
Baird, 1995
Gordon, 1995
Bertsekas & Tsitsiklis, 1996
• In particular, linear Q-learning can diverge
Baird’s Counterexample
q0 + 2q1
q0 + 2q2
q0 + 2q3
q0 + 2q4
Markov chain (no actions)
q0 + 2q5
All states updated equally often,
synchronously
Exact solution exists: q = 0
100%
Initial q0 = (1,1,1,1,1,10,1)T
2q0 + q6
1 

10
10
10
Parameter
values, qk(i)
(log scale,
broken at ±1)
10
0
5
/ -10 0
-10
5
10
- 10
0
1000
2000
3000
Iterations (k)
4000
5000
On-Policy Stability
• If d(s) is the stationary distribution of the MDP
under policy (the on-policy distribution)
• Then convergence is guaranteed for
–
–
–
–
Linear FA
Bootstrapping
Sampling
Prediction
Tsitsiklis & Van Roy, 1997
Tadic, 2000
• Furthermore, asymptotic mean square error is a
bounded expansion of the minimal MSE:
1  
MSE(q  ) 
min MSE(q )
1 q
— Value Function Space —
True V*
Region of *
best
admissable V*
value fn.
best
admissable
policy
Sarsa, TD() & other
on-policy methods
Q-learning, DP & other
off-policy methods
divergence
possible
chattering
without divergence
or guaranteed
convergence
Original naïve hope
guaranteed
convergence
to good
policy
Res gradient et al.
guaranteed
convergence
to less
desirable
policy
There are Two Different Problems:
Chattering
Instability
• Is due to Control + FA
• Is due to Bootstrapping
+ FA + Off-Policy
• Bootstrapping not involved
• Not necessarily a problem
• Being addressed with
policy-based methods
• Argmax-ing is to blame
• Control not involved
• Off-policy is to blame
Yet we need Off-policy Learning
• Off-policy learning is needed in all the frameworks
that have been proposed to raise reinforcement
learning to a higher level
– Macro-actions, options, HAMs, MAXQ
– Temporal abstraction, hierarchy, modularity
– Subgoals, goal-and-action-oriented perception
• The key idea is: We can only follow one policy,
but we would like to learn about many policies,
in parallel
– To do this requires off-policy learning
On-Policy Policy Evaluation Problem
Use data (episodes) generated by  to learn Qˆ  Q
Off-Policy Policy Evaluation Problem
Use data (episodes) generated by ’ to learn Qˆ  Q
behavior policy
Target
policy
Naïve Importance-Sampled TD()
qt  
Relative prob.
Rt  q T st at st at r1 r2 r3 L rT-1 of episode
under  and ’

 (st , at )
rt 
(st ,at )

importance sampling
correction ratio for t
We expect this to have relatively high variance
Per-Decision Importance-Sampled TD()


qt   Rt  q T st at st at r1 r2 r3 L rt
 (st , at )
rt 
(st ,at )

Rt is like Rt , except in terms of
Rt(n)  rt1   rt 2 rt 1   rt2 rt1 rt2 
  n rt 1
rt n q T stnatn
Per-Decision Theorem
Precup, Sutton & Singh (2000)
E Rt st , at  E Rt st ,at
New Result for Linear PD Algorithm
Precup, Sutton & Dasgupta (2001)
E q s0 ,a0  E q s0 , a0
Total change over episode
for new algorithm
Total change for
conventional TD()
Convergence Theorem
• Under natural assumptions
–
–
–
–
–
S and A are finite
All s,a are visited under ’
 and ’ are proper (terminate w.p.1)
bounded rewards
usual stochastic approximation conditions on the step size k
• And one annoying assumption
varr1 r2
rT1  B   s1 S
• Then
the
off-policy
linearlength
PD algorithm converges to
e.g.,
bounded
episode
the same q as on-policy TD()
The variance assumption is restrictive
But can often be satisfied with “artificial” terminations
• Consider a modified MDP with bounded episode length
–
–
–
–
We have data for this MDP
Our result assures good convergence for this
This solution can be made close to the sol’n to original problem
By choosing episode bound long relative to  or the mixing time
• Consider application to macro-actions
–
–
–
–
Here it is the macro-action that terminates
Termination is artificial, real process is unaffected
Yet all results directly apply to learning about macro-actions
We can choose macro-action termination to satisfy the
variance condition
Empirical Illustration
Agent always starts at S
Terminal states marked G
Deterministic actions
Behavior policy chooses
up-down with 0.4-0.1 prob.
Target policy chooses
up-down with 0.1-0.4
If the algorithm is successful, it should give positive
weight to rightmost feature, negative to the leftmost one
Trajectories of Two Components of q
0.5
0.4
0.3
µr *i ght most , down
0.2
µr i ght most , down
0.1
 = 0.9
 decreased
0
-0.1
µl ef t most , down
-0.2
-0.3
µl*ef t most , down
-0.4
0
1
2
3
4
5
Episodes x 100,000
q appears to converge as advertised
Comparison of Naïve and PD IS Algs
2.5
Root
Mean
Squared
Error
 = 0.9
 constant
Naive IS
2
1.5
(after 100,000
episodes, averaged
over 50 runs)
Per-Decision IS
1
-12
-13
-14
-15
-16
-17
Log2 
Precup, Sutton & Dasgupta, 2001
Can Weighted IS help the variance?
Return to the tabular case, consider two estimators:
ith return following s,a
n
QnIS (s,a)  1n  Ri wi
i1
IS correction product
rt 1 rt  2 rt  3 L r T 1
(s,a occurs at t )
converges with finite variance iff the wi have finite variance
n
 Ri wi
QnISW (s,a)  i1n
 wi
Can this be extended
to the FA case?
i1
converges with finite variance even if the wi have infinite variance
Restarting within an Episode
• We can consider episodes to start at any time
• This alters the weighting of states,
– But we still converge,
– And to near the best answer (for the new weighting)
Incremental Implementation
At the start of each episode:
c0  g0
e0  c00
On each step:
st at
 rt1 st1 at1
rt1   (st1, at1 ) (st 1, at 1 )
 t  rt 1   r t1 q T t 1  q T t
qt    t et
ct1  rt 1 ct  gt 1
et1    rt 1 et  ct1 t1
0t T
Key Distinctions
• Control vs Prediction
• Bootstrapping/Truncation vs Full Returns
• Sampling vs Enumeration
• Function approximation vs Table lookup
• Off-policy vs On-policy
Harder,
more challenging
and interesting
Easier,
conceptually
simpler
Conclusions
• RL is beating the Curse of Dimensionality
– FA and Sampling
• There is a broad frontier, with many open questions
• MDPs: States, Decisions, Goals, and Probability
is a rich area for mathematics and experimentation
Download