Source

advertisement
Reinforcement Learning:
How far can it Go?
Rich Sutton
University of Massachusetts
ATT Research
With thanks to
Doina Precup, Satinder Singh, Amy McGovern,
B. Ravindran, Ron Parr
Reinforcement Learning

An active, popular, successful approach to AI



15 – 50 years old
emphasizes learning from interaction
Does not assume complete knowledge of world

World-class applications
 Strong theoretical foundations
 Parallels in other fields: operations research,
control theory, psychology, neuroscience
 Seeks simple general principles
How Far Can It Go ?
World-Class Applications of RL

TD-Gammon and Jellyfish


Elevator Control


Crites & Barto
(Probably) world's best down-peak elevator controller
Job-Shop Scheduling


Tesauro, Dahl
World's best backgammon player
Zhang & Dietterich
World’s best scheduler of space-shuttle payload processing
Dynamic Channel Assignment

Singh & Bertsekas, Nie & Haykin
World's best assigner of radio channels to mobile telephone calls
Outline
 RL

Past
1950
Trial and Error Learning
1985
 RL

Present
Learning and Planning Values
2000
 RL

Future
Constructivism
RL began with dissatisfaction
with previous learning problems

Such as




Learning from examples
Unsupervised learning
Function optimization
None seemed to be purposiveful


Where is the learning to how to get something?
Where is the learning by trial and error?
Need rewards and penalties,
interaction with the world!
Rooms Example
Early learning
methods could
not learn how to
get reward
The Reward Hypothesis
That purposes can be adequately represented as
maximization of the cumulative sum of a scalar
reward signal received from the environment

Is this reasonable?
 Is it demeaning?
 Is there no other choice?

It seems to be adequate
RL Past – Trial and Error Learning

Learned only a policy (a mapping from states to actions)
 Maximized only



Assumed good/bad rewards immed. distinguishable



Short-term reward (e.g., learning automata)
Or delayed reward via simple action traces
E.g., positive is good, negative is bad
An implicitly known reinforcement baseline
Next steps were to learn baselines and internal
rewards
Taking these next steps quickly led to modern
value functions and temporal-difference learning
A Policy
Movement is in the
wrong direction 1/3
of the time
Problems with Value-less RL Methods
Outline
 RL

Past
1950
Trial and Error Learning
1985
 RL

Present
Learning and Planning Values
2000
 RL

Future
Constructivism
The Value-Function Hypothesis

Value functions = Measures of expected reward
following states:
V: States  Expected future reward
or following state-action pairs:
Q: States x Actions  Expected future reward

All efficient methods for optimal sequential
decision making estimate value functions
 The hypothesis:
That the dominant purpose of intelligence is
to approximate these value functions
State-Value Function
RL Present

Accepts reward and value hypotheses
 Many real-world applications, some impressive
 Theory strong and active,
yet still with more questions than answers
 Strong links to Operations Research
 A part of modern AI’s interest in uncertainty:


MDPs, POMDPs, Bayes nets, connectionism
Includes deliberative planning
Learning and Planning Values
New Applications of RL

CMUnited Robocup Soccer Team


KnightCap and TDleaf


Baxter, Tridgell & Weaver
Improved chess play from intermediate to master in 300 games
Inventory Management


Stone & Veloso
World’s best player of Robocup simulated soccer, 1998
Van Roy, Bertsekas, Lee & Tsitsiklis
10-15% improvement over industry standard methods
Walking Robot

Benbrahim & Franklin
Learned critical parameters for bipedal walking
Real-world applications
using on-line learning
Backprop
QuickTime™ and a
Cinepak decompressor
are needed to see this picture.
RL Present, Part II:
The Space of Methods
Exhaustive
search
Dynamic
programming
full
backups
sample
backups
Monte Carlo
Temporaldifference
learning
shallow
backups
bootstrapping, l
deep
backups
Also:
Function Approx.
Explore/Exploit
Planning/Learning
Action/state values
Actor-Critic
.
.
.
The TD Hypothesis
That all value learning is driven by TD errors

Even “Monte Carlo” methods can benefit


Even planning can benefit



Trajectory following improves function approximation
and state sampling
Sample backups reduce effect of branching factor
Psychological support


TD methods enable them to be done incrementally
TD models of reinforcement, classical conditioning
Physiological support

Reward neurons show TD behavior (Schultz et al.)
Planning

Modern RL includes planning




value/policy
As in planning for MDPs
A form of state-space planning
Still controversial for some
planning
model
acting
direct
RL
experience
model
learning
Planning and learning are near identical in RL


The same algorithms on real or imagined experience
Same value functions, backups, function approximation
Interaction
with world
RL Alg.
Imagined
interaction
Value/Policy
Planning with Imagined Experience
Real
experience
Imagined
experience
Outline
 RL

Past
1950
Trial and Error Learning
1985
 RL

Present
Learning and Planning Values
2000
 RL

Future
Constructivism
Constructivism
Piaget
Drescher
The active construction of representations
and models of the world to facilitate
the learning and planning of values
Representations
and Models
Great
flexibility
here
Value functions
Policy
Constructivist Prophecy

Whereas RL present is about solving an MDP,
 RL future will be about representing the





States
Actions
Transitions
Rewards
Features
The RL agent
as active
world modeler
to construct an MDP.
 Constructing the world to be the way we want it:



Markov
Reliable
Deterministic
 Linear
 Small
 Independent
 Shallow
 Additive
 Low
branching
Representing State, Part I:
Features and Function Approximation

Linear-in-the-features methods are state of the art


 Memory-based methods
Two-stage architecture:

State
Features
Values
Compute feature values
• Nonlinear, expansive, fixed or slowly changing mapping

Map the feature values linearly to the result
• Linear, convergent, fast changing mapping

Works great if features are appropriate


Fast, reliable, local learning; good generalization
Feature construction best done by hand
...or by methods yet to be found
Constructive
Induction
Good Features
Bad Features
Features correspond to
regions of similar value
Features unrelated to
values
Representing State, Part II:
Partial Observability
When immediate observations do not uniquely
identify the current state; non-Markov problems

Not as big a deal as widely thought



Can treat as function approximation issue



A greater problem for theory than for practice
Need not use POMDP ideas
Making do with imperfect observations/features
Finding the right memories to add as new features
The key is to construct state representations that
make the world more Markov – McCallum’s thesis
Representations of Action

Nominally, actions in RL are low-level


But people work mostly with courses of action




The lowest level at which behavior can vary
We decide among these
We make predictions at this level
We plan at this level
Remarkably, all this can be incorporated in RL



Course of action = policy + termination condition
Almost all RL ideas, algorithms and theory extend
Wherever actions are used, courses of action can be
substituted
Parr, Bradtke & Duff, Precup, Singh, Dietterich, Kaelbling, Huber &
Grupen, Szepesvari, Dayan, Ryan & Pendrith, Hauskrecht, Lin...
Room-to-Room Courses of Action
A course of action
for each hallway
from each room
(2 of 8 shown)
Representing Transitions

Models can also be learned for courses of action

What state will we be in at termination?

How much reward will we receive along the way?

Mathematical form of models follows from the
theory of semi-Markov decision processes

Permits planning at a higher level
Planning (Value Iteration)
with Courses of Action
wit h cell-t o-cell
primit ive act ions
V (goal )=1
Iteration #0
Iteration #1
Iteration #2
Iteration #1
Iteration #2
wit h room-t o-room
courses of act ion
V (goal )=1
Iteration #0
Reconnaissance Example
25

Mission: Fly over (observe) most
valuable sites and return to base

Stochastic weather affects
observability (cloudy or clear) of
sites

Limited fuel

Intractable with classical optimal
control methods

Actions:
15 (reward)
25 (mean time between
weather changes)
50
8
?
50
5
100

Primitives: which direction to fly

Courses: which site to head for
10
50

Base
100 decision steps
Courses compress space and time


B. Ravindran, UMass

Reduce steps from ~600 to ~6
Reduce states from ~1011 to ~106
Enable finding of best solutions
Courses of action
permit enormous flexibility
Subgoals

Courses of action are often goal-oriented

E.g., drive-to-work, open-the-door

A course can be learned to achieve its goal
 Many can be learned at once, independently



Goal-oriented courses of action create better MDP



Solves classic problem of subgoal credit assignment
Solves psychological puzzle of goal-oriented action
Fewer states, smaller branching factor
Compartmentalizes dependencies
Their models are also goal-oriented recognizers...
Perception

Real perception, like real action,
is temporally extended

Features are ability oriented
rather than sensor oriented


Dockable
region
What is a chair? Something that can be sat upon
Consider a goal-oriented course of action,
like dock-with-charger



charger
Its model gives the probability of successfully docking
as a function of state
I.e., a feature (detector) for states that afford docking
Such features can be learned without supervision
This is RL with a totally different feel

Still one primary policy and set of values
 But many other policies, values, and models are
learned not directly in service of reward
 The dominant purpose is discovery, not reward



What possibilities does this world afford?
How can I control and predict it in a variety of ways?
In other words, constructing representations to
make the world:



Markov
Reliable
Deterministic
 Linear
 Small
 Independent
 Shallow
 Additive
 Low
branching
Imagine

An agent driven primarily by biased curiosity
 To discover how it can predict and control
its interaction with the world




What courses of action have predictable effects?
What salient observables can be controlled?
What models are most useful in planning?
A human coach presenting a series of



Problems/Tasks
Courses of action
Highlighting key states, providing subpolicies,
termination conditions…
What is New?

Constructivism itself is not new .
But actually doing it would be!

Does RL really change it, make it easier?
That is, do values and policies help?

Yes! Because so much constructed knowledge is


well represented as values and policies

in service of approximating values and policies
RL’s goal-orientation is also critical to modeling
goal-oriented action and perception
Take Home Messages

RL Past


RL Present



Let’s revisit, but not repeat past work
Do you accept that value functions are critical?
And that TD methods are the way to find them?
RL Future




It’s time to address representation construction
Explore/understand the world rather than control it
RL/values provide new structure for this
May explain goal-oriented action and perception
How far can RL go?

A simple and general formulation of AI

Yet there is enough structure to make progress

While this is true, we should complicate no further,
but seek general principles of AI

They may take us all the way to human-level
intelligence
Download