Hierarchical Reinforcement Learning

advertisement
Hierarchical Reinforcement
Learning
[A Survey and Comparison of HRL techniques]
Mausam
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.
Explore avenues to speed up RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise
Decision Making
Environment
What action
next?
Percept
Action
Slide courtesy
Dan Weld
Personal Printerbot
States (S) : {loc,has-robot-printout,
user-loc,has-userprintout},map
Actions (A) :{moven,moves,movee,movew,
extend-arm,grab-page,release-pages}
Reward (R) : if h-u-po +20 else -1
Goal (G) : All states with h-u-po true.
Start state : A state with h-u-po
false.
Episodic Markov Decision Process








Episodic MDP ´
hS, A, P, R, G, s0i
MDP with
S : Set of environment states. absorbing goals
A : Set of available actions.
P : Probability Transition model. P(s’|s,a)*
R : Reward model. R(s)*
G : Absorbing goal states.
s0 : Start state.
* Markovian
 : Discount factor**.
assumption.
** bounds R for
infinite horizon.
Goal of an Episodic MDP
Find a policy (S ! A), which:
maximises expected discounted reward for a
a fully observable* Episodic MDP.
if agent is allowed to execute for an indefinite
horizon.
* Non-noisy
complete
information
perceptors
Solution of an Episodic MDP
Define V*(s) : Optimal reward
starting in state s.
Value Iteration : Start with an
estimate of V*(s) and
successively re-estimate it to
converge to a fixed point.
Complexity of Value Iteration
Each iteration – polynomial in |S|
Number of iterations – polynomial in |S|
Overall – polynomial in |S|
Polynomial in |S| - 
|S| : exponential in number of
features in the domain*.
* Bellman’s
curse of
dimensionality
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.
Explore avenues to speed up RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise
Learning
Environment
•Gain knowledge
•Gain understanding
•Gain skills
•Modification of
behavioural tendency
Data
Decision Making while Learning*
Environment
•Gain knowledge
•Gain understanding
•Gain skills
•Modification of
behavioural tendency
What action
Percepts
next?
Datum
Action
* Known as
Reinforcement
Learning
Reinforcement Learning
Unknown P and reward R.
Learning Component : Estimate the P and R
values via data observed from the
environment.
Planning Component : Decide which actions
to take that will maximise reward.
Exploration vs. Exploitation
GLIE (Greedy in Limit with
Infinite Exploration)
Learning
Model-based learning
Learn the model, and do planning
Requires less data, more computation
Model-free learning
Plan without learning an explicit model
Requires a lot of data, less computation
Q-Learning
Instead of learning, P and R, learn Q*
directly.
 Q*(s,a) : Optimal reward starting in s,
if the first action is a, and
after that the optimal policy is followed.
 Q* directly defines the optimal policy:
Optimal policy is the
action with maximum
Q* value.
Q-Learning
Given an experience tuple hs,a,s’,ri
New
estimate
Under suitable assumptions,
andOldGLIE
estimate of
of Q value
exploration Q-Learning
Q value
converges to optimal.
Semi-MDP: When actions take time.
The Semi-MDP equation:
Semi-MDP Q-Learning equation:
where experience tuple is hs,a,s’,r,Ni
r = accumulated discounted reward
while action a was executing.
Printerbot
Paul G. Allen Center has 85000 sq ft space
Each floor ~ 85000/7 ~ 12000 sq ft
Discretise location on a floor: 12000 parts.
State Space (without map) :
2*2*12000*12000 --- very large!!!!!
How do humans do the
decision making?
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.
Explore avenues to speedup RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise
1. The Mathematical Perspective
A Structure Paradigm
 S : Relational MDP
 A : Concurrent MDP
 P : Dynamic Bayes Nets
 R : Continuous-state MDP
 G : Conjunction of state variables
 V : Algebraic Decision Diagrams
  : Decision List (RMDP)
2. Modular Decision Making
2. Modular Decision Making
•Go out of room
•Walk in hallway
•Go in the room
2. Modular Decision Making
Humans plan modularly at different
granularities of understanding.
Going out of one room is similar to going
out of another room.
Navigation steps do not depend on whether
we have the print out or not.
3. Background Knowledge
Classical Planners using additional control
knowledge can scale up to larger problems.
(E.g. : HTN planning, TLPlan)
What forms of control knowledge can we
provide to our Printerbot?
First pick printouts, then deliver them.
Navigation – consider rooms, hallway,
separately, etc.
A mechanism that exploits all three
avenues : Hierarchies
1. Way to add a special (hierarchical)
structure on different parameters of an
MDP.
2. Draws from the intuition and reasoning in
human decision making.
3. Way to provide additional control
knowledge to the system.
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.
Explore avenues to speedup RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise
Hierarchy
Hierarchy of : Behaviour, Skill, Module,
SubTask, Macro-action, etc.
picking the pages
collision avoidance
fetch pages phase
walk in hallway
HRL ´ RL with temporally
extended actions
Hierarchical Algos ´ Gating Mechanism
Hierarchical Learning
•Learning the gating function
•Learning the individual behaviours
•Learning both
*
g is a gate
bi is a
behaviour
*Can be a multilevel hierarchy.
Option : Movee until end of hallway
Start : Any state in
the hallway.
Execute : policy as
shown.
 Terminate : when s
is end of hallway.
Options
[Sutton, Precup, Singh’99]
An option is a well defined behaviour.
o = h Io, o, o i
 Io : Set of states (IoµS) in which o can be
initiated.
 o(s) : Policy (S!A*) when o is executing.
 o(s) : Probability that o terminates
in s.
*Can be a policy
over lower level
options.
Learning
An option is temporally extended action
with well defined policy.
Set of options (O) replaces the set of
actions (A)
Learning occurs outside options.
Learning over options ´ Semi MDP QLearning.
Machine: Movee + Collision Avoidance
: End of hallway
Call M1
Movee
Obstacle
Choose
Call M2
End of hallway
Return
M1
M2
Movew
Movew
Moves
Moves
Return
Moven
Moven
Return
Hierarchies of Abstract Machines
[Parr, Russell’97]
A machine is a partial policy represented by
a Finite State Automaton.
Node :
Execute a ground action.
Call a machine as a subroutine.
Choose the next node.
Return to the calling machine.
Hierarchies of Abstract Machines
A machine is a partial policy represented by
a Finite State Automaton.
Node :
Execute a ground action.
Call a machine as subroutine.
Choose the next node.
Return to the calling machine.
Learning
Learning occurs within machines, as
machines are only partially defined.
Flatten all machines out and consider
states [s,m] where s is a world state, and m,
a machine node ´ MDP
reduce(SoM) : Consider only states where
machine node is a choice node ´ Semi-MDP.
Learning ¼ Semi-MDP Q-Learning
Task Hierarchy: MAXQ Decomposition
[Dietterich’00]
Root
Fetch
Take
Extend-arm
Children of a
task are
unordered
Deliver
Give
Navigate(loc)
Grab
Release
MovenMovesMovewMovee
Extend-arm
MAXQ Decomposition
Augment the state s by adding the
subtask i : [s,i].
Define C([s,i],j) as the reward received in i
after j finishes.
Q([s,Fetch],Navigate(prr)) =
V([s,Navigate(prr)])+C([s,Fetch],Navigate(prr))*
Reward received
Express
V in terms of Reward
C received
*Observe the
while navigating
after navigation
context-free
Learn C, instead of learning Q
nature of
Q-value
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.
Explore avenues to speedup RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise
1. State Abstraction
Abstract state : A state having fewer
state variables; different world states
maps to the same abstract state.
If we can reduce some state
variables, then we can reduce on the
learning time considerably!
We may use different abstract states
for different macro-actions.
State Abstraction in MAXQ
Relevance : Only some variables are
relevant for the task.
Fetch : user-loc irrelevant
Navigate(printer-room) : h-r-po,h-u-po,user-loc
Fewer params for V of lower levels.
Funnelling : Subtask maps many states to
smaller set of states.
Fetch : All states map to h-r-po=true,
loc=pr.room.
Fewer params for C of higher levels.
State Abstraction in Options, HAM
Options : Learning required only in states
that are terminal states for some option.
HAM : Original work has no abstraction.
Extension: Three-way value decomposition*:
Q([s,m],n) = V([s,n]) + C([s,m],n) + Cex([s,m])
Similar abstractions are employed.
*[Andre,Russell’02]
2. Optimality
Hierarchical Optimality
vs.
Recursive Optimality
Optimality
Options : Hierarchical
Use (A [ O) : Global**
Interrupt options
HAM : Hierarchical*
MAXQ : Recursive*
Interrupt subtasks
Use Pseudo-rewards
Iterate!
* Can define
eqns for both
optimalities
**Adv. of using
macro-actions
maybe lost.
3. Language Expressiveness
Option
Can only input a complete policy
HAM
Can input a complete policy.
Can input a task hierarchy.
Can represent “amount of effort”.
Later extended to partial programs.
MAXQ
Cannot input a policy (full/partial)
4. Knowledge Requirements
Options
Requires complete specification of policy.
One could learn option policies – given subtasks.
HAM
Medium requirements
MAXQ
Minimal requirements
5. Models advanced
Options : Concurrency
HAM : Richer representation, Concurrency
MAXQ : Continuous time, state, actions;
Multi-agents, Average-reward.
In general, more researchers have followed
MAXQ
Less input knowledge
Value decomposition
6. Structure Paradigm
 S : Options, MAXQ
 A : All
 P : None
 R : MAXQ
 G : All
 V : MAXQ
  : All
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.
Explore avenues to speedup RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise
Directions for Future Research
Bidirectional State Abstractions
Hierarchies over other RL research
Model based methods
Function Approximators
Probabilistic Planning
Hierarchical P and Hierarchical R
Imitation Learning
Directions for Future Research
Theory
Bounds (goodness of hierarchy)
Non-asymptotic analysis
Automated Discovery
Discovery of Hierarchies
Discovery of State Abstraction
Apply…
Applications
Toy Robot
Flight Simulator
AGV Scheduling
Keepaway soccer
P2
D2
P1
D1
Parts
Warehouse
Assemblies
D3
D4
P3
P4
Images courtesy
various sources
Thinking Big…
"... consider maze domains. Reinforcement learning
researchers, including this author, have spent
countless years of research solving a solved
problem! Navigating in grid worlds, even with
stochastic dynamics, has been far from rocket
science since the advent of search techniques
such as A*.”
-- David Andre
Use planners, theorem provers, etc. as
components in big hierarchical solver.
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.
Explore avenues to speedup RL.
Illustrate prominent HRL methods.
Compare prominent HRL methods.
Discuss future research.
Summarise
How to choose appropriate hierarchy
Look at available domain knowledge
If some behaviours are completely specified –
options
If some behaviours are partially specified –
HAM
If less domain knowledge available – MAXQ
We can use all three to specify different
behaviours in tandem.
Main ideas in HRL community
Hierarchies speedup learning
Value function decomposition
State Abstractions
Greedy non-hierarchical execution
Context-free learning and pseudo-rewards
Policy improvement by re-estimation
and re-learning.
Download