USC - Stanford AI Lab

advertisement
Apprenticeship Learning
Pieter Abbeel
Stanford University
In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley,
Dmitri Dolgov, Sebastian Thrun.
Machine Learning

Large number of success stories:

Handwritten digit recognition

Face detection

Disease diagnosis

…
All learn from examples a direct mapping from
inputs to outputs.

Reinforcement learning / Sequential decision
making:

Humans still greatly outperform machines.
Reinforcement learning
Probability distribution over next states
given current state and action
Describes
desirability (how
much it costs) to
be in a state.
Reward
Function R
Dynamics
Model
Psa
Reinforcement
Learning
Prescribes
actions to
take
Controller
p
Apprenticeship learning
Teacher Demonstration
Dynamics
Model
Psa
(s0, a0, s1, a1, ….)
Reward
Function R
Reinforcement
Learning
Controller
p
Example task: driving
Learning from demonstrations

Learn direct mapping from states to actions



Inverse reinforcement learning



Assumes controller simplicity.
E.g., Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al.,
1994; Demiris & Hayes, 1994; Amit & Mataric, 2002;
Tries to recover the reward function from demonstrations.
Inherent ambiguity makes reward function impossible to recover.
Apprenticeship learning


[Ng & Russell, 2000]
[Abbeel & Ng, 2004]
Exploits reward function structure + provides strong guarantees.
Related work since: Ratliff et al., 2006, 2007; Neu & Szepesvari,
2007; Syed & Schapire, 2008.
Apprenticeship learning

Key desirable properties:

Returns controller p with performance guarantee:

Short running time.

Small number of demonstrations required.
Apprenticeship learning algorithm

Assume

Initialize: pick some controller p0.

Iterate for i = 1, 2, … :



Make the current best guess for the reward function. Concretely,
find the reward function such that the teacher maximally
outperforms all previously found controllers.
Find optimal optimal controller pi for the current guess of the reward
function Rw.
If
, exit the algorithm.
Theoretical guarantees
Highway driving
Input: Driving demonstration
Output: Learned behavior
The only input to the learning algorithm was the driving
demonstration (left panel). No reward function was provided.
Parking lot navigation
Reward function trades off: curvature, smoothness,
distance to obstacles, alignment with principal directions.
Quadruped

Reward function trades off 25 features.

Learn on training terrain.

Test on previously unseen terrain.
[NIPS 2008]
Quadruped on test-board
Apprenticeship learning
Teacher’s flight
Dynamics
Model
Psa
(s0, a0, s1, a1, ….)
Learn
R
Reward
Function R
Reinforcement
Learning
Controller
p
Apprenticeship learning
Teacher’s flight
Dynamics
Model
Psa
(s0, a0, s1, a1, ….)
Learn
R
Reward
Function R
Reinforcement
Learning
Controller
p
Motivating example
 How
•Textbook
modelto fly
collection?
• Specification
Collect
flight data.
helicopter for data
 How to ensure that entire flight
envelope is covered by the data
Accurate
Learn model
collection process?
from data.
dynamics
model Psa
Learning the dynamics model

State-of-the-art: E3 algorithm, Kearns and Singh (1998,2002).
(And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003;
Brafman and Tennenholtz, 2002.)
Have good
model of
dynamics?
YES
“Exploit”
NO
“Explore”
Learning the dynamics model

State-of-the-art: E3 algorithm, Kearns and Singh (2002).
(And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford,
2003; Brafman and Tennenholtz, 2002.)
Have good
model of
impractical:
dynamics?
Exploration policies are
they do not even try to perform well.
NO
Can we avoid explicit
exploration and just exploit?
YES
“Exploit”
“Explore”
Apprenticeship learning of the model
Teacher’s flight
Learn Psa
Autonomous flight
Dynamics
Model
Psa
(s0, a0, s1, a1, ….)
Reward
Function R
Learn Psa
(s0, a0, s1, a1, ….)
Reinforcement
Learning
Controller
p
Theoretical guarantees

Here, polynomial is with respect to
1/, 1/(failure probability), the horizon T, the maximum reward R,
the size of the state space.
Model Learning: Proof Idea






From initial pilot demonstrations, our model/simulator Psa will be
accurate for the part of the state space (s,a) visited by the pilot.
Our model/simulator will correctly predict the helicopter’s
behavior under the pilot’s controller p*.
Consequently, there is at least one controller (namely p*) that looks
capable of flying the helicopter well in our simulation.
Thus, each time we solve for the optimal controller using the
current model/simulator Psa, we will find a controller that
successfully flies the helicopter according to Psa.
If, on the actual helicopter, this controller fails to fly the helicopter--despite the model Psa predicting that it should---then it must be
visiting parts of the state space that are inaccurately modeled.
Hence, we get useful training data to improve the model. This can
happen only a small number of times.
Learning the dynamics model


Exploiting structure from physics

Explicitly encode gravity, inertia.

Estimate remaining dynamics from data.
Lagged learning criterion


Maximize prediction accuracy of the simulator over
time scales relevant for control (vs. digital integration
time scale).
Similar to machine learning: discriminative vs.
generative.
[Abbeel et al. {NIPS 2005, NIPS 2006}]
Autonomous nose-in funnel
Related work


Bagnell & Schneider, 2001; LaCivita et al., 2006; Ng et
al., 2004a; Roberts et al., 2003; Saripalli et al., 2003.;
Ng et al., 2004b; Gavrilets, Martinos, Mettler and Feron,
2002.
Maneuvers presented here are significantly more
difficult than those flown by any other autonomous
helicopter.
Apprenticeship learning
Teacher’s flight
Learn Psa
Autonomous flight
Dynamics
Model
Psa
(s0, a0, s1, a1, ….)
Learn Psa
(s0, a0, s1, a1, ….)
Learn
R
Reward
Function R
Reinforcement
Learning
Controller
p

Model predictive control

Receding horizon differential dynamic programming
Apprenticeship learning: summary
Teacher’s flight
Learn Psa
Autonomous flight
Dynamics
Model
Psa
(s0, a0, s1, a1, ….)
Learn Psa
(s0, a0, s1, a1, ….)
Learn
R
Reward
Function R
Applications:
Reinforcement
Learning
Controller
p
Demonstrations
Learned reward (trajectory)
Current and future work

Applications:



Autonomous helicopters to assist
wildland fire fighting.
in
Fixed-wing formation flight:
Estimated fuel savings for
three aircraft formation: 20%.
Learning from demonstrations only scratches the surface of how
humans learn (and teach).

Safe autonomous learning.

More general advice taking.
Thank you.







Apprenticeship Learning via Inverse Reinforcement Learning, Pieter
Abbeel and Andrew Y. Ng. In Proc. ICML, 2004.
Learning First Order Markov Models for Control, Pieter Abbeel and Andrew
Y. Ng. In NIPS 17, 2005.
Exploration and Apprenticeship Learning in Reinforcement Learning,
Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, 2005.
Modeling Vehicular Dynamics, with Application to Modeling Helicopters,
Pieter Abbeel, Varun Ganapathi and Andrew Y. Ng. In NIPS 18, 2006.
Using Inaccurate Models in Reinforcement Learning, Pieter Abbeel,
Morgan Quigley and Andrew Y. Ng. In Proc. ICML, 2006.
An Application of Reinforcement Learning to Aerobatic Helicopter Flight,
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng. In NIPS 19,
2007.
Hierarchical Apprenticeship Learning with Application to Quadruped
Locomotion, J. Zico Kolter, Pieter Abbeel and Andrew Y. Ng. In NIPS 20, 2008.
Airshow accuracy
Chaos
Tic-toc
Current and future work

Applications:



Autonomous helicopters to assist
in wildland fire fighting.
Fixed-wing formation flight:
Estimated fuel savings for
three aircraft formation: 20%.
Learning from demonstrations only scratches the surface of
how humans learn (and teach).

Safe autonomous learning.

More general advice taking.
Full Inverse RL Algorithm

Initialize: pick some arbitrary reward weights w.

For i = 1, 2, …

RL step:
Compute optimal controller pi for the current estimate of the reward
function Rw.

Inverse RL step:
Re-estimate the reward function Rw:
If
, exit the algorithm.
Helicopter dynamics model in auto
Parking lot navigation---experiments
Helicopter inverse RL: experiments
Auto-rotation descent
Apprenticeship learning
Teacher’s flight
Learn Psa
Autonomous flight
Dynamics
Model
Psa
(s0, a0, s1, a1, ….)
Learn Psa
(s0, a0, s1, a1, ….)
Learn
R
Reward
Function R
Reinforcement
Learning
Controller
p
Algorithm Idea


Input to algorithm: approximate model.
Start by computing the optimal controller
according to the model.
Real-life trajectory
Target trajectory
Algorithm Idea (2)

Update the model such that it becomes exact for the
current controller.
Algorithm Idea (2)

Update the model such that it becomes exact for the
current controller.
Algorithm Idea (2)
Performance Guarantees
First trial.
(Model-based controller.)
After learning.
(10 iterations)
Performance guarantee intuition

Intuition by example:

Let

If the returned controller p satisfies

Then no matter what the values of
and
are, the
controller p performs as well as the teacher’s controller p*.
Summary
Autonomous flight
Teacher: human pilot flight
Learn Psa

Dynamics
Model
Psa
Learn Psa
When given a demonstration:
(a1, s1, a2, s2, aAutomatically
(a1,than
s1, a(time3, s3, ….)
2, s2, a3, s3, ….)
learn reward function, rather
Learn consumingly) hand-engineer it.
R
Unlike exploration methods, our algorithm concentrates
on the task of interest, and always tries to fly as well as
Reinforcement
possible.
Reward

Controller p
Learning
Function
R

High performance
crude
maxcontrol
ER( s0 ) with
...  R
( sT ) model + small
p
number of trials.
Reward: Intended trajectory

Perfect demonstrations are extremely hard to obtain.

Multiple trajectory demonstrations:


Every demonstration is a noisy instantiation of the
intended trajectory.
Noise model captures (among others):




Position drift.
Time warping.
If different demonstrations are suboptimal in
different ways, they can capture the “intended”
trajectory implicitly.
[Related work: Atkeson & Schaal, 1997.]
Outline

Preliminaries: reinforcement learning.

Apprenticeship learning algorithms.

Experimental results on various robotic platforms.
Reinforcement learning (RL)
state s0
System
System
Dynamics
dynamics
Psa
s1
a0
reward R(s0)
+
Psa
System
…
s2
Dynamics
sT-1
+
Psa
aT-1
a1
R(s1)
sT
R(s2) +…+ R(sT-1)
+
Goal: Pick actions over time so as to maximize the
expected score: E[R(s0) + R(s1) + … + R(sT)]
Solution: controller p which specifies an action for
each possible state for all times t= 0, 1, … , T-1.
R(sT)
Model-based reinforcement learning
controller p
Run reinforcement
learning algorithm
in simulator.
Probabilistic graphical model for
multiple demonstrations
Apprenticeship learning for the dynamics model


Algorithms such as E3 (Kearns and Singh, 2002) learn the
dynamics by using exploration policies, which are
dangerous/impractical for many systems.
Our algorithm




Initializes model from a demonstration.
Repeatedly executes “exploitation policies'' that try to maximize
rewards.
Provably achieves near-optimal performance (compared to teacher).
Machine learning theory:

Complicated non-IID sample generating process.

Standard learning theory bounds not applicable.

Proof uses martingale construction over relative losses.
[ICML 2005]
Accuracy
Non-stationary maneuvers

Modeling extremely complex:

Our dynamics model state:


True state:


Position, orientation, velocity, angular rate.
Air (!), head-speed, servos, deformation, etc.
Key observation:

In the vicinity of a specific point along a specific
trajectory, these unknown state variables tend to
take on similar values.
Example: z-acceleration
Local model learning algorithm
1. Time align trajectories.
2. Learn locally weighted models in the vicinity of the
trajectory.
W(t’) = exp(- (t – t’)2 /2 )
Algorithm Idea w/Teacher

Input to algorithm:
 Teacher demonstration.
 Approximate model.
Trajectory predicted
by simulator/model for
same inputs
Teacher trajectory
[ICML 2006]
Algorithm Idea w/Teacher (2)

Update the model such that it becomes exact for the
demonstration.
Algorithm Idea w/Teacher (2)

Update the model such that it becomes exact for the
demonstration.
Algorithm Idea w/Teacher (2)

The updated model perfectly predicts the state sequence
obtained during the demonstration.

We can use the updated model
a feedback Controller.
to find
Algorithm w/Teacher
1.
2.
3.
Record teacher’s demonstration s0, s1, …
Update the (crude) model/simulator to be exact for the
teacher’s demonstration by adding appropriate time
biases for each time step.
Return the policy p that is optimal according to the
updated model/simulator.
Performance guarantees w/Teacher

Theorem.
Algorithm [iterative]
1.
2.
3.
4.
5.
6.
Record teacher’s demonstration s0, s1, …
Update the (crude) model/simulator to be exact for the
teacher’s demonstration by adding appropriate time
biases for each time step.
Find the policy p that is optimal according to the
updated model/simulator.
Execute the policy p and record the state trajectory.
Update the (crude) model/simulator to be exact along
the trajectory obtained with the policy p.
Go to step 3.

Related work: iterative learning control (ILC).
Algorithm
1. Find the (locally) optimal policy p for the model.
2. Execute the current policy p and record the state
trajectory.
3. Update the model such that the new model is exact for the
current policy p.
4. Use the new model to compute the policy gradient  and
update the policy:
 :=  +  .
5. Go back to Step 2.
Notes:


The step-size parameter  is determined by a line search.
Instead of the policy gradient, any algorithm that provides a local
policy improvement direction can be used. In our experiments we
used differential dynamic programming.
Algorithm
1. Find the (locally) optimal policy p for the model.
2. Execute the current policy p and record the state
trajectory.
3. Update the model such that the new model is exact for the
current policy p.
4. Use the new model to compute the policy gradient  and
update the policy:
 :=  +  .
5. Go back to Step 2.

Related work: Iterative learning control.
Future work
Acknowledgments


Adam Coates, Morgan
Quigley, Andrew Y. Ng
J. Zico Kolter, Andrew
Y. Ng


Andrew Y. Ng
Morgan Quigley,
Andrew Y. Ng
RC Car: Circle
RC Car: Figure-8 Maneuver
Teacher demonstration for quadruped


Full teacher demonstration = sequence of
footsteps.
Much simpler to “teach hierarchically”:


Specify a body path.
Specify best footstep in a small area.
Hierarchical inverse RL

Quadratic programming problem (QP):


quadratic objective, linear constraints.
Constraint generation for path constraints.
Experimental setup

Training:





Have quadruped walk straight across a fairly simple
board with fixed-spaced foot placements.
Around each foot placement: label the best foot
placement. (about 20 labels)
Label the best body-path for the training board.
Use our hierarchical inverse RL algorithm to learn a
reward function from the footstep and path labels.
Test on hold-out terrains:

Plan a path across the test-board.
Helicopter Flight


Task:

Hover at a specific point.

Initial state: tens of meters away from target.
Reward function trades off:

Position accuracy,

Orientation accuracy,

Zero velocity,

Zero angular rate,

… (11 features total)
Learned from “careful” pilot
Learned from “aggressive” pilot
More driving examples
Driving
demonstration
Learned
behavior
Driving
demonstration
In each video, the left sub-panel shows a
demonstration of a different driving
“style”, and the right sub-panel shows
the behavior learned from watching the
demonstration.
Learned
behavior
Download