NIPS06-RL-workshop - Stanford Computer Science

advertisement
Apprenticeship learning
for robotic control
Pieter Abbeel
Stanford University
Joint work with Andrew Y. Ng, Adam Coates,
Morgan Quigley.
This talk
Dynamics
Model
Psa
Reward
Function
R
Reinforcement
Learning
max ER( s0 )  ...  R( sT )
p
Recurring theme: Apprenticeship learning.
Control
policy p
Motivation
In practice reward functions are hard to specify, and people tend to tweak them a lot.
Motivating example: helicopter tasks, e.g. flip.
Another motivating example: Highway driving.
Apprenticeship Learning
• Learning from observing an expert.
• Previous work:
– Learn to predict expert’s actions as a function of states.
– Usually lacks strong performance guarantees.
– (E.g.,. Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994;
Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997;
…)
• Our approach:
– Based on inverse reinforcement learning (Ng & Russell, 2000).
– Returns policy with performance as good as the expert as measured
according to the expert’s unknown reward function.
– [Most closely related work: Ratliff et al. 2005, 2006.]
Algorithm
For t = 1,2,…
Inverse RL step:
Estimate expert’s reward function R(s)= wT(s) such that under
R(s) the expert performs better than all previously found policies
{pi}.
RL step:
Compute optimal policy pt for
the estimated reward w.
[Abbeel & Ng, 2004]
Algorithm: IRL step
Maximize , w:||w||
2
≤1

s.t. Uw(pE)  Uw(pi) + 
i=1,…,t-1
 = margin of expert’s performance over the performance of previously
found policies.
Uw(p)
= E [Tt=1 R(st)|p] = E [Tt=1 wT(st)|p]
= wT E [Tt=1 (st)|p]
= wT (p)
T
(p) = E [t=1
(st)|p] are the “feature expectations”
Feature Expectation Closeness and Performance
If we can find a policy p such that
||(pE) - (p)||2  ,
then for any underlying reward R*(s) =w*T(s),
we have that
|Uw*(pE) - Uw*(p)|
= |w*T (pE) - w*T (p)|
 ||w*||2 ||(pE) - (p)||2
 .
Theoretical Results: Convergence
Theorem. Let an MDP (without reward function), a kdimensional feature vector  and the expert’s feature
expectations (pE) be given. Then after at most
kT2/2
iterations, the algorithm outputs a policy p that performs nearly
as well as the expert, as evaluated on the unknown reward
function R*(s)=w*T(s), i.e.,
Uw*(p)  Uw*(pE) - .
Case study: Highway driving
Input: Driving demonstration
Output: Learned behavior
The only input to the learning algorithm was the driving
demonstration (left panel). No reward function was provided.
More driving examples
Driving
demonstration
Learned
behavior
Driving
demonstration
Learned
behavior
In each video, the left sub-panel shows a demonstration of a different driving
“style”, and the right sub-panel shows the behavior learned from watching the
demonstration.
Inverse reinforcement learning summary
Our algorithm returns a policy with performance as good as the expert
as evaluated according to the expert’s unknown reward function.
Algorithm is guaranteed to converge in poly(k,1/) iterations.
The algorithm exploits reward “simplicity” (vs. policy “simplicity” in
previous approaches).
The dynamics model
Dynamics
Model
Psa
Reward
Function
R
Reinforcement
Learning
max ER( s0 )  ...  R( sT )
p
Control
policy p
Collecting data to learn the
dynamics model
Learning the dynamics model Psa from data
Dynamics
Model
Estimate Psa from data
Psa
Reward
Function
R
Reinforcement
Learning
max ER( s0 )  ...  R( sT )
Control
policy p
p
For example, in discrete-state problems, estimate Psa(s’)
to be the fraction of times you transitioned to state s’ after
taking action a in state s.
Challenge: Collecting enough data to guarantee that you
can model the entire flight envelop.
Collecting data to learn dynamical model
State-of-the-art: E3 algorithm (Kearns and Singh, 2002)
Have good
model of
dynamics?
YES
“Exploit”
NO
“Explore”
Aggressive exploration (Manual flight)
Aggressively exploring the edges of the flight envelope
isn’t always a good idea.
Learning the dynamics
Autonomous flight
Expert human pilot flight
Learn Psa
Dynamics
Model
Learn Psa
Psa
(a1, s1, a2, s2, a3, s3, ….)
(a1, s1, a2, s2, a3, s3, ….)
Reward
Function
R
Reinforcement
Learning
max ER( s0 )  ...  R( sT )
p
Control
policy p
Apprenticeship learning of model
Theorem. Suppose that we obtain m = O(poly(S, A, T, 1/)) examples from a
human expert demonstrating the task. Then after a polynomial number k of
iterations of testing/re-learning, with high probability, we will obtain a policy p
whose performance is comparable to the expert’s:
U(p)  U(pE) - 
Thus, so long as a demonstration is available, it isn’t necessary to explicitly
explore.
In practice, k=1 or 2 is almost always enough.
[Abbeel & Ng, 2005]
Proof idea
•
From initial pilot demonstrations, our model/simulator Psa will be accurate for the part
of the flight envelop (s,a) visited by the pilot.
•
Our model/simulator will correctly predict the helicopter’s behavior under the pilot’s
policy pE.
•
Consequently, there is at least one policy (namely pE) that looks like it’s able to fly the
helicopter in our simulation.
•
Thus, each time we solve the MDP using the current simulator Psa, we will find a
policy that successfully flies the helicopter according to Psa.
•
If, on the actual helicopter, this policy fails to fly the helicopter---despite the model Psa
predicting that it should---then it must be visiting parts of the flight envelop that the
model is failing to accurately model.
•
Hence, this gives useful training data to model new parts of the flight envelop.
Configurations flown (exploitation only)
Tail-in funnel
Nose-in funnel
In-place rolls
In place flips
Acknowledgements
Andrew Ng,
Adam Coates,
Morgan Quigley
Thank You!
Download