Learning Locomotion: Extreme Learning For Extreme Terrain CMU: Chris Atkeson

advertisement
Learning Locomotion: Extreme
Learning For Extreme Terrain
CMU: Chris Atkeson
Drew Bagnell, James Kuffner,
Martin Stolle, Hanns Tappeiner,
Nathan Ratliff, Joel Chestnutt,
Michael Dille, Andrew Maas
CMU Robotics Institute
Test Recap
• Test 1: Everything went well.
• Test 2: Straight approach worked well.
Side approach: bad plan, bug.
• Test 3: Dependence on power supply.
Big step up.
• Test 4: Need to handle dog variations.
Test 0: Establish Learning Context
Test 1: 5 trials (x10 speedup)
Reinforcement Learning: Reward
Hierarchical Approach
• Footstep planner
• Body and leg trajectory planner
• Execution
Footstep Planner in Action
Terrain:
Cost map:
Footstep
Plan:
Global Footstep Path Planning
• Use A* to plan a safe sequence of
footsteps from the current robot
configuration to the goal.
• Try to stay as close to that plan as
possible during the trial, replan when
– We measure that we have deviated from the
planned path by a certain amount.
– We tip over while taking a step.
A* Details
• Cost for each foot location and orientation
is pre-computed at startup (usually while
the robot is calibrating).
• Cost includes angle of terrain, flatness of
terrain, distance to any drop-offs, and a
heuristic measure of whether the knee will
hit any terrain at position and orientation.
• Heuristic is currently Euclidean distance to
the goal.
Action Model
Robot Body
(from above)
Base Foot
Location
Reference Actions
• A base foot location is
based on the body position
and orientation.
• From that base location, a
reference set of actions is
applied.
• 8 actions for each front foot,
1 action for each rear foot.
• The front feet “lead” the
robot, with the rear feet just
following along.
Adapting the Reference Actions
A local search is
performed for each
action to find a safe
location that is near the
reference action and
still within the
reachability of the
robot.
Reference Actions
Decreasing Safety Margins
2 cm
0 cm
Effect on Paths
2 cm
0 cm
Foot and Body Trajectories
• Foot trajectory based on
convex hull of intervening
terrain.
• Body trajectory is newly
created at each step,
based on the next two
steps in the path, and has
two stages:
– Move into triangle of
support before lifting the
foot.
– Stay within the polygon of
support and move forward
while foot is in flight.
Interface
Foot Contact Detection
• Foot sensor (not reliable).
• Predicted foot below terrain.
• Z velocity approximately zero and Z
acceleration positive.
• Compliance?
• IMU signals? Not for us.
• Motor torque?
Test 2: 5 trials (x10 speedup)
Why did we fail?
IMU rx, ry
fl_rx
Blue = Actual
Red = Desired
fr_rx
hr_rx
Why did we fail?
Reinforcement Learning:
Punishment
Test 3
• Software was set for external power rather
than battery.
• Initial step up was higher than expected
(initial board level).
Varying Speed
Front left hip ry
Blue = Actual
Red = Desired
Saturation:
Front left hip ry
Blue = Motor
Red = Is_Saturated
Blue = Actual
Red = Desired
Slow Speed:
Front left hip ry
Blue = Actual
Red = Desired
Blue = Motor
Red = Saturated?
Fixes
• Manipulate clock (works for static walking)
• Bang-bang-servo (allows dynamic
locomotion).
Power Supply Axed To
Avoid Further Errors
(Secondary Reinforcer
For Dog)
Test 4
Test 4
Test 4: What we learned
• Need to be robust to vehicle variation:
– Fore/aft effective center of mass (tipping
back)
– Side/side effective center of mass (squatting)
– Leg swing
Plans For Future Development
•
•
•
•
Learn To Make Better Plans
Learn To Plan From Observation
Memory-Based Policy Learning
Dynamic Locomotion
Planning: What can be learned?
•
•
•
•
•
•
•
Better primitives to plan with
Better robot/environment models
Planning parameters
Better models of robot capabilities
Better terrain and action cost functions
Better failure models and models of risk
Learn how to plan: bias to plan better and
faster
• How: Policy search, parameter
optimization, …
Learn To Make Better Plans
• It takes several days to manually tune a
planner.
• We will use policy search techniques to
automate this tuning process.
• The challenge is to do it efficiently.
Learn To Plan From Observation
• Key issue: Do we learn cost functions or
value functions?
Learn Cost Functions:
Maximum Margin Planning
(MMP) Algorithm
• Assumption: cost function is formed as a
linear combination of feature maps
• Training examples: Run current planner
through a number of terrains and take
resulting body trajectories as example
paths
Linear combination of features
tree
detector
open
space
slope
w1
w2
w3
w4
smoothed
trees
MMP Algorithm
Until convergence do
1) Compute cost maps as linear combination of
features
2) Technical step: slightly increase the cost of cells you
want the planner to plan through
•
•
Makes it more difficult for the planner to be right
Train planner on harder problems to ensure good test
performance
3) Plan through these cost maps using D*
4) Update based on mistakes:
•
If planned path doesn't match example then
i. Raise cost of features found along planned path
ii. Lower cost of features found along example path
MMP Algorithm Properties
• Algorithm equivalent to a convex optimization
problem => no local minima
• Rapid (linear) convergence
• Maximum margin interpretation (via the "lossaugmentation" step 2)
• Machine learning theoretic guarantees in online
and offline settings
• Can use boosting to solve feature selection
Learned Cost Maps
Key Issue:
Two Approaches to Control
• Learn the Value
function
• Build a Planner
Q uickTim e™ and a
TI FF ( Uncom pr essed) decom pr essor
ar e needed t o see t his pict ur e.
Why Values?
• Captures the entire cost-to-go: follow the
value-function with just one-step look
ahead for optimality (no planning
necessary)
• Learnable in principle: use regression
techniques to approximate cost-to-go
Why Plans?
• In practice, very hard to learn useful valuefunctions
– High dimensional: curse of dimensionality
– Value features don’t generalize well to new
environments
– Hard to build in useful domain knowledge
• Instead, can take planning approach
– Lots of domain knowledge
– Costs *do* generalize
– But:
• computationally hard-- curse of dimensionality strikes back
Space of values:
high dimensional
Hybrid Algorithm
• A new extension
of Maximum
Margin Planning
to do structured
regression:
predict values
with a planner in
the loop
Learned
Linear combination
Planner
“Value” Features
Learned Space of costs
Proto-results
• Demonstrated an earlier algorithm
(MMPBoost) on learning a heuristic
– Provided orders of magnitude faster planning
• Adapting now to the higher dimensional
space of footsteps instead of heuristic
– Hope to bridge the gap: reinforcement
learning/value-function approximation with the
key benefits of planning and cost-functions
Memory-Based Policy Learning
• 1. Remember plans (all have same goal).
• 2. Remember refined plans.
• 3. Remember plans (many goals) – need
planner to create policy.
• 4. Remember experiences – need planner
to create policy.
• We are currently investigating option 1.
We will explore options 2, 3, and 4.
Plan Libraries
• Combine local
plans to create
global policy.
• Local planners:
decoupled
planner, A*,
RRT, DDP,
DIRCOL.
• Remember
refined plans,
experiences
Forward Planning To Generate
Trajectory Library
Single A* search
Trajectory Library
A Plan Library For Little Dog
Remembering
Refined Plans
Commands
Before
After
Errors
Future Tests
Tasks We Have Trouble On
•
•
•
•
•
Not slipping / maintaining footing
More terrain tilt / rock climbing footholds.
Big step ups and step downs.
Dynamic maneuvers (jump over ditch).
Dynamic locomotion (trot, pace, bound).
Future Tests
• 1) Longer and/or wider course with more choices.
- could be done with test facility with more extensive mocap system.
- could be done by using onboard vision to detect marked obstacles.
• 2) More trials per test (10?) so can demonstrate learning during test.
Score only 3? best. Cut testing off after an hour with whatever trials
have been performed.
• 3) New terrain boards with harder obstacles, and/or obstacles that
require dynamic moves. Big step ups and step downs
• 4) Put surprises in terrain (terrain file errors) such as terrain a little
higher or lower than expected. Test quality of control systems.
• 5) Revisit the evaluation function: should falls be penalized more? How
much does speed matter over robustness? It seems that failing fast is
currently a winning strategy.
• 6) One concern is that our movement strategies do not
compete with Rhex like approaches, i.e., clever open loop robustness.
We need to demonstrate that the super "cognitive" dog is possible that
is ALWAYS in control. Need to think more about how to do this.
• 7) Trotting/pacing/bounding on rough terrain would really push our
ability to control the dog. Not clear how a test would encourage that
other than just mandating the gait to be used.
• 8) Simulate Perception: provide point cloud at a fixed radius on each tick,
perhaps with distance weighted random noise.
What to learn?
•
•
•
•
•
Plan better.
Plan faster.
Robustness.
Special cases.
Utilize vehicle dynamics.
Download