Learning Locomotion: Extreme Learning For Extreme Terrain CMU: Chris Atkeson Drew Bagnell, James Kuffner, Martin Stolle, Hanns Tappeiner, Nathan Ratliff, Joel Chestnutt, Michael Dille, Andrew Maas CMU Robotics Institute Test Recap • Test 1: Everything went well. • Test 2: Straight approach worked well. Side approach: bad plan, bug. • Test 3: Dependence on power supply. Big step up. • Test 4: Need to handle dog variations. Test 0: Establish Learning Context Test 1: 5 trials (x10 speedup) Reinforcement Learning: Reward Hierarchical Approach • Footstep planner • Body and leg trajectory planner • Execution Footstep Planner in Action Terrain: Cost map: Footstep Plan: Global Footstep Path Planning • Use A* to plan a safe sequence of footsteps from the current robot configuration to the goal. • Try to stay as close to that plan as possible during the trial, replan when – We measure that we have deviated from the planned path by a certain amount. – We tip over while taking a step. A* Details • Cost for each foot location and orientation is pre-computed at startup (usually while the robot is calibrating). • Cost includes angle of terrain, flatness of terrain, distance to any drop-offs, and a heuristic measure of whether the knee will hit any terrain at position and orientation. • Heuristic is currently Euclidean distance to the goal. Action Model Robot Body (from above) Base Foot Location Reference Actions • A base foot location is based on the body position and orientation. • From that base location, a reference set of actions is applied. • 8 actions for each front foot, 1 action for each rear foot. • The front feet “lead” the robot, with the rear feet just following along. Adapting the Reference Actions A local search is performed for each action to find a safe location that is near the reference action and still within the reachability of the robot. Reference Actions Decreasing Safety Margins 2 cm 0 cm Effect on Paths 2 cm 0 cm Foot and Body Trajectories • Foot trajectory based on convex hull of intervening terrain. • Body trajectory is newly created at each step, based on the next two steps in the path, and has two stages: – Move into triangle of support before lifting the foot. – Stay within the polygon of support and move forward while foot is in flight. Interface Foot Contact Detection • Foot sensor (not reliable). • Predicted foot below terrain. • Z velocity approximately zero and Z acceleration positive. • Compliance? • IMU signals? Not for us. • Motor torque? Test 2: 5 trials (x10 speedup) Why did we fail? IMU rx, ry fl_rx Blue = Actual Red = Desired fr_rx hr_rx Why did we fail? Reinforcement Learning: Punishment Test 3 • Software was set for external power rather than battery. • Initial step up was higher than expected (initial board level). Varying Speed Front left hip ry Blue = Actual Red = Desired Saturation: Front left hip ry Blue = Motor Red = Is_Saturated Blue = Actual Red = Desired Slow Speed: Front left hip ry Blue = Actual Red = Desired Blue = Motor Red = Saturated? Fixes • Manipulate clock (works for static walking) • Bang-bang-servo (allows dynamic locomotion). Power Supply Axed To Avoid Further Errors (Secondary Reinforcer For Dog) Test 4 Test 4 Test 4: What we learned • Need to be robust to vehicle variation: – Fore/aft effective center of mass (tipping back) – Side/side effective center of mass (squatting) – Leg swing Plans For Future Development • • • • Learn To Make Better Plans Learn To Plan From Observation Memory-Based Policy Learning Dynamic Locomotion Planning: What can be learned? • • • • • • • Better primitives to plan with Better robot/environment models Planning parameters Better models of robot capabilities Better terrain and action cost functions Better failure models and models of risk Learn how to plan: bias to plan better and faster • How: Policy search, parameter optimization, … Learn To Make Better Plans • It takes several days to manually tune a planner. • We will use policy search techniques to automate this tuning process. • The challenge is to do it efficiently. Learn To Plan From Observation • Key issue: Do we learn cost functions or value functions? Learn Cost Functions: Maximum Margin Planning (MMP) Algorithm • Assumption: cost function is formed as a linear combination of feature maps • Training examples: Run current planner through a number of terrains and take resulting body trajectories as example paths Linear combination of features tree detector open space slope w1 w2 w3 w4 smoothed trees MMP Algorithm Until convergence do 1) Compute cost maps as linear combination of features 2) Technical step: slightly increase the cost of cells you want the planner to plan through • • Makes it more difficult for the planner to be right Train planner on harder problems to ensure good test performance 3) Plan through these cost maps using D* 4) Update based on mistakes: • If planned path doesn't match example then i. Raise cost of features found along planned path ii. Lower cost of features found along example path MMP Algorithm Properties • Algorithm equivalent to a convex optimization problem => no local minima • Rapid (linear) convergence • Maximum margin interpretation (via the "lossaugmentation" step 2) • Machine learning theoretic guarantees in online and offline settings • Can use boosting to solve feature selection Learned Cost Maps Key Issue: Two Approaches to Control • Learn the Value function • Build a Planner Q uickTim e™ and a TI FF ( Uncom pr essed) decom pr essor ar e needed t o see t his pict ur e. Why Values? • Captures the entire cost-to-go: follow the value-function with just one-step look ahead for optimality (no planning necessary) • Learnable in principle: use regression techniques to approximate cost-to-go Why Plans? • In practice, very hard to learn useful valuefunctions – High dimensional: curse of dimensionality – Value features don’t generalize well to new environments – Hard to build in useful domain knowledge • Instead, can take planning approach – Lots of domain knowledge – Costs *do* generalize – But: • computationally hard-- curse of dimensionality strikes back Space of values: high dimensional Hybrid Algorithm • A new extension of Maximum Margin Planning to do structured regression: predict values with a planner in the loop Learned Linear combination Planner “Value” Features Learned Space of costs Proto-results • Demonstrated an earlier algorithm (MMPBoost) on learning a heuristic – Provided orders of magnitude faster planning • Adapting now to the higher dimensional space of footsteps instead of heuristic – Hope to bridge the gap: reinforcement learning/value-function approximation with the key benefits of planning and cost-functions Memory-Based Policy Learning • 1. Remember plans (all have same goal). • 2. Remember refined plans. • 3. Remember plans (many goals) – need planner to create policy. • 4. Remember experiences – need planner to create policy. • We are currently investigating option 1. We will explore options 2, 3, and 4. Plan Libraries • Combine local plans to create global policy. • Local planners: decoupled planner, A*, RRT, DDP, DIRCOL. • Remember refined plans, experiences Forward Planning To Generate Trajectory Library Single A* search Trajectory Library A Plan Library For Little Dog Remembering Refined Plans Commands Before After Errors Future Tests Tasks We Have Trouble On • • • • • Not slipping / maintaining footing More terrain tilt / rock climbing footholds. Big step ups and step downs. Dynamic maneuvers (jump over ditch). Dynamic locomotion (trot, pace, bound). Future Tests • 1) Longer and/or wider course with more choices. - could be done with test facility with more extensive mocap system. - could be done by using onboard vision to detect marked obstacles. • 2) More trials per test (10?) so can demonstrate learning during test. Score only 3? best. Cut testing off after an hour with whatever trials have been performed. • 3) New terrain boards with harder obstacles, and/or obstacles that require dynamic moves. Big step ups and step downs • 4) Put surprises in terrain (terrain file errors) such as terrain a little higher or lower than expected. Test quality of control systems. • 5) Revisit the evaluation function: should falls be penalized more? How much does speed matter over robustness? It seems that failing fast is currently a winning strategy. • 6) One concern is that our movement strategies do not compete with Rhex like approaches, i.e., clever open loop robustness. We need to demonstrate that the super "cognitive" dog is possible that is ALWAYS in control. Need to think more about how to do this. • 7) Trotting/pacing/bounding on rough terrain would really push our ability to control the dog. Not clear how a test would encourage that other than just mandating the gait to be used. • 8) Simulate Perception: provide point cloud at a fixed radius on each tick, perhaps with distance weighted random noise. What to learn? • • • • • Plan better. Plan faster. Robustness. Special cases. Utilize vehicle dynamics.