Physics-Based Reinforcement Learning for Mobile Manipulation

Physics-Based Reinforcement Learning for Mobile Manipulation PhD Proposal Jonathan Scholz June 11, 2014 ! Dr. Dr. Dr. Dr. Dr. Committee: Charles Isbell (IC, Georgia Institute of Technology) Andrea Thomaz (IC, Georgia Institute of Technology) Henrik Christensen (IC, Georgia Institute of Technology) Magnus Egerstedt (ECE, Georgia Institute of Technology) Michael Littman (CS, Brown University) What this proposal is about Robots interacting with objects Interested in using robots to perform tasks Themes: Modeling & estimation Planning with uncertain models Task planning with humanoids?? Many reasons to be optimistic! 2 Reasons to be optimistic 3 Big Questions Where is my robot assistant? Where is my robot butler? Why is this hard? Every problem is different Demos are precarious piles of hacks Requires lots of careful engineering 4 Reducing the engineering burden A Reinforcement Learning approach Advantages of RL: The Robot “Programs” itself! Solid decision-theoretic foundation Treats robot as persistent agent, w/ internal state and beliefs Hides engineering from end-user 5 RL+Robotics: Prior work 16 • • Humanoid Walking ! (Peters, Schaal, Vijayakumar 2003)! Acrobatic Helicopter ! (Ng et al. 2003)! Ball-in-Cup! (Kolber & Peters 2009)! • PILCO: Cart-Pole! (Diesenroth et al. 2011)! PILCO: Block-Stacking ! (Diesenroth et al. 2011)!q̇ = h(q d,k • Marc Peter Deisenroth Carl Edward Rasmussen Dept. of Computer Science & Engineering University of Washington Seattle, WA, USA Dept. of Engineering University of Cambridge Cambridge, UK Abstract—Over the last years, there has been substantial progress in robust manipulation in unstructured environments. Fig. 3. Episodic natural actor-critic for learning dynamic movement primitives. (a) Learning curves comparing the episodic Natural Actor Critic to episodic TheREINFORCE. long-term goal of our work is to get away from precise, (b) Humanoid robot DB which was used for this task. Note that the of the butvariance very expensive robotic systems and to develop affordable, episodic Natural Actor Critic learning is significantly lower than the one of episodic potentially imprecise, self-adaptive manipulator systems that can REINFORCE, with about 10 times faster convergence. interactively perform tasks such as playing with children. In this paper, we demonstrate how a low-cost off-the-shelf robotic system can learn closed-loop policies for a stacking task in only a handful 4.2 Example II: Optimizing Nonlinear Motor Primitives for of trials—from scratch. Our manipulator is inaccurate and provides no pose feedback. For learning a controller in the Humanoid Motion Planning work space of a Kinect-style depth camera, we use a model-based While the previous example demonstrated the feasibility and performance of reinforcement learning technique. Our learning method is data the Natural Actor Critic in a classical example of motor control, this section will efficient, reduces model bias, and deals with several noise sources turn towards an application of optimizing nonlinear dynamic motor primitives in a principled way during long-term planning. We present a for a humanoid robot. In [17, 16], a novel form of representing movement plans way of incorporating state-space constraints into the learning (q d , q̇ d ) for the degrees of freedom (DOF) of a robotic system was suggested in terms of the time evolution of the nonlinear dynamical systemsprocess and analyze the learning gain by exploiting the sequential structure of the stacking task. • • Learning to Control a Low-Cost Manipulator us Data-Efficient Reinforcement Learning d,k , z k , gk , ⌧, ✓k ) (30) I. I NTRODUCTION where (qd,k , q̇d,k ) denote the desired position and velocity of a joint, z k the Over the last years, there has been substantial progress in internal state of the dynamic system, gk the goal (or point attractor) state of robust manipulation in unstructured environments. While exeach DOF, ⌧ the movement duration shared by all DOFs, and ✓k the open parameters of the function h. The original work in [17, 16] demonstrated how isting techniques have the potential to solve various household the parameters ✓k can be learned to match a template trajectory by means of manipulation tasks, they typically rely on extremely expensive supervised learning – this scenario is, for instance, useful as the first step of an robot hardware [12]. The long-term goal of our work is to imitation learning system. Here we will add the ability of self-improvement of light-weight manipulator systems that can the movement primitives in Eq.(30) by means of reinforcementdevelop learning,affordable, which is the crucial second step in imitation learning. interactively play with children. A key problem of cheap The system in Eq.(30) is a point-to-point movement, i.e., manipulators, an episodic task however, is their inaccuracy and the limited from the view of reinforcement learning – continuous (e.g., periodic) movement CST: Navigation! (Konidaris et al. 2012) sensor feedback, if any. In this paper, we show how to use a cheap, off-the-shelf robotic manipulator ($370) and a Kinectstyle (http://www.xbox.com/kinect) depth camera (<$120) to learn a block stacking task [2, 1] under state-space constraints. We use data-efficient reinforcement learning (RL) to train a 6 in the work space of the depth camera. controller directly Fully autonomous RL methods typically require many trials Dieter Fox Dept. of Computer Science & Engine University of Washington Seattle, WA, USA Fig. 1. Low-cost robotic arm by Lynxmotion [1] performing a blo task. Since the manipulator does not provide any pose feedback, learns a controller directly in the task space using visual feedb Kinect-style depth camera. a typical problem of model-based methods: P ILCO a flexible probabilistic non-parametric Gaussian proc dynamics model and takes model uncertainty con into account during long-term planning. P ILCO lea controllers from scratch, i.e., with random initializa deep understanding of the system is required. In this p show how obstacle information provided by the dept can be incorporated into PILCO’s planning and lea avoid collisions even during training, and how knowl be efficiently transferred across related tasks. The paper is structured as follows. After discussin work, we describe the task to be solved, the low-cost used, and a basic tracking algorithm in Sec. III. Sec. marizes the PILCO framework and details how we inc collision avoidance into long-term planning under un Reinforcement Learning Method Comparison Policy Model Horizon Problem Space Algorithm Data Ball in Cup DMP None Short Robot-Space EM w/ Natural Gradient Autonomous Humanoid Walking RBF None Short Robot-Space LSTD w/ Natural Gradient Autonomous Helicopter Neural-Net Locally Weighted Regression Short Robot-Space Hill-Climbing w/ Monte-Carlo Eval. LfD PILCO Cart-Pole! RBF Gaussian Process Short Robot-Space CG/L-BFGS Autonomous PILCO BlockStacking Linear Gaussian Process Short Mixed CG/L-BFGS Autonomous Red-Room* Skill Tree Options + Lin. Reg. (Fourier Basis) Long Task-Space CST (trajectory segmentation) LfD Policy Model Horizon Problem Space Algorithm Data Long Task-Space A* Variant N/A Long Task-Space State Machine N/A Robotics Cart Pushing Jacobian PD-Control, Geometric (Projected Primitives 2D) PR2 Towels Scripted Grasp and Manipulation Prim. Implicit Cloth Physics PR2 Socks Scripted Grasp and Manipulation Prim. Implicit Cloth Physics HRP-2 NAMO ZMP Walking, Jacobian PD Grasp PR2 NAMO HRP-2 MacGyver Domain General Physics-Based Long Task-Space State Machine N/A 2D Rigid Body Physics Long Task-Space A* Variant N/A RRT Path Planner, PD Control 2/3D Geometric Simulation Long Mixed Hierarchical Backchaining N/A ZMP Walking, Scripted Primitives 3D Physical Simulation Long Mixed A* Variant N/A 7 Summary of existing work Methods either: a) short-horizon, robot-space, general models b) long-horizon, w/ known model or human demonstrations Proposal goal: Autonomous task-level RL without hand-engineered representations 8 Thesis statement Incorporating physics-based models and planning representations into the Reinforcement Learning framework reduces engineering overhead and increases robot performance in autonomous mobile manipulation tasks. Physics Engines Autonomous Reinforcement + = Mobile Manipulation Learning 9 Two Perspectives • From Robotics: reduce engineering • From RL: improve scalability No more grid worlds x 10 Reinfo rceme s otic nt Lea Rob rning Thesis intuition Representational Gap 11 Outline Physics-based Reinforcement Learning 1 Stochastic Planning for Physics-Based MDPs 1: Published (Scholz et al. 2014)! 2: Proposed (extends Scholz et al. 2010)! 3: Proposed (implements Levihn, Scholz, & Stilman 2012 & 2013)! 4: Proposed 12 2 Application to Golem-Krang Navigation Among Movable Obstacles 3 Cost-Based Furniture Arrangement 4 Outline Physics-based Reinforcement Learning Stochastic Planning for Physics-Based MDPs Application to Golem-Krang Navigation Among Movable Obstacles • Overview! • Model Space! • Data Acquisition! • Inference! Cost-Based Furniture Arrangement • Evaluation 13 Physics-Based Reinforcement Learning Model-Based RL Loop PyMC Idea: Use physics-engine as model representation Technical contribution: formalize model learning problem and develop inference method 14 PBRL Overview f (s, a) ! s0 Physics API is Learning Target f (s, a; ) ⇡ s0 Place prior on API parameters body {! ! ! ! ! 8s, a Estimate posterior from data 0 L( |h) = P (s |s, a; ) mesh! mass! inertia! joint.wheel! | | P ( |h) / P (h| )P ( ) P (s0 |s, a, ) ⇡(s) = arg maxa P (s0 |s, a, 15 PBRL: Model Parameters Three classes of parameters Rigid-body parameters Anisotropic friction constraints 16 Distance constraints ⇡(s) ms PBRL: Rigid body parameters (m, r, µc ) Mass For computing accelerations We assume uniform density f a= i α= Restitution (wx , wy , w✓ , µx , µy ) For computing perpendicular contact forces (ia , ib , afxp,=fanyr, bx , by ) Friction For computing tangential contact forces Proportional to normalforce (Coulomb friction) m := (m, r, µc , wx , wy , w✓ , µx , µy , ia , ib , ax , ay , bx , by ) I-1 τ ff = fn!c 17 v fn EXTRA MATH FOR PROPOSAL ⌧ ↵= =I I ⌧ PBRL: Anisotropic friction JONATHAN SCHOLZ heel 1 wheel constraint (w , w , w✓ , µ , µ ) (1) Compute relative velocity between anchors (wheel x and ground) in world y x frame y  stance ẋ w 1) J= ẏ + ⇥ ✓˙ ⇤ ✓  wx ⇥ R wy =  ẋ ẏ  0 ✓˙ Pose + ✓˙ 0 R  (i , i , a , a wx wy Coefficients ,b ,b ) y frame, x y if we b x to joint (2) rotate body b (wheel) velocity to body (orainconstraint principle Forframe placing The orthogonal ever include rotation) l params on a body b 2) J =R 1w friction components J := (m, r,!µ , w , w , w , µ , µ , i , i , (-0.3, ax , ay ,0,bx0, , by0.1, ) 0.8) i in body frame –yjust c scale x theycomponents x of the y velocity a b ✓ (3) compute friction force from (2) by their respective coefficients wx b 3) 4) ◆ A velocity constraint w θµx f= 0 0 µy b J !x w y (4) rotate this force (impulse) back to the world frame Object-frame w f = R(b f ) (5) add in to the force accumulator ... Constraint force: Wheel constraint as single expression:  ✓ µx 0 ẋ 5) fw = R R 1 0 µy ẏ +  0 ✓˙ ✓˙ 0 R  wx wy ◆ velocity scaled Note: 2-D, matrix the cross-product of a scalaranchor with avelocity vector isindefined as the equivalent 3-D R: bodyIn rotation world frame in object frame ersion about implied z-axis: 2 3 2 3  0 a a 18 6) [x] ⇥ =4 0 5⇥4 b 5 b wheel JONATHAN SCHOLZ PBRL: Distance constraint distance 2) q( ) = ln (P (h| )P ( )) n ⇣ ⌘2 X X 0 , i(p) = st f (st , at ; ˜ ) +(ia P b , ax , ay , bx , by ) 3) All params 4) ndom ⇡(s) ,= a) d (wx , wy , w✓ , µx , µy ) t=1 A position constraint p2 ✓ ◆ q( t ) Paccept ( t | t 1 ) = min 1, := (m, r, µ q(, wt 1,) w Body indices i Indicates two bodies to anchor the constraint c x PositionA ,w ,µ ,µ ,i ,i ,a y x y a ✓ Anchor offset on the first body b x PositionB ,a ,b ,b ) y x y Anchor offset on the second body (m, r, µc ) ⌧ ↵= =I I body a eel 1 ⌧ bx body b by Inverted pendulum l anchor b a x (wx , wy , w✓ , µx , µy ) ay tance tance constraint init anchor (ia , ib , aa x , ay , bx , by )  xainit a yinit xacur a ycur  l= + Constraint: tance constraint params i  + ax ay  ax ay   xacur a ycur  xainit a yinit  ax ay ax ay l=0 := (m, r, µc , wx , wy , w✓ , µx , µy , ia , ib , ax , ay , bx , by ) 19 ) Paccept ( t| t ✓ q( t ) 1 ) = min 1, q( t 1 ) ◆ PBRL: Overall model space Random ⇡(s) (s0 |s, a) 14 physical parameters per object 0) i := (m, r, µc , wx , wy , w✓ , µx , µy , ia , ib , ax , ay , bx , by ) Scenes have many objects φ2 φ3 φ4 φ5 φ1 What if some constraints aren’t necessary? We can negate effects numerically Extending will require inference over variable-sized representations 20 Φ In general thisoutp mod tence and parametrization of physical constraints, such (physical parameters) and in 354 regression model, this ies will have both heels. Like a standard Bayesian notes a deterministic physical simu therefore allow con el includes uncertainty in 355 the process input parameters ˜ , then the˜ core dynamics This can function be accom ysical parameters) and in 356 output noise. If f (·; ) defor represented com s a deterministic physical simulation parameterized by Standard Regression model parameterized by 357 st+1 (st , at ; prior=onfconstraints hen the core dynamicsproperties. function is: physical 358 cases where the eff n ˜ = ( ˜(7) st+1 = f (st359 , at ; ˜ ) +where ✏ )i=1 denotes a full of as a finite number BOOMDP task-space straints cannbe null n ˜ physical parameters for all objec ˜ 360 re = ( )i=1 denotes a full assignment to the relevant force & torque 440for all n objects sical parameters inzero-mean the ✏ isat+1 We Gaussian noise with var satisfy these 361 at scene, and m, r 441 2 meshes -mean Gaussian noise with variance . wheel constraint, µ x , µy 442 362 polygons ˜ For any domain, must contain w , ay a 443 … x , wby y , ax constraints the planning any domain, ˜444must contain of inertial pa363a core set⇡(s) bx , by ⇡(s) wheel is sufficient rameters for each object, as well ˜ w 445 ✓ eters for each object, as 364 well as zero or more conindoors, such iadefine , ib as 446 straints. Inertial parameters nts. Inertial parameters define rigid body behavior in mass 447 st+1 cause they have onl st 365 friction 1. Distr absence of interactions with otherthe objects, fand (s, a; ˜con)of interactionsTable 448 … absence with wheels can be expr 366 449 position & velocity physics new position & velocity nts define the space of possible interactions. straints define the space possible 450 can beof nullified note a matrixby of 367 451 noise he general case inertia requires 10 parameters; 1 for distance constraint 452 In the general case inertia require 2.368 Graphicalof model the online model learning object’s mass, 3453for Figure the location thedepicting center of mass, 21 problem, and the assumptions of PBRL, in terms of states sIn andsummary, our dy PBRL: Graphical Model Example Episodes PBRL: Gathering Data Unconstrained Touch or Grasp Object Locked Caster Apply forces Track object Data = State Trajectory & Applied forces 22 Chair Collision Compute equivalent force in body-frame 0 L( |h) = P (s |s, a; ) PBRL: Prior & Likelihood | | Bayesian Inference: Φ is a generative model of h P ( |h) / P (h| )P ( ) 0 P (s |s, a, ) Likelihood: should prefer accurate predictions Prior: Posterior: 0 0 ⇡(s) = arg max P (s |s, a, )V (s ) a should support only legal values will reflect robot’s updated beliefs 2 s1 after observing h 23 6 s2 6 h=6 . wheel constraint dy b (wheel) velocity to body frame (or in principle to joint frame, if we 529 (1) Compute relative velocity between anchors (wheel and ground) in world frame ude rotation) n Y 530 P (h| , ) = 2 3 531 ✓  ◆ ẋ 0 t=1 frame – just scale the componentswxof the velocity friction force in body w 532 4 ẏ 5 + 4 0 5 ⇥ R ! J = ncoefficients 0 wy2 by their respective Y ˜ ˙ ˙ 1 (s f (s , a ; )) 533 t t ( ✓) t ✓ p exp = 2  2⇡ priors 534 Univariate 2 (2) rotate body t=1 bb (wheel) velocity to body frame (or in principle to joint frame, if we µx 0 b BOOMDP f = by support J 535 ever includechosen rotation) (11) 0 µy 536 Distribution 495 b 1w J = R J isEq. force back to the world 2 r that Log-Normal(µ, ) frame 11(impulse) tellsm,us the likelihood for proposed model pa496 537 2 µ x , µy Truncated-Normal(µ, , 0, 1) 497 (3) compute friction force in body 2 frame – just on scale the components of the velocity rameters on a Gaussian the prexy centered xy 538 wxare , wy , aevaluated , a Truncated-Normal(µ, , a , a ) 498 x y min max w b 2 xy f = R( f ) from (2) by their respective coefficients bxstate , by Truncated-Normal(µ, , bmin , bxy ) 499 dicted next for a generative physics world parametermax (s) 539 w Von-Mises(µ, ) 500 ✓ the force ... ized by ˜accumulator (i.e., known geometry and proposed ia , ib withCategorical(p) 540 501dynamµ 0 x b b int as Due singletoexpression: 502 is obics). Gaussian noise, the log-likelihood for f = J 541 Table 1. Distributions for each physical parameter type 02 3 2 3 0 µy 503 1  by summing 2squared ✓  the504 ◆ ẋ distances 0 between tained observed 542 L1 @ penalty on4 sample µxSimple 0 force wxframeA 4 5 5 (4) rotate this (impulse) back to the world ẏ 0 505 = R and Robserved transitions: + each state ⇥ Rand action: note athe of value state for 0 matrix µy predicted w 543 y trajectories 2 (✓) 3 ✓˙ 506 ˙ 0 s 1 a1 s 1 507 544 n a ⇣ s0 7 w ⌘ earning 6 sX b 2 2 2 2 6 f = R((9)f˜) 508 07 s s and h=6 . 7 . . 545 ln P (h| , ) = 4 .. .. (s.. t 5 f (st , at ; ) 509 (12) (dy... 510 l.(5) ⇡(·) add in to the force accumulator sTt=1 aT s0T 546 511 on. We q( ) = ln (Pas (h|single )P ( expression: )) Wheel constraint 547 5123 should h⇣to update the thein agent’s beliefs about 0 2 3 2 1 n Along We with theuse prior defined Table 1, this provides the ⌘ X X 2 513  ✓  ◆ {wx, 0 and the noise term0 . In a Bayesian approachẋthis is 548 ˜ = s f (s , a ; ) + P (p) Posterior samples µ 0 w t history 5145 necessary components for at Metropolis sampler for 10. x w as the model expressed A 0 Eq. f =t=1 R t xposterior given R 1 @4h:p2ẏ 5 + 4 515 ⇥ R 549 0 µby wy y generated MCMC PBRL: Inference PJ(s=0t |R ,21w, Jst , 3 at ) b ˙ ( ✓) P (h| , )P✓( )P ( ) ◆ P ( , |h) = R (10) oblem, q( ) t P (h| , )P ( )P ( ) MCMC Paccept ( t | t 1 ), = min 1, and q( t 1 ) tion 4. vision ect dyaramend the -series where = { 1 , 2 , . . . , k } is the collection of hidden parameters for the kq( objects ) =in1 the ln domain, (P (h| and)P is( a scalar. )) This expression is obtained from Bayes’ rule, and defines n for ⇣ a PBRL agent. the abstract model inference problem X 0 = s f (st , at ; t The prior P ( ) can be used to encode any prior knowledge about the parameters, and is not assumed t=1 to be of any par- 516 ✓˙ 517 518 519 520 521 ⌘522 X 2 523 ˜ ) 524+ P (p) 525 p224 wy, 0, 0.1, 0.8} ycur (16) 15) ay xacur a ycur ycur  aay  ax (m, r,xµcur, w , w a,xw , µ , µ , i , i , a , a , b , b ) := i x y + ✓ l x= 0 y a b x y x y ac ay ycur ay Defining Physics-Based MDPs Latex Math Defining All params PBRL MDPs i := (m, r, µc , wx , wy , w✓ , µx , µy , ia , ib , ax , ay , bx , by ) • states 16) i := (m, r, µc , wx , wy , w✓ , µx , µy , ia , ib , ax , ay , bx , by ) h i| MDPs ˙ Defining PBRL MDPs (17) s = x1 , y1 , ✓1 , ẋ1 , ẏ1 , ✓˙1 , ..., xn , yn , ✓n , ẋShopping n , ẏn , ✓n Cart MDP States ⇤ • h states Jonathan Scholz i| h i| PROPOSAL EXTRA MATH FOR ˙ ˙ s = (18) x1 , y1 , ✓1 , ẋs 1=, ẏ1x,1✓, y11,,..., , ẋxnn ,, x 17) ✓1 , x ẋ1n,,ẏy1 n , ✓,˙1✓, n..., yẏnin,,,✓y✓nin ,,ẋẋni,,ẏẏni,,✓˙✓n˙i 2 R • actions April 16, 2014 EXTRA MATH FOR PROPOSAL 3 18) , ẋi , ẏi , ✓˙i 2 R ✓i 2 [ ⇡, ⇡] (19)(20) xi , yi , ẋi , ẏi , ✓˙xi i ,2yiR a = [f , f , ⌧, i]| x EXTRA MATH FOR PROPOSAL 19) (20) (21) • actions ✓i 2 [ ⇡, ⇡] Actions a = [fx , fy , ⌧, i]f|x , fy , ⌧ 2 R f| (s, a) ! s0 a = [fx , fy , ⌧, i] • transitions f x , fy , ⌧ 2 R P ( (22) • transitions (23) i2N (23) • rewards (24) 3 ✓i 2 [ ⇡, ⇡] (21) (22) y • rewards f x , fy , ⌧ 2 R i2N Apartment MDP 2 NP (h| )P ( ) |h)i/ Transitions P (s0 |s, a, ) 0 0 Rewards ⇡(s) = arg maxa P (s |s, a, )V (s ) task-dependent (next-section) (24) 25 2 s1 a1 s01 3 PBRL Results: Online Performance PBRL out-performs regression methods −100 Reward −300 −100 TRUE PBRL OOLWR LWR 100 200 300 400 500 0 0 TRUE PBRL OOLWR LWR −300 Reward −100 Step TRUE PBRL OOLWR LWR −40 −120 ≈ ≈≈ 200 ≈ 400 ≈ 600 800 1000 PBR LWR LR Step Lo Ki ge ir ir ir ha C ir ha C ch ha C en un h tc k es ha C t d ou Be C ar e bl Ta e bl Ta ng C ki en h tc oc D R Ki 26 k Step es Step 800 1000 e fe 600 800 1000 −80 Reward Samples Required ≈ D 400 600 TRUE PBRL OOLWR 0 140 120 100 80 60 40 20 0 of 200 400 Step C 0 200 0 0 −300 Reward Reward 0 0 Shopping-Cart Task: Gaussian Noise (σ=1.25) Summary of proposed work Goal: Scalable RL framework for object manipulation domains Contribution: Physics-based model representation and scalable inference method Result: Simulation results suggest favorable performance vs. regression methods 27 Outline Physics-based Reinforcement Learning Stochastic Planning for Physics-Based MDPs Application to Golem-Krang Navigation Among Movable Obstacles • Handling uncertainty! • Handling sequential rewards! • Proposed algorithm Cost-Based Furniture Arrangement 28 (8) t=1 Stochastic Planning for Physics-Based MDPs (9) So far: stochastic (9) physics model Random ⇡(s) 0 P (s |s, a) Paccept ( Next: method for planning with it Random ⇡(s) 0 P (s |s, a) Draws on: Melchior & Simmons (2007) Scholz & Stilman (2010) 29 t| Planning in RL Desirable properties: x Optimality Generality open problem RRT: a spatial search bias Traditional Approaches: Dynamic Programming Monte-Carlo Tree Search too many states weak search bias (value-based) x x 30 Two challenges Handling model uncertainty Handling sequential rewards 31 Model Uncertainty Particle RRT for Path Planning with Uncertainty Simple approach: Monte-Carlo Nik A. Melchior and Reid Simmons Want: Repeat:sample model & (re)plan single tree over entire space Problem: inefficient act— This paper describes a new extension to the –exploring Random Tree (RRT) path planning algoThe Particle RRT algorithm explicitly considers uncerits domain, similar to the operation of a particle filter. tension to the search tree is treated as a stochastic and is simulated multiple times. The behavior of the n be characterized based on the specified uncertainty environment, and guarantees can be made as to the ance under this uncertainty. Extensions to the search d therefore entire paths, may be chosen based on the d probability of successful execution. The benefit of this m is demonstrated in the simulation of a rover operating h terrain with unknown coefficients of friction. I. I NTRODUCTION Rapidly–exploring Random Tree (RRT) algorithm [1] pular technique for path planning with kinodynamic nts. In simple terms, RRT builds a search tree of le states by attempting to apply random actions at reachable states. Unless the action causes the robot contact with an obstacle or violate some dynamics Fig. 1. 32 A pRRT tree with several particles at each node occlusions are possible. In addition, characteristics of the Particle-RRT Particle RRT for Path Planning with Uncertainty Melchior Nik A. Melchior and Reid Simmons (a) Simulation RRT with model uncertainty 1. Sample model beliefs 2. Step particles 3. Cluster particles together & Simmons (2007) (b) Plot Fig. 4. Trajectories Abstract— This paper describes a new extension to the with qualitatively different endpoints Rapidly–exploring Random Tree (RRT) path planning algorithm. The Particle RRT algorithm explicitly considers uncerof points. tainty in its domain, similar to the operation of a particle filter. Each extension to the search tree is treated as a stochastic D. Node and Path Probability process and is simulated multiple times. The behavior of the The particles at each node provide an estimate of the robot can be characterized based on the specified uncertainty distribution of values of F that allow the robot to reach that in the environment, and guarantees can be made as to the state. By combining this with a prior distribution over F, the performance under this uncertainty. Extensions to the search probability of reaching any particular node, or indeed the tree, and therefore entire paths, may be chosen based on the probability of following an entire path, may be calculated. expected probability of successful execution. The benefit of this In this way, the growth of the tree structure can be biased algorithm is demonstrated in the simulation of a rover operating toward generating paths that are more likely to be followed in rough terrain with unknown coefficients of friction. I. I NTRODUCTION The Rapidly–exploring Random Tree (RRT) algorithm [1] by the robot. To bias the search, we adapt Urmson’s hRRT technique for heuristically biasing RRT growth [9]. The heuristic modifies the SELECT function the RRT algorithm. Fig. 1. AEXTENSION pRRT tree with severalofparticles at each node Rather than simply accepting a random point, p, in the state space and its nearest neighbor, q, in the tree, the hRRT technique chooses to extend proportional to the quality of the node q. In our case, the quality of q is defined as: discardofsmall Fig. 5. Dendrogram produced by the hierarchical clustering tree algorithm a popular for path planning with kinodynamic ofintuition: the probability reaching the isgoal fromtechnique this node. This constraints. In ∗ simple terms, RRT builds a search tree of numerical differences would produce a heuristic in the style of A [10] that would reachable states by attempting to apply random actions In this case, the final agglomeration combines particleat 1 with estimate the probability of successfully travelling from start known–reachable states. Unless the action causes the robot a cluster containing all other particles. qprob − m make contactany with estimate an agglomeration obstacle of or violate Although continuessome until dynamics all particles areocclusions are possible. qqualityIn = addition, characteristics (2) of the to goal through any particular node.to However, m linked, we must determine how much aggregation is actuallyterrain such as the coefficient1.0of− friction constraint, the action is considered successful, and the resultcan be estimated they used hierarchicalthe probability–to–go must be optimistic, meaning it each cannot appropriate set of particles. WeAlocate the link in where qprob is the probability of reaching q from the root ing state is added to the for tree of reachable states. simulator only roughly, but these characteristics can have substantial the tree which combines the most dissimilar subtrees by of the tree, and m is the minimum probability of all leaf agglomerativethe clustering underestimate possibility of completing the path from is generally treated as a black box that determines the result rover’stree. behavior, when traversing searching for the largest difference in distances betweenimpact nodeson of the the search A randomparticularly value, r is drawn from a of an action given the robot’s initial state. This allows the rough terrain. Path planning can benefit from explicitly successive any state. Without exploring all options from this agglomerations. state, the This link and all following links uniform distribution between 0 and 1. If qquality > r, the pair algorithm to be to domains where complex system are applied disregarded, while any links made previously are used toconsidering of points pthe and uncertainty q are accepted in andthe an extension is attempted. terrain to generate paths only admissible heuristic is 1.0,dynamics whichmake does notthecontrol provide determine particledifficult clustering. cutoff is represented a new pairofofsuccess. points are chosen. Alternately, we analytic orThis impossible. RRT withOtherwise, high likelihood by the dashed horizontal line in figure 5. In thisvehicase, the might keep p and try the next nearest neighbor in the tree [9]. has been successfully applied to wheeled and legged any additional information. Future work may investigate particles have been split into five clusters. The random valueinr is[2] usedhas to promote the useaof changing extensions enviRelated work considered Example: slippery as well as In underwater robotsthat andthis aircraft. from high quality nodes without excluding the possibility of tests, we found of approach performs betterronment whetherterrain non–admissible heuristicscles, improve the runtime in the form of moving obstacles by predicting However, this decision as to thek–means successalgorithm of an [8]. extending lower quality nodes should the algorithm become than binary the Gaussian clustering of the motion of these obstacles. If the obstacles move in the algorithm without significantlyaction affecting path quality. can limit the application of tree thisallows algorithm in twofewerthe stuck in a situation where reaching the goal at all is unlikely. The hierarchical clustering us to calculate Uncertainty about friction manner, thequality path is simply replanned. since not we do not for needranking to directly estimate thean unexpected The effectiveness of the heuristic might be im- One important ways. First, it does allow or scoring In practice, we found that the probability ofparameters reaching nodes approach for planning in uncertain terrain [3] ensures means and covariance matrices cluster. This is proved if we could calculate not only the probability of that multiple actions which may succeed fromof aeach given initial especially important since we are clustering only a handfulthe reaching fromproduce the root ofthe the tree, butresult also an for estimate planneda node actions same the entire the tree drops quickly with path Thishave causes thecosts state.length. Actions often associated (such as the energy Nodes reflect ofqualitative or time from requirednodes for execution), and planning may produce range of expected values of the unknown conditions. Another algorithm to favor making extensions near the paths from start to goal with varying cumulative application [4] builds a forest of search trees, using a differences inroot, outcome even when reasonably likelyseveral nodes exist closer to the costs. High–cost extensions might therefore be avoided in different value for each tree. goal. In order to encourage more extensions from hopes of finding a better path,nodes but these extensions should not This paper presents an extension to the RRT algorithm be ignored completely, because a better path may not exist. that directly addresses the issue of plans under uncertainty far from the root, the path probability qprob is normalized Most RRT implementations at least consider path length in with a novel approach. The Particle RRT (pRRT) algorithm √ n using the path length. We substitute the quality prob in distance terms of qEuclidean travelled, but other notions of facilitates the creation of an RRT in an uncertain environment pathnode cost should incorporated as well. calculation, where n is the depth of q in bethe tree. This by propagating that uncertainty to the planned path. Each The second important use for a fuzzy notion of action extension to the search tree is attempted several times under improves the runtime since node success quality does not drop so 6. A tree built by pRRT. Spheres mark the planned path. is when knowledge of the obstacles Fig. or dynamic different likely conditions. Nodes in the search tree are quickly as the distance from theproperties root increases. of the environment cannot be precisely known. created by clustering the results from these simulations. The 33 However, The application presented in this paper is a rover navigating likelihood of successfully executing each action is quantified the effect of averaging the path probability over the length Particle-RRT limitations How to handle sequential rewards? Assumes known goal Considered by Task-Space RRT* *Scholz & Stilman, 2010 34 Task-Space RRT Basic idea main loop: sample model to search space sometimes: run gradient optimizer from leaf nodes Result Finds modes in cost function, and actions required to reach them 35 Task-Space RRT limitations No node clustering Only optimizes immediate reward problem: can’t pick best expected action 36 Jailette et al. 2008 Proposed work Particle RRT for Path Planning with Uncertainty Nik A. Melchior and Reid Simmons Abstract— This paper describes a new extension to the Rapidly–exploring Random Tree (RRT) path planning algorithm. The Particle RRT algorithm explicitly considers uncertainty in its domain, similar to the operation of a particle filter. Each extension to the search tree is treated as a stochastic process and is simulated multiple times. The behavior of the robot can be characterized based on the specified uncertainty in the environment, and guarantees can be made as to the performance under this uncertainty. Extensions to the search tree, and therefore entire paths, may be chosen based on the expected probability of successful execution. The benefit of this algorithm is demonstrated in the simulation of a rover operating in rough terrain with unknown coefficients of friction. Particle clustering I. I NTRODUCTION The Rapidly–exploring Random Tree (RRT) algorithm [1] is a popular technique for path planning with kinodynamic constraints. In simple terms, RRT builds a search tree of reachable states by attempting to apply random actions at known–reachable states. Unless the action causes the robot to make contact with an obstacle or violate some dynamics constraint, the action is considered successful, and the resulting state is added to the tree of reachable states. A simulator is generally treated as a black box that determines the result of an action given the robot’s initial state. This allows the algorithm to be applied to domains where complex system dynamics make analytic control difficult or impossible. RRT has been successfully applied to wheeled and legged vehicles, as well as underwater robots and aircraft. However, this binary decision as to the success of an action can limit the application of this algorithm in two important ways. First, it does not allow for ranking or scoring multiple actions which may succeed from a given initial state. Actions often have associated costs (such as the energy or time required for execution), and planning may produce several paths from start to goal with varying cumulative costs. High–cost extensions might therefore be avoided in hopes of finding a better path, but these extensions should not be ignored completely, because a better path may not exist. Most RRT implementations at least consider path length in terms of Euclidean distance travelled, but other notions of path cost should be incorporated as well. The second important use for a fuzzy notion of action success is when knowledge of the obstacles or dynamic properties of the environment cannot be precisely known. The application presented in this paper is a rover navigating through unknown terrain. Stereo vision is used to build a model of the terrain in the immediate area, but the accuracy of this model decreases with distance from the rover, and melchior@cmu.edu, reids@cs.cmu.edu. The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213-3890. Fig. 1. A pRRT tree with several particles at each node LeafOptimization occlusions are possible. In addition, characteristics of the terrain such as the coefficient of friction can be estimated only roughly, but these characteristics can have substantial impact on the rover’s behavior, particularly when traversing rough terrain. Path planning can benefit from explicitly considering the uncertainty in the terrain to generate paths with high likelihood of success. Related work in [2] has considered a changing environment in the form of moving obstacles by predicting the motion of these obstacles. If the obstacles move in an unexpected manner, the path is simply replanned. One approach for planning in uncertain terrain [3] ensures that the planned actions produce the same result for the entire range of expected values of the unknown conditions. Another application [4] builds a forest of search trees, using a different value for each tree. + BellmanValues = Value-based Task-Space Particle-RRT* This paper presents an extension to the RRT algorithm that directly addresses the issue of plans under uncertainty with a novel approach. The Particle RRT (pRRT) algorithm facilitates the creation of an RRT in an uncertain environment by propagating that uncertainty to the planned path. Each extension to the search tree is attempted several times under different likely conditions. Nodes in the search tree are created by clustering the results from these simulations. The likelihood of successfully executing each action is quantified so that the probability of following entire paths may be determined. An example search tree is shown in figure 1. Experimental results from simulations are presented, showing that this approach results in paths that are more robust to uncertainty. 37 *tsp-RRT? ne the methods 4.4 above Proposed Work into a single coherent algorithm: the value-based particleRRT This a straightforward combination thecoherent two methods discussed with one We is propose to combine the above methods into a of single algorithm: the value-basedabove, particleRRT with leaf optimization. Thisextension is a straightforward combination of the two methods discussed above, with onewhich particle RRT, the node heuristic was based only on path probabilities, notable difference. In particle RRT, the node extension heuristic was based only on path probabilities, which he model probability across particles inparticles a node, a path theroot. root. Adopting came from averaging the model probability across in aalong node, along a pathfrom from the Adopting notation q to denote particles a node containing q1n,,q2where , . . . , qn , where is the state of of particle qi and e a the node containing q1 , qparticles sqi issqthe state particle qiFand Fqi q 2, . . . , q the model sample, the node values in the particle-RRT are defined as follows: node values in the particle-RRT are defined as follows: vq = Â P(Fq ) (14) Eq.pRRT 15, obtained directly from the bellman-recursion, appears superficially different from Eq. 14. Hownode vq = Â P(Fqi )q 2q (14) ever, However, for “values” cases in inour which the reward function is not dependent on the action, as with task-space formalism case we are attempting to solve an MDP, and therefore are searching a more general qi 2q The value of values i Introdu i i i in reward [25], and in which model deterministic if the model[27]. parameters themselves are landscape. As the a result, thetransitions node value are should be based on(even the bellman-value To achieve this Eq. 15 simplifies as follows: we to the sparse-sampling which provides a recursive expression for computinganode values: asenot), weturn are attempting to literature, solve an MDP, and therefore are searching more general " " ## a result, the node value should be based0 on the bellman-value [27]. To achieve this 0 a) max Qd d 1 (s 1 0 ,0 a0 ) 0 MCTS node vq = max R(s, a) + g P(s |s, (16) vq = max R(s, a) + g Â P(s |s, a) max QSS (s , a ) (15) Â 0 a a a a 0 ampling literature, a srecursive expression for computing node values: s values which provides 2 3 Z where Qd (s, a) " refers to the expected value of taking action a in state s, # and following the optimal policy 0 0 0 0 d 4R(s) + g Â = max P(s |s, a, f )P(f )df max Q 1 for d 1 subsequent steps (soa QSS0 (s, a) = R(s, 0 a)). d 1 0 0 a0 SS0 vq = max R(s, a) + g Â P(s |s, max QSS0 (s , a ) s0 f a) 2F 0 1 (s0 , a0 )5 (17) (15) a 0 s Considers cart beliefs tsp-RRT node= 1 R(sq ) + g P(F 0 15 0 ) max v , q f (s , a ; F ) (18) qi qi qi Â i Â qi q0 q n and beer quality qi 2q valuesvalue ofqi 2qtaking action to the expected a in state s, and following the optimal policy a Eq. Q 181is (s, obtained by expanding the model transition as a marginal over the unknown model paramteps (so a) = R(s, a)). 0 SS eters and approximating it with a finite sum over particles ( f (·) denotes the PBRL model function from Sec. 3.2). This yields a generalization of the approach taken in [21], in which we replace model averTuesday,the September 6, 2011 aging (Eq. 14) with a particle-based bellman backup. Note that this contribution achieves what the authors Reflects probability suggest regarding a search heuristic:15 Why useful? look-ahead future reward “The effectiveness of the quality heuristic might be improved if we could calculate not only the probability of reaching a node from the root of the tree, but also an estimate of the probability of reaching the goal from this node.” Because reward functions can be arbitrary, this formulation does not guarantee an admissible heuristic. 38 However, for many common cases, such as simple distance-based rewards, we expect this approach to AND Summary of proposed work Goal: Tractable value-based planner for large continuous spaces with stochastic PBRL dynamics Expected Result: Successful online planning for PBRL problems for up to 5-10 objects 39 Outline Physics-based Reinforcement Learning Stochastic Planning for Physics-Based MDPs Application to Golem-Krang Navigation Among Movable Obstacles • Target platform! • NAMO MDP formulation! • Proposed task setting Cost-Based Furniture Arrangement 40 Platform: Golem-Krang Software Architecture Robot: • dynamically-stable humanoid • 7-DOF harmonic drive arms • 6-DOF wrist force/torque sensors *balancing-related modules omitted 41 The NAMO MDP States: free-space regions Actions: connecting regions Rewards: goal region (sparse) Rewards propagate through abstract MDP Transition uncertainty grounded in PBRL beliefs Levihn, Scholz, & Stilman 2012,2013 42 Tag-based Vision Camera Rig Simulation Techway Facility Task Setting 43 Proposed work Goal: Implement NAMO MDP on Golem-Krang Expected Result: First NAMO system to handle non-holonomic objects and dynamics uncertainty 44 Outline Physics-based Reinforcement Learning Stochastic Planning for Physics-Based MDPs Application to Golem-Krang Navigation Among Movable Obstacles • Cost function for meeting configuration! • Proposed task setting Cost-Based Furniture Arrangement 45 Cost-Based Furniture Arrangement Basic idea: Use robot to optimize task parameters Follows [Scholz et al. 2010], with furniture and non-holonomic constraints 46 New cost terms for office applications closed-loop base navigation system be seen in Fig. 11(c). a) Meeting TaskinOptimum (b) can Presentation Task Optimum (c)specified Target Environment Learning. Thus addition to the proposed NAMO task, which assumes a navigation goal from One of the central goals of the proposed research is to increase the autonomy of robots using Reinforcement an end user, we propose to implement a mobile manipulation task based on more abstract criteria, in the Learning. Thus in addition the proposed NAMOArrangement task, which assumes a navigation goal specified from 5.2[25]. Experiment 2: to Cost-Based Furniture tradition of Specifically, we will use the cost-based formulation of task goals fromand [25] the to define a environme e an 12:end Maximal configurations two different cost functions, user, we propose to implement aunder mobile manipulation task based on more abstract criteria, target in the tasktradition in which robot mustgoals arrange furniture in a is room first for presentation, andfrom then[25] for to a meeting. One the central ofwe thethe proposed research to increase theaautonomy robots using Reinforcement of aof [25]. Specifically, will use the cost-based formulation of taskof goals define a iment 2. Thus in taken addition the which NAMOto task, which assumes a we navigation from By task contrast to thea method in to [25], resorted action primitives, willand usegoal thespecified physics-based in Learning. which robot must arrange the proposed furniture in a room first for a presentation, then for a meeting. an end user, we propose to implement a3 mobile manipulation based on more abstract criteria, in the model space and described Chap. and Chap. 4. action task By contrast to planners the method taken inin[25], which resorted to primitives, we will use the physics-based traditionthe of x, [25]. Specifically, we willofuse the cost-based formulation of task goalsof from [25] to define a If c denotes y position the center a and circle of radius r (e.g. the position a table with radius model space and planners described in Chap. 3 Chap. 4. task in which a robot must arrange thebetween furniture in the a room first for a presentation, and then for a meeting. grt and the angle circle and each object: < r),Iffinite-differencing and (p , q ) the position and orientation of object i, then the following terms can bewith defined to i i cBy denotes positiontaken the in center a circle of radius r (e.g. the position table radius contrastthe to x, they method [25], of which resorted to action primitives, we will of useathe physics-based quantify the criteria of the meeting task: rt < r), and (p theplanners position and orientation i , qi )and model space described in Chap. 3 of andobject Chap. i, 4. then the following terms can be defined to tin = atan2(p + cy ,r p(e.g. cxposition ) iy radius ix +the quantify theIf criteria of the the x,meeting task: c denotes y position the center of a circle of of a table with radius 2 i, then the following terms can be defined to 1. linear rt < r), anddistance (pi , qi ) the position of object ⇤orientation clinearand =tÂ (kp ck r) (21) i n sort(t) = quantify criteria of the meeting to the circle clineartask: =i=1Â✓(kp⇤i ck⇤ r)n2 (21) ◆2 n si=1 = [tnpii cti 1 ]cos(q i=1 i) 2 ✓  ◆2 cangular c=linear · (22) Â n= Â (kp(a) ck r) (21) i Meeting Task Optimum (b) Presentation Task Optimum (c) Target Environment sin(q ) kpipi ckc cos(qi i ) 2. angular distance c i=1 i=1 · (22) angular = Â ✓ ◆ ✓  ◆ sin(q ) ckonMaximal 2the configurations Figure two different cost functions, and that the target environment for 2 under i e must to also specify environment constraints poses of these bodies, such the leaf-opt i 12: radial orientation i=1 nkp 2p p c cos(q ) i i 2. ◆· ✓ (22) cspacingcangular = s · s= Â nExperiment (23) 2 sin(q ) ers feasible optima. For example, Eq. 28 limits all objects to the room dimensions: 2p kp ck i n + 1 i cspacing = s · s i=1sorting n and (23) ✓ ◆ finite-differencing the angle between the circle and each object: 2 n + 2p 1 (24) 3. penalty for noncspacing = s · s n (23) n i + cy , pi + cx ) ti = atan2(p (25) (24) n + 1 (x  x  x , y  y  y ) homogeneous spacing min i max min i max ⇤ i=1 = sort(t) (26) These terms capture the error of projecting the pose of each object onto a t circle centered at c, facing in(24) s = [ti⇤ ti⇤ 1 ]ni=1 (27) 3 These terms capture the error of projecting the pose of each object onto a circle centered at c, facing inwards. The vector s in Eq. 24 represents the angular spacing of objects about the circle, and is obtained by 3These ven weights a, bcapture importance of theonto subtasks, overall We mustofalso specify environment constraints on thethe poses of bodies,reward such that the function leaf-optimizer terms of relative projecting the pose each a circle centered atthese facing in-by wards. The vector s, g in indicating Eq.the 24error represents the angular spacing of object objects about the circle, and isc,obtained Cost functions for office configuration s: angular spacing term Meeting configuration y x 3 The 2 in Eq. 24 is to offset recovers feasible optima. For example, Eq. 28 limits allfor objects to the room dimensions: 3the purpose of term n(2p/(n + 1)) the spacing cost by the maximal value n objects, such that wards. The vector s in Eq. 24 represents the angular spacing of objects about the circle, and is obtained bythe ng task be defined as follows: 3 Thecan 2 of in theEq. term 1))optimal in Eq.configuration, 24 is to offset the spacing the(xmaximal final reward purpose computed 29n(2p/(n is 0 for + any such as the cost one by depicted in Fig.value 12(a).for n objects,n such that the min  xi  xmax , ymin  yi  ymax )i=1 circle position The purpose the29 term 1))2 in Eq. 24 is to offsetsuch the spacing costdepicted by the maximal final reward 3computed inofEq. is 0n(2p/(n for any+optimal configuration, as the one in Fig.value 12(a).for n objects, such that the c: r: circle radius (28) Given weights , g indicating relative importance function for the final reward computed in Eq. 29 is 0 for any optimal configuration, such a, asbthe one depicted in Fig. 12(a). of the subtasks, the overall reward pi: object position Overall: Rmeeting = (ameeting clinear +be bdefined cangular task can as follows:+ gcspacing ) Rmeeting = (a clinear + b cangular + gcspacing ) θi: object orientation n: number of objects (29) rid reward function is defined similarly,The but save space. grid omitted reward function to is defined similarly, but omitted to save space. An important property of this problem is that there are many possible goal configurations – the meeting 47 21 n important property of this problem is that there areformany possible configurations the m task is equally satisfied any configuration of chairs ingoal a circular configuration, regardless of– particular 21 Meeting configuration 48 Proposed work Goal: Successful room configuration for 2-5 objects Meeting Task Optimum Expected Result: Implement cost-based furniture method on Golem-Krang Presentation Task Optimum 49 Target Environment Summary of proposal 1. Tractable online model learning Proposed: Bayesian physics-based approach Demonstrated: PBRL offers superior performance vs. regression methods (ICML 2014) 2. Tractable planning in task-space Demonstrated: Task-space RRT planner with leaf optimization Proposed: Stochastic Value-based RRT planner with leaf optimization (Humanoids 2010) 3. Implementation on Golem-Krang Demonstrated: Implementation of hierarchical NAMOMDP formulation in simulation Proposed: Implementation of NAMO-MDP on Krang Proposed: Implementation of Cost-based furniture arrangement problem on Krang 50 (WAFR 2012, ICRA 2013) Timeline Date Work Progress 2010 Task-space RRT published (Humanoids ’10) 2012-2013 Hierarchical NAMO-MDP published (WAFR ’12,ICRA ’13) 2014 Core PBRL Model published (ICML ’14) Sept. 2014 tsp-RRT will submit to ICRA Jan. 2015 Krang Implementation will submit to RSS Feb.-Apr. 2015 Thesis writing May 2015 Thesis defense 51 Special thanks to my committee, and Mike Stilman 52 Backup Slides 53 Eq. 15, obtained directly from the bellman-recursion, appears superficially different from Eq. 14. However, for cases in which the reward function is not dependent on the action, as with task-space formalism in [25], and in which the model transitions are deterministic (even if the model parameters themselves are not), Eq. 15 simplifies as follows: " # Bellman values in MCTS+RRT d vq = max R(s, a) + g Â P(s0 |s, a) max Q 0 a 2 s0 = max 4R(s) + g Â a Z s0 f 2F a 1 (s0 , a0 ) d P(s0 |s, a, f )P(f )df max Q 0 1 = Â R(sqi ) + g Â P(Fqi ) max vq0 , 0 n qi 2q q qi 2q a q0 (16) 1 3 (s0 , a0 )5 f (sqi , aqi ; Fqi ) (17) (18) Eq. 18 is obtained by expanding the model transition as a marginal over the unknown model parameters and approximating it with a finite sum over particles ( f (·) denotes the PBRL model function from Sec. 3.2). This yields a generalization of the approach taken in [21], in which we replace the model aver• 14) Expand stochastic asthismarginal over what the authors aging (Eq. with a particle-based bellmantransition backup. Note that contribution achieves suggest regarding a search heuristic: continuous model parameters “The effectiveness of the quality heuristic might be improved if we could calculate not only the probability of reaching a node from the root of the tree, but also an estimate of the probability Approximate integral with particles of•reaching the goal from this node.” Because reward functions can be arbitrary, this formulation does not guarantee an admissible heuristic. However, for many common cases, such as simple distance-based rewards, we expect this approach to yield notable improvements in runtime. After validating this approach in simulation, we will illustrate its 54 applicability to several mobile manipulation problems, including one based only on abstract criteria about 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 240 246of 479 position an anisotropic constraint. While position offriction an anisotropic friction const notion was central to the OO-MDP (Diuk et al. two-dimensions, which can be represented 480 the is 241 global 247 maximum the unique, likelihood surface can global maximum is unique, likelihoo ˙ 2008), and is explored in greater depth here. In parameters s = {x, y, ✓, ẋ, ẏ, ✓}, with {x, y PBRL: Model Format 481 P ( |h) / P (h| )P ( ) 242 248 be quite convoluted. be quite convoluted. physics-based domains, fully describing object state sponding to 2D position and orientation, an 243 249 482 0 0 both pose and 0 0 velocity parameters requires (the g (s , a ) ⇡ s g (s , a ) ! s + ✏, ✏ ⇠ N( g244(si , ai ) ⇡ 483 stheir g (s , a ) ! s + ✏, ✏ ⇠ N (µ, ⌃) i i i i i derivatives. i i i i 250 0 so-called “phase-space” representation of the system) Want: 245 484 P (s |s, a, ) d ⇠ U (k) do ⇠ U (k) 251 o ofingenerality, Actions this context correspond to in Without loss we consider objects 246 485 252 and torques used to move objects aro two-dimensions, which can be represented with six ⌦, w , w ⇠ U (0, 1) ⌦, w , w ⇠ U (0, 1) 247 µx µy 486 µx µy ˙ with 253 parameters s = {x, y, ✓, ẋ, ẏ, ✓}, {x, y, ✓} corre For: can be represented with three additiona 0 248 ⇡(s) = arg max P (s |s, a, a ˙ d , d , d , d , w , w , ⇠ U (l, u) dx , dy254 , sponding dxo , d487 , w , w , ⇠ U (l, u) x y x x x y x x y to 2D position and orientation, and { ẋ, ẏ, ✓} o o o ters a = {fx , fy , ⌧, d}. Here {fx , fy } 249 488 their derivatives. 1 a torque, and 255 1 a force-vector in 2D, ⌧ 250 w✓ ⇠ von mises(0, ) w mises(0, ) ✓ ⇠ von 489 2 2 tion for which this action is applied. A 251 256 Actions in this context correspond to the forces 490 (s, a, s˙0 ) = (x, y, ✓, ẋ, 0ẏ, ✓, 0 0 f 0 , f 0 , ⌧,0 d,˙ 0x0 , y ˙ (s, a, s2D, ) = (x, y, ✓, ẋ, ẏ, ✓, f , f , ⌧, d, x , y , ✓ ẋ ẏ , ✓ ) So the dynamics model signature is: x y 257 x y vation” is an assignment to the tuple ( 252in and torques used to move objects around, and 491 0 0 0 0 0 0 ˙ ˙ 258 0 0 (x,represented ẋ, a) ẏ, ✓, fsx ,= fythree , ⌧, d, xadditional , y , ✓ , ẋ ,parame ẏ , ✓ ), f (s, !with f253 (s, a)can ! sbe = y, ✓, 492 254 259 tal= dimensionality of m{fx= +0 40 +0 ters a493 {fx , fy , ⌧, d}. Here , fy }6 represents 0 ! 0 ˙(x 0 ,y ,✓ , ˙ 0f, xy,0 ,f✓y ,0 ,⌧,ẋd) ˙ f ((x, y, ✓, ẋ, ẏ, ✓, f255 ((x, y, ✓, ẋ, ẏ, ✓, f , f , ⌧, d) ! (x , ẏ , ✓ ) a dura xmodeling y 260 a force-vector in 2D, ⌧ a torque, and d The problem is to take an h 494 256 261 tion for thisthese actionobservations is applied. An trixwhich D of and“obser fit a 0 55 0 257 262 ˙ vation”f (s, is a) an assignment to the tuple (s, a, s ! s = f ((x, y, ✓, ẋ, ẏ, ✓, f ) , f= 8s, a Latex Math able input-output measurements [15]. Adaptation is typically ⇤ Jonathan done inScholz two stages: (1) estimation of the plant parameters using a Parameter Adaptation Algorithm (PAA) (2) updating parameters based on the current plant parameter Aprilcontroller 16, 2014 estimates. PBR can be understood as a PAA generalized to support non-linear model estimation using Bayesian approximate inference. Alternative approaches to non-linear system identification can be found in [?]. Finally, there are several results in object pushing without explicit physical knowledge [18], [24], however these approaches are restricted to holonomic 0 objects. We are interf (s,environments, a) ! s which can frequently ested in tasks in human include objects with wheels and hinges. f (s, a; ) ⇡ s0 8s, a III. OVERVIEW dimension parameterized by i .2 Defining X̃ := [s, a] for notational convenien predictors can be fit for each output dimension u squares: ˆi =⇤(X̃ T X̃) 1 X̃s0 i Evaluation Method: OO-LWR Jonathan Scholz B. Locally Weighted State-Space Regression 16, 2014Regression is structured sim Locally-Weighted Uses locally-weighted regression to April introduces a query-dependent kernel whose role i model dynamics notion of similarity that allows predictions to be , a; ) • Problem: How to handle collisions work onregressor? data driven dynamics modeling for 0 with robotics aExisting smooth L( |h) = low-level P (s |s,control a; )problem typically focuses on the their nearby training points. The kernel function compute a positive distance wj = k(X̃ ⇤ , X̃ j ) be query point X̃ ⇤ and each element X̃ j of the tr These weights are collected into a diagonal matr used to produce predictions with weighted least-s ⇤ i s0i = ((X̃ T W X̃) = X̃ ⇤T ⇤ i 1 X̃ T W yi 0 (e.g. [21]), and requires learning the dynamics of the robot itself. Non-parametric regression is an appropriate choice In contrast to the parametric approach in Eq. 3, 2 n n | | a , . . . , s , a ] for these situations as model fidelity is critical and robot parameters ⇤ are re-computed for each query. A the regression coefficients are free to vary across mechanisms are generally complex and difficult to model. • .t. Cond(s) = c However, the mobile X manipulation := [s, a] tasks we consider here space, allowing LWR to model nonlinear func are unique in that (a) tasks often involve many objects with least-squares. The main properties of LWR are: highly redundant dynamics [16], (b) many tasks are quasi1 1 2 2 n n 2 allowing 3feedback X :=to[scompensate , a , s ,for a ,minor . . . , model s , a ] 1) The only model parameters are the kern static, c f (X1 ) parameters, and flexibility is achieved by s errors cand7(c) observations are limited and expensive 6 f[12], (X ) 2 As raw data. to 6 acquire [8]. the primary concern shifts from 7 a result, c f (X) := (1) 6 to.. efficiency 7 Xand := [snot Cond(s) = c2) Kernels are typically decreasing, making f i , aclear i ] s.t. i fidelity it is if non-parametric 4 5 . polant. Consequently, coverage of the tra methods arec still the correct choice. We will now briefly f (Xthe over the test set is a key factor in the ac n ) model learning problem and the core concepts summarize 2the model. 3 for each approach. f (X1 , ciss )deferred to query-time. Thi 3) Computation 6negative 7performance for real-tim f (X , c ) A. State-Space Regression impact on 2 s )P ( ) 6 7 56 f (X) := 6 7 . Our goal is to fit a model to the discrete-time state-space Solution: Factor space using object and collision-state variables Samples Required PBRL: Sample Efficiency 140 120 100 80 60 40 20 0 ≈ ≈ ≈≈ ≈ ≈ PBR LWR LR Lo Ki ge un n he tc k es ir ha C 57 ir ha C ir ha C ch ou C d Be t ir ha C e bl Ta ar C ng ki oc n he tc e bl Ta k es e fe of D R Ki C D Number of samples required to achieve R2 ≥ 0.995 on a collection of household objects. Training data corrupted by Gaussian noise (σ2 = 1.25). PBRL results: model learning • 0.4 Fitting dynamics of shopping cart PBR LWR LR 0.0 • R2 0.8 Shopping-Cart Model Fitting PBRL more robust to noise than regression methods 0 0.5 1 1.5 2 Noise − Level(σ2) 58 5000 PBRL results: online performance • OO-LWR viable given sufficient data • −100 PBRL significantly more sample efficient TRUE PBRL OOLWR LWR −300 • Reward 0 Shopping-Cart Task: Noise-free LWR stuck in obstacle 0 100 200 300 Step 59 400 500 PBRL Results: Online Performance OO-LWR intractable −80 • TRUE PBRL OOLWR −120 PBRL capable of fitting multi-body models Reward • −40 0 Apartment Task: Gaussian Noise (σ=1.25) 0 200 400 600 Step 60 800 1000 The problem with other carts • Same idea, but different constraints are different... 14 Tuesday, September 6, 2011 61

Physics-Based Reinforcement Learning for Mobile Manipulation

Related documents

Products

Support

Physics-Based Reinforcement Learning for Mobile Manipulation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib