Physics-Based Reinforcement Learning for Mobile Manipulation PhD Dissertation Defense Jonathan Scholz August 17, 2015 Committee: Dr. Charles Isbell (IC, Georgia Institute of Technology) Dr. Andrea Thomaz (IC, Georgia Institute of Technology) Dr. Henrik Christensen (IC, Georgia Institute of Technology) Dr. Magnus Egerstedt (ECE, Georgia Institute of Technology) Dr. Michael Littman (CS, Brown University) Thesis Vision Robots operating in human environments Learning behaviors from experience Good Ideas • Self-guided exploration • Shaping behavior through feedback Reinforcement Learning Action at Agent, State st World Reward rt Next State st+1 ➡ ➡ ➡ Pacman states: all positions of pacman, ghosts, food, & pellets Pacman actions: {N,S,E,W} Pacman rewards: -1 per step, +10 food, -500 die,+500 win,... RL+Robotics: Prior work 16 • • • Learning to Control a Low-Cost Manipulator us Humanoid Walking (Peters, Schaal, Vijayakumar 2003) Data-Efficient Reinforcement Learning Acrobatic Helicopter (Ng et al. 2003) Marc Peter Deisenroth Carl Edward Rasmussen Dept. of Computer Science & Engineering University of Washington Seattle, WA, USA Dept. of Engineering University of Cambridge Cambridge, UK Dieter Fox Dept. of Computer Science & Engine University of Washington Seattle, WA, USA Fig. 3. Episodic natural actor-critic for learning dynamic movement primitives. (a) Learning curves comparing the episodic Natural Actor Critic to episodic REINFORCE. (b) Humanoid robot DB which was used for this task. Note that the variance of the episodic Natural Actor Critic learning is significantly lower than the one of episodic REINFORCE, with about 10 times faster convergence. Abstract—Over the last years, there has been substantial Ball-in-Cup progress in robust manipulation in unstructured environments. long-term goal of our work is to get away from precise, (Kolber 4.2 & Peters 2009) The Example II: Optimizing Nonlinear Motor Primitives for but very expensive robotic systems and to develop affordable, Humanoid Motion Planningpotentially imprecise, self-adaptive manipulator systems that can interactively perform tasks such as playing with children. In While the previous example demonstrated the feasibility and performance of the Natural Actor Critic in a classicalthis example of motor control, this section will paper, we demonstrate how a low-cost off-the-shelf robotic turn towards an application of optimizing nonlinear dynamic motor primitives system can learn closed-loop policies for a stacking task in only for a humanoid robot. In [17, 16], a novel form of representing movement plans a handful of trials—from scratch. Our manipulator is inaccurate (q d , q̇ d ) for the degrees of freedom (DOF) of a robotic system was suggested in and provides no pose feedback. For learning a controller in the terms of the time evolution of the nonlinear dynamical systems work space of a Kinect-style depth camera, we use a model-based q̇d,k = h(qd,k , z k , gk , ⌧, ✓k ) (30) reinforcement learning technique. Our learning method is data where (qd,k , q̇d,k ) denote the desiredefficient, position and velocity model of a joint, z k and the deals with several noise sources reduces bias, internal state of the dynamic system,ingk atheprincipled goal (or pointway attractor) state of during long-term planning. We present a each DOF, ⌧ the movement duration shared by all DOFs, and ✓k the open way of state-space constraints into the learning parameters of the function h. The original workincorporating in [17, 16] demonstrated how and analyze thebylearning the parameters ✓k can be learned to process match a template trajectory means of gain by exploiting the sequential supervised learning – this scenario is,structure for instance,of useful as the first step of an the stacking task. Fig. 1. Low-cost robotic arm by Lynxmotion [1] performing a blo task. Since the manipulator does not provide any pose feedback, learns a controller directly in the task space using visual feedba Kinect-style depth camera. progress in robust manipulation in unstructured environments. While existing techniques have the potential to solve various household manipulation tasks, they typically rely on extremely expensive robot hardware [12]. The long-term goal of our work is to develop affordable, light-weight manipulator systems that can interactively play with children. A key problem of cheap a typical problem of model-based methods: P ILCO a flexible probabilistic non-parametric Gaussian proc dynamics model and takes model uncertainty con into account during long-term planning. P ILCO lea controllers from scratch, i.e., with random initializa deep understanding of the system is required. In this p show how obstacle information provided by the dept can be incorporated into PILCO’s planning and lea • PILCO: Cart-Pole (Diesenroth et al. 2011) • PILCO: Block-Stacking (Diesenroth et al. 2011) imitation learning system. Here we will add the ability of self-improvement of the movement primitives in Eq.(30) by means of reinforcement learning, which I. I NTRODUCTION is the crucial second step in imitation learning. The system in Eq.(30) is a point-to-point an episodic Overmovement, the lasti.e.,years, theretaskhas been substantial from the view of reinforcement learning – continuous (e.g., periodic) movement 3 71 0 7 A90 7 08 Not achieved using RL methods Limitations Feedback k s a T ?? ?? ?? ?? r o t o m i r o s Sen Decisions Sensorimotor Level What we see What the robot sees Pixel Intensities 2,073,618 Dimensional Time-Series! Joint Currents } What the robot does Physical Object Level What we see What the robot sees Square Table Pose Rectangular Table Pose } } It is hard to imagine a truly intelligent agent that does not conceive of the world in terms of objects Net Applied Force-Torque (Square Table) and their properties and relations to other objects. - Leslie Kaelbling, 2001 9 Dimensional Time-Series Reinforcement Learning Method Comparison Policy Model Horizon Problem Space Algorithm Data Ball in Cup DMP None Short Robot-Space EM w/ Natural Gradient Autonomous Humanoid Walking RBF None Short Robot-Space LSTD w/ Natural Gradient Autonomous Helicopter Neural-Net Locally Weighted Regression RBF PILCO Cart-Pole PILCO BlockStacking Robotics Cart Pushing Domain General Short Robot-Space Hill-Climbing w/ Monte-Carlo Eval. LfD Gaussian Process Short Robot-Space CG/L-BFGS Autonomous Linear Gaussian Process Short Mixed CG/L-BFGS Autonomous Policy Model Horizon Problem Space Algorithm Data Long Task-Space A* Variant N/A Long Task-Space State Machine N/A Jacobian PD-Control, Geometric (Projected Primitives 2D) PR2 Towels Scripted Grasp and Manipulation Prim. Implicit Cloth Physics PR2 Socks Scripted Grasp and Manipulation Prim. Implicit Cloth Physics HRP-2 NAMO ZMP Walking, Jacobian PD Grasp PR2 NAMO HRP-2 MacGyver Physics-Based Long Task-Space State Machine N/A 2D Rigid Body Physics Long Task-Space A* Variant N/A RRT Path Planner, PD Control 2/3D Geometric Simulation Long Mixed Hierarchical Backchaining N/A ZMP Walking, Scripted Primitives 3D Physical Simulation Long Mixed A* Variant N/A Thesis statement Physics-based Reinforcement Learning is a feasible and efficient method for autonomous robot manipulation, and enables adaptive behavior in natural environments. Reinfo rceme s otic nt Lea Rob rning Thesis intuition -3 3 3 0 7 09 0 Thesis Wish List 1. Whole-Body Manipulation ➡ Utilizes full robot capabilities ➡ Challenging domestic robotics tasks 3. Stochastic Planning ➡ Handles model uncertainty 4. Model Learning ➡ Adapt to novel environments 5. Dynamic Constraints ➡ Capture rigid body behavior 2. Multi-Object Planning A 97 3 1 Physics-based Reinforcement Learning 3 2 Navigation Among Movable Objects with Uncertain Dynamics Online Learning for Navigation Among Movable Objects A 97 3 1 Physics-based Reinforcement Learning 3 2 Navigation Among Movable Objects with Uncertain Dynamics Online Learning for Navigation Among Movable Objects Efficiency in Model-Based RL Models let you simulate experience Online Learning Curves Reward Faster model learning = faster policy learning Time 10 3 2 f (R6n+4 ) ! R1 3 f (x1 , . . . , fx , . . . ; x1 ) + ✏ 3 2 3 05 +1 = 4 f (x1 , . . . .,.f.x , . . . ; x1 ) +5✏ .1 1 5 . . 5 = 4f (x1 , . . . , fx , . .. .. ;. ✓˙n ) + ✏ t+1 f (x1 , . . . , fx , . . . ; ✓˙n ) + ✏ State: n Object-Oriented MDPs s = [ni=1 oi .attributes s = [ni=1 oi .attributes Diuk et al. (ICML 2007) Attributes Classes Relations x Person in(person, taxi) y Taxi contactN(o1,o2) inTaxi Wall contactS(o1,o2) Destination contactE(o1,o2) Condition: c = [R k=1 rk (oi , oj ) 8i 6= j c= 1 [R k=1 rk (oi , oj ) 8i 6= j 2 contactW(o1,o2) s1≠s2≠s3 c1≠c2=c3 3 key idea An Object-Oriented Representation for Effici OO-MDP limitations Discrete state-space Deterministic boolean effects go-right(s, c) wall no wall OO-M mode No plant dynamics erty o object has that w velocity tion. throug experiences forces When s.x=s.x s.x=s.x + 1 tion b do-nothing(s,c) Diuk et al. (ICML 2007) lished mines tribute s.x = ? is defi Figure 1. The taxi domain. (a) Original 5 × 5 Taxi problem. (b) that ar Extended 10 × 10 version, with a different wall distribution and 8 when possible passenger locations and destinations. i2N EXTRA MATH FOR PROPOSAL Model signature (23) P (s0 |s, a) = P (x01 , y10 , ✓10 , ẋ01 , ẏ10 , ✓˙10 , ..., x0n , yn0 , ✓n0 , ẋ0n ,(ICML ẏn0 , ✓˙n0 |x1 ,2014) y1 , ✓1 , ẋ1 , ẏ1 , ✓˙1 , . (22) Object-Oriented • actions Regression (20) As a regression problem: 1) Define oriented collision predicates (21) contactθ=0(oi) (22) contactθ=20(oi) a = [fx , ff(R ⌧, i]) | ! R6n y ,6n+4 (24) 2) Use regression model fx , fy , ⌧ 26n+4 R for each f (R condition ) ! R1 True (25) { s’ 2 i 3 2 2N 3 f (x1 , . . . , fx , . . . ; x1 ) + ✏ Model signature (26) 4 ... 5 = 4 5 ... f (x1 , . . . , fx , . . . ; ✓˙n ) + ✏ ✓˙nt+1 (23) … P (s0 |s, a) = P (x01 , y10 , ✓10 , ẋ01 , ẏ10 , ✓˙10 , ..., x0n , yn0 , ✓n0 , ẋ0n , ẏn0 , ✓˙n0 |x1 , y1 , ✓1 , ẋ1 , ẏ1 , ✓˙1 contactθ=180(oi) False True 0 xt+1 1 As a regression problem: (controlled dynamical system) is estimated online from avail- a s where f (s, a; i ) denotes the predictor for the ith able input-output measurements [15]. Adaptation is typically 6n+4 dimension parameterized by i .2 6n (24) f (R Defining ) !X̃ R:= [s, a] for notational convenience, done in two stages: (1) estimation of the plant parameters Need 6n+4 ofcanthese per using a Parameter Adaptation Algorithm (PAA) (2) updating predictors be fit for eachcondition output dimension using controller parameters based on the current plant parameter squares: T 1 0 ˆ estimates. PBR can be understood as a PAA generalized to 6n+4 1 i = (X̃ X̃) X̃si (25) f (R ) !R support non-linear model estimation using Bayesian approxB. Locally Weighted State-Space Regression imate inference. Alternative approaches to non-linear system Locally-Weighted Regression is structured similarl identification can be found in [?]. Finally, there are several results in object pushing with- introduces a query-dependent kernel whose role is to out explicit physical knowledge [18], [24], however these notion of similarity that allows predictions to be bias approaches are restricted to holonomic objects. We are inter- their nearby training points. The kernel function is us ⇤ j C=[001100…01010] ested in tasks in human environments, which can frequently compute a positive distance wj = k(X̃ , X̃ ) betwee query point X̃ ⇤ and each element X̃ j of the trainin include objects with wheels and hinges. used locallyThese weights are collected into a diagonal matrix W (240 dim, n=12) used to produceweighted predictions with weighted least-squar regression III. OVERVIEW Applied to entire scene Existing work on data driven dynamics modeling for robotics typically focuses on the low-level control problem (e.g. [21]), and requires learning the dynamics of the robot itself. Non-parametric regression is an appropriate choice for these situations as model fidelity is critical and robot mechanisms are generally complex and difficult to model. However, the mobile manipulation tasks we consider here are unique in that (a) tasks often involve many objects with ⇤ i s0i = ((X̃ T W X̃) = X̃ ⇤T 1 X̃ T W yi 0 ⇤ i In contrast to the parametric approach in Eq. 3, the m parameters ⇤ are re-computed for each query. As a the regression coefficients are free to vary across the space, allowing LWR to model nonlinear functions least-squares. OO-LWR limitations Must discretize the collision space Latex Math θ5 θ4 θ3 Jonathan Scholz θ6 θ⇤2 April 16, 2014 θ 7 θ8 ) ! s0 θ9 θ10 θ1 |C| = n2P θ12 C: number of conditions n: number of objects P: number of collision sectors θ11 h) / P (h| )P ( ) Need exponential number 0 0 of predicates! = arg maxa P (s |s, a, )V (s ) 3 2 0 s 1 a1 s 1 6 s2 a2 s02 7 6 7 h=6 . .. .. 7 4 .. . . 5 sT aT s0T Limits data per condition (1) c1 c2 c3 c4 … Dynamic physical inferences Static physical inferences Will it fall? Which color is heavier? Physical Understanding in Humans speed, generality, a speed, generality, and speed, general enough forthe theabil pu enough for theTo purposes ofth e enough make for this pr Tothe make th To make this proposal con specify nature the na specify the nature of these finespecify grained they practice: fineare. grained fine grainedneering they Heret roundings often neering practeh neering practice: People’s relaxed faultmuch toler roundings oft roundings often have generality over th relaxed fault relaxed fault tolerances, lea Battaglia et al. 2013 Fig. 1. Everyday scenes, activities, and art that evoke strong physical speed, generality, and the ability problems. Our ini generality ove itions. (A) A cluttered workshop that exhibits many nuanced physical pro generality over the degree eral-purpose appro enough for the purposes of ever ties. (B) A 3D object-based representation of the scene in A that can sup problems. Ou problems. Our initial IPE m Dynamics Engine physical inferences based on simulation. (C) A precarious stack of dishesa To make this proposal concre eral-purpose eral-purpose approximation t approximate rigidlike an accident waiting to happen. (D) A child exercises his physical reaso specify Dynamics the nature of these app Dynamics En Engine (ODE) ( Carlo appr by stacking blocks. (E) Jenga puts players’ physicalMonte intuitions to the tes approximate mechanism for approximate dyna fineourgrained theyarigid-body are. Here(Photo thr “Stone balancing” exploits powerful physical expectations throu Monte Carlo Monte Carloprobabilities approach of bl stone balance by Heiko neering Brinkmann). practice: People’s ever sents objects’ geo a mechanism a mechanism for represent Fig. 1. Everyday scenes, activities, and art that evoke strong physicaloften inturoundings have much tig tributions by inert probabilities t probabilities through these p itions. (A) A cluttered workshop many nuanced physical properFig.that 1. exhibits Everyday scenes, activities, and art that evoke strong physic relaxed fault tolerances, leadin thesents conservation planning, action, reasoning, and language (Fig. 2A). At its coa objects’ ties. (B) A 3D object-based representation the scene in A thatthat can exhibits support sents objects’ geometries itions. (A) A of cluttered workshop many nuanced physical implicitly via coar Fig. 1. Everyday scenes, art that evoke strong physical intuan and object-based representation ofover a 3D scene—analogous top generality the degree of scenes, activities, and art activities, that evoke strong physical intuphysical inferences based on simulation. (C) A precarious stack of dishes looks tributions by i ties. (B) A 3D object-based representation of the scene in A that can s tributions by inertial tensors Our model runs t geometric models underlying computer-aided design progr itions. (A) cluttered workshop that exhibits many nuanced physical properred workshop that exhibits nuanced physical properlike anAaccident waiting tomany happen. (D) A child exercises his reasoning problems. Our initial IPE mod the conservat physical inferences based onphysical simulation. (C) A precarious stack of dishe the conservation of energy from the observe (Fig. the physical governing the scene’s dynam ties. by (B) stacking Arepresentation 3D object-based representation ofphysical the scene in A that can support ct-based the scene in1B)—and A that can support blocks. (E)ofJenga puts players’ intuitions toforces the test. (F) like an accident waiting to happen. (D) A child exercises his physical implicitly viarea The Intuitive Physicist • Physics-engine with random variables • Works by sampling model beliefs and simulating to generate predictions ... Battaglia et al. 2013 Learning Kinematic and Dynamic Constraints Learning Kinematic and Dynamic Constraints Whole-Body Manipulation Multi-Object Planning Stochastic Planning Model Learning Dynamic Constraints No Yes No Yes Yes • Domain: Planar object manipulation (arbitrary) • Objective: Learn compact model • Learning approach: • Stochastic physics engine model representation • MCMC estimator for unknown physical parameters 6 71 0 32 0 016 Model-Based RL Loop PyMC Idea: Use physics-engine as model representation Technical contribution: Formalize model learning problem and develop inference method 6 71 0 32 -37 4 13 3 30 7 5 (ICML 2014) f (s, a) ! s0 Physics API is Learning Target f (s, a; ) ⇡ s0 Place prior on API parameters 8s, a Estimate posterior from data 0 L( |h) = P (s |s, a; ) body { mass inertia joint.wheel joint.hinge | | P ( |h) / P (h| )P ( ) P (s0 |s, a, ) ⇡(s) = arg maxa P (s0 |s, a, h Model Parameters Three classes of parameters Rigid-body parameters Anisotropic friction constraints Distance constraints ⇡(s) Rigid body parameters (m, r, µc ) Mass For computing accelerations We assume uniform density ms i (wx , wy , w✓ , µx , µy ) Restitution Friction For computing perpendicular contact forces For computing tangential contact forces (ia , ib , ax , ay , bx , by ) Proportional to normalforce (Coulomb friction) := (m, r, µc , wx , wy , w✓ , µx , µy , ia , ib , ax , ay , bx , by ) heel ⌧ ↵= =I I Anisotropic friction 1 ⌧ (wx , wy , w✓ , µx , µy ) stance Pose parameters Friction Coefficients (ia ,oniba, body ax , ay , The bx , orthogonal by ) For placing constraint friction components l params i := (m, r, µc , wx , wy , w✓ , µx , µy(-0.3, , ia , ib0, , ax0,, a0.1, , by ) y , bx0.8) "y wx wy Object-frame wθ "x wheel 12) (wx , wy , w✓ , µx , µy ) 13) (ia , ib , ax , ay , bx , by ) Distance constraint distance All params 14) = PositionB Body indices PositionA := (m, r, µ , w , w , w , µ , µ , i , i , a , i c x y x y a b x ay , bx , by ) ✓ Indicates two bodies to anchor the constraint Anchor offset on the first body bx by body a ax ay body b anchor a l anchor b Anchor offset on the second body Inverted pendulum ) B3 099 Random ⇡(s) Paccept ( t| 239 t ✓ q( t ) 1 ) = min 1, q( t 1 ) 013 ◆ (s0 |s, a) 14 physical parameters per object 0) i := (m, r, µc , wx , wy , w✓ , µx , µy , ia , ib , ax , ay , bx , by ) Scenes have many objects φ2 φ3 φ4 φ5 φ1 Φ Gathering data Touch or Grasp Object Apply forces Track object Unlocked Caster Locked Caster Data = State Trajectory & Applied forces Compute equivalent force in body-frame 0 L( |h) = P (s |s, a; ) Learning Parameters | | Bayesian Inference: Φ is a generative model of h P ( |h) / P (h| )P ( ) 0 P (s |s, a, ) Posterior: will reflect robot’s updated beliefs after observing h 0 0 ⇡(s) = arg max P (s |s, a, )V (s ) a Likelihood: should prefer accurate predictions Prior: should support only legal values2 s1 6 s2 6 h=6 . y their 2⇡ w J =4 respectivet=1 coefficients 2 ẏ 5 + 4 0 5 ⇥ R ˙ (✓) ✓˙ wx wy (11) 535 536 µ 0 b (wheel) (2) body velocityb Jtofor body frame (or in principle to 537 joint frame, if we Eq.rotate 11 tells us bthat proposed model paf = the xlikelihood ever include rotation)0on aµyGaussian centered on the prerameters are evaluated 538 b 1w force (impulse) back to the world frame dicted next state for a generative physics J = R world J parameter539 ized by ˜ (i.e., with force known and– proposed (3) compute friction inbgeometry body frame just scaledynamthe components 540of the velocity w f noise, = R( fthe ) coefficients ics). Due(2)toby Gaussian log-likelihood for is obfrom their respective 541 the forceby accumulator ... tained summing squared distances 542 between the observed ntvalue as single expression: 0andb action: and the predicted state for eachµstate x b trajectories L2 penalty on sample 543 f = J 1 02 3 2 30 µ ✓ y ◆ ẋn 0 544 ⇣ ⌘ µx 0 w 2 x 1 @4 X 40 0to5the ẏ 5 + = Rrotate this force R (impulse) ⇥ world R ˜ frameA (4) back 545 ln0P (h| (st ˙ f (st , at ; w)y (12) µy , ) = ˙ (✓) ✓ 546 t=1 w b f = R( f ) 547 Along with the prior defined in Table 1, this provides the (5) add in to the force accumulator ... 548 necessary components for a Metropolis sampler for Eq. 10. ) = ln (Pas (h|single )P ( expression: )) Wheel q( constraint 549 Learning Parameters n ⇣ 2X 3 2 3 ⌘0 X 2 0 ✓ ẋ P (p) 0 ˜ = s f (s , a ; ) + t t 0 tµx generated w Posterior samples MCMC 5 + 4 0 5 ⇥ R wx ẏ f =t=1 R R 1 @4by p2 0 µy wy ˙ ˙ ( ✓) ✓ ✓ ◆ MCMC Paccept ( t | t 1 ) = min 1, q( t ) q( t 1 ) q( ) =1 ln (P (h| )P ( )) n ⇣ ⌘2 X X 0 = st f (st , at ; ˜ ) + P (p) t=1 Paccept ( t| t p2 ✓ q( t ) 1 ) = min 1, q( t 1 ) 1 ◆ ◆ 1 A {wx, wy, 0, 0.1, 0.8} Online Performance ICML 2014 −100 TRUE PBRL OOLWR LWR −300 Reward Reward 0 Shopping Cart MDP 0 200 Apartment MDP 400 600 800 1000 −40 −80 TRUE PBRL OOLWR −120 Reward Reward 0 Step 0 200 400 600 Step Step 800 1000 Learning on Steel Cart ICRA 2014 Prior Prediction Updated Beliefs Gathers Data Contributions Formalized learning from manipulation data Allows autonomous learning on arbitrary objects Compact physicsbased model Very data-efficient vs. OO-LWR Successfully applied to real robot One-shot learning on tables and cart Limitations Not demonstrated in real online task Up Next! A 97 3 1 Physics-based Reinforcement Learning 3 2 Navigation Among Movable Objects with Uncertain Dynamics Online Learning for Navigation Among Movable Objects Navigation Among Movable Objects with Uncertain Dynamics Objective: • Create collision-free path to goal Challenge: • Which objects are relevant? A Well-Studied Problem M.#S%lman#and#J.#Kuffner.#Naviga&on)Among)Movable)Obstacles:)Real6&me) Reasoning)in)Complex)Environments.#In#Journal#of#Humanoid#Robo%cs,#2004.# M.#S%lman,#K.#Nishiwaki,#S.#Kagami,#and#J.#Kuffner.#Planning)and)execu&ng) naviga&on)among)movable)obstacles.#IROS#2006.# M.#S%lman#and#J.#Kuffner.#Planning)among)movable)obstacles)with)ar&ficial) constraints.#In#WAFR,#2006.# Jur#van#den#Berg,#Mike#S%lman,#James#Kuffner,#Ming#Lin,#Dinesh#Manocha.#Path# Planning#Among#Movable#Obstacles:#A#Probabilis%cally#Complete#Approach.## WAFR#2008.# D.#Nieuwenhuisen,#A.#van#der#Stappen,#and#M.#Overmars.#An)effec&ve)framework) for)path)planning)amidst)movable)obstacles.##WAFR,#2008.# … Curse of Dimensionality BOOMDP State space: configuration 330 of Based Reinforcement Learning (PBRL), which trades flexrobot and every object • 331 ibility for efficiency. 332 333 • 2. Assuming resolution n: NAMO Stuff 334 335 Cr = {x, y, ✓} ! |Cr | = O(n3 ) n 336 Coi = {x, y, ✓} ! |Coi | = O(n3 ) 337 Qk 338 =) C = Cr ⇥ i=0 Coi = O(n3(k+1) ) 339 ⇤ V (s) = maxa2A Q(s, a) 340 341 ⇡ ⇤ (s) = arg maxa2A Q(s, a) 342 • Search is exponential in the 343 References 344 number of objects 345 Bhat, K.S., Twigg, C.D., Hodgins, J.K., Khosla, P.K., 346 Popović, Z., and Seitz, S.M. Estimating cloth simula347 tion parameters from video. In SIGGRAPH. Eurograph348 ics Association, 2003. n Model Uncertainty • Low-level uncertainty: how does this object move? • High-level uncertainty: can I use this object? NAMO MDP (WAFR 2012, ICRA 2013 • Natural state abstraction: free-space regions • Corresponding action abstraction: clear opening to neighboring free-space 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 ibility for efficiency. R = Rewards (5) Transition Model def: Stam, J. Stable 3fluids. a a a P11 P12 . . . P1n a conference a 7 on C 6 P anual P . . . P 22 2n 7 6 21 T (s, a, s0 ) = P (s0 |s, a) = 6 .. techniques. ACM .. .. .. 7 Pres 4 . . 5 . . P a1999. Pa ... Pa 2 Solving a NAMO-MDP LowLevel HighLevel 2. NAMO Stuff Cr = {x, y, ✓} ! |Cr | (6) = O(n3 ) control: a survey. Cog Coi = Dynamics {x, y, ✓} ! |Coi | = O(n3 ) Planningn1 n2 Transition Model def (no actions): Qk 2 =) C = Cr ⇥ i=0 Coi = O(n3(k+1) ) P11 P12 . . . Physics ⇤ (7) V (s) = maxa2A Q(s, a) ⇡ ⇤ (s) = arg maxa2A Q(s, a) 0 P1n . . . P2n .. .. . . . . . Pnn nn 3 Monte-Carlo Search 77 6 Tree 6 P21 P22 0 0 T (s, s ) = P (s |s) = 6 4 .. .. . . Pn1 Pn2 Bellman equation (no actions) 7 5 P X 2 (s = j|s = i) = 3 (8) V (s) = R(s) + P (s0 |s)V (s0 ) p1,1 p1,2 . . . p1,j . . . s0 6p2,1 p2,2 . . . p2,j . . .7 6 7 equation Bellman 6 .. .. .. " # .. .. 7 6 . 7 X . .7 . . . 6 (9) V (s) = max R(s, a) + P (s0 |s, a)V (s0 ) 6 pi,1 pi,2 . . . pi,j . . .7 a s0 4 5 .. .. .. Bellman . .Matrix .. Stochastic Iteration as update rule, forValue t steps to go (value iteration) . . . . . " ⇤ (10) Vt+1 (s) P (f = j|f = i) = t+1 t 2 3 f1,1 f1,2 ... f1,j 6 f2,1 f2,2 ... f2,j 7 6 7 6 .. 7 .. .. 4 . 5 . . fi,1 fi,2 ... fi,j P V (s) := s0 P⇡(s) (s, s0 ) R⇡(s) (s, s0 ) + V (s0 ) References max R(s, a) + a 1 F1 F3 X s0 P (s0 |s, a)Vt⇤ (s0 ) F2 F4 # Scalability • Algorithm typically linear in |O| • Adjacency graph directly controls the complexity of ValueIteration F5 F4 F6 F7 Only 8 States! F3 F2 F1 F0 Highlighted Behavior • • Robot adapts to stationary object: • Initially selects table • Updates model and switches to couch Limitations: • Manipulation planning in joint-space (KD-RRT) • Model “learning”: thresholded movable flag Contributions Value-based Planning Combines reward and model uncertainty Compact high-level dynamics Solves object relevance problem Limitations Stationary Belief Distributions No online model learning Implemented in Simulation Doesn’t use realistic control stack A 97 3 1 Physics-based Reinforcement Learning 3 2 Navigation Among Movable Objects with Uncertain Dynamics Online Learning for Navigation Among Movable Objects Platform GoToFreeSpace(Oi,Fj) GetGraspPoint NavigateToPoint Grasp ClearObstacle (vobj, k) } System Control Body Controller (pbase) x k (pobj) x k Base Controller Manipulation Controller (vl,vr) Wheels } Motor Control (vm) x 16 IMU Vision FT Modules } Sensorimotor Sensory Motor Whole-Body Manipulation Free Parameter: vo (3-dim) vo vo Desired Object Velocity vm vb Manipulation Controller Robot Body Trajectory vb vm Grasp Sampling Free Parameter: gθ (1-dim) Model-specific distribution • Sample grasp angle: gθ~ P(θ|Φ) • Compute projected boundary point • Reject if not reachable Manipulation Policies P(%|Φ) Planner operates over vo and gθ Linear Velocity (vx,vy) Static None Unconstrained ±vx and ±vy Angular Velocity (vθ) None N/A ±vθ at object center Uniform Anisotropic ±vu along unconstrained axis ±vθ at (u = argminx,y($x,$y). constraint anchor Fixed-Point ±vθ at constraint anchor None Grasp Points (gθ) Faces perpendicular to unconstrained axis. Top and bottom faces opposite to constraint Example N/A NAMO with Static Constraint NAMO with Non-Static Constraint Learning Via Contact Summary and Contributions Physics Engine + Reinforcement Learning + ϵ = Adaptive Manipulation 1. Physics-Based Reinforcement Learning • Data-efficient model learning for multi-body dynamics 2. NAMO-MDP • Fast decision-theoretic solution for NAMO with uncertain dynamics 3. Online Learning for NAMO • First NAMO planner which adapts to novel objects • First collision-learning result in mobile manipulation Publications • SCHOLZ, J. and STILMAN, M., “Combining motion planning and optimization for flexible robot manipulation,” in Humanoids, 2010. • SCHOLZ, J., CHITTA, S., MARTHI, B., and LIKHACHEV, M., “Cart pushing with a mobile manipulation system: Towards navigation with moveable objects,” in ICRA, 2011. • LEVIHN, M., SCHOLZ, J., and STILMAN, M., “Planning with movable obstacles in continuous environments with uncertain dynamics,” in ICRA, 2013. • LEVIHN, M., SCHOLZ, J., and STILMAN, M., “Hierarchical decision theoretic planning for navigation among movable obstacles,” in WAFR, 2012. • SCHOLZ, J., LEVIHN, M., ISBELL, C., “What Does Physics Bias: A Comparison of Model Priors for Robot Manipulation,” in RLDM, 2013. • SCHOLZ, J., LEVIHN, M., ISBELL, C., and WINGATE, D., “A Physics-Based Model Prior for Object-Oriented MDPs,” in ICML, 2014. • SCHOLZ, J., LEVIHN, M., ISBELL, C. L., CHRISTENSEN, H., and STILMAN, M., “Learning nonholonomic object models for mobile manipulation,” in ICRA, 2015. Special thanks to my committee, and Mike Stilman