Physics-Based Reinforcement Learning for Mobile Manipulation

advertisement
Physics-Based Reinforcement
Learning for Mobile Manipulation
PhD Proposal
Jonathan Scholz
June 11, 2014
!
Dr.
Dr.
Dr.
Dr.
Dr.
Committee:
Charles Isbell (IC, Georgia Institute of Technology)
Andrea Thomaz (IC, Georgia Institute of Technology)
Henrik Christensen (IC, Georgia Institute of Technology)
Magnus Egerstedt (ECE, Georgia Institute of Technology)
Michael Littman (CS, Brown University)
What this proposal is about
Robots interacting with objects
Interested in using robots to perform tasks
Themes:
Modeling & estimation
Planning with uncertain models
Task planning with humanoids??
Many reasons to be optimistic!
2
Reasons to be optimistic
3
Big Questions
Where is my robot assistant?
Where is my robot butler?
Why is this hard?
Every problem is
different
Demos are precarious
piles of hacks
Requires lots of careful
engineering
4
Reducing the engineering burden
A Reinforcement Learning approach
Advantages of RL:
The Robot “Programs” itself!
Solid decision-theoretic foundation
Treats robot as persistent agent, w/
internal state and beliefs
Hides engineering from end-user
5
RL+Robotics: Prior work
16
•
•
Humanoid Walking !
(Peters, Schaal, Vijayakumar 2003)!
Acrobatic Helicopter !
(Ng et al. 2003)!
Ball-in-Cup!
(Kolber & Peters 2009)!
•
PILCO: Cart-Pole!
(Diesenroth et al. 2011)!
PILCO: Block-Stacking !
(Diesenroth et al. 2011)!q̇ = h(q
d,k
•
Marc Peter Deisenroth
Carl Edward Rasmussen
Dept. of Computer Science & Engineering
University of Washington
Seattle, WA, USA
Dept. of Engineering
University of Cambridge
Cambridge, UK
Abstract—Over the last years, there has been substantial
progress
in robust
manipulation in unstructured environments.
Fig. 3. Episodic natural actor-critic for learning dynamic movement primitives.
(a)
Learning curves comparing the episodic Natural Actor Critic to episodic
TheREINFORCE.
long-term goal of our work is to get away from precise,
(b) Humanoid robot DB which was used for this task. Note that the
of the
butvariance
very expensive
robotic systems and to develop affordable,
episodic Natural Actor Critic learning is significantly lower than the
one
of
episodic
potentially imprecise, self-adaptive manipulator systems that can
REINFORCE, with about 10 times faster convergence.
interactively perform tasks such as playing with children. In
this paper, we demonstrate how a low-cost off-the-shelf robotic
system can learn closed-loop policies for a stacking task in only
a handful
4.2 Example II: Optimizing Nonlinear Motor Primitives
for of trials—from scratch. Our manipulator is inaccurate
and
provides
no pose feedback. For learning a controller in the
Humanoid Motion Planning
work space of a Kinect-style depth camera, we use a model-based
While the previous example demonstrated the feasibility and performance of
reinforcement learning technique. Our learning method is data
the Natural Actor Critic in a classical example of motor control, this section will
efficient, reduces model bias, and deals with several noise sources
turn towards an application of optimizing nonlinear dynamic motor primitives
in a principled way during long-term planning. We present a
for a humanoid robot. In [17, 16], a novel form of representing movement plans
way of incorporating state-space constraints into the learning
(q d , q̇ d ) for the degrees of freedom (DOF) of a robotic system was suggested in
terms of the time evolution of the nonlinear dynamical systemsprocess and analyze the learning gain by exploiting the sequential
structure of the stacking task.
•
•
Learning to Control a Low-Cost Manipulator us
Data-Efficient Reinforcement Learning
d,k , z k , gk , ⌧, ✓k )
(30)
I. I NTRODUCTION
where (qd,k , q̇d,k ) denote the desired position and velocity of a joint, z k the
Over the last years, there has been substantial progress in
internal state of the dynamic system, gk the goal (or point attractor) state of
robust
manipulation
in unstructured environments. While exeach DOF, ⌧ the movement duration shared by all DOFs, and ✓k the
open
parameters of the function h. The original work in [17, 16] demonstrated
how
isting techniques
have the potential to solve various household
the parameters ✓k can be learned to match a template trajectory
by
means
of
manipulation tasks, they typically rely on extremely expensive
supervised learning – this scenario is, for instance, useful as the first step of an
robot hardware [12]. The long-term goal of our work is to
imitation learning system. Here we will add the ability of self-improvement of
light-weight manipulator systems that can
the movement primitives in Eq.(30) by means of reinforcementdevelop
learning,affordable,
which
is the crucial second step in imitation learning.
interactively play with children. A key problem of cheap
The system in Eq.(30) is a point-to-point movement, i.e., manipulators,
an episodic task however, is their inaccuracy and the limited
from the view of reinforcement learning – continuous (e.g., periodic) movement
CST: Navigation!
(Konidaris et al. 2012)
sensor feedback, if any. In this paper, we show how to use a
cheap, off-the-shelf robotic manipulator ($370) and a Kinectstyle (http://www.xbox.com/kinect) depth camera (<$120) to
learn a block stacking task [2, 1] under state-space constraints.
We use data-efficient reinforcement learning (RL) to train a
6 in the work space of the depth camera.
controller directly
Fully autonomous RL methods typically require many trials
Dieter Fox
Dept. of Computer Science & Engine
University of Washington
Seattle, WA, USA
Fig. 1. Low-cost robotic arm by Lynxmotion [1] performing a blo
task. Since the manipulator does not provide any pose feedback,
learns a controller directly in the task space using visual feedb
Kinect-style depth camera.
a typical problem of model-based methods: P ILCO
a flexible probabilistic non-parametric Gaussian proc
dynamics model and takes model uncertainty con
into account during long-term planning. P ILCO lea
controllers from scratch, i.e., with random initializa
deep understanding of the system is required. In this p
show how obstacle information provided by the dept
can be incorporated into PILCO’s planning and lea
avoid collisions even during training, and how knowl
be efficiently transferred across related tasks.
The paper is structured as follows. After discussin
work, we describe the task to be solved, the low-cost
used, and a basic tracking algorithm in Sec. III. Sec.
marizes the PILCO framework and details how we inc
collision avoidance into long-term planning under un
Reinforcement Learning
Method Comparison
Policy
Model
Horizon
Problem Space
Algorithm
Data
Ball in Cup
DMP
None
Short
Robot-Space
EM w/ Natural
Gradient
Autonomous
Humanoid Walking
RBF
None
Short
Robot-Space
LSTD w/ Natural
Gradient
Autonomous
Helicopter
Neural-Net
Locally Weighted
Regression
Short
Robot-Space
Hill-Climbing w/
Monte-Carlo Eval.
LfD
PILCO Cart-Pole!
RBF
Gaussian Process
Short
Robot-Space
CG/L-BFGS
Autonomous
PILCO BlockStacking
Linear
Gaussian Process
Short
Mixed
CG/L-BFGS
Autonomous
Red-Room*
Skill Tree
Options + Lin. Reg.
(Fourier Basis)
Long
Task-Space
CST (trajectory
segmentation)
LfD
Policy
Model
Horizon
Problem Space
Algorithm
Data
Long
Task-Space
A* Variant
N/A
Long
Task-Space
State Machine
N/A
Robotics
Cart Pushing
Jacobian PD-Control, Geometric (Projected
Primitives
2D)
PR2 Towels
Scripted Grasp and
Manipulation Prim.
Implicit Cloth Physics
PR2 Socks
Scripted Grasp and
Manipulation Prim.
Implicit Cloth Physics
HRP-2 NAMO
ZMP Walking,
Jacobian PD Grasp
PR2 NAMO
HRP-2 MacGyver
Domain General
Physics-Based
Long
Task-Space
State Machine
N/A
2D Rigid Body
Physics
Long
Task-Space
A* Variant
N/A
RRT Path Planner, PD
Control
2/3D Geometric
Simulation
Long
Mixed
Hierarchical
Backchaining
N/A
ZMP Walking,
Scripted Primitives
3D Physical
Simulation
Long
Mixed
A* Variant
N/A
7
Summary of existing work
Methods either:
a) short-horizon, robot-space, general models
b) long-horizon, w/ known model or human
demonstrations
Proposal goal:
Autonomous task-level RL without
hand-engineered representations
8
Thesis statement
Incorporating physics-based models and planning
representations into the Reinforcement Learning
framework reduces engineering overhead and increases
robot performance in autonomous mobile manipulation
tasks.
Physics
Engines
Autonomous
Reinforcement
+
= Mobile Manipulation
Learning
9
Two Perspectives
•
From Robotics: reduce engineering
•
From RL: improve scalability
No more grid worlds
x
10
Reinfo
rceme
s
otic
nt Lea
Rob
rning
Thesis intuition
Representational Gap
11
Outline
Physics-based
Reinforcement
Learning
1
Stochastic
Planning for
Physics-Based
MDPs
1: Published (Scholz et al. 2014)!
2: Proposed (extends Scholz et al.
2010)!
3: Proposed (implements Levihn,
Scholz, & Stilman 2012 & 2013)!
4: Proposed
12
2
Application to
Golem-Krang
Navigation Among
Movable Obstacles
3
Cost-Based Furniture
Arrangement
4
Outline
Physics-based
Reinforcement
Learning
Stochastic
Planning for
Physics-Based
MDPs
Application to
Golem-Krang
Navigation Among
Movable Obstacles
• Overview!
• Model Space!
• Data Acquisition!
• Inference!
Cost-Based Furniture
Arrangement
• Evaluation
13
Physics-Based Reinforcement Learning
Model-Based RL Loop
PyMC
Idea: Use physics-engine
as model representation
Technical contribution: formalize model learning
problem and develop inference method
14
PBRL Overview
f (s, a) ! s0
Physics API is
Learning Target
f (s, a; ) ⇡ s0
Place prior on API
parameters
body {!
!
!
!
!
8s, a
Estimate posterior
from data 0
L( |h) = P (s |s, a; )
mesh!
mass!
inertia!
joint.wheel!
| |
P ( |h) / P (h| )P ( )
P (s0 |s, a, )
⇡(s) = arg maxa P (s0 |s, a,
15
PBRL: Model Parameters
Three classes of parameters
Rigid-body
parameters
Anisotropic friction
constraints
16
Distance
constraints
⇡(s)
ms
PBRL: Rigid body parameters
(m, r, µc )
Mass
For computing accelerations
We assume uniform density
f
a=
i
α=
Restitution
(wx , wy , w✓ , µx , µy )
For computing perpendicular
contact forces
(ia , ib , afxp,=fanyr, bx , by )
Friction
For computing tangential
contact forces
Proportional to normalforce (Coulomb friction)
m
:= (m, r, µc , wx , wy , w✓ , µx , µy , ia , ib , ax , ay , bx , by )
I-1 τ
ff = fn!c
17
v
fn
EXTRA MATH FOR PROPOSAL
⌧
↵= =I
I
⌧
PBRL: Anisotropic friction
JONATHAN SCHOLZ
heel
1
wheel constraint
(w , w , w✓ , µ , µ )
(1) Compute relative velocity between anchors (wheel x
and ground)
in world
y
x frame
y

stance
ẋ
w
1)
J=
ẏ
+
⇥
✓˙
⇤
✓ 
wx
⇥ R
wy
=

ẋ
ẏ

0
✓˙
Pose
+
✓˙
0
R

(i , i , a , a
wx
wy
Coefficients
,b ,b )
y frame,
x y if we
b x to joint
(2) rotate body b (wheel) velocity to body
(orainconstraint
principle
Forframe
placing
The orthogonal
ever include rotation)
l params
on a body
b
2)
J =R
1w
friction
components
J
:= (m, r,!µ , w , w , w , µ , µ , i , i , (-0.3,
ax , ay ,0,bx0,
, by0.1,
) 0.8)
i in body frame –yjust
c scale
x theycomponents
x of the
y velocity
a b
✓
(3) compute friction force
from (2) by their respective coefficients
wx
b
3)
4)
◆
A velocity constraint
w θµx
f=
0
0
µy
b
J
!x
w
y
(4) rotate this force (impulse) back to the world frame
Object-frame
w
f = R(b f )
(5) add in to the force accumulator ...
Constraint force:
Wheel constraint as single expression:

✓
µx 0
ẋ
5)
fw = R
R 1
0 µy
ẏ
+

0
✓˙
✓˙
0
R

wx
wy
◆
velocity scaled
Note:
2-D, matrix
the cross-product
of a scalaranchor
with avelocity
vector isindefined
as the equivalent 3-D
R: bodyIn
rotation
world frame
in object frame
ersion about implied z-axis:
2 3 2 3

0
a
a
18
6)
[x] ⇥
=4 0 5⇥4 b 5
b
wheel
JONATHAN SCHOLZ
PBRL:
Distance
constraint
distance
2)
q( ) = ln (P (h| )P ( ))
n ⇣
⌘2 X
X
0
, i(p)
=
st f (st , at ; ˜ ) +(ia P
b , ax , ay , bx , by )
3)
All params
4)
ndom ⇡(s)
,=
a)
d
(wx , wy , w✓ , µx , µy )
t=1
A position constraint
p2
✓
◆
q( t )
Paccept ( t | t 1 ) = min 1,
:= (m, r, µ q(, wt 1,) w
Body indices
i
Indicates two bodies to
anchor the constraint
c
x
PositionA
,w ,µ ,µ ,i ,i ,a
y
x y a
✓
Anchor offset on the
first body
b
x
PositionB
,a ,b ,b )
y
x y
Anchor offset on the
second body
(m, r, µc )
⌧
↵= =I
I
body a
eel
1
⌧
bx
body b
by
Inverted pendulum
l
anchor b
a
x
(wx , wy , w✓ , µx , µy )
ay
tance
tance constraint init
anchor
(ia , ib , aa
x , ay , bx , by )

xainit
a
yinit
xacur
a
ycur

l=
+
Constraint:
tance constraint
params
i

+
ax
ay

ax
ay


xacur
a
ycur

xainit
a
yinit

ax
ay
ax
ay
l=0
:= (m, r, µc , wx , wy , w✓ , µx , µy , ia , ib , ax , ay , bx , by )
19
)
Paccept (
t| t
✓
q( t )
1 ) = min 1,
q( t 1 )
◆
PBRL:
Overall
model
space
Random ⇡(s)
(s0 |s, a)
14 physical parameters per object
0)
i
:= (m, r, µc , wx , wy , w✓ , µx , µy , ia , ib , ax , ay , bx , by )
Scenes have
many objects
φ2
φ3
φ4
φ5
φ1
What if some constraints aren’t necessary?
We can negate effects numerically
Extending will require inference over
variable-sized representations
20
Φ
In general
thisoutp
mod
tence and parametrization of physical
constraints,
such
(physical
parameters)
and
in
354 regression model, this ies will have both
heels. Like a standard Bayesian
notes
a
deterministic
physical
simu
therefore allow con
el includes uncertainty in 355
the process input parameters
˜ , then the˜ core dynamics
This can function
be accom
ysical parameters) and in 356
output noise. If f (·; ) defor represented
com
s a deterministic
physical
simulation parameterized
by
Standard
Regression
model
parameterized
by
357
st+1
(st , at ;
prior=onfconstraints
hen the core
dynamicsproperties.
function is:
physical
358
cases where the eff
n
˜ = ( ˜(7)
st+1 = f (st359
, at ; ˜ ) +where
✏
)i=1 denotes
a full of
as
a finite number
BOOMDP
task-space
straints
cannbe
null
n
˜
physical
parameters
for
all
objec
˜
360
re = ( )i=1 denotes a full assignment
to the relevant
force & torque
440for all n objects
sical parameters
inzero-mean
the
✏ isat+1 We
Gaussian
noise
with
var
satisfy
these
361
at scene, and
m, r
441
2
meshes
-mean Gaussian
noise
with
variance
.
wheel constraint,
µ x , µy
442
362
polygons
˜
For
any
domain,
must
contain
w
, ay a
443 …
x , wby
y , ax
constraints
the
planning
any domain, ˜444must contain
of inertial pa363a core set⇡(s)
bx , by
⇡(s)
wheel
is sufficient
rameters
for
each
object,
as
well
˜
w
445
✓
eters for each object, as 364
well as zero or more conindoors, such
iadefine
, ib as
446
straints.
Inertial
parameters
nts. Inertial parameters
define
rigid
body
behavior
in
mass
447
st+1 cause they have onl
st
365
friction
1. Distr
absence of interactions
with otherthe
objects,
fand
(s, a; ˜con)of interactionsTable
448 …
absence
with
wheels
can be expr
366
449
position
&
velocity
physics
new
position
&
velocity
nts define the space of possible interactions.
straints define the space
possible
450
can beof
nullified
note
a matrixby
of
367
451
noise
he general case inertia requires
10 parameters; 1 for
distance constraint
452
In
the
general
case
inertia
require
2.368
Graphicalof
model
the
online
model
learning
object’s mass, 3453for Figure
the location
thedepicting
center
of
mass,
21
problem, and the assumptions of PBRL, in terms of states sIn
andsummary, our dy
PBRL: Graphical Model
Example Episodes
PBRL: Gathering Data
Unconstrained
Touch or
Grasp Object
Locked Caster
Apply
forces
Track
object
Data = State Trajectory & Applied forces
22
Chair Collision
Compute equivalent
force in body-frame
0
L( |h) = P (s |s, a; )
PBRL: Prior & Likelihood
| |
Bayesian Inference:
Φ is a generative model of h
P ( |h) / P (h| )P ( )
0
P (s |s, a, )
Likelihood: should prefer accurate predictions
Prior:
Posterior:
0
0
⇡(s)
=
arg
max
P
(s
|s,
a,
)V
(s
)
a
should support only legal values
will reflect robot’s updated beliefs
2
s1
after observing h
23
6 s2
6
h=6 .
wheel
constraint
dy b (wheel)
velocity to body frame (or in principle to joint frame, if we
529
(1)
Compute
relative
velocity
between
anchors
(wheel
and
ground)
in world frame
ude rotation)
n
Y
530
P (h| , ) =
2 3
531
✓ 
◆
ẋ
0
t=1 frame – just scale the componentswxof the velocity
friction force in body
w
532
4 ẏ 5 + 4 0 5 ⇥ R
!
J
=
ncoefficients
0
wy2
by their respective
Y
˜
˙
˙
1
(s
f
(s
,
a
;
))
533
t
t
(
✓)
t ✓
p exp
=
2
 2⇡ priors
534
Univariate
2
(2) rotate body t=1
bb (wheel)
velocity
to
body
frame
(or
in
principle
to
joint frame, if we
µx 0 b
BOOMDP
f = by support
J
535
ever includechosen
rotation)
(11)
0 µy
536
Distribution
495
b
1w
J
=
R
J
isEq.
force
back
to the
world
2
r that
Log-Normal(µ,
) frame
11(impulse)
tellsm,us
the
likelihood
for proposed model
pa496
537
2
µ x , µy
Truncated-Normal(µ, , 0, 1)
497
(3) compute
friction force
in
body 2 frame
– just on
scale
the
components
of the velocity
rameters
on
a
Gaussian
the
prexy centered
xy
538
wxare
, wy , aevaluated
,
a
Truncated-Normal(µ,
,
a
,
a
)
498
x y
min max
w
b
2 xy
f
=
R(
f
)
from
(2)
by
their
respective
coefficients
bxstate
, by
Truncated-Normal(µ,
, bmin , bxy
)
499
dicted
next
for
a
generative
physics
world
parametermax
(s)
539
w
Von-Mises(µ,
)
500
✓
the force
...
ized
by ˜accumulator
(i.e.,
known
geometry and proposed
ia , ib withCategorical(p)
540
501dynamµ
0
x
b
b
int
as Due
singletoexpression:
502 is obics).
Gaussian
noise,
the
log-likelihood
for
f
=
J
541
Table 1. Distributions
for
each
physical
parameter
type
02
3 2 3 0 µy
503 1
 by summing 2squared
✓  the504
◆
ẋ distances
0 between
tained
observed
542
L1 @
penalty
on4 sample
µxSimple
0 force
wxframeA
4
5
5
(4)
rotate
this
(impulse)
back
to
the
world
ẏ
0
505
= R and
Robserved transitions:
+ each state
⇥ Rand action:
note athe
of
value
state for
0 matrix
µy predicted
w
543
y
trajectories
2 (✓)
3 ✓˙
506
˙
0
s 1 a1 s 1
507
544
n a ⇣ s0 7 w
⌘
earning
6 sX
b
2
2
2
2
6
f = R((9)f˜) 508
07
s s and
h=6 .
7
.
.
545
ln P (h| , ) = 4 .. .. (s.. t 5 f (st , at ; ) 509 (12)
(dy...
510
l.(5)
⇡(·) add in to the force accumulator
sTt=1
aT s0T
546
511
on. We q(
) = ln (Pas
(h|single
)P ( expression:
))
Wheel
constraint
547
5123
should
h⇣to update
the thein
agent’s
beliefs
about
0
2
3
2
1
n
Along We
with
theuse
prior
defined
Table
1,
this
provides
the
⌘
X
X
2
513

✓

◆
{wx,
0
and the noise term0 . In a Bayesian
approachẋthis is
548
˜
=
s
f
(s
,
a
;
)
+
P
(p)
Posterior
samples
µ
0
w
t history
5145
necessary
components
for at Metropolis
sampler for
10.
x
w as the model
expressed
A
0 Eq.
f =t=1
R t xposterior given
R 1 @4h:p2ẏ 5 + 4 515
⇥ R
549
0 µby
wy
y
generated
MCMC
PBRL: Inference
PJ(s=0t |R ,21w, Jst , 3
at )
b
˙
(
✓)
P (h| , )P✓( )P ( )
◆
P ( , |h) = R
(10)
oblem,
q(
)
t
P
(h|
,
)P
(
)P
(
)
MCMC
Paccept ( t | t 1 ), = min 1,
and
q( t 1 )
tion 4.
vision
ect dyaramend the
-series
where
= { 1 , 2 , . . . , k } is the collection of hidden
parameters for the kq(
objects
) =in1 the
ln domain,
(P (h| and)P is( a scalar.
))
This expression is obtained from Bayes’ rule, and defines
n for
⇣ a PBRL agent.
the abstract model inference problem
X
0
=
s
f (st , at ;
t
The prior P ( ) can be used to encode any prior
knowledge
about the parameters, and is not assumed
t=1 to be of any par-
516
✓˙
517
518
519
520
521
⌘522
X
2
523
˜ ) 524+
P (p)
525 p224
wy, 0, 0.1, 0.8}
ycur
(16)
15)
ay
xacur
a
ycur
ycur
 aay

ax (m, r,xµcur, w , w a,xw , µ , µ , i , i , a , a , b , b )
:=
i
x
y
+
✓ l x= 0 y a b x y x y
ac
ay
ycur
ay
Defining Physics-Based MDPs
Latex Math
Defining
All params
PBRL MDPs
i := (m, r, µc , wx , wy , w✓ , µx , µy , ia , ib , ax , ay , bx , by )
• states
16)
i := (m, r, µc , wx , wy , w✓ , µx , µy , ia , ib , ax , ay , bx , by )
h
i|
MDPs
˙
Defining
PBRL MDPs
(17)
s = x1 , y1 , ✓1 , ẋ1 , ẏ1 , ✓˙1 , ..., xn , yn , ✓n , ẋShopping
n , ẏn , ✓n Cart MDP
States
⇤
• h
states
Jonathan
Scholz
i|
h
i| PROPOSAL
EXTRA
MATH FOR
˙
˙
s = (18)
x1 , y1 , ✓1 , ẋs 1=, ẏ1x,1✓, y11,,...,
, ẋxnn ,, x
17)
✓1 , x
ẋ1n,,ẏy1 n
, ✓,˙1✓, n...,
yẏnin,,,✓y✓nin
,,ẋẋni,,ẏẏni,,✓˙✓n˙i 2 R
• actions
April
16,
2014
EXTRA MATH FOR PROPOSAL
3
18)
, ẋi , ẏi , ✓˙i 2 R ✓i 2 [ ⇡, ⇡]
(19)(20) xi , yi , ẋi , ẏi , ✓˙xi i ,2yiR
a = [f , f , ⌧, i]|
x
EXTRA MATH FOR PROPOSAL
19)
(20)
(21)
• actions
✓i 2 [ ⇡, ⇡] Actions
a = [fx , fy , ⌧, i]f|x , fy , ⌧ 2 R
f| (s, a) ! s0
a = [fx , fy , ⌧, i]
• transitions
f x , fy , ⌧ 2 R P (
(22)
• transitions
(23)
i2N
(23)
• rewards
(24)
3
✓i 2 [ ⇡, ⇡]
(21)
(22)
y
• rewards
f x , fy , ⌧ 2 R
i2N
Apartment MDP
2 NP (h| )P ( )
|h)i/
Transitions
P (s0 |s, a, )
0
0
Rewards
⇡(s) = arg maxa P (s |s, a, )V (s )
task-dependent (next-section)
(24)
25
2
s1
a1
s01
3
PBRL Results: Online Performance
PBRL out-performs
regression methods
−100
Reward
−300
−100
TRUE
PBRL
OOLWR
LWR
100
200
300
400
500
0
0
TRUE
PBRL
OOLWR
LWR
−300
Reward
−100
Step
TRUE
PBRL
OOLWR
LWR
−40
−120
≈ ≈≈
200
≈
400
≈
600
800 1000
PBR
LWR
LR
Step
Lo
Ki
ge
ir
ir
ir
ha
C
ir
ha
C
ch
ha
C
en
un
h
tc
k
es
ha
C
t
d
ou
Be
C
ar
e
bl
Ta
e
bl
Ta
ng
C
ki
en
h
tc
oc
D
R
Ki
26
k
Step
es
Step
800 1000
e
fe
600
800 1000
−80
Reward
Samples Required
≈
D
400
600
TRUE
PBRL
OOLWR
0
140
120
100
80
60
40
20
0
of
200
400
Step
C
0
200
0
0
−300
Reward
Reward
0
0
Shopping-Cart Task: Gaussian Noise (σ=1.25)
Summary of proposed work
Goal:
Scalable RL framework for object
manipulation domains
Contribution:
Physics-based model representation
and scalable inference method
Result:
Simulation results suggest favorable
performance vs. regression methods
27
Outline
Physics-based
Reinforcement
Learning
Stochastic
Planning for
Physics-Based
MDPs
Application to
Golem-Krang
Navigation Among
Movable Obstacles
• Handling uncertainty!
• Handling sequential rewards!
• Proposed algorithm
Cost-Based Furniture
Arrangement
28
(8)
t=1
Stochastic Planning for Physics-Based MDPs
(9)
So far: stochastic
(9)
physics
model
Random ⇡(s)
0
P (s |s, a)
Paccept (
Next: method for
planning with it
Random ⇡(s)
0
P (s |s, a)
Draws on:
Melchior & Simmons (2007)
Scholz & Stilman (2010)
29
t|
Planning in RL
Desirable properties:
x
Optimality
Generality
open problem
RRT: a spatial search bias
Traditional Approaches:
Dynamic
Programming
Monte-Carlo
Tree Search
too many states
weak search bias
(value-based)
x
x
30
Two challenges
Handling model uncertainty
Handling sequential rewards
31
Model Uncertainty
Particle RRT for Path Planning with Uncertainty
Simple approach: Monte-Carlo
Nik A. Melchior and Reid Simmons
Want:
Repeat:sample model & (re)plan
single tree
over entire space
Problem: inefficient
act— This paper describes a new extension to the
–exploring Random Tree (RRT) path planning algoThe Particle RRT algorithm explicitly considers uncerits domain, similar to the operation of a particle filter.
tension to the search tree is treated as a stochastic
and is simulated multiple times. The behavior of the
n be characterized based on the specified uncertainty
environment, and guarantees can be made as to the
ance under this uncertainty. Extensions to the search
d therefore entire paths, may be chosen based on the
d probability of successful execution. The benefit of this
m is demonstrated in the simulation of a rover operating
h terrain with unknown coefficients of friction.
I. I NTRODUCTION
Rapidly–exploring Random Tree (RRT) algorithm [1]
pular technique for path planning with kinodynamic
nts. In simple terms, RRT builds a search tree of
le states by attempting to apply random actions at
reachable states. Unless the action causes the robot
contact with an obstacle or violate some dynamics
Fig. 1.
32
A pRRT tree with several particles at each node
occlusions are possible. In addition, characteristics of the
Particle-RRT
Particle RRT for Path Planning with Uncertainty
Melchior
Nik A. Melchior and Reid Simmons
(a) Simulation
RRT with model uncertainty
1. Sample model beliefs
2. Step particles
3. Cluster particles together
& Simmons (2007)
(b) Plot
Fig. 4. Trajectories
Abstract— This paper describes a new extension
to the with qualitatively different endpoints
Rapidly–exploring Random Tree (RRT) path planning algorithm. The Particle RRT algorithm explicitly considers uncerof points.
tainty in its domain, similar to the operation of a particle filter.
Each extension to the search tree is treated as a stochastic
D. Node and Path Probability
process and is simulated multiple times. The behavior of the
The particles at each node provide an estimate of the
robot can be characterized based on the specified uncertainty
distribution
of values of F that allow the robot to reach that
in the environment, and guarantees can be made as to the
state.
By
combining
this with a prior distribution over F, the
performance under this uncertainty. Extensions to the search
probability
of
reaching
any particular node, or indeed the
tree, and therefore entire paths, may be chosen based on the
probability
of
following
an entire path, may be calculated.
expected probability of successful execution. The benefit of this
In
this
way,
the
growth
of
the tree structure can be biased
algorithm is demonstrated in the simulation of a rover operating
toward
generating
paths
that
are more likely to be followed
in rough terrain with unknown coefficients of friction.
I. I NTRODUCTION
The Rapidly–exploring Random Tree (RRT) algorithm [1]
by the robot.
To bias the search, we adapt Urmson’s hRRT technique for
heuristically biasing RRT growth [9]. The heuristic modifies
the SELECT
function
the RRT
algorithm.
Fig. 1. AEXTENSION
pRRT tree with
severalofparticles
at each
node
Rather than simply accepting a random point, p, in the state
space and its nearest neighbor, q, in the tree, the hRRT
technique chooses to extend proportional to the quality of
the node q. In our case, the quality of q is defined as:
discardofsmall
Fig.
5. Dendrogram
produced
by the hierarchical
clustering tree algorithm
a popular
for path
planning
with kinodynamic
ofintuition:
the probability
reaching the isgoal
fromtechnique
this
node.
This
constraints. In
∗ simple terms, RRT builds a search tree of
numerical
differences
would
produce
a heuristic in the style
of
A
[10]
that would
reachable states
by
attempting
to apply random
actions
In this case, the final agglomeration
combines
particleat
1 with
estimate the probability of successfully
travelling
from
start
known–reachable
states.
Unless
the
action
causes
the
robot
a cluster containing all other particles.
qprob − m
make contactany
with estimate
an agglomeration
obstacle of
or violate
Although
continuessome
until dynamics
all particles areocclusions are possible.
qqualityIn
= addition, characteristics
(2) of the
to goal through any particular node.to However,
m
linked,
we
must
determine
how much aggregation
is actuallyterrain such as the coefficient1.0of− friction
constraint,
the
action
is
considered
successful,
and
the
resultcan
be
estimated
they
used hierarchicalthe
probability–to–go
must be optimistic,
meaning
it each
cannot
appropriate
set of particles.
WeAlocate
the link in where qprob is the probability of reaching q from the root
ing state is added
to the for
tree
of reachable
states.
simulator
only roughly, but these characteristics can have substantial
the
tree
which
combines
the
most
dissimilar
subtrees
by of the tree, and m is the minimum probability of all leaf
agglomerativethe
clustering
underestimate
possibility of completing
the
path
from
is generally treated
as
a
black
box
that
determines
the
result
rover’stree.
behavior,
when
traversing
searching for the largest difference in distances betweenimpact
nodeson
of the
the search
A randomparticularly
value, r is drawn
from
a
of
an
action
given
the
robot’s
initial
state.
This
allows
the
rough
terrain.
Path
planning
can
benefit
from
explicitly
successive
any state. Without exploring all options from
this agglomerations.
state, the This link and all following links uniform distribution between 0 and 1. If qquality > r, the pair
algorithm to be
to domains
where
complex
system
are applied
disregarded,
while any links
made
previously
are used toconsidering
of points pthe
and uncertainty
q are accepted in
andthe
an extension
is attempted.
terrain to
generate paths
only admissible heuristic is 1.0,dynamics
whichmake
does
notthecontrol
provide
determine
particledifficult
clustering.
cutoff is represented
a new pairofofsuccess.
points are chosen. Alternately, we
analytic
orThis
impossible.
RRT withOtherwise,
high likelihood
by the dashed
horizontal
line in figure
5. In thisvehicase, the might keep p and try the next nearest neighbor in the tree [9].
has been
successfully
applied
to wheeled
and legged
any additional information. Future
work
may investigate
particles have been split into five clusters.
The random
valueinr is[2]
usedhas
to promote
the useaof changing
extensions enviRelated
work
considered
Example: slippery
as well as In
underwater
robotsthat
andthis
aircraft.
from high
quality
nodes
without
excluding
the
possibility
of
tests,
we found of
approach performs betterronment
whetherterrain
non–admissible heuristicscles,
improve
the
runtime
in the form of moving obstacles by predicting
However, this
decision
as to
thek–means
successalgorithm
of an [8]. extending
lower quality nodes should the algorithm become
than binary
the Gaussian
clustering
of the
motion
of these obstacles. If the obstacles move in
the algorithm without significantlyaction
affecting
path
quality.
can limit
the
application
of tree
thisallows
algorithm
in twofewerthe stuck
in a situation where reaching the goal at all is unlikely.
The
hierarchical
clustering
us to calculate
Uncertainty about
friction
manner,
thequality
path is
simply
replanned.
since not
we do
not for
needranking
to directly
estimate thean unexpected
The effectiveness
of the
heuristic
might
be im- One
important ways.
First, it does
allow
or scoring
In practice, we found that the probability
ofparameters
reaching
nodes
approach
for
planning
in
uncertain
terrain
[3]
ensures
means
and covariance
matrices
cluster.
This is proved if we could calculate not only the probability
of that
multiple actions
which
may succeed
fromof aeach
given
initial
especially
important since
we are clustering only a handfulthe reaching
fromproduce
the root ofthe
the tree,
butresult
also an for
estimate
planneda node
actions
same
the entire
the tree drops quickly with path
Thishave
causes
thecosts
state.length.
Actions often
associated
(such as the energy
Nodes reflect ofqualitative
or time from
requirednodes
for execution),
and planning may produce range of expected values of the unknown conditions. Another
algorithm to favor making extensions
near the
paths from start to goal with varying cumulative application [4] builds a forest of search trees, using a
differences inroot,
outcome
even when reasonably likelyseveral
nodes
exist closer to the
costs. High–cost extensions might therefore be avoided in different value for each tree.
goal. In order to encourage more
extensions
from
hopes
of finding a better
path,nodes
but these extensions should not
This paper presents an extension to the RRT algorithm
be
ignored
completely,
because
a
better
path
may
not
exist.
that directly addresses the issue of plans under uncertainty
far from the root, the path probability qprob is normalized
Most RRT
implementations at least consider path length in with a novel approach. The Particle RRT (pRRT) algorithm
√
n
using the path length. We substitute
the quality
prob in distance
terms of qEuclidean
travelled, but other notions of facilitates the creation of an RRT in an uncertain environment
pathnode
cost should
incorporated
as well.
calculation, where n is the depth of
q in bethe
tree. This
by propagating that uncertainty to the planned path. Each
The second important use for a fuzzy notion of action extension to the search tree is attempted several times under
improves the runtime since node success
quality
does not drop so
6. A tree
built by pRRT. Spheres mark the planned path.
is when knowledge of the obstacles Fig.
or dynamic
different likely conditions. Nodes in the search tree are
quickly as the distance from theproperties
root increases.
of the environment
cannot be precisely known. created by clustering the results from these simulations. The
33 However,
The
application
presented
in
this
paper is a rover navigating likelihood of successfully executing each action is quantified
the effect of averaging the path probability over the length
Particle-RRT limitations
How to
handle
sequential
rewards?
Assumes
known
goal
Considered by Task-Space RRT*
*Scholz & Stilman, 2010
34
Task-Space RRT
Basic idea
main loop:
sample model to search space
sometimes:
run gradient optimizer
from leaf nodes
Result
Finds modes in cost function, and
actions required to reach them
35
Task-Space RRT limitations
No node clustering
Only optimizes
immediate reward
problem: can’t pick best expected action
36
Jailette et al. 2008
Proposed work
Particle RRT for Path Planning with Uncertainty
Nik A. Melchior and Reid Simmons
Abstract— This paper describes a new extension to the
Rapidly–exploring Random Tree (RRT) path planning algorithm. The Particle RRT algorithm explicitly considers uncertainty in its domain, similar to the operation of a particle filter.
Each extension to the search tree is treated as a stochastic
process and is simulated multiple times. The behavior of the
robot can be characterized based on the specified uncertainty
in the environment, and guarantees can be made as to the
performance under this uncertainty. Extensions to the search
tree, and therefore entire paths, may be chosen based on the
expected probability of successful execution. The benefit of this
algorithm is demonstrated in the simulation of a rover operating
in rough terrain with unknown coefficients of friction.
Particle
clustering
I. I NTRODUCTION
The Rapidly–exploring Random Tree (RRT) algorithm [1]
is a popular technique for path planning with kinodynamic
constraints. In simple terms, RRT builds a search tree of
reachable states by attempting to apply random actions at
known–reachable states. Unless the action causes the robot
to make contact with an obstacle or violate some dynamics
constraint, the action is considered successful, and the resulting state is added to the tree of reachable states. A simulator
is generally treated as a black box that determines the result
of an action given the robot’s initial state. This allows the
algorithm to be applied to domains where complex system
dynamics make analytic control difficult or impossible. RRT
has been successfully applied to wheeled and legged vehicles, as well as underwater robots and aircraft.
However, this binary decision as to the success of an
action can limit the application of this algorithm in two
important ways. First, it does not allow for ranking or scoring
multiple actions which may succeed from a given initial
state. Actions often have associated costs (such as the energy
or time required for execution), and planning may produce
several paths from start to goal with varying cumulative
costs. High–cost extensions might therefore be avoided in
hopes of finding a better path, but these extensions should not
be ignored completely, because a better path may not exist.
Most RRT implementations at least consider path length in
terms of Euclidean distance travelled, but other notions of
path cost should be incorporated as well.
The second important use for a fuzzy notion of action
success is when knowledge of the obstacles or dynamic
properties of the environment cannot be precisely known.
The application presented in this paper is a rover navigating
through unknown terrain. Stereo vision is used to build a
model of the terrain in the immediate area, but the accuracy
of this model decreases with distance from the rover, and
melchior@cmu.edu, reids@cs.cmu.edu. The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213-3890.
Fig. 1.
A pRRT tree with several particles at each node
LeafOptimization
occlusions are possible. In addition, characteristics of the
terrain such as the coefficient of friction can be estimated
only roughly, but these characteristics can have substantial
impact on the rover’s behavior, particularly when traversing
rough terrain. Path planning can benefit from explicitly
considering the uncertainty in the terrain to generate paths
with high likelihood of success.
Related work in [2] has considered a changing environment in the form of moving obstacles by predicting
the motion of these obstacles. If the obstacles move in
an unexpected manner, the path is simply replanned. One
approach for planning in uncertain terrain [3] ensures that
the planned actions produce the same result for the entire
range of expected values of the unknown conditions. Another
application [4] builds a forest of search trees, using a
different value for each tree.
+
BellmanValues
=
Value-based
Task-Space
Particle-RRT*
This paper presents an extension to the RRT algorithm
that directly addresses the issue of plans under uncertainty
with a novel approach. The Particle RRT (pRRT) algorithm
facilitates the creation of an RRT in an uncertain environment
by propagating that uncertainty to the planned path. Each
extension to the search tree is attempted several times under
different likely conditions. Nodes in the search tree are
created by clustering the results from these simulations. The
likelihood of successfully executing each action is quantified
so that the probability of following entire paths may be
determined. An example search tree is shown in figure 1.
Experimental results from simulations are presented, showing that this approach results in paths that are more robust
to uncertainty.
37
*tsp-RRT?
ne the
methods
4.4 above
Proposed
Work into a single coherent algorithm: the value-based particleRRT
This
a straightforward
combination
thecoherent
two methods
discussed
with one
We is
propose
to combine the above
methods into a of
single
algorithm: the
value-basedabove,
particleRRT
with leaf
optimization.
Thisextension
is a straightforward
combination
of the two
methods
discussed
above, with onewhich
particle
RRT,
the node
heuristic
was based
only
on path
probabilities,
notable difference. In particle RRT, the node extension heuristic was based only on path probabilities, which
he model
probability
across
particles
inparticles
a node,
a path
theroot.
root.
Adopting
came from
averaging the model
probability
across
in aalong
node, along
a pathfrom
from the
Adopting
notation
q to denote particles
a node containing
q1n,,q2where
, . . . , qn , where
is the
state of
of particle
qi and
e a the
node
containing
q1 , qparticles
sqi issqthe
state
particle
qiFand
Fqi
q
2, . . . , q
the model sample, the node values in the particle-RRT are defined as follows:
node
values in the particle-RRT are defined as follows:
vq = Â P(Fq )
(14)
Eq.pRRT
15, obtained
directly from the bellman-recursion,
appears superficially different from Eq. 14. Hownode
vq = Â P(Fqi )q 2q
(14)
ever, However,
for “values”
cases in
inour
which
the
reward
function
is
not
dependent
on
the
action,
as
with
task-space
formalism
case we are attempting
to solve an MDP, and therefore are searching a more general
qi 2q
The value of values
i
Introdu
i
i
i
in reward
[25], and
in which
model
deterministic
if the model[27].
parameters
themselves
are
landscape.
As the
a result,
thetransitions
node value are
should
be based on(even
the bellman-value
To achieve
this
Eq.
15
simplifies
as follows:
we
to the
sparse-sampling
which
provides
a recursive
expression
for computinganode
values:
asenot),
weturn
are
attempting
to literature,
solve
an
MDP,
and
therefore
are
searching
more
general
" "
##
a result,
the node
value should be based0 on
the bellman-value
[27]. To achieve this
0 a) max Qd d 1 (s
1 0 ,0 a0 )
0
MCTS
node
vq = max
R(s,
a)
+
g
P(s
|s,
(16)
vq = max R(s, a) +
g  P(s |s, a) max
QSS (s , a )
(15)
Â
0
a
a
a a
0
ampling literature,
a srecursive
expression
for computing node values:
s
values which provides
2
3
Z
where Qd (s, a) "
refers to the expected value of taking action a in state s, #
and following the optimal policy
0
0
0
0
d
4R(s) + g Â
=
max
P(s
|s,
a,
f
)P(f
)df
max
Q
1
for d 1 subsequent steps (soa QSS0 (s, a) = R(s,
0 a)).
d 1 0 0 a0
SS0
vq = max R(s, a) + g  P(s |s,
max
QSS0 (s , a )
s0 f a)
2F
0
1
(s0 , a0 )5
(17)
(15)
a
0
s
Considers
cart beliefs
tsp-RRT node= 1 R(sq ) + g P(F
0
15
0
)
max
v
,
q
f
(s
,
a
;
F
)
(18)
qi qi
qi
 i
 qi q0 q
n
and beer quality
qi 2q
valuesvalue ofqi 2qtaking action
to the expected
a in state s, and following the optimal policy
a
Eq. Q
181is (s,
obtained
by
expanding
the model transition as a marginal over the unknown model paramteps (so
a)
=
R(s,
a)).
0
SS
eters and approximating it with a finite sum over particles ( f (·) denotes the PBRL model function from
Sec. 3.2). This yields a generalization of the approach taken in [21], in which we replace
model
averTuesday,the
September
6, 2011
aging (Eq. 14) with a particle-based bellman backup. Note that this contribution achieves what the authors
Reflects probability
suggest regarding a search heuristic:15
Why useful?
look-ahead
future
reward
“The effectiveness of the quality heuristic might be improved if we could calculate
not only
the
probability of reaching a node from the root of the tree, but also an estimate of the probability
of reaching the goal from this node.”
Because reward functions can be arbitrary, this formulation does not guarantee an admissible heuristic.
38
However, for many common cases, such as simple distance-based
rewards, we expect this approach to
AND
Summary of proposed work
Goal:
Tractable value-based planner for large continuous
spaces with stochastic PBRL dynamics
Expected Result:
Successful online planning for PBRL problems for up to
5-10 objects
39
Outline
Physics-based
Reinforcement
Learning
Stochastic
Planning for
Physics-Based
MDPs
Application to
Golem-Krang
Navigation Among
Movable Obstacles
• Target platform!
• NAMO MDP formulation!
• Proposed task setting
Cost-Based Furniture
Arrangement
40
Platform: Golem-Krang
Software Architecture
Robot:
• dynamically-stable humanoid
• 7-DOF harmonic drive arms
• 6-DOF wrist force/torque sensors
*balancing-related modules omitted
41
The NAMO MDP
States: free-space regions
Actions: connecting regions
Rewards: goal region (sparse)
Rewards propagate
through abstract MDP
Transition uncertainty
grounded in PBRL beliefs
Levihn, Scholz, & Stilman 2012,2013
42
Tag-based Vision
Camera Rig
Simulation
Techway Facility
Task Setting
43
Proposed work
Goal:
Implement NAMO
MDP on Golem-Krang
Expected Result:
First NAMO system to
handle non-holonomic
objects and dynamics
uncertainty
44
Outline
Physics-based
Reinforcement
Learning
Stochastic
Planning for
Physics-Based
MDPs
Application to
Golem-Krang
Navigation Among
Movable Obstacles
• Cost function for meeting
configuration!
• Proposed task setting
Cost-Based Furniture
Arrangement
45
Cost-Based Furniture Arrangement
Basic idea: Use robot to
optimize task parameters
Follows [Scholz et al. 2010], with furniture
and non-holonomic constraints
46
New cost terms for
office applications
closed-loop
base navigation
system
be seen in
Fig.
11(c).
a) Meeting
TaskinOptimum
(b) can
Presentation
Task
Optimum
(c)specified
Target Environment
Learning.
Thus
addition
to the proposed
NAMO
task,
which
assumes a navigation goal
from
One of the central goals of the proposed research is to increase the autonomy of robots using Reinforcement
an end user, we propose to implement a mobile manipulation task based on more abstract criteria, in the
Learning.
Thus
in addition
the proposed
NAMOArrangement
task, which assumes a navigation goal specified from
5.2[25].
Experiment
2: to
Cost-Based
Furniture
tradition
of
Specifically,
we
will
use
the
cost-based
formulation
of task
goals fromand
[25] the
to define a environme
e an
12:end Maximal
configurations
two different
cost
functions,
user, we propose
to implement aunder
mobile manipulation
task based
on
more abstract
criteria, target
in the
tasktradition
in which
robot
mustgoals
arrange
furniture
in
a is
room
first
for
presentation,
andfrom
then[25]
for to
a meeting.
One
the central
ofwe
thethe
proposed
research
to increase
theaautonomy
robots
using
Reinforcement
of aof
[25].
Specifically,
will
use the
cost-based
formulation
of taskof goals
define a
iment
2.
Thus in taken
addition
the which
NAMOto
task,
which
assumes a we
navigation
from
By task
contrast
to thea method
in to
[25],
resorted
action
primitives,
willand
usegoal
thespecified
physics-based
in Learning.
which
robot
must
arrange
the proposed
furniture
in a room
first
for
a presentation,
then
for
a meeting.
an end
user, we propose
to implement
a3 mobile
manipulation
based on more abstract criteria, in the
model
space
and
described
Chap.
and
Chap.
4. action task
By contrast
to planners
the method
taken inin[25],
which
resorted
to
primitives, we will use the physics-based
traditionthe
of x,
[25].
Specifically,
we willofuse
the
cost-based
formulation
of task
goalsof
from
[25] to
define
a
If c denotes
y position
the center
a and
circle
of radius
r (e.g. the
position
a table
with
radius
model
space
and
planners
described
in
Chap.
3
Chap.
4.
task in which a robot must
arrange
thebetween
furniture in the
a room
first for
a presentation,
and then for a meeting.
grt and
the
angle
circle
and
each
object:
< r),Iffinite-differencing
and
(p
,
q
)
the
position
and
orientation
of
object
i,
then
the
following
terms
can
bewith
defined
to
i i
cBy
denotes
positiontaken
the in
center
a circle
of radius
r (e.g.
the position
table
radius
contrastthe
to x,
they method
[25], of
which
resorted
to action
primitives,
we will of
useathe
physics-based
quantify
the
criteria
of the
meeting
task:
rt < r),
and
(p
theplanners
position
and orientation
i , qi )and
model
space
described
in Chap. 3 of
andobject
Chap. i,
4. then the following terms can be defined to
tin = atan2(p
+ cy ,r p(e.g.
cxposition
)
iy radius
ix +the
quantify theIf criteria
of the
the x,meeting
task:
c denotes
y position
the center
of a circle of
of a table with radius
2 i, then the following terms can be defined to
1.
linear
rt <
r), anddistance
(pi , qi ) the position
of object
⇤orientation
clinearand
=tÂ
(kp
ck
r)
(21)
i
n sort(t)
=
quantify
criteria of the meeting
to the
circle
clineartask:
=i=1Â✓(kp⇤i ck⇤ r)n2
(21)
◆2
n
si=1
= [tnpii cti 1 ]cos(q
i=1
i)
2
✓

◆2
cangular c=linear
·
(22)
Â
n= Â (kp(a)
ck
r)
(21)
i
Meeting
Task
Optimum
(b)
Presentation
Task
Optimum
(c)
Target
Environment
sin(q
)
kpipi ckc
cos(qi i )
2. angular distance c
i=1
i=1
·
(22)
angular = Â
✓
◆
✓

◆
sin(q
)
ckonMaximal
2the configurations
Figure
two different
cost functions,
and that
the target
environment
for
2 under
i
e must to
also
specify
environment constraints
poses
of
these
bodies,
such
the
leaf-opt
i 12:
radial
orientation
i=1 nkp
2p
p
c
cos(q
)
i
i
2. ◆·
✓
(22)
cspacingcangular
= s · s= Â
nExperiment
(23)
2
sin(q
)
ers feasible optima. For example, Eq. 28
limits
all
objects
to
the
room
dimensions:
2p
kp
ck
i
n
+
1
i
cspacing = s · s i=1sorting
n and
(23)
✓
◆
finite-differencing
the
angle
between
the
circle
and
each
object:
2
n + 2p
1
(24)
3. penalty for noncspacing = s · s n
(23)
n i + cy , pi + cx )
ti = atan2(p
(25)
(24)
n
+
1
(x

x

x
,
y

y

y
)
homogeneous spacing min
i
max min
i
max
⇤
i=1
= sort(t)
(26)
These terms capture the error of projecting the pose of each object onto a t circle
centered
at
c,
facing
in(24)
s = [ti⇤ ti⇤ 1 ]ni=1
(27)
3
These
terms
capture
the
error
of
projecting
the
pose
of
each
object
onto
a
circle
centered
at
c,
facing
inwards. The vector s in Eq. 24 represents the angular spacing of objects about the circle, and is obtained by
3These
ven
weights
a, bcapture
importance
of
theonto
subtasks,
overall
We
mustofalso
specify
environment
constraints
on
thethe
poses
of
bodies,reward
such
that the function
leaf-optimizer
terms
of relative
projecting
the
pose
each
a circle
centered
atthese
facing
in-by
wards.
The vector
s, g
in indicating
Eq.the
24error
represents
the angular
spacing
of object
objects
about
the circle,
and
isc,obtained
Cost functions for office configuration
s: angular
spacing term
Meeting configuration
y
x
3 The
2 in Eq. 24 is to offset
recovers
feasible
optima.
For
example,
Eq. 28 limits
allfor
objects
to the room
dimensions:
3the
purpose
of
term
n(2p/(n
+
1))
the
spacing
cost
by
the
maximal
value
n
objects,
such
that
wards.
The
vector
s
in
Eq.
24
represents
the
angular
spacing
of
objects
about
the
circle,
and
is
obtained
bythe
ng
task
be
defined
as
follows:
3 Thecan
2
of in
theEq.
term
1))optimal
in Eq.configuration,
24 is to offset the
spacing
the(xmaximal
final reward purpose
computed
29n(2p/(n
is 0 for +
any
such
as the cost
one by
depicted
in Fig.value
12(a).for n objects,n such that the
min  xi  xmax , ymin  yi  ymax )i=1
circle position
The purpose
the29
term
1))2 in Eq.
24 is to offsetsuch
the spacing
costdepicted
by the maximal
final reward 3computed
inofEq.
is 0n(2p/(n
for any+optimal
configuration,
as the one
in Fig.value
12(a).for n objects, such that the c:
r: circle radius
(28)
Given weights
, g indicating
relative
importance
function for the
final reward computed in Eq. 29 is 0 for any optimal configuration,
such a,
asbthe
one depicted
in Fig.
12(a). of the subtasks, the overall reward
pi: object position
Overall:
Rmeeting = (ameeting
clinear
+be bdefined
cangular
task can
as follows:+ gcspacing )
Rmeeting = (a clinear + b cangular + gcspacing )
θi: object orientation
n: number of objects
(29)
rid reward function is defined similarly,The
but
save
space.
grid omitted
reward function to
is defined
similarly,
but omitted to save space.
An important property of this problem is that there are many possible goal configurations – the meeting
47
21
n important property of this problem is that
there
areformany
possible
configurations
the m
task is equally
satisfied
any configuration
of chairs ingoal
a circular
configuration, regardless of–
particular
21
Meeting configuration
48
Proposed work
Goal:
Successful room
configuration for
2-5 objects
Meeting Task Optimum
Expected Result:
Implement cost-based
furniture method on
Golem-Krang
Presentation Task Optimum
49
Target Environment
Summary of proposal
1. Tractable online model learning
Proposed:
Bayesian physics-based approach
Demonstrated:
PBRL offers superior performance
vs. regression methods
(ICML 2014)
2. Tractable planning in task-space
Demonstrated:
Task-space RRT planner with leaf optimization
Proposed:
Stochastic Value-based RRT
planner with leaf optimization
(Humanoids 2010)
3. Implementation on Golem-Krang
Demonstrated:
Implementation of hierarchical NAMOMDP formulation in simulation
Proposed:
Implementation of NAMO-MDP on Krang
Proposed:
Implementation of Cost-based furniture
arrangement problem on Krang
50
(WAFR 2012, ICRA 2013)
Timeline
Date
Work
Progress
2010
Task-space RRT
published (Humanoids ’10)
2012-2013
Hierarchical NAMO-MDP
published (WAFR ’12,ICRA ’13)
2014
Core PBRL Model
published (ICML ’14)
Sept. 2014
tsp-RRT
will submit to ICRA
Jan. 2015
Krang Implementation
will submit to RSS
Feb.-Apr. 2015
Thesis writing
May 2015
Thesis defense
51
Special thanks to
my committee,
and Mike
Stilman
52
Backup Slides
53
Eq. 15, obtained directly from the bellman-recursion, appears superficially different from Eq. 14. However, for cases in which the reward function is not dependent on the action, as with task-space formalism
in [25], and in which the model transitions are deterministic (even if the model parameters themselves are
not), Eq. 15 simplifies as follows:
"
#
Bellman values in MCTS+RRT
d
vq = max R(s, a) + g  P(s0 |s, a) max
Q
0
a
2
s0
= max 4R(s) + g Â
a
Z
s0 f 2F
a
1
(s0 , a0 )
d
P(s0 |s, a, f )P(f )df max
Q
0
1
=  R(sqi ) + g  P(Fqi ) max
vq0 ,
0
n qi 2q
q
qi 2q
a
q0
(16)
1
3
(s0 , a0 )5
f (sqi , aqi ; Fqi )
(17)
(18)
Eq. 18 is obtained by expanding the model transition as a marginal over the unknown model parameters and approximating it with a finite sum over particles ( f (·) denotes the PBRL model function from
Sec. 3.2). This yields a generalization of the approach taken in [21], in which we replace the model aver• 14)
Expand
stochastic
asthismarginal
over what the authors
aging (Eq.
with a particle-based
bellmantransition
backup. Note that
contribution achieves
suggest regarding a search heuristic:
continuous model parameters
“The effectiveness of the quality heuristic might be improved if we could calculate not only the
probability of reaching a node from the root of the tree, but also an estimate of the probability
Approximate
integral
with particles
of•reaching
the goal from this
node.”
Because reward functions can be arbitrary, this formulation does not guarantee an admissible heuristic.
However, for many common cases, such as simple distance-based rewards, we expect this approach to
yield notable improvements in runtime. After validating
this approach in simulation, we will illustrate its
54
applicability to several mobile manipulation problems, including one based only on abstract criteria about
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
240 246of 479
position
an anisotropic
constraint.
While
position offriction
an anisotropic
friction
const
notion
was
central
to
the
OO-MDP
(Diuk
et
al.
two-dimensions,
which
can
be
represented
480 the is
241 global
247 maximum
the
unique,
likelihood
surface
can
global
maximum
is
unique,
likelihoo
˙
2008),
and
is
explored
in
greater
depth
here.
In
parameters
s
=
{x,
y,
✓,
ẋ,
ẏ,
✓},
with
{x,
y
PBRL:
Model
Format
481
P
(
|h)
/
P
(h|
)P
(
)
242
248
be quite
convoluted.
be
quite
convoluted.
physics-based
domains,
fully
describing
object
state
sponding
to
2D
position
and
orientation,
an
243 249 482
0
0 both pose and
0 0 velocity parameters
requires
(the
g
(s
,
a
)
⇡
s
g
(s
,
a
)
!
s
+
✏,
✏
⇠
N(
g244(si , ai ) ⇡ 483
stheir
g
(s
,
a
)
!
s
+
✏,
✏
⇠
N
(µ,
⌃)
i
i
i
i
i
derivatives.
i
i
i i
250
0
so-called
“phase-space”
representation
of
the
system)
Want:
245
484
P
(s
|s,
a,
)
d
⇠
U
(k)
do ⇠ U
(k)
251
o ofingenerality,
Actions
this context
correspond
to in
Without
loss
we
consider
objects
246
485
252
and
torques
used
to
move
objects
aro
two-dimensions,
which
can
be
represented
with
six
⌦,
w
,
w
⇠
U
(0,
1)
⌦,
w
,
w
⇠
U
(0,
1)
247 µx µy 486
µx
µy
˙ with
253
parameters
s
=
{x,
y,
✓,
ẋ,
ẏ,
✓},
{x,
y,
✓}
corre
For:
can
be
represented
with
three
additiona
0
248
⇡(s)
=
arg
max
P
(s
|s,
a,
a
˙
d
,
d
,
d
,
d
,
w
,
w
,
⇠
U
(l,
u)
dx , dy254
, sponding
dxo , d487
,
w
,
w
,
⇠
U
(l,
u)
x
y
x
x
x
y
x
x
y
to
2D
position
and
orientation,
and
{
ẋ,
ẏ,
✓}
o
o
o
ters
a = {fx , fy , ⌧, d}. Here {fx , fy }
249
488
their
derivatives.
1 a torque, and
255
1
a
force-vector
in
2D,
⌧
250
w✓ ⇠
von
mises(0,
)
w
mises(0,
)
✓ ⇠ von 489
2
2
tion
for
which
this
action
is
applied.
A
251 256
Actions
in
this
context
correspond
to
the
forces
490 (s, a, s˙0 ) = (x, y, ✓, ẋ, 0ẏ, ✓,
0
0 f 0 , f 0 , ⌧,0 d,˙ 0x0 , y
˙
(s,
a, s2D,
)
=
(x,
y,
✓,
ẋ,
ẏ,
✓,
f
,
f
,
⌧,
d,
x
,
y
,
✓
ẋ
ẏ
,
✓
)
So
the
dynamics
model
signature
is:
x
y
257
x
y
vation”
is
an
assignment
to
the
tuple
(
252in
and torques
used
to
move
objects
around,
and
491
0
0
0
0
0
0
˙
˙
258
0
0 (x,represented
ẋ, a)
ẏ, ✓,
fsx ,=
fythree
, ⌧, d, xadditional
, y , ✓ , ẋ ,parame
ẏ , ✓ ),
f (s,
!with
f253
(s, a)can
! sbe
= y, ✓,
492
254 259
tal= dimensionality
of m{fx=
+0 40 +0
ters a493
{fx , fy , ⌧, d}. Here
, fy }6 represents
0 !
0 ˙(x
0 ,y ,✓ ,
˙ 0f, xy,0 ,f✓y ,0 ,⌧,ẋd)
˙
f
((x,
y,
✓,
ẋ,
ẏ,
✓,
f255
((x, y,
✓,
ẋ,
ẏ,
✓,
f
,
f
,
⌧,
d)
!
(x
,
ẏ
,
✓
) a dura
xmodeling
y
260
a force-vector
in
2D,
⌧
a
torque,
and
d
The
problem
is
to
take
an h
494
256 261
tion for
thisthese
actionobservations
is applied. An
trixwhich
D of
and“obser
fit a
0
55 0
257 262
˙
vation”f (s,
is a)
an assignment
to
the
tuple
(s,
a,
s
! s = f ((x, y, ✓, ẋ, ẏ, ✓, f ) , f=
8s, a
Latex Math
able input-output measurements [15]. Adaptation is typically
⇤
Jonathan
done inScholz
two stages: (1) estimation of the plant parameters
using a Parameter Adaptation Algorithm (PAA) (2) updating
parameters based on the current plant parameter
Aprilcontroller
16, 2014
estimates. PBR can be understood as a PAA generalized to
support non-linear model estimation using Bayesian approximate inference. Alternative approaches to non-linear system
identification can be found in [?].
Finally, there are several results in object pushing without explicit physical knowledge [18], [24], however these
approaches are restricted to holonomic
0 objects. We are interf (s,environments,
a) ! s which can frequently
ested in tasks in human
include objects with wheels and hinges.
f (s, a; ) ⇡ s0 8s, a
III. OVERVIEW
dimension parameterized by i .2
Defining X̃ := [s, a] for notational convenien
predictors can be fit for each output dimension u
squares:
ˆi =⇤(X̃ T X̃) 1 X̃s0
i
Evaluation Method: OO-LWR
Jonathan Scholz
B. Locally Weighted State-Space Regression
16, 2014Regression is structured sim
Locally-Weighted
Uses locally-weighted regression to April
introduces a query-dependent kernel whose role i
model dynamics
notion of similarity that allows predictions to be
, a; )
•
Problem: How to handle collisions
work onregressor?
data driven dynamics
modeling for
0
with robotics
aExisting
smooth
L( |h)
= low-level
P (s |s,control
a; )problem
typically focuses
on the
their nearby training points. The kernel function
compute a positive distance wj = k(X̃ ⇤ , X̃ j ) be
query point X̃ ⇤ and each element X̃ j of the tr
These weights are collected into a diagonal matr
used to produce predictions with weighted least-s
⇤
i
s0i
= ((X̃ T W X̃)
= X̃ ⇤T
⇤
i
1
X̃ T W yi 0
(e.g. [21]), and requires learning the dynamics of the robot
itself. Non-parametric
regression is an appropriate choice In contrast to the parametric approach in Eq. 3,
2
n n
|
|
a , . . . , s , a ] for these situations as model fidelity is critical and robot parameters ⇤ are re-computed for each query. A
the regression coefficients are free to vary across
mechanisms are generally complex and difficult to model.
•
.t. Cond(s) = c However, the mobile
X manipulation
:= [s, a] tasks we consider here space, allowing LWR to model nonlinear func
are unique in that (a) tasks often involve many objects with least-squares.
The main properties of LWR are:
highly redundant dynamics [16], (b) many tasks are quasi1
1
2
2
n
n
2 allowing
3feedback
X :=to[scompensate
, a , s ,for
a ,minor
. . . , model
s , a ] 1) The only model parameters are the kern
static,
c
f (X1 )
parameters, and flexibility is achieved by s
errors
cand7(c) observations are limited and expensive
6 f[12],
(X
)
2 As
raw data.
to 6
acquire [8].
the primary concern shifts from
7 a result,
c
f (X) :=
(1)
6 to.. efficiency
7 Xand
:=
[snot
Cond(s)
= c2) Kernels are typically decreasing, making f
i , aclear
i ] s.t.
i
fidelity
it
is
if
non-parametric
4
5
.
polant. Consequently, coverage of the tra
methods arec still the correct choice. We will now briefly
f (Xthe
over the test set is a key factor in the ac
n ) model learning problem and the core concepts
summarize
2the model.
3
for each approach.
f (X1 , ciss )deferred to query-time. Thi
3) Computation
6negative
7performance for real-tim
f
(X
,
c
)
A. State-Space Regression
impact
on
2
s
)P ( )
6
7
56
f
(X)
:=
6
7
.
Our goal is to fit a model to the discrete-time state-space
Solution: Factor space using object
and collision-state variables
Samples Required
PBRL: Sample Efficiency
140
120
100
80
60
40
20
0
≈
≈ ≈≈
≈
≈
PBR
LWR
LR
Lo
Ki
ge
un
n
he
tc
k
es
ir
ha
C
57
ir
ha
C
ir
ha
C
ch
ou
C
d
Be
t
ir
ha
C
e
bl
Ta
ar
C
ng
ki
oc
n
he
tc
e
bl
Ta
k
es
e
fe
of
D
R
Ki
C
D
Number of samples required to achieve R2 ≥ 0.995 on a collection of
household objects. Training data corrupted by Gaussian noise (σ2 = 1.25).
PBRL results: model learning
•
0.4
Fitting dynamics of
shopping cart
PBR
LWR
LR
0.0
•
R2
0.8
Shopping-Cart Model Fitting
PBRL more robust
to noise than
regression methods
0
0.5
1
1.5
2
Noise − Level(σ2)
58
5000
PBRL results: online performance
•
OO-LWR viable given
sufficient data
•
−100
PBRL significantly more
sample efficient
TRUE
PBRL
OOLWR
LWR
−300
•
Reward
0
Shopping-Cart Task: Noise-free
LWR stuck in obstacle
0
100
200
300
Step
59
400
500
PBRL Results: Online Performance
OO-LWR intractable
−80
•
TRUE
PBRL
OOLWR
−120
PBRL capable of fitting
multi-body models
Reward
•
−40
0
Apartment Task: Gaussian Noise (σ=1.25)
0
200
400
600
Step
60
800 1000
The problem with other carts
• Same idea, but different constraints are
different...
14
Tuesday, September 6, 2011
61
Download