Document 17865061

advertisement
>> Dan Bohus: I’m very happy to have Stefanie Tellex visiting us here today and giving a talk. Stefanie
has a Ph.D. from MIT Media Lab. She’s done work at CSAIL also, I think. She’s done fabulous work in
grounded languages, probabilistic models. Now she’s at Brown and she’s going to tell us more I think
about some of her more recent work on Human-Robot Collaboration.
>> Stefanie Tellex: Thanks Dan. Alright, we’re a small audience so if there’s questions please feel free to
stop me, as we go.
Robots are really exciting today because we’re seeing a lot of success where robots are operating in
complex real world environments. This is an example of a robot I worked a lot with at MIT. It’s our
robotic forklift. It can drive around in warehouse environments. It can sense objects like pallets and
trucks, and people. It can autonomously move things around in these environments. It can pick up
pallets and unload trucks, and stuff like that.
We also have robots that can assemble IKEA furniture. This is an example of a team of robots
collaborating to do that with a person, developed by Ross Knepper and his collaborators at MIT. They
can make a plan to assemble the complex thing. Like a table in this case out of smaller parts. Multiple
robots work together in order to carry out this task.
Then we have robots like the PR2. This is my son with the PR2 back when he was littler. Some of you
might have interacted with him at the MSR Workshop. This was right around that time. He’s now about
this tall, two years old.
But we’re also imagining that there’s going to be robots operating in our homes, in our hospitals, in our
offices carrying out complex tasks. Things like helping us cook or fetching objects for us, or maybe
interacting with our kids and taking care of them.
As we have robots that operate in complex real world environments. It’s really important that we
create ways for them to not just interact with people. People talk about human-robot interaction. But
like you could interact with a rock, right.
Like, so, I think it’s before even, try to go beyond just interaction and think about collaboration. How
can the robot and the person collaborate so that the robot can meet the person’s needs? The robot can
do what the person needs them to do. Help them solve problems in these real world environments.
What I’m going to talk about today is sort of, three sort of thrusts that we need to address, to achieve
this vision of robots operating in our daily lives. Like there’s this, you know Bill Gates said we need a
computer in every desk. Then later he said something like robots are going to be in every room, in every
house.
Everyone’s going to have a robot in the next twenty years, I think. Just the way our computers have
moved from a big mainframe to a desk, to the cell phones in our pockets. We’re going to see these
platforms get smaller. They’re going to get cheaper. They’re going to get more reliable. They’re going
to be part of our daily lives. I think it’s really important to create mechanisms for them to collaborate
and operate with people.
The first thing to talk about is making robots that can actually operate in these real world environments.
Because we don’t just want them operating in sort of fixed, prepared work spaces. We want them in
our homes and our hospitals so that they can do things to help us.
Then a second challenge is once you have a platform that’s capable of actions. Of a large space of
possible actions and complex state spaces there’s a sort of AI planning problem. How do you figure out
what to do? If you have you know ten things you could do at every time step. Then you know you’re
trying to plan over a course of a day. You know the things you’re going to do that help the person.
The planning problem, the branching factor, and search depth is huge. People effortlessly use language
to command you to do things at all levels of this hierarchy. For example, for the forklift people would
say things like unload the truck that’s a very high level action that might involve you know ten or twenty
pick and place actions. In the next breath they’ll say move backwards about six inches. You know move
out of my way. Tilt your forks back a little bit.
These are examples from human interaction. People have no trouble with that. Like they kind of jump
up and jump down. But for a robot or a planning algorithm if you give it that low level state action space
you’ve basically made the high level action like unload the truck impossible to find. Because the search
depth and the branching factor are huge.
Yet, if you take out the low level actions, so that you can find and operate in some more abstract space
then you could no longer. It’s just not possible to move forward six inches because there’s no action to
do that. We’re trying to create algorithms that can efficiently plan and operate in these types of state
action spaces.
Then finally once you can operate in real world environments. You can plan in them, plan complex
actions. Then you have to be able to figure out what people want. The last thing I’ll talk about is
coordinating with people. Using, observing what they’re doing, observing the gestures that they’re
making, the words that they’re saying, the actions that they’re taking.
Then trying to infer what they want so that the robot can carry out actions that help them to collaborate
with. In particular, I’ll talk about is trying to establish, our work towards establishing a social feedback
loop between the human and the robot.
It’s inevitable when you’re interacting with a person because of perceptual uncertainty. All of the, the
size of the set action space that the robot is going to fail to understand them. What I think we have to
do is kind of embrace that. Think of mechanisms for the robot to actively detect these failures to
understand and take actions to correct them.
Also work with the person to convey what they don’t understand. What specific parts of what the
person is saying and doing do they get? What parts do they not get? They’re not just saying I don’t
understand, I don’t understand, I don’t understand to the person. But you’re actively you know zeroing
in on those parts that you, the robot didn’t get. The person can provide targeted information to get the
robot unconfused so that it can move on, and go on and do the thing to help the person.
As sort of a motivating example, this is a scene from my parent’s pantry in the kitchen. What I’m
thinking a lot about is a sort of base capability for a robot is pick and place. Picking something up and
moving it somewhere else in the world. You might imagine a robot that’s assisting you at cooking a
meal for example. You’re stirring and you don’t want the onions to burn. You say, oh can you please go
and get me the pepper. Or you’re making a soufflé and it’s like delicate folding operation and you need
the parmesan cheese. It’s going to go and hand you the parm, you say please hand me the parmesan
cheese. It goes off and finds it and picks it up.
Object manipulation in robotics is a very challenging unsolved problem. Robustly knowing about all of
these objects in a scene like this, knowing what words people use to describe them. Then more than
that, doing pose estimation to figure out where to put your gripper is a really hard problem to solve in
robots.
To approach this problem, so here’s an example of like handing something off, like a tool, like a knife for
example. The way that we’re approaching this problem is by taking an approach inspired by mapping
objects. Inspired by robot mapping, so the real problem that you like to solve is you bring a robot into
an environment that it’s never been in before, some kitchen. Have it look around and see stuff. Then
say hand me the knife or hand me the eggs, or something, and have it work.
But we think that’s too hard for right now. Instead what we’re doing is saying the robots going to come
into an environment. It’s going to be handed some objects. It gets a chance to explore, build maps,
learn about those objects. It can do that for awhile. Then after that exploration process and data
collection process is complete. Then we’re going to try to manipulate the object and interpret
commands about these objects.
In order to do this you have to know what the object is. For example, this is a ruler you might want to
know a label like ruler. You want to know which object it is so you can load up all the information that
you’ve collected about that object. You need to know where it is in the environment. Here we’re going
to label it with pixels in the image that are sort of associated with this object.
But you also need to know like where it is in the real world because you need to move your gripper
somewhere to try to pick up that object. Then finally you need to know where are you going to actually
put your gripper on the object. For, in this example for this ruler a good place to grab is in the middle
because it’s kind of heavy. If you grab it near the end it’s going to slip out of the robot’s grasp.
Existing approaches to this problem they found is sort of two categories. The first is sort of Categorybased grasping. The idea is that you look at the object with your sensors. Then you try to infer what
category it is and where you can grab it. There’s lots of papers about this. It’s a very interesting and
hard problem. They have tables of results.
But when you actually try to run it on your robot with your objects, like nobody’s, we do a lot with like
objects that are used in childcare. Like diapers and Vaseline and nobody’s tested these models on those
objects. They don’t work and they need more training data. Then you know you’re sort of in this
problem of trying to get somebody else’s code.
>>: Do they run into a problem on the glasses and dish where they were trained on?
>> Stefanie Tellex: Yeah, so numbers in the papers range between eighty and ninety percent. The
standard right now is you pick your favorite objects and test on them. One paper will evaluate on their
favorite objects. Another paper will evaluate on a different set of objects.
That’s just changing, so like in the past month or two the community who’s working in this area has
created a standard set of objects, which you can order. I’ve ordered them. They offered them to me
early and I still haven’t gotten my set. Like it’s got about a hundred objects that you can then. We can
all, if we all have the same hundred objects then we can evaluate on that dataset.
>>: When you evaluate you’re literally picking it up?
>> Stefanie Tellex: Yeah.
>>: You’re not just like showing on different experiences?
>> Stefanie Tellex: Different papers do different things. You’ve got to; sometimes in these papers
you’ve got to read between the lines.
>>: But it’s the same object area…
>>: [indiscernible]
>> Stefanie Tellex: But in many of them you’re actually picking things up. In our work we are too.
>>: This area is so sensitive that if you have an exact same set but different brands because nothing
works.
>>: Yeah.
>> Stefanie Tellex: Yeah.
>>: It’s same you know diapers that are like, Vaseline.
>> Stefanie Tellex: Yeah.
>>: But you switch…
>> Stefanie Tellex: Yeah.
>>: Companies making them and then…
>> Stefanie Tellex: Yeah.
>>: Alright, sorry it’s not close enough of that.
>> Stefanie Tellex: Yeah, exactly. It’s very frustrating.
>>: Yeah, you have like the diaper faction and the plate faction of research and…
[laughter]
>>: [indiscernible] way about the pant that was supposed to [indiscernible].
>>: Right.
[laughter]
>> Stefanie Tellex: Yeah, yeah, and of course we all want that capability category. Like I want to just
pull a diaper, I should know it’s a diaper. I should know how to pick it up and even unfold it, and all that
stuff. But, okay so that’s category.
A second approach that people take is instance-based. The idea is that you collect images. You have
your diaper, its Pampers, and you collect images of it. You take pictures of it or you put it on a turn
table. People will do that and then turn it around and maybe build a 3D mesh of the object.
Then you and sometimes you can use that to propose grasps if you have an accurate geometric model.
You can get some proposed graphs. There’s a lot of work on that. Or you can annotate. You can just
say I want you to grasp this object here. That’s a good place I’ll just tell you. It works, so if you collect
this data you can detect the object because you know exactly what it looks like. If you annotate the
grasp in which case you will be able to pick it up.
But the problem is the data collection. You need to collect this data. That’s not something you can
really expect an untrained person to do or even be willing to do. We think, Robotics make a robot called
Baxter that’s designed to operate factory floors. They have a way for a person to collect data with the
robot. Where they like push some buttons to say here’s the object I want you to pick up.
Nobody uses it in the real environment they told me, because they don’t understand why the robot
can’t see the object in the first place. Our approach is to take the instance-based approach. We’re
going to collect these images. But instead of manually collecting images we’re going to use a robot to
collect it.
The contribution of this first part is an approach for automatically collecting the instance-based training
data. So that we can localize the object and then pick, identify the object, localize it, and then pick it up.
We’ll actually practice picking it up. We will propose grasps using your favorite. We have heuristic
method for proposing grasps.
But you can plug in your favorite method for automatically computing grasp points. We will try them
out and if it works that’s great, we’ll be happy. If it doesn’t work we will do some math and figure out
where the next place to go is. Until we find one that works. Then at that point we’ve empirically shown
like through practice that we found a good grasp. Then we can pick up that object.
Okay, so here’s what it looks like. I pushed my button but it didn’t go. I’ll just use my, no, okay, so
here’s what the scanning process looks like. This is our robot Baxter. We’re using; we wanted to use
these on, the sensors, only the sensors that had Baxter. What you’re seeing is our one pixel connect.
This robot has an RGB camera in the hand and also an IR range finder which gives you one point of
depth data. We’re doing, at the beginning it was doing this Raster scan in order to get a point cloud of
the object. Then it’s moving around to take pictures. Now it’s practicing. This is trying different grasps.
You know surveying in, localizing the object, and then trying different grasps at different locations on
the object, and practicing.
Then at the end it, let’s see, so here’s what the pictures look like that it collects for the ruler, different
crops, we crop that out in the background. We get this Raster IR scan. We only need to scan at training
time to proposed grasps. We don’t need it at pick time or it’d be way to slow. We use just vision to
pick.
After we get this data then, let’s see, sorry it’s. This is just showing after the process is complete this is
showing picking these to objects. The scan that we do takes about fifteen minutes mostly because of
the silly depth scan that we have to do. But runs on and stock Baxter so we don’t need any additional
sensing on Baxter in order to do this.
We’re all; we just bought a depth camera that we’re going to mount on the arm to try to speed that up.
We’re also going to try to explore stereo. You can make virtual stereo because you can move the
camera. Then try to get point clouds more quickly from an ocular camera, but anyway for now we’re
doing this scanning based approach.
This is cool. It works on a lot of objects. You can basically hand the robot an object, we run our scan,
and at the end of it most of the time you can pick something up. Yeah?
>>: What was clothing for shape all of time?
>> Stefanie Tellex: Yeah, so we’ve tried it with diapers and it does work.
>>: Infants?
>> Stefanie Tellex: What’s that? Infants, yeah.
[laughter]
Yeah.
>>: I think that’s…
>>: [indiscernible]
>> Stefanie Tellex: Yeah, so we are using, one of the advantages of taking this instance-based approach
is that you can get away with like really simple computer vision. We’re using SIFT bag of words to detect
the object to localize. We actually, we literally take, we [indiscernible] in and we take a picture of what
it looks like when you’re right on top of it. We memorize the pixels. We computer the gradient of that
and then we measure the error between what we see right now and what that picture looks like, pixel
by pixel.
It’s nothing fancy but it works quite well because we have the data which we were able to collect
automatically. The question about clothing well maybe it, my student keeps wanting to talk, wanted to
do HOG which will, Histogram of Gradients which will capture like the shape properties of objects. But
then that probably will not work on something like clothing that’s deformable. But maybe the SIFT
features will.
We’ll have to have ways of like deciding which detection and classification algorithms work the best for
individual objects. Okay, so this works on lots of objects. It does not work on all objects. Here’s an
example of, wait I should put this down, I guess the battery died. Here’s an example of that ruler that
I’ve been using.
Here it inferred a grasp based on the geometry of the ruler that it got from the point cloud. What it
does is it picks it up near the end because it just looks like that’s going to work. You know it’s semetrics,
so none of those grasps are particularly better than the other, according to the gasps model that it’s
using.
It’s going to do a shake now to make sure it’s got a good grasp. Once it does the shake it pops out
because it wasn’t a very good place to grasp. The friction between the gripper with the weight of the
ruler wasn’t very good. Maybe if we had a better grasper system with a better model physics. A
category-based approach could have predicted that. But then there would have been some other
problem.
>>: It’s not mechanically if you picked it up from the top...
>> Stefanie Tellex: What’s that?
>>: If the grip had been from the top it would have…
>> Stefanie Tellex: If you went to the center I’ll show you it will work with this particular gripper.
>>: Is it a better model of physics. Do you have any model of physics in the [indiscernible]?
>> Stefanie Tellex: We do not have any model physics right now.
>>: Better sound like we have something but…
>> Stefanie Tellex: Yeah, so people, so if we had a model of physics maybe we would learn to go to the
center. Other things that happen is like you have transparent objects. They don’t show up very well in
IR. You know, we hallucinate all these crazy grasps.
>>: Has anybody done physics? Any physics analysis of the…
>> Stefanie Tellex: Yes, yes, there’s, so there’s a program called grasp it that basically simulates grasps
using something about physics, and shapes, and stuff. People are exploring that but it’s, whatever you
choose. Like something’s going to break. This is what’s breaking in the one that we’re using.
>>: I was just wondering how much common sense you’d need in that kind of thing. Imagine
representing, static, co-efficience of friction between…
>> Stefanie Tellex: Yeah.
>>: Robot hand and object. You could imagine notions of swivel and center of mass.
>> Stefanie Tellex: Yes.
>>: It would be those two things. I would pull out a deeply physical model but imagine getting rid of a
few things.
>> Stefanie Tellex: Yeah, so Josh Tenenbaum talks about the video game Engine in Your Head.
>>: I know, yeah.
>> Stefanie Tellex: You know the, I think that that’s probably like the right way to do this. Is too really,
you know like we have some kind of representation of physics I think in our brains. That we’re using to
do this.
This is something much simpler. We have a proposal system which could use physics or not. Then we’re
going to just try different grasps and pick stuff up. Now, yeah?
>>: Sorry, you have a lot of place where content seems important is fragility.
>> Stefanie Tellex: Yes.
>>: Where a glass you don’t want to squeeze it tight. But if its cloth doesn’t really matter probably or.
>> Stefanie Tellex: Yes, so we are mostly using objects from my son’s toy chest for a reason.
[laughter]
Because we want the robot to get this through experience and if you give it something fragile then it’s
not so good. Yes?
>>: Does the robot know not to pick up your son?
[laughter]
>> Stefanie Tellex: Not right now. You can see from the construction of the table we sort of keep
people away through the furniture in the room. I’m excited to talk to Dan because we can do person
tracking and stuff. We could be a little smarter about how safe we are. This is not a mobile robot. I
really want a mobile robot. Like my next robotic purchase is going to be a mobile robot. Then we’ll
have to be more worried about…
>>: How fresh is the [indiscernible]? Is it constant or is there feedback?
>> Stefanie Tellex: Yeah, we keep the force constant right now. It doesn’t, it’s basically…
>>: Does it have the feedback of [indiscernible] check is giving as you grab it?
>> Stefanie Tellex: We know, yeah, yes we know the position of the gripper and how much force it’s
exerting. We use that right now to detect if we have successfully picked up the object. Like it knows it’s
succeeded if it’s achieved forced closure. It shakes it. It still has something in the gripper. Then it puts
it down. Then nothing’s in the gripper because some stuff gets stuck in the gripper. It tries to put it
down, like if it’s deformable or something it gets stuck.
That’s, if you, and some objects, one of the ones in our evaluation was a banana. If you pick it up in the
middle it was like a Styrofoam banana because we didn’t want to break it. If you pick it up in the middle
it would always get stuck. But if you did it near the end where it was a little bit narrower then it
wouldn’t. It would learn to go near the end.
It was basically this long table of stuff. Any one of which you could fix and people work, and there are
papers about probably. The IR there’s papers about transparent objects and how to handle them. But
our idea is like on top of all of that we’re just going to try stuff. Keep track of what works and what
doesn’t work.
For the ruler when we do this before the training process happens. You drop the ruler and here’s some
examples of like objects that aren’t visible in IR. Like this is transparent. This is a salt shaker. It’s kind of
hard to see in the light. But this is shiny but a terrible place to grasp because of the reflections. In the
point cloud it looks awesome. This is where you want; the handles are where you want. But it’s black so
it absorbs the IR light and so you can’t see them.
We actually learned to move from here to here on this object by, from experience. To do this we
formalize this as a Bandit problem in RL, so it’s like an N-armed Bandit. Where the arms aren’t like the
robot arms. The arms are different grasp points. Each grasp point and the orientation of our gripper is
an arm with an unknown payout probability. You get a payout if you pick it up, shake it, and put it down
successfully.
There’s like some discreditation. What we’re doing is best arm identification. We want to find the best
arm given a budget of training trials. You get to play with the object for awhile. Then at the end of that
time I want you to tell me the best arm. Yes?
>>: Does this discreditation that implies that the arms are sort of spatially related to each other.
>> Stefanie Tellex: Yeah.
>>: I [indiscernible] in the problem and you’re just bandits…
>> Stefanie Tellex: Yeah, so in this, right, so we are, so there’s one more thing that you want to do. We
don’t have time to try seven, a thousand seven hundred arms. That would take a really long time. Each
grasp right now takes about thirty seconds to do. We have our model that proposes grasps. We
interpret that as basically a score on all of those possible grasps. We try them in order. We try them
until we’re confident above a threshold that it’s either, it’s a good grasp and then we stop. We say
we’re done with this object we have a good grasp. Or we’re confident that it’s a bad grasp and then we
move on to the next one in the list, in the sorted list. Does that make sense?
Because the scoring function takes into account spatial information, the model does. We are talking to
Michael, do you know…
>>: Like the robot, sorry, if the robot proposes. Like if you algorithm proposes two grasps that are very
close spatial to each other.
>> Stefanie Tellex: Yes.
>>: Are you going, like if I have five grasps that are all close here.
>> Stefanie Tellex: Yeah.
>>: I don’t know if I figure out that this one doesn’t work the other ones are likely not to work. I show
them somewhere else.
>> Stefanie Tellex: Yeah, show…
>>: Is that…
>> Stefanie Tellex: We do not do that. I can give you examples of objects where that intuition isn’t, is
false basically.
>>: Sure, sure.
>> Stefanie Tellex: Like especially, so like a lot, there’s a, I don’t think I have a picture of it. But there’s a
vanilla bottle, like little bottles where like it goes, it needs to be just right on top.
>>: Unique, yeah.
>> Stefanie Tellex: Like it moves over like one cell basically and finds it. That said there’s definitely
spatial information. Michael and [indiscernible] are talking about thinking of this as a continuous
problem, where you could basically have some kind of prior. Then you know some kind of Gaussian
processee kind of thing that would capture some version. Because it definitely examples where it’s true
that like there a local things change with some model of like. There’s sometimes discontinuity and then
you’ve got to sort of explore and find them. Yes?
>>: You’re looking at instances. But are you giving any thought to generalizability of features and
geometry or something?
>> Stefanie Tellex: Yeah, so maybe, so wait so this is just showing after we do the learning we pick up
this ruler in the middle. Because we tried different grasps and that one worked better. Now it kind of
survives the shake. At this point we are aggressively instance-based. Because I think that it’s really
important to make something that works. Actually be able to pick up these objects.
I was going to show the shake. He’s going to smile, oh; he’s going to smile at me.
[laughter]
He puts it down. He’s like yes I got it however function works. The thing is if we are able to collect this
data. We collected this data autonomously. It gives us a potential to scale up the whole data collection
effort. This is sort of showing our evaluation, practicing picking up these different objects.
What we’re doing now is collecting, gearing up to collect a lot of data about a lot of objects. This is
showing, there’s like, I’m never sure whether to show the music. There’s the soundtrack to this video is
like the Howl of the Mountain King.
[laughter]
>>: How important is it that this is done in the real world? I mean can you do this all in CT?
>> Stefanie Tellex: I think it’s very important to do it in the real world if you want it to work. If we had a
perfect simulator, then yes you could you know you could do it in simulation. If you had a perfect
simulator you could also, the problem would be solved because if you could put the world in your
perfect simulator. You could try stuff in simulation really, really quickly and then you know do it. You
could do lots of learning. You could do deep learning. The problem is getting that perfect simulator.
>>: Yeah.
>> Stefanie Tellex: I think that there’s irreplaceable information in the real world that lets you build the
simulator. You know like if we’re all suppose to be doing physics that’s great. Then let’s collect lots of
data and assess whether our physical models are good and how we pick the right contents.
How are we doing friction? You know what is the interaction between my, the rubber tape that my
student likes to use on the gripper to give it more friction and this particular plastic ruler. Like that’s not
something that’s easy to put in a simulator.
>>: [indiscernible]
>> Stefanie Tellex: This is showing sort of quantitatively before we do this learning process we pick up
this dataset. This is our particular dataset of objects that we picked. We get half of them and after
learning we get seventy-five percent. If you saw the video go by like, this is from worst to best.
From about here on we’re eight out of ten picks are better. Some of these are really hard. It’s kind of
bimodal. There’s some really, really bad ones. There a bunch of really good ones. We know what
worked and what didn’t work. So we, yeah?
>>: That’s interesting because it’s not completely obvious to me what is…
>>: What?
>> Stefanie Tellex: Yeah, I can answer questions if you want to know.
[laughter]
This we filled with water.
>>: Okay.
>> Stefanie Tellex: To make it harder because this one I was and this was also partly selected like by us
to exercise the system. Like we were trying to get a diverse set of objects, but we wanted it to show
that it could improve. Like this one its first grasp was, this is a Sippy cup. The lip of the cup and that
worked great. It didn’t, so I was like oh it’s not going to learn, that sucks. We filled it with water. I
hoped it would learn to go to the handle to get a better grasp but it did not. It slips out a lot when it’s
filled with water. Yeah?
>>: You said that the silver object with two handles was very hard. But it was the right hand side. Yes,
this one.
>> Stefanie Tellex: Yeah, so this is one that moved.
>>: It what?
>> Stefanie Tellex: It moved. If, I should make this picture with the before, it started out hard after the
learning we learned to go on the handle. Now we can do it, we can do it pretty well. Yeah?
>>: It has two arms. Do you use both arms in some cases, hands?
>> Stefanie Tellex: You can see in the video it’s like you know it’s practicing with both arms. Right now
we treat them independently. But I’m really interested in learning things like, so there’s our vanilla
bottle. Like unscrewing this bottle so you need to hold it with one arm and unscrew it with the other
arm.
>>: I was thinking with that big plastic thing it would be easier with two hands and what not.
>> Stefanie Tellex: Yes, the big plastic thing was actually a stupid bug. When we test that we’ve
released the object we basically close the gripper and make sure that we can close it all the way. We
move up to a particular height and then do that test. We didn’t move up tall enough. It thought that it
couldn’t put it down. It actually picked up that object pretty well off the bat. Except that the reward
function test wasn’t very good because we didn’t move up high enough. But we didn’t want like take it
out because that seemed to.
Okay, so in terms of using two arms. What we’re doing now is trying to scale this whole thing up. To
collect lots and lots, and lots of data, so there’s the COCO dataset that you guys have. Where you’re
people take pictures and then train, annotate, and then train in computer vision algorithms on it. But
that dataset consist of a human photographer selecting the image. Then you get to see each object one
time. Maybe, there’s probably exception to that. But like most of the time you see each object in that
one frame and then it’s gone forever and you never get to see it again.
This dataset and the data that we’re collecting, you get to see each object from multiple views and
multiple different angles. You get to pick it up and know empirically where this robot with this gripper
you know what its pick success rate was.
What our goal is to scale up the data collection. To collect a corpus that’s like this for a million objects.
It is computer vision scale corpus where we really know a lot about each one of the objects in the
corpus. If you imagine having access to that dataset and then trying you know your favorite deep
learning algorithm. To do some of these tasks like grasp point prediction when you have some of this
empirical success. Or trying to learn a model of physics you can actually test empirically how well it
predicts what you’re observing in the dataset from this experience in the world.
I think it will give us a new basis from which to approach all of these problems which we all care about.
The trained category based method to come in and see a room that you’ve never seen before. In the
mean time by taking an instance based approach at the cost of the robot. Not your time but the robot’s
time. You can have some of this robustness now using these instance based algorithms.
>>: If you wanted to go large scale have you thought of just going to Mechanical Turk with an image and
saying if you were to grasp this object with the two fingers, where would you place your two fingers?
>>: I was just going to say that, to get the language data tied to the where you’re…
>>: Yeah, just [indiscernible]…
>> Stefanie Tellex: Yeah, so people do that. Like [Indiscernible] Nova has a crowd sourcing process
where people actually move the robot’s gripper around and then…
>>: You don’t even have a robot. If you were human and you were to pick up that object with two
fingers. Where would you put your two fingers?
>> Stefanie Tellex: Yeah, where would you pick it? Where would you…
>>: Where you show in the video this successful pick up and ask where, what part of it was being
grasped. Just…
>>: You can do that too. But I’m just saying how do you like a hundred thousand training examples or
something like that? You know like quickly, with different…
>> Stefanie Tellex: That was like grasp points and stuff, yeah.
>>: With a variety of objects.
>> Stefanie Tellex: Yeah, I see, but the question is then what are the damages of the object?
>>: You have the video of that…
>> Stefanie Tellex: Of a hundred thousand projects…
>>: No, no, you could do research for a lot of small objects. You know just you know.
>>: I see you’re talking about just abstract objects…
>>: Say you go to ImageNet and you get all the small objects on ImageNet. Then you send them to
Mechanical Turk and say find the…
>>: Right.
>>: You know what are the two points…
>> Stefanie Tellex: Annotate me the grasp boards.
>>: Yeah.
>> Stefanie Tellex: Yeah and I’m like take some subset…
>>: [indiscernible]…
>>: Different version of the…
>> Stefanie Tellex: Yeah, I think that would be cool, like we should probably do that too.
>>: I have no idea because like…
>> Stefanie Tellex: Yeah, I’m not aware of; the closest thing I’m aware of is somebody who
[indiscernible] Nova at WPI who crowd sourced grasp point annotation. They actually tested it on a real
robot. They had the person drag the gripper in a simulator on Turk. Then the robot went in and tried to
pick it up and grasp it.
>>: [indiscernible] crowd sourcing on folks that have worked with real world simulations. Like how to
pack a truck and things like that that are kind of related with you know…
>> Stefanie Tellex: Yeah.
>>: People at a distance and try to sort of figure it out how to do physical things in coordinated ways.
It’s interesting to think about that, it’s…
>>: Yeah.
>> Stefanie Tellex: Maybe there’s something irreplaceable. I don’t think that’s a bad idea but I think
what we’re trying to do…
>>: Question.
>> Stefanie Tellex: What we’re trying to do is get.
[laughter]
I don’t know like I think, like I was imagining that we would do this.
>>: Well it’s not really a bad idea.
[laughter]
>>: It’s mediocre.
>> Stefanie Tellex: But it doesn’t solve the problem, out of the problem that this dataset would be,
because you would get the result of actually doing stuff in the world.
>>: Yes.
>> Stefanie Tellex: For example we could do that. But then how would you like you have to somehow
sub-sample some of those objects and try those grasps to know how good those annotations were. If
you had this dataset then you could…
>>: [indiscernible] then you have a TV image and then you can predict where the two grasp points are.
Then you, how well that would work I don’t know.
>>: Okay, maybe that would keep you, a part in the idea. How about we do this, find out where we’re
failing.
>>: Exactly.
>>: Then have select learn when we get a human being…
>>: We’re not even failing the real world. Not even failing but when you’ve done it in a weird way. Like,
you probably have lots of successful grasps that are not like human would do at all.
>>: It’s a changing process.
>>: Another thing is humans might give you a lot of different you know possibilities.
>> Stefanie Tellex: Yeah.
>>: Like one of the humans are here and another human over there. They’re all different possibilities.
>> Stefanie Tellex: Yeah and so in that perspective you can take that as your grasp proposal. Give us all
of your grasp proposals and we’ll try them all and tell you which ones are good and which ones are bad.
>>: What about the Sippy cup?
>>: What about the Sippy cup?
>>: Yeah, you would never, no human would pick up like a toddler would. But a human generally would
grab the handle at the same…
>>: Yeah at [indiscernible] next year.
>> Stefanie Tellex: What’s that?
>>: We expect an HCOM with it.
[laughter]
>> Stefanie Tellex: I mean I think that if you have this dataset where you have like the Sippy cup for
multiple different angels. That’s the other problem is that when you get, you only get to see the object
from whatever view happens to be in ImageNet. Like that’s probably not the view your robot’s seeing.
Because your robot doesn’t really know how the world works and it’s not very good at taking these
pictures. You want to go like if they say pickup the Sippy cup. I want to know it’s a Sippy cup even if I
see it out of the corner of my eye and it occluded.
>>: [indiscernible] product shots don’t have the variety but you can put things in that data.
>> Stefanie Tellex: Yeah, that’s right.
>>: When she’s COCO this we’re like real world, a dogs do this.
>>: It could be the images are, the objects are at a lower resolution. Then I get worried as well.
>>: Dogs…
>>: Really what you want like high resolution images, objects of a lot of random objects at different
angles, which is, people don’t take photos like that…
>> Stefanie Tellex: Yeah, people don’t take photos like that. But our robot does and can. This is a robot
that is made by Rethink Robotics. There’s three hundred of them in the research community right now.
>>: Is this important for your data set?
>> Stefanie Tellex: Most of them are sitting there doing nothing most of the time, these robots, these
Baxters. It’s sad but true, robots. What we’re doing right now is scaling out to collect that dataset using
the software infrastructure. The idea is everybody’s robot at night when you’re not using it is doing
objects. You hand it a box of stuff and it’s doing this stuff. You know maybe a lot of the objects it won’t
be able to pick up. But some of it will.
It’ll collect these images; upload it to our database server. Then we can, I want it to be the case that you
could hit it with crowd source data too. You can edit labels like the parts of the object and stuff.
>>: You said before that CG data is not good because we don’t have a good model of the friction and all
that other stuff, right. However, let’s say we have a lot of CG models of small objects. We can submit
ocutionary images, show them to humans, and say where should you pick these objects up? Use that as
training examples on real images to then predict where those objects should be picked up. As long as
the images look similar enough and rendered of height of quality. That might actually transfer.
>> Stefanie Tellex: Yeah people who work on like SketchUp data and trying to transfer it. I’m skeptical
because like this is I mean how can that compare to this where it’s like the real pixels, you know.
>>: Well there…
>> Stefanie Tellex: It’s the stuff that you, you know.
>>: I think it’s here that you’re using the humans to kind of predict the, you’re not doing any physics,
right.
>> Stefanie Tellex: Yeah.
>>: You’re just using it to generate these images that we can’t gather just for you know Crawling Flicker.
>>: Yeah, it seems to me that what you really want though is humans picking up the same objects and
narrating what they’re doing. That’s the crowd sourcing that I would want to now concur. To what
you’ve learned as the right grasp point. See what’s natural for humans. How do they refer to that?
>>: Yeah that would be ten percent more money than that, yeah.
>>: [indiscernible], yeah. But it’s kind of what you want, right, that parallelism.
>> Stefanie Tellex: Yeah.
>>: Have you tried looking for a, or for fingerprint on real objects where people pick some or…
[laughter]
>>: I like that, yeah.
>> Stefanie Tellex: Yeah, we haven’t done that.
>>: Do you think there’s also like a, I don’t know how much difference it is in an hour, you know
dexterity and your gripper, and whatever…
>> Stefanie Tellex: We’re way better.
>>: To tell you how…
>> Stefanie Tellex: We’re way better.
[laughter]
>>: The question is the trying…
>> Stefanie Tellex: That’s another question about human, like the way that a human would do it is not
necessarily the best way for our robot to do it.
>>: [indiscernible]
>>: Yeah.
>>: But you could always put a camera on a cashier.
>> Stefanie Tellex: Yeah at a grocery store.
>>: Yeah at a grocery store.
>>: Yeah that’s a good idea.
>>: You know or even like at Wal-Mart or whatever.
>> Stefanie Tellex: I was just at Amazon and they have cameras on all of their pickers who are pulling
things off the shelves to put in boxes. Sort of seeing what they’re doing with the stuff. But I don’t know,
as an engineer who wants the robot to work. Like this is something you could run on your robot and it
will work.
You know there’s not any of these transfer questions because you’re looking at your pixels and your
motor torques and doing this learn you can pick in place, right. Then as a side a factor as maybe a target
you can also collect a lot of data. I think it is a unique dataset that is not. Like I think that in terms of
having multiple images of objects from multiple angles, plus gasp success rates is tell you stuff that you, I
don’t think you’re going to get from SketchUp data.
You don’t know how much things weigh. You don’t know the friction. You don’t…
>>: The squishiness of bananas.
>> Stefanie Tellex: This squishiness of bananas.
>>: Is that where you like…
>> Stefanie Tellex: If you make like a deep remodel you can actually be able to predict. Like if I pick it up
here will it, like you’ll have lots of these labeled examples. I tried to pick it up here and it slipped. I tried
to pick it up here and it didn’t slip.
You can imagine training some kind of thing that learns that. I don’t think you can do that from
SketchUp data for example. Yeah?
>>: Oh, I was just going to I guess mention more about like this idea of training with humans. I was
thinking that it might be possible to have like children play with toys.
>> Stefanie Tellex: Yeah.
>>: You could do this in lab. Like this looks like, I mean I’ve done work like this too, right. You just go
through your house and you’re like I take this object and that object. But it seems like you could actually
have kids play with these varied toys. The robot can just sit and watch which is a bit creepy. But could
learn a lot about how kids are grabbing.
>> Stefanie Tellex: Yeah.
>>: In a way that might, I mean like around two or three that might translate well to…
>> Stefanie Tellex: Yeah, I mean, so another next step that I want to go to is not just pick and place. But
like these things is a pepper grinder squeezes. This cap unscrews you know. My son knows all that stuff.
Like if you had the vanilla bottle he will try to unscrew it and get vanilla all over everything.
It would be nice if the robot could also collect information, could do those actions as well. I can imagine,
and I think that an approach where it somehow as proposals from maybe this human data about what
are affordances of objects. Or pick your favorite deep learning model and train on whatever data you
want, in crowd source, or SketchUp, or whatever.
But then let’s try it out at a large scale on lots of real objects and see if we can learn from that. Build lots
of maps like I did this and guess what it actually worked. It actually really truly did open this vanilla
bottle.
Like this is a fact that a chain in the physical world that I don’t think you’re going to get from just images.
Because you actually put some torques on something and tracked the success. If you imagine that we
scale up this effort over time, with more complex behaviors, and more complex manipulations of
objects. Then I think it would be really cool to learn things like that. You know the way that things move
and the way that things rotate, and tool use, and all that stuff.
>>: You probably don’t need to…
>>: Let you got to the main stages…
>> Stefanie Tellex: Just a few more things.
>>: Yeah.
>>: Sorry [indiscernible].
>> Stefanie Tellex: Okay, so, okay that was like how can we have robots robustly perform actions in realworld environments? We’re not at this shelf yet. But in my lab if you bring something in that fits in the
robots gripper. We can scan it and in about twenty minutes we can usually pick it up. We have pick and
place where you can pick it up and put it down somewhere else.
The next thing that I’m going to talk about is carrying out complex sequences of actions in very large
data action spaces. For example, something like building IKEA furniture. The robot might want to ask
for help, like please hand me the white table leg. Or if you’re a forklift, one of my favorite examples
from the Turker’s was how do you pick up a dime with a forklift? A dime, so anybody want to guess
how.
>>: The tires.
>>: Between the two double…
>> Stefanie Tellex: With the forks.
>>: With the tools?
>> Stefanie Tellex: You have the dime on top of the forks just like a pallet.
>>: I seen that that’s really how they did it.
[laughter]
>> Stefanie Tellex: Yep.
>>: Move the two arms to both sides maybe...
>> Stefanie Tellex: You cannot in this forklift, you can’t. You put the forks down on top of the dime and
then you slowly. Here’s the instructions they gave me.
>>: [indiscernible]
>> Stefanie Tellex: You slowly drive backward and it flips to be on top. Raise the forks twelve inches.
Line up either fork in front of the dime. Tilt the forks forward fifteen degrees. Pull the truck forward
until one fork is directly over the dime. Completely lower the forks. Put the truck in reverse and gently
travel backward a foot. The dime will flip backwards…
>>: [indiscernible] forklift training class?
[laughter]
>> Stefanie Tellex: This was from a person on AM…
>>: [indiscernible]
>>: Yeah [indiscernible] forklift.
>> Stefanie Tellex: This is when a person on AMT…
[laughter]
What did you, I…
>>: I said it’s like last day of your training program.
>> Stefanie Tellex: Yeah, that’s right.
>>: It’s like everyone’s, the first day you say and the last day we’ll be doing this, keep your eyes open…
>> Stefanie Tellex: Yeah, your eyes open.
>>: Keep your feet on the pedal.
>> Stefanie Tellex: That’s right. The person who gave me this was a worker on AMT who worked in a
warehouse. He told me that they use this to haze new forklift operators to give them a hard time.
What I want to point out is that you can say this in language. I can tell you in words and like I use some
gestures too. But we like all instantly understand what motion is the forklift came. If you knew how to
drive a forklift it might take you a few tries. But you could probably figure out how to get it to pick up a
dime.
In the next breath I could say unload the truck and that’s okay to. But if I give you, if I give the robot this
low level action space things are going to break. We’re exploring algorithms for trying to handle this so
that we can accept commands at these different levels of extraction.
To do this I wanted to move to a simulation domain so that we could have very tall trees of actions.
Very, very high level actions along with very, very low level actions, without making a robot that could
do all of those things. The domain that we’re using is one that Microsoft now owns which is Minecraft.
This is a picture from the game Minecraft.
Have you played it? Do people here know?
>>: Played it.
>> Stefanie Tellex: What’s that?
>>: This was like the center piece of my family for like seven years.
[laughter]
>> Stefanie Tellex: Yeah, so you’re, yeah.
>>: I’ve lived it.
>> Stefanie Tellex: Live it, so Minecraft is a game. It’s like Blocks World of Zombies if you do AI. Like
it’s, there’s a very tall tech tree. Like you can’t, there’s a transistor block called red stone. You can
actually build a transistor in the world. Like a transistor is about as big as you in this world. You can
make. People have made things like graphing calculators and ALUs, and CPUs in Minecraft.
But this sort of bread and butter is like you’re wandering around a world much bigger than this. This is a
ten by ten world. The real world you, I don’t even know how big it is. Like you can’t find the end of it,
it’s like the earth or something.
There’s these zombies and you can mine iron and make swords and all kinds of stuff, castles. There’s
these dungeons and you know, I like to say its AI complete because if you can solve Minecraft. If you
could like actually build a graphing calculator in Minecraft with your AI agent then you could do
everything else too.
What we’re doing is trying to think about how to explore these problems of the state space explosion in
Minecraft, because there’s lots of data in simulation. You can imagine having, there’s lots of language
data about it. It’s a great example of the sort of state space explosions.
Here’s one of the first problems that we looked at. We called it a bridge problem. The idea is you can
do any actions, move, place, destroy, use, jump, rotate, look, craft. You can pick up any of these blocks.
Your goal is to get from here to here. You can’t walk across the trench. You can’t jump far enough to
get across the trench.
What you’re supposed to do is pick up one of these blocks, put it in the trench to make a bridge. Then
walk across the trench. But if you take a sort of breadth first search approach you spend all your time in
these parts of the state space.
This is an even simpler version where we gave the agent two blocks it could place. It’s you know
exploring all of the places those two block could be. Then trying to decide if that would help it get
across the trench. What you really want to do is make a bridge and then be at the goal.
The way we approach this is by formalizing this as a machine learning problem where you sample as a
Markov Decision Process. But in regular like RL you basically you, who’s seen Ground Hog Day, the
movie? Yeah, regular RL, a lot of the work in regular RL makes a groundhog, makes assumption very
similar to the movie Ground Hog Day where they basically reset the agents to the same initial state over
and over, and over. Then the agent gets to learn.
They learn things like, so Bill Murray learns things like. There’s a scene where he like robs a bank and
you know he knows what car is going to drive by when. They’re going to drop the quarters. Their backs
are turned and he walks up and picks up the money and goes away. He learns all these ridiculous things
about the world.
It’s cool though like it was good assumptions. It led to lots of interesting algorithms. There’s great work
in our RL under the set of assumptions. But the set of assumptions doesn’t match a lot of problems in
robotics and in the real world. Because you will never see, in the real world you never see the same
state twice. Every day is a new day. Even if it’s the same room like everything is different.
What you want is your agent to be plopped down in the state that it’s never seen before. Still be able to
take intelligent action, do what I tell it to do. We normalize this as a sampling process where we sample
MDPs. You sample states from some underlying distribution and sample reward functions. You learn in
small MDPs about what are good actions to take. Then you try to generalize that to larger problems and
see if you can learn what are good parts of the state space to explore?
It’s basically learning to do action pruning. In the bridge problems we learned that if you’re trying to
move somewhere in the environment. You’re not standing near a trench don’t place blocks. Cut out all
these branches of the search tree because that doesn’t help you most of the time.
This is probabilistic so we, it’s not, like this is illustrating it as red and green. But in reality it’s kind of soft
probabilities. But it basically concentrates the search in the good parts of the state space. It enables our
agent to solve these bridge problems. Much larger bridge problems than you can solve with even state
of the art planning methods that don’t learn things about the environment based on the goals that
they’re trying to achieve.
This is published. It’s going to appear at ICAPS in about a month. We’ve also where this is just our first
stab in this area. We’re really excited about hooking this up to language because one of the motivations
is being able to talk to these agents and also doing model learning.
We’re about to release a Minecraft mod that lets us run our RL framework inside of Minecraft, inside
the Minecraft Java virtual machine. So that you can load the MDP state from the Minecraft trunk that
you’re in so you can see all the block locations and stuff. Then we give a couple of example dungeons
that are about this big. You know you can run VI if you take out planning actions and you’ll find, you
know walk across. It’ll do all the grid world stuff.
But then you know you can try any RL algorithm that you want in this framework. It’s going to be like
the Atari dataset if you’re familiar with the RL literature. Where Deep Mine and all that stuff just
[indiscernible] Atari, so this is the next thing. This is Minecraft where you have this huge, huge world
and these blocks that you’re trying to manipulate.
Okay, Minecraft, so I have five more minutes. I will just briefly talk about the social feedback loop. This
is sort of planning in these big state action spaces and how we’re exploring that. Social feedback, so the
idea is you have a large space of possible actions. In a typical kitchen there might a hundred things, a
hundred ingredients that are just around in cupboards and stuff that you might be asked for. The
person will say something like hand me the vanilla.
The robot has to figure out what actions to take in order to respond to that natural language request.
To do this you have to map between words and language, and stuff that’s out there in the world. This is
an example of Symbol Grounding Problem from the forklift domain.
In my previous work we did this by collecting lots of crowd source data. Learning models of grounded
compositional semantics and showed that we could use this to make robots that interpreted natural
language commands. This is some examples of that. This is a forklift responding to a command, like put
the pallet on the truck and then moving to the environment to carry out that command.
This is the PR2 doing Blocks World manipulation tasks. This is a ground vehicle that is following route
direction requests through an outdoor environment. Up here is a helicopter about to take off following
natural language request to look at stuff in the environment. This was a lot of fun to do all this, of these
robots.
But the problem with all this work is if you watch every single one of those videos. What happens is
somebody says something. Then a lot of time passes before the robot starts to move. The problem is
that whether or not the robot goes on to do the right thing. You’ve sort of already failed by the time the
person has had to wait, even a few seconds because they have no idea if the robot is understanding
them. They might try to repeat themselves.
What we want is something more like what Dan does. Where you know the system is always tracking
what you’re doing and producing feedback based on what it understands. For example, so here’s some
commands that the people from Turk gave our forklift. Move the pallet from the truck. But there’s two
pallets on the truck, so it’s ambiguous. Even these unambiguous ones here, the offload the metal crate
so that that one. If your language model is bad or you couldn’t figure out that metal amounts to these
perceptual features. That may be ambiguous to the robot. The person will, not knowing how the robot
works will have no idea why this command is failing.
What we’re trying to do is make robots that can solve these problems. I mean we’d like to solve the
underlying problems. But we want to make robots that are robust to this type of failure by creating a
social feedback loop between the human and the robot. When humans are talking to each other they
are constantly tracking what the other person is doing. Trying to figure out if they understand and you
know their nodding their head, or they’re looking confused. They’ll stop the other person if they don’t
understand.
This feedback loop is what; I think one of the main things that causes human-human interaction to be so
robust to failures. Because we are not open loop we are operating in a closed loop way. This is a study
from Herb Clark studying human-human interaction where one human was instructing another person
to assemble structures out of Lego’s. If they turn, and they basically turn off feedback at different levels,
so if they no longer share a work space they can still talk to each other. They’re two times slower at
carrying out the task. If they turn off feedback so they basically give the listener a recorded set of
instructions. They can stop and play back the recordings. It’s not like they didn’t hear it they can go
back. But there’s no feedback. They can’t ask questions or look, or things like that. They make eight
times as many errors.
This is human-human interaction where you know humans feature condition; the human feature
condition system in their head is like pretty good, right. We hope that like they’re able to understand
the audio and stuff. Yet, even so feedback makes a huge difference. In the case of robots where their
perceptual abilities are so much more limited than when people do. We might hope for an even larger
impact by having some kind of feedback mechanism.
To do this we’re working on methods for incrementally interpreting language and gesture from a
person, at a very high frequency. This is showing the sort of base guy here the state if you think of this
from like a dialog perspective. The state is which object the person wants. The belief state is a
multinomial [indiscernible], so which of these four objects is going on.
We’re updating the belief state word by word and gesture by gesture, at fourteen hertz, a very high
frequency. In order to very quickly and basically continuously pub, subscribe, right. We’re publishing
what the robot thinks about the person’s intentions in real time. Yes?
>>: The first thing once you’re…
>> Stefanie Tellex: How much over can I go? Not to three.
>>: [indiscernible]
>>: We’re until three I think.
>>: Yeah, we can have [indiscernible] discussion, you’ve got three o’clock.
>> Stefanie Tellex: Is that okay?
>>: It’s until three o’clock.
>> Stefanie Tellex: Okay.
>>: If a person wasn’t [indiscernible] can do better than people it point a laser. You can say is it this
one, right, even if it’s ten feet away?
>> Stefanie Tellex: Yes, right so I think that’s one of the other glorious things about social feedback. Is
if, I mean we don’t want a robot to be annoying. Like is it this one, is it this one? If it’s too bad it’ll be
very frustrating.
>>: Yeah.
>> Stefanie Tellex: But if we can create that feedback loop and successfully recover from these failures.
We will create labeled training examples out of that process. If we fail to understand something and
then they disambiguate with a gesture. Then we get it and we know we got it. We can go back to that
language that we initially misunderstood and try to fix it up.
You know use our deep learning is a thing, Howard, or use Dan’s system to retrain your model and
improve the performance in these real interactions. Or learn the specific language and gestures that a
particular person likes to use in different situations. Yes?
>>: You mentioned gestures in my state but [indiscernible] recognition is absolutely [indiscernible].
[laughter]
>> Stefanie Tellex: Yes, so we. I was talking to Dan about this. We tried, it’s silly we, so the results that I
have are language and gesture. We tried face. We have numbers for face, gaze direction and they suck.
They suck because our gaze tracker sucks.
We think that, so I believe you enough that we actually tried it. But I took the numbers out because it
didn’t work for uninteresting reasons. We are exploring better face trackers. I have the hypothesis that
that will also help. Yes, that’s right.
The model, this is the model we’re using. This is a generative model. Jason was explaining to me that
the state of the art now is discriminative models for doing this kind of state estimation for dialog. But,
and the state of the art is also that they get to train and test their models on existing datasets that were
collected by somebody else.
Jason is one of the people organizing this for the Cambridge dataset where it talks about, like people are
asking about restaurant recommendations in Cambridge England. Before that there was a CMU dataset
about bus schedules. People trained discriminative models on these datasets. Of course there’s no
models for my kinds of tasks.
We started out with a generative model. Jason, I’m really excited about this, Jason said we can use. Like
the way that you collect the data in the first place is you take a model that kind of works. You run it on
lots of people. That’s what you use to collect the data to train your discriminative model which is now
beating the pants off all of the generative models.
This is our generative model for state estimation. The idea is the hidden state is what’s in the person’s
brain. The observations are language and gesture. You can add face. You can add gaze. You can add
everything else that you want to. Really I believe that discriminative would work better but we don’t
have the data. We’re excited about trying to collect that data.
But even with a generative model you can get quite good results. Here’s an example of what it looks like
when it runs. Here he’s pointing. He says please hand me that, then that, these pointing gestures.
What you notice is that the model is updating as an animation because it’s happening at such a high
frequency. That these probabilities are sort of moving up and down in real time, in a lively way, right.
Here, I’ll show you another one. He says I would like a bowl. As soon as it sees bowl those bowls jump
up and he points. It saves the probability, it doesn’t start from scratch because it knew it was
disambiguates with the gesture. We’re combining multimodal information from language and gesture
to do this.
You can also get compositional semantic, poor man’s compositional semantics, little bits of
compositional semantics out of this. I would like a spoon, the plastic one. Here each one individually is
ambiguous. I would like a spoon is ambiguous. I would like the plastic one is ambiguous. But because
the model is saving information through time it’s able to get the plastic spoon. Because it’s sort of and
also taking into account that this is happening relatively close in time. If a lot more time had passed
things would have decayed back to neutral. Then it would not have incorporated much information
from the past which is kind of what you want in this.
Then you could have, if you have an ambiguous gesture. Here he’s going to point ambiguously,
deliberately to sort of demonstrate. Then say the plastic one. Again, the gesture and the language were
ambiguous but you combine the information to get a better performance.
I can sort of show you a couple; we’re just starting to explore the feedback now. Like the idea is you
have this bar graph of things, maybe I’ll show the video. You have this bar graph but we don’t want,
maybe we should show the bar graph to people, but, like it’s hard to, for an untrained person to
interpret a bar graph like that.
Now what we’re trying to do is render this graph on the robot through gaze of the robot, and facial
expressions on the robot, and pointing gestures, and looking gestures on the robot. To try to say like
well if I’m confused between this spoon and that spoon, it should look back and forth between those
two objects. Then the person’s going to know, get some information from that about which of the two
the robot thinks about. Then be able to provide that targeted feedback.
I can show you some examples. This is like our earliest prototype where it’s basically pointing at the
object that it thinks it understands. We have like this fixed face it’s very, he’s either happy or sad. This
is a more recent video where it’s actually going to deliver the object. He’s saying he wants the brush.
The robot moves and delivers the brush, and works forty more mechanisms now to do this.
I can also show our quantitative results of this. If there’s just four objects then, in this evaluation where
you had people come in and they talked to the robot. We told them tell the robot that you want a
particular object. We pointed out the object with a laser pointer so we knew which one they were
talking about.
We said tell another person what you would do with language and gesture to pick up the object.
Random is twenty-five percent. If we use language only we got forty-six percent right using the Google’s
[indiscernible] system. We like them so we got, it didn’t have the far field microphone. If you use
gesture only we got eighty percent. This is pointing gestures. It’s like a Gaussian model of pointing with
a covariance.
The really cool thing is if you combine language and gesture you do even better. You get ninety-one
percent with the multimodal information. The language results surprised speech recognition people.
This was from the workshop. Actually I didn’t have the, I don’t know if you, I looked up why. I didn’t
have a good answer at the workshop.
The reason is that we told them to talk as if they were talking to a person. They would often point and
say that one. We were only using the language without the gesture. The language, the speech
recognition worked. It wasn’t just speech recognition error it was that they said something that was
ambiguous without the gesture. These results are showing that this model is improving from using this
multimodal information.
Next step is this is going to be our baseline. Yeah?
>>: [indiscernible], these numbers seem like they would be very different depending on the kinds of
objects, the number of objects.
>> Stefanie Tellex: That’s right.
>>: I could imagine the language would become more and more important if you had a lot of very
similar objects…
>> Stefanie Tellex: That’s right.
>>: Like some…
>> Stefanie Tellex: That’s right, so what we’re doing now. This is just with four objects. What we’re
doing now is trying to scale up to like ten or twenty.
>>: Yeah.
>> Stefanie Tellex: I expect the performance to drop and the language to matter more in these different
situations. Then we’re going to have, like the goal is, this is the baseline. If we turn on different types of
social feedback how do things change? Can we, because this is too good. Like there’s, we might get ten
points better or something. But I want the baseline with no social feedback to be around sixty. We’re
exploring different versions of these tasks, harder versions with more objects.
>>: Right.
>> Stefanie Tellex: That are farther away, so it’s harder to point. Then we’ll get, and then once we get
that. Of course we have a task for that’s bad. Then we can try adding social feedback and see if we can
get back up to ninety or a hundred. Yes?
>>: [indiscernible] actually learning materials, plastic, metal distinction…
>> Stefanie Tellex: In this task…
>>: Or is it, or is this something that’s…
>> Stefanie Tellex: No in this task we give it, it’s a generative model. We give it; I’m going to zip back.
We do not need training data but we do need this distribution. This is the probability. The distribution
we need is this one. The probability of l given x, so x is the true object. Let’s say it’s the plastic bowl and
l is the words that they use. That’s a language model conditioned on the object. We write that down
right now. We say okay plastic bowl, color. If they say word, like we do a pilot test, so if they say words
that we didn’t have we add them in and collect it that way.
For, this plugs into the scanning system. Like to get that you could upload the images to Turk and have
them type down descriptions and count up the words, and get a new language model. Or you could
collect lots of people interacting and pointing at these systems. Again, you have, either use a generative
thing to get that. Or have a discriminative model that would learn basically that. That these are the
words people use to describe the objects.
We don’t, its instance-based so we don’t know that, this, what plastic means. But again if you had this
scale, if you, if we collected that dataset of a million objects and Turk’d them all, you could then have
plastic from many different objects, many different viewpoints with different reflections and stuff and
try to learn a visual plastic predictor. That might work.
>>: So far material detection doesn’t work super well unless it’s really closely related to like color.
>> Stefanie Tellex: Yeah.
>>: Like wood, we could detect wood because it has to be brown as well.
>>: [indiscernible].
>>: There’s a models list…
>>: Yeah, I don’t think, I don’t really know that plastic detection works very well.
>>: People experiment with more sensors on the gripper, grasper, or whatever.
>> Stefanie Tellex: Yeah.
>>: The pressure sensitivity seems to be…
>> Stefanie Tellex: Yeah, so we have basically like just pressure.
>>: That’s right.
>> Stefanie Tellex: But people put different versions of skin on their grippers. We were collaborating
with MIT who has a soft hand made out of silicone. They’re putting more sensors on it. People do that,
they do that actively to do grasping. Like you can, like even if you can’t see you can kind of feel. Like the
robot will feel and predict movements to get a good grasp. Again, if you put that on, we’re collecting
data from them now. That can all be stuff in this dataset.
>>: Dataset, yeah.
>> Stefanie Tellex: You know it may not have as much with that particular hand because there’s not.
One of the things about getting this dataset is the Baxters existence. Like there’s three hundred Baxters.
They, the gripper that they have like, Rod the person who runs the company that made this robot told
me. He put it on the robot to piss everybody off and make it so that they invent new grippers.
It is definitely not the best gripper in the world. There’s a lot of limitations to it. There are other
alternative grippers. But by using their gripper anybody with a Baxter can just use the software without,
with as low barrier entry as possible. That’s what we’re aiming for.
Okay, so back to the videos. Alright and this is just showing the face that we are working on. Smoothly
transitioning between confused and happy, and kind of a sine wave, we set the position. The confusion
level on his face based on the entropy of the underlying distribution, so we are doing something kind of
animated. But like we’re, it’s being driven by something real about what the robot’s perceiving about
the environment.
>>: Right.
>> Stefanie Tellex: I would really like, your face was so much better with the eyebrows and stuff.
>>: It’s best to [indiscernible].
>>: Get a good paper.
>> Stefanie Tellex: What’s that?
>>: I pointing it to another AP paper on this, no.
>> Stefanie Tellex: No.
>>: Okay.
>>: We use HP also too.
>> Stefanie Tellex: Okay.
>>: For the expression control wasn’t it smooth over. Also what relates to our, we had a project called
deep listener about ten years ago. Where you used Hal’s Eye and we scanned it in something that
looked like glass and the redness. But it would be [indiscernible] between green for okay I’m getting
you, to yellow…
>> Stefanie Tellex: Awesome.
>>: To saying slow down, to red which is like hang on, let’s back up.
>> Stefanie Tellex: Awesome, awesome.
>>: Very smoothly. That was kind of…
>> Stefanie Tellex: I try to find like, so we’re collaborating with some people from RISDI who are helping
us with face designs. I try to get my head around the literature. I’ve been talking to Bill Gate also. I
don’t see it where my contribution is like this is the best face we should all be using.
[laughter]
It’s more like just, I just wanted some way to, something that I could set with entropy.
>>: [indiscernible]
>> Stefanie Tellex: Like the eyebrows thing it was…
>>: It’s part of the paper is I asked you…
>> Stefanie Tellex: You know, okay, great.
>>: That’s really more about, that was really a full faced model and with lots of effort to get like a
confused human like face.
>> Stefanie Tellex: Yeah, yes.
>>: There’s also different kinds of confusion.
>>: [indiscernible]
>>: Like confusion on the visual channel versus…
>> Stefanie Tellex: Yes, yes, yes, yes, yes.
>>: In your class it might also have confusion about the…
>> Stefanie Tellex: Yeah.
>>: The gesture.
>>: But we don’t care about the fact that it’s using so much as entropy for confusion and different
entropy.
>>: Yeah, makes sense.
>> Stefanie Tellex: Yeah, yeah, I think that’s awesome. Like, so, and this is our first step. Like entropy
seemed like the obvious place to start. But we’re also like looking, I think looking is going to be very
important, like looking at the objects and gesturing and that stuff.
>>: I think part of our challenge is once you get to the real like, to the more complex task we’re talking
about the debug mode. You know one of our challenges is something more intricate systems.
>> Stefanie Tellex: Yeah.
>>: You have so many uncertainties.
>> Stefanie Tellex: Yeah, yes.
>>: The question is you cannot reflect all of them…
>> Stefanie Tellex: Yes, yeah, you just look like what?
>>: You know you don’t know which ones you kind of need a model of how the person works and what
they would understand of that and you know so.
>> Stefanie Tellex: Yeah, yes.
>>: It get’s interesting.
>>: Like and it goes, so we talk about in future work. We did a little bit of this. But there’s not this
confusion but what does it look like? What does the face, what does the robot express when it’s
confusion is being just resolve?
>> Stefanie Tellex: Yeah.
>>: Or when it’s surprised by itself. It’s not the same as confusion.
>> Stefanie Tellex: Yes, yes.
>>: If it can communicate that to [indiscernible].
>>: [indiscernible], yeah.
>> Stefanie Tellex: Yeah, yeah, so we, so my hope is that there’s some kind of [indiscernible] thing
where like your action transitions something in the person’s mental state about what you think the
robot thinks. What the person thinks the robot thinks. Then you know maybe that, if, I think if we do
that right it might tell you something about. That you should look relieved because then they’re going
to know that you’re no longer confused. That’s good, like you want them to understand what your
mental state is.
>>: It’s also probably, one thing to a full modal belief system like that. What it does is just assumes
some independence. You have notions of what surprises that simple bit of depend on having two
entropies running. But it’s okay, I can see that.
>> Stefanie Tellex: Yeah, my students and I are kind of having that debate right now. Like I think the
[indiscernible] elegance. They’re like no, no, we need to make heuristic approximations like what you’re
talking about in order to make it practical. We’re sort of exploring that [indiscernible].
>>: Have you come across [indiscernible] faces? They’re a way to plot the width of the million numbers
by people having all sorts of funny faces.
>> Stefanie Tellex: No.
>>: No.
>>: Well Michael Cohen had similar, so Michael Cohen had his own set of dimensions. I was trying to
base this other with these you know the famous, who was the statistician who did that?
>>: Jonah, I wouldn’t solve that.
>>: [indiscernible]
[laughter]
>>: Well, [indiscernible] maybe you can write about that.
[laughter]
>>: [indiscernible]
>>: It’s a turn over faces.
[laughter]
>>: It was faces that you turn off. It’s [indiscernible], right. The two dimensions and, but Michael Cohen
was kind of cool his own set of dimensions once it’s you know [indiscernible] nicer in some ways for our
application than turn off faces.
>> Stefanie Tellex: Yeah, that’s cool.
>>: It’s the same, similar thinking.
>> Stefanie Tellex: Maybe we should talk more about it. But we wouldn’t like hating how you control.
Like what are the levers that you want to control. Like so confusion levels just one dimension but like
confusion and surprise.
>>: Yeah.
>> Stefanie Tellex: Like I talked to one of my mentors about this. Like confusion is bad, right? He said,
no I’m happy when I’m confused. Because I’m a researcher and I like, it means I’m about to learn
something.
[laughter]
Like you know, so it’s like we want to be confused and happy, and confused and sad, and confused and
nervous. You know what are the levers that our underlying model should support in order to express all
of this stuff?
>>: Except that I think is more of a discussion mode and that was it.
>> Stefanie Tellex: Yeah, this is the end robots standard form actions.
>>: We’ve been working on a, as part of the work that we’ve done with the intern here, two summers
ago, on a really rich, if you’re interested, again barring concern about the cannas of various kinds. I
think the fixture is a nice avatar with the expressions of various kinds. We have a whole authoring
toolkit for these. We want to field it out to academic partners.
>> Stefanie Tellex: Okay.
>>: It might be a fun thing to field out to Stefanie’s team.
>>: Yeah, I [indiscernible] think so.
>> Stefanie Tellex: Yeah.
>>: Jane and stuff. We actually believe a kind of a whole environment that we want to share and…
>> Stefanie Tellex: Okay.
>>: We can play with it.
>>: Shareware and get more…
>> Stefanie Tellex: I have two, one student from RISD to Rhode Island School of Design, and another
undergrad who’s from animation, and a professor from UK…
>>: [indiscernible]
>> Stefanie Tellex: Who are all helping us with the…
>>: Another [indiscernible] interesting design questions like where do you start? Do you start with; we
did the avatar by my door upstairs.
>> Stefanie Tellex: Yeah.
>>: We had all these, chart called Sketch Version, Line Drawing Version, just a light bulb…
>> Stefanie Tellex: Yes, come to my lab and you’ll see…
>>: Just a circle spinning like the Cortana X.
>> Stefanie Tellex: But what we’re running now is actually much more abstract than that face. Like the
one, the side one so like…
>>: Right, so that’s the thing…
>> Stefanie Tellex: They felt that was less…
>>: Between the, you know the Hal lens scanning that’s morphing between red light, yellow and green.
It’s like, you kind of, after a while you sort of map your mind to it.
>> Stefanie Tellex: Yeah.
>>: You’re looking at this think. You’re talking to it and its changing colors.
>>: It starts to see what…
>>: Yeah, eyebrows, looks like it…
>>: Yeah.
>> Stefanie Tellex: Yeah.
>>: Between that and like the full, like we’re sketching a really human like face. We’re working on these
muscles. It’s a kind of interesting world they’re unofficial RISD fix it, you know.
>> Stefanie Tellex: Yes.
>>: But somewhere maybe more in the Hal side.
>> Stefanie Tellex: Yeah, their approach was like to map out the space. Like they put up these posters
in my lab like of all the robots they could find. What their faces where like. Then they had us all draw
faces on our whiteboard. Like there was a sign draw a face. Everybody who walked by could draw
examples of faces. Then they make mockups and they put them on the robot. We decide how we like it
and iterate.
We have a large space of faces that are around now. The one that I’m, that they’ve been moving
towards is more abstract and cartoony than the one that I showed. But still has a face like eyes and
nose, and a mouth and stuff. It’s cool; yeah we’ll see where they go.
Yeah, so that’s the talk, so robots that robustly perform actions in real-world environments, carrying out
complex sequences of actions with the Minecraft stuff. Then, actively coordinating with people to
establish a social feedback loop, so yeah, they’re call kind of separate right now. These two connect
together. But my hope is before I get tenure they all connect together.
[laughter]
You have your complex actions in real-world environments and you’re giving natural language
commands, dialog, question, answering, all in collaboration with the person. Thank you.
[applause]
Download