>> John Platt: So I'm really pleased to present... of us because he's collaborated with us over the years. ...

advertisement
>> John Platt: So I'm really pleased to present Andrew Ng. He's known to many
of us because he's collaborated with us over the years. He was a Ph.D. student
at university of California at Berkeley and then since 2002, he's been at Stanford
as a professor working on lots of hard AI problems across multiple fields. So
here's Andrew.
>> Andrew Ng: Thanks, John. So what I want the to do today was tell you about
some work we've been doing on STAIR or the Stanford AI Robot Project. So
STAIR project starts about three years ago motivated by the observation that
today the field of AI has practicing meted many subfields and today each of those
passes is entirely a separate research area of entirely separate conferences and
so on, and what to define, what to define unified challenge problems that require
time together these disparate pursue the integrate AI dream again. And for those
of you familiar with AI I think of this as a project very much in the tradition of
shaky and flaky but doing this work, you know, 2008 AI technology, rather than
1966 AI technology as was in the case of fakey -- shaky.
So these are long term challenge problems we set for ourselves, to build a single
robot can do things like tidy a room, use the dishwasher, fetch and deliver items
around the office, assemble furniture, prepare meals. There's a thought that if
you can build a robot and do all of these things then maybe that's when it
becomes useful to put a robot in every home.
In the short term, what we want to do is have the robot fetch an item from an
office, in other words, have a robot understand a verbal demand like STAIR,
please fetch the stapler from my office and have them be able to understand the
command and carry out the task. And so what I'm going to do is tell you today
about the elements we tied together to build that STAIR stapler from my office
application and the elements repeated on this slide are after recognition, so you
can say recognize the stapler, mobile manipulations, you can navigate indoor
spaces and open doors. Depth perception. This goes into discussion of
estimating distances from a single image. So for grasping a manipulation to let
the robot pick up objects and lastly split the dialogue system to tie the whole
system together.
So let's start by talking about actor recognition. So I think human vision -- robotic
vision today is far inferior to human vision and there are many reasons that
human vision is so much -- is so much superior to current robotic vision systems.
And people that should show talk about user contacts so for common sense.
There are many reasons like that. One reason I think has not been exploited in
the literature is just a reason that humans use a foveal to look directly at objects
and therefore obtain higher resolution images of it. And recognizing objects is
just much easier from higher resolution images than from low resolution ones.
So for example, if I show you that image and ask you what is this how many of
you can tell?
>>: [Inaudible].
>> Andrew Ng: Well, cool. You guys are good. It's actually easier from the back
of the room. And once I show you high resolution image it's so much easier to
tell without it. It turns out the picture on the left is what a coffee mug looks like at
five meter distance from a robot and so maybe it's no wonder that, you know, our
recognition is so hard to get to work well.
So just be clear, right, if I'm standing here, if I'm facing you like this, I actually do
not have enough pixels in my eyes to recognize this is a laptop. For me to
recognize this as a laptop, I need to turn my eyes and look directly at it and get a
higher resolution image of it. I can now recognize this is a laptop and I can then
look away and continue to track this black blop in my peripheral vision and then it
was still a laptop.
So turns off using off the shelf hardware it's fairly straightforward to replicate this
sort of foveal peripheral vision system. In particular, you can use a pan to zoom
camera, there's a camera that can turn and pan and zoom into different parts of
the scene. To simulate this foveal and if you use a fixed wide angle camera
down here to simulate your wide view, lower resolution peripheral vision,
peripheral vision system. And let's just point out that, you know, unlike actor
recognition on your Internet images, so if your computer vision researchers and if
all you do is you download into -- download images off the Internet, then, you
know, this has a few -- you're actually not allowed to work on this problem, right,
because you cannot zoom into images you downloaded off the Internet.
Because most video is -- they're inner-images is an important and interesting
problem in his own right, but I think when you work with vision and physical
spaces then alternatives like these become very natural.
Just a little bit more detail. We can learn a foveal control strategy or learn where
to look next in order to try to minimize uncertainty or maximum information gain
and minimize entropy. When we express the entropy about the uncertainty of the
objects we're tracking and objects we may not yet have found in the scene and
when you maximize this, what it boils down into is a foveal control strategy that
trades off almost magically between goals of trying to look around to search in
the objects versus occasionally confirming the location of previously found
objects.
This by the way was the only -- is the only equation I have in this talk, so, you
know, I hope you enjoyed it. Just to show you how it works, in this video in the
upper right hand corner is the wide angle peripheral vision of view and in the low
left corner is the high resolution foveal view. On the upper left is what we call the
interest belief state. My laser point is running out of battery. It's the interest
[inaudible] which is a learned estimate of how interesting this will look at different
parts of the scene, how likely you are to find a new object if you look there.
And as the robot pans his camera around to zoom into different parts of the
scene you can sort of tell that it was infinitely easier to recognize the objects from
the high resolution foveal view on the lower left similar to recognized objects from
those on the upper right.
And if you evaluate the algorithm more quantitatively, you know, depending on
the experiments or setup you get 71 percent performance improvement. And just
be clear. If any of you are ever interested in building, you know, some camera
system or some vision system for some physical space like I don't know, if you
want to put a camera system in a retirement home to monitor the retirees, they
show the safety or whatever, I actually think of slapping a foveal on it as low
hanging fruit for suddenly allowing yourself to see the entire world in high
resolution and giving your vision system a cyclical performance boost. The next
[inaudible] we'll talk about just in two slides is mobile manipulation, having robots
navigate and open doors. So let's talk about that.
So we actually worked out two views of this system. In the first one it turns out
that most office buildings have identical doors and any guess very briefly what
we did was actually develop a representation for office spaces in those that
allows a robot to reason simultaneously about very coarse map, a grid map to
enable a robot to navigate huge building size spaces as well as in the same
probabilistic model reason about one millimeter model, models of [inaudible]
about a millimeter. Since you need about, you know, three to five millimeter
accuracy in order to manipulate it to a handle. So we came up with a model
that's probabilistically coherent despite the two very different resolutions of these
two spaces.
And then more recent version is putting this together with the foveal vision
system that I just described, so that you can have a robot who uses vision to
recognize novel door handles and in some cases elevator buttons as well, so that
you can put the robot and have it see a novel door that it's never seen before and
see and recognize the door handle and figure out how to manipulate a door
handle also use a motion planner to plan certain motion needed for the art to
manipulate the door handle.
So this video shows a number of examples of a robot seeing novel doors, test set
doors as never seen before. We're going around looking for doors to test this on.
This is actually a robot trying to go inside the men's room.
Elevator button door is more elementary, same algorithm, pushing elevator
buttons. And the overall performance of the system on opening doors, novel
doors was 91 percent.
>>: Have people tried this problem before versus ->> Andrew Ng: So it turns out there's lots of work on the opening known doors. I
believe we're the first to have a robot open previously unseen doors. And so that
sequence ->>: [Inaudible] or could be push ones or ->> Andrew Ng: I see. Yeah, so, yeah, you're right. So let's see in the video I
showed this was restricted to handles, not knobs, and only push doors. I believe
this robot is mechanically not capable of pulling a door shut behind it, for
example, but with the newer robot did you see a picture of where actually student
Ellen is applying pretty much the same algorithm to that problem. We haven't
done that yet.
>>: [Inaudible].
>> Andrew Ng: Yes?
>>: [Inaudible] using the [inaudible] or does it also have lasers or some other
kind of [inaudible].
>> Andrew Ng: Yeah, let's see. In this, boy, we've done many things all the
time. We have done this using the cameras and stereo cameras. We've also
done this using a single camera and a laser. And actually we've done this with
different sets of sensors. The results you saw here I believe were all the single
camera and the horizontally mounted [inaudible].
Okay. Yeah. But although, but [inaudible] using only vision. So next thing I want
to tell you about was free perception. Take us into discussion of estimating
depths from a single slow image. So let's talk about that. If I show you that
picture and ask you how far things were from the camera when this picture was
taken, you can look at it and maybe sort yes. Or if I show you that picture and
ask you how far objects were from the camera you can tell the tree on the left
was probably further than the tree on the right from the camera when this picture
was taken. The problem of estimating distance from a single image, from a
single stow image has traditionally been considered an impossible problem in
computer vision and in a narrow mathematical sense it isn't impossible for all of
them but this is a problem you and I saw fairly well and we like our robot to use
the same in order to give a robot a sense of depth perception.
So it turns out that there is, of course, lots of prior work on depth estimation from
vision. Most of this prior work has focused on approaches like stereo vision
where you use two cameras in triangulation and I think that often works poorly in
practice.
The number of other approaches that use multiple images and it turns out to be
all very difficult to get them to work on many of these indoor and outdoor scenes
and I should say there's also some contemporary worked hours done by Derik
Home [phonetic] at CNU. But the given is given a single image like that, how can
you estimate distances?
So this is the approach that we took. We collected a training set comprising a
large set of pairs of images of monocular images like these and ground trip depth
maps lay down on the right. So the ground trip depth maps the different colors
indicate different distances, where yellow is close by, red is further away and
blue is very hard away and these ground trip depth maps are collected using a
laser scanner where you send pulses of light out in the environment, measure
how long the light takes to go out, hit something and bounce back to your sensor
and because you know the speed of light, this allows you to directly measure the
distance of every pixel. And then having collected large training sets we then
learn a function mapping from monocular images like these to what the ground
trip depth maps look like, using supervised learning.
So a little bit more detail. In order to construct this learning algorithm to construct
features of this learning algorithm we actually first went to the psychological
literature to try to understand what some of the visual cues used by humans,
used by people to estimate depths. So some of the cues that you and I used
turned out to be things like texture deviations and texture gradient, so for
example those two patches is all the same stuff, is all the rows are draft but the
texture of these two patches are very different because they are very different
distances. All the cues that people use include haze or color, so things are far
away tend to be hazy and tinged slightly blue because of atmosphere replace
gather interesting. We also use cues like shading, dough focus, occlusion and
[inaudible].
So for example, if those two are tangled look like they are similar in size to you,
that's only because your visual system is so good for directing for distance. In
fact, there are about 15 percent different in size and so if you know people of
roughly five to six feet tall, then by seeing how tall they appear in the imagine,
you can note roughly how far away that is.
So we construct a feature vector is try to capture as many of these cues as we
could. And realistically I think we do a decent job of capturing the first few of the
list and less good job capturing the second half of this list maybe. And then
given an imagine, we then came up with a upon list particular model of distance,
and in die tail given image we compute image features everywhere in the image
and then we construct a probabilistic model known as the [inaudible] random
view model. What that does is allows us to model the relation between the
depths and features, in other words, it models how image features may directly
help allow you to estimate a depth at a point. Also models relation in depths at
the same special scale, because two pixels, two adjacent pixels are more likely
to be at similar distances than at very different distances as was relation between
depth set multiple spatial skills.
When you trained this algorithm using you know, supervised learning, he's are
examples of test set results. So in the left-most column is a single monocular
image and the middle column is the ground trip depth map use the laser scanner
and in the right most column is the estimated map given only that one image is
input. So the error makes interesting errors. I'll point one out here, which is that
that tree there is actually in the foreground, right? That tree is actually fairly
close to the camera but the algorithm misses it entirely and, you know, thinks that
tree is much further away. The example below still looks okay. A few more
examples. And I want to point out another interesting error. In this image up
here, this bush here is in the foreground, and that tree there is in the background.
So these are two, you know, physically separate objects where this bush in the
lower right is significantly closer to the camera than the tree in the background,
but the algorithm misses that as well and sort of ends up blending together the
depths of the tree and the bush, right? But other than that, [inaudible].
>>: [Inaudible].
>> Andrew Ng: Yeah. Yeah.
>>: What's with [inaudible] it was not [inaudible].
>> Andrew Ng: No. So [inaudible].
>>: Edges or have sort of more [inaudible] stuff inside?
>> Andrew Ng: Let's see. So this turns up -- one thing I was going to do is make
[inaudible] a convex problem and so this was an MRF where it's E to the L1
functions or E to the sum of a lot of absolute value terms. And the absolute value
terms essentially you know capture these sorts of relations. In motion of this
case the models, we actually reason explicitly about [inaudible] and a bunch of
you're complicated -- a bunch of other more complicated phenomenon. So there
are other cues like that. If you find a long line in an image, you look at an image
and you find a lot line in the picture, that long line will probably correspond to a
long line in three as well.
So there are like three or four types of cues like that that the [inaudible] capture
and probably taken the challenge is how to encode all of these things, so it's still
a contact optimization [inaudible].
>>: Is there anything on the bottom layer that was like [inaudible] model or
something or is it sort of just very simple models looking up the features to that?
>> Andrew Ng: Yeah, boy ->>: [Inaudible] or was there any [inaudible].
>> Andrew Ng: I see. There was one other machine that I need [inaudible]
which is specifically age detection so one of steps we do is look at an image and
for every point in the image try to decide if there's a physical discontinuity there.
So for example standing here, does the physical discontinuity between the top of
a laptop and my chest and so you try to recognize those points and then those
help the [inaudible] do better as well. Yeah. It's [inaudible] parts of it is are
complicated.
[Inaudible] it's about 30, 35 percent per pixel error but let's skip over that. So
more interesting is, you know, right now I've been showing you depth maps. One
of the other things you can do with this is actually take the models you estimate
and render them as 3D 5 view modals and then we're going to show you an
example of that. And what I'm about to show you the entirety of the inputted to
the algorithm was one of these free images and so let's take a look at the source
of 3D models you get from these images.
So first the three examples. The first picture is actually a picture we took
ourselves on the stamp by campers. Given a single image, this is an example of
their 3D 5 through model you get. [Inaudible] I'm actually you're really bad at
driving this thing. Right. Turn left, look at the 3D shaped tree. So let's imagine
that we're standing together in front of this house and I'm going to squat down so
the wall comes up, you can't see the cars anymore, stand up, squat down
[inaudible]. The second and third images are actually [inaudible] images, they're
download into that. Let's fly down the river.
Turn right, look at the trees, turn left look at the shape of the mountain and so on.
So -- yeah?
>>: [Inaudible].
>> Andrew Ng: Say that again?
>>: Were those [inaudible] used in the [inaudible]?
>> Andrew Ng: So that was a more sophisticated version of the algorithm than
the basic depth map one. The ideas are roughly the same. So I -- boy, I have to
go into more detail. I guess one difference was that those -- in the most
sophisticated version that I didn't talk that much about, we actually first
oversegment the image using a super pixel segmentation algorithm the
[inaudible] and then imagine using a pair of scissors to cut this picture up into lots
of small pieces and into [inaudible] pixels and then using an inference algorithm
to take each of these pieces and in [inaudible] paste them to 3D and when you
do that, that helps you preserve inner surfaces and lastly we can texture map the
image back on to this 3D model where I pasted all these pieces somewhere over
the 3D.
>>: [Inaudible]?
>> Andrew Ng: Say that again.
>>: [Inaudible] I had two of them but [inaudible].
>> Andrew Ng: Yeah. Actually you're right. Well, there's something that I wasn't
going to show, but let me see if I have it. I think I might have a hidden slide that
does that. Yeah. So it turns out that, you know, on many images monocular
does okay, stereo, this is [inaudible] version system [inaudible] did not find
correspondences and therefore there are return depths, and if you combine the
monocular cues and the stereo cues, then you get better, you get measurably
better results in either mono or stereo only and by stereo I mean triangulation.
And you can also do things like take a few images and build large scale models
where policy images are seen only one by camera and policy images are seen
by multiple cameras.
So if I could find where I was in the talk. Yeah. Okay. Cool. And so is October
which means some of you may recently have gotten back from your summer
holidays or whatever. So this algorithm is actually up on the website and if any of
you want to take, you know, your own holiday pictures and upload it to the
website, then the algorithm works so turn your pictures into 3D models so you
can revisit your holiday memories in 3 rather than as flat pictures. So that was
depth perception. And it turns out one of the most interesting applications of
these ideas is the robotic grasping and manipulation because you have -- with
depth perception that gives a robot a sense of space around it, but on the other
hand with a robot you can use lasers and whatever to directly measure depth.
But apply some of these ideas through robotic manipulation. So let's talk about
that.
So robotics today is in an interesting states. So robots today as many of you
know can be scripted to perform amazing tasks in known environments. One of
my favorite examples is this. This was done in Japan 15, 20s years ago. This is
a picture of a robot balancing a spinning top on the edge of a sword. So that red
thing is a spinning top, that long thing is robot is holding a sword, and the top is
being balanced on the narrow edge of the sword.
So you know, if this is -- is this a solved problem in robotics, this was done 15, 20
years ago, you know what's unsolved? Well, it turns out picking up that cup is an
unsolved problems in robotics if you've never seen that cup before. And so is
that latter problem, though. So you've never seen this cup before, how do you
pick this up? Well, one thing you can do is use stereo vision to try to build a 3D
model of the cup and on the stereo project we've been fortunate to have had you
know to several companies trying donate hardware to us, so using a decent
commercial stereo vision system this is an example of the depth map we get
where the different shades of gray indicate different distances and black is where
the algorithm did not find correspondences and therefore did not return
distances.
If you zoom into where the cup was, it's just a mess. You can barely tell if the
handle's the left or the right. So this is my -- I think all of you probably know what
stereo is, but this is my cartoon of what stereo does, right, which is in stereo
depth perception you have two images, one from the left eye and one from the
right eye and stereo depth perception has defined correspondences. People
point to the left eye denoted by the cross, you then have to find a corresponding
point in the right eye image and then you, you know, send out two rays from eyes
through these two points and see where they intersect and that triangulation is
theorem unless you estimate the distance of a 3D point. And you can also do
this for a different point, you know, shown there. You can estimate distances of
that point as well. And I think the reason that stereo, dense stereo is hard is that
you pick a point like that in the left eye image is very difficult to tell which of those
points it corresponds to the right eye image and depending on which one of
those you choose you can get very different distances. It's very hard to find out
the 3D position of that point there. And what dense theorem does is tries to take
every point of the left eye image and tries to triangulate to every single point in
the right eye image and this is very difficult to get to work.
But as stereo doesn't work for us how do you pick this out? Well, we just said
that given just a single snow image, you can all right get a sense of the 3D depth
and all right get a sense of 3D space, of the scene. So this is what we did using
monocular vision. Using these monocular vision cues. Which is we created a
training set comprising five types of objects and for each of these objects we
labeled it with the quote correct place I wish to pick up the object. So we labeled
a pencil as saying pick up a pencil by the mid point, pick up a wine glass by the
stem, pick up a coffee cup by the handle and so on. And then we trained the
learning algorithm to use monocular vision cues so that the algorithm would take
us in for an image like this and would try to predict the position of this big red
cross. So give the single image or use monocular -- it's monocular vision cues to
decide where is the grasp point or the position of the red cross.
When a robot faces a novel object like a novel coffee cup what it does is it uses
the learned classifier to identify the grasp points in the left eye image, identify the
grasp on the right eye image and then you then just take these two points and
you triangulate them to attain a single point of 3D and you then reach on their
active grasp there. Okay. And just contrast this with dense stereo vision which
tries to triangulate every single point of both images and does very hard and
contrast this picks one or sometimes it knows a small number of points and both
images triangulate and that works much better.
So I'm going to show you a video of this working. This is a video of the stereo
[inaudible] the variety of objects for the first time using, you know, that ball is a
cheap web cam we bought from the electronics store and using even, you know,
cheap web cam images the algorithm, the robot often understands the shape of
these objects well enough to pick them up.
The training set object -- the training set objects with just those five that you saw
earlier, there were, what, there was no cell phone in the training set, there was,
what, there was the wine glass, the pencil, the box with, the eraser and the
coffee mug. But training on those five objects it often generalizes one of the
grasp fairly different objects.
This [inaudible] this works 88 percent of the time on large test set of objects. I
went to the dollar store to buy objects and what we tried to pick up -- I actually
have no idea what that is.
>>: You had a coffee -- the coffee pot upside down. Would it work right side up.
>> Andrew Ng: I'm pretty sure it did, yeah. And let's see. It turns out you can
use exactly the same algorithm on objects placed in the dishwasher. So for
these experiments we moved the camera back. So rather than a risk mounted
web cam we actually use a higher quality pair of cameras, higher quality set of
stereo cameras mounted to the base of the robot off the left of the screen.
But it's actually the same algorithm to identify grasp points, triangulate and then
reach beyond there to pick up objects.
>>: So you use [inaudible].
>> Andrew Ng: Yes. So this is theorem here.
>>: [Inaudible].
>> Andrew Ng: No. Actually that's the camera. So this one was single web cam
but we moved the arms a couple places to take a few -- to take two to four
pictures, say. Yeah. But then we used monocular cues in each of these or used
Monday or collar perceptions to figure out where to identify grasp points in each
image separately and only after that we triangulate.
>>: [Inaudible].
>> Andrew Ng: Yeah, right. Because from each image you -- because you say,
take two pictures and each measure for the -- say where you think the big cross
should be, then you triangulate the points where you put the big crosses. Yeah?
>>: [Inaudible]
>> Andrew Ng: Oh, when we make this video it was just control of the robot arm
had a very low update rate and so we would command the robot in larger steps
than we would have liked. This is a software bandwidth problem, just how fast
we could send commands to robot.
>>: I was just wondering if it was reestimating the position of.
>> Andrew Ng: No, it's not.
>>: [Inaudible].
>> Andrew Ng: Yeah. Not -- not in these videos.
>>: [Inaudible]. Do the thing and [inaudible].
>> Andrew Ng: Yeah. There's actually a -- so that actually [inaudible] grasping
failures. It turns out the -- with this algorithm anyway, the majority of the
processing failures are -- so what the [inaudible] does, right, takes two pictures
then decide I'm going to reach there. They close their eyes and wear heavy
gloves so you have no sense of touch and you reach there and try to -- so the
majority of a grasping failures are if you accidently knock the object slightly and
then you -- you know, you don't know you did that. So we actually have new
hands with touch perception in the robots fingers and that makes us work better.
[Inaudible]. Was there other questions? Okay.
Okay. Cool. So so far I've been showing you experiments using one STAIR path
for our first STAIR 1 platform. These are the future of hands STAIR path. STAIR
2 uses larger more mechanically capable arm and this one, this is actually built
by one of my colleagues [inaudible] say more about this later but just show you
an example of the same algorithm on a different robot. This is using a barot
[phonetic] arm which is a much larger more mechanically capable arm, capable
of carrying heavy year pay loads. But it's the same algorithm that finds grasp
points and it gives one modification is you now need to plan the position of the
fingers as well, right, to need to decide where -- how to position your fingers as
well to pick up different objects.
But there was a stereo pair of cameras, there was a little bit off the writers
screen, takes those images, finds grasp points and then reaches out to pick up
these objects and they [inaudible]. So that turns out to be a fake rock. A lot of
fake rocks in my office for some reason. And for those of you that are -- for those
of you that goes to the knits conference you know why we have to keep
[inaudible]. So, yeah, so they're actually seems to work.
And so that was grasping. The loss of the elements I tell you about was this
spoken dialogue system which we don't use any reinforcement or any algorithm.
So those of you that know me a little bit will know that my students and I have
been heavily invested in the primary force of learning algorithm to control of a
variety of robots. So just for fun, here's a video of [inaudible] helicopter being
flown using reinforcement learning algorithm. So everything here is computer
controlled flight where, you know, using one of these learning algorithms it has
learned to control a helicopter and so split S is a fast 180 degree turn. [Inaudible]
is another fast 180 degree turn. Do two loops. The second loop -- and
[inaudible] fast spin at the top, right there. Another stall turn done in reverse.
Then backwards. [Inaudible] pardon?
>>: [Inaudible].
>> Andrew Ng: [Inaudible] what?
>>: [Inaudible].
>> Andrew Ng: [Inaudible] not that I'm aware of. Oh, 90 degree horizontal agree
fall. Stationary row sends out one of the most difficult [inaudible] is another very
difficult maneuver. The tic-toc is like the inverted, you know, grandfather
pendulum clock, right?
>>: [Inaudible] G forces.
>> Andrew Ng: Let's see.
>>: [Inaudible].
>> Andrew Ng: We could do that. Yeah. So but, you know, [inaudible] many
fans of machinery, it turns out that just in the United States and maybe after a
dozen groups that work on [inaudible] have controllers. Perhaps the most
[inaudible] Eric Feran's [phonetic] group that just moved from MIT to Georgia
Tech. But turns out these are by far the most difficult, most advanced
maneuvers shown in any [inaudible] helicopter and that's actually a complete non
controversial statement and [inaudible] algorithms. And in fact, we've actually
more or less run out of things to do. They're on actually other maneuvers they
want to do but [inaudible]. Yes?
>>: Is this something that the helicopter is doing that given the human expert will
find challenging? [Inaudible] myself, but someone.
>> Andrew Ng: Yeah. So we are fortunate to have one of the best pilots in the
country work with us. Not the very best pilot in the country. And this did learn
from him, but this flies many of the maneuvers even better than he does. I just
say it's maybe competitive with the very best pilots in the world. I wouldn't say
this outperforms the very best pilots in the world but this does outperform our
pilot that is one of the top 50 pilots in the United States maybe.
>>: Farcy helicopters.
>> Andrew Ng: Farcy helicopters, yeah. It turns out you can't do these things on
full size helicopters. [laughter].
>>: [Inaudible] seem like there's somehow does it notice [inaudible] things
[inaudible].
>> Andrew Ng: I see. Yeah, so it turns out the helicopters big wide empty
space, we just know that we are not detecting or avoiding obstacles in the air.
>>: Right [inaudible] kind of just try to do [inaudible].
>>: [Inaudible] model of the ground there.
>> Andrew Ng: We know where the ground is, but we just happen -- we just
happen to, you know, command the entire maneuver far enough away from the
ground that you don't worry about it. It turns out, you know, modern GPS
systems you can get about -- well, you can get ->>: What sensors do you have on the [inaudible].
>> Andrew Ng: Yes. So, right, so we have a [inaudible] and giros and a
compass, magnetometer on board the helicopter. For estimation you can use
either GPS or cameras. These videos were done with cameras on the ground to
estimate the position. But you also use GPS which gives you about 2
centimeters error with model GPS systems. All right.
So those primary force learning and following in the footsteps of many others we
[inaudible] the same job in and so on. We use these source of learning
algorithms to develop a proven dialogue system. But I'm not even talk about
that.
So taking the elements described and putting them together with -- after I
mention mobile manipulation, depth perception especially applied to robotic
[inaudible] and spoken dialogue system, when you put these things together
what you can do is build the STAIR piece face, the stapler from my office
application. So see that.
>>: [Inaudible].
>>: [Inaudible].
>> Andrew Ng: So those are spoken dialogue system kicking out the whole thing
and it just said Quak's [phonetic] office, it's one of the Ph.D.'s office. So robot
uses [inaudible] robot navigation to navigate to the office. It then uses a vision
system to detect a door handle, drives closer and takes another image, confirms
the location of the door handle, says pick it out, you know, which door handle to
push on and so on. Again, this is a novel door handle. Has not seen this door or
this handle before. Goes inside to where he knows is the student's desk, uses
foveal vision, that's the camera on top moving around, that's the camera view on
the lower left. So it's a camera [inaudible] you take different images, finally it
zooms in to confirm where thinks it's found the stapler identifies the grasp point
using that learning algorithm as I described earlier, the positions that cross here
is in this camera view the location of the estimated cross pointed. It then reaches
up to pick up the stapler.
It turns out you know picks up many different objects, stapler's a huge [inaudible]
turns out [inaudible] pick up very reliably. And finally switches back to the inner
robot mobile algorithm for navigating indoors to go back to [inaudible]. And so
there you go. And you know, on the one hand, this was a quote demo, but on
the other hand, we've actually done this a few times, fetching a few -- fetching
objects from a few different places and so on. And so on the one hand this is a
quote demo, but on the other hand, genuinely indicates all those components I
described earlier and I think -- I hope this is really genuine beginnings of robots
that are able to usefully fetch items from around the office.
It turns out once you put all these components together, it becomes relatively
easy to rapidly put together these components, navigation, door opening, vision
and so on, to try to put together other applications. So what I want to do is very
quickly tell you about a second application that we're working on having a robot
take inventory. So here's what I mean.
Here's a map of the Stanford computer science building and zooming into those
four officer. Actually I think this one used to be Christina's office. Oh, no you're
on the second floor. Never mine. This was your office? Maybe. So what I want
to do is a robot be able to go inside these four offices and take inventory. So
imagine after everyone's gone home, the robot go inside and figure out where
things are and say figure out where all the coffee must are in these offices.
We tried to build this application, we found that by far the weakest link was after
recognition, so this was the result of applying, you know, after recognition to
detect coffee mugs and with not the best people in the world at building, working
after recognition systems and if some of you say that you know about vision and
you can get to work with this I would have absolutely no argument with that.
On the other hand, we were highly motivated to tune a vision system to work as
well as we could make it and this was actually about the best that, you know, we
as not the most experienced vision people, but we were not totally stupid people
either, were able to do.
And for this piece -- and when you look at robot after recognition, what I want to
talk about takes into consideration from the natural world which if you look at
computer vision I think most computer vision today is based on RGB color, red,
green, blue color or based on gray scale vision and in one this makes sense
because a lot of video or a lot of images aren't filmed for humans and if you
understand that sort of -- you know, that sort of images then you have to really
understand RGB images or gray scale images. But on the other hand if you look
even in perception the natural world, it extends well beyond the human visible
spectrum, so for example bats and dolphins you use sonar to estimate distances
directly and this with this bird on the right, this is a pretty boring looking bird, it's
just a black colored bird, it's not very interesting to look at, but it turns out if you
look at this bird in ultraviolets then it's actually -- these birds can see a an
ultraviolet they appear very colorful to each other, and this is of course rendered
in fast color so that you and I can see it, too.
And so we've actually done work on using this sort of depth perception for actor
recognition as well as sort of hyper spectral, outside the visible spectrum. When
we talk about only the depth perception piece today. And to describe that, let's
revisit stereo vision. We've heard a lot about already. But this is another cartoon
description of stereo vision, all right. So to estimate distance what stereo does is
it picks a point on say the mug and it then, you know, extrapolates rays from two
cameras and then use triangulation to compute distances. This is an idea called
actor stereo, which is when you replace one of the cameras with say a laser
pointer and what you do is you shine the laser beam on to the object and so this
constant bright spots on the object. The camera then sees the position of this
red dot, of the green dot and you can then use triangulation to estimate distance.
And this picture is exactly analogous to when you had two cameras an two lines
coming out of it.
The difference is when you had two cameras it was very hard to see whether the
two cameras are pointing at the same point. Now it's got a correspondence
problem. Now we have a laser point in the camera is very easy to be sure the
laser pointer in the camera are looking at the same point because well you just
painted that point bright green.
So this idea is called actor stereo, it's a completely standard idea, a very old idea.
And it turns out also completely standard and take this idea even a little bit
furtherer which is instead of casting a single dot into the scene you can cast a
vertical stripe into the scene and scan the stripe horizontally like that, and this
gives you a direct 3D measurement, the right distance measurement of every
single point in the scene.
So just to show you what that looks like, this is well video of our laser scanner
and operation, so as the vertical laser is panned horizontally across the scene, it
is measuring the 3D distance of every point in the laser falls into. Okay.
And with that, these are examples of some of the data you get. On the left is
your normal visible image and on the right is a 3D point called, you know, of the
same scene. And so on the right is a 3D point called this -- at a slightly higher
point of view, right, so you take the camera and moved it up.
And on the one hand this looks like a lot of data because it was distance for
every point. On the other hand, there's actually also maybe less information here
than might appear to the human visual system just because our human vision
system is so good at [inaudible] these scenes.
And so for example we do not -- we still do not see the rear halves of these
coffee must, right, because [inaudible] and with this [inaudible] I think about it.
So given [inaudible] you can do things like compute surface modals and so
image the different colors indicate different, you know, orientation on the
surfaces. So purple or horizontal surfaces and green are vertical surfaces at a
range of vertical orientations.
And the way I think about this is you can now represent a pixel using this 90
vector with every pixel you know it's RGB color, you know it's XYZ position and
you also know, you know, M1, M2, M3, their 3D surface normal vector. And
these nine components are not independent, so argues is it 9 or is it actually low
dimensional? But they think of this is if you use a camera then it's as if you only
get to observe the first three components of this vector with these other senses
you get to observe the four vector and that lens you directly measure things like
object shape, object shape features, object size. You can ask questions like is it
sitting on horizontal surface, because it turns out most coffee must I found on
horizontal surfaces like desks and [inaudible] so on.
And when you apply this actor recognition this is actually a fairly typical result
where the 3D information as well it completely cleans up the result so you get,
you know, near perfect actor recognition.
You validate the more [inaudible] of the F square goes up from -- and just
validation, coffee mugs goes up from 67 to 94 percent F square. And speaking
[inaudible] this is F square, not error is that you look at -- if you think of one
minus F square's error formally you think of this as an 80 percent error reduction
and for us, anyway, an RSP was this was actually the gap between a vision
system that was not useable for an application and the vision system that is
useable for application.
>>: This [inaudible] other objects in the class [inaudible] particular ->> Andrew Ng: Oh, no, this is object cost recognition. Training and testing on
different [inaudible]. So this is the distance or application and so this is STAIR
using that, you know, opening the novel door algorithm, the door navigation and
so on. You go inside these offices to use the laser scanner, to scan the offices.
So there you see it going from desk to desk and you see that vertical green laser
being used and then so you know, the robot is building is 3D model, he's getting
these 3D point clouds of all the desks in the office.
So he's done with the first office. Going next door to the second office in a row.
Again [inaudible].
>>: [Inaudible].
>> Andrew Ng: No, more than that. Just maybe 10 X I think.
>>: [Inaudible].
>> Andrew Ng: [Inaudible] coffee mugs. Yeah, not when the robot is moving,
yeah. So let's see. In these experience I think it took about eight seconds to
scan the desk, so hopefully you aren't moving the coffee mugs during those eight
seconds. Yeah. And I think, yeah. Let's see. And so results. If you use only
visible light, if you use only, you know, color vision or RGB vision, using all
classifier, the best classifier that we are able to use, these are the results you get
where every red dot and every block dot is either a false positive or a false
negative.
With the [inaudible] information, these are [inaudible] you got. So there was
seven coffee mugs in those four offices left there by the -- left there in their
natural places by the denison of the offices. We added an additional 22 coffee
mugs making it a total of 29 coffee mugs and on -- these are actually results the
first experiment we ran and the robot actually found 29 for 29 coffee mugs.
Since then we've repeated the experiment a few more times and it's fairly typical
result for the robot to make somewhere between zero to two mistakes on the
scope of problem. And there in fact are also all the automatic X factor pictures of
all the coffee mugs. And so again, I think, you know, on the one hand this was a
quote demo, but on the other hand, I think this real is really a genuine beginnings
of robots are able to go around and take inventory and usefully taken inventory.
So this is one last thing I will tell you about. So, you know, earlier I show pictures
of the STAIR two and future plans STAIR platforms and this [inaudible] design
and built for us by my friend Ken Salsbury [phonetic], and the last thing I want to
do is tell you a little bit about the personal robotics program, which is slightly
different.
So where -- work at Stanford. This is maybe more about work at other
universities even. So I think the PC revolution in the 1970s was enabled by their
having a standardized computing platform for everyone to develop on. This is
the Apple 2 PC. And the fact it was a standardized computing platform that
made it possible for someone to buy, what was a very expensive computer at the
time, but it made it possible for someone to, quote, invent the spreadsheet and
for everyone else in the world to then use the same spreadsheeting software.
That made it possible for someone to, quote, invent a word processor and then
for everyone else in the world to use the same word processing software. And
obvious windows [inaudible] huge run on this, too.
And I think robotics they lack such a platform. And this means two things. One,
it means they're very high solid cost in robotics because if you go around the
country, you see that, you know, all these research groups spend all this time
building up their own robotic platforms. This is high solid cost. And furthermore,
because every group, you go around the country, you also find that with relatively
few exceptions, almost every research group will have a completely unique
robotic platform. Both in hardware and software. And that also makes it difficult
for research groups to share ideas or share inventions of each other, because
your co won't run on my robot, my co won't run on yours.
So together with companies called work garage, we're working on building about
10 copies of this robot, which we hope to make available to universities and
research labs for free in some way under some terms. And so what you see
here is a video of this robot being [inaudible] actually no AI here, all this is human
intelligence, human using joy sticks actually to control the robot. And you can
sort of tell that this robot is mechanically capable of doing many of the [inaudible]
we like it to. I'll let you watch the video.
>>: I'll buy it, I'll buy it.
>> Andrew Ng: I don't remember. This is the only segment that's sped up; all
the others were not. I'm asked about the beer a lot.
So robotics today clearly we should keep working on hardware platforms,
improving hardware and so on. Also, that I think this video also shows that
robotic platforms today are maybe quote good enough to already carry out many
of the household tasks we like robots to and if only we can get the right software
into it to make robots do these things harmlessly.
So, yeah, we hope that these robots will roll off the assembly line within six
months so almost buy one. [Inaudible].
>>: [Inaudible].
>> Andrew Ng: Let's see. So hopefully [inaudible] in some way to some
university research labs. And the company will [inaudible] presell these for price
comparable to luxury car, which is sort of a non answer because luxury cars
[inaudible] and [inaudible].
>>: [Inaudible]. [laughter].
>> Andrew Ng: Yeah. So just to wrap up, [inaudible] robot platform integrating
to all these different areas of AI and what you heard in this talk was number of
tools that we put together to develop the stapler from the office application as
well as the inventory application and then you also read about the [inaudible]
robots platform. And I just want to say out loud the names of the lead Ph.D.
students that made all this work possible. Ellen Clingbell [phonetic] is a woman
in the video, Steve Gools [phonetic], [inaudible] led most of the depth perception
and the grasping work, and I think he's actually giving a talk in LiveLab in two
weeks with all the details on that. I did not talk about.
Peter Bouled [phonetic] most of the helicopter work and Allen Coze [phonetic] is
also involved. Morgan Quiggly's [phonetic] probably sweat more blood than
anything else on getting things to work on the STAIR project and Eric Virgil
[phonetic] is also involved in all this. So thank you very much.
[applause].
>> John Platt: Do we have any more questions for Andrew?
>>: [Inaudible]. Is it based on industry [inaudible]?
>> Andrew Ng: No, it wasn't. I actually so we did a bunch of things [inaudible].
The one published piece was the following which is, you know, we got a robot
and we -- we have a component that uses a speaker ID detection to try to figure
out who you are and so the robot has this other mode where it didn't know who
you are or try to make chitchat with you to try to elicit more words from you so
that it can hopefully recognize who you are and then when it finally recognizes,
thinks it recognizes you, it's going to take the gamble and say hi, and so, and
hopefully it got your name right. It turns out there are actually studies that if a
robot greets you by name, it emotionally generates something new. So it knows
that. None of this is illustrated in the integrated [inaudible].
>>: Did you learn -- what was the most [inaudible] you learned by integrating all
the pieces together, other than just [inaudible]?
>> Andrew Ng: So I think two things. One is that I think one of -- to me a lot of
the most interesting research, you know, for my has arisen at what traditionally
bound, traditionally disparate views in the aisle where with traditional boundaries.
So for example the foveal vision piece where you use a pan or zoom camera
where you work only in vision, I don't know if you end up doing that. But once
you think about vision on the physical robot we can move things around. It's just
so natural to do. So that was one example. And combining vision and grasping
and so on. So this idea of as we work on this project we often just stumble on
interesting problems, you know, there at the boundaries of traditional [inaudible].
And the other is this interesting intellectual problem that we think a lot about
which is integrated representations. So when we have all of these different
components, one that's trying to fit our draw string, one that's trying to recognize
objects, one that's trying to navigate without colliding into things, is there a, you
know, what's the word, common lingo, is there a common language for all of
these very different algorithms to talk to, to interact with each other so that there
are all maybe representing their own knowledge, representing the things they
figure out about the world and the sort of common unified representation. So that
latter problem has you working a lot on this. Is there a common representation, a
common language for these sorts of algorithms to manipulate or to interact with.
>>: Also [inaudible] I mean, they're pretty impressive, but they're pretty far from
human range of ability as far as grasping and manipulate.
>> Andrew Ng: Yeah.
>>: I mean is it just a question of cost at this point, or is there still a lot of room
for improving them.
>> Andrew Ng: Let's see. There's ample room for improvement. There are -boy, yeah, you know, human manipulation is amazing. But then it turns out that
for many tasks you do not need -- so let's see. Tying shoelaces is really difficult
and [inaudible] is really difficult. But on the other hand if you want to put a row on
every home, maybe you don't need to tie shoelaces and maybe I don't need to
[inaudible]. So and one of my favorite examples is when [inaudible] they often
ask about Rosie the robot from the Jetson's cartoon, the robot, and it turns out I
think, you know, to put Rosie robot that would wise crack and do the things and
play with the kids and you know, whatever, it turns out and I don't think we'll get
there in the next few years but on the other hand, it turns out that if you go to put
a useful robot in every home, you know, wisecracking of the kids is not needed
and so I think -- I actually think that [inaudible] like to develop the technology to
put a robot in every home in the next decade and I think that's feasible.
>>: So the cost of the pan [inaudible] camera why not just have stitching a whole
bunch of high res cameras together is it really cheaper to have a thing physically
moving around and just doing this ->> Andrew Ng: I see. Yeah. Don't know. It may make sense to just use the -let's see. So I believe there are 100 mega pixel CCD images and I believe a pan
or zoom camera corresponds to one gig a pixel camera, which you cannot buy.
>>: [Inaudible] pixel images.
>> Andrew Ng: Oh, there are. I didn't know that. Cool. So you can do that.
There's you'll one other thing, which is a computational requirement. So if you
have a giga pixel image, you can't actually run the sliding [inaudible] over the
entire image and so you may end up -- we've actually done -- you probably -- we
have actually done some level things with the higher res camera and [inaudible]
the pixel somewhere in between the two. And even there, you know, you -- is
very hard to take such a huge image and download it from a camera to computer
because fire wire and USB and all that. And even if you could settle down the
images that fast, it's still very expensive to run your cost by every row in the
image. And so even if you do that we've actually, just one experiment. We've
actually used the same foveal that I here to look at the lower res version and pick
up the promising regions and then only digitally or physically zoom into those
regions.
>>: [Inaudible] shoot cameras? I mean if only to increase the switching speed?
Because I mean with the [inaudible] restrictions there's a [inaudible] so you can
just have a bunch of cheap cameras and multiplex.
>> Andrew Ng: Yeah, yeah, that totally makes sense too.
>> John Platt: Any other questions? Okay. Let's thank Andrew again.
>> Andrew Ng: Thanks very much.
[applause]
Download