>> John Platt: So I'm really pleased to present Andrew Ng. He's known to many of us because he's collaborated with us over the years. He was a Ph.D. student at university of California at Berkeley and then since 2002, he's been at Stanford as a professor working on lots of hard AI problems across multiple fields. So here's Andrew. >> Andrew Ng: Thanks, John. So what I want the to do today was tell you about some work we've been doing on STAIR or the Stanford AI Robot Project. So STAIR project starts about three years ago motivated by the observation that today the field of AI has practicing meted many subfields and today each of those passes is entirely a separate research area of entirely separate conferences and so on, and what to define, what to define unified challenge problems that require time together these disparate pursue the integrate AI dream again. And for those of you familiar with AI I think of this as a project very much in the tradition of shaky and flaky but doing this work, you know, 2008 AI technology, rather than 1966 AI technology as was in the case of fakey -- shaky. So these are long term challenge problems we set for ourselves, to build a single robot can do things like tidy a room, use the dishwasher, fetch and deliver items around the office, assemble furniture, prepare meals. There's a thought that if you can build a robot and do all of these things then maybe that's when it becomes useful to put a robot in every home. In the short term, what we want to do is have the robot fetch an item from an office, in other words, have a robot understand a verbal demand like STAIR, please fetch the stapler from my office and have them be able to understand the command and carry out the task. And so what I'm going to do is tell you today about the elements we tied together to build that STAIR stapler from my office application and the elements repeated on this slide are after recognition, so you can say recognize the stapler, mobile manipulations, you can navigate indoor spaces and open doors. Depth perception. This goes into discussion of estimating distances from a single image. So for grasping a manipulation to let the robot pick up objects and lastly split the dialogue system to tie the whole system together. So let's start by talking about actor recognition. So I think human vision -- robotic vision today is far inferior to human vision and there are many reasons that human vision is so much -- is so much superior to current robotic vision systems. And people that should show talk about user contacts so for common sense. There are many reasons like that. One reason I think has not been exploited in the literature is just a reason that humans use a foveal to look directly at objects and therefore obtain higher resolution images of it. And recognizing objects is just much easier from higher resolution images than from low resolution ones. So for example, if I show you that image and ask you what is this how many of you can tell? >>: [Inaudible]. >> Andrew Ng: Well, cool. You guys are good. It's actually easier from the back of the room. And once I show you high resolution image it's so much easier to tell without it. It turns out the picture on the left is what a coffee mug looks like at five meter distance from a robot and so maybe it's no wonder that, you know, our recognition is so hard to get to work well. So just be clear, right, if I'm standing here, if I'm facing you like this, I actually do not have enough pixels in my eyes to recognize this is a laptop. For me to recognize this as a laptop, I need to turn my eyes and look directly at it and get a higher resolution image of it. I can now recognize this is a laptop and I can then look away and continue to track this black blop in my peripheral vision and then it was still a laptop. So turns off using off the shelf hardware it's fairly straightforward to replicate this sort of foveal peripheral vision system. In particular, you can use a pan to zoom camera, there's a camera that can turn and pan and zoom into different parts of the scene. To simulate this foveal and if you use a fixed wide angle camera down here to simulate your wide view, lower resolution peripheral vision, peripheral vision system. And let's just point out that, you know, unlike actor recognition on your Internet images, so if your computer vision researchers and if all you do is you download into -- download images off the Internet, then, you know, this has a few -- you're actually not allowed to work on this problem, right, because you cannot zoom into images you downloaded off the Internet. Because most video is -- they're inner-images is an important and interesting problem in his own right, but I think when you work with vision and physical spaces then alternatives like these become very natural. Just a little bit more detail. We can learn a foveal control strategy or learn where to look next in order to try to minimize uncertainty or maximum information gain and minimize entropy. When we express the entropy about the uncertainty of the objects we're tracking and objects we may not yet have found in the scene and when you maximize this, what it boils down into is a foveal control strategy that trades off almost magically between goals of trying to look around to search in the objects versus occasionally confirming the location of previously found objects. This by the way was the only -- is the only equation I have in this talk, so, you know, I hope you enjoyed it. Just to show you how it works, in this video in the upper right hand corner is the wide angle peripheral vision of view and in the low left corner is the high resolution foveal view. On the upper left is what we call the interest belief state. My laser point is running out of battery. It's the interest [inaudible] which is a learned estimate of how interesting this will look at different parts of the scene, how likely you are to find a new object if you look there. And as the robot pans his camera around to zoom into different parts of the scene you can sort of tell that it was infinitely easier to recognize the objects from the high resolution foveal view on the lower left similar to recognized objects from those on the upper right. And if you evaluate the algorithm more quantitatively, you know, depending on the experiments or setup you get 71 percent performance improvement. And just be clear. If any of you are ever interested in building, you know, some camera system or some vision system for some physical space like I don't know, if you want to put a camera system in a retirement home to monitor the retirees, they show the safety or whatever, I actually think of slapping a foveal on it as low hanging fruit for suddenly allowing yourself to see the entire world in high resolution and giving your vision system a cyclical performance boost. The next [inaudible] we'll talk about just in two slides is mobile manipulation, having robots navigate and open doors. So let's talk about that. So we actually worked out two views of this system. In the first one it turns out that most office buildings have identical doors and any guess very briefly what we did was actually develop a representation for office spaces in those that allows a robot to reason simultaneously about very coarse map, a grid map to enable a robot to navigate huge building size spaces as well as in the same probabilistic model reason about one millimeter model, models of [inaudible] about a millimeter. Since you need about, you know, three to five millimeter accuracy in order to manipulate it to a handle. So we came up with a model that's probabilistically coherent despite the two very different resolutions of these two spaces. And then more recent version is putting this together with the foveal vision system that I just described, so that you can have a robot who uses vision to recognize novel door handles and in some cases elevator buttons as well, so that you can put the robot and have it see a novel door that it's never seen before and see and recognize the door handle and figure out how to manipulate a door handle also use a motion planner to plan certain motion needed for the art to manipulate the door handle. So this video shows a number of examples of a robot seeing novel doors, test set doors as never seen before. We're going around looking for doors to test this on. This is actually a robot trying to go inside the men's room. Elevator button door is more elementary, same algorithm, pushing elevator buttons. And the overall performance of the system on opening doors, novel doors was 91 percent. >>: Have people tried this problem before versus ->> Andrew Ng: So it turns out there's lots of work on the opening known doors. I believe we're the first to have a robot open previously unseen doors. And so that sequence ->>: [Inaudible] or could be push ones or ->> Andrew Ng: I see. Yeah, so, yeah, you're right. So let's see in the video I showed this was restricted to handles, not knobs, and only push doors. I believe this robot is mechanically not capable of pulling a door shut behind it, for example, but with the newer robot did you see a picture of where actually student Ellen is applying pretty much the same algorithm to that problem. We haven't done that yet. >>: [Inaudible]. >> Andrew Ng: Yes? >>: [Inaudible] using the [inaudible] or does it also have lasers or some other kind of [inaudible]. >> Andrew Ng: Yeah, let's see. In this, boy, we've done many things all the time. We have done this using the cameras and stereo cameras. We've also done this using a single camera and a laser. And actually we've done this with different sets of sensors. The results you saw here I believe were all the single camera and the horizontally mounted [inaudible]. Okay. Yeah. But although, but [inaudible] using only vision. So next thing I want to tell you about was free perception. Take us into discussion of estimating depths from a single slow image. So let's talk about that. If I show you that picture and ask you how far things were from the camera when this picture was taken, you can look at it and maybe sort yes. Or if I show you that picture and ask you how far objects were from the camera you can tell the tree on the left was probably further than the tree on the right from the camera when this picture was taken. The problem of estimating distance from a single image, from a single stow image has traditionally been considered an impossible problem in computer vision and in a narrow mathematical sense it isn't impossible for all of them but this is a problem you and I saw fairly well and we like our robot to use the same in order to give a robot a sense of depth perception. So it turns out that there is, of course, lots of prior work on depth estimation from vision. Most of this prior work has focused on approaches like stereo vision where you use two cameras in triangulation and I think that often works poorly in practice. The number of other approaches that use multiple images and it turns out to be all very difficult to get them to work on many of these indoor and outdoor scenes and I should say there's also some contemporary worked hours done by Derik Home [phonetic] at CNU. But the given is given a single image like that, how can you estimate distances? So this is the approach that we took. We collected a training set comprising a large set of pairs of images of monocular images like these and ground trip depth maps lay down on the right. So the ground trip depth maps the different colors indicate different distances, where yellow is close by, red is further away and blue is very hard away and these ground trip depth maps are collected using a laser scanner where you send pulses of light out in the environment, measure how long the light takes to go out, hit something and bounce back to your sensor and because you know the speed of light, this allows you to directly measure the distance of every pixel. And then having collected large training sets we then learn a function mapping from monocular images like these to what the ground trip depth maps look like, using supervised learning. So a little bit more detail. In order to construct this learning algorithm to construct features of this learning algorithm we actually first went to the psychological literature to try to understand what some of the visual cues used by humans, used by people to estimate depths. So some of the cues that you and I used turned out to be things like texture deviations and texture gradient, so for example those two patches is all the same stuff, is all the rows are draft but the texture of these two patches are very different because they are very different distances. All the cues that people use include haze or color, so things are far away tend to be hazy and tinged slightly blue because of atmosphere replace gather interesting. We also use cues like shading, dough focus, occlusion and [inaudible]. So for example, if those two are tangled look like they are similar in size to you, that's only because your visual system is so good for directing for distance. In fact, there are about 15 percent different in size and so if you know people of roughly five to six feet tall, then by seeing how tall they appear in the imagine, you can note roughly how far away that is. So we construct a feature vector is try to capture as many of these cues as we could. And realistically I think we do a decent job of capturing the first few of the list and less good job capturing the second half of this list maybe. And then given an imagine, we then came up with a upon list particular model of distance, and in die tail given image we compute image features everywhere in the image and then we construct a probabilistic model known as the [inaudible] random view model. What that does is allows us to model the relation between the depths and features, in other words, it models how image features may directly help allow you to estimate a depth at a point. Also models relation in depths at the same special scale, because two pixels, two adjacent pixels are more likely to be at similar distances than at very different distances as was relation between depth set multiple spatial skills. When you trained this algorithm using you know, supervised learning, he's are examples of test set results. So in the left-most column is a single monocular image and the middle column is the ground trip depth map use the laser scanner and in the right most column is the estimated map given only that one image is input. So the error makes interesting errors. I'll point one out here, which is that that tree there is actually in the foreground, right? That tree is actually fairly close to the camera but the algorithm misses it entirely and, you know, thinks that tree is much further away. The example below still looks okay. A few more examples. And I want to point out another interesting error. In this image up here, this bush here is in the foreground, and that tree there is in the background. So these are two, you know, physically separate objects where this bush in the lower right is significantly closer to the camera than the tree in the background, but the algorithm misses that as well and sort of ends up blending together the depths of the tree and the bush, right? But other than that, [inaudible]. >>: [Inaudible]. >> Andrew Ng: Yeah. Yeah. >>: What's with [inaudible] it was not [inaudible]. >> Andrew Ng: No. So [inaudible]. >>: Edges or have sort of more [inaudible] stuff inside? >> Andrew Ng: Let's see. So this turns up -- one thing I was going to do is make [inaudible] a convex problem and so this was an MRF where it's E to the L1 functions or E to the sum of a lot of absolute value terms. And the absolute value terms essentially you know capture these sorts of relations. In motion of this case the models, we actually reason explicitly about [inaudible] and a bunch of you're complicated -- a bunch of other more complicated phenomenon. So there are other cues like that. If you find a long line in an image, you look at an image and you find a lot line in the picture, that long line will probably correspond to a long line in three as well. So there are like three or four types of cues like that that the [inaudible] capture and probably taken the challenge is how to encode all of these things, so it's still a contact optimization [inaudible]. >>: Is there anything on the bottom layer that was like [inaudible] model or something or is it sort of just very simple models looking up the features to that? >> Andrew Ng: Yeah, boy ->>: [Inaudible] or was there any [inaudible]. >> Andrew Ng: I see. There was one other machine that I need [inaudible] which is specifically age detection so one of steps we do is look at an image and for every point in the image try to decide if there's a physical discontinuity there. So for example standing here, does the physical discontinuity between the top of a laptop and my chest and so you try to recognize those points and then those help the [inaudible] do better as well. Yeah. It's [inaudible] parts of it is are complicated. [Inaudible] it's about 30, 35 percent per pixel error but let's skip over that. So more interesting is, you know, right now I've been showing you depth maps. One of the other things you can do with this is actually take the models you estimate and render them as 3D 5 view modals and then we're going to show you an example of that. And what I'm about to show you the entirety of the inputted to the algorithm was one of these free images and so let's take a look at the source of 3D models you get from these images. So first the three examples. The first picture is actually a picture we took ourselves on the stamp by campers. Given a single image, this is an example of their 3D 5 through model you get. [Inaudible] I'm actually you're really bad at driving this thing. Right. Turn left, look at the 3D shaped tree. So let's imagine that we're standing together in front of this house and I'm going to squat down so the wall comes up, you can't see the cars anymore, stand up, squat down [inaudible]. The second and third images are actually [inaudible] images, they're download into that. Let's fly down the river. Turn right, look at the trees, turn left look at the shape of the mountain and so on. So -- yeah? >>: [Inaudible]. >> Andrew Ng: Say that again? >>: Were those [inaudible] used in the [inaudible]? >> Andrew Ng: So that was a more sophisticated version of the algorithm than the basic depth map one. The ideas are roughly the same. So I -- boy, I have to go into more detail. I guess one difference was that those -- in the most sophisticated version that I didn't talk that much about, we actually first oversegment the image using a super pixel segmentation algorithm the [inaudible] and then imagine using a pair of scissors to cut this picture up into lots of small pieces and into [inaudible] pixels and then using an inference algorithm to take each of these pieces and in [inaudible] paste them to 3D and when you do that, that helps you preserve inner surfaces and lastly we can texture map the image back on to this 3D model where I pasted all these pieces somewhere over the 3D. >>: [Inaudible]? >> Andrew Ng: Say that again. >>: [Inaudible] I had two of them but [inaudible]. >> Andrew Ng: Yeah. Actually you're right. Well, there's something that I wasn't going to show, but let me see if I have it. I think I might have a hidden slide that does that. Yeah. So it turns out that, you know, on many images monocular does okay, stereo, this is [inaudible] version system [inaudible] did not find correspondences and therefore there are return depths, and if you combine the monocular cues and the stereo cues, then you get better, you get measurably better results in either mono or stereo only and by stereo I mean triangulation. And you can also do things like take a few images and build large scale models where policy images are seen only one by camera and policy images are seen by multiple cameras. So if I could find where I was in the talk. Yeah. Okay. Cool. And so is October which means some of you may recently have gotten back from your summer holidays or whatever. So this algorithm is actually up on the website and if any of you want to take, you know, your own holiday pictures and upload it to the website, then the algorithm works so turn your pictures into 3D models so you can revisit your holiday memories in 3 rather than as flat pictures. So that was depth perception. And it turns out one of the most interesting applications of these ideas is the robotic grasping and manipulation because you have -- with depth perception that gives a robot a sense of space around it, but on the other hand with a robot you can use lasers and whatever to directly measure depth. But apply some of these ideas through robotic manipulation. So let's talk about that. So robotics today is in an interesting states. So robots today as many of you know can be scripted to perform amazing tasks in known environments. One of my favorite examples is this. This was done in Japan 15, 20s years ago. This is a picture of a robot balancing a spinning top on the edge of a sword. So that red thing is a spinning top, that long thing is robot is holding a sword, and the top is being balanced on the narrow edge of the sword. So you know, if this is -- is this a solved problem in robotics, this was done 15, 20 years ago, you know what's unsolved? Well, it turns out picking up that cup is an unsolved problems in robotics if you've never seen that cup before. And so is that latter problem, though. So you've never seen this cup before, how do you pick this up? Well, one thing you can do is use stereo vision to try to build a 3D model of the cup and on the stereo project we've been fortunate to have had you know to several companies trying donate hardware to us, so using a decent commercial stereo vision system this is an example of the depth map we get where the different shades of gray indicate different distances and black is where the algorithm did not find correspondences and therefore did not return distances. If you zoom into where the cup was, it's just a mess. You can barely tell if the handle's the left or the right. So this is my -- I think all of you probably know what stereo is, but this is my cartoon of what stereo does, right, which is in stereo depth perception you have two images, one from the left eye and one from the right eye and stereo depth perception has defined correspondences. People point to the left eye denoted by the cross, you then have to find a corresponding point in the right eye image and then you, you know, send out two rays from eyes through these two points and see where they intersect and that triangulation is theorem unless you estimate the distance of a 3D point. And you can also do this for a different point, you know, shown there. You can estimate distances of that point as well. And I think the reason that stereo, dense stereo is hard is that you pick a point like that in the left eye image is very difficult to tell which of those points it corresponds to the right eye image and depending on which one of those you choose you can get very different distances. It's very hard to find out the 3D position of that point there. And what dense theorem does is tries to take every point of the left eye image and tries to triangulate to every single point in the right eye image and this is very difficult to get to work. But as stereo doesn't work for us how do you pick this out? Well, we just said that given just a single snow image, you can all right get a sense of the 3D depth and all right get a sense of 3D space, of the scene. So this is what we did using monocular vision. Using these monocular vision cues. Which is we created a training set comprising five types of objects and for each of these objects we labeled it with the quote correct place I wish to pick up the object. So we labeled a pencil as saying pick up a pencil by the mid point, pick up a wine glass by the stem, pick up a coffee cup by the handle and so on. And then we trained the learning algorithm to use monocular vision cues so that the algorithm would take us in for an image like this and would try to predict the position of this big red cross. So give the single image or use monocular -- it's monocular vision cues to decide where is the grasp point or the position of the red cross. When a robot faces a novel object like a novel coffee cup what it does is it uses the learned classifier to identify the grasp points in the left eye image, identify the grasp on the right eye image and then you then just take these two points and you triangulate them to attain a single point of 3D and you then reach on their active grasp there. Okay. And just contrast this with dense stereo vision which tries to triangulate every single point of both images and does very hard and contrast this picks one or sometimes it knows a small number of points and both images triangulate and that works much better. So I'm going to show you a video of this working. This is a video of the stereo [inaudible] the variety of objects for the first time using, you know, that ball is a cheap web cam we bought from the electronics store and using even, you know, cheap web cam images the algorithm, the robot often understands the shape of these objects well enough to pick them up. The training set object -- the training set objects with just those five that you saw earlier, there were, what, there was no cell phone in the training set, there was, what, there was the wine glass, the pencil, the box with, the eraser and the coffee mug. But training on those five objects it often generalizes one of the grasp fairly different objects. This [inaudible] this works 88 percent of the time on large test set of objects. I went to the dollar store to buy objects and what we tried to pick up -- I actually have no idea what that is. >>: You had a coffee -- the coffee pot upside down. Would it work right side up. >> Andrew Ng: I'm pretty sure it did, yeah. And let's see. It turns out you can use exactly the same algorithm on objects placed in the dishwasher. So for these experiments we moved the camera back. So rather than a risk mounted web cam we actually use a higher quality pair of cameras, higher quality set of stereo cameras mounted to the base of the robot off the left of the screen. But it's actually the same algorithm to identify grasp points, triangulate and then reach beyond there to pick up objects. >>: So you use [inaudible]. >> Andrew Ng: Yes. So this is theorem here. >>: [Inaudible]. >> Andrew Ng: No. Actually that's the camera. So this one was single web cam but we moved the arms a couple places to take a few -- to take two to four pictures, say. Yeah. But then we used monocular cues in each of these or used Monday or collar perceptions to figure out where to identify grasp points in each image separately and only after that we triangulate. >>: [Inaudible]. >> Andrew Ng: Yeah, right. Because from each image you -- because you say, take two pictures and each measure for the -- say where you think the big cross should be, then you triangulate the points where you put the big crosses. Yeah? >>: [Inaudible] >> Andrew Ng: Oh, when we make this video it was just control of the robot arm had a very low update rate and so we would command the robot in larger steps than we would have liked. This is a software bandwidth problem, just how fast we could send commands to robot. >>: I was just wondering if it was reestimating the position of. >> Andrew Ng: No, it's not. >>: [Inaudible]. >> Andrew Ng: Yeah. Not -- not in these videos. >>: [Inaudible]. Do the thing and [inaudible]. >> Andrew Ng: Yeah. There's actually a -- so that actually [inaudible] grasping failures. It turns out the -- with this algorithm anyway, the majority of the processing failures are -- so what the [inaudible] does, right, takes two pictures then decide I'm going to reach there. They close their eyes and wear heavy gloves so you have no sense of touch and you reach there and try to -- so the majority of a grasping failures are if you accidently knock the object slightly and then you -- you know, you don't know you did that. So we actually have new hands with touch perception in the robots fingers and that makes us work better. [Inaudible]. Was there other questions? Okay. Okay. Cool. So so far I've been showing you experiments using one STAIR path for our first STAIR 1 platform. These are the future of hands STAIR path. STAIR 2 uses larger more mechanically capable arm and this one, this is actually built by one of my colleagues [inaudible] say more about this later but just show you an example of the same algorithm on a different robot. This is using a barot [phonetic] arm which is a much larger more mechanically capable arm, capable of carrying heavy year pay loads. But it's the same algorithm that finds grasp points and it gives one modification is you now need to plan the position of the fingers as well, right, to need to decide where -- how to position your fingers as well to pick up different objects. But there was a stereo pair of cameras, there was a little bit off the writers screen, takes those images, finds grasp points and then reaches out to pick up these objects and they [inaudible]. So that turns out to be a fake rock. A lot of fake rocks in my office for some reason. And for those of you that are -- for those of you that goes to the knits conference you know why we have to keep [inaudible]. So, yeah, so they're actually seems to work. And so that was grasping. The loss of the elements I tell you about was this spoken dialogue system which we don't use any reinforcement or any algorithm. So those of you that know me a little bit will know that my students and I have been heavily invested in the primary force of learning algorithm to control of a variety of robots. So just for fun, here's a video of [inaudible] helicopter being flown using reinforcement learning algorithm. So everything here is computer controlled flight where, you know, using one of these learning algorithms it has learned to control a helicopter and so split S is a fast 180 degree turn. [Inaudible] is another fast 180 degree turn. Do two loops. The second loop -- and [inaudible] fast spin at the top, right there. Another stall turn done in reverse. Then backwards. [Inaudible] pardon? >>: [Inaudible]. >> Andrew Ng: [Inaudible] what? >>: [Inaudible]. >> Andrew Ng: [Inaudible] not that I'm aware of. Oh, 90 degree horizontal agree fall. Stationary row sends out one of the most difficult [inaudible] is another very difficult maneuver. The tic-toc is like the inverted, you know, grandfather pendulum clock, right? >>: [Inaudible] G forces. >> Andrew Ng: Let's see. >>: [Inaudible]. >> Andrew Ng: We could do that. Yeah. So but, you know, [inaudible] many fans of machinery, it turns out that just in the United States and maybe after a dozen groups that work on [inaudible] have controllers. Perhaps the most [inaudible] Eric Feran's [phonetic] group that just moved from MIT to Georgia Tech. But turns out these are by far the most difficult, most advanced maneuvers shown in any [inaudible] helicopter and that's actually a complete non controversial statement and [inaudible] algorithms. And in fact, we've actually more or less run out of things to do. They're on actually other maneuvers they want to do but [inaudible]. Yes? >>: Is this something that the helicopter is doing that given the human expert will find challenging? [Inaudible] myself, but someone. >> Andrew Ng: Yeah. So we are fortunate to have one of the best pilots in the country work with us. Not the very best pilot in the country. And this did learn from him, but this flies many of the maneuvers even better than he does. I just say it's maybe competitive with the very best pilots in the world. I wouldn't say this outperforms the very best pilots in the world but this does outperform our pilot that is one of the top 50 pilots in the United States maybe. >>: Farcy helicopters. >> Andrew Ng: Farcy helicopters, yeah. It turns out you can't do these things on full size helicopters. [laughter]. >>: [Inaudible] seem like there's somehow does it notice [inaudible] things [inaudible]. >> Andrew Ng: I see. Yeah, so it turns out the helicopters big wide empty space, we just know that we are not detecting or avoiding obstacles in the air. >>: Right [inaudible] kind of just try to do [inaudible]. >>: [Inaudible] model of the ground there. >> Andrew Ng: We know where the ground is, but we just happen -- we just happen to, you know, command the entire maneuver far enough away from the ground that you don't worry about it. It turns out, you know, modern GPS systems you can get about -- well, you can get ->>: What sensors do you have on the [inaudible]. >> Andrew Ng: Yes. So, right, so we have a [inaudible] and giros and a compass, magnetometer on board the helicopter. For estimation you can use either GPS or cameras. These videos were done with cameras on the ground to estimate the position. But you also use GPS which gives you about 2 centimeters error with model GPS systems. All right. So those primary force learning and following in the footsteps of many others we [inaudible] the same job in and so on. We use these source of learning algorithms to develop a proven dialogue system. But I'm not even talk about that. So taking the elements described and putting them together with -- after I mention mobile manipulation, depth perception especially applied to robotic [inaudible] and spoken dialogue system, when you put these things together what you can do is build the STAIR piece face, the stapler from my office application. So see that. >>: [Inaudible]. >>: [Inaudible]. >> Andrew Ng: So those are spoken dialogue system kicking out the whole thing and it just said Quak's [phonetic] office, it's one of the Ph.D.'s office. So robot uses [inaudible] robot navigation to navigate to the office. It then uses a vision system to detect a door handle, drives closer and takes another image, confirms the location of the door handle, says pick it out, you know, which door handle to push on and so on. Again, this is a novel door handle. Has not seen this door or this handle before. Goes inside to where he knows is the student's desk, uses foveal vision, that's the camera on top moving around, that's the camera view on the lower left. So it's a camera [inaudible] you take different images, finally it zooms in to confirm where thinks it's found the stapler identifies the grasp point using that learning algorithm as I described earlier, the positions that cross here is in this camera view the location of the estimated cross pointed. It then reaches up to pick up the stapler. It turns out you know picks up many different objects, stapler's a huge [inaudible] turns out [inaudible] pick up very reliably. And finally switches back to the inner robot mobile algorithm for navigating indoors to go back to [inaudible]. And so there you go. And you know, on the one hand, this was a quote demo, but on the other hand, we've actually done this a few times, fetching a few -- fetching objects from a few different places and so on. And so on the one hand this is a quote demo, but on the other hand, genuinely indicates all those components I described earlier and I think -- I hope this is really genuine beginnings of robots that are able to usefully fetch items from around the office. It turns out once you put all these components together, it becomes relatively easy to rapidly put together these components, navigation, door opening, vision and so on, to try to put together other applications. So what I want to do is very quickly tell you about a second application that we're working on having a robot take inventory. So here's what I mean. Here's a map of the Stanford computer science building and zooming into those four officer. Actually I think this one used to be Christina's office. Oh, no you're on the second floor. Never mine. This was your office? Maybe. So what I want to do is a robot be able to go inside these four offices and take inventory. So imagine after everyone's gone home, the robot go inside and figure out where things are and say figure out where all the coffee must are in these offices. We tried to build this application, we found that by far the weakest link was after recognition, so this was the result of applying, you know, after recognition to detect coffee mugs and with not the best people in the world at building, working after recognition systems and if some of you say that you know about vision and you can get to work with this I would have absolutely no argument with that. On the other hand, we were highly motivated to tune a vision system to work as well as we could make it and this was actually about the best that, you know, we as not the most experienced vision people, but we were not totally stupid people either, were able to do. And for this piece -- and when you look at robot after recognition, what I want to talk about takes into consideration from the natural world which if you look at computer vision I think most computer vision today is based on RGB color, red, green, blue color or based on gray scale vision and in one this makes sense because a lot of video or a lot of images aren't filmed for humans and if you understand that sort of -- you know, that sort of images then you have to really understand RGB images or gray scale images. But on the other hand if you look even in perception the natural world, it extends well beyond the human visible spectrum, so for example bats and dolphins you use sonar to estimate distances directly and this with this bird on the right, this is a pretty boring looking bird, it's just a black colored bird, it's not very interesting to look at, but it turns out if you look at this bird in ultraviolets then it's actually -- these birds can see a an ultraviolet they appear very colorful to each other, and this is of course rendered in fast color so that you and I can see it, too. And so we've actually done work on using this sort of depth perception for actor recognition as well as sort of hyper spectral, outside the visible spectrum. When we talk about only the depth perception piece today. And to describe that, let's revisit stereo vision. We've heard a lot about already. But this is another cartoon description of stereo vision, all right. So to estimate distance what stereo does is it picks a point on say the mug and it then, you know, extrapolates rays from two cameras and then use triangulation to compute distances. This is an idea called actor stereo, which is when you replace one of the cameras with say a laser pointer and what you do is you shine the laser beam on to the object and so this constant bright spots on the object. The camera then sees the position of this red dot, of the green dot and you can then use triangulation to estimate distance. And this picture is exactly analogous to when you had two cameras an two lines coming out of it. The difference is when you had two cameras it was very hard to see whether the two cameras are pointing at the same point. Now it's got a correspondence problem. Now we have a laser point in the camera is very easy to be sure the laser pointer in the camera are looking at the same point because well you just painted that point bright green. So this idea is called actor stereo, it's a completely standard idea, a very old idea. And it turns out also completely standard and take this idea even a little bit furtherer which is instead of casting a single dot into the scene you can cast a vertical stripe into the scene and scan the stripe horizontally like that, and this gives you a direct 3D measurement, the right distance measurement of every single point in the scene. So just to show you what that looks like, this is well video of our laser scanner and operation, so as the vertical laser is panned horizontally across the scene, it is measuring the 3D distance of every point in the laser falls into. Okay. And with that, these are examples of some of the data you get. On the left is your normal visible image and on the right is a 3D point called, you know, of the same scene. And so on the right is a 3D point called this -- at a slightly higher point of view, right, so you take the camera and moved it up. And on the one hand this looks like a lot of data because it was distance for every point. On the other hand, there's actually also maybe less information here than might appear to the human visual system just because our human vision system is so good at [inaudible] these scenes. And so for example we do not -- we still do not see the rear halves of these coffee must, right, because [inaudible] and with this [inaudible] I think about it. So given [inaudible] you can do things like compute surface modals and so image the different colors indicate different, you know, orientation on the surfaces. So purple or horizontal surfaces and green are vertical surfaces at a range of vertical orientations. And the way I think about this is you can now represent a pixel using this 90 vector with every pixel you know it's RGB color, you know it's XYZ position and you also know, you know, M1, M2, M3, their 3D surface normal vector. And these nine components are not independent, so argues is it 9 or is it actually low dimensional? But they think of this is if you use a camera then it's as if you only get to observe the first three components of this vector with these other senses you get to observe the four vector and that lens you directly measure things like object shape, object shape features, object size. You can ask questions like is it sitting on horizontal surface, because it turns out most coffee must I found on horizontal surfaces like desks and [inaudible] so on. And when you apply this actor recognition this is actually a fairly typical result where the 3D information as well it completely cleans up the result so you get, you know, near perfect actor recognition. You validate the more [inaudible] of the F square goes up from -- and just validation, coffee mugs goes up from 67 to 94 percent F square. And speaking [inaudible] this is F square, not error is that you look at -- if you think of one minus F square's error formally you think of this as an 80 percent error reduction and for us, anyway, an RSP was this was actually the gap between a vision system that was not useable for an application and the vision system that is useable for application. >>: This [inaudible] other objects in the class [inaudible] particular ->> Andrew Ng: Oh, no, this is object cost recognition. Training and testing on different [inaudible]. So this is the distance or application and so this is STAIR using that, you know, opening the novel door algorithm, the door navigation and so on. You go inside these offices to use the laser scanner, to scan the offices. So there you see it going from desk to desk and you see that vertical green laser being used and then so you know, the robot is building is 3D model, he's getting these 3D point clouds of all the desks in the office. So he's done with the first office. Going next door to the second office in a row. Again [inaudible]. >>: [Inaudible]. >> Andrew Ng: No, more than that. Just maybe 10 X I think. >>: [Inaudible]. >> Andrew Ng: [Inaudible] coffee mugs. Yeah, not when the robot is moving, yeah. So let's see. In these experience I think it took about eight seconds to scan the desk, so hopefully you aren't moving the coffee mugs during those eight seconds. Yeah. And I think, yeah. Let's see. And so results. If you use only visible light, if you use only, you know, color vision or RGB vision, using all classifier, the best classifier that we are able to use, these are the results you get where every red dot and every block dot is either a false positive or a false negative. With the [inaudible] information, these are [inaudible] you got. So there was seven coffee mugs in those four offices left there by the -- left there in their natural places by the denison of the offices. We added an additional 22 coffee mugs making it a total of 29 coffee mugs and on -- these are actually results the first experiment we ran and the robot actually found 29 for 29 coffee mugs. Since then we've repeated the experiment a few more times and it's fairly typical result for the robot to make somewhere between zero to two mistakes on the scope of problem. And there in fact are also all the automatic X factor pictures of all the coffee mugs. And so again, I think, you know, on the one hand this was a quote demo, but on the other hand, I think this real is really a genuine beginnings of robots are able to go around and take inventory and usefully taken inventory. So this is one last thing I will tell you about. So, you know, earlier I show pictures of the STAIR two and future plans STAIR platforms and this [inaudible] design and built for us by my friend Ken Salsbury [phonetic], and the last thing I want to do is tell you a little bit about the personal robotics program, which is slightly different. So where -- work at Stanford. This is maybe more about work at other universities even. So I think the PC revolution in the 1970s was enabled by their having a standardized computing platform for everyone to develop on. This is the Apple 2 PC. And the fact it was a standardized computing platform that made it possible for someone to buy, what was a very expensive computer at the time, but it made it possible for someone to, quote, invent the spreadsheet and for everyone else in the world to then use the same spreadsheeting software. That made it possible for someone to, quote, invent a word processor and then for everyone else in the world to use the same word processing software. And obvious windows [inaudible] huge run on this, too. And I think robotics they lack such a platform. And this means two things. One, it means they're very high solid cost in robotics because if you go around the country, you see that, you know, all these research groups spend all this time building up their own robotic platforms. This is high solid cost. And furthermore, because every group, you go around the country, you also find that with relatively few exceptions, almost every research group will have a completely unique robotic platform. Both in hardware and software. And that also makes it difficult for research groups to share ideas or share inventions of each other, because your co won't run on my robot, my co won't run on yours. So together with companies called work garage, we're working on building about 10 copies of this robot, which we hope to make available to universities and research labs for free in some way under some terms. And so what you see here is a video of this robot being [inaudible] actually no AI here, all this is human intelligence, human using joy sticks actually to control the robot. And you can sort of tell that this robot is mechanically capable of doing many of the [inaudible] we like it to. I'll let you watch the video. >>: I'll buy it, I'll buy it. >> Andrew Ng: I don't remember. This is the only segment that's sped up; all the others were not. I'm asked about the beer a lot. So robotics today clearly we should keep working on hardware platforms, improving hardware and so on. Also, that I think this video also shows that robotic platforms today are maybe quote good enough to already carry out many of the household tasks we like robots to and if only we can get the right software into it to make robots do these things harmlessly. So, yeah, we hope that these robots will roll off the assembly line within six months so almost buy one. [Inaudible]. >>: [Inaudible]. >> Andrew Ng: Let's see. So hopefully [inaudible] in some way to some university research labs. And the company will [inaudible] presell these for price comparable to luxury car, which is sort of a non answer because luxury cars [inaudible] and [inaudible]. >>: [Inaudible]. [laughter]. >> Andrew Ng: Yeah. So just to wrap up, [inaudible] robot platform integrating to all these different areas of AI and what you heard in this talk was number of tools that we put together to develop the stapler from the office application as well as the inventory application and then you also read about the [inaudible] robots platform. And I just want to say out loud the names of the lead Ph.D. students that made all this work possible. Ellen Clingbell [phonetic] is a woman in the video, Steve Gools [phonetic], [inaudible] led most of the depth perception and the grasping work, and I think he's actually giving a talk in LiveLab in two weeks with all the details on that. I did not talk about. Peter Bouled [phonetic] most of the helicopter work and Allen Coze [phonetic] is also involved. Morgan Quiggly's [phonetic] probably sweat more blood than anything else on getting things to work on the STAIR project and Eric Virgil [phonetic] is also involved in all this. So thank you very much. [applause]. >> John Platt: Do we have any more questions for Andrew? >>: [Inaudible]. Is it based on industry [inaudible]? >> Andrew Ng: No, it wasn't. I actually so we did a bunch of things [inaudible]. The one published piece was the following which is, you know, we got a robot and we -- we have a component that uses a speaker ID detection to try to figure out who you are and so the robot has this other mode where it didn't know who you are or try to make chitchat with you to try to elicit more words from you so that it can hopefully recognize who you are and then when it finally recognizes, thinks it recognizes you, it's going to take the gamble and say hi, and so, and hopefully it got your name right. It turns out there are actually studies that if a robot greets you by name, it emotionally generates something new. So it knows that. None of this is illustrated in the integrated [inaudible]. >>: Did you learn -- what was the most [inaudible] you learned by integrating all the pieces together, other than just [inaudible]? >> Andrew Ng: So I think two things. One is that I think one of -- to me a lot of the most interesting research, you know, for my has arisen at what traditionally bound, traditionally disparate views in the aisle where with traditional boundaries. So for example the foveal vision piece where you use a pan or zoom camera where you work only in vision, I don't know if you end up doing that. But once you think about vision on the physical robot we can move things around. It's just so natural to do. So that was one example. And combining vision and grasping and so on. So this idea of as we work on this project we often just stumble on interesting problems, you know, there at the boundaries of traditional [inaudible]. And the other is this interesting intellectual problem that we think a lot about which is integrated representations. So when we have all of these different components, one that's trying to fit our draw string, one that's trying to recognize objects, one that's trying to navigate without colliding into things, is there a, you know, what's the word, common lingo, is there a common language for all of these very different algorithms to talk to, to interact with each other so that there are all maybe representing their own knowledge, representing the things they figure out about the world and the sort of common unified representation. So that latter problem has you working a lot on this. Is there a common representation, a common language for these sorts of algorithms to manipulate or to interact with. >>: Also [inaudible] I mean, they're pretty impressive, but they're pretty far from human range of ability as far as grasping and manipulate. >> Andrew Ng: Yeah. >>: I mean is it just a question of cost at this point, or is there still a lot of room for improving them. >> Andrew Ng: Let's see. There's ample room for improvement. There are -boy, yeah, you know, human manipulation is amazing. But then it turns out that for many tasks you do not need -- so let's see. Tying shoelaces is really difficult and [inaudible] is really difficult. But on the other hand if you want to put a row on every home, maybe you don't need to tie shoelaces and maybe I don't need to [inaudible]. So and one of my favorite examples is when [inaudible] they often ask about Rosie the robot from the Jetson's cartoon, the robot, and it turns out I think, you know, to put Rosie robot that would wise crack and do the things and play with the kids and you know, whatever, it turns out and I don't think we'll get there in the next few years but on the other hand, it turns out that if you go to put a useful robot in every home, you know, wisecracking of the kids is not needed and so I think -- I actually think that [inaudible] like to develop the technology to put a robot in every home in the next decade and I think that's feasible. >>: So the cost of the pan [inaudible] camera why not just have stitching a whole bunch of high res cameras together is it really cheaper to have a thing physically moving around and just doing this ->> Andrew Ng: I see. Yeah. Don't know. It may make sense to just use the -let's see. So I believe there are 100 mega pixel CCD images and I believe a pan or zoom camera corresponds to one gig a pixel camera, which you cannot buy. >>: [Inaudible] pixel images. >> Andrew Ng: Oh, there are. I didn't know that. Cool. So you can do that. There's you'll one other thing, which is a computational requirement. So if you have a giga pixel image, you can't actually run the sliding [inaudible] over the entire image and so you may end up -- we've actually done -- you probably -- we have actually done some level things with the higher res camera and [inaudible] the pixel somewhere in between the two. And even there, you know, you -- is very hard to take such a huge image and download it from a camera to computer because fire wire and USB and all that. And even if you could settle down the images that fast, it's still very expensive to run your cost by every row in the image. And so even if you do that we've actually, just one experiment. We've actually used the same foveal that I here to look at the lower res version and pick up the promising regions and then only digitally or physically zoom into those regions. >>: [Inaudible] shoot cameras? I mean if only to increase the switching speed? Because I mean with the [inaudible] restrictions there's a [inaudible] so you can just have a bunch of cheap cameras and multiplex. >> Andrew Ng: Yeah, yeah, that totally makes sense too. >> John Platt: Any other questions? Okay. Let's thank Andrew again. >> Andrew Ng: Thanks very much. [applause]