>>: Good morning everyone. It’s my great pleasure to welcome all of you to the second in our state-ofthe-art lecture series on computer vision. The first lecture—in case you missed it—was given by Jian Sun on face recognition. It’s on ResNet, you can find it. The series is being organized by The Virtual Center on Computer Vision. The Virtual Center sounds like a big name but it’s basically a bunch of us scattered across different MSR labs including, like, Banging Wo in Asia, and Andrew Fitzgibbon, and I, and a few others—Matt and Zhenyu—and we’re trying to just coordinate various research going on and product transfers in computer vision. We’re sponsoring this series, we’re working on a web page where you can find more information about various vision activities. And with that background, let me say a little bit about Larry Zitnick, our speaker for this morning. Larry got his PhD from Carnegie Melon, 2003; he’s been with us since then and worked on a variety of different projects. Some wonderful work on three-D video, stereo matching, and he’s done some computational photography work as well, but for the last many years he’s been concentrating on object recognition, visual understanding, scene recognition, he’s had some wonderful work on understanding and modeling context using clip art animations, and most recently co-created the COCO database—that is the latest and greatest, some of you may have heard of ImageNet and deep learning, well COCO is much better and he might tell you why in this talk so, thank you , Larry. >> Larry Zitnick: Alright. So I recognize many of you, some of you I don’t recognize. I just want to give some background as to what I want this talk to be about. So probably, as Rick said, many of you have heard about deep learning and—you know—there’s been a lot of hoopla lately about deep learning and how it’s self-computer vision, how AI… the next coming of AI is gonna—you know—truly bring AI. And what I want to do is I want to give everybody some context for these big changes that have been happening because it’s hard to… it’s hard—you know—for… if you’re not a computer vision expert—and really this talk is geared toward people who aren’t computer vision experts—it’s hard to judge whether these advances are advances that go to, “Wow, computer vision actually really works now, like, I can actually recognize objects,” or it’s just a bunch of academics saying, “Wow, we’ve failed miserably before, and now we’re failing just a little bit less miserably and we’re really excited and now we’re publishing articles about it.” [laughter] Right? So obviously the truth lies somewhere in between, but I think it’s hard to kind of just like jump in, right, and for me just to say this is what deep learning is and this is what’s been solved and kind of get a sense for—you know—why is there so much excitement? So what I want to do today is I want to spend… I want to talk about how did we actually get here—you know—how did we get to this deep learning craze? And then, where are we now? So basically I’m gonna spend about forty-five minutes talking about the last twenty or thirty years in object recognition to give everybody some context for the big advances that happened in the field, why they happened, why they failed, why they were good for certain things. And then that will give you a sense for why everything… everybody was so excited about deep learning in the last three or four years. And then we’ll also talk about—you know—is deep learning… is it going to solve it all—you know—is object recognition solved now? Can you trust the news articles that are coming out? So, we’ll get to that. Alright, so let’s begin with our history lesson. So in 1966 you can say computer vision kind of started around there. You have Marvin Minsky, a big AI guy, went to one of his students one summer and said, “Hook a television camera up to the computer and get it to tell us—you know—make it tell us what it sees.” Right? Simple enough. [laughter] Now, as many as you—you know—you probably could guess, it failed miserably. [laughs] You know, and it wasn’t many… there wasn’t much success in computer vision at this time. But to really kind of understand, it’s hard to put yourself in the place of somebody in the 1960’s or the 1970’s, right? But let’s just try to do that for a second. So imagine you have this image here and you want to detect the faces in the image. How would you detect faces? Now you… no computer vision—you know—answers, no knowledge of what’s already been done. I you’d a—you know—a naïve, non-computer vision expert and you just wanted to detect faces, how would you do it? Any ideas? >>: Color. >> Larry Zitnick: Color? Color’s a good one. Any other ideas? Unfortunately color was not available back in 1970. [laughter] Any other ideas? >>: The eyes, the nose, and the mouth. >> Yeah, you find the nose, right? So you find the nose—good. And you find the eyes, right? And maybe if the eyes are in the right location relative to the nose—voila—you have a face. This is exactly the same intuition that people in the 1970s had. I mean this is what you’d… if you ask a kid, “How do you recognize a face?” This is exactly the answer you get, and this is exactly what they tried. So you have something called a constellation model or pictorial structures which is basically saying you detect a nose, you detect a mouth, you detect the hair, and they should all be in the same kind of rough position relative to each other which is represented by these springs. So these springs kind of want everything to kind of rest at these certain locations, right? Now, they did excellent work and they got—you know— beautiful accuracy on the data set of, I think, about sixteen faces. And this is some—you know— captioned… or pictures from the actual paper, so you can see the type of images they worked with. So, it’s fantastic work. [Laughter] And in case you’re skeptical, they even added noise to the images and it still worked. So, this is fantastic work. So face recognition was solved in ’73, kind of. Alright. So unfortunately—you know—in the late seventies anybody who’s—you know— known much about AI, what happened was everybody promised the world in the seventies and then suddenly nothing actually happened and you had this thing called the AI winter. And the AI winter also affected computer vision because computer vision is—you know—a subset of AI. So computer vision research kind of fell off a little bit as well and what we found was that nothing was really working so we kind of went back to basics. We said, “Well, let’s not try to detect people or animals or chairs or those sort of things which were big in the seventies. Let’s just try to detect an edge,” you know. So there’s a ton of papers in the eighties looking at edges, looking in up at flow, looking at these really kind of low level problems, ‘cause we could actually make a little bit of progress in these fields. And at the same time there were other people saying, you know, “I could take these edges,” and “Let’s look at the properties of edges; some edges are parallel; some are like this;” and “How do we group edges together?” Looking at some of the more scientific questions. And there’s also—you know—fabulous debates about how do you actually represent objects? Do you represent objects using a three-D form? Do you recognize… represent objects using—you know—two-D representations and you just have a two-D representation for different poses? And there’s good debates about this and plenty of fine papers—you know— discussing this back and forth, and blah, blah, blah, but—you know—none of this actually helped in real recognition algorithms. So come around the late eighties something happened which was we kind of forgot about the science. We stopped thinking about the science behind computer vision and this is kind of the turning point when computer vision, I think, really switched from being more of a science to being more of an engineering discipline. And there’s a very simple reason for this which is: something actually worked. So, you have Yann Lecun, you have something called a convolutional neural network— which I’ll describe in just a little bit—and he was able to show that you could train a vision algorithm to recognize handwritten digits. And this is the first time that we had an algorithm that could kind of work out in the wild. You know, we had algorithms that could take—you know—parts going down a machine assembly line and kind of line them up roughly, or look for very specialized applications, or that the face stuff I showed you earlier, but again, that’s not going to work in the wild. But here… this actually worked. You could take zip codes off of real letters and actually recognize them, right? Alright, so how did this actually work? Well, so one of the things that I’m gonna be talking about a lot today is something called a filter. Probably most of you know what filters are, but for those of you who might not know what a filter is, a filter is basically a local operation you apply across the entire image. So if you have an image like this one, you can apply a filter where you take the pixels to the right and you subtract them from the pixels to the left and you get a response that kind of looks like this, so you can detect vertical edges. Or you can do the same thing with just subtracting the pixels from below you with the pixels that are above you and you basically detect horizontal edges, alright? So you can imagine that these sort of filters are good at finding—you know—edges in image or other sort of patterns that you might imagine. And what Yann Lecun and his coauthors realized was you don’t need to learn separate filters for this part of the image than you do for this part of the image. You can actually run the exact same filter over the entire image and it works just as well—which seems kind of obvious, and it is kind of obvious, but it actually made a big deal. So you could take a single filter, you run it over the whole image, and then you get a response—it looks like this. And using the neural network you can learn all sorts of different filters. And the cool thing here is before what they were doing is they were trying to learn different filters for every single position, but the problem was is number of parameters you needed to learn was huge. But if you can use the same filters across the entire image, suddenly the number of parameters goes from—you know—bazillions down to something very manageable, and you can actually learn it. And it’s because of this—exconvolution of these filters—that the neural networks worked. And then what they would do is they take this filter response—we don’t really care about exact locations—so you take the filter response and you just shrink it, you just make it smaller. You can do this making a max operation, an average operation… there’s a lot of different ways of doing it, but you basically make it smaller and then, basically, you just keep repeating that process. So you have your input image, you run your filters, you make them smaller, and then you run another set of filters on their responses and make them smaller, and then you can basically do your fully connected neural network, alright? And this would be an excellent hindering digit recognizer. Now you can apply… Yep? >>: Why the stages for creating… connecting the image to the neural network? >> Larry Zitnick: Where’s that? >>: So why were you doing that [indiscernible]? >> Larry Zitnick: Oh, going back and forth. >>: Yeah. >> Larry Zitnick: So what happens here is you have your image here, you run some filters on it and then you shrink it down, right? And what it does is when you do that pooling operation, you shrink it, basically you lose a little bit of spatial location information, right? And then you apply another set of filters, alright, and then you shrink them down and it uses a little bit of spatial information. So what you’re doing is you’re essentially saying, “This here are very simple filters which is—you know—kind of, like, edges, that sort of thing.” Now, these guys are basically gonna be taking combinations of filters which have a little bit of invariance in them, so they can represent, like, larger things. So instead of recognizing just this part of an eight it’s gonna represent the entire top of the eight, let’s say, and it gives you a little bit more invariance every time you do it. Does that…? ‘Kay. And you can do… we’ll talk a little bit more about this later ‘cause obviously this comes back to life—you know—twenty years later. So—you know—and then they do the same thing with faces. So you can take a box within an image and you can try to detect faces in an image and you basically take your box, and then you do the same thing with convolutions, and then you spit some output. It’s a bunch of good work doing neural networks with faces. One little side note: today in my talk I’m going to be talking a lot about detection and classification. To, like, the layperson detection and classification are exactly the same thing but there’s actually very two different… there are two distinct meanings within the vision community. Classification is… think about the handwritten digits. You’re given an image and you just had to say, “Is this a five? Is this a three? Is this a two? Does this image have a dog in it? Yes or No,” right? That’s classification. Now detection… you not only had to say, “Is there a dog in the image?” But you also have to draw a bounding box or some sort of localization of where that dog is in the image as well, so you’re detecting where in the image that object is. And if there’s two objects you want to detect both, alright? So, these early neural networks—you know—numbers work great, faces work great and they tried to do it with other categories, but they weren’t getting much traction—you know—they didn’t solve object recognition at this point. It worked for a lot of other things too; we had self-driving cars back then that—you know—only had four neurons in them, I think, or five neurons in them, and they were able to drive a car for hundreds of miles. You know, it was amazing what they could do, but they weren’t solving everything. And the reasons for this failure, I’m not going to go into now because we’ll get into that when we get to the year 2012, alright? But since we’re talking about faces, let’s talk about faces a little bit ‘cause they’re really interesting ‘cause this is something that, very early on, everybody wanted to detect faces; we’re humans and we want to be able to detect each other so faces obviously play a big role. And one of the very early face detectors that really just made people go, “Wow,” was the Viola-Jones face detector. And this used something called boosting plus... boosted… cascaded boosting classifier which, given an image, you’d basically slide a sliding window along it to do your detection, and it can very quickly throw out bad detections, very quickly. And then when it found a detection it thought it could be good, it could perf… it could then do more powerful machinery in that one location. And what this made possible was you could actually do face detection in real time. So Paul Viola when he got up on stage to demo this, he would actually take a camera out and wave it in front of the crowd and you could see the faces detected, which today would be like, “That’s so lame, because, like, we can do that on our cell phone’s… it’s a camera.” But back in the day, it just, like, was a, “Whoa,” it just blew people away, right, ‘cause nothing worked real time. And then it actually, well—A: it worked, which is pretty cool; and B: it worked real time which is just, like, amazing, like, that just doesn’t happen back then. This is in 2000, where— you know—digital cameras are still kind of novel at that point or fairly novel at that point. But you know what’s interesting about this? It’s not that Bruce did the cascaded classifier which made it fast, it was actually the features ‘cause the features were also very fast. But it wasn’t the fastness that I think is interesting, it’s actually the features themselves. So let’s look at the features that the classifiers used to actually recognize the faces. So, let’s say we have a face like this: what does classifier use? There’s something called a Haar wavelet and a Haar wavelet is simply… it takes the average of all the pixels within one box and subtracts it from a whole bunch of pixels in another box—and that’s your feature— and it just gives you a value and then you put that into a classifier and then using that, it can recognize a face. So just the two big boxes subtract the values and that’s your feature, which seems incredibly naïve, incredibly simple, right? What’s cool about these, you can compute them very fast, but it seems like simp… it just seems too simple to work. You know, if I went back to, let’s say, 1970, right, and I went to my… and I’m a student, I went to my advisor and he said, you know, “Solve this space detection problem.” And I said, “Okay, I want to take a bunch of rectangles and just subtract their values—you know—over the face and I bet you that’ll be good enough to do classification.” They probably would have laughed at you and, like, “No, no, no, no, you’ve got to do some… you’ve got to detect noses and the eyes and—you know— all that other stuff. You just can’t throw a bunch of random boxes all over the image and just hope that it’ll detect a face. Nobody would have ever thought that would work. And that’s not an obvious way of solving the problem, yet it works. So why does it work? Any guesses? Alright, well probably vision people have a good guess, but… the reason why it works is a couple of reasons. One is, our faces are three-D, right? So they have a certain shape to them. Another thing is our faces have a tendency to always be upright. You know, I’m not sitting there looking at you guys like this very often. And then the third thing is the lights are almost always above us, which means that my forehead is lit more than underneath, my cheeks are always bright, my nose is bright, my eyebrows are dark, right? And then you get this kind of this same shading. The second thing is that the thing that makes a face a face, the details that make a face a face is not the exterior, it’s not the bounding, it’s not, like, the oval of my face, it’s actually the interior features. You know, it’s the eyes, it’s the nose, it’s the mouth. And if we take a bunch of average faces and put them together and you’ll see that these part of the face actually averages together really well. It’s very consistent with this part of the face looks like. And this part of the face actually isn’t repeatable much at all because people have hair and they have different—you know—things going on there. So if you look at these Haar wavelets, the two most useful ones were these—were basically the dark area—which is like this area of your head is dark, this is light, which is basically your cheeks down here. Makes total sense. Here again, the bridge of your nose is brighter, your eyes are darker. So it’s just taking advantage of these simple properties of faces and the fact that lighting is fairly consistent, our faces are usually upright and it just works because of that. It’s almost like evolution made our faces easy to detect, right? Alright, so this worked great for faces, but why didn’t it work for anything else? And there’s a simple reason for this as well, which is… let’s take a portrait of a face, right? We always… these face detectors don’t work nearly as well when I’m looking at a portrait than when I’m looking front on. And why is that? Well, if you look at a portrait of a face, what makes that unique is actually that kind of the bridge of your nose and your mouth. This part right here, right? That’s what makes a face unique. But the problem is that this part right here, that line that you see, is partially created by the background and is partially created by the face. Which means if the background’s light then it will be brighter than the face; if the background’s dark it will be darker than the face. You put those boxes over that area and it’s gonna be completely random whether one is brighter than the other, right? Same thing for objects like mugs. We don’t detect a mug by the interior of the mug, I mean, the… you know, mugs look different, every single one, right? What makes a mug a mug is actually the profile, right? And again, since a profile’s gonna be on the background there’s not gonna be—any of these features that we’re using— they’re not gonna be repeatable, they’re not gonna be useful at all for detecting, let’s say, this mug here, alright? So the great news was… is computer vision had a stake in the ground, we had something that worked; the bad news was it didn’t work for anything other than faces. But at least we had something that worked—we could justify our jobs which is always good. Back in the 2000’s we needed some way to justify our jobs, even though, I was still in school then, so it was okay. Alright, so next on. Let’s go back in time just a couple years. Let’s talk about SIFT. And I have an asterisk on 1999 there because SIFT was published in 1999, but it was first submitted to a conference in 1998 and it was actually rejected the first time, which by itself would not be that surprising. Our papers get rejected all the time, it’s part of being an academic, being a student. You just get used to it; your papers get rejected—even though it still kind of hurts. But this paper was exceptionally notable because it was rejected and it is probably the most referenced computer vision paper ever. You know, it’s got… I don’t even… over ten thousand, maybe twenty thousand referen… I mean, it’s just got a ludicrous number of references and spawned, like, entire indust… it’s amazing how influential this paper was, and it was rejected. Why was it rejected? Because it was just a list of beautiful hacks that worked really well. [laughs] But—you know—sometimes hacks are good. Alright, so SIFT had two things in it. One was: no more sliding window. So what I was mentioning before—you know—we do face detection where you have this sliding window going the whole way across the image. Now if you’re doing a sliding window you have to evaluate at every single point which is expensive, alright? But if you had a way to only evaluate at certain… at smaller number of points, suddenly it would be less expensive which means that you can use a descriptor which is more expensive for the same CPU—you know—power. So basically we could use better features because you’re gonna be evaluating fewer places in the image. So let me talk about what I mean by interest points. And there’s many types of interest points; I’m going to be talking about difference of Gaussian here, there’s also corner interest points which is how they detect corners in images. But this is one called difference of Gaussians or Laplacian. And the idea is you have an image, like this beautiful cat image here and you just blur it. You blur it a little bit more and a little bit more and a little bit more. And then you just subtract neighboring images and you get these responses. So this is just this image minus this image, this image minus this image, you get these guys. And all you do is you look for peaks in these responses—peaks both in x and y and also peaks in scale here, so it’s a three dimensional peak. You find a peak, that’s your interest point. And the thinking here is, is what this does, when you think about a blurred image and you subtract off this blurred image from this blurred image what it does is it basically finds blobs of bright areas or blobs of dark areas. So if you look at the image, we look at interest points returned, you can see you have kind of a blob of dark area, blob or lighter area, dark area—you know—lighter area, lighter area, all across the image. And the cool thing here is if you look… took this image and you kind moved it like this or the cats move just a little bit, is that these points on image would be fire… would still fire. And if I actually had a video of this and the cats are moving, and the first thing you’d notice that these interest points are blinking all over the place. But I look at this, I kind of call this approach a shot gun approach because you don’t expect every interest point to be repeatable, but out of the, let’s say, the five hundred interest points that you see in this image, yeah, maybe fifty of them will be repeatable, if you’re luck—probably not even fifty. So let’s talk about the descriptor. So now that we have an interest point—let’s see the interest point—it has an interest point so you saw it has a position, it has a scale shown by the size of the circle, so let’s say we want to extract out, now, a descriptor around that interest point. The first thing we do is we take that patch and we compute a bunch of gradients, a bunch of those filters that I was talking about earlier, and we do it in a bunch of different orientations, typically about eight. And then we would do is—imagine this is the patch here—every single pixel has a gradient and an orientation, and then we just pool them together, we just take the sum of all the gradients here which have an orientation in this direction, and we put it here. And we take all the gradients here which have an orientation in this direction and we add them here. And if you remember from the neural network we had that pooling operation which basically gave us a little bit of spatial invariance, exactly the same thing here. Basically taking the gradients, or basically averaging or pooling them together to give us a little bit of variance. You don’t care if this edge is here, if it’s over here a little bit, essentially. And then you basically have all these gradients, and in this case you’d have eight, we have four different cells and eight different gradients so you’d have a histogram of dimension thirty-two. And if you want it to be brightness and offset invariant—it’s already offset invariant because these are gradients—but if you want to make a gain or brightness invariant, you then just normalize this histogram, make sure it sums to one, that sort of thing. So a nice… it’s a bit more complex than the Haar wavelets we talked about, captures a lot of the local gradient information. And what this was beautiful at doing was instance recognition. So you have an object like these—kind of plainer objects, a lot of texture in them—and you have an image like this and it can very quickly match up these little patches from here with the patches in here. And for those of you who… you were following computer vision, let’s say, ten years ago in the early 2000’s, it was… everybody was really excited—you know—about recognizing cd covers. David Nister got on stage at one point and was showing his computer cd covers in real time and it was recognizing… I think he had fifty thousand cd covers he could recognize and he could recognize any of them and he’s doing, like, air-guitar on stage. Unfortunately, cd’s… nobody ever bought them again after, I think, a couple years after that so nobody cared any more. And then they’d say, “Well we can use it for books,” and then nobody buys books anymore, so obviously the technology is not that useful. But what it was useful for is things like Photosynth, for those of you who… Photosynth is still around, yes, for Photosynth. The other thing that it’s useful for that we use… that we probably use fairly often is panorama stitching. I mean, this is something that, I think, all of us take for granted now—you know—it runs on our cameras, it runs on our cell phones. Panorama stitching seems so lame now because it’s so ubiquitous, but—you know—in 2003, 2002 it was like, “Whoa, this is awesome—we can do this,” you know, ‘cause before you actually had to manually align the photos and it was just like… nobody did it. But then around 2003, Matt Brown and David Lowe—you know—and Rick and many others they all created these—Matt with the ICE application—created these great apps to do panorama stitching and really some great visualizations. Alright, so basically now we have a method that can recognize objects using much more complex descriptors because we have these interest points. So why didn’t it work for everything again? It works great for these kind of cd covers, book covers, it’s great for panorama stitching, right? But why doesn’t it work for other objects? Why doesn’t it work for motorcycles, let’s say? Well, let’s look at that. So there is a paper in 2003 who said exactly that: what we have is interest points, interest points are awesome. We can now do more complex things because we only have to examine certain points in the image. Can we come up with a kind of category object detector for this to just recognize generic motorcycles? And what they said is—you know—back… is basically motorcycles are made of parts. You know, all of them have wheels, all of them have headlights, all of them have handlebars, etc… And if you think about it, all these things can be put together by springs ala Constellation model—we’re back to 1973 again. People love this idea, and you can see—you know—this is the front wheel and this is the front wheel, this is the—you know—the back wheel—you know—and you can see the different parts that it found and—you know—it really had these springs everywhere. It’s very similar to the 1973 model. And the way it found the parts… found the part candidates, because if you have all these parts and you need to evaluate every combination of parts, you can’t do this for every single position image, it’s way too expensive, right? So what they did is they eliminated the parts again to interest points. You’d think this would work beautifully. And it worked beautifully in the paper, but why does this fail in general? And this is an example from the actual paper itself. So the reason why it failed is interest points actually are not… they’re not repeatable for object categories. They’re repeatable for object instances like when a cd cover and another cd cover, right? But if you look at these two bikes, the interest points that are returns… yeah, okay, there’s a circular one here and a circular one here, but they’re not the same scale. If you try to find an interest point that is the same position here as this bike here, it’s nearly impossible. And a lot of this is for the same reasons. One bike has a very different—you know—colors—you know—maybe this part’s darker, this part’s lighter that another bike, right? And also bicycles… motorcycles—you know—like, this area here—you know—there’s background, and the background’s gonna be darker or lighter and that’s gonna affect the interest points. So there’s all sorts of different things that can affect these interest points and make them not repeatable. And then general, I mean, something like a chair or motorcycle or people even, they just don’t repeat at all. So really the only thing they work for well are objects which have a lot of texture internal to them, very much like faces where you don’t have pollution from the background and the objects are kind of… are basically repeatable from one object to the next—they don’t change a lot. And the other thing too which actually hurt this algorithm a bunch, is that they had a spring between every single part. And because of that you couldn’t… you really had to use interest points, and you could only have a small number of parts because computationally this is very expensive to compute and to evaluate. But the reason it worked is if you actually looked at the data’s they tested on. So this is the Caltech4 dataset and this is—you know—this is in ’03 and people are actually impressed by this result—this paper won best paper award—alright, and this is the dataset. So basically had a bunch of motorcycles in it all placing the same direction—a lot of the motorcycles with a nice white background. You had airplanes all facing the same direction. You had a bunch of faces, which might look impressive at first except for you realize that this face dataset was literally created by a student walking around with the same exact camera, with a flash on, taking pictures of students and faculty around Cal Tech—yes, around Cal Tech. So basically they all… the lighting is all very similar, again, because it’s driven by a flash, and many of the faces are the same… is the same person. And then the back of cars—again, you know, the dataset was incredibly simple. This is a data set created by the people who wrote the paper, which creates a bias because you want to make sure your paper actually has good results, right? So, you might have had other categories as well and you just kind of threw them away ‘cause they don’t work. Alright, so that was in ’03. Now let’s go forward a little bit and let’s go to ’05 and something called the histogram of oriented gradients, so… or HOG, many of you have probably have heard of HOG before. And what they wanted to do here… so we had a student, [indiscernible], who build trigs which was his advisor—I’m making this up—but I’m assuming he said to—you know—this student, “Let’s detect pedestrians.” Pedestrians are something that are important to detect—you know—‘cause we want to have autonomous cars at some point and—you know—people are—you know—we always like to detect people. Pedestrians are a little bit more… easier to detect because pedestrians always look like this or they look like this—you know—they don’t really vary much. So you have a student and he wants to detect pedestrians like this, so what does this student do? Oh, let’s just talk about pedestrians a little bit. So the interesting thing about pedestrians is it’s not like faces—you know—faces have the texture— you know—which is interior to the object, but what makes a pedestrian a pedestrian really is the contour. Because—you know—the shirt that I’m wearing versus the shirt you’re wearing or the pants I’m wearing versus the pants you’re wearing—you know—they all look very different. So you can’t rely on our internal textures, right? So you really have to rely on this boundary which makes it a more challenging problem, ‘cause like we were talking about, this boundary has the background and the foreground object in it, and they clutter backgrounds and then, again, significant variance. Interest points, as we learned, don’t work for these sort of scenarios of objects in which you have to detect—you know—them by the boundaries because interest points don’t fire reliably when you have this variance. So we have to go back to a sliding window. Thankfully, at this point, computers are a little bit faster, so the sliding window kind of made some sense. So, if you’re [indiscernible], a student and you’re thinking, “Okay, I’m gonna do a sliding window detector and I want to be able to detect a pedestrian,” and you look at—you know—what the current state-of-the-art is doing, you say to yourself, “Hmm, what should I do, what should I do?” And you’d say, “Oh, I know what I’ll do, I’ll just make a really big SIFT descriptor and slap it on top of the pedestrian and do a sliding window on that.” Seems like a great idea, right? And that’s essentially, exactly what he did. [Laughs] So basically he takes SIFT—exact same thing I talked about before, right, with the gradients—and…but now we’re gonna have more bends because pedestrians are bigger—not this gonna have the four by four… I should have two by two grid but in practice people use a four by four grid, and here you have a much larger grid that you use. But there’s one critical distinction and this is what kind of made the paper I believe, is that before with the SIFT descriptor, if you wanted to take care of gain offsets you had to divide the whole descriptor and make sure it all summed to one, right? Now imagine what you would… imagine what would happen if you took, like, this cell and you want to normalize the descriptor so you basically divide by the magnitude of the entire histogram. Now, imagine if you do this for a pedestrian, I… let’s say my legs have very high contrast and my shirt has very low contrast, what’s going to happen is the gradients in my legs are gonna be really large and the gradients on my shirt are gonna be really small and not very formulated; they’re gonna look like noise. Or you could have the opposite. You could have a lot of… depending on the background—you— know—the background could be white behind me, so this has a lot of texture and maybe I’m wearing white pants and this has none, right? So if we divide… if we normalize by the entire histogram, it’s not gonna actually perform that well. So they had this brilliant idea which was, “Well let’s not normalize by the entire patch, let’s just normalize by a local window. And what they did is they took this patch and normalized by these four, these four, those four, and those four, and basically made a descriptor four times as big and normalizing locally. And the cool thing about that is you’re just normalizing based upon this local area right here, right? So that way if there is an edge here it’s gonna be [indiscernible], you’re gonna see that edge, and you have an edge here, you’ll see that edge, but if the contrast is different between them it doesn’t really matter, because at the end of the day, you don’t really care about the magnitude of the edge. You don’t care that there’s a white to black area or a light gray to—you know—slightly lighter gray area, right? All you care about is, is there actually an edge there, yes or no? And that’s what this descriptor basically did. It said, “Is there an edge, yes or no?” Presence is bigger than magnitude. This is more informative than magnitude. And if you look at… if you just train an SVM classifier on top of that, you basically get a… you can see these… this is the HOG descriptor visualization of it and these are the positive weights and the negative weights of the SVM classifier. You can see it really does pick up on the contour of the pedestrian. And then you basically have this pipeline. You compute HOG features, you train a linear classifier, perform some sort of nonmaximal suppression—non-maximal suppression just means if I detect a person here I shouldn’t detect a person right here, if you just pick the one that has a higher response and remove the other one. And this pipeline reigned for quite… I mean it was, like, the pipeline for a long time. Alright, so why did this work? A couple of reasons we already talked about. So we can finally detect objects based upon their boundaries. This is really the first time that boundaries really worked. There was a bunch of people who tried to detect objects based upon edges… you literally detect edges first, these discreet edges and then you try to detect objects, but the problem is edges are not very repeatable, like, sometimes you detect them and sometimes you wouldn’t, but this is the first kind of repeatable method to detect objects based upon their edges. They also had hard negative mining. Now this actually is kind of hidden in the paper a little bit but is incredibly important, which meant that if you want to find negative examples of humans, you don’t want to pick just any random patch in any image, you want to detect… you want to take the examples that are actually close to being correctly… incorrectly classified as humans and add them to the negative set. So you run your classifier once over the trending data set, you find all the ones that kind of were classified incorrectly as humans and you add them to your negatives and you kind of rerun it. And this helped a lot just in accuracy numbers. And finally again, this is a dense sliding window approach, computers are fast enough. Okay, why it failed. So this is… if you look at the average gradient mask of a pedestrian this is what it’ll look like. Again, pedestrians look like this, right? Well—you know—it works great for pedestrians but people… we have all different sorts of poses, right? How do we actually go about…? So this was, essentially, a next big challenge. How do we actually recognize objects which don’t have—you know— very—you know—obvious contours that we can detect? What if they’re more deformable? But before we get to the solution to this problem I just want to take a little bit of a side detour and let’s talk about data sets really quick. So…’cause data sets actually drive the field a lot and it’s… data sets actually dictate the type of research that people are doing. So 2004, 2006 the reigning data sets were Caltech 101, Caltech 256 and they had a lot of images that looked like these, which are fairly challenging and if you took all the different categories and you average the images together you get these sort of average images. And what you’ll notice is that a lot of the categories have things that are very consistent and a lot of categories have things that are kind of random, so they kind of varied in hardness. But what was unfortunate about this data set, and maybe it’s partly due to the fact that it wasn’t super large, is that we pupil evaluated on this data set. They were given fifteen or thirty images for training and then you had to evaluate on the rest. So basically I give you thirty images per category and I say, “You must learn this category from the thirty images.” That’s a really small amount of images. You think about how complex an image is and you only have thirty of them. So basically this limited the research, I mean, like, you couldn’t publish unless you had that result and since you only had a small number of training examples that really limited the algorithms that you could actually run on this. As we’ll find out later some of the most—you know—well performing algorithms require infinitely more data than just thirty examples. Alright? Then there’s PASCAL VOC. This really led the drive towards object detection, so this data set had much more realistic images in it and then it had bounding boxes drawn around all the different objects. And a lot of people talk, like, this is the gold standard for object detection for about five or six years. We had the Caltech pedestrian dataset, what was interesting about this data set was we just had a large number of people. This is one of the first data sets that actually had a hundred thousand training examples in it. Again, but these are pedestrians, so, you know. And then, again, ImageNet, which a lot of you probably heard of which has millions of images which is really cool and they’re all in these kind of range of this fine-grained category so we had, I think, like, five hundred different types of dogs and a bunch of different types… you know, basically it’s like all these kind of fine-grained details and it had—you know—maybe a thousand examples for every single one of these. And what’s interesting about this—and we’ll see how this is useful later—is this is the first data set where we actually had huge amounts of training data, we actually had millions of images and this will be incredibly important later on. Alright. Oh, and Sun. With Sun basically you had these kind of perfect segmentations but the data set was much smaller and this is kind of nice for localization. And I’ll talk about COCO which is a new data set we’ve developed at the end of the talk. >> Answer a question? >> Yep? >>: When you say cat… go back to a couple slides categories, twenty-two thousand categories, are… is this—you know—using the human example—is this… is it just category human or is it human as pedestrian, human jumping, human…? >> Larry Zitnick: It’s not human jumping, it’s teacher, it’s professional, it’s firemen, it’s that sort of thing. >>: Great, thanks. >> Larry Zitnick: Yeah, and then, you know. >>: Yes. >> Larry Zitnick: Alright, I’m gonna skip that for a time. Alright, so, back… we were talking about why it failed and we can’t recognize people that look like this, right? Alright, so let’s say you’re in the year 2008, HOG works really well for pedestrians and you want to do more deformable objects, what would you do if you’re designing an algorithm? You’d say, “Hmm, let’s see, if I want to detect a person and I can’t just do a whole template, let’s see I could detect, let’s see, I could detect the feet and then I could detect the hands and I can…,” you know, so basically the DPM, deformable part model, was exactly what you’d think it would be, which is now instead of detecting one big template, you’re basically gonna have a bunch of these different HOG descriptors and you’re gonna detect the feet and you’re gonna detect the legs, and you’re gonna detect the head, and you’re gonna detect the arms, right? And you’re gonna attach them all using springs. [Laughs] So, again, we’re back. But these people were smart. They didn’t take every single one of these parts and attach springs everywhere—no, that’d be naïve. They took the springs and attached it to a root node and then had one spring coming out from each creating a star model. And the beauty of this was you could actually compute it efficiently, so you didn’t have to use interest points, you could literally run these part detections across the entire image and find the optimal human in a very quick non-exponential amount of time and feel like the distance transforms. It’s pretty cool actually that it works. So using this model, you could actually do it. So you could do densely partbased detection. And the other thing that they did was they said, “Well, okay let’s say… let’s take bikes for example.” Bikes and they have all different parts they can move around, but a bike from the side looks very different from a bike front on. So instead of trying to create a deformable part model which could warp, let’s say a side-facing bike to a front-facing bike they said, “Ah, don’t worry about that. Let’s just create a different model for each one of those. So you create a bike detector which is a front-facing bike and you create a bike detector which is a side-facing bike. So anytime you have a lot of deformations you just have multiple components and multiple models. Alright, so why did this work? Multiple components, deformable parts? I have a question mark here ‘cause there’s a lot of debate in the community. You can imagine instead of having deformable parts you could just have more components. And the balance you have between those—you know—you can kind of get really good results just using multiple, like, more components and having less deformations. So there’s kind of a trade-off between the two. Again, hard negative mining’s always important. And good balance; and what I mean by good balance is this worked really well on the PASCAL data set. And the PASCAL data set has, let’s say, I think—you know—four hundred to maybe a thousand objects per category. Again, not a huge number of objects, but this algorithm was trained, like… they just tweaked it to work really well on PASCAL because if you took it classified and gave it too much freedom it would over fix, you didn’t have enough training data. So basically you had to take your classifier and make it just strong enough to take advantage of the data that’s there in PASCAL, but not too strong, right? So, as we’ll see later, this kind of got over-fit to the Pascal data, but people had a really hard time beating it. Alright, so why did it fail? So, why did it fail? So to understand this, let me just do a simple little demonstration. So, look at this patch here. What is that? >>: Person riding a horse. >> Larry Zitnick: Yeah, okay. [Laughs] It’s a head, it’s a head of a person, right? So let’s say I wanna… you could say, “Which part of a person is it?” Right? So it’s a head. What part of a person is that? Leg, right? What about that? >>: Shadow? Blood. >> Larry Zitnick: Yeah, it’s nothing, right? And we can even do this in really low res, so, what’s that? >>: Lower leg. >> Larry Zitnick: Yeah. Well, feet, leg—you know—head, ‘cept… So it’s amazing. We can look at… as humans we can look at a very small patch of an image and say whether there’s a head, feet, arms, legs. Even in low resolution images we can do this, right? So what we did in an experiment, is we actually used Mechanical Turk and we asked turkers to basically label every single little image patches containing one of these parts of a human. And these are re… for these two images these are the responses that we got. You can see… here’s the legend up there. So the humans very reliably were able to find—you know—the head, the—you know—hands, the feet the legs, etc…, and they’re very nice. Now if we look at the automatic algorithms, the deformable part model which I was just discussing, and you look at the responses for that, you get this. Not nearly as good. Now, the deformable part model generally did fairly well on the head ‘cause the head has a kind of this unique structure, but as you can imagine, legs… legs, there’s nothing unique about legs. They’re just, like, kind of vertical lines, right? Same thing with arms—they can move around a lot. So arms and legs—you know—that sort of thing, was really hard for the deformable part model to detect and it was really relying on the head detection a lot. If you look at more examples, this one’s a little bit more challenging, you can see the humans mess up…it thinks if you just show a small patch here around the horses foot, we think that’s a human foot, right? Not too surprisingly. But the machine just fails miserably again. And here’s another example you can see—you know—it doesn’t even get, like, a lot of the head detections here, the DPM… and the humans just do a lot better job looking at these small patches. And if you take the DPM model and just replace the human responses for whether they think feet or heads or hands are here, you would get this huge boost in performance over the machine detected—you know—hands, feet, and legs. So it’s really kind of led, you know… you can see how these kind of low level, this kind of just being able to detect these small little parts, humans can do them just so much better. There’s obviously HOG followed by an SVM classifier is not doing it—it’s not detecting these parts reliably. And a little side note, you can… this kind of explains one of the reasons why Kinect works so well. So we have Kinect, it can… the input is something that looks like this, right? Now of everything I have talked about today, what’s unique about this image versus an RGB image? We can actually detect the profile of the human really easily, right? It just stands out. I mean, you get this profile of the human [snaps] without even trying. It’s gonna be har… it’s gonna be noise free, essentially. And because of that, that’s why Kinect works really well—is we don’t’ have to work about detecting these object boundaries any more—they’re basically given to us. And then, once you have the object boundaries, especially with a human—‘cause you know we’re flying our arms around—you know, detecting the hands or the head becomes a lot easier if you have these really nice boundaries, right? So we have this, basically, this really nice input and because the input is so good then we can train classifiers which are essentially… used features which are essentially a lot easier. It goes back to the Haar wavelets that I was talking about earlier—it’s even simpler than Haar wavelets—and you can still detect the humans. But still this doesn’t solve the RGB problem. Now you think that the deformable part model… what if you gave it more data, right, to detect these arms, these hands and arms and do it better. We find out is it doesn’t help. Again, this gets back to what I was talking about before is they basically tweaked it. They used a linear SVM and a linear SVM only has so much capacity. You can’t learn a lot—you know—a linear SVM can only recognize so much, can only do so much. So if you give it more training data, it’s not really gonna learn anything more. But you can’t give it… you can’t use a more complex classifier because then, again, it starts falling apart because then you overfit. So basically we had this problem. More training data didn’t help, the classifier had to be restricted… awww. And at this point, this is a really dark time in computer vision research and object recognition. All of us are just, like, “Oh this is so painful. All we get is all these algorithms making these really small incremental improvements. We’re improving things by one percent a year.” And this happened… this lasted for—you know—three, four years—these kind of incremental improvements. Nobody was making any headway, right? And the results kind of look bad, I mean, for computer vision researchers you look at ‘em and, like, “Okay, that’s okay.” I showed ‘em to my mom and she’d slap me, right? So we’re all competing on this kind of—you know—small little bit and, like, basically the mood even in the PASCAL data set, they decided to stop updating the PASCAL data set because all it was small improvements on this model and we weren’t seeing any improvement really at all, like, the life kind of went out of it because nobody could think of anything else. Alright, so let’s look at this DPM model. We basically have an image, we could beat a HOG descriptor, we have an SVM classifier, we do some sort of pooling, right? That’s basically the HOG model. Now, we have low level features, we have limited capacity classifier—a linear SVM. Intuitively you think, “Well, these HOG features—you know—there not just looking at the raw pixels, but they’re not looking—you know—there just looking at gradients, pooled gradients, so that’s not that abstract, right? And we have this classifier which is pretty limited. So we’re taking these kind of low level features and feeding them into a classifier which isn’t that powerful. You know, it seems like we need something else here, something that is a better abstraction that could feed into a classifier and actually learn something. We need something that’s more abstract than HOG. The problem is, is HOG was handcrafted, right, and what do you do after HOG? How do you combine features in a way? You can’t… you think about it intuitively it’s like, “I don’t know.” You could come up with all these hacks but there’s no… we don’t have any good introspection on how to do this. It’s hard to hand-design these things. Alright, 2009 our data sets had about thirty thousand images in them. 2012, ImageNet, fourteen million, huge increase. 2009, Caltech 256 had two fifty-six categories in it. 2012 ImageNet had twenty-two thousand categories in it. 2009, we had algorithms like these, right, from a part model. In 2012, for some reason, the convolutional deep neural network started catching on again. So you have image— uuuup—up to there—you know—a pretty weak classifier up here, but then you have all these extra layers to learn that something new in here. So what happens if we take all this additional data, we take the deep Learning work that we just had and learn it using GPUs, and it’s interesting…there’s actually a whole interesting back story on when people even started using GPUs to do deep learning and it wasn’t… people didn’t just go with that one day and say, “Oh, you know what? Let’s just revisit deep learning and… you know, I think that it was just the fact that it wasn’t running fast enough. Let’s just apply the GPUs and see what happens.” It was actually much more interesting of a story for why that came about and a couple dead-ends at the research community kind of went into before we actually got to this point, but I don’t have enough time to talk about that today, so I won’t—plus deep learning. So what did this buy us? So 2012, ImageNet challenge. These are kind of DPMS standard computer vision models. Jeff Hinton and company submitted their SuperVision algorithm based upon Deep Learning to very, very, very skeptical computer vision community and the results were this. This is error, so lower is better. This blew people away, this really woke up a lot of people. You can see, I mean, the amount of error drop there was stunning. What’s even more stunning if you look at state-of-the-art results right now—Google, this year—the bar’s down to here, alright? And you look at… it’s amazing. And if you look at the deep neural networks, all it is is you have input image, you have a series of filters, you have some sort of pooling, you’re plying other filters, blah, blah, blah, blah… he’s a dense, sorry, ninety percent of parameters are actually in the dense layer ‘cause of the convolution that I was talking about earlier. But if you look at this and you look at this, they’re essentially exactly the same thing. They literally are essentially… I mean this…they really are the same thing. We have a couple of more layers here—we have five layers instead of two—but essentially the network, the [indiscernible], everything is essentially exactly the same. So what happened? Well, we got GPUs, because in 1990 as we would trade these… train these deep neural networks and we’d be like, “Okay, nothing’s happening, they’re not working,” because—you know— the… you just sit there and you wait a month or two and it doesn’t do that good of a job. Well, if you take a GPU which runs thousands of times faster and you let it run for an entire week, suddenly then it works, right? And you can’t train neural networks without a lot of data ‘cause a bazillion… there’s billions of parameters, right? And if you have billions of parameters you need a lot of data to learn this parameter. So you need lots of data and back in 1990 you’re not gonna have lots of data ‘cause how do you get the gen… nobody even… images were scarce then. You know, now we suddenly have billions of images and we can get them off the web—you know. It wasn’t until we had a lot of data, we have a lot of processing power that the—you know—power of these guys really showed itself. And also rectified actification or activation and drop out helped that a lot too, or a little bit too. If you’re curious about the details on this, Ross is giving a talk next month, specifically just on the deep learning aspects of this, so I’m not going to go into a lot of detail here other than just to give you some intuition for how it works or why it works and where it doesn’t work. But if you want to see more details on this, come back in a month. Alright, so at this point there is one… so as a computer vision researcher, you can imagine, we’re very depressed. Basically, people just told us, “You wasted the last twenty years of your life doing nothing. You should have just sit back, relaxed, waited for GPU’s to get better,” [laughter] “…wait for more data, and you could have just, like, done just as well. That’s great you published a lot of papers, you have a lot of citations, congratulations, and you paid your jobs, but—you know—your work is basically pointless.” And at this point I was like, “No, no, no, no, no, wait, wait, you just solved the classification problems,” this is—you know—we talked about classification versus detection…”you haven’t solved the detection problem. Detection is much harder.” And there was debate for a year or two about whether detection… ‘cause they tried it on the PASCAL data set and it didn’t work that well, right? The tried on ImageNet it worked really well, but PASCAL they weren’t able to beat yet. So we were like, “Whew, at least we still have PASCAL that we’re winning on.” Alright, so then comes out there’s this student who was one of the creators of the DPM model, Ross Girshik, who is MSR, he’s gonna be in a talk in a month, and he said, “Well, you know, my old model is being beaten. You know, I want to be at the forefront of deep learning, I want to see if I can make deep learning work well on PASCAL.” So what he did, one of the first things you have to solve is, again, we talk about sliding window, you do that with a deep neural network it’s actually really expensive ‘cause when you do image classification you just do it on the whole image at once, it’s not that expensive. But if you do sliding window you gotta do it in all these different boxes, which is tough, but then you also had to do it all different aspect ratios ‘cause some objects are like this, some objects are like this. So it’s really expensive, right? So how do you solve this problem? How can you—you know—detect boxes of all different sizes and image? Well at the same time in the literature there’s a whole bunch of algorithms coming out which looked at object proposals and then essentially what the idea here was, instead of doing a sliding window approach we’ll come up with some bottom up method to kind of give you object bounding boxes which we think objects exist inside of. And one of the more popular methods was you basically just segment the image in many different ways, you merge your segments together and you draw a bounding box around those segments and that’s your object proposal. But what this does is it takes the number of bounding boxes you need to evaluate from millions down to thousands, right? So then you could apply again, same story, these more complex descriptors, more complex features and a smaller number of areas and we can see if we get improvements. Alright, so input image, proposals, for every proposal, we basically skew it so it fits within a perfect rectangle like the classification task, you run it through your deep neural network and then you classify that patch. Alright, so these are the results up until Ross’s paper. So you can see we have DPM, DPM, DPM, plusplus, plus-plus, this is—you know—that’s a big improvement. We’re like, “Whoo,” and this is a paper— you know—going from there to there. >>: Up now. >> Larry Zitnick: Good as up, yes. So this is average precision. I want to describe what average precision is but the higher the better. And you can see, like, this is… and this is basically that period where… I mean, everything basically stalled out, everybody’s kind of down, right? You throw in the deep neural networks and you get that. So if you look at this gap, that’s huge, and we were seeing, basically all the results were plateauing. And when this gap, when this… saw this result, all of us were like… well, some of us were like, “Damn, why did it have to do good on object detection as well?” But at the same time you’re like, “This is great. We finally have some improvement again. The fields not dead,” you know. And you can see the improvement here. And now, this number here, Ross will get into this a month from now, but he’s gotten results that are in the sixties, like, mid-sixties, so it’s even going up higher. Alright, so just to give you some intuition as to—you know—how these—you know—where this works, where it doesn’t work; let’s just look at some results, alright? So, here’s an image, detect a train beautifully. Good. Person, boat—I think a very impressive result. Bottles here, you can see kind of misses some of these bottles. That’s kind of a tough problem. Here, there’s chairs, there’s all sorts of things it should recognize—it doesn’t recognize any of them. Cat, good, it can recognize cats. [laughter] Train, you see gets a train, but kind of says train here—maybe that’s forgivable. Bird; this is an interesting case because it recognizes birds all over the place here, but if you look—you know—yeah that’s a bird but it doesn’t really get the head, it doesn’t really get—you know—none of the bounding boxes really get—you know—getting the entire bird. Again, whew, detect cat faces, cat faces are very cool, and sometimes, no, we can’t detect cat faces, so here it just fails. Sometimes we hallucinate cats [laughter] different places. You know, here, again, this is a challenging scene. Now again, this is where—you know—fact from fiction… this is… when I look at this image I’m like, “Wow, that’s pretty impressive as a computer vision person,” right? But again, if I show this to my mom she’s like, “Eh, you know, you could probably do better.” [laughter] You know, like, “That’s not a horse. Come on, this is garbage,” you know, but everybody has different expectations. Here again, a bowl, hambur…I was like, I’m just blown away by this—you know—that’s… I mean, just, no way. I mean, like, this algorithm just… you would never get that before. But then you look at here and you’re like, “What is going on? You got this dumbbell but not that one [laughter] you know, so yeah, exactly. I mean, and it’s not only that, this dumbbell got it with really… I mean it was really confident about that dumbbell, and this one, that’s definitely not a dumbbell. [Laughter] So—you know—exactly. I don’t… I mean, look at the person, he’s like deformed and in different position and he still got it—it’s amazing. Again, but then my mom would be like, “Oh, it missed the dumbbell, Larry, it’s not that good.” [laughter] Alright, so that kind of gives you a sense for the quality of the results. And for classification it can do much better, like I was talking about with image net classification, it can do very well with classification. Detection is a much more challenging problem ‘cause you have to localize it as well. So failing. So this is actually an interesting plot. So this is looking at vehicles, furniture and animals and the different reasons for the errors. So we have confused with the background which is yellow, confused with other objects which is orange, confused with similar objects like a cat versus dog in purple, and got the location wrong in blue. So you see a lot of the errors are due to location errors. For furniture it gets confused with a lot of other objects, not too surprising. Chairs kind of look like a lot of other things. And also animals—you know—it’s really easy to confuse one animal for the other—you know—because a lot of animals look similar. So a lot of these… what’s nice here is there’s not confusing with the background as much. If you looked at DPM this yellow region would be a lot larger, so we’re not just hallucinating—you know—horses in the background as much anymore. And these are different object categories and their accuracies. If you scan in this, what’s interesting is that kind of the large objects kind of does well on, or small objects like a power drill does really poor. Horizontal bar, I mean—you know—yeah, exactly. What‘s another low one… backpack, which is generally you don’t have an image which is an entire backpack, but it… backpack is small in it, where an image of an iPod, for the image in that date set, would be basically a blow up of the iPod, so the iPod actually does well. You know, hammer. Usually it’s not a big picture of a hammer, it’s a picture of a person holding a hammer or something like that. A hammer’s small so you can see it doesn’t do as well in these categories. The other thing is, all these methods, they look for an object, right, and they kind of look at a bounding box and they say, “Is that object within this bounding box?” Right? And a lot of objects if you just look at the bounding box that surrounds that object, it’s really hard to tell what that object actually is. You actually need to look at information outside of the bounding box. So if you look at these images here, what’s interesting about them is that all of them had these exact same pixels in different positions, different orientations, and how we interpret these pixels changes dramatically based upon the context around it. So these pixels can be a plate of food, it could be a pedestrian, it could be a car, it can be… and it totally depends on the context of the object relative to the other objects. And you can imagine for a lot of the things I showed earlier, these smaller objects, you don’t recognize the smaller objects from the pixels themselves. A lot of times you recognize them from the context. If I going like this, whatever I’m holding here you’re gonna think is a cell phone regardless of the pixels that actually the cell phone contains. And these models are not capturing that yet. We can look at an image like this as humans and we know something’s wrong. These models don’t really know something’s wrong. They wouldn’t realize that, “Oh there isn’t really, yeah… is there three faces, is there two faces? I don’t know.” It doesn’t get that contradiction—you know—‘cause it doesn’t really fully understand the extent of what the objects are. Remember when I was looking at the birds? You know, these models, what they’re doing, what they’re really good at is finding kind of descriptive little patches in the image which are very informative of said object—like the face of a cat or the—you know—the head of a person—and that’s where you get a large firing. So you can look here. What it does is it kind of finds a part that it thinks is a really—this is kind of a heat map of where it fires—and it basically finds the informative bits of those objects. So you see here, like for the boat example here—you know—this is kind of the part that it finds that’s discriminative for a boat, but this whole other part’s a boat too. Where here—you know—it finds all these different airplanes but—you know—is that one airplane or multiple airplanes, you know? And this concept of that—you know—my feet are part of me is kind of lost on it ‘cause that is not discriminative for the algorithm, right? So it’s this… so that’s one of the challenges, especially for the detection task. Another thing that’s interesting is there’s this paper that recently came out which, you take this image which is correctly classified or this image which is correctly classified, you manipulate it a little bit to get this image, and suddenly it’s incorrectly classified, alright. So for humans, these images look exactly the same to us, right? But to the neural network they look completely different. I think these are ostriches now. [laughter] So… I think that’s right. Even this, you know. So there’s something funny going on. I mean, they are still very sensitive. You get this signal which kind of propagate and then you go up the neural network they get amplified and can do un—you know—sightly things like this. So that can be a problem. The other thing is these deep neural networks are being applied in a lot more domains, and one of the ways that they’re being applied is, before you actually run your deep neural network is you basically stabilize image with respect to something. So you can use attributes like whether somebody’s wearing pants or shorts, but first what you do is you don’t just give them the whole image, you basically find the legs and then you just feed the leg portion in. Or if you’re trying to do face recognition what you do is you first do a face detector, you take the face, you align it with a three-D model, you warp it so it’s a frontal looking face, and then you feed this into the deep neural network classifier to say whether that’s Sylvester Stallone or not, right? So what we haven’t figured out yet is we can learn these very complex models, but we don’t really understand the relationship between a person that looks like this and then move over like this or how things change in three-D, right? So those haven’t been worked in. So we still need these kind of… you could think of it as a hacky layer on top to really get good performance when you start thinking about three-D. And also this can impede or generalize ability of these algorithms. So if I take a picture of my dog and I want to recognize my dog from another angle, it’s not really learning these relationships between a dog at this angle and this angle. It just knows that dogs from this angle look like dogs and dogs at these angle looks like dogs, but it doesn’t know how to relate the two features from one to the other. >>: Does any of this apply to looking at moving pictures or is it all static imagery? >> Larry Zitnick: So yeah, there has been some research in moving pictures but most… almost everything is static images. The reason for that is it’s so much harder to deal… I mean, we’re already, like, maxed out, our GPUs and… >>: [indiscernible] >> Larry Zitnick: Exactly, we already have a lot of data. This work is done by actually keypoint annotation so we take all the human data and you actually click on all the hands and the feet and the arms and use that for training to then be able to relocate those objects. So that kind of helps relate it, but you have to give it that information, it doesn’t learn that automatically. >>: Wouldn’t your mom just say do it on moving pictures? >> Larry Zitnick: We keep bringing that up. It’s like, “We should do this on moving pictures.” That’s how—you know—kids do it, right? But yeah, it’s tough. The other thing is, back to here, you know, these algorithms again—you know—we find these object proposals, well some objects aren’t gonna be found right by these object proposals just using low level cues, right? So we’re gonna be missing some objects because we never even evaluated our detector on them in the first place. So basically we’re back to our interest point problem that we talked about before. Yes this is more repeatable than the interest points that we looked at—you know—back in 2002 or 2000, but we’re still a problem. So I think the field going forward will begin to look… how do we go take these complex descriptors and start applying them densely across the entire image again. And this has already been worked on. Alright, so in the last couple minutes I just want to conclude… so things that we’ve seen that deep neural networks have problems with, right? Detection and understanding the full extent of the objects and being relate one object to the other. And this is something… and also that—you know—PASCAL, the reason why it took people so long to do well on PASCAL is because PASCAL didn’t have enough training data to train these deep neural networks. And the way… the only way they got it to work was they trained it first on ImageNet to get the neural network in a good spot and then they just kind of tweaked it a little bit on PASCAL. So you basically have to use ImageNet to bootstrap it. So in comes Microsoft COCO. So since PASCAL they decided to discontinue it several years ago, so several people within MSR and within the academic community decided to get together to create a new object detection data set. And this data set is… we want to find images where we have objects in context so it’s talking about that cell phone example I was mentioning earlier. We want to find segmentations not bounding boxes. We really want to understand the true extent of the objects and not just the bounding boxes that they have. And we also want to have non-iconic instances of the objects. And let me explain really quickly what that means. These are iconic images. This is what a lot of images in ImageNet, let’s say… not all, I mean, but a bunch of them look like this where you basically have these big objects in them, right? You can also have iconic scene images where you have images that look like these. So—you know—generally what you find is bathroom scenes which are staged for—you know—you’re selling your home and that sort of thing. They don’t have humans, they don’t look realistic, right? The type of images we wanted to gather were images like these where you’re using, let’s say, a toothbrush in a real situation. You know, you have cluttered scenes and that sort of thing. You want to be able to recognize the objects, and these sort of scenes where you have contextual information, you have realistic actions going on. So just relative to other data sets, we don’t label as many categories, but we have a lot more instances per category. So we have ten thousand instances at minimum per category. For humans, we’re gonna have four hundred thousand humans in our data set, each one segmented, so this is gonna be a lot of data for doing deep learning. And then we also have more objects per image, so blue is us and these are the other data sets, so PASCAL and COCO only had a small number of images per… a small number of objects per image—I see one or two for most—where ours we have a lot more objects per image, which we think is important for that contextual reasoning. And segmentation, as I was mentioning earlier, all this is done on Mechanical Turk with the generous funding of Microsoft. Microsoft’s really nice in giving us some money to be able to do this and we’ve already run seventy-seven thousand hours of working time on Mechanical Turk, ‘cause eight years… it’s the equivalent to somebody working eight years nonstop on this problem, segmenting out objects, do about twenty thousand a day you max out Mechanical Turk. And these are just some of the things. And also another very interesting thing for future researchers, every single one of these images has sentences with it, so one of the very interesting future areas of research which I haven’t gone into today is actually taking the image and then describing it using natural language. And this is a really hot area right now. There’s a lot of people who are running into this space, but due to time, I decided not to talk about that today. Again, go to Ross’s talk next month, it’s gonna have a lot of great stuff about deep learning. You’ll get more details on that and I’ll stop there. [applause] >>: If you all have any questions for Larry? >> Larry Zitnick: Yeah. >>: Can you go back one image? The segmentation you have, like, couch in that third one, is that… do you label that whole kind of background as couch, or is it two different images of the couch? Two different segments? >> Larry Zitnick: This is… okay so, the question is: for this couch segmentation, is that… how’s that segmentation created, essentially? This segmentation is actually two different pieces, so we basically had the turkers only label the visible pixels of the objects. So—you know—if I am standing in front of something I don’t, like, call the couch that’s behind me part of the couch. >>: But you do want kind of that context to be… understand, like, this is a whole couch. >> Larry Zitnick: With humans, I mean, these are results, these are humans that… the—you know— human results. >>: But does the database say there are two couches? >> Larry Zitnick: No, no, the database would say that this is a single couch. >>: Ah. >> Larry Zitnick: Yes. >>: That’s what you want. >> Larry Zitnick: Yeah, says it’s a single couch. So we want to be able to reason just like humans would. Yeah? >>: So you seem to be applying for human level perfection quality and… but the human models are metric models that train on three-D data, >> Larry Zitnick: Yes. >>: … and then they’re applying it to two-D technician, so are we pursuing the harder problem than we’re getting used to? >> Larry Zitnick: Computer vision researchers really could tie our hands behind our back. We don’t like to cheat and use depth data or use video data or anything else. We know humans can do it from two-D images, therefore we should be able to do from two-D images. So that’s the cynical answer. The other answer is yes, we would love to use depth data. We would… I mean, Kinect proved that point really well—you have depth data everything is infinitely easier. You can get the boundaries much more easily. Video data would be great. Why aren’t we doing that? Essentially, because that data’s a lot harder to collect. You know, there are depth data sets out there right now, but they’re… relative to, like, these data sets, and we’re talking thousands of images instead of millions of images, right? So if we can find a way for—you know—the entire world to take one picture with depth—you know—we could probably get a much bigger data set and—you know—that would be very useful, but we just don’t have the data to play with—you know—and same thing with the video data. We can go to YouTube but—you know— how do you label it then too? Like… and that’s another interesting area for researchers unsupervised— you know—can we learn about the visual world without actually having explicit labels? Because labeling things like this is a huge effort. It’s taken us over a year to create this data set. It’s been a lot of work so—you know—it’s a good area for research as well. But yeah, it’s just—you know—getting the data. >>: On that issue with the video, you could do a variant, which is you just collect your images rather than… static image internet, you collect them from video clips and you just have the turker label the middle image. It’s the same amount of labeling effort, but then the vision algorithm has access to things like motion segmentation… >> Larry Zitnick: That’s one of the things that people have been asking us is two things—one, “Can you…” literally do what you just said, which is: you take an image… you take a video, and just extract out a single image from that video, and just label that. Or we could take… people have, like, photo stream, so—you know—it’s a bunch of images that are very similar and just label one of those. So people have been asking us about that; we didn’t do it. You know, we got all of our images off of Flickr—you know— it’s harder—I mean, just the YouTube problem—it’s just another level of complexity, so it’s really… I would… I agree, I think it would be a great idea. It’s just getting somebody to do it—you know—it’s… and it’s hard to convince a graduate student to bite on something like this. >>: So are you… oh, sorry. >> Larry Zitnick: Go ahead. >>: Are you, like, also looking at the problem of, like, for this object: it’s a dog is here in the image, but also it’s facing this way, for example… >> Larry Zitnick: Yes, so there’s a lot of… another thing that’s popular is attributes, so it’s like which way is the dog facing? So one thing that we’re doing here is we’re putting keypoints in, so this is like labeling where people’s hands, feet, head are. So you can label the parts of the object as well. Now for dogs, we’re… we probably won’t add the keypoints for dogs, but you can imagine that. Another thing we want to do is for all the people, we want to say man or woman, how old are they? You know, are they wearing jeans? Or—you know—all these different other things that you could possibly label, which would be very interesting, I think, for practical applications and very doable. Yeah? >>: So do you have a sense of are we back on an up… generally upward trajectory or are we…? >> Larry Zitnick: Right now, we haven’t seen the… we haven’t maxed out. I mean, we keep thinking that, like, the numbers on PASCAL or the—you know—these—you know—the improvement in detection is kind of maxed out, and then, knowing that, we have to do something else, but by tweaking these deep neural networks, and—like I said—Ross will get into it more next month, we keep seeing fairly significant improvement. So I don’t think we’ve seen the end of it. It’ll probably be another year—you know—so it’s a… I still think a lot of these segmentation problems though—or a lot of the problems that I mentioned are still going to be there—they still need a big leap to kind of go beyond that, but yeah, I think we’re still going in a very—you know—the rate of improvement’s pretty fast right now. Yeah? >>: So there was this whole body of work on cascaded classifiers and how one… solving one particular vision task can actually help you solve another vision task. >> Larry Zitnick: Yeah. >>: Has that been distilled out with deep learning as yet? >> Larry Zitnick: Well, I mean, that’s one of the powers of deep learning is basically all the features are shared for all the object categories up until, like, the last layer or two. Because basically, the last layer is basically a thousand different classes for the classification task, and they basically share all the other features. So the—basically—learning all the features that it learns in the middle are shared by all the object categories. And that basically gives all the cob… object categories additional information— there’s more images per. >>: Well, so I was actually thinking when you have… when you’re down in the lowest level, solving— let’s say—a problem for classifying cars, but you also have a separate neural net… or separate deep learning network you’ve learned for detecting cars or maybe segmenting cars… >> Larry Zitnick: Yeah. >>: The output of that can actually feed back as additional features which are higher levels… >> Larry Zitnick: Yes, yes, yes, yes. Yeah, so that… and that’s something, again, this is part of an academic bias, people like to start from the pixels—you know—‘cause then the paper doesn’t seem as hacky, but yeah, I mean, there’s a lot of—I think, from a practical standpoint—a lot of other things you can start feeding into it—you know—as additional information that it could use. Now, the question is whether some of that’s redundant or not—you know—it’d really have to be complementary in order to see a win. >>: This gentleman here brought up the point of the goal being it… for the technology to see the image as well as a human can see it, but you had one image back here that it was imperceptible what the changes were, the pheasant and the school bus and… >> Larry Zitnick: Yes. >>: … why was the algorithm seeing that’s getting a failed attempt in the second one, and it looks the same to us? >> Larry Zitnick: That is a great question. I mean, that’s one of the things… when I’m… and—you know—we’re… we keep saying—you know—and these deep neural networks are doing amazing things, right? >>: [indiscernible] back the photo? ‘Cause the middle image, actually, it’s an interesting… there is a… >> Larry Zitnick: This one? >>: Yes, that one, yeah. >>: … they cheated in a very specific way, and the middle image says exactly how they cheated to make this happen. >>: Okay. >> Larry Zitnick: Yeah, this is a deformation that you can see. It’s a kind of… >>: You have the middle image [indiscernible] >> Larry Zitnick: Yeah, yeah. Basically, this is how you kind of change the pixels… >>: The way I would explain it is if you—let’s say—build a intelligent agent by having and answering a rule book for answering twenty questions, right? And then you give someone else the rule book, they can kind of look at it long enough and figure how to answer the questions wrong, right? Because if you have a rule set… and so the neural network, you can analyze it and say, “Over here, it really wants to see this thing.” And so it’s… in some ways it’s… >>: It’s a cheat. >>: It’s a cheat. It’s as if you looked at the software code, and you were able to figure out where the bugs were, so you could give it the test case that breaks it, which is what hackers do, right? >> Larry Zitnick: And the important thing here is—you know—people want to know when it will fail, right? We had these… we had this great mechanism, right? Because not as much science into when they fail, or why they really do work—you know—so it’s examples like these that make everybody kind of cringe a little bit, because you can see that—you know—sometimes they do just fail for seemingly random reasons, where… >>: It’s the man behind the curtain. >> Larry Zitnick: Yeah. >>: So based on what you said, it’s not possible to do this without knowing what the neural network is doing? >>: Yes, you’d have to be able to pull apart the network and put in probes. It’s as if—you know—you put electrodes into the brain to figure out how to fool someone’s brain. >>: Is that what they did here? >>: [indiscernible] this paper… >> Larry Zitnick: Uh… I mean, they definitely looked at the outputs, and you can basically tweak the inputs, but—you know—I bet you if you just randomly—you know—moved pixels around, that you could get… you could find similar things. >>: Find something similar [indiscernible]. >> Larry Zitnick: Yeah, and it would—still to a human—it would look… and I could take that bus image, right? And I could warp it in random ways that, to us, would still look like a bus… >>: This is true of any machine learning algorithm, right? You… if you get to vary the input enough, you will find, accidentally, something that looks close, but that forces a mistake [indiscernible]. So in some ways, it’s not surprising, but it is… it’s sort of… >> Larry Zitnick: Well, I mean, it is true… this is really an example to show that—you know—we’re not… this isn’t equal to humans—you know—or the other thing is: what this is doing is not exactly the same as what humans are doing; there is a difference there, you know? And I think this really clearly shows that. Do you have a question? >>: I have a quick question concerning this wall descriptor so—you know—a fine invariant helps us in some… it’s not only used for PR also, but for cd too for white base tracking… >> Larry Zitnick: Yes, yes, yes, yes. >>: …where you have… so that problem where you have—you know—a moving-camera video—you know—segments, basically so, but for instance, for what’s called survey photos—for instance, in movie business—with widely—you know—changing camera… >> Larry Zitnick: Yeah. >>: …that you need to have two-D responders. When I tried to use it, then unfortunately, this algorithms there’s—you know—choose whatever they want to track, not what I want to track… >> Larry Zitnick: Yes, yes. >>: …in segments, they choose [indiscernible] corners of the building and so on. Here, choose whatever it was. So when I looked at the… ‘cause I want to ask you: do you know the paper where you, for instance, track whatever the low weight or affine—you know—invariant ellipses will take to pick one image to another, and then knowing where the point you’re interested in is [indiscernible] the coordinates, say, of this ellipse. Find this point as a [indiscernible] from the [indiscernible] correspondence. For instance, I’m interested in—you know—specific features—you know—which are [indiscernible] in a sense of—you know—coordinate things… >> Larry Zitnick: Mmhmm, mmhmm, mmhmm, Yeah, so I mean, there’s a… >>: …But I couldn’t track them, but they could track something which was around, so it was something public. I didn’t have time to finish this, but [indiscernible] a practical thing. >> Larry Zitnick: I mean, this is actually… that brings up a couple good points, I mean, one thing here is—you know—I was disparaging of SIFT and those things—you know—about what they were; there’s a lot of great applications that they work on. I mean, one of them is three-D reconstruction and these tracking… this is—you know—since we’re talking about object recognition, I didn’t go into that, but yeah, there’s a lot of other applications that a lot of these technologies—you know—like, gave birth to. So this tracking… there’s a lot of literature, it’d probably be better to—I mean, Rick here and there’s several people here who are, you know, experts… yeah, exactly. As far as any exact paper, I probably would have to know a little bit more about the exact application, ‘cause there’s just so many papers in this space. There’s some good survey papers, I think, like, Rick wrote a good survey paper. >>: We should take this offline, because it’s a different part of computer vision than recognition. >>: Okay. >> Larry Zitnick: Yeah. Any other questions? Alright, cool. >>: Thank you. [applause]