Document 17828067

advertisement
>>: Good morning everyone. It’s my great pleasure to welcome all of you to the second in our state-ofthe-art lecture series on computer vision. The first lecture—in case you missed it—was given by Jian
Sun on face recognition. It’s on ResNet, you can find it. The series is being organized by The Virtual
Center on Computer Vision. The Virtual Center sounds like a big name but it’s basically a bunch of us
scattered across different MSR labs including, like, Banging Wo in Asia, and Andrew Fitzgibbon, and I,
and a few others—Matt and Zhenyu—and we’re trying to just coordinate various research going on and
product transfers in computer vision. We’re sponsoring this series, we’re working on a web page where
you can find more information about various vision activities. And with that background, let me say a
little bit about Larry Zitnick, our speaker for this morning. Larry got his PhD from Carnegie Melon, 2003;
he’s been with us since then and worked on a variety of different projects. Some wonderful work on
three-D video, stereo matching, and he’s done some computational photography work as well, but for
the last many years he’s been concentrating on object recognition, visual understanding, scene
recognition, he’s had some wonderful work on understanding and modeling context using clip art
animations, and most recently co-created the COCO database—that is the latest and greatest, some of
you may have heard of ImageNet and deep learning, well COCO is much better and he might tell you
why in this talk so, thank you , Larry.
>> Larry Zitnick: Alright. So I recognize many of you, some of you I don’t recognize. I just want to give
some background as to what I want this talk to be about. So probably, as Rick said, many of you have
heard about deep learning and—you know—there’s been a lot of hoopla lately about deep learning and
how it’s self-computer vision, how AI… the next coming of AI is gonna—you know—truly bring AI. And
what I want to do is I want to give everybody some context for these big changes that have been
happening because it’s hard to… it’s hard—you know—for… if you’re not a computer vision expert—and
really this talk is geared toward people who aren’t computer vision experts—it’s hard to judge whether
these advances are advances that go to, “Wow, computer vision actually really works now, like, I can
actually recognize objects,” or it’s just a bunch of academics saying, “Wow, we’ve failed miserably
before, and now we’re failing just a little bit less miserably and we’re really excited and now we’re
publishing articles about it.” [laughter] Right? So obviously the truth lies somewhere in between, but I
think it’s hard to kind of just like jump in, right, and for me just to say this is what deep learning is and
this is what’s been solved and kind of get a sense for—you know—why is there so much excitement? So
what I want to do today is I want to spend… I want to talk about how did we actually get here—you
know—how did we get to this deep learning craze? And then, where are we now? So basically I’m
gonna spend about forty-five minutes talking about the last twenty or thirty years in object recognition
to give everybody some context for the big advances that happened in the field, why they happened,
why they failed, why they were good for certain things. And then that will give you a sense for why
everything… everybody was so excited about deep learning in the last three or four years. And then
we’ll also talk about—you know—is deep learning… is it going to solve it all—you know—is object
recognition solved now? Can you trust the news articles that are coming out? So, we’ll get to that.
Alright, so let’s begin with our history lesson. So in 1966 you can say computer vision kind of started
around there. You have Marvin Minsky, a big AI guy, went to one of his students one summer and said,
“Hook a television camera up to the computer and get it to tell us—you know—make it tell us what it
sees.” Right? Simple enough. [laughter] Now, as many as you—you know—you probably could guess,
it failed miserably. [laughs] You know, and it wasn’t many… there wasn’t much success in computer
vision at this time. But to really kind of understand, it’s hard to put yourself in the place of somebody in
the 1960’s or the 1970’s, right? But let’s just try to do that for a second. So imagine you have this image
here and you want to detect the faces in the image. How would you detect faces? Now you… no
computer vision—you know—answers, no knowledge of what’s already been done. I you’d a—you
know—a naïve, non-computer vision expert and you just wanted to detect faces, how would you do it?
Any ideas?
>>: Color.
>> Larry Zitnick: Color? Color’s a good one. Any other ideas? Unfortunately color was not available
back in 1970. [laughter] Any other ideas?
>>: The eyes, the nose, and the mouth.
>> Yeah, you find the nose, right? So you find the nose—good. And you find the eyes, right? And
maybe if the eyes are in the right location relative to the nose—voila—you have a face. This is exactly
the same intuition that people in the 1970s had. I mean this is what you’d… if you ask a kid, “How do
you recognize a face?” This is exactly the answer you get, and this is exactly what they tried. So you
have something called a constellation model or pictorial structures which is basically saying you detect a
nose, you detect a mouth, you detect the hair, and they should all be in the same kind of rough position
relative to each other which is represented by these springs. So these springs kind of want everything to
kind of rest at these certain locations, right? Now, they did excellent work and they got—you know—
beautiful accuracy on the data set of, I think, about sixteen faces. And this is some—you know—
captioned… or pictures from the actual paper, so you can see the type of images they worked with. So,
it’s fantastic work. [Laughter] And in case you’re skeptical, they even added noise to the images and it
still worked. So, this is fantastic work. So face recognition was solved in ’73, kind of.
Alright. So unfortunately—you know—in the late seventies anybody who’s—you know— known much
about AI, what happened was everybody promised the world in the seventies and then suddenly
nothing actually happened and you had this thing called the AI winter. And the AI winter also affected
computer vision because computer vision is—you know—a subset of AI. So computer vision research
kind of fell off a little bit as well and what we found was that nothing was really working so we kind of
went back to basics. We said, “Well, let’s not try to detect people or animals or chairs or those sort of
things which were big in the seventies. Let’s just try to detect an edge,” you know. So there’s a ton of
papers in the eighties looking at edges, looking in up at flow, looking at these really kind of low level
problems, ‘cause we could actually make a little bit of progress in these fields. And at the same time
there were other people saying, you know, “I could take these edges,” and “Let’s look at the properties
of edges; some edges are parallel; some are like this;” and “How do we group edges together?” Looking
at some of the more scientific questions. And there’s also—you know—fabulous debates about how do
you actually represent objects? Do you represent objects using a three-D form? Do you recognize…
represent objects using—you know—two-D representations and you just have a two-D representation
for different poses? And there’s good debates about this and plenty of fine papers—you know—
discussing this back and forth, and blah, blah, blah, but—you know—none of this actually helped in real
recognition algorithms. So come around the late eighties something happened which was we kind of
forgot about the science. We stopped thinking about the science behind computer vision and this is
kind of the turning point when computer vision, I think, really switched from being more of a science to
being more of an engineering discipline. And there’s a very simple reason for this which is: something
actually worked. So, you have Yann Lecun, you have something called a convolutional neural network—
which I’ll describe in just a little bit—and he was able to show that you could train a vision algorithm to
recognize handwritten digits. And this is the first time that we had an algorithm that could kind of work
out in the wild. You know, we had algorithms that could take—you know—parts going down a machine
assembly line and kind of line them up roughly, or look for very specialized applications, or that the face
stuff I showed you earlier, but again, that’s not going to work in the wild. But here… this actually
worked. You could take zip codes off of real letters and actually recognize them, right?
Alright, so how did this actually work? Well, so one of the things that I’m gonna be talking about a lot
today is something called a filter. Probably most of you know what filters are, but for those of you who
might not know what a filter is, a filter is basically a local operation you apply across the entire image.
So if you have an image like this one, you can apply a filter where you take the pixels to the right and
you subtract them from the pixels to the left and you get a response that kind of looks like this, so you
can detect vertical edges. Or you can do the same thing with just subtracting the pixels from below you
with the pixels that are above you and you basically detect horizontal edges, alright? So you can
imagine that these sort of filters are good at finding—you know—edges in image or other sort of
patterns that you might imagine. And what Yann Lecun and his coauthors realized was you don’t need
to learn separate filters for this part of the image than you do for this part of the image. You can
actually run the exact same filter over the entire image and it works just as well—which seems kind of
obvious, and it is kind of obvious, but it actually made a big deal. So you could take a single filter, you
run it over the whole image, and then you get a response—it looks like this. And using the neural
network you can learn all sorts of different filters. And the cool thing here is before what they were
doing is they were trying to learn different filters for every single position, but the problem was is
number of parameters you needed to learn was huge. But if you can use the same filters across the
entire image, suddenly the number of parameters goes from—you know—bazillions down to something
very manageable, and you can actually learn it. And it’s because of this—exconvolution of these
filters—that the neural networks worked. And then what they would do is they take this filter
response—we don’t really care about exact locations—so you take the filter response and you just
shrink it, you just make it smaller. You can do this making a max operation, an average operation…
there’s a lot of different ways of doing it, but you basically make it smaller and then, basically, you just
keep repeating that process. So you have your input image, you run your filters, you make them
smaller, and then you run another set of filters on their responses and make them smaller, and then you
can basically do your fully connected neural network, alright? And this would be an excellent hindering
digit recognizer. Now you can apply… Yep?
>>: Why the stages for creating… connecting the image to the neural network?
>> Larry Zitnick: Where’s that?
>>: So why were you doing that [indiscernible]?
>> Larry Zitnick: Oh, going back and forth.
>>: Yeah.
>> Larry Zitnick: So what happens here is you have your image here, you run some filters on it and then
you shrink it down, right? And what it does is when you do that pooling operation, you shrink it,
basically you lose a little bit of spatial location information, right? And then you apply another set of
filters, alright, and then you shrink them down and it uses a little bit of spatial information. So what
you’re doing is you’re essentially saying, “This here are very simple filters which is—you know—kind of,
like, edges, that sort of thing.” Now, these guys are basically gonna be taking combinations of filters
which have a little bit of invariance in them, so they can represent, like, larger things. So instead of
recognizing just this part of an eight it’s gonna represent the entire top of the eight, let’s say, and it gives
you a little bit more invariance every time you do it. Does that…? ‘Kay. And you can do… we’ll talk a
little bit more about this later ‘cause obviously this comes back to life—you know—twenty years later.
So—you know—and then they do the same thing with faces. So you can take a box within an image and
you can try to detect faces in an image and you basically take your box, and then you do the same thing
with convolutions, and then you spit some output. It’s a bunch of good work doing neural networks
with faces. One little side note: today in my talk I’m going to be talking a lot about detection and
classification. To, like, the layperson detection and classification are exactly the same thing but there’s
actually very two different… there are two distinct meanings within the vision community. Classification
is… think about the handwritten digits. You’re given an image and you just had to say, “Is this a five? Is
this a three? Is this a two? Does this image have a dog in it? Yes or No,” right? That’s classification.
Now detection… you not only had to say, “Is there a dog in the image?” But you also have to draw a
bounding box or some sort of localization of where that dog is in the image as well, so you’re detecting
where in the image that object is. And if there’s two objects you want to detect both, alright?
So, these early neural networks—you know—numbers work great, faces work great and they tried to
do it with other categories, but they weren’t getting much traction—you know—they didn’t solve object
recognition at this point. It worked for a lot of other things too; we had self-driving cars back then
that—you know—only had four neurons in them, I think, or five neurons in them, and they were able to
drive a car for hundreds of miles. You know, it was amazing what they could do, but they weren’t
solving everything. And the reasons for this failure, I’m not going to go into now because we’ll get into
that when we get to the year 2012, alright?
But since we’re talking about faces, let’s talk about faces a little bit ‘cause they’re really interesting
‘cause this is something that, very early on, everybody wanted to detect faces; we’re humans and we
want to be able to detect each other so faces obviously play a big role. And one of the very early face
detectors that really just made people go, “Wow,” was the Viola-Jones face detector. And this used
something called boosting plus... boosted… cascaded boosting classifier which, given an image, you’d
basically slide a sliding window along it to do your detection, and it can very quickly throw out bad
detections, very quickly. And then when it found a detection it thought it could be good, it could perf…
it could then do more powerful machinery in that one location. And what this made possible was you
could actually do face detection in real time. So Paul Viola when he got up on stage to demo this, he
would actually take a camera out and wave it in front of the crowd and you could see the faces
detected, which today would be like, “That’s so lame, because, like, we can do that on our cell phone’s…
it’s a camera.” But back in the day, it just, like, was a, “Whoa,” it just blew people away, right, ‘cause
nothing worked real time. And then it actually, well—A: it worked, which is pretty cool; and B: it worked
real time which is just, like, amazing, like, that just doesn’t happen back then. This is in 2000, where—
you know—digital cameras are still kind of novel at that point or fairly novel at that point. But you know
what’s interesting about this? It’s not that Bruce did the cascaded classifier which made it fast, it was
actually the features ‘cause the features were also very fast. But it wasn’t the fastness that I think is
interesting, it’s actually the features themselves. So let’s look at the features that the classifiers used to
actually recognize the faces. So, let’s say we have a face like this: what does classifier use? There’s
something called a Haar wavelet and a Haar wavelet is simply… it takes the average of all the pixels
within one box and subtracts it from a whole bunch of pixels in another box—and that’s your feature—
and it just gives you a value and then you put that into a classifier and then using that, it can recognize a
face. So just the two big boxes subtract the values and that’s your feature, which seems incredibly
naïve, incredibly simple, right? What’s cool about these, you can compute them very fast, but it seems
like simp… it just seems too simple to work. You know, if I went back to, let’s say, 1970, right, and I went
to my… and I’m a student, I went to my advisor and he said, you know, “Solve this space detection
problem.” And I said, “Okay, I want to take a bunch of rectangles and just subtract their values—you
know—over the face and I bet you that’ll be good enough to do classification.” They probably would
have laughed at you and, like, “No, no, no, no, you’ve got to do some… you’ve got to detect noses and
the eyes and—you know— all that other stuff. You just can’t throw a bunch of random boxes all over
the image and just hope that it’ll detect a face. Nobody would have ever thought that would work. And
that’s not an obvious way of solving the problem, yet it works. So why does it work? Any guesses?
Alright, well probably vision people have a good guess, but… the reason why it works is a couple of
reasons. One is, our faces are three-D, right? So they have a certain shape to them. Another thing is
our faces have a tendency to always be upright. You know, I’m not sitting there looking at you guys like
this very often. And then the third thing is the lights are almost always above us, which means that my
forehead is lit more than underneath, my cheeks are always bright, my nose is bright, my eyebrows are
dark, right? And then you get this kind of this same shading. The second thing is that the thing that
makes a face a face, the details that make a face a face is not the exterior, it’s not the bounding, it’s not,
like, the oval of my face, it’s actually the interior features. You know, it’s the eyes, it’s the nose, it’s the
mouth. And if we take a bunch of average faces and put them together and you’ll see that these part of
the face actually averages together really well. It’s very consistent with this part of the face looks like.
And this part of the face actually isn’t repeatable much at all because people have hair and they have
different—you know—things going on there. So if you look at these Haar wavelets, the two most useful
ones were these—were basically the dark area—which is like this area of your head is dark, this is light,
which is basically your cheeks down here. Makes total sense. Here again, the bridge of your nose is
brighter, your eyes are darker. So it’s just taking advantage of these simple properties of faces and the
fact that lighting is fairly consistent, our faces are usually upright and it just works because of that. It’s
almost like evolution made our faces easy to detect, right?
Alright, so this worked great for faces, but why didn’t it work for anything else? And there’s a simple
reason for this as well, which is… let’s take a portrait of a face, right? We always… these face detectors
don’t work nearly as well when I’m looking at a portrait than when I’m looking front on. And why is
that? Well, if you look at a portrait of a face, what makes that unique is actually that kind of the bridge
of your nose and your mouth. This part right here, right? That’s what makes a face unique. But the
problem is that this part right here, that line that you see, is partially created by the background and is
partially created by the face. Which means if the background’s light then it will be brighter than the
face; if the background’s dark it will be darker than the face. You put those boxes over that area and it’s
gonna be completely random whether one is brighter than the other, right? Same thing for objects like
mugs. We don’t detect a mug by the interior of the mug, I mean, the… you know, mugs look different,
every single one, right? What makes a mug a mug is actually the profile, right? And again, since a
profile’s gonna be on the background there’s not gonna be—any of these features that we’re using—
they’re not gonna be repeatable, they’re not gonna be useful at all for detecting, let’s say, this mug
here, alright? So the great news was… is computer vision had a stake in the ground, we had something
that worked; the bad news was it didn’t work for anything other than faces. But at least we had
something that worked—we could justify our jobs which is always good. Back in the 2000’s we needed
some way to justify our jobs, even though, I was still in school then, so it was okay.
Alright, so next on. Let’s go back in time just a couple years. Let’s talk about SIFT. And I have an
asterisk on 1999 there because SIFT was published in 1999, but it was first submitted to a conference in
1998 and it was actually rejected the first time, which by itself would not be that surprising. Our papers
get rejected all the time, it’s part of being an academic, being a student. You just get used to it; your
papers get rejected—even though it still kind of hurts. But this paper was exceptionally notable because
it was rejected and it is probably the most referenced computer vision paper ever. You know, it’s got… I
don’t even… over ten thousand, maybe twenty thousand referen… I mean, it’s just got a ludicrous
number of references and spawned, like, entire indust… it’s amazing how influential this paper was, and
it was rejected. Why was it rejected? Because it was just a list of beautiful hacks that worked really
well. [laughs] But—you know—sometimes hacks are good. Alright, so SIFT had two things in it. One
was: no more sliding window. So what I was mentioning before—you know—we do face detection
where you have this sliding window going the whole way across the image. Now if you’re doing a sliding
window you have to evaluate at every single point which is expensive, alright? But if you had a way to
only evaluate at certain… at smaller number of points, suddenly it would be less expensive which means
that you can use a descriptor which is more expensive for the same CPU—you know—power. So
basically we could use better features because you’re gonna be evaluating fewer places in the image. So
let me talk about what I mean by interest points. And there’s many types of interest points; I’m going to
be talking about difference of Gaussian here, there’s also corner interest points which is how they
detect corners in images. But this is one called difference of Gaussians or Laplacian. And the idea is you
have an image, like this beautiful cat image here and you just blur it. You blur it a little bit more and a
little bit more and a little bit more. And then you just subtract neighboring images and you get these
responses. So this is just this image minus this image, this image minus this image, you get these guys.
And all you do is you look for peaks in these responses—peaks both in x and y and also peaks in scale
here, so it’s a three dimensional peak. You find a peak, that’s your interest point. And the thinking here
is, is what this does, when you think about a blurred image and you subtract off this blurred image from
this blurred image what it does is it basically finds blobs of bright areas or blobs of dark areas. So if you
look at the image, we look at interest points returned, you can see you have kind of a blob of dark area,
blob or lighter area, dark area—you know—lighter area, lighter area, all across the image. And the cool
thing here is if you look… took this image and you kind moved it like this or the cats move just a little bit,
is that these points on image would be fire… would still fire. And if I actually had a video of this and the
cats are moving, and the first thing you’d notice that these interest points are blinking all over the place.
But I look at this, I kind of call this approach a shot gun approach because you don’t expect every
interest point to be repeatable, but out of the, let’s say, the five hundred interest points that you see in
this image, yeah, maybe fifty of them will be repeatable, if you’re luck—probably not even fifty.
So let’s talk about the descriptor. So now that we have an interest point—let’s see the interest point—it
has an interest point so you saw it has a position, it has a scale shown by the size of the circle, so let’s
say we want to extract out, now, a descriptor around that interest point. The first thing we do is we
take that patch and we compute a bunch of gradients, a bunch of those filters that I was talking about
earlier, and we do it in a bunch of different orientations, typically about eight. And then we would do
is—imagine this is the patch here—every single pixel has a gradient and an orientation, and then we just
pool them together, we just take the sum of all the gradients here which have an orientation in this
direction, and we put it here. And we take all the gradients here which have an orientation in this
direction and we add them here. And if you remember from the neural network we had that pooling
operation which basically gave us a little bit of spatial invariance, exactly the same thing here. Basically
taking the gradients, or basically averaging or pooling them together to give us a little bit of variance.
You don’t care if this edge is here, if it’s over here a little bit, essentially. And then you basically have all
these gradients, and in this case you’d have eight, we have four different cells and eight different
gradients so you’d have a histogram of dimension thirty-two. And if you want it to be brightness and
offset invariant—it’s already offset invariant because these are gradients—but if you want to make a
gain or brightness invariant, you then just normalize this histogram, make sure it sums to one, that sort
of thing. So a nice… it’s a bit more complex than the Haar wavelets we talked about, captures a lot of
the local gradient information.
And what this was beautiful at doing was instance recognition. So you have an object like these—kind
of plainer objects, a lot of texture in them—and you have an image like this and it can very quickly
match up these little patches from here with the patches in here. And for those of you who… you were
following computer vision, let’s say, ten years ago in the early 2000’s, it was… everybody was really
excited—you know—about recognizing cd covers. David Nister got on stage at one point and was
showing his computer cd covers in real time and it was recognizing… I think he had fifty thousand cd
covers he could recognize and he could recognize any of them and he’s doing, like, air-guitar on stage.
Unfortunately, cd’s… nobody ever bought them again after, I think, a couple years after that so nobody
cared any more. And then they’d say, “Well we can use it for books,” and then nobody buys books
anymore, so obviously the technology is not that useful. But what it was useful for is things like
Photosynth, for those of you who… Photosynth is still around, yes, for Photosynth. The other thing that
it’s useful for that we use… that we probably use fairly often is panorama stitching. I mean, this is
something that, I think, all of us take for granted now—you know—it runs on our cameras, it runs on our
cell phones. Panorama stitching seems so lame now because it’s so ubiquitous, but—you know—in
2003, 2002 it was like, “Whoa, this is awesome—we can do this,” you know, ‘cause before you actually
had to manually align the photos and it was just like… nobody did it. But then around 2003, Matt Brown
and David Lowe—you know—and Rick and many others they all created these—Matt with the ICE
application—created these great apps to do panorama stitching and really some great visualizations.
Alright, so basically now we have a method that can recognize objects using much more complex
descriptors because we have these interest points. So why didn’t it work for everything again? It works
great for these kind of cd covers, book covers, it’s great for panorama stitching, right? But why doesn’t
it work for other objects? Why doesn’t it work for motorcycles, let’s say? Well, let’s look at that. So
there is a paper in 2003 who said exactly that: what we have is interest points, interest points are
awesome. We can now do more complex things because we only have to examine certain points in the
image. Can we come up with a kind of category object detector for this to just recognize generic
motorcycles? And what they said is—you know—back… is basically motorcycles are made of parts. You
know, all of them have wheels, all of them have headlights, all of them have handlebars, etc… And if
you think about it, all these things can be put together by springs ala Constellation model—we’re back
to 1973 again. People love this idea, and you can see—you know—this is the front wheel and this is the
front wheel, this is the—you know—the back wheel—you know—and you can see the different parts
that it found and—you know—it really had these springs everywhere. It’s very similar to the 1973
model. And the way it found the parts… found the part candidates, because if you have all these parts
and you need to evaluate every combination of parts, you can’t do this for every single position image,
it’s way too expensive, right? So what they did is they eliminated the parts again to interest points.
You’d think this would work beautifully. And it worked beautifully in the paper, but why does this fail in
general? And this is an example from the actual paper itself. So the reason why it failed is interest
points actually are not… they’re not repeatable for object categories. They’re repeatable for object
instances like when a cd cover and another cd cover, right? But if you look at these two bikes, the
interest points that are returns… yeah, okay, there’s a circular one here and a circular one here, but
they’re not the same scale. If you try to find an interest point that is the same position here as this bike
here, it’s nearly impossible. And a lot of this is for the same reasons. One bike has a very different—you
know—colors—you know—maybe this part’s darker, this part’s lighter that another bike, right? And
also bicycles… motorcycles—you know—like, this area here—you know—there’s background, and the
background’s gonna be darker or lighter and that’s gonna affect the interest points. So there’s all sorts
of different things that can affect these interest points and make them not repeatable. And then
general, I mean, something like a chair or motorcycle or people even, they just don’t repeat at all. So
really the only thing they work for well are objects which have a lot of texture internal to them, very
much like faces where you don’t have pollution from the background and the objects are kind of… are
basically repeatable from one object to the next—they don’t change a lot. And the other thing too
which actually hurt this algorithm a bunch, is that they had a spring between every single part. And
because of that you couldn’t… you really had to use interest points, and you could only have a small
number of parts because computationally this is very expensive to compute and to evaluate.
But the reason it worked is if you actually looked at the data’s they tested on. So this is the Caltech4
dataset and this is—you know—this is in ’03 and people are actually impressed by this result—this paper
won best paper award—alright, and this is the dataset. So basically had a bunch of motorcycles in it all
placing the same direction—a lot of the motorcycles with a nice white background. You had airplanes all
facing the same direction. You had a bunch of faces, which might look impressive at first except for you
realize that this face dataset was literally created by a student walking around with the same exact
camera, with a flash on, taking pictures of students and faculty around Cal Tech—yes, around Cal Tech.
So basically they all… the lighting is all very similar, again, because it’s driven by a flash, and many of the
faces are the same… is the same person. And then the back of cars—again, you know, the dataset was
incredibly simple. This is a data set created by the people who wrote the paper, which creates a bias
because you want to make sure your paper actually has good results, right? So, you might have had
other categories as well and you just kind of threw them away ‘cause they don’t work.
Alright, so that was in ’03. Now let’s go forward a little bit and let’s go to ’05 and something called the
histogram of oriented gradients, so… or HOG, many of you have probably have heard of HOG before.
And what they wanted to do here… so we had a student, [indiscernible], who build trigs which was his
advisor—I’m making this up—but I’m assuming he said to—you know—this student, “Let’s detect
pedestrians.” Pedestrians are something that are important to detect—you know—‘cause we want to
have autonomous cars at some point and—you know—people are—you know—we always like to detect
people. Pedestrians are a little bit more… easier to detect because pedestrians always look like this or
they look like this—you know—they don’t really vary much. So you have a student and he wants to
detect pedestrians like this, so what does this student do? Oh, let’s just talk about pedestrians a little
bit. So the interesting thing about pedestrians is it’s not like faces—you know—faces have the texture—
you know—which is interior to the object, but what makes a pedestrian a pedestrian really is the
contour. Because—you know—the shirt that I’m wearing versus the shirt you’re wearing or the pants
I’m wearing versus the pants you’re wearing—you know—they all look very different. So you can’t rely
on our internal textures, right? So you really have to rely on this boundary which makes it a more
challenging problem, ‘cause like we were talking about, this boundary has the background and the
foreground object in it, and they clutter backgrounds and then, again, significant variance. Interest
points, as we learned, don’t work for these sort of scenarios of objects in which you have to detect—you
know—them by the boundaries because interest points don’t fire reliably when you have this variance.
So we have to go back to a sliding window. Thankfully, at this point, computers are a little bit faster, so
the sliding window kind of made some sense. So, if you’re [indiscernible], a student and you’re thinking,
“Okay, I’m gonna do a sliding window detector and I want to be able to detect a pedestrian,” and you
look at—you know—what the current state-of-the-art is doing, you say to yourself, “Hmm, what should I
do, what should I do?” And you’d say, “Oh, I know what I’ll do, I’ll just make a really big SIFT descriptor
and slap it on top of the pedestrian and do a sliding window on that.” Seems like a great idea, right?
And that’s essentially, exactly what he did. [Laughs] So basically he takes SIFT—exact same thing I talked
about before, right, with the gradients—and…but now we’re gonna have more bends because
pedestrians are bigger—not this gonna have the four by four… I should have two by two grid but in
practice people use a four by four grid, and here you have a much larger grid that you use. But there’s
one critical distinction and this is what kind of made the paper I believe, is that before with the SIFT
descriptor, if you wanted to take care of gain offsets you had to divide the whole descriptor and make
sure it all summed to one, right? Now imagine what you would… imagine what would happen if you
took, like, this cell and you want to normalize the descriptor so you basically divide by the magnitude of
the entire histogram. Now, imagine if you do this for a pedestrian, I… let’s say my legs have very high
contrast and my shirt has very low contrast, what’s going to happen is the gradients in my legs are
gonna be really large and the gradients on my shirt are gonna be really small and not very formulated;
they’re gonna look like noise. Or you could have the opposite. You could have a lot of… depending on
the background—you— know—the background could be white behind me, so this has a lot of texture
and maybe I’m wearing white pants and this has none, right? So if we divide… if we normalize by the
entire histogram, it’s not gonna actually perform that well. So they had this brilliant idea which was,
“Well let’s not normalize by the entire patch, let’s just normalize by a local window. And what they did
is they took this patch and normalized by these four, these four, those four, and those four, and basically
made a descriptor four times as big and normalizing locally. And the cool thing about that is you’re just
normalizing based upon this local area right here, right? So that way if there is an edge here it’s gonna
be [indiscernible], you’re gonna see that edge, and you have an edge here, you’ll see that edge, but if
the contrast is different between them it doesn’t really matter, because at the end of the day, you don’t
really care about the magnitude of the edge. You don’t care that there’s a white to black area or a light
gray to—you know—slightly lighter gray area, right? All you care about is, is there actually an edge
there, yes or no? And that’s what this descriptor basically did. It said, “Is there an edge, yes or no?”
Presence is bigger than magnitude. This is more informative than magnitude. And if you look at… if you
just train an SVM classifier on top of that, you basically get a… you can see these… this is the HOG
descriptor visualization of it and these are the positive weights and the negative weights of the SVM
classifier. You can see it really does pick up on the contour of the pedestrian. And then you basically
have this pipeline. You compute HOG features, you train a linear classifier, perform some sort of nonmaximal suppression—non-maximal suppression just means if I detect a person here I shouldn’t detect a
person right here, if you just pick the one that has a higher response and remove the other one. And
this pipeline reigned for quite… I mean it was, like, the pipeline for a long time.
Alright, so why did this work? A couple of reasons we already talked about. So we can finally detect
objects based upon their boundaries. This is really the first time that boundaries really worked. There
was a bunch of people who tried to detect objects based upon edges… you literally detect edges first,
these discreet edges and then you try to detect objects, but the problem is edges are not very
repeatable, like, sometimes you detect them and sometimes you wouldn’t, but this is the first kind of
repeatable method to detect objects based upon their edges. They also had hard negative mining. Now
this actually is kind of hidden in the paper a little bit but is incredibly important, which meant that if you
want to find negative examples of humans, you don’t want to pick just any random patch in any image,
you want to detect… you want to take the examples that are actually close to being correctly…
incorrectly classified as humans and add them to the negative set. So you run your classifier once over
the trending data set, you find all the ones that kind of were classified incorrectly as humans and you
add them to your negatives and you kind of rerun it. And this helped a lot just in accuracy numbers.
And finally again, this is a dense sliding window approach, computers are fast enough.
Okay, why it failed. So this is… if you look at the average gradient mask of a pedestrian this is what it’ll
look like. Again, pedestrians look like this, right? Well—you know—it works great for pedestrians but
people… we have all different sorts of poses, right? How do we actually go about…? So this was,
essentially, a next big challenge. How do we actually recognize objects which don’t have—you know—
very—you know—obvious contours that we can detect? What if they’re more deformable? But before
we get to the solution to this problem I just want to take a little bit of a side detour and let’s talk about
data sets really quick. So…’cause data sets actually drive the field a lot and it’s… data sets actually
dictate the type of research that people are doing. So 2004, 2006 the reigning data sets were Caltech
101, Caltech 256 and they had a lot of images that looked like these, which are fairly challenging and if
you took all the different categories and you average the images together you get these sort of average
images. And what you’ll notice is that a lot of the categories have things that are very consistent and a
lot of categories have things that are kind of random, so they kind of varied in hardness. But what was
unfortunate about this data set, and maybe it’s partly due to the fact that it wasn’t super large, is that
we pupil evaluated on this data set. They were given fifteen or thirty images for training and then you
had to evaluate on the rest. So basically I give you thirty images per category and I say, “You must learn
this category from the thirty images.” That’s a really small amount of images. You think about how
complex an image is and you only have thirty of them. So basically this limited the research, I mean,
like, you couldn’t publish unless you had that result and since you only had a small number of training
examples that really limited the algorithms that you could actually run on this. As we’ll find out later
some of the most—you know—well performing algorithms require infinitely more data than just thirty
examples. Alright?
Then there’s PASCAL VOC. This really led the drive towards object detection, so this data set had much
more realistic images in it and then it had bounding boxes drawn around all the different objects. And a
lot of people talk, like, this is the gold standard for object detection for about five or six years. We had
the Caltech pedestrian dataset, what was interesting about this data set was we just had a large number
of people. This is one of the first data sets that actually had a hundred thousand training examples in it.
Again, but these are pedestrians, so, you know.
And then, again, ImageNet, which a lot of you probably heard of which has millions of images which is
really cool and they’re all in these kind of range of this fine-grained category so we had, I think, like, five
hundred different types of dogs and a bunch of different types… you know, basically it’s like all these
kind of fine-grained details and it had—you know—maybe a thousand examples for every single one of
these. And what’s interesting about this—and we’ll see how this is useful later—is this is the first data
set where we actually had huge amounts of training data, we actually had millions of images and this
will be incredibly important later on. Alright. Oh, and Sun. With Sun basically you had these kind of
perfect segmentations but the data set was much smaller and this is kind of nice for localization. And I’ll
talk about COCO which is a new data set we’ve developed at the end of the talk.
>> Answer a question?
>> Yep?
>>: When you say cat… go back to a couple slides categories, twenty-two thousand categories, are… is
this—you know—using the human example—is this… is it just category human or is it human as
pedestrian, human jumping, human…?
>> Larry Zitnick: It’s not human jumping, it’s teacher, it’s professional, it’s firemen, it’s that sort of thing.
>>: Great, thanks.
>> Larry Zitnick: Yeah, and then, you know.
>>: Yes.
>> Larry Zitnick: Alright, I’m gonna skip that for a time. Alright, so, back… we were talking about why it
failed and we can’t recognize people that look like this, right? Alright, so let’s say you’re in the year
2008, HOG works really well for pedestrians and you want to do more deformable objects, what would
you do if you’re designing an algorithm? You’d say, “Hmm, let’s see, if I want to detect a person and I
can’t just do a whole template, let’s see I could detect, let’s see, I could detect the feet and then I could
detect the hands and I can…,” you know, so basically the DPM, deformable part model, was exactly what
you’d think it would be, which is now instead of detecting one big template, you’re basically gonna have
a bunch of these different HOG descriptors and you’re gonna detect the feet and you’re gonna detect
the legs, and you’re gonna detect the head, and you’re gonna detect the arms, right? And you’re gonna
attach them all using springs. [Laughs] So, again, we’re back. But these people were smart. They didn’t
take every single one of these parts and attach springs everywhere—no, that’d be naïve. They took the
springs and attached it to a root node and then had one spring coming out from each creating a star
model. And the beauty of this was you could actually compute it efficiently, so you didn’t have to use
interest points, you could literally run these part detections across the entire image and find the optimal
human in a very quick non-exponential amount of time and feel like the distance transforms. It’s pretty
cool actually that it works. So using this model, you could actually do it. So you could do densely partbased detection.
And the other thing that they did was they said, “Well, okay let’s say… let’s take bikes for example.”
Bikes and they have all different parts they can move around, but a bike from the side looks very
different from a bike front on. So instead of trying to create a deformable part model which could warp,
let’s say a side-facing bike to a front-facing bike they said, “Ah, don’t worry about that. Let’s just create
a different model for each one of those. So you create a bike detector which is a front-facing bike and
you create a bike detector which is a side-facing bike. So anytime you have a lot of deformations you
just have multiple components and multiple models.
Alright, so why did this work? Multiple components, deformable parts? I have a question mark here
‘cause there’s a lot of debate in the community. You can imagine instead of having deformable parts
you could just have more components. And the balance you have between those—you know—you can
kind of get really good results just using multiple, like, more components and having less deformations.
So there’s kind of a trade-off between the two. Again, hard negative mining’s always important. And
good balance; and what I mean by good balance is this worked really well on the PASCAL data set. And
the PASCAL data set has, let’s say, I think—you know—four hundred to maybe a thousand objects per
category. Again, not a huge number of objects, but this algorithm was trained, like… they just tweaked
it to work really well on PASCAL because if you took it classified and gave it too much freedom it would
over fix, you didn’t have enough training data. So basically you had to take your classifier and make it
just strong enough to take advantage of the data that’s there in PASCAL, but not too strong, right? So,
as we’ll see later, this kind of got over-fit to the Pascal data, but people had a really hard time beating it.
Alright, so why did it fail? So, why did it fail? So to understand this, let me just do a simple little
demonstration. So, look at this patch here. What is that?
>>: Person riding a horse.
>> Larry Zitnick: Yeah, okay. [Laughs] It’s a head, it’s a head of a person, right? So let’s say I wanna…
you could say, “Which part of a person is it?” Right? So it’s a head. What part of a person is that? Leg,
right? What about that?
>>: Shadow? Blood.
>> Larry Zitnick: Yeah, it’s nothing, right? And we can even do this in really low res, so, what’s that?
>>: Lower leg.
>> Larry Zitnick: Yeah. Well, feet, leg—you know—head, ‘cept… So it’s amazing. We can look at… as
humans we can look at a very small patch of an image and say whether there’s a head, feet, arms, legs.
Even in low resolution images we can do this, right? So what we did in an experiment, is we actually
used Mechanical Turk and we asked turkers to basically label every single little image patches containing
one of these parts of a human. And these are re… for these two images these are the responses that we
got. You can see… here’s the legend up there. So the humans very reliably were able to find—you
know—the head, the—you know—hands, the feet the legs, etc…, and they’re very nice. Now if we look
at the automatic algorithms, the deformable part model which I was just discussing, and you look at the
responses for that, you get this. Not nearly as good. Now, the deformable part model generally did
fairly well on the head ‘cause the head has a kind of this unique structure, but as you can imagine, legs…
legs, there’s nothing unique about legs. They’re just, like, kind of vertical lines, right? Same thing with
arms—they can move around a lot. So arms and legs—you know—that sort of thing, was really hard for
the deformable part model to detect and it was really relying on the head detection a lot. If you look at
more examples, this one’s a little bit more challenging, you can see the humans mess up…it thinks if you
just show a small patch here around the horses foot, we think that’s a human foot, right? Not too
surprisingly. But the machine just fails miserably again. And here’s another example you can see—you
know—it doesn’t even get, like, a lot of the head detections here, the DPM… and the humans just do a
lot better job looking at these small patches. And if you take the DPM model and just replace the
human responses for whether they think feet or heads or hands are here, you would get this huge boost
in performance over the machine detected—you know—hands, feet, and legs. So it’s really kind of led,
you know… you can see how these kind of low level, this kind of just being able to detect these small
little parts, humans can do them just so much better. There’s obviously HOG followed by an SVM
classifier is not doing it—it’s not detecting these parts reliably. And a little side note, you can… this kind
of explains one of the reasons why Kinect works so well. So we have Kinect, it can… the input is
something that looks like this, right? Now of everything I have talked about today, what’s unique about
this image versus an RGB image? We can actually detect the profile of the human really easily, right? It
just stands out. I mean, you get this profile of the human [snaps] without even trying. It’s gonna be
har… it’s gonna be noise free, essentially. And because of that, that’s why Kinect works really well—is
we don’t’ have to work about detecting these object boundaries any more—they’re basically given to
us. And then, once you have the object boundaries, especially with a human—‘cause you know we’re
flying our arms around—you know, detecting the hands or the head becomes a lot easier if you have
these really nice boundaries, right? So we have this, basically, this really nice input and because the
input is so good then we can train classifiers which are essentially… used features which are essentially a
lot easier. It goes back to the Haar wavelets that I was talking about earlier—it’s even simpler than Haar
wavelets—and you can still detect the humans. But still this doesn’t solve the RGB problem. Now you
think that the deformable part model… what if you gave it more data, right, to detect these arms, these
hands and arms and do it better. We find out is it doesn’t help. Again, this gets back to what I was
talking about before is they basically tweaked it. They used a linear SVM and a linear SVM only has so
much capacity. You can’t learn a lot—you know—a linear SVM can only recognize so much, can only do
so much. So if you give it more training data, it’s not really gonna learn anything more. But you can’t
give it… you can’t use a more complex classifier because then, again, it starts falling apart because then
you overfit. So basically we had this problem. More training data didn’t help, the classifier had to be
restricted… awww. And at this point, this is a really dark time in computer vision research and object
recognition. All of us are just, like, “Oh this is so painful. All we get is all these algorithms making these
really small incremental improvements. We’re improving things by one percent a year.” And this
happened… this lasted for—you know—three, four years—these kind of incremental improvements.
Nobody was making any headway, right? And the results kind of look bad, I mean, for computer vision
researchers you look at ‘em and, like, “Okay, that’s okay.” I showed ‘em to my mom and she’d slap me,
right? So we’re all competing on this kind of—you know—small little bit and, like, basically the mood
even in the PASCAL data set, they decided to stop updating the PASCAL data set because all it was small
improvements on this model and we weren’t seeing any improvement really at all, like, the life kind of
went out of it because nobody could think of anything else.
Alright, so let’s look at this DPM model. We basically have an image, we could beat a HOG descriptor,
we have an SVM classifier, we do some sort of pooling, right? That’s basically the HOG model. Now, we
have low level features, we have limited capacity classifier—a linear SVM. Intuitively you think, “Well,
these HOG features—you know—there not just looking at the raw pixels, but they’re not looking—you
know—there just looking at gradients, pooled gradients, so that’s not that abstract, right? And we have
this classifier which is pretty limited. So we’re taking these kind of low level features and feeding them
into a classifier which isn’t that powerful. You know, it seems like we need something else here,
something that is a better abstraction that could feed into a classifier and actually learn something. We
need something that’s more abstract than HOG. The problem is, is HOG was handcrafted, right, and
what do you do after HOG? How do you combine features in a way? You can’t… you think about it
intuitively it’s like, “I don’t know.” You could come up with all these hacks but there’s no… we don’t
have any good introspection on how to do this. It’s hard to hand-design these things.
Alright, 2009 our data sets had about thirty thousand images in them. 2012, ImageNet, fourteen million,
huge increase. 2009, Caltech 256 had two fifty-six categories in it. 2012 ImageNet had twenty-two
thousand categories in it. 2009, we had algorithms like these, right, from a part model. In 2012, for
some reason, the convolutional deep neural network started catching on again. So you have image—
uuuup—up to there—you know—a pretty weak classifier up here, but then you have all these extra
layers to learn that something new in here. So what happens if we take all this additional data, we take
the deep Learning work that we just had and learn it using GPUs, and it’s interesting…there’s actually a
whole interesting back story on when people even started using GPUs to do deep learning and it
wasn’t… people didn’t just go with that one day and say, “Oh, you know what? Let’s just revisit deep
learning and… you know, I think that it was just the fact that it wasn’t running fast enough. Let’s just
apply the GPUs and see what happens.” It was actually much more interesting of a story for why that
came about and a couple dead-ends at the research community kind of went into before we actually got
to this point, but I don’t have enough time to talk about that today, so I won’t—plus deep learning. So
what did this buy us? So 2012, ImageNet challenge. These are kind of DPMS standard computer vision
models. Jeff Hinton and company submitted their SuperVision algorithm based upon Deep Learning to
very, very, very skeptical computer vision community and the results were this. This is error, so lower is
better. This blew people away, this really woke up a lot of people. You can see, I mean, the amount of
error drop there was stunning. What’s even more stunning if you look at state-of-the-art results right
now—Google, this year—the bar’s down to here, alright? And you look at… it’s amazing. And if you
look at the deep neural networks, all it is is you have input image, you have a series of filters, you have
some sort of pooling, you’re plying other filters, blah, blah, blah, blah… he’s a dense, sorry, ninety
percent of parameters are actually in the dense layer ‘cause of the convolution that I was talking about
earlier. But if you look at this and you look at this, they’re essentially exactly the same thing. They
literally are essentially… I mean this…they really are the same thing. We have a couple of more layers
here—we have five layers instead of two—but essentially the network, the [indiscernible], everything is
essentially exactly the same. So what happened? Well, we got GPUs, because in 1990 as we would
trade these… train these deep neural networks and we’d be like, “Okay, nothing’s happening, they’re
not working,” because—you know— the… you just sit there and you wait a month or two and it doesn’t
do that good of a job. Well, if you take a GPU which runs thousands of times faster and you let it run for
an entire week, suddenly then it works, right? And you can’t train neural networks without a lot of data
‘cause a bazillion… there’s billions of parameters, right? And if you have billions of parameters you need
a lot of data to learn this parameter. So you need lots of data and back in 1990 you’re not gonna have
lots of data ‘cause how do you get the gen… nobody even… images were scarce then. You know, now
we suddenly have billions of images and we can get them off the web—you know. It wasn’t until we
had a lot of data, we have a lot of processing power that the—you know—power of these guys really
showed itself. And also rectified actification or activation and drop out helped that a lot too, or a little
bit too. If you’re curious about the details on this, Ross is giving a talk next month, specifically just on
the deep learning aspects of this, so I’m not going to go into a lot of detail here other than just to give
you some intuition for how it works or why it works and where it doesn’t work. But if you want to see
more details on this, come back in a month.
Alright, so at this point there is one… so as a computer vision researcher, you can imagine, we’re very
depressed. Basically, people just told us, “You wasted the last twenty years of your life doing nothing.
You should have just sit back, relaxed, waited for GPU’s to get better,” [laughter] “…wait for more data,
and you could have just, like, done just as well. That’s great you published a lot of papers, you have a lot
of citations, congratulations, and you paid your jobs, but—you know—your work is basically pointless.”
And at this point I was like, “No, no, no, no, no, wait, wait, you just solved the classification problems,”
this is—you know—we talked about classification versus detection…”you haven’t solved the detection
problem. Detection is much harder.” And there was debate for a year or two about whether
detection… ‘cause they tried it on the PASCAL data set and it didn’t work that well, right? The tried on
ImageNet it worked really well, but PASCAL they weren’t able to beat yet. So we were like, “Whew, at
least we still have PASCAL that we’re winning on.” Alright, so then comes out there’s this student who
was one of the creators of the DPM model, Ross Girshik, who is MSR, he’s gonna be in a talk in a month,
and he said, “Well, you know, my old model is being beaten. You know, I want to be at the forefront of
deep learning, I want to see if I can make deep learning work well on PASCAL.” So what he did, one of
the first things you have to solve is, again, we talk about sliding window, you do that with a deep neural
network it’s actually really expensive ‘cause when you do image classification you just do it on the whole
image at once, it’s not that expensive. But if you do sliding window you gotta do it in all these different
boxes, which is tough, but then you also had to do it all different aspect ratios ‘cause some objects are
like this, some objects are like this. So it’s really expensive, right? So how do you solve this problem?
How can you—you know—detect boxes of all different sizes and image? Well at the same time in the
literature there’s a whole bunch of algorithms coming out which looked at object proposals and then
essentially what the idea here was, instead of doing a sliding window approach we’ll come up with some
bottom up method to kind of give you object bounding boxes which we think objects exist inside of.
And one of the more popular methods was you basically just segment the image in many different ways,
you merge your segments together and you draw a bounding box around those segments and that’s
your object proposal. But what this does is it takes the number of bounding boxes you need to evaluate
from millions down to thousands, right? So then you could apply again, same story, these more
complex descriptors, more complex features and a smaller number of areas and we can see if we get
improvements. Alright, so input image, proposals, for every proposal, we basically skew it so it fits
within a perfect rectangle like the classification task, you run it through your deep neural network and
then you classify that patch.
Alright, so these are the results up until Ross’s paper. So you can see we have DPM, DPM, DPM, plusplus, plus-plus, this is—you know—that’s a big improvement. We’re like, “Whoo,” and this is a paper—
you know—going from there to there.
>>: Up now.
>> Larry Zitnick: Good as up, yes. So this is average precision. I want to describe what average precision
is but the higher the better. And you can see, like, this is… and this is basically that period where… I
mean, everything basically stalled out, everybody’s kind of down, right? You throw in the deep neural
networks and you get that. So if you look at this gap, that’s huge, and we were seeing, basically all the
results were plateauing. And when this gap, when this… saw this result, all of us were like… well, some
of us were like, “Damn, why did it have to do good on object detection as well?” But at the same time
you’re like, “This is great. We finally have some improvement again. The fields not dead,” you know.
And you can see the improvement here. And now, this number here, Ross will get into this a month
from now, but he’s gotten results that are in the sixties, like, mid-sixties, so it’s even going up higher.
Alright, so just to give you some intuition as to—you know—how these—you know—where this works,
where it doesn’t work; let’s just look at some results, alright? So, here’s an image, detect a train
beautifully. Good. Person, boat—I think a very impressive result. Bottles here, you can see kind of
misses some of these bottles. That’s kind of a tough problem. Here, there’s chairs, there’s all sorts of
things it should recognize—it doesn’t recognize any of them. Cat, good, it can recognize cats. [laughter]
Train, you see gets a train, but kind of says train here—maybe that’s forgivable. Bird; this is an
interesting case because it recognizes birds all over the place here, but if you look—you know—yeah
that’s a bird but it doesn’t really get the head, it doesn’t really get—you know—none of the bounding
boxes really get—you know—getting the entire bird. Again, whew, detect cat faces, cat faces are very
cool, and sometimes, no, we can’t detect cat faces, so here it just fails. Sometimes we hallucinate cats
[laughter] different places. You know, here, again, this is a challenging scene. Now again, this is
where—you know—fact from fiction… this is… when I look at this image I’m like, “Wow, that’s pretty
impressive as a computer vision person,” right? But again, if I show this to my mom she’s like, “Eh, you
know, you could probably do better.” [laughter] You know, like, “That’s not a horse. Come on, this is
garbage,” you know, but everybody has different expectations. Here again, a bowl, hambur…I was like,
I’m just blown away by this—you know—that’s… I mean, just, no way. I mean, like, this algorithm just…
you would never get that before. But then you look at here and you’re like, “What is going on? You got
this dumbbell but not that one [laughter] you know, so yeah, exactly. I mean, and it’s not only that, this
dumbbell got it with really… I mean it was really confident about that dumbbell, and this one, that’s
definitely not a dumbbell. [Laughter] So—you know—exactly. I don’t… I mean, look at the person, he’s
like deformed and in different position and he still got it—it’s amazing. Again, but then my mom would
be like, “Oh, it missed the dumbbell, Larry, it’s not that good.” [laughter]
Alright, so that kind of gives you a sense for the quality of the results. And for classification it can do
much better, like I was talking about with image net classification, it can do very well with classification.
Detection is a much more challenging problem ‘cause you have to localize it as well.
So failing. So this is actually an interesting plot. So this is looking at vehicles, furniture and animals and
the different reasons for the errors. So we have confused with the background which is yellow,
confused with other objects which is orange, confused with similar objects like a cat versus dog in
purple, and got the location wrong in blue. So you see a lot of the errors are due to location errors. For
furniture it gets confused with a lot of other objects, not too surprising. Chairs kind of look like a lot of
other things. And also animals—you know—it’s really easy to confuse one animal for the other—you
know—because a lot of animals look similar. So a lot of these… what’s nice here is there’s not confusing
with the background as much. If you looked at DPM this yellow region would be a lot larger, so we’re
not just hallucinating—you know—horses in the background as much anymore. And these are different
object categories and their accuracies. If you scan in this, what’s interesting is that kind of the large
objects kind of does well on, or small objects like a power drill does really poor. Horizontal bar, I
mean—you know—yeah, exactly. What‘s another low one… backpack, which is generally you don’t have
an image which is an entire backpack, but it… backpack is small in it, where an image of an iPod, for the
image in that date set, would be basically a blow up of the iPod, so the iPod actually does well. You
know, hammer. Usually it’s not a big picture of a hammer, it’s a picture of a person holding a hammer or
something like that. A hammer’s small so you can see it doesn’t do as well in these categories. The
other thing is, all these methods, they look for an object, right, and they kind of look at a bounding box
and they say, “Is that object within this bounding box?” Right? And a lot of objects if you just look at
the bounding box that surrounds that object, it’s really hard to tell what that object actually is. You
actually need to look at information outside of the bounding box. So if you look at these images here,
what’s interesting about them is that all of them had these exact same pixels in different positions,
different orientations, and how we interpret these pixels changes dramatically based upon the context
around it. So these pixels can be a plate of food, it could be a pedestrian, it could be a car, it can be…
and it totally depends on the context of the object relative to the other objects. And you can imagine
for a lot of the things I showed earlier, these smaller objects, you don’t recognize the smaller objects
from the pixels themselves. A lot of times you recognize them from the context. If I going like this,
whatever I’m holding here you’re gonna think is a cell phone regardless of the pixels that actually the
cell phone contains. And these models are not capturing that yet. We can look at an image like this as
humans and we know something’s wrong. These models don’t really know something’s wrong. They
wouldn’t realize that, “Oh there isn’t really, yeah… is there three faces, is there two faces? I don’t
know.” It doesn’t get that contradiction—you know—‘cause it doesn’t really fully understand the extent
of what the objects are. Remember when I was looking at the birds? You know, these models, what
they’re doing, what they’re really good at is finding kind of descriptive little patches in the image which
are very informative of said object—like the face of a cat or the—you know—the head of a person—and
that’s where you get a large firing. So you can look here. What it does is it kind of finds a part that it
thinks is a really—this is kind of a heat map of where it fires—and it basically finds the informative bits
of those objects. So you see here, like for the boat example here—you know—this is kind of the part
that it finds that’s discriminative for a boat, but this whole other part’s a boat too. Where here—you
know—it finds all these different airplanes but—you know—is that one airplane or multiple airplanes,
you know? And this concept of that—you know—my feet are part of me is kind of lost on it ‘cause that
is not discriminative for the algorithm, right? So it’s this… so that’s one of the challenges, especially for
the detection task.
Another thing that’s interesting is there’s this paper that recently came out which, you take this image
which is correctly classified or this image which is correctly classified, you manipulate it a little bit to get
this image, and suddenly it’s incorrectly classified, alright. So for humans, these images look exactly the
same to us, right? But to the neural network they look completely different. I think these are ostriches
now. [laughter] So… I think that’s right. Even this, you know. So there’s something funny going on. I
mean, they are still very sensitive. You get this signal which kind of propagate and then you go up the
neural network they get amplified and can do un—you know—sightly things like this. So that can be a
problem. The other thing is these deep neural networks are being applied in a lot more domains, and
one of the ways that they’re being applied is, before you actually run your deep neural network is you
basically stabilize image with respect to something. So you can use attributes like whether somebody’s
wearing pants or shorts, but first what you do is you don’t just give them the whole image, you basically
find the legs and then you just feed the leg portion in. Or if you’re trying to do face recognition what
you do is you first do a face detector, you take the face, you align it with a three-D model, you warp it so
it’s a frontal looking face, and then you feed this into the deep neural network classifier to say whether
that’s Sylvester Stallone or not, right? So what we haven’t figured out yet is we can learn these very
complex models, but we don’t really understand the relationship between a person that looks like this
and then move over like this or how things change in three-D, right? So those haven’t been worked in.
So we still need these kind of… you could think of it as a hacky layer on top to really get good
performance when you start thinking about three-D. And also this can impede or generalize ability of
these algorithms. So if I take a picture of my dog and I want to recognize my dog from another angle,
it’s not really learning these relationships between a dog at this angle and this angle. It just knows that
dogs from this angle look like dogs and dogs at these angle looks like dogs, but it doesn’t know how to
relate the two features from one to the other.
>>: Does any of this apply to looking at moving pictures or is it all static imagery?
>> Larry Zitnick: So yeah, there has been some research in moving pictures but most… almost
everything is static images. The reason for that is it’s so much harder to deal… I mean, we’re already,
like, maxed out, our GPUs and…
>>: [indiscernible]
>> Larry Zitnick: Exactly, we already have a lot of data. This work is done by actually keypoint
annotation so we take all the human data and you actually click on all the hands and the feet and the
arms and use that for training to then be able to relocate those objects. So that kind of helps relate it,
but you have to give it that information, it doesn’t learn that automatically.
>>: Wouldn’t your mom just say do it on moving pictures?
>> Larry Zitnick: We keep bringing that up. It’s like, “We should do this on moving pictures.” That’s
how—you know—kids do it, right? But yeah, it’s tough. The other thing is, back to here, you know,
these algorithms again—you know—we find these object proposals, well some objects aren’t gonna be
found right by these object proposals just using low level cues, right? So we’re gonna be missing some
objects because we never even evaluated our detector on them in the first place. So basically we’re
back to our interest point problem that we talked about before. Yes this is more repeatable than the
interest points that we looked at—you know—back in 2002 or 2000, but we’re still a problem. So I think
the field going forward will begin to look… how do we go take these complex descriptors and start
applying them densely across the entire image again. And this has already been worked on.
Alright, so in the last couple minutes I just want to conclude… so things that we’ve seen that deep neural
networks have problems with, right? Detection and understanding the full extent of the objects and
being relate one object to the other. And this is something… and also that—you know—PASCAL, the
reason why it took people so long to do well on PASCAL is because PASCAL didn’t have enough training
data to train these deep neural networks. And the way… the only way they got it to work was they
trained it first on ImageNet to get the neural network in a good spot and then they just kind of tweaked
it a little bit on PASCAL. So you basically have to use ImageNet to bootstrap it. So in comes Microsoft
COCO. So since PASCAL they decided to discontinue it several years ago, so several people within MSR
and within the academic community decided to get together to create a new object detection data set.
And this data set is… we want to find images where we have objects in context so it’s talking about that
cell phone example I was mentioning earlier. We want to find segmentations not bounding boxes. We
really want to understand the true extent of the objects and not just the bounding boxes that they have.
And we also want to have non-iconic instances of the objects. And let me explain really quickly what
that means. These are iconic images. This is what a lot of images in ImageNet, let’s say… not all, I mean,
but a bunch of them look like this where you basically have these big objects in them, right? You can
also have iconic scene images where you have images that look like these. So—you know—generally
what you find is bathroom scenes which are staged for—you know—you’re selling your home and that
sort of thing. They don’t have humans, they don’t look realistic, right? The type of images we wanted to
gather were images like these where you’re using, let’s say, a toothbrush in a real situation. You know,
you have cluttered scenes and that sort of thing. You want to be able to recognize the objects, and
these sort of scenes where you have contextual information, you have realistic actions going on. So just
relative to other data sets, we don’t label as many categories, but we have a lot more instances per
category. So we have ten thousand instances at minimum per category. For humans, we’re gonna have
four hundred thousand humans in our data set, each one segmented, so this is gonna be a lot of data for
doing deep learning. And then we also have more objects per image, so blue is us and these are the
other data sets, so PASCAL and COCO only had a small number of images per… a small number of
objects per image—I see one or two for most—where ours we have a lot more objects per image, which
we think is important for that contextual reasoning. And segmentation, as I was mentioning earlier, all
this is done on Mechanical Turk with the generous funding of Microsoft. Microsoft’s really nice in giving
us some money to be able to do this and we’ve already run seventy-seven thousand hours of working
time on Mechanical Turk, ‘cause eight years… it’s the equivalent to somebody working eight years
nonstop on this problem, segmenting out objects, do about twenty thousand a day you max out
Mechanical Turk. And these are just some of the things. And also another very interesting thing for
future researchers, every single one of these images has sentences with it, so one of the very interesting
future areas of research which I haven’t gone into today is actually taking the image and then describing
it using natural language. And this is a really hot area right now. There’s a lot of people who are
running into this space, but due to time, I decided not to talk about that today. Again, go to Ross’s talk
next month, it’s gonna have a lot of great stuff about deep learning. You’ll get more details on that and
I’ll stop there. [applause]
>>: If you all have any questions for Larry?
>> Larry Zitnick: Yeah.
>>: Can you go back one image? The segmentation you have, like, couch in that third one, is that… do
you label that whole kind of background as couch, or is it two different images of the couch? Two
different segments?
>> Larry Zitnick: This is… okay so, the question is: for this couch segmentation, is that… how’s that
segmentation created, essentially? This segmentation is actually two different pieces, so we basically
had the turkers only label the visible pixels of the objects. So—you know—if I am standing in front of
something I don’t, like, call the couch that’s behind me part of the couch.
>>: But you do want kind of that context to be… understand, like, this is a whole couch.
>> Larry Zitnick: With humans, I mean, these are results, these are humans that… the—you know—
human results.
>>: But does the database say there are two couches?
>> Larry Zitnick: No, no, the database would say that this is a single couch.
>>: Ah.
>> Larry Zitnick: Yes.
>>: That’s what you want.
>> Larry Zitnick: Yeah, says it’s a single couch. So we want to be able to reason just like humans would.
Yeah?
>>: So you seem to be applying for human level perfection quality and… but the human models are
metric models that train on three-D data,
>> Larry Zitnick: Yes.
>>: … and then they’re applying it to two-D technician, so are we pursuing the harder problem than
we’re getting used to?
>> Larry Zitnick: Computer vision researchers really could tie our hands behind our back. We don’t like
to cheat and use depth data or use video data or anything else. We know humans can do it from two-D
images, therefore we should be able to do from two-D images. So that’s the cynical answer. The other
answer is yes, we would love to use depth data. We would… I mean, Kinect proved that point really
well—you have depth data everything is infinitely easier. You can get the boundaries much more easily.
Video data would be great. Why aren’t we doing that? Essentially, because that data’s a lot harder to
collect. You know, there are depth data sets out there right now, but they’re… relative to, like, these
data sets, and we’re talking thousands of images instead of millions of images, right? So if we can find a
way for—you know—the entire world to take one picture with depth—you know—we could probably
get a much bigger data set and—you know—that would be very useful, but we just don’t have the data
to play with—you know—and same thing with the video data. We can go to YouTube but—you know—
how do you label it then too? Like… and that’s another interesting area for researchers unsupervised—
you know—can we learn about the visual world without actually having explicit labels? Because labeling
things like this is a huge effort. It’s taken us over a year to create this data set. It’s been a lot of work
so—you know—it’s a good area for research as well. But yeah, it’s just—you know—getting the data.
>>: On that issue with the video, you could do a variant, which is you just collect your images rather
than… static image internet, you collect them from video clips and you just have the turker label the
middle image. It’s the same amount of labeling effort, but then the vision algorithm has access to things
like motion segmentation…
>> Larry Zitnick: That’s one of the things that people have been asking us is two things—one, “Can
you…” literally do what you just said, which is: you take an image… you take a video, and just extract out
a single image from that video, and just label that. Or we could take… people have, like, photo stream,
so—you know—it’s a bunch of images that are very similar and just label one of those. So people have
been asking us about that; we didn’t do it. You know, we got all of our images off of Flickr—you know—
it’s harder—I mean, just the YouTube problem—it’s just another level of complexity, so it’s really… I
would… I agree, I think it would be a great idea. It’s just getting somebody to do it—you know—it’s…
and it’s hard to convince a graduate student to bite on something like this.
>>: So are you… oh, sorry.
>> Larry Zitnick: Go ahead.
>>: Are you, like, also looking at the problem of, like, for this object: it’s a dog is here in the image, but
also it’s facing this way, for example…
>> Larry Zitnick: Yes, so there’s a lot of… another thing that’s popular is attributes, so it’s like which way
is the dog facing? So one thing that we’re doing here is we’re putting keypoints in, so this is like labeling
where people’s hands, feet, head are. So you can label the parts of the object as well. Now for dogs,
we’re… we probably won’t add the keypoints for dogs, but you can imagine that. Another thing we
want to do is for all the people, we want to say man or woman, how old are they? You know, are they
wearing jeans? Or—you know—all these different other things that you could possibly label, which
would be very interesting, I think, for practical applications and very doable. Yeah?
>>: So do you have a sense of are we back on an up… generally upward trajectory or are we…?
>> Larry Zitnick: Right now, we haven’t seen the… we haven’t maxed out. I mean, we keep thinking
that, like, the numbers on PASCAL or the—you know—these—you know—the improvement in detection
is kind of maxed out, and then, knowing that, we have to do something else, but by tweaking these deep
neural networks, and—like I said—Ross will get into it more next month, we keep seeing fairly
significant improvement. So I don’t think we’ve seen the end of it. It’ll probably be another year—you
know—so it’s a… I still think a lot of these segmentation problems though—or a lot of the problems that
I mentioned are still going to be there—they still need a big leap to kind of go beyond that, but yeah, I
think we’re still going in a very—you know—the rate of improvement’s pretty fast right now. Yeah?
>>: So there was this whole body of work on cascaded classifiers and how one… solving one particular
vision task can actually help you solve another vision task.
>> Larry Zitnick: Yeah.
>>: Has that been distilled out with deep learning as yet?
>> Larry Zitnick: Well, I mean, that’s one of the powers of deep learning is basically all the features are
shared for all the object categories up until, like, the last layer or two. Because basically, the last layer is
basically a thousand different classes for the classification task, and they basically share all the other
features. So the—basically—learning all the features that it learns in the middle are shared by all the
object categories. And that basically gives all the cob… object categories additional information—
there’s more images per.
>>: Well, so I was actually thinking when you have… when you’re down in the lowest level, solving—
let’s say—a problem for classifying cars, but you also have a separate neural net… or separate deep
learning network you’ve learned for detecting cars or maybe segmenting cars…
>> Larry Zitnick: Yeah.
>>: The output of that can actually feed back as additional features which are higher levels…
>> Larry Zitnick: Yes, yes, yes, yes. Yeah, so that… and that’s something, again, this is part of an
academic bias, people like to start from the pixels—you know—‘cause then the paper doesn’t seem as
hacky, but yeah, I mean, there’s a lot of—I think, from a practical standpoint—a lot of other things you
can start feeding into it—you know—as additional information that it could use. Now, the question is
whether some of that’s redundant or not—you know—it’d really have to be complementary in order to
see a win.
>>: This gentleman here brought up the point of the goal being it… for the technology to see the image
as well as a human can see it, but you had one image back here that it was imperceptible what the
changes were, the pheasant and the school bus and…
>> Larry Zitnick: Yes.
>>: … why was the algorithm seeing that’s getting a failed attempt in the second one, and it looks the
same to us?
>> Larry Zitnick: That is a great question. I mean, that’s one of the things… when I’m… and—you
know—we’re… we keep saying—you know—and these deep neural networks are doing amazing things,
right?
>>: [indiscernible] back the photo? ‘Cause the middle image, actually, it’s an interesting… there is a…
>> Larry Zitnick: This one?
>>: Yes, that one, yeah.
>>: … they cheated in a very specific way, and the middle image says exactly how they cheated to make
this happen.
>>: Okay.
>> Larry Zitnick: Yeah, this is a deformation that you can see. It’s a kind of…
>>: You have the middle image [indiscernible]
>> Larry Zitnick: Yeah, yeah. Basically, this is how you kind of change the pixels…
>>: The way I would explain it is if you—let’s say—build a intelligent agent by having and answering a
rule book for answering twenty questions, right? And then you give someone else the rule book, they
can kind of look at it long enough and figure how to answer the questions wrong, right? Because if you
have a rule set… and so the neural network, you can analyze it and say, “Over here, it really wants to see
this thing.” And so it’s… in some ways it’s…
>>: It’s a cheat.
>>: It’s a cheat. It’s as if you looked at the software code, and you were able to figure out where the
bugs were, so you could give it the test case that breaks it, which is what hackers do, right?
>> Larry Zitnick: And the important thing here is—you know—people want to know when it will fail,
right? We had these… we had this great mechanism, right? Because not as much science into when
they fail, or why they really do work—you know—so it’s examples like these that make everybody kind
of cringe a little bit, because you can see that—you know—sometimes they do just fail for seemingly
random reasons, where…
>>: It’s the man behind the curtain.
>> Larry Zitnick: Yeah.
>>: So based on what you said, it’s not possible to do this without knowing what the neural network is
doing?
>>: Yes, you’d have to be able to pull apart the network and put in probes. It’s as if—you know—you
put electrodes into the brain to figure out how to fool someone’s brain.
>>: Is that what they did here?
>>: [indiscernible] this paper…
>> Larry Zitnick: Uh… I mean, they definitely looked at the outputs, and you can basically tweak the
inputs, but—you know—I bet you if you just randomly—you know—moved pixels around, that you
could get… you could find similar things.
>>: Find something similar [indiscernible].
>> Larry Zitnick: Yeah, and it would—still to a human—it would look… and I could take that bus image,
right? And I could warp it in random ways that, to us, would still look like a bus…
>>: This is true of any machine learning algorithm, right? You… if you get to vary the input enough, you
will find, accidentally, something that looks close, but that forces a mistake [indiscernible]. So in some
ways, it’s not surprising, but it is… it’s sort of…
>> Larry Zitnick: Well, I mean, it is true… this is really an example to show that—you know—we’re not…
this isn’t equal to humans—you know—or the other thing is: what this is doing is not exactly the same as
what humans are doing; there is a difference there, you know? And I think this really clearly shows that.
Do you have a question?
>>: I have a quick question concerning this wall descriptor so—you know—a fine invariant helps us in
some… it’s not only used for PR also, but for cd too for white base tracking…
>> Larry Zitnick: Yes, yes, yes, yes.
>>: …where you have… so that problem where you have—you know—a moving-camera video—you
know—segments, basically so, but for instance, for what’s called survey photos—for instance, in movie
business—with widely—you know—changing camera…
>> Larry Zitnick: Yeah.
>>: …that you need to have two-D responders. When I tried to use it, then unfortunately, this
algorithms there’s—you know—choose whatever they want to track, not what I want to track…
>> Larry Zitnick: Yes, yes.
>>: …in segments, they choose [indiscernible] corners of the building and so on. Here, choose whatever
it was. So when I looked at the… ‘cause I want to ask you: do you know the paper where you, for
instance, track whatever the low weight or affine—you know—invariant ellipses will take to pick one
image to another, and then knowing where the point you’re interested in is [indiscernible] the
coordinates, say, of this ellipse. Find this point as a [indiscernible] from the [indiscernible]
correspondence. For instance, I’m interested in—you know—specific features—you know—which are
[indiscernible] in a sense of—you know—coordinate things…
>> Larry Zitnick: Mmhmm, mmhmm, mmhmm, Yeah, so I mean, there’s a…
>>: …But I couldn’t track them, but they could track something which was around, so it was something
public. I didn’t have time to finish this, but [indiscernible] a practical thing.
>> Larry Zitnick: I mean, this is actually… that brings up a couple good points, I mean, one thing here
is—you know—I was disparaging of SIFT and those things—you know—about what they were; there’s a
lot of great applications that they work on. I mean, one of them is three-D reconstruction and these
tracking… this is—you know—since we’re talking about object recognition, I didn’t go into that, but
yeah, there’s a lot of other applications that a lot of these technologies—you know—like, gave birth to.
So this tracking… there’s a lot of literature, it’d probably be better to—I mean, Rick here and there’s
several people here who are, you know, experts… yeah, exactly. As far as any exact paper, I probably
would have to know a little bit more about the exact application, ‘cause there’s just so many papers in
this space. There’s some good survey papers, I think, like, Rick wrote a good survey paper.
>>: We should take this offline, because it’s a different part of computer vision than recognition.
>>: Okay.
>> Larry Zitnick: Yeah. Any other questions? Alright, cool.
>>: Thank you. [applause]
Download