>> Larry Zinick: Today it's my pleasure to welcome... give a talk on pedestrian detection. He recently graduated...

advertisement
>> Larry Zinick: Today it's my pleasure to welcome Piotr Dollar here to join us to
give a talk on pedestrian detection. He recently graduated from UCSD and is
currently doing a post-doc at Caltech, and he'll be describing his work that he's
presenting at BNVC today.
>> Piotr Dollar: And general review.
>> Larry Zinick: Okay. Great.
>> Piotr Dollar: Cool. Thank you for the introduction. And it's great to be here.
So today I'm going to be talking about pedestrian detection, the state of the art.
So I'm going to give sort of an overview of what the state of the art is and then
talk about our own work and our own detection system and sort of the advances
we made there recently.
So first of all, let me say why am I studying pedestrian detection, why am I
interested in this problem? I mean, as many of you know here, it just has a large
number of really compelling applications, robotics, image indexing, HCI,
technology for visually impaired, surveillance, automotive safety. And actually
the last one is the one that sort of motivated us to work on this in the first place.
But the other reason for really studying this problem is you really get to look at
some fundamental problems in computer vision and machine learning. You get
to really think about object detection in general, feature representations, learning
paradigms, sort of specific things that actually occur when you're doing object
detection like scale, occlusion, pose, role of context. So it's just a great problem
in general.
So this is a problem I became interested in when I got to Caltech about three
years ago. At the time we were funded by and automotive company, and since
then that's no longer the case, but it's a problem I've been sort of continuously
interested in. And my work has been in sort of three areas. One is sort of the a
benchmarking and figuring out actually what works, what are the common
elements, sort of evaluating the state of the art. There's been a lot of claims
people make we have this amazing detector and really seeing how those claims
hold up and what conditions.
And so we learned a lot through that. And so the first part of the talk I hope to be
able to communicate some of that knowledge to you.
In the second part of the talk, I'm going to be talking about our own detector, and
so one of the first things we did after sort of doing this evaluation is we really had
a good sense of what the key components of a detection system were. And so
we were able to put together basically a very simple system that sort of really
boiled down that what's required to have a good detector. And it actually
performs really well. It's something at the time that we published it was
something like a 10 fold decrease in false positive over any competing method.
And since then others have caught up. It's still one of the best systems.
And actually I'm going to give a fairly brief overview of the system and then really
dive into sort of a recent insight we've had that's allowed us to speed up the
detection by a factor of 10 over our previous method and 10 to 100 over
competing methods. And it's really something that's pretty general and should be
applicable to anyone sort of doing multi-scale detection. So that will be sort of
the main technical component of the talk.
I've also been very interested in sort of the learning paradigms you use for
detection so part based methods and setting that up as a learning problem where
you don't have to have full supervision. But I won't actually talk about that too
much today for lack of time. But if anyone's interested, that's something I would
be very happy to talk about afterwards.
So the first part I'm going to do this overview. What is the state of the art in
pedestrian detection? So what we did, one of the first things we did at Caltech,
and this is joint worker with Christian Wojek and Bernt Schiele was we went out
and we gathered this huge dataset of pedestrians.
You know [inaudible] set up where we had a camera hooked up to a vehicle 640
by 40 VGA camera and we had somebody drive around that wasn't computer
vision researcher, so hopefully they weren't biased in any way, had them drive
around and record video of the car. And then we went ahead and we labeled a
very large quantity of video. So we have something like a third of a million
bounding boxes labeled. So that's really a big number. I mean, think about that
number, 300,000 people with boxes around them.
Now, of course many times it's the same person, there's only about with 23
unique individuals that are labeled. But they're still labeled for a long time. And
of course we didn't do this ourselves, we hired some people to do it. Most of
them would flake out after maybe doing this task for four or five hours. But then
we had the wife of a post-doc in our lab who really stuck with this, and I think she
did something like 350 hours of labeling. And she'd be -- she said she couldn't
sit in a car an look out the window because she would be drawing boxes around
people in her head. So, yeah, not a fun task. I wouldn't recommend it.
But, yeah, it did give us this really incredible dataset. And then I should just say
and so the datasets available online. I sort of had this link up here the whole
time, and this is sort of the dataset. And all this stuff I'm going to be talking about
is available online, evaluation code and so O.
Including the -- so we had this annotation tool. So still even if we hire somebody
for 400 hours, if we were just to label frame -- individual frames one at a time, it
wouldn't be really a feasible way of labeling. So we have this tool where we
could label the person in one frame and then label in 50 frames later it tries to do
some interpolation. It doesn't use any kind of -- any of the image data to do
interpolation. We found -- we tried to use a tracker and other techniques for
doing the interpolation. We found that from a user's perspective it's actually
better to do something that's very predictable. So it would end up being just a
very simple cubic interpolation. And then the user goes in and fixes those where
it's still making a mistake or reinterpolates and you do this a couple times until
you have good tracks.
So this sort of makes it feasible to label large quantities. Videos, of course, the
drawback is that you're labeling a huge number of frames, but the drawback is
that frame to frame person may not change that much the visual appearance. So
that's sort of the tradeoff.
So I want to say a little bit about the sort of the statistics and what we learned
about the -- about pedestrian detection in the real world. So we did some
statistics of the height of people. And so here on the -- so on this graph on the
left on the X axis is the pedestrian height in pixels. This is in log scale, so they're
log size bins. This is sort of the distribution of the size of people.
So again, this was a 640 by 40 camera, so this is going to be very application
dependent. But nevertheless we found that -- and that was sort of the -- that was
the setting that the automobile company they were working with was pushing us
to use. They really didn't want us to use higher resolution cameras, which I
actually -- I disagree with that approach. But that was the constraint they gave
us. And anyway, under these settings where you have the 640 by 40 VGA
camera, most people are between 32 and 64 pixels high. And then sometimes
you get bigger people. But of course if there's a person right in front of your car,
maybe that's not ideal. So hopefully you don't have people right on your
windshield.
And then below 32 pixels there's actually not a lot of people either because that's
sort of the resolution limit in this quality of video below which you really can't see
people.
And so the other thing here is from a safety perspective let's say you're driving
around 55 kilometers per hour, so maybe a little fast for city driving, but
reasonable, and you make some assumptions about the height of people, but the
size of a person is inversely proportional to the distance from the vehicle, and so
here you have the distance and here you have the height in pixels. So, yeah, as
the distance increases this blue curve, this is the perceived height.
And so basically what happens is if a person's 60 meters from the camera, you
have about four seconds, you have time to react if you want to let's say alert the
drive in whatever way. And so really that's when you want to start detecting
people is -- and they're about 30 pixels high. By the time they're 80 pixels high
you have about a second and a half to react and you know, bye-bye pedestrian.
So you really want to warn the driver sort of when the pedestrian is just a little
further away. And this -- I'm just using this to emphasize that really for these type
of applications you really want to be able to do detection of smaller scale
pedestrians. And I should say a lot of the work, including our own worker, we
had some park based work, was focused much more on this realm of higher
resolution pedestrians. And so again, this is going to be very application
dependent, and I'll talk a little bit more about that. I think actually the type of
techniques one would use for low resolution and high resolution vary a lot, and
so I'll discuss that. But to do well on on the type of datasets that are sort of out
there, you really need to do better on the low res pedestrians.
So the other interesting thing is -- I guess I didn't mention this before, but when
we were labeling the data, we would actually label both the pedestrian of a box or
where the person would be and then another box for visible region for the
person. So every person's labeled two boxes if they're occluded so let's say
you're looking at me and you're labeling me, you label one box at that ground
and one box that just contains my torso.
And so any time the person was occluded, we would have that information. And
so basically what you see is this is the fraction of time that a person's occluded
as they're walking so let's say you see them for about an average about 150
frames, so about five seconds. There's some people about 20 percent of people
are always occluded. In fact, there's only about 30 percent that are never
occluded. So really if you want to do a good job, you really have to deal with
occlusion. And I think this is something that for example when you're one is
using the Internet pedestrian dataset, which is sort of the dataset established by
Dalal and Triggs has sort of been used heavily for pedestrian detection there's
really very little occlusion or no occlusion in the data they labeled. They selected
it and that's really not realistic.
On the other hand, occlusion isn't random. So what we did here on the right is
we took -- so we have all these sort of masks, right, we have the box containing
the pedestrian, we have another box containing the visible region, so you can
kind of create a binary mask, standardize the size, and you could average all
those together. So you get a heat map of which portion of a person is typically
visible. Okay?
And so basically what you get actually is that -- so this is the head region, the
blue indicates that it's rarely occluded, red means it's oven occluded. And so
what you see is that in the real world actually you'll oftentimes see my head and
torso and not my legs. And in fact, sort of makes sense if you think about how
the real world is with grabbing and whatever else.
But we took this a little step further actually and it turns out that if you cluster -- if
you look at actually a little more deeply at what are the types of occlusions so
you cluster these masks and not so important will exactly how that clustering is
done, but what happens is actually so these numbers I guess are a little hard to
see, but this occlusion mask occurs 65 percent of the time. So 65 percent of the
time the occlusion is very simple. It's just the bottom part of a person.
And then these other few, like this one occurs like for left to right occurs about
five percent of the time, bottom and left another five percent of the time. But
again, the lesson here is that the occlusion isn't really random and you could
really try to exploit this kind of information. It's not uniform so which I don't know
if people have. Yes?
>>: Is the a fact about occlusion or is this a fact about the people you manage to
[inaudible]?
>> Piotr Dollar: Yes.
>>: If somebody had his head hidden away when you detect it ->> Piotr Dollar: Yes. So that's a really interesting subtly. Right. So the question
is, you know, so you're looking at me and you see my torso so you detect me, but
let's say, you know, I am hiding and all you see is my legs, would you still detect
me? And that could be -- that's absolutely right. That could be a bias. On the
other hand, we actually -- so we're not doing detection just from one frame, right?
We track the person over -- over let's say 50 frames. And so odds are that -- you
know, there are some people that are always occluded 20 percent of the time,
but we still have lots and lots of people that are just occluded part of the time and
it's rarely the case that those people are ever occluded from above. Oh, yeah,
that could be a bias.
So, yeah, so online again. I keep flashing this URL here. But we have
evaluation of lots and lots and lots of algorithms. I think at the time this slide has
12. I know there's at least two ECCB papers that are using this dataset. There's
two more algorithms that we're evaluating at this time. And so sort of interesting.
So this is log log scale. And so you can kind of see that you know there's some
order here, some orally when these were published. And so there is some
progress. These curves with sort of kind of moving down slowly.
But this is a log log scale. So actually so you know from Viola Jones to -- well,
that's even hard to say, but for a lot of these, you know, a shift of sort of one grid
solver here is a 10 fold reduction in false positives. So even though these curves
may not look like much, like let's say we can look at HOG here which is the
orange curve and these algorithms over here, that's a 10 fold reduction in false
positives. So progress is being made. Yes?
>>: Is it misprint or [inaudible] because it looks to me the red line never sees any
pedestrians.
>> Piotr Dollar: Yeah. No, this red line is -- it's a -- I mean, the Viol Jones
system worked incredibly well but without some modifications it didn't work that
well for pedestrians. So, yeah, that's absolutely right, the Viola Jones at most
gets about 80 percent detection.
>>: [inaudible].
>> Piotr Dollar: Sorry, gets at most 20 percent detection. Yes, thank you.
Thanks for keeping me honest. Yeah, and so basically what you get is -- I mean,
this is one particular curve. We break this down further, this is sort of the overall
result. You get that about one false positive per image, you detect about 70
percent of people for the state of the art.
Now, but with the caveat that it's very much depends on the resolution of the
people, whether you include occluded or not people, and so we have a further
breakdown of the website. I won't go into too many details here. I think that, but
if you're interested all that information online. And I'll say here's our own
detector. So it's the second best here. And in fact, the only one that's better also
uses motion information. And so we're about at least at the time -- right now,
before ECCV, we're about 10 times better using just static features than anyone
else. Actually Deb Aravadad [phonetic] has a paper at ECCV which actually
beats this. But it's unpublished. So I'm not including it here. And so I'll talk
about that in sort of the second part of the talk.
So what are sort of -- what is every -- all these detectors have in common. What
are sort of the key ingredients that make these detectors effective for pedestrian
detection? And so one of the -- all of these algorithms are pretty much sliding
window algorithms. There are other approaches like huff based approaches.
But they tend to not work unless you have much higher resolution data.
All these methods use histogram ingredients in some form or another. Not
necessarily HOG but in some form. The best algorithms also integrate more
feature types. So one interesting thing here is so when I was a grad student I
was very much into machine learning and I still am. Well, one interesting thing
here is what role does the classifier play versus the feature which is something
some people here have explored. But it turns out that and like you guys found
that the classifier itself you use once you have a feature representation plays
very little role. So whether you use boosting or SVM or within boosting, you
know, there's aid-a-boost, gentle boost, logic boost, real boost, savage boost,
brown -- I could go on, right?
And, you know, we've tried about three or four. And you know, using the other
system fix. And it really makes very little difference.
Now, having said that, all those classifiers that I just set and SVMs, whether you
change the kernel a little bit will make a difference but not huge. All of those are
all using the same framework versus a framework for a binary classification. But
I think really the role for machine learning to sort of advance, advance detection
is where you set up the problem in a more interesting way like you have less
supervision during training so you allow for swap. Or you have part based
methods or some sort of latent variable that allows your classifier to be more
powerful, for example, that for a single visual category you learn a mixture model
of classifiers. So I think that's really the place where we're seeing -- where we're
seeing machine learning being able to play a role to advance us and not
necessarily swapping algorithms for X or algorithm Y within the same paradigm.
Oh, the last thing I wanted to say is all these detectors, they typically all behave
the same as you change the conditions. So if you -- if you increase occlusion, all
the algorithms work, both the park based one and the monolithic ones. And
some -- the only setting under which can there seems to be some differences
between algorithms is low versus high resolution. And I'll talk a little bit more
about that, but basically you have less room to be clever in terms of park based
methods and what not at low res.
So anyways, there's benchmarks up there. We evaluate all these different
settings. In fact, we've taken a lot of other pedestrian detection benchmarks and
ran all these algorithms on all these other ones. So you really get to -- if you're
interested, you really can go see how well, you know, the various methods in the
literature are performing.
So sort of the open problems in pedestrian detection. So things worker
reasonably at a large scale and occluded pedestrians. Definitely not how we
would want them, but maybe okay. Where things really break down is when you
have lower res or occlusion. Now, of course there is additional source of
information you can use, context, spatial, temporal, whole system performance,
motion. And people are using those. And I have some students working on that
this summer. There's been a decent amount published.
Say program if you know you're in a vehicle, you can exploit that information. For
higher resolution pedestrians, I think that's actually maybe in the area if not for
the dataset I've been talking about here, but an area where there's a lot of room
for improving the detection itself. But I'll talk a little bit more about that later.
One final thing I wanted to say about -- this is sort of a -- maybe a little bit boring
but when you're doing evaluation, doing evaluation right is actually really
important. So there's two ways of sort of evaluating detectors and one is the
Pascal criteria type of thing where your detector output's bound in boxes and you
want to do sort of the overlap with the ground truth bounding box and measure it
that way, and I've heard that as a per-image evaluation criteria.
The other one is per window. So that's sort of the -- if you're training your
machine learning algorithm you have a bunch of windows that are negative and a
bunch of windows that are positive, you measure on that. It turns out that
oftentimes per window error can be very misleading both because it's really easy
to make a mistake during the valuation. And because sometimes it didn't -- it's
not really measuring the thing you want it to be measuring.
And so this is just -- so these curves on the left here were taken from the
literature. And so basically you had Viola Jones and then you had HOG which is
this dotted orange curve and then you had these algorithms, again, this is a log
scale, that are just absolutely destroying like by a factor of, you know, hundreds
of performance from HOG. And it turns out that we went and we got these
algorithms from the various authors and it turns out that actually we ran them and
they're really not -- you know, even though the one here, this blue one, that's
using histogram, they're section kernels, it's a little bit better than HOG, but all
these algorithms actually were not. And that was a really disappointing think for
me that I would go into the literature, find these great algorithms, and it turns out
that they're really not that great.
And so I think -- I think that's happening a little less now that sort of we have
these bigger datasets and people are using those. But just something to be
aware of that a lot of results that have been recently, you know, look really good
sometimes they look just really good on paper and, you know, if we want to
understand the progress being made we really have to do a careful job of
evaluation. So a little bit of a mundane point but important to keep in mind. Don't
use this measure if at all avoidable. It's at this point not very convincing.
So yes, so that -- yes?
>>: [inaudible] did you retrain viola Jones using our features extracted for the
various [inaudible].
>> Piotr Dollar: Yes. Yes. Not for faces.
>>: So my question is like did you ever run the [inaudible] face detector on your
dataset and see what it does?
>> Piotr Dollar: Oh, I see. But the problem is about 50 to 100 pixels that -- I
mean, it ->>: Just [inaudible] because you say most of the occlusions have the ->> Piotr Dollar: Oh, I see. But you also have to remember that it's not just frontal
faces.
[brief talking over].
>>: [inaudible]. Face, profile faces and ->> Piotr Dollar: In ideal world, yes. But no, we did not run that. And I think ->>: [inaudible] state of the art detectors [inaudible].
>> Piotr Dollar: Yes. So for Viola Jones we actually used an open CV and then
we had our own implementation.
>>: [inaudible].
>> Piotr Dollar: Sure. I mean, so there is a question sort of are we actually able
to test every detector in the literature and the answer is no.
>>: [inaudible].
[brief talking over].
>> Piotr Dollar: That's an interesting question and, you know, we could talk
about it a little more offline. But it's a reasonable question. Yes?
>>: This may be off subject but you go to the -- consider the [inaudible].
>> Piotr Dollar: Yeah. I have not. You know, yes?
>>: It seems like you could combine any based on [inaudible] heat -- seeking
infrared camera.
>> Piotr Dollar: Yes. So there is a bit of work on [inaudible] so all I'll be talking
about is monocular pedestrian detection. And of course the two other things that
people use is stereo and motion of course. Some detectors that I was talking
about use motion.
But yeah, actually it -- what actually ends up happening is it's not as trivial to use
those other sources of information as one may think but there's definitely work on
that. Yes, like [inaudible] has a bunch of -- he had some students working on
using infrared, and sometimes you can do ROI detection very easily. But again,
it's not a panacea as one may expect. When you get these other sensors, it
makes some things easier but, you know, it doesn't solve the problem for sure.
Okay. So that's sort of the state of the art in a very brief overview. And I'm sure I
didn't do it complete justice but take a look at the website, take a look at our
benchmarking paper. And we have -- we're preparing a journal version of the
benchmarking paper which will go into a little more depth.
But so now in the second part of this talk, I want to sort of say so you've kind of
an overview of the methods out there. I wanted to talk about our own method.
And at the first part of it, this is -- this is sort of from three papers, older CVPR
paper and some recent PNVC papers, one that's not published yet I just found
out about that got accepted a couple days ago, but at the beginning when I'm tell
you about this system everything should be sort of clear and be like oh, this is
very natural, this is very simple. And that's sort of the goal. We really tried to
distill the system. But again it works quite well.
So what we basically did is we wanted to integrate multiple feature types in a
very clean uniform manner. And so there's this long history of using basically
image channels in computer vision. So what is an image channel. You basically
take an image and compute some kind of linear or non-linear transformation of
the image where the output has essentially the same dimension as the input
image and so those could be as simple as say the color channels, linear filter
outputs, in this case some Gabor filters or different Gaussian filters or offset
Gaussian filters, something like the gradient magnitude, edge detection, gradient
histograms which are very similar to kind of Gabor filter outputs but you basically
in each channel you compute the gradient, you leave the gradient at a given
orientation in that channel. So it's just the way of just doing integral histograms.
But you can kind of think of all types of channels of this nature. And the idea was
to basically take the -- in a nutshell to take the Viola Jones system but run it
instead of just on gray scale images, where you compute hard features over gray
scale images, and a hard feature is just a sum over a rectangle or sums of
multiple rectangles computed quickly using integral images. The idea was
instead of just doing this over the original gray scale image, do it over all these
different channels.
And it's really a very simple idea, a very natural extension. But now you can sort
of -- I mean, any one of these channels that I kind of drew up here, you can kind
of, you know, write that in 10, 20 lines of mat log code or whatever your favorite
image processing toolbox is.
And so basically if you have, you know, Viola Jones implemented then it's pretty
easy to integrate that -- those features into an overall system.
And so like I said, there's are very few parameters. It gives you very good
performance paths to compute. And the key thing here is this integration of
heterogenous information. Let me say a few more details about it. So initially so
the features you use these hard features, so Viola Jones had these sort of
approximating gradients, and those may make sense over the gray scale image,
may not make as much sense over these other channels. So at some point we
were exploring using gradient descent or some various search strategies to
basically find the optimal combination of rectangles to be a useful feature. And
so what we actually found -- I mean, that does help, but what we actually found -and these are -- these are some of the combinations found for faces. But what
we found actually that, you know, use random features works quite well and in
fact instead of having these complex features which are these combination of
rectangles, single rectangle, random features work great.
And we use these soft cascades developed here at MSR actually for the training.
There's a misconception about boosting that it takes weeks to train or the system
takes about five, ten minutes to train say on pedestrian detection. So I think
initially the Viola Jones system was a bit slow. And part of that was because
they were using a huge, huge number of features. And we find that if you just
use random features you do some caching. Like I said, training can be five, 10
minutes for one of these detectors.
For the weak classifiers, we just use simple little decision trees. That's sort of
necessary because your features are so simple. So you hear something learned
channel is not so important. But like -- so this is just a very -- I gave kind of a
whirlwind overview of the system. But the point is it actually performance really,
really well. It performs a lot better than any other detector that only relies on
features computed from a single frame without motion.
So this very, very simple distilled system can actually do quite well.
>>: Question.
>> Piotr Dollar: Yes.
>>: [inaudible].
>> Piotr Dollar: Yeah. So originally -- so you could imagine -- so Viola Jones,
let's say a typical feature they use was two pairs of rectangles next to each other,
one with a plus one, one with a minus one and you compute the sum of the
pixels into two and do the subtraction. But when we were doing random features
and we wanted to have combination of features, we also let the weights be
random and how you weight it. So you may be have this rectangle times
two-thirds minus this rectangle times one-fourth plus this rectangle times two.
So it's just a way of creating a richer feature representation. But, like I say, in the
end we ended up just using random features and each feature is just a single
rectangle, in which case you don't need weights, right, because it's just one
rectangle anyway and boosting then assigns the weight.
So, yeah, but I think that sort of so far what I talk about is a system that I think
anyone could take that, you know, has worked with Viola Jones and really extend
Viola Jones to give you state of the art performance.
>>: [inaudible].
>> Piotr Dollar: Yeah. No, absolutely. I mean that's absolutely the key. If you
were to use truly random features without any kind of selection, in this case it
wouldn't work well. Yeah.
>>: Did you see [inaudible] did you see any preference to certain types of
feature ->> Piotr Dollar: Like medium sized?
>>: No, no, no, I mean like he discovered certain features all together.
>> Piotr Dollar: Yeah, I see. So -- so we ended up using just three channel
types which is these quantized gradient, gradient magnitude and color channels.
And this is sort of -- I don't want to go into too much detail, but this sort of
visualization of where the features were picked. And so for example for these
gradient magnitude, so this is let's say the vertical gradient magnitude and like to
have a lot of vertical gradient magnitude in these regions, for let's say the LUV
color channels, so for some reason LUV worked really well and, in fact, it really
likes the features around the head, so I guess it really is picking up on skin color.
So this isn't exactly a face detector, but it actually implicitly is capturing that type
of information. And we've done some studies actually similar to what was done
in Norman's group where we looked at sort of the size of the features. And it
turns out that basically we found a very similar thing that if you discard very large
features you're fine, if you discard very small features you're fine, but you do
need the sort of medium sized features. So very similar in that regard. That is
just the results in the entering the dataset.
This is a high res dataset. We still get -- we're sort of tied for the best
performance when we're doing an epsilon better. But that might be just noise
with the latent SVM stuff.
So typically the system I've described, which doesn't have any notion of parts
here, does better on lower res. And on higher res it basically does as well as the
state of the art. Yes?
>>: [inaudible] get into this, but I was just wondering if there's some like key
intuition as to why the boosting features works better than the others? Is it like
the others were overengineered features?
>> Piotr Dollar: I mean I wouldn't say it's just a boosting of random features. I
mean, I think it's -- I mean, at some point it does come down to just good
engineering. But, I mean one advantage that we have is we can integrate the
very different feature types in a uniform way. So, you know, a lot of systems that
say just use HOG or variant of HOG aren't using any color. And for us using
color helped quite a bit.
Of course, you know, how do you integrate color ingredients? They're sort of
different animals. That can be sort of tricky, but in this system it's so sort of
uniform you don't have to think about it. Yeah, and actually it's interesting
because right now we have, you know, these three feature types basically. We
tried, you know, lots and lots of others because you can kind of dream some stuff
up and add it.
And actually not that much stuff makes that much of a difference. So these seem
to capture a good chunk of the information we'd want to get. This is Pietro's
vacation photo dataset. I call it a dataset kind of jokingly. So you can kind of use
these photos to see how our detector's performing or see pictures of Pietro's
family.
So the idea is just typical results. Pietro really likes sending me these images
where he things the detector will fail. That's sort of a game we play. He finds
images where he'll trip me up. He did a really good job with these windows here
in this building in Venice. So here are some of Pietro's kids playing and, you
know, these are sort of typical false positives. Anyway, this is sort of to give you
a sense of what it's doing.
So but, yeah, any more questions about the -- yes?
>>: Just a real application scenario. When you're capturing the [inaudible], you
know, moving car, you know, won't they be blurred?
>> Piotr Dollar: Yeah. I mean, there is definitely a little bit of blurring. I mean,
this dataset does have that. That's one of the reasons it's -- the absolute
performance numbers on this are much lower than they are on the INRIA
dataset, which is from still images which do have a higher quality. So that
definitely degrades your performance a bit. Yeah.
So that's the overall system. And again, the key was they are sort of boiling it
down to sort of the key elements.
What I wanted to talk about sort of for the rest of this talk, this sort of the main
kind of meat of the talk is a recent insight we had that basically allowed us to
really speed these things up. And so it's actually quite interesting and I think it's
fairly generally.
So basically the original Viola Jones system, the reason it was so fast is you took
your image, you computed your integral image and then when you ran multiscale
classification, you could test every window and you could test every window at
every scale without it doing any feature recomputation. And so that was really,
really fast because in some ways the feature computation stage, which really
was just integral images you did once and then you did multiscale detection.
With sort of modern detection systems you can no longer do that. Because let's
say you're computing gradients. What you do is you train a model of a fixed
scale, and then you shrink your image, and you perform detection and you
shrunk an image and you shrink your image again and you perform detection.
And so every time you shrink your image when you're looking for the same size
detections you're obviously looking for bigger pedestrians in the original image,
right?
This becomes pretty slow because basically all the stuff I was talking about,
whether it's in our system or any other of these detector systems, you have to
recompute all the features for every one of these images in the pyramid. Okay?
And so basically -- so again, I mean, the advantage of course of this is you can
use much richer features. Because Viola Jones the original system was limited
to just gray scale features. So can you sort of combine the advantage of both?
And we basically had an insight that you can. And it's based on some sort of
interesting work done in the '90s about the fractal structure of the visual world.
Meaning that if you take a photograph of the scene, the statistics on average of
that photograph will be independent of the physical area subsumed by a single
pixel. So whether I take a picture of this room or a pixel may correspond to, you
know, a certain dimension or I zoom in on someone near by like I zoom in on
Larry and took a picture of his face, on average the statistics of images taken at
those two different scales will be the same. And that -- that actually the sort of
fractal nature will allow us to approximate features at other scales. So it will take
me a little while to get there, but bear with me.
So normally the reason you can do sliding window at a single scale very quickly
is that your features in omega here denotes feature computation of features in
this case gradient magnitude are invariant to shift. So whether you translate your
image and compute the features, in this case gradient magnitude, or compute the
features and then shift your images, you'll get the same result. And that's, you
know, ignoring sort of the boundary effects here. But this holds.
So you can do this operation in the order which means that you don't actually
when you're doing detection, you don't have to crop every single window and
recompute the features in that window. You could just compute the features
once for the whole image and then crops up windows from the feature map. And
this wouldn't be true, for example, if you had smoothing of the image where you
wanted to smooth the window with more smoothing say towards the edges of the
window than the center. Then you'd actually have to crop and recompute the
features.
But typically the features we work with in computer vision tend to be shift
invariant. So that's why you can do single scale sliding window without
recomputing features. This does not hold for scale.
So let's say you have an image, I shrink it and I compute say the gradient
magnitude, that's not the same as computing the gradient magnitude and
shrinking that result. If you think about it, I mean, a very simple example, let's
say you have a checkerboard, very high frequency white block pixels alternating,
you compute the gradient magnitude it has lots and lots of gradient energy, you
shrink that, it still has lots of gradient energy. But if you shrink the checkerboard,
it's white and black pixels, you basically lose that frequency, it becomes a gray, it
has no gradient. So you can't reverse the order of these operations.
So that's why you have to recompute features that I rescale.
Okay. Okay. I don't need to say this. So this will be my only equation heavy
slide, I promise. But it's super simple. Basically have I denotes image, D
denotes gradients, gradient magnitude is just at a pixel IJ is just the sum of the
squares of the gradient square root. The angle, the orientation is just an arctan.
And then we quantized orientation to Q bins. And a gradient histogram is the
gradient magnitude summed over some rectangular region at some orientation.
And this was quantized.
So super whirlwind introduction to gradient histograms but hopefully this is sort of
familiar to everyone. And I should say, by the way, for sort of the purposes of
explaining these ideas I'm going to be talking about gradients and gradient
histograms, but these concepts will actually hold for almost any feature type.
And I'll try to make that a little bit more precise later on.
And okay, one other thing. So this is a gradient histogram. So this is the sum of
the magnitude at a given orientation over a region. This is a bin of a gradient
histogram.
Sometimes just for simplicity actually I'll be talking about the sum of the gradient
magnitude at a region at all orientations. Okay. So hopefully this is okay.
So I'm going to try to make this a little bit interactive, so please, please answer
some of these questions if you can. So -- and we'll see how this goes. So we
have -- let's say we have a pair of images or an image and an up sampled
version of that image by a factor of two and we compute the gradient magnitude
for each of those. And then we sum all the pixels in the gradient magnitude here
and we sum all the pixels and upsampled image.
What's the relation between this scalar and this scalar?
>>: [inaudible].
>> Piotr Dollar: I'm sorry?
>>: [inaudible].
>> Piotr Dollar: Close so let me give a hint. So when you upsample an image,
the gradient actually decreases, right? So if you have a sharp edge, you
increase the scale, the magnitude of the [inaudible] actually decreases so by a
factor of a half.
>>: [inaudible].
>> Piotr Dollar: All good guesses. All close. Nobody has hit it yet. It's just a
factor of two. So the gradient magnitude at each point is just one-half, but you
have four times as many points. Okay? And so.
>>: You're assuming just -- you're talking about the original image that has more
high frequency content?
>> Piotr Dollar: Yes. So that will be the more interesting case that I'll get to in a
second.
So in this case sort of nothing magical is happening. When you upsample, you're
not creating information. And so it makes perfect sense that you don't actually
have to upsample and compute gradients to predict the gradient content of the
up sampled image. Nothing really special so far. But we went ahead and -- so
this experiment isn't going to be that interesting, but I'll lay the framework for sort
of the experiments coming up. Went ahead and gathered dataset of images.
Computed the gradient magnitude sum num, upsampled, performed the same
operation, looked at the ratio and then we plotted the histogram.
So as expected, most images when you upsample two there's some noise
because of say the interpolation procedure. And we tried a few different
interpolation procedures. Typically there's always a little bit of noise.
So this is basically plotting this ratio. So fact if you take a dataset of images and
do this, it's usually around two with some noise. And there's two datasets of
images I'm looking at here. One is the dataset of windows containing
pedestrians and the other is the dataset of just natural images. So windows
containing this random windows in the world with the caveat that they're actually
random windows selected from images that could contain pedestrians. So sort of
the negatives if you think of it in a detection setting. But it seems to hold for both.
There's a little more variance for natural than it is which I'll explain a little bit.
Okay.
So far nothing interesting. Now let's say I downsample an image by a half. So
this is the interesting case because now I'm actually losing frequencies. So first
of all, let me ask the question. Now, suppose that the large image actually was
very smooth so they didn't have any high frequency. If I downsampled, what
would the relationship be between H and H prime? Just from ->>: [inaudible].
>> Piotr Dollar: One over two. Right. Very simple. I mean, it's just a reverse of
the previous thing. But of course now we're actually going losing -- we're going
to be smoothing out information. So can we say anything about this relationship?
Well, it turns out that it's actually .32. Why? And so I observe this fact
empirically and I went ahead and I took the whole collection of images and it was
in fact .32 for a lot, a lot of images with small variance. This is a little bit of a
surprising fact. Why is it .32, why not some other number? Where is this coming
from?
And sort of this goes back to sort of this -- what I was talking about this fractal
nature of images. So basically back in the '90s people did a bunch of work that
they took -- would take an ensemble of images in different scales and they found
that on the statistics of those images were invariant to the scale at which they
were taken. And this is what I was saying at the beginning.
And, in fact, that's not exactly true. The precise statement here is okay, so let
each Q just denote the gradient histogram over image after it's been
downsampled a factor of two dec. So that's just what we were looking at. So the
really the precise statement here is the amount of energy lost in the
downsampled image only depends on the relative scale the pair of images and
not the absolute scale. That's another way of saying that the sort of the what's
going on in image is independent of the scale at which it was taken. Right? And
so basically what you get is I just formalizing what I wrote up here. And again,
this is all work done sort of in the '90s and I'm just borrowing it here, is that if you
take the gradient histogram, add a scale beta plus S so downsampled by beta
plus S, versus the image downsampled the beta, it will only -- it will be some
function that only depends on S. It will not depend on beta. Okay? So it only
depends on the relative scale the images and not the absolute scale. And this
only holds in the overaverages. It may not hold for an individual image.
And once you have H in this form, well will H actually has to have the -- has to
have an exponential form in this case. And that's just some algebra to see that.
So what we get is using sort of this very old -- well, not very old, it's from the '90s,
I shouldn't say that's very old, this fairly old knowledge of sort of the natural
image statistics, to make a statement about how gradients will behave when we
downsample an image. Now, there's this parameter lambda, and so that's
something we have to estimate experimentally. But we expect it to follow this law
and we expect it to be able to find this one value of lambda and be able to then
predict how the features will change at a different scale.
So perform this experiment. So this is the same experiment I showed a few
slides ago. This is this experiment where we have .32. But it repeated for lots
and lots of resampling values. So not just a value of one-half, but also -- so this
is a lot of scales. So not just a value of two to the minus first where we have .33,
but also all these other values. And they plotted those points on either the
means and lo and behold, linear fits works wonderfully to this data. It's really,
really close. So we get some lambda out and, in fact, now we have the ability to
predict how gradient will look in a downsampled image.
Now, so this gets us most of where we want to be. There's going to be some
error. The further away we look, the bigger that error will be. And so I'm not
going to explain exactly what this number is, but basically the error of this
approximation will get worse and worse. But now we have a way of performing
this approximation.
So so let me show some concrete examples of that. So here is some typical
computer vision images that we may work with, especially if we're computer
vision researchers and publishing papers. So this is Laya or Lana, I think. So
this is the original image. On the right is the gradient magnitude, on the below is
the gradient orientation. This is the upsampled version of Lana. Here's her
gradient magnitude.
And this is the downsampled version. So we have these three scales. This is
the original one. And basically what we then did is we computed a gradient
histogram. And that's where you're seeing these bars here. So just looking at
the light blue, these are different orientations, so this is one orientation, second
orientation, third orientation and so on.
And this is computing the sum of the gradient magnitude at each orientation.
And what I did basically is I computed -- I did the same thing at the higher -- the
upsampled scale, but I corrected -- I knew I was supposed to have twice the
gradient twice, so I just divided by two and I plotted this histogram. And I do the
same thing for the downsampled one. And so what you see is basically the fit is
extremely close. All three of these histograms are very, very similar.
So we can't act -- so basically what this is saying, if instead of downsampling for
this particular image and recomputing the histogram, we could have just used the
histogram computed this scale and, you know, we would have some little errors
here. Okay?
So this is -- so so far all the statistics I was talking about previously were for an
ensemble of images. This is an actual demonstration of how it works for an
individual image.
Here's another example. Here's a Colt 45, same thing. Here is a highly textured
image. This is broader texture where it has lots and lots of high frequencies.
This is not an image that's an example of image typically found in the real world.
And in fact, here when you downsample it, you lose a ton. Because you're losing
those frequencies. And in fact, the yellow bar here which is the histogram
computed at the lower scale corrected for the scale chain is completely off. So in
this image, if we were to apply this rule of predicting what the gradients would
look like at a lower scale, we would completely fail. Okay? The assumption is
that these type of images are not that common.
But the real proof of course is when we use this inside of a detection system,
how does that affect performance? So, you know, for these images sort of
previously like this one the approximation holds quite well. For this one it
doesn't. Hopefully most of the images will be -- will be encountering when we
form detection of this form.
So one thing when I was -- did that whole derivation, I was showing you all those
gradients and whatnot, actually nowhere in that derivation did I use the fact that
there were gradients. This whole idea that the image statistics are independent
of scale, they perform those experiments back in the '90s at Gruderman
[phonetic] and Field and others. They perform for lots of different statistics. So
it's -- it actually should hold for pretty much any image feature it may want to use.
Now, that's not -- that's not a -- you can't analytically prove that statement or limit
the type of features analytically that it may be applicable to. But in practice, any
feature that people have tested back in the '90s and that we tested seems to
follow this law. The only thing that changes is this lambda. So we use this on
some local standard deviation features. And again, the lambda changed a little
bit, but again this beautiful linear fit.
So normalized gradients, so typically people don't use gradient but do some
nonlinear normalization in the neighborhood. Same thing. This is actually HOG,
this is the actual HOG codes I got, you know, some of the HOG code by Dalal
and Triggs and ran it on the HOG code and again, it holds. So, you know, you
could imagine testing this for other feature types and then it's all, you know, one
does you don't know for sure if the law will hold, but we have no reason to
believe that it wouldn't. And so again, I'm using this observation inside of our
own system to perform fast multi-scaled detection but it should really be
applicable to sort of any features you may compute and wish to approximate at a
near by scale.
So we used this in our multi-scale detector. So instead of creating an image
pyramid, we actually can scale -- so instead of actually having to downscale the
image and recompute the gradients to perform detection of the bigger box, we
can actually predict what the gradients would look like in a downsampled image.
So we can all do -- we can do this all but just be using the original scale image
with the features computed over that.
So that makes it be a little bit fuzzier right now. I don't want to go into all the sort
of technical algebraic details but hopefully you can imagine how that process
would be performed even if you don't know the details. So that's what we did.
And basically what we found -- and so again, this approximation degrades the
bigger your scale step is. Oh, actually I should say something. So because it
degrades when you take a really big scale step we actually ended up doing a
hybrid approach. Instead of normally you have a image pyramid that's finely
sampled, we had an image pyramid that's very coarsely sampled. So every time
you downsample say by a factor of two or downsample by a factor of four and
then you do the detection for the scales nearby did that, so you can decouple the
pyramid -- the image pyramid from the pyramid at which you do the detections.
So that's what we did. So basically this here I won't show too many curves here.
But here we have -- this is the actual scale step, so if you have -- if it's not
decoupled, so if your scale step is the same, so you do two to the one-eighth
scale step as -- the image pyramid in the detector pyramid. So this is the old way
of doing it, this is the performance you get. So on the X axis your performance,
there's three different core -- curves at different false positive rates.
So this false positive rate, I mean you know, maybe you get about 90 percent
detection rate on whatever particular measure this is. And then if you decouple
the process of the -- of having the image pyramid and the classifier pyramid, so
you have an image pyramid that you downsample by a factor of two to the first,
so by a factor of two every time, you actually lose almost no performance.
Not until you go to a downsampling factor of about 16 does it start to perform
poorly. So you can really -- so in practice, we use the scale step of one. So in
practice we create this image pyramid of a factor of two. But very, very little loss
of performance.
There's a lambda that we estimated empirically. And when we do the correction,
we can use other lambdas. And it turns out that the lambda estimated
empirically from the previous experiments is very near the optimal lambda as one
would hope or otherwise there's something not quite right with our theory. And
so at the -- we call this method at the time publishing the fastest pedestrian
detection in the west. You might have noticed some -- a theme throughout some
of those slides. But anyway, so on these benchmarks, so channel features is
sort of our baseline method. And the fastest pedestrian detection in the west.
And they're sorted by performance. And so what you see basically is that our
fastest pedestrian detection in the west is about one percent less accurate than
the full on image pyramid. And that sort of where the proof lies, that this method
of approximating features that other scales is really reasonable. And so we did
this, this is the INRIA data, this is the -- this is the USA data, this is some other
datasets and it's very similar story. Usually our approximation is very, very close
to the original detector.
So how does this speed things up? So this is a speed versus performance
curve. So speed on the X axis. So you want to be sort of as fast as possible and
miss rate on the Y axis you want to be as low as possible. So this is on the
Caltech dataset with 100 pixel and up. And so here is our baseline method. It
worked at about two frames per second, and the new one goes about four to five
frames per second, depending on settings. So it's about a speedup of 10.
And oh, I should say the Viola Jones method here actually is a slow
implementation because it actually doesn't use the classifier pyramid. So Viola
Jones wouldn't be faster. I mean, it's much simpler. It also is orders of
magnitude less accurate, of course. But, you know, there's lots and lots of
methods that work at 20, 40 seconds. We're at five frames per second.
Similar story for when you actually go to detect smaller scale pedestrians. We
really -- we lose a tiny bit of performance but really move up in the speed
category. So, yeah, so that's -- so that's sort of our latest insight/contribution to
sort of our detection system. And I'll be posting that paper on my website. And I
need [inaudible] in two weeks, so I won't drag my feet.
And yeah, so is there any questions on that aspect? I'm almost done with the
talk, don't worry.
So, yeah, like I said I also have am very interested in sort of the learning aspect
of this and -- yes?
>>: Just a general question. So just the data, all of the images that you've
shown me, the data and data images, what happens -- so your mission was to
detect pedestrians from a car. What happens at the nighttime when ->> Piotr Dollar: Oh, yeah. I think right now we're just hoping to do something in
the daytime and nighttime or when it's raining, that's a little bit of a different story,
yeah. Not quite there yet. Yes?
>>: [inaudible] pedestrians coming towards you?
>> Piotr Dollar: I'm sorry?
>>: [inaudible] the camera [inaudible].
>> Piotr Dollar: Oh, well so we're actually not using any motion. So we're just
using -- we're just using the appearance. So it wouldn't help or hurt us on any
assumptions of that nature. But of course if you're putting us into an overall
system, that's the kind of things you would want to be aware of. Yes?
>>: So the speed of the dimension, it's basically the time taken to compute
features of features.
>> Piotr Dollar: Yes.
>>: Would the number of boxes that are being evaluated ->> Piotr Dollar: It does not change at all.
>>: So you mean to say it's just [inaudible] feature extraction. [inaudible].
>> Piotr Dollar: Yeah.
>>: [inaudible].
>> Piotr Dollar: It is reasonably faster. Single scale.
>>: [inaudible].
>> Piotr Dollar: Yeah. So basically now -- so okay. So that is reasonably fast so
you can do realtime single scale detection with that code. But when you create
the image pyramid and compute the features that every pyramid is a fine image
pyramid because let's see you have eight scales per octave, that's when it gets
quite slow. So that's what we're saving. We don't -- it's the same code. Let's
say you're using HOG features. We're not using HOG features, we have our
own. But instead of having to compute at every scale like eight times per octave,
you only compute it once. So that's the speedup.
And, in fact, because we're using sort of this boosting with cascades and all that,
the detector itself, the evaluation of the window was quite fast. And the
bottleneck was the computation of the features. And that's what we've alleviated.
>>: [inaudible] complementary set of word says featured extraction is sparse but
the number of windows [inaudible] is huge branch and bound and other works?
>> Piotr Dollar: Right. I mean, so -- so there are orthogonal issues. So this
doesn't address at all the number of windows. So I mean one of the things is if
you have boosting with cascades actually some of those claims about the
evaluation of the windows being really slow is may not hold as much as if say
you're using like a kernel RBF or something like that where actually evaluation of
each window even after you have the features is quite slow.
>>: So I have a question.
>> Piotr Dollar: Yes?
>>: [inaudible] modeling or anything [inaudible].
>> Piotr Dollar: Right.
>>: I'm really surprised that you did [inaudible] because [inaudible] equivalent for
like [inaudible].
>> Piotr Dollar: Yeah.
>>: [inaudible].
>> Piotr Dollar: Yeah. The problem is that the current part based models don't
explicitly model occlusion. So if you just have a missing part, it's a lower score.
>>: There's also some other work where you can reason the missing parts.
>> Piotr Dollar: Yeah. So there's -- yeah. So I think -- I think methods like that
would definitely be the right way to go. I think that there just wasn't a compelling
dataset before to really -- so you could simulate datasets like that. Like there's
an ICCV paper last year where they had this HOG LBP thing where they tried to
estimate what part of the pedestrians are occlusions and all that. But they had to
simulate the data. And so it's just not as convincing as, you know, if you actually
of a really dataset which we do now, but we actually haven't performed any
experiments along those lines.
But no, that's exactly the right type of approach I'd advocate. So let me say a
little bit more what I think sort of the future work is in pedestrian detection or
where I see sort of the field going. And the way I see it is really that there's two
separate problems. There's the low resolution domain and then there's the high
resolution domain. So in the low resolution domain, this is a image from Antonia
Torralba at MIT. A lot of you have probably seen this where -- so what does this
look like. If you've see this demo before don't say a car. What does this look
like? A man. It actually Antonio took and these are the identical pixel values
underneath. And so the only reason this is a car based a person is based on
context because if you were to look in a local window, they're exactly the same.
And so I think right now a lot of the work in pedestrian detection is using -- saying
there's only so much we're going to extract from the pixels and using sort of the
surrounding information. So the context, temporal consistency kind of holds
system performance. If you know you're in a car you can exploit that type of
information.
And, you know, this helps a lot at a low resolution domain. It also helps a lot in
the high resolution domain just because their detectors aren't that good. So even
if you can clearly tell it's a person just from the pixels, our detectors can't do that.
So in some sense, all of pedestrian detection is in this domain right now. And
probably all object detection just because our object detectors are not that great.
But really what I would argue is that in the high resolution domain, which, you
know -- and in some ways the whole Caltech pedestrian dataset and that whole
application with those resolution cameras really argue that this is where you want
to be. But I'd say that there's this other domain which is just as interesting which
is the high resolution domain where if I took a Porsche let's say and put it here
covering up let's say this guy at this resolution, you would not think that's a
person just because of the context. You'd know that would be a car right there,
right?
So I would argue that in this, this high resolution you can really just take the local
window and extract information from that and really try to do a lot better. And
that's where I think sort of the park based models, you could use segmentation
features, you really have a lot more pixel data just directly coming from that
region.
And so much of my work over the past year or two, which has been motivated by
the Caltech dataset has been sort of simplified in looking at the low resolution
domain. We weren't using context. I have some students working on that this
summer. But it limited the type of approaches that were really successful in that
domain. And I think this is sort of where -- and I don't know if there's any ideal
datasets for this, but this is sort of very interesting problem which is sort of where
I want to move next where you really get to do more interesting things in terms of
extracting information from the local window on which you're trying to make a -make a decision.
Yeah. So that concludes my talk. If there's any questions.
[applause].
>> Piotr Dollar: Yes?
>>: Look at the picture below. Whether the detection goes bad or good is the
local de facto dress code. Like the man on the left with his multicolored stuff is
[inaudible] uniform.
>> Piotr Dollar: Right. Well, so I would say that the current detectors it problem
wouldn't mess it up because they would downsample this first and you really
wouldn't even be -- you would discard those details that may confuse you. But I
think if you actually wanted to do detection to this resolution, that's the kind of
things you would start thinking about. Yes?
>>: [inaudible].
>> Piotr Dollar: Yes. So I mean one at one place where sort of all the detectors
have a hard time and these are all sliding window, they don't have any reasoning
about if you have a couple people very close to each other which parts come
from which person or anything of that nature. I mean, they're just these
monolithic classifiers. So that's definitely sort of a re-- area where things go bad.
And then, you know, we have this heuristics afterwards of non-maximum
suppression where you have all these detections and you're trying to actually see
how many people they are and that's definitely a pretty challenging -- I mean, at
this separation you're still okay but once people start overlapping, basically your
detectors give you that there's people there and you get sort of your -- it's firing
everywhere, but you have no idea, you know, what's actually going on in the
image.
Okay. Cool. Thank you.
[applause]
Download