>> Larry Zinick: Today it's my pleasure to welcome Piotr Dollar here to join us to give a talk on pedestrian detection. He recently graduated from UCSD and is currently doing a post-doc at Caltech, and he'll be describing his work that he's presenting at BNVC today. >> Piotr Dollar: And general review. >> Larry Zinick: Okay. Great. >> Piotr Dollar: Cool. Thank you for the introduction. And it's great to be here. So today I'm going to be talking about pedestrian detection, the state of the art. So I'm going to give sort of an overview of what the state of the art is and then talk about our own work and our own detection system and sort of the advances we made there recently. So first of all, let me say why am I studying pedestrian detection, why am I interested in this problem? I mean, as many of you know here, it just has a large number of really compelling applications, robotics, image indexing, HCI, technology for visually impaired, surveillance, automotive safety. And actually the last one is the one that sort of motivated us to work on this in the first place. But the other reason for really studying this problem is you really get to look at some fundamental problems in computer vision and machine learning. You get to really think about object detection in general, feature representations, learning paradigms, sort of specific things that actually occur when you're doing object detection like scale, occlusion, pose, role of context. So it's just a great problem in general. So this is a problem I became interested in when I got to Caltech about three years ago. At the time we were funded by and automotive company, and since then that's no longer the case, but it's a problem I've been sort of continuously interested in. And my work has been in sort of three areas. One is sort of the a benchmarking and figuring out actually what works, what are the common elements, sort of evaluating the state of the art. There's been a lot of claims people make we have this amazing detector and really seeing how those claims hold up and what conditions. And so we learned a lot through that. And so the first part of the talk I hope to be able to communicate some of that knowledge to you. In the second part of the talk, I'm going to be talking about our own detector, and so one of the first things we did after sort of doing this evaluation is we really had a good sense of what the key components of a detection system were. And so we were able to put together basically a very simple system that sort of really boiled down that what's required to have a good detector. And it actually performs really well. It's something at the time that we published it was something like a 10 fold decrease in false positive over any competing method. And since then others have caught up. It's still one of the best systems. And actually I'm going to give a fairly brief overview of the system and then really dive into sort of a recent insight we've had that's allowed us to speed up the detection by a factor of 10 over our previous method and 10 to 100 over competing methods. And it's really something that's pretty general and should be applicable to anyone sort of doing multi-scale detection. So that will be sort of the main technical component of the talk. I've also been very interested in sort of the learning paradigms you use for detection so part based methods and setting that up as a learning problem where you don't have to have full supervision. But I won't actually talk about that too much today for lack of time. But if anyone's interested, that's something I would be very happy to talk about afterwards. So the first part I'm going to do this overview. What is the state of the art in pedestrian detection? So what we did, one of the first things we did at Caltech, and this is joint worker with Christian Wojek and Bernt Schiele was we went out and we gathered this huge dataset of pedestrians. You know [inaudible] set up where we had a camera hooked up to a vehicle 640 by 40 VGA camera and we had somebody drive around that wasn't computer vision researcher, so hopefully they weren't biased in any way, had them drive around and record video of the car. And then we went ahead and we labeled a very large quantity of video. So we have something like a third of a million bounding boxes labeled. So that's really a big number. I mean, think about that number, 300,000 people with boxes around them. Now, of course many times it's the same person, there's only about with 23 unique individuals that are labeled. But they're still labeled for a long time. And of course we didn't do this ourselves, we hired some people to do it. Most of them would flake out after maybe doing this task for four or five hours. But then we had the wife of a post-doc in our lab who really stuck with this, and I think she did something like 350 hours of labeling. And she'd be -- she said she couldn't sit in a car an look out the window because she would be drawing boxes around people in her head. So, yeah, not a fun task. I wouldn't recommend it. But, yeah, it did give us this really incredible dataset. And then I should just say and so the datasets available online. I sort of had this link up here the whole time, and this is sort of the dataset. And all this stuff I'm going to be talking about is available online, evaluation code and so O. Including the -- so we had this annotation tool. So still even if we hire somebody for 400 hours, if we were just to label frame -- individual frames one at a time, it wouldn't be really a feasible way of labeling. So we have this tool where we could label the person in one frame and then label in 50 frames later it tries to do some interpolation. It doesn't use any kind of -- any of the image data to do interpolation. We found -- we tried to use a tracker and other techniques for doing the interpolation. We found that from a user's perspective it's actually better to do something that's very predictable. So it would end up being just a very simple cubic interpolation. And then the user goes in and fixes those where it's still making a mistake or reinterpolates and you do this a couple times until you have good tracks. So this sort of makes it feasible to label large quantities. Videos, of course, the drawback is that you're labeling a huge number of frames, but the drawback is that frame to frame person may not change that much the visual appearance. So that's sort of the tradeoff. So I want to say a little bit about the sort of the statistics and what we learned about the -- about pedestrian detection in the real world. So we did some statistics of the height of people. And so here on the -- so on this graph on the left on the X axis is the pedestrian height in pixels. This is in log scale, so they're log size bins. This is sort of the distribution of the size of people. So again, this was a 640 by 40 camera, so this is going to be very application dependent. But nevertheless we found that -- and that was sort of the -- that was the setting that the automobile company they were working with was pushing us to use. They really didn't want us to use higher resolution cameras, which I actually -- I disagree with that approach. But that was the constraint they gave us. And anyway, under these settings where you have the 640 by 40 VGA camera, most people are between 32 and 64 pixels high. And then sometimes you get bigger people. But of course if there's a person right in front of your car, maybe that's not ideal. So hopefully you don't have people right on your windshield. And then below 32 pixels there's actually not a lot of people either because that's sort of the resolution limit in this quality of video below which you really can't see people. And so the other thing here is from a safety perspective let's say you're driving around 55 kilometers per hour, so maybe a little fast for city driving, but reasonable, and you make some assumptions about the height of people, but the size of a person is inversely proportional to the distance from the vehicle, and so here you have the distance and here you have the height in pixels. So, yeah, as the distance increases this blue curve, this is the perceived height. And so basically what happens is if a person's 60 meters from the camera, you have about four seconds, you have time to react if you want to let's say alert the drive in whatever way. And so really that's when you want to start detecting people is -- and they're about 30 pixels high. By the time they're 80 pixels high you have about a second and a half to react and you know, bye-bye pedestrian. So you really want to warn the driver sort of when the pedestrian is just a little further away. And this -- I'm just using this to emphasize that really for these type of applications you really want to be able to do detection of smaller scale pedestrians. And I should say a lot of the work, including our own worker, we had some park based work, was focused much more on this realm of higher resolution pedestrians. And so again, this is going to be very application dependent, and I'll talk a little bit more about that. I think actually the type of techniques one would use for low resolution and high resolution vary a lot, and so I'll discuss that. But to do well on on the type of datasets that are sort of out there, you really need to do better on the low res pedestrians. So the other interesting thing is -- I guess I didn't mention this before, but when we were labeling the data, we would actually label both the pedestrian of a box or where the person would be and then another box for visible region for the person. So every person's labeled two boxes if they're occluded so let's say you're looking at me and you're labeling me, you label one box at that ground and one box that just contains my torso. And so any time the person was occluded, we would have that information. And so basically what you see is this is the fraction of time that a person's occluded as they're walking so let's say you see them for about an average about 150 frames, so about five seconds. There's some people about 20 percent of people are always occluded. In fact, there's only about 30 percent that are never occluded. So really if you want to do a good job, you really have to deal with occlusion. And I think this is something that for example when you're one is using the Internet pedestrian dataset, which is sort of the dataset established by Dalal and Triggs has sort of been used heavily for pedestrian detection there's really very little occlusion or no occlusion in the data they labeled. They selected it and that's really not realistic. On the other hand, occlusion isn't random. So what we did here on the right is we took -- so we have all these sort of masks, right, we have the box containing the pedestrian, we have another box containing the visible region, so you can kind of create a binary mask, standardize the size, and you could average all those together. So you get a heat map of which portion of a person is typically visible. Okay? And so basically what you get actually is that -- so this is the head region, the blue indicates that it's rarely occluded, red means it's oven occluded. And so what you see is that in the real world actually you'll oftentimes see my head and torso and not my legs. And in fact, sort of makes sense if you think about how the real world is with grabbing and whatever else. But we took this a little step further actually and it turns out that if you cluster -- if you look at actually a little more deeply at what are the types of occlusions so you cluster these masks and not so important will exactly how that clustering is done, but what happens is actually so these numbers I guess are a little hard to see, but this occlusion mask occurs 65 percent of the time. So 65 percent of the time the occlusion is very simple. It's just the bottom part of a person. And then these other few, like this one occurs like for left to right occurs about five percent of the time, bottom and left another five percent of the time. But again, the lesson here is that the occlusion isn't really random and you could really try to exploit this kind of information. It's not uniform so which I don't know if people have. Yes? >>: Is the a fact about occlusion or is this a fact about the people you manage to [inaudible]? >> Piotr Dollar: Yes. >>: If somebody had his head hidden away when you detect it ->> Piotr Dollar: Yes. So that's a really interesting subtly. Right. So the question is, you know, so you're looking at me and you see my torso so you detect me, but let's say, you know, I am hiding and all you see is my legs, would you still detect me? And that could be -- that's absolutely right. That could be a bias. On the other hand, we actually -- so we're not doing detection just from one frame, right? We track the person over -- over let's say 50 frames. And so odds are that -- you know, there are some people that are always occluded 20 percent of the time, but we still have lots and lots of people that are just occluded part of the time and it's rarely the case that those people are ever occluded from above. Oh, yeah, that could be a bias. So, yeah, so online again. I keep flashing this URL here. But we have evaluation of lots and lots and lots of algorithms. I think at the time this slide has 12. I know there's at least two ECCB papers that are using this dataset. There's two more algorithms that we're evaluating at this time. And so sort of interesting. So this is log log scale. And so you can kind of see that you know there's some order here, some orally when these were published. And so there is some progress. These curves with sort of kind of moving down slowly. But this is a log log scale. So actually so you know from Viola Jones to -- well, that's even hard to say, but for a lot of these, you know, a shift of sort of one grid solver here is a 10 fold reduction in false positives. So even though these curves may not look like much, like let's say we can look at HOG here which is the orange curve and these algorithms over here, that's a 10 fold reduction in false positives. So progress is being made. Yes? >>: Is it misprint or [inaudible] because it looks to me the red line never sees any pedestrians. >> Piotr Dollar: Yeah. No, this red line is -- it's a -- I mean, the Viol Jones system worked incredibly well but without some modifications it didn't work that well for pedestrians. So, yeah, that's absolutely right, the Viola Jones at most gets about 80 percent detection. >>: [inaudible]. >> Piotr Dollar: Sorry, gets at most 20 percent detection. Yes, thank you. Thanks for keeping me honest. Yeah, and so basically what you get is -- I mean, this is one particular curve. We break this down further, this is sort of the overall result. You get that about one false positive per image, you detect about 70 percent of people for the state of the art. Now, but with the caveat that it's very much depends on the resolution of the people, whether you include occluded or not people, and so we have a further breakdown of the website. I won't go into too many details here. I think that, but if you're interested all that information online. And I'll say here's our own detector. So it's the second best here. And in fact, the only one that's better also uses motion information. And so we're about at least at the time -- right now, before ECCV, we're about 10 times better using just static features than anyone else. Actually Deb Aravadad [phonetic] has a paper at ECCV which actually beats this. But it's unpublished. So I'm not including it here. And so I'll talk about that in sort of the second part of the talk. So what are sort of -- what is every -- all these detectors have in common. What are sort of the key ingredients that make these detectors effective for pedestrian detection? And so one of the -- all of these algorithms are pretty much sliding window algorithms. There are other approaches like huff based approaches. But they tend to not work unless you have much higher resolution data. All these methods use histogram ingredients in some form or another. Not necessarily HOG but in some form. The best algorithms also integrate more feature types. So one interesting thing here is so when I was a grad student I was very much into machine learning and I still am. Well, one interesting thing here is what role does the classifier play versus the feature which is something some people here have explored. But it turns out that and like you guys found that the classifier itself you use once you have a feature representation plays very little role. So whether you use boosting or SVM or within boosting, you know, there's aid-a-boost, gentle boost, logic boost, real boost, savage boost, brown -- I could go on, right? And, you know, we've tried about three or four. And you know, using the other system fix. And it really makes very little difference. Now, having said that, all those classifiers that I just set and SVMs, whether you change the kernel a little bit will make a difference but not huge. All of those are all using the same framework versus a framework for a binary classification. But I think really the role for machine learning to sort of advance, advance detection is where you set up the problem in a more interesting way like you have less supervision during training so you allow for swap. Or you have part based methods or some sort of latent variable that allows your classifier to be more powerful, for example, that for a single visual category you learn a mixture model of classifiers. So I think that's really the place where we're seeing -- where we're seeing machine learning being able to play a role to advance us and not necessarily swapping algorithms for X or algorithm Y within the same paradigm. Oh, the last thing I wanted to say is all these detectors, they typically all behave the same as you change the conditions. So if you -- if you increase occlusion, all the algorithms work, both the park based one and the monolithic ones. And some -- the only setting under which can there seems to be some differences between algorithms is low versus high resolution. And I'll talk a little bit more about that, but basically you have less room to be clever in terms of park based methods and what not at low res. So anyways, there's benchmarks up there. We evaluate all these different settings. In fact, we've taken a lot of other pedestrian detection benchmarks and ran all these algorithms on all these other ones. So you really get to -- if you're interested, you really can go see how well, you know, the various methods in the literature are performing. So sort of the open problems in pedestrian detection. So things worker reasonably at a large scale and occluded pedestrians. Definitely not how we would want them, but maybe okay. Where things really break down is when you have lower res or occlusion. Now, of course there is additional source of information you can use, context, spatial, temporal, whole system performance, motion. And people are using those. And I have some students working on that this summer. There's been a decent amount published. Say program if you know you're in a vehicle, you can exploit that information. For higher resolution pedestrians, I think that's actually maybe in the area if not for the dataset I've been talking about here, but an area where there's a lot of room for improving the detection itself. But I'll talk a little bit more about that later. One final thing I wanted to say about -- this is sort of a -- maybe a little bit boring but when you're doing evaluation, doing evaluation right is actually really important. So there's two ways of sort of evaluating detectors and one is the Pascal criteria type of thing where your detector output's bound in boxes and you want to do sort of the overlap with the ground truth bounding box and measure it that way, and I've heard that as a per-image evaluation criteria. The other one is per window. So that's sort of the -- if you're training your machine learning algorithm you have a bunch of windows that are negative and a bunch of windows that are positive, you measure on that. It turns out that oftentimes per window error can be very misleading both because it's really easy to make a mistake during the valuation. And because sometimes it didn't -- it's not really measuring the thing you want it to be measuring. And so this is just -- so these curves on the left here were taken from the literature. And so basically you had Viola Jones and then you had HOG which is this dotted orange curve and then you had these algorithms, again, this is a log scale, that are just absolutely destroying like by a factor of, you know, hundreds of performance from HOG. And it turns out that we went and we got these algorithms from the various authors and it turns out that actually we ran them and they're really not -- you know, even though the one here, this blue one, that's using histogram, they're section kernels, it's a little bit better than HOG, but all these algorithms actually were not. And that was a really disappointing think for me that I would go into the literature, find these great algorithms, and it turns out that they're really not that great. And so I think -- I think that's happening a little less now that sort of we have these bigger datasets and people are using those. But just something to be aware of that a lot of results that have been recently, you know, look really good sometimes they look just really good on paper and, you know, if we want to understand the progress being made we really have to do a careful job of evaluation. So a little bit of a mundane point but important to keep in mind. Don't use this measure if at all avoidable. It's at this point not very convincing. So yes, so that -- yes? >>: [inaudible] did you retrain viola Jones using our features extracted for the various [inaudible]. >> Piotr Dollar: Yes. Yes. Not for faces. >>: So my question is like did you ever run the [inaudible] face detector on your dataset and see what it does? >> Piotr Dollar: Oh, I see. But the problem is about 50 to 100 pixels that -- I mean, it ->>: Just [inaudible] because you say most of the occlusions have the ->> Piotr Dollar: Oh, I see. But you also have to remember that it's not just frontal faces. [brief talking over]. >>: [inaudible]. Face, profile faces and ->> Piotr Dollar: In ideal world, yes. But no, we did not run that. And I think ->>: [inaudible] state of the art detectors [inaudible]. >> Piotr Dollar: Yes. So for Viola Jones we actually used an open CV and then we had our own implementation. >>: [inaudible]. >> Piotr Dollar: Sure. I mean, so there is a question sort of are we actually able to test every detector in the literature and the answer is no. >>: [inaudible]. [brief talking over]. >> Piotr Dollar: That's an interesting question and, you know, we could talk about it a little more offline. But it's a reasonable question. Yes? >>: This may be off subject but you go to the -- consider the [inaudible]. >> Piotr Dollar: Yeah. I have not. You know, yes? >>: It seems like you could combine any based on [inaudible] heat -- seeking infrared camera. >> Piotr Dollar: Yes. So there is a bit of work on [inaudible] so all I'll be talking about is monocular pedestrian detection. And of course the two other things that people use is stereo and motion of course. Some detectors that I was talking about use motion. But yeah, actually it -- what actually ends up happening is it's not as trivial to use those other sources of information as one may think but there's definitely work on that. Yes, like [inaudible] has a bunch of -- he had some students working on using infrared, and sometimes you can do ROI detection very easily. But again, it's not a panacea as one may expect. When you get these other sensors, it makes some things easier but, you know, it doesn't solve the problem for sure. Okay. So that's sort of the state of the art in a very brief overview. And I'm sure I didn't do it complete justice but take a look at the website, take a look at our benchmarking paper. And we have -- we're preparing a journal version of the benchmarking paper which will go into a little more depth. But so now in the second part of this talk, I want to sort of say so you've kind of an overview of the methods out there. I wanted to talk about our own method. And at the first part of it, this is -- this is sort of from three papers, older CVPR paper and some recent PNVC papers, one that's not published yet I just found out about that got accepted a couple days ago, but at the beginning when I'm tell you about this system everything should be sort of clear and be like oh, this is very natural, this is very simple. And that's sort of the goal. We really tried to distill the system. But again it works quite well. So what we basically did is we wanted to integrate multiple feature types in a very clean uniform manner. And so there's this long history of using basically image channels in computer vision. So what is an image channel. You basically take an image and compute some kind of linear or non-linear transformation of the image where the output has essentially the same dimension as the input image and so those could be as simple as say the color channels, linear filter outputs, in this case some Gabor filters or different Gaussian filters or offset Gaussian filters, something like the gradient magnitude, edge detection, gradient histograms which are very similar to kind of Gabor filter outputs but you basically in each channel you compute the gradient, you leave the gradient at a given orientation in that channel. So it's just the way of just doing integral histograms. But you can kind of think of all types of channels of this nature. And the idea was to basically take the -- in a nutshell to take the Viola Jones system but run it instead of just on gray scale images, where you compute hard features over gray scale images, and a hard feature is just a sum over a rectangle or sums of multiple rectangles computed quickly using integral images. The idea was instead of just doing this over the original gray scale image, do it over all these different channels. And it's really a very simple idea, a very natural extension. But now you can sort of -- I mean, any one of these channels that I kind of drew up here, you can kind of, you know, write that in 10, 20 lines of mat log code or whatever your favorite image processing toolbox is. And so basically if you have, you know, Viola Jones implemented then it's pretty easy to integrate that -- those features into an overall system. And so like I said, there's are very few parameters. It gives you very good performance paths to compute. And the key thing here is this integration of heterogenous information. Let me say a few more details about it. So initially so the features you use these hard features, so Viola Jones had these sort of approximating gradients, and those may make sense over the gray scale image, may not make as much sense over these other channels. So at some point we were exploring using gradient descent or some various search strategies to basically find the optimal combination of rectangles to be a useful feature. And so what we actually found -- I mean, that does help, but what we actually found -and these are -- these are some of the combinations found for faces. But what we found actually that, you know, use random features works quite well and in fact instead of having these complex features which are these combination of rectangles, single rectangle, random features work great. And we use these soft cascades developed here at MSR actually for the training. There's a misconception about boosting that it takes weeks to train or the system takes about five, ten minutes to train say on pedestrian detection. So I think initially the Viola Jones system was a bit slow. And part of that was because they were using a huge, huge number of features. And we find that if you just use random features you do some caching. Like I said, training can be five, 10 minutes for one of these detectors. For the weak classifiers, we just use simple little decision trees. That's sort of necessary because your features are so simple. So you hear something learned channel is not so important. But like -- so this is just a very -- I gave kind of a whirlwind overview of the system. But the point is it actually performance really, really well. It performs a lot better than any other detector that only relies on features computed from a single frame without motion. So this very, very simple distilled system can actually do quite well. >>: Question. >> Piotr Dollar: Yes. >>: [inaudible]. >> Piotr Dollar: Yeah. So originally -- so you could imagine -- so Viola Jones, let's say a typical feature they use was two pairs of rectangles next to each other, one with a plus one, one with a minus one and you compute the sum of the pixels into two and do the subtraction. But when we were doing random features and we wanted to have combination of features, we also let the weights be random and how you weight it. So you may be have this rectangle times two-thirds minus this rectangle times one-fourth plus this rectangle times two. So it's just a way of creating a richer feature representation. But, like I say, in the end we ended up just using random features and each feature is just a single rectangle, in which case you don't need weights, right, because it's just one rectangle anyway and boosting then assigns the weight. So, yeah, but I think that sort of so far what I talk about is a system that I think anyone could take that, you know, has worked with Viola Jones and really extend Viola Jones to give you state of the art performance. >>: [inaudible]. >> Piotr Dollar: Yeah. No, absolutely. I mean that's absolutely the key. If you were to use truly random features without any kind of selection, in this case it wouldn't work well. Yeah. >>: Did you see [inaudible] did you see any preference to certain types of feature ->> Piotr Dollar: Like medium sized? >>: No, no, no, I mean like he discovered certain features all together. >> Piotr Dollar: Yeah, I see. So -- so we ended up using just three channel types which is these quantized gradient, gradient magnitude and color channels. And this is sort of -- I don't want to go into too much detail, but this sort of visualization of where the features were picked. And so for example for these gradient magnitude, so this is let's say the vertical gradient magnitude and like to have a lot of vertical gradient magnitude in these regions, for let's say the LUV color channels, so for some reason LUV worked really well and, in fact, it really likes the features around the head, so I guess it really is picking up on skin color. So this isn't exactly a face detector, but it actually implicitly is capturing that type of information. And we've done some studies actually similar to what was done in Norman's group where we looked at sort of the size of the features. And it turns out that basically we found a very similar thing that if you discard very large features you're fine, if you discard very small features you're fine, but you do need the sort of medium sized features. So very similar in that regard. That is just the results in the entering the dataset. This is a high res dataset. We still get -- we're sort of tied for the best performance when we're doing an epsilon better. But that might be just noise with the latent SVM stuff. So typically the system I've described, which doesn't have any notion of parts here, does better on lower res. And on higher res it basically does as well as the state of the art. Yes? >>: [inaudible] get into this, but I was just wondering if there's some like key intuition as to why the boosting features works better than the others? Is it like the others were overengineered features? >> Piotr Dollar: I mean I wouldn't say it's just a boosting of random features. I mean, I think it's -- I mean, at some point it does come down to just good engineering. But, I mean one advantage that we have is we can integrate the very different feature types in a uniform way. So, you know, a lot of systems that say just use HOG or variant of HOG aren't using any color. And for us using color helped quite a bit. Of course, you know, how do you integrate color ingredients? They're sort of different animals. That can be sort of tricky, but in this system it's so sort of uniform you don't have to think about it. Yeah, and actually it's interesting because right now we have, you know, these three feature types basically. We tried, you know, lots and lots of others because you can kind of dream some stuff up and add it. And actually not that much stuff makes that much of a difference. So these seem to capture a good chunk of the information we'd want to get. This is Pietro's vacation photo dataset. I call it a dataset kind of jokingly. So you can kind of use these photos to see how our detector's performing or see pictures of Pietro's family. So the idea is just typical results. Pietro really likes sending me these images where he things the detector will fail. That's sort of a game we play. He finds images where he'll trip me up. He did a really good job with these windows here in this building in Venice. So here are some of Pietro's kids playing and, you know, these are sort of typical false positives. Anyway, this is sort of to give you a sense of what it's doing. So but, yeah, any more questions about the -- yes? >>: Just a real application scenario. When you're capturing the [inaudible], you know, moving car, you know, won't they be blurred? >> Piotr Dollar: Yeah. I mean, there is definitely a little bit of blurring. I mean, this dataset does have that. That's one of the reasons it's -- the absolute performance numbers on this are much lower than they are on the INRIA dataset, which is from still images which do have a higher quality. So that definitely degrades your performance a bit. Yeah. So that's the overall system. And again, the key was they are sort of boiling it down to sort of the key elements. What I wanted to talk about sort of for the rest of this talk, this sort of the main kind of meat of the talk is a recent insight we had that basically allowed us to really speed these things up. And so it's actually quite interesting and I think it's fairly generally. So basically the original Viola Jones system, the reason it was so fast is you took your image, you computed your integral image and then when you ran multiscale classification, you could test every window and you could test every window at every scale without it doing any feature recomputation. And so that was really, really fast because in some ways the feature computation stage, which really was just integral images you did once and then you did multiscale detection. With sort of modern detection systems you can no longer do that. Because let's say you're computing gradients. What you do is you train a model of a fixed scale, and then you shrink your image, and you perform detection and you shrunk an image and you shrink your image again and you perform detection. And so every time you shrink your image when you're looking for the same size detections you're obviously looking for bigger pedestrians in the original image, right? This becomes pretty slow because basically all the stuff I was talking about, whether it's in our system or any other of these detector systems, you have to recompute all the features for every one of these images in the pyramid. Okay? And so basically -- so again, I mean, the advantage of course of this is you can use much richer features. Because Viola Jones the original system was limited to just gray scale features. So can you sort of combine the advantage of both? And we basically had an insight that you can. And it's based on some sort of interesting work done in the '90s about the fractal structure of the visual world. Meaning that if you take a photograph of the scene, the statistics on average of that photograph will be independent of the physical area subsumed by a single pixel. So whether I take a picture of this room or a pixel may correspond to, you know, a certain dimension or I zoom in on someone near by like I zoom in on Larry and took a picture of his face, on average the statistics of images taken at those two different scales will be the same. And that -- that actually the sort of fractal nature will allow us to approximate features at other scales. So it will take me a little while to get there, but bear with me. So normally the reason you can do sliding window at a single scale very quickly is that your features in omega here denotes feature computation of features in this case gradient magnitude are invariant to shift. So whether you translate your image and compute the features, in this case gradient magnitude, or compute the features and then shift your images, you'll get the same result. And that's, you know, ignoring sort of the boundary effects here. But this holds. So you can do this operation in the order which means that you don't actually when you're doing detection, you don't have to crop every single window and recompute the features in that window. You could just compute the features once for the whole image and then crops up windows from the feature map. And this wouldn't be true, for example, if you had smoothing of the image where you wanted to smooth the window with more smoothing say towards the edges of the window than the center. Then you'd actually have to crop and recompute the features. But typically the features we work with in computer vision tend to be shift invariant. So that's why you can do single scale sliding window without recomputing features. This does not hold for scale. So let's say you have an image, I shrink it and I compute say the gradient magnitude, that's not the same as computing the gradient magnitude and shrinking that result. If you think about it, I mean, a very simple example, let's say you have a checkerboard, very high frequency white block pixels alternating, you compute the gradient magnitude it has lots and lots of gradient energy, you shrink that, it still has lots of gradient energy. But if you shrink the checkerboard, it's white and black pixels, you basically lose that frequency, it becomes a gray, it has no gradient. So you can't reverse the order of these operations. So that's why you have to recompute features that I rescale. Okay. Okay. I don't need to say this. So this will be my only equation heavy slide, I promise. But it's super simple. Basically have I denotes image, D denotes gradients, gradient magnitude is just at a pixel IJ is just the sum of the squares of the gradient square root. The angle, the orientation is just an arctan. And then we quantized orientation to Q bins. And a gradient histogram is the gradient magnitude summed over some rectangular region at some orientation. And this was quantized. So super whirlwind introduction to gradient histograms but hopefully this is sort of familiar to everyone. And I should say, by the way, for sort of the purposes of explaining these ideas I'm going to be talking about gradients and gradient histograms, but these concepts will actually hold for almost any feature type. And I'll try to make that a little bit more precise later on. And okay, one other thing. So this is a gradient histogram. So this is the sum of the magnitude at a given orientation over a region. This is a bin of a gradient histogram. Sometimes just for simplicity actually I'll be talking about the sum of the gradient magnitude at a region at all orientations. Okay. So hopefully this is okay. So I'm going to try to make this a little bit interactive, so please, please answer some of these questions if you can. So -- and we'll see how this goes. So we have -- let's say we have a pair of images or an image and an up sampled version of that image by a factor of two and we compute the gradient magnitude for each of those. And then we sum all the pixels in the gradient magnitude here and we sum all the pixels and upsampled image. What's the relation between this scalar and this scalar? >>: [inaudible]. >> Piotr Dollar: I'm sorry? >>: [inaudible]. >> Piotr Dollar: Close so let me give a hint. So when you upsample an image, the gradient actually decreases, right? So if you have a sharp edge, you increase the scale, the magnitude of the [inaudible] actually decreases so by a factor of a half. >>: [inaudible]. >> Piotr Dollar: All good guesses. All close. Nobody has hit it yet. It's just a factor of two. So the gradient magnitude at each point is just one-half, but you have four times as many points. Okay? And so. >>: You're assuming just -- you're talking about the original image that has more high frequency content? >> Piotr Dollar: Yes. So that will be the more interesting case that I'll get to in a second. So in this case sort of nothing magical is happening. When you upsample, you're not creating information. And so it makes perfect sense that you don't actually have to upsample and compute gradients to predict the gradient content of the up sampled image. Nothing really special so far. But we went ahead and -- so this experiment isn't going to be that interesting, but I'll lay the framework for sort of the experiments coming up. Went ahead and gathered dataset of images. Computed the gradient magnitude sum num, upsampled, performed the same operation, looked at the ratio and then we plotted the histogram. So as expected, most images when you upsample two there's some noise because of say the interpolation procedure. And we tried a few different interpolation procedures. Typically there's always a little bit of noise. So this is basically plotting this ratio. So fact if you take a dataset of images and do this, it's usually around two with some noise. And there's two datasets of images I'm looking at here. One is the dataset of windows containing pedestrians and the other is the dataset of just natural images. So windows containing this random windows in the world with the caveat that they're actually random windows selected from images that could contain pedestrians. So sort of the negatives if you think of it in a detection setting. But it seems to hold for both. There's a little more variance for natural than it is which I'll explain a little bit. Okay. So far nothing interesting. Now let's say I downsample an image by a half. So this is the interesting case because now I'm actually losing frequencies. So first of all, let me ask the question. Now, suppose that the large image actually was very smooth so they didn't have any high frequency. If I downsampled, what would the relationship be between H and H prime? Just from ->>: [inaudible]. >> Piotr Dollar: One over two. Right. Very simple. I mean, it's just a reverse of the previous thing. But of course now we're actually going losing -- we're going to be smoothing out information. So can we say anything about this relationship? Well, it turns out that it's actually .32. Why? And so I observe this fact empirically and I went ahead and I took the whole collection of images and it was in fact .32 for a lot, a lot of images with small variance. This is a little bit of a surprising fact. Why is it .32, why not some other number? Where is this coming from? And sort of this goes back to sort of this -- what I was talking about this fractal nature of images. So basically back in the '90s people did a bunch of work that they took -- would take an ensemble of images in different scales and they found that on the statistics of those images were invariant to the scale at which they were taken. And this is what I was saying at the beginning. And, in fact, that's not exactly true. The precise statement here is okay, so let each Q just denote the gradient histogram over image after it's been downsampled a factor of two dec. So that's just what we were looking at. So the really the precise statement here is the amount of energy lost in the downsampled image only depends on the relative scale the pair of images and not the absolute scale. That's another way of saying that the sort of the what's going on in image is independent of the scale at which it was taken. Right? And so basically what you get is I just formalizing what I wrote up here. And again, this is all work done sort of in the '90s and I'm just borrowing it here, is that if you take the gradient histogram, add a scale beta plus S so downsampled by beta plus S, versus the image downsampled the beta, it will only -- it will be some function that only depends on S. It will not depend on beta. Okay? So it only depends on the relative scale the images and not the absolute scale. And this only holds in the overaverages. It may not hold for an individual image. And once you have H in this form, well will H actually has to have the -- has to have an exponential form in this case. And that's just some algebra to see that. So what we get is using sort of this very old -- well, not very old, it's from the '90s, I shouldn't say that's very old, this fairly old knowledge of sort of the natural image statistics, to make a statement about how gradients will behave when we downsample an image. Now, there's this parameter lambda, and so that's something we have to estimate experimentally. But we expect it to follow this law and we expect it to be able to find this one value of lambda and be able to then predict how the features will change at a different scale. So perform this experiment. So this is the same experiment I showed a few slides ago. This is this experiment where we have .32. But it repeated for lots and lots of resampling values. So not just a value of one-half, but also -- so this is a lot of scales. So not just a value of two to the minus first where we have .33, but also all these other values. And they plotted those points on either the means and lo and behold, linear fits works wonderfully to this data. It's really, really close. So we get some lambda out and, in fact, now we have the ability to predict how gradient will look in a downsampled image. Now, so this gets us most of where we want to be. There's going to be some error. The further away we look, the bigger that error will be. And so I'm not going to explain exactly what this number is, but basically the error of this approximation will get worse and worse. But now we have a way of performing this approximation. So so let me show some concrete examples of that. So here is some typical computer vision images that we may work with, especially if we're computer vision researchers and publishing papers. So this is Laya or Lana, I think. So this is the original image. On the right is the gradient magnitude, on the below is the gradient orientation. This is the upsampled version of Lana. Here's her gradient magnitude. And this is the downsampled version. So we have these three scales. This is the original one. And basically what we then did is we computed a gradient histogram. And that's where you're seeing these bars here. So just looking at the light blue, these are different orientations, so this is one orientation, second orientation, third orientation and so on. And this is computing the sum of the gradient magnitude at each orientation. And what I did basically is I computed -- I did the same thing at the higher -- the upsampled scale, but I corrected -- I knew I was supposed to have twice the gradient twice, so I just divided by two and I plotted this histogram. And I do the same thing for the downsampled one. And so what you see is basically the fit is extremely close. All three of these histograms are very, very similar. So we can't act -- so basically what this is saying, if instead of downsampling for this particular image and recomputing the histogram, we could have just used the histogram computed this scale and, you know, we would have some little errors here. Okay? So this is -- so so far all the statistics I was talking about previously were for an ensemble of images. This is an actual demonstration of how it works for an individual image. Here's another example. Here's a Colt 45, same thing. Here is a highly textured image. This is broader texture where it has lots and lots of high frequencies. This is not an image that's an example of image typically found in the real world. And in fact, here when you downsample it, you lose a ton. Because you're losing those frequencies. And in fact, the yellow bar here which is the histogram computed at the lower scale corrected for the scale chain is completely off. So in this image, if we were to apply this rule of predicting what the gradients would look like at a lower scale, we would completely fail. Okay? The assumption is that these type of images are not that common. But the real proof of course is when we use this inside of a detection system, how does that affect performance? So, you know, for these images sort of previously like this one the approximation holds quite well. For this one it doesn't. Hopefully most of the images will be -- will be encountering when we form detection of this form. So one thing when I was -- did that whole derivation, I was showing you all those gradients and whatnot, actually nowhere in that derivation did I use the fact that there were gradients. This whole idea that the image statistics are independent of scale, they perform those experiments back in the '90s at Gruderman [phonetic] and Field and others. They perform for lots of different statistics. So it's -- it actually should hold for pretty much any image feature it may want to use. Now, that's not -- that's not a -- you can't analytically prove that statement or limit the type of features analytically that it may be applicable to. But in practice, any feature that people have tested back in the '90s and that we tested seems to follow this law. The only thing that changes is this lambda. So we use this on some local standard deviation features. And again, the lambda changed a little bit, but again this beautiful linear fit. So normalized gradients, so typically people don't use gradient but do some nonlinear normalization in the neighborhood. Same thing. This is actually HOG, this is the actual HOG codes I got, you know, some of the HOG code by Dalal and Triggs and ran it on the HOG code and again, it holds. So, you know, you could imagine testing this for other feature types and then it's all, you know, one does you don't know for sure if the law will hold, but we have no reason to believe that it wouldn't. And so again, I'm using this observation inside of our own system to perform fast multi-scaled detection but it should really be applicable to sort of any features you may compute and wish to approximate at a near by scale. So we used this in our multi-scale detector. So instead of creating an image pyramid, we actually can scale -- so instead of actually having to downscale the image and recompute the gradients to perform detection of the bigger box, we can actually predict what the gradients would look like in a downsampled image. So we can all do -- we can do this all but just be using the original scale image with the features computed over that. So that makes it be a little bit fuzzier right now. I don't want to go into all the sort of technical algebraic details but hopefully you can imagine how that process would be performed even if you don't know the details. So that's what we did. And basically what we found -- and so again, this approximation degrades the bigger your scale step is. Oh, actually I should say something. So because it degrades when you take a really big scale step we actually ended up doing a hybrid approach. Instead of normally you have a image pyramid that's finely sampled, we had an image pyramid that's very coarsely sampled. So every time you downsample say by a factor of two or downsample by a factor of four and then you do the detection for the scales nearby did that, so you can decouple the pyramid -- the image pyramid from the pyramid at which you do the detections. So that's what we did. So basically this here I won't show too many curves here. But here we have -- this is the actual scale step, so if you have -- if it's not decoupled, so if your scale step is the same, so you do two to the one-eighth scale step as -- the image pyramid in the detector pyramid. So this is the old way of doing it, this is the performance you get. So on the X axis your performance, there's three different core -- curves at different false positive rates. So this false positive rate, I mean you know, maybe you get about 90 percent detection rate on whatever particular measure this is. And then if you decouple the process of the -- of having the image pyramid and the classifier pyramid, so you have an image pyramid that you downsample by a factor of two to the first, so by a factor of two every time, you actually lose almost no performance. Not until you go to a downsampling factor of about 16 does it start to perform poorly. So you can really -- so in practice, we use the scale step of one. So in practice we create this image pyramid of a factor of two. But very, very little loss of performance. There's a lambda that we estimated empirically. And when we do the correction, we can use other lambdas. And it turns out that the lambda estimated empirically from the previous experiments is very near the optimal lambda as one would hope or otherwise there's something not quite right with our theory. And so at the -- we call this method at the time publishing the fastest pedestrian detection in the west. You might have noticed some -- a theme throughout some of those slides. But anyway, so on these benchmarks, so channel features is sort of our baseline method. And the fastest pedestrian detection in the west. And they're sorted by performance. And so what you see basically is that our fastest pedestrian detection in the west is about one percent less accurate than the full on image pyramid. And that sort of where the proof lies, that this method of approximating features that other scales is really reasonable. And so we did this, this is the INRIA data, this is the -- this is the USA data, this is some other datasets and it's very similar story. Usually our approximation is very, very close to the original detector. So how does this speed things up? So this is a speed versus performance curve. So speed on the X axis. So you want to be sort of as fast as possible and miss rate on the Y axis you want to be as low as possible. So this is on the Caltech dataset with 100 pixel and up. And so here is our baseline method. It worked at about two frames per second, and the new one goes about four to five frames per second, depending on settings. So it's about a speedup of 10. And oh, I should say the Viola Jones method here actually is a slow implementation because it actually doesn't use the classifier pyramid. So Viola Jones wouldn't be faster. I mean, it's much simpler. It also is orders of magnitude less accurate, of course. But, you know, there's lots and lots of methods that work at 20, 40 seconds. We're at five frames per second. Similar story for when you actually go to detect smaller scale pedestrians. We really -- we lose a tiny bit of performance but really move up in the speed category. So, yeah, so that's -- so that's sort of our latest insight/contribution to sort of our detection system. And I'll be posting that paper on my website. And I need [inaudible] in two weeks, so I won't drag my feet. And yeah, so is there any questions on that aspect? I'm almost done with the talk, don't worry. So, yeah, like I said I also have am very interested in sort of the learning aspect of this and -- yes? >>: Just a general question. So just the data, all of the images that you've shown me, the data and data images, what happens -- so your mission was to detect pedestrians from a car. What happens at the nighttime when ->> Piotr Dollar: Oh, yeah. I think right now we're just hoping to do something in the daytime and nighttime or when it's raining, that's a little bit of a different story, yeah. Not quite there yet. Yes? >>: [inaudible] pedestrians coming towards you? >> Piotr Dollar: I'm sorry? >>: [inaudible] the camera [inaudible]. >> Piotr Dollar: Oh, well so we're actually not using any motion. So we're just using -- we're just using the appearance. So it wouldn't help or hurt us on any assumptions of that nature. But of course if you're putting us into an overall system, that's the kind of things you would want to be aware of. Yes? >>: So the speed of the dimension, it's basically the time taken to compute features of features. >> Piotr Dollar: Yes. >>: Would the number of boxes that are being evaluated ->> Piotr Dollar: It does not change at all. >>: So you mean to say it's just [inaudible] feature extraction. [inaudible]. >> Piotr Dollar: Yeah. >>: [inaudible]. >> Piotr Dollar: It is reasonably faster. Single scale. >>: [inaudible]. >> Piotr Dollar: Yeah. So basically now -- so okay. So that is reasonably fast so you can do realtime single scale detection with that code. But when you create the image pyramid and compute the features that every pyramid is a fine image pyramid because let's see you have eight scales per octave, that's when it gets quite slow. So that's what we're saving. We don't -- it's the same code. Let's say you're using HOG features. We're not using HOG features, we have our own. But instead of having to compute at every scale like eight times per octave, you only compute it once. So that's the speedup. And, in fact, because we're using sort of this boosting with cascades and all that, the detector itself, the evaluation of the window was quite fast. And the bottleneck was the computation of the features. And that's what we've alleviated. >>: [inaudible] complementary set of word says featured extraction is sparse but the number of windows [inaudible] is huge branch and bound and other works? >> Piotr Dollar: Right. I mean, so -- so there are orthogonal issues. So this doesn't address at all the number of windows. So I mean one of the things is if you have boosting with cascades actually some of those claims about the evaluation of the windows being really slow is may not hold as much as if say you're using like a kernel RBF or something like that where actually evaluation of each window even after you have the features is quite slow. >>: So I have a question. >> Piotr Dollar: Yes? >>: [inaudible] modeling or anything [inaudible]. >> Piotr Dollar: Right. >>: I'm really surprised that you did [inaudible] because [inaudible] equivalent for like [inaudible]. >> Piotr Dollar: Yeah. >>: [inaudible]. >> Piotr Dollar: Yeah. The problem is that the current part based models don't explicitly model occlusion. So if you just have a missing part, it's a lower score. >>: There's also some other work where you can reason the missing parts. >> Piotr Dollar: Yeah. So there's -- yeah. So I think -- I think methods like that would definitely be the right way to go. I think that there just wasn't a compelling dataset before to really -- so you could simulate datasets like that. Like there's an ICCV paper last year where they had this HOG LBP thing where they tried to estimate what part of the pedestrians are occlusions and all that. But they had to simulate the data. And so it's just not as convincing as, you know, if you actually of a really dataset which we do now, but we actually haven't performed any experiments along those lines. But no, that's exactly the right type of approach I'd advocate. So let me say a little bit more what I think sort of the future work is in pedestrian detection or where I see sort of the field going. And the way I see it is really that there's two separate problems. There's the low resolution domain and then there's the high resolution domain. So in the low resolution domain, this is a image from Antonia Torralba at MIT. A lot of you have probably seen this where -- so what does this look like. If you've see this demo before don't say a car. What does this look like? A man. It actually Antonio took and these are the identical pixel values underneath. And so the only reason this is a car based a person is based on context because if you were to look in a local window, they're exactly the same. And so I think right now a lot of the work in pedestrian detection is using -- saying there's only so much we're going to extract from the pixels and using sort of the surrounding information. So the context, temporal consistency kind of holds system performance. If you know you're in a car you can exploit that type of information. And, you know, this helps a lot at a low resolution domain. It also helps a lot in the high resolution domain just because their detectors aren't that good. So even if you can clearly tell it's a person just from the pixels, our detectors can't do that. So in some sense, all of pedestrian detection is in this domain right now. And probably all object detection just because our object detectors are not that great. But really what I would argue is that in the high resolution domain, which, you know -- and in some ways the whole Caltech pedestrian dataset and that whole application with those resolution cameras really argue that this is where you want to be. But I'd say that there's this other domain which is just as interesting which is the high resolution domain where if I took a Porsche let's say and put it here covering up let's say this guy at this resolution, you would not think that's a person just because of the context. You'd know that would be a car right there, right? So I would argue that in this, this high resolution you can really just take the local window and extract information from that and really try to do a lot better. And that's where I think sort of the park based models, you could use segmentation features, you really have a lot more pixel data just directly coming from that region. And so much of my work over the past year or two, which has been motivated by the Caltech dataset has been sort of simplified in looking at the low resolution domain. We weren't using context. I have some students working on that this summer. But it limited the type of approaches that were really successful in that domain. And I think this is sort of where -- and I don't know if there's any ideal datasets for this, but this is sort of very interesting problem which is sort of where I want to move next where you really get to do more interesting things in terms of extracting information from the local window on which you're trying to make a -make a decision. Yeah. So that concludes my talk. If there's any questions. [applause]. >> Piotr Dollar: Yes? >>: Look at the picture below. Whether the detection goes bad or good is the local de facto dress code. Like the man on the left with his multicolored stuff is [inaudible] uniform. >> Piotr Dollar: Right. Well, so I would say that the current detectors it problem wouldn't mess it up because they would downsample this first and you really wouldn't even be -- you would discard those details that may confuse you. But I think if you actually wanted to do detection to this resolution, that's the kind of things you would start thinking about. Yes? >>: [inaudible]. >> Piotr Dollar: Yes. So I mean one at one place where sort of all the detectors have a hard time and these are all sliding window, they don't have any reasoning about if you have a couple people very close to each other which parts come from which person or anything of that nature. I mean, they're just these monolithic classifiers. So that's definitely sort of a re-- area where things go bad. And then, you know, we have this heuristics afterwards of non-maximum suppression where you have all these detections and you're trying to actually see how many people they are and that's definitely a pretty challenging -- I mean, at this separation you're still okay but once people start overlapping, basically your detectors give you that there's people there and you get sort of your -- it's firing everywhere, but you have no idea, you know, what's actually going on in the image. Okay. Cool. Thank you. [applause]