22074 >> Larry Zitnick: All right. It's my pleasure...

advertisement
22074
>> Larry Zitnick: All right. It's my pleasure to welcome Piotr Dollar here today to give us a talk on
object recognition. He graduated from the University of San Diego, under the advising of Serge,
right?
And the last few years has been at Cal Tech studying some biologically-inspired works as well as
computer vision. And today he's going to be talking about more of the computer vision aspects, I
believe.
>> Piotr Dollar: Thank you for the introduction. And thank you for having me here. So it's a
pleasure today to tell you about my work in object-centered visual recognition.
And so the goal of my work is to take an image and discover and find all the objects in the scene.
But not just to find objects, but to actually characterize their properties in some more detail.
And so first I'll tell you about my work in detection, just finding and naming the objects.
And then in recovering their pose, their geometric configuration of the objects, tracking the
objects to determine their motion, and not just the motion of their center of mass, but actually their
changes in their geometric configuration and then characterizing the behavior of the objects
themselves.
So my talk -- I'll tell you about each of these, my work in each of these areas with a specific focus
on the detection, and so I'll more briefly talk about the latter parts. We'll see how we're doing on
time in the latter part of the talk.
So throughout all the methods I'm going to talk about have sort of three properties that I think are
crucial if we want computer vision algorithms to be applicable in the real world.
They're data-driven. They're robust, and they're efficient. So data-driven, what that really means
is that it's easy to adopt into a different domain, to adopt into a different domain you just get a
new dataset.
They're robust meaning they work well but they also work under adverse conditions so not just
conditions in the lab where we gather some nice, clean images and they're efficient. So
computationally fairly quick, and this is important both if you have applications where you simply
need real time feedback, for example, if you have a vehicle and you want the vehicle, an
automatic system to warn the driver if, say, there's pedestrians in the road, you need the
feedback to be very fast.
But also if you have sort of low power devices like a cell phone, you can't have algorithms that on
a standard computer take minutes to run. So all these are sort of aspects I think are crucial for
successful algorithms. And they're something I'll focus throughout all the different methods I'll
talk about.
So I'm going to start talk about detection. I'll tell you a little bit, first a little bit about some of the
challenges here. And this is actually -- I gave a talk here in June, some of you might have seen
this.
I'll discuss it very briefly. Then I'll talk at some length about our detection framework that has
evolved over a number of years. I think what we've ended up with is a very simple system.
We've sort of pruned a lot of the complexities out, but at the same time it's very effective.
I'll tell you sort of what goes into that. Then I'll tell you about a recent effort, recent insight, that
allowed all of this to be made much, much faster.
So, first of all, one of the things we always, when we're working in this, is that the existing
datasets in object recognition didn't really capture what we wanted them to capture when we
wanted to really look at what's working and what's not.
So there's been an emphasis in going to a large number of categories. So continuous increase in
the number of categories culminating in Fei-Fei's recent effort on Image Net, but I think there
hasn't been enough of a focus on going to a large number of images per category, where you can
really see for that one category at least what's really the difficulty there.
Sure, here the difficulty's if you need to distinguish a thousand different classes, but for this one
class, what are the challenges there?
The other thing with a lot of these datasets, for object recognition, is that they have some fairly
strong level of bias in the sense that the images were hand-picked.
If somebody went through a vacation photo, said this is a good image to use, or they went
through a search engine for tagging, say Flickr, where people labeled, said this had a pigeon in it.
The pigeon was very central. But it might not be representative if you wanted to detect all
pigeons and all images including small ones in the background.
So we introduced the Cal Tech Pedestrian Dataset, which I think addressed these issues. I'll tell
you briefly about that. But essentially what the main focus was to really drill down in one
category, a lot of images, no bias.
And so the way -- this is the dataset. It's available on line. And since we've introduced it it's
starting to get more widely adopted. And the way we gathered it is we had a camera hooked up
in a vehicle. And, really, if you're driving around in a vehicle, these are the types of images you'll
see.
So there's -- in that sense it's very unbiased. We had a labeling tool and hired a bunch of
annotators, had a huge amount of data. It's something like a quarter of a million labeled frames
with 350,000 bounding boxes of pedestrians. And we labeled them in some detail, including
occlusion mask.
So, again, I'm just going to briefly touch upon this. But one of the things we found is that you see
sort of where human perception is. So these are images that are downsampled and upsampled.
So they have an effective resolution. They're all actually at the same resolution but they have an
effective resolution that's different.
And here's sort of 3264 pixels, you can really distinguish the pedestrian in this case fairly
accurately, even at 32 maybe there's some ambiguity, at 64 you definitely distinguish it; lower,
probably not.
This is where humans are at. Algorithms are down here right now. And I think we really have
sort of a way to go, and I think this has sort of practical implications, but it also is showing that
maybe perhaps the representation we have of the image for our computer algorithms isn't really
doing an effective job of extracting all the information out of the image that we'd want.
And, for example, in this dataset turned out that most of the pedestrians were actually in this
height range somewhere in here. And the fact that algorithms that don't do well there is a big
challenge.
The other thing we looked at was occlusion. I think our occlusion is one of those things we all
know for object recognition is an important thing to deal with, but yet we're not dealing with it. In
fact, there hasn't even been a dataset that has occlusions labeled.
So we did this for people. And it turned out that most of the time, so without going sort of through
all the details here, most of the time most people are in fact occluded.
But there is some hope. If you look at the pattern of occlusion, this is sort of the average visible
region. So blue here indicates regions that are visible.
The head tense of the people tends to be visible, where the lower part of the body isn't. It's very
nonuniform. It's not like every part of the person is equally likely to be occluded. Moreover, if we
sort of cluster the different occlusions, these seven occlusion masks here account for something
like 98 percent of the occlusion. So 98 percent of the time, conditioned on the fact that a person's
occluded, this is the portion that will be visible, one of these. And not say, well, their middle
region is occluded or the head's occluded or something else. But this is really information that we
really want to be able to exploit in the long run.
So we also went and we gathered a very large number of detectors. So each one of these -exactly whose detectors these are is not important, but each one of these are detectors from a
leading computer vision group. If you're interested, it's on the Web site or in my papers.
Here's for example HOG and a number of others. What I'm showing here's false positive per
image of the X axis and the miss rate on the Y axis.
You want low false positives per image, and low missed rates. You want to be sort of in the lower
left corner. You see there's sort of a spread and things are getting better.
This is a log-log scale, so a shift is actually a pretty big deal. But we still have quite a way to go
so it's one false positive per image. We're detecting about 60 percent of the pedestrians.
So it's something, but it's really not quite there. So the systems I'm going to tell you about today,
these are our two algorithms.
So this is all without using any kind of context, without any motion. This is just classifications of
pedestrian. They're not independently for every window.
These are our two detectors I'll be telling you about today. This is our base detector, this is a
sped up one, you can see where they perform.
One of the things we did was we asked sort of what the progress has been over the last decade
and where we're going. It's always a little dangerous to extrapolate.
But this is around two Circa 3000. This is, by the way, on the Internet pedestrian dataset.
Slightly easier data. People are larger. There's no occlusions.
And there's just fewer, the pedestrian takes up a larger fraction of the image. Back in around
2000, if we take Viola Jones's words where it's at, to get 80 percent detection we are at about one
false positive.
Five years later, Delal and Trigs [phonetic] introduced -- I'm showing a very selective number of
papers obviously from the literature.
Five years later Delal and Trigs showed the HOG features. And this is the curve you get. You
may say, oh, here, well, that's not that big of a gap. This is a log-log scale. So going from here to
here, going one grid cell reduction of factor of ten in false positives.
It really actually is something. This is where we're at today. This is our own method. You can
see that really there's been pretty dramatic progress for given detection rate, we have just in ten
years reduced it by 100-fold.
So that's really, really something. At the same time, where we're at today, we're still not where
we want to be, one false positive or 10 images, 80 percent detection rate. Maybe we could argue
exactly where we want to be, but really it would be more over here.
So, like I said, it's very dangerous to extrapolate, because there's no physical law that says we'll
be making, no Moore's law that's saying that we have to, hardware will increase at a certain rate.
But I do think that this is somewhere we'll get within about 5, ten years.
>>: Humans are ->> Piotr Dollar: That's an excellent question. That's one of the things we want to do. We don't
have that data point. And it's a little bit challenging to get that data point.
So if you take the whole image and you have people label, so they get to use context and
everything else, then they're actually doing quite, quite well.
Like especially on this dataset. So one of the things about this dataset it was hand-selected
images, hand-selected so there's little ambiguity.
I would say humans here are actually close to perfect. One of the things we really wanted to do,
a lot of Larry's work has addressed this to some extent in terms of what is human performance
and where do humans work and algorithms don't -- one of the things we wanted to do was really
do context-less, so we present people with windows and have them make a decision.
So we're still trying to figure out how to set up that experiment properly, because the difficulty with
it is, first of all, there's a million windows we'd have to show the people.
But let's say we have a budget and we have M Turk. So maybe that's okay. But then this
becomes this problem is that it's a rare event actually having a pedestrian.
Most of the windows don't. And if you give, say farm out to M Turk this kind of task where you
have rare events, it really biases people's responses. So even if they could detect, maybe they
started hitting no, no, no, no. And basically what we have to do to do that experiment, so one of
the possibilities, if we can get a detector that has 100 percent recall, at a fairly low false positives
per image, like say a thousand false positive per image but 100 percent recall, we could use that
as a first filter and then perform a human study.
Using sort of a detector that only gets 90 percent performance to do that isn't as compelling.
So anyway, the short story is we really want to have that data point but we haven't figured out
quite the right way to get it yet.
So there's been some progress. So this is unoccluded fairly large pedestrian. This is going back
to Cal Tech pedestrian dataset, and these are difficult cases.
This is low resolution and occluded. And basically what you see is here at this point, this is where
performance is essentially abysmal right now.
No matter, even if you go to lots and lots of false positive per image, you're only detecting maybe
20 percent of the people. Here I don't think in this case of your question where human
performance would be, it wouldn't be perfect, absolutely not, especially when we had humans
labeled this, we used motion. They can go back and forth in the video. They can identify large
pedestrian and go back in time and get the smaller. So certainly here for static and certainly,
certainly for context-less detections, humans it would not be perfect. But the algorithm is
definitely quite poor.
>>: For the resolution it's not as important. So you can get almost any resolution you want why
do you focus on resolution.
>> Piotr Dollar: That's true. I think there's sort of two reasons. One is a practical one, and one is
sort of a scientific one. So the practical one is sort of depends on the application and some
absolutely cell phones, the resolution goes up every year, even if necessarily the quality of the
image doesn't.
It's funny, because in the application we're working on, which is vehicles, it turns out so a vehicle
let's say a vehicle is $30,000, it turns that out only about 10 percent of that cost is actual parts
and labor and everything else is, or 10 percent is parts, everything else is labor, marketing, sales,
shipping, et cetera.
And so for them -- so when you talk about, let's say a camera, high resolution camera, which also
requires more computational capacity, slightly faster sort of processor, say that's an extra $50.
That's $50 not on top of $30,000 on top of $3,000. And for them, their margins are so small in
that, that they're very, very against that. It's funny, because you'd say spend another 50 bucks,
we'll have resolution.
But I agree. In some applications, absolutely for the proximal reason it's not so important. Cell
phones, cameras are increasing their resolution, why worry about it. I agree about that. It will
depend on the application. But from a scientific point of view, I think going back here, when I look
at these images or when people look at these images, clearly you can do detection at a relatively
low resolution.
Now, so just because humans can do it while we can kind of compensate with hardware, why not
compensate with hardware, but I think the fact that we can't properly extract information from
these images is actually telling us we're not extracting information very well.
So I think by focusing on lower resolution data, we're forced to say how can we get all the data
out of these pixels.
So we're forcing ourselves to solve that problem. And when we solve that problem, or we
address that problem, we'll actually help with the whole gamut of resolutions.
>>: The human resolution and processing power is fixed. So you cannot change the [inaudible],
cannot change ->> Piotr Dollar: Yeah, but there's focus.
>>: So machines, this is like a trade-off. So you can increase resolution or increase processing
powers. So it seems you should have a composed axis, because ->> Piotr Dollar: Yeah, so we have looked at -- we've looked at a lot, for example, at performance
versus resolution. I think as you said that will depend on applications. Some applications just
won't make sense to worry about for resolution just make up for it with hardware. So I absolutely
agree with that.
Let me tell you about my own detection framework. So we use a sliding window paradigm. This
is actually the dominant paradigm for this type of detection task. Not for any other reason than
this is just what seems to work well and all the top performing methods, for example, on
pedestrian detection, Pascal dataset, all tend to be sliding window and this paradigm, there's two
steps.
You're classifying every single window in an image; you're down-sampling the image and
classifying every image again. And you can think of it as having two stages. There's extraction
of features, and then there's actually some kind of classification, supervised, semi, et cetera, but
there's two stages features.
And learning and classification. And so I'm going to tell you about both of our work in both. On
our feature end, basically what we did -- there's been sort of a lot of specialized features
introduced over the years, let's say HOG or Sift and let's say this is the best thing since, the best
feature yet.
And we ought to take a little bit of a different approach. We say we really want to take all kinds of
different complementary information but we don't want to have to go and actually go encode
different features, here's a color feature, and then go code that or go get somebody's code for
color features.
We wanted to really have a homogenous approach to all these feature types. What we basically
did, there's sort of a ton of work in the computer vision literature over the last many, many years
about computing different, what I call channels of information, which basically highlight a
different -- so obviously you always start with the same image, but you're highlighting different
aspects of the information in the image. So, for example, you could do edge detection on the
image, caddy edges. You can run some kind of like linear filter, some kind of nonlinear operation.
Obviously the information is in some sense in each of these views less than information in the
original image, because you've done some processing and thrown out some information.
But at the same time, from the perspective that you're going to be putting learning on top of this,
different elements of information get highlighted in these different channels. That's the idea,
basically can we leverage this, can we take an image compute lots and lots of different channels.
This is a very simple thing to do from sort of a systems perspective because there exists code for
say computing canny edge detection, [inaudible] and there's fast code for that.
No matter what sort of library you're doing that, that exists and can we leverage this and compute
all these channels, so we have very homogenous representation. And in some ways it becomes
irrelevant what the underlying information that's being emphasized here is.
And then basically what we can do is take and compute integral images over that. And so this is
a trick from sort of a math, a nice little math trick from Paul Viola and others where you can then
take a sum of pixels in a rectangular region very, very quickly using a fixed number of floating
point operations.
So the details of that aren't so important. But the point is we can then compute lots and lots of
sort of sums of rectangles or sums of combination of rectangles over these and these sums, once
we do preprocessing, are essentially free or a fixed cost.
And so this gives us an extremely large feature pool to work with. And it lets us -- again the idea
is lots of heterogenous feature types, homogeneous framework. If we want to add something we
go to a different framework and maybe color is more important something else is more important,
we can do that sort of with no effort.
So on top of this we had a very standard learning framework, maybe some tweaks. At the
beginning we tried to explore sort of random combination, gradient descent, find optimal
combinations of sort of combinations of rectangles to give us the most informative features. We
ended up working, using random features, because it didn't really hurt performance very much,
threw this into a boosting framework.
So this is actually from some folks here at MSR. Soft Cascade is just a very nice variant of
Cascades, and some decision trees and sort of all of this optimized over the years, but no huge
fundamental changes from what was there 10 years ago.
But it got to the point where it takes five, ten minutes to train one of these and there's very few
parameters and it works well.
And so here's, for example, the learned channel. So for pedestrian detection. We use gradient
histogram channels and those are sort of the most important ones. Those are the ones people
are using.
Then we threw in color channels and then it turned out that in LUV color space, the skin color,
actually independent of race, pops out as pretty particular. And you may have known that ahead
of time; you may not have.
But the algorithm can discover that. You give it enough different features, the algorithm will
discover the most informative ones. For example, it discovers that, discovers the shoulders, the
head. So exactly what I'm showing here isn't so important.
But the point is those kinds of things you don't have to program in, and that's really what the
algorithm is going to be discovering.
So we had an extension of this to have some notion of articulation. So have some notion that the
features don't have to appear in a fixed location relative to the window you're testing, but they get
to move around a little bit, move around in groups. It's the notion when you're doing object
detection there's some level of articulation.
So we took basically our feature representation, added a notion of part based and we came up
with something called multiple component learning, basically grouped features into sets and
those can move around.
And so what we actually started with was -- there's this approach called multiple instance
learning, an approach for supervised learning, which instead of normally in standardized
supervised learning, in the simplest case, you have your positive examples and your negative
examples, you learn a decision boundary between those.
And what multiple instance learning lets you do, instead of your examples being singletons, they
actually come in groups or sets or bags, where now you're trying to make a decision for an entire
bag and you basically say the bag is positive if it contains at least one positive instance.
But in training, this really lets you add a little bit of, a little bit of weak supervision in the sense that
you don't have to exactly identify your positive example, you just have to say this region contains
your positive example.
So what this paper from MSR actually did, what these guys did was use this to say let's say
you're trying to learn a head detector but you don't know exactly what sized windows to use, you
don't have exact labeling, so you let those move around and you say there's a head here and you
have the algorithm to figure out which window to use from each training example.
So you could use this. So basically what we did is you can say actually you can use this to learn
parts. So they were using it to learn an object classifier where you had slop in the labeling of the
object.
You can say let's actually train an object as a collection of windows, and try to find sort of the
window within all the positive classes that are sort of the appearance of a window in the positive
class distinguished from negatives.
And this lets you know learn the notion of parts of course. A person you could say, well, actually
an image of a person can have lots of different parts, which one do you focus on.
What we did is essentially divide the region into lots of overlapping regions. So you can generate
these regions and there can be kind of dense overlap.
For each one we learned a part detector using mill. So each one has some notion of slot. So we
learned all these different, using -- we get a separate part detector for sort of each region. We
get huge numbers of part detectors, a lot of redundancy there.
But those now become weak classifiers inside of a boosting framework. Sort of two layers of
learning. First you learn the parts and you feed those into yet another learning step.
And that second learning step is the one that picks which are the informative parts and how to
combine them in their spatial reasoning.
And so this kind of thing actually works well when you have a decent amount of resolution and it
turns out that when you have relatively low resolution, having a notion of parts doesn't buy you
much.
So that's our basic framework. So I've shown you earlier I showed you this sort of slide with the
ROC of how things were performing. But really when it comes down to it, when I want to -- when
I want to show Pietro that something works, he sends me his vacation photos and I run them on
that. These are his kids skiing and the detector fires on those, there's the number of those,
there's the confidence and you can convert it to a probability.
And here's his kids playing soccer, and in this case the algorithm hallucinates a little man on the
shoulder here.
But when I show Pietro images like this, he says maybe this is actually working, maybe you have
something. Here's some hiking in the Alps. Here's an image, kind of thing, this is more
challenging image, more stuff going on here.
Here gives you a sense of the type of errors the algorithm is making and you sort of get false
positive hallucinators. If you were to say use scene geometry, which we're not, if you knew
where the ground plane was, you could probably eliminate just with that a lot of these.
But you kind of see without that just, per window, what the algorithm's hallucinating. And some of
them are somewhat reasonable. Some aren't. Here's some more hallucinations. Here you could
say these hallucinations are pretty reasonable, some of these are not. A human would never
hallucinate a pedestrian there.
So one of the things that Pietro is very proud of is that he discovered that the algorithm has a lot
of false positives in Venice. It thinks Venice with big arched windows are people.
That gives you a sense that maybe the algorithms still aren't really where we want it to be. Even
looking locally at that information without the context, it's very clear that's a window.
Anyway, he discovered Padawa, where he's from, you don't get the false positives. But in
Venice, something about the arch, something about the distribution of those features, you do.
Here's that example image I was using at the beginning. And you kind of see, again, you know
as doing a decent job that upright people. It's not doing a good job on people in different poses,
and this is, again, why we want to introduce the notion of pose, which I'll talk about later.
So you get some sense of how it's working. I showed Pietro this image. He sent me this image.
I said, look, it's not working. It detects Venus, so maybe you're okay. I guess it depends on what
your goals are.
So now I'm going to tell you about how we took this baseline system and how we made it fast.
And so basically before sort of one of the top performing methods was running at something like
five minutes per image.
Let me repeat that. Five minutes for a single image to get your detection results. And that's on
like a relatively reasonable computer. And so that's really not, if we want to make these things
practical, that's not where we want to be.
And our stuff was running for, depending on the resolution and depending on everything, but sort
of, if you wanted to detect all the small people in a VGA image, it's something like a few seconds
per image. A little more reasonable but still not where you want to be.
The sort of insight I'm going to talk to you about now sped it up by a factor of 10. Now if we want
to do sort of low small pedestrians, we can do that at a few frames a second. If we're not so
concerned about all the small pedestrians, you can actually bump that up to real time.
And so this is sort of the main technical part of the talk. So what's the idea here? The idea is,
normally if you do multi-scale detection, if you have some kind of feature, like let's say gradients,
you basically have to recompute that feature at every scale. And I will explain in detail why that is
in a second.
But basically if you down-sample image and recompute the gradients, it's not the same as having
the gradient in the image down-sampled. So you basically recompute, let's say you have HOG
features or whatever they are, you basically build this dense image pyramid where you shrink the
pyramid by 10 percent every time or five percent every time.
And you do sliding window detection engine to scale. If you think about it, it's very wasteful
because you're recomputing whatever features they are. Those can be pretty rich nowadays, at
every scale of detection.
This just doesn't seem the right way to go. So actually when Viola Jones did their detection
framework over a decade ago, they got away without doing this.
Since all their features were basically sums of pixels, rectangle sums, they didn't do a detection of
a larger pedestrian. They didn't have to down-sample the image and run the same size model,
they could actually change the size of their model because they could say, instead of doing a sum
of a 10-by-10 pixel rectangle, I'm going to do a sum of 20-by-20 pixel rectangle today. I know if I
divide that by four, that's the same as if I shrunk the image did 10 by 10.
That was one of the reasons Viola Jones was so fast over a decade ago. And now when we
want to do kind of rich features, they're necessary to get good performance. Viola Jones just
doesn't work that well. Features, it's only view of the image or sums of grayscale values in
rectangle regions, captures some notion of gradient but really is not rich enough to really capture
the appearance of most objects, so things got slow.
And so is there anything we can do here? And intuitively you'd think, well, really is the gradient
going to change that much if I shrink the image by five percent? Is it really going to change
completely? Do we have to really recalculate everything? And the answer is no.
And so if we can approximate features at nearby scales, that's really what this is going to be
about, then we don't have to do this, we don't actually have to resample the image all the way.
We can actually approximate features at nearby scales. So we take our original classifier,
enlarge it by a little bit by approximating what those features would have been larger scale,
shrinking our detector a little bit, and do a very, very sparse image pyramid.
If you think about it, if you think feature computation sort of proportional to the number of pixels,
then in Viola Jones there's basically, if the number of pixels N, the amount of computation was N,
on order of N.
Here if it is dense pyramid image, but depends on the density, but it could be easily something
like 10 times N. If you do this kind of thing, well, it's N plus one-fourth N plus one-eighth N, et
cetera, and that comes out to be about four-thirds N if you do that infinite sum.
So basically what we'll be doing is getting the benefit of ort of the computational costs would be
as if we're working on the original image at one scale. But the accuracy is going to be as if we
had the full rich features, whatever they are, every scale.
And so how does that work? Well, so, first of all, features are shift and variant, most of the
features used in computer vision, meaning if we compute a feature. This arrow here is a future
computation.
And for sake of this talk, I'm going to be mostly talking about gradient magnitude. It will hold for
lots of features. If we compute a feature and translate it, that's the same as translating an image
and computing the feature. There's nothing sort of, not all features necessarily would have this
property, but most of the ones, most of the practical ones we would use would.
So this makes sense. But now when you go to scale, this no longer holds. If you compute, let's
say, the gradient magnitude image down-sample the image, compute the gradient magnitude,
that's not the same as doing -- so the order here matters a lot.
That's why we actually have to resample the image before we recompute the features. It's not
enough to just compute the features once and resample those.
That sort of intuitively makes sense. Again, there's just the effect basically that when you
down-sample, you're sort of blurring away information.
So if you have, say, high frequency signal in the image that goes away, so when you compute
your feature, which is depending on that high frequency signal, it's just not there.
Okay. So is there anything we can do? Because the information is actually lost. So at first
glance it seems like there's nothing we can do. But there actually is. And so let me tell you about
that.
First of all, I'm going to, for the sake of discussion, I'll talk about the gradient magnitude. I'll talk
about scalar feature over an image computing the gradient magnitude over each pixel, summing
those across the whole image. You should get a single number out, here is the gradient, sum of
the gradient magnitude of image, some scalar H.
So when you upsample an image, so you have an image, you compute the gradient magnitude,
compute the sum, you get H. You upsample it, and you do the same operation, you get H prime.
When you're upsampling an image, you're not actually creating an information. You're also not
losing any information.
So it seems like we should be able to predict what the relationship between what H and H prime
is without actually having to upsample the image to get that value. And so what is that relation?
Well, it's not that hard to figure out. So the gradient and upsampled image is always half the
gradient than the original image. You're just stretching everything out. So you have a high slope
in the original image, half the slope in the upsampled image.
So it's the gradient magnitude is half of the corresponding location. But there's four times as
many locations.
So we predict that the value should be approximately two. I say approximately because there's
some dissecuretization effects and therefore these relations will only hold approximates.
This makes sense and we went out and we actually performed experiments to test this. And
these experiments don't show us anything surprising. It's what we'd expect. We actually went
and took a dataset of images, computed H and H prime, so computed this scalar upsampled
example to scalar and histogramed that ratio.
This is something like for a thousand images what the distribution of those ratios. So in fact the
mean is around two, as we'd expect. And there's some variance because this is approximately.
And this is for different datasets of image, this is the dataset of just images containing people.
And this is images not containing people.
So these would be the positives and negatives, say, when you're training a classifier. So you in
fact get this relation that the mean ratio of H over H prime is two.
Okay. So so far I haven't told you anything surprising. But now what happens when we down
sample an image. So now we are throwing away information. So, first of all, what would happen
if this image was smooth? If this image was smooth then really we'd just have this case
reversed.
And we have the arrows reversed. So if this was a smooth image we down sample H and H
prime the ratio between H and H prime would be one-half.
Okay. Sure. But it turns out that when we actually did this, the ratio turns out to be .32 image.
But the ratio of the statistics of the original ensemble of images and to zoom in will only depend
on the relative scale change and not on the absolute scales at which those two ensembles of
images were taken.
And so this is work done in the '90s, and they were only looking at distributions of basically the
gradient. But really you would expect this -- if the world does in fact have this fractile structure
that as you zoom in the statistics of those scenes are the same, it shouldn't really matter what
type of feature you're measuring, whether it's actually gradients or anything else.
So if that's the case, so, first of all, let me actually say sort of the mathematical expression. So if
you have some kind of feature you're computing over image at scale S, the ratio of this for an
ensemble of images should only depend on the relative scale of S 1 and S 2. So it should only
depend on that ratio.
So this should hold on average. So on average the expectation of this ratio should be just a
function of S 1 and S 2.
As soon as you have something in this form, it's just a matter of algebra to show that F must
follow a power law. And so basically the feature computed at scale S is going to be
approximately, if you know the feature computed at the original scale, times some function and
exponential or a power of the actual change in scale.
>>: Would the effectiveness of this observation would depend upon the feature, because some
features which are textures, texture type features, will obey this very well, right? Where other
features which might be dependent on step edges might not agree with this. Step edges won't be
0.34.
>> Piotr Dollar: That's an excellent question. So, first of all, there's nothing analytical that says
this has to hold. This is what we would expect images to have sort of this fractile structure.
But everything we've measured for it has held. But maybe I can actually answer that in a slide
where I show for a couple different things. But we never actually tested sort of like canny edges.
That would be a fun one to test for exactly the reason you're saying.
So, well, so let me actually show that this holds, and I'll get back to your question. This is a log
log scale so if something follows a log log scale it's a line.
So this is a scaling factor. So this experiment we performed where we actually did this down
sampling and measured that the mean was .32, we can perform this for lots of different scale
changes, that becomes a data point.
And so for down sampling of a factor of two we had a data point which was at .32. We did this for
lots and lots of data points, and it turns out that a line is a really beautiful pit to this data.
So it really does follow power law. And then we can see just this at a mean but it also had a
spread of the distribution, and the spread gives you a sense of the air for an individual image.
And so you can imagine that the bigger the further rate and scale you were looking the bigger that
spread will become, so the more error that will be introduced. That's exactly what we see. We
see basically a factor of two it had some kind of spread and it's different for these two datasets.
And as you increase the scaling that spread increases. So this is the error you sort of get for
individual images. So you know everything I've said sort of holds for an ensemble of images and
in particular an image it might fall on either side of that. So this is the error.
So this tells you basically the further away you're predicting the more error you're going to
introduce. And I'll come back to this. I wanted to go to this, because this is Larry's question.
So different feature types. So, first of all, so this is just different, three different feature types. We
tried this. We've always seen this linear fit. The only thing that changes is the slope of the line.
But it's always this power law is always obeyed as far as we've seen. So that works. But then for
individual images, the slope of these you can see that the scale of this, this goes up to .5. This
goes up to .8. That one goes up to .6.
So the rate at which the error increases does vary on the type of feature type.
One question would be like Larry is suggesting there's certain feature types for which it increases
much, much faster.
And that could be the case. That's a really good question. So here is sort of a visualization of
this. So this is computing gradients, histograms of oriented gradients.
So for every orientation you want to compute the mound of gradient magnitude at that orientation.
So for all the locations image, gradient at a certain angle, you take those and you sum the
gradient magnitude and you can build a history.
We can do that for the original image and upsampled version of the image and down sampled
version of those images.
And we can correct -- so power law tells us that, say, for example, when you double the scale of
the image we'd expect there to be -- well, for doubling the scale we'd expect there to be twice as
much gradient and for having the scale there would be .32 as much gradient so we can correct
for that.
So you can remove that effect. And that's what we've done. So this is a histogram. So each
color -- these are the orientation bins, and the colors correspond to the original upsample or
downsampled image.
They correct it and you basically see that the histogram after applying these rules are the same.
So we could have computed the gradient magnitude of any one of these scales corrected for sort
of this effect we now know, this power law, and there would be some error but not much.
So for this type of image, it really holds. And here's two other images where the predictions that
the bars in all cases are relatively similar.
So for these three images it holds. Here's an example of an image where this completely fails.
So this is a highly textured image. And so when we downsample that image we basically throw
away almost all the gradient information. And so even though we know we're going to lose some
gradient when you go downsampled we wouldn't have predicted we would lose that much.
So in this case the approximation completely fails. And so if images in the natural world looked
like this, this whole method wouldn't work.
The whole reason the method works is because natural image -- natural images follow natural
image statistics. This isn't a natural image basically. It's a brought outtexture, it's a zoomed in
texture. In fact in a room like this I can imagine zooming in lots of regions where it's just a
uniform texture, and really it has very unique statistics.
It really doesn't have this property. If I take a picture of this room or of somebody's face, there's
objects at different scales. There's lots of stuff going on.
If I zoom in on a texture, say, or a smooth region it really has almost nothing going on. It loses
that sort of practical nature that we're depending on for this to work.
Okay. So basically without going into further details, we have a way, if you compute features at
one scale. So again I've shown a lot of stuff for gradient magnitude but it holds for all kinds of
features. We compute features at one scale.
We could predict what those feature responses are going to look like at nearby scales. So
without going to the technical details, we can basically, for a class where we can do this, where
it's a set of -- so we don't want to predict too far away. So let's say we don't want to predict more
than a factor of two away or even a factor of square root of two so we can compute this very
sparse image pyramid and then compute just upsampled, downsample our class very slightly.
So that allows us to basically say get rid of this whole step of three computing features at every
scale. And this really should be applicable to any detector. Because really the core idea is that
you don't have to recompute your features.
And so this is really a universal finding that we exploited here for our detector, but it's really
applicable anywhere.
So here's a different way of visualizing those same results I showed in the RFC earlier. On the Y
axis is performance. On the X axis is speed.
And so lower here is better and to right is better. So this is a previous detector from -- this is the
one I described originally, and this is the sped-up version.
And so you see that there's a big difference in speed and almost no difference in performance.
And this is for one particular dataset, one particular setting but we've run this on six different
datasets and always the sped-up version is within 1 percent accuracy of the original one. So this
prediction really is coupled with sort of learning methods that are robust and maybe slight noise in
the features.
This prediction rule really lets you save computational time without losing accuracy.
And you're going to see where it's sort of other methods fall. This is another method that gets
slightly different performance, exploits motion information, but in general our methods have -- but
it takes -- I should say it takes 256 seconds for one image, and really that's not where we want to
be.
And for some slight hidden performance, this is where we're at. And this is sort of you can see
sort of where the methods fall, and these are all sort of methods from the literature.
So we really managed to speed things up while still maintaining top accuracy.
Okay. How am I doing on time here? I've been talking about 40 minutes.
I might actually skip over a little bit of pose and focus on tracking and behavior.
But basically I'll just say what the method is without actually going into too much technical detail
or what sort of key property, key property of method for proposed estimations. This is where you
want to estimate your metric configuration.
So detection is the first step where you find the objects. But now we want to extract more
information about those objects.
And so pose is the first thing we'd want to extract, but again we want an approach that's basically
data-driven, accurate and fast. So data-driven in the sense that somebody gives us labeled data
and we learn from that so we don't have something specialized to a particular domain.
Accurate for obvious reasons and fast in this case the method here takes about one to five
milliseconds to do an estimation for a single location.
Which is fast enough, basically I'll show some applications detection, pose estimation, it's fast
enough to do real time pose tracking.
The key thing behind this method, like I say I won't go into too much detail. So pose estimation is
essentially a regression problem. Where you have your data and you're trying to get some kind
of continuous variables out which describe your data.
So you could set it up as a learning problem. You have your training examples and you have
your output pose. And it's just a regression. It's just a function from your input to some
continuous multivariate space.
The problem with setting it up like that is really poses something much more than arbitrary thing
to regress to, in the sense that poses geometrical configuration, you can go back to an image and
you can plop down your pose estimate and what you can basically do for a given pose estimate
you can take new measurements, conditioned on that pose, from an image that tell you basically
if your estimate of the pose is good or not.
So it's very different than let's say you're trying to regress and predict the age of a person, where
you can do that, it's a learning problem, and in some ways it's the same learning problem but in
some ways it's an arbitrary number. There's nothing you can do once you get age estimate for
someone you can do in measured different features. Maybe there is. But at least it's not as
obvious. With pose, geometric configuration is very obvious how to go back to the data and how
to get new features.
And by setting up the problem like this, so special -- instead of having generic regression
problem, but having this problem where you actually know it's posed, we can go and get good
results instead of using a thousand training examples, using something like 50.
And we can get very good results. So I won't, again, talk about sort of the details of that work. I'll
say a little bit about it in the context of applications to track where we're actually tracking the pose
over time.
So one of the things I've been involved in at Cal Tech is collaborations with groups and biology
that want to do animal tracking.
So looking here, these are just different domains that we've looked at. So these are all
collaborations with different groups at Cal Tech.
The video seems to be stuttering a little bit. Let me see if I can reset it. You see in some ways
the domains are easy, because we know it's not arbitrary images in the world. There are specific
sets of images.
On the other hand, there's a very large diversity in type of setups we wanted to work with, and
this is one of the things where having a very data-driven approach helps us because we can
basically go into any domain, train our algorithms for that domain and we could be sort of
agnostic as to exactly what that domain is.
And so let us produce very effective tools. So basically what we would do is we would take, have
users -- sorry about that -- have users go and label data for us. Right? No problem. That's easy
for biologists to do.
We'd then go and train classifiers and pose estimators for these objects. These are mice. We
learn the standard, is it a mouse or not a mouse. If it's a mouse, then learn the pose estimation
for it.
And then basically what we do we actually run that on independently on every frame of the video.
So exploiting this notion of tracking as repeated detection.
And so we run both detection on the whole video and for every detection we run the pose
estimation, and then we try to find a sequence of poses that explains away all of these
observations, what that let's us do is basically there's going to be hallucinated detections
sometimes.
Our detectors aren't perfect. And there's going to be misdetection. So this mouse in the corner is
barely visible. But using temporal information, we can make up for a lot of those mistakes.
And so the idea -- I'll just go through this briefly -- if you have a bunch of observations at each
point in time T, and you maybe have some kind of score for that observation, and you want to
solve for track, so every point in time T you want to assign the pose for the object.
So here the pose could just be in this case I just represent as a one dimensional number, but the
pose can be some arbitrarily complex pose estimate. As long as you have some notion of
distance between the poses.
And basically you can set it up as a big optimization problem where the goal is to find sort of a
solution that explains away in most of your observations, and then you can kind of hallucinate
missing detections and get rid of the false positives.
And one of the key things that actually let us make this work well is so normally an object
detection if you have a sliding window detector, what happens if you have an object and you have
a lot of windows sort of nearby that contain that object, you get the detector to fire in every one of
those windows.
So then you have some kind of post processing heuristic, nonmaximal suppression called where
you actually go and say well all of these probably belong to one object.
The really nice thing here is we don't have to make that decision ahead of time. The
nonmechanical expression can be folded into the tracking so when we're trying -- we don't have
to make a decision ahead of time to actually group those detections and make let's say if we have
a bunch of detections close together which actually were due to two objects, we lose those if we
just applied some nonmaximal suppression heuristic, by folding it into the optimization of the
tracking it solves both problems at once and it gets much better results.
That was one of the key observations. And we've run this, and so running it on say a 24-hour
video takes about 24 hours. It's a bath process. But it takes about -- it's about real time. It just
runs off line.
And we've tested this over hundreds and hundreds of hours of video. And in fact we've had sort
of two publications with biologists recent nature paper and another work where we've really been
able to do some nice biology, some nice science with this. I won't tell you too much about that.
I'll just show you some sample results. This is tracking to center position. And, again, we run this
over hundreds and hundreds of hours. It has to be very robust. If you lose the objects you have
to regain it, and we've tested it under these scenarios and it really works quite well.
So these are mice. So this was for the nature study. Here's another example. Actually, this is
one of my favorite. Sorry about that. This is one of my favorites in the corner here.
So this is actually tracking pose, in which case it's just an ellipse with an orientation.
And you can see the biologists reach in with their hand in the middle of the video and put that
plate in and the lighting changes, everything changes, and you know maybe we're tempted to tell
sort of our users the biologists in this case to, you can't do that, our algorithm won't work if you
stick your hand into the cage.
But really that's our job to make that work. And I really like the video because it sort of happens.
The light reaches in. The lighting completely changes and the algorithm is completely unfazed,
and I think that's really -- those are the techniques you need to push for in vision.
Here's another example. This is some work of Gathal [phonetic] on tracking pose of fish.
So I just wanted to very quickly tell you about my work in behavior recognition before wrapping up
here.
So again extracting a bit more information about what the objects are doing. So this is -- faced
with a series of domains like this, you can ask, well, so here's some maybe two training examples
and one testing example on the right. Again, it's stuttering.
There we go.
And so you can say well what's sort of similar about the image in the right, the video in the right
column to the two in the left? And the way we approach this, and something that's proven very,
very effective and has become fairly widely adopted in the community, is basically by saying, well,
let's treat this -- so for object detection, one of the things people do is treat an image as a bag of
words, as a bag of packages. We said let's treat video as a bag of spatial temporal words.
You take a video and you run sort of a separate linear filter. So these are Gabor filters in time.
And Gaussian derivatives in space.
And you run the derivatives. You run these linear filters separable linear filters. It's a linear
operation over image.
And you look for maximum of the response of these linear filters.
So sort of something equivalent to, say, Sift where you're trying to find interest points in the
spatial but now you're doing it over space and time.
And so now you have a video and you can represent it by space, time interest points.
And so now you can do kind of a bag of words approach where you have video. You detect
these interest points. You can then form clusters of these. I mean, very analogous to bag of
words.
Look at sort of the histogram, and then the videos linking again. Come on video.
Well, it's going to be chunky, I guess, and you can characterize a new behavior based on the
presence of certain words and certain words that aren't there.
So this is work we published in 2005. And since then this kind of bag approach to video, bag of
words approach to video, has become very widely adopted. So this kind of thing is used with sort
of a lot of things on top. And a lot of details have changed in how people do it.
But this sort of framework has become one of the de facto ways of doing behavior recognition
now.
And, in fact, this work was published in 2005. So I could show you some results from 2005. I
took a paper from last year from '09, from a separate group that is doing an evaluation. And so
they're referring to our method as cuboids. And mind you this is five years later. They were
taking our methods from five years, code we had online comparing it to lots of different method.
Turns out we're not doing that bad. Still our algorithm with just a base algorithm of five years ago
with some minor tweaks in terms of how the descriptor's doing is still performing. So these are
three different datasets. It's still up there in performance.
This actually surprised me because I thought in five years we would have been kind of blown out
of the water now. Introduced this idea of using bag of words for video, for recognition, but we
kind of thought by now so many things will have changed.
Turns out it's still working reasonably well. Okay. So that sums up the technical content. I won't
keep you guys here much longer but I sort of wanted to say what are sort of some of the next
steps here that I plan on pursuing.
So one of the things is I've talked about sort of these different modules a bit in isolation. So I've
talked about detection. I've talked about pose estimation and then tracking, which takes the
poses and the detections and then does the tracking and the behavior.
And I was talking about them very independently. Like each one very -- this is all information
we're tracking about an object, but it's sort of a linear, sort of sequential pipeline.
And really I don't think that that's the right way of doing it. For example, when I was showing you
the results on detection, it wasn't correctly getting this guy, because he's in a weird pose. So
doing detection then pose estimation, so feeding, having the arrow this way, I don't think it's good
enough. I think when we do detection, we want to have a notion of pose. So we really want to
have one feed into the other. Combine those more closely.
So have features that are conditioned on the pose, which is what we're doing for the pose
estimation, but go back and actually have those help us with the detection.
Likewise, for example, the tracking, so we're using the detection and the pose to feed into
tracking. So we do independent detection and we do independent pose estimation every frame
and then that we basically merge that over time to give us a track. But really if you have video,
you should be able to do so much more in terms of actually going from the tracks and then
helping your detector.
You can imagine different ways of doing that. So you can imagine as sort of a semi supervised
form of learning where if you have a video and you have a detector that sort of works you can
specialize to that video or you can use it to do more training data.
There's lots of things you can do. But I think -- again the behavior in tracking, right now those
results on behavior recognition I show it was assuming that you basically had one object in the
scene and it was basically moving but there's not much other stuff moving and there was multiple
things interacting.
And really that's too naive. You really want the tracking to feed into that.
You really want all of these to be merged more closely together. So, for example, like the
detection and the pose is something I want to pursue very short term in terms of getting the pose
to help you with the detection. I think that would really help. Maybe not so much on the upright
pedestrians when they're small resolution, but when there's sort of more articulation and high
resolution.
The other thing is, this is such a rich problem area. I've sort of presented four different types of
information you might want to extract. But really there's so much information to capture and so
much information we want to extract about objects in a scene. There's so much we can do with
this and that's really the direction I want to go, is really be able to capture a lot more about the
objects. Not just finding the objects and saying some basic information about them, but really -there's so much rich information in the visual world than really try to capture more of that.
So before I conclude, I want to say I've been sort of involved in a lot of other projects in sort of in
a lot of different areas, and if anybody's interested in discussing these sort of off line and are in
one-on-one meetings I've done a lot of work applied to learning in problems where traditionally
more designed algorithm has been posing, supposing machine learning where we got edge
detection and learning descriptors. I've also been involved a lot in manifold learning, we had
some work on that and also I've been very interested in sort of extensions of learning to sort of
beyond the just here are your positive and negative examples. So, for example, one approach
was multiple instance learning when you have, when your data, instead of being independent
instances in a bag they actually fall on a manifold.
If any of these things interest people, I'd be very happy to talk about these one-on-one. All right.
Thank you.
[applause]
So questions?
>> Larry Zitnick: All right. Thank you.
>> Piotr Dollar: All right. Thank you.
Download