22744 >> Rick Szeliski: Good afternoon, everyone. It's a... Pascal Fua. Pascal is a professor at the Ecole...

advertisement
22744
>> Rick Szeliski: Good afternoon, everyone. It's a pleasure for me to introduce
Pascal Fua. Pascal is a professor at the Ecole Polytechnique in Switzerland.
And we go back a long way. Pascal was working at SRI International in Menlo
Park when I briefly worked there in 1989. And he's one of the pioneers in
multi-view stereo. He's done some of the most interesting work in person
tracking. Recently over the last decade tremendous amount of work on fast and
efficient feature descriptors, which is what he's going to talk about today.
Another interesting topic is that his work on deformable surface modeling led to
his being hired as a consultant by the Swiss National Racing Team for sailboat
racing so he has written software to spy on other people's sailboats and figure
out why their sails are working so well. It's a great pleasure to introduce Pascal.
>> Pascal Fua: Thank you. Hi. So today I'd like to talk about our work on the
binary descriptors. And to give you a context, maybe to tell you how we got
interested into this, by this. Let me tell you something you probably already
know, which is one of the things we are all interested in doing is being able to run
around the city, look at -- take pictures of landmarks with our cell phones and
being able to click, to point at a particular statue, in this case this one, and have
some information pop up about the statue, which means, of course, that we need
two things.
We need a model of in this case the cathedral. It has to be annotated. And it
has to be precisely registered with the image we just took to be able to -- so that
the phone will know that when you point it at the particular pixel you were talking
about this particular statue.
So there are two main components in this. One is the 3-D models. And
preferably they have to be large scale because in the end we want to do a whole
city, a whole country, the whole world.
Eventually it's going to become very, very big. And once you have that you want
to be able to take the image you just took and match it against this potentially
extremely large models.
So first let me say a few words about the models themselves and how we build
them. And in a sense this is a follow-up of things that have been done here.
It's Photosynth taken maybe to the next level. And the problem is when you try
to do this, is say you are here are a few pictures of Lausanne, where we are, and
a number of graduate students went around with digital cameras and took
pictures of the city.
Typically what happens when you do that is you get pictures of landmarks like
the cathedral taken from very different angles, they don't look the same at all.
And it's not trivial to match them.
And to produce models three points like those. And I know that there's this very
famous paper, about reconstructing Rome in a day, but it didn't really reconstruct
Rome. It reconstructed three very important landmarks in Rome, which is slightly
different.
And that's actually fairly typical about state-of-the-art approaches that tend to
reconstruct disconnected clusters, because mostly you have lots and lots of
pictures, usually for specific locations, but in between those locations you have
very few pictures, it's very sparse, and most bundlers tend to disconnect them.
Then, of course, there's the problem if you're going to do adjustment on the very
large scale it tends to explode unless you're very careful about it.
And finally a lot of the 3-D reconstruct techniques tend to choke if you give them
too many images over too wide a range. So these are things we've been trying
to work on. And one of the things we've paid much attention to is being able to
register images like this which are taken from very, very different viewpoints.
And to do this, what we've done is taken advantage of the fact that most images
you can find have some form of geo tag, GPS data, and also that most cities -- I
mean it's not [inaudible] you typically for most cities in the world you have a
cadastral map, a map of the buildings. It may or may not be accurate. But it
gives you a pretty good idea of what's there.
And so the system we've developed essentially takes all the images that you
have, follows on, for example, we have approximately 20,000, and we'll like most
of the other state-of-the-art systems essentially group them into clusters, but then
when we do the bundle adjustment, we use the GPS data and the cadastral
information to make sure that the clusters align with each other and with
cadastral map. So that we can large scale consistent model of the whole city.
And, of course, it has to take into account the fact that sometimes they're not
there and sometimes they're being wrong. But that's when you do vision that's
normal, dealing with outliers is normal.
And so we can build reconstruction of this one where what's interesting about this
is we've used images very different kinds of images from different sources.
Some from drones. Some from these ground level images taken with one
particular camera and some actually images where we zoomed in the cathedral
to produce the high res stuff.
And actually here we need some help from the graphics folks. You probably saw
it when we switched resolution, there was a jump, and really there shouldn't be a
jump. But that's us poor computer vision folks we don't know what to do with
this.
Anyway, so the point of this is we build these models. This one is not particularly
big, actually, but you can build much bigger one with thousand and thousands of
images.
And you can to it, but now if you actually want to use it, what's going to be
needed is a way to take a new image and register it against this model. So this
model is what in it it has 3-D points and for all the 3-D points it's going to have
to -- it's going to contain descriptors. So typically they could be SIFT descriptors
which is more state of the art still.
And that's fine if you have unlimited amounts of memory. But typically if you're
on a cell phone you don't have unlimited amounts of memory. If and if the model
is really, really big with millions of descriptors, you end up with a problem.
You need to be able to have smaller descriptors so you use less memory and
also so that you can do the matching faster. And that's actually where the binary
descriptors come in. What I'm going to argue is that if you replace the floating
point descriptors by binary descriptors, you gain in memory size. You gain in
speed. And you don't really lose in accuracy. So they are a good thing.
So that's the program is to try to show that binary descriptors are much more
compact. They can be -- they lead to faster matching because essentially the
hemming distance is faster to compute than the Euclidian distance. And if you
compute this correctly, you don't really lose with respect to the more traditional
ones like SIFT or SURF.
So this is actually something we've been looking at for a long time. And this is
now a fairly old video where what we used was something called FERNS, where
we had actually trained a classifier to recognize interest points on the car. So
every time the car comes into view, it's hidden, it doesn't find it, but when it
comes into view, it does the matching very, very fast and redetects the car.
So in this particular video, there is -- there's no temporal consistency imposed.
Detection at every frame independently.
And the principal behind this was to -- was a classification-based approach to
matching where if you take a key point here, the corner of the M, for example,
you know that if the surface is locally planar, all possible appearances of that M
or that corner are going to be this patch up to some [inaudible]. So what we did
was to produce a training database of all the possible views of that key point.
I mean, not all possible views, a representative sample. And then we trained a
decision tree-based classifier to recognize those key points, so that you would
drop a patch at the top of a tree and then you will ask questions like: Is this pixel
brighter than this one. If yes, you go left, if no, you go right.
And we have trained the classifier to do this and the point is doing this
classification is extremely fast. It only is a binary descriptor because it's all based
on this simple yes/no questions.
And so we could actually train it either by if the object was truly planar like the
cover of the book, just synthesizing homographies of the key points, or if the
object was nonplanar, like the Lacar by essentially using -- well, yeah, this is a
video. By essentially showing to the system a video of the object. Video in
which the motion was really nice and slow so that tracking was easy and
generating -- so we could generate our training database in this way.
So that worked nicely. And some of our colleagues and grads know a while ago
that implemented in the cell phone and we got real time performance on that
phone.
But the problem with this is that it requires a lot of training. So the runtime
performance was good. But training took a long time. So there was no real way
with a particular algorithm to just show an image and then very, very quickly say
learn it and use it and then having trained the trees try to find it again, which is
why more recently we moved to actually something that's simpler than this, which
is actually the true binary descriptors we play with which is something called
Brief. And Brief is really, really simple, which is what you do, describe a patch.
You take the patch. You smooth it.
And then on this smooth patch you ask a bunch of questions of the -- you take
pairs of points. And for each pair you ask: Is this one brighter than this one? If
the answer is yes it's one, if the answer is no, it's zero. And typically you can
choose about 256 of these pairs, which means you boil down your patch to 256
bits vector.
So this is really simple. And the surprising thing is that it works very well. In
terms -- it's a pretty powerful descriptor with, well, I'll show you, but you can use
128, 256 or 512 of these tests, and it's basically enough.
And another surprising thing is how do you choose to test. So we tried many,
many different strategies. In the end we were not able to truly improve upon
doing it just randomly.
Just pick a bunch of random tests. We get a smidgen better by biassing a little
bit of the probability so that it would be more in the center but varies slightly and
not terribly significantly.
>>: The middle sized segments over really short ones or really long ones?
Because it seems like one pixel-long segments don't need as much descriptive
power, and very long ones maybe the patch is large, maybe too little.
>> Pascal Fua: We could, but in fact we know we didn't. Again, we could try but
I suspect -- it might yield a very small improvement but nothing terribly significant.
Essentially what this is computing are derivatives. You can think of these as
gradients.
And the long ones are gradients over -- which is why we need the smoothing, by
the way, otherwise they wouldn't be meaningful gradients. And so we first tested
that on some of the standard benchmark datasets with some mostly planar
structures and some that are not planar, for which we have lighter data.
And so if you compare it to SURF, it does better. I won't go into the detail of all
the curves, but typically we get recognition rays that tend to be better. The test
we have take 512 points in one image, 512 in the other, and we measure how
many of the correspondences established using these descriptors are the right
ones.
So the percentage is somewhat higher with something you should note is what
we are comparing against mostly is something called U SURF, so SURF is
designed to be orientation invariant. Our Brief is not. BRIEF is definitely not
orientation invariant. What we compare against is something called U SURF
which is SURF without the orientation invariants. Which actually does better than
SURF when orientation invariant is not needed. Important detail. You pay a
price for orientation independence.
So in applications where you don't need it, don't use it. Typically if you have a
cell phone that you know the orientation because you have a nickel in the meter,
you should use a nonorientation invariant descriptor.
So we tend to do better on these benchmarks. And, of course, SURF essentially
is a fast version of SIFT with some loss of performance. So on some of the
benchmarks, actually, we still do better in SIFT but not on others. On others
actually with SIFT, SIFT is still more powerful than BRIEF. So it's a very hand
wavy thing, but roughly based on this test, the recognition accuracy of BRIEF is
somewhere between that of SIFT and SURF.
The other thing that especially for all the students here is be careful of
benchmarks. Depending on how you present these graphs, I could have argued
about anything, because by choosing the right graph to show you.
But that's actually not the point of BRIEF. The point of BRIEF is not to be more
accurate than SIFT, it's to be much faster. And that it is.
And here are some computational times. So on -- that was done, I think, it was a
Mac like this one. And here is the kind of time it takes. So if you use SURF, it
takes a certain amount of time.
And if you use BRIEF, it's, of course, a much, much shorter. So, of course,
there's a version of SURF on the GPU, which is much better than the version that
SURF without GPU. But the point here is we can do better than SURF on the
GPU without a GPU. This is still a CPU.
And one of the things that speeds it up a lot is that the new CPUs have, you can
compute essentially the hemming distance which is what you use extremely fast.
It's one instruction. So that's one of the reasons why you want to use binary
descriptors.
So one more word maybe about scale and orientation variance. Definitely BRIEF
is neither scale nor rotation variance, but it's very fast. If you need scale and
rotation variance, what you can do is you have, for each key point you learn
rotated versions of it.
So, of course, you pay a price. But it's an acceptable price. And so this is the
kind of thing it will do is very simple demo where you just show a thing to the
computer. You say here is my area of interest, and then immediately, because
there is no learning, you can start tracking it.
And you, as I told you, don't believe benchmarks, try the code and actually we
have heard some of you already have. So now we did what I just described.
And that was a paper we published at the EECV and we sent it for journal
publication and the reviewers told us that's all very nice but why don't you try this
on the more challenging and more and newer benchmark dataset, like the liberty
dataset, which actually has interesting properties, small images of the Statue of
Liberty, and it's 3-D and it's true it's more complicated than the ones I've shown
before. Especially has much more images, much more key points.
So we did. And here are the results we get when we are using more and more
key points. So the previous examples were done with 512 key points in each
image. So it's kind of taking a graph through these slides. We have SIFT here.
We have SURF here and we have BRIEF here in the middle.
And what happens is actually as you increase the number of images, essentially
there is a fairly big difference between SIFT and -- between SIFT and BRIEF.
And actually there's something we didn't like too much but that's life, but there is
a crossover. We can still beat SURF but we have to use more bits which, after
all, is not completely unreasonable.
But the problem is now that we are going -- if we're going towards more key
points, and I didn't plot what happens here, the descriptive power of BRIEF is not
sufficient. Because in the end the problem I'm trying to address is not matching a
few hundred key points. It's going into a database where I'm going to have
millions and millions of them. So in that case we are going to do something to
search through these large databases we're going to need something more
powerful.
And the way we try to get that is by going back to our Lausanne model, so back
to what I discussed earlier, we have built a model of Lausanne with 4500 images.
So a million or so 3-D points, and ten million feature points because each one -not each one but a lot of the 3-D points are visible in many images.
So actually this is a pretty good database, and it has an important feature that I
don't think many people explored, but it's an important one, which is if you are
reasoning in terms of 3-D point, you know that this point and that point and that
point are the same 3-D points, it means that your descriptors, in terms of the
descriptors, this, that, and that should match even though they may have very
different appearances because in terms of the SIFT descriptor, they could be
very different because they're seen from very different perspectives.
So another way to look at this is you can look at the [inaudible] dataset. You can
register it in the same way. And what you get is -- it's hard to see on this slide,
but you can see that you have the same key point on San Marco that's seen in
many, many images.
So what we did is we took, I think, the ->>: Who produced this Venice dataset?
>> Pascal Fua: The Venice dataset, I'm not sure, isn't it you? Doesn't it come
from here?
>>: It might be. Simon is not here anymore.
>> Pascal Fua: I think it's a Simon.
>>: I don't think it's Simon. Simon had Notre Dame [inaudible] and someplace
else. But I don't remember.
>>: Venice and Marco is one of the classic scenes reconstructed by Photo
Tourism.
>> Pascal Fua: I think it started here. And it may have grown since.
>>: [inaudible].
>>: Venice has a lot of different datasets in different sources because each
dataset is going to be biased, which is what I think you're getting at. Simon's are
sort of biased because of the SIFT detector.
>> Pascal Fua: That's a good point. But let's go back to that. Lausanne will play
a key role in that, and we can discuss after that whether it's biased or not.
So we have this large dataset with many images of Venice. And one thing we
did is you take I think 24 feature points that are found in the longest track we
could find. 24 longest track we could find. So each track contains feature points
seen in many images. For each of them we can compute the SIFT factor. And
we can plot a confusion matrix.
So each block here corresponds to the SIFT descriptors for the same 3-D point
seen in many images. So ideally this image, you should have zeros on the
blocks, diagonal blocks, and large values everywhere else. That would be the -that would be an ideal descriptor would do this.
So SIFT doesn't quite do that. It does something reasonable, and it works. But it
doesn't quite do that.
So what I would like to talk about now is this descriptor that we call LD hash,
which is going to be a binary descriptor such that the confusion matrix in
hemming space is better behaved than this one. So this one you can see that
you have the blue on the diagonal and everywhere else it's kind of reddish. I
mean, it's a large distance everywhere which is not quite true here. You have -here you have some blue even of diagonal which you shouldn't have ideally.
>>: You couldn't just rescale that one to make it more red? Because it looks like
that was very, very blue on the diagonal and then sort of green and red. If you
just boosted all the numbers up ->> Pascal Fua: You think that would be ->>: Everything would become sort of reddish orange and the very blues would
become a pale blue.
>> Pascal Fua: So we could try that. We could try that. But I'm going to show
you when we do the real test that I think not. But we could -- okay. So what
we're trying to achieve to ideally get this thing to be the way we want it, what we
are trying to achieve is to start from our descriptors, our floating point descriptors,
which in this case are going to be SIFT descriptors, and binarize them in such a
way that the binarizing are to get the B, so that the distances between those that
correspond to the same points -- so the same 3-D points, are minimized and here
I'm sorry it should be a minus sign.
Those between the distance between those that belong to different points should
be maximized so all the energy is minimized if you put a minus here.
And in this case B is going to be very -- they have very many ways to binarize
something, but the simple one is you just do a projection and you threshold. You
can't get much simpler than that.
And the reason for doing it this way, one of the reasons for doing it this way is
finding the P through the projection matrix that minimizes this criterion can be
dealt in closed form.
Basically you compute the appropriate matrices. You look for the -- you do an
SVD. You look for the smallest eigenvalues, and you got it.
And, similarly, computing the threshold can be done as a second step, and it can
be done dimension by dimension by doing a simple 1D line search, which means
that computing this can be done in closed form, essentially, which is in contrast
to many other techniques around where you typically have a greedy search or a
search dimension by dimension with no explicit guarantee of finding a global
minimum. In this case we do.
And they are of course a couple of parameters in this. Actually, there are two
parameters. There is one is this alpha parameter. So how do you weigh the
positive examples against the negative ones, and then what is going to be the
nationality of your binary descriptor? So the size, the number of lines in your P
matrix.
And here are some curves or the test when I'm going to present are all done this
way. What we're going to do is we're going to take a point, a descriptor and we
are going to see, to plot the false positive rates as a function of the threshold and
distance we put against the true positive rate.
And so here are the curves. So this is alpha. So infinity means you're not using
the negative examples. You're only using the positive examples.
And for one thing it's not credibly, if you look at the for 128 bits, that's the size of
the descriptor, it's not incredibly sensitive to the value of alpha you choose.
So in practice we choose something around 10. And dimensionally of course
you do better when you have 128 or 64. 64 is not quite enough but 128 does a
pretty good job.
And now we can do some comparisons. We do this test using SIFT, or using our
LD hash descriptor.
And fairly consistently in this very low false positive rate area, we do better. So
it's actually interesting. I mean, at least in that range LD hash seems to do better
in SIFT even though it's a binary vector, it has fewer bits, but what it does have is
because we have this learning -- this supervised learning technique, it learns, it
apparently has learned something that seems to carry over. Remember, we
forced it. I think that's key.
We forced it to produce similar binary vectors for all descriptors corresponding to
the same 3-D points, even though the appearance might have been quite
different because of change in perspective.
>>: So in your previous graph it looked like your infinity ones were pretty close to
the 1,000, ten and one?
>> Pascal Fua: Yes.
>>: Isn't infinity there's not really much learning going in that case, right?
>> Pascal Fua: No, there is still. Because in infinity it just means -- it just means
that this term goes away. Just infinity means alpha controls the weight of the
positive one against the negative ones. So infinity just means -- in practice, in
the code, when we say infinity, we mean make it 1 and forget about this one.
>>: And what you're testing, when you did the learning, was it on ->> Pascal Fua: It was on Lou San.
>>: Sorry what was that? Were descriptors from a different dataset?
>> Pascal Fua: No, that's something that gets back to the point you made
earlier. Is the training was done on the Lou San dataset. We don't retrain.
>>: They're done on the same dataset?
>> Pascal Fua: No, they're not. But training is done on Lou San. So essentially
in the end what we've learned is one matrix P and one set of vectors and one
vector T, from Lou San. And then we use these to binarize on the dress den,
Venice and all the others.
So it seems that's actually a little surprising. Lou San seems to capture -- at least
I don't exactly know how broad it is. But broad enough at least to work on all the
other datasets.
So where we have curves that are significantly above what you can do with SIFT.
Again, in the very low false positive range. And that's for this particular
application, that's where we want to work in.
Because if you have millions and millions of data points, points in the database, if
you don't have a very low false positive weight you're going to be overwhelmed
by the number of points you have and your computation will become very slow.
>>: The approximate nearest neighbors that you're ->>: So I'm going to -- very good point. This is exact nearest neighbors. This is
just nearest neighbor. We don't worry about time. It's very slow.
>>: [inaudible].
>> Pascal Fua: But it's a big issue. I'm going to talk about it. But for the
purpose of drawing that graph, it's just nearest neighbors.
>>: So for these tests, what is the false positive rate? Does that mean that you
had, you queried ->> Pascal Fua: So how many points are within, for a very low threshold, only the
points that correspond to the same, only the descriptor that corresponds to the
3-D point will be found. It's the threshold on the distance.
>>: I guess what I'm asking is how is the false positive computed, you query
with -- what are the inputs?
>> Pascal Fua: It's a patch. It's how many among those you find is how many
patches that are not the right one.
>>: Okay. So the difference between .001 and .01 is whether 1/10th of a percent
or one percent of the results would be false positive?
>> Pascal Fua: Okay. I must confess I don't have the formula in my head. Can I
answer that after the talk?
>>: Sure.
>> Pascal Fua: Okay. So Prague. EFL, it's all kind of the same. So I think
that's sort of interesting. It seemed -- it looks like by training on this one dataset,
this one Lou San dataset, we've essentially learned something about the
redundancies of SIFT in general, and carries over to all these other datasets.
Now, back to the message is I've just said essentially LD hash does better than
SIFT at least on this particular task, with that particular way of computing the
false positive rays that we should go into afterward. But if you remember another
graph I showed before, I showed you a graph where SIFT was doing better -- this
is LD hash -- where -- well, SIFT was doing better than the LD hash. So what's
happening, that actually is interesting because this is one tiny paper. This is a
different tiny paper and I put together this talk and I looked at these graphs and I
said, huh? What's happening.
And what's happening, I think -- and actually that's something that we might be
discussing is it's not the same task. So better it's when you say something is
better than something else, you have to say better for what?
And in fact what's happening is this is looking for a point in a very large database
and having very few false positives. This is about matching all points in
essentially a small set. And what I believe and I have to checked this, what's
happening here corresponds to what is happening at this end of the curve?
So the order I'm showing you is for the very low positive rates. If you looked at
the order at the very high positive rates, false positive rate, the order would not
be the same.
So, again, beware of benchmarks. And what I would advise if you are interested
is try it. Because so the Lausanne dataset is on the Web. The codes for the
hash is on the web, too. Inside the code in this case is nothing, it's a header. It's
a matrix and a bunch of thresholds.
Okay. So in summary, what I've shown you is that you can take SIFT, you did
binarize it and actually for a class of tasks you don't actually lose any accuracy,
you even gain some.
>>: When you do your projections, do you do it by random projections, standard
hashing technique?
>> Pascal Fua: Yes, we tried. The performance is lower. Actually, what you
could do, you can do an intermediate thing where you have a random projection
matrix, but you choose your thresholds carefully.
I think we haven't explicitly done that but I think you would do okay. In some
sense there's this P matrix and this T vector. The T vector is most important than
the P matrix.
Okay. So okay now we have all these binary vectors. You can compute them
using the technique I just described, but maybe you have a better technique. In
all cases it's going to get back to what you mentioned that now that you have
binary vectors, you can do the test using linear nearest neighbor search, which is
fine but slow. Again, for you and the ID you have millions and millions of points
it's not going to be practical.
So you have to go to nearest -- approximate nearest neighbor search. And the
problem is something that's surprised us is I'm not aware of many algorithms in
the literature designed for this. There are lots and lots of algorithms for ANN, for
floating point descriptors, but the problem is if you try them on binary vector, you
can always treat a binary vector as a floating point vector if you want.
But you lose a lot on performance. They don't work very well. Except one which
is the hierarchical K means if you use real value centroids. So you take your
binary vectors, you take averages of them. But if you take averages of them they
become floating point vectors and you lose the advantage because now you
have to complete Euclidian distances again.
That's a somewhat surprising finding, if you think about it, it sort of makes sense
because hemming spaces are different from Euclidian spaces. They are different
in the sense that the boundary between two points, the set of points that are
equidistant from two points is thick. So if you hear what I'm plotting is in
hemming space, the boundaries between the points that are closer to this one or
closer to this one, to V, is thick. There are lots of them.
Whereas in Euclidian space it's of measure zero. In fact, you can compute these
things and you can show that they're very simple formulas that says that the
number of points that are equi distant from two arbitrary points in your space is
very significant.
It's definitely not of measure zero. And that actually confuses most standard
algorithm for ANN, which essentially assume that the boundary is of measure 0.
And I think that is the explanation for why those standard algorithms -- the
performance of these standard algorithms degrades.
So actually that's something we just began working on is, okay, so how can you
do algorithms for ANN whose performance does not degrade? And still work
only on the binary vectors so as to remain fast?
So one thing we did which worked pretty well was something we called a park
tree for a petition around random center, which is essentially the same thing as a
hierarchical [inaudible] trees except for the fact that instead of taking a center
when you subdivide a tree to be the average, you take one point at random.
And you do that several times, so again it's a very simple algorithm. And you rely
on randomness, because you have several of these trees to get the right answer.
Or something that's even simpler, which is a form of LSH on the binary vectors.
And this goes as follows. You have your binary vectors. You select a random
number of bits. You produce a hash key with those, and you put your vectors in
the right bins. You do that several times. Again, you rely on randomness. And
that actually works well.
And what we get with these algorithms is on relatively large here we tried that on
databases was 500 K, 900 K and 1.5 million descriptors. And there we get
something that actually runs much, much faster for the same accuracy as a
state-of-the-art methods because everything remains binary. We never compute
Euclidian distance, and we can exploit these, the fact that the [inaudible] CPUs
hemming distance is very fast.
And actually there are really cases that will help. So, for example, here's a
practical example of something we use this for, which is the case of our
triangulation, it's completely standard. You overfly an area. You take lots of
images. Lots of very big images and you want to register them, to produce auto
photos and 3-D models.
And in this case this small example was -- actually what we had, we had 25
relatively big images of a town in south of France. So about 400 K feature point
image and to do this registration you need to do nearest neighbor search.
And so with this faster ANN approach is that gives you a 20-fold speedup over
using the falling point vectors.
And actually this sort of matters because so we're trying to follow the great
American, the tradition of the start-up. And so we actually, one of the games we
are playing is trying to develop a product that essentially can be used in
conjunction with these small drones like this one where you just launch it by
hand. It will land by itself. In this case it collects images of the EPFL campus,
and eventually what it will come back with is an auto photo and/or a 3-D model of
the whole thing.
And of course none of this is incredibly new. There's aren't techniques that have
been known for a long time. But if you are trying to put it into a product it has to
be robust. It has to be totally automatic. Nobody wants to touch anything. And it
has to be robust.
So in those cases actually the fact that 20 times speedup is not negligible
because the guy who wants the photo doesn't want to wait. And so that's
hopefully ->>: [inaudible] in the middle?
>> Pascal Fua: So this is -- let me stop this.
Let me go to the end. Okay. So this is -- so this is called the official name is the
Rolex Learning Center also known informally as the learning cheese, because of
the Swiss holds on the top of it, which is it's an interesting building because it's
the -- there's nothing -- there's nothing plainer? There. It's just rolling forms. So
the architects had fun and it's good for us because it's in terms of architectural
reconstruction, it's more fun to reconstruct that than a bunch of boxes.
Okay. So to conclude, I have presented two kinds of binary descriptors. One
BRIEF is really as simple as it gets. It's just based on doing these binary tests.
It's extremely fast to compute.
And it's quite effective for matching images against each other as long as you
don't have too many for relatively small databases it works very well, and it's well
adapted for cases where you have very limited CPU. Another, which is more
sophisticated, it's more computationally intensive because you have first to
compute SIFT and then binarize it. And then is adapted to finding points to being
searches in extremely large databases.
And the interesting part is at least for this particular task you really do not lose
any accuracy over something like SIFT, you even gain some. Because you've
trained the thing appropriately. And now what's needed to really use these
applications in techniques to do ANN search on truly large databases quickly,
and then one of the things we're looking into is exploring the fact that we are not
doing arbitrary ANN search, we're doing an ANN search for architectural style
things, where location matters. So once you found the descriptor, once you've
done one match, you can expect that the next one is going to be a point in the
vicinity, in the geography vicinity, and you should build that into your algorithms.
And that's one of the things we are going to keep looking at in the future.
That's I thought. Thank you.
[applause]
>> Rick Szeliski: Questions?
>>: So this binary slide, the two [inaudible] things that you presented, one which
is binarizes the pixel intensities, it's binarizing on, working directly on the batch,
coming up with a binary vector. The other one is really computing SIFT, you
have to compute SIFT or some kind of complicated heavy weight descriptor and
then you binarize. So why don't you think the first can really not scale to large
images or is it some -- is there a better way to do it.
>> Pascal Fua: That could be -- that would be good. So you suggested a way -SIFT has been carefully thought out to be, you know, capture all these small
scale, large scale effects and we are capitalizing on that.
So maybe as we suggested, if you had a BRIEF style thing with the test
organized in the right way, maybe you should be able to capture some of those
same things. Because what happens with SIFT is it seems to be redundant in
many ways, since you can decimate it and still not lose its distributive power. I
think it's intriguing, and I think there's more to be -- there's something to be
looked at in this.
>>: Have you had to visualize ->> Pascal Fua: The weight?
>>: The weighting of which features are important and then the LD hash?
>> Pascal Fua: No, we haven't. We should.
>>: It would be interesting to visualize that to see what's kicking in.
>> Pascal Fua: Yeah.
>>: So I remember with the old FERNS stuff, looks actually even the trees before
the FERNS, you train them to maximize the discriminatory power. Have you tried
to do any of that training?
>> Pascal Fua: That's the point of the BRIEF is to completely get rid of the
training. What it's saying in some sense is that you don't need -- the training that
was included in the FERNS is not all that necessary; you can do without.
>>: But, I mean, just in the ways you choose the tests, do you investigate the
various different random strategies generated ->> Pascal Fua: That would be true. The complete randomness of the test is the
same in the FERNS and in BRIEF. So we've not -- again this goes back to what
we just discussed. You would think that there should be an optimum way to
select the test, but we haven't found it.
>>: So for the nearest neighbor part you showed you could make it much faster
to achieve the same position, but can it also just, the chart of the precision you
showed for the K trees for instance was 4.75. How much better than that can
you get?
>> Pascal Fua: You can never -- I mean in this particular thing you cannot be the
linear -- it depends how much time you're willing to spend. That's why it's a time
as a time per query.
>>: I'm assuming, where is the curve of the new stuff?
>> Pascal Fua: The new stuff is here.
>>: Okay. And the bottom axis ->> Pascal Fua: The bottom axis, the others are various parameters and HKM is
the High K means [phonetic].
>>: Do you know what portion in the case of binary vectors, what portion of using
the KD tree is slow? Is it the Euclidian distance?
>> Pascal Fua: It's the Euclidian distance that slows it down.
>>: As opposed to the thick boundaries you were talking about.
>> Pascal Fua: Well, the KD trees do not have the -- they're slow because you -in this case they're slow because since you compute the means, you don't have
to take boundaries anymore. It becomes Euclidian again. So the take boundary
problem goes away. But the price you pay is now you have to do these floating
point computations.
>>: Are these per dimension, as the KD tree is the standard setup where you
split on one dimension and you total.
>> Pascal Fua: Yeah, the KD tree, the one I described is a hierarchical K
means. The last one that works best for this problem is the hierarchical K
means.
>>: And the examples you show in the air were matching aerial photos to each
other. Have you done it with ground level photos?
>> Pascal Fua: Yes. So for the ANN, have we done this? Yes we have,
because some of these models are, the Lausanne model I showed you was, the
video was we showed some of the ground level images so, yes.
>>: Do you have a system now where somebody could walk up, take a picture
somewhere in Lausanne and it will tell you where you're standing patching
against all of the photos?
>> Pascal Fua: We don't have one we could -- in the sense that something we'd
put on the Web, no.
>>: But if you were to deploy such a system would it successfully match the
image or are there too many images that come up for retrieval? Because that's a
large number of images, probably more than the number of drone images.
>> Pascal Fua: Right. Actually, we haven't really tested it.
>>: It's the beginning of the talk you showed us the photos, the whole Lausanne
reconstruction. I thought you were heading towards a system which would take
an arbitrary new photo tell you where you're standing.
>> Pascal Fua: We're headed in that direction but we're not quite there yet.
>>: You said you can put multiple brands over descriptors for doing rotation. For
this Lausanne thing, how many certain descriptors?
>> Pascal Fua: It was for BRIEF. I said that about BRIEF, to compensate for the
orientation invariance.
BRIEF we really used it on these indoor applications, so we didn't try -- it's more
on these augmented reality things on your cell phone kind of things we've been
playing with. So that's what I've been trying to explain. It's adapted for that but
it's truly not adapted for the large scale stuff.
>>: For large scale stuff, what do you use instead?
>> Pascal Fua: Initially we used SIFT like everyone else. And now that we've
trained LD hash, we use LD hash.
>> Rick Szeliski: Any other questions? Okay. Well, thank you once again.
[applause]
Download