22744 >> Rick Szeliski: Good afternoon, everyone. It's a pleasure for me to introduce Pascal Fua. Pascal is a professor at the Ecole Polytechnique in Switzerland. And we go back a long way. Pascal was working at SRI International in Menlo Park when I briefly worked there in 1989. And he's one of the pioneers in multi-view stereo. He's done some of the most interesting work in person tracking. Recently over the last decade tremendous amount of work on fast and efficient feature descriptors, which is what he's going to talk about today. Another interesting topic is that his work on deformable surface modeling led to his being hired as a consultant by the Swiss National Racing Team for sailboat racing so he has written software to spy on other people's sailboats and figure out why their sails are working so well. It's a great pleasure to introduce Pascal. >> Pascal Fua: Thank you. Hi. So today I'd like to talk about our work on the binary descriptors. And to give you a context, maybe to tell you how we got interested into this, by this. Let me tell you something you probably already know, which is one of the things we are all interested in doing is being able to run around the city, look at -- take pictures of landmarks with our cell phones and being able to click, to point at a particular statue, in this case this one, and have some information pop up about the statue, which means, of course, that we need two things. We need a model of in this case the cathedral. It has to be annotated. And it has to be precisely registered with the image we just took to be able to -- so that the phone will know that when you point it at the particular pixel you were talking about this particular statue. So there are two main components in this. One is the 3-D models. And preferably they have to be large scale because in the end we want to do a whole city, a whole country, the whole world. Eventually it's going to become very, very big. And once you have that you want to be able to take the image you just took and match it against this potentially extremely large models. So first let me say a few words about the models themselves and how we build them. And in a sense this is a follow-up of things that have been done here. It's Photosynth taken maybe to the next level. And the problem is when you try to do this, is say you are here are a few pictures of Lausanne, where we are, and a number of graduate students went around with digital cameras and took pictures of the city. Typically what happens when you do that is you get pictures of landmarks like the cathedral taken from very different angles, they don't look the same at all. And it's not trivial to match them. And to produce models three points like those. And I know that there's this very famous paper, about reconstructing Rome in a day, but it didn't really reconstruct Rome. It reconstructed three very important landmarks in Rome, which is slightly different. And that's actually fairly typical about state-of-the-art approaches that tend to reconstruct disconnected clusters, because mostly you have lots and lots of pictures, usually for specific locations, but in between those locations you have very few pictures, it's very sparse, and most bundlers tend to disconnect them. Then, of course, there's the problem if you're going to do adjustment on the very large scale it tends to explode unless you're very careful about it. And finally a lot of the 3-D reconstruct techniques tend to choke if you give them too many images over too wide a range. So these are things we've been trying to work on. And one of the things we've paid much attention to is being able to register images like this which are taken from very, very different viewpoints. And to do this, what we've done is taken advantage of the fact that most images you can find have some form of geo tag, GPS data, and also that most cities -- I mean it's not [inaudible] you typically for most cities in the world you have a cadastral map, a map of the buildings. It may or may not be accurate. But it gives you a pretty good idea of what's there. And so the system we've developed essentially takes all the images that you have, follows on, for example, we have approximately 20,000, and we'll like most of the other state-of-the-art systems essentially group them into clusters, but then when we do the bundle adjustment, we use the GPS data and the cadastral information to make sure that the clusters align with each other and with cadastral map. So that we can large scale consistent model of the whole city. And, of course, it has to take into account the fact that sometimes they're not there and sometimes they're being wrong. But that's when you do vision that's normal, dealing with outliers is normal. And so we can build reconstruction of this one where what's interesting about this is we've used images very different kinds of images from different sources. Some from drones. Some from these ground level images taken with one particular camera and some actually images where we zoomed in the cathedral to produce the high res stuff. And actually here we need some help from the graphics folks. You probably saw it when we switched resolution, there was a jump, and really there shouldn't be a jump. But that's us poor computer vision folks we don't know what to do with this. Anyway, so the point of this is we build these models. This one is not particularly big, actually, but you can build much bigger one with thousand and thousands of images. And you can to it, but now if you actually want to use it, what's going to be needed is a way to take a new image and register it against this model. So this model is what in it it has 3-D points and for all the 3-D points it's going to have to -- it's going to contain descriptors. So typically they could be SIFT descriptors which is more state of the art still. And that's fine if you have unlimited amounts of memory. But typically if you're on a cell phone you don't have unlimited amounts of memory. If and if the model is really, really big with millions of descriptors, you end up with a problem. You need to be able to have smaller descriptors so you use less memory and also so that you can do the matching faster. And that's actually where the binary descriptors come in. What I'm going to argue is that if you replace the floating point descriptors by binary descriptors, you gain in memory size. You gain in speed. And you don't really lose in accuracy. So they are a good thing. So that's the program is to try to show that binary descriptors are much more compact. They can be -- they lead to faster matching because essentially the hemming distance is faster to compute than the Euclidian distance. And if you compute this correctly, you don't really lose with respect to the more traditional ones like SIFT or SURF. So this is actually something we've been looking at for a long time. And this is now a fairly old video where what we used was something called FERNS, where we had actually trained a classifier to recognize interest points on the car. So every time the car comes into view, it's hidden, it doesn't find it, but when it comes into view, it does the matching very, very fast and redetects the car. So in this particular video, there is -- there's no temporal consistency imposed. Detection at every frame independently. And the principal behind this was to -- was a classification-based approach to matching where if you take a key point here, the corner of the M, for example, you know that if the surface is locally planar, all possible appearances of that M or that corner are going to be this patch up to some [inaudible]. So what we did was to produce a training database of all the possible views of that key point. I mean, not all possible views, a representative sample. And then we trained a decision tree-based classifier to recognize those key points, so that you would drop a patch at the top of a tree and then you will ask questions like: Is this pixel brighter than this one. If yes, you go left, if no, you go right. And we have trained the classifier to do this and the point is doing this classification is extremely fast. It only is a binary descriptor because it's all based on this simple yes/no questions. And so we could actually train it either by if the object was truly planar like the cover of the book, just synthesizing homographies of the key points, or if the object was nonplanar, like the Lacar by essentially using -- well, yeah, this is a video. By essentially showing to the system a video of the object. Video in which the motion was really nice and slow so that tracking was easy and generating -- so we could generate our training database in this way. So that worked nicely. And some of our colleagues and grads know a while ago that implemented in the cell phone and we got real time performance on that phone. But the problem with this is that it requires a lot of training. So the runtime performance was good. But training took a long time. So there was no real way with a particular algorithm to just show an image and then very, very quickly say learn it and use it and then having trained the trees try to find it again, which is why more recently we moved to actually something that's simpler than this, which is actually the true binary descriptors we play with which is something called Brief. And Brief is really, really simple, which is what you do, describe a patch. You take the patch. You smooth it. And then on this smooth patch you ask a bunch of questions of the -- you take pairs of points. And for each pair you ask: Is this one brighter than this one? If the answer is yes it's one, if the answer is no, it's zero. And typically you can choose about 256 of these pairs, which means you boil down your patch to 256 bits vector. So this is really simple. And the surprising thing is that it works very well. In terms -- it's a pretty powerful descriptor with, well, I'll show you, but you can use 128, 256 or 512 of these tests, and it's basically enough. And another surprising thing is how do you choose to test. So we tried many, many different strategies. In the end we were not able to truly improve upon doing it just randomly. Just pick a bunch of random tests. We get a smidgen better by biassing a little bit of the probability so that it would be more in the center but varies slightly and not terribly significantly. >>: The middle sized segments over really short ones or really long ones? Because it seems like one pixel-long segments don't need as much descriptive power, and very long ones maybe the patch is large, maybe too little. >> Pascal Fua: We could, but in fact we know we didn't. Again, we could try but I suspect -- it might yield a very small improvement but nothing terribly significant. Essentially what this is computing are derivatives. You can think of these as gradients. And the long ones are gradients over -- which is why we need the smoothing, by the way, otherwise they wouldn't be meaningful gradients. And so we first tested that on some of the standard benchmark datasets with some mostly planar structures and some that are not planar, for which we have lighter data. And so if you compare it to SURF, it does better. I won't go into the detail of all the curves, but typically we get recognition rays that tend to be better. The test we have take 512 points in one image, 512 in the other, and we measure how many of the correspondences established using these descriptors are the right ones. So the percentage is somewhat higher with something you should note is what we are comparing against mostly is something called U SURF, so SURF is designed to be orientation invariant. Our Brief is not. BRIEF is definitely not orientation invariant. What we compare against is something called U SURF which is SURF without the orientation invariants. Which actually does better than SURF when orientation invariant is not needed. Important detail. You pay a price for orientation independence. So in applications where you don't need it, don't use it. Typically if you have a cell phone that you know the orientation because you have a nickel in the meter, you should use a nonorientation invariant descriptor. So we tend to do better on these benchmarks. And, of course, SURF essentially is a fast version of SIFT with some loss of performance. So on some of the benchmarks, actually, we still do better in SIFT but not on others. On others actually with SIFT, SIFT is still more powerful than BRIEF. So it's a very hand wavy thing, but roughly based on this test, the recognition accuracy of BRIEF is somewhere between that of SIFT and SURF. The other thing that especially for all the students here is be careful of benchmarks. Depending on how you present these graphs, I could have argued about anything, because by choosing the right graph to show you. But that's actually not the point of BRIEF. The point of BRIEF is not to be more accurate than SIFT, it's to be much faster. And that it is. And here are some computational times. So on -- that was done, I think, it was a Mac like this one. And here is the kind of time it takes. So if you use SURF, it takes a certain amount of time. And if you use BRIEF, it's, of course, a much, much shorter. So, of course, there's a version of SURF on the GPU, which is much better than the version that SURF without GPU. But the point here is we can do better than SURF on the GPU without a GPU. This is still a CPU. And one of the things that speeds it up a lot is that the new CPUs have, you can compute essentially the hemming distance which is what you use extremely fast. It's one instruction. So that's one of the reasons why you want to use binary descriptors. So one more word maybe about scale and orientation variance. Definitely BRIEF is neither scale nor rotation variance, but it's very fast. If you need scale and rotation variance, what you can do is you have, for each key point you learn rotated versions of it. So, of course, you pay a price. But it's an acceptable price. And so this is the kind of thing it will do is very simple demo where you just show a thing to the computer. You say here is my area of interest, and then immediately, because there is no learning, you can start tracking it. And you, as I told you, don't believe benchmarks, try the code and actually we have heard some of you already have. So now we did what I just described. And that was a paper we published at the EECV and we sent it for journal publication and the reviewers told us that's all very nice but why don't you try this on the more challenging and more and newer benchmark dataset, like the liberty dataset, which actually has interesting properties, small images of the Statue of Liberty, and it's 3-D and it's true it's more complicated than the ones I've shown before. Especially has much more images, much more key points. So we did. And here are the results we get when we are using more and more key points. So the previous examples were done with 512 key points in each image. So it's kind of taking a graph through these slides. We have SIFT here. We have SURF here and we have BRIEF here in the middle. And what happens is actually as you increase the number of images, essentially there is a fairly big difference between SIFT and -- between SIFT and BRIEF. And actually there's something we didn't like too much but that's life, but there is a crossover. We can still beat SURF but we have to use more bits which, after all, is not completely unreasonable. But the problem is now that we are going -- if we're going towards more key points, and I didn't plot what happens here, the descriptive power of BRIEF is not sufficient. Because in the end the problem I'm trying to address is not matching a few hundred key points. It's going into a database where I'm going to have millions and millions of them. So in that case we are going to do something to search through these large databases we're going to need something more powerful. And the way we try to get that is by going back to our Lausanne model, so back to what I discussed earlier, we have built a model of Lausanne with 4500 images. So a million or so 3-D points, and ten million feature points because each one -not each one but a lot of the 3-D points are visible in many images. So actually this is a pretty good database, and it has an important feature that I don't think many people explored, but it's an important one, which is if you are reasoning in terms of 3-D point, you know that this point and that point and that point are the same 3-D points, it means that your descriptors, in terms of the descriptors, this, that, and that should match even though they may have very different appearances because in terms of the SIFT descriptor, they could be very different because they're seen from very different perspectives. So another way to look at this is you can look at the [inaudible] dataset. You can register it in the same way. And what you get is -- it's hard to see on this slide, but you can see that you have the same key point on San Marco that's seen in many, many images. So what we did is we took, I think, the ->>: Who produced this Venice dataset? >> Pascal Fua: The Venice dataset, I'm not sure, isn't it you? Doesn't it come from here? >>: It might be. Simon is not here anymore. >> Pascal Fua: I think it's a Simon. >>: I don't think it's Simon. Simon had Notre Dame [inaudible] and someplace else. But I don't remember. >>: Venice and Marco is one of the classic scenes reconstructed by Photo Tourism. >> Pascal Fua: I think it started here. And it may have grown since. >>: [inaudible]. >>: Venice has a lot of different datasets in different sources because each dataset is going to be biased, which is what I think you're getting at. Simon's are sort of biased because of the SIFT detector. >> Pascal Fua: That's a good point. But let's go back to that. Lausanne will play a key role in that, and we can discuss after that whether it's biased or not. So we have this large dataset with many images of Venice. And one thing we did is you take I think 24 feature points that are found in the longest track we could find. 24 longest track we could find. So each track contains feature points seen in many images. For each of them we can compute the SIFT factor. And we can plot a confusion matrix. So each block here corresponds to the SIFT descriptors for the same 3-D point seen in many images. So ideally this image, you should have zeros on the blocks, diagonal blocks, and large values everywhere else. That would be the -that would be an ideal descriptor would do this. So SIFT doesn't quite do that. It does something reasonable, and it works. But it doesn't quite do that. So what I would like to talk about now is this descriptor that we call LD hash, which is going to be a binary descriptor such that the confusion matrix in hemming space is better behaved than this one. So this one you can see that you have the blue on the diagonal and everywhere else it's kind of reddish. I mean, it's a large distance everywhere which is not quite true here. You have -here you have some blue even of diagonal which you shouldn't have ideally. >>: You couldn't just rescale that one to make it more red? Because it looks like that was very, very blue on the diagonal and then sort of green and red. If you just boosted all the numbers up ->> Pascal Fua: You think that would be ->>: Everything would become sort of reddish orange and the very blues would become a pale blue. >> Pascal Fua: So we could try that. We could try that. But I'm going to show you when we do the real test that I think not. But we could -- okay. So what we're trying to achieve to ideally get this thing to be the way we want it, what we are trying to achieve is to start from our descriptors, our floating point descriptors, which in this case are going to be SIFT descriptors, and binarize them in such a way that the binarizing are to get the B, so that the distances between those that correspond to the same points -- so the same 3-D points, are minimized and here I'm sorry it should be a minus sign. Those between the distance between those that belong to different points should be maximized so all the energy is minimized if you put a minus here. And in this case B is going to be very -- they have very many ways to binarize something, but the simple one is you just do a projection and you threshold. You can't get much simpler than that. And the reason for doing it this way, one of the reasons for doing it this way is finding the P through the projection matrix that minimizes this criterion can be dealt in closed form. Basically you compute the appropriate matrices. You look for the -- you do an SVD. You look for the smallest eigenvalues, and you got it. And, similarly, computing the threshold can be done as a second step, and it can be done dimension by dimension by doing a simple 1D line search, which means that computing this can be done in closed form, essentially, which is in contrast to many other techniques around where you typically have a greedy search or a search dimension by dimension with no explicit guarantee of finding a global minimum. In this case we do. And they are of course a couple of parameters in this. Actually, there are two parameters. There is one is this alpha parameter. So how do you weigh the positive examples against the negative ones, and then what is going to be the nationality of your binary descriptor? So the size, the number of lines in your P matrix. And here are some curves or the test when I'm going to present are all done this way. What we're going to do is we're going to take a point, a descriptor and we are going to see, to plot the false positive rates as a function of the threshold and distance we put against the true positive rate. And so here are the curves. So this is alpha. So infinity means you're not using the negative examples. You're only using the positive examples. And for one thing it's not credibly, if you look at the for 128 bits, that's the size of the descriptor, it's not incredibly sensitive to the value of alpha you choose. So in practice we choose something around 10. And dimensionally of course you do better when you have 128 or 64. 64 is not quite enough but 128 does a pretty good job. And now we can do some comparisons. We do this test using SIFT, or using our LD hash descriptor. And fairly consistently in this very low false positive rate area, we do better. So it's actually interesting. I mean, at least in that range LD hash seems to do better in SIFT even though it's a binary vector, it has fewer bits, but what it does have is because we have this learning -- this supervised learning technique, it learns, it apparently has learned something that seems to carry over. Remember, we forced it. I think that's key. We forced it to produce similar binary vectors for all descriptors corresponding to the same 3-D points, even though the appearance might have been quite different because of change in perspective. >>: So in your previous graph it looked like your infinity ones were pretty close to the 1,000, ten and one? >> Pascal Fua: Yes. >>: Isn't infinity there's not really much learning going in that case, right? >> Pascal Fua: No, there is still. Because in infinity it just means -- it just means that this term goes away. Just infinity means alpha controls the weight of the positive one against the negative ones. So infinity just means -- in practice, in the code, when we say infinity, we mean make it 1 and forget about this one. >>: And what you're testing, when you did the learning, was it on ->> Pascal Fua: It was on Lou San. >>: Sorry what was that? Were descriptors from a different dataset? >> Pascal Fua: No, that's something that gets back to the point you made earlier. Is the training was done on the Lou San dataset. We don't retrain. >>: They're done on the same dataset? >> Pascal Fua: No, they're not. But training is done on Lou San. So essentially in the end what we've learned is one matrix P and one set of vectors and one vector T, from Lou San. And then we use these to binarize on the dress den, Venice and all the others. So it seems that's actually a little surprising. Lou San seems to capture -- at least I don't exactly know how broad it is. But broad enough at least to work on all the other datasets. So where we have curves that are significantly above what you can do with SIFT. Again, in the very low false positive range. And that's for this particular application, that's where we want to work in. Because if you have millions and millions of data points, points in the database, if you don't have a very low false positive weight you're going to be overwhelmed by the number of points you have and your computation will become very slow. >>: The approximate nearest neighbors that you're ->>: So I'm going to -- very good point. This is exact nearest neighbors. This is just nearest neighbor. We don't worry about time. It's very slow. >>: [inaudible]. >> Pascal Fua: But it's a big issue. I'm going to talk about it. But for the purpose of drawing that graph, it's just nearest neighbors. >>: So for these tests, what is the false positive rate? Does that mean that you had, you queried ->> Pascal Fua: So how many points are within, for a very low threshold, only the points that correspond to the same, only the descriptor that corresponds to the 3-D point will be found. It's the threshold on the distance. >>: I guess what I'm asking is how is the false positive computed, you query with -- what are the inputs? >> Pascal Fua: It's a patch. It's how many among those you find is how many patches that are not the right one. >>: Okay. So the difference between .001 and .01 is whether 1/10th of a percent or one percent of the results would be false positive? >> Pascal Fua: Okay. I must confess I don't have the formula in my head. Can I answer that after the talk? >>: Sure. >> Pascal Fua: Okay. So Prague. EFL, it's all kind of the same. So I think that's sort of interesting. It seemed -- it looks like by training on this one dataset, this one Lou San dataset, we've essentially learned something about the redundancies of SIFT in general, and carries over to all these other datasets. Now, back to the message is I've just said essentially LD hash does better than SIFT at least on this particular task, with that particular way of computing the false positive rays that we should go into afterward. But if you remember another graph I showed before, I showed you a graph where SIFT was doing better -- this is LD hash -- where -- well, SIFT was doing better than the LD hash. So what's happening, that actually is interesting because this is one tiny paper. This is a different tiny paper and I put together this talk and I looked at these graphs and I said, huh? What's happening. And what's happening, I think -- and actually that's something that we might be discussing is it's not the same task. So better it's when you say something is better than something else, you have to say better for what? And in fact what's happening is this is looking for a point in a very large database and having very few false positives. This is about matching all points in essentially a small set. And what I believe and I have to checked this, what's happening here corresponds to what is happening at this end of the curve? So the order I'm showing you is for the very low positive rates. If you looked at the order at the very high positive rates, false positive rate, the order would not be the same. So, again, beware of benchmarks. And what I would advise if you are interested is try it. Because so the Lausanne dataset is on the Web. The codes for the hash is on the web, too. Inside the code in this case is nothing, it's a header. It's a matrix and a bunch of thresholds. Okay. So in summary, what I've shown you is that you can take SIFT, you did binarize it and actually for a class of tasks you don't actually lose any accuracy, you even gain some. >>: When you do your projections, do you do it by random projections, standard hashing technique? >> Pascal Fua: Yes, we tried. The performance is lower. Actually, what you could do, you can do an intermediate thing where you have a random projection matrix, but you choose your thresholds carefully. I think we haven't explicitly done that but I think you would do okay. In some sense there's this P matrix and this T vector. The T vector is most important than the P matrix. Okay. So okay now we have all these binary vectors. You can compute them using the technique I just described, but maybe you have a better technique. In all cases it's going to get back to what you mentioned that now that you have binary vectors, you can do the test using linear nearest neighbor search, which is fine but slow. Again, for you and the ID you have millions and millions of points it's not going to be practical. So you have to go to nearest -- approximate nearest neighbor search. And the problem is something that's surprised us is I'm not aware of many algorithms in the literature designed for this. There are lots and lots of algorithms for ANN, for floating point descriptors, but the problem is if you try them on binary vector, you can always treat a binary vector as a floating point vector if you want. But you lose a lot on performance. They don't work very well. Except one which is the hierarchical K means if you use real value centroids. So you take your binary vectors, you take averages of them. But if you take averages of them they become floating point vectors and you lose the advantage because now you have to complete Euclidian distances again. That's a somewhat surprising finding, if you think about it, it sort of makes sense because hemming spaces are different from Euclidian spaces. They are different in the sense that the boundary between two points, the set of points that are equidistant from two points is thick. So if you hear what I'm plotting is in hemming space, the boundaries between the points that are closer to this one or closer to this one, to V, is thick. There are lots of them. Whereas in Euclidian space it's of measure zero. In fact, you can compute these things and you can show that they're very simple formulas that says that the number of points that are equi distant from two arbitrary points in your space is very significant. It's definitely not of measure zero. And that actually confuses most standard algorithm for ANN, which essentially assume that the boundary is of measure 0. And I think that is the explanation for why those standard algorithms -- the performance of these standard algorithms degrades. So actually that's something we just began working on is, okay, so how can you do algorithms for ANN whose performance does not degrade? And still work only on the binary vectors so as to remain fast? So one thing we did which worked pretty well was something we called a park tree for a petition around random center, which is essentially the same thing as a hierarchical [inaudible] trees except for the fact that instead of taking a center when you subdivide a tree to be the average, you take one point at random. And you do that several times, so again it's a very simple algorithm. And you rely on randomness, because you have several of these trees to get the right answer. Or something that's even simpler, which is a form of LSH on the binary vectors. And this goes as follows. You have your binary vectors. You select a random number of bits. You produce a hash key with those, and you put your vectors in the right bins. You do that several times. Again, you rely on randomness. And that actually works well. And what we get with these algorithms is on relatively large here we tried that on databases was 500 K, 900 K and 1.5 million descriptors. And there we get something that actually runs much, much faster for the same accuracy as a state-of-the-art methods because everything remains binary. We never compute Euclidian distance, and we can exploit these, the fact that the [inaudible] CPUs hemming distance is very fast. And actually there are really cases that will help. So, for example, here's a practical example of something we use this for, which is the case of our triangulation, it's completely standard. You overfly an area. You take lots of images. Lots of very big images and you want to register them, to produce auto photos and 3-D models. And in this case this small example was -- actually what we had, we had 25 relatively big images of a town in south of France. So about 400 K feature point image and to do this registration you need to do nearest neighbor search. And so with this faster ANN approach is that gives you a 20-fold speedup over using the falling point vectors. And actually this sort of matters because so we're trying to follow the great American, the tradition of the start-up. And so we actually, one of the games we are playing is trying to develop a product that essentially can be used in conjunction with these small drones like this one where you just launch it by hand. It will land by itself. In this case it collects images of the EPFL campus, and eventually what it will come back with is an auto photo and/or a 3-D model of the whole thing. And of course none of this is incredibly new. There's aren't techniques that have been known for a long time. But if you are trying to put it into a product it has to be robust. It has to be totally automatic. Nobody wants to touch anything. And it has to be robust. So in those cases actually the fact that 20 times speedup is not negligible because the guy who wants the photo doesn't want to wait. And so that's hopefully ->>: [inaudible] in the middle? >> Pascal Fua: So this is -- let me stop this. Let me go to the end. Okay. So this is -- so this is called the official name is the Rolex Learning Center also known informally as the learning cheese, because of the Swiss holds on the top of it, which is it's an interesting building because it's the -- there's nothing -- there's nothing plainer? There. It's just rolling forms. So the architects had fun and it's good for us because it's in terms of architectural reconstruction, it's more fun to reconstruct that than a bunch of boxes. Okay. So to conclude, I have presented two kinds of binary descriptors. One BRIEF is really as simple as it gets. It's just based on doing these binary tests. It's extremely fast to compute. And it's quite effective for matching images against each other as long as you don't have too many for relatively small databases it works very well, and it's well adapted for cases where you have very limited CPU. Another, which is more sophisticated, it's more computationally intensive because you have first to compute SIFT and then binarize it. And then is adapted to finding points to being searches in extremely large databases. And the interesting part is at least for this particular task you really do not lose any accuracy over something like SIFT, you even gain some. Because you've trained the thing appropriately. And now what's needed to really use these applications in techniques to do ANN search on truly large databases quickly, and then one of the things we're looking into is exploring the fact that we are not doing arbitrary ANN search, we're doing an ANN search for architectural style things, where location matters. So once you found the descriptor, once you've done one match, you can expect that the next one is going to be a point in the vicinity, in the geography vicinity, and you should build that into your algorithms. And that's one of the things we are going to keep looking at in the future. That's I thought. Thank you. [applause] >> Rick Szeliski: Questions? >>: So this binary slide, the two [inaudible] things that you presented, one which is binarizes the pixel intensities, it's binarizing on, working directly on the batch, coming up with a binary vector. The other one is really computing SIFT, you have to compute SIFT or some kind of complicated heavy weight descriptor and then you binarize. So why don't you think the first can really not scale to large images or is it some -- is there a better way to do it. >> Pascal Fua: That could be -- that would be good. So you suggested a way -SIFT has been carefully thought out to be, you know, capture all these small scale, large scale effects and we are capitalizing on that. So maybe as we suggested, if you had a BRIEF style thing with the test organized in the right way, maybe you should be able to capture some of those same things. Because what happens with SIFT is it seems to be redundant in many ways, since you can decimate it and still not lose its distributive power. I think it's intriguing, and I think there's more to be -- there's something to be looked at in this. >>: Have you had to visualize ->> Pascal Fua: The weight? >>: The weighting of which features are important and then the LD hash? >> Pascal Fua: No, we haven't. We should. >>: It would be interesting to visualize that to see what's kicking in. >> Pascal Fua: Yeah. >>: So I remember with the old FERNS stuff, looks actually even the trees before the FERNS, you train them to maximize the discriminatory power. Have you tried to do any of that training? >> Pascal Fua: That's the point of the BRIEF is to completely get rid of the training. What it's saying in some sense is that you don't need -- the training that was included in the FERNS is not all that necessary; you can do without. >>: But, I mean, just in the ways you choose the tests, do you investigate the various different random strategies generated ->> Pascal Fua: That would be true. The complete randomness of the test is the same in the FERNS and in BRIEF. So we've not -- again this goes back to what we just discussed. You would think that there should be an optimum way to select the test, but we haven't found it. >>: So for the nearest neighbor part you showed you could make it much faster to achieve the same position, but can it also just, the chart of the precision you showed for the K trees for instance was 4.75. How much better than that can you get? >> Pascal Fua: You can never -- I mean in this particular thing you cannot be the linear -- it depends how much time you're willing to spend. That's why it's a time as a time per query. >>: I'm assuming, where is the curve of the new stuff? >> Pascal Fua: The new stuff is here. >>: Okay. And the bottom axis ->> Pascal Fua: The bottom axis, the others are various parameters and HKM is the High K means [phonetic]. >>: Do you know what portion in the case of binary vectors, what portion of using the KD tree is slow? Is it the Euclidian distance? >> Pascal Fua: It's the Euclidian distance that slows it down. >>: As opposed to the thick boundaries you were talking about. >> Pascal Fua: Well, the KD trees do not have the -- they're slow because you -in this case they're slow because since you compute the means, you don't have to take boundaries anymore. It becomes Euclidian again. So the take boundary problem goes away. But the price you pay is now you have to do these floating point computations. >>: Are these per dimension, as the KD tree is the standard setup where you split on one dimension and you total. >> Pascal Fua: Yeah, the KD tree, the one I described is a hierarchical K means. The last one that works best for this problem is the hierarchical K means. >>: And the examples you show in the air were matching aerial photos to each other. Have you done it with ground level photos? >> Pascal Fua: Yes. So for the ANN, have we done this? Yes we have, because some of these models are, the Lausanne model I showed you was, the video was we showed some of the ground level images so, yes. >>: Do you have a system now where somebody could walk up, take a picture somewhere in Lausanne and it will tell you where you're standing patching against all of the photos? >> Pascal Fua: We don't have one we could -- in the sense that something we'd put on the Web, no. >>: But if you were to deploy such a system would it successfully match the image or are there too many images that come up for retrieval? Because that's a large number of images, probably more than the number of drone images. >> Pascal Fua: Right. Actually, we haven't really tested it. >>: It's the beginning of the talk you showed us the photos, the whole Lausanne reconstruction. I thought you were heading towards a system which would take an arbitrary new photo tell you where you're standing. >> Pascal Fua: We're headed in that direction but we're not quite there yet. >>: You said you can put multiple brands over descriptors for doing rotation. For this Lausanne thing, how many certain descriptors? >> Pascal Fua: It was for BRIEF. I said that about BRIEF, to compensate for the orientation invariance. BRIEF we really used it on these indoor applications, so we didn't try -- it's more on these augmented reality things on your cell phone kind of things we've been playing with. So that's what I've been trying to explain. It's adapted for that but it's truly not adapted for the large scale stuff. >>: For large scale stuff, what do you use instead? >> Pascal Fua: Initially we used SIFT like everyone else. And now that we've trained LD hash, we use LD hash. >> Rick Szeliski: Any other questions? Okay. Well, thank you once again. [applause]