36254 >> Konstantin Makarychev: So it's a great pleasure to introduce today's speaker, Ilya Razenshteyn, a graduate student at MIT working under the supervision of Piotr Indyk. So he has done a lot of work on the nearest neighbor search. And today, he will tell us about some of his recent work on this topic. >> Ilya Razenshteyn: Thanks for the introduction. So I'll talk today about two recent papers of ours. And these papers are joined with Alex Andoni from Columbia, Piotr Indyk, who is my advisor at MIT, Thijs Laarhoven, who is a graduate student at Eindhoven, and Ludwig Schmidt who is another graduate student at MIT. So this is my talk outline. Essentially I'll talk about these two results eventually but both of them have pretty large common prefix which I'll start with. So I'll start with defining the problem that we'll be solving and the problem is actually very easy to formulate. It's called near neighbor search. It has different names, but the basic idea is the following. So you have a data set which is endpoints in RD and you have a certain distance threshold R. So what you want is you want to preprocess your data set so that given a query finding, a data point within distance R from your query. So the parameters that we will care about will be mostly space that your data structure occupies and query time, another parameter that we would naturally care about is preprocessing time. But usually, if you can bring space down to something reasonable, then [indiscernible] statement usually can be made relatively fast as well, so let's not worry about it for now. So at least in some scenarios, greater great data structures for this problem. So a model case would be something like if all of your paints on the plane and the distance is Euclidian, then what you can do is you can build [indiscernible] diagram and then given a query just perform a point like [indiscernible] of this diagram, and that will give you the nearest neighbor. And so like using more or less textbook algorithms and data structures, you can actually [indiscernible] query time for this case. So unfortunately, approaches like these are completely infeasible for the high dimensions. So the problem is if you would do something based on [indiscernible] diagrams and do it more generally, whatever data structures we know from the high-dimensional case, they require space that is exponential in your dimension. And that's definitely not that great. So at the same time, I would argue that all the fun in some sense happens in high dimensions. So many applications that are interesting, they are definitely not on the plane, so it would be nice do something about them. And what we will do is, well, as good theoreticians will change the problem that we are solving, and so instead of exact nearest neighbors, we will be happy with approximate nearest neighbors. And the formal definition is like this. So in addition to the data set and the distance threshold, we have [indiscernible] factor C which is some real number larger than one. And now, we have the following question. So suppose that I give you the query with the promise that there will be at least one data point within distance R. So then I want you to return any data point within distance CR from the query. So basically, there are two balls and I know that there is at least one data point within the small ball but I would be happy with any data point within the light bulb. So is the definition clear? Good. So, yeah. So near neighbor search or like similarity search, whatever you call it, it has quite a few applications so the most obvious applications are similarity search for all sorts of different data, images [indiscernible] text biological data and so on. But there are a couple of non- -- like a couple of applications of different sorts. So let me just briefly mention them. So there is a recent one for cryptanalysis. So namely, turns out that nearest neighbor search can be applied in practice for solving shortest vector problem in lattices and that can give you pretty good speed out. And another application is for optimization. So nearest neighbor search was used to speed up different methods for optimization such as coordinate descent, stochastic gradient descent. So [indiscernible] results. And actually, in this talk, we'll be mostly looking at the specific case of nearest neighbor search, not for the whole talk. We will consider general case as well but important special case is when all of your points and queries lie on the unit sphere in Rd. So for some reason, it will be convenient to look at this case and it's actually relevant for both theoretical and practical reasons. So in theory, it turns out that we can reduce general case to the spherical case. And then I'll show this reduction -- well, at least I'll play on this deduction later in the talk. And then practice Euclidian distance of the spherical response to cosine similarity which is widely used by itself but also even if you don't want cosine similarity, you want like genuine Euclidian distance, sometimes you can pretend that you're data set lies on the sphere and you wouldn't lose might by doing that. Yeah. So even more specifically, a special case that is good to have in mind would be spherical random case. So what's the setup here? So my data set is not only points on this sphere but the random points on this sphere, just chosen uniformly and random at times. And query, I generate queries as follows. I take random data point and plant query within 45 degrees, say, from a data point at random. And what it looks like is something on this picture. So basically, if I have a query, then I'll have a near neighbor within 45 degrees just because I generated it like this, but all the other data points will be tightly concentrated around 90 degrees just because if you sample endpoints on this sphere, they will be paralyzed almost orthogonal, unless you have like too many of them. And just keep this case in mind. It would be nice to illustrate our algorithms on this case and in certain sense, it will be the core case as we'll see later. And again, in practice, what often happens is that actually, a concentration of angles around 90 degrees, it's not uncommon to see in practice. So it's also relevant for practice somehow, this case. So any questions about it? >> [Indiscernible]. >> Ilya Razenshteyn: No. M is much larger than Z but let's say it's not like exponential. Let's say is subexponential in D. So in total square root, D. Something like this. So of course, if you'll have lots of points, then you will not have that good concentration, but, yeah. Okay. So that's it for the problem definition. Now let me introduce locality-sensitive hashing. So if you have any questions, maybe you should ask them now. Okay. So what is locality-sensitive hashing? So this is a technique introduced by Indyk and Motwani in 1998 and this is the way to solve near neighbor problem in high dimensions. And the basic idea is pretty intuitive. So what you want is you want space partition of Rd such that closer point like it will be random partition so that closed pairs of points collide more often than far ones. So something in spirit of this partition into random Ds. So the formal definition here is the following. So I require my random partition to have the following two properties. So if my points are closed, then they should collide with [indiscernible] probability [indiscernible] random partition. Same with probability at least P1. And if they are far, then they should collide not too often with probability less than P2. And these R and CR are just the thresholds [indiscernible] from the definition of approximate nearest neighbor I cared about like closed pairs is far pairs and they say exactly the distance thresholds we care about. And useful way of thinking about it is if you have some random space partition, it would most likely have some dependence on the probability of collision on the distance and then this inequality is just tell you something about two specific points on this plot. So actually, now, let me demonstrate one example of LSH, so not to think of it abstractly but just have some concrete example in mind. And that's actually very useful family. It's very useful for both theory and practice and it was introduced by Charikar in 2002 and it actually has been inspired by certain approximation algorithms by Goemans and Williamson for those who understand. And eventually, like hashing family looks very simple. So it works only for this sphere. And if I am on this sphere, what I can do is the following. I can sample a random unit vector uniformly. Let's go with R. And then I hash my point into sign of the dot product of my point and R. So basically, another way of saying it is we take a sphere and cut it into equal pieces by random hyperplane. And it's very easy to compute exactly probability of collision for two points. So if angle between my two points is alpha, then probability of collision is just one minus alpha over PY. Well, because if you have two points, then you know that for your hyperplane to cut these two points, you need this hyperplane to past through this like candle between P and Q and probability of that is alpha over five. And the remaining probability actually collide. So that's why we have this expression. It's like exact formula. And own the plot, it would look something like this. So remember that we would care about random case, right? With 45 degrees. So if points like 45 degrees, then probability of the collision is three-quarters and for 90 degrees is one-half. So that's like typical case we would think about. Okay. So we have this nice simple family in mind. And let's now see how to use it to solve like to do similarities certain high dimensions. So the first idea you would think about is to just take our hashing family and just hash points using hashing family. So basically [indiscernible] hashes and then just, I don't know, look up points with the same hash. But of course it wouldn't work. For instance, for the hyperplane, it wouldn't work because you have only two values of hash. So your points in the best case, they would be split like they would be split evenly and in each byte that you have in over two points, right? So you would need to enumerate another two points so that's too much. So a natural extension of this idea instead of one hash function [indiscernible] independent hash functions from my family, and ->> So what do you think of [indiscernible] query time? >> Ilya Razenshteyn: Query time, yeah. not worry too much about ->> And space but space [indiscernible] So [indiscernible]? >> Ilya Razenshteyn: Definitely spawn some linear query time, right? If you just use one hash function [indiscernible] linear [indiscernible]. Of course I would want as good as possible. So let's see what we get if we use K hash functions simultaneously instead of one. So for one hash function, probability of collision, as I said before, is just a straight line. But when we'll start increasing K, it actually goes down. This is more or less obvious and in probability, just gets raised to the power K, right, just because everything is independent. But what's crucial here is that for far points, this probability goes down much faster than for closed pairs. And that's exactly the crucial thing here. And that's actually it. So what we do is we choose K appropriately I'll say in a second how to choose the actually. And then we hash our points using topples of hash functions simultaneously and then just enumerate like [indiscernible] all the points in the same bucket. So that's the whole reduction. So let's see what parameters we need to choose and what we get in them. So turns out that the optimal choice of K is such that this point for our pairs becomes smaller than one over number of points of order, one of the number of points. Why? Because in this case, for a query, if you look at its beam, then the number of outliers in this beam, namely for our points that we don't really care about is like constant on average just by lineage of expectation. It's like M times one over N is one, right? So the we're time will be constant actually in this case. Well, proportional to the dimension, but I think of dimension as something small for the sake of this talk so we will enumerate like constant number of points on average. And are we done? No. Because actually, we also need to care about probabilities for closed pairs. So we would want to find at least one closed pair, one closed point with descent probability. And if we just do this whole thing once, then the probability that it would collide, it's three-quarters to K, so it's exactly this point. And if you do the math, so it will take K from here and put it here, we will get that probability of success would be something like one over N to point [indiscernible] two or something like this. And in order to boost probability of success to say 99 percent, we would need to repeat this whole thing N to point [indiscernible] two times. So then we have like L hash table. Hash tables. >> Here you want the exact nearest neighbor and you're not -- >> Ilya Razenshteyn: No, no, no. I'm happy with any point within this 45-degree range, let's say. But for the random case, it would be exact nearest neighbor, yeah. But in general, not necessarily. Question? So its overall scheme is like this. We have K times L hyperplanes. So in each hash table, we have K of them and we have L hash tables overall and the overall space is something like M to 1.42 and we're time M to .42. So that's exactly the query time that would be typical for this talk, sort of. >> [Indiscernible]. >> Ilya Razenshteyn: Kind of, yeah. Some polynomial of them. So are there any questions about it? Like if that's unclear, it's better to spend some time. >> So one thing I'm not really clear is that you are doing worst case, right? You can't do [indiscernible]. >> Ilya Razenshteyn: Yeah, yeah, yeah. So it's like worst case over queries. So for each query, we succeed with this improbability. >> But not -- I'm wondering like why is that important? [indiscernible]. >> Ilya Razenshteyn: >> >> Like maybe -- Amortized over what? Like queries. >> Ilya Razenshteyn: >> So what -- A new amortized thing. >> Ilya Razenshteyn: Like So you want to be average other queries. Yeah. >> Ilya Razenshteyn: So to [indiscernible] we don't know any better. in practice, maybe sort of but in theory ->> [Indiscernible]. >> Ilya Razenshteyn: >> Sorry? Something like spirit of ten here. >> Ilya Razenshteyn: >> Well, Yeah. And you are saying that's what instead of the [indiscernible]. >> Ilya Razenshteyn: Yeah. Yeah. So in practice, succeeding only for like good queries, it kind of makes sense. But in theory, we don't really know any better. So we don't know how to be like average over queries in some sense. Okay. Good. So this is a pretty simple argument that actually appears in the same paper that introduced [indiscernible] and that eventually, like for general case, I showed you complete numbers, but in general case, what you can show is you could always choose number of tables and number of hash functions per table so to get space that is N to one plus row and query time N to row where row is the gap between probabilities of collision for like closed pairs and for far pairs. And the proof is exactly the same. It's just instead of complete numbers, we get this formula. Yeah. Okay. So that's it for the definition of LSH. Now let me show you the optimal construction for LSH for us here. So can I do better than hyperplanes? So a question you could ask for this specific random instance where I planned queries within 45 degrees, can I [indiscernible] bones which are sort of square root of L and 10 to .132 space? Can we do better or not? Turns out that we can. And that we observed in our previous paper and like used it in one of the papers I'm going to talk about and actually you can do much better. So you can get, for the query time, you can improve more than quadratically actually. So you can get N to .19, query time and space N to 1.18. And this is actually optimal. So this new bound is optimal, unlike the old one. I'll say in a second how it works. But let me just say for now that for spherical case we understand the best bounds exactly. And of course, of course I again tell you the numbers for the random 45-degree case but of course it works for general case in this sphere just like the formals will be slightly more complicated. So that's why I'm not showing it here. But let me show the construction. The construction is actually pretty simple and clean. And again, as the hyperplane, this is also inspired by certain approximation algorithms this time is by result of [indiscernible] who used somewhat similar space partition to round SDPs for coloring. And let me call [indiscernible] LSH. I'll explain in a second why we would call it like this. And the construction is actually fairly simple. So we want to hash our points on the sphere. So for this, let me sample -- let me choose [indiscernible] however I choose it to separate equation and not entirely trivial. But let me sample standard dimensional Gaussian. So each GI is a dimensional IAD and 01 vector. And then, hash of my point would be index of the Gaussian who is dot product with my point is the maximum possible. So pictorially what happens is something like this. So I sample a bunch of Gaussians. They don't have to lie on this sphere but I mean, their length will be approximately equal so let's think of them as uniform vectors from this sphere. It's not going to matter for this discussion. So I basically sample a bunch of random points on this sphere and then my sphere gets partitioned according to with which Gaussian it correlates the best. So something along these lines. And that's exactly my space partition. Is the construction clear? Yeah. And let me just observe that if I sample only two Gaussians, it's exactly the hyperplane LSH. Why? Because if I sample two Gaussians, then my partition would be just hyperplane that is in the middle of these two vectors. So it's exactly equivalent. And so this is a natural generalization. Instead of just two regions. We might have more than two regions. And we'll see it will be beneficial for us. Okay. Let's actually compare it with hyperplanes. So as I said, one hyperplane is exactly the same as Voronoi LSH with two [indiscernible]. Right? Just because it's exactly the same. And turns out that the right way of comparing it is to compare K hyperplanes. So remember that we're basically [indiscernible] with respect to K independent hyperplanes and Voronoi LSH with two to K Gaussians. Why is that? Well, because actually it turns out that it's good to compare. We'll see in a second why. But for now, let me just tell that in both cases, we have two to K regions. So at least we can meaningfully compare these things. So for one hyperplane and for two Gaussians, there are no difference. But then when we start increasing these things, things start getting interesting. So even for the two hyperplanes and four Gaussians, so this point is exactly the same on both of these plots, and that has do with the fact that in both cases, there are two to K regions. But in this, but for this point, we start seeing different. And it actually turns out that for Voronoi LSH, we get slightly better, slightly higher probability of collision for like small distances. So it starts kicking in. And then when we increase, the gap actually widens. So when we increase to like six hyperplanes versus 64 Gaussians, the gap is actually pretty non-trivial. And you can do the formal analysis and show that if your number of Gaussians grows, the gap between hyperplane LSH and Voronoi LSH increases and the exponent that we get, it approaches that value .18 that I promised you. >> [Indiscernible] when you do hyperplanes [indiscernible]. >> Ilya Razenshteyn: That will be the problem. That I'll cover in the second half of the talk. And yes, that's a problem so what Sergei said is that for K hyperplanes, we can essentially like decide where our point lies in like K [indiscernible] speaking and for Gaussians we would need to do two to K operations. Yeah. So it comes with the cost of improved exponent. But actually, let me just tell that for the sake of theory, it doesn't matter. We will choose parameters that wouldn't matter much. Yeah. So okay. So that's actually it. So this is the optimal LSH construction for us here. And now let me tell you how to use it to get the state of the art algorithm for ->> One question. So there is computing [indiscernible]? >> Ilya Razenshteyn: Yeah. Yeah. being Logan, something like this. >> Yeah. So think of the [indiscernible] as [Indiscernible] dimension, okay. >> Ilya Razenshteyn: Yeah. So it's like proportional to Logan. What I'm mostly worried in this talk is factors like M to epsilon. So if something is sub polynomial is constant for the same of this talk. In practice, it of course might [indiscernible]. I'll talk about it. Yeah. >> Thank you. >> I want to try and understand one thing. So you're talking about this uniform random case and then you also made a comment about general case. >> Ilya Razenshteyn: >> Yes. And you said that this wouldn't work there as well. >> Ilya Razenshteyn: Yes. >> But it's only optimal in -- the optimality proof is in the uniform random case? >> Ilya Razenshteyn: So optimality proof is for the general case but when your distant thresholds that you care about graphic response to the random case. So namely software root two versus square root two over the approximation factor. >> What is square root two. >> Ilya Razenshteyn: Square root two is the typical distance between two random points on the sphere. >> Okay. >> Ilya Razenshteyn: So our optimality will provide -- in a way, it shows that, like, you can think of how optimality proof is that it's optimal for the random case if you want, yes. And it immediately implies that it's optimal for like arbitrary case if your distance thresholds could respond to random case. But whether this construction is optimal or not for two arbitrary distance thresholds, that becomes yet proof although I conjecture that this is the case. So that's the exact state of things. >> And you still draw the Gaussians uniformly. >> Ilya Razenshteyn: >> Yes. Even if the data is somehow skewed. >> Ilya Razenshteyn: Even if the data -- even if the data is somehow skewed, but so optimality for LSH. So actually, like I'll cover in the second how to do better for the case when your data is skewed. But you would need to do something else. We will see. But yeah, it's a very good point. It's exactly what I'm going to talk about. So now let me tell you about our first result. And that appeared in this year's talk. So you can ask, like, okay, now I talked about this sphere. Now I'm going slightly switch gears and talk about the whole Rd. It's like more general case so let's talk about it. So you can ask what are the first bounds on the LSH you can get? And it turns out that for Euclidian distance, and I'm not going to talk much about it, but still, like it's very interesting to see what happens for Hamming distance and Hamming distance, actually, we know exact bounds on the exponent that you can get. So for Euclidian distance, the right bound is one over C squared and where C is my approximation factor, remember. And for Hamming distance, it's one over C. So in particular, for two approximation, what we get is we get query time something like M to one-fourth and square root ten for Hamming distances [indiscernible]. And that was established in the sequence of works over quite some time actually. So we know exactly best bounds for LSH, for L2 and L1. Yeah. And just let me briefly recall that one-half here means that it gets based into three halves and query times square root ten. So can I do better than LSH? So, yes, we can. And that's exactly the main point of what I'm going to talk about. And how can we do better than LSH. So the basic idea is to do again space partitions, but the crucial idea is to do space partitions that depend on your data. So remember that my definition of LSH, it was actually pretty strong. So I required these two conditions for every pair of points. So for every P and Q, I want that if distance is small, then [indiscernible]. If distance is large, then something else, right? But actually, we don't need it. So for the reduction, what we need is we need to make sure that these conditions called if one of my points is a data point. And that gives us the full length possibility. So maybe we can look at our data set before building the hash family and just like cook some nice hash family that works nicely for this data set, right? And what's exactly what we do. But interestingly enough, not only it works for a good data set, it actually gives improvement for every data set. So you can say that every data set has some structure to exploit informally speaking. And now let me tell you our reasons. So basically, we get optimal data dependent space partitions, optimal after proper familiarization, it's a little bit subtle. But again, let's not worry about it for now. And coincidentally what we get is we get almost quadratic improvement. So for Euclidian distance, we get say for two approximation, one over seven instead of one over four. And for Hamming, for Hamming distance, we get one, one-third instead of one-half. And let me say again that this bounds optimal for data dependent LSH if you formalize it properly. So what's the main idea? So basically, our algorithm consists of two steps. So first I'll show you how to [indiscernible] random data sets. Random in the sense as I described. And actually, for the random data set, the spoiler is that Voronoi LSH works well and gives better bounds. And this step is completely date dependent. So if you know that your data set is random so you just use Voronoi LSH and apply the standard thing. So the second part which is more interesting if that's the main point is how to take any worse case data set that may not necessarily look at random and then I'll show how to partition it in part for the sake of our algorithm essentially random. And that step is data dependent and it would exactly address your equation about like skewed data. But, okay, let's first look at the random case. So actually, yeah. So what exactly it means is that we have a sphere and our points and queries are random. And we use the fact that distances are concentrated around like if you have sphere of [indiscernible] R, then distances are concentrated around square root two times R and Voronoi LSH gives you the right exponent. So it gives you exactly the improved background. One over two C square minus one which eventually we want to get for every data set. But if your data set doesn't look random, then actually Voronoi LSH is suboptimal and the good example is if your points are for example clustered, then line is full region of your sphere and actually Voronoi LSH done for that grade and it gives -it doesn't give good results. And we need to do something about it. And that the exactly why the second part comes. How to reduce from general case to randomly looking case. So if something doesn't look like random, let's make it look random forcefully. So basically, we need to remove structure. So what do I mean by structure here is basically low radius clusters that contain lots of points. I'll say in a second what does it mean to be low radius but again, like, at least conceptually, it should be pretty clear. So if we have any low radius dense clusters, we just take them away. Something like this. And that will show how to work with them in the next slide. Of course we need to do something about them because what if our near neighbor is one of these points. Right? But for now, let's not worry about it. Let's say that we just removed everything. And the crucial thing is that the remainder pretty much looks like random set. So we know that we have no dense areas anymore and that kind of spread so we can apply it more in LSH. And recurs. So after we recurs, so by recurs, I mean [indiscernible] for each region but we do the same. And dense clusters can appear again because the definition was relative. So since it has no way fewer points, we again can potentially have dense clusters take away, and again, [indiscernible] recurs. So now, yeah. So before, I'll tell you what to do with the clusters, let me tell you how we process queries. So for queries, we actually do the following. So we first query every single cluster. They will not be manual. Then we'll choose parameter set that they will be relatively small amount of them so we can afford to query every single one of them. And for [indiscernible] LSH [indiscernible] one part where our point lies. For example, this one in the recursive [indiscernible] part. So that's the whole thing. So it remains to tell what we actually need to do for the clears. I didn't tell and actually that's very crucial. So for clusters, we observe the following thing. Actually, now, it's time to tell what exactly it means to be lower radius. So by a lower radius, I mean something that is slightly smaller than half of this sphere. So basically, I declare a cluster to be a lower radius if it has a spherical cap of radius where two minus epsilon times R. So it's slightly non-trivial but slightly smaller than the half of this sphere. And the crucial thing is that we can actually enclose such a cap into somewhat smaller ball by a factor of one minus epsilon square. And that's great. Why? Because we can recurs with the reduced ranges and as I'll explain, we make actually programs by doing that. So let me state the algorithm again like the overall algorithm. So we basically, if a cluster would use the radius and thus the several reductions, the problem essentially becomes trivial for certain reasons. And for the random remainder, Voronoi LSH works well. So that's exactly conceptually how we handle different cases. And at some level, it can be seen as a decision to what we get. So we start with a root. Then we take out dense clusters. Then we have random remainder which we partition using Voronoi LSH and then recursively do the same thing for everything. And when we query, we can go potentially to several [indiscernible]. So we created all the clears in one part, and again, it continues branching. And the parameters we get are the following. So you control that my tree occupies nearly near space. And for the time can be bounded by some sub polynomial function. And that's great. And of course, as before, one tree would not be enough because it would give you only polynomial small probability of success, so we need many of them to succeed with probability 99 percent. So that's actually -- so any questions about it? Yeah. So one line summary is that we observe that Voronoi LSH works great for random data sets and then if something doesn't look random enough, just make it random. Okay. So now let me tell you a little bit about our second result and actually, the second result is how to make Voronoi LSH practical and that has to do with Sergei's question and that's our NIPS paper. Is Voronoi LSH practical? No. Why? Because converge to the optimal exponent is on the one hand very slow. So we need lots of Gaussians to make it [indiscernible] good. And at the same time, [indiscernible] time is roughly dimension times number of Gaussians. So and that's bad. Even say, for example, 64 Gaussians is already pretty much impractical and that wouldn't bring us even close to the optimal exponent. So can we could anything about it? It would be nice to do something because actually hyperplane LSH is used quite a bit in various forms and practice. And it would be nice to like use theory to get some practical improvements. Can we do something? Yes. That's exactly the point of the second part. So let's make our Voronoi LSH practical step by step. So the first step is to make our set of vectors a little bit more structured. So Voronoi LSH samples bunch of random vectors, and that might not necessarily be that great because that brings us like the [indiscernible] times slow. Let's make it like less random, less arbitrary in some sense. And that actually -- there was a very nice paper by [indiscernible] who showed -- who proposed such a scheme. They didn't analyze it, but at least they proposed a possible improvement to Voronoi LSH and what they proposed is instead of random vectors used plus minus basis vectors. If you want to press your point on the sphere, you perform a random rotation, and then after you do random rotation, you find the closest plus minus basis vector. So for example, for the case of two, we partition everything in full parts. And for general high dimensions, we will have cross-polytope. So for dimension T, we have 2d parts. Yeah. So in this paper, we actually, for the first time, analyze this [indiscernible] and control that it gives almost the same quality as Voronoi LSH with 2d Gaussians, again, which is 2d Gaussians because it's the same number of parts. So essentially will show that by moving from Gaussians to this structured set of vectors if you do random rotation first, then it gives almost the same result. And in a way, you can think of it as like placing [indiscernible] actually. So exponent improves as dimension grows because number of Gaussians effectively grows. Right? And it's still not that great. Because random rotation is expensive so applying random rotations takes D squared time. And storing it also takes D square reels, so it seems that we didn't do that much progress, but in fact, we did. So wait for the next slide. And the way we did progress is that at least the second step, finding the closest minus basis vector, now it's cheap. You can do it in like once [indiscernible] over your coordinates. It takes D time instead of D squared. So that's exactly the progress. So second idea is to use pseudo random rotations. So as I said, the bottleneck is to store and apply random rotations. And that will be expensive. To instead, we use pseudo random rotations and they were introduced in the paper by Ailon and Chazelle. And since then, they were used in like many other places in both theory and like applied papers and so on. It's a very beautiful idea. If you want to like basic if you want to learn one idea from this talk, I want you to ->> [Indiscernible]. >> Ilya Razenshteyn: Sorry? No, no, no, it's like -- you'll see. It's really nice. So we want to do something that would serve roughly as a random rotation but without doing the whole random rotation. So what we do is the following things. So instead of random rotation, let's do Hadamard transform. So what is a Hadamard transform? It's a certain orthogonal map that preserves [indiscernible] norms, so a certain rotation. And it has two properties. In one, like one way properties that it mixes well. I might say in a second what it actually means but what's more crucial is that it's fast. So we can compute it in time D of D. So what is a Hadamard transform? [Indiscernible] defined matrix that is the following, like zero Hadamard transform is just one and then it gets replicate four times and one [indiscernible] will keep all the signs. So it's basically a plus minus one matrix with [indiscernible] orthogonal rows and columns. So this is nice. But of course it's a deterministic map and we want to inject some randomness. So the crucial idea in Ailon and Chazelle paper was to flip signs at random before applying Hadamard transform. So basically, for every coordinate we toss a coin and for like say, heads, we change the signs. For tails, we don't do anything. And then we apply Hadamard transform. And for that application, that was enough. But for our application, it's not quite enough. Why? Because for example, suppose that we started with the once part vector. So it's a vector with only one non-zero -- say first basis vector. Then if I flip the sign, it could be plus minus [indiscernible] basis vector, and even after I applied Hadamard transform within one of the two vectors and that's not good enough for us. But what is good enough for us is to repeat this whole thing a couple of times. And then it works. So with the caveat that we don't know how to prove that it works but it works empirically well. And I conjecture that it actually works in theory well. It's just I don't know how to prove it. And that is actually pretty much it. So the overall hashing scheme is to perform 2 or 3 rounds of flip signs Hadamard and then find the closest vector from plus minus basis vectors, which essentially boils down to like finding maximum coordinate or something like this. And the evaluate times becomes D log D instead of D squared. And that's exactly where we save a lot. And again, this is the statement I don't yet now how to prove but empirically it seems that it's pretty much equivalent to the cross-polytope LSH with truly random rotations which are rigorously equivalent to Voronoi LSH with 2d Gaussians. So is it clear? Yeah. >> Just trying to understand the end-to-end thing. So you have distance transformations. This 2, 3 rounds of flipping and Hadamard. And then so you start -- you have a query point, you run it through this and then what do you get? You get the close -- the closest vector. Then do you -- the data is already [indiscernible] a way that you are [indiscernible] a set of points and one of them is close. >> Ilya Razenshteyn: So the good way of thinking about it, we just use it as an illustration. Plug it into the reduction that I described. So then, for like every hash table, you would have several of those things. So think of it as a hash. So it takes a point and gives you a number from one to D. So that's the half. Right? It's like closest or like from one to 2d in this case. So then just push basis vector is the closest. And then you do the same as I described before. So you just do several of these hashes and just compute these several hashes for the query and look up the corresponding beam and retrieve all the data points from there. >> So I have the union of many beams? >> Ilya Razenshteyn: [Indiscernible] intersection. Because you want to collide on all the key hash functions. And then you do union. So then the overall thing you repeat many times to prove probability of success to like 99, whatever you want. >> And then in this union, you just -- >> Ilya Razenshteyn: >> You try all of them and just find the closest one. >> Ilya Razenshteyn: >> You try all the points. Yeah. How many points are in that union? >> Ilya Razenshteyn: So we set parameters such that in one hash table, you would have one like or like five or like constant number of far points and everything else is closed. So we're happy with those. And for the union, you just give that essentially a number of bad points is roughly the same as number of hash tables. So it's some like N to some power. >> How many hash tables? >> Ilya Razenshteyn: So let me actually go back to the slide where I show it for the hyperplanes and for this LSH is the same. >> [Indiscernible]? >> Ilya Razenshteyn: No. We computed -- so we computed every hash table. We compute three times K Hadamard transform. So essentially, in total for a query, we have like three times K times L Hadamard transforms. >> You recompute it. >> Ilya Razenshteyn: Yeah. You need, right, you need to recompute your [indiscernible]. If you [indiscernible] me to do it, that would be very nice. That would save a lot. Yeah. So this slide. So basically, so now think of -- we use our Hadamard -- I'm not sure how to call it -correspondent hash instead of hyperplane hash. So with hash, our point using K [indiscernible] independent [indiscernible] of our hash functions from our family, so this is the parameter K. And we also have parameter L. Which is how many tables we have. So in terms of what we have K times L functions and L hash tables. And for the hyperplanes, we have M to point for two tables. For the cross-polytope, we would have had N to .18. So some small polynomial relatively. And K is something like [indiscernible]. Okay? So it's exactly the same reduction just the set of hyperplanes use cross-polytope and it works much better. Okay. Let's go back. I have ten minutes, right? Okay. So now, let me turn to the actually quite a big issue. Memory consumption. So if you look at like there are lots of papers that [indiscernible] high dimensions say in practice. And many of them use LSH as a baseline. And in many papers, you see statements like this. LSH is terrible because it consumes lots of memory. Let's try to figure out if it's true or not. So we actually can do math and compute exactly how many tables we would need for example for hyperplane LSH for that random instance that I told you with 45-degree. So if you have million points and your queries are within 45 degrees at random, then you would need [indiscernible] probability .9, you would need 725 tables. So that's terrible actually. It's like million points. Come on. It's not like -- it's nothing, right? Now we would more like care about billions of points if not more. Right? So 725 times to replicate the whole thing, that's not what we want. Right? But there is a very nice solution for this. It's called multiprobe LSH that was introduced in a very nice VLDB 2007 paper. And basically, the idea is like this. I'm not going to tell you [indiscernible] but the idea is in each table, we can query more than one beam. So of course, the best beam to query is which collides with all our hashes. Right? But intuitively, we would also might want to look at the beams that almost collide. So let's say they collide on all quarters except one or something like this. So they have a very nice heuristic way how to do it. Eventually there are theory papers that kind of analyze this, something similar to this multiprobe LSH, but it's analyzed something way less practical. But in practice, you would want to use multiprobe LSH. And one of the contributions of this paper, is we have similar scheme for cross polytope. It's a little bit more tricky. The main source of trickiness is that for hyperplanes, you have only two regions. So we just need to decide whether [indiscernible] or not in each thing. Here, we need to decide by how much [indiscernible]. It can be done. So we did quite a few experiments. Let me just show one of the experiments. It's on the data set of features for -- I think it's called image net or something. I'm not sure. So it's basically certain data set of images from that data set they received features computed and the parameters are we have million points and dimension 128. Right? So the linear [indiscernible] takes 38 milliseconds. It's actually not that great data set because linear scan is already pretty fast. But nevertheless, compared to eight milliseconds, this whole thing improves things quite a bit. So with hyperplane and multiprobe, you can improve by a factor of two and cross-polytope improves even further. So this is not the biggest gap between hyperplane and cross-polytope. I can show you but even here, it already works better. And in practice, you would not just want to like take your data set and apply say cross-polytope LSH. So in fact, turns out that for this specific data set, you can look at it, stare at it a little bit, cluster it a little bit, recenter, and then actually improve both like improve results for both hyperplane LSH and cross-polytope LSH and here the gap is a little bit wider actually. So cross-polytope benefits a little bit more from it. And we have other experimental results if you are interested in like applying something like this for your applications like read our paper. So this is just some kind of big numbers. >> [Indiscernible]? >> Ilya Razenshteyn: Oh. So yeah. It's a great question. So we require it to use the same amount of memory as the data set itself. So it's roughly double the size of the data set so it's actually pretty reasonable. So you wouldn't use like 725 tables or anything like this. If you use more memory, actually both hyperplane and cross-polytope benefit from it. Yeah. So okay. So that's pretty much it. There are actually a lot of open questions here. Some of them are hard, some of them are easy, some of them are meaningful, some of them are not so much. But what I showed in this talk. So like essentially I showed you two results. Optimal data dependent hashing for the whole L2, which is the theory result. And practical and optimal LSH for the spherical case, which is more applied. And I'd say the main open question which I really like and I have no idea how to approach it is to have practical version of our worst case to random reduction. That would essentially make the first bullet point practical and if it is possible to do, I don't know. But would be very nice. Yeah. >> So all of this is again finding one point that's close enough. want to find all the close points? What if I >> Ilya Razenshteyn: You want to just within certainly distance threshold you cover everything. >> At approximation, however is reasonable. So I'm waiting to accept and no-show approximation, but I want to find all of the appropriate -approximately find all the nearest neighbors. >> Ilya Razenshteyn: Good. So here, the analysis shows that for every point from the set that you care about, [indiscernible] 99. So it means that on average, we'll recover 99 percent of all the points. Of all the closed points. >> So again, the set that at the end of the day [indiscernible], it's already pretty much all the members. >> Ilya Razenshteyn: Pretty much. So it is 99 percent of them. So you can just essentially repeat this many times and put this to what are you want. >> The hash -- >> Ilya Razenshteyn: Of course, if you do something like this, your running time will depend on how large your -- like how large this set is. But you can do better than the size of this L. So you would get something like M to raw plus the size of the set times something. Any more questions? >> Again, maybe this is the same as my previous question, maybe not, but still true, though. So if I want to find the single closest point. >> Ilya Razenshteyn: With like absolutely clauses. >> Absolutely clauses with hyper mobility. 99 percent of the close ones are nearby. Then I understand then >> Ilya Razenshteyn: Yeah. But you don't necessarily find this closest point because you would like basically again like how the analysis goes. You say that in your beam, there are essentially no fire points. So it means that you would very quickly find some close point but not necessarily the closest. So actually finding exactly closest neighbors in theory it's very hard. Precisely like I wouldn't expect it to be possible to do in like high dimensions and [indiscernible] some linear time unless your space is huge. But what you can do often in practice is to say that, look, my data set [indiscernible] that I have a query, I have a -- like exactly closest neighbor. No approximation. And then there are not that many points which are not much -- like not much further than that. And that often happens actually. For instance, in this [indiscernible], oh, yeah, I should have told you that here, we'll look at the like this -- like here the times of from a probability point [indiscernible] find exact nearest neighbors, of course no approximation. Was it has this property that there are not that many approximately closest points, everything else is concentrated much further. So in this case, that's actually your best bet and that you can always like you can often see in practice. But other than that, not really. Unless you want that dependence on the dimension. >> So there are lower ones for exact -- >> Ilya Razenshteyn: So you can for instance show that if you can do polynomial preprocessing and strongly sublinear query time, you can do better than [indiscernible]. It's very easy to show. So there are lower bounds. They are not that [indiscernible] because this assumption that you can do better than two [indiscernible] strong but we don't know how to do it at least. So I wouldn't expect it to be possible. At least [indiscernible]. [Applause]