36254 >> Konstantin Makarychev: So it's a great pleasure to... speaker, Ilya Razenshteyn, a graduate student at MIT working under...

advertisement
36254
>> Konstantin Makarychev: So it's a great pleasure to introduce today's
speaker, Ilya Razenshteyn, a graduate student at MIT working under the
supervision of Piotr Indyk. So he has done a lot of work on the nearest
neighbor search. And today, he will tell us about some of his recent work on
this topic.
>> Ilya Razenshteyn: Thanks for the introduction. So I'll talk today about
two recent papers of ours. And these papers are joined with Alex Andoni from
Columbia, Piotr Indyk, who is my advisor at MIT, Thijs Laarhoven, who is a
graduate student at Eindhoven, and Ludwig Schmidt who is another graduate
student at MIT. So this is my talk outline. Essentially I'll talk about
these two results eventually but both of them have pretty large common prefix
which I'll start with. So I'll start with defining the problem that we'll be
solving and the problem is actually very easy to formulate. It's called near
neighbor search. It has different names, but the basic idea is the
following. So you have a data set which is endpoints in RD and you have a
certain distance threshold R. So what you want is you want to preprocess
your data set so that given a query finding, a data point within distance R
from your query. So the parameters that we will care about will be mostly
space that your data structure occupies and query time, another parameter
that we would naturally care about is preprocessing time. But usually, if
you can bring space down to something reasonable, then [indiscernible]
statement usually can be made relatively fast as well, so let's not worry
about it for now. So at least in some scenarios, greater great data
structures for this problem. So a model case would be something like if all
of your paints on the plane and the distance is Euclidian, then what you can
do is you can build [indiscernible] diagram and then given a query just
perform a point like [indiscernible] of this diagram, and that will give you
the nearest neighbor. And so like using more or less textbook algorithms and
data structures, you can actually [indiscernible] query time for this case.
So unfortunately, approaches like these are completely infeasible for the
high dimensions. So the problem is if you would do something based on
[indiscernible] diagrams and do it more generally, whatever data structures
we know from the high-dimensional case, they require space that is
exponential in your dimension. And that's definitely not that great. So at
the same time, I would argue that all the fun in some sense happens in high
dimensions. So many applications that are interesting, they are definitely
not on the plane, so it would be nice do something about them. And what we
will do is, well, as good theoreticians will change the problem that we are
solving, and so instead of exact nearest neighbors, we will be happy with
approximate nearest neighbors. And the formal definition is like this. So
in addition to the data set and the distance threshold, we have
[indiscernible] factor C which is some real number larger than one. And now,
we have the following question. So suppose that I give you the query with
the promise that there will be at least one data point within distance R. So
then I want you to return any data point within distance CR from the query.
So basically, there are two balls and I know that there is at least one data
point within the small ball but I would be happy with any data point within
the light bulb. So is the definition clear? Good. So, yeah. So near
neighbor search or like similarity search, whatever you call it, it has quite
a few applications so the most obvious applications are similarity search for
all sorts of different data, images [indiscernible] text biological data and
so on. But there are a couple of non- -- like a couple of applications of
different sorts. So let me just briefly mention them. So there is a recent
one for cryptanalysis. So namely, turns out that nearest neighbor search can
be applied in practice for solving shortest vector problem in lattices and
that can give you pretty good speed out. And another application is for
optimization. So nearest neighbor search was used to speed up different
methods for optimization such as coordinate descent, stochastic gradient
descent. So [indiscernible] results. And actually, in this talk, we'll be
mostly looking at the specific case of nearest neighbor search, not for the
whole talk. We will consider general case as well but important special case
is when all of your points and queries lie on the unit sphere in Rd. So for
some reason, it will be convenient to look at this case and it's actually
relevant for both theoretical and practical reasons. So in theory, it turns
out that we can reduce general case to the spherical case. And then I'll
show this reduction -- well, at least I'll play on this deduction later in
the talk. And then practice Euclidian distance of the spherical response to
cosine similarity which is widely used by itself but also even if you don't
want cosine similarity, you want like genuine Euclidian distance, sometimes
you can pretend that you're data set lies on the sphere and you wouldn't lose
might by doing that. Yeah. So even more specifically, a special case that
is good to have in mind would be spherical random case. So what's the setup
here? So my data set is not only points on this sphere but the random points
on this sphere, just chosen uniformly and random at times. And query, I
generate queries as follows. I take random data point and plant query within
45 degrees, say, from a data point at random. And what it looks like is
something on this picture. So basically, if I have a query, then I'll have a
near neighbor within 45 degrees just because I generated it like this, but
all the other data points will be tightly concentrated around 90 degrees just
because if you sample endpoints on this sphere, they will be paralyzed almost
orthogonal, unless you have like too many of them. And just keep this case
in mind. It would be nice to illustrate our algorithms on this case and in
certain sense, it will be the core case as we'll see later. And again, in
practice, what often happens is that actually, a concentration of angles
around 90 degrees, it's not uncommon to see in practice. So it's also
relevant for practice somehow, this case. So any questions about it?
>>
[Indiscernible].
>> Ilya Razenshteyn: No. M is much larger than Z but let's say it's not
like exponential. Let's say is subexponential in D. So in total square
root, D. Something like this. So of course, if you'll have lots of points,
then you will not have that good concentration, but, yeah. Okay. So that's
it for the problem definition. Now let me introduce locality-sensitive
hashing. So if you have any questions, maybe you should ask them now. Okay.
So what is locality-sensitive hashing? So this is a technique introduced by
Indyk and Motwani in 1998 and this is the way to solve near neighbor problem
in high dimensions. And the basic idea is pretty intuitive. So what you
want is you want space partition of Rd such that closer point like it will be
random partition so that closed pairs of points collide more often than far
ones. So something in spirit of this partition into random Ds. So the
formal definition here is the following. So I require my random partition to
have the following two properties. So if my points are closed, then they
should collide with [indiscernible] probability [indiscernible] random
partition. Same with probability at least P1. And if they are far, then
they should collide not too often with probability less than P2. And these R
and CR are just the thresholds [indiscernible] from the definition of
approximate nearest neighbor I cared about like closed pairs is far pairs and
they say exactly the distance thresholds we care about. And useful way of
thinking about it is if you have some random space partition, it would most
likely have some dependence on the probability of collision on the distance
and then this inequality is just tell you something about two specific points
on this plot. So actually, now, let me demonstrate one example of LSH, so
not to think of it abstractly but just have some concrete example in mind.
And that's actually very useful family. It's very useful for both theory and
practice and it was introduced by Charikar in 2002 and it actually has been
inspired by certain approximation algorithms by Goemans and Williamson for
those who understand. And eventually, like hashing family looks very simple.
So it works only for this sphere. And if I am on this sphere, what I can do
is the following. I can sample a random unit vector uniformly. Let's go
with R. And then I hash my point into sign of the dot product of my point
and R. So basically, another way of saying it is we take a sphere and cut it
into equal pieces by random hyperplane. And it's very easy to compute
exactly probability of collision for two points. So if angle between my two
points is alpha, then probability of collision is just one minus alpha over
PY. Well, because if you have two points, then you know that for your
hyperplane to cut these two points, you need this hyperplane to past through
this like candle between P and Q and probability of that is alpha over five.
And the remaining probability actually collide. So that's why we have this
expression. It's like exact formula. And own the plot, it would look
something like this. So remember that we would care about random case,
right? With 45 degrees. So if points like 45 degrees, then probability of
the collision is three-quarters and for 90 degrees is one-half. So that's
like typical case we would think about. Okay. So we have this nice simple
family in mind. And let's now see how to use it to solve like to do
similarities certain high dimensions. So the first idea you would think
about is to just take our hashing family and just hash points using hashing
family. So basically [indiscernible] hashes and then just, I don't know,
look up points with the same hash. But of course it wouldn't work. For
instance, for the hyperplane, it wouldn't work because you have only two
values of hash. So your points in the best case, they would be split like
they would be split evenly and in each byte that you have in over two points,
right? So you would need to enumerate another two points so that's too much.
So a natural extension of this idea instead of one hash function
[indiscernible] independent hash functions from my family, and ->>
So what do you think of [indiscernible] query time?
>> Ilya Razenshteyn: Query time, yeah.
not worry too much about ->>
And space but space [indiscernible]
So [indiscernible]?
>> Ilya Razenshteyn: Definitely spawn some linear query time, right? If you
just use one hash function [indiscernible] linear [indiscernible]. Of course
I would want as good as possible. So let's see what we get if we use K hash
functions simultaneously instead of one. So for one hash function,
probability of collision, as I said before, is just a straight line. But
when we'll start increasing K, it actually goes down. This is more or less
obvious and in probability, just gets raised to the power K, right, just
because everything is independent. But what's crucial here is that for far
points, this probability goes down much faster than for closed pairs. And
that's exactly the crucial thing here. And that's actually it. So what we
do is we choose K appropriately I'll say in a second how to choose the
actually. And then we hash our points using topples of hash functions
simultaneously and then just enumerate like [indiscernible] all the points in
the same bucket. So that's the whole reduction. So let's see what
parameters we need to choose and what we get in them. So turns out that the
optimal choice of K is such that this point for our pairs becomes smaller
than one over number of points of order, one of the number of points. Why?
Because in this case, for a query, if you look at its beam, then the number
of outliers in this beam, namely for our points that we don't really care
about is like constant on average just by lineage of expectation. It's like
M times one over N is one, right? So the we're time will be constant
actually in this case. Well, proportional to the dimension, but I think of
dimension as something small for the sake of this talk so we will enumerate
like constant number of points on average. And are we done? No. Because
actually, we also need to care about probabilities for closed pairs. So we
would want to find at least one closed pair, one closed point with descent
probability. And if we just do this whole thing once, then the probability
that it would collide, it's three-quarters to K, so it's exactly this point.
And if you do the math, so it will take K from here and put it here, we will
get that probability of success would be something like one over N to point
[indiscernible] two or something like this. And in order to boost
probability of success to say 99 percent, we would need to repeat this whole
thing N to point [indiscernible] two times. So then we have like L hash
table. Hash tables.
>>
Here you want the exact nearest neighbor and you're not --
>> Ilya Razenshteyn: No, no, no. I'm happy with any point within this
45-degree range, let's say. But for the random case, it would be exact
nearest neighbor, yeah. But in general, not necessarily. Question? So its
overall scheme is like this. We have K times L hyperplanes. So in each hash
table, we have K of them and we have L hash tables overall and the overall
space is something like M to 1.42 and we're time M to .42. So that's exactly
the query time that would be typical for this talk, sort of.
>>
[Indiscernible].
>> Ilya Razenshteyn: Kind of, yeah. Some polynomial of them. So are there
any questions about it? Like if that's unclear, it's better to spend some
time.
>> So one thing I'm not really clear is that you are doing worst case,
right? You can't do [indiscernible].
>> Ilya Razenshteyn: Yeah, yeah, yeah. So it's like worst case over
queries. So for each query, we succeed with this improbability.
>> But not -- I'm wondering like why is that important?
[indiscernible].
>> Ilya Razenshteyn:
>>
>>
Like maybe --
Amortized over what?
Like queries.
>> Ilya Razenshteyn:
>>
So what --
A new amortized thing.
>> Ilya Razenshteyn:
Like
So you want to be average other queries.
Yeah.
>> Ilya Razenshteyn: So to [indiscernible] we don't know any better.
in practice, maybe sort of but in theory ->>
[Indiscernible].
>> Ilya Razenshteyn:
>>
Sorry?
Something like spirit of ten here.
>> Ilya Razenshteyn:
>>
Well,
Yeah.
And you are saying that's what instead of the [indiscernible].
>> Ilya Razenshteyn: Yeah. Yeah. So in practice, succeeding only for like
good queries, it kind of makes sense. But in theory, we don't really know
any better. So we don't know how to be like average over queries in some
sense. Okay. Good. So this is a pretty simple argument that actually
appears in the same paper that introduced [indiscernible] and that
eventually, like for general case, I showed you complete numbers, but in
general case, what you can show is you could always choose number of tables
and number of hash functions per table so to get space that is N to one plus
row and query time N to row where row is the gap between probabilities of
collision for like closed pairs and for far pairs. And the proof is exactly
the same. It's just instead of complete numbers, we get this formula. Yeah.
Okay. So that's it for the definition of LSH. Now let me show you the
optimal construction for LSH for us here. So can I do better than
hyperplanes? So a question you could ask for this specific random instance
where I planned queries within 45 degrees, can I [indiscernible] bones which
are sort of square root of L and 10 to .132 space? Can we do better or not?
Turns out that we can. And that we observed in our previous paper and like
used it in one of the papers I'm going to talk about and actually you can do
much better. So you can get, for the query time, you can improve more than
quadratically actually. So you can get N to .19, query time and space N to
1.18. And this is actually optimal. So this new bound is optimal, unlike
the old one. I'll say in a second how it works. But let me just say for now
that for spherical case we understand the best bounds exactly. And of
course, of course I again tell you the numbers for the random 45-degree case
but of course it works for general case in this sphere just like the formals
will be slightly more complicated. So that's why I'm not showing it here.
But let me show the construction. The construction is actually pretty simple
and clean. And again, as the hyperplane, this is also inspired by certain
approximation algorithms this time is by result of [indiscernible] who used
somewhat similar space partition to round SDPs for coloring. And let me call
[indiscernible] LSH. I'll explain in a second why we would call it like
this. And the construction is actually fairly simple. So we want to hash
our points on the sphere. So for this, let me sample -- let me choose
[indiscernible] however I choose it to separate equation and not entirely
trivial. But let me sample standard dimensional Gaussian. So each GI is a
dimensional IAD and 01 vector. And then, hash of my point would be index of
the Gaussian who is dot product with my point is the maximum possible. So
pictorially what happens is something like this. So I sample a bunch of
Gaussians. They don't have to lie on this sphere but I mean, their length
will be approximately equal so let's think of them as uniform vectors from
this sphere. It's not going to matter for this discussion. So I basically
sample a bunch of random points on this sphere and then my sphere gets
partitioned according to with which Gaussian it correlates the best. So
something along these lines. And that's exactly my space partition. Is the
construction clear? Yeah. And let me just observe that if I sample only two
Gaussians, it's exactly the hyperplane LSH. Why? Because if I sample two
Gaussians, then my partition would be just hyperplane that is in the middle
of these two vectors. So it's exactly equivalent. And so this is a natural
generalization. Instead of just two regions. We might have more than two
regions. And we'll see it will be beneficial for us. Okay. Let's actually
compare it with hyperplanes. So as I said, one hyperplane is exactly the
same as Voronoi LSH with two [indiscernible]. Right? Just because it's
exactly the same. And turns out that the right way of comparing it is to
compare K hyperplanes. So remember that we're basically [indiscernible] with
respect to K independent hyperplanes and Voronoi LSH with two to K Gaussians.
Why is that? Well, because actually it turns out that it's good to compare.
We'll see in a second why. But for now, let me just tell that in both cases,
we have two to K regions. So at least we can meaningfully compare these
things. So for one hyperplane and for two Gaussians, there are no
difference. But then when we start increasing these things, things start
getting interesting. So even for the two hyperplanes and four Gaussians, so
this point is exactly the same on both of these plots, and that has do with
the fact that in both cases, there are two to K regions. But in this, but
for this point, we start seeing different. And it actually turns out that
for Voronoi LSH, we get slightly better, slightly higher probability of
collision for like small distances. So it starts kicking in. And then when
we increase, the gap actually widens. So when we increase to like six
hyperplanes versus 64 Gaussians, the gap is actually pretty non-trivial. And
you can do the formal analysis and show that if your number of Gaussians
grows, the gap between hyperplane LSH and Voronoi LSH increases and the
exponent that we get, it approaches that value .18 that I promised you.
>>
[Indiscernible] when you do hyperplanes [indiscernible].
>> Ilya Razenshteyn: That will be the problem. That I'll cover in the
second half of the talk. And yes, that's a problem so what Sergei said is
that for K hyperplanes, we can essentially like decide where our point lies
in like K [indiscernible] speaking and for Gaussians we would need to do two
to K operations. Yeah. So it comes with the cost of improved exponent. But
actually, let me just tell that for the sake of theory, it doesn't matter.
We will choose parameters that wouldn't matter much. Yeah. So okay. So
that's actually it. So this is the optimal LSH construction for us here.
And now let me tell you how to use it to get the state of the art algorithm
for ->>
One question.
So there is computing [indiscernible]?
>> Ilya Razenshteyn: Yeah. Yeah.
being Logan, something like this.
>>
Yeah.
So think of the [indiscernible] as
[Indiscernible] dimension, okay.
>> Ilya Razenshteyn: Yeah. So it's like proportional to Logan. What I'm
mostly worried in this talk is factors like M to epsilon. So if something is
sub polynomial is constant for the same of this talk. In practice, it of
course might [indiscernible]. I'll talk about it. Yeah.
>>
Thank you.
>> I want to try and understand one thing. So you're talking about this
uniform random case and then you also made a comment about general case.
>> Ilya Razenshteyn:
>>
Yes.
And you said that this wouldn't work there as well.
>> Ilya Razenshteyn:
Yes.
>> But it's only optimal in -- the optimality proof is in the uniform random
case?
>> Ilya Razenshteyn: So optimality proof is for the general case but when
your distant thresholds that you care about graphic response to the random
case. So namely software root two versus square root two over the
approximation factor.
>>
What is square root two.
>> Ilya Razenshteyn: Square root two is the typical distance between two
random points on the sphere.
>>
Okay.
>> Ilya Razenshteyn: So our optimality will provide -- in a way, it shows
that, like, you can think of how optimality proof is that it's optimal for
the random case if you want, yes. And it immediately implies that it's
optimal for like arbitrary case if your distance thresholds could respond to
random case. But whether this construction is optimal or not for two
arbitrary distance thresholds, that becomes yet proof although I conjecture
that this is the case. So that's the exact state of things.
>>
And you still draw the Gaussians uniformly.
>> Ilya Razenshteyn:
>>
Yes.
Even if the data is somehow skewed.
>> Ilya Razenshteyn: Even if the data -- even if the data is somehow skewed,
but so optimality for LSH. So actually, like I'll cover in the second how to
do better for the case when your data is skewed. But you would need to do
something else. We will see. But yeah, it's a very good point. It's
exactly what I'm going to talk about. So now let me tell you about our first
result. And that appeared in this year's talk. So you can ask, like, okay,
now I talked about this sphere. Now I'm going slightly switch gears and talk
about the whole Rd. It's like more general case so let's talk about it. So
you can ask what are the first bounds on the LSH you can get? And it turns
out that for Euclidian distance, and I'm not going to talk much about it, but
still, like it's very interesting to see what happens for Hamming distance
and Hamming distance, actually, we know exact bounds on the exponent that you
can get. So for Euclidian distance, the right bound is one over C squared
and where C is my approximation factor, remember. And for Hamming distance,
it's one over C. So in particular, for two approximation, what we get is we
get query time something like M to one-fourth and square root ten for Hamming
distances [indiscernible]. And that was established in the sequence of works
over quite some time actually. So we know exactly best bounds for LSH, for
L2 and L1. Yeah. And just let me briefly recall that one-half here means
that it gets based into three halves and query times square root ten. So can
I do better than LSH? So, yes, we can. And that's exactly the main point of
what I'm going to talk about. And how can we do better than LSH. So the
basic idea is to do again space partitions, but the crucial idea is to do
space partitions that depend on your data. So remember that my definition of
LSH, it was actually pretty strong. So I required these two conditions for
every pair of points. So for every P and Q, I want that if distance is
small, then [indiscernible]. If distance is large, then something else,
right? But actually, we don't need it. So for the reduction, what we need
is we need to make sure that these conditions called if one of my points is a
data point. And that gives us the full length possibility. So maybe we can
look at our data set before building the hash family and just like cook some
nice hash family that works nicely for this data set, right? And what's
exactly what we do. But interestingly enough, not only it works for a good
data set, it actually gives improvement for every data set. So you can say
that every data set has some structure to exploit informally speaking. And
now let me tell you our reasons. So basically, we get optimal data dependent
space partitions, optimal after proper familiarization, it's a little bit
subtle. But again, let's not worry about it for now. And coincidentally
what we get is we get almost quadratic improvement. So for Euclidian
distance, we get say for two approximation, one over seven instead of one
over four. And for Hamming, for Hamming distance, we get one, one-third
instead of one-half. And let me say again that this bounds optimal for data
dependent LSH if you formalize it properly. So what's the main idea? So
basically, our algorithm consists of two steps. So first I'll show you how
to [indiscernible] random data sets. Random in the sense as I described.
And actually, for the random data set, the spoiler is that Voronoi LSH works
well and gives better bounds. And this step is completely date dependent.
So if you know that your data set is random so you just use Voronoi LSH and
apply the standard thing. So the second part which is more interesting if
that's the main point is how to take any worse case data set that may not
necessarily look at random and then I'll show how to partition it in part for
the sake of our algorithm essentially random. And that step is data
dependent and it would exactly address your equation about like skewed data.
But, okay, let's first look at the random case. So actually, yeah. So what
exactly it means is that we have a sphere and our points and queries are
random. And we use the fact that distances are concentrated around like if
you have sphere of [indiscernible] R, then distances are concentrated around
square root two times R and Voronoi LSH gives you the right exponent. So it
gives you exactly the improved background. One over two C square minus one
which eventually we want to get for every data set. But if your data set
doesn't look random, then actually Voronoi LSH is suboptimal and the good
example is if your points are for example clustered, then line is full region
of your sphere and actually Voronoi LSH done for that grade and it gives -it doesn't give good results. And we need to do something about it. And
that the exactly why the second part comes. How to reduce from general case
to randomly looking case. So if something doesn't look like random, let's
make it look random forcefully. So basically, we need to remove structure.
So what do I mean by structure here is basically low radius clusters that
contain lots of points. I'll say in a second what does it mean to be low
radius but again, like, at least conceptually, it should be pretty clear. So
if we have any low radius dense clusters, we just take them away. Something
like this. And that will show how to work with them in the next slide. Of
course we need to do something about them because what if our near neighbor
is one of these points. Right? But for now, let's not worry about it.
Let's say that we just removed everything. And the crucial thing is that the
remainder pretty much looks like random set. So we know that we have no
dense areas anymore and that kind of spread so we can apply it more in LSH.
And recurs. So after we recurs, so by recurs, I mean [indiscernible] for
each region but we do the same. And dense clusters can appear again because
the definition was relative. So since it has no way fewer points, we again
can potentially have dense clusters take away, and again, [indiscernible]
recurs. So now, yeah. So before, I'll tell you what to do with the
clusters, let me tell you how we process queries. So for queries, we
actually do the following. So we first query every single cluster. They
will not be manual. Then we'll choose parameter set that they will be
relatively small amount of them so we can afford to query every single one of
them. And for [indiscernible] LSH [indiscernible] one part where our point
lies. For example, this one in the recursive [indiscernible] part. So
that's the whole thing. So it remains to tell what we actually need to do
for the clears. I didn't tell and actually that's very crucial. So for
clusters, we observe the following thing. Actually, now, it's time to tell
what exactly it means to be lower radius. So by a lower radius, I mean
something that is slightly smaller than half of this sphere. So basically, I
declare a cluster to be a lower radius if it has a spherical cap of radius
where two minus epsilon times R. So it's slightly non-trivial but slightly
smaller than the half of this sphere. And the crucial thing is that we can
actually enclose such a cap into somewhat smaller ball by a factor of one
minus epsilon square. And that's great. Why? Because we can recurs with
the reduced ranges and as I'll explain, we make actually programs by doing
that. So let me state the algorithm again like the overall algorithm. So we
basically, if a cluster would use the radius and thus the several reductions,
the problem essentially becomes trivial for certain reasons. And for the
random remainder, Voronoi LSH works well. So that's exactly conceptually how
we handle different cases. And at some level, it can be seen as a decision
to what we get. So we start with a root. Then we take out dense clusters.
Then we have random remainder which we partition using Voronoi LSH and then
recursively do the same thing for everything. And when we query, we can go
potentially to several [indiscernible]. So we created all the clears in one
part, and again, it continues branching. And the parameters we get are the
following. So you control that my tree occupies nearly near space. And for
the time can be bounded by some sub polynomial function. And that's great.
And of course, as before, one tree would not be enough because it would give
you only polynomial small probability of success, so we need many of them to
succeed with probability 99 percent. So that's actually -- so any questions
about it? Yeah. So one line summary is that we observe that Voronoi LSH
works great for random data sets and then if something doesn't look random
enough, just make it random. Okay. So now let me tell you a little bit
about our second result and actually, the second result is how to make
Voronoi LSH practical and that has to do with Sergei's question and that's
our NIPS paper. Is Voronoi LSH practical? No. Why? Because converge to
the optimal exponent is on the one hand very slow. So we need lots of
Gaussians to make it [indiscernible] good. And at the same time,
[indiscernible] time is roughly dimension times number of Gaussians. So and
that's bad. Even say, for example, 64 Gaussians is already pretty much
impractical and that wouldn't bring us even close to the optimal exponent.
So can we could anything about it? It would be nice to do something because
actually hyperplane LSH is used quite a bit in various forms and practice.
And it would be nice to like use theory to get some practical improvements.
Can we do something? Yes. That's exactly the point of the second part. So
let's make our Voronoi LSH practical step by step. So the first step is to
make our set of vectors a little bit more structured. So Voronoi LSH samples
bunch of random vectors, and that might not necessarily be that great because
that brings us like the [indiscernible] times slow. Let's make it like less
random, less arbitrary in some sense. And that actually -- there was a very
nice paper by [indiscernible] who showed -- who proposed such a scheme. They
didn't analyze it, but at least they proposed a possible improvement to
Voronoi LSH and what they proposed is instead of random vectors used plus
minus basis vectors. If you want to press your point on the sphere, you
perform a random rotation, and then after you do random rotation, you find
the closest plus minus basis vector. So for example, for the case of two, we
partition everything in full parts. And for general high dimensions, we will
have cross-polytope. So for dimension T, we have 2d parts. Yeah. So in
this paper, we actually, for the first time, analyze this [indiscernible] and
control that it gives almost the same quality as Voronoi LSH with 2d
Gaussians, again, which is 2d Gaussians because it's the same number of
parts. So essentially will show that by moving from Gaussians to this
structured set of vectors if you do random rotation first, then it gives
almost the same result. And in a way, you can think of it as like placing
[indiscernible] actually. So exponent improves as dimension grows because
number of Gaussians effectively grows. Right? And it's still not that
great. Because random rotation is expensive so applying random rotations
takes D squared time. And storing it also takes D square reels, so it seems
that we didn't do that much progress, but in fact, we did. So wait for the
next slide. And the way we did progress is that at least the second step,
finding the closest minus basis vector, now it's cheap. You can do it in
like once [indiscernible] over your coordinates. It takes D time instead of
D squared. So that's exactly the progress. So second idea is to use pseudo
random rotations. So as I said, the bottleneck is to store and apply random
rotations. And that will be expensive. To instead, we use pseudo random
rotations and they were introduced in the paper by Ailon and Chazelle. And
since then, they were used in like many other places in both theory and like
applied papers and so on. It's a very beautiful idea. If you want to like
basic if you want to learn one idea from this talk, I want you to ->>
[Indiscernible].
>> Ilya Razenshteyn: Sorry? No, no, no, it's like -- you'll see. It's
really nice. So we want to do something that would serve roughly as a random
rotation but without doing the whole random rotation. So what we do is the
following things. So instead of random rotation, let's do Hadamard
transform. So what is a Hadamard transform? It's a certain orthogonal map
that preserves [indiscernible] norms, so a certain rotation. And it has two
properties. In one, like one way properties that it mixes well. I might say
in a second what it actually means but what's more crucial is that it's fast.
So we can compute it in time D of D. So what is a Hadamard transform?
[Indiscernible] defined matrix that is the following, like zero Hadamard
transform is just one and then it gets replicate four times and one
[indiscernible] will keep all the signs. So it's basically a plus minus one
matrix with [indiscernible] orthogonal rows and columns. So this is nice.
But of course it's a deterministic map and we want to inject some randomness.
So the crucial idea in Ailon and Chazelle paper was to flip signs at random
before applying Hadamard transform. So basically, for every coordinate we
toss a coin and for like say, heads, we change the signs. For tails, we
don't do anything. And then we apply Hadamard transform. And for that
application, that was enough. But for our application, it's not quite
enough. Why? Because for example, suppose that we started with the once
part vector. So it's a vector with only one non-zero -- say first basis
vector. Then if I flip the sign, it could be plus minus [indiscernible]
basis vector, and even after I applied Hadamard transform within one of the
two vectors and that's not good enough for us. But what is good enough for
us is to repeat this whole thing a couple of times. And then it works. So
with the caveat that we don't know how to prove that it works but it works
empirically well. And I conjecture that it actually works in theory well.
It's just I don't know how to prove it. And that is actually pretty much it.
So the overall hashing scheme is to perform 2 or 3 rounds of flip signs
Hadamard and then find the closest vector from plus minus basis vectors,
which essentially boils down to like finding maximum coordinate or something
like this. And the evaluate times becomes D log D instead of D squared. And
that's exactly where we save a lot. And again, this is the statement I don't
yet now how to prove but empirically it seems that it's pretty much
equivalent to the cross-polytope LSH with truly random rotations which are
rigorously equivalent to Voronoi LSH with 2d Gaussians. So is it clear?
Yeah.
>> Just trying to understand the end-to-end thing. So you have distance
transformations. This 2, 3 rounds of flipping and Hadamard. And then so you
start -- you have a query point, you run it through this and then what do you
get? You get the close -- the closest vector. Then do you -- the data is
already [indiscernible] a way that you are [indiscernible] a set of points
and one of them is close.
>> Ilya Razenshteyn: So the good way of thinking about it, we just use it as
an illustration. Plug it into the reduction that I described. So then, for
like every hash table, you would have several of those things. So think of
it as a hash. So it takes a point and gives you a number from one to D. So
that's the half. Right? It's like closest or like from one to 2d in this
case. So then just push basis vector is the closest. And then you do the
same as I described before. So you just do several of these hashes and just
compute these several hashes for the query and look up the corresponding beam
and retrieve all the data points from there.
>>
So I have the union of many beams?
>> Ilya Razenshteyn: [Indiscernible] intersection. Because you want to
collide on all the key hash functions. And then you do union. So then the
overall thing you repeat many times to prove probability of success to like
99, whatever you want.
>>
And then in this union, you just --
>> Ilya Razenshteyn:
>>
You try all of them and just find the closest one.
>> Ilya Razenshteyn:
>>
You try all the points.
Yeah.
How many points are in that union?
>> Ilya Razenshteyn: So we set parameters such that in one hash table, you
would have one like or like five or like constant number of far points and
everything else is closed. So we're happy with those. And for the union,
you just give that essentially a number of bad points is roughly the same as
number of hash tables. So it's some like N to some power.
>>
How many hash tables?
>> Ilya Razenshteyn: So let me actually go back to the slide where I show it
for the hyperplanes and for this LSH is the same.
>>
[Indiscernible]?
>> Ilya Razenshteyn: No. We computed -- so we computed every hash table.
We compute three times K Hadamard transform. So essentially, in total for a
query, we have like three times K times L Hadamard transforms.
>>
You recompute it.
>> Ilya Razenshteyn: Yeah. You need, right, you need to recompute your
[indiscernible]. If you [indiscernible] me to do it, that would be very
nice. That would save a lot. Yeah. So this slide. So basically, so now
think of -- we use our Hadamard -- I'm not sure how to call it -correspondent hash instead of hyperplane hash. So with hash, our point using
K [indiscernible] independent [indiscernible] of our hash functions from our
family, so this is the parameter K. And we also have parameter L. Which is
how many tables we have. So in terms of what we have K times L functions and
L hash tables. And for the hyperplanes, we have M to point for two tables.
For the cross-polytope, we would have had N to .18. So some small polynomial
relatively. And K is something like [indiscernible]. Okay? So it's exactly
the same reduction just the set of hyperplanes use cross-polytope and it
works much better. Okay. Let's go back. I have ten minutes, right? Okay.
So now, let me turn to the actually quite a big issue. Memory consumption.
So if you look at like there are lots of papers that [indiscernible] high
dimensions say in practice. And many of them use LSH as a baseline. And in
many papers, you see statements like this. LSH is terrible because it
consumes lots of memory. Let's try to figure out if it's true or not. So we
actually can do math and compute exactly how many tables we would need for
example for hyperplane LSH for that random instance that I told you with
45-degree. So if you have million points and your queries are within
45 degrees at random, then you would need [indiscernible] probability .9, you
would need 725 tables. So that's terrible actually. It's like million
points. Come on. It's not like -- it's nothing, right? Now we would more
like care about billions of points if not more. Right? So 725 times to
replicate the whole thing, that's not what we want. Right? But there is a
very nice solution for this. It's called multiprobe LSH that was introduced
in a very nice VLDB 2007 paper. And basically, the idea is like this. I'm
not going to tell you [indiscernible] but the idea is in each table, we can
query more than one beam. So of course, the best beam to query is which
collides with all our hashes. Right? But intuitively, we would also might
want to look at the beams that almost collide. So let's say they collide on
all quarters except one or something like this. So they have a very nice
heuristic way how to do it. Eventually there are theory papers that kind of
analyze this, something similar to this multiprobe LSH, but it's analyzed
something way less practical. But in practice, you would want to use
multiprobe LSH. And one of the contributions of this paper, is we have
similar scheme for cross polytope. It's a little bit more tricky. The main
source of trickiness is that for hyperplanes, you have only two regions. So
we just need to decide whether [indiscernible] or not in each thing. Here,
we need to decide by how much [indiscernible]. It can be done. So we did
quite a few experiments. Let me just show one of the experiments. It's on
the data set of features for -- I think it's called image net or something.
I'm not sure. So it's basically certain data set of images from that data
set they received features computed and the parameters are we have million
points and dimension 128. Right? So the linear [indiscernible] takes
38 milliseconds. It's actually not that great data set because linear scan
is already pretty fast. But nevertheless, compared to eight milliseconds,
this whole thing improves things quite a bit. So with hyperplane and
multiprobe, you can improve by a factor of two and cross-polytope improves
even further. So this is not the biggest gap between hyperplane and
cross-polytope. I can show you but even here, it already works better. And
in practice, you would not just want to like take your data set and apply say
cross-polytope LSH. So in fact, turns out that for this specific data set,
you can look at it, stare at it a little bit, cluster it a little bit,
recenter, and then actually improve both like improve results for both
hyperplane LSH and cross-polytope LSH and here the gap is a little bit wider
actually. So cross-polytope benefits a little bit more from it. And we have
other experimental results if you are interested in like applying something
like this for your applications like read our paper. So this is just some
kind of big numbers.
>>
[Indiscernible]?
>> Ilya Razenshteyn: Oh. So yeah. It's a great question. So we require it
to use the same amount of memory as the data set itself. So it's roughly
double the size of the data set so it's actually pretty reasonable. So you
wouldn't use like 725 tables or anything like this. If you use more memory,
actually both hyperplane and cross-polytope benefit from it. Yeah. So okay.
So that's pretty much it. There are actually a lot of open questions here.
Some of them are hard, some of them are easy, some of them are meaningful,
some of them are not so much. But what I showed in this talk. So like
essentially I showed you two results. Optimal data dependent hashing for the
whole L2, which is the theory result. And practical and optimal LSH for the
spherical case, which is more applied. And I'd say the main open question
which I really like and I have no idea how to approach it is to have
practical version of our worst case to random reduction. That would
essentially make the first bullet point practical and if it is possible to
do, I don't know. But would be very nice. Yeah.
>> So all of this is again finding one point that's close enough.
want to find all the close points?
What if I
>> Ilya Razenshteyn: You want to just within certainly distance threshold
you cover everything.
>> At approximation, however is reasonable. So I'm waiting to accept and
no-show approximation, but I want to find all of the appropriate -approximately find all the nearest neighbors.
>> Ilya Razenshteyn: Good. So here, the analysis shows that for every point
from the set that you care about, [indiscernible] 99. So it means that on
average, we'll recover 99 percent of all the points. Of all the closed
points.
>> So again, the set that at the end of the day [indiscernible], it's
already pretty much all the members.
>> Ilya Razenshteyn: Pretty much. So it is 99 percent of them. So you can
just essentially repeat this many times and put this to what are you want.
>>
The hash --
>> Ilya Razenshteyn: Of course, if you do something like this, your running
time will depend on how large your -- like how large this set is. But you
can do better than the size of this L. So you would get something like M to
raw plus the size of the set times something. Any more questions?
>> Again, maybe this is the same as my previous question, maybe not, but
still true, though. So if I want to find the single closest point.
>> Ilya Razenshteyn:
With like absolutely clauses.
>> Absolutely clauses with hyper mobility.
99 percent of the close ones are nearby.
Then I understand then
>> Ilya Razenshteyn: Yeah. But you don't necessarily find this closest
point because you would like basically again like how the analysis goes. You
say that in your beam, there are essentially no fire points. So it means
that you would very quickly find some close point but not necessarily the
closest. So actually finding exactly closest neighbors in theory it's very
hard. Precisely like I wouldn't expect it to be possible to do in like high
dimensions and [indiscernible] some linear time unless your space is huge.
But what you can do often in practice is to say that, look, my data set
[indiscernible] that I have a query, I have a -- like exactly closest
neighbor. No approximation. And then there are not that many points which
are not much -- like not much further than that. And that often happens
actually. For instance, in this [indiscernible], oh, yeah, I should have
told you that here, we'll look at the like this -- like here the times of
from a probability point [indiscernible] find exact nearest neighbors, of
course no approximation. Was it has this property that there are not that
many approximately closest points, everything else is concentrated much
further. So in this case, that's actually your best bet and that you can
always like you can often see in practice. But other than that, not really.
Unless you want that dependence on the dimension.
>>
So there are lower ones for exact --
>> Ilya Razenshteyn: So you can for instance show that if you can do
polynomial preprocessing and strongly sublinear query time, you can do better
than [indiscernible]. It's very easy to show. So there are lower bounds.
They are not that [indiscernible] because this assumption that you can do
better than two [indiscernible] strong but we don't know how to do it at
least. So I wouldn't expect it to be possible. At least [indiscernible].
[Applause]
Download