>> Ping Li: It's so very nice to be... okay. So just thanks for coming right before Thanksgiving...

advertisement
>> Ping Li: It's so very nice to be back here, to see so many friends again. So
okay. So just thanks for coming right before Thanksgiving shopping tomorrow.
So I'm in the Department of Statistical Science, which is under computing and
information science. And the faculty computing and information science is larger
than CS. So for a long time I thought I'm actually in CS.
So at Cornell the Department Of Statistical Science it has very unique position.
So people are all from different departments like OR, math, and computer -- and
CIS, which is me. Yeah.
Okay. So as a statistician this is how I understand how do you use statistics.
Well, we need a model, we need variables, we need observations. And this is
what -- what many people consider as a good practice; that is we need -- we
want a simple model, and we want lots of features and lots of data.
This is because the people who use the models, they want model to be simple.
And the people who use the model, they know how to play with the data. So this
seem to be good combination.
So conceptually we can consider a dataset as a matrix, like size N times D. And
there's a number of observations now it's really huge. Like millions, it's not big
anymore, a billion is not rare. And consider the click-through data, the number of
observations can be infinity.
So the people, also common to use high-dimensional data in text and image and
biological data. And I see people use billion dimensions, and I saw people use
trillion dimensions. And in the context of search it looked like 2 to 64 seemed to
be the standard.
And so in a sense, the dimensionality can be arbitrarily high. You just consider
all pairwise, three-way or higher interactions.
So this is a simple example of big data learning. So suppose X is your data and
Y is your response. So X is 2 to 64 dimensions. And of course usually sparse.
Has to be sparse. And so let's only consider the simplest model, logistic
regression SVM. And we want to -- we want to minimize the loss function and to
find this weight.
So the C is regularization parameter here. So even for such a simple model,
once you have lots of data, so plenty of challenges we have to face.
For example, the data may not feed the memory. So that's the first thing. And
data loading, which include data transmission over network, could -- may take
too long.
And training can be still expensive, even for simple linear models. And testing
can be expensive, relatively, because why testing is expensive? Because once
you fit the W, you have your computer inner product.
And if this very high dimension that would take time, which may be too expensive
for search and all high-speed trading, et cetera.
And the model itself may be too large. This is an interesting issue. Because if
you talk about data 2 to 64 dimensions, so normally the data has to be sparse,
otherwise, you know, there's no such data 2 to 64 dimensions.
But the model, if you are not careful enough, the model, the W, usually is dense.
Unless you do something else. So if for the model 2 to 64 dimension you cannot
really store it. So that's actually interesting issue.
And there's also an issue of near-neighbor search. And you want to find similar
documents or images in billions or more web pages or images without scanning
them all. So that's more interesting issues.
So this motivates the need for dimensionality reductions. And but more
importantly for many cases is actually data reduction. It means reducing the
number of nonzeros. This is because for modern learning algorithms the cost,
like storage cost, transmission cost, computation cost is mainly determined by
number of nonzeros, not too much by the dimensionality, unless the dimension is
very high. Then that will -- you will cause a problem.
So PCA is -- for such large data, is usually infeasible, at least not in realtime.
And updating PCA is non-trivial. And if you have a need to index the data PCA is
usually not a good indexing scheme. And in very high-dimension data, also even
if you can do it, PCA usually do not give you good performance for very
high-dimensional sparse data. So that's my recent experience.
Okay. So now how do we deal with the data if it's such a huge data? So
suppose this original data, which is too big and very high dimensions, we can
multiply by matrix R, and we'll get another scheme in matrix B. So B is a matrix,
has K columns. K is small.
So this is called random projections. Why it's called random projection?
Because the entries R are sampled random, for example, if you sample from
normal 0,1. So R is normal 0,1, i.i.d. samples, so R is the matrix.
Then if you look at the B times B transpose, which is A times R, times R
transpose times A, and you take expectations, it become A times A transpose
because R is normal.
So this is cool, right? Because on this expectation you get the same -- you get a
good answer. Of course, this is only really on the expectations. And the error
could be very large. So we have to analyze the variance. But in the expectation
we get unbiassed estimate for the inner product.
>>: So what kind of applications do you find this kind of preservation of the
distance in useful or otherwise?
>> Ping Li: Yeah. I will continue. I will keep that in mind, yeah.
So now suppose you want to build a model, linear model for A. So well, because
we cannot do it. So let's do -- let's build a model for B. So basically this is
analogous to suppose we do PCA, so you feel the model under PCA, right,
except now you do something much cheaper than PCA. So and it becomes B.
Actually it is also inner product space. So you can actually -- so all the
algorithms still works if you build a model on B directly. So that means you can
actually discard A once you get B. Questions? Okay.
So now, one concern people often has, well, this look like -- look nice, but it's
kind of cheating in a way because A times R could be very expensive. It's a
huge matrix multiplications.
So to address our concern, in 2006 we wrote a paper called Very Sparse
Random Projections. So basically instead of sample from normal, you can
sample from a distribution like very sparse. Means if this S parameter is 100,
that mean 99 percent of the entries on average are zeros. If there's 10,000, then
it's 99.99 percent are zeros.
And the magic thing is that this also works. So now let me see how it works. It
works in a certain way. So this is the 3-gram data, the test data, the 3,000 -actually 3,500 text samples, 16 million dimensions, about 4,000 nonzeros or
average per observation.
So it's 16 million but only 4,000 nonzeros. So it's very sparse. But 4,000 -- 4,000
is not a small number. The 4,000 actually fairly typical. I heard that it's -- usually
people see the main application range from 10,000 -- I mean, 1,000 to 10,000
nonzeros.
But the dimension could be really large. Yeah.
So this task is a binary classification for spam versus non-spam. So now let's
apply a SVM, linear SVM. So the red curve are the results for linear SVM. This
is accuracy. Higher is better. And we remember SVM has a parameter C,
requisition parameter. So this is -- so red curves are original results. So now we
apply random projections. So the red curve are the results used on A. Now, the
other curves the results on B. So we use random projections.
Now we need to determine the K, the number of projections. So this is
interesting because if you're using 4,000, you still don't get a good result. I
mean, if you really care about this point, it's still not good enough. So in general
you need 10,000 projections. So that's random projections you need 10,000. I'll
explain why you need 10,000.
And but the original has 4,000 nonzeros. So it reduces dimensionality, but does
not really help with the data reduction for this case. But another interesting thing
is that the data can be -- the projection matrix can be very sparse. For example,
10,000 -- 1,000 you still get almost the same results here.
So the two interesting things here with random projection is you need a large
number of projections, but the projection matrix can be sparse. Yeah. Okay. So
that's a thousand. Actually, if you look at 10,000, you still see the same thing.
And in the summer I gave a talk, Stanford Massive Data Workshop, so I incur a
lot more results for classification, clustering, and regression on very sparse
random projections. So if you just type MMDS, you're going to find it.
So I like the very sparse random projection a lot because when I was looking for
a job in 2007 I think that helped me get a job. So but now five years past, so we
improve, right? So we want to do something better and more practical.
So this is what we going criticize about random projections. And its variants. It's
very -- it's inaccurate. You need 10,000 projections, especially on binary data.
So random projection actually doesn't really care whether it's binary or not. Even
doesn't care whether it's sparse or not.
But the -- but in practice, if you really have binary sparse data, you want to do
something more -- something else. So that's a point. And let's understand why
it's not accurate. Let's look at the first two rows in A called U1 and U2. The first
two rows in B is V1 and V2. So V1 -- U1 and U2 in very high dimension, and V1
and V2 in K dimensions. So K is small. Small, like 10,000.
So in the expectation the inner product is preserved. I use A for inner product
always. However, if you look at the variance -- so variance of the inner estimates
actually is dominated by -- by the marginal L2 norms.
So that's interesting. Because first -- because A squared is always less than M1
times M2, right, by [inaudible] inequality. So that means the variance of random
projection is always dominated by something that's nobody don't really care.
You care about A, but its variance actually dominated by something else. But
what's really worse is that if you -- once you -- given particular datasets, if you
look at the pairwise inner products, most of the pairs are more or less orthogonal.
Of course, a significant -- I mean a fraction of the pairs are very similar, but most
of them are more or less orthogonal. So that means you actually very bad
situations. So A roughly zero, but the variance is still very large. So how do you
reduce a variance or increase in K?
So if you make K like 10,000 and then its variance is -- you need to take square
root, right, to get it there. So that means like you get roughly like one percent.
So that's how you get accurate results. So that's why the 10,000 is good
number. Yeah. Question?
>>: [inaudible].
>> Ping Li: It's all relative. So you normalize doesn't really help. It helps in
some way if it's not -- if it's not even. I mean M1, M2 are very different.
Normalization actually helps for -- in that aspect. But as far as if -- because you
normalize -- it's all relative, right? It's all relative. So normalization in general
doesn't help.
>>: [inaudible].
>> Ping Li: Yeah. So each row. Each row. Every row, yeah. And the norm,
yeah. So if you want to verify this formula, think about if U1 and U2 are identical,
then you actually, you get a chi square. So chi squared if you -- in statistic you
know chi squared the variance is 2 something squared. So that's how you get -verify the formula quickly. Yeah. Okay.
>>: All right. So I'm going to say [inaudible] because that always seems to be
the answer. Is that the answer?
>> Ping Li: Yeah.
>>: Oh, okay. [laughter].
>> Ping Li: Yeah. Because I cannot think of reason why not because you can
always say [inaudible] [laughter]. Yeah. That's exactly right.
The reason -- now, actually in this case what's important is the nonzero is
important. So focus on the nonzeros. Yeah. Wow. That's very clever. Yeah.
So what we do is actually focus on the important things, like the nonzeros. For
example, if it was binary, you only care about 1s. So we focus on the 1s. We
call it the b-bit -- so we do b-bit minwise hashing instead of random projections.
And it's -- surprisingly, it's very simple and surprisingly much more accurate than
random projections for inner product estimations. And so minwise hashing -since many of you from search, we know that it's a standard algorithm in the
search industry, and it actually -- the b-bit minwise hashing require much smaller
space.
And random projection only applicable to pairwise, but minwise hashing or b-bit
minwise hashing you can do 3-ways. For people from databases, all you want to
-- you want to go beyond pairwise. Yeah like [inaudible] work, you want to go
beyond pairwise. Actually we can do that.
And we also develop methods so you can do large-scale linear learning. And, of
course, you can also do kernel learning.
But you can also use this for near neighbor search. This is because the
algorithm directed provides indexing. My experience is that -- we have the
slides. My experience is that I'm never going to be able to cover that. So -- but
I'm going to leave the slides then.
So the major drawback for this, the minwise hashing and b-bit minwise hashing,
if I give the talk before the summer, I would say, well, it's very expensive
preprocessing. Of course, when people say, well, it's expensive, then at that
time I would say, well, it's the preprocessing. So but now the problem is solved.
>>: But this only reduce the dimensionality of the data, not the data [inaudible].
>> Ping Li: And also data size.
>>: [inaudible] as well?
>> Ping Li: Yeah. So I will -- yeah, I will cover that.
So the question is why do we care about binary data? Well, look like for text it's
multiple datasets. Binary representation seem to be very common.
For very high-dimensional data, it's only the case that even for very
high-dimension data, like web spanned datasets I just show you, if you binaryize,
it doesn't really matter for the accuracy. So that seem to be true for many very
high-dimensional data.
And if you consider pairwise or higher-order interactions, and you start with some
datasets as reasonable sparse, but if you do pairwise then it becomes sparser
and sparser because zero multiply anything become zero.
So this is how -- so usually the very high-dimensional data is generated by
because the interactions. And that's why very high- dimension data usually very
sparse.
And but once the data is sparse and it's binary, so people in statistics we think
about to data matrix. So for binary sparse data, what we really need is to store
the nonzeros. So for example, S1 is a vector, binary vector, but we only need to
store 1,4,5,8 means the location of nonzeros, which is the inverted index, of
course.
So this is example, the classic example. How do you get a massive binary
high-dimensional data? So how do you represent a text? Well, the back was a
model because like today is a nice day, you can just represent this as a set of
four elements. Of course, we know that it doesn't work too well, because the
order is -- does not matter. And it means maybe it's not very meaningful
sentences.
So to overcome that, the next trick is to use contiguous words. We could use
every two contiguous words, every three contiguous words, and we could just
continue this business until maybe five, a hundred people use 11 and the -- yeah.
So this has generally very high-dimensional data. So this is because -- it's okay.
Oh, right here. Okay. So if you build a dictionary, so this is document 1. If A
and Z is usually there. And but otherwise it's mostly just zeros. So this suppose
is 10 to the 5 common English words. And if you do consider pairwise, so it
become 10 to the 5 squared. So -- et cetera. So you just -- so the vector
become longer and longer, and this is sparser and sparser. Yeah.
And the interesting thing is that if you -- if you W is equal to 5, you get a 2 to 83
dimensions. Of course, unless you have lots of documents, most of the columns
will be zero. So you can remove those columns. So usually people use 2 to the
64 as a convenient upper limit. Yeah. Questions?
>>: No, I just [inaudible].
>> Ping Li: Okay. So that's how this 2 to the 64 story comes from here.
>>: [inaudible] different name [inaudible] normally it's tri-gram, n-gram.
>> Ping Li: I think John once told me that his understanding is that when it's
n-grams means n-grams, but when it says shingles means you actually apply
hashing right after. Yeah. I think that's a very clever explanations. Yeah.
So basically because 2 to the 64 is not something had you want to use anyway.
So but when you apply hashing, it's convenient. Yeah.
So now let's introduce notations. So you now have a set. Binary vectors are
sets. So F1, F2 is just the size of two sets. A is the size of intersections. And
the similarity, a good measure is a resemblance, a size of intersection over size
of a unit. And for binary data, I think it's more rational than the correlation of -correlation of coefficients.
So that's a notation. So this is a trick that the search industry has been using for
many years. So basically suppose we're able to do a permutation on the space.
We apply the same permutation to two sets, and we only look at the -- only store
the minimum. And the chance that the two minimums are equal is exactly the
resemblance. I will give you an example.
So this is -- suppose the space size is 5 and these two sets, the similarities are
20 percent, and this is one realization of the permutation. So 0 become 3, 1
become 2, 4 become 1, et cetera.
And now let's do the permutations. So S1. So pi S1 means 0 become 3, okay?
And 3 become 4, okay? So the minimum is 1. So you store the minimum. You
discard the rest. You only store the minimum. And you apply the second
permutation -- I mean, you apply the permutation to the second set. The
minimum is 0. At this time they're not equal.
They're not equal because you only have a 1 over 5 chance to be equal. So how
do you make sure you get enough times that they're equal? Well, just repeat this
K times. Basically if we repeat this K times -- and you can estimate this as
binomial probabilities. So the variance is just 1 over K times R, 1 minus R. So R
just a binomial probability.
So in terms of data matrix for -- machine learning and statistic people will like
data matrix.
So what minwise hashing is basically doing is suppose you can permute the
columns. You permute the columns. So this original column, and you permute -after permutation, this is your new data matrix. Then you look from left to right,
and you stop at the first location where it's a nonzero. Then you store the
location. So 2, 0, and 0.
Then you discard the rest. Yeah. So that's what minwise hashing doing
conceptionally in term of the matrix. Questions?
>>: Is this the same size for all the rows?
>> Ping Li: Yes, the same. The pi basically means you permute the columns.
So it has to be the same. Yeah. Very good.
So usually when I view this slide, people will say, well, what about the maximum?
Can't we just store the maximum? The answer is yes, of course. The maximum
contain essentially the same amount of information minimum. But we choose
minimum. Yeah. Okay. Question? Okay.
So this is what hashing is doing. So we have to repeat K times. Now, remember
in random projection we would do K permutation -- I mean, K random projections.
The K has to be 10,000. So here K only need to be 200, or 500 at most. So it's
a -- so it's better.
So now the immediate question is how do you actually store this number, store
the minimums? So it looks like this minimum number, minimum location, but the
minimum can actually happen from the beginning to almost towards the end
because it's very sparse, right? It can really be anywhere. And essentially
anywhere in this matrix.
So that means the minimum value can actually -- is often very big number. If you
have a sparse matrix. The minimum location is actually also very big. And to
ensure if the dimensionality is 2 to the 64, to make sure you get the right answer,
you have to use 64 bits to store the minimum values. Yeah. Okay. Because
minimum can happen towards the end. Yeah. Questions? Okay.
So 64 bit actually big number. If you have to -- so each number 64 bit, but then
have to multiply it by K, the number of permutations, then have to multiply the
number of documents or images. Yeah. So it's a -- the storage is big. And the
search is also -- the computation could -- it's coordinately is also very expensive.
And later we're going to show that this also correspond to very high-dimensional
data.
So how do we solve all this problem? You do -- the solution is simple. Suppose
instead of store this using 64 bits, and this only store using 1 or 2 bits, then
everything become easy. Now, the question is will that still work?
So this is the intuition. Suppose now you do hashing and we have to store it.
Instead of storing the hash value, using 64 bits, let's only store, for example,
using 1 bit or 2 bits. The intuition is that if the sets identical, then hash data will
be identical, then any bits will be identical. Right? Because you apply the same
operation.
So the question is if they're similar, then the lowest bits will also similar. Well,
that I need proof. And in practice, since we often care about the sets, the pairs
or sets with a high similarity, like larger than .5, hopefully we do not need that
many bits.
So now we need to introduce more notations. So now originally we have -- we
use Z1, Z2 to denote the two minimum values. And now suppose you want to
take b-bits. So the lowest -- for example, so Z1 is 7, so Z1 is the minimum value,
the hash value. Suppose we use 7. And if you only take 1 bit it's 1, take 2 bits,
it's 3.
Basically means if the hash value, if you -- this is hash value. And if you only
take the lowest b-bits, two bits, than the number will cycle through like 0, 1, 2, 3.
So that's what we mean by taking the bits.
So now we have to solve a probability problem. So originally the collision
probabilities are the two minimum value equals R. But the new probability is
actually C1 plus 1 minus C2 times R.
So how do we derive this? Well, it's just the probability excess. And the
numbers you can -- if you care about how the -- what's C1, C2, is this surprising
elegant formula, which is all related -- it's related to the relative size of the sets.
So F1 is the size of the set. D is original size-- the space size. So it's all come
from there.
And this actually formula is actually -- we assume these very large to derive the
formula. But it turns out the formula's remarkably accurate. So let's -- let me try
to convince you this is very accurate.
So how do we -- actually how do we compute the exact formula? So if you care
about probabilities, the exact formula can be computed as follows: Suppose you
only care about one bit. So you have two numbers. You take one -- take
whether they're even or odd. Then you want to check if two numbers are equal
or not. If they're both even and both odd, right? So basically you just have to
check. You have to keep track when both are even or both are odd or when they
both are equal. The [inaudible] really add them together so you get the answer.
Except that now it's very complicated formulas. But at least you can compute
this exhaustively by -- if D is small enough. So if you do that, you -- we can verify
this formula. So basically if you using the approximate formula, the approximate
formula with derive and subtract the exact formula, the answer is that it's very,
very accurate, even for small D. So it's very accurate formula. So all this try to
show is that it's a very accurate formula.
And the Communications ACM seem to be very interested in this work. And they
-- I guess at that time they expect is there will be a lot of new applications. I think
they have a good vision. So I'm going to show the applications of this b-bit
minwise hashing.
But the first thing we want to show is that why do we -- we haven't fully answered
the questions. If you're using only one bit or b-bits, how much do you lose in
terms of accuracy? Smaller number of bits you need more space, but need the
larger variance because you use small -- you only use a part of the information.
So if -- but if you look at the space times variance and the relative improvements,
64 R1 minus -- 1 plus R. So what that mean? That basically means if originally
using 64 bits and you do K permutations, but now you only use one bit -- sorry.
Yeah, this is a -- I'd like to draw it. So if originally using 64 bits and do K
permutations now, you do one bit and you do 3K permutations, you get the same
answer in terms of variance. If the similarity's a .5. So that's what I mean. Yeah.
So let's see how it really works. So Christian and I did the experiments with
Microsoft data. So the experiments are retrieve the document pairs whose
similarity is larger, like R0. Let's look at the most similar, like R0 is .8. So we
want to find documents which are -- which overlap 80 percent. Yeah. So if you
want to find who copy your paper, 80 percent. So that's the [inaudible] you use.
So basically using one bit you get the same answer using -- almost the same
answer using all the bits. If your similarity level you want 80 percent. And but
this result comes from 50 percent. If you only want to detect 50 percent
similarities, then use one bit. You don't get the same answer as using all the bits,
but if you increase the number of bits and number of permutations by small
factor, for example, three, so instead of 100 use 300, you'll get the same answer.
So that's what all the theory predicts. Yeah.
Okay.
>>: Just curious. Where was the original like random projection's technique
beyond this graph?
>> Ping Li: I don't have random projection plots for this one. But I have random
projection comparisons later, yeah.
>>: Oh, okay.
>> Ping Li: Yeah. So this is an 80 percent. Then people say, well, what about
90 percent? Better, right? Well, but 99 percent even better. So maybe we can
do better than one bit. So that's how we get the half-bit idea.
So half bit is basically cheating because there's no such thing as a half bit. So
what we did is suppose you do two permutations, you take one bit, then you
solve them. You still get one bit. But I don't have a name -- I need to give a new
name. So I call half a bit. So people would be curious what half a bit mean.
Then it's, oh, that's what you mean.
So but that's what I mean. So you can still do the probability calculation and you
can compute the variance. And the encouraging result is then when all go to 1,
the variance reduce by factor 2, if you're using concatenate by R2 bits. Then it's
hard. This is very encouraging. Then I can continue doing this business, right? I
can use this -- I can use like quarter bits, A bits, this number just keep jumping,
right? So that -- of course, that's true. Because if two sets identical, no matter
how you do it, they would be -- the results will be identical, right?
So the -- but the trouble is that if you -- but it's going to hurt if the similarity is
lower than certain threshold. So it's help high similar, but it's going to hurt. So
probably don't want to push this too much. Yeah.
So -- questions? Okay. So now the next thing is -- I'd like to show is what is a 3
-- suppose you care about 3-ways. A 3-way similarity is a -- you care about the
relative size of this intersection here. Yeah.
So you can still compute these probabilities. The probability just very, very
complicated. And I don't expect you to read very carefully. But the answer is -the take-home message is that you can -- if you care about 3-way similarities,
you can still do this b-bit minwise hashing compared to using 64 bits. You still
get substantial improvements. And one interesting thing is you have to use in 2
bits to get the estimated 3-way similarities.
Yeah. And that's interesting phenomena. Yeah.
>>: Question?
>> Ping Li: Uh-huh.
>>: So your -- the variance is used to bound this on the -- the similarity metric?
[inaudible] all represent similarities?
>> Ping Li: Uh-huh.
>>: Whereas -- so my question is, if I want to make sure that if, say, I do a -- for
one query I want all of the other items in my dataset that's within 80 percent
similarity, right, 80 percent cosine similarity or something or Jaccard similarity,
because I get a lot of noise from the [inaudible] similarity items, that can actually
bleed into my retreat set.
So even though for the items that are truly very similar, I will get them, but I could
also have noisy ones that come from items that are not actually very similar but
becomes very similar after projection and compression. Does that [inaudible].
>> Ping Li: Well, the variance in that proportional, for example, R minus R. So if
this was low, then it is going to stay low, the variance will stay low. That means
that if you start with .1, right, so this one is going to be smaller than .1 anyway, so
you take a square root. So it's going to be a small number. So the variance also
proportionally to the original value. If R is .1. So the variance like standard
deviation. So that, which is this guy is still going to be a small number, yeah.
>>: Isn't that just one of the [inaudible] you're dividing by K. Okay.
>> Ping Li: Yeah. So it's still going to be a small number. So what you said
actually it will happen at certain boundaries. It's going to mess up things a little
bit, yeah. But if it's something really low similarities and they're not going to
really a factor.
On the other hand, they -- because you really care about the high similar things,
you only care about high similar things, so it's not the estimates or the low
similarity pairs, it's not going to become high, you'll be fine, right? So ->>: [inaudible].
>> Ping Li: Yeah. And it's not going to be very high. Because of this variance.
Yeah. Yeah. Of course, if this number is not .1, it's .3, then it's -- at certain point,
it's going to affect, yeah. Okay.
So now the next question I'd like to cover is how do we use this for learning? We
can use this for retrieval, so how do we use this to build models? So with the -with random projections easy, right? You just discard A to random projection or
do you use B? That's easy. You build a model on B.
But now with b-bit minwise hashing it was not clear at beginning how do you do
that. Because you generate with a B -- with a minwise hashing you generate -you get numbers for 1, 2, 3, K permutation, every permutation you get a number.
375, 200 -- I don't know 75. And 1,049, et cetera. And then this is one set.
Then another set you get 300 -- supposedly the same, 1,000, et cetera. And so
this is the hash value you get each time. Yeah. But how do we build a linear
model from this data? Yeah, that was actually not clear beginning. But it turns
out it's very easy. It's just a trick.
So it's not -- not -- it's not intuitive just because, first of all, we want to make sure
is resemblance something good to use? Well, it turns out resemblance is
something very nice because suppose you have undatapoints and you build a
matrix resemblance matrix, which is actually -- we have simple proof that it's
positive definite. So resemblance it's a -- it's a good kernel.
So, okay, so basically if you want -- if you want to build a linear model with using
a minwise hashing type of technique, first you want to make sure the
resemblance itself. Because we're approximating the resemblance, you want to
make sure the resemblance is a good representation.
It turns out it is the resemblance is a positive definite matrix, the resemblance
matrix. And also the proof come from the fact the that minwise hashing matrix is
also positive definite, and b-bit minwise hashing matrix is also positive definite.
And the proof is nearly trivial. I don't see why, why it's -- why we've got trivial
results -- I mean, trivial proof for interesting results.
So this is minwise hashing data. And every time we want to -- we -- when we
use it, we want to compute the indicator function, whether they are equal or not,
right? But indicator function does not seem to be inner product at the beginning.
But if you look carefully, indicator function is indeed inner product. It's not in a
trivial way. So basically you can just expand the data as a vector with exactly
1,1. All the rest is 0. Then it's become inner product. So let me give you a trivial
example.
So basically suppose the data only range from 0 to 4 with D as 5.
Then you can expand the data, data like 375, expand the data 0 becomes
0,0,0,0,1, and 4 is 1,0,0,0,0. And now we have indicator function to check if the
2s and 3s are equal. Of course if they're not equal, this is because their
corresponding inner product is not equal. I mean, the inner product is zero.
So there's a one-to-one correspondence between the inner product and the
indicator function. This is a -- once you see this, actually it's kind of trivial. But
it's actually very important. Because now this actually allow us to use the linear
-- to use this data just to expand this. So how do you expand this? Well, 375,
right, that's easy. We get a vector of 0 to 2 to the 64 minus 1. And 375 is 1. All
the rest are zeros. Those kind of vector, yeah.
So every vector -- every number is a vector like this. Then we just concatenate
all the vectors. And people say you're crazy because that's a 2 to 64
dimensions, right? So this number, 2 to 64 dimension times K. So
conceptionally you can do it, but you cannot really do it in practice.
So but now we'll only -- we don't do 64 bits. We use b-bits. So that actually
naturally solve the problem. Okay. So everybody get the trick? This is an easy
trick. So I want to make sure everybody understands the trick.
Okay. Good? Okay. Now, exactly how do we do it with the -- with b-bit minwise
hashing. Let me show you the procedure. So this is how we do it. So basically
we do a three permit -- suppose this example we did a K3. So three
permutations. So every time I get a number -- so for each set I get a -- so each
set I get three numbers.
I did three permutation. I get three numbers. And then I just use binary -- look at
the binary representations. Oh, that's too many numbers. They be using only
two bits. So this number I want to store 01, 00, 0 -- 11. So I do three
permutations. I got three numbers for each set. And so then I have other sets of
course. But each number I only store the lowest b-bits, two bits in this example.
So the number becomes 0 -- 1 and 0 and 3. So but in order to -- in order to use
this trick, we have to expand it in 2 to the B dimensions. So this example like B
is not 64. So that's good. So B is only 2. So I got 4 dimensions. So 0 -- 1
become 0010, and 0 become 0001, and 3 is 1000. But now we just concatenate
all the 3 -- the 3 short vectors to make a vector length to 12. So that will be the
new vector input to the simple vector machine.
Then I will do the same thing for all the other sets. Yeah. Well, I didn't seem to
really did anything, like, after a couple years of work. All I did is really play the
trick. So I -- yeah.
This actually is very simple, yeah. It's all -- so you start with a binary data,
minimal nonzeros. But now you only end up with three nonzeros, three ones.
And the dimension, instead of 64, 2 to 64 dimensions, now you get like only 12
dimensions, for example. Okay. Okay.
Yeah. So every set you get a 12 dimensions. So [inaudible] another 12
dimensions of [inaudible] and you can use that directly for linear classifier. This
is because you take the inner product, the inner product, it's -- of this vector that
is actually the indicator functions, the value of the indicator functions. So that's
why this thing works.
I haven't really showed how it really works yet. So now let's do experiments.
The same Webspam datasets again, which is 16 million dimensions. And we
also process another dataset. So make a billion dimension just for testing. So
this billion dimension with how many nonzeros, about 10,000 nonzeros. And
then what? Well, then just do it.
So we do a hashing ->>: [inaudible].
>> Ping Li: Okay.
>>: What does 1/30 3-way features mean?
>> Ping Li: 1/30 ->>: So you took Rcv, you formed unigrams bigrams, and then I see 1/30 3-way
features.
>> Ping Li: Well, then -- it's not, it's pairwise, not bigram. Pairwise.
>>: [inaudible] more than bigrams.
>> Ping Li: Yeah. So pairwise, yeah. I think bigrams is ->>: You discarded the sequence information too.
>> Ping Li: That's because we don't have that information.
>>: Oh, okay.
>> Ping Li: The datasets, we don't have that information yet.
>>: [inaudible].
>> Ping Li: Sorry?
We --
>>: It's any two words anywhere in the document?
>> Ping Li: Yeah.
>>: Okay.
>> Ping Li: So this is actually my favorite example.
>>: What does 1/30 and 3-way mean?
>> Ping Li: That means I only do it at the 3-ways, but that's too big. So I want to
connect the numbers within a billion. So you have to discuss ->>: So you randomly ->> Ping Li: Yeah.
>>: -- 3-way 29, 30.
>> Ping Li: Yeah. Yeah.
>>: Okay. That's fine.
>> Ping Li: So, okay. Actually, my favorite example, I still want to show my
favorite. So suppose you want to do difficult recognitions. And if you only using
-- if you're only using the single pixels, it doesn't work that well. But you can do
pairwise means -- which is good. But you can also do shingles, means you're
using all this two-by-two grids. So that's bigram, right? But bigrams are local
expansions. But if you do pairwise, it means every pixel gets used.
For example, two maybe this pi, this pi, maybe is good representations. So I
think pairwise and n-grams can be used together, yeah.
>>: [inaudible].
>> Ping Li: They're not binary. But once you do the pairwise binary or not
doesn't matter. So that's actually another example why binary is actually good
for data has high dimensions. Yeah. The Webspam is not binary either. But you
give almost the same results if you binaryize it.
>>: [inaudible].
>> Ping Li: It's available. You can actually download from the LIBLINEAR
[inaudible] site.
>>: [inaudible].
>> Ping Li: Yeah. Okay. Good. So let's do experiments. So now this is a -remember the experiments with the random projection, right? We need the K
equal to about 10,000 permutations -- I mean random projections. Now this is
b-bit minwise hashing. So it's b-bit minwise hashing by two parameters now.
The one is the number of permutations, and also we need to use the number of
bits. We need to choose number of bits.
So for this example, the 200 permutations and the way about 8 bits you get the
same answer using the original data. And the 200 permutation in the 8 bits I
usually recommend 8 bits because it doesn't really make sense to use 6 bit or 5
bits. And so that's like 70 megabytes. So it's actually very small data. Yeah.
So you get the same answer with the 70 megabytes of data compared to the
original data. Questions on experiments?
>>: [inaudible] use 8 bits and 200 hashes or permutations. Like is it -- and then
you look at the representation that you induce to feed to the linear classifier. Is it
about half the features are on and half are off?
>> Ping Li: You're talking about the coefficients of 0 close to 0?
>>: Right. Right.
>> Ping Li: I didn't look at that carefully. Here I overuse the L2 recognitions. We
did try L1. And it seemed to work the same way. But you get much sparser
results. Yeah. We didn't write a paper on that, but we did do experiments. So I
don't know how to answer your question. But at least the significant of the
weights can be zeros.
>>: Yeah. Just like, you know, sort of like when I think about bloom filters and
like, oh, when you get the density right about half the bits are on, and so it's like
oh, if you got the features space that you induced right maybe each feature is
roughly like 50 percent likely to be present. I don't know.
>> Ping Li: Yeah. That -- that might be interesting future research. Because
right now it is like -- I think the good thing about hashing that they allow you to do
is more n-grams. So you get more -- so you -- so hopefully with the more
n-grams you get better results. So more computations maybe. But exactly how
it works, it's not the -- it's -- there would be some more interesting research can
be done there, yeah. Okay.
Yeah. If Microsoft is interested in funding that research.
Okay. So we showed this for a different case, 50 to 500. So we see like the
training is similar except with 500 permutations only one bit can do pretty well.
Yeah. But that's expected, yeah.
And for now the -- the training time is reduced from about 100 seconds to 100 -oh, 800 seconds to about how many seconds, 3 to 7 seconds. So the training
time -- so this assume the data because data only trained at 4 gigabytes, original
data only trained at 4 gigabytes, we can put in memory. Yeah. Because now
there's 24 -- I think 48 gigabytes probably is a standard memory size. Before I
come to the talk my -- the system administrator in my department just told me
that he has set up the machine for me, which I purchased with -- the machine
has 250 gigabytes of memory. So it's pretty good.
So 24 gigabytes you can put in memory. But even you can be -- even you can
be in the -- below the memory still take about 100 seconds to 500 seconds. So
but now with the hashing we only use like 3 to 7. Okay. And for testing, so it's
reduced from 20 seconds to about two seconds.
>>: Then can you [inaudible] mention that there was preprocessing was
expensive?
>> Ping Li: Yes, preprocessing is expensive. I'm glad you asked, yeah. So here
I assume the data already been processed. So the [inaudible] time did not
include the preprocessing time.
>>: [inaudible] preprocessing?
>> Ping Li: Yes, preprocessing where you can [inaudible] you can do GPUs and
all those things.
>>: What do you account as preprocessing?
>> Ping Li: Sorry?
>>: What are the operations that you do?
>> Ping Li: So here the numbers do not include -- does not include
preprogressing.
>>: Right. I'm wondering what -- I was just ->> Ping Li: Wondering what numbers look like?
>>: I'm also wondering what kind of preprocessing it is.
>> Ping Li: Just permutation, right? Permutation. So actually this -- I will go
there, yeah. I'll get there. Okay.
So okay. So this is attachment time. But I agree that it should a little bit because
I did not include the preprocessing time. And so this is all linear SVM. But if you
want to do logistic regression, the story essentially goes -- the same story still
holds. And so the task, the training time reducing from a thousand seconds to
about 10 seconds. So if you change your SVM to logistic regression, so you get
similar kind of stories. So I'm not going to repeat the story.
And if you care about -- if you care about random projection results, how do they
combine together? So this is random projection results.
So this is -- the minwise hashing results in terms of K. So now this is a number -so K for random projection means number of projections. For minwise hashing
means number of permutations. So our result is right here. So but if you do
random projections, the result is like this. So basically you need a 10 to the 4
random projections to get similar results as using minwise hashing with 200
permutations. Yeah.
>>: So with this kind of data impression does it make sense to do nonlinear
SVM?
>> Ping Li: Probably not. Because you can always -- for example, I give you an
example since I have [inaudible] here. So basically if you have in these datasets
and if you -- which is 760 dimensions, and if you do linear SVMs 85 percent
accuracies, but if we know -- linear. But if we know that SVMs 98.5, right? And if
you do linear plus pairwise or pairwise interactions, it's like 8 -- almost 98
percent. So there's actually interesting issues. Do we really need a nonlinear
classifier? Because you can -- so pairwise in a sense is basically like second
order approximation or the Gaussian kernels.
So even with the pairwise and partial linear SVM is still much faster than kernels.
Yeah. So that's actually interesting. Sorry?
>>: Question. What is the difference between the random projection and the
minwise hashing?
>> Ping Li: Random projection you do the matrix multiplication, random
permutation will only permit the comments.
So as we show that the variance of random projection is dominated by the
marginals, but the minwise hashing the variance is actually proportional to the
similarities.
>>: [inaudible] preprocessing?
>> Ping Li: I didn't do -- I didn't count any preprocessing in the plots, yeah. So
the computational cost, yeah. So preprocessing is preprocessing. So it can be
important, but sometimes often not important. Yeah.
>>: How expensive it is?
>> Ping Li: It can be very expensive. But it's a one time cost. But it's also can
be expensive. It can be done offline but can be expensive as well. And in some
cases it can be important because for testing if you [inaudible] preprocess, the
new image come in, then you have to process again. So you can do GPUs. And
we actually wrote a paper ->>: So what's involved in preprocessing I think [inaudible].
>> Ping Li: Okay. The preprocessing involve -- I'm going to skip all of this. So
the preprocessing involves this. So basically you do the permutation of the
columns. Conceptually you only -- you only -- you always work with the
nonzeros. But you, basically you scale all the nonzeros, and you have computer
hashing value and only store the minimus.
>>: And this also does -- I guess part of preprocessing has to do with actually
coming up with the index or the columns, right?
>> Ping Li: Yeah.
>>: Original data is words or pixels or pairs of pixels, then you have to map it to
this linear index of which column does it fall under?
>> Ping Li: Yes. So usually the hashing time can be combined with data
collections. When the data document come in, you linear scan all the words.
You just parse all the documents, then you can actually -- you know, you can use
a hash table to -- you can use a big hash table at 2 to 64 cells, and then you -- to
make sure you -- it almost has no collisions. And then it can be done in one shot,
I mean one pass of the documents.
>>: So [inaudible] dimension or over the data sample?
>> Ping Li: It's for the -- over the future dimension but for all the data.
>>: I see. But for the test data how do you do that?
>> Ping Li: You just have one row, right, test data only has one row. Yeah.
Okay. Yeah.
So there are lots of parallelizations can be done, right, so it look like everything
seems -- almost everything is parallelizable. Yeah.
So I'm going to go to the processing a little bit more. But let me summarize the
K-permutation hashing a little bit. So now we're in the big data time. And with
big data, high-dimension data, we can do random projections. But random
projection require lots of projection, 10 to the 10,000. But we can do minwise
hashing, which is standard procedure in the context of search. And with b-bit
minwise hashing we can improve, for example, 24 in this space.
So 24 does not seem to be too big number. But it's actually big because if it's
something that's being -- improving something that's already been used for many
years, it's actually something very substantial.
But more interestingly, the improvement is actually in terms of dimensionality is
not 24, it's 2 to the 24 because reducing the 2 to 64 dimension to the 2 to the B
dimensions using the -- if we consider data expansions.
And I leave the slides for the hash table buildings. If you're interested, you can
look at the slides for hash table building for sub-linear time near neighbor search.
And we compare with a random projections, seem to be substantially more
accurate.
And the -- people are very curious about is the drawback. Well, the drawback,
the first drawback, it's only for binary data, this version. And the preprocessing is
expensive. Because you have a repeat. Repeat one time is okay because you
always have to touch data one time anyway, right? So that's not avoidable.
But if you have to touch it multiple times, like 500 times, it's expensive. However,
every time it can be done in parallel, so we can do it for example, GPUs. So
speedwise it's not much of a problem. Becomes always parallelized. 500
permutations can be done in parallelize.
However, in terms of energy consumption, that's actual issue. Because speed is
not a problem, but you still burn the electricity. And the thing about the search
engine, we have to do this massively scale. So that's a concern.
And that's -- on the other hand, the practice that people has been using for
maybe 20 years, we do the one permutation and we the minimum, then we
discard the rest. This process has to be very wasteful. This is -- first of all, we
only look at the minimum. But what about the maximum? The amount of
information should be the same, right, because you do the permutation, the
amount of information should be equivalent. If you only look at the maximum.
So this actually more motivates us to do this -- I should tell you why I develop this
called one permutation hashing. Because I spent a few month last year around
this time just run all the experiments. But preprocessing really cost -- really take
too long. So I was tired of it. So I was thinking about this. When I look at this,
why do we have to do 500 times? It's kind of -- we don't have to do it. So let's
just do it one time.
So this is the trick. So basically -- so instead of doing 500 times, you just break
the space into 500 pieces. And every time only look at the smallest number in
each piece.
So why this should work? This works because after the permutation you break
this into -- into K bins. Every bin is actually statistically equivalent. Before we
only use the information in the first bin. Essentially. But we can use the
information in any bin. Right?
>>: So let me get this straight. Before when you would get 64 bits in your hash
function, you throw 63 away?
>> Ping Li: Uh-huh. You throw ->>: Yeah. You should like keep all 64 bits if they're all good, like, right?
>> Ping Li: Uh-huh. Well, you have to store them, right? What's your ->>: Well, I guess I'm saying like the number of hashes, like a function like
murmur has claims all the bits are good. So if you're going to do like 200 hashes,
you should really be doing like 200 divided by 64, like, hashes, right?
>> Ping Li: I will need to think about that. But this -- yeah. So this is basically -I only compare the standard minwise hashing. Okay. Question.
>>: [inaudible] zeros before you [inaudible].
>> Ping Li: Yeah. The empty bins. Yeah. So empty bins actually the only
caveat. You may get empty bin. So if you don't have empty bin, then before you
only look at the first bin. Every bin is not empty. You only look at the first -- you
only look at the first bin. Because the first -- smallest using the first bin.
But now if there's empty bin, that might be an issue.
>>: [inaudible].
>> Ping Li: Sorry?
>>: [inaudible] will be higher ->> Ping Li: It should be lower. I tell you why lower. Suppose K is the same as
D, the dimensions. You get back to the original data, right? But if you do original
minwise hashing, you don't get that. Right?
>>: Yeah, but that's only [inaudible].
>> Ping Li: Yeah. So I -- we have proof.
>>: But [inaudible].
>> Ping Li: Yeah. So actually maybe you could can guess. Now, how do we
estimate now? How do we estimate similarity? Before we just compared the
number times the matches divided by K. So if there's no empty bins, that's still
the same answer. You still count the number of times the match divided by K.
So that would be the estimate.
And now with empty bin, the answer is number times of matches divided by K,
subtract number of time empty. And the -- so we have a -- we have a paper, you
can read the proofs. But basically let me just show you the results.
So basically the estimator is you count the number of times the match divided by
K, times the number of times empty. And you can prove it's actually a trick to
unbias. And the variance is actually smaller than the original variance of minwise
hashing.
So in a crude sense to understand why it's better, think about sample with the
replacement, without replacement. This is -- first of all, it should be better.
there's one interesting issue is that suppose originally you have 100 nonzero and
you need to do 500 permutations. So you start with 100 nonzeros, but you
actually end up with 500 non -- ones. So that's actually a huge distortion of the
data.
But this one actually you don't store the data as much. You only sparsify the
data. You don't destroy the data. Questions?
>>: [inaudible].
>> Ping Li: Both are empty.
>>: Oh, okay.
>> Ping Li: Yeah. That's not a caveat. Both are empty. Yeah.
But, after all, the chance of empty being small. So basically -- so this is the size
-- joint size. So the -- if the nonzeros divided by K is 5, the chance of empty is
only, like, less than one percent. So it's small. So if -- so if you want to use
hashing to reduce the data -- so this is what you expect anyway, right?
You expect number K is much smaller than number of nonzeros. So that's what
you expect. Of course if the number of nonzeros is roughly equal to the number
of permutations then you get a significant chance of empty bins. Which you have
to consider -- the reason we have to consider is that the data, even though the
majority of pairs that satisfy this guy, but there's always some more fraction of
them with very sparse vectors. So that's why we have to consider that. Yeah.
But with this formula, this is a vigorous formula. If you don't believe it, read the
proof. Yeah. So it's actually unbiased. Actually it's a bit surprising why it's
unbiased because you get a ratio estimator. If you take a look at statistic books,
you have a ratio estimator, you don't -- you normally don't get unbiased. But this
actually is strictly unbiased. Yeah. Okay. But now how do we deal with the
empty bins? So here's the trick. So originally we do three permutations, we do
this take the bits, we expand the data, we concatenate the features.
But now suppose we do one permutation but divide into K into 4 bins. But now
suppose one bin is empty. So now what do we do? Well, before, like last year
when we encode this, we encode 0 as 0001. We did that not on purpose. But
somehow that leave a space because 000 is never used. So it's empty. We just
put 000.
And another interesting thing is that we have to normalize it, right? Once you
normalize it, you have to actually subtract number of empty bins. So remember
the formula. Yeah. So there's a -- it actually all works very well. So because
normalization is actually correspond to this operation. Roughly speaking. Yeah.
Because then you take inner product, the square root disappear, so you get a ->>: So this actually reduces the permutation for preprocessing ->> Ping Li: You only do one permutation, yeah. So there's always low cost.
Yeah.
>>: So do you lose accuracy when you do --
>> Ping Li: Oh, it's even better. So this now -- first of all, we look at this, it's
better. The variance is better. But also if you look at the -- now we do SVM.
Before we do K permutation this way, but now we do one permutation but we
encode zeros this way. And look at the results. Even better. So you do one
permutation, and this is the result. But do K permutation, this is actually even
better. So the faster is more accurate.
>>: And so for [inaudible].
>> Ping Li: Yeah.
>>: So I wonder whether this is true to achieve the same variance you need
more space. I mean, the preprocessing time is [inaudible], but to achieve the
same variance overall, do you need more space?
>> Ping Li: You don't need more space. There's a -- because everything is
online. Yeah. This is actually in the paper. So everything is online. It's actually
not 12 anymore, it's 0. You can view this as 0 because every -- you can start
with 0 again, right? So you don't need more space. Because everything is
online you can actually reindex. Yeah.
And -- okay. We reindex easy, right. It's just bit subtraction thing, B shifting
thing.
>>: [inaudible].
>> Ping Li: Yeah. We considered that [inaudible] in the paper. Yeah. So that's I
think almost done. So this is -- this is for this data. Okay. So you actually do
better, and you can similar things over SVM and logic regression. So logic
regression is three. With K increases. So one permutation you get better
results, significantly better than K permutations. Because with the -- as K
increases, you get more and more -- so think about if originally some of the bin
some of the vector only has 100 nonzeros, with 512 permutations, you make a -you make -- you artificially make -- you add lots of artificial ones. So that's why
the results actually not as good. Yeah.
>>: So I guess there must be some catch or otherwise you wouldn't spend all the
time talking about you have to use GPU ->> Ping Li: Well, this -- it's all sequential, right? And so the GPU work, We
probably should-- we did last year.
>>: So GPU is parallel ->> Ping Li: So GPU parallelize the preprogressing, yeah.
>>: Yeah. K times [inaudible].
>> Ping Li: Yeah. And so GPU accomplish this, but do not really save energy.
Yeah. For me the energy doesn't really matter, but maybe for you it matters.
>>: But the result is [inaudible] just do one permutation [inaudible].
>> Ping Li: Yeah. There's no point of talking about -- yeah. Yeah.
>>: So what's the catch? There's no catch at all?
>> Ping Li: There's -- okay. Okay. So you're curious. So let me give you
another example. So this one has 4,000 nonzeros, right? So let's consider
datasets with only 500 nonzeros. So let's consider datasets with only 500
nonzeros. So that means a significant portion of them has about 100 nonzeros.
So namely using 4,000 permutations or one permutation of 4,000 bins.
So we actually -- that means most of the bins are empty. So that zero coding
really works well with Y. You can see the difference. With one permutation you
get a much better results than the K permutation with 256 permutations. K equal
256. But with K equal to 4,000 you set -- with one permutation you get much,
much better results now.
So K permutation actually not a very good idea if the original data is too sparse.
Because you add lots of -- you just -- you add lots of ones, right? Yeah.
Because every permutation you add 11. So that's actually not a good idea. Of
course this is a contrived example.
In practice we expect you have 1,000 nonzeros, so that's probably fine. So this
is my last slide here. So you can accomplish the same standard K permutation
which are being used in practice, but you can do this with one permutation, and it
has you do one -- do better, yeah. That's the end of the talk. Questions.
[applause].
>>: [inaudible] guess again is like if you think statistical example in the
high-dimensional space has 200 nonzero entries in it, then maybe you should
choose 400 partitions or something like this.
Because you said, well, basically like each time you have -- I mean, in your
original min hashings you added a 1. And so like once you start adding more
than 200, that's a bad idea, right? So then you're going to do one. And then the
question is, like, can you get a reduced space for still, like, about 200 [inaudible]
so that the products would be minimally distorted? So I guess I'm just guessing
that if they're distributed -- well, they won't be distributed randomly [inaudible].
But still I would expect like the right K or, you know, the right K, the cheapest K,
something like the order of the number of nonzeros.
>> Ping Li: That's -- I think that's about right. Because my -- even though the
variance look like it's -- you know, it's only proportional to R, but there's actually -my experience with more nonzeros, you need more -- you really need the larger
K, yeah. So I think that -- yeah. That has to do with that.
On the other hand, this formula like R minus R it looks good but it's actually -- it's
actually not sufficient in the sense that if you want to -- basically you have to
compare this with -- sorry. So you have to compare this with the R instead of -so you -- when R go to zero, it already cancel. The variance is not sufficient. So
you really -- you need actually more larger case to cancel that. So I think that the
bloom filter stuff probably you can mix stories comes from something related to
that, yeah. Yeah.
So the -- yeah. That's actually -- okay.
>>: Yeah. So most of the things I heard here is about reduction of the
dimensionality. But earlier you mentioned that you can also [inaudible] to reduce
the data problems. So which part -- talk about that [inaudible]. I may have
missed that earlier.
>> Ping Li: No, you didn't miss that. I skipped it.
>>: Oh, I see.
>> Ping Li: So basically -- so basically with the hashing, right. So with the
hashing, hashing has you provide a partition with a space. So basically suppose
you do -- you do two bits. And you do two permutations. Then you actually
provide a partition of 2 to the 6 -- -- you can partition the space according to the
hash value. So suppose the -- suppose you have table from 0000 to 1111. And
that's all the possible hash values, right? Because you only do -- you only do two
bits, and you only use two K -- two permutations. So that's how you partition the
space.
If you partition the space into 16 bins, then all the data points can be put in one of
those bins. Then when you search for nearest neighbors, you're only searching
that bin, right? So that's actually avoiding the number non -- searching all the
stuff. Yeah.
So it works better than assign random projections, which is another method.
Yeah.
>>: So another question I have is that so when you compare the coefficient
efficiency of this with respect to the random projection earlier on, I just wonder
whether -- so one problem with the random projection is that you still have to -still have to multiply these huge matrix, despite the fact that you [inaudible]
zeros.
>> Ping Li: Yeah.
>>: [inaudible] to reduce the permutation once enough [inaudible] matrix is really
sparse, just do the permutation for nonzero.
>> Ping Li: Yeah. You -- first of all, you don't -- you don't do multiplication
anymore. Because the matrix is sparse, is big at 10,000, and it's just one minus
one, zero most of the time.
>>: Oh, I see. So you [inaudible].
>> Ping Li: There's actually -- first of all, there's no multiplication, right? And
most of -- you just basically dividing the two groups, and then subtract making a
difference.
>>: [inaudible].
>> Ping Li: Yeah. That's an issue. So I ->>: [inaudible].
>> Ping Li: And that could be, yeah. Because the index space is actually D
times K. That's still, yeah. So that ->>: So that probably overcome by the hashing.
>> Ping Li: Yeah.
>>: Okay.
>> Ping Li: Yeah. It's -- I'm very impressed you catch that. That's actually
hidden there in sparse random projections, yeah, it's hidden there. You still have
to tell which one's zero here. So okay. Okay.
[applause]
Download