>> Ping Li: It's so very nice to be back here, to see so many friends again. So okay. So just thanks for coming right before Thanksgiving shopping tomorrow. So I'm in the Department of Statistical Science, which is under computing and information science. And the faculty computing and information science is larger than CS. So for a long time I thought I'm actually in CS. So at Cornell the Department Of Statistical Science it has very unique position. So people are all from different departments like OR, math, and computer -- and CIS, which is me. Yeah. Okay. So as a statistician this is how I understand how do you use statistics. Well, we need a model, we need variables, we need observations. And this is what -- what many people consider as a good practice; that is we need -- we want a simple model, and we want lots of features and lots of data. This is because the people who use the models, they want model to be simple. And the people who use the model, they know how to play with the data. So this seem to be good combination. So conceptually we can consider a dataset as a matrix, like size N times D. And there's a number of observations now it's really huge. Like millions, it's not big anymore, a billion is not rare. And consider the click-through data, the number of observations can be infinity. So the people, also common to use high-dimensional data in text and image and biological data. And I see people use billion dimensions, and I saw people use trillion dimensions. And in the context of search it looked like 2 to 64 seemed to be the standard. And so in a sense, the dimensionality can be arbitrarily high. You just consider all pairwise, three-way or higher interactions. So this is a simple example of big data learning. So suppose X is your data and Y is your response. So X is 2 to 64 dimensions. And of course usually sparse. Has to be sparse. And so let's only consider the simplest model, logistic regression SVM. And we want to -- we want to minimize the loss function and to find this weight. So the C is regularization parameter here. So even for such a simple model, once you have lots of data, so plenty of challenges we have to face. For example, the data may not feed the memory. So that's the first thing. And data loading, which include data transmission over network, could -- may take too long. And training can be still expensive, even for simple linear models. And testing can be expensive, relatively, because why testing is expensive? Because once you fit the W, you have your computer inner product. And if this very high dimension that would take time, which may be too expensive for search and all high-speed trading, et cetera. And the model itself may be too large. This is an interesting issue. Because if you talk about data 2 to 64 dimensions, so normally the data has to be sparse, otherwise, you know, there's no such data 2 to 64 dimensions. But the model, if you are not careful enough, the model, the W, usually is dense. Unless you do something else. So if for the model 2 to 64 dimension you cannot really store it. So that's actually interesting issue. And there's also an issue of near-neighbor search. And you want to find similar documents or images in billions or more web pages or images without scanning them all. So that's more interesting issues. So this motivates the need for dimensionality reductions. And but more importantly for many cases is actually data reduction. It means reducing the number of nonzeros. This is because for modern learning algorithms the cost, like storage cost, transmission cost, computation cost is mainly determined by number of nonzeros, not too much by the dimensionality, unless the dimension is very high. Then that will -- you will cause a problem. So PCA is -- for such large data, is usually infeasible, at least not in realtime. And updating PCA is non-trivial. And if you have a need to index the data PCA is usually not a good indexing scheme. And in very high-dimension data, also even if you can do it, PCA usually do not give you good performance for very high-dimensional sparse data. So that's my recent experience. Okay. So now how do we deal with the data if it's such a huge data? So suppose this original data, which is too big and very high dimensions, we can multiply by matrix R, and we'll get another scheme in matrix B. So B is a matrix, has K columns. K is small. So this is called random projections. Why it's called random projection? Because the entries R are sampled random, for example, if you sample from normal 0,1. So R is normal 0,1, i.i.d. samples, so R is the matrix. Then if you look at the B times B transpose, which is A times R, times R transpose times A, and you take expectations, it become A times A transpose because R is normal. So this is cool, right? Because on this expectation you get the same -- you get a good answer. Of course, this is only really on the expectations. And the error could be very large. So we have to analyze the variance. But in the expectation we get unbiassed estimate for the inner product. >>: So what kind of applications do you find this kind of preservation of the distance in useful or otherwise? >> Ping Li: Yeah. I will continue. I will keep that in mind, yeah. So now suppose you want to build a model, linear model for A. So well, because we cannot do it. So let's do -- let's build a model for B. So basically this is analogous to suppose we do PCA, so you feel the model under PCA, right, except now you do something much cheaper than PCA. So and it becomes B. Actually it is also inner product space. So you can actually -- so all the algorithms still works if you build a model on B directly. So that means you can actually discard A once you get B. Questions? Okay. So now, one concern people often has, well, this look like -- look nice, but it's kind of cheating in a way because A times R could be very expensive. It's a huge matrix multiplications. So to address our concern, in 2006 we wrote a paper called Very Sparse Random Projections. So basically instead of sample from normal, you can sample from a distribution like very sparse. Means if this S parameter is 100, that mean 99 percent of the entries on average are zeros. If there's 10,000, then it's 99.99 percent are zeros. And the magic thing is that this also works. So now let me see how it works. It works in a certain way. So this is the 3-gram data, the test data, the 3,000 -actually 3,500 text samples, 16 million dimensions, about 4,000 nonzeros or average per observation. So it's 16 million but only 4,000 nonzeros. So it's very sparse. But 4,000 -- 4,000 is not a small number. The 4,000 actually fairly typical. I heard that it's -- usually people see the main application range from 10,000 -- I mean, 1,000 to 10,000 nonzeros. But the dimension could be really large. Yeah. So this task is a binary classification for spam versus non-spam. So now let's apply a SVM, linear SVM. So the red curve are the results for linear SVM. This is accuracy. Higher is better. And we remember SVM has a parameter C, requisition parameter. So this is -- so red curves are original results. So now we apply random projections. So the red curve are the results used on A. Now, the other curves the results on B. So we use random projections. Now we need to determine the K, the number of projections. So this is interesting because if you're using 4,000, you still don't get a good result. I mean, if you really care about this point, it's still not good enough. So in general you need 10,000 projections. So that's random projections you need 10,000. I'll explain why you need 10,000. And but the original has 4,000 nonzeros. So it reduces dimensionality, but does not really help with the data reduction for this case. But another interesting thing is that the data can be -- the projection matrix can be very sparse. For example, 10,000 -- 1,000 you still get almost the same results here. So the two interesting things here with random projection is you need a large number of projections, but the projection matrix can be sparse. Yeah. Okay. So that's a thousand. Actually, if you look at 10,000, you still see the same thing. And in the summer I gave a talk, Stanford Massive Data Workshop, so I incur a lot more results for classification, clustering, and regression on very sparse random projections. So if you just type MMDS, you're going to find it. So I like the very sparse random projection a lot because when I was looking for a job in 2007 I think that helped me get a job. So but now five years past, so we improve, right? So we want to do something better and more practical. So this is what we going criticize about random projections. And its variants. It's very -- it's inaccurate. You need 10,000 projections, especially on binary data. So random projection actually doesn't really care whether it's binary or not. Even doesn't care whether it's sparse or not. But the -- but in practice, if you really have binary sparse data, you want to do something more -- something else. So that's a point. And let's understand why it's not accurate. Let's look at the first two rows in A called U1 and U2. The first two rows in B is V1 and V2. So V1 -- U1 and U2 in very high dimension, and V1 and V2 in K dimensions. So K is small. Small, like 10,000. So in the expectation the inner product is preserved. I use A for inner product always. However, if you look at the variance -- so variance of the inner estimates actually is dominated by -- by the marginal L2 norms. So that's interesting. Because first -- because A squared is always less than M1 times M2, right, by [inaudible] inequality. So that means the variance of random projection is always dominated by something that's nobody don't really care. You care about A, but its variance actually dominated by something else. But what's really worse is that if you -- once you -- given particular datasets, if you look at the pairwise inner products, most of the pairs are more or less orthogonal. Of course, a significant -- I mean a fraction of the pairs are very similar, but most of them are more or less orthogonal. So that means you actually very bad situations. So A roughly zero, but the variance is still very large. So how do you reduce a variance or increase in K? So if you make K like 10,000 and then its variance is -- you need to take square root, right, to get it there. So that means like you get roughly like one percent. So that's how you get accurate results. So that's why the 10,000 is good number. Yeah. Question? >>: [inaudible]. >> Ping Li: It's all relative. So you normalize doesn't really help. It helps in some way if it's not -- if it's not even. I mean M1, M2 are very different. Normalization actually helps for -- in that aspect. But as far as if -- because you normalize -- it's all relative, right? It's all relative. So normalization in general doesn't help. >>: [inaudible]. >> Ping Li: Yeah. So each row. Each row. Every row, yeah. And the norm, yeah. So if you want to verify this formula, think about if U1 and U2 are identical, then you actually, you get a chi square. So chi squared if you -- in statistic you know chi squared the variance is 2 something squared. So that's how you get -verify the formula quickly. Yeah. Okay. >>: All right. So I'm going to say [inaudible] because that always seems to be the answer. Is that the answer? >> Ping Li: Yeah. >>: Oh, okay. [laughter]. >> Ping Li: Yeah. Because I cannot think of reason why not because you can always say [inaudible] [laughter]. Yeah. That's exactly right. The reason -- now, actually in this case what's important is the nonzero is important. So focus on the nonzeros. Yeah. Wow. That's very clever. Yeah. So what we do is actually focus on the important things, like the nonzeros. For example, if it was binary, you only care about 1s. So we focus on the 1s. We call it the b-bit -- so we do b-bit minwise hashing instead of random projections. And it's -- surprisingly, it's very simple and surprisingly much more accurate than random projections for inner product estimations. And so minwise hashing -since many of you from search, we know that it's a standard algorithm in the search industry, and it actually -- the b-bit minwise hashing require much smaller space. And random projection only applicable to pairwise, but minwise hashing or b-bit minwise hashing you can do 3-ways. For people from databases, all you want to -- you want to go beyond pairwise. Yeah like [inaudible] work, you want to go beyond pairwise. Actually we can do that. And we also develop methods so you can do large-scale linear learning. And, of course, you can also do kernel learning. But you can also use this for near neighbor search. This is because the algorithm directed provides indexing. My experience is that -- we have the slides. My experience is that I'm never going to be able to cover that. So -- but I'm going to leave the slides then. So the major drawback for this, the minwise hashing and b-bit minwise hashing, if I give the talk before the summer, I would say, well, it's very expensive preprocessing. Of course, when people say, well, it's expensive, then at that time I would say, well, it's the preprocessing. So but now the problem is solved. >>: But this only reduce the dimensionality of the data, not the data [inaudible]. >> Ping Li: And also data size. >>: [inaudible] as well? >> Ping Li: Yeah. So I will -- yeah, I will cover that. So the question is why do we care about binary data? Well, look like for text it's multiple datasets. Binary representation seem to be very common. For very high-dimensional data, it's only the case that even for very high-dimension data, like web spanned datasets I just show you, if you binaryize, it doesn't really matter for the accuracy. So that seem to be true for many very high-dimensional data. And if you consider pairwise or higher-order interactions, and you start with some datasets as reasonable sparse, but if you do pairwise then it becomes sparser and sparser because zero multiply anything become zero. So this is how -- so usually the very high-dimensional data is generated by because the interactions. And that's why very high- dimension data usually very sparse. And but once the data is sparse and it's binary, so people in statistics we think about to data matrix. So for binary sparse data, what we really need is to store the nonzeros. So for example, S1 is a vector, binary vector, but we only need to store 1,4,5,8 means the location of nonzeros, which is the inverted index, of course. So this is example, the classic example. How do you get a massive binary high-dimensional data? So how do you represent a text? Well, the back was a model because like today is a nice day, you can just represent this as a set of four elements. Of course, we know that it doesn't work too well, because the order is -- does not matter. And it means maybe it's not very meaningful sentences. So to overcome that, the next trick is to use contiguous words. We could use every two contiguous words, every three contiguous words, and we could just continue this business until maybe five, a hundred people use 11 and the -- yeah. So this has generally very high-dimensional data. So this is because -- it's okay. Oh, right here. Okay. So if you build a dictionary, so this is document 1. If A and Z is usually there. And but otherwise it's mostly just zeros. So this suppose is 10 to the 5 common English words. And if you do consider pairwise, so it become 10 to the 5 squared. So -- et cetera. So you just -- so the vector become longer and longer, and this is sparser and sparser. Yeah. And the interesting thing is that if you -- if you W is equal to 5, you get a 2 to 83 dimensions. Of course, unless you have lots of documents, most of the columns will be zero. So you can remove those columns. So usually people use 2 to the 64 as a convenient upper limit. Yeah. Questions? >>: No, I just [inaudible]. >> Ping Li: Okay. So that's how this 2 to the 64 story comes from here. >>: [inaudible] different name [inaudible] normally it's tri-gram, n-gram. >> Ping Li: I think John once told me that his understanding is that when it's n-grams means n-grams, but when it says shingles means you actually apply hashing right after. Yeah. I think that's a very clever explanations. Yeah. So basically because 2 to the 64 is not something had you want to use anyway. So but when you apply hashing, it's convenient. Yeah. So now let's introduce notations. So you now have a set. Binary vectors are sets. So F1, F2 is just the size of two sets. A is the size of intersections. And the similarity, a good measure is a resemblance, a size of intersection over size of a unit. And for binary data, I think it's more rational than the correlation of -correlation of coefficients. So that's a notation. So this is a trick that the search industry has been using for many years. So basically suppose we're able to do a permutation on the space. We apply the same permutation to two sets, and we only look at the -- only store the minimum. And the chance that the two minimums are equal is exactly the resemblance. I will give you an example. So this is -- suppose the space size is 5 and these two sets, the similarities are 20 percent, and this is one realization of the permutation. So 0 become 3, 1 become 2, 4 become 1, et cetera. And now let's do the permutations. So S1. So pi S1 means 0 become 3, okay? And 3 become 4, okay? So the minimum is 1. So you store the minimum. You discard the rest. You only store the minimum. And you apply the second permutation -- I mean, you apply the permutation to the second set. The minimum is 0. At this time they're not equal. They're not equal because you only have a 1 over 5 chance to be equal. So how do you make sure you get enough times that they're equal? Well, just repeat this K times. Basically if we repeat this K times -- and you can estimate this as binomial probabilities. So the variance is just 1 over K times R, 1 minus R. So R just a binomial probability. So in terms of data matrix for -- machine learning and statistic people will like data matrix. So what minwise hashing is basically doing is suppose you can permute the columns. You permute the columns. So this original column, and you permute -after permutation, this is your new data matrix. Then you look from left to right, and you stop at the first location where it's a nonzero. Then you store the location. So 2, 0, and 0. Then you discard the rest. Yeah. So that's what minwise hashing doing conceptionally in term of the matrix. Questions? >>: Is this the same size for all the rows? >> Ping Li: Yes, the same. The pi basically means you permute the columns. So it has to be the same. Yeah. Very good. So usually when I view this slide, people will say, well, what about the maximum? Can't we just store the maximum? The answer is yes, of course. The maximum contain essentially the same amount of information minimum. But we choose minimum. Yeah. Okay. Question? Okay. So this is what hashing is doing. So we have to repeat K times. Now, remember in random projection we would do K permutation -- I mean, K random projections. The K has to be 10,000. So here K only need to be 200, or 500 at most. So it's a -- so it's better. So now the immediate question is how do you actually store this number, store the minimums? So it looks like this minimum number, minimum location, but the minimum can actually happen from the beginning to almost towards the end because it's very sparse, right? It can really be anywhere. And essentially anywhere in this matrix. So that means the minimum value can actually -- is often very big number. If you have a sparse matrix. The minimum location is actually also very big. And to ensure if the dimensionality is 2 to the 64, to make sure you get the right answer, you have to use 64 bits to store the minimum values. Yeah. Okay. Because minimum can happen towards the end. Yeah. Questions? Okay. So 64 bit actually big number. If you have to -- so each number 64 bit, but then have to multiply it by K, the number of permutations, then have to multiply the number of documents or images. Yeah. So it's a -- the storage is big. And the search is also -- the computation could -- it's coordinately is also very expensive. And later we're going to show that this also correspond to very high-dimensional data. So how do we solve all this problem? You do -- the solution is simple. Suppose instead of store this using 64 bits, and this only store using 1 or 2 bits, then everything become easy. Now, the question is will that still work? So this is the intuition. Suppose now you do hashing and we have to store it. Instead of storing the hash value, using 64 bits, let's only store, for example, using 1 bit or 2 bits. The intuition is that if the sets identical, then hash data will be identical, then any bits will be identical. Right? Because you apply the same operation. So the question is if they're similar, then the lowest bits will also similar. Well, that I need proof. And in practice, since we often care about the sets, the pairs or sets with a high similarity, like larger than .5, hopefully we do not need that many bits. So now we need to introduce more notations. So now originally we have -- we use Z1, Z2 to denote the two minimum values. And now suppose you want to take b-bits. So the lowest -- for example, so Z1 is 7, so Z1 is the minimum value, the hash value. Suppose we use 7. And if you only take 1 bit it's 1, take 2 bits, it's 3. Basically means if the hash value, if you -- this is hash value. And if you only take the lowest b-bits, two bits, than the number will cycle through like 0, 1, 2, 3. So that's what we mean by taking the bits. So now we have to solve a probability problem. So originally the collision probabilities are the two minimum value equals R. But the new probability is actually C1 plus 1 minus C2 times R. So how do we derive this? Well, it's just the probability excess. And the numbers you can -- if you care about how the -- what's C1, C2, is this surprising elegant formula, which is all related -- it's related to the relative size of the sets. So F1 is the size of the set. D is original size-- the space size. So it's all come from there. And this actually formula is actually -- we assume these very large to derive the formula. But it turns out the formula's remarkably accurate. So let's -- let me try to convince you this is very accurate. So how do we -- actually how do we compute the exact formula? So if you care about probabilities, the exact formula can be computed as follows: Suppose you only care about one bit. So you have two numbers. You take one -- take whether they're even or odd. Then you want to check if two numbers are equal or not. If they're both even and both odd, right? So basically you just have to check. You have to keep track when both are even or both are odd or when they both are equal. The [inaudible] really add them together so you get the answer. Except that now it's very complicated formulas. But at least you can compute this exhaustively by -- if D is small enough. So if you do that, you -- we can verify this formula. So basically if you using the approximate formula, the approximate formula with derive and subtract the exact formula, the answer is that it's very, very accurate, even for small D. So it's very accurate formula. So all this try to show is that it's a very accurate formula. And the Communications ACM seem to be very interested in this work. And they -- I guess at that time they expect is there will be a lot of new applications. I think they have a good vision. So I'm going to show the applications of this b-bit minwise hashing. But the first thing we want to show is that why do we -- we haven't fully answered the questions. If you're using only one bit or b-bits, how much do you lose in terms of accuracy? Smaller number of bits you need more space, but need the larger variance because you use small -- you only use a part of the information. So if -- but if you look at the space times variance and the relative improvements, 64 R1 minus -- 1 plus R. So what that mean? That basically means if originally using 64 bits and you do K permutations, but now you only use one bit -- sorry. Yeah, this is a -- I'd like to draw it. So if originally using 64 bits and do K permutations now, you do one bit and you do 3K permutations, you get the same answer in terms of variance. If the similarity's a .5. So that's what I mean. Yeah. So let's see how it really works. So Christian and I did the experiments with Microsoft data. So the experiments are retrieve the document pairs whose similarity is larger, like R0. Let's look at the most similar, like R0 is .8. So we want to find documents which are -- which overlap 80 percent. Yeah. So if you want to find who copy your paper, 80 percent. So that's the [inaudible] you use. So basically using one bit you get the same answer using -- almost the same answer using all the bits. If your similarity level you want 80 percent. And but this result comes from 50 percent. If you only want to detect 50 percent similarities, then use one bit. You don't get the same answer as using all the bits, but if you increase the number of bits and number of permutations by small factor, for example, three, so instead of 100 use 300, you'll get the same answer. So that's what all the theory predicts. Yeah. Okay. >>: Just curious. Where was the original like random projection's technique beyond this graph? >> Ping Li: I don't have random projection plots for this one. But I have random projection comparisons later, yeah. >>: Oh, okay. >> Ping Li: Yeah. So this is an 80 percent. Then people say, well, what about 90 percent? Better, right? Well, but 99 percent even better. So maybe we can do better than one bit. So that's how we get the half-bit idea. So half bit is basically cheating because there's no such thing as a half bit. So what we did is suppose you do two permutations, you take one bit, then you solve them. You still get one bit. But I don't have a name -- I need to give a new name. So I call half a bit. So people would be curious what half a bit mean. Then it's, oh, that's what you mean. So but that's what I mean. So you can still do the probability calculation and you can compute the variance. And the encouraging result is then when all go to 1, the variance reduce by factor 2, if you're using concatenate by R2 bits. Then it's hard. This is very encouraging. Then I can continue doing this business, right? I can use this -- I can use like quarter bits, A bits, this number just keep jumping, right? So that -- of course, that's true. Because if two sets identical, no matter how you do it, they would be -- the results will be identical, right? So the -- but the trouble is that if you -- but it's going to hurt if the similarity is lower than certain threshold. So it's help high similar, but it's going to hurt. So probably don't want to push this too much. Yeah. So -- questions? Okay. So now the next thing is -- I'd like to show is what is a 3 -- suppose you care about 3-ways. A 3-way similarity is a -- you care about the relative size of this intersection here. Yeah. So you can still compute these probabilities. The probability just very, very complicated. And I don't expect you to read very carefully. But the answer is -the take-home message is that you can -- if you care about 3-way similarities, you can still do this b-bit minwise hashing compared to using 64 bits. You still get substantial improvements. And one interesting thing is you have to use in 2 bits to get the estimated 3-way similarities. Yeah. And that's interesting phenomena. Yeah. >>: Question? >> Ping Li: Uh-huh. >>: So your -- the variance is used to bound this on the -- the similarity metric? [inaudible] all represent similarities? >> Ping Li: Uh-huh. >>: Whereas -- so my question is, if I want to make sure that if, say, I do a -- for one query I want all of the other items in my dataset that's within 80 percent similarity, right, 80 percent cosine similarity or something or Jaccard similarity, because I get a lot of noise from the [inaudible] similarity items, that can actually bleed into my retreat set. So even though for the items that are truly very similar, I will get them, but I could also have noisy ones that come from items that are not actually very similar but becomes very similar after projection and compression. Does that [inaudible]. >> Ping Li: Well, the variance in that proportional, for example, R minus R. So if this was low, then it is going to stay low, the variance will stay low. That means that if you start with .1, right, so this one is going to be smaller than .1 anyway, so you take a square root. So it's going to be a small number. So the variance also proportionally to the original value. If R is .1. So the variance like standard deviation. So that, which is this guy is still going to be a small number, yeah. >>: Isn't that just one of the [inaudible] you're dividing by K. Okay. >> Ping Li: Yeah. So it's still going to be a small number. So what you said actually it will happen at certain boundaries. It's going to mess up things a little bit, yeah. But if it's something really low similarities and they're not going to really a factor. On the other hand, they -- because you really care about the high similar things, you only care about high similar things, so it's not the estimates or the low similarity pairs, it's not going to become high, you'll be fine, right? So ->>: [inaudible]. >> Ping Li: Yeah. And it's not going to be very high. Because of this variance. Yeah. Yeah. Of course, if this number is not .1, it's .3, then it's -- at certain point, it's going to affect, yeah. Okay. So now the next question I'd like to cover is how do we use this for learning? We can use this for retrieval, so how do we use this to build models? So with the -with random projections easy, right? You just discard A to random projection or do you use B? That's easy. You build a model on B. But now with b-bit minwise hashing it was not clear at beginning how do you do that. Because you generate with a B -- with a minwise hashing you generate -you get numbers for 1, 2, 3, K permutation, every permutation you get a number. 375, 200 -- I don't know 75. And 1,049, et cetera. And then this is one set. Then another set you get 300 -- supposedly the same, 1,000, et cetera. And so this is the hash value you get each time. Yeah. But how do we build a linear model from this data? Yeah, that was actually not clear beginning. But it turns out it's very easy. It's just a trick. So it's not -- not -- it's not intuitive just because, first of all, we want to make sure is resemblance something good to use? Well, it turns out resemblance is something very nice because suppose you have undatapoints and you build a matrix resemblance matrix, which is actually -- we have simple proof that it's positive definite. So resemblance it's a -- it's a good kernel. So, okay, so basically if you want -- if you want to build a linear model with using a minwise hashing type of technique, first you want to make sure the resemblance itself. Because we're approximating the resemblance, you want to make sure the resemblance is a good representation. It turns out it is the resemblance is a positive definite matrix, the resemblance matrix. And also the proof come from the fact the that minwise hashing matrix is also positive definite, and b-bit minwise hashing matrix is also positive definite. And the proof is nearly trivial. I don't see why, why it's -- why we've got trivial results -- I mean, trivial proof for interesting results. So this is minwise hashing data. And every time we want to -- we -- when we use it, we want to compute the indicator function, whether they are equal or not, right? But indicator function does not seem to be inner product at the beginning. But if you look carefully, indicator function is indeed inner product. It's not in a trivial way. So basically you can just expand the data as a vector with exactly 1,1. All the rest is 0. Then it's become inner product. So let me give you a trivial example. So basically suppose the data only range from 0 to 4 with D as 5. Then you can expand the data, data like 375, expand the data 0 becomes 0,0,0,0,1, and 4 is 1,0,0,0,0. And now we have indicator function to check if the 2s and 3s are equal. Of course if they're not equal, this is because their corresponding inner product is not equal. I mean, the inner product is zero. So there's a one-to-one correspondence between the inner product and the indicator function. This is a -- once you see this, actually it's kind of trivial. But it's actually very important. Because now this actually allow us to use the linear -- to use this data just to expand this. So how do you expand this? Well, 375, right, that's easy. We get a vector of 0 to 2 to the 64 minus 1. And 375 is 1. All the rest are zeros. Those kind of vector, yeah. So every vector -- every number is a vector like this. Then we just concatenate all the vectors. And people say you're crazy because that's a 2 to 64 dimensions, right? So this number, 2 to 64 dimension times K. So conceptionally you can do it, but you cannot really do it in practice. So but now we'll only -- we don't do 64 bits. We use b-bits. So that actually naturally solve the problem. Okay. So everybody get the trick? This is an easy trick. So I want to make sure everybody understands the trick. Okay. Good? Okay. Now, exactly how do we do it with the -- with b-bit minwise hashing. Let me show you the procedure. So this is how we do it. So basically we do a three permit -- suppose this example we did a K3. So three permutations. So every time I get a number -- so for each set I get a -- so each set I get three numbers. I did three permutation. I get three numbers. And then I just use binary -- look at the binary representations. Oh, that's too many numbers. They be using only two bits. So this number I want to store 01, 00, 0 -- 11. So I do three permutations. I got three numbers for each set. And so then I have other sets of course. But each number I only store the lowest b-bits, two bits in this example. So the number becomes 0 -- 1 and 0 and 3. So but in order to -- in order to use this trick, we have to expand it in 2 to the B dimensions. So this example like B is not 64. So that's good. So B is only 2. So I got 4 dimensions. So 0 -- 1 become 0010, and 0 become 0001, and 3 is 1000. But now we just concatenate all the 3 -- the 3 short vectors to make a vector length to 12. So that will be the new vector input to the simple vector machine. Then I will do the same thing for all the other sets. Yeah. Well, I didn't seem to really did anything, like, after a couple years of work. All I did is really play the trick. So I -- yeah. This actually is very simple, yeah. It's all -- so you start with a binary data, minimal nonzeros. But now you only end up with three nonzeros, three ones. And the dimension, instead of 64, 2 to 64 dimensions, now you get like only 12 dimensions, for example. Okay. Okay. Yeah. So every set you get a 12 dimensions. So [inaudible] another 12 dimensions of [inaudible] and you can use that directly for linear classifier. This is because you take the inner product, the inner product, it's -- of this vector that is actually the indicator functions, the value of the indicator functions. So that's why this thing works. I haven't really showed how it really works yet. So now let's do experiments. The same Webspam datasets again, which is 16 million dimensions. And we also process another dataset. So make a billion dimension just for testing. So this billion dimension with how many nonzeros, about 10,000 nonzeros. And then what? Well, then just do it. So we do a hashing ->>: [inaudible]. >> Ping Li: Okay. >>: What does 1/30 3-way features mean? >> Ping Li: 1/30 ->>: So you took Rcv, you formed unigrams bigrams, and then I see 1/30 3-way features. >> Ping Li: Well, then -- it's not, it's pairwise, not bigram. Pairwise. >>: [inaudible] more than bigrams. >> Ping Li: Yeah. So pairwise, yeah. I think bigrams is ->>: You discarded the sequence information too. >> Ping Li: That's because we don't have that information. >>: Oh, okay. >> Ping Li: The datasets, we don't have that information yet. >>: [inaudible]. >> Ping Li: Sorry? We -- >>: It's any two words anywhere in the document? >> Ping Li: Yeah. >>: Okay. >> Ping Li: So this is actually my favorite example. >>: What does 1/30 and 3-way mean? >> Ping Li: That means I only do it at the 3-ways, but that's too big. So I want to connect the numbers within a billion. So you have to discuss ->>: So you randomly ->> Ping Li: Yeah. >>: -- 3-way 29, 30. >> Ping Li: Yeah. Yeah. >>: Okay. That's fine. >> Ping Li: So, okay. Actually, my favorite example, I still want to show my favorite. So suppose you want to do difficult recognitions. And if you only using -- if you're only using the single pixels, it doesn't work that well. But you can do pairwise means -- which is good. But you can also do shingles, means you're using all this two-by-two grids. So that's bigram, right? But bigrams are local expansions. But if you do pairwise, it means every pixel gets used. For example, two maybe this pi, this pi, maybe is good representations. So I think pairwise and n-grams can be used together, yeah. >>: [inaudible]. >> Ping Li: They're not binary. But once you do the pairwise binary or not doesn't matter. So that's actually another example why binary is actually good for data has high dimensions. Yeah. The Webspam is not binary either. But you give almost the same results if you binaryize it. >>: [inaudible]. >> Ping Li: It's available. You can actually download from the LIBLINEAR [inaudible] site. >>: [inaudible]. >> Ping Li: Yeah. Okay. Good. So let's do experiments. So now this is a -remember the experiments with the random projection, right? We need the K equal to about 10,000 permutations -- I mean random projections. Now this is b-bit minwise hashing. So it's b-bit minwise hashing by two parameters now. The one is the number of permutations, and also we need to use the number of bits. We need to choose number of bits. So for this example, the 200 permutations and the way about 8 bits you get the same answer using the original data. And the 200 permutation in the 8 bits I usually recommend 8 bits because it doesn't really make sense to use 6 bit or 5 bits. And so that's like 70 megabytes. So it's actually very small data. Yeah. So you get the same answer with the 70 megabytes of data compared to the original data. Questions on experiments? >>: [inaudible] use 8 bits and 200 hashes or permutations. Like is it -- and then you look at the representation that you induce to feed to the linear classifier. Is it about half the features are on and half are off? >> Ping Li: You're talking about the coefficients of 0 close to 0? >>: Right. Right. >> Ping Li: I didn't look at that carefully. Here I overuse the L2 recognitions. We did try L1. And it seemed to work the same way. But you get much sparser results. Yeah. We didn't write a paper on that, but we did do experiments. So I don't know how to answer your question. But at least the significant of the weights can be zeros. >>: Yeah. Just like, you know, sort of like when I think about bloom filters and like, oh, when you get the density right about half the bits are on, and so it's like oh, if you got the features space that you induced right maybe each feature is roughly like 50 percent likely to be present. I don't know. >> Ping Li: Yeah. That -- that might be interesting future research. Because right now it is like -- I think the good thing about hashing that they allow you to do is more n-grams. So you get more -- so you -- so hopefully with the more n-grams you get better results. So more computations maybe. But exactly how it works, it's not the -- it's -- there would be some more interesting research can be done there, yeah. Okay. Yeah. If Microsoft is interested in funding that research. Okay. So we showed this for a different case, 50 to 500. So we see like the training is similar except with 500 permutations only one bit can do pretty well. Yeah. But that's expected, yeah. And for now the -- the training time is reduced from about 100 seconds to 100 -oh, 800 seconds to about how many seconds, 3 to 7 seconds. So the training time -- so this assume the data because data only trained at 4 gigabytes, original data only trained at 4 gigabytes, we can put in memory. Yeah. Because now there's 24 -- I think 48 gigabytes probably is a standard memory size. Before I come to the talk my -- the system administrator in my department just told me that he has set up the machine for me, which I purchased with -- the machine has 250 gigabytes of memory. So it's pretty good. So 24 gigabytes you can put in memory. But even you can be -- even you can be in the -- below the memory still take about 100 seconds to 500 seconds. So but now with the hashing we only use like 3 to 7. Okay. And for testing, so it's reduced from 20 seconds to about two seconds. >>: Then can you [inaudible] mention that there was preprocessing was expensive? >> Ping Li: Yes, preprocessing is expensive. I'm glad you asked, yeah. So here I assume the data already been processed. So the [inaudible] time did not include the preprocessing time. >>: [inaudible] preprocessing? >> Ping Li: Yes, preprocessing where you can [inaudible] you can do GPUs and all those things. >>: What do you account as preprocessing? >> Ping Li: Sorry? >>: What are the operations that you do? >> Ping Li: So here the numbers do not include -- does not include preprogressing. >>: Right. I'm wondering what -- I was just ->> Ping Li: Wondering what numbers look like? >>: I'm also wondering what kind of preprocessing it is. >> Ping Li: Just permutation, right? Permutation. So actually this -- I will go there, yeah. I'll get there. Okay. So okay. So this is attachment time. But I agree that it should a little bit because I did not include the preprocessing time. And so this is all linear SVM. But if you want to do logistic regression, the story essentially goes -- the same story still holds. And so the task, the training time reducing from a thousand seconds to about 10 seconds. So if you change your SVM to logistic regression, so you get similar kind of stories. So I'm not going to repeat the story. And if you care about -- if you care about random projection results, how do they combine together? So this is random projection results. So this is -- the minwise hashing results in terms of K. So now this is a number -so K for random projection means number of projections. For minwise hashing means number of permutations. So our result is right here. So but if you do random projections, the result is like this. So basically you need a 10 to the 4 random projections to get similar results as using minwise hashing with 200 permutations. Yeah. >>: So with this kind of data impression does it make sense to do nonlinear SVM? >> Ping Li: Probably not. Because you can always -- for example, I give you an example since I have [inaudible] here. So basically if you have in these datasets and if you -- which is 760 dimensions, and if you do linear SVMs 85 percent accuracies, but if we know -- linear. But if we know that SVMs 98.5, right? And if you do linear plus pairwise or pairwise interactions, it's like 8 -- almost 98 percent. So there's actually interesting issues. Do we really need a nonlinear classifier? Because you can -- so pairwise in a sense is basically like second order approximation or the Gaussian kernels. So even with the pairwise and partial linear SVM is still much faster than kernels. Yeah. So that's actually interesting. Sorry? >>: Question. What is the difference between the random projection and the minwise hashing? >> Ping Li: Random projection you do the matrix multiplication, random permutation will only permit the comments. So as we show that the variance of random projection is dominated by the marginals, but the minwise hashing the variance is actually proportional to the similarities. >>: [inaudible] preprocessing? >> Ping Li: I didn't do -- I didn't count any preprocessing in the plots, yeah. So the computational cost, yeah. So preprocessing is preprocessing. So it can be important, but sometimes often not important. Yeah. >>: How expensive it is? >> Ping Li: It can be very expensive. But it's a one time cost. But it's also can be expensive. It can be done offline but can be expensive as well. And in some cases it can be important because for testing if you [inaudible] preprocess, the new image come in, then you have to process again. So you can do GPUs. And we actually wrote a paper ->>: So what's involved in preprocessing I think [inaudible]. >> Ping Li: Okay. The preprocessing involve -- I'm going to skip all of this. So the preprocessing involves this. So basically you do the permutation of the columns. Conceptually you only -- you only -- you always work with the nonzeros. But you, basically you scale all the nonzeros, and you have computer hashing value and only store the minimus. >>: And this also does -- I guess part of preprocessing has to do with actually coming up with the index or the columns, right? >> Ping Li: Yeah. >>: Original data is words or pixels or pairs of pixels, then you have to map it to this linear index of which column does it fall under? >> Ping Li: Yes. So usually the hashing time can be combined with data collections. When the data document come in, you linear scan all the words. You just parse all the documents, then you can actually -- you know, you can use a hash table to -- you can use a big hash table at 2 to 64 cells, and then you -- to make sure you -- it almost has no collisions. And then it can be done in one shot, I mean one pass of the documents. >>: So [inaudible] dimension or over the data sample? >> Ping Li: It's for the -- over the future dimension but for all the data. >>: I see. But for the test data how do you do that? >> Ping Li: You just have one row, right, test data only has one row. Yeah. Okay. Yeah. So there are lots of parallelizations can be done, right, so it look like everything seems -- almost everything is parallelizable. Yeah. So I'm going to go to the processing a little bit more. But let me summarize the K-permutation hashing a little bit. So now we're in the big data time. And with big data, high-dimension data, we can do random projections. But random projection require lots of projection, 10 to the 10,000. But we can do minwise hashing, which is standard procedure in the context of search. And with b-bit minwise hashing we can improve, for example, 24 in this space. So 24 does not seem to be too big number. But it's actually big because if it's something that's being -- improving something that's already been used for many years, it's actually something very substantial. But more interestingly, the improvement is actually in terms of dimensionality is not 24, it's 2 to the 24 because reducing the 2 to 64 dimension to the 2 to the B dimensions using the -- if we consider data expansions. And I leave the slides for the hash table buildings. If you're interested, you can look at the slides for hash table building for sub-linear time near neighbor search. And we compare with a random projections, seem to be substantially more accurate. And the -- people are very curious about is the drawback. Well, the drawback, the first drawback, it's only for binary data, this version. And the preprocessing is expensive. Because you have a repeat. Repeat one time is okay because you always have to touch data one time anyway, right? So that's not avoidable. But if you have to touch it multiple times, like 500 times, it's expensive. However, every time it can be done in parallel, so we can do it for example, GPUs. So speedwise it's not much of a problem. Becomes always parallelized. 500 permutations can be done in parallelize. However, in terms of energy consumption, that's actual issue. Because speed is not a problem, but you still burn the electricity. And the thing about the search engine, we have to do this massively scale. So that's a concern. And that's -- on the other hand, the practice that people has been using for maybe 20 years, we do the one permutation and we the minimum, then we discard the rest. This process has to be very wasteful. This is -- first of all, we only look at the minimum. But what about the maximum? The amount of information should be the same, right, because you do the permutation, the amount of information should be equivalent. If you only look at the maximum. So this actually more motivates us to do this -- I should tell you why I develop this called one permutation hashing. Because I spent a few month last year around this time just run all the experiments. But preprocessing really cost -- really take too long. So I was tired of it. So I was thinking about this. When I look at this, why do we have to do 500 times? It's kind of -- we don't have to do it. So let's just do it one time. So this is the trick. So basically -- so instead of doing 500 times, you just break the space into 500 pieces. And every time only look at the smallest number in each piece. So why this should work? This works because after the permutation you break this into -- into K bins. Every bin is actually statistically equivalent. Before we only use the information in the first bin. Essentially. But we can use the information in any bin. Right? >>: So let me get this straight. Before when you would get 64 bits in your hash function, you throw 63 away? >> Ping Li: Uh-huh. You throw ->>: Yeah. You should like keep all 64 bits if they're all good, like, right? >> Ping Li: Uh-huh. Well, you have to store them, right? What's your ->>: Well, I guess I'm saying like the number of hashes, like a function like murmur has claims all the bits are good. So if you're going to do like 200 hashes, you should really be doing like 200 divided by 64, like, hashes, right? >> Ping Li: I will need to think about that. But this -- yeah. So this is basically -I only compare the standard minwise hashing. Okay. Question. >>: [inaudible] zeros before you [inaudible]. >> Ping Li: Yeah. The empty bins. Yeah. So empty bins actually the only caveat. You may get empty bin. So if you don't have empty bin, then before you only look at the first bin. Every bin is not empty. You only look at the first -- you only look at the first bin. Because the first -- smallest using the first bin. But now if there's empty bin, that might be an issue. >>: [inaudible]. >> Ping Li: Sorry? >>: [inaudible] will be higher ->> Ping Li: It should be lower. I tell you why lower. Suppose K is the same as D, the dimensions. You get back to the original data, right? But if you do original minwise hashing, you don't get that. Right? >>: Yeah, but that's only [inaudible]. >> Ping Li: Yeah. So I -- we have proof. >>: But [inaudible]. >> Ping Li: Yeah. So actually maybe you could can guess. Now, how do we estimate now? How do we estimate similarity? Before we just compared the number times the matches divided by K. So if there's no empty bins, that's still the same answer. You still count the number of times the match divided by K. So that would be the estimate. And now with empty bin, the answer is number times of matches divided by K, subtract number of time empty. And the -- so we have a -- we have a paper, you can read the proofs. But basically let me just show you the results. So basically the estimator is you count the number of times the match divided by K, times the number of times empty. And you can prove it's actually a trick to unbias. And the variance is actually smaller than the original variance of minwise hashing. So in a crude sense to understand why it's better, think about sample with the replacement, without replacement. This is -- first of all, it should be better. there's one interesting issue is that suppose originally you have 100 nonzero and you need to do 500 permutations. So you start with 100 nonzeros, but you actually end up with 500 non -- ones. So that's actually a huge distortion of the data. But this one actually you don't store the data as much. You only sparsify the data. You don't destroy the data. Questions? >>: [inaudible]. >> Ping Li: Both are empty. >>: Oh, okay. >> Ping Li: Yeah. That's not a caveat. Both are empty. Yeah. But, after all, the chance of empty being small. So basically -- so this is the size -- joint size. So the -- if the nonzeros divided by K is 5, the chance of empty is only, like, less than one percent. So it's small. So if -- so if you want to use hashing to reduce the data -- so this is what you expect anyway, right? You expect number K is much smaller than number of nonzeros. So that's what you expect. Of course if the number of nonzeros is roughly equal to the number of permutations then you get a significant chance of empty bins. Which you have to consider -- the reason we have to consider is that the data, even though the majority of pairs that satisfy this guy, but there's always some more fraction of them with very sparse vectors. So that's why we have to consider that. Yeah. But with this formula, this is a vigorous formula. If you don't believe it, read the proof. Yeah. So it's actually unbiased. Actually it's a bit surprising why it's unbiased because you get a ratio estimator. If you take a look at statistic books, you have a ratio estimator, you don't -- you normally don't get unbiased. But this actually is strictly unbiased. Yeah. Okay. But now how do we deal with the empty bins? So here's the trick. So originally we do three permutations, we do this take the bits, we expand the data, we concatenate the features. But now suppose we do one permutation but divide into K into 4 bins. But now suppose one bin is empty. So now what do we do? Well, before, like last year when we encode this, we encode 0 as 0001. We did that not on purpose. But somehow that leave a space because 000 is never used. So it's empty. We just put 000. And another interesting thing is that we have to normalize it, right? Once you normalize it, you have to actually subtract number of empty bins. So remember the formula. Yeah. So there's a -- it actually all works very well. So because normalization is actually correspond to this operation. Roughly speaking. Yeah. Because then you take inner product, the square root disappear, so you get a ->>: So this actually reduces the permutation for preprocessing ->> Ping Li: You only do one permutation, yeah. So there's always low cost. Yeah. >>: So do you lose accuracy when you do -- >> Ping Li: Oh, it's even better. So this now -- first of all, we look at this, it's better. The variance is better. But also if you look at the -- now we do SVM. Before we do K permutation this way, but now we do one permutation but we encode zeros this way. And look at the results. Even better. So you do one permutation, and this is the result. But do K permutation, this is actually even better. So the faster is more accurate. >>: And so for [inaudible]. >> Ping Li: Yeah. >>: So I wonder whether this is true to achieve the same variance you need more space. I mean, the preprocessing time is [inaudible], but to achieve the same variance overall, do you need more space? >> Ping Li: You don't need more space. There's a -- because everything is online. Yeah. This is actually in the paper. So everything is online. It's actually not 12 anymore, it's 0. You can view this as 0 because every -- you can start with 0 again, right? So you don't need more space. Because everything is online you can actually reindex. Yeah. And -- okay. We reindex easy, right. It's just bit subtraction thing, B shifting thing. >>: [inaudible]. >> Ping Li: Yeah. We considered that [inaudible] in the paper. Yeah. So that's I think almost done. So this is -- this is for this data. Okay. So you actually do better, and you can similar things over SVM and logic regression. So logic regression is three. With K increases. So one permutation you get better results, significantly better than K permutations. Because with the -- as K increases, you get more and more -- so think about if originally some of the bin some of the vector only has 100 nonzeros, with 512 permutations, you make a -you make -- you artificially make -- you add lots of artificial ones. So that's why the results actually not as good. Yeah. >>: So I guess there must be some catch or otherwise you wouldn't spend all the time talking about you have to use GPU ->> Ping Li: Well, this -- it's all sequential, right? And so the GPU work, We probably should-- we did last year. >>: So GPU is parallel ->> Ping Li: So GPU parallelize the preprogressing, yeah. >>: Yeah. K times [inaudible]. >> Ping Li: Yeah. And so GPU accomplish this, but do not really save energy. Yeah. For me the energy doesn't really matter, but maybe for you it matters. >>: But the result is [inaudible] just do one permutation [inaudible]. >> Ping Li: Yeah. There's no point of talking about -- yeah. Yeah. >>: So what's the catch? There's no catch at all? >> Ping Li: There's -- okay. Okay. So you're curious. So let me give you another example. So this one has 4,000 nonzeros, right? So let's consider datasets with only 500 nonzeros. So let's consider datasets with only 500 nonzeros. So that means a significant portion of them has about 100 nonzeros. So namely using 4,000 permutations or one permutation of 4,000 bins. So we actually -- that means most of the bins are empty. So that zero coding really works well with Y. You can see the difference. With one permutation you get a much better results than the K permutation with 256 permutations. K equal 256. But with K equal to 4,000 you set -- with one permutation you get much, much better results now. So K permutation actually not a very good idea if the original data is too sparse. Because you add lots of -- you just -- you add lots of ones, right? Yeah. Because every permutation you add 11. So that's actually not a good idea. Of course this is a contrived example. In practice we expect you have 1,000 nonzeros, so that's probably fine. So this is my last slide here. So you can accomplish the same standard K permutation which are being used in practice, but you can do this with one permutation, and it has you do one -- do better, yeah. That's the end of the talk. Questions. [applause]. >>: [inaudible] guess again is like if you think statistical example in the high-dimensional space has 200 nonzero entries in it, then maybe you should choose 400 partitions or something like this. Because you said, well, basically like each time you have -- I mean, in your original min hashings you added a 1. And so like once you start adding more than 200, that's a bad idea, right? So then you're going to do one. And then the question is, like, can you get a reduced space for still, like, about 200 [inaudible] so that the products would be minimally distorted? So I guess I'm just guessing that if they're distributed -- well, they won't be distributed randomly [inaudible]. But still I would expect like the right K or, you know, the right K, the cheapest K, something like the order of the number of nonzeros. >> Ping Li: That's -- I think that's about right. Because my -- even though the variance look like it's -- you know, it's only proportional to R, but there's actually -my experience with more nonzeros, you need more -- you really need the larger K, yeah. So I think that -- yeah. That has to do with that. On the other hand, this formula like R minus R it looks good but it's actually -- it's actually not sufficient in the sense that if you want to -- basically you have to compare this with -- sorry. So you have to compare this with the R instead of -so you -- when R go to zero, it already cancel. The variance is not sufficient. So you really -- you need actually more larger case to cancel that. So I think that the bloom filter stuff probably you can mix stories comes from something related to that, yeah. Yeah. So the -- yeah. That's actually -- okay. >>: Yeah. So most of the things I heard here is about reduction of the dimensionality. But earlier you mentioned that you can also [inaudible] to reduce the data problems. So which part -- talk about that [inaudible]. I may have missed that earlier. >> Ping Li: No, you didn't miss that. I skipped it. >>: Oh, I see. >> Ping Li: So basically -- so basically with the hashing, right. So with the hashing, hashing has you provide a partition with a space. So basically suppose you do -- you do two bits. And you do two permutations. Then you actually provide a partition of 2 to the 6 -- -- you can partition the space according to the hash value. So suppose the -- suppose you have table from 0000 to 1111. And that's all the possible hash values, right? Because you only do -- you only do two bits, and you only use two K -- two permutations. So that's how you partition the space. If you partition the space into 16 bins, then all the data points can be put in one of those bins. Then when you search for nearest neighbors, you're only searching that bin, right? So that's actually avoiding the number non -- searching all the stuff. Yeah. So it works better than assign random projections, which is another method. Yeah. >>: So another question I have is that so when you compare the coefficient efficiency of this with respect to the random projection earlier on, I just wonder whether -- so one problem with the random projection is that you still have to -still have to multiply these huge matrix, despite the fact that you [inaudible] zeros. >> Ping Li: Yeah. >>: [inaudible] to reduce the permutation once enough [inaudible] matrix is really sparse, just do the permutation for nonzero. >> Ping Li: Yeah. You -- first of all, you don't -- you don't do multiplication anymore. Because the matrix is sparse, is big at 10,000, and it's just one minus one, zero most of the time. >>: Oh, I see. So you [inaudible]. >> Ping Li: There's actually -- first of all, there's no multiplication, right? And most of -- you just basically dividing the two groups, and then subtract making a difference. >>: [inaudible]. >> Ping Li: Yeah. That's an issue. So I ->>: [inaudible]. >> Ping Li: And that could be, yeah. Because the index space is actually D times K. That's still, yeah. So that ->>: So that probably overcome by the hashing. >> Ping Li: Yeah. >>: Okay. >> Ping Li: Yeah. It's -- I'm very impressed you catch that. That's actually hidden there in sparse random projections, yeah, it's hidden there. You still have to tell which one's zero here. So okay. Okay. [applause]