Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar BigData: Efficient Search and Learning in High Dimensions with Sparse Random Projections and Hashing Ping Li Department of Statistics & Biostatistics Department of Computer Science Rutgers University Piscataway, NJ 08854 November 19, 2013 1 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Practice of Statistical Analysis and Data Mining Key Components: Models (Methods) + Variables (Features) + Observations (Examples) A Popular Practice: Simple Models + Lots of Features + Lots of Data or simply Simple Methods + BigData 2 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar BigData Everywhere Conceptually, consider a dataset as a matrix of size n × D . In modern applications, # examples n = 106 is common and n = 109 is not rare, for example, images, documents, spams, search click data. High-dimensional (image, text, biological) data are common: D = 109 (billion), D = 1012 (trillion), or even D = 264 . D = 106 (million), 3 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Examples of BigData Challenges: Linear Learning Binary classification: Dataset {(xi , yi )}n i=1 , xi ∈ RD , yi ∈ {−1, 1}. One can fit an L2 -regularized linear logistic regression: n X T 1 T log 1 + e−yi w xi , min w w + C w 2 i=1 or the L2 -regularized linear SVM: n X 1 T T max 1 − yi w xi , 0 , min w w + C w 2 i=1 where C > 0 is the penalty (regularization) parameter. 4 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Challenges of Learning with Massive High-dimensional Data • The data often can not fit in memory (even when the data are sparse). • Data loading (or transmission over network) takes too long. • Training can be expensive, even for simple linear models. • Testing may be too slow to meet the demand, especially crucial for applications in search, high-speed trading, or interactive data visual analytics. • Near neighbor search, for example, finding the most similar document in billions of Web pages without scanning them all. 5 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Dimensionality Reduction and Data Reduction Dimensionality reduction: Reducing D , for example, from 264 to 105 . Data reduction: Reducing # nonzeros is often more important. With modern linear learning algorithms, the cost (for storage, transmission, computation) is mainly determined by # nonzeros, not much by the dimensionality. ——————— In practice, where do high-dimensional data come from? 6 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar High-Dimensional Data • Certain datasets (e.g., genomics) are naturally high-dimensional. • For some applications, one can choose either – Low-dimensional representations + complex & slow methods or – High-dimensional representations + simple & fast methods • Two popular feature expansion methods – Global expansion: e.g., pairwise (or higher-order) interactions – Local expansion: e.g., word/charater n-grams using contiguous words/charaters. 7 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Webspam: Text Data and Local Expansion Dim. Time Accuracy 1-gram 254 20 sec 93.30% 3-gram 16,609,143 200 sec 99.6% Kernel 16,609,143 About a Week 99.6% Classification experiments were based on linear SVM unless marked as “kernel” —————(Character) 1-gram: Frequencies of occurrences of single characters. (Character) 3-gram: Frequencies of occurrences of 3-contiguous characters. 8 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Webspam: Binary Quantized Data Dim. Time Accuracy 1-gram 254 20 sec 93.30% Binary 1-gram 254 20 sec 87.78% 3-gram 16,609,143 200 sec 99.6% Binary 3-gram 16,609,143 200 sec 99.6% Kernel 16,609,143 About a Week 99.6% With high-dim representations, often only presence/absence information matters. 9 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Major Issues with High-Dim Representations • High-dimensionality • High storage cost • Relatively high training/testing cost A popular industrial practice: high-dimensional data + linear algorithms + hashing —————— Random projection is a standard hashing method. Several well-known hashing algorithms are equivalent to random projections. 10 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar A Popular Solution Based on Normal Random Projections Random Projections: Replace original data matrix A by B A =A×R R = B R ∈ RD×k : a random matrix, with i.i.d. entries sampled from N (0, 1). B ∈ Rn×k : projected matrix, also random. B approximately preserves the Euclidean distance and inner products between any two rows of A. In particular, E (BBT ) = AE(RRT )AT = AAT . Therefore, we can simply feed B into (e.g.,) SVM or logistic regression solvers. 11 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Very Sparse Random Projections R = {rij } ∈ RD×k . Instead of sampling from normals, we sample from a sparse distribution parameterized by s ≥ 1: The projection matrix: 1 −1 with prob. 2s rij = 0 with prob. 1 − 1 1 with prob. 2s 1 s If s = 100, then on average, 99% of the entries are zero. If s = 1000, then on average, 99.9% of the entries are zero. ——————Ref: Li, Hastie, Church, Very Sparse Random Projections, KDD’06. Ref: Li, Very Sparse Stable Random Projections, KDD’07. 12 Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar A Running Example Using (Small) Webspam Data Datset: 350K text samples, 16 million dimensions, about 4000 nonzeros on average, 24GB disk space. 4 4 x 10 Webspam 3 Frequency Ping Li 2 1 0 0 2000 4000 6000 # nonzeros 8000 10000 Task: Binary classification for spam vs. non-spam. ——————– Data were generated using character 3-grams, i.e., every 3-contiguous characters. 13 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Very Sparse Projections + Linear SVM on Webspam Data Red dashed curves: results based on the original data k = 4096 100 k = 1024 k = 512 k = 256 95 Classification Acc (%) Classification Acc (%) 100 k = 128 90 k = 64 k = 32 85 80 Webspam: SVM (s = 1) 75 −2 10 −1 10 Observations: 0 10 C 1 10 2 10 k = 4096 k = 1024 k = 512 k = 256 95 k = 128 90 k = 64 85 k = 32 80 Webspam: SVM (s = 1000) 75 −2 10 −1 10 0 10 C 1 10 • We need a large number of projections (e.g.,k > 4096) for high accuracy. • The sparsity parameter s matters little, i.e., matrix can be very sparse. 2 10 14 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Disadvantages of Random Projections (and Variants) Inaccurate, especially on binary data. 15 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 Variance Analysis for Inner Product Estimates A R = B u1 , u2 ∈ RD (D is very large): First two rows in A: u1 = {u1,1 , u1,2 , ..., u1,i , ..., u1,D } u2 = {u2,1 , u2,2 , ..., u2,i , ..., u2,D } v1 , v2 ∈ Rk (k is small): First two rows in B: v1 = {v1,1 , v1,2 , ..., v1,j , ..., v1,k } v2 = {v2,1 , v2,2 , ..., v2,j , ..., v2,k } E (v1T v2 ) = uT1 u2 = a NYU-ML Seminar 16 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar k 1X v1,j v2,j , (which is also an inner product) â = k j=1 E(â) = a 1 2 V ar(â) = m1 m2 + a k D D X X |u2,i |2 |u1,i |2 , m2 = m1 = i=1 i=1 ——— The variance is dominated by marginal l2 norms m1 m2 . For real-world datasets, most pairs are often close to be orthogonal (a ≈ 0). ——————Ref: Li, Hastie, and Church, Very Sparse Random Projections, KDD’06. Ref: Li, Shrivastava, Moore, König, Hashing Algorithms for Large-Scale Learning, NIPS’11 17 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 The New Approach • Use random permutations instead of random projections. • Use only a small “k ” (e.g.,) 200 as opposed to (e.g.,) 104 . • The work is inspired by minwise hashing for binary data. ————— This talk will focus on binary data. NYU-ML Seminar 18 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Minwise Hashing • In information retrieval and databases, efficiently computing similarities is often crucial and challenging, for example, duplicate detection of web pages. • The method of minwise hashing is still the standard algorithm for estimating set similarity in industry, since the 1997 seminal work by Broder et. al. • Minwise hashing has been used for numerous applications, for example: content matching for online advertising, detection of large-scale redundancy in enterprise file systems, syntactic similarity algorithms for enterprise information management, compressing social networks, advertising diversification, community extraction and classification in the Web graph, graph sampling, wireless sensor networks, Web spam, Web graph compression, text reuse in the Web, and many more. 19 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Binary Data Vectors and Sets A binary (0/1) vector in D -dim can be viewed a set S Example: S1 , S2 , S3 0 ⊆ Ω = {0, 1, ..., D − 1}. ⊆ Ω = {0, 1, ..., 15} (i.e., D = 16). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 S : 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0 S : 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 2 S : 3 S1 = {1, 4, 5, 8}, 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 S2 = {8, 10, 12, 14}, S3 = {3, 6, 7, 14} 20 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Binary (0/1) Massive Text Data by n-Grams (Local Expansion) Each document (Web page) can be viewed as a set of n-grams. For example, after parsing, a sentence “today is a nice day” becomes • n = 1: {“today”, “is”, “a”, “nice”, “day”} • n = 2: {“today is”, “is a”, “a nice”, “nice day”} • n = 3: {“today is a”, “is a nice”, “a nice day”} It is common to use n ≥ 5. Using n-grams generates extremely high dimensional vectors, e.g., D = (105 )n . (105 )5 = 1025 = 283 , although in current practice, it seems D = 264 suffices. 21 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Notation A binary (0/1) vector ⇐⇒ a set (locations of nonzeros). Consider two sets S1 , S2 ⊆ Ω = {0, 1, 2, ..., D − 1} (e.g., D = 264 ) f1 f1 = |S1 |, f2 = |S2 |, a f 2 a = |S1 ∩ S2 |. The resemblance R is a popular measure of set similarity R= a |S1 ∩ S2 | = . |S1 ∪ S2 | f1 + f2 − a (Is it more rational than a √ ?) f1 f2 22 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Minwise Hashing: Standard Algorithm in the Context of Search The standard practice in the search industry: Suppose a random permutation π is performed on Ω, i.e., π : Ω −→ Ω, where Ω = {0, 1, ..., D − 1}. An elementary probability argument shows that Pr (min(π(S1 )) = min(π(S2 ))) = |S1 ∩ S2 | = R. |S1 ∪ S2 | 23 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar An Example D = 5. S1 = {0, 3, 4}, S2 = {1, 2, 3}, R = |S1 ∩S2 | |S1 ∪S2 | = 51 . One realization of the permutation π can be 0 =⇒ 3 1 =⇒ 2 2 =⇒ 0 3 =⇒ 4 4 =⇒ 1 π(S1 ) = {3, 4, 1} = {1, 3, 4}, In this example, min(π(S1 )) π(S2 ) = {2, 0, 4} = {0, 2, 4} 6= min(π(S2 )). 24 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 Minwise Hashing in 0/1 Data Matrix Original Data Matrix 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 S : 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0 S : 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 S3: 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 1 2 Permuted Data Matrix 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 π(S1): 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 π(S ): 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 2 π(S ): 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 3 min(π(S1 )) = 2, min(π(S2 )) = 0, min(π(S3 )) = 0 NYU-ML Seminar 25 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing An Example with k Nov, 2013 NYU-ML Seminar = 3 Permutations Input: sets S1 , S2 , ..., Hashed values for S1 : 113 264 1091 Hashed values for S2 : 2049 103 1091 Hashed values for S3 : ... .... 26 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 Minwise Hashing Estimator After k permutations, π1 , π2 , ..., πk , one can estimate R without bias: R̂M Var k 1X 1{min(πj (S1 )) = min(πj (S2 ))}, = k j=1 R̂M = 1 R(1 − R). k —————————- One major problem: Need to use 64 bits to store each hashed value. NYU-ML Seminar 27 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Issues with Minwise Hashing and Our Solutions 1. Expensive storage (and computation): In the standard practice, each hashed value was stored using 64 bits. Our solution: b-bit minwise hashing by using only the lowest b bits. 2. Linear kernel learning: Minwise hashing was not used for supervised learning, especially linear kernel learning. Our solution: We prove that (b-bit) minwise hashing results in positive definite (PD) linear kernel matrix. The data dimensionality is reduced from 264 to 2b . 3. Expensive and energy-consuming (pre)processing for k permutations: The industry had been using the expensive procedure since 1997 (or earlier) Our solution: One permutation hashing, which is even more accurate. 28 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar All these are accomplished by only doing basic statistics/probability 29 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar b-Bit Minwise Hashing: the Intuition Basic idea: Only store the lowest b-bits of each hashed value, for small b. Intuition: • If two sets are identical, then the lowest b-bits of hashed values are equal. • If two sets are similar, then their lowest b-bits of the hashed values “should be” also similar (True?? Need a proof). ————– For example, if b = 1, then we only store whether a number is even or odd. 30 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Lowest b = 2 bits Nov, 2013 Decimal Binary (8 bits) 0 00000000 00 0 1 00000001 01 1 2 00000010 10 2 3 00000011 11 3 4 00000100 00 0 5 00000101 10 1 6 00000110 10 2 7 00000111 11 3 8 00001000 00 0 9 00001001 01 1 NYU-ML Seminar Corresp. Decimal 31 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar The New Probability Problem Define the minimum values under π : Ω → Ω to be z1 and z2 : z1 = min (π (S1 )) , z2 = min (π (S2 )) . and their b-bit versions (b) (b) z1 = The lowest b bits of z1 , Example: if z1 z2 = The lowest b bits of z2 (1) = 7(= 111 in binary), then z1 (2) = 1, z1 = 3. We need: Pr (b) z1 = (b) z2 32 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Collision probability: Assume D is large (which is virtually always true) Pb = Pr (b) z1 = (b) z2 ≈ C1,b + (1 − C2,b ) R = Pb,approx ——————– Recall, (assuming infinite precision, or as many digits as needed), we have Pr (z1 = z2 ) = R 33 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Collision probability: Assume D is large (which is virtually always true) Pb,approx = C1,b + (1 − C2,b ) R f1 r1 = , D f2 r2 = , D f1 = |S1 |, f2 = |S2 |, r1 r2 + A2,b , r1 + r2 r1 + r2 r1 r2 = A1,b + A2,b , r1 + r2 r1 + r2 C1,b = A1,b C2,b A1,b = r1 [1 − r1 ] 2b −1 1 − [1 − r1 ] 2b , A2,b = r2 [1 − r2 ] 2b −1 1 − [1 − r2 ] ——————– Ref: Li and König, b-Bit Minwise Hashing, WWW’10. 2b . 34 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing The Exact Collision Probability for b P1,exact Nov, 2013 NYU-ML Seminar =1 (1) (1) = Pr(both even) + Pr(both odd) = Pr z1 = z2 =Pr(z1 = z2 ) + Pr(z1 6= z2 , both even) + Pr(z1 6= z2 , both odd) X X =R + Pr (z1 = i, z2 = j) i=0,2,4,... j6=i,j=0,2,4,... X X + Pr (z1 = i, z2 = j) . i=1,3,5,... j6=i,j=1,3,5,... where Pr (z1 = i, z2 = j, i < j) j−i−2 i−1 f2 f1 − a Y D − f2 − i − 1 − t Y D − f1 − f2 + a − t = D D − 1 t=0 D−2−t D+i−j−1−t t=0 35 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Remarkably Accurate Approximate Formula Pb,approx − Pb,exact for D = 20 and D = 500 (real D can be 264 ) −4 4 Approximation Error Approximation Error 0.01 0.008 0.006 f2 = 2 0.004 0.002 0 0 D = 20, f1 = 4, b = 1 0.2 0.4 0.6 a/f 2 f =4 2 0.8 x 10 D = 500, f1 = 50, b = 1 3 2 f2 = 25 1 f =2 1 0 2 0 f = 50 2 0.2 0.4 0.6 a/f 0.8 2 —————– Ref: Li and König, Theory and Applications of b-Bit Minwise Hashing, CACM Research Highlights, 2011 . 1 36 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 The Variance-Space Trade-off Smaller b =⇒ smaller space for each sample. However, smaller b =⇒ larger variance. The “rule of thumb” 64 bits + k permutations ⇐⇒ 1 bit + 3k permutations The var-space trade-off is characterized as 64 (i.e., space) × variance b (i.e., space) × variance If R = 0.5, then the improvement will be 64 3 = 21.3-fold. NYU-ML Seminar 37 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Experiments: Duplicate Detection on Microsoft News articles 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 b=1 R = 0.5 0 100 200 300 400 Sample size (k) b=1 b=2 b=4 M 500 Precision Precision Goal: Retrieve documents pairs with resemblance R > R0 . 1 0.9 0.8 b=1 0.7 0.6 2 0.5 b=1 0.4 R = 0.8 0 b=2 0.3 0.2 b=4 0.1 M 0 0 100 200 300 400 500 Sample size (k) Can we push the b-bit idea further, for R0 close 1? 38 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar 1/2 Bit Suffices for High Similarity XOR two bits from two permutations, into just one bit. 1 XOR 1 = 0, 0 XOR 0 = 0, 1 XOR 0 = 1, 0 XOR 1 = 1 The new estimator is denoted by R̂1/2 . Compared to the 1-bit estimator R̂1 : • A good idea for highly similar data: Var lim R→1 R̂1 = 2. Var R̂1/2 • Not so good idea when data are not very similar: Var R̂1 < Var R̂1/2 , if R < 0.5774, Assuming sparse data ————— Ref: Li and König, b-Bit Minwise Hashing, WWW’10. 39 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar b-Bit Minwise Hashing for Estimating 3-Way Similarities Notation for 3-way set intersections: R123 = |S1 ∩S2 ∩S3 | |S1 ∪S2 ∪S3 | a s 13 f1 f a 3 a a12 13 r 23 f2 1 r 3 s s s 23 12 r2 Applications: Databases (query planning), information retrieval 3-Way collision probability: Assume D is large. Pr (lowest b bits of 3 hashed values are equal) = R123 + where u sij = = r1 + r2 + r3 − s12 − s13 − s23 + s123 , ri = |S1 ∩S2 | , and ... D Z u |Si | D , 40 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Z =(s12 − s)A3,b + +(s13 − s)A2,b + +(s23 − s)A1,b + (r3 − s13 − s23 + s) r1 + r2 − s12 (r2 − s12 − s23 + s) r1 + r3 − s13 (r1 − s12 − s13 + s) r2 + r3 − s23 Nov, 2013 NYU-ML Seminar s12 G12,b s13 G13,b s23 G23,b h i (r1 − s12 − s13 + s) + (r2 − s23 )A3,b + (r3 − s23 )A2,b G23,b r2 + r3 − s23 h i (r2 − s12 − s23 + s) G13,b + (r1 − s13 )A3,b + (r3 − s13 )A1,b r1 + r3 − s13 h i (r3 − s13 − s23 + s) G12,b , + (r1 − s12 )A2,b + (r2 − s12 )A1,b r1 + r2 − s12 b rj (1 − rj )2 −1 Aj,b = , b 1 − (1 − rj )2 Gij,b = b (ri + rj − sij )(1 − ri − rj + sij )2 −1 1 − (1 − ri − rj + sij )2 b , i, j ∈ {1, 2, 3}, i 6= j. ————— Ref: Li, König, Gui, b-Bit Minwise Hashing for Estimating 3-Way Similarities, NIPS’10. 41 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Useful messages: 1. Substantial improvement over using 64 bits, just like in the 2-way case. 2. Must use b ≥ 2 bits for estimating 3-way similarities. 42 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 Next question: How to use b-bit minwise hashing for linear learning? NYU-ML Seminar 43 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Using b-Bit Minwise Hashing for Learning Random projection for linear learning is straightforward: A R = B because the projected data are naturally in an inner product space (linear kernel). Natural questions 1. Is the matrix of resemblances positive definite (PD)? 2. Is the matrix of estimated resemblances PD? We look for linear kernels to take advantage of modern linear learning algorithms. 44 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Kernels from Minwise Hashing and b-Bit Minwise Hashing Definition: A symmetric n × n matrix K satisfying real vectors c is called positive definite (PD). Result Apply permutation π on n sets S1 , ..., Sn Define zi P ij ci cj Kij ≥ 0, for all ∈ Ω = {0, 1, ..., D − 1}. = min{π(Si )}. The following two matrices are both PD. 1. The resemblance matrix R ∈ Rn×n : Rij = 2. The minwise hashing matrix M |Si ∩Sj | |Si ∪Sj | = |Si ∩Sj | |Si |+|Sj |−|Si ∩Sj | ∈ Rn×n : Mij = 1{zi = zj } 45 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar b-Bit Minwise Hashing for Large-Scale linear Learning Linear learning algorithms require the estimators to be inner products. The estimator of minwise hashing R̂M k 1X 1{min(πj (S1 )) = min(πj (S2 ))} = k j=1 k D−1 1XX 1{min(πj (S1 )) = i} × 1{min(πj (S2 )) = i} = k j=1 i=0 is indeed an inner product of two (extremely sparse) vectors in D × k dimensions. ———– Ref: Li, Shrivastava, Moore, König, Hashing Algorithms for Large-Scale Learning, NIPS’11. 46 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Consider D Nov, 2013 NYU-ML Seminar = 5. We can expand numbers into vectors of length 5. 0 =⇒ [1, 0, 0, 0, 0], 3 =⇒ [0, 0, 0, 1, 0], 1 =⇒ [0, 1, 0, 0, 0], 4 =⇒ [0, 0, 0, 0, 1]. 2 =⇒ [0, 0, 1, 0, 0] ——————— If z1 = 2, z2 = 3, then 0 = 1{z1 = z2 } = inner product between [0, 0, 1, 0, 0] and [0, 0, 0, 1, 0]. If z1 = 2, z2 = 2, then 1 = 1{z1 = z2 } = inner product between [0, 0, 1, 0, 0] and [0, 0, 1, 0, 0]. 47 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Integrating b-Bit Minwise Hashing for (Linear) Learning Very simple: 1. Apply k independent random permutations on each (binary) feature vector xi and store the lowest b bits of each hashed value. The storage costs nbk bits. 2. At run-time, expand a hashed data point into a 2b × k -length vector, i.e. concatenate k 2b -length vectors. The new feature vector has exactly k 1’s. 48 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing An Example with k Nov, 2013 NYU-ML Seminar = 3 Permutations Input: sets S1 , S2 , ..., Hashed values for S1 : 113 264 1091 Hashed values for S2 : 2049 103 1091 Hashed values for S3 : ... .... 49 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing An Example with k Nov, 2013 NYU-ML Seminar = 3 Permutations and b = 2 Bits For set (vector) S1 : (Original high-dimensional binary feature vector) Hashed values Binary : 113 : Lowest b = 2 bits : Decimal values b : Expansions (2 ) : 264 1091 1110001 100001000 10001000011 01 00 11 1 0 3 0100 1000 0001 1 New binary feature vector : [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1] × √ 3 Same procedures on sets S2 , S3 , ... 50 Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Experiments on Webspam Data: Testing Accuracy Accuracy (%) Ping Li 100 b = 6,8,10,16 b=4 98 96 b=2 4 94 92 b=1 90 88 86 svm: k = 200 84 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C • Dashed: using the original data (24GB disk space). • Solid: b-bit hashing. Using b = 8 and k = 200 achieves about the same test accuracies as using the original data. Space: 70MB (350000 × 200) 51 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar A Good Practice 1. Engineering: Webspam, unigram, 254 dim =⇒3-gram, 16M dim, 4000 nonzeros per doc Accuracy: 93.3% =⇒ 99.6% 2. Probability/Statistics: 16M dim, 4000 nonzeros =⇒ k Accuracy: 99.6% =⇒ 99.6% = 200 nonzeros, 2b × k = 51200 dim 52 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing 100 98 96 94 92 90 88 86 84 82 80 −3 10 −2 10 −1 10 0 10 1 10 2 10 Nov, 2013 NYU-ML Seminar 53 Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Training Time 3 10 Training time (sec) Ping Li svm: k = 200 Spam: Training time 2 10 1 10 b = 16 0 10 −3 10 −2 10 −1 0 10 10 1 10 2 10 C • They did not include data loading time (which is small for b-bit hashing) • The original training time is about 200 seconds. • b-bit minwise hashing needs about 3 ∼ 7 seconds (3 seconds when b = 8). 54 Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 Testing Time 1000 Testing time (sec) Ping Li svm: k = 200 Spam: Testing time 100 10 2 1 −3 10 −2 10 −1 0 10 10 1 10 2 10 C However, here we assume the test data have already been processed. NYU-ML Seminar 55 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar The Problem of Expensive Preprocessing 200 or 500 permutations on the entire data can be very expensive. Particularly a serious issue when the new testing data have not been processed. Two solutions: 1. Parallel solution by GPUs: Achieved up to 100-fold improvement in speed. Ref: Li, Shrivastava, König, GPU-Based Minwise Hashing, WWW’12 (poster) 2. Statistical solution: Only one permutation is needed. Even more accurate. Ref: Li, Owen, Zhang, One Permutation Hashing, NIPS’12 (although the problem was not fully solved due to the existence of empty bins) 56 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Intuition: Minwise Hashing Ought to Be Wasteful Original Data Matrix 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 S1: 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0 S : 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 2 S : 3 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 Permuted Data Matrix 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 π(S1): 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 π(S2): 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 π(S ): 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 3 Only store the minimums and repeat the process k (e.g., 500) times. 57 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar One Permutation Hashing S1 , S2 , S3 ⊆ Ω = {0, 1, ..., 15} (i.e., D = 16). After one permutation: π(S1 ) = {2, 4, 7, 13}, π(S2 ) = {0, 3, 6, 13}, 1 0 1 2 2 3 4 5 6 3 7 8 9 π(S3 ) = {0, 1, 10, 12} 4 10 11 12 13 14 15 π(S ): 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 1 π(S ): 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 2 π(S ): 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 3 One permutation hashing: divide the space Ω evenly into k = 4 bins and select the smallest nonzero in each bin. Ref: P. Li, A. Owen, C-H Zhang, One Permutation Hashing, NIPS’12 58 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar An Unbiased Estimator of Resemblance R Estimator : Expectation Variance: R̂mat = : Nmat k − Nemp E R̂mat = R V ar R̂mat R(1 − R) < k One-permutation hashing has smaller variance than k -permutation hashing. 59 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar 100 98 96 94 92 90 88 86 84 82 80 −3 10 b = 6,8 b=4 b=2 b=1 Accuracy (%) Accuracy (%) Experimental Results on Webspam Data Original 1 Perm k Perm SVM: k = 256 Webspam: Accuracy −2 10 −1 0 10 10 1 10 2 10 100 b = 4,6,8 98 96 94 92 90 88 86 84 82 80 −3 −2 10 10 b=2 b=1 Original 1 Perm k Perm SVM: k = 512 Webspam: Accuracy −1 0 10 C 10 1 10 C One permutation hashing (zero coding) is even slightly more accurate than k -permutation hashing (at merely 1/k of the original cost). 2 10 60 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Summary of Contributions 1. b-bit minwise hashing: Use only the lowest b bits of hashed value, as opposed to 64 bits in the standard industry practice. 2. Linear kernel learning: b-bit minwise hashing results in positive definite linear kernel. The data dimensionality is reduced from 264 to 2b . 3. One permutation hashing: It reduces the expensive and energy-consuming process of (e.g.,) 500 permutations to merely one permutation. The results are even more accurate. 4. Others: For example, b-bit minwise hashing naturally leads to a space-partition scheme for sub-linear time near-neighbor search. ———— All these are accomplished by only doing basic statistics/probability 61 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Dense or Non-binary data? Back to Random Projections Ref: Ping Li, Gennady Samorodnitsky, John Hopcroft, Sign Cauchy Projections and Chi-Square Kernel, NIPS’13. Ref: Ping Li, Michael Mitzenmacher, Anshumali Shrivastava, Coding for Random Projections, arXiv:1308.2218, 2013 Ref: Ping Li, Cun-Hui Zhang, Tong Zhang, Compressed Counting Meets Compressed Sensing, arXiv:1310.1076, 2013. 62 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Compressed Sensing with α-Stable Random Projections • In mainstream literature, sparse signal recovery (compressed sensing) uses Gaussian design matrix. Gaussian is α-stable with α = 2. • We recently discovered that it is a good idea to use small α. • With a small α, the decoding algorithm becomes extremely simple, very fast, and robust (against measurement noise), no L1 , no RIP, no phase transition. 63 Efficient Search and Learning with Sparse Random Projections and Hashing Simulations with M 1.5 Min+Gap (3) K = 30, Time = 0.31 1 1.5 0.5 0.5 1 0 −0.5 −1 −1 2 4 6 Coordinates 8 NYU-ML Seminar 10 4 x 10 LP K = 30, ζ = 5, Time = 137.73 0 −0.5 −1.5 0 Nov, 2013 = 1/5K log N Measurements Values Values Ping Li −1.5 0 2 4 6 Coordinates 8 10 4 x 10 • Left panel: Our method using α = 0.03 design matrix. • Right panel: Classical linear programming using Gaussian design matrix. 64 Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 Comparisons in Decoding Times 3 10 Median Time (sec) Ping Li LP 2 10 Gaussian Signal N = 100000 K = 100 1 10 OMP (3) 0 10 (1) (2) −1 10 1 Number of measurements = 2 3 4 ζ (# Measurements = M0/ζ) 1/ζ × K log N . 5 NYU-ML Seminar 65 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar An Example of “Single-Pixel Camera” Application The task is to take a picture using linear combination of measurements. When the scene is sparse, our method can efficiently recover the nonzero components. Original In reality, natural pictures are often not as sparse, but in important scenarios, for example, the difference between consecutive frames of surveillance cameras are usually very sparse because the background remains still. 66 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Original Min+Gap(3): ζ = 3, Time = 4.61 Nov, 2013 NYU-ML Seminar Min+Gap(3): ζ = 1, Time = 11.63 Min+Gap(3): ζ = 5, Time = 4.86 67 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Min+Gap(3): ζ = 10, Time = 5.45 When ζ Nov, 2013 NYU-ML Seminar Min+Gap(3): ζ = 15, Time = 4.94 = 15, M ≈ K in this case, but the reconstructed image by our method could still be informative. What we have seen so far is merely a tip of iceberg in this line of work. 68 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Conclusion • The “Big Data” time. Researchers & practitioners often don’t have enough knowledge of the fundamental data generation process, which is fortunately often compensated by the capability of quickly collecting lots of data. • Approximate answers often suffice, for many statistical problems. • Hashing becomes crucial in important applications, for example, search engines, as they have lots of data and need the answers quickly & cheaply. • Binary data are special and crucial. In most applications, one can expand the features to extremely high-dimensional binary vectors. • Probabilistic hashing is basically clever sampling. • Numerous interesting research problems. Statisticians can make great contributions in this area. We just need to treat them as statistical problems. 69 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar MNIST: Digit Image Data and Global Expansion Dim. Time Accuracy Original 768 80 sec 92.11% Binary original 768 30 sec 91.25% Pairwise 295,296 400 sec 98.33% Binary pairwise 295,296 400 sec 97.88% 768 Many Hours 98.6% Kernel (Not easy to achieve the well-known 98.6% accuracy. Must use very precise kernel parameters) 70 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Experimental Results on L2 -Regularized Logistic Regression 100 b=6 b=4 b=2 b=1 logit: k = 100 Spam: Accuracy −1 0 10 1 10 10 2 10 C b=4 b=8 b=2 95 b=1 80 −3 10 logit: k = 300 Spam:Accuracy −2 10 −1 0 10 10 C 95 b=2 90 b=1 85 logit: k = 150 Spam:Accuracy 80 −3 10 −2 10 −1 0 10 10 1 10 1 10 2 10 2 10 C 100 90 b=4 b=8 Accuracy (%) 100 98 b = 8,10,16 96 94 92 90 88 86 84 82 80 −3 −2 10 10 Accuracy (%) 100 b=4 98 b = 6,8,10,16 96 b=2 94 92 b=1 90 88 86 logit: k = 200 84 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C Accuracy (%) 100 98 b=6 b = 8,10,16 96 b=4 94 92 90 b=2 88 86 b=1 84 logit: k = 50 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C Accuracy (%) Accuracy (%) Accuracy (%) Testing Accuracy 100 b=4 b = 6,8,10,16 98 b=2 4 96 b=1 94 92 90 88 86 logit: k = 500 84 82 Spam: Accuracy 80 −3 −2 −1 0 1 2 10 10 10 10 10 10 C 71 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Training Time 3 2 10 1 10 logit: k = 50 1 10 Training time (sec) Training time (sec) 2 10 1 10 logit: k = 100 0 Spam: Training time 10 −3 −2 −1 0 10 10 10 10 C 2 10 10 10 Training time (sec) 2 1 10 logit: k = 200 0 Spam: Training time 10 −3 −2 −1 0 10 10 10 10 C 10 2 10 b=8 1 10 logit: k = 150 Spam:Training Time −2 10 −1 0 10 10 1 10 2 10 3 b = 16 2 10 b=1 2 8 b=4 1 10 logit: k = 300 Spam:Training Time 10 −3 10 4 10 0 1 2 C 10 10 b = 16 b=1 10 −3 10 2 3 3 10 2 10 0 1 Training time (sec) Training time (sec) 10 10 0 Spam: Training time 10 −3 −2 −1 0 10 10 10 10 C Training time (sec) 3 3 10 −2 10 −1 0 10 10 C 1 10 2 10 b = 16 2 10 1 10 logit: k = 500 0 Spam: Training time 10 −3 −2 −1 0 10 10 10 10 C 1 10 2 10 72 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Testing the Robustness of Zero Coding on 20Newsgroup Data For webspam data, the # empty bins may be too small to make the algorithm fail. The 20Newsgroup (News20) dataset has merely 20, 000 samples in about one million dimensions, with on average about 500 nonzeros per sample. Therefore, News20 is merely a toy example to test the robustness of our algorithm. In fact, we let k as large as k = 4096, i.e., most of bins are empty. 73 Efficient Search and Learning with Sparse Random Projections and Hashing b=8 SVM: k = 8 b = 4 News20: Accuracy b=6 b=2 b=1 0 10 1 10 C 2 10 3 10 b=8 b=6 b=4 b=2 b=1 SVM: k = 64 News20: Accuracy 0 10 1 10 C 2 10 3 10 100 95 90 85 80 75 70 65 60 55 50 −1 10 100 95 90 85 80 75 70 65 60 55 50 −1 10 SVM: k = 16 News20: Accuracy b=8 b=6 b=4 b=2 b=1 0 10 1 10 C Nov, 2013 NYU-ML Seminar k -Permutation (SVM) 2 3 10 10 100 95 90 85 80 75 70 65 60 55 50 −1 10 b=8 b=6 b=4 b=2 b=1 0 10 SVM: k = 32 News20: Accuracy 1 2 3 10 10 10 C 100 b=8 b=6 Accuracy (%) 100 95 90 85 80 75 70 65 60 55 50 −1 10 Original 1 Perm k Perm Accuracy (%) 100 95 90 85 80 75 70 65 60 55 50 −1 10 Accuracy (%) Accuracy (%) Accuracy (%) 20Newsgroups: One Permutation v.s. Accuracy (%) Ping Li b=4 b=2 b=1 SVM: k = 128 News20: Accuracy 0 10 1 10 C 2 10 b=8 95 b=6 90 b=4 85 80 b=2 75 b=1 70 3 10 65 −1 10 SVM: k = 256 News20: Accuracy 0 10 1 10 C 2 10 3 10 74 Efficient Search and Learning with Sparse Random Projections and Hashing 100 90 85 b=4 b=2 b=1 80 75 70 65 −1 10 0 1 10 C 2 10 100 b = 6,8 b=2 b=4 85 80 75 SVM: k = 1024 News20: Accuracy 0 10 95 Accuracy (%) b=1 85 Original 1 Perm k Perm 80 75 70 65 −1 10 b=2 b=1 1 10 C 2 10 3 10 100 b = 4,6,8 95 90 90 65 −1 10 3 10 NYU-ML Seminar b=4 70 SVM: k = 512 News20: Accuracy 10 Nov, 2013 100 b = 6,8 95 b=8 b=6 Accuracy (%) Accuracy (%) 95 Accuracy (%) Ping Li 0 10 1 10 C 2 10 90 85 Original 1 Perm k Perm 80 75 70 SVM: k = 2048 News20: Accuracy 3 10 b=2 b=1 65 −1 10 SVM: k = 4096 News20: Accuracy 0 10 1 10 C 2 10 3 10 One permutation with zero coding can reach the original accuracy 98%. The original k -permutation can only reach 97.5% even with k = 4096. 75 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing 20Newsgroups: One Permutation v.s. Nov, 2013 NYU-ML Seminar k -Permutation (Logit) b=2 b=1 0 logit: k = 32 News20: Accuracy 1 10 10 C 2 10 100 95 Accuracy (%) b=4 90 85 b=2 80 b=1 75 70 65 −1 10 0 1 10 C 2 10 b=1 0 10 logit: k = 64 News20: Accuracy 1 10 C 2 10 3 10 90 3 10 b=2 b=1 0 1 10 C 2 10 b=2 b=1 80 75 0 1 10 C 2 10 85 Original 1 Perm k Perm 80 75 3 10 65 −1 10 b=2 75 b=1 95 b=1 70 logit: k = 1024 News20: Accuracy 10 90 80 100 b=4 b=2 95 0 1 10 C 2 10 logit: k = 256 News20: Accuracy 0 10 3 10 1 10 C b = 4,6,8 2 10 3 10 b=2 b=1 90 Original 1 Perm k Perm 85 80 75 70 logit: k = 2048 News20: Accuracy 10 b=4 85 65 −1 10 3 10 90 b=6 70 logit: k = 128 News20: Accuracy b = 6,8 b=4 85 65 −1 10 b=4 10 b=8 95 b=6 100 b = 6,8 70 logit: k = 512 News20: Accuracy 10 b=2 100 b=8 b=6 95 Accuracy (%) 3 10 b=4 100 b=8 Accuracy (%) b=4 b=6 100 95 90 85 80 75 70 65 60 55 50 −1 10 Accuracy (%) b=6 b=8 Accuracy (%) b=8 100 95 90 85 80 75 70 65 60 55 50 −1 10 Accuracy (%) 100 95 90 85 80 75 70 65 60 55 50 −1 10 Accuracy (%) Accuracy (%) Logistic Regression 65 −1 10 logit: k = 4096 News20: Accuracy 0 10 1 10 C 2 10 3 10 76 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Comparisons with Other Algorithms We conducted extensive experiments with the VW algorithm (Weinberger et. al., ICML’09, not the VW online learning platform), which has the same variance as random projections and is not limited to binary data. Consider two sets S1 , S2 , f1 R= |S1 ∩S2 | |S1 ∪S2 | = |S1 |, f2 = |S2 |, a = |S1 ∩ S2 |, = f1 +fa2 −a . (a = 0 ⇔ R = 0) Then, from their variances 1 2 f1 f2 + a V ar (âV W ) ≈ k 1 V ar R̂M IN W ISE = R (1 − R) k we can immediately see the significant advantages of minwise hashing, especially when a ≈ 0 (which is common in practice). 77 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Comparing b-Bit Minwise Hashing with VW 100 1,10,100 10,100 C =1 98 0.1 C = 0.1 96 C = 0.01 94 C = 0.01 92 90 88 86 84 svm: VW vs b = 8 hashing 82 Spam: Accuracy 80 1 2 3 4 5 6 10 10 10 10 10 10 k Accuracy (%) Accuracy (%) 8-bit minwise hashing (dashed, red) with k ≥ 200 (and C ≥ 1) achieves about the same test accuracy as VW with k = 104 ∼ 106 . 100 10,100 100 10 C =1 1 98 C = 0.1 C = 0.1 96 94 C = 0.01 C = 0.01 92 90 88 86 84 logit: VW vs b = 8 hashing 82 Spam: Accuracy 80 1 2 3 4 5 6 10 10 10 10 10 10 k 78 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar 8-bit hashing is substantially faster than VW (to achieve the same accuracy). 3 3 10 svm: VW vs b = 8 hashing Training time (sec) Training time (sec) 10 C = 100 2 C = 10 10 C = 1,0.1,0.01 C = 100 1 10 C = 10 Spam: Training time 0 10 C = 1,0.1,0.01 2 10 4 10 10 k 5 10 6 10 C = 0.1,0.01 2 10 100 1 10,1.0,0.1 10 C = 0.01 0 3 C = 100,10,1 10 logit: VW vs b = 8 hashing Spam: Training time 2 10 3 4 10 10 k 5 10 6 10 79 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Advantages of One Permutation Hashing 1 0 1 2 2 3 4 5 6 3 7 8 9 4 10 11 12 13 14 15 π(S ): 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 1 π(S2): 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 π(S ): 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 3 • Computationally much more efficient and energy-efficient. • Testing unprocessed data is much faster, crucial for user-facing applications. • Implementation is easier, from the perspective of random number generation, e.g., storing a permutation vector of length D = 109 (4GB) is not a big deal. • One permutation hashing is a much better matrix sparsification scheme. 80 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Definitions of Nemp and Nmat Nemp = k X Iemp,j , j=1 Nmat = k X Imat,j j=1 where Iemp,j and Imat,j are defined for the j -th bin, as Iemp,j Imat,j 1 = 0 if both π(S1 ) and π(S2 ) are empty in the j -th bin otherwise 1 if both π(S1 ) and π(S1 ) are not empty and the smallest element = of π(S1 ) matches the smallest element of π(S2 ), in the j -th bin 0 otherwise 81 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Number of Jointly Empty Bins Nemp Notation: f1 = |S1 |, f2 = |S2 |, a = |S1 ∩ S2 |, f = |S1 ∪ S2 | = f1 + f2 − a. Distribution: Pr (Nemp k−j X k! = j) = (−1) j!s!(k − j − s)! s=0 fY −1 E (Nemp ) = k j=0 Expectation: f /k = 5 =⇒ s E(Nemp ) k fY −1 D 1− −t D−t 1 k t=0 D 1− −j ≤ D−j ≈ 0.0067 (negligible). j+s k 1− 1 k f 82 Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Empirical Validation Empirical MSE (=bias2 + variance), for 2 pairs of binary data vectors. −1 −1 10 10 Empirical Theoretical Minwise 10 10 −3 10 −4 10 −5 −3 10 −4 10 −5 10 10 OF−−AND RIGHTS−−RESERVED −6 10 Empirical Theoretical Minwise −2 MSE(Rmat) −2 MSE(Rmat) Ping Li 1 10 2 10 3 10 k −6 4 10 5 10 10 1 10 2 10 3 10 k 4 10 5 10 83 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar One Permutation Hashing for Linear Learning Almost the same as k -permutation hashing, but we must deal with empty bins *. Our solution: Encode empty bins as (e.g.,) 00000000 in the expanded space. 84 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 Zero Coding Example for One Permutation with k NYU-ML Seminar = 4 Bins For set (vector) S1 : (Original high-dimensional binary feature vector) Hashed values Binary : 113 : Lowest b = 2 bits : Decimal values b : Expansions (2 ) : 264 1091 ∗ 1110001 100001000 10001000011 ∗ 01 00 11 ∗ 1 0 3 ∗ 0100 1000 0001 ∗ 1 New feature vector : [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0] × √ 4−1 Same procedures on sets S2 , S3 , ... 85 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar b-Bit Minwise Hashing for Efficient Near Neighbor Search Near neighbor search is a much more frequent operation than training an SVM. The bits from b-bit minwise hashing can be directly used to build hash tables to enable sub-linear time near neighbor search. This is an instance of the general family of locality-sensitive hashing (LSH). 86 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar An Example of b-Bit Hashing for Near Neighbor Search We use b = 2 bits and k = 2 permutations to build a hash table indexed from 0000 to 1111, i.e., the table size is 22×2 = 16. Index 00 00 00 01 00 10 Data Points 8, 13, 251 5, 14, 19, 29 (empty) 11 01 7, 24, 156 11 10 33, 174, 3153 11 11 61, 342 Index 00 00 00 01 00 10 Data Points 2, 19, 83 17, 36, 129 4, 34, 52, 796 11 01 7, 198 11 10 56, 989 11 11 8 ,9, 156, 879 Then, the data points are placed in the buckets according to their hashed values. Look for near neighbors in the bucket which matches the hash value of the query. Replicate the hash table (twice in this case) for missed and good near neighbors. Final retrieved data points are the union of the buckets. 87 Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Experiments on the Webspam Dataset B = b × k : table size L: number of tables. Fractions of retrieved data points 0 0 10 Fraction Evaluated Webspam: L = 25 b=4 −1 b=2 10 b=1 −2 10 8 SRP b−bit 12 b=4 b=2 −1 10 b=1 −2 16 B (bits) 20 24 10 10 Webspam: L = 50 8 SRP b−bit 12 b=4 Fraction Evaluated 0 10 Fraction Evaluated Ping Li Webspam: L = 100 b=2 −1 10 b=1 SRP b−bit −2 16 B (bits) 20 24 10 8 SRP (sign random projections) and b-bit hashing (with b 12 16 B (bits) 20 24 = 1, 2) retrieved similar numbers of data points. ———————— Ref: Shrivastava and Li, Fast Near Neighbor Search in High-Dimensional Binary Data, ECML’12 88 Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Using L 100 SRP 90 b−bit 80 70 60 50 1 40 b=4 30 20 10 B= 16 L= 100 Webspam: T = 10 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%) Precision (%) 100 SRP 90 b−bit 80 70 60 b=4 b=1 50 40 30 20 10 B= 24 L= 100 Webspam: T = 10 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%) Precision (%) 100 SRP 90 b−bit 80 70 60 50 40 1 30 20 b=4 10 B= 16 L= 100 Webspam: T = 5 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%) Precision (%) 100 SRP 90 b−bit 80 70 60 b=4 50 b=1 40 30 20 10 B= 24 L= 100 Webspam: T = 5 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%) Precision (%) Precision (%) Precision-Recall curves (the higher the better) for retrieving top-T data points. Precision (%) Ping Li 100 SRP 90 b−bit 80 70 60 50 b=1 b=4 40 30 20 10 B= 24 L= 100 Webspam: T = 50 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%) 100 SRP 90 b−bit 80 70 60 b=4 1 50 40 30 20 10 B= 16 L= 100 Webspam: T = 50 0 0 10 20 30 40 50 60 70 80 90 100 Recall (%) = 100 tables, b-bit hashing considerably outperformed SRP (sign random projections), for table sizes B = 24 and B = 16. 89 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Experimental Results on Rcv1 (200GB) Test accuracy using linear SVM (Can not train the original data) 90 b=8 b=4 70 b=2 b=8 80 b=4 70 50 −3 10 −2 10 0 10 10 1 10 b=2 b=1 50 −3 10 2 10 −2 10 −1 10 70 b=2 0 10 10 C b=2 b=1 10 50 −3 10 2 10 rcv1: Accuracy −2 10 −1 80 1 10 10 50 −3 10 90 b=2 b=1 −2 10 1 10 2 10 b = 16 b = 12 b=8 b=4 80 b=2 70 b=1 60 rcv1: Accuracy rcv1: Accuracy 2 10 svm: k =500 b=4 70 0 10 100 b = 16 b = 12 rcv1: Accuracy 10 1 b=8 60 b=1 −1 b=4 70 C 90 Accuracy (%) Accuracy (%) b = 12 b=4 −2 10 svm: k =300 80 50 −3 10 0 100 b = 16 b=8 60 80 C svm: k =200 90 b=8 60 C 100 b = 16 b = 12 rcv1: Accuracy rcv1: Accuracy −1 svm: k =150 90 b = 12 60 60 b=1 b = 16 Accuracy (%) 80 100 svm: k =100 90 b = 12 Accuracy (%) Accuracy (%) 100 b = 16 svm: k =50 Accuracy (%) 100 −1 0 10 10 C 1 10 2 10 50 −3 10 −2 10 −1 0 10 10 C 1 10 2 10 90 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Training time using linear SVM 3 3 10 12 16 1 10 12 b=8 rcv1: Train Time 0 −2 10 −1 0 10 10 1 2 b = 16 12 10 1 10 12 b=8 rcv1: Train Time 0 10 −3 10 2 10 svm: k=150 10 −2 10 −1 0 10 C 10 1 3 10 12 1 10 b=8 rcv1: Train Time 10 0 10 10 C 0 10 10 1 10 10 2 10 svm: k=500 12 2 10 b = 16 1 10 b=8 rcv1: Train Time 0 2 1 10 3 Training time (sec) b = 16 2 10 −1 10 −1 10 svm: k=300 Training time (sec) svm: k=200 0 rcv1: Train Time −2 C 3 −2 b=8 0 10 10 −3 10 1 10 10 −3 10 2 10 b = 16 12 2 10 C 10 Training time (sec) Training time (sec) 2 10 10 −3 10 10 svm: k=100 Training time (sec) svm: k=50 Training time (sec) 3 10 10 −3 10 −2 10 −1 0 10 10 C 1 10 2 b = 16 10 12 1 10 b=8 rcv1: Train Time 0 2 10 10 −3 10 −2 10 −1 0 10 10 C 1 10 2 10 91 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Test accuracy using logistic regression 90 b=8 80 b=4 b=2 60 b=8 80 b=4 70 b=2 60 b=1 50 rcv1: Accuracy −2 0 10 10 10 −2 10 −1 10 Accuracy (%) Accuracy (%) b=4 b=2 10 −1 0 10 10 C b=2 b=1 10 50 −3 10 2 10 rcv1: Accuracy −2 10 −1 80 1 10 10 50 −3 10 b=2 b=1 −2 10 1 10 2 10 b = 16 b = 12 90 b=8 b=4 80 b=2 70 b=1 60 rcv1: Accuracy 2 10 logit: k =500 b=4 70 0 10 100 b = 16 b = 12 b=8 60 b=1 −2 70 C 90 rcv1: Accuracy 50 −3 10 10 1 logit: k =300 b=8 60 0 100 b = 16 b = 12 90 70 b=4 C logit: k =200 80 80 rcv1: Accuracy C 100 b=8 60 b=1 50 −3 10 2 b = 12 90 Accuracy (%) 70 b = 16 logit: k =150 b = 12 90 b = 12 100 b = 16 logit: k =100 Accuracy (%) Accuracy (%) 100 b = 16 logit: k =50 Accuracy (%) 100 −1 0 10 10 C 1 10 2 10 50 −3 10 rcv1: Accuracy −2 10 −1 0 10 10 C 1 10 2 10 92 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Training time using logistic regression 3 3 10 b=1 b=2 b=4 8 b = 16 b = 12 1 10 rcv1: Train Time 0 10 −3 10 −2 10 −1 0 10 10 1 b=1 2 10 b=8 b = 12 1 b = 16 10 rcv1: Train Time 0 10 −3 10 2 10 logit: k=150 10 −2 10 −1 0 10 C 10 1 3 3 12 8 b = 16 1 10 rcv1: Train Time 0 −2 10 −1 0 10 10 C 10 −1 1 10 10 10 1 2 10 10 12 2 b=8 10 1 10 10 −3 10 rcv1: Train Time −2 10 −1 0 10 10 C 1 10 b = 16 logit: k=500 b = 16 0 2 0 10 3 Training time (sec) 2 10 −3 10 rcv1: Train Time −2 10 logit: k=300 Training time (sec) logit: k=200 b=1 10 C 10 10 b = 16 1 0 10 12 8 b=4 10 −3 10 2 10 2 10 C 10 Training time (sec) Training time (sec) 2 10 10 logit: k=100 Training time (sec) logit: k=50 Training time (sec) 3 10 12 2 10 b=8 1 10 rcv1: Train Time 0 2 10 10 −3 10 −2 10 −1 0 10 10 C 1 10 2 10 93 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Comparisons with VW on Rcv1 Test accuracy using linear SVM 100 80 C = 0.01 70 C = 0.01,0.1,1,10 60 svm: VW vs 1−bit hashing 3 10 50 1 10 4 10 10 svm: VW vs 2−bit hashing 2 100 C = 10 90 C = 0.1,1,10 C = 0.01 C = 0.01 70 60 rcv1: Accuracy svm: VW vs 8−bit hashing 2 3 10 10 K C = 0.01 70 50 1 10 4 10 rcv1: Accuracy svm: VW vs 4−bit hashing 2 3 10 10 100 C = 10 4 10 80 C = 0.1,1,10 C = 0.01 70 50 1 10 C = 10 90 C = 0.01 60 4 10 K 90 Accuracy (%) Accuracy (%) 10 C = 0.01 K 100 50 1 10 3 10 C = 10 80 60 rcv1: Accuracy K 80 C = 0.01 70 60 rcv1: Accuracy 2 80 C = 0.01,0.1,1,10 C = 0.1,1,10 90 rcv1: Accuracy svm: VW vs 12−bithashing 2 3 10 10 K 4 10 Accuracy (%) 50 1 10 C = 0.1,1,10 90 C = 0.1,1,10 Accuracy (%) Accuracy (%) 90 100 Accuracy (%) 100 C = 0.01 C = 0.1,1,10 80 C = 0.01 70 60 50 1 10 rcv1: Accuracy svm: VW vs 16−bit hashing 2 3 10 10 K 4 10 94 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar 100 100 90 90 90 C = 0.1,1,10 80 C = 0.01,0.1,1,10 60 50 1 10 2 3 10 70 60 rcv1: Accuracy logit: VW vs 1−bithashing 10 C = 0.01 C = 0.01,0.1,1,10 50 1 10 4 10 2 100 C = 0.1,1,10 C = 0.01 C = 0.01 70 60 50 1 10 rcv1: Accuracy logit: VW vs 8−bithashing 2 3 10 10 K 70 50 1 10 4 10 rcv1: Accuracy logit: VW vs 4−bithashing 2 4 10 80 100 50 1 10 C = 0.01 C = 0.01 rcv1: Accuracy logit: VW vs 12−bithashing 2 3 10 10 K 10 4 10 C = 10 C = 0.1,1,10 90 C = 0.1,1,10 70 60 3 10 K C = 0.1,1,10 90 Accuracy (%) Accuracy (%) 80 10 C = 0.01 C = 0.01 K 100 90 3 10 K C = 0.1,1,10 C = 0.1,1,10 60 rcv1: Accuracy logit: VW vs 2−bithashing C = 0.1,1,10 80 C =1 4 10 Accuracy (%) 70 C = 0.01 C = 0.1,1,10 80 Accuracy (%) 100 Accuracy (%) Accuracy (%) Test accuracy using logistic regression C = 0.01 80 C = 0.01 70 60 50 1 10 logit: VW vs 16−bithashing 2 3 10 10 K 4 10 95 Ping Li Efficient Search and Learning with Sparse Random Projections and Hashing Nov, 2013 NYU-ML Seminar Training time 3 3 10 rcv1: Train Time C = 10 2 10 C=1 C = 0.1 C = 10 1 10 Training time (sec) Training time (sec) 10 C = 0.01 C=1 C = 0.1 0 C = 0.01 10 1 10 svm: VW vs 8−bit hashing 2 3 10 10 K 4 10 rcv1: Train Time C = 10 C=1 2 10 C = 0.1 C = 10 C = 0.01 C=1 1 C = 0.1 10 C = 0.01 0 10 1 10 logit: VW vs 8−bit hashing 2 3 10 10 K 4 10 96