Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
BigData: Efficient Search and Learning in High Dimensions
with Sparse Random Projections and Hashing
Ping Li
Department of Statistics & Biostatistics
Department of Computer Science
Rutgers University
Piscataway, NJ 08854
November 19, 2013
1
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Practice of Statistical Analysis and Data Mining
Key Components:
Models (Methods) + Variables (Features) + Observations (Examples)
A Popular Practice:
Simple Models + Lots of Features + Lots of Data
or simply
Simple Methods + BigData
2
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
BigData Everywhere
Conceptually, consider a dataset as a matrix of size n × D .
In modern applications, # examples n
= 106 is common and n = 109 is not
rare, for example, images, documents, spams, search click data.
High-dimensional (image, text, biological) data are common:
D = 109 (billion), D = 1012 (trillion), or even D = 264 .
D = 106 (million),
3
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Examples of BigData Challenges: Linear Learning
Binary classification: Dataset {(xi , yi )}n
i=1 , xi
∈ RD , yi ∈ {−1, 1}.
One can fit an L2 -regularized linear logistic regression:
n
X
T
1 T
log 1 + e−yi w xi ,
min w w + C
w
2
i=1
or the L2 -regularized linear SVM:
n
X
1 T
T
max 1 − yi w xi , 0 ,
min w w + C
w
2
i=1
where C
> 0 is the penalty (regularization) parameter.
4
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Challenges of Learning with Massive High-dimensional Data
• The data often can not fit in memory (even when the data are sparse).
• Data loading (or transmission over network) takes too long.
• Training can be expensive, even for simple linear models.
• Testing may be too slow to meet the demand, especially crucial for
applications in search, high-speed trading, or interactive data visual analytics.
• Near neighbor search, for example, finding the most similar document in
billions of Web pages without scanning them all.
5
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Dimensionality Reduction and Data Reduction
Dimensionality reduction: Reducing D , for example, from 264 to 105 .
Data reduction: Reducing # nonzeros is often more important. With modern
linear learning algorithms, the cost (for storage, transmission, computation) is
mainly determined by # nonzeros, not much by the dimensionality.
———————
In practice, where do high-dimensional data come from?
6
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
High-Dimensional Data
• Certain datasets (e.g., genomics) are naturally high-dimensional.
• For some applications, one can choose either
– Low-dimensional representations + complex & slow methods
or
– High-dimensional representations + simple & fast methods
• Two popular feature expansion methods
– Global expansion: e.g., pairwise (or higher-order) interactions
– Local expansion: e.g., word/charater n-grams using contiguous
words/charaters.
7
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Webspam: Text Data and Local Expansion
Dim.
Time
Accuracy
1-gram
254
20 sec
93.30%
3-gram
16,609,143
200 sec
99.6%
Kernel
16,609,143
About a Week
99.6%
Classification experiments were based on linear SVM unless marked as “kernel”
—————(Character) 1-gram: Frequencies of occurrences of single characters.
(Character) 3-gram: Frequencies of occurrences of 3-contiguous characters.
8
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Webspam: Binary Quantized Data
Dim.
Time
Accuracy
1-gram
254
20 sec
93.30%
Binary 1-gram
254
20 sec
87.78%
3-gram
16,609,143
200 sec
99.6%
Binary 3-gram
16,609,143
200 sec
99.6%
Kernel
16,609,143
About a Week
99.6%
With high-dim representations, often only presence/absence information matters.
9
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Major Issues with High-Dim Representations
• High-dimensionality
• High storage cost
• Relatively high training/testing cost
A popular industrial practice:
high-dimensional data + linear algorithms + hashing
——————
Random projection is a standard hashing method.
Several well-known hashing algorithms are equivalent to random projections.
10
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
A Popular Solution Based on Normal Random Projections
Random Projections:
Replace original data matrix A by B
A
=A×R
R = B
R ∈ RD×k : a random matrix, with i.i.d. entries sampled from N (0, 1).
B ∈ Rn×k : projected matrix, also random.
B approximately preserves the Euclidean distance and inner products between
any two rows of A.
In particular, E (BBT ) = AE(RRT )AT = AAT .
Therefore, we can simply feed B into (e.g.,) SVM or logistic regression solvers.
11
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Very Sparse Random Projections
R = {rij } ∈ RD×k . Instead of sampling from normals,
we sample from a sparse distribution parameterized by s ≥ 1:
The projection matrix:

1

−1
with prob. 2s


rij =
0 with prob. 1 −



1
1 with prob. 2s
1
s
If s
= 100, then on average, 99% of the entries are zero.
If s
= 1000, then on average, 99.9% of the entries are zero.
——————Ref: Li, Hastie, Church, Very Sparse Random Projections, KDD’06.
Ref: Li, Very Sparse Stable Random Projections, KDD’07.
12
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
A Running Example Using (Small) Webspam Data
Datset: 350K text samples, 16 million dimensions, about 4000 nonzeros on
average, 24GB disk space.
4
4
x 10
Webspam
3
Frequency
Ping Li
2
1
0
0
2000
4000 6000
# nonzeros
8000 10000
Task: Binary classification for spam vs. non-spam.
——————–
Data were generated using character 3-grams, i.e., every 3-contiguous
characters.
13
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Very Sparse Projections + Linear SVM on Webspam Data
Red dashed curves: results based on the original data
k = 4096
100
k = 1024
k = 512
k = 256
95
Classification Acc (%)
Classification Acc (%)
100
k = 128
90
k = 64
k = 32
85
80
Webspam: SVM (s = 1)
75 −2
10
−1
10
Observations:
0
10
C
1
10
2
10
k = 4096
k = 1024
k = 512
k = 256
95
k = 128
90
k = 64
85
k = 32
80
Webspam: SVM (s = 1000)
75 −2
10
−1
10
0
10
C
1
10
• We need a large number of projections (e.g.,k > 4096) for high accuracy.
• The sparsity parameter s matters little, i.e., matrix can be very sparse.
2
10
14
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Disadvantages of Random Projections (and Variants)
Inaccurate, especially on binary data.
15
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
Variance Analysis for Inner Product Estimates
A
R = B
u1 , u2 ∈ RD (D is very large):
First two rows in A:
u1 = {u1,1 , u1,2 , ..., u1,i , ..., u1,D }
u2 = {u2,1 , u2,2 , ..., u2,i , ..., u2,D }
v1 , v2 ∈ Rk (k is small):
First two rows in B:
v1 = {v1,1 , v1,2 , ..., v1,j , ..., v1,k }
v2 = {v2,1 , v2,2 , ..., v2,j , ..., v2,k }
E (v1T v2 ) = uT1 u2 = a
NYU-ML Seminar
16
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
k
1X
v1,j v2,j , (which is also an inner product)
â =
k j=1
E(â) = a
1
2
V ar(â) =
m1 m2 + a
k
D
D
X
X
|u2,i |2
|u1,i |2 , m2 =
m1 =
i=1
i=1
———
The variance is dominated by marginal l2 norms m1 m2 .
For real-world datasets, most pairs are often close to be orthogonal (a
≈ 0).
——————Ref: Li, Hastie, and Church, Very Sparse Random Projections, KDD’06.
Ref: Li, Shrivastava, Moore, König, Hashing Algorithms for Large-Scale Learning, NIPS’11
17
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
The New Approach
• Use random permutations instead of random projections.
• Use only a small “k ” (e.g.,) 200 as opposed to (e.g.,) 104 .
• The work is inspired by minwise hashing for binary data.
—————
This talk will focus on binary data.
NYU-ML Seminar
18
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Minwise Hashing
• In information retrieval and databases, efficiently computing similarities is
often crucial and challenging, for example, duplicate detection of web pages.
• The method of minwise hashing is still the standard algorithm for estimating
set similarity in industry, since the 1997 seminal work by Broder et. al.
• Minwise hashing has been used for numerous applications, for example:
content matching for online advertising, detection of large-scale redundancy
in enterprise file systems, syntactic similarity algorithms for enterprise
information management, compressing social networks, advertising
diversification, community extraction and classification in the Web graph,
graph sampling, wireless sensor networks, Web spam, Web graph
compression, text reuse in the Web, and many more.
19
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Binary Data Vectors and Sets
A binary (0/1) vector in D -dim can be viewed a set S
Example: S1 , S2 , S3
0
⊆ Ω = {0, 1, ..., D − 1}.
⊆ Ω = {0, 1, ..., 15} (i.e., D = 16).
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
S :
0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0
S :
0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0
1
2
S :
3
S1 = {1, 4, 5, 8},
0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0
S2 = {8, 10, 12, 14},
S3 = {3, 6, 7, 14}
20
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Binary (0/1) Massive Text Data by n-Grams (Local Expansion)
Each document (Web page) can be viewed as a set of n-grams.
For example, after parsing, a sentence “today is a nice day” becomes
• n = 1: {“today”, “is”, “a”, “nice”, “day”}
• n = 2: {“today is”, “is a”, “a nice”, “nice day”}
• n = 3: {“today is a”, “is a nice”, “a nice day”}
It is common to use n
≥ 5.
Using n-grams generates extremely high dimensional vectors, e.g., D
= (105 )n .
(105 )5 = 1025 = 283 , although in current practice, it seems D = 264 suffices.
21
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Notation
A binary (0/1) vector ⇐⇒ a set (locations of nonzeros).
Consider two sets S1 , S2
⊆ Ω = {0, 1, 2, ..., D − 1} (e.g., D = 264 )
f1
f1 = |S1 |,
f2 = |S2 |,
a
f
2
a = |S1 ∩ S2 |.
The resemblance R is a popular measure of set similarity
R=
a
|S1 ∩ S2 |
=
.
|S1 ∪ S2 |
f1 + f2 − a
(Is it more rational than
a
√
?)
f1 f2
22
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Minwise Hashing: Standard Algorithm in the Context of Search
The standard practice in the search industry:
Suppose a random permutation π is performed on Ω, i.e.,
π : Ω −→ Ω,
where
Ω = {0, 1, ..., D − 1}.
An elementary probability argument shows that
Pr (min(π(S1 )) = min(π(S2 ))) =
|S1 ∩ S2 |
= R.
|S1 ∪ S2 |
23
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
An Example
D = 5. S1 = {0, 3, 4}, S2 = {1, 2, 3}, R =
|S1 ∩S2 |
|S1 ∪S2 |
= 51 .
One realization of the permutation π can be
0 =⇒ 3
1 =⇒ 2
2 =⇒ 0
3 =⇒ 4
4 =⇒ 1
π(S1 ) = {3, 4, 1} = {1, 3, 4},
In this example, min(π(S1 ))
π(S2 ) = {2, 0, 4} = {0, 2, 4}
6= min(π(S2 )).
24
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
Minwise Hashing in 0/1 Data Matrix
Original Data Matrix
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
S :
0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0
S :
0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0
S3:
0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0
1
2
Permuted Data Matrix
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
π(S1): 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0
π(S ): 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0
2
π(S ): 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0
3
min(π(S1 ))
= 2, min(π(S2 )) = 0, min(π(S3 )) = 0
NYU-ML Seminar
25
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
An Example with k
Nov, 2013
NYU-ML Seminar
= 3 Permutations
Input: sets S1 , S2 , ...,
Hashed values for S1
:
113
264
1091
Hashed values for S2
:
2049
103
1091
Hashed values for S3
: ...
....
26
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
Minwise Hashing Estimator
After k permutations, π1 , π2 , ..., πk , one can estimate R without bias:
R̂M
Var
k
1X
1{min(πj (S1 )) = min(πj (S2 ))},
=
k j=1
R̂M
=
1
R(1 − R).
k
—————————-
One major problem: Need to use 64 bits to store each hashed value.
NYU-ML Seminar
27
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Issues with Minwise Hashing and Our Solutions
1. Expensive storage (and computation): In the standard practice, each
hashed value was stored using 64 bits.
Our solution: b-bit minwise hashing by using only the lowest b bits.
2. Linear kernel learning: Minwise hashing was not used for supervised
learning, especially linear kernel learning.
Our solution: We prove that (b-bit) minwise hashing results in positive definite
(PD) linear kernel matrix. The data dimensionality is reduced from 264 to 2b .
3. Expensive and energy-consuming (pre)processing for k permutations:
The industry had been using the expensive procedure since 1997 (or earlier)
Our solution: One permutation hashing, which is even more accurate.
28
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
All these are accomplished by only doing basic statistics/probability
29
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
b-Bit Minwise Hashing: the Intuition
Basic idea: Only store the lowest b-bits of each hashed value, for small b.
Intuition:
• If two sets are identical, then the lowest b-bits of hashed values are equal.
• If two sets are similar, then their lowest b-bits of the hashed values “should
be” also similar (True?? Need a proof).
————–
For example, if
b = 1, then we only store whether a number is even or odd.
30
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Lowest b
= 2 bits
Nov, 2013
Decimal
Binary (8 bits)
0
00000000
00
0
1
00000001
01
1
2
00000010
10
2
3
00000011
11
3
4
00000100
00
0
5
00000101
10
1
6
00000110
10
2
7
00000111
11
3
8
00001000
00
0
9
00001001
01
1
NYU-ML Seminar
Corresp. Decimal
31
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
The New Probability Problem
Define the minimum values under π
: Ω → Ω to be z1 and z2 :
z1 = min (π (S1 )) ,
z2 = min (π (S2 )) .
and their b-bit versions
(b)
(b)
z1 = The lowest b bits of z1 ,
Example: if z1
z2 = The lowest b bits of z2
(1)
= 7(= 111 in binary), then z1
(2)
= 1, z1 = 3.
We need:
Pr
(b)
z1
=
(b)
z2
32
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Collision probability: Assume D is large (which is virtually always true)
Pb = Pr
(b)
z1
=
(b)
z2
≈ C1,b + (1 − C2,b ) R = Pb,approx
——————–
Recall, (assuming infinite precision, or as many digits as needed), we have
Pr (z1 = z2 ) = R
33
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Collision probability: Assume D is large (which is virtually always true)
Pb,approx = C1,b + (1 − C2,b ) R
f1
r1 = ,
D
f2
r2 = ,
D
f1 = |S1 |,
f2 = |S2 |,
r1
r2
+ A2,b
,
r1 + r2
r1 + r2
r1
r2
= A1,b
+ A2,b
,
r1 + r2
r1 + r2
C1,b = A1,b
C2,b
A1,b =
r1 [1 − r1 ]
2b −1
1 − [1 − r1 ]
2b
,
A2,b =
r2 [1 − r2 ]
2b −1
1 − [1 − r2 ]
——————–
Ref: Li and König, b-Bit Minwise Hashing, WWW’10.
2b
.
34
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
The Exact Collision Probability for b
P1,exact
Nov, 2013
NYU-ML Seminar
=1
(1)
(1)
= Pr(both even) + Pr(both odd)
= Pr z1 = z2
=Pr(z1 = z2 ) + Pr(z1 6= z2 , both even) + Pr(z1 6= z2 , both odd)



X 
X
=R +
Pr (z1 = i, z2 = j)


i=0,2,4,... j6=i,j=0,2,4,...



X
X 
+
Pr (z1 = i, z2 = j) .


i=1,3,5,...
j6=i,j=1,3,5,...
where
Pr (z1 = i, z2 = j, i < j)
j−i−2
i−1
f2 f1 − a Y D − f2 − i − 1 − t Y D − f1 − f2 + a − t
=
D D − 1 t=0
D−2−t
D+i−j−1−t
t=0
35
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Remarkably Accurate Approximate Formula
Pb,approx − Pb,exact for D = 20 and D = 500 (real D can be 264 )
−4
4
Approximation Error
Approximation Error
0.01
0.008
0.006
f2 = 2
0.004
0.002
0
0
D = 20, f1 = 4, b = 1
0.2
0.4
0.6
a/f
2
f =4
2
0.8
x 10
D = 500, f1 = 50, b = 1
3
2
f2 = 25
1
f =2
1
0 2
0
f = 50
2
0.2
0.4
0.6
a/f
0.8
2
—————–
Ref: Li and König, Theory and Applications of b-Bit Minwise Hashing, CACM
Research Highlights, 2011 .
1
36
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
The Variance-Space Trade-off
Smaller b
=⇒ smaller space for each sample. However,
smaller b
=⇒ larger variance.
The “rule of thumb”
64 bits + k permutations ⇐⇒ 1 bit + 3k permutations
The var-space trade-off is characterized as
64 (i.e., space) × variance
b (i.e., space) × variance
If R
= 0.5, then the improvement will be
64
3
= 21.3-fold.
NYU-ML Seminar
37
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Experiments: Duplicate Detection on Microsoft News articles
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
2
b=1
R = 0.5
0
100
200
300
400
Sample size (k)
b=1
b=2
b=4
M
500
Precision
Precision
Goal: Retrieve documents pairs with resemblance R
> R0 .
1
0.9
0.8 b=1
0.7
0.6 2
0.5
b=1
0.4
R = 0.8
0
b=2
0.3
0.2
b=4
0.1
M
0
0
100
200
300
400
500
Sample size (k)
Can we push the b-bit idea further, for R0 close 1?
38
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
1/2 Bit Suffices for High Similarity
XOR two bits from two permutations, into just one bit.
1 XOR 1 = 0, 0 XOR 0 = 0, 1 XOR 0 = 1, 0 XOR 1 = 1
The new estimator is denoted by R̂1/2 . Compared to the 1-bit estimator R̂1 :
• A good idea for highly similar data:
Var
lim
R→1
R̂1
= 2.
Var R̂1/2
• Not so good idea when data are not very similar:
Var R̂1 < Var R̂1/2 ,
if R < 0.5774, Assuming sparse data
—————
Ref: Li and König, b-Bit Minwise Hashing, WWW’10.
39
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
b-Bit Minwise Hashing for Estimating 3-Way Similarities
Notation for 3-way set intersections:
R123 =
|S1 ∩S2 ∩S3 |
|S1 ∪S2 ∪S3 |
a
s
13
f1
f
a
3
a
a12
13
r
23
f2
1
r
3
s
s
s
23
12
r2
Applications: Databases (query planning), information retrieval
3-Way collision probability: Assume D is large.
Pr (lowest b bits of 3 hashed values are equal) = R123 +
where u
sij =
= r1 + r2 + r3 − s12 − s13 − s23 + s123 , ri =
|S1 ∩S2 |
, and ...
D
Z
u
|Si |
D ,
40
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Z =(s12 − s)A3,b +
+(s13 − s)A2,b +
+(s23 − s)A1,b +
(r3 − s13 − s23 + s)
r1 + r2 − s12
(r2 − s12 − s23 + s)
r1 + r3 − s13
(r1 − s12 − s13 + s)
r2 + r3 − s23
Nov, 2013
NYU-ML Seminar
s12 G12,b
s13 G13,b
s23 G23,b
h
i (r1 − s12 − s13 + s)
+ (r2 − s23 )A3,b + (r3 − s23 )A2,b
G23,b
r2 + r3 − s23
h
i (r2 − s12 − s23 + s)
G13,b
+ (r1 − s13 )A3,b + (r3 − s13 )A1,b
r1 + r3 − s13
h
i (r3 − s13 − s23 + s)
G12,b ,
+ (r1 − s12 )A2,b + (r2 − s12 )A1,b
r1 + r2 − s12
b
rj (1 − rj )2 −1
Aj,b =
,
b
1 − (1 − rj )2
Gij,b =
b
(ri + rj − sij )(1 − ri − rj + sij )2 −1
1 − (1 − ri − rj + sij )2
b
, i, j ∈ {1, 2, 3}, i 6= j.
—————
Ref: Li, König, Gui, b-Bit Minwise Hashing for Estimating 3-Way Similarities,
NIPS’10.
41
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Useful messages:
1. Substantial improvement over using 64 bits, just like in the 2-way case.
2. Must use b
≥ 2 bits for estimating 3-way similarities.
42
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
Next question: How to use b-bit minwise hashing for linear learning?
NYU-ML Seminar
43
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Using b-Bit Minwise Hashing for Learning
Random projection for linear learning is straightforward:
A
R = B
because the projected data are naturally in an inner product space (linear kernel).
Natural questions
1. Is the matrix of resemblances positive definite (PD)?
2. Is the matrix of estimated resemblances PD?
We look for linear kernels to take advantage of modern linear learning algorithms.
44
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Kernels from Minwise Hashing and b-Bit Minwise Hashing
Definition: A symmetric n × n matrix K satisfying
real vectors c is called positive definite (PD).
Result Apply permutation π on n sets S1 , ..., Sn
Define zi
P
ij ci cj Kij
≥ 0, for all
∈ Ω = {0, 1, ..., D − 1}.
= min{π(Si )}. The following two matrices are both PD.
1. The resemblance matrix R
∈ Rn×n : Rij =
2. The minwise hashing matrix M
|Si ∩Sj |
|Si ∪Sj |
=
|Si ∩Sj |
|Si |+|Sj |−|Si ∩Sj |
∈ Rn×n : Mij = 1{zi = zj }
45
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
b-Bit Minwise Hashing for Large-Scale linear Learning
Linear learning algorithms require the estimators to be inner products.
The estimator of minwise hashing
R̂M
k
1X
1{min(πj (S1 )) = min(πj (S2 ))}
=
k j=1
k D−1
1XX
1{min(πj (S1 )) = i} × 1{min(πj (S2 )) = i}
=
k j=1 i=0
is indeed an inner product of two (extremely sparse) vectors in D × k dimensions.
———–
Ref: Li, Shrivastava, Moore, König, Hashing Algorithms for Large-Scale Learning, NIPS’11.
46
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Consider D
Nov, 2013
NYU-ML Seminar
= 5. We can expand numbers into vectors of length 5.
0 =⇒ [1, 0, 0, 0, 0],
3 =⇒ [0, 0, 0, 1, 0],
1 =⇒ [0, 1, 0, 0, 0],
4 =⇒ [0, 0, 0, 0, 1].
2 =⇒ [0, 0, 1, 0, 0]
———————
If z1
= 2, z2 = 3, then
0 = 1{z1 = z2 } = inner product between [0, 0, 1, 0, 0] and [0, 0, 0, 1, 0].
If z1
= 2, z2 = 2, then
1 = 1{z1 = z2 } = inner product between [0, 0, 1, 0, 0] and [0, 0, 1, 0, 0].
47
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Integrating b-Bit Minwise Hashing for (Linear) Learning
Very simple:
1. Apply k independent random permutations on each (binary) feature vector xi
and store the lowest b bits of each hashed value. The storage costs nbk bits.
2. At run-time, expand a hashed data point into a 2b
× k -length vector, i.e.
concatenate k 2b -length vectors. The new feature vector has exactly k 1’s.
48
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
An Example with k
Nov, 2013
NYU-ML Seminar
= 3 Permutations
Input: sets S1 , S2 , ...,
Hashed values for S1
:
113
264
1091
Hashed values for S2
:
2049
103
1091
Hashed values for S3
: ...
....
49
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
An Example with k
Nov, 2013
NYU-ML Seminar
= 3 Permutations and b = 2 Bits
For set (vector) S1 : (Original high-dimensional binary feature vector)
Hashed values
Binary
:
113
:
Lowest b
= 2 bits :
Decimal values
b
:
Expansions (2 )
:
264
1091
1110001
100001000
10001000011
01
00
11
1
0
3
0100
1000
0001
1
New binary feature vector : [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1] × √
3
Same procedures on sets S2 , S3 , ...
50
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Experiments on Webspam Data: Testing Accuracy
Accuracy (%)
Ping Li
100 b = 6,8,10,16
b=4
98
96
b=2
4
94
92
b=1
90
88
86
svm: k = 200
84
82
Spam: Accuracy
80 −3
−2
−1
0
1
2
10
10
10
10
10
10
C
• Dashed: using the original data (24GB disk space).
• Solid: b-bit hashing. Using b = 8 and k = 200 achieves about the same test
accuracies as using the original data. Space: 70MB (350000 × 200)
51
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
A Good Practice
1. Engineering:
Webspam, unigram, 254 dim =⇒3-gram, 16M dim, 4000 nonzeros per doc
Accuracy:
93.3% =⇒ 99.6%
2. Probability/Statistics:
16M dim, 4000 nonzeros =⇒ k
Accuracy:
99.6% =⇒ 99.6%
= 200 nonzeros, 2b × k = 51200 dim
52
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
100
98
96
94
92
90
88
86
84
82
80 −3
10
−2
10
−1
10
0
10
1
10
2
10
Nov, 2013
NYU-ML Seminar
53
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Training Time
3
10
Training time (sec)
Ping Li
svm: k = 200
Spam: Training time
2
10
1
10
b = 16
0
10 −3
10
−2
10
−1
0
10
10
1
10
2
10
C
• They did not include data loading time (which is small for b-bit hashing)
• The original training time is about 200 seconds.
• b-bit minwise hashing needs about 3 ∼ 7 seconds (3 seconds when b = 8).
54
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
Testing Time
1000
Testing time (sec)
Ping Li
svm: k = 200
Spam: Testing time
100
10
2
1 −3
10
−2
10
−1
0
10
10
1
10
2
10
C
However, here we assume the test data have already been processed.
NYU-ML Seminar
55
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
The Problem of Expensive Preprocessing
200 or 500 permutations on the entire data can be very expensive.
Particularly a serious issue when the new testing data have not been processed.
Two solutions:
1. Parallel solution by GPUs: Achieved up to 100-fold improvement in speed.
Ref: Li, Shrivastava, König, GPU-Based Minwise Hashing, WWW’12 (poster)
2. Statistical solution: Only one permutation is needed. Even more accurate.
Ref: Li, Owen, Zhang, One Permutation Hashing, NIPS’12
(although the problem was not fully solved due to the existence of empty bins)
56
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Intuition: Minwise Hashing Ought to Be Wasteful
Original Data Matrix
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
S1:
0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0
S :
0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0
2
S :
3
0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0
Permuted Data Matrix
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
π(S1): 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0
π(S2): 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0
π(S ): 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0
3
Only store the minimums and repeat the process k (e.g., 500) times.
57
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
One Permutation Hashing
S1 , S2 , S3 ⊆ Ω = {0, 1, ..., 15} (i.e., D = 16). After one permutation:
π(S1 ) = {2, 4, 7, 13},
π(S2 ) = {0, 3, 6, 13},
1
0
1
2
2
3
4
5
6
3
7
8
9
π(S3 ) = {0, 1, 10, 12}
4
10 11 12 13 14 15
π(S ): 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0
1
π(S ): 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0
2
π(S ): 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0
3
One permutation hashing: divide the space Ω evenly into k
= 4 bins and
select the smallest nonzero in each bin.
Ref: P. Li, A. Owen, C-H Zhang, One Permutation Hashing, NIPS’12
58
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
An Unbiased Estimator of Resemblance R
Estimator
:
Expectation
Variance:
R̂mat =
:
Nmat
k − Nemp
E R̂mat = R
V ar R̂mat
R(1 − R)
<
k
One-permutation hashing has smaller variance than k -permutation hashing.
59
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
100
98
96
94
92
90
88
86
84
82
80 −3
10
b = 6,8
b=4
b=2
b=1
Accuracy (%)
Accuracy (%)
Experimental Results on Webspam Data
Original
1 Perm
k Perm
SVM: k = 256
Webspam: Accuracy
−2
10
−1
0
10
10
1
10
2
10
100 b = 4,6,8
98
96
94
92
90
88
86
84
82
80 −3
−2
10
10
b=2
b=1
Original
1 Perm
k Perm
SVM: k = 512
Webspam: Accuracy
−1
0
10
C
10
1
10
C
One permutation hashing (zero coding) is even slightly more accurate than
k -permutation hashing (at merely 1/k of the original cost).
2
10
60
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Summary of Contributions
1. b-bit minwise hashing: Use only the lowest b bits of hashed value, as
opposed to 64 bits in the standard industry practice.
2. Linear kernel learning: b-bit minwise hashing results in positive definite
linear kernel. The data dimensionality is reduced from 264 to 2b .
3. One permutation hashing: It reduces the expensive and energy-consuming
process of (e.g.,) 500 permutations to merely one permutation. The results
are even more accurate.
4. Others: For example, b-bit minwise hashing naturally leads to a
space-partition scheme for sub-linear time near-neighbor search.
————
All these are accomplished by only doing basic statistics/probability
61
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Dense or Non-binary data? Back to Random Projections
Ref: Ping Li, Gennady Samorodnitsky, John Hopcroft, Sign Cauchy Projections
and Chi-Square Kernel, NIPS’13.
Ref: Ping Li, Michael Mitzenmacher, Anshumali Shrivastava, Coding for Random
Projections, arXiv:1308.2218, 2013
Ref: Ping Li, Cun-Hui Zhang, Tong Zhang, Compressed Counting Meets
Compressed Sensing, arXiv:1310.1076, 2013.
62
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Compressed Sensing with α-Stable Random Projections
• In mainstream literature, sparse signal recovery (compressed sensing) uses
Gaussian design matrix. Gaussian is α-stable with α = 2.
• We recently discovered that it is a good idea to use small α.
• With a small α, the decoding algorithm becomes extremely simple, very fast,
and robust (against measurement noise), no L1 , no RIP, no phase transition.
63
Efficient Search and Learning with Sparse Random Projections and Hashing
Simulations with M
1.5 Min+Gap (3)
K = 30, Time = 0.31
1
1.5
0.5
0.5
1
0
−0.5
−1
−1
2
4
6
Coordinates
8
NYU-ML Seminar
10
4
x 10
LP
K = 30, ζ = 5, Time = 137.73
0
−0.5
−1.5
0
Nov, 2013
= 1/5K log N Measurements
Values
Values
Ping Li
−1.5
0
2
4
6
Coordinates
8
10
4
x 10
• Left panel: Our method using α = 0.03 design matrix.
• Right panel: Classical linear programming using Gaussian design matrix.
64
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
Comparisons in Decoding Times
3
10
Median Time (sec)
Ping Li
LP
2
10
Gaussian Signal
N = 100000
K = 100
1
10
OMP
(3)
0
10
(1)
(2)
−1
10
1
Number of measurements =
2
3
4
ζ (# Measurements = M0/ζ)
1/ζ × K log N .
5
NYU-ML Seminar
65
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
An Example of “Single-Pixel Camera” Application
The task is to take a picture using linear combination of measurements. When the
scene is sparse, our method can efficiently recover the nonzero components.
Original
In reality, natural pictures are often not as sparse, but in important scenarios, for
example, the difference between consecutive frames of surveillance cameras are
usually very sparse because the background remains still.
66
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Original
Min+Gap(3): ζ = 3, Time = 4.61
Nov, 2013
NYU-ML Seminar
Min+Gap(3): ζ = 1, Time = 11.63
Min+Gap(3): ζ = 5, Time = 4.86
67
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Min+Gap(3): ζ = 10, Time = 5.45
When ζ
Nov, 2013
NYU-ML Seminar
Min+Gap(3): ζ = 15, Time = 4.94
= 15, M ≈ K in this case, but the reconstructed image by our method
could still be informative.
What we have seen so far is merely a tip of iceberg in this line of work.
68
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Conclusion
• The “Big Data” time. Researchers & practitioners often don’t have enough
knowledge of the fundamental data generation process, which is fortunately
often compensated by the capability of quickly collecting lots of data.
• Approximate answers often suffice, for many statistical problems.
• Hashing becomes crucial in important applications, for example, search
engines, as they have lots of data and need the answers quickly & cheaply.
• Binary data are special and crucial. In most applications, one can expand
the features to extremely high-dimensional binary vectors.
• Probabilistic hashing is basically clever sampling.
• Numerous interesting research problems. Statisticians can make great
contributions in this area. We just need to treat them as statistical problems.
69
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
MNIST: Digit Image Data and Global Expansion
Dim.
Time
Accuracy
Original
768
80 sec
92.11%
Binary original
768
30 sec
91.25%
Pairwise
295,296
400 sec
98.33%
Binary pairwise
295,296
400 sec
97.88%
768
Many Hours
98.6%
Kernel
(Not easy to achieve the well-known 98.6% accuracy. Must use very precise
kernel parameters)
70
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Experimental Results on L2 -Regularized Logistic Regression
100
b=6
b=4
b=2
b=1
logit: k = 100
Spam: Accuracy
−1
0
10
1
10
10
2
10
C
b=4
b=8
b=2
95
b=1
80 −3
10
logit: k = 300
Spam:Accuracy
−2
10
−1
0
10
10
C
95
b=2
90
b=1
85
logit: k = 150
Spam:Accuracy
80 −3
10
−2
10
−1
0
10
10
1
10
1
10
2
10
2
10
C
100
90
b=4
b=8
Accuracy (%)
100
98 b = 8,10,16
96
94
92
90
88
86
84
82
80 −3
−2
10
10
Accuracy (%)
100
b=4
98 b = 6,8,10,16
96
b=2
94
92
b=1
90
88
86
logit: k = 200
84
82
Spam: Accuracy
80 −3
−2
−1
0
1
2
10
10
10
10
10
10
C
Accuracy (%)
100
98
b=6
b = 8,10,16
96
b=4
94
92
90
b=2
88
86
b=1
84
logit: k = 50
82
Spam: Accuracy
80 −3
−2
−1
0
1
2
10
10
10
10
10
10
C
Accuracy (%)
Accuracy (%)
Accuracy (%)
Testing Accuracy
100
b=4
b = 6,8,10,16
98
b=2
4
96
b=1
94
92
90
88
86
logit: k = 500
84
82
Spam: Accuracy
80 −3
−2
−1
0
1
2
10
10
10
10
10
10
C
71
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Training Time
3
2
10
1
10
logit: k = 50
1
10
Training time (sec)
Training time (sec)
2
10
1
10
logit: k = 100
0 Spam: Training time
10 −3
−2
−1
0
10
10
10
10
C
2
10
10
10
Training time (sec)
2
1
10
logit: k = 200
0 Spam: Training time
10 −3
−2
−1
0
10
10
10
10
C
10
2
10
b=8
1
10
logit: k = 150
Spam:Training Time
−2
10
−1
0
10
10
1
10
2
10
3
b = 16
2
10
b=1
2 8
b=4
1
10
logit: k = 300
Spam:Training Time
10 −3
10
4
10
0
1
2
C
10
10
b = 16
b=1
10 −3
10
2
3
3
10
2
10
0
1
Training time (sec)
Training time (sec)
10
10
0 Spam: Training time
10 −3
−2
−1
0
10
10
10
10
C
Training time (sec)
3
3
10
−2
10
−1
0
10
10
C
1
10
2
10
b = 16
2
10
1
10
logit: k = 500
0 Spam: Training time
10 −3
−2
−1
0
10
10
10
10
C
1
10
2
10
72
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Testing the Robustness of Zero Coding on 20Newsgroup Data
For webspam data, the # empty bins may be too small to make the algorithm fail.
The 20Newsgroup (News20) dataset has merely 20, 000 samples in about one
million dimensions, with on average about 500 nonzeros per sample.
Therefore, News20 is merely a toy example to test the robustness of our
algorithm. In fact, we let k as large as k
= 4096, i.e., most of bins are empty.
73
Efficient Search and Learning with Sparse Random Projections and Hashing
b=8
SVM: k = 8
b = 4 News20: Accuracy
b=6
b=2
b=1
0
10
1
10
C
2
10
3
10
b=8
b=6
b=4
b=2
b=1
SVM: k = 64
News20: Accuracy
0
10
1
10
C
2
10
3
10
100
95
90
85
80
75
70
65
60
55
50 −1
10
100
95
90
85
80
75
70
65
60
55
50 −1
10
SVM: k = 16
News20: Accuracy
b=8
b=6
b=4
b=2
b=1
0
10
1
10
C
Nov, 2013
NYU-ML Seminar
k -Permutation (SVM)
2
3
10
10
100
95
90
85
80
75
70
65
60
55
50 −1
10
b=8
b=6
b=4
b=2
b=1
0
10
SVM: k = 32
News20: Accuracy
1
2
3
10
10
10
C
100
b=8
b=6
Accuracy (%)
100
95
90
85
80
75
70
65
60
55
50 −1
10
Original
1 Perm
k Perm
Accuracy (%)
100
95
90
85
80
75
70
65
60
55
50 −1
10
Accuracy (%)
Accuracy (%)
Accuracy (%)
20Newsgroups: One Permutation v.s.
Accuracy (%)
Ping Li
b=4
b=2
b=1
SVM: k = 128
News20: Accuracy
0
10
1
10
C
2
10
b=8
95
b=6
90
b=4
85
80
b=2
75
b=1
70
3
10
65 −1
10
SVM: k = 256
News20: Accuracy
0
10
1
10
C
2
10
3
10
74
Efficient Search and Learning with Sparse Random Projections and Hashing
100
90
85
b=4
b=2
b=1
80
75
70
65 −1
10
0
1
10
C
2
10
100 b = 6,8
b=2
b=4
85
80
75
SVM: k = 1024
News20: Accuracy
0
10
95
Accuracy (%)
b=1
85
Original
1 Perm
k Perm
80
75
70
65 −1
10
b=2
b=1
1
10
C
2
10
3
10
100 b = 4,6,8
95
90
90
65 −1
10
3
10
NYU-ML Seminar
b=4
70
SVM: k = 512
News20: Accuracy
10
Nov, 2013
100 b = 6,8
95
b=8
b=6
Accuracy (%)
Accuracy (%)
95
Accuracy (%)
Ping Li
0
10
1
10
C
2
10
90
85
Original
1 Perm
k Perm
80
75
70
SVM: k = 2048
News20: Accuracy
3
10
b=2
b=1
65 −1
10
SVM: k = 4096
News20: Accuracy
0
10
1
10
C
2
10
3
10
One permutation with zero coding can reach the original accuracy 98%.
The original k -permutation can only reach 97.5% even with k
= 4096.
75
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
20Newsgroups: One Permutation v.s.
Nov, 2013
NYU-ML Seminar
k -Permutation (Logit)
b=2
b=1
0
logit: k = 32
News20: Accuracy
1
10
10
C
2
10
100
95
Accuracy (%)
b=4
90
85
b=2
80
b=1
75
70
65 −1
10
0
1
10
C
2
10
b=1
0
10
logit: k = 64
News20: Accuracy
1
10
C
2
10
3
10
90
3
10
b=2
b=1
0
1
10
C
2
10
b=2
b=1
80
75
0
1
10
C
2
10
85
Original
1 Perm
k Perm
80
75
3
10
65 −1
10
b=2
75
b=1
95
b=1
70
logit: k = 1024
News20: Accuracy
10
90
80
100
b=4
b=2
95
0
1
10
C
2
10
logit: k = 256
News20: Accuracy
0
10
3
10
1
10
C
b = 4,6,8
2
10
3
10
b=2
b=1
90
Original
1 Perm
k Perm
85
80
75
70
logit: k = 2048
News20: Accuracy
10
b=4
85
65 −1
10
3
10
90
b=6
70
logit: k = 128
News20: Accuracy
b = 6,8
b=4
85
65 −1
10
b=4
10
b=8
95
b=6
100
b = 6,8
70
logit: k = 512
News20: Accuracy
10
b=2
100
b=8
b=6
95
Accuracy (%)
3
10
b=4
100
b=8
Accuracy (%)
b=4
b=6
100
95
90
85
80
75
70
65
60
55
50 −1
10
Accuracy (%)
b=6
b=8
Accuracy (%)
b=8
100
95
90
85
80
75
70
65
60
55
50 −1
10
Accuracy (%)
100
95
90
85
80
75
70
65
60
55
50 −1
10
Accuracy (%)
Accuracy (%)
Logistic Regression
65 −1
10
logit: k = 4096
News20: Accuracy
0
10
1
10
C
2
10
3
10
76
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Comparisons with Other Algorithms
We conducted extensive experiments with the VW algorithm (Weinberger et. al.,
ICML’09, not the VW online learning platform), which has the same variance as
random projections and is not limited to binary data.
Consider two sets S1 , S2 , f1
R=
|S1 ∩S2 |
|S1 ∪S2 |
= |S1 |, f2 = |S2 |, a = |S1 ∩ S2 |,
= f1 +fa2 −a . (a = 0 ⇔ R = 0) Then, from their variances
1
2
f1 f2 + a
V ar (âV W ) ≈
k
1
V ar R̂M IN W ISE = R (1 − R)
k
we can immediately see the significant advantages of minwise hashing, especially
when a
≈ 0 (which is common in practice).
77
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Comparing b-Bit Minwise Hashing with VW
100 1,10,100
10,100
C =1
98 0.1
C = 0.1
96
C = 0.01
94 C = 0.01
92
90
88
86
84
svm: VW vs b = 8 hashing
82
Spam: Accuracy
80 1
2
3
4
5
6
10
10
10
10
10
10
k
Accuracy (%)
Accuracy (%)
8-bit minwise hashing (dashed, red) with k ≥ 200 (and C ≥ 1) achieves about
the same test accuracy as VW with k = 104 ∼ 106 .
100 10,100
100
10
C =1
1
98
C = 0.1
C = 0.1
96
94
C = 0.01
C = 0.01
92
90
88
86
84
logit: VW vs b = 8 hashing
82
Spam: Accuracy
80 1
2
3
4
5
6
10
10
10
10
10
10
k
78
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
8-bit hashing is substantially faster than VW (to achieve the same accuracy).
3
3
10
svm: VW vs b = 8 hashing
Training time (sec)
Training time (sec)
10
C = 100
2
C = 10
10
C = 1,0.1,0.01
C = 100
1
10
C = 10
Spam: Training time
0
10
C = 1,0.1,0.01
2
10
4
10
10
k
5
10
6
10
C = 0.1,0.01
2
10
100
1 10,1.0,0.1
10
C = 0.01
0
3
C = 100,10,1
10
logit: VW vs b = 8 hashing
Spam: Training time
2
10
3
4
10
10
k
5
10
6
10
79
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Advantages of One Permutation Hashing
1
0
1
2
2
3
4
5
6
3
7
8
9
4
10 11 12 13 14 15
π(S ): 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0
1
π(S2): 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0
π(S ): 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0
3
• Computationally much more efficient and energy-efficient.
• Testing unprocessed data is much faster, crucial for user-facing applications.
• Implementation is easier, from the perspective of random number generation,
e.g., storing a permutation vector of length D = 109 (4GB) is not a big deal.
• One permutation hashing is a much better matrix sparsification scheme.
80
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Definitions of Nemp and Nmat
Nemp =
k
X
Iemp,j ,
j=1
Nmat =
k
X
Imat,j
j=1
where Iemp,j and Imat,j are defined for the j -th bin, as
Iemp,j
Imat,j

 1
=
 0
if both π(S1 ) and π(S2 ) are empty in the j -th bin
otherwise



 1 if both π(S1 ) and π(S1 ) are not empty and the smallest element
=
of π(S1 ) matches the smallest element of π(S2 ), in the j -th bin



0 otherwise
81
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Number of Jointly Empty Bins Nemp
Notation:
f1 = |S1 |, f2 = |S2 |, a = |S1 ∩ S2 |, f = |S1 ∪ S2 | = f1 + f2 − a.
Distribution:
Pr (Nemp
k−j
X
k!
= j) =
(−1)
j!s!(k − j − s)!
s=0
fY
−1
E (Nemp )
=
k
j=0
Expectation:
f /k = 5 =⇒
s
E(Nemp )
k
fY
−1
D 1−
−t
D−t
1
k
t=0
D 1−
−j
≤
D−j
≈ 0.0067 (negligible).
j+s
k
1−
1
k
f
82
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Empirical Validation
Empirical MSE (=bias2 + variance), for 2 pairs of binary data vectors.
−1
−1
10
10
Empirical
Theoretical
Minwise
10
10
−3
10
−4
10
−5
−3
10
−4
10
−5
10
10
OF−−AND
RIGHTS−−RESERVED
−6
10
Empirical
Theoretical
Minwise
−2
MSE(Rmat)
−2
MSE(Rmat)
Ping Li
1
10
2
10
3
10
k
−6
4
10
5
10
10
1
10
2
10
3
10
k
4
10
5
10
83
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
One Permutation Hashing for Linear Learning
Almost the same as k -permutation hashing, but we must deal with empty bins *.
Our solution: Encode empty bins as (e.g.,) 00000000 in the expanded space.
84
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
Zero Coding Example for One Permutation with k
NYU-ML Seminar
= 4 Bins
For set (vector) S1 : (Original high-dimensional binary feature vector)
Hashed values
Binary
:
113
:
Lowest b
= 2 bits :
Decimal values
b
:
Expansions (2 )
:
264
1091
∗
1110001
100001000
10001000011
∗
01
00
11
∗
1
0
3
∗
0100
1000
0001
∗
1
New feature vector : [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0] × √
4−1
Same procedures on sets S2 , S3 , ...
85
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
b-Bit Minwise Hashing for Efficient Near Neighbor Search
Near neighbor search is a much more frequent operation than training an SVM.
The bits from b-bit minwise hashing can be directly used to build hash tables to
enable sub-linear time near neighbor search.
This is an instance of the general family of locality-sensitive hashing (LSH).
86
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
An Example of b-Bit Hashing for Near Neighbor Search
We use b
= 2 bits and k = 2 permutations to build a hash table indexed from
0000 to 1111, i.e., the table size is 22×2 = 16.
Index
00 00
00 01
00 10
Data Points
8, 13, 251
5, 14, 19, 29
(empty)
11 01 7, 24, 156
11 10 33, 174, 3153
11 11 61, 342
Index
00 00
00 01
00 10
Data Points
2, 19, 83
17, 36, 129
4, 34, 52, 796
11 01 7, 198
11 10 56, 989
11 11 8 ,9, 156, 879
Then, the data points are placed in the buckets according to their hashed values.
Look for near neighbors in the bucket which matches the hash value of the query.
Replicate the hash table (twice in this case) for missed and good near neighbors.
Final retrieved data points are the union of the buckets.
87
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Experiments on the Webspam Dataset
B = b × k : table size
L: number of tables.
Fractions of retrieved data points
0
0
10
Fraction Evaluated
Webspam: L = 25
b=4
−1
b=2
10
b=1
−2
10
8
SRP
b−bit
12
b=4
b=2
−1
10
b=1
−2
16
B (bits)
20
24
10
10
Webspam: L = 50
8
SRP
b−bit
12
b=4
Fraction Evaluated
0
10
Fraction Evaluated
Ping Li
Webspam: L = 100
b=2
−1
10
b=1
SRP
b−bit
−2
16
B (bits)
20
24
10
8
SRP (sign random projections) and b-bit hashing (with b
12
16
B (bits)
20
24
= 1, 2) retrieved similar
numbers of data points.
————————
Ref: Shrivastava and Li, Fast Near Neighbor Search in High-Dimensional Binary
Data, ECML’12
88
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Using L
100
SRP
90
b−bit
80
70
60
50
1
40
b=4
30
20
10 B= 16 L= 100
Webspam: T = 10
0
0 10 20 30 40 50 60 70 80 90 100
Recall (%)
Precision (%)
100
SRP
90
b−bit
80
70
60
b=4
b=1
50
40
30
20
10 B= 24 L= 100
Webspam: T = 10
0
0 10 20 30 40 50 60 70 80 90 100
Recall (%)
Precision (%)
100
SRP
90
b−bit
80
70
60
50
40
1
30
20
b=4
10 B= 16 L= 100
Webspam: T = 5
0
0 10 20 30 40 50 60 70 80 90 100
Recall (%)
Precision (%)
100
SRP
90
b−bit
80
70
60
b=4
50
b=1
40
30
20
10 B= 24 L= 100
Webspam: T = 5
0
0 10 20 30 40 50 60 70 80 90 100
Recall (%)
Precision (%)
Precision (%)
Precision-Recall curves (the higher the better) for retrieving top-T data points.
Precision (%)
Ping Li
100
SRP
90
b−bit
80
70
60
50
b=1
b=4
40
30
20
10 B= 24 L= 100
Webspam: T = 50
0
0 10 20 30 40 50 60 70 80 90 100
Recall (%)
100
SRP
90
b−bit
80
70
60
b=4
1
50
40
30
20
10 B= 16 L= 100
Webspam: T = 50
0
0 10 20 30 40 50 60 70 80 90 100
Recall (%)
= 100 tables, b-bit hashing considerably outperformed SRP (sign
random projections), for table sizes B = 24 and B = 16.
89
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Experimental Results on Rcv1 (200GB)
Test accuracy using linear SVM (Can not train the original data)
90
b=8
b=4
70
b=2
b=8
80
b=4
70
50 −3
10
−2
10
0
10
10
1
10
b=2
b=1
50 −3
10
2
10
−2
10
−1
10
70
b=2
0
10
10
C
b=2
b=1
10
50 −3
10
2
10
rcv1: Accuracy
−2
10
−1
80
1
10
10
50 −3
10
90
b=2
b=1
−2
10
1
10
2
10
b = 16
b = 12
b=8
b=4
80
b=2
70
b=1
60
rcv1: Accuracy
rcv1: Accuracy
2
10
svm: k =500
b=4
70
0
10
100
b = 16
b = 12
rcv1: Accuracy
10
1
b=8
60
b=1
−1
b=4
70
C
90
Accuracy (%)
Accuracy (%)
b = 12
b=4
−2
10
svm: k =300
80
50 −3
10
0
100
b = 16
b=8
60
80
C
svm: k =200
90
b=8
60
C
100
b = 16
b = 12
rcv1: Accuracy
rcv1: Accuracy
−1
svm: k =150
90
b = 12
60
60
b=1
b = 16
Accuracy (%)
80
100
svm: k =100
90
b = 12
Accuracy (%)
Accuracy (%)
100
b = 16
svm: k =50
Accuracy (%)
100
−1
0
10
10
C
1
10
2
10
50 −3
10
−2
10
−1
0
10
10
C
1
10
2
10
90
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Training time using linear SVM
3
3
10
12
16
1
10
12
b=8
rcv1: Train Time
0
−2
10
−1
0
10
10
1
2
b = 16
12
10
1
10
12
b=8
rcv1: Train Time
0
10 −3
10
2
10
svm: k=150
10
−2
10
−1
0
10
C
10
1
3
10
12
1
10
b=8
rcv1: Train Time
10
0
10
10
C
0
10
10
1
10
10
2
10
svm: k=500
12
2
10
b = 16
1
10
b=8
rcv1: Train Time
0
2
1
10
3
Training time (sec)
b = 16
2
10
−1
10
−1
10
svm: k=300
Training time (sec)
svm: k=200
0
rcv1: Train Time
−2
C
3
−2
b=8
0
10
10 −3
10
1
10
10 −3
10
2
10
b = 16
12
2
10
C
10
Training time (sec)
Training time (sec)
2
10
10 −3
10
10
svm: k=100
Training time (sec)
svm: k=50
Training time (sec)
3
10
10 −3
10
−2
10
−1
0
10
10
C
1
10
2
b = 16
10
12
1
10
b=8
rcv1: Train Time
0
2
10
10 −3
10
−2
10
−1
0
10
10
C
1
10
2
10
91
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Test accuracy using logistic regression
90
b=8
80
b=4
b=2
60
b=8
80
b=4
70
b=2
60
b=1
50
rcv1: Accuracy
−2
0
10
10
10
−2
10
−1
10
Accuracy (%)
Accuracy (%)
b=4
b=2
10
−1
0
10
10
C
b=2
b=1
10
50 −3
10
2
10
rcv1: Accuracy
−2
10
−1
80
1
10
10
50 −3
10
b=2
b=1
−2
10
1
10
2
10
b = 16
b = 12
90
b=8
b=4
80
b=2
70
b=1
60
rcv1: Accuracy
2
10
logit: k =500
b=4
70
0
10
100
b = 16
b = 12
b=8
60
b=1
−2
70
C
90
rcv1: Accuracy
50 −3
10
10
1
logit: k =300
b=8
60
0
100
b = 16
b = 12
90
70
b=4
C
logit: k =200
80
80
rcv1: Accuracy
C
100
b=8
60
b=1
50 −3
10
2
b = 12
90
Accuracy (%)
70
b = 16
logit: k =150
b = 12
90
b = 12
100
b = 16
logit: k =100
Accuracy (%)
Accuracy (%)
100
b = 16
logit: k =50
Accuracy (%)
100
−1
0
10
10
C
1
10
2
10
50 −3
10
rcv1: Accuracy
−2
10
−1
0
10
10
C
1
10
2
10
92
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Training time using logistic regression
3
3
10
b=1
b=2
b=4
8
b = 16
b = 12
1
10
rcv1: Train Time
0
10 −3
10
−2
10
−1
0
10
10
1
b=1
2
10
b=8
b = 12
1 b = 16
10
rcv1: Train Time
0
10 −3
10
2
10
logit: k=150
10
−2
10
−1
0
10
C
10
1
3
3
12
8
b = 16
1
10
rcv1: Train Time
0
−2
10
−1
0
10
10
C
10
−1
1
10
10
10
1
2
10
10
12
2
b=8
10
1
10
10 −3
10
rcv1: Train Time
−2
10
−1
0
10
10
C
1
10
b = 16
logit: k=500
b = 16
0
2
0
10
3
Training time (sec)
2
10 −3
10
rcv1: Train Time
−2
10
logit: k=300
Training time (sec)
logit: k=200
b=1
10
C
10
10
b = 16
1
0
10
12
8
b=4
10 −3
10
2
10
2
10
C
10
Training time (sec)
Training time (sec)
2
10
10
logit: k=100
Training time (sec)
logit: k=50
Training time (sec)
3
10
12
2
10
b=8
1
10
rcv1: Train Time
0
2
10
10 −3
10
−2
10
−1
0
10
10
C
1
10
2
10
93
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Comparisons with VW on Rcv1
Test accuracy using linear SVM
100
80
C = 0.01
70
C = 0.01,0.1,1,10
60
svm: VW vs 1−bit hashing
3
10
50 1
10
4
10
10
svm: VW vs 2−bit hashing
2
100
C = 10
90
C = 0.1,1,10
C = 0.01
C = 0.01
70
60
rcv1: Accuracy
svm: VW vs 8−bit hashing
2
3
10
10
K
C = 0.01
70
50 1
10
4
10
rcv1: Accuracy
svm: VW vs 4−bit hashing
2
3
10
10
100
C = 10
4
10
80
C = 0.1,1,10
C = 0.01
70
50 1
10
C = 10
90
C = 0.01
60
4
10
K
90
Accuracy (%)
Accuracy (%)
10
C = 0.01
K
100
50 1
10
3
10
C = 10
80
60
rcv1: Accuracy
K
80
C = 0.01
70
60
rcv1: Accuracy
2
80
C = 0.01,0.1,1,10
C = 0.1,1,10
90
rcv1: Accuracy
svm: VW vs 12−bithashing
2
3
10
10
K
4
10
Accuracy (%)
50 1
10
C = 0.1,1,10
90
C = 0.1,1,10
Accuracy (%)
Accuracy (%)
90
100
Accuracy (%)
100
C = 0.01
C = 0.1,1,10
80
C = 0.01
70
60
50 1
10
rcv1: Accuracy
svm: VW vs 16−bit hashing
2
3
10
10
K
4
10
94
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
100
100
90
90
90
C = 0.1,1,10
80
C = 0.01,0.1,1,10
60
50 1
10
2
3
10
70
60
rcv1: Accuracy
logit: VW vs 1−bithashing
10
C = 0.01
C = 0.01,0.1,1,10
50 1
10
4
10
2
100
C = 0.1,1,10
C = 0.01
C = 0.01
70
60
50 1
10
rcv1: Accuracy
logit: VW vs 8−bithashing
2
3
10
10
K
70
50 1
10
4
10
rcv1: Accuracy
logit: VW vs 4−bithashing
2
4
10
80
100
50 1
10
C = 0.01
C = 0.01
rcv1: Accuracy
logit: VW vs 12−bithashing
2
3
10
10
K
10
4
10
C = 10
C = 0.1,1,10
90
C = 0.1,1,10
70
60
3
10
K
C = 0.1,1,10
90
Accuracy (%)
Accuracy (%)
80
10
C = 0.01
C = 0.01
K
100
90
3
10
K
C = 0.1,1,10
C = 0.1,1,10
60
rcv1: Accuracy
logit: VW vs 2−bithashing
C = 0.1,1,10
80
C =1
4
10
Accuracy (%)
70
C = 0.01
C = 0.1,1,10
80
Accuracy (%)
100
Accuracy (%)
Accuracy (%)
Test accuracy using logistic regression
C = 0.01
80
C = 0.01
70
60
50 1
10
logit: VW vs 16−bithashing
2
3
10
10
K
4
10
95
Ping Li
Efficient Search and Learning with Sparse Random Projections and Hashing
Nov, 2013
NYU-ML Seminar
Training time
3
3
10
rcv1: Train Time
C = 10
2
10
C=1
C = 0.1
C = 10
1
10
Training time (sec)
Training time (sec)
10
C = 0.01
C=1
C = 0.1
0 C = 0.01
10 1
10
svm: VW vs 8−bit hashing
2
3
10
10
K
4
10
rcv1: Train Time
C = 10
C=1
2
10
C = 0.1
C = 10
C = 0.01
C=1
1 C = 0.1
10
C = 0.01
0
10 1
10
logit: VW vs 8−bit hashing
2
3
10
10
K
4
10
96