CS 361A Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and

advertisement
CS 361A
(Advanced Data Structures and Algorithms)
Lecture 18
(Nov 30, 2005)
Fingerprints, Min-Hashing, and
Document Similarity
CS 361A
Rajeev Motwani
1
Game Plan for Week
Fingerprints
Document Similarity
Shingling
Min-Hashing
Min-Wise Independent Permutations
CS 361A
2
Fingerprints
 W – set of large objects (e.g., URLs)
Goal
avoid storing large objects explicitly
quick-and-dirty equality-testing
Fingerprints?
Short tags for objects
Distinct fingerprints  distinct objects
Distinct objects  probably distinct fingerprints
CS 361A
3
Formalization
Fingerprint length k  fingerprint space size N=2k
Fingerprint function family
F = { f : Wk }
Random f eR F 
f(A) f(B)  A 
Collisions: P[ f(A) = f(B) | A  (ideally 2O(-k))
Typical Application
Adversarial object-set S with |S| = n << 2k
Goal – |f(S)| = |S| with high probability
n2 pair-wise collisions possible  need 2k > n2
(to avoid Birthday Paradox)
CS 361A
4
Example – URL Fingerprints
 Search Engines
Manage large numbers of URL strings
Long, variable strings (embedded objects/database-queries)
 Desiderata
small/fixed-length encodings – hopefully, unique
Some scenarios
o Exact string irrelevant
o Only need ability to distinguish distinct URLs
Even otherwise, unique IDs useful for indexing
 Numbers?
4 billion webpages  n=232
N n2  k=64
Fingerprints  8-byte representation
CS 361A
5
Fingerprinting vs Hashing
Hashing h: Wk
Set Membership testing for set S of size n
Desire uniform distribution over bin address k
Minimize collisions per bin – reduce lookup time
Minimize hash table size  n N=2k
Fingerprinting f : Wk
Object Equality testing over set S of size n
Distribution over k is irrelevant
Avoid collisions altogether
Tolerate larger k – typically N > n2
CS 361A
6
Fingerprinting Strings
 Typical Application – but techniques extend to
combinatorial objects (database tuples, trees/graphs)
 Obvious techniques
Checksum – no worst-case collision probability guarantees
MD5 – cryptographically-secure string hashes
o relatively slow
o avoids leaking information about original string
 Rabin’s Scheme
Algebraic technique – polynomial arithmetic
Efficient – need (1 table lookup + 1 xor + 1 shift) per byte
other nice properties…
CS 361A
7
Rabin Fingerprints
 Consider – m-bit string A=a1 a2 … am
 Assume – a1=1 and fixed-length strings (wlog)
 Encoding Strings
Degree-m polynomials over Z2
A(x) = a1 xm-1 + a2 xm-2 + … + am-1 x1 + am
 Fingerprints
P(x): random, irreducible deg-k polynomial over Z2
(easy to sample such polynomials)
irreducible  unlike x2+x+1, can factor x2+1=(x+1)2
f(A) = A(x) mod P(x)
CS 361A
8
Analysis
 Fix S – n strings of length m
 Consider QS (x) 
 A(x)  B(x) 
A  B S
Collision f(A)=f(B)  A(x)=B(x) mod P(x)  QS=0 mod P(x)
Therefore – P(x) is factor of QS(x)
 Collision Probability?
degree(QS) = n2m
number of irreducible degree-k factors of QS(x) is < n2m/k
Fact: Number of irreducible degree-k polynomials > (2k-2k/2)/k
Prob[random P(x) divides QS(x)] < n2m/2k
n 2m
 Prob [fingerprints not distinct] <
2k
CS 361A
9
Beneficial Properties
Hardware-level implementation
Z2-polynomials same as strings
simple shift-register operations
Distributivity – f(A+B) = f(A) + f(B) over Z2
Let  = concatenation
f(A  B) = f(f(A)  )
f(A  B) = A(x)*tm + B(x) mod P(x)
Fingerprint sliding windows over strings –
low incremental cost
CS 361A
10
Duplicate Document Detection
 Problem
Given – large collection of arbitrary documents
Identify – near-duplicate documents
 Web search engines
Proliferation of near-duplicate documents
o Legitimate – mirrors, local copies, updates, …
o Malicious – spam, spider-traps, dynamic URLs, …
o Mistaken – spider errors
30% of web-pages are near-duplicates [Broder et al 1997]
Cost – RAM/disk, search quality, unhappy users
Enterprise search – even larger amount of duplication
SCAM – plagiarism detection [Shivakumar et al 1998]
CS 361A
11
Natural Approaches
 Fingerprinting?
only works for exact matches
here – must identify even near-duplicates
 Random Sampling?
sample substrings (phrases, sentences, etc)
hope: similar documents  similar samples
No – even samples of same document will differ
 Edit-distance?
metric for approximate string-matching
expensive – even for one pair of strings
impossible – for 1032 web documents
CS 361A
12
Desiderata
Storage
only small sketches of each document.
Computation
O(n log n) time on n documents
Stream Processing
once sketch computed, source is unavailable
Error Guarantees
problem scale  small biases have large impact
need formal guarantees – heuristics will not do
CS 361A
13
Basic Idea [Broder 1997]
Shingling
dissect document into q-grams (shingles)
represent documents by shingle-sets
near-duplicates shingle-sets intersection is large
reduce problem to set intersection
Set Intersection
fingerprints of shingles
min-hash to estimate intersections sizes
CS 361A
14
Shingling
 Shingle – q contiguous tokens/words (q-gram)
 Consider following “document”
a rose is a rose is a rose
 Choose q=4  get multi-set of shingles
a rose is a
rose is a rose
is a rose is
a rose is a
rose is a rose
CS 361A
15
Documents  Sets of 64-bit
fingerprints
Doc
shingling
Multiset
of
Shingles
fingerprint
Multiset of
Fingerprints
Fingerprints?
• Use Rabin fingerprints
• Fingerprint space U = [0, …, N-1]
• In practice, use 64-bit fingerprints, i.e., N=264
• Result – uniformity in length of strings
CS 361A
16
Similarity of Documents
Doc
A
SA
SB
Doc
B
• Jaccard measure – similarity of SA, SB  U = [0 … N-1]
sim(S A , SB ) 
SA  SB
SA  SB
• Claim: A & B are near-duplicates if sim(SA,SB) is high
• Claim: A is contained in B if con(SA,SB) is high
con(S A , SB ) 
CS 361A
SA  SB
SA
17
Remarks
 Multiplicities of q-grams – could retain or ignore
trade-off efficiency with precision
 Shingle Size q ε [3 … 10]
Short shingles  increase similarity of unrelated documents
o With q=1, sim(SA,SB) =1  A is permutation of B
o Need larger q to sensitize to permutation changes
Long shingles  small random changes have larger impact
 Similarity Measure
Similarity is non-transitive, non-metric
But – dissimilarity 1-sim(SA,SB) is a metric [Charikar 02]
 [Ukkonen 92] – relate q-gram & edit-distance
CS 361A
18
Example
 A = “a rose is a rose is a rose”
 B = “a rose is a flower which is a rose”
 Preserving multiplicity
q=1  sim(SA,SB) = 0.7
o SA = {a, a, a, is, is, rose, rose, rose}
o SB = {a, a, a, is, is, rose, rose, flower, which}
q=2  sim(SA,SB) = 0.5
q=3  sim(SA,SB) = 0.3
 Disregarding multiplicity
q=1  sim(SA,SB) = 0.6
q=2  sim(SA,SB) = 0.5
q=3  sim(SA,SB) = 0.4285
CS 361A
19
Min-Hashing
 Consider
SA, SB  U
Pick – random permutation π of U
Define  = π -1( min{π(SA)} ) and b = π -1( min{π(SB)} )
Meaning? – minimal element under permutation π
 Lemma: P[α  β] 
SA  SB
SA  SB
Let δ = min{ π(SASB) }
Claim:  = b π -1(δ)  SASB
Clearly
CS 361A
P[π δ   SA  SB ] 
-1
SA  SB
SA  SB
20
Min-Hashing
 Similarity Sketches
Succinct representation of fingerprint sets SA
Allows efficient estimation of sim(SA,SB)
Basic idea – use min-hash of fingerprints
 sk(A) = k minimal elements under π(SA)
 Claim: E[ sim(sk(A), sk(B)) ] = sim(SA,SB)
For each   sk(A)  sk(B)
P[α  sk(A)  sk(B)]  P[π α   SA  SB ] 
-1
SA  SB
SA  SB
 Observe
sketch-similarity is unbiased estimator of similarity
reducing variance – use larger k
CS 361A
21
Remarks
Implementation
shingle/fingerprint/sketch document in streams
Issue – cost of pairwise comparison of sketches?
o cluster sketch-streams [Broder et al, Guha et al]
o Open? – hashing sketches to identify similarity
[Broder-Mitzenmacher 99] – Min-Hash is only
unbiased estimator
[Indyk-Motwani 99] – Locality-Sensitive Hash
collisions more likely for similar items
Min-Hash is special case
CS 361A
22
Multiple Permutations
Better Variance Reduction
Instead of larger k, stick with k=1
Multiple, independent permutations
Sketch Construction
Pick p random permutations of U – π1,π2, …,πp
sk(A) = minimal elements under π1(SA), …, πp(SA)
Claim: E[ sim(sk(A),sk(B)) ] = sim(SA,SB)
Earlier lemma  true for p=1
Linearity of expectations
Variance reduction – independence of π1, …,πp
CS 361A
23
Min-Wise Indep Permutations
Problem
Truly-random π over U = [0 … N-1] is infeasible
But – do we really need true randomness?
Solution
Poly-size family of permutations FSN over U
Choosing/representing random πF is easy
Min-Wise Independence (MWI) Property:
For all sets XU, for all xF,
1

P
min πX   x  
πF
X
CS 361A
24
Minimum-Size MWI Families
[Broder et al 98]
Upper/lower bounds of lcm(1,2,…,n)
Problem – exponential in N
Approximate MWI Families
1 ε

min πX   x  
Relax to P
πF
X
Non-constructive – polynomial-size
Constructive – size NO(log 1/e) [Indyk 99]
In practice – 2-universal hashes work well!
CS 361A
25
References I
 Fingerprinting by random polynomials. M. Rabin. Technical
Report TR-15-81, Harvard University (1981).
 Some applications of Rabin's fingerprinting method. A.
Broder. Sequence II (1993).
 On the Resemblance and Containment of Documents, A.
Broder. SEQUENCES 1997.
 Syntactic Clustering of the Web, A. Broder, S. Glassman, M.
Manasse, and G. Zweig, WWW 1997.
 Finding near-replicas of documents on the web. N.
Shivakumar and H. Garcia-Molina. WebDB 1998.
 Identifying and Filtering Near-Duplicate Documents, Andrei
Broder. CPM 2000.
CS 361A
26
References II
 Approximate String Matching with q-grams and Maximal
Matches. E. Ukkonen. Theoretical Computer Science (1992).
 Completeness and Robustness Properties of Min-Wise
Independent Permutations. A. Broder and M. Mitzenmacher.
 Min-Wise Independent Permutations, A. Broder, M. Charikar,
A. Frieze and M. Mitzenmacher, JCSS (2000).
 A Small Approximately min-wise Independent Family of Hash
Functions. P. Indyk. SODA 1999.
 Approximate Nearest Neighbors: Towards Removing the
Curse of Dimensionality, P. Indyk and R. Motwani. STOC 1998.
 Similarity Search in High Dimensions via Hashing, A. Gionis, P.
Indyk, and R. Motwani. VLDB 1999.
 Similarity Estimation Techniques from Rounding Algorithms,
M. Charikar, STOC 2002.
CS 361A
27
Download