CS 361A (Advanced Data Structures and Algorithms) Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and Document Similarity CS 361A Rajeev Motwani 1 Game Plan for Week Fingerprints Document Similarity Shingling Min-Hashing Min-Wise Independent Permutations CS 361A 2 Fingerprints W – set of large objects (e.g., URLs) Goal avoid storing large objects explicitly quick-and-dirty equality-testing Fingerprints? Short tags for objects Distinct fingerprints distinct objects Distinct objects probably distinct fingerprints CS 361A 3 Formalization Fingerprint length k fingerprint space size N=2k Fingerprint function family F = { f : Wk } Random f eR F f(A) f(B) A Collisions: P[ f(A) = f(B) | A (ideally 2O(-k)) Typical Application Adversarial object-set S with |S| = n << 2k Goal – |f(S)| = |S| with high probability n2 pair-wise collisions possible need 2k > n2 (to avoid Birthday Paradox) CS 361A 4 Example – URL Fingerprints Search Engines Manage large numbers of URL strings Long, variable strings (embedded objects/database-queries) Desiderata small/fixed-length encodings – hopefully, unique Some scenarios o Exact string irrelevant o Only need ability to distinguish distinct URLs Even otherwise, unique IDs useful for indexing Numbers? 4 billion webpages n=232 N n2 k=64 Fingerprints 8-byte representation CS 361A 5 Fingerprinting vs Hashing Hashing h: Wk Set Membership testing for set S of size n Desire uniform distribution over bin address k Minimize collisions per bin – reduce lookup time Minimize hash table size n N=2k Fingerprinting f : Wk Object Equality testing over set S of size n Distribution over k is irrelevant Avoid collisions altogether Tolerate larger k – typically N > n2 CS 361A 6 Fingerprinting Strings Typical Application – but techniques extend to combinatorial objects (database tuples, trees/graphs) Obvious techniques Checksum – no worst-case collision probability guarantees MD5 – cryptographically-secure string hashes o relatively slow o avoids leaking information about original string Rabin’s Scheme Algebraic technique – polynomial arithmetic Efficient – need (1 table lookup + 1 xor + 1 shift) per byte other nice properties… CS 361A 7 Rabin Fingerprints Consider – m-bit string A=a1 a2 … am Assume – a1=1 and fixed-length strings (wlog) Encoding Strings Degree-m polynomials over Z2 A(x) = a1 xm-1 + a2 xm-2 + … + am-1 x1 + am Fingerprints P(x): random, irreducible deg-k polynomial over Z2 (easy to sample such polynomials) irreducible unlike x2+x+1, can factor x2+1=(x+1)2 f(A) = A(x) mod P(x) CS 361A 8 Analysis Fix S – n strings of length m Consider QS (x) A(x) B(x) A B S Collision f(A)=f(B) A(x)=B(x) mod P(x) QS=0 mod P(x) Therefore – P(x) is factor of QS(x) Collision Probability? degree(QS) = n2m number of irreducible degree-k factors of QS(x) is < n2m/k Fact: Number of irreducible degree-k polynomials > (2k-2k/2)/k Prob[random P(x) divides QS(x)] < n2m/2k n 2m Prob [fingerprints not distinct] < 2k CS 361A 9 Beneficial Properties Hardware-level implementation Z2-polynomials same as strings simple shift-register operations Distributivity – f(A+B) = f(A) + f(B) over Z2 Let = concatenation f(A B) = f(f(A) ) f(A B) = A(x)*tm + B(x) mod P(x) Fingerprint sliding windows over strings – low incremental cost CS 361A 10 Duplicate Document Detection Problem Given – large collection of arbitrary documents Identify – near-duplicate documents Web search engines Proliferation of near-duplicate documents o Legitimate – mirrors, local copies, updates, … o Malicious – spam, spider-traps, dynamic URLs, … o Mistaken – spider errors 30% of web-pages are near-duplicates [Broder et al 1997] Cost – RAM/disk, search quality, unhappy users Enterprise search – even larger amount of duplication SCAM – plagiarism detection [Shivakumar et al 1998] CS 361A 11 Natural Approaches Fingerprinting? only works for exact matches here – must identify even near-duplicates Random Sampling? sample substrings (phrases, sentences, etc) hope: similar documents similar samples No – even samples of same document will differ Edit-distance? metric for approximate string-matching expensive – even for one pair of strings impossible – for 1032 web documents CS 361A 12 Desiderata Storage only small sketches of each document. Computation O(n log n) time on n documents Stream Processing once sketch computed, source is unavailable Error Guarantees problem scale small biases have large impact need formal guarantees – heuristics will not do CS 361A 13 Basic Idea [Broder 1997] Shingling dissect document into q-grams (shingles) represent documents by shingle-sets near-duplicates shingle-sets intersection is large reduce problem to set intersection Set Intersection fingerprints of shingles min-hash to estimate intersections sizes CS 361A 14 Shingling Shingle – q contiguous tokens/words (q-gram) Consider following “document” a rose is a rose is a rose Choose q=4 get multi-set of shingles a rose is a rose is a rose is a rose is a rose is a rose is a rose CS 361A 15 Documents Sets of 64-bit fingerprints Doc shingling Multiset of Shingles fingerprint Multiset of Fingerprints Fingerprints? • Use Rabin fingerprints • Fingerprint space U = [0, …, N-1] • In practice, use 64-bit fingerprints, i.e., N=264 • Result – uniformity in length of strings CS 361A 16 Similarity of Documents Doc A SA SB Doc B • Jaccard measure – similarity of SA, SB U = [0 … N-1] sim(S A , SB ) SA SB SA SB • Claim: A & B are near-duplicates if sim(SA,SB) is high • Claim: A is contained in B if con(SA,SB) is high con(S A , SB ) CS 361A SA SB SA 17 Remarks Multiplicities of q-grams – could retain or ignore trade-off efficiency with precision Shingle Size q ε [3 … 10] Short shingles increase similarity of unrelated documents o With q=1, sim(SA,SB) =1 A is permutation of B o Need larger q to sensitize to permutation changes Long shingles small random changes have larger impact Similarity Measure Similarity is non-transitive, non-metric But – dissimilarity 1-sim(SA,SB) is a metric [Charikar 02] [Ukkonen 92] – relate q-gram & edit-distance CS 361A 18 Example A = “a rose is a rose is a rose” B = “a rose is a flower which is a rose” Preserving multiplicity q=1 sim(SA,SB) = 0.7 o SA = {a, a, a, is, is, rose, rose, rose} o SB = {a, a, a, is, is, rose, rose, flower, which} q=2 sim(SA,SB) = 0.5 q=3 sim(SA,SB) = 0.3 Disregarding multiplicity q=1 sim(SA,SB) = 0.6 q=2 sim(SA,SB) = 0.5 q=3 sim(SA,SB) = 0.4285 CS 361A 19 Min-Hashing Consider SA, SB U Pick – random permutation π of U Define = π -1( min{π(SA)} ) and b = π -1( min{π(SB)} ) Meaning? – minimal element under permutation π Lemma: P[α β] SA SB SA SB Let δ = min{ π(SASB) } Claim: = b π -1(δ) SASB Clearly CS 361A P[π δ SA SB ] -1 SA SB SA SB 20 Min-Hashing Similarity Sketches Succinct representation of fingerprint sets SA Allows efficient estimation of sim(SA,SB) Basic idea – use min-hash of fingerprints sk(A) = k minimal elements under π(SA) Claim: E[ sim(sk(A), sk(B)) ] = sim(SA,SB) For each sk(A) sk(B) P[α sk(A) sk(B)] P[π α SA SB ] -1 SA SB SA SB Observe sketch-similarity is unbiased estimator of similarity reducing variance – use larger k CS 361A 21 Remarks Implementation shingle/fingerprint/sketch document in streams Issue – cost of pairwise comparison of sketches? o cluster sketch-streams [Broder et al, Guha et al] o Open? – hashing sketches to identify similarity [Broder-Mitzenmacher 99] – Min-Hash is only unbiased estimator [Indyk-Motwani 99] – Locality-Sensitive Hash collisions more likely for similar items Min-Hash is special case CS 361A 22 Multiple Permutations Better Variance Reduction Instead of larger k, stick with k=1 Multiple, independent permutations Sketch Construction Pick p random permutations of U – π1,π2, …,πp sk(A) = minimal elements under π1(SA), …, πp(SA) Claim: E[ sim(sk(A),sk(B)) ] = sim(SA,SB) Earlier lemma true for p=1 Linearity of expectations Variance reduction – independence of π1, …,πp CS 361A 23 Min-Wise Indep Permutations Problem Truly-random π over U = [0 … N-1] is infeasible But – do we really need true randomness? Solution Poly-size family of permutations FSN over U Choosing/representing random πF is easy Min-Wise Independence (MWI) Property: For all sets XU, for all xF, 1 P min πX x πF X CS 361A 24 Minimum-Size MWI Families [Broder et al 98] Upper/lower bounds of lcm(1,2,…,n) Problem – exponential in N Approximate MWI Families 1 ε min πX x Relax to P πF X Non-constructive – polynomial-size Constructive – size NO(log 1/e) [Indyk 99] In practice – 2-universal hashes work well! CS 361A 25 References I Fingerprinting by random polynomials. M. Rabin. Technical Report TR-15-81, Harvard University (1981). Some applications of Rabin's fingerprinting method. A. Broder. Sequence II (1993). On the Resemblance and Containment of Documents, A. Broder. SEQUENCES 1997. Syntactic Clustering of the Web, A. Broder, S. Glassman, M. Manasse, and G. Zweig, WWW 1997. Finding near-replicas of documents on the web. N. Shivakumar and H. Garcia-Molina. WebDB 1998. Identifying and Filtering Near-Duplicate Documents, Andrei Broder. CPM 2000. CS 361A 26 References II Approximate String Matching with q-grams and Maximal Matches. E. Ukkonen. Theoretical Computer Science (1992). Completeness and Robustness Properties of Min-Wise Independent Permutations. A. Broder and M. Mitzenmacher. Min-Wise Independent Permutations, A. Broder, M. Charikar, A. Frieze and M. Mitzenmacher, JCSS (2000). A Small Approximately min-wise Independent Family of Hash Functions. P. Indyk. SODA 1999. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality, P. Indyk and R. Motwani. STOC 1998. Similarity Search in High Dimensions via Hashing, A. Gionis, P. Indyk, and R. Motwani. VLDB 1999. Similarity Estimation Techniques from Rounding Algorithms, M. Charikar, STOC 2002. CS 361A 27