CS 361A Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and

CS 361A (Advanced Data Structures and Algorithms) Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and Document Similarity CS 361A Rajeev Motwani 1 Game Plan for Week Fingerprints Document Similarity Shingling Min-Hashing Min-Wise Independent Permutations CS 361A 2 Fingerprints  W – set of large objects (e.g., URLs) Goal avoid storing large objects explicitly quick-and-dirty equality-testing Fingerprints? Short tags for objects Distinct fingerprints  distinct objects Distinct objects  probably distinct fingerprints CS 361A 3 Formalization Fingerprint length k  fingerprint space size N=2k Fingerprint function family F = { f : Wk } Random f eR F  f(A) f(B)  A  Collisions: P[ f(A) = f(B) | A  (ideally 2O(-k)) Typical Application Adversarial object-set S with |S| = n << 2k Goal – |f(S)| = |S| with high probability n2 pair-wise collisions possible  need 2k > n2 (to avoid Birthday Paradox) CS 361A 4 Example – URL Fingerprints  Search Engines Manage large numbers of URL strings Long, variable strings (embedded objects/database-queries)  Desiderata small/fixed-length encodings – hopefully, unique Some scenarios o Exact string irrelevant o Only need ability to distinguish distinct URLs Even otherwise, unique IDs useful for indexing  Numbers? 4 billion webpages  n=232 N n2  k=64 Fingerprints  8-byte representation CS 361A 5 Fingerprinting vs Hashing Hashing h: Wk Set Membership testing for set S of size n Desire uniform distribution over bin address k Minimize collisions per bin – reduce lookup time Minimize hash table size  n N=2k Fingerprinting f : Wk Object Equality testing over set S of size n Distribution over k is irrelevant Avoid collisions altogether Tolerate larger k – typically N > n2 CS 361A 6 Fingerprinting Strings  Typical Application – but techniques extend to combinatorial objects (database tuples, trees/graphs)  Obvious techniques Checksum – no worst-case collision probability guarantees MD5 – cryptographically-secure string hashes o relatively slow o avoids leaking information about original string  Rabin’s Scheme Algebraic technique – polynomial arithmetic Efficient – need (1 table lookup + 1 xor + 1 shift) per byte other nice properties… CS 361A 7 Rabin Fingerprints  Consider – m-bit string A=a1 a2 … am  Assume – a1=1 and fixed-length strings (wlog)  Encoding Strings Degree-m polynomials over Z2 A(x) = a1 xm-1 + a2 xm-2 + … + am-1 x1 + am  Fingerprints P(x): random, irreducible deg-k polynomial over Z2 (easy to sample such polynomials) irreducible  unlike x2+x+1, can factor x2+1=(x+1)2 f(A) = A(x) mod P(x) CS 361A 8 Analysis  Fix S – n strings of length m  Consider QS (x)   A(x)  B(x)  A  B S Collision f(A)=f(B)  A(x)=B(x) mod P(x)  QS=0 mod P(x) Therefore – P(x) is factor of QS(x)  Collision Probability? degree(QS) = n2m number of irreducible degree-k factors of QS(x) is < n2m/k Fact: Number of irreducible degree-k polynomials > (2k-2k/2)/k Prob[random P(x) divides QS(x)] < n2m/2k n 2m  Prob [fingerprints not distinct] < 2k CS 361A 9 Beneficial Properties Hardware-level implementation Z2-polynomials same as strings simple shift-register operations Distributivity – f(A+B) = f(A) + f(B) over Z2 Let  = concatenation f(A  B) = f(f(A)  ) f(A  B) = A(x)*tm + B(x) mod P(x) Fingerprint sliding windows over strings – low incremental cost CS 361A 10 Duplicate Document Detection  Problem Given – large collection of arbitrary documents Identify – near-duplicate documents  Web search engines Proliferation of near-duplicate documents o Legitimate – mirrors, local copies, updates, … o Malicious – spam, spider-traps, dynamic URLs, … o Mistaken – spider errors 30% of web-pages are near-duplicates [Broder et al 1997] Cost – RAM/disk, search quality, unhappy users Enterprise search – even larger amount of duplication SCAM – plagiarism detection [Shivakumar et al 1998] CS 361A 11 Natural Approaches  Fingerprinting? only works for exact matches here – must identify even near-duplicates  Random Sampling? sample substrings (phrases, sentences, etc) hope: similar documents  similar samples No – even samples of same document will differ  Edit-distance? metric for approximate string-matching expensive – even for one pair of strings impossible – for 1032 web documents CS 361A 12 Desiderata Storage only small sketches of each document. Computation O(n log n) time on n documents Stream Processing once sketch computed, source is unavailable Error Guarantees problem scale  small biases have large impact need formal guarantees – heuristics will not do CS 361A 13 Basic Idea [Broder 1997] Shingling dissect document into q-grams (shingles) represent documents by shingle-sets near-duplicates shingle-sets intersection is large reduce problem to set intersection Set Intersection fingerprints of shingles min-hash to estimate intersections sizes CS 361A 14 Shingling  Shingle – q contiguous tokens/words (q-gram)  Consider following “document” a rose is a rose is a rose  Choose q=4  get multi-set of shingles a rose is a rose is a rose is a rose is a rose is a rose is a rose CS 361A 15 Documents  Sets of 64-bit fingerprints Doc shingling Multiset of Shingles fingerprint Multiset of Fingerprints Fingerprints? • Use Rabin fingerprints • Fingerprint space U = [0, …, N-1] • In practice, use 64-bit fingerprints, i.e., N=264 • Result – uniformity in length of strings CS 361A 16 Similarity of Documents Doc A SA SB Doc B • Jaccard measure – similarity of SA, SB  U = [0 … N-1] sim(S A , SB )  SA  SB SA  SB • Claim: A & B are near-duplicates if sim(SA,SB) is high • Claim: A is contained in B if con(SA,SB) is high con(S A , SB )  CS 361A SA  SB SA 17 Remarks  Multiplicities of q-grams – could retain or ignore trade-off efficiency with precision  Shingle Size q ε [3 … 10] Short shingles  increase similarity of unrelated documents o With q=1, sim(SA,SB) =1  A is permutation of B o Need larger q to sensitize to permutation changes Long shingles  small random changes have larger impact  Similarity Measure Similarity is non-transitive, non-metric But – dissimilarity 1-sim(SA,SB) is a metric [Charikar 02]  [Ukkonen 92] – relate q-gram & edit-distance CS 361A 18 Example  A = “a rose is a rose is a rose”  B = “a rose is a flower which is a rose”  Preserving multiplicity q=1  sim(SA,SB) = 0.7 o SA = {a, a, a, is, is, rose, rose, rose} o SB = {a, a, a, is, is, rose, rose, flower, which} q=2  sim(SA,SB) = 0.5 q=3  sim(SA,SB) = 0.3  Disregarding multiplicity q=1  sim(SA,SB) = 0.6 q=2  sim(SA,SB) = 0.5 q=3  sim(SA,SB) = 0.4285 CS 361A 19 Min-Hashing  Consider SA, SB  U Pick – random permutation π of U Define  = π -1( min{π(SA)} ) and b = π -1( min{π(SB)} ) Meaning? – minimal element under permutation π  Lemma: P[α  β]  SA  SB SA  SB Let δ = min{ π(SASB) } Claim:  = b π -1(δ)  SASB Clearly CS 361A P[π δ   SA  SB ]  -1 SA  SB SA  SB 20 Min-Hashing  Similarity Sketches Succinct representation of fingerprint sets SA Allows efficient estimation of sim(SA,SB) Basic idea – use min-hash of fingerprints  sk(A) = k minimal elements under π(SA)  Claim: E[ sim(sk(A), sk(B)) ] = sim(SA,SB) For each   sk(A)  sk(B) P[α  sk(A)  sk(B)]  P[π α   SA  SB ]  -1 SA  SB SA  SB  Observe sketch-similarity is unbiased estimator of similarity reducing variance – use larger k CS 361A 21 Remarks Implementation shingle/fingerprint/sketch document in streams Issue – cost of pairwise comparison of sketches? o cluster sketch-streams [Broder et al, Guha et al] o Open? – hashing sketches to identify similarity [Broder-Mitzenmacher 99] – Min-Hash is only unbiased estimator [Indyk-Motwani 99] – Locality-Sensitive Hash collisions more likely for similar items Min-Hash is special case CS 361A 22 Multiple Permutations Better Variance Reduction Instead of larger k, stick with k=1 Multiple, independent permutations Sketch Construction Pick p random permutations of U – π1,π2, …,πp sk(A) = minimal elements under π1(SA), …, πp(SA) Claim: E[ sim(sk(A),sk(B)) ] = sim(SA,SB) Earlier lemma  true for p=1 Linearity of expectations Variance reduction – independence of π1, …,πp CS 361A 23 Min-Wise Indep Permutations Problem Truly-random π over U = [0 … N-1] is infeasible But – do we really need true randomness? Solution Poly-size family of permutations FSN over U Choosing/representing random πF is easy Min-Wise Independence (MWI) Property: For all sets XU, for all xF, 1  P min πX   x   πF X CS 361A 24 Minimum-Size MWI Families [Broder et al 98] Upper/lower bounds of lcm(1,2,…,n) Problem – exponential in N Approximate MWI Families 1 ε  min πX   x   Relax to P πF X Non-constructive – polynomial-size Constructive – size NO(log 1/e) [Indyk 99] In practice – 2-universal hashes work well! CS 361A 25 References I  Fingerprinting by random polynomials. M. Rabin. Technical Report TR-15-81, Harvard University (1981).  Some applications of Rabin's fingerprinting method. A. Broder. Sequence II (1993).  On the Resemblance and Containment of Documents, A. Broder. SEQUENCES 1997.  Syntactic Clustering of the Web, A. Broder, S. Glassman, M. Manasse, and G. Zweig, WWW 1997.  Finding near-replicas of documents on the web. N. Shivakumar and H. Garcia-Molina. WebDB 1998.  Identifying and Filtering Near-Duplicate Documents, Andrei Broder. CPM 2000. CS 361A 26 References II  Approximate String Matching with q-grams and Maximal Matches. E. Ukkonen. Theoretical Computer Science (1992).  Completeness and Robustness Properties of Min-Wise Independent Permutations. A. Broder and M. Mitzenmacher.  Min-Wise Independent Permutations, A. Broder, M. Charikar, A. Frieze and M. Mitzenmacher, JCSS (2000).  A Small Approximately min-wise Independent Family of Hash Functions. P. Indyk. SODA 1999.  Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality, P. Indyk and R. Motwani. STOC 1998.  Similarity Search in High Dimensions via Hashing, A. Gionis, P. Indyk, and R. Motwani. VLDB 1999.  Similarity Estimation Techniques from Rounding Algorithms, M. Charikar, STOC 2002. CS 361A 27

CS 361A Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and

Related documents

Products

Support

CS 361A Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib