Scalable Techniques for Clustering the Web Taher H. Haveliwala Aristides Gionis Piotr Indyk Stanford University {taherh,gionis,indyk}@cs.stanford.edu Project Goals Generate fine-grained clustering of web based on topic Similarity search (“What’s Related?”) Two major issues: Develop appropriate notion of similarity Scale up to millions of documents Prior Work Offline: detecting replicas [Broder-Glassman-Manasse-Zweig’97] [Shivakumar-G. Molina’98] Online: finding/grouping related pages [Zamir-Etzioni’98] [Manjara] Link based methods [Dean-Henzinger’99, Clever] Prior Work: Online, Link Online: cluster results of search queries does not work for clustering entire web offline Link based approaches are limited What about relatively new pages? What about less popular pages? Prior Work: Copy detection Designed to detect duplicates/nearreplicas Do not scale when notion of similarity is modified to ‘topical’ similarity Creation of document-document similarity matrix is the core challenge: join bottleneck Pairwise similarity Consider relation Docs(id, sentence) Must compute: SELECT FROM WHERE GROUP BY HAVING D1.id, D2.id Docs D1, Docs D2 D1.sentence = D2.sentence D1.id, D2.id count(*) > What if we change ‘sentence’ to ‘word’? Pairwise similarity Relation Docs(id, word) Compute: SELECT FROM WHERE GROUP BY HAVING D1.id, D2.id Docs D1, Docs D2 D1.word = D2.word D1.id, D2.id count(*) > For 25M urls, could take months to compute! Overview Choose document representation Choose similarity metric Compute pairwise document similarities Generate clusters Document representation Bag of words model Bag for each page p consists of Title of p Anchor text of all pages pointing to p (Also include window of words around anchors) Bag Generation http://www.foobar.com/ ...click here for a great music page... http://www.music.com/ MusicWorld ...click here for great sports page... Enter our site http://www.baz.com/ ...what I had for lunch... ...this music is great... Bag Generation Union of ‘anchor windows’ is a concise description of a page. Note that using anchor windows, we can cluster more documents than we’ve crawled: In general, a set of N documents refers to cN urls Standard IR Remove stopwords (~ 750) Remove high frequency & low frequency terms Use stemming Apply TFIDF scaling Overview Choose document representation Choose similarity metric Compute pairwise document similarities Generate clusters Similarity Similarity metric for pages U1, U2, that were assigned bags B1, B2, respectively sim(U1, U2) = |B1 B2| / |B1 B2| Threshold is set to 20% Reality Check www.foodchannel.com: www.epicurious.com/a_home/a00_home/home.html www.gourmetworld.com www.foodwine.com www.cuisinenet.com www.kitchenlink.com www.yumyum.com www.menusonline.com www.snap.com/directory/category/0,16,-324,00.html www.ichef.com www.home-canning.com .37 .36 .325 .3125 .3125 .3 .3 .2875 .2875 .275 Overview Choose document representation Choose similarity metric Compute pairwise document similarities Generate clusters Pair Generation Find all pairs of pages (U1, U2) satisfying sim(U1, U2) 20% Ignore all url pairs with sim < 20% How do we avoid the join bottleneck? Locality Sensitive Hashing Idea: use special kind of hashing Locality Sensitive Hashing (LSH) provides a solution: Min-wise hash functions [Broder’98] LSH [Indyk, Motwani’98], [Cohen et al’2000] Properties: Similar urls are hashed together w.h.p Dissimilar urls are not hashed together Locality Sensitive Hashing sports.com golf.com music.com opera.com sing.com Hashing Two steps Min-hash (MH): a way to consistently sample words from bags Locality sensitive hashing (LSH): similar pages get hashed to the same bucket while dissimilar ones do not Step 1: Min-hash Step 1: Generate m min-hash signatures for each url (m = 80) For i = 1...m Generate a random order hi on words mhi(u) = argmin {hi(w) | w Bu} Pr(mhi(u) = mhi(v)) = sim(u, v) Step 1: Min-hash Round 1: ordering = [cat, dog, mouse, banana] Set A: {mouse, dog} MH-signature = dog Set B: {cat, mouse} MH-signature = cat Step 1: Min-hash Round 2: ordering = [banana, mouse, cat, dog] Set A: {mouse, dog} MH-signature = mouse Set B: {cat, mouse} MH-signature = mouse Step 2: LSH Step 2: Generate l LSH signatures for each url, using k of the min-hash values (l = 125, k = 3) For i = 1...l Randomly select k min-hash indices and concatenate them to form i’th LSH signature Step 2: LSH Generate candidate pair if u and v have an LSH signature in common in any round Pr(lsh(u) = lsh(v)) = Pr(mh(u) = mh(v))k Step 2: LSH Set A: {mouse, dog, horse, ant} MH1 MH2 MH3 MH4 = = = = horse mouse ant dog LSH134 = horse-ant-dog LSH234 = mouse-ant-dog Set B: {cat, ice, shoe, mouse} MH1 MH2 MH3 MH4 = = = = cat mouse ice shoe LSH134 = cat-ice-shoe LSH234 = mouse-ice-shoe Step 2: LSH Bottom line - probability of collision: 10% similarity 0.1% 1% similarity 0.0001% Step 2: LSH Round 1 sports.com golf.com party.com sportteamwin ... music.com opera.com musicsoundplay ... sing.com singmusicear Step 2: LSH Round 2 sports.com golf.com gameteamscore ... music.com sing.com audiomusicnote ... opera.com theaterlucianosing Sort & Filter Using all buckets from all LSH rounds, generate candidate pairs Sort candidate pairs on first field Filter candidate pairs: keep pair (u, v), only if u and v agree on 20% of MHsignatures Ready for “What’s Related?” queries... Overview Choose document representation Choose similarity metric Compute pairwise document similarities Generate clusters Clustering The set of document pairs represents the document-document similarity matrix with 20% similarity threshold Clustering algorithms S-Link: connected components C-Link: maximal cliques Center: approximation to C-Link Center Scan through pairs (they are sorted on first component) For each run [(u, v1), ... , (u, vn)] if u is not marked cluster = u + unmarked neighbors of u mark u and all neighbors of u Center Results 20 Million urls on Pentium-II 450 Algorithm Step Bag generation Bag sorting Min-hash LSH Filtering Sorting CENTER Running Time (hours) 23 4.7 26 16 83 107 18 Sample Cluster feynman.princeton.edu/~sondhi/205main.html hep.physics.wisc.edu/wsmith/p202/p202syl.html hepweb.rl.ac.uk/ppUK/PhysFAQ/relativity.html pdg.lbl.gov/mc_particle_id_contents.html physics.ucsc.edu/courses/10.html town.hall.org/places/SciTech/qmachine www.as.ua.edu/physics/hetheory.html www.landfield.com/faqs/by-newsgroup/sci/sci.physics.relativity.html www.pa.msu.edu/courses/1999spring/PHY492/desc_PHY492.html www.phy.duke.edu/Courses/271/Synopsis.html . . . (total of 27 urls) . . . Ongoing/Future Work Tune anchor-window length Develop system to measure quality What is ground truth? How do you judge clustering of millions of pages?