Scalable Techniques for Clustering the Web

advertisement
Scalable Techniques for
Clustering the Web
Taher H. Haveliwala
Aristides Gionis
Piotr Indyk
Stanford University
{taherh,gionis,indyk}@cs.stanford.edu
Project Goals
Generate fine-grained clustering of web
based on topic
Similarity search (“What’s Related?”)
Two major issues:
Develop appropriate notion of similarity
Scale up to millions of documents
Prior Work
Offline: detecting replicas
[Broder-Glassman-Manasse-Zweig’97]
[Shivakumar-G. Molina’98]
Online: finding/grouping related pages
[Zamir-Etzioni’98]
[Manjara]
Link based methods
[Dean-Henzinger’99, Clever]
Prior Work: Online, Link
Online: cluster results of search queries
does not work for clustering entire web
offline
Link based approaches are limited
What about relatively new pages?
What about less popular pages?
Prior Work: Copy detection
Designed to detect duplicates/nearreplicas
Do not scale when notion of similarity is
modified to ‘topical’ similarity
Creation of document-document similarity
matrix is the core challenge: join
bottleneck
Pairwise similarity
Consider relation Docs(id, sentence)
Must compute:
SELECT
FROM
WHERE
GROUP BY
HAVING
D1.id, D2.id
Docs D1, Docs D2
D1.sentence = D2.sentence
D1.id, D2.id
count(*) > 
What if we change ‘sentence’ to ‘word’?
Pairwise similarity
Relation Docs(id, word)
Compute:
SELECT
FROM
WHERE
GROUP BY
HAVING
D1.id, D2.id
Docs D1, Docs D2
D1.word = D2.word
D1.id, D2.id
count(*) > 
For 25M urls, could take months to compute!
Overview
Choose document representation
Choose similarity metric
Compute pairwise document similarities
Generate clusters
Document representation
Bag of words model
Bag for each page p consists of
Title of p
Anchor text of all pages pointing to p
(Also include window of words around
anchors)
Bag Generation
http://www.foobar.com/
...click here for a
great music page...
http://www.music.com/
MusicWorld
...click here for great
sports page...
Enter our site
http://www.baz.com/
...what I had for
lunch...
...this music is great...
Bag Generation
Union of ‘anchor windows’ is a concise
description of a page.
Note that using anchor windows, we can
cluster more documents than we’ve
crawled:
In general, a set of N documents refers to cN
urls
Standard IR
Remove stopwords (~ 750)
Remove high frequency & low frequency
terms
Use stemming
Apply TFIDF scaling
Overview
Choose document representation
Choose similarity metric
Compute pairwise document similarities
Generate clusters
Similarity
Similarity metric for pages U1, U2, that
were assigned bags B1, B2, respectively
sim(U1, U2) = |B1  B2| / |B1  B2|
Threshold is set to 20%
Reality Check
www.foodchannel.com:
www.epicurious.com/a_home/a00_home/home.html
www.gourmetworld.com
www.foodwine.com
www.cuisinenet.com
www.kitchenlink.com
www.yumyum.com
www.menusonline.com
www.snap.com/directory/category/0,16,-324,00.html
www.ichef.com
www.home-canning.com
.37
.36
.325
.3125
.3125
.3
.3
.2875
.2875
.275
Overview
Choose document representation
Choose similarity metric
Compute pairwise document
similarities
Generate clusters
Pair Generation
Find all pairs of pages (U1, U2) satisfying
sim(U1, U2)  20%
Ignore all url pairs with sim < 20%
How do we avoid the join bottleneck?
Locality Sensitive Hashing
Idea: use special kind of hashing
Locality Sensitive Hashing (LSH) provides
a solution:
Min-wise hash functions [Broder’98]
LSH [Indyk, Motwani’98], [Cohen et al’2000]
Properties:
Similar urls are hashed together w.h.p
Dissimilar urls are not hashed together
Locality Sensitive Hashing
sports.com
golf.com
music.com
opera.com
sing.com
Hashing
Two steps
Min-hash (MH): a way to consistently sample
words from bags
Locality sensitive hashing (LSH): similar
pages get hashed to the same bucket while
dissimilar ones do not
Step 1: Min-hash
Step 1: Generate m min-hash signatures
for each url (m = 80)
For i = 1...m
Generate a random order hi on words
mhi(u) = argmin {hi(w) | w  Bu}
Pr(mhi(u) = mhi(v)) = sim(u, v)
Step 1: Min-hash
Round 1:
ordering = [cat, dog, mouse, banana]
Set A:
{mouse, dog}
MH-signature = dog
Set B:
{cat, mouse}
MH-signature = cat
Step 1: Min-hash
Round 2:
ordering = [banana, mouse, cat, dog]
Set A:
{mouse, dog}
MH-signature = mouse
Set B:
{cat, mouse}
MH-signature = mouse
Step 2: LSH
Step 2: Generate l LSH signatures for
each url, using k of the min-hash values
(l = 125, k = 3)
For i = 1...l
Randomly select k min-hash indices and
concatenate them to form i’th LSH
signature
Step 2: LSH
Generate candidate pair if u and v have
an LSH signature in common in any round
Pr(lsh(u) = lsh(v)) = Pr(mh(u) = mh(v))k
Step 2: LSH
Set A:
{mouse, dog, horse, ant}
MH1
MH2
MH3
MH4
=
=
=
=
horse
mouse
ant
dog
LSH134 = horse-ant-dog
LSH234 = mouse-ant-dog
Set B:
{cat, ice, shoe, mouse}
MH1
MH2
MH3
MH4
=
=
=
=
cat
mouse
ice
shoe
LSH134 = cat-ice-shoe
LSH234 = mouse-ice-shoe
Step 2: LSH
Bottom line - probability of collision:
10% similarity  0.1%
1% similarity  0.0001%
Step 2: LSH
Round 1
sports.com
golf.com
party.com
sportteamwin
...
music.com
opera.com
musicsoundplay
...
sing.com
singmusicear
Step 2: LSH
Round 2
sports.com
golf.com
gameteamscore
...
music.com
sing.com
audiomusicnote
...
opera.com
theaterlucianosing
Sort & Filter
Using all buckets from all LSH rounds,
generate candidate pairs
Sort candidate pairs on first field
Filter candidate pairs: keep pair (u, v),
only if u and v agree on 20% of MHsignatures
Ready for “What’s Related?” queries...
Overview
Choose document representation
Choose similarity metric
Compute pairwise document similarities
Generate clusters
Clustering
The set of document pairs represents the
document-document similarity matrix with
20% similarity threshold
Clustering algorithms
S-Link: connected components
C-Link: maximal cliques
Center: approximation to C-Link
Center
Scan through pairs (they are sorted on
first component)
For each run [(u, v1), ... , (u, vn)]
if u is not marked
cluster = u + unmarked neighbors of u
mark u and all neighbors of u
Center
Results
20 Million urls on Pentium-II 450
Algorithm Step
Bag generation
Bag sorting
Min-hash
LSH
Filtering
Sorting
CENTER
Running Time
(hours)
23
4.7
26
16
83
107
18
Sample Cluster
feynman.princeton.edu/~sondhi/205main.html
hep.physics.wisc.edu/wsmith/p202/p202syl.html
hepweb.rl.ac.uk/ppUK/PhysFAQ/relativity.html
pdg.lbl.gov/mc_particle_id_contents.html
physics.ucsc.edu/courses/10.html
town.hall.org/places/SciTech/qmachine
www.as.ua.edu/physics/hetheory.html
www.landfield.com/faqs/by-newsgroup/sci/sci.physics.relativity.html
www.pa.msu.edu/courses/1999spring/PHY492/desc_PHY492.html
www.phy.duke.edu/Courses/271/Synopsis.html
. . . (total of 27 urls) . . .
Ongoing/Future Work
Tune anchor-window length
Develop system to measure quality
What is ground truth?
How do you judge clustering of millions of
pages?
Download