Detecting Near-Duplicates for Web Crawling

advertisement
DETECTING NEAR-DUPLICATES FOR
WEB CRAWLING
Authors:
Gurmeet Singh Manku,
Arvind Jain, and
Anish Das Sarma
Presentation By:
Fernando Arreola
Outline









De-duplication
Goal of the Paper
Why is De-duplication Important?
Algorithm
Experiment
Related Work
Tying it Back to Lecture
Paper Evaluation
Questions
6/20/2011
Detecting Near-Duplicates for Web Crawling
2
De-duplication


The process of eliminating near-duplicate web
documents in a generic crawl
Challenge of near-duplicates:
 Identifying
 Use
 How
exact duplicates is easy
checksums
to identify near-duplicate?
 Near-duplicates
are identical in content but have differences
in small areas

6/20/2011
Ads, counters, and timestamps
Detecting Near-Duplicates for Web Crawling
3
Goal of the Paper


Present near-duplicate detection system which
improves web crawling
Near-duplicate detection system includes:
 Simhash
technique
 Technique
used to transform a web-page to an f-bit
fingerprint
 Solution
to Hamming Distance Problem
 Given
f-bit fingerprint find all fingerprints in a given
collection which differ by at most k-bit positions
6/20/2011
Detecting Near-Duplicates for Web Crawling
4
Why is De-duplication Important?

Elimination of near duplicates:
 Saves
network bandwidth
 Do
not have to crawl content if similar to previously crawled
content
 Reduces
storage cost
 Do
not have to store in local repository if similar to
previously crawled content
 Improves
quality of search indexes
 Local
repository used for building search indexes not
polluted by near-duplicates
6/20/2011
Detecting Near-Duplicates for Web Crawling
5
Algorithm: Simhash Technique

Convert web-page to set of features

Using Information Retrieval techniques




Dimension values start at 0
Update f-dimensional vector with weight of feature



Give a weight to each feature
Hash each feature into a f-bit value
Have a f-dimensional vector


e.g. tokenization, phrase detection
If i-th bit of hash value is zero -> subtract i-th vector value by weight of
feature
If i-th bit of hash value is one -> add the weight of the feature to the ith vector value
Vector will have positive and negative components

Sign (+/-) of each component are bits for the fingerprint
6/20/2011
Detecting Near-Duplicates for Web Crawling
6
Algorithm: Simhash Technique (cont.)

Very simple example
 One
web-page
 Web-page
 Reduced
to two features
 “Simhash”
 “Technique”
 Hash
-> weight = 2
-> weight = 4
features to 4-bits
 “Simhash”
 “Technique”
6/20/2011
text: “Simhash Technique”
-> 1101
-> 0110
Detecting Near-Duplicates for Web Crawling
7
Algorithm: Simhash Technique (cont.)

Start vector with all zeroes
0
0
0
0
6/20/2011
Detecting Near-Duplicates for Web Crawling
8
Algorithm: Simhash Technique (cont.)

Apply “Simhash” feature (weight = 2)
feature’s
f-bit value
calculation
1
0+2
2
0
1
0+2
2
0
0
0-2
-2
0
1
0+2
2
0
6/20/2011
Detecting Near-Duplicates for Web Crawling
9
Algorithm: Simhash Technique (cont.)

Apply “Technique” feature (weight = 4)
feature’s
f-bit value
calculation
0
2-4
-2
2
1
2+4
6
-2
1
-2 + 4
2
2
0
2-4
-2
2
6/20/2011
Detecting Near-Duplicates for Web Crawling
10
Algorithm: Simhash Technique (cont.)

Final vector:
-2
6
2
-2


Sign of vector values is -,+,+,Final 4-bit fingerprint = 0110
6/20/2011
Detecting Near-Duplicates for Web Crawling
11
Algorithm: Solution to Hamming
Distance Problem


Problem: Given f-bit fingerprint (F) find all fingerprints in a given
collection which differ by at most k-bit positions
Solution:

Create tables containing the fingerprints




Each table has a permutation (π) and a small integer (p) associated with it
Apply the permutation associated with the table to its fingerprints
Sort the tables
Store tables in main-memory of a set of machines

Iterate through tables in parallel


6/20/2011
Find all permutated fingerprints whose top pi bits match the top pi bits of πi(F)
For the fingerprints that matched, check if they differ from πi(F) in at most k-bits
Detecting Near-Duplicates for Web Crawling
12
Algorithm: Solution to Hamming
Distance Problem (cont.)

Simple example
F
= 0100 1101
K = 3
 Have a collection of 8 fingerprints
Fingerprints
 Create two tables
1100 0101
1111 1111
0101 1100
0111 1110
1111 1110
0000 0001
1111 0101
1101 0010
6/20/2011
Detecting Near-Duplicates for Web Crawling
13
Algorithm: Solution to Hamming
Distance Problem (cont.)
Fingerprints
1100 0101
1111 1111
0101 1100
0111 1110
1111 1110
0010 0001
1111 0101
1101 0010
p = 3; π = Swap last four bits with first four bits
p = 3; π = Move last two bits to the front
0101 1100
1011 1111
1111 1111
0100 1000
1100 0101
0111 1101
1110 0111
1011 0100
6/20/2011
Detecting Near-Duplicates for Web Crawling
14
Algorithm: Solution to Hamming
Distance Problem (cont.)
p = 3; π = Swap last four bits with first four bits
p = 3; π = Move last two bits to the front
0101 1100
1011 1111
1111 1111
0100 1000
1100 0101
0111 1101
1110 0111
1011 0100
Sort
Sort
p = 3; π = Swap last four bits with first four bits
p = 3; π = Move last two bits to the front
0101 1100
0100 1000
1100 0101
0111 1101
1110 0111
1011 0100
1111 1111
1011 1111
6/20/2011
Detecting Near-Duplicates for Web Crawling
15
Algorithm: Solution to Hamming
Distance Problem (cont.)

F = 0100 1101
π(F) = 1101 0100
π(F) = 0101 0011
p = 3; π = Swap last four bits with first four bits
p = 3; π = Move last two bits to the front
0101 1100
0100 1000
1100 0101
Match!
0111 1101
1110 0111
1011 0100
1111 1111
1011 1111
6/20/2011
Detecting Near-Duplicates for Web Crawling
16
Algorithm: Solution to Hamming
Distance Problem (cont.)

With k =3, only fingerprint in first table is a nearduplicate of the F fingerprint
p = 3; π = Swap last four bits with
first four bits
p = 3; π = Move last two bits to the
front
1
1
0
1
0
1
0
0
1
1
0
0
0
1
0
1
6/20/2011
F
0
1
0
1
0
0
1
1
0
1
0
0
1
0
0
0
Detecting Near-Duplicates for Web Crawling
17
Algorithm: Compression of Tables

Store first fingerprint in a block (1024 bytes)
XOR the current fingerprint with the previous one
Append to the block the Huffman code for the position
of the most significant 1 bit
Append to the block the bits after the most significant 1
bit
Repeat steps 2-4 until block is full

Comparing to the query fingerprint





Use last fingerprint (key) in the block and perform
interpolation search to decompress appropriate block
6/20/2011
Detecting Near-Duplicates for Web Crawling
18
Algorithm: Extending to Batch Queries


Problem: Want to get near-duplicates for batch of query
fingerprints – not just one
Solution:

Use Google File System (GFS) and MapReduce

Create two files



Store the files in GFS




MapReduce allows for a task to be created per chunk
Iterate through chunks in parallel
Each task produces output of near-duplicates found
Produce sorted file from output of each task

6/20/2011
GFS breaks up the files into chunks
Use MapReduce to solve the Hamming Distance Problem for each chunk of F
for all queries in Q


File F has the collection of fingerprints
File Q has the query fingerprints
Remove duplicates if necessary
Detecting Near-Duplicates for Web Crawling
19
Experiment: Parameters



8 Billion web pages used
K = 1 …10
Manually tagged pairs as follows:
 True
positives
 Differ
 False
slightly
positives
 Radically
different pairs
 Unknown
 Could
6/20/2011
not be evaluated
Detecting Near-Duplicates for Web Crawling
20
Experiment: Results

Accuracy
Low k value -> a lot of false negatives
 High k value -> a lot of false positives
 Best value
-> k = 3

75% of near-duplicates reported
 75% of reported cases are true positives


Running Time
Solution Hamming Distance: O(log(p))
 Batch Query + Compression:


6/20/2011
32GB File & 200 tasks -> runs under 100 seconds
Detecting Near-Duplicates for Web Crawling
21
Related Work

Clustering related documents
 Detect

near-duplicates to show related pages
Data extraction
 Determine
schema of similar pages to obtain
information

Plagiarism
 Detect

pages that have borrowed from each other
Spam
 Detect
6/20/2011
spam before user receives it
Detecting Near-Duplicates for Web Crawling
22
Tying it Back to Lecture

Similarities
Indicated importance of de-duplication to save crawler
resources
 Brief summary of several uses for near-duplicate detection


Differences

Lecture focus:


Breadth-first look at algorithms for near-duplicate detection
Paper focus:

In-depth look of simhash and Hamming Distance algorithm

6/20/2011
Includes how to implement and effectiveness
Detecting Near-Duplicates for Web Crawling
23
Paper Evaluation: Pros



Thorough step-by-step explanation of the algorithm
implementation
Thorough explanation on how the conclusions were
reached
Included brief description of how to improve
simhash + Hamming Distance algorithm
 Categorize
web-pages before running simhash, create
algorithm to remove ads or timestamps, etc.
6/20/2011
Detecting Near-Duplicates for Web Crawling
24
Paper Evaluation: Cons

No comparison
 How
much more effective or faster is it than other
algorithms?
 By how much did it improve the crawler?

Limited batch queries to a specific technology
 Implementation
required use of GFS
 Approach not restricted to certain technology might be
more applicable
6/20/2011
Detecting Near-Duplicates for Web Crawling
25
Any Questions?
???
6/20/2011
Detecting Near-Duplicates for Web Crawling
26
Download