Detecting Near-Duplicates for Web Crawling

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola Outline          De-duplication Goal of the Paper Why is De-duplication Important? Algorithm Experiment Related Work Tying it Back to Lecture Paper Evaluation Questions 6/20/2011 Detecting Near-Duplicates for Web Crawling 2 De-duplication   The process of eliminating near-duplicate web documents in a generic crawl Challenge of near-duplicates:  Identifying  Use  How exact duplicates is easy checksums to identify near-duplicate?  Near-duplicates are identical in content but have differences in small areas  6/20/2011 Ads, counters, and timestamps Detecting Near-Duplicates for Web Crawling 3 Goal of the Paper   Present near-duplicate detection system which improves web crawling Near-duplicate detection system includes:  Simhash technique  Technique used to transform a web-page to an f-bit fingerprint  Solution to Hamming Distance Problem  Given f-bit fingerprint find all fingerprints in a given collection which differ by at most k-bit positions 6/20/2011 Detecting Near-Duplicates for Web Crawling 4 Why is De-duplication Important?  Elimination of near duplicates:  Saves network bandwidth  Do not have to crawl content if similar to previously crawled content  Reduces storage cost  Do not have to store in local repository if similar to previously crawled content  Improves quality of search indexes  Local repository used for building search indexes not polluted by near-duplicates 6/20/2011 Detecting Near-Duplicates for Web Crawling 5 Algorithm: Simhash Technique  Convert web-page to set of features  Using Information Retrieval techniques     Dimension values start at 0 Update f-dimensional vector with weight of feature    Give a weight to each feature Hash each feature into a f-bit value Have a f-dimensional vector   e.g. tokenization, phrase detection If i-th bit of hash value is zero -> subtract i-th vector value by weight of feature If i-th bit of hash value is one -> add the weight of the feature to the ith vector value Vector will have positive and negative components  Sign (+/-) of each component are bits for the fingerprint 6/20/2011 Detecting Near-Duplicates for Web Crawling 6 Algorithm: Simhash Technique (cont.)  Very simple example  One web-page  Web-page  Reduced to two features  “Simhash”  “Technique”  Hash -> weight = 2 -> weight = 4 features to 4-bits  “Simhash”  “Technique” 6/20/2011 text: “Simhash Technique” -> 1101 -> 0110 Detecting Near-Duplicates for Web Crawling 7 Algorithm: Simhash Technique (cont.)  Start vector with all zeroes 0 0 0 0 6/20/2011 Detecting Near-Duplicates for Web Crawling 8 Algorithm: Simhash Technique (cont.)  Apply “Simhash” feature (weight = 2) feature’s f-bit value calculation 1 0+2 2 0 1 0+2 2 0 0 0-2 -2 0 1 0+2 2 0 6/20/2011 Detecting Near-Duplicates for Web Crawling 9 Algorithm: Simhash Technique (cont.)  Apply “Technique” feature (weight = 4) feature’s f-bit value calculation 0 2-4 -2 2 1 2+4 6 -2 1 -2 + 4 2 2 0 2-4 -2 2 6/20/2011 Detecting Near-Duplicates for Web Crawling 10 Algorithm: Simhash Technique (cont.)  Final vector: -2 6 2 -2   Sign of vector values is -,+,+,Final 4-bit fingerprint = 0110 6/20/2011 Detecting Near-Duplicates for Web Crawling 11 Algorithm: Solution to Hamming Distance Problem   Problem: Given f-bit fingerprint (F) find all fingerprints in a given collection which differ by at most k-bit positions Solution:  Create tables containing the fingerprints     Each table has a permutation (π) and a small integer (p) associated with it Apply the permutation associated with the table to its fingerprints Sort the tables Store tables in main-memory of a set of machines  Iterate through tables in parallel   6/20/2011 Find all permutated fingerprints whose top pi bits match the top pi bits of πi(F) For the fingerprints that matched, check if they differ from πi(F) in at most k-bits Detecting Near-Duplicates for Web Crawling 12 Algorithm: Solution to Hamming Distance Problem (cont.)  Simple example F = 0100 1101 K = 3  Have a collection of 8 fingerprints Fingerprints  Create two tables 1100 0101 1111 1111 0101 1100 0111 1110 1111 1110 0000 0001 1111 0101 1101 0010 6/20/2011 Detecting Near-Duplicates for Web Crawling 13 Algorithm: Solution to Hamming Distance Problem (cont.) Fingerprints 1100 0101 1111 1111 0101 1100 0111 1110 1111 1110 0010 0001 1111 0101 1101 0010 p = 3; π = Swap last four bits with first four bits p = 3; π = Move last two bits to the front 0101 1100 1011 1111 1111 1111 0100 1000 1100 0101 0111 1101 1110 0111 1011 0100 6/20/2011 Detecting Near-Duplicates for Web Crawling 14 Algorithm: Solution to Hamming Distance Problem (cont.) p = 3; π = Swap last four bits with first four bits p = 3; π = Move last two bits to the front 0101 1100 1011 1111 1111 1111 0100 1000 1100 0101 0111 1101 1110 0111 1011 0100 Sort Sort p = 3; π = Swap last four bits with first four bits p = 3; π = Move last two bits to the front 0101 1100 0100 1000 1100 0101 0111 1101 1110 0111 1011 0100 1111 1111 1011 1111 6/20/2011 Detecting Near-Duplicates for Web Crawling 15 Algorithm: Solution to Hamming Distance Problem (cont.)  F = 0100 1101 π(F) = 1101 0100 π(F) = 0101 0011 p = 3; π = Swap last four bits with first four bits p = 3; π = Move last two bits to the front 0101 1100 0100 1000 1100 0101 Match! 0111 1101 1110 0111 1011 0100 1111 1111 1011 1111 6/20/2011 Detecting Near-Duplicates for Web Crawling 16 Algorithm: Solution to Hamming Distance Problem (cont.)  With k =3, only fingerprint in first table is a nearduplicate of the F fingerprint p = 3; π = Swap last four bits with first four bits p = 3; π = Move last two bits to the front 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0 1 6/20/2011 F 0 1 0 1 0 0 1 1 0 1 0 0 1 0 0 0 Detecting Near-Duplicates for Web Crawling 17 Algorithm: Compression of Tables  Store first fingerprint in a block (1024 bytes) XOR the current fingerprint with the previous one Append to the block the Huffman code for the position of the most significant 1 bit Append to the block the bits after the most significant 1 bit Repeat steps 2-4 until block is full  Comparing to the query fingerprint      Use last fingerprint (key) in the block and perform interpolation search to decompress appropriate block 6/20/2011 Detecting Near-Duplicates for Web Crawling 18 Algorithm: Extending to Batch Queries   Problem: Want to get near-duplicates for batch of query fingerprints – not just one Solution:  Use Google File System (GFS) and MapReduce  Create two files    Store the files in GFS     MapReduce allows for a task to be created per chunk Iterate through chunks in parallel Each task produces output of near-duplicates found Produce sorted file from output of each task  6/20/2011 GFS breaks up the files into chunks Use MapReduce to solve the Hamming Distance Problem for each chunk of F for all queries in Q   File F has the collection of fingerprints File Q has the query fingerprints Remove duplicates if necessary Detecting Near-Duplicates for Web Crawling 19 Experiment: Parameters    8 Billion web pages used K = 1 …10 Manually tagged pairs as follows:  True positives  Differ  False slightly positives  Radically different pairs  Unknown  Could 6/20/2011 not be evaluated Detecting Near-Duplicates for Web Crawling 20 Experiment: Results  Accuracy Low k value -> a lot of false negatives  High k value -> a lot of false positives  Best value -> k = 3  75% of near-duplicates reported  75% of reported cases are true positives   Running Time Solution Hamming Distance: O(log(p))  Batch Query + Compression:   6/20/2011 32GB File & 200 tasks -> runs under 100 seconds Detecting Near-Duplicates for Web Crawling 21 Related Work  Clustering related documents  Detect  near-duplicates to show related pages Data extraction  Determine schema of similar pages to obtain information  Plagiarism  Detect  pages that have borrowed from each other Spam  Detect 6/20/2011 spam before user receives it Detecting Near-Duplicates for Web Crawling 22 Tying it Back to Lecture  Similarities Indicated importance of de-duplication to save crawler resources  Brief summary of several uses for near-duplicate detection   Differences  Lecture focus:   Breadth-first look at algorithms for near-duplicate detection Paper focus:  In-depth look of simhash and Hamming Distance algorithm  6/20/2011 Includes how to implement and effectiveness Detecting Near-Duplicates for Web Crawling 23 Paper Evaluation: Pros    Thorough step-by-step explanation of the algorithm implementation Thorough explanation on how the conclusions were reached Included brief description of how to improve simhash + Hamming Distance algorithm  Categorize web-pages before running simhash, create algorithm to remove ads or timestamps, etc. 6/20/2011 Detecting Near-Duplicates for Web Crawling 24 Paper Evaluation: Cons  No comparison  How much more effective or faster is it than other algorithms?  By how much did it improve the crawler?  Limited batch queries to a specific technology  Implementation required use of GFS  Approach not restricted to certain technology might be more applicable 6/20/2011 Detecting Near-Duplicates for Web Crawling 25 Any Questions? ??? 6/20/2011 Detecting Near-Duplicates for Web Crawling 26

Detecting Near-Duplicates for Web Crawling

Related documents

Products

Support

Detecting Near-Duplicates for Web Crawling

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib