DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola Outline De-duplication Goal of the Paper Why is De-duplication Important? Algorithm Experiment Related Work Tying it Back to Lecture Paper Evaluation Questions 6/20/2011 Detecting Near-Duplicates for Web Crawling 2 De-duplication The process of eliminating near-duplicate web documents in a generic crawl Challenge of near-duplicates: Identifying Use How exact duplicates is easy checksums to identify near-duplicate? Near-duplicates are identical in content but have differences in small areas 6/20/2011 Ads, counters, and timestamps Detecting Near-Duplicates for Web Crawling 3 Goal of the Paper Present near-duplicate detection system which improves web crawling Near-duplicate detection system includes: Simhash technique Technique used to transform a web-page to an f-bit fingerprint Solution to Hamming Distance Problem Given f-bit fingerprint find all fingerprints in a given collection which differ by at most k-bit positions 6/20/2011 Detecting Near-Duplicates for Web Crawling 4 Why is De-duplication Important? Elimination of near duplicates: Saves network bandwidth Do not have to crawl content if similar to previously crawled content Reduces storage cost Do not have to store in local repository if similar to previously crawled content Improves quality of search indexes Local repository used for building search indexes not polluted by near-duplicates 6/20/2011 Detecting Near-Duplicates for Web Crawling 5 Algorithm: Simhash Technique Convert web-page to set of features Using Information Retrieval techniques Dimension values start at 0 Update f-dimensional vector with weight of feature Give a weight to each feature Hash each feature into a f-bit value Have a f-dimensional vector e.g. tokenization, phrase detection If i-th bit of hash value is zero -> subtract i-th vector value by weight of feature If i-th bit of hash value is one -> add the weight of the feature to the ith vector value Vector will have positive and negative components Sign (+/-) of each component are bits for the fingerprint 6/20/2011 Detecting Near-Duplicates for Web Crawling 6 Algorithm: Simhash Technique (cont.) Very simple example One web-page Web-page Reduced to two features “Simhash” “Technique” Hash -> weight = 2 -> weight = 4 features to 4-bits “Simhash” “Technique” 6/20/2011 text: “Simhash Technique” -> 1101 -> 0110 Detecting Near-Duplicates for Web Crawling 7 Algorithm: Simhash Technique (cont.) Start vector with all zeroes 0 0 0 0 6/20/2011 Detecting Near-Duplicates for Web Crawling 8 Algorithm: Simhash Technique (cont.) Apply “Simhash” feature (weight = 2) feature’s f-bit value calculation 1 0+2 2 0 1 0+2 2 0 0 0-2 -2 0 1 0+2 2 0 6/20/2011 Detecting Near-Duplicates for Web Crawling 9 Algorithm: Simhash Technique (cont.) Apply “Technique” feature (weight = 4) feature’s f-bit value calculation 0 2-4 -2 2 1 2+4 6 -2 1 -2 + 4 2 2 0 2-4 -2 2 6/20/2011 Detecting Near-Duplicates for Web Crawling 10 Algorithm: Simhash Technique (cont.) Final vector: -2 6 2 -2 Sign of vector values is -,+,+,Final 4-bit fingerprint = 0110 6/20/2011 Detecting Near-Duplicates for Web Crawling 11 Algorithm: Solution to Hamming Distance Problem Problem: Given f-bit fingerprint (F) find all fingerprints in a given collection which differ by at most k-bit positions Solution: Create tables containing the fingerprints Each table has a permutation (π) and a small integer (p) associated with it Apply the permutation associated with the table to its fingerprints Sort the tables Store tables in main-memory of a set of machines Iterate through tables in parallel 6/20/2011 Find all permutated fingerprints whose top pi bits match the top pi bits of πi(F) For the fingerprints that matched, check if they differ from πi(F) in at most k-bits Detecting Near-Duplicates for Web Crawling 12 Algorithm: Solution to Hamming Distance Problem (cont.) Simple example F = 0100 1101 K = 3 Have a collection of 8 fingerprints Fingerprints Create two tables 1100 0101 1111 1111 0101 1100 0111 1110 1111 1110 0000 0001 1111 0101 1101 0010 6/20/2011 Detecting Near-Duplicates for Web Crawling 13 Algorithm: Solution to Hamming Distance Problem (cont.) Fingerprints 1100 0101 1111 1111 0101 1100 0111 1110 1111 1110 0010 0001 1111 0101 1101 0010 p = 3; π = Swap last four bits with first four bits p = 3; π = Move last two bits to the front 0101 1100 1011 1111 1111 1111 0100 1000 1100 0101 0111 1101 1110 0111 1011 0100 6/20/2011 Detecting Near-Duplicates for Web Crawling 14 Algorithm: Solution to Hamming Distance Problem (cont.) p = 3; π = Swap last four bits with first four bits p = 3; π = Move last two bits to the front 0101 1100 1011 1111 1111 1111 0100 1000 1100 0101 0111 1101 1110 0111 1011 0100 Sort Sort p = 3; π = Swap last four bits with first four bits p = 3; π = Move last two bits to the front 0101 1100 0100 1000 1100 0101 0111 1101 1110 0111 1011 0100 1111 1111 1011 1111 6/20/2011 Detecting Near-Duplicates for Web Crawling 15 Algorithm: Solution to Hamming Distance Problem (cont.) F = 0100 1101 π(F) = 1101 0100 π(F) = 0101 0011 p = 3; π = Swap last four bits with first four bits p = 3; π = Move last two bits to the front 0101 1100 0100 1000 1100 0101 Match! 0111 1101 1110 0111 1011 0100 1111 1111 1011 1111 6/20/2011 Detecting Near-Duplicates for Web Crawling 16 Algorithm: Solution to Hamming Distance Problem (cont.) With k =3, only fingerprint in first table is a nearduplicate of the F fingerprint p = 3; π = Swap last four bits with first four bits p = 3; π = Move last two bits to the front 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0 1 6/20/2011 F 0 1 0 1 0 0 1 1 0 1 0 0 1 0 0 0 Detecting Near-Duplicates for Web Crawling 17 Algorithm: Compression of Tables Store first fingerprint in a block (1024 bytes) XOR the current fingerprint with the previous one Append to the block the Huffman code for the position of the most significant 1 bit Append to the block the bits after the most significant 1 bit Repeat steps 2-4 until block is full Comparing to the query fingerprint Use last fingerprint (key) in the block and perform interpolation search to decompress appropriate block 6/20/2011 Detecting Near-Duplicates for Web Crawling 18 Algorithm: Extending to Batch Queries Problem: Want to get near-duplicates for batch of query fingerprints – not just one Solution: Use Google File System (GFS) and MapReduce Create two files Store the files in GFS MapReduce allows for a task to be created per chunk Iterate through chunks in parallel Each task produces output of near-duplicates found Produce sorted file from output of each task 6/20/2011 GFS breaks up the files into chunks Use MapReduce to solve the Hamming Distance Problem for each chunk of F for all queries in Q File F has the collection of fingerprints File Q has the query fingerprints Remove duplicates if necessary Detecting Near-Duplicates for Web Crawling 19 Experiment: Parameters 8 Billion web pages used K = 1 …10 Manually tagged pairs as follows: True positives Differ False slightly positives Radically different pairs Unknown Could 6/20/2011 not be evaluated Detecting Near-Duplicates for Web Crawling 20 Experiment: Results Accuracy Low k value -> a lot of false negatives High k value -> a lot of false positives Best value -> k = 3 75% of near-duplicates reported 75% of reported cases are true positives Running Time Solution Hamming Distance: O(log(p)) Batch Query + Compression: 6/20/2011 32GB File & 200 tasks -> runs under 100 seconds Detecting Near-Duplicates for Web Crawling 21 Related Work Clustering related documents Detect near-duplicates to show related pages Data extraction Determine schema of similar pages to obtain information Plagiarism Detect pages that have borrowed from each other Spam Detect 6/20/2011 spam before user receives it Detecting Near-Duplicates for Web Crawling 22 Tying it Back to Lecture Similarities Indicated importance of de-duplication to save crawler resources Brief summary of several uses for near-duplicate detection Differences Lecture focus: Breadth-first look at algorithms for near-duplicate detection Paper focus: In-depth look of simhash and Hamming Distance algorithm 6/20/2011 Includes how to implement and effectiveness Detecting Near-Duplicates for Web Crawling 23 Paper Evaluation: Pros Thorough step-by-step explanation of the algorithm implementation Thorough explanation on how the conclusions were reached Included brief description of how to improve simhash + Hamming Distance algorithm Categorize web-pages before running simhash, create algorithm to remove ads or timestamps, etc. 6/20/2011 Detecting Near-Duplicates for Web Crawling 24 Paper Evaluation: Cons No comparison How much more effective or faster is it than other algorithms? By how much did it improve the crawler? Limited batch queries to a specific technology Implementation required use of GFS Approach not restricted to certain technology might be more applicable 6/20/2011 Detecting Near-Duplicates for Web Crawling 25 Any Questions? ??? 6/20/2011 Detecting Near-Duplicates for Web Crawling 26