Crawling The Web For a Search Engine Or Why Crawling is Cool Talk Outline What is a crawler? Some of the interesting problems RankMass Crawler As time permits: Refresh Policies Duplicate Detection What is a Crawler? initial urls init get next url get page web to visit urls visited urls extract urls web pages Applications Internet Search Engines Comparison Shopping Services Google, Yahoo, MSN, Ask Shopping Data mining Stanford Web Base, IBM Web Fountain Is that it? Not quite Crawling the Big Picture Duplicate Pages Mirror Sites Identifying Similar Pages Templates Deep Web When to stop? Incremental Crawler Refresh Policies Evolution of the Web Crawling the “good” pages first Focused Crawling Distributed Crawlers Crawler Friendly Webservers Today’s Focus A crawler which guarantees coverage of the Web As time permits: Refresh Policies Duplicate Detection Techniques RankMass Crawler A Crawler with High Personalized PageRank Coverage Guarantee Motivation Impossible to download the entire web: Example: many pages from one calendar When can we stop? How to gain the most benefit from the pages we download Main Issues Crawler Guarantee: guarantee on how much of the “important” part of the Web they “cover” when they stop crawling If we don’t see the pages, how do we know how important they are? Crawler Efficiency: Download “important” pages early during a crawl Obtain coverage with a min number of downloads Outline Formalize coverage metric L-Neighbor: Crawling with RankMass guarantee RankMass: Crawling to achieve high RankMass Windowed RankMass: How greedy do you want to be? Experimental Results Web Coverage Problem D – The potentially infinite set of documents of the web DC – The finite set of documents in our document collection Assign importance weights to each page Web Coverage Problem What weights? Per query? Topic? Font? PageRank? Why PageRank? Useful as importance mesure Random surfer. Effective for ranking. PageRank a Short Review rj 1 ri d ( 1 d ) |D| p jI ( pi ) c j p2 p1 p3 p4 Now it’s Personal Personal, TrustRank, General p3 p4 t1 T t2 rj ri d (1 d )ti p jI ( pi c j RankMass Defined Using personalized pagerank formally define RankMass of DC : RM ( DC ) p D ri i C Coverage Guarantee: We seek a crawler that given , when it stops the downloaded pages DC: RM ( DC ) p D ri 1 i C Efficient crawling: We seek a crawler that, for a given N, downloads |DC|=N s.t. RM(DC) is greater or equal to any other |DC|=N, DC D How to Calculate RankMass Based on PageRank How do you compute RM(Dc) without downloading the entire web We can’t compute the exact but can lower bound Let’s a start a simple case Single Trusted Page T(1): t1=1 ; ti = 0 i≠1 Always jump to p1 when bored We can place a lowerbound on being within L of P1 NL(p1)=pages reachable from p1 in L links Single Trusted Page Lower bound guarantee: Single Trusted Theorem 1: Assuming the trust vector T(1), the sum of the PageRank values of all Lneighbors of p1 is at least dL+1 close to 1.. That is: L 1 r 1 d i pi N L ( p1 ) Lower bound guarantee: General Case Theorem 2: The RankMass of the L-neighbors of the group of all trusted pages G, NL(G), is at least dL+1 close to 1. That is: L 1 r 1 d i pi N L ( G ) RankMass Lower Bound Lower bound given a single trusted page r 1 d L 1 i pi N L ( p1 ) Extension: Given a set of trusted pages G L 1 r 1 d i pi N L ( G ) That’s the basis of the crawling algorithm with a coverage guarantee The L-Neighbor Crawler 1. 2. 3. L := 0 N[0] = {pi|ti > 0} // Start with the trusted pages While ( < dL+1) 1. 2. 3. Download all uncrawled pages in N[L] N[L + 1] = {all pages linked to by a page in N[L]} L=L+1 But what about efficency? L-Neighbor similar to BFS L-Neighbor simple and efficient May wish to prioritize further certain neighborhoods first Page level prioritization. t0=0.99 t1=0.01 Page Level Prioritizing We want a more fine-grained page-level priority The idea: Estimate PageRank on a page basis High priority for pages with a high estimate of PageRank We cannot calculate exact PageRank Calculate PageRank lower bound of undownloaded pages …But how Probability of being at Page P Click Link Interrupted Page Trusted Page Random Surfer Calculating PageRank Lower Bound PageRank(p) = Probability Random Surfer in p Breakdown path by “interrupts”, jumps to a trusted page Sum up all paths that start with an interrupt and end with p Interrupt (1-d) Pj (tj) (d*1/3) P1 P2 P3 P4 P5 (d*1/5) (d*1/3) (d*1/3) (d*1/3) (d*1/3) Pi RankMass Basic Idea p1 0.99 p3 0.25 p1 0.99 p3 0.25 p1 0.99 p1 0.99 p2 0.01 p4 0.25 p5 0.25 p6 0.09 p7 0.09 RankMass Crawler: High Level But that sounds complicated?! Luckily we don’t need all that Based on this idea: Dynamically update lower bound on PageRank Update total RankMass Download page with highest lower bound RankMass Crawler (Shorter) Variables: CRM: RankMass lower bound of crawled pages rmi: Lower bound of PageRank of pi. RankMassCrawl() CRM = 0 rmi = (1 − d)ti for each ti > 0 While (CRM < 1 − ): • • • • Pick pi with the largest rmi. Download pi if not downloaded yet CRM = CRM + rmi Foreach pj linked to by pi: • rmj = rmj + d/ci rmi • rmi = 0 Greedy vs Simple L-Neighbor is simple RankMass is very greedy. Update expensive: random access to web graph Compromise? Batching downloads together updates together Windowed RankMass Variables: CRM: RankMass lower bound of crawled pages rmi: Lower bound of PageRank of pi. Crawl() rmi = (1 − d)ti for each ti > 0 While (CRM < 1 − ): • Download top window% pages according to rmi • Foreach page pi ∈ DC • CRM = CRM + rmi • Foreach pj linked to by pi: • rmj = rmj + d/ci rmi • rmi = 0 Experimental Setup HTML files only Algorithms simulated over web graph Crawled between Dec’ 2003 and Jan’ 2004 141 millon URLs span over 6.9 million host names 233 top level domains. Metrics Of Evaluation 1. 2. 3. How much RankMass is collected during the crawl How much RankMass is “known” to have been collected during the crawl How much computational and performance overhead the algorithm introduces. L-Neighbor RankMass Windowed RankMass Window Size Algorithm Efficiency Algorithm Downloads required for above 0.98% guaranteed RankMass Downloads required for above 0.98% actual RankMass L-Neighbor 7 million 65,000 RankMass 131,072 27,939 WindowedRankMass 217,918 30,826 Optimal 27,101 27,101 Algorithm Running Time Window Hours Number of Iterations Number of Documents L-Neighbor 1:27 13 83,638,834 20%Windowed 4:39 44 80,622,045 10%Windowed 10:27 85 80,291,078 5%Windowed 17:52 167 80,139,289 RankMass 25:39 Not comparable 10,350,000 Refresh Policies Refresh Policy: Problem Definition You have N urls you want to keep fresh Limited resources: f documents / second Download order to maximize average freshness What do you do? Note: Can’t always know how the page really looks The Optimal Solution Depends on freshness definition Freshness boolean: A page can only be fresh or not One small change deems it unfresh Understand Freshness Better Two page database Pd changes daily Pw changes once a week We can refresh one page per week How should we visit pages? Uniform: Pd, Pd, Pd, Pd, Pd, Pd,… Proportional: Pd,Pd, Pd, Pd, Pd,Pd,Pw Other? Proportional Often Not Good! Visit fast changing e1 get 1/2 day of freshness Visit slow changing e2 get 1/2 week of freshness Visiting Pw is a better deal! Optimal Refresh Frequency Problem Given 1 , 2 , ..., N find f1 , f 2 ,... , f N that maximize 1 F (S ) N N and f , N f fi / N i 1 F (ei ) i 1 Optimal Refresh Frequency • Shape of curve is the same in all cases • Holds for any change frequency distribution Do Not Crawl In The DUST: Different URLs Similar Text Ziv Bar-Yossef Idit Keidar Uri Schonfeld (Technion and Google) (Technion) (UCLA) 48 Even the WWW Gets Dusty DUST – Different URLs Similar Text Examples: Default Directory Files: “/index.html” “/” Domain names and virtual hosts “news.google.com” “google.com/news” Aliases and symbolic links: • • “~shuri” “/people/shuri” Parameters with little effect on content • ?Print=1 URL transformations: • • “/story_<num>” “story?id=<num>” 49 Why Care about DUST? Reduce the crawl and indexing Avoid fetching the same document more than once Canonization for better ranking References to a document may be split among its aliases Avoid returning duplicate results Many algorithm which use URLs as unique ids will benefit 50 Related Work Similarity detection via document sketches [Broder et al, Hoad-Zobel, Shivakumar et al, Di Iorio et al, Brin et al, Garcia-Molina et al] Requires fetching all document duplicates Cannot be used to find "DUST rules" Mirror detection [Bharat,Broder 99], [Bharat,Broder,Dean,Henzinger 00], [Cho,Shivakumar,Garcia-Molina 00], [Liang 01] Not suitable for finding site-specific "DUST rules" Mining association rules [Agrawal and Srikant] A technically different problem 51 So what are we looking for? 52 Our Contributions DustBuster, an algorithm that Discovers site-specific DUST rules from a URL list without examining page content Requires a small number of page fetches to validate the rules Site specific URL canonization algorithm Experimented on real data both from: Web access logs Crawl logs 53 DUST Rules Valid DUST Rule: a mapping Ψ that maps each valid URL u to a valid URL Ψ(u) with similar content “/index.html” “/” “news.google.com” “google.com/news” “/story_<num>” “story?id=<num>” Invalid DUST Rules: Either do not preserve similarity or do not produce valid URLs 54 1 Types of DUST Rules Substring Substitution DUST: Focus of Talk “story_1259” “story?id=1259” “news.google.com” “google.com/news” “/index.html” “” Parameter DUST: Removing a parameter or replacing its value to a default value “Color=pink” “Color=black” 55 Basic Detection Framework Input: List of URLs from a site: Crawl or web access log Detect likely DUST rules No Fetch here Eliminate redundant rules Validate DUST rules using samples Detect 56 Eliminate Validate Example: Instances & Support 57 END