Optimal Crawling Strategies for Web Search Engines Wolf, Sethuraman, Ozsen Presented By Rajat Teotia Overview : • • • • • • Why Do we care ? Purpose of the paper. Proposed solution for optimal crawling Pros Cons Current Trends. Why Do We Care? • Search engines use crawlers in a automated manner to build local repositories containing web pages. • These local copies of web pages are used for later processing, like creating the index, run ranking algorithms etc. • Due to dynamic nature of website web pages are updated frequently. • To maintain the fresh copy of these pages, we require efficient crawling mechanism Purpose of The Paper? • This paper provide efficient solution to: 1. Optimum crawling frequency problem. 2. Crawling scheduling problem. 3. Minimization of the average level of staleness over all web pages. 4. Minimize search engine embarrassment level metric. 5. To use efficient resource allocation algorithms to achieve optimum crawling mechanism Solution: Minimize staleness over all web pages • Size of the web is estimated to be 10 + billion pages. • According to the study around 25% - 30 % of the web pages change daily. • In order to maintain fresh web page repository, efficient crawling algorithm should be used. • Two main aspects to build an efficient crawling algorithm are: 1) Optimal frequency : Number of crawls for each web page over a fixed period of time and Ideal crawl times between these intervals. 2) Efficient scheduling for these crawling process. • To handle the update pattern of the web pages, Some pages are updated in quasi-deterministic manner other tend to be updated in Poisson manner. Solution: Optimal frequency problem • To compute a particular probability function that captures, whether the search engine have a stale copy of web page i at an arbitrary time t in the interval [0; T]. • From this we can compute the time-average staleness estimate, by averaging this probability function over all t within [0; T] • To find a time interval to minimize the time-average staleness estimate. • To find the importance of web pages (weights), in order to organize possible results search query. This can be efficiently explained by search engine embarrassment metrics. Search engine embarrassment level metrics • The frequency with which a client makes a query, and finds that the resulting page is inconsistent. • Case 1: lucky case, stale page is not returned to user. • Case 2 : unlucky case, stale page is returned to user but not clicked by user • Case 3: stale page returned and user clicks the result page to find the correct query • Case 4: returned pages has inconsistent result w.r.t query Solution: Greedy approach for resource allocation • Probability of clicking a page to the position or weight of the web page • For quasi deterministic case for updating a page, crawl should be done at potential update time. • To solve the resource allocation problem, in order to find optimum crawl time author has used dynamic programming and greedy algorithms. • To find the optimal time interval between minimum and maximum bound Solution : Optimum scheduling problem • Number of crawls to obtain fresh copy of the page for a time period T, the problem is to decide optimum time interval between these crawls. • Since for most of the cases scheduling the crawl bit early or bit late does not affect performance too much. • But for the quasi – deterministic process being late is acceptable but being early is not useful. • This scheduling problem can be posed and solved as transportation problem and network flow. • A bipartite network graph with one sided flow depicts this problem. Solution : Optimum scheduling problem ……. • If C be total no of crawlers and S be crawl task in time T. • Each node has supply of 1 unit and there is one demand node per time slot and crawler pair. • Then they are indexed by 1≤ l ≤ S and 1≤k≤C • Where k is individual crawler and is the no of tasks. • The solution for this transportation problem ensures the existence of optimal solution with optimal flow Parameterization issue about update process: • Information about last crawl time does not tell anything about other updates occurring since last crawl. • Crawl time, update pattern and data can be used together to formulate the statistical properties of update process. • This information can be then used to build probability distribution for the interupdate for any page. Pros: • Precisely describe the optimal crawling process to reduce staleness of web pages. • Provide good introduction to search engine embarrassment metrics. • Provides schemes for optimized number of crawls for a dynamic page using dynamic programming. • Give us clear idea about the optimal crawling schedule. Cons : • Research data is quite outdated, and lot of advancement have been made since then. • No strategy has been proposed for handling the content replication. • Introduction of blogs, forums and social networking site has changed the way we calculate weight for the pages. Conclusion : • Crawling process can improve the quality of services provided by search engine. • Optimal crawling process and the scheduling algorithm plays a vital role in determining the quality and freshness of web pages. • Overall objective is to reduce the search engine embarrassment metrics and to provide best possible search results. Further Research : • Event driven web page crawler, to be able to fetch ajax based data. • Adaptive Model based crawling strategies, fixed order vs random order crawling. • Implementing ranking based crawling strategies • Formulate the crawling strategies keeping page replication in account. To reduce the crawling task to some extent. Current Trends : • Building adaptive model based web crawlers. • Using separate crawling strategies for finding fresh pages and for deep crawls (eg. Googles organic crawl) Fresh bot to fetch fresh pages, and deep crawl bot to index all the web pages. • Duplicate content aware crawling to reduce the crawling load. Current Trends …… • URL ordering and queuing based on priority. • Context focused crawling for better Result • Distributed crawling and multi threaded crawlers • Crawling and real time web search. References • • • • • J. L. Wolf , M. S. Squillante , P. S. Yu , J. Sethuraman , L. Ozsen, Optimal crawling strategies for web search engines, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA [doi>10.1145/511446.511465] Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). "An adaptive model for optimizing performance of an incremental web crawler". In Proceedings of the Tenth Conference on World Wide Web (Hong Kong: Elsevier Science): 106–113. doi:10.1145/371920.371960 Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. (2000). Focused crawling using context graphs. In Proceedings of 26th International Conference on Very Large Databases (VLDB), pages 527-534, Cairo, Egypt. Pant, Gautam; Srinivasan, Padmini; Menczer, Filippo (2004). "Crawling the Web". in Levene, Mark; Poulovassilis, Alexandra. Web Dynamics: Adapting to Change in Content, Size, Topology and Use. Springer. pp. 153–178. ISBN 9783540406761. Articles from search engine journal, search engine round table, Wikipedia ….. Q&A