Optimal Crawling Strategies for Web Search Engines

advertisement
Optimal Crawling Strategies for Web
Search Engines
Wolf, Sethuraman, Ozsen
Presented By
Rajat Teotia
Overview :
•
•
•
•
•
•
Why Do we care ?
Purpose of the paper.
Proposed solution for optimal crawling
Pros
Cons
Current Trends.
Why Do We Care?
• Search engines use crawlers in a automated manner to
build local repositories containing web pages.
• These local copies of web pages are used for later
processing, like creating the index, run ranking
algorithms etc.
• Due to dynamic nature of website web pages are updated
frequently.
• To maintain the fresh copy of these pages, we require
efficient crawling mechanism
Purpose of The Paper?
• This paper provide efficient solution to:
1. Optimum crawling frequency problem.
2. Crawling scheduling problem.
3. Minimization of the average level of staleness over all
web pages.
4. Minimize search engine embarrassment level metric.
5. To use efficient resource allocation algorithms to
achieve optimum crawling mechanism
Solution: Minimize staleness over all web pages
• Size of the web is estimated to be 10 + billion pages.
• According to the study around 25% - 30 % of the web pages change
daily.
• In order to maintain fresh web page repository, efficient crawling
algorithm should be used.
• Two main aspects to build an efficient crawling algorithm are:
1) Optimal frequency : Number of crawls for each web page over a
fixed period of time and Ideal crawl times between these intervals.
2) Efficient scheduling for these crawling process.
• To handle the update pattern of the web pages, Some pages are updated
in quasi-deterministic manner other tend to be updated in Poisson
manner.
Solution: Optimal frequency problem
• To compute a particular probability function that captures,
whether the search engine have a stale copy of web page i at an
arbitrary time t in the interval [0; T].
• From this we can compute the time-average staleness estimate,
by averaging this probability function over all t within [0; T]
• To find a time interval to minimize the time-average staleness
estimate.
• To find the importance of web pages (weights), in order to
organize possible results search query. This can be efficiently
explained by search engine embarrassment metrics.
Search engine embarrassment level metrics
• The frequency with which a client
makes a query, and finds that the
resulting page is inconsistent.
• Case 1: lucky case, stale page is not
returned to user.
• Case 2 : unlucky case, stale page is
returned to user but not clicked by
user
• Case 3: stale page returned and user
clicks the result page to find the
correct query
• Case 4: returned pages has
inconsistent result w.r.t query
Solution: Greedy approach for resource allocation
• Probability of clicking a page to
the position or weight of the web
page
• For quasi deterministic case for
updating a page, crawl should be
done at potential update time.
• To solve the resource allocation
problem, in order to find
optimum crawl time author has
used dynamic programming and
greedy algorithms.
• To find the optimal time interval
between minimum and maximum
bound
Solution : Optimum scheduling problem
• Number of crawls to obtain fresh copy of the page for a
time period T, the problem is to decide optimum time
interval between these crawls.
• Since for most of the cases scheduling the crawl bit early or
bit late does not affect performance too much.
• But for the quasi – deterministic process being late is
acceptable but being early is not useful.
• This scheduling problem can be posed and solved as
transportation problem and network flow.
• A bipartite network graph with one sided flow depicts this
problem.
Solution : Optimum scheduling problem …….
• If C be total no of crawlers and S be
crawl task in time T.
• Each node has supply of 1 unit and
there is one demand node per time
slot and crawler pair.
• Then they are indexed by 1≤ l ≤ S and
1≤k≤C
• Where k is individual crawler and is
the no of tasks.
• The solution for this transportation
problem ensures the existence of
optimal solution with optimal flow
Parameterization issue about update process:
• Information about last crawl time does not tell
anything about other updates occurring since last
crawl.
• Crawl time, update pattern and data can be used
together to formulate the statistical properties of
update process.
• This information can be then used to build
probability distribution for the interupdate for any
page.
Pros:
• Precisely describe the optimal crawling
process to reduce staleness of web pages.
• Provide good introduction to search
engine embarrassment metrics.
• Provides schemes for optimized number of
crawls for a dynamic page using dynamic
programming.
• Give us clear idea about the optimal
crawling schedule.
Cons :
• Research data is quite outdated, and lot of
advancement have been made since then.
• No strategy has been proposed for
handling the content replication.
• Introduction of blogs, forums and social
networking site has changed the way we
calculate weight for the pages.
Conclusion :
• Crawling process can improve the quality of
services provided by search engine.
• Optimal crawling process and the scheduling
algorithm plays a vital role in determining the
quality and freshness of web pages.
• Overall objective is to reduce the search engine
embarrassment metrics and to provide best
possible search results.
Further Research :
• Event driven web page crawler, to be able to fetch ajax
based data.
• Adaptive Model based crawling strategies, fixed order vs
random order crawling.
• Implementing ranking based crawling strategies
• Formulate the crawling strategies keeping page replication
in account. To reduce the crawling task to some extent.
Current Trends :
• Building adaptive model based web crawlers.
• Using separate crawling strategies for finding fresh pages
and for deep crawls (eg. Googles organic crawl)
Fresh bot to fetch fresh pages, and deep crawl bot to index
all the web pages.
• Duplicate content aware crawling to
reduce the crawling load.
Current Trends ……
• URL ordering and queuing based on priority.
• Context focused crawling for better Result
• Distributed crawling and multi threaded crawlers
• Crawling and real time web search.
References
•
•
•
•
•
J. L. Wolf , M. S. Squillante , P. S. Yu , J. Sethuraman , L. Ozsen, Optimal crawling
strategies for web search engines, Proceedings of the 11th international conference
on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi>10.1145/511446.511465]
Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). "An adaptive model for
optimizing performance of an incremental web crawler". In Proceedings of the Tenth
Conference on World Wide Web (Hong Kong: Elsevier Science): 106–113.
doi:10.1145/371920.371960
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. (2000). Focused
crawling using context graphs. In Proceedings of 26th International Conference on
Very Large Databases (VLDB), pages 527-534, Cairo, Egypt.
Pant, Gautam; Srinivasan, Padmini; Menczer, Filippo (2004). "Crawling the Web". in
Levene, Mark; Poulovassilis, Alexandra. Web Dynamics: Adapting to Change in
Content, Size, Topology and Use. Springer. pp. 153–178. ISBN 9783540406761.
Articles from search engine journal, search engine round table, Wikipedia …..
Q&A
Download