Crawling the Web •  Web pages

Crawling the Web Web pages •Few thousand characters long •Served through the internet using the hypertext transport protocol (HTTP) •Viewed at client end using `browsers’ Crawler •To fetch the pages to the computer •At the computer Automatic documents programs can analyze hypertext HTML  HyperText Markup Language  Lets the author • specify layout and typeface • embed diagrams • create hyperlinks.  expressed as an anchor tag with a HREF attribute  HREF names another page using a Uniform Resource Locator (URL), • URL =  protocol field (“HTTP”) +  a server hostname (“www.cse.iitb.ac.in”) +  file path (/, the `root' of the published file system). Mining the Web Chakrabarti and Ramakrishnan 2 HTTP(hypertext transport protocol)  Built on top of the Transport Control Protocol (TCP)  Steps(from client end) • resolve the server host name to an Internet address (IP)   Use Domain Name Server (DNS) DNS is a distributed database of name-to-IP mappings maintained at a set of known servers • contact the server using TCP    connect to default HTTP port (80) on the server. Enter the HTTP requests header (E.g.: GET) Fetch the response header – MIME (Multipurpose Internet Mail Extensions) – A meta-data standard for email and Web content transfer Mining the Web  Chakrabarti and Ramakrishnan Fetch the HTML page 3 Crawl “all” Web pages?  Problem: no catalog of all accessible URLs on the Web.  Solution: • start from a given set of URLs • Progressively fetch and scan them for new • • • outlinking URLs fetch these pages in turn….. Submit the text in page to a text indexing system and so on………. Mining the Web Chakrabarti and Ramakrishnan 4 Crawling procedure  Simple • Great deal of engineering goes into industry- • strength crawlers Industry crawlers crawl a substantial fraction of the Web E.g.: Alta Vista, Northern Lights, Inktomi •  No guarantee that all accessible Web pages will be located in this fashion  Crawler may never halt ……. • pages will be added continually even as it is running. Mining the Web Chakrabarti and Ramakrishnan 5 Crawling overheads  Delays involved in • Resolving the host name in the URL to an IP • address using DNS Connecting a socket to the server and sending the request Receiving the requested page in response •  Solution: Overlap the above delays by • fetching many pages at the same time Mining the Web Chakrabarti and Ramakrishnan 6 Anatomy of a crawler.  Page fetching threads • Starts with DNS resolution • Finishes when the entire page has been fetched  Each page • stored in compressed form to disk/tape • scanned for outlinks  Work pool of outlinks • maintain network utilization without overloading it  Dealt with by load manager  Continue till he crawler has collected a Mining the Web Chakrabarti and Ramakrishnan 7 Typical anatomy of a large-scale crawler. Mining the Web Chakrabarti and Ramakrishnan 8 Large-scale crawlers: performance and reliability considerations  Need to fetch many pages at same time • utilize the network bandwidth • single page fetch may involve several seconds of network latency  Highly concurrent and parallelized DNS lookups  Use of asynchronous sockets • Explicit encoding of the state of a fetch context in a • data structure Polling socket to check for completion of network transfers Multi-processing or multi-threading: Impractical •  Care in URL extraction • Eliminating duplicates to reduce redundant fetches Avoiding “spider Chakrabarti traps”and Ramakrishnan Mining • the Web 9  DNS caching, pre-fetching and resolution A customized DNS component with….. 1. Custom client for address resolution 2. Caching server 3. Prefetching client Mining the Web Chakrabarti and Ramakrishnan 10 Custom client for address resolution  Tailored for concurrent handling of multiple outstanding requests  Allows issuing of many resolution requests together • polling at a later time for completion of individual requests  Facilitates load distribution among many DNS servers. Mining the Web Chakrabarti and Ramakrishnan 11 Caching server  With a large cache, persistent across DNS restarts  Residing largely in memory if possible. Mining the Web Chakrabarti and Ramakrishnan 12 Prefetching client • Steps 1. Parse a page that has just been fetched 2. extract host names from HREF targets 3. Make DNS resolution requests to the caching server • Usually implemented using UDP • User Datagram Protocol • connectionless, packet-based • • communication protocol does not guarantee packet delivery Does not wait for resolution to be completed. Mining the Web Chakrabarti and Ramakrishnan 13 Multiple concurrent fetches • Managing multiple concurrent connections • A single download may take several • • seconds Open many socket connections to different HTTP servers simultaneously Multi-CPU machines not useful • crawling performance limited by network and disk • Two approaches 1. using multi-threading Mining the Web Chakrabarti and Ramakrishnan 14 Multi-threading • logical threads • physical thread of control provided by the operating system (E.g.: pthreads) OR • concurrent processes • fixed number of threads allocated in advance • programming paradigm • create a client socket • connect the socket to the HTTP service on a server • Send the HTTP request header • read the socket (recv) until • no more characters are available • close the socket. • use blocking system calls Mining the Web Chakrabarti and Ramakrishnan 15 Multi-threading: Problems • performance penalty • mutual exclusion • concurrent access to data structures • slow disk seeks. • great deal of interleaved, random input-output • on disk Due to concurrent modification of document repository by multiple threads Mining the Web Chakrabarti and Ramakrishnan 16 Non-blocking sockets and event handlers • non-blocking sockets • connect, send or recv call returns immediately without waiting for the network operation to complete. • poll the status of the network operation separately • “select” system call • lets application suspend until more data can be read from or written to the socket • timing out after a pre-specified deadline • Monitor polls several sockets at the same time • More efficient memory management • code that completes processing not interrupted by other completions • No need for locks and semaphores on the pool • only append complete pages to the log Mining the Web Chakrabarti and Ramakrishnan 17 Link extraction and normalization • Goal: Obtaining a canonical form of URL • URL processing and filtering • Avoid multiple fetches of pages known by • different URLs many IP addresses • For load balancing on large sites • Mirrored contents/contents on same file system • “Proxy pass“ • Mapping of different host names to a single IP address • need to publish many logical sites • Relative URLs • Mining the Web need to be interpreted w.r.t to a base URL. Chakrabarti and Ramakrishnan 18 Canonical URL • • • • Formed by Using a standard string for the protocol Canonicalizing the host name Adding an explicit port number Normalizing and cleaning up the path Mining the Web Chakrabarti and Ramakrishnan 19 Robot exclusion • Check • whether the server prohibits crawling a • normalized URL In robots.txt file in the HTTP root directory of the server • species a list of path prefixes which crawlers should not attempt to fetch. • Meant for crawlers only Mining the Web Chakrabarti and Ramakrishnan 20 Eliminating already-visited URLs  Checking if a URL has already been fetched • Before adding a new URL to the work pool • Needs to be very quick. • Achieved by computing MD5 hash function on the URL  Exploiting spatio-temporal locality of access  Two-level hash function. – most significant bits (say, 24) derived by hashing the host name plus port – lower order bits (say, 40) derived by hashing the path  concatenated bits use d as a key in a B-tree  qualifying URLs added to frontier of the crawl.  hash values added to B-tree. Mining the Web Chakrabarti and Ramakrishnan 21 Spider traps  Protecting from crashing on • Ill-formed HTML  E.g.: page with 68 kB of null characters • Misleading sites  indefinite number of pages dynamically generated by CGI scripts  paths of arbitrary depth created using soft directory links and path remapping features in HTTP server Mining the Web Chakrabarti and Ramakrishnan 22 Spider Traps: Solutions  No automatic technique can be foolproof  Check for URL length  Guards • Preparing regular crawl statistics • Adding dominating sites to guard module • Disable crawling active content such as CGI • form queries Eliminate URLs with non-textual data types Mining the Web Chakrabarti and Ramakrishnan 23 Avoiding repeated expansion of links on duplicate pages  Reduce redundancy in crawls  Duplicate detection • Mirrored Web pages and sites  Detecting exact duplicates • Checking against MD5 digests of stored URLs • Representing a relative link v (relative to aliases u1 and u2) as tuples (h(u1); v) and (h(u2); v)  Detecting near-duplicates • Even a single altered character will completely change the digest !  E.g.: date of update/ name and email of the site administrator • Solution : Shingling Mining the Web Chakrabarti and Ramakrishnan 24 Load monitor  Keeps track of various system statistics • Recent performance of the wide area network (WAN) connection  E.g.: latency and bandwidth estimates. • Operator-provided/estimated upper bound • on open sockets for a crawler Current number of active sockets. Mining the Web Chakrabarti and Ramakrishnan 25 Thread manager  Responsible for  Choosing units of work from frontier  Scheduling issue of network resources  Distribution of these requests over multiple ISPs if appropriate.  Uses statistics from load monitor Mining the Web Chakrabarti and Ramakrishnan 26 Per-server work queues  Denial of service (DoS) attacks  limit the speed or frequency of responses to any fixed client IP address  Avoiding DOS  limit the number of active requests to a given server IP address at any time  maintain a queue of requests for each server  Use the HTTP/1.1 persistent socket capability.  Distribute attention relatively evenly between a large number of sites  Access locality vs. politeness dilemma Mining the Web Chakrabarti and Ramakrishnan 27 Text repository  Crawler’s last task  Dumping fetched pages into a repository  Decoupling crawler from other functions for efficiency and reliability preferred  Page-related information stored in two parts  meta-data  page contents. Mining the Web Chakrabarti and Ramakrishnan 28 Storage of page-related information  Meta-data  relational in nature   usually managed by custom software to avoid relation database system overheads text index involves bulk updates  includes fields like content-type, last-modified date, content-length, HTTP status code, etc. Mining the Web Chakrabarti and Ramakrishnan 29 Page contents storage  Typical HTML Web page compresses to 24 kB (using zlib)  File systems have a 4-8 kB file block size  Too large !!  Page storage managed by custom storage manager  simple access methods for   Mining the Web crawler to add pages Subsequent programs (Indexer etc) to retrieve documents Chakrabarti and Ramakrishnan 30 Page Storage  Small-scale systems  Repository fitting within the disks of a single machine  Use of storage manager (E.g.: Berkeley DB)   Manage disk-based databases within a single file configuration as a hash-table/B-tree for URL access key  To handle ordered access of pages  configuration as a sequential log of page records.  Since Indexer can handle pages in any order Mining the Web Chakrabarti and Ramakrishnan 31 Page Storage  Large Scale systems  Repository distributed over a number of storage servers  Storage servers   Connected to the crawler through a fast local network (E.g.: Ethernet) Hashed by URLs  `T3' grade leased lines.  Mining the Web To handle 10 million pages (40 GB) per hour Chakrabarti and Ramakrishnan 32 Large-scale crawlers often use multiple ISPs and a bank of local storage servers to store the pages crawled. Mining the Web Chakrabarti and Ramakrishnan 33 Refreshing crawled pages  Search engine's index should be fresh  Web-scale crawler never `completes' its job  High variance of rate of page changes  “If-modified-since” request header with HTTP protocol  Impractical for a crawler  Solution  At commencement of new crawling round estimate which pages have changed Mining the Web Chakrabarti and Ramakrishnan 34 Determining page changes  “Expires” HTTP response header  For page that come with an expiry date  Otherwise need to guess if revisiting that page will yield a modified version.  Score reflecting probability of page being modified  Crawler fetches URLs in decreasing order of score.  Assumption : recent past predicts the future Mining the Web Chakrabarti and Ramakrishnan 35 Estimating page change rates  Brewington and Cybenko & Cho  Algorithms for maintaining a crawl in which most pages are fresher than a specified epoch.  Prerequisite  average interval at which crawler checks for changes is smaller than the inter-modification times of a page  Small scale intermediate crawler runs  to monitor fast changing sites  E.g.: current news, weather, etc.  Patched intermediate indices into master index Mining the Web Chakrabarti and Ramakrishnan 36 Putting together a crawler  Reference implementation of the HTTP client protocol   Mining the Web World-wide Web Consortium (http://www.w3c.org/) w3c-libwww package Chakrabarti and Ramakrishnan 37 Design of the core components: Crawler class.  To copy bytes from network sockets to storage media  Three methods to express Crawler's contract with user  pushing a URL to be fetched to the Crawler  (fetchPush) Termination callback handler (fetchDone) called with same URL Method (start) which starts Crawler's event loop.   Implementation of Crawler class  Need for two helper classes called DNS and Fetch Mining the Web Chakrabarti and Ramakrishnan 38

Crawling the Web •  Web pages

Related documents

Products

Support

Crawling the Web •  Web pages

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib