Scalability Analysis of PeerCrawl AIAD Project Report Dulloor Rao Nirmal Thacker Scalability Analysis of PeerCrawl 1. Introduction PeerCrawl is a P2P crawler, built on the Gnutella protocol of an unstructured Network overlay. Such a decentralized design is intended to provide fast crawling with built in scalability and fault tolerance. Additionally the architecture implements an enhanced link-based ranking algorithm, to overcome problems in traditional link-based ranking (eg: Google's page rank) such as slow update times, web spam security issues and a flat based page-view. The core architecture of PeerCrawl, (Figure 1) is not very different from that of a traditional crawler and it additionally contains an interface to connect to other Peers in its system. However the architecture is briefly explained to keep the reader informed. Figure 1 2 Scalability Analysis of PeerCrawl The core architecture of PeerCrawl contains a seed list, which has a list of initial URL’s to begin the crawl at. The seed list is processed into a URL queue. A backup thread in PeerCrawl keeps an additional URL Queue and the URL Bloom Filter. The URL Bloom Filter is necessary to create the hash for the Division of Labor which is explained below. The URL Fetcher has a multithreaded architecture which performs DNS lookups and fetches pages from the World Wide Web. These documents are passed to a URL extractor which fetches URL’s from the pages, passes it to a filter, to filter out malformed URL’s or convert relative URL’s to absolute ones. A URL Seen Test essentially makes sure the URL’s are not being fetched if they have already been fetched once. In other words it avoids creating archives of Duplicate URL’s on disk and this has two advantages- for one, it increases the efficiency of the crawling processes, fetching a unique URL each time and for another, it will assist the crawler to avoid entering into a state of crawling the web in cycles, hence being unproductive. PeerCrawl's motive as a fast scalable P2P crawler involves a network of peers located in geographically different regions in the World, and crawling a portion of the Web. Peers exchange this information to each other when required or requested. Hence a peer could function as a router, a client or a server. Peers are made to co-operate the task of crawling by a principle known as “Division of Labor”. The Division of Labor is a necessary to a P2P crawler such as PeerCrawl. As the URL seen test makes sure that duplicate URL’s are not crawled by a Peer instance, the Division of Labor ensures that different subsets of the WWW are allocated to different Peers, depending on their proximity to that subset of pages or domain. This way each peer is assigned a domain to crawl and these domains do not overlap among different Peers, hence increasing the performance of the crawler, linearly, as the number of peers increase. By the Division of Labor peers dynamically modify their range of crawling as peers increase or decrease in the overlay. Specifically a URL Distribution Function, determines the domains to be crawled by a peer by obtaining a hash of the IP of the peers and the seed URL's. A “Range” threshold is established using this hash obtained and a peer crawl’s pages within its range. The Range is calculated by considering the hash as well as a system dependent constant such as the number of peers hence introducing the dynamics of the division of labor. URL’s which do not belong to a Peer, by the concept of Division of Labor are passed to other peers to whom they are relevant. This is done through a Phex Client which performs the P2P functions of the crawler, by broadcasting the data across the overlay. 3 Scalability Analysis of PeerCrawl As shown in Figure 2,, by the division of labour, on introducing Peer C, the labour is divided in the rage of X-Y for Peer A, Y-ZZ for Peer B and ZZ-X for Peer C. The initial aim of this project was to obtain a PeerCrawl build, benchmark it over 16+ nod nodes, and objectively quantify its performance against well known crawlers such as Apoidea, Mercator etc. However we found that PeerCrawl needs to be significantly modified to distribute its peer instances across a network, initialize the parameters for these, and obtain statistics on a single node, efficiently. Since this platform does not exist as yet, it becomes imperative to build such a platform to efficiently benchmark PeerCrawl. Figure 2 This report contains the detailed proposal to build such a platform. Section 2 contains the Motivation of this project. Section 3 inclu includes the Architecture details, Section ion 4 contains the Future Work and Section ion 5 contains information for the user. 4 Scalability Analysis of PeerCrawl 2. Motivation While PeerCrawl was built to use alternative crawling ideas and technologies which theoretically have proven to be better than traditional based crawling, it lacks an interface to efficiently test and benchmark a large enough network installation. PeerCrawl is built on JAVA, and involves a simple point-and-click GUI interface which instantiates a single peercrawl instance. The instance has a simple choice of configuring the crawler as a “Root Node” or connect to a “Root Node” IP address. There is also a list of URL's to be crawled, which is traditionally known as the seed list. To test PeerCrawls performance, the current architecture already supports stats collection at each local node such as the list of URL’s crawled, the IP address of the Root Node, the total number of URL’s crawled per minute, the total number of Seen URL’s per minute, the number of cached URL’s, the Crawl Job Queue Size, the growth of this queue per minute and so on. Ideally to test PeerCrawl, one would require to instantiate several instances of the crawler across the network, run these, without human intervention and obtain the overall statistics of a run. Our architecture does exactly this, and is detailed below. At present if the PeerCrawl architecture must be tested, one must first configure a Root Node, and note down its details. After which the tester must move on to each individual peer node, run an instance of the crawler and make sure that it connects to the Root Node. Additionally if the tester wishes to tweak or add additional parameters related to the PeerCrawl architecture such as the Bloom Filter Size, the Stats directory, the socket Timeouts, he must first modify the source code of each peer’s instance, add these variables since they are mostly hard coded into the existing code. He must also make sure that these variables do not in any way affect the overall working of the PeerCrawl architecture and in no way affecting the divison of labor as it is done by the existing concept. Apart from this appearing to be a cumbersome and daunting task, the tester must also make sure that each peer’s entrance into the system is almost around the same instance of time. To effectively test PeerCrawl for x number of peers, each of the x nodes, must begin crawling at about the same time, if not exactly at the same instance. However, if a tester uses the above technique, it becomes apparent that the peers would each connect to the system sequentially depending on when the tester hand-runs each of these on each system 5 Scalability Analysis of PeerCrawl 3. Architecture Our PeerCrawl testing platform’s architecture can be briefly divided into two parts: 1. Deployment and Creating instances of the Crawler 2. Obtaining statistics over the network. 3.1 Deploying and creating multiple instances of PeerCrawl To deploy the crawler we have a deploying system at a manager node. The manager node in our architecture is a regular login node, except that it does not participate in the crawling process. The manager node ensures that it initiates a Root Node and several peer nodes. It also ensures that it obtains statistics from all peer nodes and root nodes, and coalesces this into a single statistics file, which can be used easily by a tester to view the overall results. To automate the entire process, we have modified the PeerCrawl architecture to instantiate a crawler at a remote node, without a necessary GUI, and accept parameters which will ensure its connection into the overlay as a peer. This was achieved by replacing the GUI classes in the architecture with a simpler class, which reads off values from a Java Properties file, which we call the configuration file. The configuration file is simply passed along as an argument with the JAR file of the crawler. We have provided configuration files each for a Peer node and a Root Node and a Statistics Node. The statistics node, does not participate in the crawling process, but instead sets up the node to receive statistics from all other nodes and merge these together in a single file. The Deploying system, on its execution reads hostnames to connect to and begin an instance of the crawler. Other necessary steps at this point would be to copy over configuration files for each node before starting the instance of the crawler and making sure that the Root Node IP of the overlay as well as the Statistics Collecting Node IP is specified. Similarly, a Root Node would have to be identified using a Boolean TRUE in its own configuration file and likewise for a Statistics Node. In our case we have anticipated that the possible configurations necessary for a normal run of PeerCrawl, without modifying any of its other parameters, are three in number one each for a Peer Node, Master Node and Statistics Collector. Below is an example of 6 Scalability Analysis of PeerCrawl PeerCrawl Peer Configuration file which is simply a Java Properties file which the final jar build of PeerCrawl will read before it has been started. Below is the format of a sample PeerCrawl configuration for a Peer Node. As it can be seen, our architecture allows the tester to specify several other parameters which were otherwise hardcoded into the PeerCrawl’s existing architecture. This could allow the tester to increase the variability of his tests. # Sample PeerCrawl configuration # General or focused crawl # FOCUSED_CRAWL = FALSE # Crawl at root node starts from this URL CRAWL_DOMAIN = www.gatech.edu # The default number of fetch and process threads to be started #MAX_THREADS = 10 # Max crawl threads that a system can handle #MAX_CRAWL_THREADS = 25 # The length of hash (in bits) #HASH_LENGTH = 128 # Number of junk lines allowed before <html tag appears in HTTP Response #MAX_NON_HTML_LINES = 50 # Depth #MAX_DEPTH = 25; # Bloom Filter Capacity - default is (4096 * 8) #BLOOM_FILTER_SIZE = 32768 # root node ROOT_NODE = FALSE # root node IP - used only if ROOT_NODE is false ROOT_NODE_IP = 10.0.0.1 # If caching on secondary storage is enabled #CACHE_DOCS = FALSE # If PDF type documents need to be crawled #CRAWL_PDF = FALSE # LOG files #STATS_DIR= .stats #NETWORK_STATS_FILE= network.csv #CRAWL_URLS_STATS_FILE= urls.csv #PerSecond_STATS_FILE= second_data.csv #PerMinute_STATS_FILE= minute_data.csv #HTTP_STATS_FILE= http_data.csv # global binary stats BINARY_STATS = TRUE 7 Scalability Analysis of PeerCrawl # stats server STATS_NODE = FALSE #STATS_SERVER_IP = #STATS_SERVER_PORT #GLOBAL_STATS_FILE # Interval used by 127.0.0.1 = 7897 = global_data.csv stats server to accumulate global stats #SECS_PER_STATS_SAMPLE = 5 # Bounds on various queues #CRAWL_MAXJOB_LIMIT = 10000 #DISPATCH_THRESHOLD = 20 # Filename for backing up statistics #CRAWL_JOB_PATH = crawlJobsSaved #FETCH_BUFFER_PATH = fetchBufferSaved #URL_INFO_PATH = URLInfoSaved #ROBOTS_PATH = robotsSaved # Timeout for getting pages # Socket timeout - default is (4 * 1000) #SOCKET_TIMEOUT = 4000 # Time to Backup data structures # Backup time - default is (1000*60*5) #BACKUP_TIME = 300000 # Time to sleep between two runs of statistics collection #STATS_SLEEP_TIME = 1000 # Time for checking peers #NETCONN_SLEEP_TIME = 5000 # Maximum number of hops allowed for entering back into domain #MAX_HOP_COUNT = 0 # Maximum number of inactive minutes before we kill the crawler #MAX_INACTIVE_MIN = 10 The reader, can use the above configuration file as a template to build his own. The deployment architecture will also ensure that all clients are initiated in a span of a short duration, ensuring that the variability in start times of clients are always a constant. 8 Scalability Analysis of PeerCrawl 3.2 Obtaining Statistics over the Network The Manager node in our architecture is configured to run as a statistics node as well. The statistics node, for compatibility purposes, only requires a Boolean value specified in its configuration file, upon which the Statistics node will not run a PeerCrawl awl Instance but rather will set up a socket interface to collect results from the Peer and Root Nodes. The statistics node is made efficient by allowing the peers to send their stats in a Binary Stream rather than clear text. The statistics node will also obtain these results and merge them all in a single Comma Separated values file. The details of the statistics node, such as its IP and the port, is specified in the Configuration file of each peer The overall architecture of the deploying and statistics collection is shown below: 9 Scalability Analysis of PeerCrawl 4. Future Work As part of the Future Work, we propose the following: This architecture can be extended to use it to benchmark PeerCrawl effectively and compare it with well known crawlers such as Apoidea, Mercator etc. It would certainly be interesting to note the pros and cons of a decentralized P2P crawler such as PeerCrawl over its counterparts. This architecture can be extended to provide the tester with a complete suite of graphs and visual representation of a topology, and benchmark tests. From a more research perspective, we feel that PeerCrawl can be used to learn of several interesting concepts. PeerCrawl is heavy on most system resources. It extensively uses the CPU for computation, Physical Memory, Network and sufficient amount of Disk I/O. In our case we have tested PeerCrawl on Virtual Machines and found it to be a splendid workload to analyze resource usages on such systems. Moreover, this can be extended to study the statistics aggregator developed in our system on a larger scale. Specifically, the statistics aggregator is bound to create a bottleneck or restrict crawling performance, as the number of peers in the overlay increase. PeerCrawl could thus be modified to include a protocol to distribute the statistics aggregation across multiple statistic nodes, and this could be use to study physical resource sharing as the peers in the system increase. 10 Scalability Analysis of PeerCrawl 5. Information to the User This section contains details for the user of our system. Anyone wishing to use our system, to test, benchmark or extend PeerCrawl will find this information useful. The JAR files of our modified PeerCrawl includes three Java Properties file, which are essentially our default Configurations of a Statistics node, Root Node and a Peer Node. The configuration template is provided in the Architecture Section of this report and could be customized to the users preference. A node is started by simply passing the Configuration file with the JAR to a Java Compiler. For eg on Linux the command would be: java <JAR NAME> <CONF FILE> We would advise the user to use our builds on Linux machines for the reason that the system can be rapidly started without any human intervention. We provide instructions to do that below: 1. Create 3 files on your Linux System called peers, root_node, stat_node which includes hostnames of each of the respective nodes in the files, each hostname on a newline. 2. Set up a password-less SSH login on each of your test nodes from your login node. This can be done by generating the public key on your system and placing this key in the authorized_keys file in each of the nodes. This can be done by the commands ssh-keygen –t dsa #Will generate a DSA encrypted private and public key in $HOME/.ssh Now manually append $HOME/.ssh/identity.pub to each nodes authorized_keys file in their respective $HOME/.ssh directories. If the file does not exist you could create it. 3. Now you should be able to login to your test nodes without any passwords required 4. You can simply begin running the system by executing the following commands in order cat stat_node | xargs –i ssh {} ‘java <jar name> <stat node configuration>’ cat root_node | xargs –i ssh {} ‘java <jar name> <root node configuration>’ cat peers | xargs –i ssh {} ‘java <jar name> <peer node configuration>’ 11 Scalability Analysis of PeerCrawl Obviously we assume that you have the Jar and the respective configurations installed in the nodes beforehand. If an alternative method is used, care should be taken to make sure that the order of execution remains as instructed above 5. Users must also note that they have to hand edit the configuration file with the IP’s of the respective stats node and Root node. 6. Finally the statistics can be found in a file called global_data.csv present in the stats directory of the statistics node. 5. References 1. "DSphere: A Source-Centric Approach to Crawling, Indexing and Searching the World Wide Web "; Bhuvan Bamba, Ling Liu, James Caverlee, Vaibhav Padliya, Mudhakar Srivatsa, Tushar Bansal, Mahesh Palekar, Joseph Patrao, Suiyang Li and Aameek Singh, , In the Proceedings of International Conference on Data Engineering 2007 2. “Link-Based Ranking of the Web with Source-Centric Collaboration” ; J. Caverlee, L. Liu, and W. B. Rouse; 2nd International Conference on Collaborative Computing: Networking, Applications and Worksharing, (CollaborateCom), Atlanta, 2006. 3. “Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web” ; Aameek Singh, Mudhakar Srivatsa, Ling Liu, Todd Miller, Proceedings of the SIGIR 2003 Workshop on Distributed Information Retrieval, Lecture Notes in Computer Science, Volume 2924. 4. “Mercator: A Scalable, Extensible Web Crawler”; (1999) ;Allan Heydon, Marc Najork 12