Scalability Analysis of PeerCrawl AIAD Project Proposal Dulloor Rao Nirmal Thacker 1. Introduction PeerCrawl is a P2P crawler, built on the Gnutella protocol of an unstructured Network overlay. Such a decentralized design is intended to provide fast crawling with built in scalability and fault tolerance. Additionally the architecture implements an enhanced link-based ranking algorithm, to overcome problems in traditional link-based ranking (eg: Google's page rank) such as slow update times, web spam security issues and a flat based page-view. PeerCrawl's motive as a fast scalable P2P crawler involves a network of peers located in geographically different regions in the World, and crawling a portion of the Web. Peers exchange this information to each other when required or requested. Hence a peer could function as a router, a client or a server. Peers are made to co-operate the task of crawling by a principle known as “Division of Labor”. The Division of Labor is a necessary to a P2P crawler such as PeerCrawl. As the URL seen test makes sure that duplicate URL’s are not crawled by a Peer instance, the Division of Labor ensures that different subsets of the WWW are allocated to different Peers, depending on their proximity to that subset of pages or domain. This way each peer is assigned a domain to crawl and these domains do not overlap among different Peers, hence increasing the performance of the crawler, linearly, as the number of peers increase. The initial aim of this project was to obtain a PeerCrawl build, benchmark it over 16+ nodes, and objectively quantify its performance against well known crawlers such as Apoidea, Mercator etc. However we found that PeerCrawl needs to be significantly modified to distribute its peer instances across a network, initialize the parameters for these, and obtain statistics on a single node, efficiently. Since this platform does not exist as yet, it becomes imperative to build such a platform to efficiently benchmark PeerCrawl. This proposal contains our motivation and list of tasks to modify PeerCrawl to effectively benchmark it. 2. Motivation PeerCrawl is built on JAVA, and involves a simple point-and-click GUI interface which instantiates a single peercrawl instance. The instance has a simple choice of configuring the crawler as a “Root Node” or connect to a “Root Node” IP address. There is also a list of URL's to be crawled, which is traditionally known as the seed list. To test PeerCrawls performance, the current architecture already supports stats collection at each local node such as the list of URL’s crawled, the IP address of the Root Node, the total number of URL’s crawled per minute, the total number of Seen URL’s per minute, the number of cached URL’s, the Crawl Job Queue Size, the growth of this queue per minute and so on. At present if the PeerCrawl architecture must be tested, one must first configure a Root Node, and note down its details. After which the tester must move on to each individual peer node, run an instance of the crawler and make sure that it connects to the Root Node. Additionally if the tester wishes to tweak or add additional parameters related to the PeerCrawl architecture such as the Bloom Filter Size, the Stats directory, the socket Timeouts, he must first modify the source code of each peer’s instance, add these variables since they are mostly hard coded into the existing code. He must also make sure that these variables do not in any way affect the overall working of the PeerCrawl architecture and in no way affecting the divison of labor as it is done by the existing concept. Apart from this appearing to be a cumbersome and daunting task, the tester must also make sure that each peer’s entrance into the system is almost around the same instance of time. To effectively test PeerCrawl for x number of peers, each of the x nodes, must begin crawling at about the same time, if not exactly at the same instance. However, if a tester uses the above technique, it becomes apparent that the peers would each connect to the system sequentially depending on when the tester hand-runs each of these on each system 3. Tasks Our task to modify PeerCrawl to build an effective testing platform architecture can be briefly divided into two parts: 1. Deployment and Creating instances of the Crawler: Essentially PeerCrawl would be modified to run as an instance without requiring human intervention. This implies that parameters can be read in automatically and several instances of the crawler can be instantiated from a single node, at a single instance. As we anticipate it, this would require us to heavily modify PeerCrawl, to eliminate its Java Windowing system and replace it with an execution point, which would also read in a configuration accepting parameters for a run. 2. Obtaining statistics over the network: The current architecture supports the collection of statistics, however, this occurs at each node, and the user must obtain this information and coalesce it. We propose to add in a statistics aggregator- a separate node, which will read in all these statistics from each node and collate it in a single location. This would briefly involve a distributed networking architecture.