Scalability Analysis of PeerCrawl AIAD Project Report Dulloor Rao

advertisement
Scalability Analysis of
PeerCrawl
AIAD Project Report
Dulloor Rao
Nirmal Thacker
Scalability Analysis of PeerCrawl
1. Introduction
PeerCrawl is a P2P crawler, built on the Gnutella protocol of an unstructured Network overlay.
Such a decentralized design is intended to provide fast crawling with built in scalability and fault
tolerance. Additionally the architecture implements an enhanced link-based ranking algorithm,
to overcome problems in traditional link-based ranking (eg: Google's page rank) such as slow
update times, web spam security issues and a flat based page-view.
The core architecture of PeerCrawl, (Figure 1) is not very different from that of a traditional
crawler and it additionally contains an interface to connect to other Peers in its system.
However the architecture is briefly explained to keep the reader informed.
Figure 1
2
Scalability Analysis of PeerCrawl
The core architecture of PeerCrawl contains a seed list, which has a list of initial URL’s to begin
the crawl at. The seed list is processed into a URL queue. A backup thread in PeerCrawl keeps
an additional URL Queue and the URL Bloom Filter. The URL Bloom Filter is necessary to create
the hash for the Division of Labor which is explained below. The URL Fetcher has a
multithreaded architecture which performs DNS lookups and fetches pages from the World
Wide Web. These documents are passed to a URL extractor which fetches URL’s from the pages,
passes it to a filter, to filter out malformed URL’s or convert relative URL’s to absolute ones. A
URL Seen Test essentially makes sure the URL’s are not being fetched if they have already been
fetched once. In other words it avoids creating archives of Duplicate URL’s on disk and this has
two advantages- for one, it increases the efficiency of the crawling processes, fetching a unique
URL each time and for another, it will assist the crawler to avoid entering into a state of
crawling the web in cycles, hence being unproductive.
PeerCrawl's motive as a fast scalable P2P crawler involves a network of peers located in
geographically different regions in the World, and crawling a portion of the Web. Peers
exchange this information to each other when required or requested. Hence a peer could
function as a router, a client or a server. Peers are made to co-operate the task of crawling by a
principle known as “Division of Labor”. The Division of Labor is a necessary to a P2P crawler
such as PeerCrawl. As the URL seen test makes sure that duplicate URL’s are not crawled by a
Peer instance, the Division of Labor ensures that different subsets of the WWW are allocated to
different Peers, depending on their proximity to that subset of pages or domain. This way each
peer is assigned a domain to crawl and these domains do not overlap among different Peers,
hence increasing the performance of the crawler, linearly, as the number of peers increase.
By the Division of Labor peers dynamically modify their range of crawling as peers increase or
decrease in the overlay. Specifically a URL Distribution Function, determines the domains to be
crawled by a peer by obtaining a hash of the IP of the peers and the seed URL's. A “Range”
threshold is established using this hash obtained and a peer crawl’s pages within its range. The
Range is calculated by considering the hash as well as a system dependent constant such as the
number of peers hence introducing the dynamics of the division of labor. URL’s which do not
belong to a Peer, by the concept of Division of Labor are passed to other peers to whom they
are relevant. This is done through a Phex Client which performs the P2P functions of the
crawler, by broadcasting the data across the overlay.
3
Scalability Analysis of PeerCrawl
As shown in Figure 2,, by the division of labour, on
introducing Peer C, the labour is divided in the rage of
X-Y for Peer A, Y-ZZ for Peer B and ZZ-X for Peer C.
The initial aim of this project was to obtain a PeerCrawl
build, benchmark it over 16+ nod
nodes, and objectively
quantify its performance against well known crawlers
such as Apoidea, Mercator etc.
However we found that PeerCrawl needs to be
significantly modified to distribute its peer instances
across a network, initialize the parameters for these,
and obtain statistics on a single node, efficiently. Since
this platform does not exist as yet, it becomes
imperative to build such a platform to efficiently
benchmark PeerCrawl.
Figure 2
This report contains the detailed proposal to build such a platform. Section 2 contains the
Motivation of this project. Section 3 inclu
includes the Architecture details, Section
ion 4 contains the
Future Work and Section
ion 5 contains information for the user.
4
Scalability Analysis of PeerCrawl
2. Motivation
While PeerCrawl was built to use alternative crawling ideas and technologies which
theoretically have proven to be better than traditional based crawling, it lacks an interface to
efficiently test and benchmark a large enough network installation.
PeerCrawl is built on JAVA, and involves a simple point-and-click GUI interface which
instantiates a single peercrawl instance. The instance has a simple choice of configuring the
crawler as a “Root Node” or connect to a “Root Node” IP address. There is also a list of URL's to
be crawled, which is traditionally known as the seed list.
To test PeerCrawls performance, the current architecture already supports stats collection at
each local node such as the list of URL’s crawled, the IP address of the Root Node, the total
number of URL’s crawled per minute, the total number of Seen URL’s per minute, the number
of cached URL’s, the Crawl Job Queue Size, the growth of this queue per minute and so on.
Ideally to test PeerCrawl, one would require to instantiate several instances of the crawler
across the network, run these, without human intervention and obtain the overall statistics of a
run. Our architecture does exactly this, and is detailed below.
At present if the PeerCrawl architecture must be tested, one must first configure a Root Node,
and note down its details. After which the tester must move on to each individual peer node,
run an instance of the crawler and make sure that it connects to the Root Node. Additionally if
the tester wishes to tweak or add additional parameters related to the PeerCrawl architecture
such as the Bloom Filter Size, the Stats directory, the socket Timeouts, he must first modify the
source code of each peer’s instance, add these variables since they are mostly hard coded into
the existing code. He must also make sure that these variables do not in any way affect the
overall working of the PeerCrawl architecture and in no way affecting the divison of labor as it is
done by the existing concept.
Apart from this appearing to be a cumbersome and daunting task, the tester must also make
sure that each peer’s entrance into the system is almost around the same instance of time. To
effectively test PeerCrawl for x number of peers, each of the x nodes, must begin crawling at
about the same time, if not exactly at the same instance. However, if a tester uses the above
technique, it becomes apparent that the peers would each connect to the system sequentially
depending on when the tester hand-runs each of these on each system
5
Scalability Analysis of PeerCrawl
3. Architecture
Our PeerCrawl testing platform’s architecture can be briefly divided into two parts:
1. Deployment and Creating instances of the Crawler
2. Obtaining statistics over the network.
3.1 Deploying and creating multiple instances of PeerCrawl
To deploy the crawler we have a deploying system at a manager node. The manager node in
our architecture is a regular login node, except that it does not participate in the crawling
process. The manager node ensures that it initiates a Root Node and several peer nodes. It also
ensures that it obtains statistics from all peer nodes and root nodes, and coalesces this into a
single statistics file, which can be used easily by a tester to view the overall results.
To automate the entire process, we have modified the PeerCrawl architecture to instantiate a
crawler at a remote node, without a necessary GUI, and accept parameters which will ensure its
connection into the overlay as a peer. This was achieved by replacing the GUI classes in the
architecture with a simpler class, which reads off values from a Java Properties file, which we
call the configuration file. The configuration file is simply passed along as an argument with the
JAR file of the crawler. We have provided configuration files each for a Peer node and a Root
Node and a Statistics Node. The statistics node, does not participate in the crawling process,
but instead sets up the node to receive statistics from all other nodes and merge these together
in a single file.
The Deploying system, on its execution reads hostnames to connect to and begin an instance of
the crawler. Other necessary steps at this point would be to copy over configuration files for
each node before starting the instance of the crawler and making sure that the Root Node IP of
the overlay as well as the Statistics Collecting Node IP is specified. Similarly, a Root Node would
have to be identified using a Boolean TRUE in its own configuration file and likewise for a
Statistics Node. In our case we have anticipated that the possible configurations necessary for a
normal run of PeerCrawl, without modifying any of its other parameters, are three in number
one each for a Peer Node, Master Node and Statistics Collector. Below is an example of
6
Scalability Analysis of PeerCrawl
PeerCrawl Peer Configuration file which is simply a Java Properties file which the final jar build
of PeerCrawl will read before it has been started.
Below is the format of a sample PeerCrawl configuration for a Peer Node. As it can be seen, our
architecture allows the tester to specify several other parameters which were otherwise hardcoded into the PeerCrawl’s existing architecture. This could allow the tester to increase the
variability of his tests.
# Sample PeerCrawl configuration
# General or focused crawl
# FOCUSED_CRAWL = FALSE
# Crawl at root node starts from this URL
CRAWL_DOMAIN = www.gatech.edu
# The default number of fetch and process threads to be started
#MAX_THREADS = 10
# Max crawl threads that a system can handle
#MAX_CRAWL_THREADS = 25
# The length of hash (in bits)
#HASH_LENGTH = 128
# Number of junk lines allowed before <html tag appears in HTTP Response
#MAX_NON_HTML_LINES = 50
# Depth
#MAX_DEPTH = 25;
# Bloom Filter Capacity - default is (4096 * 8)
#BLOOM_FILTER_SIZE = 32768
# root node
ROOT_NODE = FALSE
# root node IP - used only if ROOT_NODE is false
ROOT_NODE_IP = 10.0.0.1
# If caching on secondary storage is enabled
#CACHE_DOCS = FALSE
# If PDF type documents need to be crawled
#CRAWL_PDF = FALSE
# LOG files
#STATS_DIR= .stats
#NETWORK_STATS_FILE= network.csv
#CRAWL_URLS_STATS_FILE= urls.csv
#PerSecond_STATS_FILE= second_data.csv
#PerMinute_STATS_FILE= minute_data.csv
#HTTP_STATS_FILE= http_data.csv
# global binary stats
BINARY_STATS = TRUE
7
Scalability Analysis of PeerCrawl
# stats server
STATS_NODE = FALSE
#STATS_SERVER_IP =
#STATS_SERVER_PORT
#GLOBAL_STATS_FILE
# Interval used by
127.0.0.1
= 7897
= global_data.csv
stats server to accumulate global stats
#SECS_PER_STATS_SAMPLE = 5
# Bounds on various queues
#CRAWL_MAXJOB_LIMIT = 10000
#DISPATCH_THRESHOLD = 20
# Filename for backing up statistics
#CRAWL_JOB_PATH = crawlJobsSaved
#FETCH_BUFFER_PATH = fetchBufferSaved
#URL_INFO_PATH = URLInfoSaved
#ROBOTS_PATH = robotsSaved
# Timeout for getting pages
# Socket timeout - default is (4 * 1000)
#SOCKET_TIMEOUT = 4000
# Time to Backup data structures
# Backup time - default is (1000*60*5)
#BACKUP_TIME = 300000
# Time to sleep between two runs of statistics collection
#STATS_SLEEP_TIME = 1000
# Time for checking peers
#NETCONN_SLEEP_TIME = 5000
# Maximum number of hops allowed for entering back into domain
#MAX_HOP_COUNT = 0
# Maximum number of inactive minutes before we kill the crawler
#MAX_INACTIVE_MIN = 10
The reader, can use the above configuration file as a template to build his own.
The deployment architecture will also ensure that all clients are initiated in a span of a short
duration, ensuring that the variability in start times of clients are always a constant.
8
Scalability Analysis of PeerCrawl
3.2 Obtaining Statistics over the Network
The Manager node in our architecture is configured to run as a statistics node as well. The
statistics node, for compatibility purposes, only requires a Boolean value specified in its
configuration file, upon which the Statistics node will not run a PeerCrawl
awl Instance but rather
will set up a socket interface to collect results from the Peer and Root Nodes. The statistics
node is made efficient by allowing the peers to send their stats in a Binary Stream rather than
clear text. The statistics node will also obtain these results and merge them all in a single
Comma Separated values file. The details of the statistics node, such as its IP and the port, is
specified in the Configuration file of each peer
The overall architecture of the deploying and statistics collection is shown below:
9
Scalability Analysis of PeerCrawl
4. Future Work
As part of the Future Work, we propose the following:
This architecture can be extended to use it to benchmark PeerCrawl effectively and compare it
with well known crawlers such as Apoidea, Mercator etc. It would certainly be interesting to
note the pros and cons of a decentralized P2P crawler such as PeerCrawl over its counterparts.
This architecture can be extended to provide the tester with a complete suite of graphs and
visual representation of a topology, and benchmark tests.
From a more research perspective, we feel that PeerCrawl can be used to learn of several
interesting concepts. PeerCrawl is heavy on most system resources. It extensively uses the CPU
for computation, Physical Memory, Network and sufficient amount of Disk I/O. In our case we
have tested PeerCrawl on Virtual Machines and found it to be a splendid workload to analyze
resource usages on such systems. Moreover, this can be extended to study the statistics
aggregator developed in our system on a larger scale. Specifically, the statistics aggregator is
bound to create a bottleneck or restrict crawling performance, as the number of peers in the
overlay increase. PeerCrawl could thus be modified to include a protocol to distribute the
statistics aggregation across multiple statistic nodes, and this could be use to study physical
resource sharing as the peers in the system increase.
10
Scalability Analysis of PeerCrawl
5. Information to the User
This section contains details for the user of our system. Anyone wishing to use our system, to test,
benchmark or extend PeerCrawl will find this information useful.

The JAR files of our modified PeerCrawl includes three Java Properties file, which are essentially
our default Configurations of a Statistics node, Root Node and a Peer Node. The configuration
template is provided in the Architecture Section of this report and could be customized to the
users preference.

A node is started by simply passing the Configuration file with the JAR to a Java Compiler. For eg
on Linux the command would be:
java <JAR NAME> <CONF FILE>
We would advise the user to use our builds on Linux machines for the reason that the system
can be rapidly started without any human intervention. We provide instructions to do that
below:
1. Create 3 files on your Linux System called peers, root_node, stat_node which
includes hostnames of each of the respective nodes in the files, each hostname on a
newline.
2. Set up a password-less SSH login on each of your test nodes from your login node. This
can be done by generating the public key on your system and placing this key in the
authorized_keys file in each of the nodes. This can be done by the commands
ssh-keygen –t dsa
#Will generate a DSA encrypted private and public key in $HOME/.ssh
Now manually append $HOME/.ssh/identity.pub to each nodes authorized_keys file in
their respective $HOME/.ssh directories. If the file does not exist you could create it.
3. Now you should be able to login to your test nodes without any passwords required
4. You can simply begin running the system by executing the following commands in order
cat stat_node | xargs –i ssh {} ‘java <jar name> <stat node
configuration>’
cat root_node | xargs –i ssh {} ‘java <jar name> <root node
configuration>’
cat peers | xargs –i ssh {} ‘java <jar name> <peer node configuration>’
11
Scalability Analysis of PeerCrawl
Obviously we assume that you have the Jar and the respective configurations installed in
the nodes beforehand.
If an alternative method is used, care should be taken to make sure that the order of
execution remains as instructed above
5. Users must also note that they have to hand edit the configuration file with the IP’s of
the respective stats node and Root node.
6. Finally the statistics can be found in a file called global_data.csv present in the stats
directory of the statistics node.
5. References
1. "DSphere: A Source-Centric Approach to Crawling, Indexing and Searching the World Wide
Web "; Bhuvan Bamba, Ling Liu, James Caverlee, Vaibhav Padliya, Mudhakar Srivatsa, Tushar
Bansal, Mahesh Palekar, Joseph Patrao, Suiyang Li and Aameek Singh, , In the Proceedings of
International Conference on Data Engineering 2007
2. “Link-Based Ranking of the Web with Source-Centric Collaboration” ; J. Caverlee, L. Liu, and W.
B. Rouse; 2nd International Conference on Collaborative Computing: Networking, Applications
and Worksharing, (CollaborateCom), Atlanta, 2006.
3. “Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web” ;
Aameek Singh, Mudhakar Srivatsa, Ling Liu, Todd Miller, Proceedings of the SIGIR 2003
Workshop on Distributed Information Retrieval, Lecture Notes in Computer Science, Volume
2924.
4. “Mercator: A Scalable, Extensible Web Crawler”; (1999) ;Allan Heydon, Marc Najork
12
Download