Parallel PageRank Computation on a Gigabit PC Cluster Bundit Manaskasemsak, Arnon Rungsawang Massive Information & Knowledge Engineering Department of Computer Engineering Faculty of Engineering Kasetsart University, Bangkok 10900, Thailand. {un,arnon} Abstract Efficient computing the PageRank scores for a large web graph is actually one of the hot issues in Web-IR community. Recent researches propose to accelerate the computation, both in algorithmic and architectural ways. We here focus on a parallel PageRank computational architecture on a cluster of Opteron PCs networked via a Gigabit Ethernet. We propose both an efficient parallel algorithm of the standard PageRank computation, and a simple pairwise communication model needed to synchronize local PageRank scores between processors. Our experimental results conducted on a large web graph, over 1.5 billion links, synthesized from the real set of crawled web pages in the TH domain, are quite promising. The current implementation takes less than 15 seconds for an iteration run. 1. Introduction Fast computing of PageRank scores for a billion nodes of web graphs is indispensable for most Web-IR researches today. PageRank is used to influence the relevance of search results, in addition to the query keywords found in the web pages [2]. It is also used as a technique for guiding a web spider to refresh and discover interesting web pages: high score pages tend to be very important; therefore, they should be fetched first, and refreshed quite often [4]. PageRank score of a web page is calculated from scores of other web pages whose hyperlinks point to it. This leads to a problem of iterative computations between a large matrix representing the connectivity of nodes within a web graph and a rank vector representing estimated PageRank scores of web pages. The main issue of this context is such a sheer size of the web (actually passing 3 billion pages [1]) that results the computation in consuming too much time for Web-IR researchers to withstand. Many speeding up techniques have thus been proposed both in algorithmic (e.g., extrapolation and adaptive methods [6,7], I/O efficient techniques [3,5]), and in architectural ways (e.g., PC cluster and P2P architectures [8,9]). In this paper, we focus on speeding up the PageRank computation using a cluster of Opteron PC machines networked via a Gigabit Ethernet. We first employ a link structure file specially designed to represent the connectivity of a web graph. We then partition the link structure file into several smaller ones and allot each of them to a machine. Local PageRank scores in each machine are computed in parallel until an appropriate interval, e.g., every five iterations, they will be sent to other machines for synchronization. The computational processes repeat until the PageRank scores convert to a set of certain acceptable values. We devised our parallel PageRank algorithm, as well as proposed a simple pairwise communication model for local PageRank score synchronization. We implemented the algorithm using the standard MPICH library, and ran several experiments. Results obtained from a web graph synthesized from the real web pages crawled within the TH domain are quite promising. Using eight machines, an iteration run on 1.5 billion links of a web graph requires only 14.6 seconds. This paper is organized in the following way. Section 2 provides a short overview of the standard PageRank algorithm, and the definition of a binary link structure file. Section 3 gives detail of our parallel algorithm, as well as our simple pairwise communication model. Section 4 describes how we test the preliminary implementation with a series of experiments, and then discusses the results. Finally, section 5 concludes the paper. 2. Background 2.1. Review of PageRank Intuitive description of PageRank is based on an idea that if a page v of interest has many other pages u with high PageRank scores pointing to, then the authors of pages u are implicitly conferring some importance to page v. The importance that pages u confer to page v can be described as follows. Let Nu be the number of pages which page u points out (called later, “outdegree”), and let Rank(u) represent the rank score of a page u, then a hyperlink u v confers Rank( u ) Nu units of rank to page v. To compute the rank vector for all pages of a web graph, we then simply iteratively perform the following fixed-point computation. If T is the number of pages in the underlying web graph, we first assign the initial score T1 to all pages. Let Sv represent the set of pages pointing to page v, for each iteration, the successive rank scores of pages are recursively propagated from the previously computed rank scores of all other pages pointing to them: Rank i (u) Nu uSv v Rank i1 (v) (1) The above PageRank computation equation ignores some important details. In general, the web graph is not strongly connected, and this may lead the PageRank computation of some pages to be trapped in a small isolated cluster of the graph. This problem is usually resolved by pruning nodes with zero out-degree, and by adding random jumps to the random surfer process underlying PageRank [2]. This leads to the following modification of Equation (1) to: Rank i (u) Nu uSv v Rank i1 (v) (1 ) (2) where , called “damping factor”, is the value that we use to modify the transitional probability of the random surfer model of an underlying web graph. In the remainder of this paper, we will refer to the iterative processes stated in this Equation (2) to compute the PageRank scores. 2.2. Binary link structure file In general, the input web graph exists in form of a text file composing of two columns: a URL u in the former that has a hyperlink to another URL v in the latter. To easily process, we then transform the input web graph into two files. The first one consists of all URLs sorted by alphabetical order. The second one, called “out-link file”, is equivalent to the underlying input web graph; however, each line contains a pair of integers referring to the line number (i.e., source and destination id) in the first file. The example of these two files is depicted in Figure 1. 1 1 1 3 3 4 2 3 4 1 4 2 Figure 1. The input web graph files. However, our PageRank computational algorithm does not directly employ the input web graph files as depicted in Figure 1, we rather convert the out-link file into a binary “link structure file” M as illustrated textually in Figure 2. Each record contains three fields: the first, storing in 4-byte integer, refers to a destination URL; the second, also storing in 4-byte integer, represents its in-degree (i.e., the number of URLs pointing to it); and the last, storing a list of 4byte integers, represents the ids referring to the source URLs. For example, reading from the second row of Figure 2, we obtain that the destination URL “” has been pointed by two source URLs “” and “”, respectively. dest-id in-degree source-id (4 bytes) (4 bytes) (4 bytes each) 1 1 3 2 2 1 4 3 1 1 4 2 1 3 Figure 2. The binary link structure file. 3. Parallel PageRank computation 3.1. Distributed binary link structure file and parallel algorithm To accelerate the PageRank computation using a cluster of processors (i.e., machines), we first partition the binary link structure file M into chunks: M0, M1, …., M-1, such that each Mi contains only records referring to destination URLs from the ( iT 1)th to the ( (i 1)T ) th where T is the total Algorithm 1 : Parallel Algorithm 1 : assign MPI process identifier to each machine, Pi 0, 1, 2, ... number of URLs in the web graph. Figure 3 as follows illustrates textually the example of the partitioned binary link structure file Mi. 2:B dest-id in-degree source-id (4 bytes) (4 bytes) (4 bytes each) 4 : for round 1 to 50 do 5 : while M Pi is not end of file do dest-id in-degree source-id (4 bytes) (4 bytes) (4 bytes each) T β Pi 1; E Tβ Pi 1; 3 : t 1 ..T Rank src, Pi [t ] T1 ; α 0.85; 1 1 3 1001 2 11 440 6: t M Pi .dest-id ; 2 2 1 4 1002 3 23 36 40 7: Rank dest,Pi [t B] 3 1 1 1003 1 1021 4 2 1 3 1004 2 132 2354 Rank src,Pi [ M Pi .source-id ] (1 α ) α ; M .in-degree N Ranksrc,Pi[ M Pi .source-id] Pi 8 : end while link structure file M0 (1 destid T ) link structure file M1 ( T 1 destid 2 T ) dest-id in-degree source-id (4 bytes) (4 bytes) (4 bytes each) 2001 3 2 15 221 2002 1 124 2003 3 1 35 112 2004 2 1 239 link structure file M2 11 : end for 3.2. Pairwise communication ( 2 T 1 destid 3T ) 9 : each process has to synchroniz e all local Rank dest,Pi 10 : t B .. E Rank src,Pi [t ] Rank dest, Pi [t B]; Figure 3. The partitioned binary link structure file used in parallel PageRank Algorithm. During the iterative computing of PageRank scores for a large web graph in parallel, every processor which participates in the pool of tasks is assigned with a process identifier Pi = 0, 1, 2, …, -1. Each processor Pi has to allocate, for its own, two separated arrays of floating points: ranksrc,Pi, having T entries, records all source rank scores of the iteration jth, and rankdest,Pi, having T entries, records the local destination rank scores of the iteration (j+1)th; and an array of integer outarryPi, having T entries, records the out-degree of each destination URL. The proposed parallel PageRank computation can be illustrated in Algorithm 1. According to the Algorithm 1, we first assign an MPI process identifier to each machine, calculate the corresponding starting URL (B) and the ending URL (E), initialize each source rank score (i.e., ranksrc,Pi) with T1 , and set the damping factor to 0.85. The iterative processes (see line 4-11 in Algorithm 1) repeat until all final PageRank scores convert. By experiments, those scores convert with L1 norm of residual errors [6] < 0.025 when the iterative processes repeat at least 50 times. In the following experimental results, we then report with the number of iteration pass set to 50. We can see that Algorithm 1 exposes one handicap, as local PageRank scores have to be synchronized between processors to obtain the final ones to be used in the next iteration step (see line 9). Here, we then propose a simple but efficient communication model to synchronize those scores during iteration runs. Figure 4 illustrates an example of the communication model for eight processors collaborating in the pool of tasks. In this model, firstly, process Pi will request process Pj to send its local block of rank scores (i.e., Bj), wherein the next consecutive step process Pi will also send its own one to Pj (see step 1-2). Secondly, in the same way, Pi will communicate with Pk to synchronize its jointed local blocks of rank scores to each other (see step 3-6). If we let be the number of processors, all processors will thus be able to successfully synchronize their local blocks of PageRank scores in 2log2 communication steps. Step 1 P0 B0 B1 P2 B3 P5 B4 B5 B5 P7 P6 B7 P3 B3 P4 B5 P6 B1 B2 P5 B4 P1 B1 P2 B3 P4 P0 B0 P3 B2 B6 Step 2 P1 B6 B7 P7 B7 Step 3 P0 Step 4 P1 B0 P0 P1 B0 B0 B0 B1 B1 B1 B1 B2 B2 B3 B3 P2 P3 P2 P3 B2 B2 B2 B2 B3 B3 B3 B3 P4 B4 P5 P4 P5 B4 B4 B4 B5 B5 B5 B5 B6 B6 B7 P6 P7 B7 P6 P7 B6 B6 B6 B6 B7 B7 B7 B7 Step 5 iteration steps. A graph in Figure 5 plots the L1 norm of residual errors calculated from the final results of PageRank scores of a web graph in TH domain versus the number of x consecutive iteration steps before a local PageRank synchronization occurs. Step 6 B0 B0 B0 B4 B1 B1 B1 B5 B1 B5 B2 B2 B2 B6 B2 B6 B3 B3 B3 B7 B3 B7 B0 B0 B0 B4 B1 B1 B5 B1 B5 B2 B2 B2 B6 B2 B6 B3 B3 B3 B7 B3 B7 B4 B4 B4 B5 B5 B5 B5 B6 B6 B6 B6 B7 B7 B7 B7 B4 B4 B4 B5 B5 B5 B5 B6 B6 B6 B6 B7 B7 B7 B7 B1 P0 P2 P4 P6 P1 P3 P5 P7 P0 P1 P2 P3 P4 P5 P6 P7 B0 B4 B0 B4 B4 B4 Figure 4. Pairwise communications during the synchronization of local PageRank scores. For the simplicity with our prior work [8], let tstart-up be the time expense needed to startup the connection between any pair of processes, and tcomm be the time expense for synchronizing one block of rank scores (i.e., any Bi) between each pair. The total cost Ttotal needed to completely synchronize blocks of rank scores can be written in Equation (3): Ttotal log2 2 log 2 t startup 2i1 tcomm i 1 (3) 2 log 2 t startup log2 2 t i i 1 comm 3.3. Reducing communication cost As we can see that the total expense of Algorithm 1 comes from the computational cost (i.e., mainly between line 5-8 and 10) plus the extra communication cost needed to synchronize all local PageRank scores between processors (i.e., line 9). To reduce that extra communication cost, we can let the local blocks of rank scores be synchronized after every x consecutive Figure 5. L1 norm of residual errors versus local PageRank synchronization interval. Seeing from Figure 5, the L1 norm of residual errors increases when the synchronization interval increases and the number of processors participating in computation increases. For 2, 4 and 8 processors, the average residual error calculated from synchronizing the local rank scores after every 5 consecutive steps is less than 0.025. This value is acceptable for our research purpose. We thus set the synchronization interval to 5 to be reported in experimental results as follows. 4. Experimental results and discussion 4.1. Experimental setup Machine Setup: We have an opportunity to run our experiments on a new PC cluster of eight Opteron 240 machines, networked via the Gigabit Ethernet and running the Linux operating system, at the KU office of computing service. Each machine is equipped with 3GB of main memory, and a UW-SCSI hard disk. The proposed algorithm was written in C language using the standard message passing MPICH version 1.2.5 library. We arranged the computing environment so that all machines had no other process running, except ours, during the experiments. Input Data Set: We use a web graph derived from a crawl during January 2003 within the TH domain. It contains around 10.9 million web pages, 97 million links. The web graph was first parsed and converted into the input web graph files, as depicted in Figure 1. We will hereafter name this data, the base graph “1DB”. We then create additional artificial sets of web graphs of larger size by concatenating several copies of the 1DB, and connecting those copies by rerouting some of the links. Resulting web data represents large artificial web graphs of 2, 4, 8, and 16 times the size of the 1DB base graph, respectively. Before computing the PageRank scores using Algorithm 1, we preprocess and convert each artificial web graph into corresponding binary link structure file as mentioned in section 2.2 and 3.1. 4.2. Results and discussion For each test, we ran the experiments at least 3 times, and averaged the results. Resulting runs using 1 processor (i.e., the base case for speedup calculation), 2, 4, 6, and 8 processors, and 1DB, 2DB, 4DB, 8DB, 16DB, have been depicted in following Figure 6. The average wall-clock time needed for the 16DB data set (i.e., roughly 174.4 million web pages, 1.55 billion links) using 1, 2, 4, and 8 processors are 91.03, 47.04, 24.50, and 14.16 seconds for an iteration run, respectively. In Figure 6, we also draw an ideal speedup curve for reference. 5. Conclusion In this paper, we have proposed another parallel algorithm to accelerate the PageRank computation using a PC cluster. We also present a simple pairwise communication model to synchronize the local PageRank scores between processors. We implement and test our algorithm on a cluster of eight Opteron machines, networked via a Gigabit Ethernet. To study its efficiency, we perform several experiments using artificial web graphs synthesized from the real web data. The given results are quite encouraging. To study the limitation of the proposed algorithm, we are looking forward to doing more experiments with larger sizes of web graphs that are indeed in our web spidering queue. 6. Acknowledgement We would like to thanks the KU office of computing service to let us perform many experiments on the Opteron PC cluster. This research is a part of the MIKE-Web-Explorer project granted by the KURDI, Thailand. 7. References Figure 6. Speedup curves concluded from the experiments. Note that the slope of speedup curves get closer to the ideal one when the size of the virtual data becomes larger, but each has tendency to go beyond the ideal line when the number of using processors increases and the size of running data decreases. We are looking forward to investigating the upper limit of our proposed algorithm with larger sizes of virtual web data. Unfortunately, we have run out of the allowed disk space at the moment. [1] [2] S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proc. of the 7th WWW Conf., 1998. [3] Y. Chen, Q. Gan and T. Suel, I/O Efficient Techniques for Computing Pagerank, Proc. of the 11th ACM CIKM Conf., 2002. [4] J. Cho, H. Garcia-Molina and L. Page, Efficient Crawling Through URL Ordering, Computer Networks and ISDN Systems, 30(1-7): 161-172, 1998. [5] T. Haveliwala, Efficient Computation of PageRank, Technical Report 1999-31, Stanford Digital Library Project, 1999. [6] S. Kamvar, T. Haveliwala, C. Manning and G. Golub, Extrapolation Methods for Accelerating PageRank Computations, Proc. of the 12th WWW Conf., 2003. [7] S. Kamvar, T. Haveliwala and G. Golub, Adaptive Methods for the Computation of PageRank, Linear Algebra and its Applications, Spec. Issue on the Numerical Sol. of Markov Chains, November 2003. [8] A. Rungsawang and B. Manaskasemsak, PageRank Computation using PC Cluster, Proc. of the 10th European PVM/MPI User’s Group Meeting, 2003. [9] K. Sankaralingam, S. Sethumadhavan and J.C. Browne, Distributed Pagerank for P2P Systems, Proc. of the 12th IEEE HPDC’03 conf., 2003.