CPE2

advertisement
Parallel PageRank Computation on a Gigabit PC Cluster
Bundit Manaskasemsak, Arnon Rungsawang
Massive Information & Knowledge Engineering
Department of Computer Engineering
Faculty of Engineering
Kasetsart University, Bangkok 10900, Thailand.
{un,arnon}@mikelab.net
Abstract
Efficient computing the PageRank scores for a
large web graph is actually one of the hot issues in
Web-IR community. Recent researches propose to
accelerate the computation, both in algorithmic and
architectural ways. We here focus on a parallel
PageRank computational architecture on a cluster of
Opteron PCs networked via a Gigabit Ethernet. We
propose both an efficient parallel algorithm of the
standard PageRank computation, and a simple
pairwise communication model needed to synchronize
local PageRank scores between processors. Our
experimental results conducted on a large web graph,
over 1.5 billion links, synthesized from the real set of
crawled web pages in the TH domain, are quite
promising. The current implementation takes less than
15 seconds for an iteration run.
1. Introduction
Fast computing of PageRank scores for a billion
nodes of web graphs is indispensable for most Web-IR
researches today. PageRank is used to influence the
relevance of search results, in addition to the query
keywords found in the web pages [2]. It is also used as
a technique for guiding a web spider to refresh and
discover interesting web pages: high score pages tend
to be very important; therefore, they should be fetched
first, and refreshed quite often [4]. PageRank score of a
web page is calculated from scores of other web pages
whose hyperlinks point to it. This leads to a problem of
iterative computations between a large matrix
representing the connectivity of nodes within a web
graph and a rank vector representing estimated
PageRank scores of web pages.
The main issue of this context is such a sheer size of
the web (actually passing 3 billion pages [1]) that
results the computation in consuming too much time
for Web-IR researchers to withstand. Many speeding
up techniques have thus been proposed both in
algorithmic (e.g., extrapolation and adaptive methods
[6,7], I/O efficient techniques [3,5]), and in
architectural ways (e.g., PC cluster and P2P
architectures [8,9]). In this paper, we focus on speeding
up the PageRank computation using a cluster of
Opteron PC machines networked via a Gigabit
Ethernet. We first employ a link structure file specially
designed to represent the connectivity of a web graph.
We then partition the link structure file into several
smaller ones and allot each of them to a machine. Local
PageRank scores in each machine are computed in
parallel until an appropriate interval, e.g., every five
iterations, they will be sent to other machines for
synchronization. The computational processes repeat
until the PageRank scores convert to a set of certain
acceptable values.
We devised our parallel PageRank algorithm, as
well as proposed a simple pairwise communication
model for local PageRank score synchronization. We
implemented the algorithm using the standard MPICH
library, and ran several experiments. Results obtained
from a web graph synthesized from the real web pages
crawled within the TH domain are quite promising.
Using eight machines, an iteration run on 1.5 billion
links of a web graph requires only 14.6 seconds.
This paper is organized in the following way.
Section 2 provides a short overview of the standard
PageRank algorithm, and the definition of a binary link
structure file. Section 3 gives detail of our parallel
algorithm, as well as our simple pairwise
communication model. Section 4 describes how we test
the preliminary implementation with a series of
experiments, and then discusses the results. Finally,
section 5 concludes the paper.
2. Background
2.1. Review of PageRank
Intuitive description of PageRank is based on an
idea that if a page v of interest has many other pages u
with high PageRank scores pointing to, then the authors
of pages u are implicitly conferring some importance to
page v. The importance that pages u confer to page v
can be described as follows. Let Nu be the number of
pages which page u points out (called later, “outdegree”), and let Rank(u) represent the rank score of a
page u, then a hyperlink u  v confers
Rank( u )
Nu
units of
rank to page v.
To compute the rank vector for all pages of a web
graph, we then simply iteratively perform the following
fixed-point computation. If T is the number of pages in
the underlying web graph, we first assign the initial
score T1 to all pages. Let Sv represent the set of pages
pointing to page v, for each iteration, the successive
rank scores of pages are recursively propagated from
the previously computed rank scores of all other pages
pointing to them:
Rank i (u)
Nu
uSv
v Rank i1 (v)  
(1)
The above PageRank computation equation ignores
some important details. In general, the web graph is not
strongly connected, and this may lead the PageRank
computation of some pages to be trapped in a small
isolated cluster of the graph. This problem is usually
resolved by pruning nodes with zero out-degree, and by
adding random jumps to the random surfer process
underlying PageRank [2]. This leads to the following
modification of Equation (1) to:
Rank i (u)
Nu
uSv
v Rank i1 (v)  (1   )   
(2)
where , called “damping factor”, is the value that we
use to modify the transitional probability of the random
surfer model of an underlying web graph. In the
remainder of this paper, we will refer to the iterative
processes stated in this Equation (2) to compute the
PageRank scores.
2.2. Binary link structure file
In general, the input web graph exists in form of a
text file composing of two columns: a URL u in the
former that has a hyperlink to another URL v in the
latter. To easily process, we then transform the input
web graph into two files. The first one consists of all
URLs sorted by alphabetical order. The second one,
called “out-link file”, is equivalent to the underlying
input web graph; however, each line contains a pair of
integers referring to the line number (i.e., source and
destination id) in the first file. The example of these
two files is depicted in Figure 1.
argo.ku.ac.th/index.html
cpc.ku.ac.th/index.html
cpe.ku.ac.th/home.html
eng.ku.ac.th/index.html
rdi.ku.ac.th/research.html
www.ku.ac.th/index.html
1
1
1
3
3
4
2
3
4
1
4
2
Figure 1. The input web graph files.
However, our PageRank computational algorithm
does not directly employ the input web graph files as
depicted in Figure 1, we rather convert the out-link file
into a binary “link structure file” M as illustrated
textually in Figure 2. Each record contains three fields:
the first, storing in 4-byte integer, refers to a
destination URL; the second, also storing in 4-byte
integer, represents its in-degree (i.e., the number of
URLs pointing to it); and the last, storing a list of 4byte integers, represents the ids referring to the source
URLs. For example, reading from the second row of
Figure 2, we obtain that the destination URL
“http://cpc.ku.ac.th/index.html” has been pointed by
two source URLs “http://argo.ku.ac.th/index.html” and
“http://eng.ku.ac.th/index.html”, respectively.
dest-id in-degree
source-id
(4 bytes) (4 bytes) (4 bytes each)
1
1
3
2
2
1 4
3
1
1
4
2
1 3
Figure 2. The binary link structure file.
3. Parallel PageRank computation
3.1. Distributed binary link structure file and
parallel algorithm
To accelerate the PageRank computation using a
cluster of  processors (i.e.,  machines), we first
partition the binary link structure file M into  chunks:
M0, M1, …., M-1, such that each Mi contains only
records referring to destination URLs from the
( iT  1)th to the ( (i 1)T ) th where T is the total
Algorithm 1 : Parallel Algorithm
1 : assign MPI process identifier to each machine, Pi  0, 1, 2, ...
number of URLs in the web graph. Figure 3 as follows
illustrates textually the example of the partitioned
binary link structure file Mi.
2:B 
dest-id in-degree
source-id
(4 bytes) (4 bytes) (4 bytes each)
4 : for round  1 to 50 do
5 : while M Pi is not end of file  do
dest-id in-degree
source-id
(4 bytes) (4 bytes) (4 bytes each)

T
β

 Pi  1; E  Tβ  Pi  1;
3 :  t 1 ..T Rank src, Pi [t ]  T1 ; α  0.85;
1
1
3
1001
2
11 440
6:
t  M Pi .dest-id ;
2
2
1 4
1002
3
23 36 40
7:
Rank dest,Pi [t  B] 
3
1
1
1003
1
1021
4
2
1 3
1004
2
132 2354

Rank src,Pi [ M Pi .source-id ] 
(1  α )  α 
;
 M .in-degree

N Ranksrc,Pi[ M Pi .source-id]
 Pi

8 : end while
link structure file M0
(1  destid  T )

link structure file M1
( T 1  destid  2 T )


dest-id in-degree
source-id
(4 bytes) (4 bytes) (4 bytes each)
2001
3
2 15 221
2002
1
124
2003
3
1 35 112
2004
2
1 239
link structure file M2
11 : end for
3.2. Pairwise communication
( 2 T 1  destid  3T )

9 : each process has to synchroniz e all local Rank dest,Pi
10 :  t  B .. E Rank src,Pi [t ]  Rank dest, Pi [t  B];

Figure 3. The partitioned binary link structure file used
in parallel PageRank Algorithm.
During the iterative computing of PageRank scores
for a large web graph in parallel, every processor which
participates in the pool of tasks is assigned with a
process identifier Pi = 0, 1, 2, …, -1. Each processor
Pi has to allocate, for its own, two separated arrays of
floating points: ranksrc,Pi, having T entries, records all
source rank scores of the iteration jth, and rankdest,Pi,
having T entries, records the local destination rank
scores of the iteration (j+1)th; and an array of integer
outarryPi, having T entries, records the out-degree of
each destination URL. The proposed parallel PageRank
computation can be illustrated in Algorithm 1.
According to the Algorithm 1, we first assign an
MPI process identifier to each machine, calculate the
corresponding starting URL (B) and the ending URL
(E), initialize each source rank score (i.e., ranksrc,Pi)
with T1 , and set the damping factor  to 0.85. The
iterative processes (see line 4-11 in Algorithm 1) repeat
until all final PageRank scores convert. By
experiments, those scores convert with L1 norm of
residual errors [6] < 0.025 when the iterative processes
repeat at least 50 times. In the following experimental
results, we then report with the number of iteration pass
set to 50.
We can see that Algorithm 1 exposes one handicap,
as local PageRank scores have to be synchronized
between processors to obtain the final ones to be used
in the next iteration step (see line 9). Here, we then
propose a simple but efficient communication model to
synchronize those scores during iteration runs. Figure 4
illustrates an example of the communication model for
eight processors collaborating in the pool of tasks. In
this model, firstly, process Pi will request process Pj to
send its local block of rank scores (i.e., Bj), wherein the
next consecutive step process Pi will also send its own
one to Pj (see step 1-2). Secondly, in the same way, Pi
will communicate with Pk to synchronize its jointed
local blocks of rank scores to each other (see step 3-6).
If we let  be the number of processors, all 
processors will thus be able to successfully synchronize
their local blocks of PageRank scores in 2log2
communication steps.
Step 1
P0
B0
B1
P2
B3
P5
B4
B5
B5
P7
P6
B7
P3
B3
P4
B5
P6
B1
B2
P5
B4
P1
B1
P2
B3
P4
P0
B0
P3
B2
B6
Step 2
P1
B6
B7
P7
B7
Step 3
P0
Step 4
P1
B0
P0
P1
B0
B0
B0
B1
B1
B1
B1
B2
B2
B3
B3
P2
P3
P2
P3
B2
B2
B2
B2
B3
B3
B3
B3
P4
B4
P5
P4
P5
B4
B4
B4
B5
B5
B5
B5
B6
B6
B7
P6
P7
B7
P6
P7
B6
B6
B6
B6
B7
B7
B7
B7
Step 5
iteration steps. A graph in Figure 5 plots the L1 norm of
residual errors calculated from the final results of
PageRank scores of a web graph in TH domain versus
the number of x consecutive iteration steps before a
local PageRank synchronization occurs.
Step 6
B0
B0
B0 B4
B1
B1
B1 B5
B1 B5
B2
B2
B2 B6
B2 B6
B3
B3
B3 B7
B3 B7
B0
B0
B0 B4
B1
B1 B5
B1 B5
B2
B2
B2 B6
B2 B6
B3
B3
B3 B7
B3 B7
B4
B4
B4
B5
B5
B5
B5
B6
B6
B6
B6
B7
B7
B7
B7
B4
B4
B4
B5
B5
B5
B5
B6
B6
B6
B6
B7
B7
B7
B7
B1
P0
P2
P4
P6
P1
P3
P5
P7
P0
P1
P2
P3
P4
P5
P6
P7
B0 B4
B0 B4
B4
B4
Figure 4. Pairwise communications during the
synchronization of local PageRank scores.
For the simplicity with our prior work [8], let tstart-up
be the time expense needed to startup the connection
between any pair of processes, and tcomm be the time
expense for synchronizing one block of rank scores
(i.e., any Bi) between each pair. The total cost Ttotal
needed to completely synchronize  blocks of rank
scores can be written in Equation (3):
Ttotal
log2 


 2 log 2  t startup   2i1 tcomm 
i 1

 (3)
 2 log 2  t startup 
log2 
2 t
i
i 1
comm
3.3. Reducing communication cost
As we can see that the total expense of Algorithm 1
comes from the computational cost (i.e., mainly
between line 5-8 and 10) plus the extra communication
cost needed to synchronize all local PageRank scores
between processors (i.e., line 9). To reduce that extra
communication cost, we can let the local blocks of rank
scores be synchronized after every x consecutive
Figure 5. L1 norm of residual errors versus local
PageRank synchronization interval.
Seeing from Figure 5, the L1 norm of residual errors
increases when the synchronization interval increases
and the number of processors participating in
computation increases. For 2, 4 and 8 processors, the
average residual error calculated from synchronizing
the local rank scores after every 5 consecutive steps is
less than 0.025. This value is acceptable for our
research purpose. We thus set the synchronization
interval to 5 to be reported in experimental results as
follows.
4. Experimental results and discussion
4.1. Experimental setup
Machine Setup: We have an opportunity to run our
experiments on a new PC cluster of eight Opteron 240
machines, networked via the Gigabit Ethernet and
running the Linux operating system, at the KU office of
computing service. Each machine is equipped with
3GB of main memory, and a UW-SCSI hard disk. The
proposed algorithm was written in C language using the
standard message passing MPICH version 1.2.5 library.
We arranged the computing environment so that all
machines had no other process running, except ours,
during the experiments.
Input Data Set: We use a web graph derived from
a crawl during January 2003 within the TH domain. It
contains around 10.9 million web pages, 97 million
links. The web graph was first parsed and converted
into the input web graph files, as depicted in Figure 1.
We will hereafter name this data, the base graph
“1DB”. We then create additional artificial sets of web
graphs of larger size by concatenating several copies of
the 1DB, and connecting those copies by rerouting
some of the links. Resulting web data represents large
artificial web graphs of 2, 4, 8, and 16 times the size of
the 1DB base graph, respectively. Before computing
the PageRank scores using Algorithm 1, we preprocess
and convert each artificial web graph into
corresponding binary link structure file as mentioned in
section 2.2 and 3.1.
4.2. Results and discussion
For each test, we ran the experiments at least 3
times, and averaged the results. Resulting runs using 1
processor (i.e., the base case for speedup calculation),
2, 4, 6, and 8 processors, and 1DB, 2DB, 4DB, 8DB,
16DB, have been depicted in following Figure 6. The
average wall-clock time needed for the 16DB data set
(i.e., roughly 174.4 million web pages, 1.55 billion
links) using 1, 2, 4, and 8 processors are 91.03, 47.04,
24.50, and 14.16 seconds for an iteration run,
respectively. In Figure 6, we also draw an ideal
speedup curve for reference.
5. Conclusion
In this paper, we have proposed another parallel
algorithm to accelerate the PageRank computation
using a PC cluster. We also present a simple pairwise
communication model to synchronize the local
PageRank scores between processors. We implement
and test our algorithm on a cluster of eight Opteron
machines, networked via a Gigabit Ethernet. To study
its efficiency, we perform several experiments using
artificial web graphs synthesized from the real web
data. The given results are quite encouraging. To study
the limitation of the proposed algorithm, we are
looking forward to doing more experiments with larger
sizes of web graphs that are indeed in our web
spidering queue.
6. Acknowledgement
We would like to thanks the KU office of
computing service to let us perform many experiments
on the Opteron PC cluster. This research is a part of the
MIKE-Web-Explorer project granted by the KURDI,
Thailand.
7. References
Figure 6. Speedup curves concluded from the
experiments.
Note that the slope of speedup curves get closer to
the ideal one when the size of the virtual data becomes
larger, but each has tendency to go beyond the ideal
line when the number of using processors increases and
the size of running data decreases. We are looking
forward to investigating the upper limit of our proposed
algorithm with larger sizes of virtual web data.
Unfortunately, we have run out of the allowed disk
space at the moment.
[1]
http://www.google.com
[2]
S. Brin and L. Page, The Anatomy of a Large-Scale
Hypertextual Web Search Engine, Proc. of the 7th
WWW Conf., 1998.
[3]
Y. Chen, Q. Gan and T. Suel, I/O Efficient Techniques
for Computing Pagerank, Proc. of the 11th ACM CIKM
Conf., 2002.
[4]
J. Cho, H. Garcia-Molina and L. Page, Efficient
Crawling Through URL Ordering, Computer Networks
and ISDN Systems, 30(1-7): 161-172, 1998.
[5]
T. Haveliwala, Efficient Computation of PageRank,
Technical Report 1999-31, Stanford Digital Library
Project, 1999.
[6]
S. Kamvar, T. Haveliwala, C. Manning and G. Golub,
Extrapolation Methods for Accelerating PageRank
Computations, Proc. of the 12th WWW Conf., 2003.
[7]
S. Kamvar, T. Haveliwala and G. Golub, Adaptive
Methods for the Computation of PageRank, Linear
Algebra and its Applications, Spec. Issue on the
Numerical Sol. of Markov Chains, November 2003.
[8]
A. Rungsawang and B. Manaskasemsak, PageRank
Computation using PC Cluster, Proc. of the 10th
European PVM/MPI User’s Group Meeting, 2003.
[9]
K. Sankaralingam, S. Sethumadhavan and J.C. Browne,
Distributed Pagerank for P2P Systems, Proc. of the
12th IEEE HPDC’03 conf., 2003.
Download