Graph relationship discovery using Pregel computing

advertisement
SAMI 2015 • IEEE 13th International Symposium on Applied Machine Intelligence and Informatics • January 22-24, 2015 • Herl’any, Slovakia
Graph relationship discovery using Pregel
computing model
Michal Laclavík
Institute of Informatics, SAS,
Bratislava, Slovakia
michal.laclavik@savba.sk
Ján Mojžiš
Institute of Informatics, SAS,
Bratislava, Slovakia
upsyjamo@savba.sk, janmojzisx@gmail.com
In this paper we present PCMARS, a novel graph
relationship discovery algorithm intended for Pregel
computing model. It is specially designed to be minimally
sensitive to edge directions, which is useful in breadthfirst graph traversals and graph relationship discovery
tasks in general.
The rest of this paper is divided as follows. Related
work, briefly describes technological background of
selected currently widely used or promising approaches of
distributed parallel computing models. The algorithm
section defines our proposed algorithm along with its
extensive description. The Experiment section contains
information regarding to experiment setting, including the
data and the environment. The Results section further
discuss the results found in previous section and
Conclusion concludes this paper and our work, suggests
the future improvements and usage.
In the Experiment section we evaluate PCMARS with
Freebase dataset, for which, distributed and parallel
computing is simply a must; and two, small test case
graphs. Therefore Pregel framework is well suited for our
graph relationship discovery goal.
Abstract—Distributed computing is widely used
nowdays. Its computational power and memory resources
are vital for computations with large-scale datasets, which
cannot be handled with a stand-alone system. Pregel is a
novel graph distributed computing model, featuring vertexcentric computing, divided into a set of supersteps. In this
paper we propose a new algorithm called Pregel Computing
Model Algorithm for Relationship Search (PCMARS). Our
algorithm is well suited for both directed and undirected
graphs, where edges can be weighted and typed. Typed
edges are used in RDF graph model for Big Data notion.
PCMARS is able to simulate traversal in the opposite
direction of the edge, it does not use additional indexing, nor
does it change graph structure. We demonstrate PCMARS
in our example scenario with the use of Freebase dataset
and several test case graphs.
Keywords—graph
distributed computing
relationship
discovery;
Pregel;
I.
INTRODUCTION
Relationships plays a key role in our daily work. From
simple planning, like bus arrival/departure and crossing
passage, dependent on time and location, through network
traffic or workflow plans design. Many times we use
tables, to note important entities, where they are aligned in
columns and rows, in order to show relations. We use
diagrams and schemas to quickly show overview of the
data, to depict the information and also to show some kind
of relationship or ordering of displayed entities or
elements. One of such diagrams is a graph. We can create
chronologically aligned workflows as directed (weighted)
graphs. In social networks, we can note relations between
people, firms or entities. Graph usually consists of vertex
(for entity) and edge (for relation). Edges can be directed
and weighted, like in traffic, edge is the direction of the
bus and weight can represent price or time of the
transport.
Distributed and parallel computing comes into account
when dealing with large quantities of data which would
simply not fit into one stand-alone machine, by either
computing power or memory resources.
Relationship discovery is an important field in the
present and graphs are widely used to create overviews
and show relations between entities. Social networks
(people
and
firms)
[foaf.sk],
multidisciplinary
[freebase.org], citation graphs (ACM RKBExplorer) or
genome
database
KEGG
(http://www.genome.jp/kegg/kegg3a.html) are just a few
examples from various use-cases.
II.
RELATED WORK
MapReduce is a well-known programming model, for
processing and generating large data sets with a parallel,
distributed algorithm on a cluster. MapReduce is
composed of 2 procedures; Map() and Reduce(). Map()
performs filtering and sorting, while Reduce() summarizes
results (counting the number of records). Further
references can be addressed to [2]. MapReduce is can be
found in a popular open-source implementation of Apache
Hadoop [3] for Apache Hadoop cluster. Although
MapReduce is considered as fault and redundancytolerant, regarding its communication between various
parts of the system, in particular cases it can lead to
suboptimal performance [1].
Pregel is a novel vertex-centric graph distributed
computing model. In this model every vertex in a graph is
separate computing unit, which can receive or send
messages to other vertices. Compute() main function of
each vertex and results and all information is shared with
message sending. Pregel address and reduce network
communication overhead by synchronization with the
message passing concept. This model require one master
machine, which acts like environment or medium in the
process of data or messages distribution, is responsible for
superstep synchronization; and several worker machines,
intended for computing the job.
978-1-4799-8221-9/15/$31.00 ©2015 IEEE
203
J. Mojžiš and M. Laclavík • Graph Relationship Discovery using Pregel Computing Model
III. ALGORITHM
PCMARS is crafted for Pregel computing model, from
which, a vertex centric approach is a key element. All
vertices’ compute() functions work synchronized in
supersteps. Messages are sent in superstep i and received
in the next i+1 superstep. Although computation between
machines is performed in parallel, compute() function is
sequential for every vertex. Therefore messages received
by vertex, which are handled in compute() function are
placed inside stack and are sequentially removed and
handled until there are no more messages for this vertex
and superstep. Base idea in PCMARS is, that for either
directed or undirected graph we are able to navigate and
search for relationships. PCMARS resembles breadth-first
algorithm with the exception of being designed for Pregel
computing model and capable of search opposite edges
(without additional indexation or graph structure change).
Messaging is the core of communication between vertices
in the Pregel computing model. Therefore, we define
several messages in Def.2. to use in PCMARS algorithm.
Start
+
has messages
-
handle messages
+
step = 1
send interesting
+
step = 2
send ping
-
End
Def. 1. Let there be graph G = {E, V} and two subsets of
V, Pv and Iv, for which Pv ∪ Iv = V. In RDF graph model,
edges in set E are typed and contains all edge types
defined for given RDF graph.
Figure 1. PCMARS Base scheme.
Fig.2. provides details on handle messages function. It
consists of 2 main parts: handle store messages and handle
all other messages. Store messages are self-send messages
from the previous superstep. Store messages protect the
system from messages overflow, which can cause
PCMARS to crash. In case, when vertex need to generate
too many messages, a store message is generated and is
self-send to be handled in the next superstep and to
continue where the previous message generation ended.
PCMARS complexity is critical in order to prevent
algorithm failure and possibly crash due to memory limit
overflow.
In the first superstep, vertex v is sending interesting
messages to all of its neighbors found in its adjacency list
A, but only if v ∊ Iv. The complexity IC1 for vertex v in the
first superstep is given by (1).
Def.2. The message set M for PCMARS is defined as M
= {ping, interesting, interesting_reply, stored_ping,
stored_interesting, stored_interesting_reply}.
The relation is found, if any Pv vertex receives at least 2
messages from two different Iv vertices. Store messages
from Def.2. represent original messages (like ping or
interesting) but intended for self-sending in order to save
the job for the next superstep. It is important to be able to
store jobs for the future supersteps, because memory limit
and complexity of PCMARS (heavily dependent on
neighbors count and messages received) must be taken
into account for PCMARS to be stable and reliable.
Freebase (http://www.freebase.org) is RDF graph
database, covering many fields of interest (music, sport,
film, etc.) and it is notable for being very large (given
mainly by its density of vertex and edges connections).
As breadth first, in order to search for relations, vertices
send messages and distribute them further into their
neighbors in adjacency lists. There is a limit on the
messages, that a message cannot be distributed further, if
it would be passed by more vertices than is the limit. This
limit is called max_path_length and in PCMARS
evaluations we set it to 2. This has additional impact on
complexity, meaning that complexity grows with
messages distribution.
Fig.1. shows base scheme of PCMARS algorithm. Key
decision is has messages. In all supersteps has messages is
tested and only in first 2 supersteps send interesting and
send ping are generated. Vertices from Iv set can, only in
the first superstep, send interesting messages to their
neighbors in adjacency list A. Vertices from Pv set send
ping messages to all neighbors in their A list, but only in
the second superstep. Pv vertices can, however, distribute
ping, interesting and interesting_reply messages in all
supersteps, counting from second (inclusive). The handle
messages function is complex and handles all messages
received by a given vertex.
1
| |,
(1)
0,
Start
+
has stored messages
-
handle stored messages
handle all other messages
End
Figure 2. Handle messages function of PCMARS.
204
SAMI 2015 • IEEE 13th International Symposium on Applied Machine Intelligence and Informatics • January 22-24, 2015 • Herl’any, Slovakia
For n-th superstep, if n > 1, the complexity ICn of v, if v ∊
Iv is ICn = |Pn|, where Pn is the set of received ping
messages in n-th supertep.
Vertices from the Pv set start sending messages in the
second supersep. Every vertex from Pv set, in the second
superstep, needs to send ping messages to all neighbors
found in its adjacency A list. Therefore, the complexity for
second superstep PC2 of v, v ∊ Pv is PC2 = |A|. In addition,
starting in this second superstep, this vertex may have
received messages (interesting, generated in the first
superstep or ping and interesting_reply if superstep is >
2). Every vertex from Pv set can distribute ping,
interesting and interesting_reply messages. If it already
received interesting or interesting_reply messages, vertex
can response to ping messages with interesting_reply. The
complexity of v, v ∊ Pv for n-th superstep, n > 1, if v has
not yet received interesting or interesting_reply and the
message is ping or first time interesting or
interesting_reply, is PCn = |A|. If v received ping and
already obtained interesting or interesting_reply, the
complexity is PCn = |A| × |I| + |I|. The set I contains
vertices, from which this vertex obtained interesting or
interesting_reply message types. It has to send
interesting_reply to all |A| vertices and for each of them
include all vertices from |I| set. Final + |I| is the
complexity of interesting_reply response and is the
reaction to received ping. It is send only once and only to
ping message source vertex. The complexity for Iv vertices
and Pv vertices is given in (2) and (3).
| |
| |
| |
the cost of more needed supersteps and time growth. But
PCMARS will stay stable.
IV. EXPERIMENT
We use three test-cases for evaluation of PCMARS, 2
small artificial graphs and one large Freebase dataset
graph. The Iv vertices for each graph can be found in
Table 1. Note, that Freebase identifiers (MIDs) are not
numeric. Sedge can work with numerical identifiers only,
due to effectivity and memory savings. The mapping
between MID and Sedge ID had to be performed. After
results were obtained, Sedge IDs were mapped back to
Freebase MIDs and the value of Name was obtained.
Table 2 and Table 3 lists Iv vertices for the second and
third artificial test graphs respectively. Figures 3 and 4 are
visualizations of 2 artificial test graphs.
TABLE I.
IV VERTICES FOR FREEBASE GRAPH
| |
| |
Name
1
2
3
4
Rihanna
Anastacia
Will Smith
Shakira
Michael
Jackson
Alicia Keys
Sue Ann
Tasmin Archer
Vanessa Carlton
Aliyah
Quincy Jones
Madonna
Freddie
Mercury
James Brown
The Cranberries
Avril Lavigne
Kylie Minogoue
The Corrs
DMX
Chi McBride
Eminem
Shaggy
Paul McCartney
Goo Goo Dolls
Aerosmith
Dara Rolins
Richard Muller
Sony Music
Entertainment
Sony BMG
BMG Ariola
Warner Music
Group
5
6
7
8
9
10
11
12
13
(2)
| |
Number
(3)
Where Pi is the set of received ping messages for given
superstep i, starting from 2. The number of supersteps is n.
The final complexity of PCMARS can be approximated
from IC, PV and |Pv| as C = IC + |Pv| × PC.
For example, if Iv = 10 and average |A| is 1000, the
complexity of IC1 for 10 Iv vertices is 10 × 1000 = 10.000.
This is only for the first step. For next 10 steps, if average
|P| is 20 is IC10 = 10.000 + 20 × (10-1) = 10.180. PC2
complexity for vertices from Pv set = |Pv| × |A|. If average
ping messages = 20 and |Pv| = 1.000.000, the complexity
PC2, for second superstep = 1.000.000 × 20 = 20.000.000.
For 10 supersteps, if in average |I| is 5 (interesting
vertices), PC10 = 20.000.000 + 1000 × (10-1) + (1000 × 5
+ 5) × (10-1) = 20.000.000 + 9 000 + 45.045 = 20.054.045
total messages for Pv vertices in 10-th superstep and
additional 10.180 from Iv vertices set resulting in
20.064.225 messages in total. Note, that message
distribution is not taken into account here, which would
further expand the complexity higher. Also, in order to
keep the complexity per superstep in reasonable scale,
there is a limit on a message count distributed per
superstep. In such large and dense graphs like Freebase
and 5 working machines with 30 GB of main memory, we
recommend setting it to 17. Splitting messages over limit
into further supersteps reduces memory requirements for
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
205
Freebase
MID
m.06mt91
m.01wb9b8
m.0147dk
m.01wj18h
m.09889g
m.0g824
m.01tvdr_
m.03pbg_
m.0x3n
m.01wd9lv
m.01vs_v8
m.01vn0t_
m.0407f
m.01467z
m.0161c2
m.049qx
m.029dy9
m.01vvzb1
m.05jpsx
m.01vsgrn
m.01wgysj
m.03j24kf
m.01jmj8
m.0134pk
m.03f4fmx
m.07xwy1
m.043g7l
m.03mp8k
m.0dmxvkj
m.02bh8z
J. Mojžiš and M. Laclavík • Graph Relationship Discovery using Pregel Computing Model
32
Warner Bros
Records
Polygram
Universal Music
Mariah Carey
Grammy
Awards
Linkin Park
33
34
35
36
37
interesting_reply messages in the Second artificial graph
(Fig 4): 4 (from 1,10), 9 (from 1,10,15) and 13 (from 15).
In Freebase graph, we have found following relations:
Eminem + Will smith, USA nationality, although both do
Rap, exactly share only Hip- Hop music genre. They both
publish their music under Interscope Records. Anastacia +
Chi McBride, both born in Chicago. Anastacia + Alicia
Keys both lived in New York City. Alicia Keys + Will
Smith, published under RCA Records. Aaliyah + Alicia
Keys both nominated for Grammy award for best R&B
Album. Rihanna + Michael Jackson got Grammy Award
for Album of the Year. Alicia Keys + Will Smith both do
Pop music, Hip hop music, publish music under Columbia
label and RCA Records, are Record Producers, both
appeared in TV program America: A tribute to Heroes.
Will Smith + DMX both have nationality of USA, both
are Actors and Film producers. Madonna + DMX publish
music under label of Warner Bros. Records.
m.03rhqg
m.026s90
m.01dtcb
m.04xrx
m.0c4ys
m.04qmr
TABLE II.
IV VERTICES FOR SEDGE FIRST ARTIFICIAL GRAPH
Number
1
2
3
Sedge ID
1
10
15
I.
TABLE III.
IV VERTICES FOR SEDGE SECOND ARTIFICIAL GRAPH
Number
1
2
3
4
5
Sedge ID
4
5
6
9
10
1
2
3
4
6
12
7
5
8
9
10
11
13
Figure 3. First artificial graph.
ACKNOWLEDGMENT
This work was supported by the Slovak Research and
Development Agency project name CLAN, number
APVV-0809-11 and by the Scientific Grant Agency of the
Ministry of Education, science, research and sport of the
Slovak Republic and the Slovak Academy of Sciences,
project VEGA, number 2/0185/13.
1
4
10
9
13
15
Figure 4.
CONCLUSION
In this work we propose a new graph relationship
discovery algorithm PCMARS. It is designed for Pregel
computing model and its special property is the ability of
searching through the opposite direction of the edge. We
have evaluated the algorithm on several test-case graphs,
two artificial graphs (Fig. 3 and 4) and one real Freebase
graph (Table 1). The algorithm was tested and modified to
fulfill the purpose of navigation and relationship
discovery. Results from test-case graphs were carefully
evaluated and then PCMARS was run in a computer
cluster of 5 worker machines and Freebase graph data.
Some relations were expected (e.g. Will Smith and DMX
are both actors) but some details were new to us
(Anastacia and Chi McBride both born in Chicago).
Our PCMARS algorithm inherits advantages from
Pregel computing model, therefore it is well suited for
large-scale relationship discovery, when applied on graphlike data structures.
Future work on PCMARS will be the search for the
usage in other domains, preferably bio-informatics and
biological systems, where the relations can also be
represented in graph structures.
REFERENCES
[1]
Second artificial graph.
For the first artificial graph (Fig. 3) the following
vertices received interesting or interesting_reply: 1 (from
4,6), 2 (from 4,5), 3 (from 4,6), 7 (from 4,5,6,10), 8 (from
4,5,10), 11 (from 5,10), 12 (from 9) and 13 (from 9,10).
Following
vertices
received
interesting
or
[2]
[3]
206
Malewicz, G. et al.: Pregel: a system for large-scale graph
processing. Proceedings of the 2010 ACM SIGMOD International
Conference on Management of data.
Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data
processing on large clusters. Communications of the ACM, 51(1),
107-113.
Apache Hadoop, http://en.wikipedia.org/wiki/Apache_Hadoop,
retrieved
8.Nov.2014.
Download