SAMI 2015 • IEEE 13th International Symposium on Applied Machine Intelligence and Informatics • January 22-24, 2015 • Herl’any, Slovakia Graph relationship discovery using Pregel computing model Michal Laclavík Institute of Informatics, SAS, Bratislava, Slovakia michal.laclavik@savba.sk Ján Mojžiš Institute of Informatics, SAS, Bratislava, Slovakia upsyjamo@savba.sk, janmojzisx@gmail.com In this paper we present PCMARS, a novel graph relationship discovery algorithm intended for Pregel computing model. It is specially designed to be minimally sensitive to edge directions, which is useful in breadthfirst graph traversals and graph relationship discovery tasks in general. The rest of this paper is divided as follows. Related work, briefly describes technological background of selected currently widely used or promising approaches of distributed parallel computing models. The algorithm section defines our proposed algorithm along with its extensive description. The Experiment section contains information regarding to experiment setting, including the data and the environment. The Results section further discuss the results found in previous section and Conclusion concludes this paper and our work, suggests the future improvements and usage. In the Experiment section we evaluate PCMARS with Freebase dataset, for which, distributed and parallel computing is simply a must; and two, small test case graphs. Therefore Pregel framework is well suited for our graph relationship discovery goal. Abstract—Distributed computing is widely used nowdays. Its computational power and memory resources are vital for computations with large-scale datasets, which cannot be handled with a stand-alone system. Pregel is a novel graph distributed computing model, featuring vertexcentric computing, divided into a set of supersteps. In this paper we propose a new algorithm called Pregel Computing Model Algorithm for Relationship Search (PCMARS). Our algorithm is well suited for both directed and undirected graphs, where edges can be weighted and typed. Typed edges are used in RDF graph model for Big Data notion. PCMARS is able to simulate traversal in the opposite direction of the edge, it does not use additional indexing, nor does it change graph structure. We demonstrate PCMARS in our example scenario with the use of Freebase dataset and several test case graphs. Keywords—graph distributed computing relationship discovery; Pregel; I. INTRODUCTION Relationships plays a key role in our daily work. From simple planning, like bus arrival/departure and crossing passage, dependent on time and location, through network traffic or workflow plans design. Many times we use tables, to note important entities, where they are aligned in columns and rows, in order to show relations. We use diagrams and schemas to quickly show overview of the data, to depict the information and also to show some kind of relationship or ordering of displayed entities or elements. One of such diagrams is a graph. We can create chronologically aligned workflows as directed (weighted) graphs. In social networks, we can note relations between people, firms or entities. Graph usually consists of vertex (for entity) and edge (for relation). Edges can be directed and weighted, like in traffic, edge is the direction of the bus and weight can represent price or time of the transport. Distributed and parallel computing comes into account when dealing with large quantities of data which would simply not fit into one stand-alone machine, by either computing power or memory resources. Relationship discovery is an important field in the present and graphs are widely used to create overviews and show relations between entities. Social networks (people and firms) [foaf.sk], multidisciplinary [freebase.org], citation graphs (ACM RKBExplorer) or genome database KEGG (http://www.genome.jp/kegg/kegg3a.html) are just a few examples from various use-cases. II. RELATED WORK MapReduce is a well-known programming model, for processing and generating large data sets with a parallel, distributed algorithm on a cluster. MapReduce is composed of 2 procedures; Map() and Reduce(). Map() performs filtering and sorting, while Reduce() summarizes results (counting the number of records). Further references can be addressed to [2]. MapReduce is can be found in a popular open-source implementation of Apache Hadoop [3] for Apache Hadoop cluster. Although MapReduce is considered as fault and redundancytolerant, regarding its communication between various parts of the system, in particular cases it can lead to suboptimal performance [1]. Pregel is a novel vertex-centric graph distributed computing model. In this model every vertex in a graph is separate computing unit, which can receive or send messages to other vertices. Compute() main function of each vertex and results and all information is shared with message sending. Pregel address and reduce network communication overhead by synchronization with the message passing concept. This model require one master machine, which acts like environment or medium in the process of data or messages distribution, is responsible for superstep synchronization; and several worker machines, intended for computing the job. 978-1-4799-8221-9/15/$31.00 ©2015 IEEE 203 J. Mojžiš and M. Laclavík • Graph Relationship Discovery using Pregel Computing Model III. ALGORITHM PCMARS is crafted for Pregel computing model, from which, a vertex centric approach is a key element. All vertices’ compute() functions work synchronized in supersteps. Messages are sent in superstep i and received in the next i+1 superstep. Although computation between machines is performed in parallel, compute() function is sequential for every vertex. Therefore messages received by vertex, which are handled in compute() function are placed inside stack and are sequentially removed and handled until there are no more messages for this vertex and superstep. Base idea in PCMARS is, that for either directed or undirected graph we are able to navigate and search for relationships. PCMARS resembles breadth-first algorithm with the exception of being designed for Pregel computing model and capable of search opposite edges (without additional indexation or graph structure change). Messaging is the core of communication between vertices in the Pregel computing model. Therefore, we define several messages in Def.2. to use in PCMARS algorithm. Start + has messages - handle messages + step = 1 send interesting + step = 2 send ping - End Def. 1. Let there be graph G = {E, V} and two subsets of V, Pv and Iv, for which Pv ∪ Iv = V. In RDF graph model, edges in set E are typed and contains all edge types defined for given RDF graph. Figure 1. PCMARS Base scheme. Fig.2. provides details on handle messages function. It consists of 2 main parts: handle store messages and handle all other messages. Store messages are self-send messages from the previous superstep. Store messages protect the system from messages overflow, which can cause PCMARS to crash. In case, when vertex need to generate too many messages, a store message is generated and is self-send to be handled in the next superstep and to continue where the previous message generation ended. PCMARS complexity is critical in order to prevent algorithm failure and possibly crash due to memory limit overflow. In the first superstep, vertex v is sending interesting messages to all of its neighbors found in its adjacency list A, but only if v ∊ Iv. The complexity IC1 for vertex v in the first superstep is given by (1). Def.2. The message set M for PCMARS is defined as M = {ping, interesting, interesting_reply, stored_ping, stored_interesting, stored_interesting_reply}. The relation is found, if any Pv vertex receives at least 2 messages from two different Iv vertices. Store messages from Def.2. represent original messages (like ping or interesting) but intended for self-sending in order to save the job for the next superstep. It is important to be able to store jobs for the future supersteps, because memory limit and complexity of PCMARS (heavily dependent on neighbors count and messages received) must be taken into account for PCMARS to be stable and reliable. Freebase (http://www.freebase.org) is RDF graph database, covering many fields of interest (music, sport, film, etc.) and it is notable for being very large (given mainly by its density of vertex and edges connections). As breadth first, in order to search for relations, vertices send messages and distribute them further into their neighbors in adjacency lists. There is a limit on the messages, that a message cannot be distributed further, if it would be passed by more vertices than is the limit. This limit is called max_path_length and in PCMARS evaluations we set it to 2. This has additional impact on complexity, meaning that complexity grows with messages distribution. Fig.1. shows base scheme of PCMARS algorithm. Key decision is has messages. In all supersteps has messages is tested and only in first 2 supersteps send interesting and send ping are generated. Vertices from Iv set can, only in the first superstep, send interesting messages to their neighbors in adjacency list A. Vertices from Pv set send ping messages to all neighbors in their A list, but only in the second superstep. Pv vertices can, however, distribute ping, interesting and interesting_reply messages in all supersteps, counting from second (inclusive). The handle messages function is complex and handles all messages received by a given vertex. 1 | |, (1) 0, Start + has stored messages - handle stored messages handle all other messages End Figure 2. Handle messages function of PCMARS. 204 SAMI 2015 • IEEE 13th International Symposium on Applied Machine Intelligence and Informatics • January 22-24, 2015 • Herl’any, Slovakia For n-th superstep, if n > 1, the complexity ICn of v, if v ∊ Iv is ICn = |Pn|, where Pn is the set of received ping messages in n-th supertep. Vertices from the Pv set start sending messages in the second supersep. Every vertex from Pv set, in the second superstep, needs to send ping messages to all neighbors found in its adjacency A list. Therefore, the complexity for second superstep PC2 of v, v ∊ Pv is PC2 = |A|. In addition, starting in this second superstep, this vertex may have received messages (interesting, generated in the first superstep or ping and interesting_reply if superstep is > 2). Every vertex from Pv set can distribute ping, interesting and interesting_reply messages. If it already received interesting or interesting_reply messages, vertex can response to ping messages with interesting_reply. The complexity of v, v ∊ Pv for n-th superstep, n > 1, if v has not yet received interesting or interesting_reply and the message is ping or first time interesting or interesting_reply, is PCn = |A|. If v received ping and already obtained interesting or interesting_reply, the complexity is PCn = |A| × |I| + |I|. The set I contains vertices, from which this vertex obtained interesting or interesting_reply message types. It has to send interesting_reply to all |A| vertices and for each of them include all vertices from |I| set. Final + |I| is the complexity of interesting_reply response and is the reaction to received ping. It is send only once and only to ping message source vertex. The complexity for Iv vertices and Pv vertices is given in (2) and (3). | | | | | | the cost of more needed supersteps and time growth. But PCMARS will stay stable. IV. EXPERIMENT We use three test-cases for evaluation of PCMARS, 2 small artificial graphs and one large Freebase dataset graph. The Iv vertices for each graph can be found in Table 1. Note, that Freebase identifiers (MIDs) are not numeric. Sedge can work with numerical identifiers only, due to effectivity and memory savings. The mapping between MID and Sedge ID had to be performed. After results were obtained, Sedge IDs were mapped back to Freebase MIDs and the value of Name was obtained. Table 2 and Table 3 lists Iv vertices for the second and third artificial test graphs respectively. Figures 3 and 4 are visualizations of 2 artificial test graphs. TABLE I. IV VERTICES FOR FREEBASE GRAPH | | | | Name 1 2 3 4 Rihanna Anastacia Will Smith Shakira Michael Jackson Alicia Keys Sue Ann Tasmin Archer Vanessa Carlton Aliyah Quincy Jones Madonna Freddie Mercury James Brown The Cranberries Avril Lavigne Kylie Minogoue The Corrs DMX Chi McBride Eminem Shaggy Paul McCartney Goo Goo Dolls Aerosmith Dara Rolins Richard Muller Sony Music Entertainment Sony BMG BMG Ariola Warner Music Group 5 6 7 8 9 10 11 12 13 (2) | | Number (3) Where Pi is the set of received ping messages for given superstep i, starting from 2. The number of supersteps is n. The final complexity of PCMARS can be approximated from IC, PV and |Pv| as C = IC + |Pv| × PC. For example, if Iv = 10 and average |A| is 1000, the complexity of IC1 for 10 Iv vertices is 10 × 1000 = 10.000. This is only for the first step. For next 10 steps, if average |P| is 20 is IC10 = 10.000 + 20 × (10-1) = 10.180. PC2 complexity for vertices from Pv set = |Pv| × |A|. If average ping messages = 20 and |Pv| = 1.000.000, the complexity PC2, for second superstep = 1.000.000 × 20 = 20.000.000. For 10 supersteps, if in average |I| is 5 (interesting vertices), PC10 = 20.000.000 + 1000 × (10-1) + (1000 × 5 + 5) × (10-1) = 20.000.000 + 9 000 + 45.045 = 20.054.045 total messages for Pv vertices in 10-th superstep and additional 10.180 from Iv vertices set resulting in 20.064.225 messages in total. Note, that message distribution is not taken into account here, which would further expand the complexity higher. Also, in order to keep the complexity per superstep in reasonable scale, there is a limit on a message count distributed per superstep. In such large and dense graphs like Freebase and 5 working machines with 30 GB of main memory, we recommend setting it to 17. Splitting messages over limit into further supersteps reduces memory requirements for 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 205 Freebase MID m.06mt91 m.01wb9b8 m.0147dk m.01wj18h m.09889g m.0g824 m.01tvdr_ m.03pbg_ m.0x3n m.01wd9lv m.01vs_v8 m.01vn0t_ m.0407f m.01467z m.0161c2 m.049qx m.029dy9 m.01vvzb1 m.05jpsx m.01vsgrn m.01wgysj m.03j24kf m.01jmj8 m.0134pk m.03f4fmx m.07xwy1 m.043g7l m.03mp8k m.0dmxvkj m.02bh8z J. Mojžiš and M. Laclavík • Graph Relationship Discovery using Pregel Computing Model 32 Warner Bros Records Polygram Universal Music Mariah Carey Grammy Awards Linkin Park 33 34 35 36 37 interesting_reply messages in the Second artificial graph (Fig 4): 4 (from 1,10), 9 (from 1,10,15) and 13 (from 15). In Freebase graph, we have found following relations: Eminem + Will smith, USA nationality, although both do Rap, exactly share only Hip- Hop music genre. They both publish their music under Interscope Records. Anastacia + Chi McBride, both born in Chicago. Anastacia + Alicia Keys both lived in New York City. Alicia Keys + Will Smith, published under RCA Records. Aaliyah + Alicia Keys both nominated for Grammy award for best R&B Album. Rihanna + Michael Jackson got Grammy Award for Album of the Year. Alicia Keys + Will Smith both do Pop music, Hip hop music, publish music under Columbia label and RCA Records, are Record Producers, both appeared in TV program America: A tribute to Heroes. Will Smith + DMX both have nationality of USA, both are Actors and Film producers. Madonna + DMX publish music under label of Warner Bros. Records. m.03rhqg m.026s90 m.01dtcb m.04xrx m.0c4ys m.04qmr TABLE II. IV VERTICES FOR SEDGE FIRST ARTIFICIAL GRAPH Number 1 2 3 Sedge ID 1 10 15 I. TABLE III. IV VERTICES FOR SEDGE SECOND ARTIFICIAL GRAPH Number 1 2 3 4 5 Sedge ID 4 5 6 9 10 1 2 3 4 6 12 7 5 8 9 10 11 13 Figure 3. First artificial graph. ACKNOWLEDGMENT This work was supported by the Slovak Research and Development Agency project name CLAN, number APVV-0809-11 and by the Scientific Grant Agency of the Ministry of Education, science, research and sport of the Slovak Republic and the Slovak Academy of Sciences, project VEGA, number 2/0185/13. 1 4 10 9 13 15 Figure 4. CONCLUSION In this work we propose a new graph relationship discovery algorithm PCMARS. It is designed for Pregel computing model and its special property is the ability of searching through the opposite direction of the edge. We have evaluated the algorithm on several test-case graphs, two artificial graphs (Fig. 3 and 4) and one real Freebase graph (Table 1). The algorithm was tested and modified to fulfill the purpose of navigation and relationship discovery. Results from test-case graphs were carefully evaluated and then PCMARS was run in a computer cluster of 5 worker machines and Freebase graph data. Some relations were expected (e.g. Will Smith and DMX are both actors) but some details were new to us (Anastacia and Chi McBride both born in Chicago). Our PCMARS algorithm inherits advantages from Pregel computing model, therefore it is well suited for large-scale relationship discovery, when applied on graphlike data structures. Future work on PCMARS will be the search for the usage in other domains, preferably bio-informatics and biological systems, where the relations can also be represented in graph structures. REFERENCES [1] Second artificial graph. For the first artificial graph (Fig. 3) the following vertices received interesting or interesting_reply: 1 (from 4,6), 2 (from 4,5), 3 (from 4,6), 7 (from 4,5,6,10), 8 (from 4,5,10), 11 (from 5,10), 12 (from 9) and 13 (from 9,10). Following vertices received interesting or [2] [3] 206 Malewicz, G. et al.: Pregel: a system for large-scale graph processing. Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113. Apache Hadoop, http://en.wikipedia.org/wiki/Apache_Hadoop, retrieved 8.Nov.2014.