Vertexica: your relational friend for graph analytics! The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Alekh Jindal, Praynaa Rawlani, Eugene Wu, Samuel Madden, Amol Deshpande, and Mike Stonebraker. 2014. Vertexica: your relational friend for graph analytics!. Proc. VLDB Endow. 7, 13 (August 2014), 1669-1672. As Published http://dx.doi.org/10.14778/2733004.2733057 Publisher Association for Computing Machinery (ACM) Version Author's final manuscript Accessed Mon May 23 10:53:09 EDT 2016 Citable Link http://hdl.handle.net/1721.1/100912 Terms of Use Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported licence Detailed Terms http://creativecommons.org/licenses/by-nc-nd/3.0/ Vertexica: Your Relational Friend for Graph Analytics! Alekh Jindal∗ Samuel Madden∗ Praynaa Rawlani∗ Amol Deshpande⋆ ∗ CSAIL, MIT ABSTRACT In this paper, we present Vertexica, a graph analytics tools on top of a relational database, which is user friendly and yet highly efficient. Instead of constraining programmers to SQL, Vertexica offers a popular vertex-centric query interface, which is more natural for analysts to express many graph queries. The programmers simply provide their vertex-compute functions and Vertexica takes care of efficiently executing them in the standard SQL engine. The advantage of using Vertexica is its ability to leverage the relational features and enable much more sophisticated graph analysis. These include expressing graph algorithms which are difficult in vertexcentric but straightforward in SQL and the ability to compose endto-end data processing pipelines, including pre- and post- processing of graphs as well as combining multiple algorithms for deeper insights. Vertexica has a graphical user interface and we outline several demonstration scenarios including, interactive graph analysis, complex graph analysis, and continuous and time series analysis. 1. INTRODUCTION Graph analytics is getting increasingly popular with several new application domains such as social networks, transportation networks, ad networks, e-commerce, and web search. These graph analytics workloads are seen as quite different from traditional database analytics, largely due to the iterative nature of many of these computations, and the perceived awkwardness of expressing graph analytics as SQL queries (which typically involves multiple self-joins on tables of nodes and edges). As a result, many new storage and querying systems optimized for graph algorithms have been proposed. In particular, a number of so-called “vertex-centric” systems (e.g., Pregel [5], Giraph [3], GraphLab [4], GPS [6], and Trinity [7]) have been proposed in recent years. Relational databases, on the other hand, are completely missing from this wave of new graph processing systems. Note that in many real-world scenarios, data is collected and it resides in a relational database in the first place, e.g. [1, 2]. Using a different system for graph analytics would mean that the users now need to dump their data from the relational database to a graph database. This involves stitching different systems together and is highly undesir- This work is licensed under the Creative Commons AttributionNonCommercial-NoDerivs 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain permission prior to any use beyond those covered by the license. Contact copyright holder by emailing info@vldb.org. Articles from this volume were invited to present their results at the 40th International Conference on Very Large Data Bases, September 1st - 5th 2014, Hangzhou, China. Proceedings of the VLDB Endowment, Vol. 7, No. 13 Copyright 2014 VLDB Endowment 2150-8097/14/08. Eugene Wu∗ Mike Stonebraker∗ ⋆ University of Maryland able. Furthermore, users might find several features from relational databases, such as transactions, checkpointing and recovery, fault tolerance, durability, integrity constraints, etc., hard to forego. Interestingly, relational databases have been highly optimized for relational data analytics in recent years, in particular column-oriented data storage and query processing has emerged as an attractive technology. A natural question, therefore, is whether these techniques can be leveraged for efficient graph analytics as well? In this paper, we present Vertexica, a relational database system which is friendly yet efficient over graph analytics. There have been some early efforts to implement graph queries in relational databases [8]. However, these works constrain the graph analysts to SQL, the de facto query interface in relational databases. Unfortunately, implementing graph queries in SQL can be tricky and time consuming. This is because with SQL, the programmer must manipulate relational tables rather than graphs. In contrast, Vertexica provides a vertex-centric query interface, wherein the programmers writes his graph queries as vertex-compute functions and system takes care of running it on the underlying relational engine. As a result, Vertexica allows programmers to easily express several graph analytic queries such as PageRank, shortest path, connected components, stochastic gradient descent, random walk with restart, and other message passing algorithms. Apart from efficiently executing vertex-centric programs, Vertexica also allows users to leverage and combine the traditional strengths of the relational database. For example, Vertexica allows ad-hoc mutations to the graph as well as the associated metadata, which is simply impossible to do in many new graph processing systems such as Giraph. Likewise, Vertexica allows users to easily combine graph algorithms with relational operators, thereby facilitating more advanced graph queries e.g. localized PageRank. Overall, our goal in this paper is to demonstrate how Vertexica enables efficient, easy-to-use, and rich graph analytics right on top of a relational engine. In the following, we first show the details of Vertexica. Thereafter, we discuss the use-cases, for which Vertexica is a useful tool (Section 3). Finally, we sketch our demonstration proposal describing the setup and demonstration scenarios, and highlight how the users can interact and play with Vertexica (Section 4). 2. Vertexica The core idea of Vertexica is to bring user-friendly and highperformance graph analytics capability to a relational database. Vertexica does this by injecting data storage, query processing, and query interface support to enable efficient vertex-centric graph analytics. Currently, Vertexica sits on top of an industry strength column-oriented database system. However, it could be extended to other relational databases as well. Below, we discuss the vertexcentric interface, the implementation details, and the optimization techniques employed in Vertexica. 1669 Vertex Program Stored Procedure Physical Storage Vertex Program Vertex Program Vertex API Worker1 UDF … Vertex Program Worker2 … Workern Superstep Coordinator Data Access Vertex Edge Relational Database Graph Query Message Figure 1: Vertexica Architecture 2.1 Vertex-centric Interface Relational databases have SQL query interface and although one can imagine implementing graph queries in SQL, crafting and tuning the SQL can be tedious and time consuming. This is because manipulating relational tables is not really intuitive for expressing graph queries. In contrast, the concept of vertex-centric programming allows programmers to think in terms of graphs. This interface allows programmers to specify their computations as UDFs, which are executed as supersteps on the graph vertices. The UDFs communicate by sharing messages with neighboring nodes. In most implementations, this sharing is done after a synchronization barrier at the end of every superstep. As a result of the vertex-centric interface, programmers do not have to worry about details such as partitioning the graph, distributing the computing across multiple machines, and coordinating the message passing. This is analogous to MapReduce where the programmers simply provide the map and reduce functions and the framework takes care of the system details. Vertexica provides a vertex-centric interface on top of SQL. The interface exposes the same API as in Pregel and manages parallelizing the vertex compute function, the passing of messages, and synchronizing the supersteps. As in Pregel, programmers simply provide their vertex compute function, and Vertexica takes care of running it as standard SQL (with UDFs) in an unmodified relational database. Thus, with the vertex-centric interface, Vertexica allows programmers to actually think in terms of a graph while still running their queries in a relational engine. 2.2 Implementation Details Let us now look at the implementation details of Vertexica. Figure 1 shows the architecture of Vertexica. It has four major components: (1) the physical graph storage, (2) the coordinator to drive the vertex-computations, (3) the workers which run the vertex-computations, and (4) the actual compute function provided by the user. We describe each of these below. Physical Storage. Vertexica stores all data in three relational tables: (1) the vertex table to store the vertex id, the vertex value, and the vertex state, (2) the edge table to store the edge source and the edge destination, and (3) the message table to store the sender vertex, the receiver vertex, and the message value. Coordinator. The coordinator is the driver program that manages the supersteps. It is responsible for running the vertex computations in parallel and passing messages from one superstep to the next superstep. We implement the coordinator as a stored procedure; it runs as long as there is any message for the next superstep. As shown in Figure 1, in each superstep, the coordinator invokes a set of worker UDFs to run the vertex program. Worker. The worker is the container for the vertex-compute function. It is responsible for sending and receiving messages with the coordinator. Additionally, the worker exposes similar API methods as in Pregel, e.g. getVertexValue(), getMessages(), getOutEdges(), modifyVertexValue(), sendMessage(), and voteToHalt(). The workers run as database UDFs and typically there are as many parallel workers as the number of cores on the machine or nodes in the cluster. Conceptually, the worker is analogous to mapper, which is the container for map function in MapReduce. As shown in Figure 1, the workers invoke the actual vertex program provided by the user as well as propagate the output of the vertex program, such as new vertex values and the messages, to the coordinator. Vertex Computation. Finally, the vertex computation is the user provided graph query that runs once per superstep for every vertex that has at least one incoming message. Typically, the vertexcompute function gets a set of incoming messages, performs some computation, modifies the vertex state, and outputs a set of outgoing messages. This is a simple yet powerful model for graph analytics. 2.3 Optimizations Vertexica applies several optimizations to make the vertexcentric computations efficient. Let us look at some of these below. Table Unions. In order to run the vertex computations, Vertexica needs to read all vertices having incoming messages, the corresponding outgoing edges, and the messages themselves. Traditional database wisdom will tell us to simply join the vertex, edge, and message tables. However, for large graphs and/or for large number of messages (every vertex could send a message to every other vertex in the worst case), this three-way join could be very expensive and kill the performance. Since we only need to extend the message table with the vertex values and edges of every receiving vertex, we can union the three tables instead of joining them. These three inputs are renamed to a common schema and the union is then fed to the workers, which are responsible for parsing and identifying the message, vertex, and edge tuples from each other. Parallel Workers. Vertexica exploits multiple cores or multiple machines by running multiple instances of the worker in parallel and speed-up the computation. In practice, we have as many workers as the number of cores. After each superstep, Vertexica synchronizes the workers during the synchronization barrier. Vertex Batching. The extreme case could be to run each active vertex in a different worker. However, this leads to many UDFs calls, which are relatively expensive in most database systems. Therefore, Vertexica batches several vertices together and runs them serially on a worker, i.e. it tries to balance serial and parallel execution of vertex-programs. To do this, Vertexica hash partitions the table union on the vertex id into a fixed number of partitions. Each partition is sorted on the vertex id. The worker is responsible for identifying different vertices in the partition and executing the vertex program on each of them serially. Update Vs Replace. Vertexica involves two kinds of updates: (1) updating vertices, e.g. vertex value, and (2) updating messages, e.g. deleting old messages and inserting new ones from the current superstep. However, vertex-centric computations could easily incur large number of updates and can slow down the performance significantly. To handle this, instead of updating the vertices and messages in the existing tables, Vertexica creates new vertex and message tables. This is done by left joining the old vertex/message table with the new vertex/message values from the current superstep and replacing the old table with the new one. Such modifications via replace are much faster. Still, if the number of updated tuples is below a fixed threshold, then Vertexica updates the existing tables. Let us now compare the performance of Vertexica with Apache Giraph [3], a popular vertex-centric graph processing system, and a transactional graph database system. Figure 2 shows the performance of graph database, Giraph, and Vertexica for two popular 1670 Dataset Neo4j (vertex) Vertica (SQL) Dataset Neo4jGiraph Giraph Vertica Vertica (vertex) Vertica (SQL) Dataset Neo4j (vertex) Vertica (SQL) Dataset Neo4jGiraph Giraph Vertica Vertica (vertex) Vertica (SQL) Twitter (76K,2.28M) 46.95666666666667 10.92925766666667 3.31333333333333 Twitter (76K,2.28M) 589.033 589.033 46.95666666666667 10.92925766666667 3.31333333333333Twitter (76K,2.28M) 43.66666666666667 10.646985 Twitter (76K,2.28M) 395.609 395.609 43.66666666666667 10.646985 2.95635033333333 2.95635033333333 GPlus (102K,30.23M) 53.52333333333333 47.66651333333333 4.21433333333333 GPlus (102K,30.23M) 53.52333333333333 47.66651333333333 4.21433333333333GPlus (102K,30.23M) 50.78 23.837291 GPlus (102K,30.23M) 50.78 23.837291 3.91646133333333 3.91646133333333 LiveJournal (4.8M,68.9M) 190.41666666666667 321.65009966666667 29.41666666666667 LiveJournal (4.8M,68.9M) 190.41666666666667 321.65009966666667 29.41666666666667 LiveJournal (4.8M,68.9M) 115.46333333333333 146.296315 LiveJournal (4.8M,68.9M) 115.46333333333333 146.296315 54.424825 54.424825 1000 1000 1000 1000 Graph Database Graph Database Apache Giraph Apache Giraph Vertexica Vertexica Vertexica (SQL) Vertexica (SQL) 100100 Time (seconds) Time (seconds) Time (seconds) Time (seconds) 100100 Graph Database Graph Database Apache Giraph Apache Giraph Vertexica Vertexica Vertexica (SQL) Vertexica (SQL) 10 10 10 10 Strong Overlap: find pairs of nodes having strong overlap between them. Overlap could be defined as number of common neighbors. Weak Ties: find nodes which act as bridges between otherwise disconnected pair of nodes. Such analysis is very difficult or even not possible on traditional graph processing systems. 3.3 1 1 1 Twitter Twitter 1 GPlus GPlus LiveJournal LiveJournal Twitter Twitter GPlus GPlus LiveJournal LiveJournal (a) PageRank (b) Shortest Paths Figure 2: Vertex-centric Graph Algorithm Performance. graph algorithms, PageRank and Shortest Paths, over three different graphs, Twitter (81K nodes, 1.7M edges), GPlus (107K nodes, 13.6M edges), and LiveJournal (4.8M nodes, 68M edges). We can see from Figure 2 that the graph database runs only for the smallest graph and both Giraph and Vertexica outperform it significantly. Furthermore, Vertexica outperforms Giraph by more than 4 times on the small graph and has very similar performance as Giraph on larger graphs. Figure 2 also shows the performance of Vertexica(SQL), the hand-coded and meticulously optimized SQL implementations of graph algorithms in Vertexica. We can see that the SQL implementation in Vertexica significantly outperforms all other approaches. Thus, Vertexica combines the best of both worlds, an easy-to-use vertex-centric interface having comparable performance as Giraph as well a high performance SQL interface which significantly outperforms all other approaches. Vertexica facilitates analysts to perform a large variety of graph analytics. We outline several such use cases below. Dynamic Graph Analyses Graphs are not static in nature. Over time, we need to add, remove, or modify the nodes and edges in the graph. Similarly, we may need to update the metadata associated with the nodes and edges. Vertexica is naturally suited to handle updates and therefore allows for dynamic graph analysis. However, graph processing systems, such as Giraph, have no clear method of updating the graphs it analyzes. One might think of using HBase for updates and Giraph for analytics, but then we have the additional overhead of stitching two systems together and coordinating between them. Using Vertexica, an analyst can easily do graph mutations, metadata updates, as well as perform temporal analysis. For example, the analyst may be interested in knowing the nodes whose PageRanks have changed over last one year, or to see all node-pairs whose shortest paths have decreased by at least a threshold. Such dynamic graph analysis allows Vertexica users to treat graph analytics as a continuos process rather than an offline one-time activity. 3.4 Richer Graph Analytics Vertexica supports several vertex-centric graph analysis algorithms, including: (i) PageRank – a ranking algorithm to compute the relative importance of every vertex, (ii) Single Source Shortest Path – a reachability query to compute the shortest path from a given vertex to every other vertex, (iii) Connected Components – also a reachability query to find subgraphs in which any two vertices are connected to each other, and (iv) Collaborative Filtering – a recommendation technique to predict the edge weights in a bipartite graph. In general, Vertexica supports any graph algorithm which can be expressed as message-passing iterative vertex-computations. Real-world graphs have vertices and edges accompanied by rich metadata, e.g. vertices may describe each person in the social network and edges may be of types family, friends, or classmates. Given such metadata, an analyst would typically do some ad-hoc relational analysis in addition to the graph analysis. For instance, the analyst may want to select a subset of the graph before running the actual graph algorithm. Similarly, he may want to compute aggregates or build histograms over the output of the graph analysis. Furthermore, in many cases, the graphs may be implicit in the relational data and need to be extracted in the first place. Above pre- and post- processing of graph data would typically require relational operators such as selection, projection, aggregation, join, something which Vertexica inherits by default. As a result, Vertexica allows analysts to easily combine graph analysis with relational analysis and produce more complete data processing pipelines. Thus, graph analytics on Vertexica is not just running a particular graph algorithm on the bare graph skeleton, rather it includes the end-to-end data processing, starting from raw data and right up to deriving meaningful insights. 3.2 4. 3. USE CASES Let us now look at the use-cases supported by Vertexica. 3.1 Vertex-centric Graph Queries Hybrid Graph Queries While the vertex-centric computations works well for graph queries accessing local neighborhood, they do not work very well, if at all, for queries which involve 1-hop neighborhood. This is because the vertex-centric approaches needs to first collect the 1-hop neighborhood, which is expensive both in terms of time and memory, before doing the actual analysis. Vertexica allows analysts to easily combine vertex-centric analysis with 1-hop analysis. For example, the analyst may want to find all nodes which act as ties between otherwise disconnected nodes and have PageRank greater than a threshold, i.e. find sufficiently important nodes which act as bridges. Similarly, the analyst may want to compute the single source shortest path with the source node being the node with the maximum local clustering coefficient, i.e. find the distance from the most clustered node to every other node. Vertexica supports several 1-hop algorithms for such analysis: Triangle Counting: count the number (total or per-node) of triangles. This could be used for computing clustering coefficients. DEMONSTRATION In this section, we describe our demonstration proposal and show how the audience can play with Vertexica. Hardware. The demonstration runs on 4 machines, each with 2GHz Xeon with 2-socket, 12-core, 24-threads, with Hyper-threading, 48GB of memory, 1.4T disk, and running on RHEL Santiago 6.4. Datasets. We use several social network graphs of different sizes and both directed as well as undirected from http://snap. stanford.edu/data/. These include graphs from Twitter, GPlus, LiveJournal, Amazon, and Youtube, with sizes varying from a few million edges to a billion edges. Metadata. We also extend the graph datasets with the following metadata. For each node, we added 24 uniformly distributed integer attributes with cardinality varying from 2 to 109 , 8 skewed (zipfian distribution) integer attributes with varying skewness, 18 floating point attributes with varying value ranges, and 10 string attributes with varying size and cardinality. For each edge, we added three additional attributes: the weight, the creation timestamp, and an edge 1671 Page1 and the time monitor show running results. 4.2 Demonstration Scenarios We invite the audience to play with several demonstrations scenarios of Vertexica. We describe some of these below. 4.2.1 Figure 3: Vertexica Graphical User Interface. type (friend, family, or classmate), chosen uniformly at random. 4.1 Audience Interaction In our demonstration, we let the audience interact with Vertexica via a graphical user interface. Figure 3 shows a snapshot of the interface. It has six main components: Graph Visualization. The interface allows users to load any raw graph data and visualize the nodes and edges in the graph. Users can interact with the visualization by clicking on the nodes and edges and performing actions such as viewing and/or modifying the properties, e.g. node value, edge weight, etc., of a node or edge. Users can also customize the visualization to show one or more properties for all nodes and edges by default. Scope of Analysis. Apart from visualizing, users can also select portions of the graph for analysis, i.e. they can interactively define the scope of their graph analysis. This includes visual selection by clicking on one or more nodes or by drawing a minimum bounding rectangle. Alternatively, users can also apply filters based on node/edge metadata, e.g. select all edges of type “Family”. By defining the scope of analysis, users can easily specify their regions of interest. Toolbar. We allow the users to play with several graph algorithms and operators. The toolbar provides four vertex-centric algorithms (PageRank, shortest paths, connected components, collaborative filtering), five SQL graph algorithms (PageRank, shortest paths, triangle counting, strong overlap, weak ties), and four relational operators (selection, projection, aggregation, and join). These algorithms and operators allow for a very rich graph analysis. Graph processing pipelines. Users can not only run a standalone algorithm, but they can also compose a set of algorithms and operators into a graph processing pipelines. The Dataflow panel in Figure 3 is used for designing the graph pipelines. Users can drag and drop the algorithms/operators, chain and combine them, and even inspect and modify the code of each of them. Running mode. After selecting the graph and picking the algorithm (or pipeline), users can run their analysis as: (i) a single run, (iii) time series run to see how the results change over time, and (ii) continuous run to monitor how the analysis changes in response to changes such as graph mutations. Also, the users can compare the query runtimes with Giraph, in case of vertex-centric algorithm. Output Display. Finally, the interface produces two outputs from the user interactions. First, we show the actual output of the graph analysis, e.g. outputting the shortest path, triangle counts, etc., on the console. Second, we plot the time taken to run the analysis in the time monitor. In case of continuous analysis, both the console Interactive Graph Analysis Since Vertexica has very good runtime performance, it allows the audience to perform interactive graph analysis on small to medium-sized graphs, i.e. around 10s of millions of edges. The audience can interactively select the regions of interest in the graph by click on nodes or by drawing a rectangle. They can also set the scope of their analysis by considering only those nodes and edges that have specific properties (e.g. edge types). Once a region or a subgraph is selected, users can interactively run the graph algorithms as well. For example, users can click on a node and ask for its PageRank, or the number of triangles that the node participates in. Internally, this triggers a PageRank or a triangle counting query over the region of interest and produces the result in the console. Similarly, users can click on two nodes and ask for the shortest path between them or check if the node-pairs have strong overlap between them. 4.2.2 Complex Graph Analysis In addition to running the graph algorithms, the audience can pre- or post- process the graphs, such as applying selection predicates to select ad-hoc subgraphs, assigning weights and/or labels to nodes and edges, and computing statistics on the output of graph algorithm. For example, the users might be interested in looking at the distribution of PageRank values. Additionally, the audience can combine multiple graph analyses. For example, users might combine PageRank and shortest path to emit nodes which are either very near (path distance less than a threshold) or are relatively very important (PageRank greater than a threshold). Similarly, the users can ask for global clustering coefficient (combining triangle counting with weak ties). By combining and composing different graph algorithms and operators, Vertexica allows users to play with much more sophisticated graph analysis. 4.2.3 Continuous & Time Series Analysis We invite the audience to describe a graph analysis task and run it in a continuous mode. Thereafter, the users can change the scope of the graph analysis, e.g. change the edge filter from “Family” to “Classmates”, and observe how runtimes and the console output changes. Furthermore, users can also click and modify nodes and edges and observe the impact of change on the graph analysis. Similarly, we invite the audience to compare how the graph analysis has evolved over time. For example, users might ask how the PageRank of a given node has changed in the last 5 years. Or, which nodes have come closer (smaller path distance) in the last one year. Such analysis will internally trigger graph algorithms on different versions of nodes and edges. Vertexica makes these possible. 5. REFERENCES [1] N. Bronson and al. TAO: Facebooks Distributed Data Store for the Social Graph. USENIX ATC, 2013. [2] FlockDB, http://github.com/twitter/flockdb. [3] Apache Giraph, http://giraph.apache.org. [4] Y. Low and al. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB, 2012. [5] G. Malewicz and al. Pregel: A System for Large-Scale Graph Processing. SIGMOD, 2010. [6] S. Salihoglu et al. GPS: A Graph Processing System. SSDBM, 2013. [7] B. Shao and al. Trinity: A Distributed Graph Engine on a Memory Cloud. SIGMOD, 2013. [8] A. Welc and al. Graph Analysis Do We Have to Reinvent the Wheel? GRADES, 2013. 1672