Oracle Labs Graph Analytics Research Hassan Chafi Sr. Research Manager Oracle Labs Graph-TA 2/21/2014 The following is intended to provide some insight into a line of research in Oracle Labs. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described in connection with any Oracle product or service remains at the sole discretion of Oracle. Any views expressed in this presentation are my own and do not necessarily reflect the views of Oracle. 2 Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved. Confidential – Oracle Restricted Green-Marl A DSL for Graph Analysis Green-Marl – A DSL for graph algorithms – started as Stanford Project (2011) Approach – User program graph algorithms in an intuitive way (productivity) – The compiler creates an efficient implantation (performance) – For multiple different environments (portability) 3 Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved. Confidential – Oracle Restricted Technical Challenges for Graph Processing Different Challenges for Different Concerns User Flexibility How to specify graph algorithms? How to visualize data and results? Execution How to run algorithms fast? How to handle very large graphs? Data Management Which graph data model to use? How to persist the graph data? Raw Data 4 Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved. Confidential – Oracle Restricted How to specify pattern-matching queries? How to find patterns efficiently? How to construct the graph representation? Competitive Landscape \ Oracle already has expertise in this area But, the product group (OSG) is trying to enter these sectors as well Flexibility Execution Hadoop/Giraph ` Data Management + Other PG DBs.. HDFS Property Graph Database Recent camp of so-called graph databases Adopt property graph data model Major focus on data management 5 Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved. (Distributed) Analytic Engines Engines (only) built for execution of graph algorithms May consider distributed execution for very large graphs Programming can be challenging Confidential – Oracle Restricted RDF + Other RDF DBs.. RDF Database (Pattern Matching) RDF: more traditional, standardized graph data model One big focus on patternmatching applications Our Approach We provide powerful graph analytic engines that are integrated with existing or developing Oracle technologies Pattern-Matching QL for PG GMQL DSL that generates programsFlexibility for both environments Execution Green-Marl DSL + Compiler Distributed Graph Analysis In-Memory Graph Analysis In-Memory Pattern Matching Data Management PG BDA Property Graph Database Distributed graph analytic engine for Oracle PG In-memory graph analytic engine for Oracle PG 6 Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved. Big Graph Analysis Confidential – Oracle Restricted RDF In memory patternmatching accelerator RDF Database (Pattern Matching) Major Milestones Achieved So Far CY2011 (3Q) Green-Marl DSL CY2012 Spec & Initial Compiler Showed: we can compile into very different environments CY2013 Compiler for multiple back-ends Started as University Project In-Memory Graph Analytic Engine (PG) Distributed Graph Analytic (BDA) In-Memory Pattern Matching (RDF) 7 Parallel C++ Runtime (Standalone) Showed: our inmemory analysis runs 10~100x faster than a popular PG Database (Neo4J) Showed: Giraph has critical, innate performance and compatibility issues Handles: multiple client, snapshot consistency, sharing instances … An Open-Source Distributed Engine (Giraph) Showed: we can exploit network BW very efficiently Can we apply the same parallel, inmemory approach to pattern matching? Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved. Enables: 30+ Builtin Algorithms for the analytic engine Language Extension Algorithms Implementation Compilation to Java Compiler Optimization Integration with Oracle PG Database Tech-Transfer Planned (OSG) In-Memory Engine Design Basic Feature Implementation Tech-Transfer Discussion ((BDA) Design Exploration Basic Feature Implementation (On-going) Algorithm Exploration Initial Implementation Tech-Transfer Discussion (OSG/RDF) Confidential – Oracle Restricted Will be a part of Oracle Property Graph Option *First target is to use the inmemory engine Showed: x200 faster than SQLbased solutions Algorithm Implementation Detecting Components and Communities Ranking and Walking Tarjan’s, Kosaraju’s, Weakly Connected Components, Label Propagation (w/ variants), Soman and Narang’s Evaluating Community Structures ∑ ∑ Link Prediction 8 Pagerank, Personalized Pagerank, Betwenness Centrality (w/ variants), Closeness Centrality, Degree Centrality, Eigenvector Centrality, HITS, Random walking and sampling (w/ variants) Path-Finding Hop-Distance (BFS) Dijkstra’s, Bi-directional Dijkstra’s Bellman-Ford’s Conductance, Modularity Clustering Coefficient (Triangle Counting) Adamic-Adar SALSA (Twitter’s Who-to-follow) Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved. Confidential – Oracle Restricted Other Classics Vertex Cover Minimum Spanning-Tree(Prim’s) Algorithm: Triangle Counting Experimental Results GraphLab’s implementation running on 31 machines Our implementation running on two different architectures (1 machine) Ours (x86) Ours (SPARC, T4-4) Hadoop implementation running on 1000+ machines GraphLab (x86 x 31) Hadoop(x86 x 1000+) Execution Time (secs) 100000 10000 Hadoop takes a lot of execution time Our single machine implementation outperforms other distributed systems 1000 SPARC provides additional performance benefits 100 10 1 0.1 Patents LiveJournal Wikipedia Twitter-2010 UK-2006 Graph Instances Hadoop numbers are excerpted from WWW’11paper *preprocessing time included 9 Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved. Confidential – Oracle Restricted Subgraph Isomorphism Problem Data Graph G Query Graph Q A Z Y X B B B Y C Y C B Z X Z Z B B Y Z A X Z C Y Z Z A A C X B Z Z X B Subgraphs of G that are isomorphic to Q 10 Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved. Confidential – Oracle Restricted Experimental Results: Comparison against DB LUBM 8K and 25K on x86 and Sparc LUBM 8K on x86 GMX SQL 100 GMX 221x 139x 206x 174x 100 1 219x 266x 309x 10 0.1 1 0.01 0.1 q2 q6 q9 q14 q2 LUBM 8K on SPARC GMX SQL 100 Time (s) 10 162x 92x 80x 103x 1 0.1 0.01 q2 11 SQL 1000 100x Time (s) Time (s) 10 LUBM 25K on x86 Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved. q6 Confidential – Oracle Restricted q9 q14 q6 q9 q14