Graph Analytics Research at Oracle Labs - DAMA-UPC

Oracle Labs Graph
Analytics Research
Hassan Chafi
Sr. Research Manager
Oracle Labs
Graph-TA
2/21/2014
 The following is intended to provide some insight into a line of
research in Oracle Labs. It is intended for information purposes only,
and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and
should not be relied upon in making purchasing decisions. The
development, release, and timing of any features or functionality
described in connection with any Oracle product or service remains
at the sole discretion of Oracle. Any views expressed in this
presentation are my own and do not necessarily reflect the views of
Oracle.
2
Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted
Green-Marl
A DSL for Graph Analysis
 Green-Marl
– A DSL for graph algorithms
– started as Stanford Project (2011)
 Approach
– User program graph algorithms in an intuitive way (productivity)
– The compiler creates an efficient implantation (performance)
– For multiple different environments (portability)
3
Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted
Technical Challenges for Graph Processing
Different Challenges for Different Concerns
User
Flexibility
How to specify
graph
algorithms?
How to visualize
data and results?
Execution
How to run
algorithms
fast?
How to handle
very large
graphs?
Data Management
Which graph
data model to
use?
How to
persist the
graph data?
Raw Data
4
Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted
How to specify
pattern-matching
queries?
How to find
patterns
efficiently?
How to construct
the graph
representation?
Competitive Landscape
\
Oracle already
has expertise in
this area
But, the product group (OSG) is
trying to enter these sectors as well
Flexibility
Execution
Hadoop/Giraph
`
Data Management
+ Other PG DBs..
HDFS
Property Graph
Database
Recent camp of so-called graph
databases
Adopt property graph data model
Major focus on data management
5
Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved.
(Distributed)
Analytic Engines
Engines (only) built for execution
of graph algorithms
May consider distributed
execution for very large graphs
Programming can be challenging
Confidential – Oracle Restricted
RDF
+ Other RDF DBs..
RDF Database
(Pattern Matching)
RDF: more traditional,
standardized graph data model
One big focus on patternmatching applications
Our Approach
 We provide powerful graph analytic engines that are integrated with existing or
developing Oracle technologies
Pattern-Matching
QL for PG
GMQL
DSL that generates
programsFlexibility
for both
environments
Execution
Green-Marl DSL + Compiler
Distributed
Graph Analysis
In-Memory
Graph Analysis
In-Memory
Pattern Matching
Data Management
PG
BDA
Property Graph
Database
Distributed graph analytic
engine for Oracle PG
In-memory graph analytic
engine for Oracle PG
6
Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved.
Big Graph
Analysis
Confidential – Oracle Restricted
RDF
In memory
patternmatching
accelerator
RDF Database
(Pattern Matching)
Major Milestones Achieved So Far
CY2011
(3Q)
Green-Marl DSL
CY2012
Spec & Initial
Compiler
Showed: we can
compile into
very different
environments
CY2013
Compiler for multiple
back-ends
Started as
University Project
In-Memory
Graph Analytic
Engine (PG)
Distributed
Graph Analytic
(BDA)
In-Memory
Pattern Matching
(RDF)
7
Parallel C++
Runtime
(Standalone)
Showed: our inmemory analysis runs
10~100x faster than a
popular PG Database
(Neo4J)
Showed: Giraph has
critical, innate
performance and
compatibility issues
Handles: multiple client,
snapshot consistency,
sharing instances …
An Open-Source
Distributed
Engine (Giraph)
Showed: we can
exploit network
BW very efficiently
Can we apply the
same parallel, inmemory approach to
pattern matching?
Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved.
Enables: 30+ Builtin Algorithms for
the analytic engine
Language
Extension
Algorithms
Implementation
Compilation to
Java
Compiler
Optimization
Integration with
Oracle PG
Database
Tech-Transfer
Planned (OSG)
In-Memory
Engine Design
Basic Feature
Implementation
Tech-Transfer
Discussion ((BDA)
Design
Exploration
Basic Feature
Implementation
(On-going)
Algorithm
Exploration
Initial
Implementation
Tech-Transfer
Discussion (OSG/RDF)
Confidential – Oracle Restricted
Will be a part of
Oracle Property
Graph Option
*First target is to
use the inmemory engine
Showed: x200
faster than SQLbased solutions
Algorithm Implementation
Detecting Components and
Communities
Ranking and Walking
Tarjan’s, Kosaraju’s,
Weakly Connected Components,
Label Propagation (w/ variants),
Soman and Narang’s
Evaluating Community Structures
∑
∑
Link Prediction
8
Pagerank, Personalized Pagerank,
Betwenness Centrality (w/ variants),
Closeness Centrality, Degree Centrality,
Eigenvector Centrality, HITS,
Random walking and sampling (w/
variants)
Path-Finding
Hop-Distance (BFS)
Dijkstra’s,
Bi-directional Dijkstra’s
Bellman-Ford’s
Conductance, Modularity
Clustering Coefficient
(Triangle Counting)
Adamic-Adar
SALSA
(Twitter’s Who-to-follow)
Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted
Other Classics
Vertex Cover
Minimum Spanning-Tree(Prim’s)
Algorithm: Triangle Counting
Experimental Results
GraphLab’s
implementation running
on 31 machines
Our implementation
running on two different
architectures (1 machine)
Ours (x86)
Ours (SPARC, T4-4)
Hadoop
implementation running
on 1000+ machines
GraphLab (x86 x 31)
Hadoop(x86 x 1000+)
Execution Time (secs)
100000
10000
Hadoop takes a lot
of execution time
Our single machine
implementation
outperforms other
distributed systems
1000
SPARC provides
additional
performance
benefits
100
10
1
0.1
Patents
LiveJournal
Wikipedia
Twitter-2010
UK-2006
Graph Instances
Hadoop numbers are excerpted from WWW’11paper
*preprocessing time included
9
Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted
Subgraph Isomorphism Problem
Data Graph G
Query Graph Q
A
Z
Y
X
B
B
B
Y
C
Y
C
B
Z
X
Z
Z
B
B
Y
Z
A
X
Z
C
Y
Z
Z
A
A
C
X
B
Z
Z
X
B
Subgraphs of G that are
isomorphic to Q
10
Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved.
Confidential – Oracle Restricted
Experimental Results: Comparison against DB
LUBM 8K and 25K on x86 and Sparc
LUBM 8K on x86
GMX
SQL
100
GMX
221x
139x
206x
174x
100
1
219x
266x
309x
10
0.1
1
0.01
0.1
q2
q6
q9
q14
q2
LUBM 8K on SPARC
GMX
SQL
100
Time (s)
10
162x
92x
80x
103x
1
0.1
0.01
q2
11
SQL
1000
100x
Time (s)
Time (s)
10
LUBM 25K on x86
Copyright © 2013-2014, Oracle and/or its affiliates. All rights reserved.
q6
Confidential – Oracle Restricted
q9
q14
q6
q9
q14