Slides

advertisement
Benchmarking traversal operations over graph
databases
Marek Ciglan1, Alex Averbuch2 and Ladialav Hluchý1
1 Institute of Informatics, Slovak Academy of sciences, Bratislava
2
Swedish Institute of Computer Science Stockholm, Sweden
Overview
• Graph data management
• Graph databases
– Characteristics
– Unique features
– Challenges
• GDB Benchmarking
– Motivation
– Related work
• Graph traversal benchmark
– Goals
– Design
• Preliminary results
21 November 2011
2
Graph data management
•
•
Booming area of R&D in recent years
Reasons:
–
–
Increased availability and importance of graph data
Natural way for modelling various real world phenomena
•
•
(networks: social, information, communication)
Two dominant data management directions:
–
Distributed graph processing frameworks
•
Mining/processing of large graphs
–
–
Pregel and clones (Goden Orb, Giraph)
Graph databases
•
Persistent management of graph data
–
Neo4J, OrientDB, Dex
21 November 2011
3
Graph databases
• Property graph data model
– Graph structure
– Elements have properties
L1
Node K1
Attr I1: val
Attr I2: val
Attr I3: val
L2
Node K2
Attr I1: val
Attr I2: val
Attr I3: val
L3
Node K3
Attr I1: val
Attr I2: val
Attr I3: val
21 November 2011
L1
Node K4
Attr I1: val
Attr I2: val
Attr I3: val
4
Graph databases
• Property graph data model
– Graph structure
– Elements have properties
• Unique feature
– Graph topology capturing the relations of objects
– Graph database should be
• Efficient in exploiting topology
• Allows for fast traversal
• Challenges
– Traditionally – graph processing/traversing done in memory
– Reasons:
• Data driven computation
• Random access pattern for data access
21 November 2011
5
Graph database benchmarking
• Motivation
–
–
–
–
Number of emerging graph data management solutions.
Which is right one for a specific problem?
Fair measurement of performance for distinct use cases.
Identify limits – what use cases have good performance.
21 November 2011
6
Graph database benchmarking
• Motivation
–
–
–
–
Number of emerging graph data management solutions.
Which is right one for a specific problem?
Fair measurement of performance for distinct use cases.
Identify limits – what use cases have good performance.
• Related work
– Only few works address directly graph databases
• D. Dominguez-Sal et al:
– Adoption of HPC benchmark for graph data processing
– Design of a benchmark suitable for graph database systems
• GraphBench - basic benchmarking framework implementation
21 November 2011
7
Graph database benchmarking
• Motivation
–
–
–
–
Number of emerging graph data management solutions.
Which is right one for a specific problem?
Fair measurement of performance for distinct use cases.
Identify limits – what use cases have good performance.
• Traversal operation benchmarking
– Graph topology – unique feature of the graph databases
– Test the ability to do:
• Local traversals (exploring k-hops neighbourhood)
• Global traversals (traversals of whole graph)
– Perform traversals in a memory constraint environment
• (can we deal efficiently with data sets exceeding the physical memory?)
21 November 2011
8
Benchmark design
• Fairness
– Blueprints API – effort to provide common API
• https://github.com/tinkerpop/blueprints/wiki/
– Using Blueprints – one implementation of benchmark for all the
benchmarked systems
• Avoid bias of different implementation of benchmark for different systems
– execution of the same sequence of operations on the same data
• log operations and their parameters in the first run over the defined data
• logs are persistent, allowing benchmarks to be rerun on different versions of a
product, and the change in performance can thus be measured
21 November 2011
9
Benchmark design
• Data
– Different data properties / distributions affects benchmark results
• E.g. dense vs. sparse graphs
– Ideally, data sets properties similar to those of real world data sets
– Use: scale free networks with small world properties
• social networks, the Internet, traffic networks, biological networks, and term cooccurrence networks
• LFR-Benchmark generator - networks with power-law degree distribution and
implanted communities within the network
21 November 2011
10
Benchmark design
• Traversal operations
– Local traversals
• Compute local clustering coefficient (2-hops breadth first traversal)
• 3-hops breadth first traversal
– Global traversals
• Compute connected components
– Incomming / ougoing edges
•
k-iterations of HITS algorithm
• Memory constraint environment
• Intermediate results for global traversals operations:
– Kept in memory
– Kept as properties on nodes
21 November 2011
11
Benchmark implementation
• Implemented on top of Blueprints API
• Test performed on:
–
–
–
–
–
Neo4J,
DEX,
OrientDB ,
Native RDF repository (NativeSail)
SGDB (research prototype )
6
• Challenge: deal with differences in underlying systems, E.g.:
–
–
–
–
–
triple stores – naming constraints,
some impl. do not support properties on some elements
Some impl. do not support iteration over nodes/edges
Nodes Ids generation – user provided vs. autogenerated
Transaction support / no transactions
21 November 2011
12
Benchmark Runs
• Performed on older hardware:
– 2G mem
• Data sets sizes:
– 1K, 10K, 40K, 50K, 100K, 200K, 400K, 800K, 1M
– Most systems were not able to load nets with 400K+ edges
• (constraint: load 10K edges in less than 60 sec.)
21 November 2011
13
Graph loading – elements insertion
21 November 2011
14
Local traversal – BFS 3 hops
21 November 2011
15
Global traversals – connected components
21 November 2011
16
Conclusion
• Extending work on benchmarking graph databases
• Focusing on graph traversal operations
• Local/Global traversals
• Preliminary results:
– Problem just to load larger datasets into GDBs
– Stable performance for local traversals with 2-3 hops
• Suitable for most ego-centric node properties analysis
– Bad performance for global traversal operations on larger networks
21 November 2011
17
Thank you for your attention.
http://ups.savba.sk/~marek/gbench.html
21 November 2011
18
SemSets – activation spreading over network
21 November 2011
19
Download