Benchmarking traversal operations over graph databases Marek Ciglan1, Alex Averbuch2 and Ladialav Hluchý1 1 Institute of Informatics, Slovak Academy of sciences, Bratislava 2 Swedish Institute of Computer Science Stockholm, Sweden Overview • Graph data management • Graph databases – Characteristics – Unique features – Challenges • GDB Benchmarking – Motivation – Related work • Graph traversal benchmark – Goals – Design • Preliminary results 21 November 2011 2 Graph data management • • Booming area of R&D in recent years Reasons: – – Increased availability and importance of graph data Natural way for modelling various real world phenomena • • (networks: social, information, communication) Two dominant data management directions: – Distributed graph processing frameworks • Mining/processing of large graphs – – Pregel and clones (Goden Orb, Giraph) Graph databases • Persistent management of graph data – Neo4J, OrientDB, Dex 21 November 2011 3 Graph databases • Property graph data model – Graph structure – Elements have properties L1 Node K1 Attr I1: val Attr I2: val Attr I3: val L2 Node K2 Attr I1: val Attr I2: val Attr I3: val L3 Node K3 Attr I1: val Attr I2: val Attr I3: val 21 November 2011 L1 Node K4 Attr I1: val Attr I2: val Attr I3: val 4 Graph databases • Property graph data model – Graph structure – Elements have properties • Unique feature – Graph topology capturing the relations of objects – Graph database should be • Efficient in exploiting topology • Allows for fast traversal • Challenges – Traditionally – graph processing/traversing done in memory – Reasons: • Data driven computation • Random access pattern for data access 21 November 2011 5 Graph database benchmarking • Motivation – – – – Number of emerging graph data management solutions. Which is right one for a specific problem? Fair measurement of performance for distinct use cases. Identify limits – what use cases have good performance. 21 November 2011 6 Graph database benchmarking • Motivation – – – – Number of emerging graph data management solutions. Which is right one for a specific problem? Fair measurement of performance for distinct use cases. Identify limits – what use cases have good performance. • Related work – Only few works address directly graph databases • D. Dominguez-Sal et al: – Adoption of HPC benchmark for graph data processing – Design of a benchmark suitable for graph database systems • GraphBench - basic benchmarking framework implementation 21 November 2011 7 Graph database benchmarking • Motivation – – – – Number of emerging graph data management solutions. Which is right one for a specific problem? Fair measurement of performance for distinct use cases. Identify limits – what use cases have good performance. • Traversal operation benchmarking – Graph topology – unique feature of the graph databases – Test the ability to do: • Local traversals (exploring k-hops neighbourhood) • Global traversals (traversals of whole graph) – Perform traversals in a memory constraint environment • (can we deal efficiently with data sets exceeding the physical memory?) 21 November 2011 8 Benchmark design • Fairness – Blueprints API – effort to provide common API • https://github.com/tinkerpop/blueprints/wiki/ – Using Blueprints – one implementation of benchmark for all the benchmarked systems • Avoid bias of different implementation of benchmark for different systems – execution of the same sequence of operations on the same data • log operations and their parameters in the first run over the defined data • logs are persistent, allowing benchmarks to be rerun on different versions of a product, and the change in performance can thus be measured 21 November 2011 9 Benchmark design • Data – Different data properties / distributions affects benchmark results • E.g. dense vs. sparse graphs – Ideally, data sets properties similar to those of real world data sets – Use: scale free networks with small world properties • social networks, the Internet, traffic networks, biological networks, and term cooccurrence networks • LFR-Benchmark generator - networks with power-law degree distribution and implanted communities within the network 21 November 2011 10 Benchmark design • Traversal operations – Local traversals • Compute local clustering coefficient (2-hops breadth first traversal) • 3-hops breadth first traversal – Global traversals • Compute connected components – Incomming / ougoing edges • k-iterations of HITS algorithm • Memory constraint environment • Intermediate results for global traversals operations: – Kept in memory – Kept as properties on nodes 21 November 2011 11 Benchmark implementation • Implemented on top of Blueprints API • Test performed on: – – – – – Neo4J, DEX, OrientDB , Native RDF repository (NativeSail) SGDB (research prototype ) 6 • Challenge: deal with differences in underlying systems, E.g.: – – – – – triple stores – naming constraints, some impl. do not support properties on some elements Some impl. do not support iteration over nodes/edges Nodes Ids generation – user provided vs. autogenerated Transaction support / no transactions 21 November 2011 12 Benchmark Runs • Performed on older hardware: – 2G mem • Data sets sizes: – 1K, 10K, 40K, 50K, 100K, 200K, 400K, 800K, 1M – Most systems were not able to load nets with 400K+ edges • (constraint: load 10K edges in less than 60 sec.) 21 November 2011 13 Graph loading – elements insertion 21 November 2011 14 Local traversal – BFS 3 hops 21 November 2011 15 Global traversals – connected components 21 November 2011 16 Conclusion • Extending work on benchmarking graph databases • Focusing on graph traversal operations • Local/Global traversals • Preliminary results: – Problem just to load larger datasets into GDBs – Stable performance for local traversals with 2-3 hops • Suitable for most ego-centric node properties analysis – Bad performance for global traversal operations on larger networks 21 November 2011 17 Thank you for your attention. http://ups.savba.sk/~marek/gbench.html 21 November 2011 18 SemSets – activation spreading over network 21 November 2011 19