G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif Sakr Sameh Elnikety Yuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond, WA Microsoft Research Redmond, WA CIKM 2012 Example 1: Social Network Hillary Alice Photo1 Photo7 Photo8 Photo2 Chris David Bob Bob Photo3 Ed France George Photo4 Photo5 Photo6 2 Example 2: Bibliographical Network location: Istanbul VLDB 12 Month: 3 Month: 1 Keyword: graph Paper 1 citedBy order: 2 Paper 2 order: 1 order: 1 Smith Keyword: XML type: Demo order: 2 Alice Age:42 location: Sydney age: 28 office: 518 John age:45 title: Senior Researcher title: Professor country: Australia established: 1949 UNSW country: USA Microsoft established: 1975 3 Contributions 1. G-SPARQL language – Pattern matching – Reachability 2. Hybrid execution engine – Graph topology in main memory – Graph data in relational database 3. Algebraic transformation – Operators – Optimizations 4. Experimental evaluation 4 1. G-SPARQL Query Language • Extends a subset of SPARQL – Based on triple pattern: (subject, predicate, object) subject object • Sub-graph matching patterns on – Graph structure – Node attribute – Edge attribute • Reachability patterns on – Path – Shortest path 5 G-SPARQL Syntax 6 G-SPARQL Pattern Matching • Node attribute – ?Person @officeNumber “518” officeNumber= 518 • Edge attribute – ?E @Role “Programmer” Alice Micros oft Role = Programmer • Structural – ?Person worksAt Microsoft – ?Person ?E(worksAt) Microsoft 7 G-SPARQL Reachability • Path – Subject ??PathVar Object • Shortest path – Subject ?*PathVar Object • Path filters – Path length – All edges – All nodes 8 Example: G-SPARQL Query SELECT ?L1 ?L2 WHERE { ?X ??P ?Y. ?X ?X ?X ?X @Label ?L1. @Age ?Age1. Affiliated UNSW. LivesIn Sydney. ?Y ?Y ?Y ?E @Label ?L2. @Age ?Age2. ?E(Affiliated) Microsoft. @Title "Researcher". FILTER(?Age1 >= 40). FILTER(?Age2 >= 40). FILTERPATH( Length( ??P, <= 3) ). } 9 Outline 1. G-SPARQL language – Pattern matching – Reachability 2. Hybrid execution engine – Graph topology in main memory – Graph data in relational database 3. Algebraic transformation – Operators – Optimizations 4. Experimental evaluation 10 2. Hybrid Execution Engine Hillary • Reachability queries – Main memory algorithms – Example: BFS and Dijkstra’s algorithm Alice Photo1 Photo7 Photo8 Photo2 Chris David Bob Photo3 Ed France George Photo4 Photo5 Photo6 • Pattern matching queries – Relational database – Indexing » Example: B-tree – Query optimizations, » Example: selectivity estimation, and join ordering – Recursive queries » Not efficient: large intermediate results and multiple joins 11 Graph Representation Node Label age office location keyword established type ID Value ID Value ID Value ID Value ID Value ID Value ID Value 1 John 1 45 8 518 3 Sydney 2 XML 2 Demo 4 1975 2 Paper 2 3 42 5 Istanbul 6 graph 7 1949 3 Alice 8 28 4 Microsoft country VLDB’12 5 6 Paper 1 7 UNSW 8 Smith authorOf ID Value 4 USA 7 Australia know affiliated published citedBy eID sID dID eID sID dID eID sID dID eID sID dID 1 1 2 3 1 4 4 2 5 9 6 2 5 3 2 8 3 7 10 6 5 6 3 6 12 8 7 11 8 6 supervise month title order ID Value 1 2 ID Value ID Value 5 1 eID sID dID eID sID dID 3 Senior Researcher 4 3 6 2 2 1 3 7 3 8 8 Professor 10 1 11 1 12 Hybrid Execution Engine: interfaces Hillary Alice Photo1 Photo7 Photo8 Photo2 Chris David Bob Photo3 Ed France George Photo4 Photo5 Photo6 G-SPARQL query 13 3. Intermediate Language & Compilation Hillary Alice Photo1 Photo7 Photo8 Photo2 Chris David Bob G-SPARQL query Front-end compilation Step 1 Algebraic query plan Back-end compilation Step 2 Photo3 Ed France George Photo4 Physical execution plan Photo5 Photo6 14 Intermediate Language • Objective – Generate query plan and chop it » Reachability part -> main-memory algorithms on topology » Pattern matching part -> relational database – Optimizations • Features – Independent of execution engine and graph representation – Algebraic query plan 15 G-SPARQL Algebra • Variant of “Tuple Algebra” • Algebra details – Data: tuples » Sets of nodes, edges, paths. – Operators » Relational: select, project, join » Graph specific: node and edge attributes, adjacency » Path operators 16 Relational 17 Relational NOT Relational 18 Front-end Compilation (Step 1) • Input – G-SPARQL query • Output – Algebraic query plan • Technique – Map » from triple patterns » To G-SPARQL operators – Use inference rules 19 Front-end Compilation: Inference Rules 20 Front-end Compilation: Optimizations • Objective – Delay execution of traversal operations • Technique – Order triple patterns, based on restrictiveness • Heuristics – Triple pattern P1 is more restrictive than P2 1. P1 has fewer path variables than P2 2. P1 has fewer variables than P2 3. P1’s variables have more filter statements than P2’s variables 21 Back-end Compilation (Step 2) • Input – G-SPARQL algebraic plan • Output – SQL commands – Traversal operations • Technique – Substitute G-SPARLQ relational operators with SPJ – Traverse » Bottom up » Stop when reaching root or reaching non-relational operator » Transform relational algebra to SQL commands – Send non-relational commands to main memory algorithms 22 Back-end Compilation: Optimizations • Optimize a fragment of query plan – Before generating SQL command • All operators are Select/Project/Join • Apply standard techniques – For example pushing selection 23 Example: G-SPARQL Query SELECT ?L1 ?L2 WHERE { ?X ??P ?Y. ?X ?X ?X ?X @label ?L1. @age ?Age1. affiliated UNSW. livesIn Sydney. FILTER(?Age1 >= 40). ?Y ?Y ?Y ?E @label ?L2. @age ?Age2. ?E(affiliated) Microsoft. @title "Researcher" FILTER(?Age2 >= 40). } 24 Example: Query Plan 25 4. Experimental Evaluation • Objective – This is a good idea – Good performance from DBMS and main memory topology • Data sets – Real ACM bibliographic network – Synthetic graphs » See technical report 26 Experimental Environment • Workload – Created Q1 … Q12 • Process – Compare to Neo4J (non-optimized, optimized) • Environment – Implementation » Main memory algorithms in C++ » IBM DB2 – PC Server 27 Results on Real Dataset 28 Response time on ACM Bibliographic Network 29 Conclusions • G-SPARQL Language – Expresses pattern matching and reachability queries on attributed graphs • Hybrid engine – Graph topology in main memory – Graph data in database • Compilation into algebraic plan – Operators and optimizations • Evaluation – Real and synthetic datasets – Good performance » Leveraging database engine and main memory topology 30