Slides - Microsoft Research

advertisement
G-SPARQL: A Hybrid Engine for
Querying Large Attributed Graphs
Sherif Sakr
Sameh Elnikety
Yuxiong He
NICTA & UNSW
Sydney, Australia
Microsoft Research
Redmond, WA
Microsoft Research
Redmond, WA
CIKM 2012
Example 1: Social Network
Hillary
Alice
Photo1
Photo7
Photo8
Photo2
Chris
David
Bob
Bob
Photo3
Ed
France
George
Photo4
Photo5
Photo6
2
Example 2: Bibliographical Network
location: Istanbul VLDB 12
Month: 3
Month: 1
Keyword: graph Paper 1
citedBy
order: 2
Paper 2
order: 1
order: 1
Smith
Keyword: XML
type: Demo
order: 2
Alice
Age:42
location: Sydney
age: 28
office: 518
John
age:45
title: Senior Researcher
title: Professor
country: Australia
established: 1949
UNSW
country: USA
Microsoft
established: 1975
3
Contributions
1.
G-SPARQL language
– Pattern matching
– Reachability
2.
Hybrid execution engine
– Graph topology in main memory
– Graph data in relational database
3.
Algebraic transformation
– Operators
– Optimizations
4.
Experimental evaluation
4
1. G-SPARQL Query Language
• Extends a subset of SPARQL
– Based on triple pattern:
(subject, predicate, object)
subject
object
• Sub-graph matching patterns on
– Graph structure
– Node attribute
– Edge attribute
• Reachability patterns on
– Path
– Shortest path
5
G-SPARQL Syntax
6
G-SPARQL Pattern Matching
• Node attribute
– ?Person @officeNumber “518”
officeNumber=
518
• Edge attribute
– ?E @Role “Programmer”
Alice
Micros oft
Role =
Programmer
• Structural
– ?Person worksAt Microsoft
– ?Person ?E(worksAt) Microsoft
7
G-SPARQL Reachability
• Path
– Subject ??PathVar Object
• Shortest path
– Subject ?*PathVar Object
• Path filters
– Path length
– All edges
– All nodes
8
Example: G-SPARQL Query
SELECT ?L1 ?L2
WHERE {
?X ??P ?Y.
?X
?X
?X
?X
@Label ?L1.
@Age ?Age1.
Affiliated UNSW.
LivesIn Sydney.
?Y
?Y
?Y
?E
@Label ?L2.
@Age ?Age2.
?E(Affiliated) Microsoft.
@Title "Researcher".
FILTER(?Age1 >= 40).
FILTER(?Age2 >= 40).
FILTERPATH( Length( ??P, <= 3) ).
}
9
Outline
1.
G-SPARQL language
– Pattern matching
– Reachability
2.
Hybrid execution engine
– Graph topology in main memory
– Graph data in relational database
3.
Algebraic transformation
– Operators
– Optimizations
4.
Experimental evaluation
10
2. Hybrid Execution Engine
Hillary
• Reachability queries
– Main memory algorithms
– Example: BFS and Dijkstra’s algorithm
Alice
Photo1
Photo7
Photo8
Photo2
Chris
David
Bob
Photo3
Ed
France
George
Photo4
Photo5
Photo6
• Pattern matching queries
– Relational database
– Indexing
» Example: B-tree
– Query optimizations,
» Example: selectivity estimation, and join ordering
– Recursive queries
» Not efficient: large intermediate results and multiple joins
11
Graph Representation
Node Label
age
office
location
keyword
established
type
ID
Value
ID
Value
ID
Value
ID
Value
ID
Value
ID
Value
ID
Value
1
John
1
45
8
518
3
Sydney
2
XML
2
Demo
4
1975
2
Paper 2
3
42
5
Istanbul
6
graph
7
1949
3
Alice
8
28
4
Microsoft
country
VLDB’12
5
6
Paper 1
7
UNSW
8
Smith
authorOf
ID
Value
4
USA
7
Australia
know
affiliated
published
citedBy
eID
sID
dID
eID
sID
dID
eID
sID
dID
eID
sID
dID
1
1
2
3
1
4
4
2
5
9
6
2
5
3
2
8
3
7
10
6
5
6
3
6
12
8
7
11
8
6
supervise
month
title
order
ID
Value
1
2
ID
Value
ID
Value
5
1
eID
sID
dID
eID
sID
dID
3
Senior Researcher
4
3
6
2
2
1
3
7
3
8
8
Professor
10
1
11
1
12
Hybrid Execution Engine: interfaces
Hillary
Alice
Photo1
Photo7
Photo8
Photo2
Chris
David
Bob
Photo3
Ed
France
George
Photo4
Photo5
Photo6
G-SPARQL
query
13
3. Intermediate Language & Compilation
Hillary
Alice
Photo1
Photo7
Photo8
Photo2
Chris
David
Bob
G-SPARQL
query
Front-end
compilation
Step 1
Algebraic
query plan
Back-end
compilation
Step 2
Photo3
Ed
France
George
Photo4
Physical
execution
plan
Photo5
Photo6
14
Intermediate Language
• Objective
– Generate query plan and chop it
» Reachability part -> main-memory algorithms on topology
» Pattern matching part -> relational database
– Optimizations
• Features
– Independent of execution engine and graph representation
– Algebraic query plan
15
G-SPARQL Algebra
• Variant of “Tuple Algebra”
• Algebra details
– Data: tuples
» Sets of nodes, edges, paths.
– Operators
» Relational: select, project, join
» Graph specific: node and edge attributes, adjacency
» Path operators
16
Relational
17
Relational
NOT
Relational
18
Front-end Compilation (Step 1)
• Input
– G-SPARQL query
• Output
– Algebraic query plan
• Technique
– Map
» from triple patterns
» To G-SPARQL operators
– Use inference rules
19
Front-end Compilation: Inference Rules
20
Front-end Compilation: Optimizations
• Objective
– Delay execution of traversal operations
• Technique
– Order triple patterns, based on restrictiveness
• Heuristics
– Triple pattern P1 is more restrictive than P2
1. P1 has fewer path variables than P2
2. P1 has fewer variables than P2
3. P1’s variables have more filter statements than P2’s variables
21
Back-end Compilation (Step 2)
• Input
– G-SPARQL algebraic plan
• Output
– SQL commands
– Traversal operations
• Technique
– Substitute G-SPARLQ relational operators with SPJ
– Traverse
» Bottom up
» Stop when reaching root or reaching non-relational operator
» Transform relational algebra to SQL commands
– Send non-relational commands to main memory algorithms
22
Back-end Compilation: Optimizations
• Optimize a fragment of query plan
– Before generating SQL command
• All operators are Select/Project/Join
• Apply standard techniques
– For example pushing selection
23
Example: G-SPARQL Query
SELECT ?L1 ?L2
WHERE {
?X ??P ?Y.
?X
?X
?X
?X
@label ?L1.
@age ?Age1.
affiliated UNSW.
livesIn Sydney.
FILTER(?Age1 >= 40).
?Y
?Y
?Y
?E
@label ?L2.
@age ?Age2.
?E(affiliated) Microsoft.
@title "Researcher"
FILTER(?Age2 >= 40).
}
24
Example: Query Plan
25
4. Experimental Evaluation
• Objective
– This is a good idea
– Good performance from DBMS and main memory topology
• Data sets
– Real ACM bibliographic network
– Synthetic graphs
» See technical report
26
Experimental Environment
• Workload
– Created Q1 … Q12
• Process
– Compare to Neo4J (non-optimized, optimized)
• Environment
– Implementation
» Main memory algorithms in C++
» IBM DB2
– PC Server
27
Results on Real Dataset
28
Response time on ACM Bibliographic Network
29
Conclusions
• G-SPARQL Language
– Expresses pattern matching and reachability queries on attributed graphs
• Hybrid engine
– Graph topology in main memory
– Graph data in database
• Compilation into algebraic plan
– Operators and optimizations
• Evaluation
– Real and synthetic datasets
– Good performance
» Leveraging database engine and main memory topology
30
Download