gStore: Answering SPARQL Queries via Subgraph Matching

advertisement
gStore: Answering SPARQL Queries
via Subgraph Matching
Lei Zou , Jinghui Mo , Lei Chen , M. Tamer Ozsu¨ , Dongyan Zhao
{ zoulei,mojinghui,zdy}@icst.pku.edu.cn, leichen@cse.ust.hk,
tamer.ozsu@uwaterloo.ca
Agenda
•
•
•
•
•
•
•
•
Introduction
Preliminaries
Overview of gStore
Storage Scheme and Encoding Technique
Indexing Structure and Query Algorithm
Optimized methods
Experiments and their results
Conclusions
Introduction -1/4
• What is RDF?
– Building block of semantic web
– Represented as a collection of triples : (Subject,Property,Object)
Prefix: y=http://en.wikipedia.org/wiki/
Subject
y:Abraham Lincoln
y:Abraham Lincoln
y:Abraham Lincoln
y:Abraham Lincoln
y:Washington_D.C
y:Washington_D.C
y:Washington_D.C
y:United_States
y:United_States
y:United_States
y:Reese_Witherspoon
y:Reese_Witherspoon
y:Reese_Witherspoon
y:Reese_Witherspoon
y:New_Orleans_Louisiana
y:New_Orleans_Louisiana
y:New_Orleans_Louisiana
Property
hasName
BornOnDate
DiedOnDate
DiedIn
hasName
FoundYear
rdf:type
hasName
hasCapital
rdf:type
rdf:type
BornOnDate
BornIn
hasName
FoundYear
rdf:type
locatedIn
Object
Abraham Lincoln
1809-02-12
1865-04-15
y:Washington_D.C
“Washington D.C”
1790
y:city
“United States”
y:Washington_D.C
Country
y:Actor
“1976-03-22”
y:New_Orleans_Louisiana
“Reese Witherspoon”
1718
y:city
y:United_States
Introduction 2/4:RDF Graph
Introduction - 3/4
• What is SPARQL?
• Sample query:
Select ?name Where { ?m <hasName> ?name. ?m <BornOn
Date > “1809-02-12” ?m <DiedOnDate> “1865-04-15” }
• Query with wildcards:
Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate>
?bd. ?m <DiedOnDate> ?dd. FILTER regex(str(?bd), “02-12”), regex(str(?dd),
“04-15”) }
Introduction - 4/4
• Problems with existing solutions:
– they cannot answer SPARQL queries with
wildcards in a scalable manner
– they cannot handle frequent updates in RDF
repositories
• Answering with subgraph matching
– Modeling RDF data and Query as two graphs
– Cannot use regular graph pattern matching
– Answering SPARQL query ≈ subgraph matching
Preliminaries
• RDF graph , G, is denoted as G=(V, LV , E, LE )
• Query graph , Q, is denoted as Q=(V, LV , E, LE )
Preliminaries Cont’d
• G(u , u ,…, u ) is a match of Q(v , v ,…, v ) if:
1
–
–
–
–
2
n
1
2
n
vi is a literal vertex, vi and ui have the same literal value
vi is a class/entity vertex, vi and ui have the same URI
vi is a parameter vertex, there is no constraint over ui
vi is a wildcard vertex, vi is a substring of ui and ui is a literal
value
– there is an edge from vi to vj in Q with the property p, there
is also an edge from ui to uj in G with the same property p
Overview of gstore
• Work directly on RDF graph and SPARQL Query
graph
• Use a signature-based encoding of each entity
and class vertex to speed up matching
• Filter and evaluate
– Use a false-positive algorithm to prune nodes and obtain a set of
candidates; then verify each candidate
• Use an index (VS∗-tree) over the data
signature graph (has light maintenance load)
for efficient pruning
Storage Scheme & Encoding Technique
• Storage Scheme
Storage Scheme & Encoding Technique
• Encoding technique
(hasName,
0100 0000 0000
“Abraham Lincoln”)
Storage Scheme & Encoding Technique
• Encoding technique
(hasName,
0100 0000 0000
“Abraham Lincoln”)
“bra”
Storage Scheme & Encoding Technique
• Encoding technique
0000 0100 0000 0000
(hasName,
“Abraham Lincoln”)
“bra”
1000 0000 0000 0000
0100 0000 0000
0000 0000 0100 0000
Storage Scheme & Encoding Technique
• Encoding technique
0000 0100 0000 0000
(hasName,
“Abraham Lincoln”)
“bra”
1000 0000 0000 0000
0100 0000 0000
0000 0000 0100 0000
OR
1000 0100 0100 0000
Storage Scheme & Encoding Technique
• Encoding technique
(hasName,
0100 0000 0000
“Abraham Lincoln”)
1000 0100 0100 0000
1000 0100 0100 0000
Storage Scheme & Encoding Technique
• Encoding technique
0110 1010 0000
(hasName,
0010 0000 0000
“Abraham Lincoln”)
1000 0100 0100 0000
(BornOnDate, "1908-02-12")
0100 0000 0000
0100 0010 0100 1000
(DiedOnDate, "1965-04-15")
0000 1000 0000
1100 0110 0100 1001
0000 0010 0100 0000
(DiedIn, y:Washington DC)
0000 0010 0000 1000 0010 0100 0001
OR
Indexing Structure and Query Algorithm
Data Signature Graph G*
Converting Q to Q*
Filter and Evaluate
Find matches of Q* over
G*(CL)
Verify each match in RDF
against G(RS)
Generating Candidate List(CL)
• Two step process:
– for each vertex vi ∈ V (Q∗ ), we find a list Ri = {ui1 , ui2 ,
..., uin}, where vi&ui=vi, ui ∈ V(G*) and uij ∈ Ri
– do a multi-way join to get the candidate list
• Use S-trees
– Height-balanced tree over signatures
– Does not support second step - expensive
• Vs-tree and Vs*-tree
– Multi-resolution summary graph based on S-tree
– Supports both steps efficiently
S-tree Solution
0000 1000
10000
1000 0000
d13
d12
1111 1101
1110 1101
1001 1101
d33
d23
d1 3
0010 1001
0010 1000
1001 0101
1100 0100
002
001
d22
003
1000 0100
007
0000 0001
0100 0100
d43
004
0001 1000
1000 0001
005
1001 1000
008
006
0001 0100
1000 1000
S-tree Solution
0000 1000
10000
001
004
1000 0000
006
d13
d12
1111 1101
1110 1101
1001 1101
d33
d23
d13
0010 1001
0010 1000
005
0000 0001
1001 0101
1100 0100
002
001
003
1000 0100
d22
0100 0100
d43
004
0001 1000
1000 0001
007
1001 1000
008
0001 0100
006
1000 1000
S-tree Solution
10000
0000 1000
1000 0000
d13
d12
d1
3
0010 1000
1000 0100
005
0000 0001
d33
1100 0100
002
006
006
1001 1101
d23
001
002
003
1111 1101
1110 1101
0010 1001
001
004
1001 1000
1001 0101
003
0100 0100
d43
004
1000 0001
007
d22
0001 1000
008
0001 0100
006
1000 1000
S-tree Solution
10000
0000 1000
1000 0000
d13
d12
d1
3
0010 1000
1000 0100
005
0000 0001
d33
1100 0100
002
006
006
1001 1101
d23
001
002
003
1111 1101
1110 1101
0010 1001
001
004
1001 1000
1001 0101
003
0100 0100
d43
004
1000 0001
007
d22
0001 1000
008
0001 0100
006
1000 1000
S-tree Solution
0000 1000
10000
001
004
1000 0000
&
006
d13
d12
1001 1101
0010 1001
001
0010 1000
005
0000 0001
d22
d33
d23
d13
006
1111 1101
1110 1101
1001 0101
1100 0100
002
003
1000 0100
007
0100 0100
1001 1000
d43
004
0001 1000
1000 0001
002
003
008
006
0001 0100
1000 1000
VS-tree Solution
11111
d11
10010
d1
d1
3
2
00110
01011
1110 1101
10010
0010 1001
d2
1001 1101
01000
3
d33
1100 0100
0010 1000
1000 0100
10000
005
0000 0001
00010
1001 0101
00010 00100
00010
01000
002
001
d22
003
1000 0001
00010
00001
007
1001 1000
00100
0001 0100
d43
004
0001 1000
00010
008
0100 0100
00010
006
1000 1000
00010
VS-tree Solution
0000 1000
10000
1000 0000
VS-tree Solution
10000
0000 1000
d11 X d11
1000 0000
VS-tree Solution
0000 1000
10000
1000 0000
d12 X d12
VS-tree Solution
0000 1000
10000
1000 0000
d13 X d23
VS-tree Solution
0000 1000
10000
1000 0000
001 X 002
VS-tree Solutionlimitations
0000 1000
10000
1000 0000
If this level is dense,
many summary
matches =>
More search space
Process each level
step by step
Possible Optimization Methods
• “magically” know which level to begin with to
minimize the number of summary matches
• Use DFS(Depth First Search) to find the valid
child nodes
• While inserting vertices, consider not only the
hamming distance but also the number of
super edges introduced
Optimization example
Experimental results-Exact queries
Yago network (20 million triples & size 3.1GB)
Queries
gStore RDF-3x SW-Store x-RDF-3x BigOWLIM GRIN
Experimental results-Wildcard queries
Queries
gStore
x-RDF-3x
RDF-3x
BigOWLIM
SW-Store
GRIN
Conclusion
• This approach:
– Uses two novel indexes VS-tree and VS*-tree to
speed up query processing
– Was also to solve the two problems with existing
solutions:
• answers SPARQL queries with wildcards in a scalable
manner
• handle frequent and online updates in RDF repositories
Questions?
Download