gStore: Answering SPARQL Queries via Subgraph Matching Lei Zou , Jinghui Mo , Lei Chen , M. Tamer Ozsu¨ , Dongyan Zhao { zoulei,mojinghui,zdy}@icst.pku.edu.cn, leichen@cse.ust.hk, tamer.ozsu@uwaterloo.ca Agenda • • • • • • • • Introduction Preliminaries Overview of gStore Storage Scheme and Encoding Technique Indexing Structure and Query Algorithm Optimized methods Experiments and their results Conclusions Introduction -1/4 • What is RDF? – Building block of semantic web – Represented as a collection of triples : (Subject,Property,Object) Prefix: y=http://en.wikipedia.org/wiki/ Subject y:Abraham Lincoln y:Abraham Lincoln y:Abraham Lincoln y:Abraham Lincoln y:Washington_D.C y:Washington_D.C y:Washington_D.C y:United_States y:United_States y:United_States y:Reese_Witherspoon y:Reese_Witherspoon y:Reese_Witherspoon y:Reese_Witherspoon y:New_Orleans_Louisiana y:New_Orleans_Louisiana y:New_Orleans_Louisiana Property hasName BornOnDate DiedOnDate DiedIn hasName FoundYear rdf:type hasName hasCapital rdf:type rdf:type BornOnDate BornIn hasName FoundYear rdf:type locatedIn Object Abraham Lincoln 1809-02-12 1865-04-15 y:Washington_D.C “Washington D.C” 1790 y:city “United States” y:Washington_D.C Country y:Actor “1976-03-22” y:New_Orleans_Louisiana “Reese Witherspoon” 1718 y:city y:United_States Introduction 2/4:RDF Graph Introduction - 3/4 • What is SPARQL? • Sample query: Select ?name Where { ?m <hasName> ?name. ?m <BornOn Date > “1809-02-12” ?m <DiedOnDate> “1865-04-15” } • Query with wildcards: Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> ?bd. ?m <DiedOnDate> ?dd. FILTER regex(str(?bd), “02-12”), regex(str(?dd), “04-15”) } Introduction - 4/4 • Problems with existing solutions: – they cannot answer SPARQL queries with wildcards in a scalable manner – they cannot handle frequent updates in RDF repositories • Answering with subgraph matching – Modeling RDF data and Query as two graphs – Cannot use regular graph pattern matching – Answering SPARQL query ≈ subgraph matching Preliminaries • RDF graph , G, is denoted as G=(V, LV , E, LE ) • Query graph , Q, is denoted as Q=(V, LV , E, LE ) Preliminaries Cont’d • G(u , u ,…, u ) is a match of Q(v , v ,…, v ) if: 1 – – – – 2 n 1 2 n vi is a literal vertex, vi and ui have the same literal value vi is a class/entity vertex, vi and ui have the same URI vi is a parameter vertex, there is no constraint over ui vi is a wildcard vertex, vi is a substring of ui and ui is a literal value – there is an edge from vi to vj in Q with the property p, there is also an edge from ui to uj in G with the same property p Overview of gstore • Work directly on RDF graph and SPARQL Query graph • Use a signature-based encoding of each entity and class vertex to speed up matching • Filter and evaluate – Use a false-positive algorithm to prune nodes and obtain a set of candidates; then verify each candidate • Use an index (VS∗-tree) over the data signature graph (has light maintenance load) for efficient pruning Storage Scheme & Encoding Technique • Storage Scheme Storage Scheme & Encoding Technique • Encoding technique (hasName, 0100 0000 0000 “Abraham Lincoln”) Storage Scheme & Encoding Technique • Encoding technique (hasName, 0100 0000 0000 “Abraham Lincoln”) “bra” Storage Scheme & Encoding Technique • Encoding technique 0000 0100 0000 0000 (hasName, “Abraham Lincoln”) “bra” 1000 0000 0000 0000 0100 0000 0000 0000 0000 0100 0000 Storage Scheme & Encoding Technique • Encoding technique 0000 0100 0000 0000 (hasName, “Abraham Lincoln”) “bra” 1000 0000 0000 0000 0100 0000 0000 0000 0000 0100 0000 OR 1000 0100 0100 0000 Storage Scheme & Encoding Technique • Encoding technique (hasName, 0100 0000 0000 “Abraham Lincoln”) 1000 0100 0100 0000 1000 0100 0100 0000 Storage Scheme & Encoding Technique • Encoding technique 0110 1010 0000 (hasName, 0010 0000 0000 “Abraham Lincoln”) 1000 0100 0100 0000 (BornOnDate, "1908-02-12") 0100 0000 0000 0100 0010 0100 1000 (DiedOnDate, "1965-04-15") 0000 1000 0000 1100 0110 0100 1001 0000 0010 0100 0000 (DiedIn, y:Washington DC) 0000 0010 0000 1000 0010 0100 0001 OR Indexing Structure and Query Algorithm Data Signature Graph G* Converting Q to Q* Filter and Evaluate Find matches of Q* over G*(CL) Verify each match in RDF against G(RS) Generating Candidate List(CL) • Two step process: – for each vertex vi ∈ V (Q∗ ), we find a list Ri = {ui1 , ui2 , ..., uin}, where vi&ui=vi, ui ∈ V(G*) and uij ∈ Ri – do a multi-way join to get the candidate list • Use S-trees – Height-balanced tree over signatures – Does not support second step - expensive • Vs-tree and Vs*-tree – Multi-resolution summary graph based on S-tree – Supports both steps efficiently S-tree Solution 0000 1000 10000 1000 0000 d13 d12 1111 1101 1110 1101 1001 1101 d33 d23 d1 3 0010 1001 0010 1000 1001 0101 1100 0100 002 001 d22 003 1000 0100 007 0000 0001 0100 0100 d43 004 0001 1000 1000 0001 005 1001 1000 008 006 0001 0100 1000 1000 S-tree Solution 0000 1000 10000 001 004 1000 0000 006 d13 d12 1111 1101 1110 1101 1001 1101 d33 d23 d13 0010 1001 0010 1000 005 0000 0001 1001 0101 1100 0100 002 001 003 1000 0100 d22 0100 0100 d43 004 0001 1000 1000 0001 007 1001 1000 008 0001 0100 006 1000 1000 S-tree Solution 10000 0000 1000 1000 0000 d13 d12 d1 3 0010 1000 1000 0100 005 0000 0001 d33 1100 0100 002 006 006 1001 1101 d23 001 002 003 1111 1101 1110 1101 0010 1001 001 004 1001 1000 1001 0101 003 0100 0100 d43 004 1000 0001 007 d22 0001 1000 008 0001 0100 006 1000 1000 S-tree Solution 10000 0000 1000 1000 0000 d13 d12 d1 3 0010 1000 1000 0100 005 0000 0001 d33 1100 0100 002 006 006 1001 1101 d23 001 002 003 1111 1101 1110 1101 0010 1001 001 004 1001 1000 1001 0101 003 0100 0100 d43 004 1000 0001 007 d22 0001 1000 008 0001 0100 006 1000 1000 S-tree Solution 0000 1000 10000 001 004 1000 0000 & 006 d13 d12 1001 1101 0010 1001 001 0010 1000 005 0000 0001 d22 d33 d23 d13 006 1111 1101 1110 1101 1001 0101 1100 0100 002 003 1000 0100 007 0100 0100 1001 1000 d43 004 0001 1000 1000 0001 002 003 008 006 0001 0100 1000 1000 VS-tree Solution 11111 d11 10010 d1 d1 3 2 00110 01011 1110 1101 10010 0010 1001 d2 1001 1101 01000 3 d33 1100 0100 0010 1000 1000 0100 10000 005 0000 0001 00010 1001 0101 00010 00100 00010 01000 002 001 d22 003 1000 0001 00010 00001 007 1001 1000 00100 0001 0100 d43 004 0001 1000 00010 008 0100 0100 00010 006 1000 1000 00010 VS-tree Solution 0000 1000 10000 1000 0000 VS-tree Solution 10000 0000 1000 d11 X d11 1000 0000 VS-tree Solution 0000 1000 10000 1000 0000 d12 X d12 VS-tree Solution 0000 1000 10000 1000 0000 d13 X d23 VS-tree Solution 0000 1000 10000 1000 0000 001 X 002 VS-tree Solutionlimitations 0000 1000 10000 1000 0000 If this level is dense, many summary matches => More search space Process each level step by step Possible Optimization Methods • “magically” know which level to begin with to minimize the number of summary matches • Use DFS(Depth First Search) to find the valid child nodes • While inserting vertices, consider not only the hamming distance but also the number of super edges introduced Optimization example Experimental results-Exact queries Yago network (20 million triples & size 3.1GB) Queries gStore RDF-3x SW-Store x-RDF-3x BigOWLIM GRIN Experimental results-Wildcard queries Queries gStore x-RDF-3x RDF-3x BigOWLIM SW-Store GRIN Conclusion • This approach: – Uses two novel indexes VS-tree and VS*-tree to speed up query processing – Was also to solve the two problems with existing solutions: • answers SPARQL queries with wildcards in a scalable manner • handle frequent and online updates in RDF repositories Questions?