PPT - UTPA Faculty Web

advertisement
Was
Derived
From
Storing, Indexing and Querying Large Provenance
Data Sets as RDF Graphs in Apache HBase
Artem Chebotko
Joint work with
John Abraham and Pearl Brazier
University of Texas – Pan American
Anthony Piazza
Piazza Consulting
Andrey Kashlev and Shiyong Lu
Wayne State University
7th IEEE International Workshop on Scientific Workflows, July 2, 2013
1
Provenance in eScience
 Metadata that captures history of an experiment
 Problem diagnosis
 Result interpretation
 Experiment reproducibility
 Scientific Workflow Community Provenance Challenges
 2006: understanding and sharing information about
provenance representations and capabilities
 2006: interoperability of different provenance
 2009: evaluating various aspects of OPM
 2010: showcase OPM in the context of novel applications
 Open Provenance Model (2007 - 2010)
 PROV-DM: The PROV Data Model (W3C
Recommendation 30 April 2013)
2
SWFMS and Provenance
 Support provenance collection
 Use proprietary or third-party systems to manage
provenance
 Differ in provenance models, provenance vocabularies,
inference support, and query languages.
 May eventually converge to W3C PROV specifications
 Taverna
 Galaxy
 Kepler
 Triana
 View
 OPMProv
 VisTrails,
 Karma
 Pegasus
 RDFProv
 Swift
 etc.
3
Sample OPM Provenance Graph
 Nodes:
 artifacts
Create Table
SQL Statements
Create Index
SQL Statements
Create Trigger
SQL Statements
 processes
 agents
 Edges:
Create Database Schema
 used
 wasGeneratedBy
Schema
Dataset
 wasControlledBy
 wasTriggeredBy
 wasDerivedFrom
Load Data
Instance
4
Sample Graph Serialization: OPMV
and Terse RDF Triple Language
Create Table
SQL Statements
Create Index
SQL Statements
Create Trigger
SQL Statements
Create Database Schema
Schema
Dataset
Load Data
Instance
utpb:schema
utpb:instance
utpb:dataset
utpb:loadData
utpb:loadData
rdf:type opmv:Artifact .
rdf:type opmv:Artifact .
rdf:type opmv:Artifact .
rdf:type opmv:Process .
opmv:used utpb:schema,
utpb:dataset .
utpb:instance opmv:wasGeneratedBy utpb:loadData .
utpb:instance opmv:wasDerivedFrom utpb:schema,
utpb:dataset .
5
Provenance Serialization and
Querying
 Both OPM and PROV-DM can be serialized in RDF
 Queried in SPARQL
Find all artifacts and their values, if any, in a provenance graph
with identifier http://cs.panam.edu/utpb#opmGraph
6
This Work - Motivation
 Single provenance graph as an RDF graph
 In general, readily manageable in main memory of a single
machine
 Hundreds of thousands or even millions of provenance
graphs as a provenance (RDF) dataset
 Challenging to manage
 Our Focus/Problem: Efficient and scalable storage and
querying of large collections of provenance graphs
serialized as RDF graphs (in an Apache HBase database)
7
This Work - Contributions
 Novel storage and indexing schemes for RDF data in
HBase that are suitable for provenance datasets
 Novel and efficient querying algorithms to evaluate
SPARQL queries in HBase that are optimized to make use
of bitmap indices and numeric values instead of triples
 Empirical evaluation of our approach using provenance
graphs and test queries of the University of Texas
Provenance Benchmark (UTPB)
8
Talk Outline
 RDF Data and Queries
 Indexing Scheme
 Storage Scheme
 Query Processing
 Performance Study
 Related Work
 Summary and Future work
9
RDF Data and Queries
10
RDF Data and Queries
11
Indexing Scheme
 Selection Indices: Is, Ip, Io
 Find a triple with known s, p and o:
12
Indexing Scheme
 Join Indices: Iss, Iso, Ios, Ioo
 Find triples with the same object as subject in triple at
position i:
Iso(i)
13
Storage Scheme
 One table with two column families for data and indices
 Each row stores one complete provenance graph
14
Query Processing
 Four efficient algorithms/functions:
 application of selection indices
 application of join indices
 handling of special cases not supported by the indices
 basic graph pattern evaluation
15
Query Processing
16
Query Processing
17
Query Processing
18
Query Processing
19
Query Processing
20
Query Processing
21
Performance Study
 Implementation
 Java, Hadoop 1.0.0, HBase 0.94
 Cluster setup
 One HBase Master
 Eight HBase Region Servers
 All commodity machines
 Benchmark – UTPB (5 datasets, 11 queries)
22
Performance Study
 Q1 – simplest, yet most expensive query due to a large
result set
 Q1. Find all provenance graph identifiers.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT * WHERE { ?graph rdf:type owl:Thing . }
23
Performance Study
 Q2 – Q11 – different complexity, yet similar performance
 Example: Q8. Find all artifacts and their values, if any, in a
particular provenance graph.
PREFIX opmv: <http://purl.org/net/opmv/ns#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX opmo: <http://openprovenance.org/model/opmo#>
PREFIX utpb: <http://cs.panam.edu/utpb#>
SELECT ?artifact ?value F
ROM NAMED <http://cs.panam.edu/utpb#opmGraph>
WHERE {
GRAPH utpb:opmGraph {
?artifact rdf:type opmv:Artifact .
OPTIONAL { ?artifact opmo:annotation ?annotation .
?annotation opmo:property ?property .
?property opmo:value ?value . } .
OPTIONAL { ?artifact opmo:avalue ?artifactValue .
?artifactValue opmo:content ?value . } .
}
}
24
Performance Study
 Please see other queries in the paper – very efficient and
scalable (nearly constant scalability due to minimal data
transfers and fast index-based join processing)
25
Related Work
 HBase, BigTable, Cassandra
 Hadoop, Hive, Pig, CouchDB, MongoDB, etc.
 NoSQL solutions to RDF data management
 Provenance management systems
 RDF data indexing
26
Summary and Future Work
 Designed novel storage and indexing schemes for RDF
data in HBase that are suitable for provenance datasets
 Empirical evaluation results are promising
 Future work
 Compare, compare, compare
 More experiments with multi-user workloads
 More optimizations
 PROV-DM benchmark anyone?
27
THANK YOU! Questions?
 My contact information:
 Artem Chebotko, Department of Computer Science,
University of Texas – Pan American
 [email protected]
 http://www.cs.panam.edu/~artem
28
Download