Xiaocheng Huang*, Zhuowei Bao, Susan B. Davidson, Tova Milo

advertisement
Answering Regular Path Queries on Workflow Provenance
Xiaocheng Huang*, Zhuowei Bao, Susan B. Davidson, Tova Milo, Xiaojie Yuan
Genome Institute of Singapore, Facebook Inc., U of Pennsylvania, Tel Aviv U, Nankai U
Abstract
Example
This paper proposes a novel approach for efficiently evaluating regular path queries over
provenance graphs of workflows that may include recursion.
The approach assumes that an execution g of a workflow G is labeled with query-agnostic
reachability labels using an existing technique. At query time, given g, G and a regular path
query R, the approach decomposes R into a set of subqueries R1, ..., Rk that are safe for G. For
each safe subquery Ri, G is rewritten so that, using the reachability labels of nodes in g, whether
or not there is a path which matches Ri between two nodes can be decided in constant time.
The results of each safe subquery are then composed, possibly with some small unsafe
remainder, to produce an answer to R.
The approach results in an algorithm that significantly reduces the number of subqueries k over
existing techniques by increasing their size and complexity, and that evaluates each subquery in
time bounded by its input and output size. Experimental results demonstrate the benefit of this
approach.
Example: Consider the following provenance graph g derived from workflow G, and query
R=(_)*e(_)*. GR is the intersection between G and R. One can check that R is safe w.r.t. G, so
we can draw internal edges for composite modules of GR (in capital letters). R(c:1, b:1) can be
decided easily by looking at W1’ in GR. Noted that things become more complex when
recursions are involved, which is omitted here.
Workflow Specification G
GR: Intersection of G and R
Introduction
What and why is querying workflow provenance?
Workflow (Specification) ----- Context-free Graph Grammar
Genetic Information
r0
(Alternation + Recursion)
Genetic Disorder Risk
G:Determine Genetic Susceptibility
r1
Expand SNPs
r2
Consult DB
r3
23AndMe
Expand SNPs
Hapmap
Expand SNPs
Run2:
Run3:
23AndMe
23AndMe
d21
HapMap
d31
HapMap
d22
Consult DB
Check Results
d32
d23
d3n
….
Provenance graph g
Experimental Results
Data: two realistic workflows BioAID and QBLast
Queries:
--IFQs: R=(_)*a1(_)*a2…ak(_)* , which favors existing approaches
--Kleene Star: R=a*, which favors our approach
--General queries: combination of the above and edge labels
Check Results
Workflow Executions/Provenance
d11
d12
Run1:
23AndMe
Check Results
RRQ R=(_)*e(_)* and its DFA
Consult DB
Consult DB
Figure 1. A realistic workflow and its executions/runs/provenance
• After observing erroneous data when the workflow is running, answer questions like
•“Is the workflow design wrong?” or
• “Are all the data produced through M1(M2|M3)(_)*M49 wrong? (Mi is a module)“
• i.e. answering regular path queries
Exp-1: Time overhead (of checking query safety)
Result: Negligible (<200 ms)
Exp-2: Query time of pairwise safe RPQs
Result: Constant w.r.t. run size
Exp-3: Query time of all-pairs safe RPQs
Result: 1) We outperform baseline when queries are not too selective
2) We achieve a major gain for Kleene Stars
Problem Statement
Consider a graph g derived from an edge-labeled context-free graph grammar G, and regular
path query R.
(1) Pairwise regular path query
decide whether there exists a path between two nodes u, v in g such that the concatenation the
edge labels along the path satisfies query R, denoted by R(u,v)=true.
(2) All-pairs regular path query
find all node pairs {(u,v)} in two given node list l1 and l2 such that R(u,v)=true.
Approach Overview
Query Time: RPQ R
R1
R2
vn
…
ln
Rk
GR2
GR1
G
…
GRk
Workflow Execution Time
v1 l1
l2
Query Time: Reachability Query
V1---> V2 ?
V2---> V3 ?
V3---> V1 ?
…
Vi –Rk--> Vj?
Vi –R2--> Vj?
Vi –R1--> Vj?
Compose result
Vi –R--> Vj? …
Theoretical Results
For a subclass of queries, namely safe queries, the time complexity is:
(1) Pairwise safe regular path queries: O(1) (assuming |G|<<|g|)
(2) All-pairs safe regular path queries: O(|l1|+|l2|+N) where N is the number of reachable node
pairs in l1 × l2
RESEARCH POSTER PRESENTATION DESIGN © 2012
www.PosterPresentations.com
Exp-4: Query time of all-pairs general RPQs
Result: 1) For BioAID: we achieve improvement of 75% (31/40) of the unsafe queries (shown
below); significant improvement (> 40%) of 60% (19/31) of these queries.
2) For QBLast: improvement over 25% (13/40) of the unsafe queries (shown below)
Decompose R into safe Ri
Prior Work VLDB2012:
Reachability labeling
v2
Figure 2a. Query Time of All-pairs Safe IFQs Figure 2b. Query Time of All-pairs Safe Kleene Stars
Figure 3. Query Time of All-pairs General Queries (on BioAID and QBLast)
Conclusions and Future Work
Our proposed technique is especially useful for queries that generate a lot of intermediate
results and thus could be a very useful component in a cost-based query optimizer that uses
statistical information to choose the right query plan and would significantly reduce the
evaluation cost of lowly selective subqueries.
Future work includes building a cost-model to predicate intermediate result size and query
rewriting when taking workflow specification into account.
*The work was done when the author was visiting UPenn
Contact: huangxc@gis.a-star.edu.sg
Download