Answering Regular Path Queries on Workflow Provenance Xiaocheng Huang*, Zhuowei Bao, Susan B. Davidson, Tova Milo, Xiaojie Yuan Genome Institute of Singapore, Facebook Inc., U of Pennsylvania, Tel Aviv U, Nankai U Abstract Example This paper proposes a novel approach for efficiently evaluating regular path queries over provenance graphs of workflows that may include recursion. The approach assumes that an execution g of a workflow G is labeled with query-agnostic reachability labels using an existing technique. At query time, given g, G and a regular path query R, the approach decomposes R into a set of subqueries R1, ..., Rk that are safe for G. For each safe subquery Ri, G is rewritten so that, using the reachability labels of nodes in g, whether or not there is a path which matches Ri between two nodes can be decided in constant time. The results of each safe subquery are then composed, possibly with some small unsafe remainder, to produce an answer to R. The approach results in an algorithm that significantly reduces the number of subqueries k over existing techniques by increasing their size and complexity, and that evaluates each subquery in time bounded by its input and output size. Experimental results demonstrate the benefit of this approach. Example: Consider the following provenance graph g derived from workflow G, and query R=(_)*e(_)*. GR is the intersection between G and R. One can check that R is safe w.r.t. G, so we can draw internal edges for composite modules of GR (in capital letters). R(c:1, b:1) can be decided easily by looking at W1’ in GR. Noted that things become more complex when recursions are involved, which is omitted here. Workflow Specification G GR: Intersection of G and R Introduction What and why is querying workflow provenance? Workflow (Specification) ----- Context-free Graph Grammar Genetic Information r0 (Alternation + Recursion) Genetic Disorder Risk G:Determine Genetic Susceptibility r1 Expand SNPs r2 Consult DB r3 23AndMe Expand SNPs Hapmap Expand SNPs Run2: Run3: 23AndMe 23AndMe d21 HapMap d31 HapMap d22 Consult DB Check Results d32 d23 d3n …. Provenance graph g Experimental Results Data: two realistic workflows BioAID and QBLast Queries: --IFQs: R=(_)*a1(_)*a2…ak(_)* , which favors existing approaches --Kleene Star: R=a*, which favors our approach --General queries: combination of the above and edge labels Check Results Workflow Executions/Provenance d11 d12 Run1: 23AndMe Check Results RRQ R=(_)*e(_)* and its DFA Consult DB Consult DB Figure 1. A realistic workflow and its executions/runs/provenance • After observing erroneous data when the workflow is running, answer questions like •“Is the workflow design wrong?” or • “Are all the data produced through M1(M2|M3)(_)*M49 wrong? (Mi is a module)“ • i.e. answering regular path queries Exp-1: Time overhead (of checking query safety) Result: Negligible (<200 ms) Exp-2: Query time of pairwise safe RPQs Result: Constant w.r.t. run size Exp-3: Query time of all-pairs safe RPQs Result: 1) We outperform baseline when queries are not too selective 2) We achieve a major gain for Kleene Stars Problem Statement Consider a graph g derived from an edge-labeled context-free graph grammar G, and regular path query R. (1) Pairwise regular path query decide whether there exists a path between two nodes u, v in g such that the concatenation the edge labels along the path satisfies query R, denoted by R(u,v)=true. (2) All-pairs regular path query find all node pairs {(u,v)} in two given node list l1 and l2 such that R(u,v)=true. Approach Overview Query Time: RPQ R R1 R2 vn … ln Rk GR2 GR1 G … GRk Workflow Execution Time v1 l1 l2 Query Time: Reachability Query V1---> V2 ? V2---> V3 ? V3---> V1 ? … Vi –Rk--> Vj? Vi –R2--> Vj? Vi –R1--> Vj? Compose result Vi –R--> Vj? … Theoretical Results For a subclass of queries, namely safe queries, the time complexity is: (1) Pairwise safe regular path queries: O(1) (assuming |G|<<|g|) (2) All-pairs safe regular path queries: O(|l1|+|l2|+N) where N is the number of reachable node pairs in l1 × l2 RESEARCH POSTER PRESENTATION DESIGN © 2012 www.PosterPresentations.com Exp-4: Query time of all-pairs general RPQs Result: 1) For BioAID: we achieve improvement of 75% (31/40) of the unsafe queries (shown below); significant improvement (> 40%) of 60% (19/31) of these queries. 2) For QBLast: improvement over 25% (13/40) of the unsafe queries (shown below) Decompose R into safe Ri Prior Work VLDB2012: Reachability labeling v2 Figure 2a. Query Time of All-pairs Safe IFQs Figure 2b. Query Time of All-pairs Safe Kleene Stars Figure 3. Query Time of All-pairs General Queries (on BioAID and QBLast) Conclusions and Future Work Our proposed technique is especially useful for queries that generate a lot of intermediate results and thus could be a very useful component in a cost-based query optimizer that uses statistical information to choose the right query plan and would significantly reduce the evaluation cost of lowly selective subqueries. Future work includes building a cost-model to predicate intermediate result size and query rewriting when taking workflow specification into account. *The work was done when the author was visiting UPenn Contact: huangxc@gis.a-star.edu.sg