Provenance Management in a COllection-oriented Scientific Workflow Framework aka Kepler/DAKS (for Luc’s collection: before: “We do provenance!”; now: “ … and it almost killed us!”) Shawn Bowers Timothy McPhillips Bertram Ludaescher in collaboration with Ilkay Altintas Norbert Podhorszki Provenance Challenge @ GGF18 Kepler/COW+RWS, Bowers, McPhiilips et al. Goals for the Provenance Challenge Implement an RWS-style provenance model for CollectionOriented Scientific Workflows • Take advantage of Collection-Oriented SWFs to – Automatically infer state-reset events – Reduce the number of provenance-relevant events that need to be recorded (keep it minimal) – Simplify association of traces and provenance into one selfcontained “trace” file for input, output, and dependencies • Support science-oriented provenance and queries – Emphasize data dependencies (lineage) as well as process details • Decouple provenance representation from particular scientific workflow technology (Kepler) Provenance Challenge @ GGF18 Kepler/COW+RWS, Bowers, McPhiilips et al. Collection-Oriented Workflows Generic support for workflows that operate over nested data collections (trees) • Abstract Model – Actors receive input trees, read contents of subtrees matching some criteria (scope), and optionally add or delete subtree nodes – Each scope instance corresponds to one actor invocation … … AnatomyImage AnatomyImage 2 Image 1 AlignWarp ReferenceImage 3 WarpParamSet Scope = AnatomyImage Provenance Challenge @ GGF18 Kepler/COW+RWS, Bowers, McPhiilips et al. 2 Image 1 ReferenceImage Collection-Oriented Workflows • Kepler Implementation – Collections are serialized within heterogeneous token streams – Actor execution is pipelined based on each actor’s scope – Enables concurrent processing of nested data collections – Collections can contain data, metadata, actor parameters, and other collections Provenance Challenge @ GGF18 Kepler/COW+RWS, Bowers, McPhiilips et al. Collection-Oriented Provenance Challenge SWF • Input data is read by collection reader – Execution driven by number and size of anatomy image sets specified by XML file • Slicer configured on the fly via parameter tokens – E.g. to create the 3 slices required for each image set • Output trace serialized into XML by collection writer – Trace implicitly contains input data, output data, and lineage Provenance Challenge @ GGF18 Kepler/COW+RWS, Bowers, McPhiilips et al. Collection-Oriented Provenance … Data Dependencies – Insertion and deletion events capture actor, invocation count, and direct data dependencies AnatomyImage WarpParamSet Image ReferenceImage Process Dependencies – Invocation dependencies record which steps created data or modified collections used by another actor invocation Insertion Dependencies Embedded Provenance Tokens – Data and invocation dependencies stored as tokens within the stream – Actor API for declaring data dependencies – Invocation dependencies added automatically Provenance Challenge @ GGF18 Kepler/COW+RWS, Bowers, McPhiilips et al. Minimal Provenance Information Without Provenance With Provenance Provenance Challenge @ GGF18 Kepler/COW+RWS, Bowers, McPhiilips et al. Querying Collection-Oriented Provenance Execution traces imply provenance graphs Image (311) Header (312) Slicer : 1 Slicer : 1 AtlasImage (308) AtlasSlice (337) Header (312) Slicer : 1 Image (311) AtlasSlice (337) AtlasImage (308) Header (312) Image (311) Slicer : 1 Slicer : 1 Data/Collection creation lineage Collection “last version” lineage Graph edges encode data lineage and process relations Lineage(Trace, Node, DependentNode, Actor, InvocCount) Provenance operations work over traces and graphs: Input(Trace, Node) Output(Trace, Node) Param(Trace, Name, Value, Actor, InvocCount) Metadata(Trace, Key, Value, Node) etc. Provenance Challenge @ GGF18 Kepler/COW+RWS, Bowers, McPhiilips et al. Challenge Results We used two different runs – Each run has embedded metadata and parameter settings – First run equivalent to challenge workflow – Second run containing three sets of image collections, containing different numbers of images WorkflowInput ImageCollection AnatomyImage Image1 Header1 AnatomyImage Reference Image Header Image2 Header2 AnatomyImage Reference Image Image3 Header3 Header AnatomyImage Reference Image input to first run Provenance Challenge @ GGF18 Kepler/COW+RWS, Bowers, McPhiilips et al. Header Image4 Header4 Reference Image Header Challenge Results We used two different runs – Each run has embedded metadata and parameter settings – First run equivalent to challenge workflow – Second run containing three sets of image collections, containing different numbers of images WorkflowInput ImageCollection ImageCollection ImageCollection AnatomyImage AnatomyImage AnatomyImage AnatomyImage … … … … AnatomyImage AnatomyImage … … AnatomyImage AnatomyImage AnatomyImage … input to second run Provenance Challenge @ GGF18 Kepler/COW+RWS, Bowers, McPhiilips et al. … … Challenge Results (Trace 1) Full Data Dependencies Query: ?- trace(1, T), nodeId(T, 341, N1), nodeId(T, 349, N2), nodeId(T, 357, N3), lineageEdges(T, [N1, N2, N3], Edges), drawEdges(Edges). Provenance Challenge @ GGF18 Kepler/COW+RWS, Bowers, McPhiilips et al. Challenge Results (Trace 1) • Question 1: Process that led to Atlas X Graphic Returns subset of lineage edges Query: ?- trace(1, T), nodeId(T, 341, N), lineageEdges(T, N, Edges), drawEdges(Edges). Provenance Challenge @ GGF18 Kepler/COW+RWS, Bowers, McPhiilips et al. Challenge Results (Trace 2) • Question 1: Process that led to Atlas X Graphic Single workflow run where not all output dependent on all input. Query: trace(2, T), nodeId(T, 973, N1), nodeId(T, 1093, N1), nodeId(T, 1193, N1), lineageEdges(T, [N1, N2, N3], Edges), drawEdges(Edges). Provenance Challenge @ GGF18 Kepler/COW+RWS, Bowers, McPhiilips et al. Summary Benefits of our approach – Provenance support for Collection-Oriented SWFs – Minimal provenance information stored in self-contained trace file – Provenance automatically embedded within data stream, simple actor provenance API – Able to answer provenance challenge queries using simple operations (see WIKI entry) -- Note that we ignored question 7 Suggestion for Future Provenance Challenge – More complex/realistic workflows (e.g., from Bioinformatics) • Loops, nesting, partial dependencies, concurrency – More “scientist-oriented” provenance queries • Explicit queries for data dependencies (e.g., see Wiki entry) • Assume user doesn’t know the structure of the trace (Queries 5) Provenance Challenge @ GGF18 Kepler/COW+RWS, Bowers, McPhiilips et al. References • An Approach for Pipelining Nested Collections in Scientific Workflows, Timothy McPhillips and Shawn Bowers, SIGMOD Record 34, 12-17, 2005. • A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows, Shawn Bowers, Timothy McPhillips, Bertram Ludaescher, Shirley Cohen, Susan B. Davidson. International Provenance and Annotation Workshop (IPAW'06), 2006. • Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data, Timothy McPhillips, Shawn Bowers, Bertram Ludaescher. 3rd International Workshop on Data Integration in the Life Sciences (DILS'06), 2006. Provenance Challenge @ GGF18 Kepler/COW+RWS, Bowers, McPhiilips et al.