Collection-Oriented Workflows

advertisement
Provenance Management in a COllection-oriented
Scientific Workflow Framework
aka Kepler/DAKS
(for Luc’s collection:
before: “We do provenance!”;
now: “ … and it almost killed us!”)
Shawn Bowers
Timothy McPhillips
Bertram Ludaescher
in collaboration with
Ilkay Altintas
Norbert Podhorszki
Provenance Challenge @ GGF18
Kepler/COW+RWS, Bowers, McPhiilips et al.
Goals for the Provenance Challenge
Implement an RWS-style provenance model for CollectionOriented Scientific Workflows
• Take advantage of Collection-Oriented SWFs to
– Automatically infer state-reset events
– Reduce the number of provenance-relevant events that need to be
recorded (keep it minimal)
– Simplify association of traces and provenance into one selfcontained “trace” file for input, output, and dependencies
• Support science-oriented provenance and queries
– Emphasize data dependencies (lineage) as well as process details
• Decouple provenance representation from particular
scientific workflow technology (Kepler)
Provenance Challenge @ GGF18
Kepler/COW+RWS, Bowers, McPhiilips et al.
Collection-Oriented Workflows
Generic support for workflows that
operate over nested data collections (trees)
• Abstract Model
– Actors receive input trees, read contents of subtrees
matching some criteria (scope), and optionally add or
delete subtree nodes
– Each scope instance corresponds to one actor
invocation
…
…
AnatomyImage
AnatomyImage
2
Image
1
AlignWarp
ReferenceImage
3
WarpParamSet
Scope = AnatomyImage
Provenance Challenge @ GGF18
Kepler/COW+RWS, Bowers, McPhiilips et al.
2
Image
1
ReferenceImage
Collection-Oriented Workflows
• Kepler Implementation
– Collections are serialized within
heterogeneous token streams
– Actor execution is pipelined based on each actor’s
scope
– Enables concurrent processing of nested data
collections
– Collections can contain data, metadata, actor
parameters, and other collections
Provenance Challenge @ GGF18
Kepler/COW+RWS, Bowers, McPhiilips et al.
Collection-Oriented Provenance Challenge SWF
• Input data is read by collection reader
– Execution driven by number and size of anatomy image sets
specified by XML file
• Slicer configured on the fly via parameter tokens
– E.g. to create the 3 slices required for each image set
• Output trace serialized into XML by collection writer
– Trace implicitly contains input data, output data, and lineage
Provenance Challenge @ GGF18
Kepler/COW+RWS, Bowers, McPhiilips et al.
Collection-Oriented Provenance
…
Data Dependencies
– Insertion and deletion events
capture actor, invocation count,
and direct data dependencies
AnatomyImage
WarpParamSet
Image
ReferenceImage
Process Dependencies
– Invocation dependencies record
which steps created data or
modified collections used by
another actor invocation
Insertion Dependencies
Embedded Provenance Tokens
– Data and invocation
dependencies stored as tokens
within the stream
– Actor API for declaring data
dependencies
– Invocation dependencies added
automatically
Provenance Challenge @ GGF18
Kepler/COW+RWS, Bowers, McPhiilips et al.
Minimal Provenance Information
Without Provenance
With Provenance
Provenance Challenge @ GGF18
Kepler/COW+RWS, Bowers, McPhiilips et al.
Querying Collection-Oriented Provenance
Execution traces imply provenance graphs
Image
(311)
Header
(312)
Slicer : 1
Slicer : 1
AtlasImage
(308)
AtlasSlice
(337)
Header
(312)
Slicer : 1
Image
(311)
AtlasSlice
(337)
AtlasImage
(308)
Header
(312)
Image
(311)
Slicer : 1
Slicer : 1
Data/Collection creation lineage
Collection “last version” lineage
Graph edges encode data lineage and process relations
Lineage(Trace, Node, DependentNode, Actor, InvocCount)
Provenance operations work over traces and graphs:
Input(Trace, Node)
Output(Trace, Node)
Param(Trace, Name, Value, Actor, InvocCount)
Metadata(Trace, Key, Value, Node)
etc.
Provenance Challenge @ GGF18
Kepler/COW+RWS, Bowers, McPhiilips et al.
Challenge Results
We used two different runs
– Each run has embedded metadata and parameter settings
– First run equivalent to challenge workflow
– Second run containing three sets of image collections, containing
different numbers of images
WorkflowInput
ImageCollection
AnatomyImage
Image1
Header1
AnatomyImage
Reference
Image
Header
Image2
Header2
AnatomyImage
Reference
Image
Image3
Header3
Header
AnatomyImage
Reference
Image
input to first run
Provenance Challenge @ GGF18
Kepler/COW+RWS, Bowers, McPhiilips et al.
Header
Image4
Header4
Reference
Image
Header
Challenge Results
We used two different runs
– Each run has embedded metadata and parameter settings
– First run equivalent to challenge workflow
– Second run containing three sets of image collections, containing
different numbers of images
WorkflowInput
ImageCollection
ImageCollection
ImageCollection
AnatomyImage AnatomyImage AnatomyImage AnatomyImage
…
…
…
…
AnatomyImage AnatomyImage
…
…
AnatomyImage AnatomyImage AnatomyImage
…
input to second run
Provenance Challenge @ GGF18
Kepler/COW+RWS, Bowers, McPhiilips et al.
…
…
Challenge Results (Trace 1)
Full Data Dependencies
Query:
?- trace(1, T),
nodeId(T, 341, N1),
nodeId(T, 349, N2),
nodeId(T, 357, N3),
lineageEdges(T, [N1, N2, N3], Edges),
drawEdges(Edges).
Provenance Challenge @ GGF18
Kepler/COW+RWS, Bowers, McPhiilips et al.
Challenge Results (Trace 1)
• Question 1: Process that led to Atlas X Graphic
Returns subset of lineage edges
Query:
?- trace(1, T),
nodeId(T, 341, N),
lineageEdges(T, N, Edges),
drawEdges(Edges).
Provenance Challenge @ GGF18
Kepler/COW+RWS, Bowers, McPhiilips et al.
Challenge Results (Trace 2)
• Question 1: Process that led to Atlas X Graphic
Single workflow run where
not all output dependent on
all input.
Query:
trace(2, T),
nodeId(T, 973, N1),
nodeId(T, 1093, N1),
nodeId(T, 1193, N1),
lineageEdges(T, [N1, N2, N3], Edges),
drawEdges(Edges).
Provenance Challenge @ GGF18
Kepler/COW+RWS, Bowers, McPhiilips et al.
Summary
Benefits of our approach
– Provenance support for Collection-Oriented SWFs
– Minimal provenance information stored in self-contained trace file
– Provenance automatically embedded within data stream, simple
actor provenance API
– Able to answer provenance challenge queries using simple
operations (see WIKI entry) -- Note that we ignored question 7
Suggestion for Future Provenance Challenge
– More complex/realistic workflows (e.g., from Bioinformatics)
• Loops, nesting, partial dependencies, concurrency
– More “scientist-oriented” provenance queries
• Explicit queries for data dependencies (e.g., see Wiki entry)
• Assume user doesn’t know the structure of the trace (Queries 5)
Provenance Challenge @ GGF18
Kepler/COW+RWS, Bowers, McPhiilips et al.
References
• An Approach for Pipelining Nested Collections in Scientific Workflows,
Timothy McPhillips and Shawn Bowers, SIGMOD Record 34, 12-17,
2005.
• A Model for User-Oriented Data Provenance in Pipelined Scientific
Workflows, Shawn Bowers, Timothy McPhillips, Bertram Ludaescher,
Shirley Cohen, Susan B. Davidson. International Provenance and
Annotation Workshop (IPAW'06), 2006.
• Collection-Oriented Scientific Workflows for Integrating and Analyzing
Biological Data, Timothy McPhillips, Shawn Bowers, Bertram
Ludaescher. 3rd International Workshop on Data Integration in the Life
Sciences (DILS'06), 2006.
Provenance Challenge @ GGF18
Kepler/COW+RWS, Bowers, McPhiilips et al.
Download