Provenance Analytics Using Provenance to Streamline Data Exploration through Workflows Juliana Freire

advertisement
Provenance Analytics
Using Provenance to Streamline Data
Exploration through Workflows
Juliana Freire
University of Utah
http://www.cs.utah.edu/~juliana
Provenance in Science
!!
!!
!!
!!
!!
!!
Interpret and reproduce results
Understand the experiment and chain of reasoning
that was used in the production of a result
Verify that an experiment was performed according
to acceptable procedures
Identify the inputs to an experiment were and
where they came from
Assess data quality
Track who performed an experiment and who is
responsible for its results
Provenance is as (or more!)
important as the results
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
2
Provenance in Science: Not a new issue
When
Lab notebooks have
been used for a long time
!!
–! Reproduce results
–! Evidence in patent
disputes
Annotation
Observed
data
DNA recombination
By Lederberg
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
3
Provenance in Science: What’s new?
!!
Scale and complexity
–! Large volumes of data
–! Complex analyses and data processing
!!
Writing notes is no longer an option
–! Need systematic means to capture provenance
!!
Workflows
–! Used to describe processes formally
–! Automatic provenance capture---most workflow systems
capture some form of provenance
!!
We now have lots of workflows and lots of
provenance…
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
4
Provenance Analytics: Opportunities
!!
!!
Benefit from scale!
Workflow and provenance repositories
–! E.g., Yahoo! Pipes, myExperiment, Indiana repository
!!
!!
Opportunity for knowledge discovery, sharing and
re-use
Query information
–! Understand processes and find useful workflows
–! Given a piece of data or task, which workflow should we
run? E.g., How do I generate a volume rendering of a CT
scan?
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
5
Provenance Analytics: Opportunities (cont.)
!!
Mine information
–! Discover interesting patters (e.g., common workflow
patterns) ! recommendation system, discover analogies
–! Identify homogeneous workflow groups by clustering !
organize collections
–! Infer workflow specification from execution log [Aalst et al.,
TKDE 2004]
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
6
Provenance Analytics: Challenges
!!
What can we learn from workflow provenance?
–! New applications!
!!
How to model provenance and workflows?
–! Many features to consider: graph structure (modules +
connections), parameters (module signatures), values,
annotations, …
–! Are traditional graph mining algorithms effective? Do they
scale?
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
7
Outline
!!
!!
!!
!!
!!
!!
What is a workflow—a semi-formal definition "
Provenance of workflow evolution
Clustering workflows
Provenance and teaching
VisComplete: A workflow a recommendation system
Emerging applications
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
8
Scientific Workflows = Dataflows
!!
Directed graphs describing a computational task
–! Vertices = modules = processing steps + parameters
–! Edges = connections between output and input ports
–! Execution order determined by flow of data from output to
input ports
Input:!
Head.120.iso!
Isosurface Output:!
value=57
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
9
Scientific Workflows and Dataflows
!!
A directed graph describing a computational task
–! Vertices = modules = processing steps + parameters
–! Edges = connections between output and input ports
–! Execution order determined by flow of data from output to
input ports
!!
No state or side effects: Outputs are a function of the
inputs
I1
O3 = vtkDataSetMapper(input=O2)
O2 = vtkContourFilter(value=57,input=O1)
O1 = vtkStructuredReader(input=I1)
[Lee and Parks, IEEE 1995]
O1
O2
O3
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
10
Scientific Workflows and Dataflows
!!
A directed graph describing a computational task
–! Vertices = modules = processing steps + parameters
–! Edges = connections between output and input ports
–! Execution order determined by flow of data from output to
input ports
!!
!!
No state or side effects: Outputs are a function of the
inputs
I1
Simple programming model
–! Good match for visual programming
interfaces
–! Widely used: adopted by most scientific
workflow and visualization systems
–! Easy to optimize and parallelize
!!
A workflow run is also a DAG
Provenance Analytics – Provenance in Workflows Symposium, 2008
O1
O2
O3
Freire
11
Dataflow Example: VisTrails
Interactive visualization of a volumetric
data set
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
12
Dataflow Example: Taverna
Kegg pathways relevant to a gene
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
13
Dataflow Example: Yahoo! Pipes
Yahoo! Local results for
Napa Wineries with
images from Flickr taken
nearby
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
14
Workflows and Computer Programs
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
15
Workflows and Computer Programs
Program
Workflow
Document
Database
<Book>
<Title>The Advanced Html Companion</Title>
<Author> Keith Schengili-Roberts </Author>
<Author> Kim Silk-Copeland</Author>
<Price> 35.96</Price>…
</Book>
A program is to a workflow what an unstructured
document is to a (structured) database.
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
16
Workflows and Computer Programs
Program
Workflow
The Beauty of Structure
Unstructured
Document
Database
Structured
<Book>
<Title>The Advanced Html Companion</Title>
<Author> Keith Schengili-Roberts </Author>
<Author> Kim Silk-Copeland</Author>
<Price> 35.96</Price>…
</Book>
A program is to a workflow what an unstructured
document is to a (structured) database.
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
17
The Beauty of Structure
Parameter exploration
QBE
Analogies
Visual diff
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
18
Workflows and Computer Programs
!!
!!
!!
Programs are created by programmers
Scientific workflows are created by scientists, who
do not necessarily have programming expertise
Usability:
–! Simpler programming paradigms and visual programming
interfaces
–! Different levels of abstraction
Databases External Schema
Semantic WF
Logical Schema
Abstract WF
Physical Schema
Executable WF
Provenance Analytics – Provenance in Workflows Symposium, 2008
Workflows
Freire
19
Provenance of Workflow Evolution
Keeping Exploration Trails
Trail
Workflows
Provenance Analytics – Provenance in Workflows Symposium, 2008
Data Products
Freire
21
Change-Based Provenance
!!
!!
Records user actions
Provenance = changes to
computational tasks
–! Add a module, add a connection,
change a parameter value
!!
spx ! countour
Extensible change algebra
addModule
deleteConnection
addConnection
addConnection
setParameter
[Freire et al., IPAW2006]
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
23
Change-Based Provenance
!!
!!
Records user actions
Provenance = changes to
computational tasks
–! Add a module, add a connection,
change a parameter value
Extensible change algebra
!! A vistrail node vt corresponds to
the workflow that is constructed
by the sequence of actions from
the root to vt
vt = xn ! xn-1 ! … ! x1 ! Ø
!!
vistrail
x1
x2
x3
[Freire et al, IPAW 2006]
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
24
Exploring the Change Space
!!
!!
Scripting workflows: Parameter explorations are
simple to specify and apply
Exploration of parameter space for a workflow vt
(setParameter(idn,valuen) ! … ! (setParameter(id1,value1) ! vt )
!!
Exploration of multiple workflow specifications
(addModule(idi,…) ! (deleteModule(idi) ! v1 )
…
(addModule(idi,…) ! (deleteModule(idi) ! vn )
!!
!!
Results can be conveniently compared in the
VisTrails spreadsheet or as an animation
Caching to avoid redundant computations [Bavoil et
al., IEEE Vis 2005]
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
25
Computing Workflow Differences
No need to compute subgraph
isomorphism!
!! A vistrail is a rooted tree: all
nodes have a common
ancestor—diffs are well
-defined and simple to
compute
vt1 = xi ! xi-1 ! … ! x1 ! Ø
vt2 = xj ! xj-1 ! … ! x1 ! Ø
vt1-vt2 = {xi, xi-1, …, x1, Ø} –
{xj, xj-1, …,x1 , Ø}
!! Different semantics:
!!
–! Exact, based on ids
–! Approximate, based on module
signatures
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
26
Change-Based Provenance: Summary
!!
!!
!!
General: Works with any system that has undo/redo!
Concise representation of provenance
Uniformly captures data and workflow provenance
–! Data provenance: where does a specific data product come
from?
–! Workflow evolution: how has workflow structure changed over
time?
!!
!!
!!
Results can be reproduced
Detailed information about the exploration process
Provenance beyond reproducibility:
–! Scientists can return to any point in the exploration space
–! Scalable exploration of the parameter space—results can be
compared side-by-side in the spreadsheet
–! Support for collaboration
–! Understand problem-solving strategies—knowledge re-use
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
27
VisTrails: Managing Exploration
!!
Comprehensive provenance infrastructure for
computational tasks
–! Data + workflow provenance
–! Treat workflow as a 1st-class data product
!!
Support for exploratory tasks such as visualization and
data mining
–! Task specification iteratively refined as users generate and
test hypotheses
!!
Focus on usability—build tools for scientists
http://www.vistrails.org
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
28
Provenance in Teaching
What can you learn from provenance?
Provenance and Teaching (1)
!!
Leverage provenance to improve the way we teach
CS and Science
–! http://www.vistrails.org/index.php/SciVisFall2008"
!!
Lecture provenance: student can reproduce results"
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
30
Provenance and Teaching (2)
!!
Homework provenance: Workflow evolution provenance
provides insights regarding"
–! Task complexity and nature: number of actions; structural vs.
parameter changes; task duration"
–! Student confusion: large branching factor=lots of trial and error
steps"
!!
Very detailed (and honest!) feedback: instructors can
leverage this information"
[Lins et al., SSDBM 2008]
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
31
Organizing Workflow and
Provenance Collections
Organizing Provenance and Workflows
!!
Simplify the process of searching for useful data
–! Cluster search results, e.g, grokker.com
–! Create a browseable directory, e.g., DMOZ
A first study:
!! How to model workflows?
!! Which distance measures make sense?
!! Which clustering algorithms are effective?
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
33
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
34
Workflow Representations
!!
Bag of words vs. graph representation
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
35
Measuring Similarity
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
36
Experimental Evaluation
!!
!!
!!
Compare different approaches to modeling workflows: graphs
vs. bag of words
Experiment with different clustering algorithms
Dataset: 1031 distinct workflows, (semi) manually tagged
[Santos et al., IPAW 2008]
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
37
Experimental Results: VS vs. MCIS
Results are similar
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
38
Experimental Results (cont.)
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
39
Experimental Results: Summary
!!
!!
!!
Vector-space based representation produced results
correlated to results obtained using a more
expensive structural representation
Although there is a large number of module pairs,
but very few are compatible and can be directly
connected
Future work: use other workflow collections;
consider additional features; alternative distance
metrics; …
[Santos et al., IPAW 2008]
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
40
VisComplete:
Auto-Completion for Workflows
Motivation
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
42
VisComplete: A Workflow
Recommendation System
!!
!!
Identify graph fragments that co-occur in a
collection of workflows
Predict sets of likely workflow additions to a given
partial workflow
[Koop et al., IEEE Vis 2008]
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
43
VisComplete: A Workflow
Recommendation System
!!
!!
Similar to a Web browser suggesting URL
completions
Idea applicable to integration queries [Sarah Cohen
-Boulakia et a., JBCB 2006; Talukdar et al., VLDB 2008]
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
44
Results
Cut the number of actions
significantly, and therefore save
time
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
45
VisComplete: Demo
[Koop et al., IEEE Vis2008]
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
46
Emerging Applications and
Research Directions
The Provenance-Enabled Paper
!!
!!
Bridge the gap between the scientific process and
publications
Results that can be reproduced and validated
–! Papers with deep captions
–! Encouraged by ACM SIGMOD and a number of journals
!!
!!
Describe more of the discovery process: people only
describe successes, can we learn from mistakes?
Dynamic (interactive) publications
–! Evolve over time
–! Blog/wiki like=> Science 2.0
–! E.g., http://project.liquidpub.org
!!
Need tools to support this!
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
48
Towards Provenance-Enabled Documents
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
49
Conclusions
!!
Provenance is crucial in many applications
–! There are even legal requirements, e.g., Sarbanes-Oxley
!!
Provenance support is appearing in many tools
–! Oracle Total Recall: transparent solution for long-term
storage and audit of historical data
–! Adobe Buzzword: multi-user online word processor controls
versions and keeps track of changes
!!
The ability to share provenance at a large scale can
expose scientists to different techniques and tools
that they would not otherwise have access to
–! scientists can learn by example; expedite their scientific
training; and potentially reduce their time to insight
[Freire and Silva, CHI SDA, 2008]
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
50
Acknowledgments: Funding
!!
This work is partially supported by the National
Science Foundation, the Department of Energy, an
IBM Faculty Award, and a University of Utah Seed
Grant.
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
51
Acknowledgments: People
!!
VisTrails Group
–!
–!
–!
–!
–!
–!
–!
–!
–!
–!
–!
Claudio Silva
Erik Anderson
Jason Callahan
Steven P. Callahan
Lorena Carlo
David Koop
Lauro Lins
Emanuele Santos
Carlos E. Scheidegger
Huy T. Vo
Geoff Draper
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
52
More info about VisTrails
google vistrails
Or
http://www.vistrails.org
Provenance Analytics – Provenance in Workflows Symposium, 2008
Freire
53
Download