Managing Provenance for Reproducibility and Beyond

advertisement
Managing Provenance for
Reproducibility and Beyond
Juliana Freire
Web and Databases Lab
Scientific Computing and Imaging Institute
School of Computing
University of Utah
What is Provenance?
 
 
Oxford English Dictionary: (i) the fact of coming
from some particular source or quarter; origin,
derivation. (ii) the history or pedigree of a work of
art, manuscript, rare book, etc.; concretely, a
record of the ultimate derivation and passage of an
item through its various owners.
Apple Dictionary: Origin, source, place of origin;
birthplace, fount, roots, pedigree, derivation, root,
etymology; formal radix.
Provenance for Reproducibility and Beyond
Juliana Freire
2
Provenance in Art
Rembrandt van Rijn
Self-Portrait, 1659
Andrew W. Mellon Collection
1937.1.72
Provenance for Reproducibility and Beyond
Juliana Freire
3
Provenance in Science
Provenance is as (or
more!) important as
the result
  Lab notebooks have
been used for a long time
  What is new?
When
 
–  Large volumes of data
–  Complex analyses—
computational processes
 
Writing notes is no
longer an option…
Annotation
Observed
data
DNA recombination
By Lederberg
Provenance for Reproducibility and Beyond
The Need for Provenance Management
Emp
John
Susan
Provenance for Reproducibility and Beyond
Dept
D01
D02
Mgr
Mary
Ken
Juliana Freire
5
Uses of Computational Provenance
anon4876_zspace_20060331.jpg
anon4877_zspace_20060331.jpg
anon4877_lesion_20060401.jpg
Reproducibility
Data quality
Attribution
Informational
How were these images created?
Was any pre-processing applied to the raw data?
What’s the difference?
Who created them?
Are they really from the
same patient?
Provenance for Reproducibility and Beyond
Juliana Freire
6
Vision: Provenance-Rich Data
Provenance for Reproducibility and Beyond
Juliana Freire
7
Provenance Management: Desiderata
Sensors
User studies
Simulations
Systematic capture
Integrate/
Obtain
  Provenance model
Organize
Data
  Efficient storage
and querying
  Usable tools
 
Web
Analyze/
Visualize
Databases
Provenance for Reproducibility and Beyond
Juliana Freire
8
The VisTrails System
 
 
 
Comprehensive provenance infrastructure for
computational tasks
Focus on exploratory tasks such as simulation,
visualization, and data analysis
Transparently tracks provenance of the discovery
process---from data acquisition to visualization
–  The trail followed as users generate and test hypotheses
 
 
 
Leverage provenance to streamline exploration
Focus on usability—build tools for scientists
Featured as an NSF discovery
Provenance for Reproducibility and Beyond
Juliana Freire
9
The VisTrails System
 
 
VisTrails is open source: http://www.vistrails.org
Multi-platform: Linux, Mac, Windows
–  Written in Python + Qt
 
Ove
 
Ma
• Visualizing
• Simulation f
(Galileo Netw
• Quantum ph
• Climate ana
• Habitat mod
• Open Wildland Fire Modeling (U. Colorado, NCAR)
• High-energy physics (LEPP, Cornell)
• Cosmology simulations (LANL)
Provenance for Reproducibility and Beyond
s
 
memory
 
 
 
 
ndiana
University)
• Linköping University (Sweden)
• University of North Carolina, Chapel Hill
• UTEP
Juliana Freire
10
Climate Data Analysis
[CDAT Project, Lawrence Livermore National Lab]
Provenance for Reproducibility and Beyond
Juliana Freire
11 11
Quantum Lattice Models
[ALPS Project, ETH-Zurich]
Provenance for Reproducibility and Beyond
Juliana Freire
12
Coastal Margin Observation & Prediction
[NSF Science & Technology Center for Coastal Margin Observation & Prediction]
Provenance for Reproducibility and Beyond
Juliana Freire
13
Comparing Cosmological Simulations
[Cosmic Code Comparison Project, Los Alamos National Lab]
Provenance for Reproducibility and Beyond
Juliana Freire
14
Wildfire Prediction
[WRF-Fire Project, University of Colorado-Denver]
Provenance for Reproducibility and Beyond
Juliana Freire
15
Studying the Effects of TMS on Memory
[Psychiatry-U of Utah]
Provenance for Reproducibility and Beyond
Juliana Freire
16
Provenance and
Data Exploration
Data Exploration and Workflows
 
 
Workflows have been traditionally used to automate
repetitive tasks
In exploratory tasks, change is the norm!
–  Data analysis and exploration are iterative processes
Data
Workflow
Data
Product
Specification
Data
Manipulation
Perception &
Cognition
Knowledge
Exploration
User
Figure modified from J. van Wijk, IEEE Vis 2005
Provenance for Reproducibility and Beyond
Juliana Freire
18
Exploration and Creativity Support
Reflective reasoning is key in the exploratory
processes
“Reflective reasoning requires the ability to store
temporary results, to make inferences from stored
knowledge, and to follow chains of reasoning
backward and forward, sometimes backtracking
when a promising line of thought proves to be
unfruitful. …the process is slow and laborious”
Donald A. Norman
  Need external aids—tools to facilitate this process
 
–  Creativity support tools
 
[Shneiderman, CACM 2002]
Need aid from people—collaboration
Provenance for Reproducibility and Beyond
Juliana Freire
19
Data Exploration and Workflows
raw data:CT scan
workflow
Files (workflow specifications)
anon4877_voxel_scale_1_zspace_20060331.srn
anon4877_textureshading_20060331.srn
anon4877_textureshading_plane0_20060331.srn
anon4877_goodxferfunction_20060331.srn
anon4877_lesion_20060331.srn
Provenance for Reproducibility and Beyond
Notes
Initial
visualization
wi Added texture
an Added plane to
vi
li
i Found good
t
s
Identified
f
lesion tissue
Juliana Freire
20
Data Exploration and Workflows: Issues
 
 
Hard to assemble and refine workflows
Data provenance is maintained manually through filenaming conventions and detailed notes
–  A time-consuming process
 
 
 
Hard to understand the exploratory process and
relationships among workflows
Hard to further explore the data, e.g., locate relevant
data products/workflows and modify them
Hard to collaborate, and work is likely to be lost if
creator leaves
The generation and maintenance of workflows is a
major bottleneck in the scientific process
Provenance for Reproducibility and Beyond
Juliana Freire
21
Change-Based Provenance
 
 
Treat workflow as a first-class data item
Provenance = changes to computational tasks
–  Add a module, add a connection, change a parameter value
[Freire et al., IPAW 2006]
Provenance for Reproducibility and Beyond
Juliana Freire
22
Change-Based Provenance
 
 
Treat workflow as a first-class
data item
Provenance = changes to
computational tasks
–  Add a module, add a connection,
change a parameter value
addModule
deleteConnection
addConnection
addConnection
setParameter
Provenance for Reproducibility and Beyond
Juliana Freire
23
Change-Based Provenance
 
 
Treat workflow as a first-class
data item
Provenance = changes to
computational tasks
–  Add a module, add a connection,
change a parameter value
 
 
A vistrail node vt corresponds to
the workflow that is constructed
by the sequence of actions from
the root to vt
vt = xn ◦ xn-1 ◦ … ◦ x1 ◦ Ø
Extensible change algebra
vistrail
x1
x2
x3
[Freire et al, IPAW 2006]
Provenance for Reproducibility and Beyond
Juliana Freire
24
Provenance Beyond Reproducibility
 
 
Support for reflective reasoning
Ability to compare data products
[Freire et al., IPAW 2006]
Provenance for Reproducibility and Beyond
Juliana Freire
25
Computing Workflow Differences
No need to compute subgraph
isomorphism!
  A vistrail is a rooted tree: all
nodes have a common
ancestor—diffs are welldefined and simple to
compute
vt1 = xi ◦ xi-1 ◦ … ◦ x1 ◦ Ø
vt2 = xj ◦ xj-1 ◦ … ◦ x1 ◦ Ø
vt1-vt2 = {xi, xi-1, …, x1, Ø} –
{xj, xj-1, …,x1 , Ø}
  Different semantics:
 
–  Exact, based on ids
–  Approximate, based on module
signatures
Provenance for Reproducibility and Beyond
Juliana Freire
26
Provenance Beyond Reproducibility
 
 
 
Support for reflective reasoning
Ability to compare data products
Explore parameter spaces and compare results
[Freire et al., IPAW 2006]
Provenance for Reproducibility and Beyond
Juliana Freire
27
Exploring the Change Space
 
 
Scripting workflows: Parameter explorations are
simple to specify and apply
Exploration of parameter space for a workflow vt
(setParameter(idn,valuen) ◦ … ◦ (setParameter(id1,value1) ◦ vt )
 
Exploration of multiple workflow specifications
(addModule(idi,…) ◦ (deleteModule(idi) ◦ v1 )
…
(addModule(idi,…) ◦ (deleteModule(idi) ◦ vn )
 
 
 
Results can be conveniently compared in the
VisTrails spreadsheet
Can create animations too!
Caching to avoid redundant computations [Bavoil et
al., IEEE Vis 2005]
Provenance for Reproducibility and Beyond
Juliana Freire
28
Provenance Beyond Reproducibility
 
 
 
 
Support for reflective reasoning
Ability to compare data products
Explore parameter spaces and compare results
Support for collaboration
[Ellkvist et al., IPAW 2008]
Provenance for Reproducibility and Beyond
Juliana Freire
29
Collaborative Exploration
 
Collaboration is key to data exploration
–  Translational, integrative approaches to science
 
 
Store provenance information in a database
Synchronize concurrent updates through locking
–  Real-time collaboration [Ellkvist et al., IPAW 2008]
 
Asynchronous access: similar to version control
systems
–  Check out, work offline, synchronize
–  Users exchange patches
 
No need for a central repository—support for
distributed collaboration
–  For details see Callahan et al., SCI Institute Technical
Report, No. UUSCI-2006-016 2006
Provenance for Reproducibility and Beyond
Juliana Freire
30
Vistrail Synchronization
 
Version tree is monotonic
–  Actions are always added, never deleted
 
Merging two vistrails is simple
+
Provenance for Reproducibility and Beyond
=
Juliana Freire
31
Change-Based Provenance: Summary
 
General: Works with any system that has undo/redo!
Provenance for Reproducibility and Beyond
Juliana Freire
32
Provenance Enabling 3rd-Party Tools
Autodesk Maya
ParaView
VisIt
ImageVis3d
[Callahan et al., IPAW 2008]
Provenance for Reproducibility and Beyond
Juliana Freire
33
Provenance Plugin for ParaView
http://www.cs.utah.edu/~juliana/videos/paraview_plugin.avi
Provenance for Reproducibility and Beyond
Juliana Freire
34
Change-Based Provenance: Summary
 
 
 
General: Works with any system that has undo/redo!
Concise representation
Uniformly captures data and workflow provenance
–  Data provenance: where does a specific data product come
from?
–  Workflow evolution: how has workflow structure changed over
time?
 
 
 
Results can be reproduced
Detailed information about the exploration process
Provenance beyond reproducibility:
–  Scientists can return to any point in the exploration space
–  Scalable exploration of the parameter space—results can be
compared side-by-side in the spreadsheet
–  Support for collaboration
–  Understand problem-solving strategies—knowledge re-use
Provenance for Reproducibility and Beyond
Juliana Freire
35
Querying and Re-Using Provenance
Querying Provenance
head.120.vtk
 
Graph traversal
–  Derivation lineage
–  Data dependencies
Find the process that led to
resampled-head-vis.png
Which data sets contributed
to resampled-head-vis.png
 
resampled-head-vis.png
Provenance for Reproducibility and Beyond
Graph patterns
Find all invocations of
vtkContourFilter with
isosurface value = 57 that
are preceded by resampling
Juliana Freire
37
Querying Provenance by Example
 
 
Provenance is represented as graphs: hard to specify
queries using text!
Querying workflows by example [Scheidegger et al., TVCG
2007; Beeri et al., VLDB 2006; Beeri et al. VLDB 2007]
–  WYSIWYQ -- What You See Is What You Query
–  Interface to create workflow is same as to query
Provenance for Reproducibility and Beyond
Juliana Freire
38
Creating Workflows
 
Complex workflows are hard to assemble
–  Programming expertise
–  Domain knowledge
–  Familiarity with different tools
Provenance for Reproducibility and Beyond
Steep learning curve
Juliana Freire
39
Refining Analyses by Analogy
 
Leverage the wisdom of the crowds in shared
provenance
–  Some refinements are common, e.g., change the
rendering technique, publish image on the Web
 
Apply refinements by analogy, automatically
Provenance for Reproducibility and Beyond
Juliana Freire
40
Generating Visualizations by Analogy
[Scheidegger et al, IEEE TVCG 2007]
Provenance for Reproducibility and Beyond
http://www.cs.utah.edu/~juliana/videos/Analogies.m4v
Juliana Freire
41
Creating Workflows by Analogy
A
as
1. 
Compute difference: ∆(A,B)
–  Just like a patch!
–  But…
D = ∆(A,B) ◦ C may not be a valid
workflow
2. 
B
is to
C
is to
A
D
C
Find correspondences between A
and C: map(A,C)
–  Diffuse similarity scores across the
product graph AxC using Eigenvalue
decompositions
3. 
4. 
Compute mapped difference
∆AC(A,B) =map(A,C) ∆(A,B)
D = ∆AC(A,B) ◦ C
[Scheidegger et al, IEEE TVCG 2007]
Provenance for Reproducibility and Beyond
Juliana Freire
42
The Need for Guidance in
Workflow Design
Provenance for Reproducibility and Beyond
Juliana Freire
43
VisComplete: A Workflow
Recommendation System
 
 
 
Mine provenance collection: Identify graph
fragments that co-occur in a collection of workflows
Predict sets of likely workflow additions to a given
partial workflow
Similar to a Web browser suggesting URL
completions
[Koop et al., IEEE Vis 2008]
Provenance for Reproducibility and Beyond
Juliana Freire
44
VisComplete: A Workflow
Recommendation System
 
 
 
Mine provenance collection: Identify graph
fragments that co-occur in a collection of workflows
Predict sets of likely workflow additions to a given
partial workflow
Similar to a Web browser suggesting URL
completions
Provenance for Reproducibility and Beyond
Juliana Freire
45
VisComplete: Demo
http://www.cs.utah.edu/~juliana/videos/viscomplete_h_264.mov
Provenance for Reproducibility and Beyond
Juliana Freire
46
Emerging Applications
Scientific Publications and Provenance
to
(e
ed
rch
y weigh rat i. , W/kg) w en cyc ing at a give VO2
5.0 /min or ! % O max) Remark b y this in i idual
ph
gy
l
e
t
Fig. 1. Mechanical efficiency when bicycling expressed as “gross efficiency”
ced efficiency”
ca ce . The
oachperiod
of thi
st dy
will be
to World
repo
and “delta
overapp
the 7-yr
in this
individual.
WC,
resu ts
from
sta Championships,
dardized la or
tory4thtesting
on this individu
Bicycle
Road
Racing
1st and
place, respectively.
Tour de
France
Champion
of the Tour de
at fi 1st,
e t Grand
me points
correspondin
toFrance
ages 2in 1999
1 2–2004.
5
METHODS
General testing sequence. On reporting to the laboratory, training,
racing, and medical histories were obtained, body weight was measured ("0.1 kg), and the following tests were performed after informed consent was obtained, with procedures approved by the
Internal Review Board of The University of Texas at Austin. Mechanical efficiency and the blood lactate threshold (LT) were determined as the subject bicycled a stationary ergometer for 25 min, with
work rate increasing progressively every 5 min over a range of 50, 60,
70, 80, and 90% V̇O2 max. After a 10- to 20-min period of active
recovery, V̇O2 max when cycling was measured. Thereafter, body
composition was determined by hydrostatic weighing and/or analysis
of skin-fold thickness (34, 35).
n Reproducibility
muscula ef c ency and
s th seBeyond
have e
Provenance for
an t e
Juliana
Freire
l
48
Scientific Publications and Provenance
to
(e
ed
rch
y weigh rat i. , W/kg) w en cyc ing at a give VO2
5.0 /min or ! % O max) Remark b y this in i idual
"raw data from the January 1993 test that revealed several
additional deviations from the published methodology. Coyle
used a 20-min ergometer protocol (not 25 min), including 2and 3-min stages where respiratory exchange ratios (RER)
o
hy
y f use of
l e the
t Lusk
exceeded 1.00. An RER >1.00ecedinvalidates
ca ce . The app oach of thi st dy will be to repo
resu s from sta ard zed laboratory testing on this individu
equations (5) to estimate energy
expenditure.”
at fi e t me
points correspon in to ages 2 1 2 5
Fig. 1. Mechanical efficiency when bicycling expressed as “gross efficiency”
and “delta efficiency” over the 7 yr period in this individual. WC, World
Bicycle Road Racing Championships, 1st and 4th place, respectively. Tour de
France 1st, Grand Champion of the Tour de France in 1999 –2004.
METHODS
General testing sequence
On reporting are
to the laboratory,
training,
”…all of the published delta efficiency
values
wrong.
…
racing, and medical histories were obtained, body weight was measured ("0.1 kg), and he following tests were performed after inthere exists no credible evidence
to was
support
Coyle's
formed consent
obtained, with procedures
approved by the
Internal Review Board of The University of Texas at Austin. Mechanical efficiencyefficiency
and the blood lactate threshold
(LT) were deterconclusion that Armstrong's muscle
improved."
mined as the subject bicycled a stationary ergometer for 25 min, with
work rate increasing progressively every 5 min over a range of 50, 60,
70, 80, and 90% V̇O2 max. After a 10- to 20-min period of active
recovery, V̇O2 max when cycling was measured. Thereafter, body
composi ion was determined by hydrostatic weighing and/or analysis
of skin-fold thickness (34, 35).
http://jap.physiology.org/cgi/content/full/105/3/1020
n Reproducibility
muscula ef c ency and
s th seBeyond
have e
Provenance for
an t e
Juliana
Freire
l
49
Provenance-Rich Publications
 
 
Bridge the gap between the scientific process and
publications
Results that can be reproduced and validated
–  Papers with deep captions
–  Encouraged by ACM SIGMOD and a number of journals
 
 
Describe more of the discovery process: people only
describe successes, can we learn from mistakes?
Support dynamic (interactive) publications
–  Evolve over time
–  Blog/wiki like=> Science 2.0
–  E.g., http://project.liquidpub.org
Provenance for Reproducibility and Beyond
Juliana Freire
50
Provenance-Rich Documents
Provenance for Reproducibility and Beyond
Juliana Freire
51
Provenance and Teaching (1)
 
Leverage provenance to improve the way we teach
CS and Science [Silva et al., Eurographics 2010]
–  http://www.vistrails.org/index.php/SciVisFall2008
–  Also used at UNC, Linkoping, UTEP
–  Lecture provenance: student can reproduce results
Provenance for Reproducibility and Beyond
Juliana Freire
52
Provenance and Teaching (2)
 
Homework provenance provides insights regarding
–  Task complexity and nature: number of actions; structural vs.
parameter changes; task duration
–  Student confusion: large branching factor=lots of trial and error
steps
 
[Lins et al., SSDBM 2008]
Provenance for Reproducibility and Beyond
Juliana Freire
53
Provenance and Teaching (3)
 
Homework provenance helps students and instructors to
collaborate
–  Student is stuck, sends his provenance
–  Instructor understands studentʼs problem, provides hints---student
can see what instructor did!
–  They can also collaborate in real time [Ellkvist et al., IPAW 2008]
Provenance for Reproducibility and Beyond
Juliana Freire
54
Using Provenance to
Teach Electronic Media
[Langefeld and Kessler,
2009]
“[...] The students have gotten to the point where
they demand the VisTrails files for every
demonstration just after I complete [it]”
“[...] students used [a vistrail instead of a reference
model] 62% of the time”
Provenance for Reproducibility and Beyond
Juliana Freire
55
Using Provenance to
Teach Electronic Media
http://www.cs.utah.edu/~juliana/videos/maya_playback_slow.mov
Provenance for Reproducibility and Beyond
Juliana Freire
56
Using Provenance to
Teach Electronic Media
[Langefeld and Kessler,2009]
“[...] The students have gotten to the point where
they demand the VisTrails files for every
demonstration just after I complete [it]”
“[...] students used [a vistrail instead of a reference
model] 62% of the time”
“Students who used provenance produced higherquality models”
Provenance for Reproducibility and Beyond
Juliana Freire
57
Science 2.0
 
 
 
Web 2.0 technologies has opened up new opportunities to
improve collaboration and information sharing in science
[Shneiderman, Science 2008; Waldrop, Scientific American
2008]
Need provenance: determine authorship, enforce intellectual
property rights, validate the integrity of artifacts and assess
their quality, and to reproduce the artifact
Social Data Analysis: Share data, processes and provenance
Provenance for Reproducibility and Beyond
Juliana Freire
58
Conclusions and Future Work
 
 
 
 
 
Provenance management is crucial in many
applications
Change-based provenance model
New algorithms and usable tools for querying and
re-using provenance information
The open-source VisTrails system: provides unique
capabilities essential for cyberinfrastructure
Sharing provenance creates new opportunities [Freire
and Silva, CHI SDA, 2008]
–  Expose users to different techniques and tools
–  Users can learn by example; expedite their training; and
potentially reduce their time to insight
 
Many challenges and several open computer science
questions
Provenance for Reproducibility and Beyond
Juliana Freire
59
Acknowledgments
 
 
Thanks to VisTrails and Web&DB groups
This work is partially supported by the National
Science Foundation, the Department of Energy, an
IBM Faculty Award, and a University of Utah Seed
Grant.
Provenance for Reproducibility and Beyond
Juliana Freire
60
Thank you
Download