Managing Provenance for Reproducibility and Beyond Juliana Freire Web and Databases Lab Scientific Computing and Imaging Institute School of Computing University of Utah What is Provenance? Oxford English Dictionary: (i) the fact of coming from some particular source or quarter; origin, derivation. (ii) the history or pedigree of a work of art, manuscript, rare book, etc.; concretely, a record of the ultimate derivation and passage of an item through its various owners. Apple Dictionary: Origin, source, place of origin; birthplace, fount, roots, pedigree, derivation, root, etymology; formal radix. Provenance for Reproducibility and Beyond Juliana Freire 2 Provenance in Art Rembrandt van Rijn Self-Portrait, 1659 Andrew W. Mellon Collection 1937.1.72 Provenance for Reproducibility and Beyond Juliana Freire 3 Provenance in Science Provenance is as (or more!) important as the result Lab notebooks have been used for a long time What is new? When – Large volumes of data – Complex analyses— computational processes Writing notes is no longer an option… Annotation Observed data DNA recombination By Lederberg Provenance for Reproducibility and Beyond The Need for Provenance Management Emp John Susan Provenance for Reproducibility and Beyond Dept D01 D02 Mgr Mary Ken Juliana Freire 5 Uses of Computational Provenance anon4876_zspace_20060331.jpg anon4877_zspace_20060331.jpg anon4877_lesion_20060401.jpg Reproducibility Data quality Attribution Informational How were these images created? Was any pre-processing applied to the raw data? What’s the difference? Who created them? Are they really from the same patient? Provenance for Reproducibility and Beyond Juliana Freire 6 Vision: Provenance-Rich Data Provenance for Reproducibility and Beyond Juliana Freire 7 Provenance Management: Desiderata Sensors User studies Simulations Systematic capture Integrate/ Obtain Provenance model Organize Data Efficient storage and querying Usable tools Web Analyze/ Visualize Databases Provenance for Reproducibility and Beyond Juliana Freire 8 The VisTrails System Comprehensive provenance infrastructure for computational tasks Focus on exploratory tasks such as simulation, visualization, and data analysis Transparently tracks provenance of the discovery process---from data acquisition to visualization – The trail followed as users generate and test hypotheses Leverage provenance to streamline exploration Focus on usability—build tools for scientists Featured as an NSF discovery Provenance for Reproducibility and Beyond Juliana Freire 9 The VisTrails System VisTrails is open source: http://www.vistrails.org Multi-platform: Linux, Mac, Windows – Written in Python + Qt Ove Ma • Visualizing • Simulation f (Galileo Netw • Quantum ph • Climate ana • Habitat mod • Open Wildland Fire Modeling (U. Colorado, NCAR) • High-energy physics (LEPP, Cornell) • Cosmology simulations (LANL) Provenance for Reproducibility and Beyond s memory ndiana University) • Linköping University (Sweden) • University of North Carolina, Chapel Hill • UTEP Juliana Freire 10 Climate Data Analysis [CDAT Project, Lawrence Livermore National Lab] Provenance for Reproducibility and Beyond Juliana Freire 11 11 Quantum Lattice Models [ALPS Project, ETH-Zurich] Provenance for Reproducibility and Beyond Juliana Freire 12 Coastal Margin Observation & Prediction [NSF Science & Technology Center for Coastal Margin Observation & Prediction] Provenance for Reproducibility and Beyond Juliana Freire 13 Comparing Cosmological Simulations [Cosmic Code Comparison Project, Los Alamos National Lab] Provenance for Reproducibility and Beyond Juliana Freire 14 Wildfire Prediction [WRF-Fire Project, University of Colorado-Denver] Provenance for Reproducibility and Beyond Juliana Freire 15 Studying the Effects of TMS on Memory [Psychiatry-U of Utah] Provenance for Reproducibility and Beyond Juliana Freire 16 Provenance and Data Exploration Data Exploration and Workflows Workflows have been traditionally used to automate repetitive tasks In exploratory tasks, change is the norm! – Data analysis and exploration are iterative processes Data Workflow Data Product Specification Data Manipulation Perception & Cognition Knowledge Exploration User Figure modified from J. van Wijk, IEEE Vis 2005 Provenance for Reproducibility and Beyond Juliana Freire 18 Exploration and Creativity Support Reflective reasoning is key in the exploratory processes “Reflective reasoning requires the ability to store temporary results, to make inferences from stored knowledge, and to follow chains of reasoning backward and forward, sometimes backtracking when a promising line of thought proves to be unfruitful. …the process is slow and laborious” Donald A. Norman Need external aids—tools to facilitate this process – Creativity support tools [Shneiderman, CACM 2002] Need aid from people—collaboration Provenance for Reproducibility and Beyond Juliana Freire 19 Data Exploration and Workflows raw data:CT scan workflow Files (workflow specifications) anon4877_voxel_scale_1_zspace_20060331.srn anon4877_textureshading_20060331.srn anon4877_textureshading_plane0_20060331.srn anon4877_goodxferfunction_20060331.srn anon4877_lesion_20060331.srn Provenance for Reproducibility and Beyond Notes Initial visualization wi Added texture an Added plane to vi li i Found good t s Identified f lesion tissue Juliana Freire 20 Data Exploration and Workflows: Issues Hard to assemble and refine workflows Data provenance is maintained manually through filenaming conventions and detailed notes – A time-consuming process Hard to understand the exploratory process and relationships among workflows Hard to further explore the data, e.g., locate relevant data products/workflows and modify them Hard to collaborate, and work is likely to be lost if creator leaves The generation and maintenance of workflows is a major bottleneck in the scientific process Provenance for Reproducibility and Beyond Juliana Freire 21 Change-Based Provenance Treat workflow as a first-class data item Provenance = changes to computational tasks – Add a module, add a connection, change a parameter value [Freire et al., IPAW 2006] Provenance for Reproducibility and Beyond Juliana Freire 22 Change-Based Provenance Treat workflow as a first-class data item Provenance = changes to computational tasks – Add a module, add a connection, change a parameter value addModule deleteConnection addConnection addConnection setParameter Provenance for Reproducibility and Beyond Juliana Freire 23 Change-Based Provenance Treat workflow as a first-class data item Provenance = changes to computational tasks – Add a module, add a connection, change a parameter value A vistrail node vt corresponds to the workflow that is constructed by the sequence of actions from the root to vt vt = xn ◦ xn-1 ◦ … ◦ x1 ◦ Ø Extensible change algebra vistrail x1 x2 x3 [Freire et al, IPAW 2006] Provenance for Reproducibility and Beyond Juliana Freire 24 Provenance Beyond Reproducibility Support for reflective reasoning Ability to compare data products [Freire et al., IPAW 2006] Provenance for Reproducibility and Beyond Juliana Freire 25 Computing Workflow Differences No need to compute subgraph isomorphism! A vistrail is a rooted tree: all nodes have a common ancestor—diffs are welldefined and simple to compute vt1 = xi ◦ xi-1 ◦ … ◦ x1 ◦ Ø vt2 = xj ◦ xj-1 ◦ … ◦ x1 ◦ Ø vt1-vt2 = {xi, xi-1, …, x1, Ø} – {xj, xj-1, …,x1 , Ø} Different semantics: – Exact, based on ids – Approximate, based on module signatures Provenance for Reproducibility and Beyond Juliana Freire 26 Provenance Beyond Reproducibility Support for reflective reasoning Ability to compare data products Explore parameter spaces and compare results [Freire et al., IPAW 2006] Provenance for Reproducibility and Beyond Juliana Freire 27 Exploring the Change Space Scripting workflows: Parameter explorations are simple to specify and apply Exploration of parameter space for a workflow vt (setParameter(idn,valuen) ◦ … ◦ (setParameter(id1,value1) ◦ vt ) Exploration of multiple workflow specifications (addModule(idi,…) ◦ (deleteModule(idi) ◦ v1 ) … (addModule(idi,…) ◦ (deleteModule(idi) ◦ vn ) Results can be conveniently compared in the VisTrails spreadsheet Can create animations too! Caching to avoid redundant computations [Bavoil et al., IEEE Vis 2005] Provenance for Reproducibility and Beyond Juliana Freire 28 Provenance Beyond Reproducibility Support for reflective reasoning Ability to compare data products Explore parameter spaces and compare results Support for collaboration [Ellkvist et al., IPAW 2008] Provenance for Reproducibility and Beyond Juliana Freire 29 Collaborative Exploration Collaboration is key to data exploration – Translational, integrative approaches to science Store provenance information in a database Synchronize concurrent updates through locking – Real-time collaboration [Ellkvist et al., IPAW 2008] Asynchronous access: similar to version control systems – Check out, work offline, synchronize – Users exchange patches No need for a central repository—support for distributed collaboration – For details see Callahan et al., SCI Institute Technical Report, No. UUSCI-2006-016 2006 Provenance for Reproducibility and Beyond Juliana Freire 30 Vistrail Synchronization Version tree is monotonic – Actions are always added, never deleted Merging two vistrails is simple + Provenance for Reproducibility and Beyond = Juliana Freire 31 Change-Based Provenance: Summary General: Works with any system that has undo/redo! Provenance for Reproducibility and Beyond Juliana Freire 32 Provenance Enabling 3rd-Party Tools Autodesk Maya ParaView VisIt ImageVis3d [Callahan et al., IPAW 2008] Provenance for Reproducibility and Beyond Juliana Freire 33 Provenance Plugin for ParaView http://www.cs.utah.edu/~juliana/videos/paraview_plugin.avi Provenance for Reproducibility and Beyond Juliana Freire 34 Change-Based Provenance: Summary General: Works with any system that has undo/redo! Concise representation Uniformly captures data and workflow provenance – Data provenance: where does a specific data product come from? – Workflow evolution: how has workflow structure changed over time? Results can be reproduced Detailed information about the exploration process Provenance beyond reproducibility: – Scientists can return to any point in the exploration space – Scalable exploration of the parameter space—results can be compared side-by-side in the spreadsheet – Support for collaboration – Understand problem-solving strategies—knowledge re-use Provenance for Reproducibility and Beyond Juliana Freire 35 Querying and Re-Using Provenance Querying Provenance head.120.vtk Graph traversal – Derivation lineage – Data dependencies Find the process that led to resampled-head-vis.png Which data sets contributed to resampled-head-vis.png resampled-head-vis.png Provenance for Reproducibility and Beyond Graph patterns Find all invocations of vtkContourFilter with isosurface value = 57 that are preceded by resampling Juliana Freire 37 Querying Provenance by Example Provenance is represented as graphs: hard to specify queries using text! Querying workflows by example [Scheidegger et al., TVCG 2007; Beeri et al., VLDB 2006; Beeri et al. VLDB 2007] – WYSIWYQ -- What You See Is What You Query – Interface to create workflow is same as to query Provenance for Reproducibility and Beyond Juliana Freire 38 Creating Workflows Complex workflows are hard to assemble – Programming expertise – Domain knowledge – Familiarity with different tools Provenance for Reproducibility and Beyond Steep learning curve Juliana Freire 39 Refining Analyses by Analogy Leverage the wisdom of the crowds in shared provenance – Some refinements are common, e.g., change the rendering technique, publish image on the Web Apply refinements by analogy, automatically Provenance for Reproducibility and Beyond Juliana Freire 40 Generating Visualizations by Analogy [Scheidegger et al, IEEE TVCG 2007] Provenance for Reproducibility and Beyond http://www.cs.utah.edu/~juliana/videos/Analogies.m4v Juliana Freire 41 Creating Workflows by Analogy A as 1. Compute difference: ∆(A,B) – Just like a patch! – But… D = ∆(A,B) ◦ C may not be a valid workflow 2. B is to C is to A D C Find correspondences between A and C: map(A,C) – Diffuse similarity scores across the product graph AxC using Eigenvalue decompositions 3. 4. Compute mapped difference ∆AC(A,B) =map(A,C) ∆(A,B) D = ∆AC(A,B) ◦ C [Scheidegger et al, IEEE TVCG 2007] Provenance for Reproducibility and Beyond Juliana Freire 42 The Need for Guidance in Workflow Design Provenance for Reproducibility and Beyond Juliana Freire 43 VisComplete: A Workflow Recommendation System Mine provenance collection: Identify graph fragments that co-occur in a collection of workflows Predict sets of likely workflow additions to a given partial workflow Similar to a Web browser suggesting URL completions [Koop et al., IEEE Vis 2008] Provenance for Reproducibility and Beyond Juliana Freire 44 VisComplete: A Workflow Recommendation System Mine provenance collection: Identify graph fragments that co-occur in a collection of workflows Predict sets of likely workflow additions to a given partial workflow Similar to a Web browser suggesting URL completions Provenance for Reproducibility and Beyond Juliana Freire 45 VisComplete: Demo http://www.cs.utah.edu/~juliana/videos/viscomplete_h_264.mov Provenance for Reproducibility and Beyond Juliana Freire 46 Emerging Applications Scientific Publications and Provenance to (e ed rch y weigh rat i. , W/kg) w en cyc ing at a give VO2 5.0 /min or ! % O max) Remark b y this in i idual ph gy l e t Fig. 1. Mechanical efficiency when bicycling expressed as “gross efficiency” ced efficiency” ca ce . The oachperiod of thi st dy will be to World repo and “delta overapp the 7-yr in this individual. WC, resu ts from sta Championships, dardized la or tory4thtesting on this individu Bicycle Road Racing 1st and place, respectively. Tour de France Champion of the Tour de at fi 1st, e t Grand me points correspondin toFrance ages 2in 1999 1 2–2004. 5 METHODS General testing sequence. On reporting to the laboratory, training, racing, and medical histories were obtained, body weight was measured ("0.1 kg), and the following tests were performed after informed consent was obtained, with procedures approved by the Internal Review Board of The University of Texas at Austin. Mechanical efficiency and the blood lactate threshold (LT) were determined as the subject bicycled a stationary ergometer for 25 min, with work rate increasing progressively every 5 min over a range of 50, 60, 70, 80, and 90% V̇O2 max. After a 10- to 20-min period of active recovery, V̇O2 max when cycling was measured. Thereafter, body composition was determined by hydrostatic weighing and/or analysis of skin-fold thickness (34, 35). n Reproducibility muscula ef c ency and s th seBeyond have e Provenance for an t e Juliana Freire l 48 Scientific Publications and Provenance to (e ed rch y weigh rat i. , W/kg) w en cyc ing at a give VO2 5.0 /min or ! % O max) Remark b y this in i idual "raw data from the January 1993 test that revealed several additional deviations from the published methodology. Coyle used a 20-min ergometer protocol (not 25 min), including 2and 3-min stages where respiratory exchange ratios (RER) o hy y f use of l e the t Lusk exceeded 1.00. An RER >1.00ecedinvalidates ca ce . The app oach of thi st dy will be to repo resu s from sta ard zed laboratory testing on this individu equations (5) to estimate energy expenditure.” at fi e t me points correspon in to ages 2 1 2 5 Fig. 1. Mechanical efficiency when bicycling expressed as “gross efficiency” and “delta efficiency” over the 7 yr period in this individual. WC, World Bicycle Road Racing Championships, 1st and 4th place, respectively. Tour de France 1st, Grand Champion of the Tour de France in 1999 –2004. METHODS General testing sequence On reporting are to the laboratory, training, ”…all of the published delta efficiency values wrong. … racing, and medical histories were obtained, body weight was measured ("0.1 kg), and he following tests were performed after inthere exists no credible evidence to was support Coyle's formed consent obtained, with procedures approved by the Internal Review Board of The University of Texas at Austin. Mechanical efficiencyefficiency and the blood lactate threshold (LT) were deterconclusion that Armstrong's muscle improved." mined as the subject bicycled a stationary ergometer for 25 min, with work rate increasing progressively every 5 min over a range of 50, 60, 70, 80, and 90% V̇O2 max. After a 10- to 20-min period of active recovery, V̇O2 max when cycling was measured. Thereafter, body composi ion was determined by hydrostatic weighing and/or analysis of skin-fold thickness (34, 35). http://jap.physiology.org/cgi/content/full/105/3/1020 n Reproducibility muscula ef c ency and s th seBeyond have e Provenance for an t e Juliana Freire l 49 Provenance-Rich Publications Bridge the gap between the scientific process and publications Results that can be reproduced and validated – Papers with deep captions – Encouraged by ACM SIGMOD and a number of journals Describe more of the discovery process: people only describe successes, can we learn from mistakes? Support dynamic (interactive) publications – Evolve over time – Blog/wiki like=> Science 2.0 – E.g., http://project.liquidpub.org Provenance for Reproducibility and Beyond Juliana Freire 50 Provenance-Rich Documents Provenance for Reproducibility and Beyond Juliana Freire 51 Provenance and Teaching (1) Leverage provenance to improve the way we teach CS and Science [Silva et al., Eurographics 2010] – http://www.vistrails.org/index.php/SciVisFall2008 – Also used at UNC, Linkoping, UTEP – Lecture provenance: student can reproduce results Provenance for Reproducibility and Beyond Juliana Freire 52 Provenance and Teaching (2) Homework provenance provides insights regarding – Task complexity and nature: number of actions; structural vs. parameter changes; task duration – Student confusion: large branching factor=lots of trial and error steps [Lins et al., SSDBM 2008] Provenance for Reproducibility and Beyond Juliana Freire 53 Provenance and Teaching (3) Homework provenance helps students and instructors to collaborate – Student is stuck, sends his provenance – Instructor understands studentʼs problem, provides hints---student can see what instructor did! – They can also collaborate in real time [Ellkvist et al., IPAW 2008] Provenance for Reproducibility and Beyond Juliana Freire 54 Using Provenance to Teach Electronic Media [Langefeld and Kessler, 2009] “[...] The students have gotten to the point where they demand the VisTrails files for every demonstration just after I complete [it]” “[...] students used [a vistrail instead of a reference model] 62% of the time” Provenance for Reproducibility and Beyond Juliana Freire 55 Using Provenance to Teach Electronic Media http://www.cs.utah.edu/~juliana/videos/maya_playback_slow.mov Provenance for Reproducibility and Beyond Juliana Freire 56 Using Provenance to Teach Electronic Media [Langefeld and Kessler,2009] “[...] The students have gotten to the point where they demand the VisTrails files for every demonstration just after I complete [it]” “[...] students used [a vistrail instead of a reference model] 62% of the time” “Students who used provenance produced higherquality models” Provenance for Reproducibility and Beyond Juliana Freire 57 Science 2.0 Web 2.0 technologies has opened up new opportunities to improve collaboration and information sharing in science [Shneiderman, Science 2008; Waldrop, Scientific American 2008] Need provenance: determine authorship, enforce intellectual property rights, validate the integrity of artifacts and assess their quality, and to reproduce the artifact Social Data Analysis: Share data, processes and provenance Provenance for Reproducibility and Beyond Juliana Freire 58 Conclusions and Future Work Provenance management is crucial in many applications Change-based provenance model New algorithms and usable tools for querying and re-using provenance information The open-source VisTrails system: provides unique capabilities essential for cyberinfrastructure Sharing provenance creates new opportunities [Freire and Silva, CHI SDA, 2008] – Expose users to different techniques and tools – Users can learn by example; expedite their training; and potentially reduce their time to insight Many challenges and several open computer science questions Provenance for Reproducibility and Beyond Juliana Freire 59 Acknowledgments Thanks to VisTrails and Web&DB groups This work is partially supported by the National Science Foundation, the Department of Energy, an IBM Faculty Award, and a University of Utah Seed Grant. Provenance for Reproducibility and Beyond Juliana Freire 60 Thank you