Provenance Analytics Using Provenance to Streamline Data Exploration through Workflows Juliana Freire University of Utah http://www.cs.utah.edu/~juliana Provenance in Science !! !! !! !! !! !! Interpret and reproduce results Understand the experiment and chain of reasoning that was used in the production of a result Verify that an experiment was performed according to acceptable procedures Identify the inputs to an experiment were and where they came from Assess data quality Track who performed an experiment and who is responsible for its results Provenance is as (or more!) important as the results Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 2 Provenance in Science: Not a new issue When Lab notebooks have been used for a long time !! –! Reproduce results –! Evidence in patent disputes Annotation Observed data DNA recombination By Lederberg Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 3 Provenance in Science: What’s new? !! Scale and complexity –! Large volumes of data –! Complex analyses and data processing !! Writing notes is no longer an option –! Need systematic means to capture provenance !! Workflows –! Used to describe processes formally –! Automatic provenance capture---most workflow systems capture some form of provenance !! We now have lots of workflows and lots of provenance… Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 4 Provenance Analytics: Opportunities !! !! Benefit from scale! Workflow and provenance repositories –! E.g., Yahoo! Pipes, myExperiment, Indiana repository !! !! Opportunity for knowledge discovery, sharing and re-use Query information –! Understand processes and find useful workflows –! Given a piece of data or task, which workflow should we run? E.g., How do I generate a volume rendering of a CT scan? Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 5 Provenance Analytics: Opportunities (cont.) !! Mine information –! Discover interesting patters (e.g., common workflow patterns) ! recommendation system, discover analogies –! Identify homogeneous workflow groups by clustering ! organize collections –! Infer workflow specification from execution log [Aalst et al., TKDE 2004] Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 6 Provenance Analytics: Challenges !! What can we learn from workflow provenance? –! New applications! !! How to model provenance and workflows? –! Many features to consider: graph structure (modules + connections), parameters (module signatures), values, annotations, … –! Are traditional graph mining algorithms effective? Do they scale? Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 7 Outline !! !! !! !! !! !! What is a workflow—a semi-formal definition " Provenance of workflow evolution Clustering workflows Provenance and teaching VisComplete: A workflow a recommendation system Emerging applications Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 8 Scientific Workflows = Dataflows !! Directed graphs describing a computational task –! Vertices = modules = processing steps + parameters –! Edges = connections between output and input ports –! Execution order determined by flow of data from output to input ports Input:! Head.120.iso! Isosurface Output:! value=57 Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 9 Scientific Workflows and Dataflows !! A directed graph describing a computational task –! Vertices = modules = processing steps + parameters –! Edges = connections between output and input ports –! Execution order determined by flow of data from output to input ports !! No state or side effects: Outputs are a function of the inputs I1 O3 = vtkDataSetMapper(input=O2) O2 = vtkContourFilter(value=57,input=O1) O1 = vtkStructuredReader(input=I1) [Lee and Parks, IEEE 1995] O1 O2 O3 Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 10 Scientific Workflows and Dataflows !! A directed graph describing a computational task –! Vertices = modules = processing steps + parameters –! Edges = connections between output and input ports –! Execution order determined by flow of data from output to input ports !! !! No state or side effects: Outputs are a function of the inputs I1 Simple programming model –! Good match for visual programming interfaces –! Widely used: adopted by most scientific workflow and visualization systems –! Easy to optimize and parallelize !! A workflow run is also a DAG Provenance Analytics – Provenance in Workflows Symposium, 2008 O1 O2 O3 Freire 11 Dataflow Example: VisTrails Interactive visualization of a volumetric data set Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 12 Dataflow Example: Taverna Kegg pathways relevant to a gene Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 13 Dataflow Example: Yahoo! Pipes Yahoo! Local results for Napa Wineries with images from Flickr taken nearby Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 14 Workflows and Computer Programs Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 15 Workflows and Computer Programs Program Workflow Document Database <Book> <Title>The Advanced Html Companion</Title> <Author> Keith Schengili-Roberts </Author> <Author> Kim Silk-Copeland</Author> <Price> 35.96</Price>… </Book> A program is to a workflow what an unstructured document is to a (structured) database. Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 16 Workflows and Computer Programs Program Workflow The Beauty of Structure Unstructured Document Database Structured <Book> <Title>The Advanced Html Companion</Title> <Author> Keith Schengili-Roberts </Author> <Author> Kim Silk-Copeland</Author> <Price> 35.96</Price>… </Book> A program is to a workflow what an unstructured document is to a (structured) database. Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 17 The Beauty of Structure Parameter exploration QBE Analogies Visual diff Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 18 Workflows and Computer Programs !! !! !! Programs are created by programmers Scientific workflows are created by scientists, who do not necessarily have programming expertise Usability: –! Simpler programming paradigms and visual programming interfaces –! Different levels of abstraction Databases External Schema Semantic WF Logical Schema Abstract WF Physical Schema Executable WF Provenance Analytics – Provenance in Workflows Symposium, 2008 Workflows Freire 19 Provenance of Workflow Evolution Keeping Exploration Trails Trail Workflows Provenance Analytics – Provenance in Workflows Symposium, 2008 Data Products Freire 21 Change-Based Provenance !! !! Records user actions Provenance = changes to computational tasks –! Add a module, add a connection, change a parameter value !! spx ! countour Extensible change algebra addModule deleteConnection addConnection addConnection setParameter [Freire et al., IPAW2006] Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 23 Change-Based Provenance !! !! Records user actions Provenance = changes to computational tasks –! Add a module, add a connection, change a parameter value Extensible change algebra !! A vistrail node vt corresponds to the workflow that is constructed by the sequence of actions from the root to vt vt = xn ! xn-1 ! … ! x1 ! Ø !! vistrail x1 x2 x3 [Freire et al, IPAW 2006] Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 24 Exploring the Change Space !! !! Scripting workflows: Parameter explorations are simple to specify and apply Exploration of parameter space for a workflow vt (setParameter(idn,valuen) ! … ! (setParameter(id1,value1) ! vt ) !! Exploration of multiple workflow specifications (addModule(idi,…) ! (deleteModule(idi) ! v1 ) … (addModule(idi,…) ! (deleteModule(idi) ! vn ) !! !! Results can be conveniently compared in the VisTrails spreadsheet or as an animation Caching to avoid redundant computations [Bavoil et al., IEEE Vis 2005] Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 25 Computing Workflow Differences No need to compute subgraph isomorphism! !! A vistrail is a rooted tree: all nodes have a common ancestor—diffs are well -defined and simple to compute vt1 = xi ! xi-1 ! … ! x1 ! Ø vt2 = xj ! xj-1 ! … ! x1 ! Ø vt1-vt2 = {xi, xi-1, …, x1, Ø} – {xj, xj-1, …,x1 , Ø} !! Different semantics: !! –! Exact, based on ids –! Approximate, based on module signatures Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 26 Change-Based Provenance: Summary !! !! !! General: Works with any system that has undo/redo! Concise representation of provenance Uniformly captures data and workflow provenance –! Data provenance: where does a specific data product come from? –! Workflow evolution: how has workflow structure changed over time? !! !! !! Results can be reproduced Detailed information about the exploration process Provenance beyond reproducibility: –! Scientists can return to any point in the exploration space –! Scalable exploration of the parameter space—results can be compared side-by-side in the spreadsheet –! Support for collaboration –! Understand problem-solving strategies—knowledge re-use Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 27 VisTrails: Managing Exploration !! Comprehensive provenance infrastructure for computational tasks –! Data + workflow provenance –! Treat workflow as a 1st-class data product !! Support for exploratory tasks such as visualization and data mining –! Task specification iteratively refined as users generate and test hypotheses !! Focus on usability—build tools for scientists http://www.vistrails.org Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 28 Provenance in Teaching What can you learn from provenance? Provenance and Teaching (1) !! Leverage provenance to improve the way we teach CS and Science –! http://www.vistrails.org/index.php/SciVisFall2008" !! Lecture provenance: student can reproduce results" Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 30 Provenance and Teaching (2) !! Homework provenance: Workflow evolution provenance provides insights regarding" –! Task complexity and nature: number of actions; structural vs. parameter changes; task duration" –! Student confusion: large branching factor=lots of trial and error steps" !! Very detailed (and honest!) feedback: instructors can leverage this information" [Lins et al., SSDBM 2008] Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 31 Organizing Workflow and Provenance Collections Organizing Provenance and Workflows !! Simplify the process of searching for useful data –! Cluster search results, e.g, grokker.com –! Create a browseable directory, e.g., DMOZ A first study: !! How to model workflows? !! Which distance measures make sense? !! Which clustering algorithms are effective? Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 33 Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 34 Workflow Representations !! Bag of words vs. graph representation Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 35 Measuring Similarity Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 36 Experimental Evaluation !! !! !! Compare different approaches to modeling workflows: graphs vs. bag of words Experiment with different clustering algorithms Dataset: 1031 distinct workflows, (semi) manually tagged [Santos et al., IPAW 2008] Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 37 Experimental Results: VS vs. MCIS Results are similar Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 38 Experimental Results (cont.) Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 39 Experimental Results: Summary !! !! !! Vector-space based representation produced results correlated to results obtained using a more expensive structural representation Although there is a large number of module pairs, but very few are compatible and can be directly connected Future work: use other workflow collections; consider additional features; alternative distance metrics; … [Santos et al., IPAW 2008] Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 40 VisComplete: Auto-Completion for Workflows Motivation Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 42 VisComplete: A Workflow Recommendation System !! !! Identify graph fragments that co-occur in a collection of workflows Predict sets of likely workflow additions to a given partial workflow [Koop et al., IEEE Vis 2008] Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 43 VisComplete: A Workflow Recommendation System !! !! Similar to a Web browser suggesting URL completions Idea applicable to integration queries [Sarah Cohen -Boulakia et a., JBCB 2006; Talukdar et al., VLDB 2008] Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 44 Results Cut the number of actions significantly, and therefore save time Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 45 VisComplete: Demo [Koop et al., IEEE Vis2008] Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 46 Emerging Applications and Research Directions The Provenance-Enabled Paper !! !! Bridge the gap between the scientific process and publications Results that can be reproduced and validated –! Papers with deep captions –! Encouraged by ACM SIGMOD and a number of journals !! !! Describe more of the discovery process: people only describe successes, can we learn from mistakes? Dynamic (interactive) publications –! Evolve over time –! Blog/wiki like=> Science 2.0 –! E.g., http://project.liquidpub.org !! Need tools to support this! Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 48 Towards Provenance-Enabled Documents Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 49 Conclusions !! Provenance is crucial in many applications –! There are even legal requirements, e.g., Sarbanes-Oxley !! Provenance support is appearing in many tools –! Oracle Total Recall: transparent solution for long-term storage and audit of historical data –! Adobe Buzzword: multi-user online word processor controls versions and keeps track of changes !! The ability to share provenance at a large scale can expose scientists to different techniques and tools that they would not otherwise have access to –! scientists can learn by example; expedite their scientific training; and potentially reduce their time to insight [Freire and Silva, CHI SDA, 2008] Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 50 Acknowledgments: Funding !! This work is partially supported by the National Science Foundation, the Department of Energy, an IBM Faculty Award, and a University of Utah Seed Grant. Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 51 Acknowledgments: People !! VisTrails Group –! –! –! –! –! –! –! –! –! –! –! Claudio Silva Erik Anderson Jason Callahan Steven P. Callahan Lorena Carlo David Koop Lauro Lins Emanuele Santos Carlos E. Scheidegger Huy T. Vo Geoff Draper Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 52 More info about VisTrails google vistrails Or http://www.vistrails.org Provenance Analytics – Provenance in Workflows Symposium, 2008 Freire 53