Querying and Re-Using Workflows with VisTrails Carlos E. Scheidegger Huy T. Vo David Koop Juliana Freire Cláudio T. Silva SCI Institute and School of Computing – University of Utah {cscheid, hvo, dakoop, juliana, csilva}@cs.utah.edu ABSTRACT We show how workflow systems can be augmented to leverage provenance information to enhance usability. In particular, we will demonstrate new mechanisms and intuitive user interfaces designed to allow users to query workflows by example and to refine workflows by analogies. These techniques are implemented in VisTrails, an open-source provenance-enabled scientific workflow system that can be combined with a wide range of tools, libraries, and visualization systems. We will show different scenarios where these techniques can be used to simplify the notoriously hard tasks of creating and refining workflows. Categories and Subject Descriptors H.4 [Information Systems Applications]: General General Terms Algorithms, Human Factors, Management Keywords scientific workflows, visualization, query-by-example, provenance, analogy 1. INTRODUCTION Computing has been an enormous accelerator to science and has led to an information explosion in many different fields. To analyze and understand scientific data, complex computational processes must be assembled, often requiring the combination of loosely-coupled resources, specialized libraries, and grid and Web services. These processes may generate even more final and intermediate data products, adding to the overflow of information scientists need to deal with. Ad-hoc approaches to data exploration (e.g., Perl scripts) have been widely used in the scientific community, but have serious limitations. In particular, scientists and engineers need to expend substantial effort managing data (e.g., scripts that encode computational tasks, raw data, data products, and notes) and recording provenance information so that basic questions can be answered, such as: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada. Copyright 2008 ACM 978-1-60558-102-6/08/06 ...$5.00. Who created this data product and when? When was it modified and by whom? What was the process used to create the data product? Were two data products derived from the same raw data? Not only is the process time-consuming, but also error-prone. Workflow systems have therefore grown in popularity within the scientific community (see e.g., [3, 8, 9]). Not only do they support the automation of repetitive tasks, but they can also capture complex analysis processes at various levels of detail and systematically capture provenance information for the derived data products. While significant progress has been made in unifying computations under the workflow umbrella, workflow systems are notoriously hard to use. They require a steep learning curve: users need to learn programming languages, programming environments, specialized libraries, and best practices for constructing workflows. Often, workflow engineers who have programming expertise construct workflows at the request of scientists. Whereas this modus operandi is acceptable for tasks that require few workflows that will be run many times, the same cannot be said of tasks that are exploratory in nature. For the latter, a workflow (or a set of workflows) must be iteratively refined as a scientist formulates and tests hypotheses. For example, while mining or creating visualizations of a dataset, a user needs to experiment with different parameter values as well as with different techniques. This process requires domain expertise that a workflow engineer does not have, and it requires expertise in building workflows which a domain scientist seldom has. To bridge this gap, we need systems that facilitate the process of constructing and refining workflows. Usability and the ability to cater to a broad set of users with varying levels of experience is of utmost importance for workflow systems. The VisTrails system [1, 7] represents our initial attempt to provide support for tasks that involve data exploration through workflows. VisTrails is an open-source provenanceenabled scientific workflow middleware which can be combined with a wide range of tools, libraries and visualization systems. A new concept we introduced with VisTrails is the notion of provenance of workflows [5]. In contrast to previous workflow systems which maintained provenance only for the data products generated by workflows, VisTrails treats the workflows themselves as first-class data items and keeps their provenance. We have shown that the provenance of how workflows evolve over time enables a series of operations which simplify exploratory processes, for example: scientists can easily navigate through the space of workflows created for a given exploration task; visually compare workflows and their results; and explore large parameter spaces. In this demonstration, we show how VisTrails unobtrusively tracks detailed provenance information for exploratory tasks and leverages this data to enhance usability. In particular, we will demonstrate new mechanisms and intuitive user interfaces we designed that allow users to query workflows by example and to refine workflows by analogy [7]. Queryby-example simplifies the task of locating existing workflows with certain structures or parameter settings by allowing the user to specify a query exactly as she would design that feature in a workflow. Using the analogy mechanism, a user can modify a workflow without having to directly edit the workflow specification: she can specify a change by selecting two workflows that exhibit the “before” and “after” and apply a similar change to a third workflow. These interfaces form the basis of an infrastructure that makes it possible for scientists to learn by example; expedites their scientific training; and potentially reduces their time to insight. 2. THE VISTRAILS SYSTEM VisTrails combines features of both workflow and visualization systems. Like workflow systems, it enables the use of loosely-coupled resources such as specialized libraries, grid, and Web-services, to be used in concert. It parallels some visualization systems by providing mechanisms to perform parameter explorations and visually compare multiple results. But unlike these systems, VisTrails was designed to manage exploratory activities, where computational tasks are iteratively refined as users formulate and test hypotheses. A distinguishing feature of VisTrails is a comprehensive provenance infrastructure. The system maintains detailed history information about the steps followed and data derived in the course of an exploratory task; and provides novel operations and user interfaces for users to explore and re-use this information. The change-based provenance model adopted by VisTrails maintains the sequence of actions that are applied to workflows (e.g., the addition of a module, the modification of a parameter, etc), akin to a database transaction log. These changes are sufficient to determine the provenance of data products, and they also contain information about how the workflows evolve over time. The change-based model is both simple and compact—it uses substantially less space than the alternative of storing multiple versions of a workflow. The model is also extensible. The underlying algebra of actions can be customized to support provenance capture at different granularities. We refer to the detailed provenance of the workflow evolution as a visual trail, or a vistrail. A tree-based view of a vistrail allows a scientist to return to a previous version in an intuitive way, to undo bad changes, to compare different workflows, and to be reminded of the actions that led to a particular result (see Figure 1). This, combined with a caching strategy that eliminates redundant computations, allows the scientist to efficiently explore a large number of related workflows and their results [1]. 3. EXPLORING WORKFLOWS Below we give an overview of features of VisTrails that allow users to query and re-use provenance information. A more detailed description is given in [7]. Figure 1: The version tree stores the complete evolution of a collection of workflows. Each node corresponds to a workflow, the edges show how the workflows are derived. 3.1 Query by Example A major problem with many computational tasks is that code is written once and rarely re-used. Often, code is tailored to a specific problem, and in trying to solve a new problem, it is hard to locate existing code that is relevant. Workflows alleviate this problem in part by promoting the adoption of a service-oriented architecture which leads to re-use. But to enable users to find existing code, a query interface is needed. Workflows are represented as graphs: modules connected by input/output ports, which carry the data type and meaning. Consequently, using text-based languages (e.g., SQL) is not desirable, because it effectively requires that a subgraph query be encoded as text. The VisTrails query-byexample mechanism eliminates the need to learn a new query language or decompose workflow graphs into SQL syntax: Users build queries exactly as they would build pieces of a workflow. In addition to being able to define the structure of the workflow, users can choose to search specific parameters with a set of filters. Finally, the results are displayed visually: each workflow version that matches is highlighted along with the corresponding portion of each matching workflow. 3.2 Workflow Analogies In exploratory tasks, it is important to understand what the differences between workflows are, especially if multiple people are collaboratively exploring data. Computing the differences between two workflows by considering their underlying graph structure turns out to be impractical (the problem can be reduced to subgraph isomorphism). However, using the change-based provenance model, this problem becomes much easier and can be solved in linear time [5]. VisTrails provides a visual difference mechanism that allows users to compare two workflows by coloring modules and connections according to which workflow they belong to. If modules occur in both, any parameter differences between them can be displayed. This difference is extremely useful for users because it offers a template for applying a similar set of changes to another workflow. Workflow analogies automate these changes by flexibly updating workflows according to a change-based template. Just as in the vocabulary analogy “dog is to puppy as cat is Figure 2: Given a workflow (A) that renders a protein, another (B) that generates an HTML report for a protein, and a third (C) that dynamically obtains protein data and produces an improved protein rendering, we generate a new workflow (D) that produces a web report using improved rendering via a workflow analogy. to ?”, workflow analogies discover the relationship between the first two entities and construct an answer by applying this difference to the third entity. As illustrated in Figure 2, we accomplish this by: determining the changes that were made from the linked pair; performing a match between the two “starting” workflows; mapping the differences through the derived match; and applying the new set of changes to the third workflow to produce a new workflow. Workflow Differences. The first ingredient in computing analogies is a method for computing differences between a pair of workflows. As described earlier, VisTrails can efficiently compute the difference between the two linked workflows by determining the sequence of changes from one workflow to another. Workflow Matching. In addition to finding the changes we wish to apply, we need to identify a correspondence between the two starting workflows. Because in VisTrails workflows are directed acyclic graphs, this problem is equivalent to graph matching. Unfortunately, this problem is NPComplete and cannot be efficiently approximated within a subpolynomial factor [2]. However, because modules have well-defined semantics, we can model this in a probabilistic manner with a good likelihood for success. We need to balance local compatibility between modules with the similarity of the global topologies. In order to do so, we first score each pair of modules based on how well they match, and then diffuse these compatibility measures through the product graph of the two workflows. The score of a pair of modules depends on the inputs these modules accept and the outputs they produce. We diffuse this score using an algorithm reminiscent of PageRank [4]. In our algorithm, this diffusion is performed on the product graph, similar to the similarity flooding strategy to match database schemas [6]. Applying the Analogy. After computing the difference ∆(A, B) and matching M (A, C), we need to translate the difference according to the matching M . To do so, we need to translate each individual change through the derived matching. Specifically, if one change is to add a connection between modules a and b, and the matching specifies that a and b in workflow A are equivalent to modules c and d in workflow C, the change becomes adding a connection between modules c and d. The translated changes are then applied to C to create a new pipeline D. Figure 2 shows an example of the entire process. Of course, there are cases where analogies do not make sense. Some workflows cannot be matched because there is not enough information or if they are too different. In addition, some actions will not make sense after translation; such changes are discarded. However, even when they are not perfect, analogies can provide a useful starting point for users trying to incorporate new techniques. 4. DEMONSTRATION OVERVIEW In this demonstration, we will show the power of using analogies and query by example in workflow systems by presenting a set of examples that emphasize the usability of the techniques, and how they enable knowledge re-use in the composition of complex workflows. We will use scenarios from real applications in cosmology, environmental observation systems, Bioinformatics and radiation oncology treatment planning. One of the key applications of the interfaces described above is the semi-automatic addition of new functionality to an existing workflow. We will demonstrate how this can be achieved by querying a database of existing workflows and selecting a pair of workflows whose difference encompasses the new feature to be added to a third workflow. We will also show how complex workflows can be constructed by a sequence of analogy steps. Finally, we will discuss the robustness and applicability of analogies in real applications. VisTrails can be downloaded from http://www.vistrails.org. A video demonstrating the analogies and query-by-example mechanisms is available at http://www.cs.utah.edu/ juliana/videos/vistrails-analogies.mov. 5. REFERENCES [1] L. Bavoil, S. Callahan, P. Crossno, J. Freire, C. Scheidegger, C. Silva, and H. Vo. Vistrails: Enabling interactive multiple-view visualizations. In Proceedings of IEEE Visualization, pages 135–142, 2005. [2] J. Hastad. Clique is hard to approximate within n1−ε . Acta Mathematica, 182:105–142, 1999. [3] The Kepler Project. http://kepler-project.org. [4] A. N. Langville and C. D. Meyer. Google’s PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, 2006. [5] J. Freire, C. T. Silva, S. P. Callahan, E. Santos, C. E. Scheidegger, and H. T. Vo. Managing rapidly-evolving scientific workflows. In International Provenance and Annotation Workshop (IPAW), LNCS 4145, pages 10–18, 2006. Invited paper. [6] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In ICDE, pages 117–128, 2002. [7] C. E. Scheidegger, H. T. Vo, D. Koop, J. Freire, and C. T. Silva. Querying and creating visualizations by analogy. IEEE Transactions on Visualization and Computer Graphics, 13(6):1560–1567, 2007. Papers from the IEEE Information Visualization Conference 2007. [8] The Taverna Project. http://taverna.sourceforge.net. [9] The VisTrails Project. http://www.vistrails.org.