Querying and Re-Using Workflows with VisTrails Carlos E. Scheidegger Huy T. Vo

advertisement
Querying and Re-Using Workflows with VisTrails
Carlos E. Scheidegger
Huy T. Vo
David Koop
Juliana Freire
Cláudio T. Silva
SCI Institute and School of Computing – University of Utah
{cscheid, hvo, dakoop, juliana, csilva}@cs.utah.edu
ABSTRACT
We show how workflow systems can be augmented to leverage provenance information to enhance usability. In particular, we will demonstrate new mechanisms and intuitive
user interfaces designed to allow users to query workflows
by example and to refine workflows by analogies. These
techniques are implemented in VisTrails, an open-source
provenance-enabled scientific workflow system that can be
combined with a wide range of tools, libraries, and visualization systems. We will show different scenarios where these
techniques can be used to simplify the notoriously hard tasks
of creating and refining workflows.
Categories and Subject Descriptors
H.4 [Information Systems Applications]: General
General Terms
Algorithms, Human Factors, Management
Keywords
scientific workflows, visualization, query-by-example, provenance, analogy
1.
INTRODUCTION
Computing has been an enormous accelerator to science
and has led to an information explosion in many different
fields. To analyze and understand scientific data, complex
computational processes must be assembled, often requiring the combination of loosely-coupled resources, specialized libraries, and grid and Web services. These processes
may generate even more final and intermediate data products, adding to the overflow of information scientists need to
deal with. Ad-hoc approaches to data exploration (e.g., Perl
scripts) have been widely used in the scientific community,
but have serious limitations. In particular, scientists and
engineers need to expend substantial effort managing data
(e.g., scripts that encode computational tasks, raw data,
data products, and notes) and recording provenance information so that basic questions can be answered, such as:
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada.
Copyright 2008 ACM 978-1-60558-102-6/08/06 ...$5.00.
Who created this data product and when? When was it
modified and by whom? What was the process used to create the data product? Were two data products derived from
the same raw data? Not only is the process time-consuming,
but also error-prone.
Workflow systems have therefore grown in popularity within the scientific community (see e.g., [3, 8, 9]). Not only do
they support the automation of repetitive tasks, but they
can also capture complex analysis processes at various levels
of detail and systematically capture provenance information
for the derived data products.
While significant progress has been made in unifying computations under the workflow umbrella, workflow systems
are notoriously hard to use. They require a steep learning
curve: users need to learn programming languages, programming environments, specialized libraries, and best practices
for constructing workflows. Often, workflow engineers who
have programming expertise construct workflows at the request of scientists. Whereas this modus operandi is acceptable for tasks that require few workflows that will be run
many times, the same cannot be said of tasks that are exploratory in nature. For the latter, a workflow (or a set of
workflows) must be iteratively refined as a scientist formulates and tests hypotheses. For example, while mining or
creating visualizations of a dataset, a user needs to experiment with different parameter values as well as with different
techniques. This process requires domain expertise that a
workflow engineer does not have, and it requires expertise in
building workflows which a domain scientist seldom has. To
bridge this gap, we need systems that facilitate the process
of constructing and refining workflows. Usability and the
ability to cater to a broad set of users with varying levels of
experience is of utmost importance for workflow systems.
The VisTrails system [1, 7] represents our initial attempt
to provide support for tasks that involve data exploration
through workflows. VisTrails is an open-source provenanceenabled scientific workflow middleware which can be combined with a wide range of tools, libraries and visualization
systems. A new concept we introduced with VisTrails is the
notion of provenance of workflows [5]. In contrast to previous workflow systems which maintained provenance only for
the data products generated by workflows, VisTrails treats
the workflows themselves as first-class data items and keeps
their provenance. We have shown that the provenance of
how workflows evolve over time enables a series of operations
which simplify exploratory processes, for example: scientists
can easily navigate through the space of workflows created
for a given exploration task; visually compare workflows and
their results; and explore large parameter spaces.
In this demonstration, we show how VisTrails unobtrusively tracks detailed provenance information for exploratory
tasks and leverages this data to enhance usability. In particular, we will demonstrate new mechanisms and intuitive user
interfaces we designed that allow users to query workflows
by example and to refine workflows by analogy [7]. Queryby-example simplifies the task of locating existing workflows
with certain structures or parameter settings by allowing the
user to specify a query exactly as she would design that feature in a workflow. Using the analogy mechanism, a user
can modify a workflow without having to directly edit the
workflow specification: she can specify a change by selecting
two workflows that exhibit the “before” and “after” and apply a similar change to a third workflow. These interfaces
form the basis of an infrastructure that makes it possible
for scientists to learn by example; expedites their scientific
training; and potentially reduces their time to insight.
2.
THE VISTRAILS SYSTEM
VisTrails combines features of both workflow and visualization systems. Like workflow systems, it enables the use of
loosely-coupled resources such as specialized libraries, grid,
and Web-services, to be used in concert. It parallels some
visualization systems by providing mechanisms to perform
parameter explorations and visually compare multiple results. But unlike these systems, VisTrails was designed to
manage exploratory activities, where computational tasks
are iteratively refined as users formulate and test hypotheses. A distinguishing feature of VisTrails is a comprehensive
provenance infrastructure. The system maintains detailed
history information about the steps followed and data derived in the course of an exploratory task; and provides novel
operations and user interfaces for users to explore and re-use
this information.
The change-based provenance model adopted by VisTrails
maintains the sequence of actions that are applied to workflows (e.g., the addition of a module, the modification of a
parameter, etc), akin to a database transaction log. These
changes are sufficient to determine the provenance of data
products, and they also contain information about how the
workflows evolve over time. The change-based model is both
simple and compact—it uses substantially less space than
the alternative of storing multiple versions of a workflow.
The model is also extensible. The underlying algebra of actions can be customized to support provenance capture at
different granularities. We refer to the detailed provenance
of the workflow evolution as a visual trail, or a vistrail.
A tree-based view of a vistrail allows a scientist to return to a previous version in an intuitive way, to undo
bad changes, to compare different workflows, and to be reminded of the actions that led to a particular result (see
Figure 1). This, combined with a caching strategy that
eliminates redundant computations, allows the scientist to
efficiently explore a large number of related workflows and
their results [1].
3.
EXPLORING WORKFLOWS
Below we give an overview of features of VisTrails that
allow users to query and re-use provenance information. A
more detailed description is given in [7].
Figure 1: The version tree stores the complete evolution of a collection of workflows. Each node corresponds to a workflow, the edges show how the workflows are derived.
3.1
Query by Example
A major problem with many computational tasks is that
code is written once and rarely re-used. Often, code is tailored to a specific problem, and in trying to solve a new
problem, it is hard to locate existing code that is relevant.
Workflows alleviate this problem in part by promoting the
adoption of a service-oriented architecture which leads to
re-use. But to enable users to find existing code, a query
interface is needed.
Workflows are represented as graphs: modules connected
by input/output ports, which carry the data type and meaning. Consequently, using text-based languages (e.g., SQL)
is not desirable, because it effectively requires that a subgraph query be encoded as text. The VisTrails query-byexample mechanism eliminates the need to learn a new query
language or decompose workflow graphs into SQL syntax:
Users build queries exactly as they would build pieces of a
workflow. In addition to being able to define the structure of
the workflow, users can choose to search specific parameters
with a set of filters. Finally, the results are displayed visually: each workflow version that matches is highlighted along
with the corresponding portion of each matching workflow.
3.2
Workflow Analogies
In exploratory tasks, it is important to understand what
the differences between workflows are, especially if multiple
people are collaboratively exploring data. Computing the
differences between two workflows by considering their underlying graph structure turns out to be impractical (the
problem can be reduced to subgraph isomorphism). However, using the change-based provenance model, this problem becomes much easier and can be solved in linear time [5].
VisTrails provides a visual difference mechanism that allows
users to compare two workflows by coloring modules and
connections according to which workflow they belong to. If
modules occur in both, any parameter differences between
them can be displayed. This difference is extremely useful
for users because it offers a template for applying a similar
set of changes to another workflow.
Workflow analogies automate these changes by flexibly
updating workflows according to a change-based template.
Just as in the vocabulary analogy “dog is to puppy as cat is
Figure 2: Given a workflow (A) that renders a protein, another (B) that generates an HTML report for a
protein, and a third (C) that dynamically obtains protein data and produces an improved protein rendering,
we generate a new workflow (D) that produces a web report using improved rendering via a workflow analogy.
to ?”, workflow analogies discover the relationship between
the first two entities and construct an answer by applying
this difference to the third entity. As illustrated in Figure 2,
we accomplish this by: determining the changes that were
made from the linked pair; performing a match between the
two “starting” workflows; mapping the differences through
the derived match; and applying the new set of changes to
the third workflow to produce a new workflow.
Workflow Differences. The first ingredient in computing
analogies is a method for computing differences between a
pair of workflows. As described earlier, VisTrails can efficiently compute the difference between the two linked workflows by determining the sequence of changes from one workflow to another.
Workflow Matching. In addition to finding the changes
we wish to apply, we need to identify a correspondence
between the two starting workflows. Because in VisTrails
workflows are directed acyclic graphs, this problem is equivalent to graph matching. Unfortunately, this problem is NPComplete and cannot be efficiently approximated within a
subpolynomial factor [2]. However, because modules have
well-defined semantics, we can model this in a probabilistic manner with a good likelihood for success. We need to
balance local compatibility between modules with the similarity of the global topologies. In order to do so, we first
score each pair of modules based on how well they match,
and then diffuse these compatibility measures through the
product graph of the two workflows. The score of a pair of
modules depends on the inputs these modules accept and
the outputs they produce. We diffuse this score using an algorithm reminiscent of PageRank [4]. In our algorithm, this
diffusion is performed on the product graph, similar to the
similarity flooding strategy to match database schemas [6].
Applying the Analogy. After computing the difference
∆(A, B) and matching M (A, C), we need to translate the
difference according to the matching M . To do so, we
need to translate each individual change through the derived
matching. Specifically, if one change is to add a connection
between modules a and b, and the matching specifies that
a and b in workflow A are equivalent to modules c and d
in workflow C, the change becomes adding a connection between modules c and d. The translated changes are then
applied to C to create a new pipeline D. Figure 2 shows
an example of the entire process. Of course, there are cases
where analogies do not make sense. Some workflows cannot
be matched because there is not enough information or if
they are too different. In addition, some actions will not
make sense after translation; such changes are discarded.
However, even when they are not perfect, analogies can provide a useful starting point for users trying to incorporate
new techniques.
4.
DEMONSTRATION OVERVIEW
In this demonstration, we will show the power of using
analogies and query by example in workflow systems by
presenting a set of examples that emphasize the usability
of the techniques, and how they enable knowledge re-use
in the composition of complex workflows. We will use scenarios from real applications in cosmology, environmental
observation systems, Bioinformatics and radiation oncology
treatment planning.
One of the key applications of the interfaces described
above is the semi-automatic addition of new functionality to
an existing workflow. We will demonstrate how this can be
achieved by querying a database of existing workflows and
selecting a pair of workflows whose difference encompasses
the new feature to be added to a third workflow. We will
also show how complex workflows can be constructed by
a sequence of analogy steps. Finally, we will discuss the
robustness and applicability of analogies in real applications.
VisTrails can be downloaded from http://www.vistrails.org.
A video demonstrating the analogies and query-by-example
mechanisms is available at
http://www.cs.utah.edu/ juliana/videos/vistrails-analogies.mov.
5.
REFERENCES
[1] L. Bavoil, S. Callahan, P. Crossno, J. Freire,
C. Scheidegger, C. Silva, and H. Vo. Vistrails: Enabling
interactive multiple-view visualizations. In Proceedings
of IEEE Visualization, pages 135–142, 2005.
[2] J. Hastad. Clique is hard to approximate within n1−ε .
Acta Mathematica, 182:105–142, 1999.
[3] The Kepler Project. http://kepler-project.org.
[4] A. N. Langville and C. D. Meyer. Google’s PageRank
and Beyond: The Science of Search Engine Rankings.
Princeton University Press, 2006.
[5] J. Freire, C. T. Silva, S. P. Callahan, E. Santos, C. E.
Scheidegger, and H. T. Vo. Managing rapidly-evolving
scientific workflows. In International Provenance and
Annotation Workshop (IPAW), LNCS 4145, pages
10–18, 2006. Invited paper.
[6] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity
flooding: A versatile graph matching algorithm and its
application to schema matching. In ICDE, pages
117–128, 2002.
[7] C. E. Scheidegger, H. T. Vo, D. Koop, J. Freire, and
C. T. Silva. Querying and creating visualizations by
analogy. IEEE Transactions on Visualization and
Computer Graphics, 13(6):1560–1567, 2007. Papers
from the IEEE Information Visualization Conference
2007.
[8] The Taverna Project. http://taverna.sourceforge.net.
[9] The VisTrails Project. http://www.vistrails.org.
Download