VisTrails: Enabling Interactive Multiple-View Visualizations Louis Bavoil 1 Steven P. Callahan 1 Patricia J. Crossno 3 Juliana Freire 2 1 1,2 Carlos E. Scheidegger Cláudio T. Silva Huy T. Vo 1 1 Scientific Computing and Imaging Institute, University of Utah 2 School of Computing, University of Utah 3 Sandia National Laboratories Figure 1: VisTrails Visualization Spreadsheet. This ensemble shows the surface salinity variation at the mouth of the Columbia River over the period of a day. The green regions represent the fresh-water discharge of the river into the ocean. A single vistrail specification is used to construct this ensemble. Each cell corresponds to an instance of this specification executed using a different timestamp value. A BSTRACT VisTrails is a new system that enables interactive multiple-view visualizations by simplifying the creation and maintenance of visualization pipelines, and by optimizing their execution. It provides a general infrastructure that can be combined with existing visualization systems and libraries. A key component of VisTrails is the visualization trail (vistrail), a formal specification of a pipeline. Unlike existing dataflow-based systems, in VisTrails there is a clear separation between the specification of a pipeline and its execution instances. This separation enables powerful scripting capabilities and provides a scalable mechanism for generating a large number of visualizations. VisTrails also leverages the vistrail specification to identify and avoid redundant operations. This optimization is especially useful while exploring multiple visualizations. When variations of the same pipeline need to be executed, substantial speedups can be obtained by caching the results of overlapping subsequences of the pipelines. In this paper, we describe the design and implementation of VisTrails, and show its effectiveness in different application scenarios. CR Categories: H.5.2 [User Interfaces]: Graphical user interfaces (GUI); I.3.4 [Graphics Utilities]; I.3.8 [Applications]; H.2.8 [Database Applications]: Scientific databases Keywords: interrogative visualization, dataflow, caching, coordinated views 1 I NTRODUCTION In recent years, with the explosion in the volume of scientific data, we have observed a paradigm shift in how scientists use visualization. Projects such as CORIE, an environmental observation and forecasting system for the Columbia River, generate and publish on the Web thousands of new images daily which depict river circulation forecasts and hindcasts, as well as real-time sensor data.1 The growing demand for visualization has led to the development of new and freely available systems [14, 20, 24], which due to increased computational power, wide availability of inexpensive Graphics Processing Units (GPUs), and more efficient visualization algorithms, allow users to generate and interactively explore complex visualizations. Although these new systems represent a significant improvement in raw efficiency compared to first-generation tools [11, 27], they have important limitations. In particular, they lack the infrastructure to properly manage the pipelines, and they do not exploit optimization opportunities during pipeline execution. As a result, the creation, maintenance, and exploration of visualization data products are major bottlenecks in the scientific process, limiting the scientists’ ability to fully exploit their data. Exploring data through visualization requires scientists to assemble dataflows that apply sequences of operations over a set of data products. Often, insight comes from comparing the results of a variety of visualizations [23]. For example, Figure 1 shows a set of CORIE images used to study the salinity variation at the mouth of the Columbia river. Unfortunately, today the process to create these complex ensembles is cumbersome and time-consuming. Although individual dataflows can be constructed using point-and-click interfaces in systems such as IBM Data Explorer (DX) [11], Advanced Visual Systems AVS [27], SCIRun [20] and ParaView [14], executing variations of a given pipeline with different parameters requires users to manually modify the parameters through the user interface. 1 http://www.ccalmr.ogi.edu/CORIE This mechanism is clearly not scalable for generating more than a few visualizations. Besides, as the parameters are modified, the previous values are forgotten. This places the burden on the scientist to first construct the visualizations and then to remember what values led to a particular image. Another source of complexity in this process comes from the visualization operations themselves. These operations are often computationally and memory-intensive. As a result, visualization pipelines may comprise long sequences of expensive steps with large temporary results, limiting a user’s ability to interactively explore their data. This problem is compounded in a multi-view scenario. Although current systems use dataflows to represent pipelines, they do not fully exploit key features of the dataflow model. Notably, they may recompute the same results over and over again if so defined in the dataflow. VisTrails enables interactive multiple-view visualizations by simplifying the creation and maintenance of visualization pipelines, and by optimizing their execution. Our design goals included: • creating an infrastructure that maintains the provenance of a large number of visualization data products; • providing a general framework for efficiently executing a large number of potentially complex visualization pipelines; • providing a user interface that simplifies multiple-view comparative visualization. Instead of creating yet another visualization system, our goal was to build an extensible and flexible infrastructure that can leverage (and be combined with) existing systems (e.g., [11, 14, 20, 27]). Here, we describe our initial prototype that was built on top of Kitware’s Visualization Toolkit (VTK) [24]. In our system, the visualization trail (or vistrail) is a formal specification of a pipeline. It provides not only a record of the provenance for the generated image (or images), but it can also be automatically executed. Unlike existing dataflow-based systems, in VisTrails there is a clear separation between the specification of a pipeline and its execution instances. This separation enables powerful scripting capabilities and provides a scalable mechanism for generating a large number of visualizations – the specification serves as a template and users can execute it using different parameters. For example, Figure 1 illustrates a multi-view visualization built using a single vistrail specification executed over distinct timestamp values. VisTrails also leverages the vistrail specification to identify and avoid redundant operations. This optimization is especially useful for multiple-view visualizations, where vistrails may contain overlapping subsequences (e.g., when variations of the same trail are executed). By reusing results computed for overlapping subsequences of different pipelines, VisTrails provides improved interactivity. We represent the vistrails specifications using XML. As we discuss in Section 3.1, this has many benefits. In particular, it allows users to query the specifications and their instances; and it greatly simplifies the sharing of visualization pipelines. In Section 2 we review the related work. In Section 3, we describe the design and implementation of VisTrails. In Section 4, we illustrate the benefits and improved performance achieved by VisTrails in three different application scenarios. We close in Section 5, where we outline directions for future work. 2 R ELATED W ORK Several systems are available for creating and executing visualization pipelines [11, 14, 20, 24, 27]. Most of these systems use a dataflow model, where a visualization is produced by assembling visualization pipelines out of basic modules. They typically provide easy-to-use interfaces to create the pipelines. However, as discussed above, these systems lack the infrastructure to properly Figure 2: VisTrails Architecture. Users create and edit visualization pipelines using the Vistrail Builder user interface. The vistrail specifications are saved in the Vistrail Repository. Users may also interact with saved vistrails by importing them into the Visualization Spreadsheet. Each cell in the spreadsheet represents a view that corresponds to a vistrail instance; users can modify the parameters of a vistrail as well as synchronize parameters across different cells. Vistrail executions are controlled by the Vistrail Cache Manager, which keeps track of operations that are invoked and their respective parameters. Only new combinations of operations and parameters are requested from the Vistrail Player, which executes the operations. manage a large number of pipelines; and often apply naı̈ve execution models that do not exploit optimization opportunities. A formal model for capturing the visualization exploration process was proposed by Jankun-Kelly et al. [13]. The vistrail model shares many of the features and benefits of this model, including: the ability to record visualization sessions at a higher-level than simple logs; the use of an XML dialect to represent a pipeline, which enables easy sharing of results and process information; and the maintenance of visualization provenance. However, there are important differences between our approaches. Whereas the focus of Jankun-Kelly et al. was to design a model for the visualization process, the model is just one component of VisTrails (albeit a very important one). We instead focus on how to effectively use the model to both optimize pipeline execution and to simplify the creation of large visualization ensembles. Currently, in VisTrails, session history can be reconstructed from information stored in a set of vistrail instances. In the recent literature, several works have addressed different aspects of the management of visualization processes, from maintaining detailed provenance of data exploration processes [15] to grid-enabling visualization systems [3]. These works are complementary and the ideas can be integrated into VisTrails. Kreuseler et al. [15] proposed a history mechanism for exploratory data mining, but their ideas are also applicable to exploratory visualization. They use a tree structure to represent the change history, and describe how undo and redo operations can be calculated in this tree structure. This mechanism provides a detailed provenance of the exploratory process. We intend to add a more expressive history mechanism to VisTrails in future work. Since we store vistrail specifications in an XML format, an XML-specific versioning scheme such as the time-stamp-based approach proposed by Buneman et al [4] is an attractive alternative. Brodlie et al. [3] describe sgViz, a system that extends IRIS Explorer.2 Their system allows the distributed execution of a visualization pipeline over a set of Grid resources. An important contribution is a layered architecture that allows different “bindings” for a given pipeline specification in each layer. With this architecture, different visualization systems can be easily integrated (in the logical layer) and different execution strategies can be implemented which bind individual operations to places in the grid (in the physical layer). Similar to sgViz, VisTrails can be used as a middleware over different visualization systems and it also uses an XML dialect to represent visualization pipelines. SgViz’s dialect differs from ours in one significant aspect: the vistrail model clearly separates the visualization modules from the flow of operations (the topology of the dataflow network). By separating the connections from the modules we can more easily support different execution models, as well as new control constructs. In the current system, 2 http://www.nag.co.uk/Welcome IEC.html type Vistrail = vistrail [ @id[ String ], name [ String ], Module*, Connect*, annotation [ String ]?] type Module = module [ @id[ String ], name [ String ], function [ @name, @returnType, param*]+, cacheable [ boolean ]? ] type Connect = connect [ @id[ String ], (FilterInput|ObjectInput)*] type FilterInput = filterInput [ @id[ String ], name [ String ], function [ String ], iport [ Integer ], oport [ Integer ]] type ObjectInput = objectInput [ @id[ String ], name [ String ]] Figure 3: Excerpt of the XML schema for vistrails. This schema captures all information required to re-execute a vistrail. The connect elements represent data dependencies among modules, which are key to automatic caching. The use of XML allows the reuse of standard XML technology, including validating parsers and query languages. It also simplifies the sharing of visualization pipelines. our goal was to optimize the sequential execution of a pipeline (or set of pipelines) by avoiding unnecessary recomputations. Parallel execution and distribution of vistrail operations on the Grid are issues we intend to pursue in future work. It is worthy of note that our representation of the trails as a tabled logic program (see Section 3.2) lends itself naturally to automatic parallelization [9]. VisTrails provides an infrastructure that enables the effective use of several visualization techniques which aid users to explore and understand the underlying information. In particular, it supports the three primary techniques described by Roberts [22]: Multiform visualization, the ability to display the same information in different ways, can be achieved by appropriately setting parameters or modifying a trail (e.g., the top row of Figure 8 shows the Visible-Human dataset rendered using isosurfaces whereas the bottom row uses volume rendering); abstract views, which apply the same visualization algorithm using different simplification criteria, can be constructed by refining and modifying a trail specification; and direct manipulation, which allows objects to be directly interrogated, scaled and positioned, can be achieved both through the Vistrail Builder and through the Vistrail Spreadsheet, and in the latter, different views can be coordinated (see Section 3.4). The use of spreadsheets for displaying multiple images was proposed in previous works. Levoy’s Spreadsheet for Images (SI) [16] is an alternative to the flow-chart-style layout employed by many earlier systems using the dataflow model. SI devotes its screen real estate to viewing data by using a tabular layout and hiding the specification of operations in interactively programmable cell formulas. The 2D nature of the spreadsheet encourages the application of multiple operations to multiple data sets through row or column-based operations. Chi [5] applies the spreadsheet paradigm to information visualization in his Spreadsheet for Information Visualization (SIV). Linking between cells is done at multiple levels, ranging from object interactions at the geometric level to arithmetic operations at the pixel level. The difficulty with both SI and SIV is that they fail to capture the history of the exploration process, since the spreadsheet only represents the latest state in the system. The Vistrail Spreadsheet supports concurrent exploration of multiple visualizations. The interface is similar to the one proposed by Jankun-Kelly and Ma [12] and it provides a natural way to explore a multi-dimensional parameter space. The separation between parameters and the actual network makes the vistrail model especially suitable to be used in such an interface. Users can change any of the parameters present in a vistrail and create new vistrail instances; and they can also synchronize different views over a set of vistrail parameters – changes to this parameter set are reflected in related vistrails shown in different cells of the spreadsheet. In addition, other corresponding visualization mechanisms are possible using the spreadsheet, such as reflection or nested views. Figure 4: A sample vistrail. This snapshot of the Vistrail Builder consists of a sequence of VTK steps that read a file, compute an isosurface and render the resulting image. Users create vistrails using the point-and-click interface of the Vistrail Builder – they can add modules, and connect the different modules and ports. The management of scientific data and processes has attracted a lot of attention in the recent literature (see e.g., [17, 25]). Although our effort was motivated by visualization, the VisTrails framework is extensible and can also be used for general scientific workflows. 3 S YSTEM A RCHITECTURE AND I MPLEMENTATION The high-level architecture of VisTrails is shown in Figure 2. The different components of the system are described in detail below. 3.1 Vistrail Data Model A vistrail specification consists of a sequence of operations used to generate a visualization. It records the visualization provenance and it can also be used to automatically re-generate the images. A vistrail is stored as an XML document. Figure 3 shows an excerpt of the XML schema [26] for a vistrail.3 A vistrail element contains of a set of modules, connections between these modules, and an optional annotation element. Each module may contain one or more functions. A function has a returnType and a set of parameters consisting of attribute-value pairs. The optional boolean attribute cacheable indicates whether the module is eligible for caching (see Section 3.2). An actual vistrail instance is depicted in Figure 4. Each node in the graph corresponds to a module element in the schema; and an edge between two nodes corresponds to a connect element. The connections capture the data dependencies among the different modules. The data generated by a module is output through its oport and can be connected to the iport of a different module. As we discuss below, these dependencies are used by the Cache Manager to avoid redundant computations, as well as to ensure that, when changes are made to a vistrail, only the steps affected by the change are re-executed. The XML representation for vistrails allows the reuse of standard XML tools and technologies. In particular, we use Xerces [29] to 3 For simplicity, here we use the type syntax of the XML Query Algebra which is less verbose and more readable than XML Schema. E XEC -V ISTRAIL(vistrail) 1 ms ← S INKS(vistrail) 2 for m ∈ ms : 3 do E XEC -S UBNETWORK -U P -T O(m) E XEC -S UBNETWORK -U P -T O(m) 1 2 3 4 5 6 7 8 9 10 11 12 13 sn ← S UBNETWORK -U P -T O(m) id ← U NIQUE -I D(sn) entry ← L OOKUP -C ACHE(id) if entry = NULL then entries ← {} ms ← C ACHEABLE -M ODULES -L EADING -T O(m) for m ∈ ms do ne ← E XEC -S UBNETWORK -U P -T O(m) A DD -T O -L IST(entries, ne) nn ← T RANSFORM -N ETWORK(sn, ms, entries) entry ← P LAYER(nn) A DD -T O -C ACHE(id, result) return entry Figure 5: Pseudocode for the vistrail execution algorithm with cache management. When a module is to be executed, the VCM first checks its cache. If an entry is found for the module (with requested parameters), the cached result is returned. Otherwise, VCM invokes the Vistrail Player (line 11) which executes the appropriate API call and returns the results to the VCM. parse and validate the vistrail specification; and XML query languages (e.g., XPath [30] and XQuery [2]) to query vistrail specifications and instances. Using an XML query language, a user can query a set of saved vistrails to locate a suitable one for the current task; query saved vistrail instances to locate anomalies documented in annotations of previously generated visualizations; locate data products and visualizations based on the operations applied in a pipeline; cluster vistrails based on different criteria; etc. For example, the XQuery query below lists all vistrails (together with their names, annotations and functions) that use an isosurface computation and that have been found to contain anomalies: for $vt in document(“vt-repository.xml”)/vistrail, $module in $vt/module/[contains(./name,”vtkContourFilter”)] where $vt[contains(./annotation,“anomal”)] return <anomaly> {$vt/name} {$vt/annotation} {$module/function} </anomaly> Another important benefit of using an open, self-describing, specification is the ability to share (and publish) vistrails. For example, an image can be published together with its associated vistrail, allowing others to easily reproduce the results. The module element in the vistrail schema is general and captures a wide range of applications, from simple scripts to VTK modules and Web services [28]. Note that for applications that do not fit this structure, it is easy to add new kinds of module elements – as long as they fit the dataflow model and appropriate functions are provided to execute these modules (see Section 3.3). This makes our VisTrails extensible, and in particular, also suitable for executing general scientific workflows [17]. 3.2 The Vistrail Cache Manager The Vistrail Cache Manager (VCM) schedules the execution of modules in vistrails. As vistrail steps are executed, the VCM stores their results, i.e., a signature for the module (which includes its name and parameter values) together with a handle to the output results. When the VCM identifies previously computed subnetworks in a vistrail, it locally transforms a possibly expensive computation into a constant-time cache lookup. To maximize the utility of the cache, the cached results are shared among vistrails. This allows U NIQUE -I D(subnetwork) 1 if I S -ATOMIC(subnetwork) 2 then return U NIQUE -I D -F ROM -ATOM(subnetwork) 3 else module-ids ← {} 4 for module ∈ M ODULES(subnetwork) 5 do id ← U NIQUE -I D -F ROM -ATOM(module) 6 A DD -T O -L IST(module-ids, id) 7 S ORT(module-ids) 8 conn-ids ← {} 9 for conn ∈ C ONNECTIONS(subnetwork) 10 do omodule ← O UTPUT-P ORT-I D(conn) 11 ufrom ← I NDEX -O F -I D(module-ids, omodule) 12 imodule ← I NPUT-P ORT-I D(conn) 13 uto ← I NDEX -O F -I D(module-ids, omodule) 14 id ← U NIQUE -I D -F ROM -C ONN(ufrom, uto) 15 A DD -T O -L IST(conn-ids, id) 16 S ORT(conn-ids) 17 return (module-ids, conn-ids) Figure 6: Generation of unique keys for subnetworks. The two sorts (lines 7 and 16) make sure that reordering in the module and connection specifications do not affect the unique ID. The procedures U NIQUE -I D -F ROM -ATOM and U NIQUE -I F -F ROM -C ONN only use information that is necessary to distinguish between dataflows to create the IDs. Lines 10–14 make sure that the connections IDs are computed from the unique module IDs, and not from the specific IDs given in the vistrail. This guarantees that all networks with the same modules and connectivity get the same ID. the transparent elimination of redundant computation in overlapping sequences of different vistrails. Figure 7 illustrates the actions of the cache manager for two vistrails. The VCM takes as input a vistrail instance (an XML file). To ensure that only new sub-sequences are executed, the cache manager analyzes the vistrail specification to identify data dependencies between cacheable modules – the dependency information is captured by the connect element in the specification (see Figure 3). Note that we distinguish between cacheable and non-cacheable modules. For some applications, it may not make sense to cache results from all of the modules. When new modules are added to VisTrails, an additional flag may be passed to indicate caching eligibility. For example, for VTK classes such as vtkProperty, which have no output, the entire object is passed to another class. The dependency computation is done concurrently with the vistrail execution. The pseudocode for the execution procedure is given in Figure 5. When a vistrail is invoked (e.g., it is loaded in a cell of the Visualization Spreadsheet), a depth-first search (DFS) is started from the sinks of the vistrail graph, traversing the graph upstream (in the opposite direction of the connections). Each step of the DFS collects a subnetwork that is bounded by cacheable modules. The algorithm then calls itself on these modules to ensure their results are in the cache. If values are already present in the cache, their results are immediately returned. When the recursive calls return, the entries list will contain a set of objects from the cache. The function T RANSFORM -N ETWORK replaces the entire subnetwork that ends at a cacheable module with a special module that looks up the result in the cache. This result will always be in the cache, since T RANSFORM -N ETWORK is always called after the appropriate E XEC -S UBNETWORK -U P -T O calls are issued. This algorithm finds all the needed dependencies and only computes the needed modules. (This can be proven by structural induction on the topological sort of the network.). An important issue is how to identify entries in the cache. We associate each entry with a key that encodes all the information necessary to compute the associated cached result. If we store too much information (like node identifiers particular to a certain vistrail), we will unnecessarily restrict data sharing and substantially reduce the utility of the cache. On the other hand, if we fail to store information that influences the computation result, the cache Figure 7: In order to execute Vistrail 1, the cache manager first determines the data dependencies among its modules. It then computes a series of subnetworks that generate the intermediate results for this pipeline (steps 1–5). Each intermediate result is associated with a unique identifier in the cache. Gray nodes represent non-cacheable modules; yellow nodes indicate cacheable modules; and red nodes indicate vistrail modules that are replaced with cache lookups. Ghosted modules are not present in the subnetworks, but they contribute to the construction of subnetwork cache keys. When Vistrail 2 is scheduled for execution (step 6), the results for the Reader-Isosurface subnetwork previously computed for Vistrail 1 (in step 4) are reused. Thus, Vistrail 2 requires no expensive computations. lookup may return incorrect results. We use a hashing scheme that strips unnecessary information from objects and generates unique keys for subnetworks. The pseudocode for the key generation algorithm is shown in Figure 6. Note that this scheme also ensures that when changes are applied to a vistrail, only steps that are affected by the changes are re-executed. The unique ID is structural in a very specific sense: all subnetworks sharing the same modules, parameters and connectivity will have the same ID. It encodes all the structural information required to identify redundant operations. Note that there may be structurally different subnetworks that produce the same result. We do not attempt to identify these for two reasons. First, this requires knowledge of the internals of the modules. Second, and most importantly, if we assume that the modules are sufficiently general, the problem becomes undecidable [7]. In fact, treating modules as black boxes is essential to the extensibility of VisTrails. The only requirement is that each module must be stateless: its behavior must be fully captured by its name, input parameters, and description of the upstream modules. (This is precisely what the U NIQUE -I D algorithm captures). We believe such an assumption is reasonable: mutable state inside modules destroys any guarantees of reproducibility, which are essential for scientific tasks. The dataflow-oriented execution model of a vistrail can be naturally represented as a Datalog program [19]. In our prototype, vistrails are translated into equivalent Datalog programs which are evaluated by the XSB Prolog/deductive database system [21]. The algorithms described in Figures 5 and 6 were implemented in Prolog; and we use XSB’s foreign language interface to invoke the Vistrail Player. 3.3 Vistrail Player The Vistrail Player (VP) receives as input an XML file for a vistrail instance and executes it using the underlying visualization API. Currently, the VP supports VTK classes. It is a very simple interpreter. First, it directly translates the vistrail modules into VTK classes and sets their connections. Then, it sets the correct parameters for the modules according to the parameter values in the vistrail instance. Finally, the resulting network is executed by calling update methods on the sink nodes. The semantics of each particular execution are defined by the underlying API. The VP is unaware of caching. To accommodate caching in the player, we added a new vtkDataObjectPipe class to VTK that pipes data from the cache to the rest of the network. The VCM is responsible for replacing a complex subnetwork that has been previously executed with an appropriate vtkDataObjectPipe (the red Cache Lookup modules in Figure 7). After a vistrail instance is executed, its outputs are stored in new cache entries. The VP needs the ability to create and execute arbitrary VTK modules from a vistrail. This requires mapping VTK descriptions, such as class and method names, to the appropriate module elements in the vistrail schema. Instead of manually constructing a complex mapping table, we use the VTK automatic wrapping mechanism. An automated program generates all required bindings directly from the VTK headers. This allows our system to expose the functionality of any VTK-compliant class without any additional coding, including user-contributed classes. The special vtkDataObjectPipe class itself was introduced in this fashion. 3.4 Creating and Interacting with Vistrails The Vistrail Builder (VB) provides a graphical user interface for creating and editing vistrails. It writes (and also reads) vistrails in the same XML format as the rest of the system. It shares the familiar modules-and-connections paradigm with dataflow systems. In order to generate the visual representation of the modules, it reads the same data structure generated by the VP VTK wrapping process (Section 3.3). Like the VP, the VB requires no change to support additional modules. The VB is implemented in lefty, as an extension to dotty, a graph editor that is part of the graphviz software package [8]. We made several modifications to both lefty and dotty, such as adding the ability to read and write XML. Figure 4 shows a vistrail built using the VB. To allow users to compare the results of multiple vistrails, we built a Visualization Spreadsheet (VS). The VS provides the user a set of separate visualization windows arranged in a tabular view. This layout makes efficient use of screen space, and the row/column groupings can conceptually help the user explore the visualization parameter space. The cells may execute different vistrails and they may also use different parameters for the same vistrail specification (see Figure 8). To ensure efficient execution, all views share the same cache. Users can also synchronize different views through the vistrails’ parameters. For example, a set of cells in the VS can be set to share the same camera, while varying the other vistrail parameters. The user interface actions have been kept as simple as possible. For instance, to change a parameter, the user selects the desired views and clicks on “Set Parameters”. Then, they can double-click on a function in the tree view, and set the parameters of the function in the dialog window. The contents of a spreadsheet can be saved as an XML file that has sections for describing the GUI layout as well as the vistrail instances in each cell – the instances saved in a spreadsheet conform with the vistrail schema (Section 3.1). The spreadsheet was implemented using vtkQt.4 Because it uses Qt, the VS runs on multiple architectures. We have tested it on Mac, Linux and Windows. 4 http://www.matthias-koenig.net/vtkqt Figure 8: Vistrail Spreadsheet. Parameters for a vistrail loaded in a spreadsheet cell can be interactively modified by clicking on the cell. Cameras as well as other vistrail parameters for different cells can be synchronized. This spreadsheet shows different visualizations of the Visible-Human dataset using different algorithms. The top row shows the rendering of two iso values, whereas in the second row, images are created using volume rendering. The cameras for the cells in the rightmost column in the spreadsheet above are synchronized. 4 C ASE S TUDIES VisTrails is useful for a variety of visualization applications. It can be used as the means to explore the parameters in a visualization, repeat visualizations across different data, and to compare different visualization techniques. We have selected a few case studies which show different applications of VisTrails and the advantages to using the system over traditional approaches. 4.1 Time-Varying Visualization The CORIE project described in Section 1 generates enormous amounts of measurement and simulation data each day. Visualizing this data is critical to understand the environmental issues that the project targets. Due to the complexity and size of the data to be visualized, approaches for 3D visualization have not been automated like the 2D plots and images available from their Web site. And even for the 2D plots, the automation is done by handcoding scripts – new scripts need to be created any time a new visualization is required. Current workflows for 3D visualizations of salinity involve loading a data file from the filesystem that includes the time, temperature, and salinity for a number of time steps. This data is used on top of the existing geometry (i.e., the river bathymetry) to find an isosurface of interest, then animated using a subset of the time steps. Today, this task is completely manual and is repeated any time a new data set is visualized. VisTrails provides the necessary infrastructure to automate this process to a great extent. We created a reader for the CORIE data and incorporated it as a vistrail module. The Vistrail Builder is used to create a vistrail that reads the CORIE data and applies transparent isosurfaces to the data. This vistrail is then loaded into the Visualization Spreadsheet where the isosurface parameters are easily manipulated until the desired image is achieved. As the user finds useful visualizations, the results are copied into the next window and the user continues to adjust parameters on one or more windows. In the example shown in Figure 1, multiple time steps are explored in the spreadsheet windows. Once suitable parameters are found, the user has the option of creating another row of windows in the spreadsheet and using the same parameters on another salinity data set. This allows the user to easily compare the results of a forecast at multiple time steps with the actual measured data for the same timesteps. Previously, this task would have been accomplished by manually adjusting parameters and saving images for each data set, which would then have been montaged using Microsoft Powerpoint. 4.2 Scalar Visualization Visualizing scalar data that is generated from CT and MR devices has been the subject of much research in the volume visualization community. Two techniques that are common for visualizing this type of data are isosurfacing and direct volume rendering. Many existing tools provide interfaces to easily adjust the contour value needed to create an isosurface or to adjust the transfer function for volume rendering. However, to visualize multiple views of a dataset simultaneously, a user is required to save the images to separate files and view them together using an image processing tool. This prevents the user from directly manipulating the images, since they are static and need to be re-generated if parameters change. With VisTrails, a user can easily visualize multiple views of the same dataset. By taking advantage of the caching capabilities in VisTrails, in many scenarios the exploration of multiple visualizations requires little more computation than visualizing a single image. Furthermore, the user has the ability to directly compare the results of different parameters or visualization methods. For example, Figure 8 shows multiple views of a head scan using isosurfacing and volume rendering. A user could also reuse the results of previous computations simply by querying the VisTrail Repository. An example of this would be a user who is given a new scalar data set to visualize. Instead of using traditional tools to visualize the data from scratch, the scientist has the option to query previous visualizations to find vistrails that perform a similar task. After finding the desired vistrail in the Vistrail Repository, the user can easily visualize the new data using the same pipeline by simply modifying the data reader in the vistrail. Besides alleviating the amount of repetitious tasks the user has to perform, VisTrails also promotes knowledge sharing. Figure 9: VisTrails Spreadsheet showing the results of multiple visualizations of diffusion tensor data. The horizontal rows explore different colormapping schemes, while the vertical columns use different isosurfaces. 4.3 Diffusion Tensor Visualization A final example of an application for the VisTrails system is in visualizing diffusion tensor data. A common task in visualization is to present the user with multiple, sometimes repetitive, visual queues that attempt to quantitatively describe an aspect of the data. One derived quantity of diffusion tensor data is anisotropy, which may help a scientist isolate areas of interest. In our example, we are visualizing the right half of a brain and use the anisotropy to encode information about the structure of the brain. Finding useful visualizations that show the data together with the derived quantity can be a difficult task. Figure 9 shows the results of several visualizations of the brain using isosurfaces that are colormapped with anisotropy. To better visualize this data, we used several univariate colormaps to find one that best describes the data. Thus for a given isosurface, vistrails were modified to include univariate colormaps that vary in hue, saturation, and value. Once a variety of visualizations were created, we replicated these results using another isosurface and compared the results with the original. These operations take advantage of the caching capabilities of VisTrails because the isosurface is computed only once (for the first vistrail that is executed) and it is reused by all other cells – only the colormaps change across cells. This is done completely automatically even when using multiple different pipelines. Without VisTrails, this task would require opening several windows in a visualization tool and either recomputing the same isosurfaces for each of them or manually creating one pipeline that shares certain components. VisTrails provides the user with a fast and automated technique for comparatively building visualizations and saving workflows for future reuse. 4.4 Performance Issues An important consideration in any interactive visualization system is performance, in particular responsiveness to changes triggered by direct manipulation of images. Since visualization pipelines can be long and complex, achieving interactive speeds is a challenging problem, and becomes even more challenging in a multiple-view system. There are two broad classes of operations in a visualization pipeline: operations that perform data filtering (or preparation) and operations that perform real-time rendering. The VCM optimizes the data filtering operations. The performance gains obtained by caching are heavily dependent on the set of vistrails used in a given session. For instance, for both the brain and visible hu- man case studies, where there is substantial overlap, caching leads to speedups that vary between 2 and 2.5 times. For the CORIE case study, on the other hand, there is no overlap among the vistrails. In this case, the caching mechanism actually incurs overheads. However, the overheads are very small – under 1 percent. By representing vistrails in XML, we incur a slight overhead for parsing. However, even a large vistrail is insignificant in comparison to the size of the data visualized. Therefore, the performance cost of parsing the XML is negligible in the overall visualization. Changes to class design and algorithms have the potential to improve the efficiency of the VCM. In particular, some techniques require substantial preprocessing steps to enable faster update times, e.g., the out-of-core isosurface technique of Chiang et al. [6]. In order to leverage these techniques with the conventional VTK pipeline, it is necessary to explicitly enforce the availability of the required index structures for subsequent operations – and this requires programming. In contrast, the VCM achieves the same behavior in a transparent (and automatic) fashion. Increasing the number of synchronized views has direct effect on the rendering frame rate. The reason is obvious: all views share a single GPU and the overall frame rates depend highly on the combined complexity of the models being shown. In practice, if the frame rate is too slow to interact with a given group of views, we can disable synchronization and interact with a single view until the desired viewpoint and/or values are determined. At that time, we turn synchronization back on. A comprehensive solution to this problem requires the availability of time-critical dynamic level-ofdetail rendering algorithms for each of the types of data supported by the system. The development of these renderers is an active area of research [18]. 5 D ISCUSSION AND F UTURE W ORK VisTrails streamlines the creation and execution of complicated visualization pipelines. It provides infrastructure that gives users a dramatically improved and simplified process to analyze and visualize large ensembles of complex visualizations. Using a vistrail, users are able to resume their explorations where they left off; to easily apply identical operations to a new data set without having to redo a long sequence of operations; to share their findings with colleagues; to test new codes by comparing visualizations of the results generated by different versions of a simulation. The amount of data a user can explore depends on the time it takes until the analysis results can be viewed. By caching intermediate results and sharing computations, VisTrails greatly reduces the response time, allowing users to efficiently explore large volumes of data. The separation of the pipeline specification from its instances has several benefits. The correspondence between parameter values and the resulting visualizations is naturally maintained in the system; and by combining this with a mechanism that controls the versions of the pipeline specifications (e.g., [4]), a complete history of the pipelines can be maintained – a detailed (and persistent) record of the provenance of visualizations. This separation also enables powerful scripting capabilities. By viewing a pipeline specification as a template and allowing users to specify sets of inputs for a given pipeline, VisTrails provides a scalable mechanism for generating a large number of visualizations. Pipeline specifications can be analyzed and optimized. VisTrails identifies and avoids redundant computation in or across pipelines (Section 3.2). This feature is especially useful for exploring multiple visualizations in a multi-view environment, where sharing computations and reusing results are key for achieving interactivity. As we discuss in Section 4, this optimization leads to substantial performance improvements, enabling interactive exploration of a large number of visualizations. As any caching system, VisTrails needs to apply some cache replacement policy. Since entries in the cache have different memory footprints and modules may have widely different execution times, simple techniques such as LRU are clearly suboptimal. We are currently investigating specialized policies that are more suitable for our problem. Another direction we plan to follow in future work is to Gridenable vistrails, allowing them to be executed in a distributed fashion, over different processing nodes in the Grid. This will require more sophisticated evaluation algorithms (akin to adaptive query processing techniques, such as [1]) that are able to adapt to variability in network delays and resource unavailability. A vistrail instance stores information about provenance of derived images. This information includes the name and location of input (raw) data files. To ensure reproducibility of trail executions, our system can keep copies of the raw data. Although we use a database to store the meta-data (i.e., the trail specification and instances), the raw data is stored in the file system. One potential problem with this simple approach is that the raw data files can be moved, deleted, or modified. More sophisticated mechanisms can be used to maintain the raw data [10,25]. For example, the RHESSI Experimental Data Center [25] uses a mapping scheme to locate and retrieve data items, and it ensures data consistency by only allowing raw data to be accessed through the meta-data available in the database. The beta version of VisTrails (including the GUIs) runs on multiple platforms. It has been tested on Linux, Mac and Windows. Over the next year, we intend to start a beta testing program in preparation for a future public release. Acknowledgments. VisTrails was originally motivated by the CORIE project. António Baptista has provided us valuable input for the system design as well as several CORIE data sets for our experiments. We thank Gordon Kindlmann for the brain data set and the Visible Human Project for the head. This work was supported by the Department of Energy under the VIEWS program and the Mathematics, Information, and Computer Science Office, the National Science Foundation under grants IIS-0513692, CCF0401498, EIA-0323604, CNS-0541560, and OISE-0405402, and a University of Utah Seed Grant. Sandia National Laboratories is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under Contract DE-AC04-94AL85000. R EFERENCES [1] R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In ACM SIGMOD, pages 261–272, 2000. [2] S. Boag, D. Chamberlin, M. Fernandez, D. Florescu, J. Robie, J. Siméon, and M. Stefanescu. XQuery 1.0: An XML query language. W3C Working Draft, June 2001. [3] K. Brodlie, D. Duce, J. Gallop, M. Sagar, J. Walton, and J. Wood. Visualization in grid computing environments. In IEEE Visualization, pages 155–162, 2004. [4] P. Buneman, S. Khanna, K. Tajima, and W. C. Tan. Archiving scientific data. ACM Transactions on Database Systems, 29(1):2–42, 2004. [5] E. H. Chi, P. Barry, J. Riedl, , and J. Konstan. Principles for information visualization spreadsheets. IEEE Computer Graphics and Applications, 18(4):30–38, 1998. [6] Y.-J. Chiang, C. T. Silva, and W. J. Schroeder. Interactive out-of-core isosurface extraction. In IEEE Visualization, pages 167–174, 1998. [7] M. D. Davis, R. Sigal, and E. Weyuker. Computability, Complexity, and Languages. Academic Press, 1994. [8] J. Ellson, E. R. Gansner, L. Koutsofios, S. North, and G. Woodhull. Graphviz - open source graph drawing tools. In Graph Drawing, pages 483–485, 2001. [9] J. Freire, R. Hu, T. Swift, and D. S. Warren. Exploiting parallelism in tabled evaluations. In PLILP, pages 115–132, 1995. [10] J. Frew and R. Bose. Earth system science workbench: A data management infrastructure for earth science products. In SSDBM, pages 180–189, 2001. [11] IBM. Opendx. http://www.research.ibm.com/dx. [12] T. Jankun-Kelly and K. Ma. Visualization exploration and encapsulation via a spreadsheet-like interface. IEEE Transactions on Visualization and Computer Graphics, 7(3):275–287, 2001. [13] T. Jankun-Kelly, K. Ma, and M. Gertz. A model for the visualization exploration process. In IEEE Visualization, 2002. [14] Kitware. Paraview. http://www.paraview.org. [15] M. Kreuseler, T. Nocke, and H. Schumann. A history mechanism for visual data mining. In IEEE Information Visualization Symposium, pages 49–56, 2004. [16] M. Levoy. Spreadsheet for images. In SIGGRAPH, pages 139–146, 1994. [17] B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger-Frank, M. Jones, E. Lee, J. Tao, and Y. Zhao. Scientific workflow management and the kepler system. Concurrency and Computation: Practice & Experience, Special Issue on Scientific Workflows, 2005. [18] D. Luebke et al. Level of Detail for 3D Graphics. Morgan-Kaufmann Publishers, 2002. [19] D. Maier and D. S. Warren. Computing With Logic: Logic Programming With Prolog. Addison-Wesley, 1988. [20] S. G. Parker and C. R. Johnson. SCIRun: a scientific programming environment for computational steering. In Supercomputing, 1995. [21] P. Rao, K. Sagonas, T. Swift, D. Warren, and J. Freire. Xsb - a system for efficiently computing well founded semantics. In LPNMR, 1997. [22] J. C. Roberts. Waltz - exploratory visualization tool for volume data, using multiform abstract displays. In SPIE, volume 3298, pages 112– 122, 1998. [23] J. C. Roberts. Multiple-View and Multiform Visualization. In SPIE, pages 176–185, 2000. [24] W. Schroeder, K. Martin, and B. Lorensen. The Visualization Toolkit An Object-Oriented Approach To 3D Graphics. Kitware, 2003. [25] E. Stolte, C. von Praun, G. Alonso, and T. R. Gross. Scientific data repositories: Designing for a moving target. In ACM SIGMOD, pages 349–360, 2003. [26] H. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema Part 1: Structures. W3C Working Draft, Feb. 2000. [27] C. Upson et al. The application visualization system: A computational environment for scientific visualization. IEEE Computer Graphics and Applications, 9(4):30–42, 1989. [28] W3C. Web services activity. http://www.w3.org/2002/ws. [29] Xerces Java parser 1.4.3. http://xml.apache.org/xerces-j. [30] XML path language (XPath) 2.0. http://www.w3.org/TR/xpath20.