Kepler: Towards a Grid-Enabled System for Scientific Workflows Ilkay Altintas1, Chad Berkley2, Efrat Jaeger1, Matthew Jones2, Bertram Ludaescher1, Steve Mock1 1 2 San Diego Supercomputer Center (SDSC), University of California San Diego National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara {berkley, jones}@nceas.ucsb.edu, {altintas, efrat, ludaescher, mock}@sdsc.edu Abstract We present the Kepler scientific workflow management system, which allows scientists to design, execute and deploy workflows using a number of technologies including Web and Grid services, RDBMS, and local applications implemented in various programming languages. 1. Scientific Workflows Progress in science depends on the quantitative and repeatable analysis of data from a variety of sources. Most scientists conduct analyses and run models in several different software and hardware environments, mentally coordinating the export and import of data from one environment to another. Scientific workflows are an attempt to formalize this ad-hoc process so that scientists can design, execute, and communicate analytical procedures repeatedly and with minimal effort. Scientific workflows are superficially similar to business process workflows but have several demanding challenges not present in the scenario for business workflows [6]. In particular, scientific workflows tend to operate on large, complex, and heterogeneous data sources that need to be integrated before computations can occur. Scientific workflows are often computationally intensive and produce complex derived data products that are archived for use in other workflows. With the increase in popularity of the Internet, the number of ways scientists fetch and manipulate data has increased. Various advances have allowed scientists to run analytical and transformational processes remotely (i.e. over the Web). This growth in the richness of the information processing resources results in a need for systems and tools that allow for discovery, efficient usage, and deployment of these resources. To fulfill this need, the Kepler (Figure 1) project [8] is building on a mature software application called Ptolemy II [10] to produce a robust workflow system that caters specifically to domain scientists. Figure 1. Workflow editor in the Kepler Framework for Scientific Workflows. The base application is the Vergil editor from the Ptolemy II project. Ptolemy II, and thus Kepler, is a set of Java packages supporting heterogeneous, concurrent modeling, design, and execution [13]. Kepler’s strengths include: 1) Precisely defined models of computation including the dataflow oriented “Process Networks” model, 2) A modular, activity oriented programming environment that lends itself to the design of reusable components, 3) An intuitive programming GUI that allows the user to easily compose complex workflows. Since Kepler extends an already stable application, development has focused on extensions needed to create workflows that are meaningful to domain scientists as well as updating the application to allow for previously unneeded user interaction. In this paper we describe several core capabilities from Kepler designed to improve the effectiveness and efficiency of scientific research: Capturing scientific workflows Accessing heterogeneous data Executing scientific workflows 2. Capturing Workflows Using Kepler, scientists capture workflow information in a formal format that can easily be changed, archived, versioned, and executed. Kepler contains a library of reusable processing steps (called actors) that perform computations such as signal processing, statistical operations, and Boolean logic operations. Each actor defines zero or more typed input and output ports that can be linked into a directed graph to allow data to flow between actors. Kepler performs both design-time and run-time type checking on the workflow and data. Kepler also allows scientists to prototype a workflow before implementing actors needed for the workflow. Figure 2. The actor prototype tool creates a stub actor class, compiles it, and then adds it to the actor library where it can be dragged onto the workspace. system will validate the connections. However, since these actors are stubs that do not implement the intended computations, they simply open a dialog indicating that the workflow implementation is incomplete. The stubs must be implemented by writing a Java method for the pre-fire, fire, and post-fire stages of the workflow execution. The intended algorithm can be implemented within the fire event processor, or it can call an external program or service to run the algorithm. The intent of this tool is to allow a scientist to quickly assemble a workflow without needing to implement the code for every individual piece of the workflow at design time. We hope this feature will encourage scientists to create workflows that document their project instead of creating mental workflows that are inaccessible. Serialization, documentation and provenance – Workflows within Kepler are serialized in an XML dialect called Modeling Markup Language (MoML) [9]. Because of this XML serialization, the workflow itself can be used as documentation (metadata) for the research project. The workflow also provides the provenance for derived data products, allowing researchers to return to previous states of the data as needed. The workflow can easily be versioned and archived in any XML storage facility and can be indexed for easy querying and access. 3. Accessing heterogeneous data Actors - Kepler has an extensible library of actors. In-house or third-party software can be added to this library by the scientist. Web and Grid services can also be used as actors and can be added to the library to call jobs on the Web and Grid from within Kepler. This is done using the generic Web and Grid Service actors. These actors expose one operation in a given Web Service Description Language (WSDL) file or Grid Web Service Description Language (GWSDL) [18,12] file by exposing the operation’s messages as input and output ports. Kepler also contains a tool to harvest a group of Web Service descriptions from a repository and save them to the actor library to be used later in workflows. Most actors are Java processes that run locally on a single machine. However, some may call external native applications such as Matlab. Still others access arbitrary web services that execute a process remotely and return a handle to the results. Prototyping actors – The actor library may not contain all of the necessary actors to complete a particular scientific computation, so we provide an actor prototyping tool in Kepler (Figure 2). This tool prompts scientists for critical information about an actor, including its name, icon, and input/output ports. Each port has a name and a data type. Once the user has defined the actor, a stub is compiled and added to the actor library. The user can then use this stub on the workflow canvas to prototype a workflow. The ports can be connected to other actors (stubs or not) and the typing Kepler has several data access actors including a relational database access actor (DatabaseQuery) and a metadata-based data ingestion actor for handling heterogeneous data (EMLDataSource). Database Access – Often, scientists need access to the data in a relational database from within the workflow. Kepler includes two actors to allow generic, efficient database access from within a workflow. The OpenDBConnection actor returns a reference to a database connection, given relevant JDBC connection information (driver name, database URL, user name and password). This reference can then be passed to other actors in the workflow that require a database connection. This increases efficiency since the system creates the database connection only once per workflow execution. The DatabaseQuery actor takes as input the reference provided by the OpenDBConnection actor and a Structured Query Language (SQL) [14] query string and outputs the results of the query as a record, an eXtensible Markup Language (XML) [19] stream, or a string. The user can also select whether to return all records or only one row at a time. Future plans for the DatabaseQuery actor include a GUI based query form that does not require the user to know SQL and a schema parser that exposes the individual attributes of the record as ports. EMLDataSource Actor – Ecological Metadata Language (EML) [7] is an XML-based metadata specification for describing ecological and biological datasets. EML contains both physical and logical information about datasets. The EMLDataSource actor (Figure 3) uses EML to ingest heterogeneous datasets into Kepler by parsing the physical and logical metadata to learn how to process the data source. Figure 3. The EMLDataSource Actor automatically configures itself with one output port for each logical attribute in the dataset. Once the EMLDataSource actor parses the EML metadata, it extracts the physical information to read the data file from its native format. It then uses the logical information to create one output port for each attribute (column) in the data file. The ports are typed like other ports in Kepler with information from the EML metadata. At execution time, one record is read for each clock cycle and the data is sent over the ports in parallel. This actor allows Kepler to ingest a multitude of heterogeneous data (as long as it’s described in EML), which gives it the flexibility it needs to be a tool used by domain scientists that use many different data sources. Data Transformation – Because actors and web services are generally designed in isolation, input/output incompatibilities are common and data transformation is needed for integration. Extensible Stylesheet Language Transformations (XSLT) is designed to for data transformations for XML documents [21]. XQuery, although designed for XML querying, can also be used in transformations of data in XML format [20]. Using widely available tools for these two languages, we designed two actors to provide a Kepler interface to XSLT and XQuery. These actors transform XML and HTML data for use in Kepler or outside of Kepler (e.g., browsers). For actors that exchange data in XML format, the XSLT transformation actor presents an easy mechanism for integrating diverse computational actors and heterogeneous data sources. For non-XML data sources, standard data processing actors in Kepler can be used to integrate components in the workflow. 4. Executing workflows Kepler’s powerful programming environment supports the varying models of computation that domain scientists may want to use for their models. Kepler can execute processes locally either within the Kepler environment (Java) or within a native environment (compiled native code, or code interpreted by another environment such as Perl). In addition, processes can be executed in a distributed way, using web and grid services. Remotely executed processes behave as a single step in the model of computation regardless of their complexity. Distributed computation – Kepler’s web and grid services actors allows scientists to utilize computational resources on the network in a distributed scientific workflow. Invocation of each of the distributed services is controlled by the current model of computation in use. The generic WebService actor provides the user with an interface to connect to a Web Service defined by a WSDL URL. To customize a Web Service actor, the user provides the URL for the WSDL and selects an operation of the Web Service. The actor automatically customizes its ports with the correct inputs and outputs of the Web Service (Figure 4), and will act as a proxy to the Web Service when executed. A generic GridService actor also operates similarly for a given GWSDL URL. Figure 4. Customizing a Web Service actor Although the Web Service actor allows the user to import just one Web Service operation as a Kepler actor, the Web Service Harvester is used for importing all the operations of a specific Web Service. It can also be used to harvest all of the Web Services in a Universal Description, Discovery and Integration (UDDI) repository [17]. The harvester creates actors for each operation given a Web Service description (WSDL) or a repository address. This feature makes it simple for scientists to locate computational web services of relevance and integrate them into their computational workflows. In addition to generic web and grid services, Kepler includes actors to use Grid based services, including actors for certificate-based authentication (ProxyInit), grid job submission (GlobusGridJob), and Grid-based data access (DataAccessWizard). Each of these actors access specific Grid-based services using Open Grid Services Architecture (OGSA) interfaces [2,12]. External Computing Environments – Most existing actors are implemented in Java, but local execution of a process outside of the Java environment (for example, the execution of a Perl script) would greatly enhance the flexibility of the system. Supporting external execution gives the user flexibility to reuse existing analysis components and targets appropriate computational environments. For example, some complex statistical procedures may be particularly suited to the SAS or R environments, and Kepler would allow the scientist to wrap SAS code in an actor and insert it into the Kepler workflow. In addition, scientists have a large inventory of scripts targeted at existing environments and are more likely to use Kepler if adapting existing code is easy. We plan a set of actors that execute native code and scripts in external environments. Ptolemy II (and hence Kepler) includes a Matlab actor that can execute Matlab code and a Python actor that executes Python code. Other environments slated for inclusion are SAS, Perl, C++ (native interface calls) and R (S+). 5. Example Domain Applications Kepler supports multiple domains of science and must be flexible enough to run workflows from users with diverse interests, data, and processing needs. The system has been used for applications in molecular biology, geoscience, ecology, biodiversity, and electrical engineering. We believe Kepler is extensible enough to be used by many more science domains. The first workflow created with Kepler, a molecular biology application integrating various online tools and databases (including the Genbank web service query actor and the BLAST actor) [1,4], resulted in great benefits for scientists in terms of reduced design time and reduced complexity. Kepler automatically handles data flow between the workflow actors during execution so scientists do not have to wait for intermediate tasks to finish nor orchestrate data flow manually. Our experience with metadata-driven data ingestion using the EMLDataSource actor has shown that generic analytical pipelines can be easily reapplied to heterogeneous data sources without manual data massaging by the scientist. This allows rapid exploratory data visualization for scientists unfamiliar with data sources. It also enables the use of previously inaccessible data sources in complex analytical workflows. For example, the Genetic Algorithm for Rule set Production (GARP) Native Species Prediction Model can use both traditional museum occurrence data and now ecological species distribution data as inputs to predict species range distribution based on niche modeling theory [15]. Incorporating it in Kepler has broadened the data available, made archiving model predictions easier, and enabled substitution of competing prediction algorithms in the workflow, all contributing to our ability to predict the effects of climate change and invasive species on the environment. The “Mineral Classifier workflow for modal classification of Igneous rocks” is a geosciences workflow that is used in naming Igneous rocks using a point-in-polygon algorithm. 6. Current Work We are currently moving from a loosely coupled nondata intensive web-service environment to a system which where data-intensive local and remote applications can be chained together seamlessly. Our current work focuses on efficiency, scheduling, and optimization of workflow computation. This will involve optimizing the tradeoff between network and compute resources in a distributed framework using performance and resource estimation and planning tools. The Kepler team would like to cooperate with other scientific workflow projects such as Chimera, Pegasus, Triana, and Taverna/Freefluo [3,22,16] to work on scientifically relevant workflow specifications that would advance interoperability among these systems. 7. Acknowledgments Kepler includes contributors from SEEK [11], SDM Center-SPA [13], Ptolemy II [10] and Geon [5]. This material is based upon work supported by the National Science Foundation under awards 0225676 for SEEK and 0225673 (AWSFL008-DS3) for GEON and by the Department of Energy under Contract No. DE-FC0201ER25486 for SciDAC/SDM and by DARPA under Contract No. F33615-00-C-1703 for Ptolemy. 8. References [1] BLAST: Basic local alignment search tool, http://www.ncbi.nlm.nih.gov/BLAST/ [2] I. Foster, C. Kesselman, J. Nick, S. Tuecke. 2002. The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration. Open Grid Service Infrastructure WG, Global Grid Forum, June 22, 2002. [3] I. Foster, J. Voeckler, M. Wilde, and Y. Zhao. 2002. Chimera: A Virtual Data System for Representing, Querying and Automating Data Derivation. Proceedings of the 14th Conference on Scientific and Statistical Database Management, Edinburgh, Scotland, July 2002. [4] Genbank: National Institute of Health Genetic Sequence Database, http://www.ncbi.nlm.nih.gov/Genbank/ [5] GEON: Cyberinfrastructure for the Geosciences, http://www.geongrid.org [6] M. Greenwood, C. Wroe, R. Stevens, C. Goble and M. Addis. 2002. Are bioinformaticians doing e-Business? Proceedings Euroweb 2002: The Web and the GRID - from escience to e-business Editors - Brian Matthews and Bob Hopgood and Michael Wilson. ISBN - 1-902505-50-6. [7] Jones, M.B., C. Berkley, J. Bojilova, and M. Schildhauer, 2001. Managing Scientific Metadata, IEEE Internet Computing 5(5): 59-68. [8] Kepler Framework for Scientific Workflows, http://kepler.ecoinformatics.org [9] E. A. Lee and S. Neuendorffer. 2000. "MoML - A Modeling Markup Language in XML, Version 0.4," Technical Memorandum UCB/ERL M00/12, University of California, Berkeley, CA 94720, March 14, 2000. http://ptolemy.eecs.berkeley.edu/publications/papers/00/moml/ [10] The Ptolemy Project, http://ptolemy.eecs.berkeley.edu [11] SEEK: Science Environment for Ecological Knowledge, http://seek.ecoinformatics.org [12] B. Sotomayor. 2003. The Globus Toolkit 3 Programmer's Tutorial, http://www.casa-sotomayor.net/gt3-tutorial/index.html. [13] SPA: http://kepler.ecoinformatics.org/spa.html [14] SQL: Structured Query Language. Chamberlin, D.D., et al., “SEQUEL 2: a unified approach to data definition, manipulation and control,” IBM Journal of Research and Development 20:6, pp. 560-575, 1976. [15] Stockwell D.R.B. and D. Peters 1999. The GARP Modeling System: problems and solutions to automated spatial prediction. International Journal of Geographical Information Science 13 (2): 143-158. [16] I. Taylor, M. Shields, I. Wang and R. Philp. 2003. Distributed P2P Computing within Triana: A Galaxy Visualization Test Case. To be published in the IPDPS 2003 Conference, April 2003. http://www.gridlab.org/Resources/Papers/ipdsp_trianagalaxy_20 03.pdf [17] Universal Discovery, Description and Integration of Web Services: http://www.uddi.org [18] Web Service Definition Language, http://www.w3.org/TR/wsdl [19] Extensible Markup Language, http://www.w3.org/XML/ [20] An XML Query Language, http://www.w3.org/TR/xquery/ [21] XSL Transformations, http://www.w3.org/TR/xslt [22] E. Deelman et al, "Mapping Abstract Complex Workflows onto Grid Environments,", Journal of Grid Computing, Vol.1, no. 1, 2003, pp. 25-39.