a Schema-Independent XML Database System

advertisement
Kepler: Towards a Grid-Enabled System for Scientific Workflows
Ilkay Altintas1, Chad Berkley2, Efrat Jaeger1, Matthew Jones2, Bertram Ludaescher1, Steve Mock1
1
2
San Diego Supercomputer Center (SDSC), University of California San Diego
National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara
{berkley, jones}@nceas.ucsb.edu, {altintas, efrat, ludaescher, mock}@sdsc.edu
Abstract
We present the Kepler scientific workflow
management system, which allows scientists to design,
execute and deploy workflows using a number of
technologies including Web and Grid services, RDBMS,
and local applications implemented in various
programming languages.
1. Scientific Workflows
Progress in science depends on the quantitative and
repeatable analysis of data from a variety of sources. Most
scientists conduct analyses and run models in several
different software and hardware environments, mentally
coordinating the export and import of data from one
environment to another. Scientific workflows are an
attempt to formalize this ad-hoc process so that scientists
can design, execute, and communicate analytical
procedures repeatedly and with minimal effort.
Scientific workflows are superficially similar to
business process workflows but have several demanding
challenges not present in the scenario for business
workflows [6]. In particular, scientific workflows tend to
operate on large, complex, and heterogeneous data
sources that need to be integrated before computations
can occur.
Scientific workflows are often
computationally intensive and produce complex derived
data products that are archived for use in other workflows.
With the increase in popularity of the Internet, the
number of ways scientists fetch and manipulate data has
increased. Various advances have allowed scientists to
run analytical and transformational processes remotely
(i.e. over the Web). This growth in the richness of the
information processing resources results in a need for
systems and tools that allow for discovery, efficient
usage, and deployment of these resources.
To fulfill this need, the Kepler (Figure 1) project [8]
is building on a mature software application called
Ptolemy II [10] to produce a robust workflow system that
caters specifically to domain scientists.
Figure 1. Workflow editor in the Kepler Framework for
Scientific Workflows. The base application is the Vergil
editor from the Ptolemy II project.
Ptolemy II, and thus Kepler, is a set of Java packages
supporting heterogeneous, concurrent modeling, design,
and execution [13]. Kepler’s strengths include:
1) Precisely defined models of computation including
the dataflow oriented “Process Networks” model,
2) A modular, activity oriented programming
environment that lends itself to the design of
reusable components,
3) An intuitive programming GUI that allows the
user to easily compose complex workflows.
Since Kepler extends an already stable application,
development has focused on extensions needed to create
workflows that are meaningful to domain scientists as
well as updating the application to allow for previously
unneeded user interaction.
In this paper we describe several core capabilities
from Kepler designed to improve the effectiveness and
efficiency of scientific research:
 Capturing scientific workflows
 Accessing heterogeneous data
 Executing scientific workflows
2. Capturing Workflows
Using Kepler, scientists capture workflow
information in a formal format that can easily be changed,
archived, versioned, and executed. Kepler contains a
library of reusable processing steps (called actors) that
perform computations such as signal processing,
statistical operations, and Boolean logic operations. Each
actor defines zero or more typed input and output ports
that can be linked into a directed graph to allow data to
flow between actors. Kepler performs both design-time
and run-time type checking on the workflow and data.
Kepler also allows scientists to prototype a workflow
before implementing actors needed for the workflow.
Figure 2. The actor prototype tool creates a stub actor
class, compiles it, and then adds it to the actor library
where it can be dragged onto the workspace.
system will validate the connections. However, since
these actors are stubs that do not implement the intended
computations, they simply open a dialog indicating that
the workflow implementation is incomplete. The stubs
must be implemented by writing a Java method for the
pre-fire, fire, and post-fire stages of the workflow
execution. The intended algorithm can be implemented
within the fire event processor, or it can call an external
program or service to run the algorithm.
The intent of this tool is to allow a scientist to quickly
assemble a workflow without needing to implement the
code for every individual piece of the workflow at design
time. We hope this feature will encourage scientists to
create workflows that document their project instead of
creating mental workflows that are inaccessible.
Serialization, documentation and provenance –
Workflows within Kepler are serialized in an XML
dialect called Modeling Markup Language (MoML) [9].
Because of this XML serialization, the workflow itself
can be used as documentation (metadata) for the research
project. The workflow also provides the provenance for
derived data products, allowing researchers to return to
previous states of the data as needed. The workflow can
easily be versioned and archived in any XML storage
facility and can be indexed for easy querying and access.
3. Accessing heterogeneous data
Actors - Kepler has an extensible library of actors.
In-house or third-party software can be added to this
library by the scientist. Web and Grid services can also
be used as actors and can be added to the library to call
jobs on the Web and Grid from within Kepler. This is
done using the generic Web and Grid Service actors.
These actors expose one operation in a given Web Service
Description Language (WSDL) file or Grid Web Service
Description Language (GWSDL) [18,12] file by exposing
the operation’s messages as input and output ports.
Kepler also contains a tool to harvest a group of Web
Service descriptions from a repository and save them to
the actor library to be used later in workflows.
Most actors are Java processes that run locally on a
single machine. However, some may call external native
applications such as Matlab. Still others access arbitrary
web services that execute a process remotely and return a
handle to the results.
Prototyping actors – The actor library may not contain
all of the necessary actors to complete a particular
scientific computation, so we provide an actor prototyping
tool in Kepler (Figure 2). This tool prompts scientists for
critical information about an actor, including its name,
icon, and input/output ports. Each port has a name and a
data type. Once the user has defined the actor, a stub is
compiled and added to the actor library.
The user can then use this stub on the workflow
canvas to prototype a workflow. The ports can be
connected to other actors (stubs or not) and the typing
Kepler has several data access actors including a
relational database access actor (DatabaseQuery) and a
metadata-based data ingestion actor for handling
heterogeneous data (EMLDataSource).
Database Access – Often, scientists need access to
the data in a relational database from within the
workflow. Kepler includes two actors to allow generic,
efficient database access from within a workflow.
The OpenDBConnection actor returns a reference to a
database connection, given relevant JDBC connection
information (driver name, database URL, user name and
password). This reference can then be passed to other
actors in the workflow that require a database connection.
This increases efficiency since the system creates the
database connection only once per workflow execution.
The DatabaseQuery actor takes as input the reference
provided by the OpenDBConnection actor and a
Structured Query Language (SQL) [14] query string and
outputs the results of the query as a record, an eXtensible
Markup Language (XML) [19] stream, or a string. The
user can also select whether to return all records or only
one row at a time. Future plans for the DatabaseQuery
actor include a GUI based query form that does not
require the user to know SQL and a schema parser that
exposes the individual attributes of the record as ports.
EMLDataSource Actor – Ecological Metadata
Language (EML) [7] is an XML-based metadata
specification for describing ecological and biological
datasets.
EML contains both physical and logical
information about datasets. The EMLDataSource actor
(Figure 3) uses EML to ingest heterogeneous datasets into
Kepler by parsing the physical and logical metadata to
learn how to process the data source.
Figure 3. The EMLDataSource Actor automatically
configures itself with one output port for each logical
attribute in the dataset.
Once the EMLDataSource actor parses the EML
metadata, it extracts the physical information to read the
data file from its native format. It then uses the logical
information to create one output port for each attribute
(column) in the data file. The ports are typed like other
ports in Kepler with information from the EML metadata.
At execution time, one record is read for each clock cycle
and the data is sent over the ports in parallel.
This actor allows Kepler to ingest a multitude of
heterogeneous data (as long as it’s described in EML),
which gives it the flexibility it needs to be a tool used by
domain scientists that use many different data sources.
Data Transformation – Because actors and web
services are generally designed in isolation, input/output
incompatibilities are common and data transformation is
needed for integration. Extensible Stylesheet Language
Transformations (XSLT) is designed to for data
transformations for XML documents [21]. XQuery,
although designed for XML querying, can also be used in
transformations of data in XML format [20]. Using
widely available tools for these two languages, we
designed two actors to provide a Kepler interface to
XSLT and XQuery. These actors transform XML and
HTML data for use in Kepler or outside of Kepler (e.g.,
browsers). For actors that exchange data in XML format,
the XSLT transformation actor presents an easy
mechanism for integrating diverse computational actors
and heterogeneous data sources. For non-XML data
sources, standard data processing actors in Kepler can be
used to integrate components in the workflow.
4. Executing workflows
Kepler’s powerful programming environment supports
the varying models of computation that domain scientists
may want to use for their models. Kepler can execute
processes locally either within the Kepler environment
(Java) or within a native environment (compiled native
code, or code interpreted by another environment such as
Perl). In addition, processes can be executed in a
distributed way, using web and grid services. Remotely
executed processes behave as a single step in the model of
computation regardless of their complexity.
Distributed computation – Kepler’s web and grid
services actors allows scientists to utilize computational
resources on the network in a distributed scientific
workflow. Invocation of each of the distributed services
is controlled by the current model of computation in use.
The generic WebService actor provides the user with
an interface to connect to a Web Service defined by a
WSDL URL. To customize a Web Service actor, the
user provides the URL for the WSDL and selects an
operation of the Web Service. The actor automatically
customizes its ports with the correct inputs and outputs of
the Web Service (Figure 4), and will act as a proxy to the
Web Service when executed. A generic GridService actor
also operates similarly for a given GWSDL URL.
Figure 4. Customizing a Web Service actor
Although the Web Service actor allows the user to
import just one Web Service operation as a Kepler actor,
the Web Service Harvester is used for importing all the
operations of a specific Web Service. It can also be used
to harvest all of the Web Services in a Universal
Description, Discovery and Integration (UDDI) repository
[17]. The harvester creates actors for each operation
given a Web Service description (WSDL) or a repository
address. This feature makes it simple for scientists to
locate computational web services of relevance and
integrate them into their computational workflows.
In addition to generic web and grid services, Kepler
includes actors to use Grid based services, including
actors for certificate-based authentication (ProxyInit), grid
job submission (GlobusGridJob), and Grid-based data
access (DataAccessWizard). Each of these actors access
specific Grid-based services using Open Grid Services
Architecture (OGSA) interfaces [2,12].
External Computing Environments – Most existing
actors are implemented in Java, but local execution of a
process outside of the Java environment (for example, the
execution of a Perl script) would greatly enhance the
flexibility of the system. Supporting external execution
gives the user flexibility to reuse existing analysis
components and targets appropriate computational
environments. For example, some complex statistical
procedures may be particularly suited to the SAS or R
environments, and Kepler would allow the scientist to
wrap SAS code in an actor and insert it into the Kepler
workflow. In addition, scientists have a large inventory
of scripts targeted at existing environments and are more
likely to use Kepler if adapting existing code is easy.
We plan a set of actors that execute native code and
scripts in external environments. Ptolemy II (and hence
Kepler) includes a Matlab actor that can execute Matlab
code and a Python actor that executes Python code. Other
environments slated for inclusion are SAS, Perl, C++
(native interface calls) and R (S+).
5. Example Domain Applications
Kepler supports multiple domains of science and
must be flexible enough to run workflows from users with
diverse interests, data, and processing needs. The system
has been used for applications in molecular biology,
geoscience, ecology, biodiversity, and electrical
engineering. We believe Kepler is extensible enough to
be used by many more science domains.
The first workflow created with Kepler, a molecular
biology application integrating various online tools and
databases (including the Genbank web service query actor
and the BLAST actor) [1,4], resulted in great benefits for
scientists in terms of reduced design time and reduced
complexity. Kepler automatically handles data flow
between the workflow actors during execution so
scientists do not have to wait for intermediate tasks to
finish nor orchestrate data flow manually.
Our experience with metadata-driven data ingestion
using the EMLDataSource actor has shown that generic
analytical pipelines can be easily reapplied to
heterogeneous data sources without manual data
massaging by the scientist. This allows rapid exploratory
data visualization for scientists unfamiliar with data
sources. It also enables the use of previously inaccessible
data sources in complex analytical workflows. For
example, the Genetic Algorithm for Rule set Production
(GARP) Native Species Prediction Model can use both
traditional museum occurrence data and now ecological
species distribution data as inputs to predict species range
distribution based on niche modeling theory [15].
Incorporating it in Kepler has broadened the data
available, made archiving model predictions easier, and
enabled substitution of competing prediction algorithms
in the workflow, all contributing to our ability to predict
the effects of climate change and invasive species on the
environment.
The “Mineral Classifier workflow for modal
classification of Igneous rocks” is a geosciences
workflow that is used in naming Igneous rocks using a
point-in-polygon algorithm.
6. Current Work
We are currently moving from a loosely coupled nondata intensive web-service environment to a system which
where data-intensive local and remote applications can be
chained together seamlessly. Our current work focuses
on efficiency, scheduling, and optimization of workflow
computation. This will involve optimizing the tradeoff
between network and compute resources in a distributed
framework using performance and resource estimation
and planning tools. The Kepler team would like to
cooperate with other scientific workflow projects such as
Chimera, Pegasus, Triana, and Taverna/Freefluo [3,22,16]
to work on scientifically relevant workflow specifications
that would advance interoperability among these systems.
7. Acknowledgments
Kepler includes contributors from SEEK [11], SDM
Center-SPA [13], Ptolemy II [10] and Geon [5]. This
material is based upon work supported by the National
Science Foundation under awards 0225676 for SEEK and
0225673 (AWSFL008-DS3) for GEON and by the
Department of Energy under Contract No. DE-FC0201ER25486 for SciDAC/SDM and by DARPA under
Contract No. F33615-00-C-1703 for Ptolemy.
8. References
[1] BLAST: Basic local alignment search tool,
http://www.ncbi.nlm.nih.gov/BLAST/
[2] I. Foster, C. Kesselman, J. Nick, S. Tuecke. 2002. The
Physiology of the Grid: An Open Grid Services Architecture for
Distributed Systems Integration. Open Grid Service
Infrastructure WG, Global Grid Forum, June 22, 2002.
[3] I. Foster, J. Voeckler, M. Wilde, and Y. Zhao. 2002.
Chimera: A Virtual Data System for Representing, Querying
and Automating Data Derivation. Proceedings of the 14th
Conference on Scientific and Statistical Database Management,
Edinburgh, Scotland, July 2002.
[4] Genbank: National Institute of Health Genetic Sequence
Database, http://www.ncbi.nlm.nih.gov/Genbank/
[5] GEON: Cyberinfrastructure for the Geosciences,
http://www.geongrid.org
[6] M. Greenwood, C. Wroe, R. Stevens, C. Goble and M.
Addis. 2002. Are bioinformaticians doing e-Business?
Proceedings Euroweb 2002: The Web and the GRID - from escience to e-business Editors - Brian Matthews and Bob
Hopgood and Michael Wilson. ISBN - 1-902505-50-6.
[7] Jones, M.B., C. Berkley, J. Bojilova, and M. Schildhauer,
2001. Managing Scientific Metadata, IEEE Internet Computing
5(5): 59-68.
[8] Kepler Framework for Scientific Workflows,
http://kepler.ecoinformatics.org
[9] E. A. Lee and S. Neuendorffer. 2000. "MoML - A Modeling
Markup Language in XML, Version 0.4," Technical
Memorandum UCB/ERL M00/12, University of California,
Berkeley, CA 94720, March 14, 2000.
http://ptolemy.eecs.berkeley.edu/publications/papers/00/moml/
[10] The Ptolemy Project, http://ptolemy.eecs.berkeley.edu
[11] SEEK: Science Environment for Ecological Knowledge,
http://seek.ecoinformatics.org
[12] B. Sotomayor. 2003. The Globus Toolkit 3 Programmer's
Tutorial, http://www.casa-sotomayor.net/gt3-tutorial/index.html.
[13] SPA: http://kepler.ecoinformatics.org/spa.html
[14] SQL: Structured Query Language. Chamberlin, D.D., et
al., “SEQUEL 2: a unified approach to data definition,
manipulation and control,” IBM Journal of Research and
Development 20:6, pp. 560-575, 1976.
[15] Stockwell D.R.B. and D. Peters 1999. The GARP Modeling
System: problems and solutions to automated spatial prediction.
International Journal of Geographical Information Science 13
(2): 143-158.
[16] I. Taylor, M. Shields, I. Wang and R. Philp. 2003.
Distributed P2P Computing within Triana: A Galaxy
Visualization Test Case. To be published in the IPDPS 2003
Conference, April 2003.
http://www.gridlab.org/Resources/Papers/ipdsp_trianagalaxy_20
03.pdf
[17] Universal Discovery, Description and Integration of Web
Services: http://www.uddi.org
[18] Web Service Definition Language,
http://www.w3.org/TR/wsdl
[19] Extensible Markup Language, http://www.w3.org/XML/
[20] An XML Query Language, http://www.w3.org/TR/xquery/
[21] XSL Transformations, http://www.w3.org/TR/xslt
[22] E. Deelman et al, "Mapping Abstract Complex
Workflows onto Grid Environments,", Journal of Grid
Computing, Vol.1, no. 1, 2003, pp. 25-39.
Download