Panoply of Utilities in Taverna

advertisement
Panoply of Utilities in Taverna
1
2
1
3
1
1
4
1
K. Wolstencroft , T. Oinn , C. Goble , J. Ferris , C. Wroe , P. Lord , K. Glover , R. Stevens
1School of Computer Science, Kilburn Building, University of Manchester, Manchester
2EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge
3IT Innovation Centre, University of Southampton, Southampton
4School of Computer Science and Information Technology, University of Nottingham, Nottingham
Abstract
The Taverna e-Science Workbench is a central component of myGrid, a loosely coupled suite of
middleware services designed to support in silico experiments in biology. Taverna enables the
construction and enactment of complex workflows over resources on local and remote machines,
allowing the automation of labour-intensive multi-step bioinformatics tasks. As the Taverna user
community has grown, the demand for new features and additions has also grown. This paper outlines
the functional requirements that have become apparent over the last year of working with domain
scientists, along with the solutions implemented in both the Taverna workbench and the Freefluo
enactment engine to address concerns relating to workflow construction and enactment, respectively.
1. Introduction
Early myGrid use cases were based upon the
bioinformatics analyses of genetic data, for
example, the mapping of gapped regions
associated with Williams-Beuren Syndrome
(WBS) [1] and the identification of genes
involved in Graves Disease (GD) [2]. These
studies quickly revealed that e-Science
workflows could be complex artefacts. Not only
did each study use a large number of services in
their workflows, but the process flows required
complex invocation patterns.
Both of these factors presented challenges
during the development of Taverna. Iteration
and recursion support was required for
designing and invoking complex workflows,
along with support for interaction with diverse
resource types. Also, as the number of available
services increased from several dozen to over
1000, a mechanism for discovering the most
appropriate service for a given task became
increasingly important.
This paper describes recent developments and
innovations in Taverna that have evolved to
meet user requirements. Collaboration with
biologists and bioinformaticians in the WBS
and GD projects provided unique insights into
the user requirements for workflow systems and
enabled us to adopt a user-driven approach to
development.
The user community has now grown and
diversified to include researchers from
chemoinformatics, health informatics and
systems biology, presenting new opportunities
for increased functionality and continued
development in the future.
2. Supporting Service Description
and Discovery
myGrid provides a service-oriented architecture
to access and integrate heterogeneous biological
data resources. There are now over 1000
services available in Taverna and this number is
continuing to grow. Whilst this provides a rich,
integrated resource for the user, discovering
appropriate services has become increasingly
difficult. In addition, the majority of services
are owned and developed by third-parties, so
there are no standards enforced regarding input
parameter formats and there is variability in
user documentation.
To address this issue, a plug-in, known as Feta,
has been developed that supports the discovery
of a service based on a description of the
function the service performs.
Feta uses the myGrid service ontology [3],
which provides descriptions of bioinformatics
tasks and data types, to formulate descriptions
of bioinformatics services and their properties,
and a service registry to be searched for
components matching these descriptions.
Suitable services can then be added to a
workflow with the standard drag and drop
metaphor used elsewhere in the workbench.
The semantic discovery tool enables biologists
with little bioinformatics experience to find
services and use them in workflow design. It
also provides a way for the more experienced
bioinformatician to discover alternative,
equivalent services to the ones they normally
use. In the event of a service failure, Taverna
provides a method for specifying which
Figure 1: The Feta semantic discovery tool showing the identification of alternative services for the ‘emma’ multiple alignment
service. The query was for services that perform alignment tasks. Discovered services are presented with their operation definitions
from the myGrid services ontology along with the location and provider of the service and an optional free-text description.
alternative services should be substituted.
Figure 1 demonstrates the use of Feta to
discover alternative sequence alignment
services.
Feta can be used to discover services from any
source. However, certain libraries of services
provide service metadata that is useful to the
discovery process. Several such libraries,
including Soaplab and BioMoby, can be
interrogated directly by Taverna to support
effective service discovery with minimal
configuration costs.
3. Supporting Complex Invocation
Patterns
A major requirement for Taverna was to model
in workflows how users naturally deal with
web-based resources. For example, invoking
services, possibly repeatedly, trying alternative
services, or ignoring unavailable services.
Workflows are a series of inter-related
components linked together by data (the output
of one component forming the input for the
next) or control (the link setting the conditions
of execution for the next component).
Commonly encountered process flows often
required constructs which would manifest in
traditional imperative languages as ‘if … else’
or ‘while…’ constructs. When transformed into
the functional view used by most workflow
systems these constructs can produce
counterintuitive behaviour. Rather than
introduce a set of ‘if’ ‘endif’ and merge
operations within the workflow language,
additional invocation semantics have been
defined over each atomic process within the
workflow.
3.1 Iterative Semantics
Freefluo includes a mechanism to introspect at
runtime over input data to a particular operation.
From this introspection, and a comparison of the
data types expected by that operation with those
actually supplied from the upstream process, the
workflow engine is able to split the incoming
data sets into a queue of data items that match
those the operation has declared it is capable of
consuming. This queue is then used as a job
queue, with a user definable number of jobs
being launched in parallel at any given time,
and progress reporting based on progress
through the queue. An inverse of the function
derived to split the inputs is then used to
construct a corresponding collection structure to
be populated by the results of each job in the
queue. This has been shown to be intuitive to
our users, resulting in the expected behaviour
with no configuration in most cases. Where an
explicit override of the split function is required,
Taverna’s workbench provides a graphical
editor for the ‘iteration strategy’ that drives this
process.
3.2 Recursive Semantics
Freefluo supports repeated invocation of a
single operation with optional feedback and
bailout. This allows structures such as ‘call
service until result is true’, or more
sophisticated ones corresponding to parameter
sweeps or recursive functions.
Figure 2: The Taverna state machine governing the processes of fault tolerance and iteration
Both iteration and recursion may be applied to
the same operation, with each operation within
an iteration being subject to the recursive
behaviour. As Taverna includes the ability to
define operations from existing or in-lined
workflows these facilities provide a superset of
the expressive power of standard looping
constructs in languages such as Java and PERL.
3.3 Fault Tolerance
Every processor in a Taverna workflow has
fault tolerance settings. In the event of a service
failure, the user can specify the number of times
the service should be retried, the length of time
between the retries, or an alternative service to
be substituted. The user can also specify the
criticality of individual services. If a process
persistently fails, after all retires and substituted
services, and it is marked as critical, the entire
workflow is aborted. If it is not marked as
critical, the workflow will continue to run, but
downstream processes will not be invoked.
Figure 2 shows Taverna’s state machine, which
controls the iteration, recursion and fault
tolerance processes.
4. Supporting Interaction With
Diverse Resource Types
Although much effort is being invested in
standard SOAP based web services within
workflows, we have found that our users have
requirements that do not fit well with the web
service model. This is due to several reasons:
1. Technical limitations with the specification
2. The available toolkits
3. A lack of control over the required resources
Large scale streamed data transfers are an
example of the first case. Call-by-value
approaches, even where technically possible,
are often inefficient for large data sets. As one
potential solution Taverna and Freefluo now
implement a plugin based on the Sytx Grid
Service protocol [4], allowing subgraphs of the
workflow where data is streamed directly
between services rather than round-tripping via
the enactment engine. This has particular
performance benefits when several linked
services exist in close network proximity or
where a downstream service is capable of
operating on partial data.
Complex object based models are often
challenging to expose as web services – the
document based approach mandated by the WS
specification sits poorly with the complex
interlinking structures of such models. Taverna
now includes a tool allowing the direct
invocation of arbitrary methods within such
models as part of the workflow; this has allowed
transparent integration of sizable third party
toolsets such as the JUMBO chemoinformatics
library [5] and the caBIO cancer object model.
The Biomart data warehouse system provides
access through direct JDBC connections back to
a central database. Taverna includes a plugin
which can extract the database metadata from
these data stores and present a derived query
generator user interface. The Freefluo enactor
has similarly been extended with code to send
the configured query and parse the result
stream. This has given our bioinformatics users
access to several critical genomic / proteomic
data resources including Ensembl [6] and the
UniProt [7] data sets.
5. Supporting Collaboration and
Reuse
The development of scientific workflows is
often a time consuming process involving many
iterations as understanding grows of how best to
address a task using the services available.
As such, individual workflows or components
of workflows are potentially valuable artefacts
that can usefully be shared among users. To
support this, the Taverna workbench can now
import workflows as service libraries. This
allows sophisticated polymorphic services to be
transparently shared amongst a collaborative
group – an example of this is the Biomart
system where services based on Biomart require
extensive configuration before use. With this
enhancement users can locate, describe and
reuse pre-configured services from other
workflows. This is particularly important for the
various scripting engines, allowing in-lined
scripts such as simple data transformation and
other such Shim services to be shared.
6. Discussion
Taverna is now a well established tool for escience experiments. The early success of the
WBS and GD projects clearly demonstrated its
use in bioinformatics analyses, and the
subsequent growth in the user community has
produced many more success stories – new escience projects, such as, ISPIDER and
PsyGrid, are adopting Taverna and other
myGrid components as a platform for
development. Existing myGrid workflows are
also being re-used. For example, components of
the WBS workflow are now being used in a
trypanosomiasis study.
The ease of use of Taverna is one of its greatest
advantages, and the development of new
features must always take account of this. The
user is shielded from the complexities of
invoking services and dealing with data flow by
the high level conceptual design process in
building the workflows and is aided in
discovering suitable services through Feta.
Using the experience and feedback of the
growing Taverna user community, in
bioinformatics,
health
informatics,
chemoinformatics and systems biology, has
proven to be an effective way of driving
ongoing development.
Acknowledgements
The authors would like to acknowledge the
assistance of the whole myGrid consortium. This
work is supported by the UK e-Science
programme EPSRC grant GR/R67743.
References
[1]. R. Stevens, H.J. Tipney, C. Wroe, T. Oinn, M.
Senger, P. Lord, C.A. Goble, A. Brass and M.
Tassabehji
Exploring
Williams-Beuren
Syndrome Using myGrid in Proceedings of 12th
International Conference on Intelligent Systems
in Molecular Biology, 31st Jul-4th Aug 2004,
Glasgow, UK, published Bioinformatics Vol. 20
Suppl. 1 2004, i303-i310
[2]. Oinn T, Addis M, Ferris M, Marvin D, Senger
M, Greenwood M, Carver T, Glover K, Pocock
M, Wipat A and Li P. Taverna: A tool for the
composition and enactment of bioinformatics
workflows Bioinformatics Vol. 20(17) pp 30453054, 2004, doi:10.1093/bioinformatics/bth361
[3]. Chris Wroe, Robert Stevens, Carole Goble,
Angus Roberts, and Mark Greenwood. Asuite of
DAML+OIL
Ontologies
to
Describe
Bioinformatics Web Services and Data. The
International
Journal
of
Cooperative
Information Systems, 12(2):597-624, 2003
[4]. Rob Pike and Dennis M. Ritchie. The StyxArchitecture for distributed systems. Bell Labs
Technical Journal, 4(2):146{152, 1999.
[5]. Yong Zhang, Peter Murray-Rust, Martin T
Dove, Robert C Glen, Henry S Rzepa, Joe A
Townsend, Simon Tyrrell, Jon Wakelin, Egon L
Willighagen. JUMBO – An XML Infrastructure
for e-science, All Hands Meeting, Nottingham,
2004
[6]. E Hubbard et al Ensembl 2005. Nucleic Acids
Res. 2005 Jan 1;33(Database issue):D447-53
[7]. Bairoch A, Apweiler R, Wu CH, Barker WC,
Boeckmann B, Ferro S, Gasteiger E, Huang H,
Lopez R, Magrane M, Martin MJ, Natale DA,
O'Donovan C, Redaschi N, Yeh LS. The
Universal Protein Resource (UniProt). Nucleic
Acids Res. 2005 Jan 1; 33(Database issue):D1549.
Download