Panoply of Utilities in Taverna 1 2 1 3 1 1 4 1 K. Wolstencroft , T. Oinn , C. Goble , J. Ferris , C. Wroe , P. Lord , K. Glover , R. Stevens 1School of Computer Science, Kilburn Building, University of Manchester, Manchester 2EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge 3IT Innovation Centre, University of Southampton, Southampton 4School of Computer Science and Information Technology, University of Nottingham, Nottingham Abstract The Taverna e-Science Workbench is a central component of myGrid, a loosely coupled suite of middleware services designed to support in silico experiments in biology. Taverna enables the construction and enactment of complex workflows over resources on local and remote machines, allowing the automation of labour-intensive multi-step bioinformatics tasks. As the Taverna user community has grown, the demand for new features and additions has also grown. This paper outlines the functional requirements that have become apparent over the last year of working with domain scientists, along with the solutions implemented in both the Taverna workbench and the Freefluo enactment engine to address concerns relating to workflow construction and enactment, respectively. 1. Introduction Early myGrid use cases were based upon the bioinformatics analyses of genetic data, for example, the mapping of gapped regions associated with Williams-Beuren Syndrome (WBS) [1] and the identification of genes involved in Graves Disease (GD) [2]. These studies quickly revealed that e-Science workflows could be complex artefacts. Not only did each study use a large number of services in their workflows, but the process flows required complex invocation patterns. Both of these factors presented challenges during the development of Taverna. Iteration and recursion support was required for designing and invoking complex workflows, along with support for interaction with diverse resource types. Also, as the number of available services increased from several dozen to over 1000, a mechanism for discovering the most appropriate service for a given task became increasingly important. This paper describes recent developments and innovations in Taverna that have evolved to meet user requirements. Collaboration with biologists and bioinformaticians in the WBS and GD projects provided unique insights into the user requirements for workflow systems and enabled us to adopt a user-driven approach to development. The user community has now grown and diversified to include researchers from chemoinformatics, health informatics and systems biology, presenting new opportunities for increased functionality and continued development in the future. 2. Supporting Service Description and Discovery myGrid provides a service-oriented architecture to access and integrate heterogeneous biological data resources. There are now over 1000 services available in Taverna and this number is continuing to grow. Whilst this provides a rich, integrated resource for the user, discovering appropriate services has become increasingly difficult. In addition, the majority of services are owned and developed by third-parties, so there are no standards enforced regarding input parameter formats and there is variability in user documentation. To address this issue, a plug-in, known as Feta, has been developed that supports the discovery of a service based on a description of the function the service performs. Feta uses the myGrid service ontology [3], which provides descriptions of bioinformatics tasks and data types, to formulate descriptions of bioinformatics services and their properties, and a service registry to be searched for components matching these descriptions. Suitable services can then be added to a workflow with the standard drag and drop metaphor used elsewhere in the workbench. The semantic discovery tool enables biologists with little bioinformatics experience to find services and use them in workflow design. It also provides a way for the more experienced bioinformatician to discover alternative, equivalent services to the ones they normally use. In the event of a service failure, Taverna provides a method for specifying which Figure 1: The Feta semantic discovery tool showing the identification of alternative services for the ‘emma’ multiple alignment service. The query was for services that perform alignment tasks. Discovered services are presented with their operation definitions from the myGrid services ontology along with the location and provider of the service and an optional free-text description. alternative services should be substituted. Figure 1 demonstrates the use of Feta to discover alternative sequence alignment services. Feta can be used to discover services from any source. However, certain libraries of services provide service metadata that is useful to the discovery process. Several such libraries, including Soaplab and BioMoby, can be interrogated directly by Taverna to support effective service discovery with minimal configuration costs. 3. Supporting Complex Invocation Patterns A major requirement for Taverna was to model in workflows how users naturally deal with web-based resources. For example, invoking services, possibly repeatedly, trying alternative services, or ignoring unavailable services. Workflows are a series of inter-related components linked together by data (the output of one component forming the input for the next) or control (the link setting the conditions of execution for the next component). Commonly encountered process flows often required constructs which would manifest in traditional imperative languages as ‘if … else’ or ‘while…’ constructs. When transformed into the functional view used by most workflow systems these constructs can produce counterintuitive behaviour. Rather than introduce a set of ‘if’ ‘endif’ and merge operations within the workflow language, additional invocation semantics have been defined over each atomic process within the workflow. 3.1 Iterative Semantics Freefluo includes a mechanism to introspect at runtime over input data to a particular operation. From this introspection, and a comparison of the data types expected by that operation with those actually supplied from the upstream process, the workflow engine is able to split the incoming data sets into a queue of data items that match those the operation has declared it is capable of consuming. This queue is then used as a job queue, with a user definable number of jobs being launched in parallel at any given time, and progress reporting based on progress through the queue. An inverse of the function derived to split the inputs is then used to construct a corresponding collection structure to be populated by the results of each job in the queue. This has been shown to be intuitive to our users, resulting in the expected behaviour with no configuration in most cases. Where an explicit override of the split function is required, Taverna’s workbench provides a graphical editor for the ‘iteration strategy’ that drives this process. 3.2 Recursive Semantics Freefluo supports repeated invocation of a single operation with optional feedback and bailout. This allows structures such as ‘call service until result is true’, or more sophisticated ones corresponding to parameter sweeps or recursive functions. Figure 2: The Taverna state machine governing the processes of fault tolerance and iteration Both iteration and recursion may be applied to the same operation, with each operation within an iteration being subject to the recursive behaviour. As Taverna includes the ability to define operations from existing or in-lined workflows these facilities provide a superset of the expressive power of standard looping constructs in languages such as Java and PERL. 3.3 Fault Tolerance Every processor in a Taverna workflow has fault tolerance settings. In the event of a service failure, the user can specify the number of times the service should be retried, the length of time between the retries, or an alternative service to be substituted. The user can also specify the criticality of individual services. If a process persistently fails, after all retires and substituted services, and it is marked as critical, the entire workflow is aborted. If it is not marked as critical, the workflow will continue to run, but downstream processes will not be invoked. Figure 2 shows Taverna’s state machine, which controls the iteration, recursion and fault tolerance processes. 4. Supporting Interaction With Diverse Resource Types Although much effort is being invested in standard SOAP based web services within workflows, we have found that our users have requirements that do not fit well with the web service model. This is due to several reasons: 1. Technical limitations with the specification 2. The available toolkits 3. A lack of control over the required resources Large scale streamed data transfers are an example of the first case. Call-by-value approaches, even where technically possible, are often inefficient for large data sets. As one potential solution Taverna and Freefluo now implement a plugin based on the Sytx Grid Service protocol [4], allowing subgraphs of the workflow where data is streamed directly between services rather than round-tripping via the enactment engine. This has particular performance benefits when several linked services exist in close network proximity or where a downstream service is capable of operating on partial data. Complex object based models are often challenging to expose as web services – the document based approach mandated by the WS specification sits poorly with the complex interlinking structures of such models. Taverna now includes a tool allowing the direct invocation of arbitrary methods within such models as part of the workflow; this has allowed transparent integration of sizable third party toolsets such as the JUMBO chemoinformatics library [5] and the caBIO cancer object model. The Biomart data warehouse system provides access through direct JDBC connections back to a central database. Taverna includes a plugin which can extract the database metadata from these data stores and present a derived query generator user interface. The Freefluo enactor has similarly been extended with code to send the configured query and parse the result stream. This has given our bioinformatics users access to several critical genomic / proteomic data resources including Ensembl [6] and the UniProt [7] data sets. 5. Supporting Collaboration and Reuse The development of scientific workflows is often a time consuming process involving many iterations as understanding grows of how best to address a task using the services available. As such, individual workflows or components of workflows are potentially valuable artefacts that can usefully be shared among users. To support this, the Taverna workbench can now import workflows as service libraries. This allows sophisticated polymorphic services to be transparently shared amongst a collaborative group – an example of this is the Biomart system where services based on Biomart require extensive configuration before use. With this enhancement users can locate, describe and reuse pre-configured services from other workflows. This is particularly important for the various scripting engines, allowing in-lined scripts such as simple data transformation and other such Shim services to be shared. 6. Discussion Taverna is now a well established tool for escience experiments. The early success of the WBS and GD projects clearly demonstrated its use in bioinformatics analyses, and the subsequent growth in the user community has produced many more success stories – new escience projects, such as, ISPIDER and PsyGrid, are adopting Taverna and other myGrid components as a platform for development. Existing myGrid workflows are also being re-used. For example, components of the WBS workflow are now being used in a trypanosomiasis study. The ease of use of Taverna is one of its greatest advantages, and the development of new features must always take account of this. The user is shielded from the complexities of invoking services and dealing with data flow by the high level conceptual design process in building the workflows and is aided in discovering suitable services through Feta. Using the experience and feedback of the growing Taverna user community, in bioinformatics, health informatics, chemoinformatics and systems biology, has proven to be an effective way of driving ongoing development. Acknowledgements The authors would like to acknowledge the assistance of the whole myGrid consortium. This work is supported by the UK e-Science programme EPSRC grant GR/R67743. References [1]. R. Stevens, H.J. Tipney, C. Wroe, T. Oinn, M. Senger, P. Lord, C.A. Goble, A. Brass and M. Tassabehji Exploring Williams-Beuren Syndrome Using myGrid in Proceedings of 12th International Conference on Intelligent Systems in Molecular Biology, 31st Jul-4th Aug 2004, Glasgow, UK, published Bioinformatics Vol. 20 Suppl. 1 2004, i303-i310 [2]. Oinn T, Addis M, Ferris M, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock M, Wipat A and Li P. Taverna: A tool for the composition and enactment of bioinformatics workflows Bioinformatics Vol. 20(17) pp 30453054, 2004, doi:10.1093/bioinformatics/bth361 [3]. Chris Wroe, Robert Stevens, Carole Goble, Angus Roberts, and Mark Greenwood. Asuite of DAML+OIL Ontologies to Describe Bioinformatics Web Services and Data. The International Journal of Cooperative Information Systems, 12(2):597-624, 2003 [4]. Rob Pike and Dennis M. Ritchie. The StyxArchitecture for distributed systems. Bell Labs Technical Journal, 4(2):146{152, 1999. [5]. Yong Zhang, Peter Murray-Rust, Martin T Dove, Robert C Glen, Henry S Rzepa, Joe A Townsend, Simon Tyrrell, Jon Wakelin, Egon L Willighagen. JUMBO – An XML Infrastructure for e-science, All Hands Meeting, Nottingham, 2004 [6]. E Hubbard et al Ensembl 2005. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D447-53 [7]. Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS. The Universal Protein Resource (UniProt). Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D1549.