BiodiversityWorld: An Architecture for an Extensible Virtual Laboratory for Analysing Biodiversity Patterns A.C. Jones1, R.J. White2, N. Pittas1, W.A. Gray1, T. Sutton3, X. Xu1, O. Bromley2, N. Caithness3, F.A. Bisby3, N.J. Fiddian1, M. Scoble4, A. Culham3 and P. Williams4 1 2 Cardiff University, School of Computer Science, PO Box 916, Cardiff CF24 3XF School of Biological Sciences, University of Southampton, Southampton SO16 7PX 3 Centre for Plant Diversity & Systematics, School of Plant Sciences, The University of Reading, Reading RG6 6AS 4 The Natural History Museum, London SW7 5BD Abstract In this paper we discuss the BiodiversityWorld project, concentrating particularly upon the architecture that we are adopting in order to achieve our aims. BiodiversityWorld is an e-Science project, in which we are seeking to make heterogeneous data sources and analytic tools of relevance to biodiversity available in a GRID setting. We are developing a problem-solving environment that will enable scientists to use these resources in complex scenarios to solve problems in biodiversity. Our architecture is designed to insulate the core BiodiversityWorld system from heterogeneity of resource implementation language, etc., by wrapping the resources to a defined standard, while retaining the flexibility to discover and use the various operations and data types supported by each resource. At the same time, these wrappers are insulated from the GRID in order to support migration to future GRID technology. In this paper we discuss the design of the BiodiversityWorld system, including its GRID interface, the use of metadata and ontologies, and the use of workflows within the system. 1. Introduction BiodiversityWorld is a three-year e-Science project, funded by the UK BBSRC, in which we are exploring how a problem-solving environment (PSE) can be designed and developed for biodiversity informatics in a GRID ( 1 ) environment. It is a biology-led project, motivated by three exemplars in the field of biodiversity informatics, but we are concerned to make this system extensible to be of general use within the discipline. Our aim is to provide scientists with tools with which they can readily access resources that were originally designed for use in isolation, composing these resources into complex workflows, and to make it as straightforward as we can for new resources to be created and introduced to the system. In the present paper we concentrate primarily upon the computing aspects of this project and, in particular, on the essential features of the system architecture. Another paper in the present proceedings provides examples illustrating how the system can be used in practice (2). In the remainder of this paper we first provide the background to BiodiversityWorld, outlining relevant aspects of selected earlier projects, and then discuss software requirements for the present project. We then present the architecture that we have adopted, explaining how it accommodates heterogeneity and change, and describe the planned evolution of the system. We explain the roles that metadata — which will eventually evolve into an ontology — and workflows are to play in supporting the PSE. The final section discusses the current implementation status, planned future work and some of the new experimentation (such as use of OGSA-DAI) that we hope to pursue, once the three chosen exemplars are in place. 2. Project background The field of biodiversity informatics, at least outside molecular bioinformatics and genomics, is characterised by individual scientists and institutions creating their own resources, often for a narrow range of uses. For example, a scientist may create a database to record the locations where specimens of particular kinds of plants have been found, or to hold a taxonomic checklist, comprising accepted names and synonyms for species in a taxonomic group that he or she has been studying. Interoperation among such resources can sometimes be challenging, as they were not originally designed to be used as part of a larger system and often do not conform to any recognised standard. Similarly, analytic tools to perform specific tasks have often been developed as stand-alone executables, so that if facilities from more than one tool are required, a user will frequently have to resort to transferring data between tools manually: in some cases it may even be necessary to re-enter data by hand. Certain projects have previously addressed specific interoperability problems within Biodiversity Informatics. For example, some of the authors were previously involved in the SPICE for Species 20001 project and the LITCHI project. The SPICE project ( 3 ) addressed the problem of building a scalable federation of heterogeneous species databases in order to comprise a catalogue of life; whereas the LITCHI project (4) addressed the problem of detecting and resolving taxonomic conflicts that may occur within or between checklists of scientific names and synonyms. Such projects are an important contribution to biodiversity informatics, because they provide a basis for accurate retrieval of speciesrelated information, but clearly they do not contribute a general solution to the problem of interoperation among relevant resources. For example, suppose a scientist wishes to investigate where a particular species might be expected to occur, given estimated past or predicted future climatic conditions. To investigate this bioclimatic modelling problem, he or she needs access to species distribution data; to tools that can model the climates characterising the locations where the species is to be found; to data pertaining to the climate at the time of interest, and to map images onto which the predicted distribution can be projected. In the GRAB project (5) some of the present authors developed a proof-of-concept demonstrator to illustrate how a fixed sequence of operations relevant to bioclimatic modelling can be orchestrated on the GRID. Features of the Globus Toolkit then available (e.g. Globus Access to Secondary Service (GASS); Metacomputing Directory Service (MDS)) were employed, and a Web-based user interface was provided. The system used data from the International Legume Database & Information Service (ILDIS),2 FISHBASE,3 the US National 1 2 http://www.sp2000.org http://www.ildis.org Climate Data Centre (NCDC)4 and SPICE for Species 2000. But this project provided no significant flexibility in the choice of the kinds of resources to be used or the sequence of operations to be performed. The BiodiversityWorld project aims to remove such restrictions, to allow scientists to exploit a wide range of analytic tools and data sources, including the catalogue of life described above, but also including many other kinds of resources. Three exemplar tasks have been chosen: • • • Bioclimatic modelling & climate change; Biodiversity richness analysis & conservation evaluation, and Phylogenetic analysis & biogeography. The first of these tasks has been mentioned above; this and the other tasks are described in more detail in the companion paper in the present proceedings. For the purposes of the current paper, though, we concentrate in the next section on the general requirements that these tasks present us with, and then we describe a PSE design intended to satisfy these requirements. 3. Software requirements for BiodiversityWorld As a GRID-based project, BiodiversityWorld has distinctive characteristics. Highperformance computing resources are only of use for a small range of self-contained tasks in biodiversity informatics; but a wide range of types of data is to be used, many of these types having their own complex structures. On the other hand, a limited range of operations on these data sources is typically required — for example, the SPICE system provides six operations, of which two (search for scientific name; retrieve species information) suffice for the tasks we currently envisage. In addition to data sources, we need to make it possible to access a range of individual analytic tools, each having its own implementational idiosyncrasies. To make access to these data sources and tools possible from BiodiversityWorld, a generic access mechanism must be achieved, while retaining flexibility as to the operations to be performed by each resource and the data types they can use. 3 4 http://www.fishbase.org http://lwf.ncdc.noaa.gov User interface Presentation Workflow enactment engine Native BiodiversityWorld Resources Metadata repository Metadata BGI API BGI API Metadata BGI BiodiversityWorld-GRID Interface (BGI) Wrapped resources The GRID Figure 1: Full BiodiversityWorld architecture These heterogeneous resources must therefore be wrapped, and metadata must be available to indicate how they are to be used. Further metadata is needed to enable selection of resources meeting appropriate criteria (e.g. perhaps a specimen distribution map for a given species is required: a map for a wider range of species would also be acceptable, as long as the map distinguishes between the species it contains). These resources must be discovered and brought together by a scientist in a flexible manner into an interactive workflow. Appropriate representation of these workflows, and tools accessible to non-computer scientists for resource discovery and for workflow design and enactment, are therefore required. Some workflows will be of general use, and so a repository of such workflows is also needed. Typically, in the scenarios we envisage, an individual scientist will not be collaborating interactively with other users of the system. This, and the kinds of operations required for each resource, implies that a service-orientated architecture is appropriate. However, an important additional requirement is that users should be able to manipulate certain kinds of data interactively — e.g. for selection among a set of trees generated as a result of phylogenetic analysis. How this modifies our basic serviceorientated architecture is discussed in the next section. One final, but important, requirement is that the software architecture should not be bound too closely to a particular GRID infrastructure. GRID software is rapidly evolving — for example, there are major differences between Globus Toolkit versions 2 and 3 — and it is desirable, as far as possible, for migration to a new infrastructure to have minimal impact on the resources. An interoperation framework that is not tied to a specific GRID infrastructure is therefore required. 4. The BiodiversityWorld architecture In this section we shall present the main elements of our architecture, and then provide more detail regarding the wrapper API we have developed, followed by some discussion regarding metadata and ontologies, and also regarding workflows. It should be noted that although a first prototype of the core architecture has been implemented, with some resources connected to the system, it is intended to develop more sophisticated versions of each aspect of the system as the project progresses. Legacy user interfaces User interface Presentation Workflow enactment engine Native BiodiversityWorld Resources Metadata repository Metadata BGI API BGI API Metadata BGI BiodiversityWorld-GRID Interface Wrapped resources Other tools The GRID Figure 2: Prototype BiodiversityWorld architecture 4.1. System overview The intended architecture of the BiodiversityWorld system is illustrated in Figure 1. A layer of abstraction is placed between BiodiversityWorld (BDW) components and the GRID, which we refer to as the BiodiversityWorld-GRID Interface (BGI). This means that if the GRID infrastructure changes, only the BGI needs to be re-implemented: it is intended that other components will remain unchanged. Further details of the BGI API are given in Section 4.2. Analytic tools and data sources conforming to the BGI are controlled from the workflow enactment engine. Some of these resources will be constructed directly for BDW; other, legacy resources are being wrapped to conform to the BGI. A metadata repository, eventually including a full ontology, is accessible via a special version of the BGI, which is currently under development: this metadata-specific API is intended to provide efficient access to appropriate metadata operations. In this architecture we include a presentation layer: eventually we anticipate that the user interface will be extensible with local viewers, etc., that will make interactive data manipulation possible. In our initial prototype design, illustrated in Figure 2, only very simple “presentation” is possible, in the sense that a small number of data types, such as bit-maps or simply integers, can be presented to the user. Interactive manipulation of data is at present via legacy user interfaces associated with existing tools. In practice, some of these existing tools may also have wrappers associated with them that allow some of their operations to be controlled from BiodiversityWorld. The user interface itself is to provide features for workflow design, including discovery of appropriate resources using their published metadata; for workflow control, and for presentation of results. 4.2. BiodiversityWorld-GRID Interface (BGI) In order to insulate the BiodiversityWorld system from the resources’ heterogeneity, we wrap these resources and provide an invocation mechanism that allows any operation to be invoked in a standard manner. The available operations are specified by metadata associated with these resources. Moreover, to support data types specific to each resource, we provide a standard data type BdwRemoteData in which either a URI, a file name or the data itself — represented as a byte sequence — is passed to and from resources. This structure allows <<interface>> BgiWrapperInterface Bgi Implementation_1 1 Bgi Implementation_2 <<abstract>> BdwAbstractWrapper 1 ... Concrete Wrapper_1 Concrete Wrapper_2 ... Figure 3: BGI architecture BdwRemoteData to act as a proxy where appropriate. When the data itself is held, it will often be an XML document, but it can also be in any other format supported by the tools. But to insulate these wrappers from changing GRID infrastructure, they need themselves to be wrapped by wrappers providing an interface to the GRID infrastructure. This is implemented as illustrated in the UML diagram given in Figure 3. Subclasses of BdwAbstractWrapper wrap individual resources: the most important method provided is invokeOperation(), which is passed an operation name and parameters (as BdwRemoteData objects) and invokes the specified operation on a resource. • to implement clients in languages other than Java. The latter two options are only appropriate where a language-independent infrastructure is being used (e.g. GRID services), of course. There are obvious inefficiencies in this architecture (in comparison with direct remote procedure calls, for example), but the generic invocation mechanism offers significant interoperability benefits. Moreover, the BGI layer makes resource wrapper maintenance a smaller task than would be the case if this layer were absent. 4.3. Metadata and ontologies BgiWrapperInterface is an interface, not an abstract class, because this allows BGI implementation classes to implement other interfaces as well and to inherit from a class specific to the chosen GRID infrastructure, if this is necessary as part of the implementation. Initially metadata is being used only to provide the most essential information such as: The above approach has been implemented in Java, but we are not restricted to the use of this language. Some options for departure from Java are: A simple proprietary structure is being used to represent this information, supported by appropriate operations to access this metadata. It is anticipated that this will soon be replaced by a triple-based representation (akin to ObjectAttribute-Value triples sometimes used in expert systems) which, though still fairly simple, will be more generally useful. • • to implement a concrete wrapper that uses the Java Native Interface to bridge between Java and the chosen alternative language; to make resources available by replacing the BgiWrapperInterface/BdwAbstractWrapper combination entirely with software implementing the BGI for a specific GRID infrastructure, in the language of choice, and • • • operations supported by a resource; resource type, and resource name. Later in the project, more sophisticated metadata will be required, supported by an ontology that allows reasoning regarding relationships between terms, etc. For example, it would be desirable to be able to infer that a particular tool is a special case of a more general kind of tool. The precise nature of this more sophisticated repository, and how information is to be represented within it, is a subject of on-going research. 4.4. Workflows In the present version of our prototype we are using an elementary XML workflow representation that can be interpreted by a simple BGI client, but which can only be altered by hand. However, we see the facility for scientists such as biologists to be able to create their own workflows, without the need for regular assistance from computer scientists, as an essential part of the BiodiversityWorld system. On-going research is exploring and experimenting with existing workflow systems and standards to determine whether one of these can be adopted by the project. Our tentative conclusions are that for workflow enactment there are a number of possible candidates, but that we shall need to develop our own user interface if it is to be appropriate for the intended users. • • • 5. Current status, future work and conclusions A simple prototype, not using the architecture presented above, was initially developed. The purpose of this prototype was to provide the project members with experience of wrapping resources to a defined standard and setting up inter-site communications, addressing firewall issues, etc. This prototype implements a climate space modelling application, which provides some of the intended features of our bioclimatic modelling & climate change example. As an initial test scenario for the architecture presented in this paper we are first reimplementing the above application. At the time of writing, a prototype BGI-wrapper framework has been implemented and some of the resources have been re-wrapped to conform to the BGI API. We are currently implementing a basic workflow enactment engine, metadata repository and user interface: some details of these are given in the previous section. It is intended that this new prototype will become available in September 2003. An important point to note is that, although we intend to migrate to Globus in due course, the current prototype implements the BGI using Java Remote Method Invocation (RMI). This was chosen for the initial BGI implementation for a number of reasons: • members of the project team have a good knowledge of RMI, and RMI provides the essential mechanisms necessary for a project of this type — the most important being remote method invocation and registration of remote objects; Globus toolkit version 2 provides only very basic facilities for remote job execution, file transfer, etc: although we believe we could implement the BGI so as to use this version of Globus, it seemed an unnecessary effort given the new, more appropriate Grid services facilities defined for Globus toolkit version 3; Globus toolkit version 3 is currently in the early stages of release, and we have chosen to wait until it has stabilised somewhat before attempting to make use of it in the BiodiversityWorld system, and using a technology such as RMI and then migrating to Globus gives us an opportunity to test the genericity of the BGI: it is intended that the resource wrappers will not need to be changed in the process of this migration. Additional biologists are being appointed for the second and third years of the BiodiversityWorld project, with the intention that they will use BiodiversityWorld in the three chosen exemplar areas. A first priority, therefore, is to develop prototype examples for each of these areas, wrapping relevant resources. Other priorities are: • • • • implementation of a Globus toolkit version 3, GRID services-based version of the BGI; selection of suitable workflow tools, developing our own if necessary; development of a suitable ontology, and specification and prototyping of suitable user interfaces and presentation tools. We also hope to be able to achieve interoperation with other Bioinformatics eScience projects via the BGI. Another issue of interest to us is the OGSA-DAI project,5 which is providing GRID database access facilities. Although, as we indicated earlier, many of the data sources that we are using only need to be accessed via a small, well-defined set of operations for our purposes, we are exploring the possibility of exploiting the OGSA-DAI facilities as our project progresses. 5 http://www.ogsadai.org.uk/ In conclusion, we have presented the BiodiversityWorld project and explained the main elements of the architecture that we have developed in order to provide a suitable interoperation framework for biodiversity informatics on the GRID. An initial prototype using this architecture is currently being assembled; it is intended to extend this to provide a flexible problem-solving environment in which scientists can explore problems relating to biodiversity. 6. Acknowledgements This project is funded by a research grant from the UK Biotechnology and Biological Sciences Research Council (BBSRC). We are grateful to a good number of collaborators who will be making data and tools available to the project. In particular, we are grateful to Species 2000, the International Legume Database & Information Service (ILDIS) and the Hadley Centre for Climate Prediction and Research for access to data that we have used in BiodiversityWorld prototypes thus far. We are also grateful to Mr Peter Brewer for assistance and advice provided. References 1. I. Foster and C. Kesselman, editors. The GRID: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco, CA, 1999. 2. R.J. White, F.A. Bisby, N. Caithness, T. Sutton, P. Brewer, P. Williams, A. Culham, M. Scoble, A.C. Jones, W.A. Gray, N.J. Fiddian, N. Pittas, X. Xu, O. Bromley, and P. Valdes. The BiodiversityWorld Environment as an Extensible Virtual Laboratory for Analysing Biodiversity Patterns. In Proc. 2nd e-Science All-Hands Meeting, Nottingham, 2003. (To appear) 3. A.C. Jones, X. Xu, N. Pittas, W.A. Gray, N.J. Fiddian, R.J. White, J.S. Robinson, F.A. Bisby, and S.M. Brandt. SPICE: a Flexible Architecture for Integrating Autonomous Databases to Comprise a Distributed Catalogue of Life. In Proc. 11th International Conference on Database and Expert Systems Applications (LNCS 1873), pages 981-992, Springer Verlag, 2000. 4. A.C. Jones, I. Sutherland, S.M. Embury, W.A. Gray, R.J. White, J.S. Robinson, F.A. Bisby, and S.M. Brandt. Techniques for Effective Integration, Maintenance and Evolution of Species Databases. In Proc. 12th International Conference on Scientific and Statistical Databases, pages 3-13, IEEE Computer Society Press, 2000. 5. A.C. Jones, W.A. Gray, J.P. Giddy, and N.J. Fiddian. Linking Heterogeneous Biodiversity Information Systems on the GRID: the GRAB demonstrator. Computing and Informatics 21:383-398, 2002.