BiodiversityWorld: An Architecture for an Extensible Virtual

advertisement
BiodiversityWorld: An Architecture for an Extensible Virtual
Laboratory for Analysing Biodiversity Patterns
A.C. Jones1, R.J. White2, N. Pittas1, W.A. Gray1, T. Sutton3, X. Xu1, O. Bromley2, N. Caithness3,
F.A. Bisby3, N.J. Fiddian1, M. Scoble4, A. Culham3 and P. Williams4
1
2
Cardiff University, School of Computer Science, PO Box 916, Cardiff CF24 3XF
School of Biological Sciences, University of Southampton, Southampton SO16 7PX
3
Centre for Plant Diversity & Systematics, School of Plant Sciences,
The University of Reading, Reading RG6 6AS
4
The Natural History Museum, London SW7 5BD
Abstract
In this paper we discuss the BiodiversityWorld project, concentrating particularly upon
the architecture that we are adopting in order to achieve our aims. BiodiversityWorld is
an e-Science project, in which we are seeking to make heterogeneous data sources and
analytic tools of relevance to biodiversity available in a GRID setting. We are
developing a problem-solving environment that will enable scientists to use these
resources in complex scenarios to solve problems in biodiversity. Our architecture is
designed to insulate the core BiodiversityWorld system from heterogeneity of resource
implementation language, etc., by wrapping the resources to a defined standard, while
retaining the flexibility to discover and use the various operations and data types
supported by each resource. At the same time, these wrappers are insulated from the
GRID in order to support migration to future GRID technology. In this paper we
discuss the design of the BiodiversityWorld system, including its GRID interface, the
use of metadata and ontologies, and the use of workflows within the system.
1. Introduction
BiodiversityWorld is a three-year e-Science
project, funded by the UK BBSRC, in which we
are exploring how a problem-solving
environment (PSE) can be designed and
developed for biodiversity informatics in a
GRID ( 1 ) environment. It is a biology-led
project, motivated by three exemplars in the
field of biodiversity informatics, but we are
concerned to make this system extensible to be
of general use within the discipline. Our aim is
to provide scientists with tools with which they
can readily access resources that were originally
designed for use in isolation, composing these
resources into complex workflows, and to make
it as straightforward as we can for new
resources to be created and introduced to the
system.
In the present paper we concentrate primarily
upon the computing aspects of this project and,
in particular, on the essential features of the
system architecture. Another paper in the
present
proceedings provides examples
illustrating how the system can be used in
practice (2).
In the remainder of this paper we first
provide the background to BiodiversityWorld,
outlining relevant aspects of selected earlier
projects, and then discuss software requirements
for the present project. We then present the
architecture that we have adopted, explaining
how it accommodates heterogeneity and change,
and describe the planned evolution of the
system. We explain the roles that metadata —
which will eventually evolve into an ontology
— and workflows are to play in supporting the
PSE. The final section discusses the current
implementation status, planned future work and
some of the new experimentation (such as use
of OGSA-DAI) that we hope to pursue, once the
three chosen exemplars are in place.
2. Project background
The field of biodiversity informatics, at least
outside molecular bioinformatics and genomics,
is characterised by individual scientists and
institutions creating their own resources, often
for a narrow range of uses. For example, a
scientist may create a database to record the
locations where specimens of particular kinds of
plants have been found, or to hold a taxonomic
checklist, comprising accepted names and
synonyms for species in a taxonomic group that
he or she has been studying. Interoperation
among such resources can sometimes be
challenging, as they were not originally
designed to be used as part of a larger system
and often do not conform to any recognised
standard. Similarly, analytic tools to perform
specific tasks have often been developed as
stand-alone executables, so that if facilities from
more than one tool are required, a user will
frequently have to resort to transferring data
between tools manually: in some cases it may
even be necessary to re-enter data by hand.
Certain projects have previously addressed
specific interoperability problems within
Biodiversity Informatics. For example, some of
the authors were previously involved in the
SPICE for Species 20001 project and the
LITCHI project. The SPICE project ( 3 )
addressed the problem of building a scalable
federation of heterogeneous species databases in
order to comprise a catalogue of life; whereas
the LITCHI project (4) addressed the problem
of detecting and resolving taxonomic conflicts
that may occur within or between checklists of
scientific names and synonyms.
Such projects are an important contribution
to biodiversity informatics, because they
provide a basis for accurate retrieval of speciesrelated information, but clearly they do not
contribute a general solution to the problem of
interoperation among relevant resources. For
example, suppose a scientist wishes to
investigate where a particular species might be
expected to occur, given estimated past or
predicted future climatic conditions. To
investigate this bioclimatic modelling problem,
he or she needs access to species distribution
data; to tools that can model the climates
characterising the locations where the species is
to be found; to data pertaining to the climate at
the time of interest, and to map images onto
which the predicted distribution can be
projected. In the GRAB project (5) some of the
present authors developed a proof-of-concept
demonstrator to illustrate how a fixed sequence
of operations relevant to bioclimatic modelling
can be orchestrated on the GRID. Features of
the Globus Toolkit then available (e.g. Globus
Access to Secondary Service (GASS);
Metacomputing Directory Service (MDS)) were
employed, and a Web-based user interface was
provided. The system used data from the
International Legume Database & Information
Service (ILDIS),2 FISHBASE,3 the US National
1
2
http://www.sp2000.org
http://www.ildis.org
Climate Data Centre (NCDC)4 and SPICE for
Species 2000. But this project provided no
significant flexibility in the choice of the kinds
of resources to be used or the sequence of
operations to be performed.
The BiodiversityWorld project aims to
remove such restrictions, to allow scientists to
exploit a wide range of analytic tools and data
sources, including the catalogue of life
described above, but also including many other
kinds of resources. Three exemplar tasks have
been chosen:
•
•
•
Bioclimatic modelling & climate
change;
Biodiversity richness analysis &
conservation evaluation, and
Phylogenetic analysis & biogeography.
The first of these tasks has been mentioned
above; this and the other tasks are described in
more detail in the companion paper in the
present proceedings. For the purposes of the
current paper, though, we concentrate in the
next section on the general requirements that
these tasks present us with, and then we
describe a PSE design intended to satisfy these
requirements.
3. Software requirements for
BiodiversityWorld
As a GRID-based project, BiodiversityWorld
has
distinctive
characteristics.
Highperformance computing resources are only of
use for a small range of self-contained tasks in
biodiversity informatics; but a wide range of
types of data is to be used, many of these types
having their own complex structures. On the
other hand, a limited range of operations on
these data sources is typically required — for
example, the SPICE system provides six
operations, of which two (search for scientific
name; retrieve species information) suffice for
the tasks we currently envisage.
In addition to data sources, we need to make
it possible to access a range of individual
analytic tools, each having its own
implementational idiosyncrasies. To make
access to these data sources and tools possible
from BiodiversityWorld, a generic access
mechanism must be achieved, while retaining
flexibility as to the operations to be performed
by each resource and the data types they can use.
3
4
http://www.fishbase.org
http://lwf.ncdc.noaa.gov
User interface
Presentation
Workflow
enactment
engine
Native
BiodiversityWorld
Resources
Metadata
repository
Metadata BGI API
BGI API
Metadata BGI
BiodiversityWorld-GRID Interface (BGI)
Wrapped
resources
The GRID
Figure 1: Full BiodiversityWorld architecture
These heterogeneous resources must therefore
be wrapped, and metadata must be available to
indicate how they are to be used. Further
metadata is needed to enable selection of
resources meeting appropriate criteria (e.g.
perhaps a specimen distribution map for a given
species is required: a map for a wider range of
species would also be acceptable, as long as the
map distinguishes between the species it
contains).
These resources must be discovered and
brought together by a scientist in a flexible
manner into an interactive workflow.
Appropriate representation of these workflows,
and tools accessible to non-computer scientists
for resource discovery and for workflow design
and enactment, are therefore required. Some
workflows will be of general use, and so a
repository of such workflows is also needed.
Typically, in the scenarios we envisage, an
individual scientist will not be collaborating
interactively with other users of the system.
This, and the kinds of operations required for
each resource, implies that a service-orientated
architecture is appropriate. However, an
important additional requirement is that users
should be able to manipulate certain kinds of
data interactively — e.g. for selection among a
set of trees generated as a result of phylogenetic
analysis. How this modifies our basic serviceorientated architecture is discussed in the next
section.
One final, but important, requirement is that
the software architecture should not be bound
too closely to a particular GRID infrastructure.
GRID software is rapidly evolving — for
example, there are major differences between
Globus Toolkit versions 2 and 3 — and it is
desirable, as far as possible, for migration to a
new infrastructure to have minimal impact on
the resources. An interoperation framework that
is not tied to a specific GRID infrastructure is
therefore required.
4. The BiodiversityWorld
architecture
In this section we shall present the main
elements of our architecture, and then provide
more detail regarding the wrapper API we have
developed, followed by some discussion
regarding metadata and ontologies, and also
regarding workflows. It should be noted that
although a first prototype of the core
architecture has been implemented, with some
resources connected to the system, it is intended
to develop more sophisticated versions of each
aspect of the system as the project progresses.
Legacy
user
interfaces
User interface
Presentation
Workflow
enactment
engine
Native
BiodiversityWorld
Resources
Metadata
repository
Metadata BGI API
BGI API
Metadata BGI
BiodiversityWorld-GRID Interface
Wrapped
resources
Other tools
The GRID
Figure 2: Prototype BiodiversityWorld architecture
4.1.
System overview
The
intended
architecture
of
the
BiodiversityWorld system is illustrated in
Figure 1. A layer of abstraction is placed
between BiodiversityWorld (BDW) components
and the GRID, which we refer to as the
BiodiversityWorld-GRID Interface (BGI). This
means that if the GRID infrastructure changes,
only the BGI needs to be re-implemented: it is
intended that other components will remain
unchanged. Further details of the BGI API are
given in Section 4.2.
Analytic tools and data sources conforming
to the BGI are controlled from the workflow
enactment engine. Some of these resources will
be constructed directly for BDW; other, legacy
resources are being wrapped to conform to the
BGI. A metadata repository, eventually
including a full ontology, is accessible via a
special version of the BGI, which is currently
under development: this metadata-specific API
is intended to provide efficient access to
appropriate metadata operations.
In this architecture we include a presentation
layer: eventually we anticipate that the user
interface will be extensible with local viewers,
etc., that will make interactive data
manipulation possible. In our initial prototype
design, illustrated in Figure 2, only very simple
“presentation” is possible, in the sense that a
small number of data types, such as bit-maps or
simply integers, can be presented to the user.
Interactive manipulation of data is at present via
legacy user interfaces associated with existing
tools. In practice, some of these existing tools
may also have wrappers associated with them
that allow some of their operations to be
controlled from BiodiversityWorld.
The user interface itself is to provide features
for workflow design, including discovery of
appropriate resources using their published
metadata; for workflow control, and for
presentation of results.
4.2.
BiodiversityWorld-GRID
Interface (BGI)
In order to insulate the BiodiversityWorld
system from the resources’ heterogeneity, we
wrap these resources and provide an invocation
mechanism that allows any operation to be
invoked in a standard manner. The available
operations are specified by metadata associated
with these resources. Moreover, to support data
types specific to each resource, we provide a
standard data type BdwRemoteData in which
either a URI, a file name or the data itself —
represented as a byte sequence — is passed to
and from resources. This structure allows
<<interface>>
BgiWrapperInterface
Bgi
Implementation_1
1
Bgi
Implementation_2
<<abstract>>
BdwAbstractWrapper
1
...
Concrete
Wrapper_1
Concrete
Wrapper_2
...
Figure 3: BGI architecture
BdwRemoteData to act as a proxy where
appropriate. When the data itself is held, it will
often be an XML document, but it can also be in
any other format supported by the tools.
But to insulate these wrappers from changing
GRID infrastructure, they need themselves to be
wrapped by wrappers providing an interface to
the GRID infrastructure. This is implemented as
illustrated in the UML diagram given in Figure
3. Subclasses of BdwAbstractWrapper
wrap individual resources: the most important
method provided is invokeOperation(),
which is passed an operation name and
parameters (as BdwRemoteData objects) and
invokes the specified operation on a resource.
•
to implement clients in languages other
than Java.
The latter two options are only appropriate
where a language-independent infrastructure is
being used (e.g. GRID services), of course.
There are obvious inefficiencies in this
architecture (in comparison with direct remote
procedure calls, for example), but the generic
invocation mechanism offers significant
interoperability benefits. Moreover, the BGI
layer makes resource wrapper maintenance a
smaller task than would be the case if this layer
were absent.
4.3.
Metadata and ontologies
BgiWrapperInterface is an interface,
not an abstract class, because this allows BGI
implementation classes to implement other
interfaces as well and to inherit from a class
specific to the chosen GRID infrastructure, if
this is necessary as part of the implementation.
Initially metadata is being used only to provide
the most essential information such as:
The above approach has been implemented
in Java, but we are not restricted to the use of
this language. Some options for departure from
Java are:
A simple proprietary structure is being used to
represent this information, supported by
appropriate operations to access this metadata.
It is anticipated that this will soon be replaced
by a triple-based representation (akin to ObjectAttribute-Value triples sometimes used in
expert systems) which, though still fairly simple,
will be more generally useful.
•
•
to implement a concrete wrapper that
uses the Java Native Interface to bridge
between Java and the chosen
alternative language;
to make resources available by
replacing the BgiWrapperInterface/BdwAbstractWrapper combination entirely with
software implementing the BGI for a
specific GRID infrastructure, in the
language of choice, and
•
•
•
operations supported by a resource;
resource type, and
resource name.
Later in the project, more sophisticated
metadata will be required, supported by an
ontology that allows reasoning regarding
relationships between terms, etc. For example, it
would be desirable to be able to infer that a
particular tool is a special case of a more
general kind of tool. The precise nature of this
more sophisticated repository, and how
information is to be represented within it, is a
subject of on-going research.
4.4.
Workflows
In the present version of our prototype we are
using an elementary XML workflow
representation that can be interpreted by a
simple BGI client, but which can only be altered
by hand. However, we see the facility for
scientists such as biologists to be able to create
their own workflows, without the need for
regular assistance from computer scientists, as
an essential part of the BiodiversityWorld
system. On-going research is exploring and
experimenting with existing workflow systems
and standards to determine whether one of these
can be adopted by the project. Our tentative
conclusions are that for workflow enactment
there are a number of possible candidates, but
that we shall need to develop our own user
interface if it is to be appropriate for the
intended users.
•
•
•
5. Current status, future work
and conclusions
A simple prototype, not using the architecture
presented above, was initially developed. The
purpose of this prototype was to provide the
project members with experience of wrapping
resources to a defined standard and setting up
inter-site communications, addressing firewall
issues, etc. This prototype implements a climate
space modelling application, which provides
some of the intended features of our bioclimatic
modelling & climate change example.
As an initial test scenario for the architecture
presented in this paper we are first reimplementing the above application. At the time
of writing, a prototype BGI-wrapper framework
has been implemented and some of the
resources have been re-wrapped to conform to
the BGI API. We are currently implementing a
basic workflow enactment engine, metadata
repository and user interface: some details of
these are given in the previous section. It is
intended that this new prototype will become
available in September 2003.
An important point to note is that, although
we intend to migrate to Globus in due course,
the current prototype implements the BGI using
Java Remote Method Invocation (RMI). This
was chosen for the initial BGI implementation
for a number of reasons:
•
members of the project team have a
good knowledge of RMI, and RMI
provides the essential mechanisms
necessary for a project of this type —
the most important being remote
method invocation and registration of
remote objects;
Globus toolkit version 2 provides only
very basic facilities for remote job
execution, file transfer, etc: although
we believe we could implement the
BGI so as to use this version of Globus,
it seemed an unnecessary effort given
the new, more appropriate Grid
services facilities defined for Globus
toolkit version 3;
Globus toolkit version 3 is currently in
the early stages of release, and we have
chosen to wait until it has stabilised
somewhat before attempting to make
use of it in the BiodiversityWorld
system, and
using a technology such as RMI and
then migrating to Globus gives us an
opportunity to test the genericity of the
BGI: it is intended that the resource
wrappers will not need to be changed
in the process of this migration.
Additional biologists are being appointed for the
second and third years of the BiodiversityWorld
project, with the intention that they will use
BiodiversityWorld in the three chosen exemplar
areas. A first priority, therefore, is to develop
prototype examples for each of these areas,
wrapping relevant resources. Other priorities are:
•
•
•
•
implementation of a Globus toolkit
version 3, GRID services-based
version of the BGI;
selection of suitable workflow tools,
developing our own if necessary;
development of a suitable ontology,
and
specification and prototyping of
suitable user interfaces and
presentation tools.
We also hope to be able to achieve
interoperation with other Bioinformatics eScience projects via the BGI. Another issue of
interest to us is the OGSA-DAI project,5 which
is providing GRID database access facilities.
Although, as we indicated earlier, many of the
data sources that we are using only need to be
accessed via a small, well-defined set of
operations for our purposes, we are exploring
the possibility of exploiting the OGSA-DAI
facilities as our project progresses.
5
http://www.ogsadai.org.uk/
In conclusion, we have presented the
BiodiversityWorld project and explained the
main elements of the architecture that we have
developed in order to provide a suitable
interoperation framework for biodiversity
informatics on the GRID. An initial prototype
using this architecture is currently being
assembled; it is intended to extend this to
provide a flexible problem-solving environment
in which scientists can explore problems
relating to biodiversity.
6. Acknowledgements
This project is funded by a research grant from
the UK Biotechnology and Biological Sciences
Research Council (BBSRC). We are grateful to
a good number of collaborators who will be
making data and tools available to the project.
In particular, we are grateful to Species 2000,
the International Legume Database &
Information Service (ILDIS) and the Hadley
Centre for Climate Prediction and Research for
access to data that we have used in
BiodiversityWorld prototypes thus far. We are
also grateful to Mr Peter Brewer for assistance
and advice provided.
References
1. I. Foster and C. Kesselman, editors. The
GRID: Blueprint for a New Computing
Infrastructure. Morgan Kaufmann, San
Francisco, CA, 1999.
2. R.J. White, F.A. Bisby, N. Caithness, T.
Sutton, P. Brewer, P. Williams, A. Culham,
M. Scoble, A.C. Jones, W.A. Gray, N.J.
Fiddian, N. Pittas, X. Xu, O. Bromley, and P.
Valdes. The BiodiversityWorld Environment
as an Extensible Virtual Laboratory for
Analysing Biodiversity Patterns. In Proc. 2nd
e-Science All-Hands Meeting, Nottingham,
2003. (To appear)
3. A.C. Jones, X. Xu, N. Pittas, W.A. Gray, N.J.
Fiddian, R.J. White, J.S. Robinson, F.A.
Bisby, and S.M. Brandt. SPICE: a Flexible
Architecture for Integrating Autonomous
Databases to Comprise a Distributed
Catalogue of Life. In Proc. 11th
International Conference on Database and
Expert Systems Applications (LNCS 1873),
pages 981-992, Springer Verlag, 2000.
4. A.C. Jones, I. Sutherland, S.M. Embury,
W.A. Gray, R.J. White, J.S. Robinson, F.A.
Bisby, and S.M. Brandt. Techniques for
Effective Integration, Maintenance and
Evolution of Species Databases. In Proc.
12th International Conference on Scientific
and Statistical Databases, pages 3-13, IEEE
Computer Society Press, 2000.
5. A.C. Jones, W.A. Gray, J.P. Giddy, and N.J.
Fiddian. Linking Heterogeneous Biodiversity
Information Systems on the GRID: the
GRAB demonstrator. Computing and
Informatics 21:383-398, 2002.
Download