Bioinformatics and eScience

advertisement
Bioinformatics and eScience
W.A. Gray1 and C. Thompson2
1
Cardiff University, School of Computer Science, PO Box 916, Cardiff CF24 3XF,
2
BBSRC, Polaris House, North Star Avenue, Swindon SN2 1UH
Abstract
This paper gives a brief overview of the diverse field of bioinformatics, identifying
research themes. It then introduces the BBSRC funded eScience pilot projects against
this overview by presenting their biological aims. Areas of the eScience programme
addressed by these projects are identified to determine the contribution they are
expected to make to eScience. This covers contributions to the developing Grid
middleware and associated eScience standards as well as their bioinformatics goals.
Acknowledgement
The authors thank the staff and PIs of the six projects who contributed material on
which this paper is based. These are:
1) (e-Protein) A distributed pipeline for structural-based proteome annotation using
GRID technology – Prof MJE Sternberg (PI) – Imperial College, University
College London, European Bioinformatics Institute [www.e-protein.org]
2) (BioSimGRID) A GRID database for biomolecular simulations – Prof MSP
Sansom (PI) – Oxford University, Southampton University, Birkbeck College,
York
University,
Nottingham
University,
Birmingham
University
[www.biosimgrid.org]
3) (e-HTPX) An eScience resource for high throughput protein crystallography – Dr
C Nave (PI) – CLRC Daresbury Laboratory, Cambridge University, Cardiff
University, European Bioinformatics Institute, York University, Oxford University
[www.e-htpx.org]
4) (BDWorld) A problem solving environment for global biodiversity: prototype and
demonstrator – Prof FA Bisby (PI) – Reading University, Cardiff University,
Southampton University, Natural History Museum [www.bdworld.org]
5) GRID-enabled modelling tools and databases for neuroinformatics – Dr N
Goddard (PI) – Edinburgh University {jointly funded with MRC}
[www.axiope.org and www.anc.ed.ac.uk]
6) (BASIS) Biology of ageing eScience integration and simulation system – Prof
TBL Kirkwood (PI) – Newcastle University [www.basis.ncl.ac.uk]
1. Introduction
In May the BBSRC held a meeting of its
eScience pilot projects. It was agreed at the
meeting, that it would be a good idea to present
a paper at this All Hands Meeting, which
covered their six pilot projects presenting their
bioinformatics goals and how they will
contribute to the aims of the UK eScience
programme. The intention is to show the types
of bioinformatics research enabled by an
eScience approach and how these projects will
drive and contribute to the eScience
developments occurring in parallel in other
research disciplines across the UK eScience
programme.
Bioinformatics can be described as the
derivation of knowledge from computer
analysis of biological data. This is a simplistic
view as it is a discipline which includes a wide
range of scientific investigation. It is not a
homogeneous domain. This means that there is
a wide range of opinions and views as to what it
comprises. In its broadest sense it is the
application of informatics techniques to
biological data in a research or application
environment. These informatics techniques can
come from a number of disciplines including
computer science, statistics and mathematics.
This discipline list is by no means exhaustive as
researchers utilise techniques from engineering,
physics, and other scientific disciplines.
In their Strategic Plan 2003-2008 [1] the
BBSRC recognise the growing importance of
bioinformatics in their domain when they state:
“Genome sequencing and post-genomic
technologies provide researchers with massive
amounts of data. As a consequence biology is
becoming
more
quantitative.
Large
experimental data sets will increasingly allow
computer (in silico) simulation of biological
systems.”
This recognises the growing importance of
bioinformatics in the next generation of research
in bioscience.
The BBSRC identify in [1] the following areas
as important to their research agenda in nonclinical bioscience:
Integrative biology,
bioscience research also needs access to the data
collections made over the centuries by Institutes
such as the Natural History Museum. These
collections must be prepared in machine
readable formats that allow investigations to be
made that shed new light and understanding on
biodiversity and the effects on it of changes in
climate, agricultural policy and government
policy. It is important that the UK research in
bioinformatics links with other efforts at
National and International levels eg the GBIF
(Global Biodiversity Information Facility)
initiative in biodiversity.
2. The Pilot Projects
When the pilot project proposals were submitted
the BBSRC had identified four theme areas for
these projects:
Genomics,
Structural studies,
Cellular processes, and
Biodiversity.
The successful projects covered topics within
these areas, although there was no specific pilot
project in the genomics area.
Sustainable agriculture,
The Healthy organism, and
Bioscience for industry.
It is recognised that there are different levels of
biological structure within these areas from
individual molecules, cells, tissues/organs
through populations to microbes, plants and
animals. There is a need to do research in a
number of ways within these levels and across
the levels. This research underpins the growth
in bioinformatics due to its generation of large
amounts of data, which need to be analysed,
integrated and used in simulations to test new
ideas.
In the second strategic objective in [1], it is
recognised that bioscience is increasingly
dependent on the development and use of etools due to the data being collected at the
‘omics’ level of research which should lead to
more productive in silico research and the
development of new bioinformatics tools.
These tools will be needed in areas such as data
mining, pattern recognition, model building,
data sharing across the vertical and horizontal
biological
structure
levels.
Traditional
The BASIS project at Newcastle University [2]
is concerned with utilising emerging Grid
middleware technology to develop a system
supporting a research community investigating
quantitatively the biology of ageing – at the cell,
tissue and organism levels. It is creating new
tools such as SBML (Systems Biology Markup
Language) which will help a researcher build a
computer based model of ageing processes and
test them. At the tissue level these will be based
on fibroblasts, gut and brain with experiments
being conducted to determine the effect of
random cell death in a tissue. Users will be able
to set up experiments involving the creation of
new models or modification of existing models
which let them gain a deeper understanding of
the effects of tissue ageing on organisms. It is
intended that this facility will be available to a
distributed community of researchers who will
share their results, models and analytic tools.
Thus this project aims to create a facility which
allows simulation by system modelling within a
structure level and across levels. The
development team is working closely with
another local team who are developing Grid
middleware in the OGSA DAI project for
accessing data held in databases. Thus their
requirements are informing the development of
this middleware and they are utilising it in the
development of the system.
e-Protein aims to provide a structure-based
annotation of proteins in major genomes, which
can be disseminated to other researchers. It is
intended that alternative annotation approaches
will be investigated to identify improvements in
the methods of annotation. This will be used to
build local databases holding structural and
functional annotation of sequence data, which
can be linked with relevant bioinformatics data
resources at other sites. The improvement of
protein modelling is a prime aim of the project
so that better function predictions can be made.
They intend that the system will be available to
a research community collaborating in their
investigations so that they can investigate and
test alternative structure models. This system
will have a workflow based interface which
makes use of the ICENI middleware and its rich
metadata structure for describing software tools.
This middleware is being developed in a related
Grid project which involves several of the eProtein investigators. As in BASIS this project
will be informing the development of the Grid
middleware it requires.
e-HTPX is addressing the problem of unifying
the
procedures
of
protein
structure
determination so that they can be accessed
through a single interface which allows
structural biologists to create models from the
data generated by high throughput protein
crystallography. This will involve creating new
structure determination software which can take
advantage of HPC computing facilities so that
the results of the structure determination can be
delivered on the same time scale as data
collection. Data generated in these experiments
will be stored at the EBI as an available
resource for the research community. It will
involve giving users access to instruments, data
collections and analytic tools. This project is
primarily concerned with building structural
models within a structure level. As the system
is expected to have a number of industrial users,
an important concern is the authentication of
users to protect the system against unauthorised
use. The development team will be consulting
potential industrial users to determine their
requirements in this important area. This will
be used to see whether this can be supported by
Grid facilities.
BioSimGrid aims to allow comparisons to be
made of the results of multiple biomolecular
simulations so that the structure of proteins and
nucleic acids can be better understood. Its users
will have access to large quantities of
simulation data, which will be integrated in
further simulation experiments testing theories
in structural biology within structure levels with
the capability to reuse this data in cross level
modelling of structures. This data will require
curation. These secondary analysis experiments
will need data mining services to locate relevant
data within these databases as well as data
analysis tools. Some of the research community
using this system will be working in
commercial organisations in the pharmaceutical
industry. This introduces the need for user
authentication before access is allowed to some
of the data and tools as commercial
confidentiality will need to be protected. It also
means they have an interest in investigating
mechanisms for distributed authorisation and
accounting. There is also a requirement to link
the simulation data with other biological and
structural data held in National repositories to
allow development of richer, more sophisticated
models.
BDWorld [3] is creating a problem solving
environment in which researchers can locate
appropriate analytic tools and data resources
held at different sites in the environment. These
tools and data can then be linked in a work flow
which produces results relevant to an
investigation into a biodiversity problem at the
species level. This may be a question such as
what will be the effect on a species’ distribution
if global warming occurs, or could this plant
become invasive if it is introduced as an
agricultural crop in a region. The system utilises
a partial catalogue of life and other biotic and
abiotic data, such as climate envelopes, in three
exemplar studies to prove the concept –
biodiversity richness analysis; bioclimatic
modelling and climate change; and phylogenetic
analysis and biogeographic. This will involve
the system linking heterogeneous legacy and
current data collections so that it can
interoperate on this data using a variety of
software tools with different data format
expectations. It is intended that this work will
link with the GBIF system being developed in
an international effort, as its resources will
complement GBIFs. This system must be able
to evolve by adding new data collections and
software tools to its distributed resources. This
requires wrapping of legacy resources as they
join so that they are consistent with the
standards for data within BDWorld. This
system is primarily concerned with the microbe,
plant, animal level of biology although it will be
able to support some lower level analyses. At
the moment a basic BDWorld system is being
built but its design is such that it will be able to
evolve: by incorporating new tools and data
resources; by adding ontologies which will help
users discover the resources they need; by
incorporating more sophisticated display tools.
The designers are aware that Grid middleware is
being created in parallel with their development
of BDWorld and they are ensuring that it will be
able to take advantage of appropriate Grid
middleware when it has reached a suitable stage
in its development.
The neuroinformatics project [4] is investigating
how the brain functions. It is intended that the
developed system will allow neuroscientists to
work collaboratively sharing their data and
software tools. Research in this area needs to be
undertaken at different biological levels and
across the levels. The challenge is to allow the
researchers
to
create
their
models
collaboratively and conduct experiments on
them. This involves being able to locate
appropriate data so that it can be linked in the
models or utilised by the models. This data and
data produced by the models must be available
for use in future experiments. This research is
being undertaken in collaboration with scientists
at the Newcastle eScience centre who are
looking after the database and Grid middleware
aspects of this project in a separate project. The
prime concern of this project at the moment is
creating data models and software tools that
enable heterogeneous data to be easily shared
and analysed by its user community. The
design team recognise that there will be a need
in the next phase of its development for the
system to provide sophisticated visualisation
tools and ontologies.
Thus they are
concentrating on creating a basic system
environment that can evolve by adding such
tools in the future.
3. Pilot Project e-Science themes
These pilot projects display a number of
eScience themes.
They are all aiming to support collaborative
working within a research community who need
to share data, results and software tools. This
means they will need to support discovery of
relevant data sources, data and their descriptions
so that the different tools can analyse and share
data. They will need to overcome heterogeneity
in data representation when it is prepared over
time for different purposes when accessing
legacy data and develop new extended standards
for the metadata describing this data which
allows its provenance to be established and
stored. Many of the projects are creating results
which have to be stored so that other analytic
tools can use this data in the future. This implies
that the data will need to be curated with
provenance showing how it was created and the
tools creating it. There will also be a need to use
and store descriptive data so that representation
of the data is understood by researchers and can
be interpreted by software tools.
There is some need for High Performance
Computing (HPC) but it is not a major
requirement of the projects. It occurs when
complex models are being built and the results
of analysing and using the models are needed in
real time for further analysis. E-HTPX is the
only project seeing this as a prime requirement,
although the others may need some access to
these facilities in the future.
All of the projects will be creating new software
tools which need to be made available within
the system for other users. These tools must be
engineered so that they can link with existing
tools and utilise the data available in the grid
system. This means that there must be
descriptions of these tools which enable them to
be linked in analytic chains which can be
executed by work flow engines. Several of the
projects need work flow engines for their user
interface to allow a user to create and execute
work flows which perform the required
analysis.
Most of the projects are aiming to support the
building of structural models at different
biological levels, which can be used to
determine the functioning of a biological system
and the effect of change on the system. It is
clear that as bioinformatics expands through the
Grid this will become a growing area as
researchers create more and more complex
models that are not limited to one biological
level but interact across the levels to determine
the effect of substructure change on the higher
level structures. This growth in model
complexity will also be reflected in a growing
level of diversity in the data sources used in the
models as researchers investigate more fully the
causes of change and evolution.
There is little emphasis at the moment on the
need for sophisticated presentation tools which
allow users to present information in different
and more imaginative ways. This is probably
due to modelling being mainly done at a single
level at the moment. Another reason for this
could be that the structural modelling of
biological systems is a relatively new technique,
and as it matures more sophisticated displays of
the results from these complex models will be
required by the modellers to make it easier for
users to understand the outcomes. It is also a
feature of the current state of development of
the systems where these facilities are seen as the
second stage of the development and an
unnecessary luxury until the basic systems are
working.
Two projects are investigating the use of the
Grid authentication techniques. This is due to
the nature of the projects which have industrial
links at the moment, rather than it not being of
concern in the area. Again as the field matures
and this type of analysis becomes more
accepted there will be an increase in this
requirement.
4. Expected effect on eScience
It is clear that the pilot projects have fairly
ambitious bioinformatics goals and that they do
not see themselves developing middleware per
se for the Grid, but co-operating with the
projects that are developing the middleware.
The e-Protein and neuroinformatics projects are
working closely with research groups that are
developing middleware in separate projects and
will utilise this software as it becomes available.
The e-Protein team are working closely with the
team developing the ICENI middleware at
Imperial College and the neuroinformatics have
a close link with the team developing the OGSA
DAI middleware at Newcastle University.
These projects will have a direct influence on
the development of these pieces of eScience
middleware. The other projects are keeping
themselves
informed
of
middleware
developments and will utilise appropriate
middleware, when it is in a stable enough form,
until then they are likely to use alternative nongeneric, limited capability software that is
available or they developed themselves to meet
their needs. However they all intend to take full
advantage of Grid middleware when it is stable.
All the projects have major data handling
challenges and one of the major contributions
from this research programme should be insight
into the future metadata requirements and
standards in Grid environments. This covers the
description of data and software as well as
provenance and curation of data. A major issue
through all the projects is the interoperation of
data held in different data collections.
Considerable insight should be gained from
these pilots as to how to describe and hold data
so that this task is facilitated, especially with
respect to the wrapping of legacy systems so
that they can enter new environments easily.
Although it is not a major feature at the
moment, these systems will need metadata
repositories and ontologies to help users identify
the resources held in the environments that they
require. These will be needed to help the
researchers create the workflows that will do
their analyses of the data. At the moment some
of the projects are investigating or creating
embryonic workflow engines which will be
used to execute the analytic chains of software
tools which are created by users to identify the
required analysis. This work should further
inform the development of these workflow
engines.
These projects all intend to develop basic
systems which can evolve to meet future as yet
unknown requirements.
This will be an
important feature of their system architectures
and the development of these pilots should give
us more insight into the best ways of building
systems with this capability. This will be
important in the development of the
sophisticated Grid systems as we will not be
able to afford to recreate such systems from
scratch.
5. Conclusions
The bioinformatics pilot projects will make
meaningful contributions to the eScience
programme in the areas of creating the metadata
standards required for bioinformatics data and
tools. They will contribute to the definition of
data curation and provenance standards. They
do not intend to make a direct contribution to
the development of the middleware required for
the Grid but their use of it as it evolves will
inform the development of this middleware.
They will also identify new middleware
requirements. It is clear that in the future this
research will inform the development of the
next generation of ontologies and data/resource
discovery tools and the more sophisticated
presentation tools such as result visualisation.
However the major contribution of these pilots
will be as catalysts which encourage more
bioinformatics research by demonstrating what
can be achieved by collaborative in silico data
experimentation and analysis in bioscience.
The pilots will also create the basic systems
which will allow the next generation of
researchers to fully exploit this capability. This
will enable the field to grow and support the
collaborative working needed to build and
exploit the next generation of systems biology
models.
References
1. World Class Bioscience, Strategic Plan 20032008, BBSRC, Swindon (2003)
2. Kirkwood TBL et al: Towards an e-biology
of ageing: integrating theory and data, Nature
Reviews Molecular Cell Biology 4, 243-49
(2003)
3. Bisby FA: Biodiversity Informatics, in
Business (quarterly magazine of the BBSRC),
24-25, July 2003
4. Goddard N, Cannon R and Howell F: Axiope
Tools for Data Management and Data Sharing,
accepted by J Neuroinformatics to be published
(2003)
Download