Grid Based Virtual Bioinformatics Laboratory

advertisement
Grid Based Virtual Bioinformatics Laboratory
Paul Donachy
Terrence J harmer
Ron H Perrott
Belfast e-Science Centre
www.qub.ac.uk/escience
Jim Johnston
Alan McBride
Michael Townsley
Fusion Antibodies Ltd
www.fusionantibodies.com
Shane McKee
Amtec Medical Limited
www.amtec-medical.com
Abstract
Biotechnologies such as genomics, gene sequencing and highthroughput screening are creating massive volumes and multiple sources
of biological and chemical data. However, the volumes of data and the
processing power required to analyse it, is threatening to create a
bottleneck that might hamper the growth of biotechnology itself. To
date, the HPC resources required to store, manage and analyse such
volumes of data has been only at the disposal of large companies and
research institutes.
However, with the emergence of Grid Technology, the whole area of
bioinformatics is an ideal candidate to leverage the benefits of secure,
reliable and scaleable high bandwidth access to distributed data sources
across various administrative domains. This in effect will allow
geographically remote researchers with limited internal resources,
access to a wealth of biological datasets and HPC resources.
This paper presents from an industrial perspective the business drivers
that acted as the catalyst in creating the industrial e-Science project
GeneGrid. The Architecture and roadmap for a Grid based Virtual
Bioinformatics Laboratory will be presented.
1 Introduction
Whole genome expression monitoring
will have extraordinary impact on clinical
diagnosis and therapy and bring new
power to clinical medicine. As the field
progresses we will identify new probes
for cancer, infectious disease and
inherited disease and understand how
genetic damage occurs and how genes
alter response to drug therapies. Equally
important will be new therapeutic tools in
the form of recombinant gene products,
novel drug targets, rational drug design,
and gene therapy.
Next-generation
efforts will allow us to link gene
expression
patterns
with
formal
characteristics
of
disease
models
including pathological and clinical
descriptions.
It has been more than a year since the
human genome was mapped, considered
one of the most gargantuan scientific
endeavors ever undertaken. The DNAsequencing data from the human genome
project (HGP) contains much untapped
data that needs to be converted to
meaningful information. At present, the
human
genome
database
has
approximately six terabytes of data. This
data set is expected to double every six
months.
At present there is a vital need to develop
distributed solutions to capture, analyse,
manage, mine and disseminate these vast
amounts of genomic data, in order to
develop actual diagnostic and therapeutic
strategies. With the emergence of Grid
Technology, the area of bioinformatics is
an ideal candidate to leverage the benefits
of secure, reliable and scaleable high
bandwidth access to distributed data
sources across various administrative
domains.
This paper will present the background
and motivation for the industrial eScience project GeneGrid. GeneGrid is
conceived from work related to the
activities of a number of biotech
companies based in Northern Ireland with
extensive international collaborative
relationships in North America, Europe
and Brazil. The aim of GeneGrid will
combine the skills and experience of the
stakeholders and the collaborative sharing
and coordinated use of their distributed
resources
to
create
a
“virtual
Bioinformatics laboratory” using the
Grid. This will allow all relevant
organizations, partners & customers to
access their collective skills, experience
and results in a secure, reliable and
scalable manner.
2 Business Drivers
At present limited efforts are made by the
stakeholders to collaborate, share data
and identify information, which can be of
overall assistance. It is evident that while
the different companies may have
different commercial or academic
objectives, the potential to share data,
information and available resources in a
“virtual Bioinformatics Laboratory” has
overwhelming economic advantages. This
is clearly demonstrated by Fusion where
it is endeavoring to find antibody targets
from important surface proteins generated
by genes and work with other to seek
genetic disease markers for diagnostics.
The data, which can be generated
collaboratively, will have considerably
greater long-term value than those
individual efforts. This has a further
multiplier effect when combined with
other associated academic efforts.
The project aims to build upon existing
genomic and proteomic programmes
including existing microarray and
sequencing technology and the immense
volumes of data generated through
screening services. At present such
technology is used to identify alterations
in gene expression between two samples
(normal versus disease tissue). These
altered expression patterns can then be
used as a molecular fingerprint.
Furthermore these molecular profiles can
be used to classify different tumor types
and ultimately to predict at a molecular
level which patients are likely to respond
to specific anticancer therapies.
However, the main drawback with
existing array technologies is that they are
generic, i.e. they represent a collection of
genes with no information regarding the
tissue types in which they normally
function. Considering therefore that the
human body is made up of multiple tissue
types it becomes clear that only a small
percentage of the genes on such a generic
array will be actually involved in a
specific disease type.
At present the individual companies do
not have any dedicated in-house
bioinformatics specialisms or HPC
capability. They generate large amounts
of data but relating this to the global
environment is problematic. The low
speed of data transfer between parties,
lack of high performance computing
power and lack of encompassing security
mechanisms
across
the
disparate
administrative
boundaries
and
organizations is a blockage to rapid
advancement of this important area of
Science and research.
3 Grid Based Architecture
The Grid based architecture presented
here is based on the Open Grid Services
Architecture (OGSA) model [1] derived
from
the
Open
Grid
Services
Infrastructure specification defined by
OGSI Working Group within the GGF
[2].
The Open Grid Services Architecture
(OGSA) represents an evolution towards
a Grid architecture based on Web services
concepts and technologies. It describes
and
defines
a
service-oriented
architecture (SOA) composed of a set of
interfaces and their corresponding
behaviors to facilitate distributed resource
sharing and access in heterogeneous
dynamic environments.
Service
Requestor
BIND
FIND
Transport
Medium
PUBLISH
Service
Provider
Service
Directory
Figure 1
Figure
1
shows
the
individual
components of the service-oriented
architecture (SOA). The service directory
is the location where all information
about all available grid services is
maintained. A service provider that wants
to offer services publishes its services by
putting appropriate entries into the service
directory. A service requestor uses the
service directory to find an appropriate
service that matches its requirements.
An example of such a requirement is the
maximum time a service requestor is
willing to accept for a protein sequence
alignment service or the need to retrieve
specific gene information from a
biological database query service. The
service directory will thus include not
only taxonomies that facilitate the search,
but also information such as maximum
calculation time, QoS details or the cost
associated with a service. When a service
requestor locates a suitable service, it
binds to the service provider, using
binding information maintained in the
service
directory.
The
binding
information contains the specification of
the protocol that the service requestor
must use as well as the structure of the
request messages and the resulting
responses. The communication between
the various agents occurs via an
appropriate transport mechanism [3][4].
This architecture is based on a view of
service collaboration that is independent
of specific programming languages or
operating systems. Instead, it relies on
already-existing transport technologies
(such as HTTP or SMTP) and industrystandard data encoding techniques (such
as XML).
PUBLISH
Service
Provider A
Biological
Databases
HPC
Resource
Service
Provider B
PUBLISH
Service
Provider C
Analysis/
Visualisation
Resource
4 Virtual Bioinformatic Lab
This paper proposes to develop a service
oriented middleware framework targeted
towards the domain of Bioinformatics. It
proposed to develop an architecture to
allow automated wide-scale data mining
of the publications and public genomic
databases, with the objective of
establishing correlations of gene sets. The
public data will be complemented by
targeted sequencing related to specific
cancers. The data sets will be quantified
and then examined for potential
diagnostic and therapeutic potential.
In such an environment the first step for
all service providers that wish to offer
services is to publish its services via
appropriate entries in the Service
Directory. See Figure 2. These entries
include those from service providers
offering services such as biological
Databases, HPC resources and analysis
and visualization resources.
Service
Directory
Figure 2
Next the client requests the Service
Directory for find appropriate services
that are needed to provision the
fulfillment of a GeneGrid service. These
may be found via a portal user interface
or dynamically from within a client
application. An example of such a request
would be “find me services that retrieve
all gene sequence data for in format X
and takes less than 30 sec”. See Figure 3.
C lie n t A
S e a rc h fo r
gene
sequence
d a ta in
fo rm a t X
GSH
S e rv ic e
D ire c to ry
C o n ta in s in fo rm a tio n o n
S e rv ic e P ro v id e r A
S e rv ic e P ro v id e r B
S e rv ic e P ro v id e r C
Figure 3
When the services are located the client
binds to the service using binding
information detailed in the service
directory. This for example in the above
example may involve specifying the
protocol that the client must use to
interact with the database service and the
transport mechanism that is to be used
such as JMS or SMTP. See Figure 4
Client A
BIND
Grid Service
Instance
Biological
Database
HPC
Resource
Analysis/
Visualisation
Resource
Figure 4
5 Summary
Grid computing technology presents an
architectural framework that aims to
provide
access
to
heterogeneous
resources in a secure, reliable and
scalable
manner
across
various
administrative boundaries. The domain of
Bioinformatics is an ideal candidate to
exploit the benefits of such a framework.
The development activities of the
GeneGrid project are due to start in Sept
2003 and expected to deliver prototype
grid services by Q2 2004. These
prototype grid services will be initially
deployed and tested both within a local
area network and wide area network
environment. The project expects to gain
valuable results from such prototype
services.
In addition to increased access and
reduced cost to HPC resources, another
expected benefit of the GeneGrid
architecture is the creation of an
extensible integration fabric. Integration
of such remote, heterogeneous resources
in any enterprise is the major bottleneck
and the realm of major Enterprise
Application Integration (EAI) activities.
Here we have presented a Gird based
framework that could provide the basis
for reference integration architecture for
the stakeholders involved.
However before widespread adoption
happens within this sector a number of
fundamental areas will need to be
addressed:
Advanced Service Discovery: As a
typical
bioinformatics
experiment
involves a complex sequence of human
and automated operations. In a gird
services environment the domain of
describing, managing and discovering
such knowledge rich resources will
require middleware extensions or
additions to ensure applications can
adequately define their requirements for
service discovery.
Security: The area of security, as with
various other knowledge-based domain,
will be a primary concern and
requirement. As Grid technology looks
to share resources both internally and
externally within organisations, security
and integrity of information are not only
important but also critical to Gene and
patient confidentiality.
Standards: The area of bioinfoamtics
and Grid computing is already heavily
loaded with a mixture of competing
formats and standards. The addition of a
further set of domain specific proprietary
grid computing standards will not assist
in the adoption and uptake of such
technology. The project aims to leverage
the use of emerging Grid Computing
standards, e.g. the Globus Toolkit, OGSA
and involvement within the GGF
standards including the newly formed
Life Sciences Grid (LSG) Research
Group within the GGF.
References
[1]
OGSA
http://www.globus.org/ogsa/
[2]
OGSI
http://www.gridforum.org/ogsi-wg/
[3]
S. Burbeck, “The Tao of e-Business
Services,” IBM Corporation (2000); see
http://www4.ibm.com/software/developer/library/wstao/index.html.
Download