Resource Monitoring and Service Discovery in GeneGrid

advertisement
Resource Monitoring and Service Discovery in GeneGrid
Sachin Wasnik, Mark Prentice,
Noel Kelly, P.V. Jithesh, Paul
Donachy, Terence Harmer,
Ron Perrott
Belfast e-Science Centre,
Queen’s University of Belfast
{s.wasnik, mprentice,
n.kelly,p.jithesh,p.donachy,
t.harmer,r.perrott}@qub.ac.uk
Mark McCurley, Michael
Townsley, Jim Johnston
Shane McKee
Fusion Antibodies Ltd
Amtec Medical Ltd,
{mark.mccurley,
michael.townsley,jim.johnston}
@fusionantibodies.com
[email protected]
Abstract
GeneGrid is a Grid based bioinformatics project run by BeSC under the UK e-Science program and supported by the
UK Department of Trade & Industry(DTI). It is a combined effort with commercial partners Fusion Antibodies and
Amtec Medical. GeneGrid provides seamless integration of a myriad of heterogeneous applications and data sets that
span multiple administrative domains and location across the globe, and present this to scientist through a simple
user friendly interface.
GeneGrid enables large-scale sharing of resources within a formal consortia of institutes and commercial partners
(virtual organisation). In such an infrastructure, service providers need a publication mechanism so that they can
provide information about the grid services and computational resources they offer and make changes as their
services evolve. This paper describes the design, development, implementation and performance of one such
mechanism called the GeneGrid Resource Manager (GRM) built upon the Belfast e-Science Grid Manager project.
1.
Introduction
With the advent of cost effective computer power and
network bandwidth, large amount of data is available to
the scientific community regardless of location, and
powerful analytical programs have been developed to
exploit them. As a result, the computer science
community has increasingly been involved in research
that aims at harnessing these data and programs with
distributed systems technology to ease their exploitation,
leading to a new discipline termed as e-science[2].
GeneGrid is one of the UK e-Science industrial project
with the involvement of companies interested in antibody
and drug development, and aims to provide a platform for
scientists to access collective skills, experiences and
results in a secure, reliable and scalable manner through
the creation of a ‘Virtual Bioinformatics Laboratory’ [3].
GeneGrid [3] is a novel and pragmatic solution to address
the above problems. It accomplishes the seamless
integration of a myriad of heterogeneous resources that
span multiple administrative domains and locations. It
provides the scientist with an integrated environment for
collaborative discovery and development via streamlined
access to a suite of bioinformatics and other accessory
applications through a user friendly interface. GeneGrid is
built upon state-of-the-art technology in distributed
computing, namely grid computing, which coordinates
resource sharing and problem solving in dynamic multiinstitutional virtual organisations [1]. In this case, the fact
that user typically have little or no knowledge of the
resources contributed by participants in the “virtual
organization” (VO) posses a significant obstacle to their
use. For this reason, resource monitoring of the existence
and characteristics of resources, services, computations
and other entities are vital part of GeneGrid.
Recourses in GeneGrid spans across various
administrative domains such as Queens University
Belfast, San Diego SuperComputing Centre and
University of Melbourne Australia. At the heart of the
GeneGrid is the ability to discover, allocate and negotiate
the use of network accessible capabilities- be they
computational services offered by a computer, application
services offered by a piece of software, bandwidth
delivered on a network, or storage space provided by a
storage system. We use the term resource management to
describe all aspects of the process: locating a capability,
arranging for its use, utilizing it and monitoring its state.
This paper presents the architecture, functionality and
performance evaluation of GeneGrid Resource Manager
(GRM), as one of the component of GeneGrid. The rest of
the paper is organised as follows :- An overview of
GeneGrid architecture and the functionality of each
component as well as their interaction with other
component is described in section 2. In section 3, We
focus on GeneGrid Resource Manager Architecture,
implementation and performance evaluation followed by
conclusion and future roadmap in section 4.
2.
BeSC
GeneGrid Component Architecture
GeneGrid consists of five cooperating components which
independently address a subset of the main requirements
of the project, and by cooperating provide scientists an
integrated environment for the streamlined access of a
number of bioinformatics applications & databases
through a simple interface. These five components,
namely the GeneGrid Data Manager, the GeneGrid
Application Manager, the GeneGrid Workflow Manager,
the GeneGrid Resource Manager Component and the
GeneGrid Portal will be discussed individually in more
detail below.
We also describe how these five
components are integrated to form GeneGrid
environment.
2.1
executing a given application, and after completion of this
task the GAMS is destroyed. Currently GeneGrid
integrates a number of bioinformatics applications
including BLAST [7], TMHMM [11], SignalP [10],
BL2SEQ [7], GeneWise [13], ClustalW [16] and
HMMER [17]. In addition, GAM also integrates a number
of custom programs developed to link the tasks in a
workflow. Figure 1 gives an overview of the components
that provide the GAM functionality.
GeneGrid Application Manager (GAM)
Access to the bioinformatics applications available on
various resources is provided by the GeneGrid
Application Manager (GAM) [3, 4]. GAM achieves this
integration through two types of OGSA-based grid
services: GeneGrid Application Manager Service Factory
(GAMSF) and the GeneGrid Application Manager.
GAMSF is a persistent service, which extends the
standard interfaces or Port Types, like GridServiceFactory
of the Open Grid Services Infrastructure (OGSI) [5] to
integrate one or more bioinformatics applications to the
grid, and exposes them to the rest of the GeneGrid. The
primary function of GAMSF is to create transient
instances of itself called GeneGrid Application Manager
Services (GAMS) which facilitate clients to interface with
the applications
Any client wishing to execute a supported application
will first connect to the GAMSF and create an instance the GAMS. This newly created GAMS then exposes to
the client the operations which allow the client to execute
the supported application as an extension to the
operations provided by the OGSA Grid Service interface.
Each GAMS is created by a client with the intention of
SDSC
GAMS
(BLAST)
BLAST
TMHMM
A client in the BeSC domain executing a BLAST job on a
GeneGrid resource in the San Diego Supercomputer Centre
(SDSC). The client initially connects to the GAMSF which
integrates applications upon that resource, creating a
GAMS for accessing BLAST.
Figure 1. GAM Overview
2.2
GeneGrid Data Manager (GDM)
The GeneGrid Data Manager (GDM) is responsible for
the integration and access of a number of disparate and
heterogeneous biological datasets, as well as for providing
a data warehousing facility within GeneGrid for
experiment data such as results [11]. The data integrated
by the GDM falls into two categories.
1). Biological data consisting of datasets available in the
public domain, e.g. Swissprot [9], EMBL [14],
ENSEBML [12] etc. and proprietary biological data
private to the companies.
2). GeneGrid data consisting of data either required by,
or created by GeneGrid, such as workflow definitions or
results information
GDM has used OGSA-DAI (www.ogsadai.org) as the
basis of its framework, enhancing and adapting it as
required, such as providing access to flat file datbases.
GDM consists of two types of services, replicating those
found in OGSA-DAI. The Gene Grid Manager Service
Factory (GDMSF) is a persistent service configured to
support a single data set. The main role of GDMSF is to
create, upon request by a client, transient GeneGrid Data
Manager Service (GDMS) which facilitate interaction
between a client and the data set as shown in figure 2.
Client
GWMSF
XML
Client
Fusion
GARR
GDMSF
(Swiss)
GWMS
5
GDMSF
(EMBL)
XML
XML
BeSC
GDMS
(SwissProt)
GAMSF
GDMSF
BLAST
GSTRIP
Swiss
Prot
EMBL
Client at Fusion executing a SwissProt query on a
GeneGrid resource in the BeSC. The client initially
connects to the GDMSF which integrates SwissProt upon
that resource, creating a GDMS for executing the query.
Figure 2. GDM Overview
2.3
GeneGrid Workflow Manager (GWM)
GeneGrid Workflow Manager (GWM) is the component
of the system responsible for the processing of all
submitted experiments, or workflows, within GeneGrid
(Figure 3). As in the case of GAM, there are two types of
services in the GWM. The first, the GeneGrid Workflow
Manager Service Factory (GWMSF) is a persistent
OGSA-based grid service. The main role of the GWMSF
is to create GeneGrid Workflow Manager Services
(GWMS), which will process and execute a submitted
workflow across the resources available. Each GWMS is
a transient grid service which is active for the lifetime of
the workflow it is created to manage. The main roles of
this service are to select the appropriate resources on
which to run elements of the workflow, as well as to
update the GeneGrid Status Tracking and Result & Input
Parameters (GSTRIP) Database with all status changes.
GWMS gets information on resources, databases, GDM
services and GAM services through the GeneGrid
Application & Resources Registry (GARR).
Submission and execution of a workflow containing a
BLAST task. The client will first connect to the GWMSF to
create a GWMS instance, before forwarding the workflow
XML to the newly created GWMS. The GWMS identifies the
BLAST task within the workflow, and queries the GARR for
the location of all suitable GAMSF. The GWMS will then
submit the task XML to the most appropriate GAM for
execution. The GWMS will also find the location of the
GDMSF serving the GSTRIP database from the GARR in
order to submit updates as to the status of this BLAST task.
Figure 3. Management of Workflow in GeneGrid
2.4
GeneGrid Resource Manager
GeneGrid Resource Manager is responsible for Resource
monitoring and service discovery. It consists of GeneGrid
Application and Resource Registry (GARR) service and
lightweight adapters present on each Node called
GeneGrid Node Monitor (GNM).
GARR is the central service in GeneGrid that mediates
service discovery by publishing information about various
services available in GeneGrid. All the nodes available,
register with the GARR service and updates the GARR
with the status of the resources, such as load average and
available memory through the GNM.
GARR provides an interface to query current state of the
resources registered with it. GARR can be queried based
on various type of parameters like type of the resources or
the name of the resources. Other components of GeneGrid
like GWMS or GAMS can query GARR service to
identify the resource whose characteristics and state
match those desired for execution of a task.. A more
detailed description of GRM component is given in
section 3.
2.5
GeneGrid Portal
The GeneGrid Portal provides a secure central access
point for all users to GeneGrid and is based upon the
GridSphere product [6]. It also serves to conceal the
complexity of interacting with many different Grid
resource types and applications from the end users’
perspective, providing a user friendly interface similar to
those which our user community is already familiar with.
This results in a drastically reduced learning curve for the
scientists in order to exploit grid technology. Figure 4
shows the interaction of Portal with other component.
By allowing users to access a GE, we create a Virtual
Organisation [1] (VO), and hence each GE may be
considered as a single installation of GeneGrid.
Portal
GWMSF
GNM
GNM
GARR
Client
Portal
GARR
GNM
GNM
GDMSF
GDMSF
GWMSF
GWDD
GSTRIP
GNM on all GE nodes registering with the GARR.
GDMSF
GSTRIP
GDMSF
GWDD
The Portal is a secure centrally hosted single access point to
the project. Users may access the Portal from any internet
ready computer. In order to access the above services, the
Portal is configured with the location of the GARR, from
which it finds all the other services.
Figure 4. GeneGrid Portal in relation to other
component of GeneGrid.
2.6
GeneGrid Environment
The GeneGrid Environment (GE) is the collective name
for the core distributed elements of the GeneGrid project,
which allow the creation, processing and tracking of
workflows. Contained within the GE is at least one
GeneGrid Portal, at least one deployment of both the
GARR and the GWMSF, an implementation of each of
the GWDD and the GSTRIP databases, as well as at least
one GDMSF configured to each of these databases
(Figure 5). All instances of any factory services
mentioned above may also be considered elements of the
GE.
Figure 5. GeneGrid Environment.
The GWMSF and both GDMSFs in the GE register their
existence with the GARR via GNM deployed on their
hosting nodes. The GWMSF and the GeneGrid Portal are
both configured with the location of the GARR service in
order for them to discover all available GeneGrid
services. Upon start up, the Portal will connect to the
GARR to discover the location of the GDMSF for both
the GWDD and the GSTRIP databases. The Portal
processes the Master Workflow Definition Document
from the GWDD, allowing authenticated users to create,
submit and track workflows.
3.
GeneGrid Resource Manager
Section 2 presented each GeneGrid component
individually. In this section we will focus on GeneGrid
Resource Manager (GRM). As it is already stated, GRM
consists of two major Components GeneGrid
NodeMonitors(GNM) and GeneGrid Application and
Resource Registry service. We will see Grid Manager
Portal developed as a part of Grid Manager Project which
is going to be a part of GeneGrid Portal in the next
release. We will see how GRM enables GeneGrid to share
resources in different domains.
3.1
GeneGrid Node Monitors
All computational Nodes available to GeneGrid
run a light weight agent which collect resource
information relating to the node on which it is
deployed. These agents, called as GeneGrid
NodeMonitors (GNM), transmit the information they
have collected regarding the computational resources
and GeneGrid services available on that node. The
GNM is a light weight C program which
communicates with GARR service by sending XML
messages. The GNM uses gSoap library for sending
the XML messages. In the latest release of GeneGrid,
GNM for various platform such as solaris and linux
are made available to support the heterogeneous
environment required as shown in figure 6.
GNM
GAMSF
GARR
BLAST
GNM
The System Information consists of
1) Hardware address
2) System Time,
3) IP address,
4) CPU speed,
5) CPU load,
6) Total Memory
7) Free Memory
8) Operating System’s Name
9) Operating Systems version
10) Uptime
11) Hostname
12) System Architecture
13) Number of Processor
14) Load average for last 1
minute, 5 minutes 15
minutes
15) Custom Data.
SDSC
GNM
The Resource Data consists of name, type and Grid
Service Handle (GSH) of the resources, which can be any
application or database they may facilitate access to.
available on the Node. The GARR service stores this
information in a MySQL database to provide persistence
of resource information. This information can also be
viewed through the Grid Manger Portal as shown in the
figure 7.
BeSC
GAMSF
BLAST
GDMSF
EMBL
GNM on multiple resources across administrative domains
registering resource information securely to a GARR.
Resource Information includes details of the applications
and databases supported as well as the location of the
Figure 6. GNM sending information to GARR
service.
3.2
GeneGrid Application and Resource
Registry
The GeneGrid Application and Resource Registry
(GARR) service captures the information which is sent by
the GNM about the grid service deployed on the node,
forwarding information about the Node. The Information
sent by GNM can be broadly classified as System
Information and Resource Data.
Figure 7. Grid Manager Portal
The GARR service is a GT3 based persistent OGSA
compliant grid service which provides an interface to
query the GARR and retrieve information such as
resource’s name, cpu_load etc.
Clients such as GeneGrid Workflow Manager(GWM)
which is responsible for processing all experiments
wishing to discover the location of any specific GeneGrid
service will contact GARR for its location. A client can
also query the GARR to find all service supporting, say,
the BLAST application, in which case the GARR will
return all available GeneGrid Application Manager
Service Factory ( GAMSF) to support BLAST, as well as
any resource information associated with the hosting
nodes.
3.3
GeneGrid Shared Resources
Bioinformatics applications and datasets are exposed to
the GeneGrid Environment by GAMSF and GDMSF
respectively. These GAM and GDM services make up the
GeneGrid Shared Resources. Each GAMSF and GDMSF
advertises its existence and capabilities to a GE via GNM
on their hosting nodes registering with the GARR.
It is possible for GNM to register with many GARR
services across multiple GE allowing the resources to be
shared between multiple organisations. Therefore,
organisations have complete control over what resources,
if any, they wish to share with other GeneGrid
organisations, forming dynamic virtual organisations as
shown in figure 8.
3.4
Performance Evaluation
GeneGrid depends upon GRM for making a decision to
allocate resource for a particular task out of the various
distributed, heterogeneous resources. Hence it is essential
to study the behaviour of the GRM under different
circumstances in order to understand any performance
limitation. We have done two sets of experiments on
GRM under different conditions to study the performance
and scalability with respect to number of clients, number
of resources.
In first set of experiment, we have used a testbed of one
GARR service deployed on machine blade3a and five
clients running on machines c00, c01, c05, c06, c07
respectively on the same network domain. Fifty different
resources were registered with GARR. We conducted a
series of experiments starting with only one client,
increasing a client every time, up to five clients accessing
GARR service concurrently. We recorded the response
time, time taken by a client to fetch the data from the
GARR service for each experiment. The results of the
experiments are depicted in figure 9.
Response Time ( GARR with 50 entries )
Tim e in m iliseconds
3500
GARR
VO 1
GNM
GNM
3000
c01
2500
c0
2000
c05
1500
c06
1000
c07
500
0
1
GAMSF
GAMSF
Resource A
Resource B
GARR
VO 2
Resource A registers with both VO 1 and VO 2.
Resource B registers with VO 1 only.
Therefore, users of VO 1 may access both resources,
while users of VO 2 may access only resource A.
Figure 8. Shared Resources in GeneGrid.
2
3
4
5
Clients
Figure 9. First set of Experiments
As shown in graph we found very small variation in
the response time, as we increase the number of clients
accessing GARR concurrently. So we tried another set of
experiments where we used the same five clients running
on c00, c01,c05, c06, c07 machines respectively but we
tried to access the GARR service deployed on machine
bescds01 in a different network domain. As this may be
the case when people using GeneGrid wants to share the
resources between two different organisation having
different network domain. We also tried to increase the
number of resources registered with GARR service up to
100. The results obtained in this experiment are depicted
in
figure
10
Time in Miliseconds
Response Time (GARR with 100
Entries)
8000
c0
6000
c01
4000
c05
c06
2000
c07
References
0
1
2
3
4
5
Clients
Figure 10.
Second set of Experiment
We observed a delay when 4 clients accessing GARR
concurrently under this circumstances but we got almost
similar response time for 5 clients accessing the GARR
concurrently. One of the reason for the delay when 4
clients accessing a remote GARR concurrently could be
network latency.
Our analysis from both set of experiments that GRM is
scalable with respect to number of clients accessing
GARR concurrently and number of resources registered
with it. In future we would like to do more exhaustive
performance evaluation and scalability testing by using
tools like Diperf [18].
4.
3) We are also looking forward to add metadata
about the service registered with GRM to use the
service in automated manner.
4) We are planning to add more interfaces to
GARR service which will return optimized
resources to be used for particular tasks based on
various parameters like cpu_load, network
bandwidth.
Conclusions and Future Work
We have described GeneGrid Resource Manager which
is going through its partial development phase. Currently
used by GeneGrid is widely deployed in a number of
different organisation as a part of GeneGrid 0.5 release.
We have also described the analysis of the performance
and this quantities study of GRM can aid in understanding
the performance limitation, advise in the development of
monitoring system and help evaluate future development
work.
We are currently working to incorporate number of new
features in GRM
1) GRM intends to expands its functionality by
predicting the performance of GeneGrid
resources based on the data collected in GARR
database. The prediction generated in this
manner will help in effective application
scheduling and also fault detection.
2) In future GRM will capture the network
information in order to effectively utilize the
resources when huge file transfer requiring high
bandwidth are to be performed.
[1] I. Foster, C. Kesselman, S Tuecke, “The Anatomy of
the Grid: Enabling Scalable Virtual Organisations”,
International J, Supercomputer Applications (2003),
15(3)
[2] F. Berman and T. Hey. “The Scientific Imperative”
in I. Foster and C. Kesselman, editors, The Grid:
Blueprint of a New Computing Infrastructure,
Morgan Kaufmann(2004), 13-24
[3] P. Donachy, T.J. Harmer, R.H. Perrott et al, “Grid
Based Virtual Bioinformatics Laboratory”,
Proceedings of the UK e-Science All Hands Meeting
(2003), 111-116
[4] P.V. Jithesh, N. Kelly, D.R. Simpson, et al
“Bioinformatics Application Integration and
Management in GeneGrid: Experiments and
Experiences”, Proceedings of UK e-Science All
Hands Meeting (2004), 563-570
[5] S. Tuecke, K. Czajkowski, I. Foster et al., Open Grid
Services Infrastructure (OGSI) Version 1.0. Global
Grid Forum Draft Recommendation, (6/27/2003).
[6] J. Novotny, M. Russell, O. Wehrens, “GridSphere:
An Advanced Portal Framework”, Proceedings of
EuroMicro Conference (2004), 412-419
[7] S.F. Altschul, et al, "Gapped BLAST and PSIBLAST: a new generation of protein database search
programs," Nucleic Acids Res., vol. 25, pp. 33893402, Sep 1. 1997.
[8] N.Kelly, P.V.Jithesh, D.R. Simpson et al,
"Bioinformatics Data and the Grid : The GeneGrid
Data Manager", Proceedings of UK e-Science All
Hands Meeting (2004), 571-578
[9] R. Apweiler, A. Bairoch, C.H. Wu et al, "UniProt:
the Universal Protein knowledgebase," Nucleic Acids
Res., vol. 32, pp. D115-9, Jan 1. 2004.
[10] J.D. Bendtsen, H. Nielsen, G. von Heijne and S.
Brunak, "Improved prediction of signal peptides:
SignalP 3.0," J.Mol.Biol., vol. 340, pp. 783-795, Jul
16. 2004.
[11] A. Krogh, et al. Predicting transmembrane protein
topology with a hidden Markov model: Application to
complete genomes. J.Mol.Biol, 305(3):567-580,
January 2001.
[12] E. Birney, D. Andrews, P. Bevan et al, "Ensembl
2004," Nucleic Acids Res., vol. 32, pp. D468-70, Jan
1. 2004.
[13] E. Birney, M. Clamp and R. Durbin, "GeneWise and
Genomewise," Genome Res., vol. 14, pp. 988-995,
May. 2004.
[14] C. Kanz, P. Aldebert, N. Althorpe et al, "The EMBL
Nucleotide Sequence Database," Nucleic Acids Res.,
vol. 33 Database Issue, pp. D29-33, Jan 1. 2005.
[15] R.D. Stevens, H.J. Tipney, C.J. Wroe, T.M. Oinn, M.
Senger, P.W. Lord, C.A. Goble, A. Brass and M.
Tassabehji, "Exploring Williams-Beuren syndrome
using myGrid," Bioinformatics, vol. 20 Suppl 1, pp.
I303-I310, Aug 4. 2004.
[16] J.D. Thompson, D.G. Higgins and T.J. Gibson,
"CLUSTAL W: improving the sensitivity of
progressive multiple sequence alignment through
sequence weighting, position-specific gap penalties
and weight matrix choice," Nucleic Acids Res., vol.
22, pp. 4673-4680, Nov 11. 1994.
[17] S.R. Eddy, “Profile hidden Markov Models,”
Bioinformatics, 14, 755-763 (1998)
[18] Catalin Dumitrescu, Ioan Raicu, Matei Ripeanu and
Ian Foster “Diperf : an automated Distributed
PERformance testing Framework” Proceedings of the
Fifth IEEE/ACM International Workshop on Grid
Computing (GRID’04) (2004) 289-296.
Download