Resource Monitoring and Service Discovery in GeneGrid Sachin Wasnik, Mark Prentice, Noel Kelly, P.V. Jithesh, Paul Donachy, Terence Harmer, Ron Perrott Belfast e-Science Centre, Queen’s University of Belfast {s.wasnik, mprentice, n.kelly,p.jithesh,p.donachy, t.harmer,r.perrott}@qub.ac.uk Mark McCurley, Michael Townsley, Jim Johnston Shane McKee Fusion Antibodies Ltd Amtec Medical Ltd, {mark.mccurley, michael.townsley,jim.johnston} @fusionantibodies.com shanemckee@doctors.org.uk Abstract GeneGrid is a Grid based bioinformatics project run by BeSC under the UK e-Science program and supported by the UK Department of Trade & Industry(DTI). It is a combined effort with commercial partners Fusion Antibodies and Amtec Medical. GeneGrid provides seamless integration of a myriad of heterogeneous applications and data sets that span multiple administrative domains and location across the globe, and present this to scientist through a simple user friendly interface. GeneGrid enables large-scale sharing of resources within a formal consortia of institutes and commercial partners (virtual organisation). In such an infrastructure, service providers need a publication mechanism so that they can provide information about the grid services and computational resources they offer and make changes as their services evolve. This paper describes the design, development, implementation and performance of one such mechanism called the GeneGrid Resource Manager (GRM) built upon the Belfast e-Science Grid Manager project. 1. Introduction With the advent of cost effective computer power and network bandwidth, large amount of data is available to the scientific community regardless of location, and powerful analytical programs have been developed to exploit them. As a result, the computer science community has increasingly been involved in research that aims at harnessing these data and programs with distributed systems technology to ease their exploitation, leading to a new discipline termed as e-science[2]. GeneGrid is one of the UK e-Science industrial project with the involvement of companies interested in antibody and drug development, and aims to provide a platform for scientists to access collective skills, experiences and results in a secure, reliable and scalable manner through the creation of a ‘Virtual Bioinformatics Laboratory’ [3]. GeneGrid [3] is a novel and pragmatic solution to address the above problems. It accomplishes the seamless integration of a myriad of heterogeneous resources that span multiple administrative domains and locations. It provides the scientist with an integrated environment for collaborative discovery and development via streamlined access to a suite of bioinformatics and other accessory applications through a user friendly interface. GeneGrid is built upon state-of-the-art technology in distributed computing, namely grid computing, which coordinates resource sharing and problem solving in dynamic multiinstitutional virtual organisations [1]. In this case, the fact that user typically have little or no knowledge of the resources contributed by participants in the “virtual organization” (VO) posses a significant obstacle to their use. For this reason, resource monitoring of the existence and characteristics of resources, services, computations and other entities are vital part of GeneGrid. Recourses in GeneGrid spans across various administrative domains such as Queens University Belfast, San Diego SuperComputing Centre and University of Melbourne Australia. At the heart of the GeneGrid is the ability to discover, allocate and negotiate the use of network accessible capabilities- be they computational services offered by a computer, application services offered by a piece of software, bandwidth delivered on a network, or storage space provided by a storage system. We use the term resource management to describe all aspects of the process: locating a capability, arranging for its use, utilizing it and monitoring its state. This paper presents the architecture, functionality and performance evaluation of GeneGrid Resource Manager (GRM), as one of the component of GeneGrid. The rest of the paper is organised as follows :- An overview of GeneGrid architecture and the functionality of each component as well as their interaction with other component is described in section 2. In section 3, We focus on GeneGrid Resource Manager Architecture, implementation and performance evaluation followed by conclusion and future roadmap in section 4. 2. BeSC GeneGrid Component Architecture GeneGrid consists of five cooperating components which independently address a subset of the main requirements of the project, and by cooperating provide scientists an integrated environment for the streamlined access of a number of bioinformatics applications & databases through a simple interface. These five components, namely the GeneGrid Data Manager, the GeneGrid Application Manager, the GeneGrid Workflow Manager, the GeneGrid Resource Manager Component and the GeneGrid Portal will be discussed individually in more detail below. We also describe how these five components are integrated to form GeneGrid environment. 2.1 executing a given application, and after completion of this task the GAMS is destroyed. Currently GeneGrid integrates a number of bioinformatics applications including BLAST [7], TMHMM [11], SignalP [10], BL2SEQ [7], GeneWise [13], ClustalW [16] and HMMER [17]. In addition, GAM also integrates a number of custom programs developed to link the tasks in a workflow. Figure 1 gives an overview of the components that provide the GAM functionality. GeneGrid Application Manager (GAM) Access to the bioinformatics applications available on various resources is provided by the GeneGrid Application Manager (GAM) [3, 4]. GAM achieves this integration through two types of OGSA-based grid services: GeneGrid Application Manager Service Factory (GAMSF) and the GeneGrid Application Manager. GAMSF is a persistent service, which extends the standard interfaces or Port Types, like GridServiceFactory of the Open Grid Services Infrastructure (OGSI) [5] to integrate one or more bioinformatics applications to the grid, and exposes them to the rest of the GeneGrid. The primary function of GAMSF is to create transient instances of itself called GeneGrid Application Manager Services (GAMS) which facilitate clients to interface with the applications Any client wishing to execute a supported application will first connect to the GAMSF and create an instance the GAMS. This newly created GAMS then exposes to the client the operations which allow the client to execute the supported application as an extension to the operations provided by the OGSA Grid Service interface. Each GAMS is created by a client with the intention of SDSC GAMS (BLAST) BLAST TMHMM A client in the BeSC domain executing a BLAST job on a GeneGrid resource in the San Diego Supercomputer Centre (SDSC). The client initially connects to the GAMSF which integrates applications upon that resource, creating a GAMS for accessing BLAST. Figure 1. GAM Overview 2.2 GeneGrid Data Manager (GDM) The GeneGrid Data Manager (GDM) is responsible for the integration and access of a number of disparate and heterogeneous biological datasets, as well as for providing a data warehousing facility within GeneGrid for experiment data such as results [11]. The data integrated by the GDM falls into two categories. 1). Biological data consisting of datasets available in the public domain, e.g. Swissprot [9], EMBL [14], ENSEBML [12] etc. and proprietary biological data private to the companies. 2). GeneGrid data consisting of data either required by, or created by GeneGrid, such as workflow definitions or results information GDM has used OGSA-DAI (www.ogsadai.org) as the basis of its framework, enhancing and adapting it as required, such as providing access to flat file datbases. GDM consists of two types of services, replicating those found in OGSA-DAI. The Gene Grid Manager Service Factory (GDMSF) is a persistent service configured to support a single data set. The main role of GDMSF is to create, upon request by a client, transient GeneGrid Data Manager Service (GDMS) which facilitate interaction between a client and the data set as shown in figure 2. Client GWMSF XML Client Fusion GARR GDMSF (Swiss) GWMS 5 GDMSF (EMBL) XML XML BeSC GDMS (SwissProt) GAMSF GDMSF BLAST GSTRIP Swiss Prot EMBL Client at Fusion executing a SwissProt query on a GeneGrid resource in the BeSC. The client initially connects to the GDMSF which integrates SwissProt upon that resource, creating a GDMS for executing the query. Figure 2. GDM Overview 2.3 GeneGrid Workflow Manager (GWM) GeneGrid Workflow Manager (GWM) is the component of the system responsible for the processing of all submitted experiments, or workflows, within GeneGrid (Figure 3). As in the case of GAM, there are two types of services in the GWM. The first, the GeneGrid Workflow Manager Service Factory (GWMSF) is a persistent OGSA-based grid service. The main role of the GWMSF is to create GeneGrid Workflow Manager Services (GWMS), which will process and execute a submitted workflow across the resources available. Each GWMS is a transient grid service which is active for the lifetime of the workflow it is created to manage. The main roles of this service are to select the appropriate resources on which to run elements of the workflow, as well as to update the GeneGrid Status Tracking and Result & Input Parameters (GSTRIP) Database with all status changes. GWMS gets information on resources, databases, GDM services and GAM services through the GeneGrid Application & Resources Registry (GARR). Submission and execution of a workflow containing a BLAST task. The client will first connect to the GWMSF to create a GWMS instance, before forwarding the workflow XML to the newly created GWMS. The GWMS identifies the BLAST task within the workflow, and queries the GARR for the location of all suitable GAMSF. The GWMS will then submit the task XML to the most appropriate GAM for execution. The GWMS will also find the location of the GDMSF serving the GSTRIP database from the GARR in order to submit updates as to the status of this BLAST task. Figure 3. Management of Workflow in GeneGrid 2.4 GeneGrid Resource Manager GeneGrid Resource Manager is responsible for Resource monitoring and service discovery. It consists of GeneGrid Application and Resource Registry (GARR) service and lightweight adapters present on each Node called GeneGrid Node Monitor (GNM). GARR is the central service in GeneGrid that mediates service discovery by publishing information about various services available in GeneGrid. All the nodes available, register with the GARR service and updates the GARR with the status of the resources, such as load average and available memory through the GNM. GARR provides an interface to query current state of the resources registered with it. GARR can be queried based on various type of parameters like type of the resources or the name of the resources. Other components of GeneGrid like GWMS or GAMS can query GARR service to identify the resource whose characteristics and state match those desired for execution of a task.. A more detailed description of GRM component is given in section 3. 2.5 GeneGrid Portal The GeneGrid Portal provides a secure central access point for all users to GeneGrid and is based upon the GridSphere product [6]. It also serves to conceal the complexity of interacting with many different Grid resource types and applications from the end users’ perspective, providing a user friendly interface similar to those which our user community is already familiar with. This results in a drastically reduced learning curve for the scientists in order to exploit grid technology. Figure 4 shows the interaction of Portal with other component. By allowing users to access a GE, we create a Virtual Organisation [1] (VO), and hence each GE may be considered as a single installation of GeneGrid. Portal GWMSF GNM GNM GARR Client Portal GARR GNM GNM GDMSF GDMSF GWMSF GWDD GSTRIP GNM on all GE nodes registering with the GARR. GDMSF GSTRIP GDMSF GWDD The Portal is a secure centrally hosted single access point to the project. Users may access the Portal from any internet ready computer. In order to access the above services, the Portal is configured with the location of the GARR, from which it finds all the other services. Figure 4. GeneGrid Portal in relation to other component of GeneGrid. 2.6 GeneGrid Environment The GeneGrid Environment (GE) is the collective name for the core distributed elements of the GeneGrid project, which allow the creation, processing and tracking of workflows. Contained within the GE is at least one GeneGrid Portal, at least one deployment of both the GARR and the GWMSF, an implementation of each of the GWDD and the GSTRIP databases, as well as at least one GDMSF configured to each of these databases (Figure 5). All instances of any factory services mentioned above may also be considered elements of the GE. Figure 5. GeneGrid Environment. The GWMSF and both GDMSFs in the GE register their existence with the GARR via GNM deployed on their hosting nodes. The GWMSF and the GeneGrid Portal are both configured with the location of the GARR service in order for them to discover all available GeneGrid services. Upon start up, the Portal will connect to the GARR to discover the location of the GDMSF for both the GWDD and the GSTRIP databases. The Portal processes the Master Workflow Definition Document from the GWDD, allowing authenticated users to create, submit and track workflows. 3. GeneGrid Resource Manager Section 2 presented each GeneGrid component individually. In this section we will focus on GeneGrid Resource Manager (GRM). As it is already stated, GRM consists of two major Components GeneGrid NodeMonitors(GNM) and GeneGrid Application and Resource Registry service. We will see Grid Manager Portal developed as a part of Grid Manager Project which is going to be a part of GeneGrid Portal in the next release. We will see how GRM enables GeneGrid to share resources in different domains. 3.1 GeneGrid Node Monitors All computational Nodes available to GeneGrid run a light weight agent which collect resource information relating to the node on which it is deployed. These agents, called as GeneGrid NodeMonitors (GNM), transmit the information they have collected regarding the computational resources and GeneGrid services available on that node. The GNM is a light weight C program which communicates with GARR service by sending XML messages. The GNM uses gSoap library for sending the XML messages. In the latest release of GeneGrid, GNM for various platform such as solaris and linux are made available to support the heterogeneous environment required as shown in figure 6. GNM GAMSF GARR BLAST GNM The System Information consists of 1) Hardware address 2) System Time, 3) IP address, 4) CPU speed, 5) CPU load, 6) Total Memory 7) Free Memory 8) Operating System’s Name 9) Operating Systems version 10) Uptime 11) Hostname 12) System Architecture 13) Number of Processor 14) Load average for last 1 minute, 5 minutes 15 minutes 15) Custom Data. SDSC GNM The Resource Data consists of name, type and Grid Service Handle (GSH) of the resources, which can be any application or database they may facilitate access to. available on the Node. The GARR service stores this information in a MySQL database to provide persistence of resource information. This information can also be viewed through the Grid Manger Portal as shown in the figure 7. BeSC GAMSF BLAST GDMSF EMBL GNM on multiple resources across administrative domains registering resource information securely to a GARR. Resource Information includes details of the applications and databases supported as well as the location of the Figure 6. GNM sending information to GARR service. 3.2 GeneGrid Application and Resource Registry The GeneGrid Application and Resource Registry (GARR) service captures the information which is sent by the GNM about the grid service deployed on the node, forwarding information about the Node. The Information sent by GNM can be broadly classified as System Information and Resource Data. Figure 7. Grid Manager Portal The GARR service is a GT3 based persistent OGSA compliant grid service which provides an interface to query the GARR and retrieve information such as resource’s name, cpu_load etc. Clients such as GeneGrid Workflow Manager(GWM) which is responsible for processing all experiments wishing to discover the location of any specific GeneGrid service will contact GARR for its location. A client can also query the GARR to find all service supporting, say, the BLAST application, in which case the GARR will return all available GeneGrid Application Manager Service Factory ( GAMSF) to support BLAST, as well as any resource information associated with the hosting nodes. 3.3 GeneGrid Shared Resources Bioinformatics applications and datasets are exposed to the GeneGrid Environment by GAMSF and GDMSF respectively. These GAM and GDM services make up the GeneGrid Shared Resources. Each GAMSF and GDMSF advertises its existence and capabilities to a GE via GNM on their hosting nodes registering with the GARR. It is possible for GNM to register with many GARR services across multiple GE allowing the resources to be shared between multiple organisations. Therefore, organisations have complete control over what resources, if any, they wish to share with other GeneGrid organisations, forming dynamic virtual organisations as shown in figure 8. 3.4 Performance Evaluation GeneGrid depends upon GRM for making a decision to allocate resource for a particular task out of the various distributed, heterogeneous resources. Hence it is essential to study the behaviour of the GRM under different circumstances in order to understand any performance limitation. We have done two sets of experiments on GRM under different conditions to study the performance and scalability with respect to number of clients, number of resources. In first set of experiment, we have used a testbed of one GARR service deployed on machine blade3a and five clients running on machines c00, c01, c05, c06, c07 respectively on the same network domain. Fifty different resources were registered with GARR. We conducted a series of experiments starting with only one client, increasing a client every time, up to five clients accessing GARR service concurrently. We recorded the response time, time taken by a client to fetch the data from the GARR service for each experiment. The results of the experiments are depicted in figure 9. Response Time ( GARR with 50 entries ) Tim e in m iliseconds 3500 GARR VO 1 GNM GNM 3000 c01 2500 c0 2000 c05 1500 c06 1000 c07 500 0 1 GAMSF GAMSF Resource A Resource B GARR VO 2 Resource A registers with both VO 1 and VO 2. Resource B registers with VO 1 only. Therefore, users of VO 1 may access both resources, while users of VO 2 may access only resource A. Figure 8. Shared Resources in GeneGrid. 2 3 4 5 Clients Figure 9. First set of Experiments As shown in graph we found very small variation in the response time, as we increase the number of clients accessing GARR concurrently. So we tried another set of experiments where we used the same five clients running on c00, c01,c05, c06, c07 machines respectively but we tried to access the GARR service deployed on machine bescds01 in a different network domain. As this may be the case when people using GeneGrid wants to share the resources between two different organisation having different network domain. We also tried to increase the number of resources registered with GARR service up to 100. The results obtained in this experiment are depicted in figure 10 Time in Miliseconds Response Time (GARR with 100 Entries) 8000 c0 6000 c01 4000 c05 c06 2000 c07 References 0 1 2 3 4 5 Clients Figure 10. Second set of Experiment We observed a delay when 4 clients accessing GARR concurrently under this circumstances but we got almost similar response time for 5 clients accessing the GARR concurrently. One of the reason for the delay when 4 clients accessing a remote GARR concurrently could be network latency. Our analysis from both set of experiments that GRM is scalable with respect to number of clients accessing GARR concurrently and number of resources registered with it. In future we would like to do more exhaustive performance evaluation and scalability testing by using tools like Diperf [18]. 4. 3) We are also looking forward to add metadata about the service registered with GRM to use the service in automated manner. 4) We are planning to add more interfaces to GARR service which will return optimized resources to be used for particular tasks based on various parameters like cpu_load, network bandwidth. Conclusions and Future Work We have described GeneGrid Resource Manager which is going through its partial development phase. Currently used by GeneGrid is widely deployed in a number of different organisation as a part of GeneGrid 0.5 release. We have also described the analysis of the performance and this quantities study of GRM can aid in understanding the performance limitation, advise in the development of monitoring system and help evaluate future development work. We are currently working to incorporate number of new features in GRM 1) GRM intends to expands its functionality by predicting the performance of GeneGrid resources based on the data collected in GARR database. The prediction generated in this manner will help in effective application scheduling and also fault detection. 2) In future GRM will capture the network information in order to effectively utilize the resources when huge file transfer requiring high bandwidth are to be performed. [1] I. Foster, C. Kesselman, S Tuecke, “The Anatomy of the Grid: Enabling Scalable Virtual Organisations”, International J, Supercomputer Applications (2003), 15(3) [2] F. Berman and T. Hey. “The Scientific Imperative” in I. Foster and C. Kesselman, editors, The Grid: Blueprint of a New Computing Infrastructure, Morgan Kaufmann(2004), 13-24 [3] P. Donachy, T.J. Harmer, R.H. Perrott et al, “Grid Based Virtual Bioinformatics Laboratory”, Proceedings of the UK e-Science All Hands Meeting (2003), 111-116 [4] P.V. Jithesh, N. Kelly, D.R. Simpson, et al “Bioinformatics Application Integration and Management in GeneGrid: Experiments and Experiences”, Proceedings of UK e-Science All Hands Meeting (2004), 563-570 [5] S. Tuecke, K. Czajkowski, I. Foster et al., Open Grid Services Infrastructure (OGSI) Version 1.0. Global Grid Forum Draft Recommendation, (6/27/2003). [6] J. Novotny, M. Russell, O. Wehrens, “GridSphere: An Advanced Portal Framework”, Proceedings of EuroMicro Conference (2004), 412-419 [7] S.F. Altschul, et al, "Gapped BLAST and PSIBLAST: a new generation of protein database search programs," Nucleic Acids Res., vol. 25, pp. 33893402, Sep 1. 1997. [8] N.Kelly, P.V.Jithesh, D.R. Simpson et al, "Bioinformatics Data and the Grid : The GeneGrid Data Manager", Proceedings of UK e-Science All Hands Meeting (2004), 571-578 [9] R. Apweiler, A. Bairoch, C.H. Wu et al, "UniProt: the Universal Protein knowledgebase," Nucleic Acids Res., vol. 32, pp. D115-9, Jan 1. 2004. [10] J.D. Bendtsen, H. Nielsen, G. von Heijne and S. Brunak, "Improved prediction of signal peptides: SignalP 3.0," J.Mol.Biol., vol. 340, pp. 783-795, Jul 16. 2004. [11] A. Krogh, et al. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J.Mol.Biol, 305(3):567-580, January 2001. [12] E. Birney, D. Andrews, P. Bevan et al, "Ensembl 2004," Nucleic Acids Res., vol. 32, pp. D468-70, Jan 1. 2004. [13] E. Birney, M. Clamp and R. Durbin, "GeneWise and Genomewise," Genome Res., vol. 14, pp. 988-995, May. 2004. [14] C. Kanz, P. Aldebert, N. Althorpe et al, "The EMBL Nucleotide Sequence Database," Nucleic Acids Res., vol. 33 Database Issue, pp. D29-33, Jan 1. 2005. [15] R.D. Stevens, H.J. Tipney, C.J. Wroe, T.M. Oinn, M. Senger, P.W. Lord, C.A. Goble, A. Brass and M. Tassabehji, "Exploring Williams-Beuren syndrome using myGrid," Bioinformatics, vol. 20 Suppl 1, pp. I303-I310, Aug 4. 2004. [16] J.D. Thompson, D.G. Higgins and T.J. Gibson, "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice," Nucleic Acids Res., vol. 22, pp. 4673-4680, Nov 11. 1994. [17] S.R. Eddy, “Profile hidden Markov Models,” Bioinformatics, 14, 755-763 (1998) [18] Catalin Dumitrescu, Ioan Raicu, Matei Ripeanu and Ian Foster “Diperf : an automated Distributed PERformance testing Framework” Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing (GRID’04) (2004) 289-296.