Grid Based Virtual Bioinformatics Laboratory Paul Donachy Terrence J harmer Ron H Perrott Belfast e-Science Centre www.qub.ac.uk/escience Jim Johnston Alan McBride Michael Townsley Fusion Antibodies Ltd www.fusionantibodies.com Shane McKee Amtec Medical Limited www.amtec-medical.com Abstract Biotechnologies such as genomics, gene sequencing and highthroughput screening are creating massive volumes and multiple sources of biological and chemical data. However, the volumes of data and the processing power required to analyse it, is threatening to create a bottleneck that might hamper the growth of biotechnology itself. To date, the HPC resources required to store, manage and analyse such volumes of data has been only at the disposal of large companies and research institutes. However, with the emergence of Grid Technology, the whole area of bioinformatics is an ideal candidate to leverage the benefits of secure, reliable and scaleable high bandwidth access to distributed data sources across various administrative domains. This in effect will allow geographically remote researchers with limited internal resources, access to a wealth of biological datasets and HPC resources. This paper presents from an industrial perspective the business drivers that acted as the catalyst in creating the industrial e-Science project GeneGrid. The Architecture and roadmap for a Grid based Virtual Bioinformatics Laboratory will be presented. 1 Introduction Whole genome expression monitoring will have extraordinary impact on clinical diagnosis and therapy and bring new power to clinical medicine. As the field progresses we will identify new probes for cancer, infectious disease and inherited disease and understand how genetic damage occurs and how genes alter response to drug therapies. Equally important will be new therapeutic tools in the form of recombinant gene products, novel drug targets, rational drug design, and gene therapy. Next-generation efforts will allow us to link gene expression patterns with formal characteristics of disease models including pathological and clinical descriptions. It has been more than a year since the human genome was mapped, considered one of the most gargantuan scientific endeavors ever undertaken. The DNAsequencing data from the human genome project (HGP) contains much untapped data that needs to be converted to meaningful information. At present, the human genome database has approximately six terabytes of data. This data set is expected to double every six months. At present there is a vital need to develop distributed solutions to capture, analyse, manage, mine and disseminate these vast amounts of genomic data, in order to develop actual diagnostic and therapeutic strategies. With the emergence of Grid Technology, the area of bioinformatics is an ideal candidate to leverage the benefits of secure, reliable and scaleable high bandwidth access to distributed data sources across various administrative domains. This paper will present the background and motivation for the industrial eScience project GeneGrid. GeneGrid is conceived from work related to the activities of a number of biotech companies based in Northern Ireland with extensive international collaborative relationships in North America, Europe and Brazil. The aim of GeneGrid will combine the skills and experience of the stakeholders and the collaborative sharing and coordinated use of their distributed resources to create a “virtual Bioinformatics laboratory” using the Grid. This will allow all relevant organizations, partners & customers to access their collective skills, experience and results in a secure, reliable and scalable manner. 2 Business Drivers At present limited efforts are made by the stakeholders to collaborate, share data and identify information, which can be of overall assistance. It is evident that while the different companies may have different commercial or academic objectives, the potential to share data, information and available resources in a “virtual Bioinformatics Laboratory” has overwhelming economic advantages. This is clearly demonstrated by Fusion where it is endeavoring to find antibody targets from important surface proteins generated by genes and work with other to seek genetic disease markers for diagnostics. The data, which can be generated collaboratively, will have considerably greater long-term value than those individual efforts. This has a further multiplier effect when combined with other associated academic efforts. The project aims to build upon existing genomic and proteomic programmes including existing microarray and sequencing technology and the immense volumes of data generated through screening services. At present such technology is used to identify alterations in gene expression between two samples (normal versus disease tissue). These altered expression patterns can then be used as a molecular fingerprint. Furthermore these molecular profiles can be used to classify different tumor types and ultimately to predict at a molecular level which patients are likely to respond to specific anticancer therapies. However, the main drawback with existing array technologies is that they are generic, i.e. they represent a collection of genes with no information regarding the tissue types in which they normally function. Considering therefore that the human body is made up of multiple tissue types it becomes clear that only a small percentage of the genes on such a generic array will be actually involved in a specific disease type. At present the individual companies do not have any dedicated in-house bioinformatics specialisms or HPC capability. They generate large amounts of data but relating this to the global environment is problematic. The low speed of data transfer between parties, lack of high performance computing power and lack of encompassing security mechanisms across the disparate administrative boundaries and organizations is a blockage to rapid advancement of this important area of Science and research. 3 Grid Based Architecture The Grid based architecture presented here is based on the Open Grid Services Architecture (OGSA) model [1] derived from the Open Grid Services Infrastructure specification defined by OGSI Working Group within the GGF [2]. The Open Grid Services Architecture (OGSA) represents an evolution towards a Grid architecture based on Web services concepts and technologies. It describes and defines a service-oriented architecture (SOA) composed of a set of interfaces and their corresponding behaviors to facilitate distributed resource sharing and access in heterogeneous dynamic environments. Service Requestor BIND FIND Transport Medium PUBLISH Service Provider Service Directory Figure 1 Figure 1 shows the individual components of the service-oriented architecture (SOA). The service directory is the location where all information about all available grid services is maintained. A service provider that wants to offer services publishes its services by putting appropriate entries into the service directory. A service requestor uses the service directory to find an appropriate service that matches its requirements. An example of such a requirement is the maximum time a service requestor is willing to accept for a protein sequence alignment service or the need to retrieve specific gene information from a biological database query service. The service directory will thus include not only taxonomies that facilitate the search, but also information such as maximum calculation time, QoS details or the cost associated with a service. When a service requestor locates a suitable service, it binds to the service provider, using binding information maintained in the service directory. The binding information contains the specification of the protocol that the service requestor must use as well as the structure of the request messages and the resulting responses. The communication between the various agents occurs via an appropriate transport mechanism [3][4]. This architecture is based on a view of service collaboration that is independent of specific programming languages or operating systems. Instead, it relies on already-existing transport technologies (such as HTTP or SMTP) and industrystandard data encoding techniques (such as XML). PUBLISH Service Provider A Biological Databases HPC Resource Service Provider B PUBLISH Service Provider C Analysis/ Visualisation Resource 4 Virtual Bioinformatic Lab This paper proposes to develop a service oriented middleware framework targeted towards the domain of Bioinformatics. It proposed to develop an architecture to allow automated wide-scale data mining of the publications and public genomic databases, with the objective of establishing correlations of gene sets. The public data will be complemented by targeted sequencing related to specific cancers. The data sets will be quantified and then examined for potential diagnostic and therapeutic potential. In such an environment the first step for all service providers that wish to offer services is to publish its services via appropriate entries in the Service Directory. See Figure 2. These entries include those from service providers offering services such as biological Databases, HPC resources and analysis and visualization resources. Service Directory Figure 2 Next the client requests the Service Directory for find appropriate services that are needed to provision the fulfillment of a GeneGrid service. These may be found via a portal user interface or dynamically from within a client application. An example of such a request would be “find me services that retrieve all gene sequence data for in format X and takes less than 30 sec”. See Figure 3. C lie n t A S e a rc h fo r gene sequence d a ta in fo rm a t X GSH S e rv ic e D ire c to ry C o n ta in s in fo rm a tio n o n S e rv ic e P ro v id e r A S e rv ic e P ro v id e r B S e rv ic e P ro v id e r C Figure 3 When the services are located the client binds to the service using binding information detailed in the service directory. This for example in the above example may involve specifying the protocol that the client must use to interact with the database service and the transport mechanism that is to be used such as JMS or SMTP. See Figure 4 Client A BIND Grid Service Instance Biological Database HPC Resource Analysis/ Visualisation Resource Figure 4 5 Summary Grid computing technology presents an architectural framework that aims to provide access to heterogeneous resources in a secure, reliable and scalable manner across various administrative boundaries. The domain of Bioinformatics is an ideal candidate to exploit the benefits of such a framework. The development activities of the GeneGrid project are due to start in Sept 2003 and expected to deliver prototype grid services by Q2 2004. These prototype grid services will be initially deployed and tested both within a local area network and wide area network environment. The project expects to gain valuable results from such prototype services. In addition to increased access and reduced cost to HPC resources, another expected benefit of the GeneGrid architecture is the creation of an extensible integration fabric. Integration of such remote, heterogeneous resources in any enterprise is the major bottleneck and the realm of major Enterprise Application Integration (EAI) activities. Here we have presented a Gird based framework that could provide the basis for reference integration architecture for the stakeholders involved. However before widespread adoption happens within this sector a number of fundamental areas will need to be addressed: Advanced Service Discovery: As a typical bioinformatics experiment involves a complex sequence of human and automated operations. In a gird services environment the domain of describing, managing and discovering such knowledge rich resources will require middleware extensions or additions to ensure applications can adequately define their requirements for service discovery. Security: The area of security, as with various other knowledge-based domain, will be a primary concern and requirement. As Grid technology looks to share resources both internally and externally within organisations, security and integrity of information are not only important but also critical to Gene and patient confidentiality. Standards: The area of bioinfoamtics and Grid computing is already heavily loaded with a mixture of competing formats and standards. The addition of a further set of domain specific proprietary grid computing standards will not assist in the adoption and uptake of such technology. The project aims to leverage the use of emerging Grid Computing standards, e.g. the Globus Toolkit, OGSA and involvement within the GGF standards including the newly formed Life Sciences Grid (LSG) Research Group within the GGF. References [1] OGSA http://www.globus.org/ogsa/ [2] OGSI http://www.gridforum.org/ogsi-wg/ [3] S. Burbeck, “The Tao of e-Business Services,” IBM Corporation (2000); see http://www4.ibm.com/software/developer/library/wstao/index.html.