The Integrative Biology Grid – Building on e-Science Components Lakshmi Sastry*, Srikanth Nagella, Ronald Fowler, John Taylor, Richard Wong, Anjan Pakhira, and Deniz Turan Applications Group, e-Science Centre Science and Technology Facilities Council Rutherford Appleton laboratory Chilton, Didcot, OX11 0QX Abstract The primary aim of the Integrative Biology (IB) project is to develop a second generation ”Hybrid Grid” to support post-genomic research in integrative biology. The requirements for the grid and applications middleware were determined by considering the needs of two vitally important clinical areas, cardiovascular disease and cancer. Component services of the IB grid have been in use by the scientific users which provides the vital feedback necessary to complete the final phase. This is a report on the work to date and a review of the experience of the computer scientists, in particular, those addressing the interactive visualization and steering requirements. 1. Introduction Full understanding of biological function is feasible only when biologists are able to integrate all the available information to recreate the non-linear, dynamic interaction at multiple levels of the system. For instance, the beating of a heart involves a chain of processes starting at the regulation of ion concentrations within the cardiac cells, the correct transport of the ions via cell membrane which leads to the propagation of action potential to rhythmically contract and expand the heart muscle fibres. Advances in biotechnology, underpinned by the massive leap of in computational resources provide an opportunity to recreate the biological function through mathematical models. An iterative approach between experiment and modelling can lead to more accurate determination of biological function with predictive power, leading to novel drugs and treatment. The goal of the EPSRC funded Integrative Biology (IB) project is to build a grid framework to support physiologists, clinicians, and computational biologists to link experiment and modelling seamlessly to direct biology research. The needs of those undertaking research in cardiovascular disease and cancer were sufficiently similar and diverse at the same time to provide the developers the necessary requirements to build such a framework. Science agenda addressed through exploitation of the Integrative Biology grid are: • the development of integrated whole organ models of biological systems, primarily in clinical science; • using these models to begin to study the development cycle of cardiac disease and cancer tumours; • bringing together clinical and laboratory data from many sources to evaluate and improve the accuracy of the models; • understanding the fundamental causes of these life-threatening conditions and how to reduce their likelihood of occurrence; • identification of opportunities for intervention at the molecular and cellular level using customised drugs and novel treatment regimes. The e-Science challenges for Integrative Biology include: • the provision of transparent, coscheduled access to appropriate combinations of distributed HPC and database resources needed to run coupled multi-scale whole organ simulations; • the exploitation of these resources efficiently through the application of computational steering, workflow, visualisation and other techniques developed in earlier e-Science projects wherever possible; • enabling of globally distributed biomedical researchers to collaboratively control, analyse and visualise simulation results in order to progress the scientific agenda of the project; • maintenance of a secure environment for the resources used and information generated by the project without inhibiting scientific collaboration. The ambition is that the tools developed by the project will improve the productivity of clinical and physiological researchers in academia and the pharmaceutical and biotechnology sectors. The UK e-Science community will benefit from access to new tools developed by the project and from the example of an integrated computational framework that the project will develop. Section 2 of this paper reports the user requirements. Section 3 describes the component parts of the IB grid that emerged from the requirements and the basic architecture of the IB grid framework and its service layers. Section 4 describes a couple of applications built around the IB grid are described. In Conclusions the status of the IB framework together with observations on the developers experience is given. 2. User requirements User requirements gathering is an expertise where the requirements gatherer deploys any one of a number of strategies as appropriate for the user in the context of the purpose of gathering such requirements. Within IB, the users came in two highly distinctive categories. The first were users who were technology aware who had existing and often deficient systems and were able to articulate what they wanted from a system in a language understood by those developing these. At the opposite end, there were mathematicians and experimental biologists who had no existing systems on which to base their needs for improvements and were not as aware of what technology can do for them. However, they were often able to quickly grasp some of the concepts and more surprisingly able to articulate what they do not want rather more precisely. The requirements gathering were undertaken through a variety of methods. One to one, face to face meetings were the primary method used. Existing applications are shown as exemplars to describe functionality during such meetings. In addition, the developers also shadowed individual users to understand their everyday scientific activity be that to compose a set of input parameter sweeping configurations for simulations or preparing in vitro experiments and collect image data. A third methodology used was to distribute questionnaire prior to face to face meetings to elicit information on generic issues such as where the data is stored, the back up and security mechanisms used, the data formats, the simulation details, who the data are shared with, its source authorization and authentication as well as whether any legal and ethical issues involved in the handling and sharing of the data. The users were also asked to review the current limitations, what they would like to have without limiting their ideas to the feasibility of such desired requirements [1]. All the information gathered in these ways was then analysed to arrive at a set of generic as well as specific requirements. It is then further partitioned to create a set of application specific functionality that needs to be implemented to keep the users’ enthusiasm and commitment with the project and thereby provide continuous feedback to the evolving framework. This also provided the software architects and developers time to design the detailed services based architecture, the layers of services and interfaces, the application utilities that need to be supported within the framework and toolkits that the users are familiar with. It also provided the opportunity to evaluate output from first generation e-Science projects to decide which modules could be adapted within the IB framework. The user requirements gathered can be broadly categorized as follows: • Secure access to high performance compute resources, together with middleware infrastructure to submit simulations. With some of the user communities, especially the cancer modelers, the techniques of high performance computing and even knowledge of state of the art computational techniques were not widely known. • In the case of empirical tumour modelling simulations, computational steering was recognised as a potential method to aid exploration of the parameter space of models. • Structured, secure data management beyond individual user’s desktop was uniformly and urgently required. • Advanced higher and multi dimensional Visualisation techniques were deemed desirable. • Visualisation requirements were made more complex by the need to have the techniques interfaced through proprietary desktop problem solving environments and application specific visualisation toolkits. • To avoid downloading vast amounts of experimental and simulation data to the users desktop machines it is desirable to perform analysis and visualisation tasks on the grid, close to the data sources. • Advanced interaction techniques such as cutting plane in user selected orientation were requested for application specific visualisation toolkits. The requirements exercise produced some unexpected and interesting points. The first of these was that for highly specialized small community of researchers and institutions, competition is intensive and collaboration meant, especially in the heart modelling topic, researchers knew each others work but do not actively share models or data. Another interesting finding was that the end users were highly focused on their science and publications and technologies that were not perceived to provide immediate benefits were quickly ignored, even if their potential longer term benefits were obvious. For instance, the effort required to standardize metadata and to use that to query and access data were relegated. There was a strong emphasis placed on getting hard coded non generic software modules, especially for visualisation and interaction, built for favourite tools which were not available to the wider community to help individual research groups. Without exception, all the end user scientists have to be supported through the process of getting an account on the National Grid Service (NGS), from the application process to using the certificate to submit a job to the NGS, despite developing step by step guide for the purpose. Biologists sometimes faced formidable problems to complete these simple steps either because system and network configuration set up in their institution or their ability to understand the process was limited by the new concepts and grid terminology they had to negotiate. In addition, scientists were and continue to be impatient to allocate the time this process takes. However, the experience of providing one to one support, unearthed a wealth of requirements not only to simplify the NGS Certification Authority portal but also to alter the content and the language in which the process was expressed. 3. IB Grid architecture A typical simplified user scenario is for a computational biologist to create one or more detailed input scripts with parameter ranges, indicate a data file that contains the details of the computational mesh and an executable of the simulations to run these with and to store the resulting output. She may then use one or more proprietary toolkits to visualise and analyse the data. This basic scenario can be made more complex with the need to monitor the simulation as it executes or assimilate and/or compare in vitro or clinical data to compare and contrast. In addition, perhaps a cardiac arrhythmia simulation may have been enabled to use computational steering so that a researcher may introduce a stimulant to observe its impact on the development of action potential and wishes to visualise the re-entry pattern [2]. even local clusters. It supports the Grid Security Infrastructure model based on X509 certificates. Data storage and management are handled through the Storage Resource Broker (SRB) [3]. At the user level, a variety of desktop interfaces can be realised based on the requirements user task for data analysis. Other overarching services are accessed the IB Interface (IBI). Figure 2 below provides an overview of the detailed architecture of the IB framework that supports the user requirements. The IB grid architecture is multi layered service oriented architecture prevalent in other application domains. Figure 1 below gives an overview of the services layer. Figure 1 Overview of IB Services The front end services are based on Secure Web Services with standard WSDL descriptions. These are run on IB servers distributed across the project partner institutions. The services are interfaced using SOAP messages from any IB client. The Globus toolkit API is used for the grid communication protocol between the front and back end services. Open standards and software stack are used throughout both to build IB Services and for communication protocols. Figure 2: Overview of detailed architecture Detailed architecture The IB framework is built on UK National grid Service but is designed to be adaptable to other Grid services such as TeraGrid or The architecture is designed with extension capabilities to collaborative working when the community evolves to make use of the full extent of the framework. The architecture is best described using a use case scenario. User A invokes IBI on his desktop, selects the simulation he wishes to run and a visualisation service or toolkit that can read her data, interpret and process it and generate the image (an encoder). She also indicates the input, the data files and output directory to store the data. The communication backend from the IBI first of all initialises the server-side visualisation toolkit and opens necessary ports to receive connection and data from the simulation. Part of this initialization phase is to realise the control panel of the chosen visualisation toolkit on the client side for user interaction with the displayed results from the simulation. A second task of the client-side backend is to pass all the information to the IB front-end service manager. The Services Manager invokes the simulation with necessary input data and informs it to stream data to the visualisation toolkit. In this architecture, the data from simulation or SRB is processed on the server side and images sent to IBI. The user interacts with the displayed visualisation using the control panel. Pick and selection actions are passed to the visualisation toolkit which interprets these correctly (decoder role). In this architecture, the interactive visualisation toolkit is deprecated to the role of an encoder and decoder on the server-side. The necessity to download large amounts of data and write to and access from local disk is thus avoided. The original data and any intermediate data as specified by the user are all stored in the SRB. The IBI provides the user not only with an interface to manipulate his SRB data store but it also allows him to back up his data on the Atlas Data Store [4] using the SRB interface for long term archiving. The IBI also provides an interactive interface to manage the grid certificate using myProxy server [5]. The job submission service allows the user to submit jobs to NGS compute clusters as well as the national high performance cluster HPCx. The monitoring of submitted jobs and other house-keeping metadata are automatically generated and stored on an Oracle database service attached to the NGS. Hibernate libraries [6] and the STFC metadata schema [7] are used to create and manage the database house-keeping. The gViz computational steering library [8] is used in the IB framework to support computational steering. If there is a user B, then the embedded CoVisa module can be used to provide collaboration. The steering communications are handled using the built in gViz communication calls. It is possible to include another user C who will have a restricted view only access to the simulation session. This may be typically a scenario where two or more participating academic research partners may wish to demonstrate the progress and discuss details of the research with a commercial partner. Image based Interactors: A unique novelty of the IB infrastructure is that it includes an Interactor library. Image based steering is used to enhance the quality of human computer interaction in steering environments [9]. The Interactor is an icon representing the data type of a parameter that the user may wish to steer. At set up time, if the simulation is steering enabled, the user can select the parameters to steer. The initialisation process will create instances of the appropriate interactors, place them inside the graphics window, binding the parameter to the interactor. From thereon the user is able to directly manipulate the interactor to convey the parameter values for steering. The IB interactors are OpenGL based visual objects with encoded behaviours which are reusable components within any OpenGL based toolkit. 4 Applications The IB Grid is used to interact with applications in heart and tumour modelling. Figures 3a and 3b below show a three dimensional carcinoma model with embedded computational steering control to study the effect of nutrient on cell growth and concentration. Figures 3a & 3b: In situ ductal carcinoma simulation with steering Figures 4a & 4B below show an image processing application for vascular cancer tumour where the image processing steps are services within the grid to identify blood vessels and tumour cells so that enumeration and other statistical information can be automated. The image in this case can be displayed using one of two techniques. The first called ‘identify’ shows the complete structure of the blood vessels/ cancer cells or both structures, either in their original colour or highlighted. The second display technique called ‘edge’ shows the edges of the blood vessels/ cancer cells or both. These edges are the edges that will be used to join this 2D image to the next to form the 3D geometry. Figures 4a & 4b: Stage 3 visualization of vascular tumour - the original and processed images with edges. 5 Conclusion The IBI and IB grid services are available to users to integrate with their applications. User trials are underway to make the system robust. Building the Integrative Biology grid results and software from previous generation e-Science projects has been both educating and challenging. Access to the developers of such modules has proven to be the single most significant factor in the speed up and ease of use of such tools. Working with scientists has also been a challenging experience. They tended to be highly focussed on their immediate needs which were often in enabling their applications to gain high performance, parallelization, a friendly user interface or advanced visualization and interaction capability. Such needs had to be catered to keep the users with the project to provide feedback. However, it proved useful to have done these tasks as the requirements formed the basis of understanding and architecting how the IB Grid needed to address the requirements for interactivity, gain a handle on data management issues and most significantly how to design the server side visualization utilities. The IB grid has extensibility and flexibility built into it and can form the basis for other projects. Acknowledgement The authors wish to acknowledge the financial support of the EPSRC (ref no: GR/S72023/01). 6 Reference 1. Lloyd, S., Gavaghan, D., Whiteley, J., Pitt-Francis, J., Slaymaker, M., Boyd, D.R.S., Mac Randal, D.F., Kleese van Dam, K., Sastry, L. Gathering Requirements for the Integrative Biology project. eScience All Hands Meeting, September 2004 2. Handley, J., Clayton, R., Wood, J., Holden, A.V., Brodlie, K. Interaction with Cardiac Virtual Tissues on the Grid: The gViz Library. Accepted for publication in Proceedings of FIMH'05, 2005 3. Storage Resource Broker – http://www.sdsc.edu/srb/index.php/ Main_Page 4. http://www.escience.clrc.ac.uk/curation/ 5. My Proxy http://grid.ncsa.uiuc.edu/myproxy/li cense.html 6. http://www.hibernate.org/ 7. http://epubs.cclrc.ac.uk/workdetails?w=30324 8. http://www.comp.leeds.ac.uk/vvr/g Viz/research_gViz_library.html 9. Sastry, L., and Wright, H. Image based computational steering for Integrative Biology, CompuSteer Workshop, Hull, http://compusteer.dcs.hull.ac.uk/IB. pdf 2006.