The White Rose Grid: practice and experience P M Dew (dew@comp.leeds.ac.uk ) University of Leeds; J G Schmidt (j.g.schmidt@leeds.ac.uk) University of Leeds; M Thompson (mart@comp.leeds.ac.uk) University of Leeds; P Morris (philip.morris@cs.york.ac.uk) University of York Abstract The White Rose Grid (WRG) initiative is establishing a production Grid that underpins a broad range of e-Science projects (e.g. DAME, gViz, and MRI Data Analysis System) involving the three Universities (Leeds, Sheffield and York) and our industrial partners. The White Rose Universities already operate as a Virtual Organisation (VO), and a key part of this project is to understand and demonstrate the value of the Grid in this environment. The paper reports our experience of building and delivering the WRG service. It raises practical issues pertaining to the Grid built by a joint team consisting of academics and Computer Services at the three Universities. The paper shares our approach to working together within the WRG and describes some aspects of technical experience gained while building our Grid. It outlines our organisational structure, technical configuration of the Grid and user management issues. Finally the business use of the Grid is discussed. Introduction The three Yorkshire Universities – Leeds, York and Sheffield – have formed the White Rose University Consortium (WRUC) to undertake larger-scale projects than those can be achieved by any one University. At the institutional level, the Consortium, which recently has been featured in the HEFCE White Paper The Future of Higher Education as an exemplar of university collaboration, functions as a Virtual Organisation (VO) employing complementary skills bases from the three Universities to tackle major projects. Many of these projects could take advantage of a Grid infrastructure to support scientific collaborations across the White Rose Universities, including their external partners. Thus the White Rose Grid (WRG) [1] offers an ideal environment for studying the issues involved in establishing and running a WRG service between the three sites. It serves as a test-bed Grid environment which is exposed to a large variety of problems and issues, including key sociological constraints (for example human interaction, trust, ownership), reflected in global Grids. Further to this, the WRG provides the three Universities with the capability to implement resource optimisation at the inter-enterprise level by rationalising access to compute and data resources (refer to its architecture in Figure 1) in order to improve the efficiency of their use and enhance business opportunities. More importantly it provides an enabling infrastructure for promoting scientific collaboration between members of the three Universities and involving their industrial partners. It offers the opportunity to develop new e-Science projects so as to attract additional funding. Furthermore it is used to support the broad research agenda of the White Rose University Consortium. The WRG facilities underpin a number of multi-million pound project proposals that have been recently submitted by the WRUC, an example of which is the Yorkshire bid to host the European Spallation Source project (ESS) at Burn airfield. General Purpose HPC node Computer Science node CFD node Engineering Application Packages node Figure 1: The WRG architecture The WRG works with prestigious companies and organisations (for example: Rolls-Royce, Shell UK), in collaborative R&D projects, with the aim of assessing the impact of this new technology. approach. The lessons learnt, and covered here, add to the comprehensive experience presented by W E Johnston from building the IPG and DOE Science Grids [7]. The business rationale for building the WRG is as follows: • to strengthen, in partnership with industry, e-Science research with the focus on the WRG core areas: decision support, diagnostics and problem solving environments, and building on the experience gained with e-Science projects such as DAME [2], Hydra [3], and gViz [4] • to assess, in collaboration with Yorkshire Forward and our IT partners, the regional demand for Grid technology • to support and enlarge new scientific communities including bio-technology, aerospace, tissue engineering, and healthcare The WRG organisation In addition initial studies are taking place to extend the White Rose Grid by providing a Grid infrastructure for the Worldwide Universities Network (WUN) [5]. For a number of logistic reasons the WRG has been developed in parallel with the UK e-Science Grid [6], and our plan is for the two Grids to interoperate seamlessly with each other. Access to WRG resources is achieved via Grid portals or Grid-enabled applications, and these hide much of its complexity from the user. WRG portal developments include both generic and application-specific Grid portals. Within the DAME project [2] Grid portals have been developed that provide access to OGSA Grid Services deployed across the WRG. In terms of Grid-enabled applications, for example, the gViz project has developed Grid-enabled IRIS Explorer modules [4]. This paper reports on progress with building the WRG, recounts lessons learnt, and outlines some problems encountered whilst building our Grid. It comprises five main sections. The first section contains a description of the WRG organisational structure; this is followed by observations from the process of acquiring and setting up the Grid. The major section, about WRG technologies, outlines our software components and mentions technical problems and experience gained. The last two sections contain a summary of user management issues followed by our business The WRG adopts a VO model but with explicit staff resources allocated from the three Universities. Its organisational structure is shown in Figure 2. It is driven by the Executive with participating members from the three Universities and our commercial partner, Esteem systems, representing our other IT partners Sun Microsystems, and Streamline Computing. It encompasses staff from two Computing Services, as well as Engineering and Computer Science departments. This mixture of Service and Research staff provides a complementary combination of skills, which we see as essential, for the development, implementation and support of our Grid. The Grid technology research element is led by Computer Scientists whereas the necessary operational skills are drawn from the Computing Service pool of expertise required to support day-to-day service. The White Rose University Consortium Board (the three Universities’ VCs, 4 PVCs and Chief Executive) The White Rose Grid Executive White Rose Grid staff Training & Technical Education Team Leeds Computing Service and Computer Science staff Architecture User ManageGroup ment Sheffield Computer Service staff Business Development York Computer Science staff Figure2: The White Rose & The White Rose Grid The WRG staff responsibilities include: business and regional academic outreach, Grid training, coordination, and liaison with the UK e-Science community, as well as technical developments, whilst local support staff are involved in the operation of computer systems and user support. The support infrastructure, which is dispersed across the three institutions, is large and complex and works under different management models. Effective communication remains a crucial issue in assuring the continuing success of the WRG, and many matters are positively resolved by the Executive who define WRG policies, and steer the work of their staff. Delivering the WRG The equipment procurement, its installation, and setting up of new Grid-enabled systems were carried out jointly by the three Universities. This consultative approach notably complicated these processes through the involvement of large working groups from distributed geographical locations. During this time questions of ownership and trust were constantly posed and resolved where possible. At every level, trust has been engendered by working in close unison. The positive outcomes also include the development of close collaboration at the technical level, at the purchasing level as well at the management level. Video and multi-site telephone conferencing facilities were used but did not always work well. Face-to-face meetings were found to be always more effective than virtual interactions despite the added travel times. The Consortium is now in the process of installing Access Grid nodes to increase the efficiency of meetings but from our past experience it is clear that while this might help it is not going to be effective for all meetings. A key technical/service objective of the White Rose Grid is to increase significantly the computational power available to users across the three Universities, and provide them with an easy means of access, via a Grid Portal, to WRG facilities. Figure 1 presents a schematic diagram of the heterogeneous WRG architecture which comprises purposely acquired computer systems (funded with over £3M grants) to offer both a local high performance computational (HPC) service (with 75% resource allocation) as well as the Grid infrastructure (utilising 25% of resources). There are four WRG nodes comprising three clusters of high performance machines from Sun Microsystems and two Intel processor-based Beowulf systems (the larger one with 256 processors) from Streamline Computing, in total delivering over 450 CPUs with a large filestore as integrated computational facilities. The WRG system nodes have been selected in consultation with the three Universities’ users to satisfy their diverse computational needs. Each of the four nodes specialises in the provision of a distinct service (see Figure 1). The justification for this approach was to rationalise access to compute and data resources in order to improve the efficiency of their use and support needed. Lack of a central Help Desk recording users’ problems and their resolution may have occasionally hampered service delivery. Both unavailability of information about the way the support is offered and the distributed nature of support cause delays in response to users’ queries. Subsequently this might have caused some difficulties to some of our users. To promote WRG usage a set of training courses for both scientific and business users is being developed. The initial focus is on basic HPC techniques and their software products (for example: OpenMP, MPI). Now that the Globus 2.4 installation has been stabilised the development of introductory courses on Grid computing, including Globus, has been started. Offering the Grid service to users requires the full support of Computing Services using their well-developed procedures and existing support infrastructure. It was vital to build the White Rose Grid quickly using, where possible, well developed software components (e.g. a powerful job manger) that address the most important user requirements for such a facility. At the same time all three sites needed to train support staff in installing and maintaining Grid middleware and related software products. This was a difficult process as issues reported were not always well documented. The prototype service was offered early but subsequent changes to the Grid fabric’s implementations needed to be introduced to achieve the greater product stability and its correct functionality. Basic software stack The WRG software stack is composed largely of open source software. The core software components deployed across the four WRG nodes include two basic building blocks: Sun Grid Engine Enterprise Edition (SGEEE) and the Globus Toolkit v2.4 (GT2.4). These were selected to ensure the following: • compatibility with other UK e-Science Centres’ offerings (via Globus) • the ability to manage workload, resources and policies set by our user communities (via SGEEE) This software stack has been customised in response to WRG users’ requirements. Our solution provides a Grid that responds to the WRG business requirements and demonstrates the following benefits: • offers access to distributed resources across WRG systems • allows to execute batch and interactive jobs on all individual computers • enables real-time machine status updates • monitors usage throughout the WRG and gathers accounting and utilisation statistics in order to deliver uniform and regular usage accounting reports • implements for individual systems the agreed resource management policies, based on shares and past usage • offers the capacity to develop and to implement the utility computing model with the provision of on-demand service. New application costing models based on metered use are being considered. Sun Grid Engine Enterprise Edition The Sun Grid Engine Enterprise Edition (SGEEE) resource management software is installed on each of the four WRG nodes for control of resource allocation across machines within each node. It enforces the site-specific resource management policy which is implemented as an agreed SGEEE share-based policy. This policy (requested by our users) assigns the level of service according to: the share owned by individual users; their past usage of this share; and their currently intended use of the systems. Authorisation for access to resources is effected by assigning users to the relevant projects managed by SGEEE. This also supports the production (via SGEEE) of consistent usage accounting reports for all WRG users. The job manager handles batch jobs as well as interactive jobs, and it allows for the implementation of a chargeback mechanism for the used computing power. This enables the development of a utility computing model which will shape the business process while getting better control of distributed resources. Globus Job submission Working closely with the early adopters of Grid technology around the WRG made it possible to undertake a simple analysis of the types of Grid jobs that must be supported. These job types range from basic serial and parallel jobs to more complex scripted job sequences and computational steering applications. All jobs must be routed through SGEEE for resource allocation on compute nodes within a particular cluster. To achieve this integration between Globus and SGEEE the WRG has employed the Globus SGE job manager and information provider packages from Imperial College, London [8]. These are Perl modules and shell scripts that perform the necessary translation of generic Globus-level requests to the equivalent resource scheduler commands understood by SGE. Each of the local SGE managers has been tuned appropriately to match the local configuration, for example to use the correct values for the LD_LIBRARY_PATH variable or work with the default values for queue parameters and parallel environments, which vary between sites. With a suitably configured SGE job manager it is possible to support basic serial and parallel jobs, but further measures are required to support scripted jobs and computational steering applications. Scripted jobs involve a sequence of small tasks (possibly pre, main and post processing) defined within a single script. Computational steering applications require a mechanism for clients to communicate with active jobs on remote resources and subsequently influence the runtime behaviour of the job. Both scripted jobs and computational steering applications pose challenges when the jobs execute on compute nodes on a private network within a cluster, as is the case on WRG resources. It appears that it is easier to resolve successfully support of scripted jobs than find a simple, secure and lightweight solution to the computational steering problem. The globus-sh-exec tool offers a platform independent solution for scripted jobs. Scripts interpreted by globus-sh-exec can reference commonly used system tools (e.g. tar and gzip) via variables defined by Globus rather than hard coded pathnames. Consequently, users are not required to know where these tools are installed on each resource. Typically, Globus is only installed on the head node of a cluster, but if the Globus installation is mounted on all compute nodes then the globus-sh-exec tool will be available to any scripted job submitted to the Globus SGE job manager. However, use of other Globus tools which access resources beyond the internal nodes of a cluster, such as globusurl-copy, require special treatment. Such tools must be wrapped so that the real invocation of the tool actually occurs on the head node of the cluster from where the rest of the WRG resources are accessible. Information Services The Information Services’ pillar of the Globus Toolkit is a critical component within the WRG. Several services are totally reliant on the information provided by the Grid Index Information Service (GIIS), which is in turn reliant upon the Grid Resource Information Service (GRIS) on each compute resource. The WRG portals and brokers require a dynamic view of available CPUs, memory, disk space and job queues; without a prompt and reliable response from the GIIS it is impossible for the WRG portals and brokers to work effectively. The key issue is to ensure that each GRIS is capable of producing a swift response to queries. Two cases leading to poor response times from the GRIS have been observed. Firstly, given that Globus is installed on the head node which is also used by users for development of their codes, the GRIS daemons occasionally compete with users’ code compilation for CPU time and subsequently fail to provide a swift response. Secondly, publishing job queue information through the GRIS can increase response times considerably. There are four WRG nodes, publishing queue information from SGEEE on five different clusters, and the number of queues per cluster varies from 20 to over 100. The consequence is that the default timeout limits for various GRIS information providers were often reached, resulting in no response at all. This problem could only worsen as we introduce more information providers into the GRIS for publishing installed software and licensing details. The problem has not been solved completely, but the situation has been improved. The GRIS daemons now run on the head node of each cluster with a higher priority than users’ jobs. Also, careful pruning of essential information and configuration of GRIS parameters such as cache time has improved performance. Troubleshooting The WRG is a relatively complex aggregate of Grid technologies with dependencies between individual components and also between components and the underlying systems. Our users appreciate the experimental nature of the Grid software, but rapid discovery and remedy of Grid problems are very important. Our original organisational structure for maintaining Globus installations around the WRG was based on local system administrators at each site and a single Globus expert responsible for all three sites. The advantage of this model is that it is easy to maintain a certain level of consistency across the various Globus installations because a single person is responsible for them. However, there are at least two serious disadvantages. Firstly, a single Globus expert represents a major vulnerability and bottleneck in the organisational model. Secondly, if local system administrators are not sufficiently familiar with Globus then it is possible to introduce incompatibility problems into the Globus installation while carrying out the essential maintenance of local software, operating systems and SGEEE. For example, if changes are made to the configuration of SGEEE then it is likely that they will impact the configuration of the Globus SGE job manager. Over time, this Grid support model has evolved to a less centralised approach where the local system administrators are more involved with their own Grid installations. Transfer of expertise from Grid research groups to local system administrators has been vital to the WRG in terms of the successful deployment of Globus 2.4. This process will continue as experience with Globus 3.0 gained within the DAME project is also propagated through to the local system administrators. Portals Grid access through the Globus Toolkit still involves a relatively steep learning curve for many users. To simplify this access, the WRG has undertaken the development of portals that provide a single gateway to all resources from the user’s desktop. Various prototype portals, based on J2EE technology, have been developed for specific applications. The first portal to be developed was DAME-XTO [9] which was an early demonstrator. It enables aeronautical engineers to identify abnormal behaviour in aircraft engines by performing DSP analyses of vibration data collected from onboard sensors. It offers users secure web access supported by a MyProxy server [10], selection of engine datasets from a database catalogue, submission and monitoring of jobs, and the visualisation of results through a graph plotting service implemented with jCharts. Another portal, currently under development, enables Bioinformaticians to search through databases of protein structures using parallelised versions of novel search algorithms. Essentially, both portals follow the same basic model: select dataset; select application; submit dataset and application to a Grid resource; visualise the results. However, typical usage requires the submission of many independent jobs each processing a different dataset. The portals make the process of launching, monitoring and managing the results of these jobs a much easier task. These portals have been built using Apache Tomcat and the Grid Portal Development Kit (GPDK) [11], although it was necessary to reengineer GPDK to work within Globus 2.4 since it was originally developed for Globus 1.1.4. Another portal has been developed for the DAME project which is based on the Struts web application framework [12] and the Java CoG Kit [13]. Within the DAME project each member (York, Leeds, Sheffield and Oxford) is responsible for developing a set of tools that aid the diagnosis of aircraft engine faults. These tools have been developed using Globus 3.0 Grid Services and are deployed at each of the member sites. For example, York has pattern matching Grid Services, Sheffield has case-based reasoning Grid Services, Leeds has DSP analysis Grid Services and Oxford provides access to engine datasets through Grid Services. A workflow manager component coordinates the use of these Grid Services in sensible combinations and enables access to their results. The portal represents a presentation layer on top of the workflow manager. The experience gained through developing these prototype portals has helped us recognise various requirements and limitations that will influence our future portal developments. In many research groups around the WRG there is an obvious demand for access to computational resources through a Grid portal. Typically, these research groups make heavy use of stable home-grown or commercial codes, they have significant computational requirements, they wish to share their code and results with collaborators, but they lack sufficient in-house expertise to build their own Grid portal. Generic Grid portals have been considered, but their generality is both a strength and a weakness. A generic portal is relatively easy to develop and maintain, but its ease-of-use and functionality is limited by the fact that it must be the lowest common denominator for all users and all applications. An application-specific portal is far more appealing to users because the user interface can be tailored to meet their needs precisely, but it places a heavy burden on the portal development team now responsible for developing and maintaining many different portals. Having recognised this development bottleneck and its consequences for a scalable solution to deploying a variety of portals around the WRG, experiments with the concept of customisable portals are being carried out. The intention is that each research group will start with a basic portal that contains the Grid functionality common to all Grid portals and will use tools, available through the portal interface itself, to upload new applications, define access to datasets, create simple workflows, and customise the interface. Relevant technology that will help us deliver this vision includes the component based user interface features of JetSpeed [14] and more importantly GridSphere [15]. In terms of automatically extending the functionality of a Grid portal based on application descriptors we take inspiration from the Ninf project [16]. User management An essential element of the WRG infrastructure is an effective user management scheme for members of the WRG community which spans across the three Universities, and includes industrial partners and academics from other universities. Authorisation for access to WRG resources as well as the implementation of gathering usage accounting statistics under the SGEEE product, necessitated the introduction of a new username scheme and a user authorisation process. These have been developed taking into consideration the following: • the distributed nature of the WRG • the cultural differences in registering users and managing users at the three sites • dependence on the existing user registration software at local institutions • the existence of two different classes of users, local and White Rose Grid, as well as the inclusion of other academic and commercial partners WRG users have the option of accessing systems though Globus or using the traditional means of access (via ssh & sftp). At remote sites they are allocated a common WRG identity beginning with letters WR followed by the relevant letter Y, S or L indicating the user’s original university, whilst locally they are assigned a local username. The WRG implements the GSI (Grid Security Infrastructure) employing a secure authentication through the use of a Public-Key Infrastructure (PKI) mechanism with digital certificates. Users authenticate to the Grid using their personal X.509 v3 digital certificates issued by the UK Core Programme Certification Authority (CA) or the WRG CA. A single sign-on capability to access all WRG resources is welcomed by users. Working within the VO it would have been preferable to appoint a single Registration Authority (RA) for all WRG users. However, this was unacceptable to the UK e-Science Core Programme Certification Authority as their policy is that the RA should be local to the users whose requests the RA is to verify. As a result each University has established its own UK CA RA to verify its users’ identities in order for the UK CA to issue a personal digital certificate to each applicant. A common Application Form for WRG Resources is used across the WRG to authorise users for access to WRG resources at the three sites. Those authorised users may use under the WRG share (25% resource allocation) all available computer systems across the three sites. Access to individual software products is an issue being considered and resolved on an individual basis. This significantly complicates matters, and it seems unacceptable that application software suppliers have not resolved the issue of licensing their products for use on the Grid. The Grid is based on the principle of resource sharing and this is an essential requirement to have software products and datasets licensed for use over the Grid. Business benefits Although Grid technology has, so far, been mainly the preserve of the scientific community, it also offers significant benefit to other communities from the public and private sectors. Evidence to support this can be derived from the myriad reports that signal the profound impact Grid technology could have on services and products (see, for example [17]). As part of a move to assessing the value of a regional Grid infrastructure through provision of services to both private and public spheres, the White Rose Grid has engaged on a two-year outreach project funded by the Yorkshire and Humber Regional Development Agency, Yorkshire Forward. To support this project’s objectives three interrelated activities have been identified: assessing regional interest, developing a business plan, and provision of a trial infrastructure for company incubators at the Innovation Centres/Science Parks of the three Universities. These activities are further supported through the provision of demonstrators, each of which serves to both promote understanding in each of the three objectives and be a catalyst for identifying new issues. These for example include the e-Science funded DAME project demonstrator, a Gridenabled MRI demonstrator to support and aid clinicians in the diagnosis of cancer using scanners’ data, or a workbench (being built with Shell Research) for visualising the computational modelling of lubricants. Assessment of regional interest To assess and foster regional interest in Grid technology a series of dissemination activities have been carried out which include presentations to: large national and international conferences (see the WRG conference [18]), representatives from vertical market sectors, and single companies. These presentations so far generated significant interest in Grid technology in various areas: medicine, disease spread, military aircraft maintenance, library, arts & heritage, agrichemical, tool cutting and digital media. The final part of the assessment stage is a series of coordinated meetings and workshops with interested parties to clarify the role of the Grid in their organisation and how they would want to manage this. Developing the business plan To generate a business plan for the regional Grid infrastructure a number of preliminary activities have been defined and initiated. The business landscape defines the areas in which Grid companies could operate and locates each company within this landscape. This process entails identifying the type of services that could be provided through Grid technology (a platform for collaborative R&D, provision of compute cycles or data storage etc.), the market in which they might operate (one particular company, virtual organisation, a horizontal market sector such as aerospace etc.) and the roles that a regional Grid infrastructure could play within these business sectors and services (e.g., Advise, Enable, Broker, Develop etc.) The business landscape plays a further role in supporting the business development plan as well as helping in the identification of charge and business models. Work in this area is being supported by both accountants Deloitte & Touche and solicitors Hammonds. Strongly allied to business and charge models are issues concerning both Quality of Service and Service Level Agreements that are being studied by the WRG. Trial regional infrastructure The final stage in the White Rose Grid outreach programme is the provision of access to the WRG through a portal to companies working within the Innovation Centres/Science Parks at the three institutions to support technology evaluation as well as testing both the business and the charge models. Concluding remarks The WRG has demonstrated the value of a production Grid service. The task now is to further build the WRG e-Science user community, to enlarge the portfolio of Grid-enabled applications, and to increase the number of e-Science projects undertaken. Further to this, recent and future technical developments resulting in Grid portals that offer an easy-to-use web interface to all WRG resources should encourage users to fully utilise all WRG computational assets. Issues of effective communication, trust, and ownership are constantly posed. Often they dominate technical problems and severely constrain the deployment of the Grid as well as of collaboration. Furthermore the delivery of the WRG service has exposed complexities due to its innovation and its large support teams. However, the successful resolution of these problems has been achieved through an effective, often dynamic organisational structure, the involvement of Computing Services from the start of this project, and the motivational force of the challenge of implementing emerging innovative technology. References [1] http://www.wrgrid.org.uk [2] http://www.cs.york.ac.uk/dame [3] http://www.informatics.leeds.ac.uk/hydra [4] http://www.visualization.leeds.ac.uk/gViz [5] http://www.wun.ac.uk [6] http://www.rcuk.ac.uk/escience [7] W E Johnston, Implementing Production Grids in Grid Computing – Making the Global Infrastructure a Reality, Edited by F Berman, A. Hey and G Fox @2003 John Wiley & Sons, Ltd [8] http://www.doc.ic.ac.uk/~marko/ETF/sge.html [9] http://iri02.leeds.ac.uk:8080/damexto/damexto [10] http://grid.ncsa.uiuc.edu/myproxy [11] http://doesciencegrid.org/projects/GPDK/ [12] http://jakarta.apache.org/struts [13] http://www.globus.org/cog/java [14] http://jakarta.apache.org/jetspeed [15] http://www.gridsphere.org [16] T. Suzumura, H. Nakada, M. Saito, S. Matsuoka, Y. Tanaka and S. Sekiguchi. The Ninf Portal: An Automatic Generation Tool for Grid Portals. In Proc of the 2002 joint ACM-ISCOPE conference on Java Grande. Pages 1-7. 2002 [17] PricewaterhouseCoopers Technology Forecast 2002 – 2004 [18] http://www.wrgrid.org.uk/conference2003_ slides.html Acknowledgements The authors are grateful for the funding received from EPSRC and the UK e-Science Core Programme, Yorkshire Forward, and Esteem Systems.