OxGrid, a campus grid for the University of Oxford David C. H. Wallom, Anne E Trefethen Oxford e-Research Centre, University of Oxford, 7 Keble Road, Oxford OX2 6NN david.wallom@oerc.ox.ac.uk Abstract The volume of computationally and data intensive research in a leading university can only increase. This though cannot be said of funding, so it is essential that every penny of useful work be extracted from existing systems. The University of Oxford has invested in creating a campus wide grid. This will connect all large scale computational resources as well as providing a uniform access method for ‘external’ resources such as the National Grid Service (NGS) and the Oxford Supercomputing Centre (OSC). The backbone of the campus grid is made using standard middleware and tools but the value add services have been provided by in-house designed software including resource brokering, user management and accounting. Since the system was first started in November 2005 we have attracted six significant users from five different departments. These use a mix of bespoke and licensed software and have run ~6300 jobs by the end of July 2006. Currently all users have access to ~1000 CPUs including NGS and OSC resources. With approximately 2 new users a week approaching the e-Research centre the current limitation on rate of uptake is the amount of time that is spent with each user to make their interactions as successful as possible. 1. Introduction Within a leading university such as Oxford there could be expected to be as many as 30 separate clustered systems. These will have been purchased through a variety of sources such as grant income, donations, and central university funding. It is becoming increasingly important that full use is made of these. As well as these specialist systems, this is also true for all ICT resources throughout an organisation. These can include shared desktop computers within teaching laboratories as well as personal systems on staff desks. There are also a significant number of resources that are available either nationally or internationally and it is therefore very important that the interfaces as defined by these projects are supported within the infrastructure. This has therefore a significant steer on the middleware chosen as the basis for the project. This it should be noted is true not only for computation but data as well. The other large impediment that can occur for a project of this type is the social interactions between departments that may jealously guard their own resources and the users from different groups that could make best use of them. This is especially true in a collegiate university where you have possible resources that are located within separate colleges as well as the academic departments. This has therefore led to a large outreach effort through talks and seminars given to a university wide audience as well as contacting all serial users of the OSC. The design of each of these components will be discussed showing their functional range as well as future plans to make further use of GGF standards. 2. Requirements Before embarking on a exercise such as the construction of a campus wide infrastructure, it is important that an initial set of minimum user requirements are considered. The most important requirement is that the users current methods of working must be affected as little as possible, i.e. they should be able switch from working on their current systems to the campus grid with a seamless transition. The inherent system design should be such that its configuration can be dynamically altered without user interruption. This will include the interruption of service to particular clusters that make up nodes on the grid but also the central services. In this case the user may not be able to submit more tasks but should be safe in the knowledge that those tasks that they have already submitted will run uninterrupted. The final core requirement involves monitoring, since it is essential that once a user has submitted a job or stored a piece of data that it is monitored until its lifetime has expired. 2.1 Data provision as well as computation The provision of data services will become increasingly important in the coming years. This will be especially true for the move by the arts, humanities and social sciences into e-Science, as these subjects include studies that make extensive use of data mining as a primary tool for research. The data system must be able to take the following factors into account: • Steep increase in the volume of data as studies progress including the new class of research that is generated by computational grid work. • Metadata to describe the properties of the data stored, the volume of which will be directly proportional to its quality. • As the class of data stored changes from final post analysis research data towards the raw data on which many different studies can be done then the need for replication and guaranteed quality of storage will increase. Therefore a system is needed which can allow a range of physical storage medias to be added to a common core to present a uniform user interface. The actual storage must also be location independent different physical locations as possible. 3. The OxGrid System The solution as finally designed is one where individual users interact with all connected resources through a central management system as shown in Figure 1. This has been configured such that as the usage of the system increases each component can be upgraded so as to allow organic growth of the available resources. Figure 1 Schematic of the OxGrid system 3.1 Authorisation and Authentication The projected users for the system can be split into two distinct groups, those that want to use only resources that exist within the university and those that want access to external systems such as the National Grid Service etc. For those users requiring external system access, we are required to use standard UK e-Science Digital Certificates. This is a restriction bought about by those system owners. For those users that only access university resources, a Kerberos Certificate Authority [3] system connected to the central university authentication system has been used. This has been taken up by central computing services and will therefore become a university wide service once there is a significant enough user base. 3.2 Connected Systems Each resource that is connected to the system has to have a minimum software stack installed. The middleware installed is the Virtual Data Toolkit [4]. This includes the Globus 2.4 middleware [5] with various patches that have been applied through the developments of the European Data Grid project [6]. 3.3 Central Services The central services of the campus grid may be broken down as follows: • Information Server • Resource Broker • Virtual Organisation Management • Data Vault These will all be described separately. 3.3.1 Information Server The Information server forms the basis of the Campus Grid with all information about resources registered to it. Using the Globus MDS 2.x [7] system to provide details of the type of resource, including full system information. This also contains details of the installed scheduler and its associated queue system. Additional information is added using the GLUE schema [7]. 3.3.2 Resource Broker The Resource Broker (RB) is the core component of the system and the one with which the users of the system have the most interaction. Current designs for a resource broker are either very heavyweight with added functionality which is unlikely to be needed by the users or have a significant number of interdependencies which could make the system difficult to maintain. The basic required functionality of a resource broker is actually very simple is listed below: • Submit tasks to all connected resources, • Automatically decide the most appropriate resource to distribute a task to, • Depending on the requirements of the task submit to only those resources which fulfil them, • Dependant on the users list of registered systems distribute tasks appropriately. Using the Condor-G [9] system as a basis an additional layer was built to interface the Condor Matchmaking system with MDS. This involved interactively querying the Grid Index Information Service to retrieve the information stored in the system wide LDAP database as well as extracting the user registration information and installed software from the Virtual Organisation Management system. Additionally other information such as which system a particular user is allowed to submit tasks is available from the VOM system. Therefore this must be queried whenever a task is submitted. jobs. The list of installed software is also retrieved at this point from the Virtual Organisation Manager database from when the resources were added. Each of these generated job advertisements for each resource are input into the Condor Matchmaking [10] system once every 5 mins. 3.3.2.2 Job Submission Script Figure 2 Resource Broker operation 3.3.2.1 The Generated Class advertisement The information passed into the resource class advertisement may be classified in three ways, that which would be included in a standard Condor Machine class-ad, additional items that are needed for Condor-G grid operation and items which we have added to give extra functionality. This third set of information will be described here; Requirements = (CurMatches (TARGET.JobUniverse == 9) < 20) && It is important to ensure that a maximum number of submitted jobs can be matched to the resource at any one time and the resource will only accept Globus universe jobs. CurMatches =0 This is the number of currently matched jobs as determined by the advertisement generator using MDS and Condor queue information every 5 mins. OpSys = "LINUX“ Arch = "INTEL" Memory = 501 Information of the type of resource as is determined from the MDS information. It is important to note though that a current limitation is that this is for the head node only not workers so heterogeneous clusters cannot at present be used on the system. MPI = False INTEL_COMPILER=True GCC3=True Special capabilities of the resource need to be defined. In this case the cluster resource does not have MPI installed and so cannot accept parallel As the standard user for the campus grid will have normally only submitted their jobs to a simple scheduler it is important that we can abstract users from underlying grid system and resource broker. The Condor-G system uses non-standard job description mechanisms and so a simpler more efficient method has been implemented. It was decided that most users were experienced with a command line executable that required arguments rather than designing a script submission type system. It is intended though to alter this with version 2 so that users may also use the GGF JSDL [11] standard to describe their tasks. The functionality of the system must be as follows: • User must be able to specify the name of the executable and whether this should be staged from the submit host or not, • Any arguments that must this executable be passed to run on the execution host, • Any input files that are needed to run and so must be copied onto the execution host, • Any output files that are generated and so must be copied back to the submission host, • Any special requirements on the execution host such as MPI, memory limits etc. • Optional, so as to override the resource broker as necessary the user should also be able to specify the gatekeeper URL of the resource he specifically wants to run on. This is useful for testing etc. It is important also that when a job is submitted that the user will not get his task allocated by the resource broker onto system to which he doesn’t have access. The job submission script accesses the VOM system to get the list of allowed system for that user. This works through passing the DN into the VOM and retrieving a comma separated list of systems and reformats this into the format as accepted by the resource broker. job-submission-script -n 1 -e /usr/local/bin/rung03 -a test.com -i test.com -o test.log -r GAUSSIAN03 g maxwalltime=10 This example runs a Gaussian job, which in the current configuration of the grid will through the resource broker only run on the Rutherford NGS node. This was a specific case that was developed to test capability matching. 3.3.3 Virtual Organisation Manager This is another example of a new solution being designed in-house due to over complicated solutions being only currently available. The functionality required is as follows: • Add/Remove system to a list of available systems, • List available systems, • Add/Remove users to a list of users to access general systems, • List users currently on the system. • Add user to the SRB system, When removing users though it is important that their record is set as invalid rather than simply removing entries so that system statistics are not disrupted. So that attached resources can retrieve the list of stored distinguished names we have also created an LDAP database that can then be retrieved from via the standard EDG MakeGridmap scripts as distributed in VDT. 3.3.3.1 Storing the Data The underlying functionality has been provided using a relational database. This has allowed the alteration of tables as and when additional functionality has become necessary. The decision was made to use the PostgreSQL relational Database [12] due to its know performance and zero cost option. It was also important at this stage to build in the extra functionality to support more than one virtual organisation. The design of the database is shown in Appendix 1. • Which registered systems the user can use and their local username on each, this can either be pool accounts or a real username. An example of the interface is shown in Figure 3. Figure 3; The interface to add a new user to the VOM When a user is added into the VOM system this gives them access to the main computational resources through insertion of their DN into LDAP and relational databases. It is also necessary that each user is also added into the Storage Resource Broker [13] system for storage of their large output datasets. 3.3.3.2 Add, Remove and List Functions Administration of the VOM system is through a secure web interface. This has each available management function as a separate web page which must be navigated to. The underlying mechanisms for the addition, removal and list functions are the basically the same for both systems and users. The information required for an individual user are: • Name: The real name of the user. • Distinguished Name (DN): This can either be a complete DN string as per a standard x509 digital certificate or if the Oxford Kerberbos CA is used then just their Oxford username, the DN for this type of user is constructed automatically. • Type: Within the VOM a user may either be an administrator or user. This is used to define the level of control that he has over the VOM system, i.e. can alter the contents. This way the addition of new system administrators into the system is automated. Figure 4; The interface for system registration into the VOM To register a system with the VOM the following information is needed; • Name: Fully Qualified Domain Name. • Type: Either a Cluster or Central system, users will only be able to see Clusters. • Administrator e-Mail: For support quereies. • Installed Scheduler: Such as PBS, LSF, SGE or Condor. • Maximum number of submitted jobs, to give the resource broker the value for ‘CurMatches’. • Installed Software; List of the installed licensed software to again pass onto the resource broker. • Names of allowed users and their local usernames. An example of the interface is shown in Figure 4. There are three different methods of displaying the information that is stored about tasks run on OxGrid. The overall number of tasks per month is shown in Figure 5. 3.4 Resource Usage Service Within a distributed system where different components are owned by different organisations it is becoming increasingly important that tasks run on the system are accounted for properly. 3.4.1 Information Presented As a base set of information the following should be recorded: • Start time • End time • Resource name job run on • Local job ID • Grid user identification • Local user name • Executable run • Arguments passed to the executable • Wall time • CPU time • Memory used As well as these basic variables an additional attribute has been set to account for the differing cost of a resource. This can be particularly used for systems that have special software or hardware installed. This can also be used to form the basis of a charging model for the system as a whole. • Local Resource cost 3.4.2 This can then be split into total numbers per individual user and per individual connected system as shown in Figure 6 and Figure 7. Recording usage There are two parts to the system, a client that can be installed onto each of the attached resources and a server that will record the information for presentation to system administrators and users. It was decided that it would be best to present the accounting information to the server on a task by task basis. This would result in instantaneously correct statistics should it be necessary to apply limits and quotas. The easiest way to achieve this is through the creation of a set of Perl library functions that attaché to the standard job-managers that are distributed within VDT. These collate all of the information to be presented to the server and then call a ‘cgi’ script through a web interface on the server. The server database is part of the VOM system described in section 3.3.3.1. An extra table has been added with each attribute corresponding to a column and each recorded task is a new row. 3.4.3 Figure 5; Total number of run tasks per month on all connected resources owned by Oxford, i.e. NOT including remote NGS core nodes or partners. Displaying Usage Figure 6; Total number of submitted tasks per user Figure 7; Total number of jobs as run on each connected system 3.5 Data Storage Vault It was decided that the best method to create an interoperable data vault system would be to leverage work already undertaken within the UK eScience community and in particular by the NGS. The requirement is for a location independent virtual filesystem. This can make use of spare diskspace that is inherent in modern systems. Initially though as a carrot to attract users we have added a 1Tb RAID system onto which users data can be stored. The storage system uses the Storage Resource Broker (SRB) from SDSC. This has the added advantage of not only fulfilling the location independent requirement but can also add metadata to annotate stored data for improved data mining capability. The SRB system is also able to interface not only to plain text files that are stored within normal attached filesystems but also relational databases. This will allow large data users within the university to make use of the system as well as install the interfaces necessary to attach their own databases with minimal additional work. on several occasions to assist with cluster upgrades before these resources can be added into the campus grid. This has illustrated the significant problems with the Beowulf solution to clustering, especially if very cheap hardware has been purchased. This has lead on several occasions to the need to spend a significant amount of time installing operating systems which is reality has little to do with construction of a campus grid. 3.6.3 Sociological issues The biggest issue when approaching resource owners is always the initial reluctance to allow anyone but themselves to use resources they own. This is a general problem with academics all over the world and as such can only be rectified with careful consideration of their concerns and specific answers to the questions they have. This has resulted in a set of documentation that can be given to owners of clusters and Condor systems so that they can make informed choices on whether they want to donate resource or not. When constructing the OxGrid system the greatest problem that was encountered was with the different resources that have been connected to the system. These fell into separate classes as described. The other issue that has had to be handled is the communication with staff responsible for departmental security. To counteract this we have produced a standard set of requirements with firewalls which are generally well received. By ensuring that communication from departmental equipment is a single system for the submission of 3.6.1 3.7 User Tools 3.6 Attached Resources Teaching Lab Condor Pools The largest single donation of systems into the OxGrid has come from the Computing Services teaching lab systems. These are used during the day as Windows systems and have a dual boot installation setup on them with a minimal Linux 2.4.X installation on them. The systems are rebooted into Linux at 2100 each night and then run as part of the pool until 0700 where they are restarted back into Windows systems. Problems have been encountered with the system imaging and control software used by OUCS and its Linux support. This has lead to significant reduction in available capacity in this system until the problem is rectified. Since this system also is configured without a shared filesystem several changes have been made to the standard Globus jobmanager scripts to ensure that Condor file transfer is used within the pool. A second problem has been found recently with the discovery of a large number of systems within a department that use the Ubuntu Linux distribution which is currently not supported by Condor. This has resulted in having to distribute by hand a set of base C libraries that solve issues with running the Condor installation 3.6.2 Clustered Systems Since there is significant experience within the OeRC with clustered systems we have been asked In this section various tools are described that have been implemented to make user interaction with the campus grid easier. 3.7.1 oxgrid_certificate_import One of the key problems within the UK e-Science community has always been that users have complained about difficult interactions with their digital certificates and certainly when dealing with the format required by the GSI infrastructure. It was therefore decided to produce an automated script. The is used to translate the certificate as it is retrieved from the UK e-Science CA, automatically save it in the correct location for GSI as well as set permissions and the passphrase used to verify the creation of proxy identities. This has been found to reduce the number of support calls about certificates as these can cause problems which the average new user would be unable to diagnose. To ensure that the operation has completed successfully it also checks the creation of a proxy and prints its contents. 3.7.2 oxgrid_q and oxgrid_status So that a user can check their individually submitted tasks we developed a script that sits on top of the standard condor commands for showing the job queue and registered systems. Both of these commands though as default show only the jobs the user has submitted or the systems they are allowed to run jobs on, though both of these commands also have global arguments to allow all jobs and registered systems to be viewed. 3.7.3 oxgrid_cleanup When a user has submitted many jobs and discovered that they have made an error in configuration etc. it is important for a large distributed system such as this that a tool exists for the easy removal of not just the underlying submitted tasks but also the wrapper submitter script as well. Therefore this tool has been created to remove the mother process as well as all its submitted children. 4. Users and Statistics Currently there are 25 registered users of the system. They range from a set of students who are using the submission capabilities to the National Grid Service to those users that are only wanting to use the local Oxford resources. They have collectively run ~6300 tasks over the last six months using all of the available systems within the university (including the OSC), not including though the rest of the NGS. The user base is currently determined by those that have been registered with the OSC or similar organizations before. Through outreach efforts though this is being moved into the more data intensive Social Sciences as their e-Science projects move along. Several whole projects have also registered their developers to make use of the large data vault capability including the Integrative Biology Virtual Research Environment. We have had several instances where we have asked for user input and a sample of them are presented here. “My work is the simulation of the quantum dynamics of correlated electrons in a laser field. OxGrid made serious computational power easily available and was crucial for making the simulating algorithm work.” Dr Dmitrii Shalashilin (Theoretical Chemistry) “The work I have done on OxGrid is on molecular evolution of a large antigen gene family in African trypanosomes. OeRC/OxGrid has been key to my research and has allowed me to complete within a few weeks calculations which would have taken months to run on my desktop.” Dr Jay Taylor (Statistics) 5. Costs There are two ways that the campus grid can be valued. Either in comparison with a similar sized cluster or in the increased throughput of the supercomputer system. Considering the 250 systems of the teaching cluster can be the basis for costs. The electricity costs for these systems would be £7000 if they were left turned on for 24 hours. These systems though do produce 1.25M CPU hours of processing power so the produced value of the resource is very high. Since the introduction of the campus grid it is considered that the utilisation of the supercomputing centre has increased for MPI tasks by ~10%. 6. Conclusion Since November 2005 the campus grid has connected systems from four different departments including Physics, Chemistry, Biochemistry and University Computing Services. The resources located in the National Grid Service and OSC have also been connected for registered users. User interaction with these physically dislocated resources is seamless when using the resource broker and for those under the control of the OeRC accounting information has been saved for each task run. This has showed that ~6300 jobs have been run on the Oxford based components of the system (i.e. not including the wider NGS core nodes or partners). The added advantage is that significant numbers of serial users from the Oxford Supercomputing Centre have moved to the campus grid so that it has also increased its performance. 7. Acknowledgments I would like to thank Prof. Paul Jeffreys and Dr. Anne Trefethen for their continued support in the startup and construction of the campus grid. I would also like to thanks Dr. Jon Wakelin for his assistance in the implementation and design of some aspects of the Version 1 of the underlying software when the author and he were both staff members at Centre for e-Research Bristol. 8. References 1. National Grid Service, http://www.ngs.ac.uk 2. Oxford Supercomputer Centre, http://www.osc.ox.ac.uk 3. Kerberos CA and kx509, http://www.citi.umich.edu/projects/kerb_pki/ 4. Virtual Data Toolkit, http://www.cs.wisc.edu/vdt 5. Globus Toolkit, The Globus Alliance, http://www.globus.org. 6. European Data Grid, http://eudatagrid.web.cern.ch/eu-datagrid/ 7. Cza jkowski k., Fitzgerald s., Foster I., and Kesselman C. Grid Information Services for Distributed Resource Sharing Proceedings of the Tenth IEEE International Symposium on High-Performance Distributed Computing (HPDC-10), IEEE Press, August 2001. 8. GLUE schema The Grid Laboratory Uniform Environement (GLUE)”, http://www.hicb.org/glue/glue.htm. Computing (HPDC-10), IEEE Press, August 2001. 11. Job Submission Description Language (JSDL) Specification, Version 1.0, GGF, https://forge.gridforum.org/pro jects/jsdlwg/document/draft-ggf-jsdl-spec/en/21 9. J. Frey, T. Tannenbaum, M. Livny, I. Foster, S. Tuecke, Condor-G: A Computation Management Agent for Multi-Institutional Grids, Cluster Computing, Volume 5, Issue 3, Jul 2002, Pages 237 – 246 12. PostgreSQL, http://www.postgresql.org 10. R. Raman, M. Livny, M. Solomon, "Matchmaking: Distributed Resource Management for High Throughput Computing," hpdc, p. 140, Seventh IEEE International Symposium on High Performance Distributed Computing (HPDC-7 '98), 1998. 13. C. Baru, R. Moore, A. Rajasekar and M. Wan. The SDSC storage resource broker. In Proceedings of the 1998 Conference of the Centre For Advanced Studies on Collaborative Research (Toronto, Ontario, Canada, November 30 - December 03, 1998) 14. LHC Computing Grid Project http://lcg.web.cern.ch/LCG/ Appendix 1 Table structure of the VOM database system. Table Primary Key Unique Value VO_VIRTUALORGANISATION VO_ID (ALSO UNIQUE) NAME DESCRIPTION VO_USER VO_ID USER_ID (ALSO UNIQUE) REAL_USER_NAME DN USER_TYPE VALID VO_RESOURCE VO_ID RESOURCE_ID (ALSO UNIQUE) TYPE JOBMANAGER SUPPORT VALID MAXJOBS SOFTWARE VO_USER_RESOURCE_STATUS VO_ID USER_ID RESOURCE_ID UNIX_USER_NAME VO_USER_RESOURCE_USAGE VO_ID JOB_NUMBER USER_ID JOB_ID START_TIME END_TIME RESOURCE_ID JOBMANAGER EXECUTABLE NUMNODES CPU_TIME WALL_TIME MEMORY VMEMORY COST