Sun Data and Compute Grids T. M. Sloan1,2, R. Abrol1,2, G. Cawood1,2,T. Seed1,2, F. Ferstl3 1 EPCC, The University of Edinburgh, James Clerk Maxwell Building, King’s Buildings, Mayfield Road, Edinburgh, EH9 3JZ, UK 2 National e-Science Centre, e-Science Institute, 15 South College Street, Edinburgh, EH8 9AA, UK 3 Sun Microsystems GmbH, Dr.-Leo-Ritter-Str. 7,D93049, Regensburg, Germany Abstract The Sun Data and Compute Grids (SunDCG) project[1] aims to develop an industry-strength, fully Globus-enabled compute and data scheduler based around Grid Engine [2], Globus [3] plus a wide variety of data technologies. The project started in February 2002 and will run until January 2004. The partners are the National e-Science Centre [4], represented in this project by EPCC[5], and Sun Microsystems [6]. This paper describes the project and its current status as of August 2003. The project is funded as part the UK e-Science Core Programme[7]. Introduction According to [8], Grid computing can be classified at three levels of deployment as illustrated in Figure 1. • Cluster Grid - a single team or project and their associated resources. • Enterprise Grid - multiple teams and projects but within a single organisation, facilitating collaboration of resources across the enterprise. • Global Grid - linked Enterprise and Cluster Grids, providing collaboration amongst organisations. Grid Engine[2] is a distributed resource management system that allows the efficient use of compute resources within an organisation. Grid Engine meets the first two levels; Cluster and Enterprise, by allowing a user to transparently make use of any number of compute resources within an organisation. However, Grid Engine, alone does not yet meet the third level, the Global Grid. The Globus Toolkit is essentially a Grid API for connecting distributed compute and instrument resources via the internet. Integration with Globus allows Grid Engine to meet this global level. That is, it allows collaboration amongst enterprises. The Sun Data and Compute Grids (SunDCG) project aims to develop a scheduler based on Grid Engine that allows user jobs to be scheduled across a global grid and allow these jobs to have access to their necessary data sources. Globus will be used as the Grid API to provide secure communications. As a first step, a compute global grid scheduler has been developed by the project. This integrates Grid Engine V5.3 and Globus Toolkit V2.2.x to allow access to remote resources. This integration is achieved by use of the Transfer-queue Over Globus (TOG) software developed by the project [10]. Figure 1 : Three levels of grid computing: cluster, enterprise and global grids (taken from [8]) Following the development of TOG, the project has investigated the integration of access to data sources via data grid technologies such as OGSA-DAI, GridFTP and SRB. The next step for the project team is to develop a hierarchical scheduler that scales better in a grid environment and enables access to remote data sources via data grid technologies. execution methods are used to provide the additional functionality to Grid Engine so that jobs can be run elsewhere. This paper describes how TOG can be used to create a global grid and so allow Grid engine to schedule jobs for execution on that grid. In addition the paper outlines the progress being made in developing a hierarchical scheduler solution that integrates access to data sources across a global grid. The TOG software has been used to create a global compute grid between the universities of Glasgow and Edinburgh. Researchers at the Glasgow site of the National e-Science Centre have been able to access compute resources at EPCC using a Grid Engine installation configured with the TOG software [9]. TOG is also being used to set up a biomedical eScience demonstration using the new SRIF network linking three sites within the University of Edinburgh – EPCC, the Scottish Centre for Genomic Technology and Informatics (GTI) at the New Royal Infirmary of Edinburgh and the MRC Human Genetics Unit (HGU) at the Western General Hospital[10]. Using TOG a global compute grid can use the Grid Engine interface for job scheduling, submission and control with remote system administrators still having full control of their resources. Building a Global Compute Grid – the Transfer-queue Over Globus (TOG) Figure 2 illustrates how an enterprise can access remote compute resources at a collaborating enterprise and thus create a global compute grid. This is achieved by configuring a queue on a local Grid Engine to use TOG. TOG provides secure job submission and control functionality between the enterprises. TOG enables an enterprise to schedule jobs for execution on remote resources when local resources are busy. Data and executables can be transferred over to the remote resource with subsequent transfer of results back to the local installation. The TOG software and documentation is available for download from the open source Grid Engine site at http://gridengine.sunsource.net/project/grideng ine/tog.html. JOSH – A hierarchical Scheduling System Following development and release of TOG, the SunDCG project is now developing a hierarchical job scheduling system referred to as JOSH – Job Scheduling Hierarchically. This system will match a user’s job requirements against Grid Engine instances at available compute sites. A job can then be sent to the chosen compute site for local scheduling and In figure 2, queue e at Enterprise A acts as a proxy for a queue at B. This ‘proxy queue’ is configured to use TOG to run the job on the queue at B. In Grid Engine a queue that passes a job to a third-party is known as a ‘transfer queue’. TOG employs a similar mechanism to that used by Transfer Queues [12]: that is, B A Grid Engine a e b c TOG d Globus 2 User A Job Grid Engine e f TOG g User B h d Figure 2: By configuring queue e to use the Transfer-queue over Globus (TOG) software, Enterprise A can access resources at enterprise B. Similarly, enterprise B can access resources at enterprise A by configuring queue d to use the TOG software. execution. Before execution, any input files will be pulled to the compute site from their data sources (notably GridFTP servers). Similarly, output files will be pushed to their target data repositories after the job has completed. Further Information A middleware layer will handle secure communications and data transfer between JOSH, Grid Engine and any remote data sources. The OGSA-compliant Globus Toolkit version 3 will form this layer. User Interface software will be developed to allow job submission and monitoring from the user’s site. [1] Sun Data and Compute Grids project home page, http://www.epcc.ed.ac.uk/sungrid/ [2] Grid Engine home page, http://gridengine.sunsource.net/ [3] Globus home page, http://www.globus.org/ [4] National e-Science Centre home page, http://www.nesc.ac.uk/ [5] EPCC home page, http://www.epcc.ac.uk/ [6] Sun Microsystems home page, http://www.sun.com/ [7] UK e-Science Core Programe, http://www.escience-grid.org.uk/ [8] “Sun Cluster Grid Architecture – A technical white paper describing the foundation of Sun Grid Computing”, May 2002, http://wwws.sun.com/software/grid/SunCl usterGridArchitecture.pdf [9] T. Sloan, “Going global with Globus and Grid Engine”, EPCC News, Issue 48, Spring 2003, http://www.epcc.ed.ac.uk/overview/public ations/newsletters/EPCCnews/EPCCNews 48.pdf [10] R. Baxter, “ODDGenes: Development Plan”, EPCC Internal Document, May By scaling better in a grid environment and enabling access to remote data sources via data grid technologies, JOSH will improve upon the recognised limitations of TOG in these areas. Figure 3 illustrates how JOSH will query child Grid Engine installations at collaborating sites to determine if they are able to run a user’s job. JOSH will then place the user’s job at the site that best matches the following criteria. 1. It is capable of running the job. 2. It has the lowest load of the available sites. 3. It has the best access to the required data sources. For those user jobs that are not data grid aware, a data component will handle the transfer of data between sites For more information on the project and its deliverables please access the project web site at http://www.epcc.ed.ac.uk/sungrid/. References User Job Spec Input Data Site User Interface Hierarchical Scheduler Grid Service Layer Grid Service Layer Grid Engine Grid Engine Output Data Site Figure 3: Using a hierarchical scheduler to provide an extensible, scaleable global grid. 2003. [11] T. Seed, “Transfer-queue Over Globus (TOG) : How-To”, July 2003, http://gridengine.sunsource.net/download/ TOG/tog-howto.pdf [12] C. Chaubal, “A Prototype of a MultiClustering Implementation using Transfer Queues”, http://gridengine.sunsource.net/project/gri dengine/howto/TransferQueues/transferqu eus.html