Open Science Grids and Cloud computing: Sebastien Goasguen August 2010 1. Vision Cloud computing has emerged as a new paradigm for computing. The characteristic features of clouds are elasticity, on-demand and a utility principle exhibited via a pay as you go model. The Google trends showed in Fig. 1 demonstrates the emergence of this new term. The de-facto example of Cloud computing has been for several years the products developed by Amazon Web Services, such the elastic compute cloud EC2 and the Simple Storage Service (S3). Clearly defining Cloud Computing can be challenging [1], but in the context of the Open Science Grid, the lowest layer of the Cloud paradigm -Infrastructure as a Service- is the most relevant. EGEE has noted that Clouds could offer some new features to Grid users [2] and that the ondemand and elastic properties especially might be incorporated into today’s grid infrastructures. Fig. 1 Google trends of Grid computing and Cloud computing. Clouds have emerged in the last quarter of 2007 and show an impressive climb since then. Therefore the challenge that many operational grid infrastructure face is to investigate Cloud computing, determine if there is indeed a new computing model needed by the users that addresses a current short comings and finally adapt the OSG infrastructure and its middleware to support a new cloud model. To that end OSG has encouraged and started grass roots efforts that call on the community to research, test and deploy cloud prototypes and measure the cost and benefits. Section 2 describes a few potential architecture changes to the OSG site deployment and middleware and Section 3 describes the main grass roots efforts currently underway. 2. Potential architecture 2.1 The principal architectural change in cloud computing stems from the need to support request for machines rather than request to execute jobs. Virtualization is the enabling technology which transforms a site from a batch processing farm into an infrastructure provider [3]. Virtual machines can be created on worker nodes that support an hypervisor –special kernel or kernel modules-. A request (e.g EC2 request) is transformed into an instantiation of a virtual machine on a node that runs an hypervisor. To receive these requests, a cloud interface needs to be present at a site. Additionally, the local resource manager which traditionally takes the job request and transforms it into a request for a number of nodes, will now take the cloud request, assign a node to it as well as start the cloud instance needed. Therefore the principal changes that need to occur to make a grid site into a cloud provider are to: Support an hypervisor on the worker nodes Deploy a Cloud interface on top of the Compute Element (CE) Extend the CE to transform the cloud request into a proper LRM request. Fig. 2 Site architecture showing a CE with an additional grid facing interface: A Cloud interface. The same LRM is used to schedule jobs and Virtual Machines (VM), the worker node can act as regular worker node and hypervisor. Cloud requests may get queued and some hypervisor technologies may not be usable in this setup. It is possible to have an intermediate architecture, without the cloud interface. The jobs arriving at a grid site would then be executed within virtual machines instantiated on-demand. [4]. This process, while not entirely matching the cloud computing model represent a good trade-off to add capability to the grid sites. VOs would be able to run their jobs within their customized virtual machines. Out of the scope of this paper is an image distribution model so that the VOs images are present at the sites. It is envisioned that current OSG data transfer mechanisms could be leveraged to distribute the images. A major drawback of this high level architecture is the fact that cloud request would be queued and mixed with the regular job request. The risk is that the on-demand characteristic of clouds would not be met. The architecture in section 2.2 tries to address this issue. Fig. 3 Site architecture to support interactive cloud resource. The Compute element uses a Cloud Resource Manager (CRM) decoupled from the LRM. The CRM can support on-demand scheduling of cloud instances. The Worker node can also support an hypervisor which needs special privileges to start VM such as Xen. 2.2 In this architecture the CE splits the cloud and job request within two resource manager. The standard LRM and a new Cloud Resource Manager (CRM) that can be implemented with virtual machine provisioning system. The CRM is able to over provision worker nodes in order to provide cloud instances on-demand. The CRM also provides the elastic capability: requesting resources on other grid sites or cloud providers during peak requests. Fig 3 depicts this variation on the proposed architecture. The worker nodes in the farm could be split between regular worker nodes and cloud worker nodes. The choice of Hypervisor may affect the management of these nodes and the possible implementation of the architecture. For instance KVM runs virtual machines as user land processes. Therefore cloud request can use the LRM and create the cloud instances as regular user processes. 3. Current work Several efforts are currently on-going to evaluate the use of cloud computing with OSG. Reports on these efforts have been disseminated during the 2009 [5] and 2010 [6] OSG All hands meeting. Short summaries are presented below with a focus on the STAR VO which has been the main driver in virtualization and cloud computing adoption within OSG. 3.1 STAR and Nimbus: The Nimbus software [7] is a toolkit for cloud computing. It enables the on-demand provisioning of virtualized resources and their automatic configuration. Nimbus presents an EC2 API as a cloud interface, it can provision machines on EC2 as well as on Science Clouds. STAR has successfully used Nimbus to deploy a full -ledged grid site on EC2 and run batch processing jobs on it. In this experiment, the compute element and the worker nodes were virtualized on EC2 and the standard job submission techniques were used to send jobs to the CE running in EC2. 3.2 STAR and Condor VM: Condor which is the most widely used LRM on OSG has recently added a virtual machine universe which offers VMware and KVM support. There the virtual machine is a regular condor job. Matchmaking is used to target a machine with an hypervisor. The image is transferred to the hypervisor before instantiation and back once the VM is shutdown, leading to a potentially high cost in data transfer. Originally the job was pre-staged in the VM and executed via a startup script run at boot-time, this had the downside that retrieving job output was cumbersome. STAR tested on the Condor VM universe on the GLOW resources at the University of Wisconsin. A special job broker was setup to be able to send the jobs within the instantiated VMs. 3.3 STAR and Clemson University: At Clemson, a slightly different model was used. The main motivation was to virtualize the worker nodes of a regular OSG site and doing this in a totally transparent manner to the users. The CE was modified to automatically instantiate VM based on job requests, as well as target the VMs of specific VOs. The job manager was extended to add job attribute that would match the job to the VMs of the VO that submitted the job. This technique has the advantage that the user keeps on using his usual workflow while being guaranteed execution in his virtual machine. STAR successfully tested this mode of operation [8]. 3.4 Magellan and Future Grid: While the last three projects have been focused on a VO investigating different cloud models within OSG it is worth noting that large Cloud resources and testbeds are being deployed in the US. These resources and testbeds are available to the OSG community and will further inform any architecture changes needed to support a cloud model. Magellan [9] is a project from the Department of Energy that is deploying two large cloud resources at NERSC and ANL. Each with approximately 4,000 cores these resources will test Cloud system like Nimbus and Eucalyptus with the aim to study scheduling of resources for on-demand and elastic use. Future Grid [10] on the other hand is an experimental testbed integrated with the TeraGrid but open to all including OSG. FutureGrid aims to explore new computing models not only Clouds, the main driver being an on-demand service composition and provisioning, where services are not only compute resources but also software, data and visualization tools. As such it extends the concept of IaaS. Fig. 4 OSG use of Cloud resources via the Engage VO. The OSG matchmaker is being extended to lease resources via cloud interfaces on providers outside the regular OSG consortium. OSG sees both Magellan and FutureGrid as partners in the exploration of Cloud computing and as means to inform the future architecture of OSG. Fig 4 below shows how the Engage VO is expanding its job submission framework to interoperate with cloud providers such as Magellan/Future Grid and future Campus clouds. 3.5 ATLAS/CMS Tier 3: Finally, the addition of Tier-3 sites within OSG has stimulated interest in using virtualization to provide easy to deploy and manage sites. Consolidating services of Tier-3 sites onto virtual machines that are pre-configured is seen as a key enabler for the sites. Reducing time to deploy services and automating the creation of a site. Several prototypes are currently under investigation. One of them uses Eucalyptus [11] to offer a service to instantiate complete Tier-3s, while another one uses Condor-job router and Condor Hawkeye to dynamically create worker nodes on EC2. Of interest to both VOs is the CernVM [12] project which creates appliances for multiple hypervisors and self-configures the appropriate experiment software. 4. References [1] L. Vaquero, L. Rodero-Merino, J. Caceres and M. Lindner “A break in the clouds: towards a cloud definition” ACM SIGCOMM Computer Communication Review, Vol. 39, pp. 50-55, January 2009 [2] M. Elian-Begin “An EGEE comparative study: Grids and Clouds” January 2008 [3] R. J. Figueiredo, P. A. Dinda, and J. A. B Fortes, “A Case for Grid Computing on Virtual Machines,” 23rd International Conference on Distributed Computing Systems (ICDCS 2003), Providence, Rhode Island, May 19-22, 2003, pp. 550-559. [4] M. A. Murphy, L. Abraham, M. Fenn and S. Goasguen “Autonomic Clouds on the Grid” Journal of Grid Computing Volume 8, Number 1 (March 2010), pages 1-18. [5] [On-line] http://indico.fnal.gov/conferenceDisplay.py?confId=2012 [6] [On-line] http://indico.fnal.gov/conferenceDisplay.py?confId=2871 [7] [On-line] http://www.nimbusproject.org [8] M. Fenn, J. Lauret and S. Goasguen “Contextualization in Practice: The Clemson Experience” ACAT, Jaipur, India February 2010. [9] [On-line] http://www.nersc.gov/nusers/systems/magellan/ [10] [On-line] http://futuregrid.org/ [11] [On-line] http://open.eucalyptus.com [12] [On-line] http://cernvm.cern.ch