OSGClouds-ThoughtsfromEngage

advertisement
Open Science Grids and Cloud computing:
Sebastien Goasguen
August 2010
1. Vision
Cloud computing has emerged as a new paradigm for computing. The
characteristic features of clouds are elasticity, on-demand and a utility
principle exhibited via a pay as you go model. The Google trends showed in
Fig. 1 demonstrates the emergence of this new term. The de-facto example of
Cloud computing has been for several years the products developed by
Amazon Web Services, such the elastic compute cloud EC2 and the Simple
Storage Service (S3). Clearly defining Cloud Computing can be challenging
[1], but in the context of the Open Science Grid, the lowest layer of the Cloud
paradigm -Infrastructure as a Service- is the most relevant. EGEE has noted
that Clouds could offer some new features to Grid users [2] and that the ondemand and elastic properties especially might be incorporated into today’s
grid infrastructures.
Fig. 1 Google trends of Grid computing and Cloud computing. Clouds have emerged in the
last quarter of 2007 and show an impressive climb since then.
Therefore the challenge that many operational grid infrastructure face is to
investigate Cloud computing, determine if there is indeed a new computing
model needed by the users that addresses a current short comings and finally
adapt the OSG infrastructure and its middleware to support a new cloud
model. To that end OSG has encouraged and started grass roots efforts that
call on the community to research, test and deploy cloud prototypes and
measure the cost and benefits. Section 2 describes a few potential
architecture changes to the OSG site deployment and middleware and
Section 3 describes the main grass roots efforts currently underway.
2. Potential architecture
2.1 The principal architectural change in cloud computing stems from the
need to support request for machines rather than request to execute jobs.
Virtualization is the enabling technology which transforms a site from a
batch processing farm into an infrastructure provider [3]. Virtual machines
can be created on worker nodes that support an hypervisor –special kernel
or kernel modules-. A request (e.g EC2 request) is transformed into an
instantiation of a virtual machine on a node that runs an hypervisor. To
receive these requests, a cloud interface needs to be present at a site.
Additionally, the local resource manager which traditionally takes the job
request and transforms it into a request for a number of nodes, will now take
the cloud request, assign a node to it as well as start the cloud instance
needed. Therefore the principal changes that need to occur to make a grid
site into a cloud provider are to:
 Support an hypervisor on the worker nodes
 Deploy a Cloud interface on top of the Compute Element (CE)
 Extend the CE to transform the cloud request into a proper LRM
request.
Fig. 2 Site architecture showing a CE with an additional grid facing interface: A Cloud interface. The
same LRM is used to schedule jobs and Virtual Machines (VM), the worker node can act as regular
worker node and hypervisor. Cloud requests may get queued and some hypervisor technologies may
not be usable in this setup.
It is possible to have an intermediate architecture, without the cloud interface. The
jobs arriving at a grid site would then be executed within virtual machines
instantiated on-demand. [4]. This process, while not entirely matching the cloud
computing model represent a good trade-off to add capability to the grid sites. VOs
would be able to run their jobs within their customized virtual machines. Out of the
scope of this paper is an image distribution model so that the VOs images are
present at the sites. It is envisioned that current OSG data transfer mechanisms
could be leveraged to distribute the images.
A major drawback of this high level architecture is the fact that cloud request would
be queued and mixed with the regular job request. The risk is that the on-demand
characteristic of clouds would not be met. The architecture in section 2.2 tries to
address this issue.
Fig. 3 Site architecture to support interactive cloud resource. The Compute element uses a Cloud
Resource Manager (CRM) decoupled from the LRM. The CRM can support on-demand scheduling of
cloud instances. The Worker node can also support an hypervisor which needs special privileges to
start VM such as Xen.
2.2 In this architecture the CE splits the cloud and job request within two resource
manager. The standard LRM and a new Cloud Resource Manager (CRM) that can be
implemented with virtual machine provisioning system. The CRM is able to over
provision worker nodes in order to provide cloud instances on-demand. The CRM
also provides the elastic capability: requesting resources on other grid sites or cloud
providers during peak requests. Fig 3 depicts this variation on the proposed
architecture. The worker nodes in the farm could be split between regular worker
nodes and cloud worker nodes. The choice of Hypervisor may affect the
management of these nodes and the possible implementation of the architecture.
For instance KVM runs virtual machines as user land processes. Therefore cloud
request can use the LRM and create the cloud instances as regular user processes.
3. Current work
Several efforts are currently on-going to evaluate the use of cloud computing with
OSG. Reports on these efforts have been disseminated during the 2009 [5] and 2010
[6] OSG All hands meeting. Short summaries are presented below with a focus on
the STAR VO which has been the main driver in virtualization and cloud computing
adoption within OSG.
3.1 STAR and Nimbus:
The Nimbus software [7] is a toolkit for cloud computing. It enables the on-demand
provisioning of virtualized resources and their automatic configuration. Nimbus
presents an EC2 API as a cloud interface, it can provision machines on EC2 as well as
on Science Clouds. STAR has successfully used Nimbus to deploy a full -ledged grid
site on EC2 and run batch processing jobs on it. In this experiment, the compute
element and the worker nodes were virtualized on EC2 and the standard job
submission techniques were used to send jobs to the CE running in EC2.
3.2 STAR and Condor VM:
Condor which is the most widely used LRM on OSG has recently added a virtual
machine universe which offers VMware and KVM support. There the virtual
machine is a regular condor job. Matchmaking is used to target a machine with an
hypervisor. The image is transferred to the hypervisor before instantiation and back
once the VM is shutdown, leading to a potentially high cost in data transfer.
Originally the job was pre-staged in the VM and executed via a startup script run at
boot-time, this had the downside that retrieving job output was cumbersome.
STAR tested on the Condor VM universe on the GLOW resources at the University of
Wisconsin. A special job broker was setup to be able to send the jobs within the
instantiated VMs.
3.3 STAR and Clemson University:
At Clemson, a slightly different model was used. The main motivation was to
virtualize the worker nodes of a regular OSG site and doing this in a totally
transparent manner to the users. The CE was modified to automatically instantiate
VM based on job requests, as well as target the VMs of specific VOs.
The job manager was extended to add job attribute that would match the job to the
VMs of the VO that submitted the job. This technique has the advantage that the user
keeps on using his usual workflow while being guaranteed execution in his virtual
machine. STAR successfully tested this mode of operation [8].
3.4 Magellan and Future Grid:
While the last three projects have been focused on a VO investigating different cloud
models within OSG it is worth noting that large Cloud resources and testbeds are
being deployed in the US. These resources and testbeds are available to the OSG
community and will further inform any architecture changes needed to support a
cloud model.
Magellan [9] is a project from the Department of Energy that is deploying two large
cloud resources at NERSC and ANL. Each with approximately 4,000 cores these
resources will test Cloud system like Nimbus and Eucalyptus with the aim to study
scheduling of resources for on-demand and elastic use.
Future Grid [10] on the other hand is an experimental testbed integrated with the
TeraGrid but open to all including OSG. FutureGrid aims to explore new computing
models not only Clouds, the main driver being an on-demand service composition
and provisioning, where services are not only compute resources but also software,
data and visualization tools. As such it extends the concept of IaaS.
Fig. 4 OSG use of Cloud resources via the Engage VO. The OSG matchmaker is being extended to lease
resources via cloud interfaces on providers outside the regular OSG consortium.
OSG sees both Magellan and FutureGrid as partners in the exploration of Cloud
computing and as means to inform the future architecture of OSG. Fig 4 below
shows how the Engage VO is expanding its job submission framework to
interoperate with cloud providers such as Magellan/Future Grid and future Campus
clouds.
3.5 ATLAS/CMS Tier 3:
Finally, the addition of Tier-3 sites within OSG has stimulated interest in using
virtualization to provide easy to deploy and manage sites. Consolidating services of
Tier-3 sites onto virtual machines that are pre-configured is seen as a key enabler
for the sites. Reducing time to deploy services and automating the creation of a site.
Several prototypes are currently under investigation. One of them uses Eucalyptus
[11] to offer a service to instantiate complete Tier-3s, while another one uses
Condor-job router and Condor Hawkeye to dynamically create worker nodes on
EC2. Of interest to both VOs is the CernVM [12] project which creates appliances for
multiple hypervisors and self-configures the appropriate experiment software.
4. References
[1] L. Vaquero, L. Rodero-Merino, J. Caceres and M. Lindner “A break in the clouds:
towards a cloud definition” ACM SIGCOMM Computer Communication Review, Vol.
39, pp. 50-55, January 2009
[2] M. Elian-Begin “An EGEE comparative study: Grids and Clouds” January 2008
[3] R. J. Figueiredo, P. A. Dinda, and J. A. B Fortes, “A Case for Grid Computing on
Virtual Machines,” 23rd International Conference on Distributed Computing
Systems (ICDCS 2003), Providence, Rhode Island, May 19-22, 2003, pp. 550-559.
[4] M. A. Murphy, L. Abraham, M. Fenn and S. Goasguen “Autonomic Clouds on the
Grid” Journal of Grid Computing Volume 8, Number 1 (March 2010), pages 1-18.
[5] [On-line] http://indico.fnal.gov/conferenceDisplay.py?confId=2012
[6] [On-line] http://indico.fnal.gov/conferenceDisplay.py?confId=2871
[7] [On-line] http://www.nimbusproject.org
[8] M. Fenn, J. Lauret and S. Goasguen “Contextualization in Practice: The Clemson
Experience” ACAT, Jaipur, India February 2010.
[9] [On-line] http://www.nersc.gov/nusers/systems/magellan/
[10] [On-line] http://futuregrid.org/
[11] [On-line] http://open.eucalyptus.com
[12] [On-line] http://cernvm.cern.ch
Download