Cloud computing Clouds for science, Virtualization Dan C. Marinescu Computer Science Division, EECS Department University of Central Florida Email: dcm@cs.ucf.edu Contents Last time Service Level Agreements Software licensing Basic architecture of cloud platforms Open-source platforms for cloud computing: Eucalyptus, Nebula, Nimbus Cloud applications Existing and new applications; Coordination and the Zookeeper The Map-Reduce programming model;The GrepTheWeb application Today Clouds for science and engineering Layering and virtualization Virtual machines Virtual machine monitors Performance isolation; security isolation Full and paravirtualization 7/26/2016 UTFSM - May-June 2012 2 Clouds for science and engineering In virtually all areas of science one has to: manage very large volumes of data, build and execute models, integrate data and the literature, document the experiments, share the data with others, and preserve it for a long time. Typical problem data discovery in large scientific data sets; examples: biomedical and genomic data at NCBI (National Center for Bio Information), astrophysics data from NASA, atmospheric data from NOAA and NCAR Online data discovery can be viewed as an ensemble of several phases (i) recognition of the information problem; (ii) generation of search queries using one or more search engines; (iii) evaluation of the search results; (iv) evaluation of the Web documents; and (v) comparing information from different sources. 7/26/2016 UTFSM - May-June 2012 3 Comparative benchmark,EC2 versus supercomputers NERSC (National Energy Research Scientific Computing Center), located at Lawrence Berkeley National Laboratory has some 3000 users working on 400 projects and using some 600 codes. Some of the codes used 1. CAM (Community Atmosphere Mode), the atmospheric component of CCSM (Community Climate System Model) used for weather and climate modeling. The code uses two two-dimensional domain decompositions; one for the dynamics and the other for re-mapping. It is communication-intensive. 2. GAMESS, ab-initio quantum chemistry calculations. On the Cray XT4 it uses MPI and only one-half of the processors compute, while the other half are data movers. The program is memory- and communication-intensive. 3. GTC, fusion research; non-spectral Poisson solver, memory intensive. 4. IMPACT-T the prediction and performance enhancement of particle accelerators; sensitive to memory bandwidth and MPI collective performance. 5. MAESTRO low Mach number hydrodynamics code for simulating astrophysical flows; Parallelization via a tri-dimensional domain decomposition using a coarsegrained distribution strategy to balance the load and minimize communication costs. 7/26/2016 UTFSM - May-June 2012 4 Codes (cont’d) 1. 2. MILC (MImd Lattice Computation) is a QCD (Quantum Chromo Dynamics) code used to study ``strong'' interactions binding quarks into protons and neutrons and holding them together in the nucleus. The CG (Conjugate Gradient) method is used to solve a sparse, nearly-singular matrix problem. Many CG iterations steps are required for convergence; the inversion translates into tri-dimensional complex matrix-vector multiplications. Each multiplication requires a dot product of three pairs of tri-dimensional complex vectors; a dot product consists of five multiplyadd operations and one multiply. The code is highly memory- and computational-intensive and it is heavily dependent on pre-fetching. PARATEC (PARAllel Total Energy Code) is a quantum mechanics code; it performs ab initio total energy calculations using pseudo-potentials, a plane wave basis set and an all-band (unconstrained) conjugate gradient (CG) approach. Parallel three-dimensional FFTs transform the wave functions between real and Fourier space. The FFT dominates the runtime; the code uses MPI and is communication-intensive. 7/26/2016 UTFSM - May-June 2012 5 The systems used for comparison with EC2 1. 2. 3. Carver 400 node IBM iDataPlex cluster with quad-core Intel Nehalem processors at 2.67 GHz and with 24 GB of RAM (3 GB/core). Each node has two sockets; a single Quad Data Rate (QDR) IB link connects each node to a network that is locally a fat-tree with a global two-dimensional mesh. The codes were compiled with Portland Group suite version 10.0 and Open MPI version. 1.4.1. Franklin 9660 node Cray XT4; each node has a single quad-core 2.3 GHz AMD Opteron ``Budapest'' processor with 8 GB of RAM (2 GB/core). Each processor is connected through a 6.4 GB/s bidirectional HyperTransport interface to the interconnect via a Cray SeaStar-2 ASIC. The SeaStar routing chips are interconnected in a tri-dimensional torus topology, where each node has a direct link to its six nearest neighbors. Codes were compiled with the Pathscale or the Portland Group version 9.0.4. Lawrencium 198-node (1584 core) Linux cluster; a compute node is a Dell Poweredge 1950 server with two Intel Xeon quad-core 64 bit, 2.66 GHz Harpertown processors with 16 GB of RAM (2 GB/core). A compute node is connected to a Dual Data Rate Infiniband network configured as a fat tree with a 3:1 blocking factor. Codes were compiled using Intel 10.0.018 and Open MPI 1.3.3. 7/26/2016 UTFSM - May-June 2012 6 The benchmarks used for comparison 1. 2. 3. 4. 5. 6. 7. DGEMM - measures the floating point performance of a processor/core; the memory bandwidth does little to affect the results, as the code is cache friendly. The results of the benchmark are close to the theoretical peak performance of the processor. STREAM - measures the memory bandwidth. The network latency benchmark. HPL - a software package that solves a (random) dense linear system in double precision arithmetic on distributed-memory computers; it is a portable and freely available implementation of the High Performance Computing Linpack Benchmark. FFTE - measures the floating point rate of execution of double precision complex one-dimensional DFT (Discrete Fourier Transform). PTRANS - parallel matrix transpose; it exercises the communications where pairs of processors communicate with each other simultaneously. It is a useful test of the total communications capacity of the network. RandomAccess - measures the rate of integer random updates of memory (GUPS) 7/26/2016 UTFSM - May-June 2012 7 The benchmark results 7/26/2016 UTFSM - May-June 2012 8 Why is virtualization important? Resource management for a community of users with a wide range of applications running under different operating systems is a very difficult. Resource management becomes even more complex when resources are oversubscribed and users are uncooperative. The obvious solution for an organization providing utility computing is installing standard operating systems on individual systems and relying on conventional OS techniques to ensure resource sharing, application protection, and performance isolation. In this setup the system administration, accounting, and security are very challenging for the providers of service, while application development and performance optimization are equally challenging for the users. The alternative is resource virtualization. 7/26/2016 UTFSM - May-June 2012 9 Codes used (cont’d) 1. 2. MILC (MImd Lattice Computation) is a QCD (Quantum Chromo Dynamics) code used to study ``strong'' interactions binding quarks into protons and neutrons and holding them together in the nucleus. The CG (Conjugate Gradient) method is used to solve a sparse, nearly-singular matrix problem. Many CG iterations steps are required for convergence; the inversion translates into tri-dimensional complex matrix-vector multiplications. Each multiplication requires a dot product of three pairs of tri-dimensional complex vectors; a dot product consists of five multiply-add operations and one multiply. The code is highly memory- and computational-intensive and it is heavily dependent on pre-fetching. PARATEC (PARAllel Total Energy Code) is a quantum mechanics code; it performs ab initio total energy calculations using pseudo-potentials, a plane wave basis set and an all-band (unconstrained) conjugate gradient (CG) approach. Parallel three-dimensional FFTs transform the wave functions between real and Fourier space. The FFT dominates the runtime; the code uses MPI and is communication-intensive. 7/26/2016 UTFSM - May-June 2012 10 Resource virtualization Virtualization simulates the interface to a physical object by: Multiplexing: create multiple virtual objects from one instance of a physical object; example, a processor is multiplexed among a number of threads. 2. Aggregation: create one virtual object from multiple physical objects; example, a number of physical disks are aggregated into a RAID disk. 3. Emulation: construct a virtual object from a different type of a physical object; example, a physical disk emulates a Random Access Memory. 4. Multiplexing and emulation. Examples: virtual memory with paging multiplexes real memory and disk and a virtual address emulates a real address; the TCP protocol emulates a reliable bit pipe and multiplexes a physical communication channel and a processor. 1. Virtualization abstracts the underlaying resources and simplifies their use, isolates users from one another, and supports replication which, in turn, increases the elasticity of the system. Virtualization is a critical aspect of cloud computing, equally important for the providers and the consumers of cloud services. 7/26/2016 UTFSM - May-June 2012 11 Why is virtualization important for cloud computing? System security allows isolation of services running on the same hardware. Performance and reliability allows applications to migrate from one platform to another. Lower costs simplifies the development and management of services offered by a provider. Performance isolation allows isolation of applications and predictive performance. User convenience a major advantage of a Virtual Machine architecture versus a traditional operating system. For example, a user of the Amazon Web Services (AWS) could submit an Amazon Machine Image (AMI) containing the applications, libraries, data, and associated configuration settings; the user could choose the operating system for the application, then start, terminate, and monitor as many instances of the AMI as needed, using the Web service APIs and the performance monitoring and management tools provided by the AWS. 7/26/2016 UTFSM - May-June 2012 12 Side effects of virtualization Performance penalty all privileged operations of a virtual machine must be trapped and validated by the Virtual Machine Monitor which, ultimately, controls the system behavior; the increased overhead has a negative impact on the performance. Increased hardware costs the cost of the hardware for a virtual machine is higher than the cost for a system running a traditional operating system because the physical hardware is shared among a set of guest operating systems and it is typically configured with faster and/or multi-core processors, more memory, larger disks, and additional network interfaces as compared with a system running a traditional operating system. Architectural support for multi-level control resource sharing in a virtual machine environment requires not only ample hardware support and, in particular powerful processors, but also architectural support for multi-level control. Indeed, resources, such as CPU cycles, memory, secondary storage, and I/O and communication bandwidth, are shared among several virtual machines; for each virtual machine resources must be shared among multiple instances of an application. 7/26/2016 UTFSM - May-June 2012 13 7/26/2016 Layering and interfaces between layers in a computer system. The software components include applications, libraries, and operating system; they interact with the hardware via several interfaces: Application Programming Interface (API) defines the set of instructions the hardware was designed to execute and gives the application access to the ISA; it includes HLL library calls which often invoke system calls Application Binary Interface (ABI) - allows the ensemble consisting of the application and the library modules to access the hardware; the ABI does not include privileged system instructions, instead it invokes system calls. and Instruction Set Architecture (ISA) The hardware consists of one or more multicore processors, a system interconnect, (e.g., one or more busses) a memory translation unit, the main memory, and I/O devices, including one or more networking interfaces. UTFSM - May-June 2012 14 Portability The binaries created by a compiler for a specific ISA and operating systems are not portable, cannot run on a computer with a different ISA or on the computer with the same ISA, but a different operating system. It is possible though to compile a HLL program for a virtual machine (VM) environment; the portable code is produced and distributed and then converted by binary translators to the ISA of the host system. A dynamic binary translation converts blocks of guest instructions from the portable code to the host instruction and leads to a significant performance improvement, as such blocks are cached and reused. 7/26/2016 UTFSM - May-June 2012 15 Two strategies to create memory images from High Level Language (HLL) code: 1. HLL code can be translated for a specific architecture and 2. 7/26/2016 operating system. HLL code can also be compiled into portable code and then the portable code be translated for systems with different ISAs. The code shared is the object code in the first case and the portable code in the second case. UTFSM - May-June 2012 16 Virtual machines A Virtual Machine (VM) an isolated environment that appears to be a whole computer, but actually only has access to a portion of the computer resources. Each virtual machine appears to be running on the bare hardware, giving the appearance of multiple instances of the same computer, though all are supported by a single physical system. Modern operating systems such as Linux Vserver, OpenVZ (Open VirtualiZation), FreeBSD Jails, and Solaris Zones based on Linux, Unix, FreeBSD, and Solaris, respectively, implement operating system-level virtualization technologies. They allow a physical server to run multiple isolated operating system instances, known as containers, Virtual Private Servers (VPSs), or Virtual Environments (VEs). These systems claim performance advantages over the systems based on a Virtual Machine Monitor such as Xen or VMware; it is reported that there is only a 1% - 3% performance penalty for OpenVZ as compared to a standalone Linux server. OpenVZ is licensed under the GPL version 2. 7/26/2016 UTFSM - May-June 2012 17 The performance of applications under virtual machine Generally, virtualization adds some level of overhead and affects negatively the performance. In some cases an application running under a virtual machine performs better than one running under a classical OS; this is the case of a policy called cache isolation. The cache is generally not partitioned equally among processes running under a classical OS, as one process may use the cache space better than the other. For example, in the case of two processes, one write-intensive and the other read-intensive, the cache may be aggressively filled by the first. Under the cache isolation policy the cache is divided between the VMs and it is beneficial to run workloads competing for cache in two different VMs. The application I/O performance running under a virtual machine depends on factors such as, the disk partition used by the VM, the CPU utilization, the I/O performance of the competing VMs, and the I/O block size. On Xen platform discrepancies between the optimal choice and the default are as high as 8% to 35% 7/26/2016 UTFSM - May-June 2012 18 Process, system, and application VMs Process VM a virtual platform created for an individual process and destroyed once the process terminates. Virtually all operating systems provide a process VM for each one of the applications running, but the more interesting process VMs are those which support binaries compiled on a different instruction set. System VM provides a complete system; each VM can run its own OS, which in turn can run multiple applications. Application virtual machine when the VM runs under the control of a normal OS and provides a platform-independent host for a single application, e.g., Java Virtual Machine (JVM). 7/26/2016 UTFSM - May-June 2012 19 Virtual machine monitors (VMMs) A Virtual Machine Monitor control software allowing several virtual machines to share a system. Several approaches are possible: Traditional VM a thin software layer that runs directly on the host machine hardware; its main advantage is performance; examples: VMWare ESX, ESXi Servers, Xen, OS370, and Denali. Hybrid the VMM shares the hardware with existing OS; example: VMWare Workstation. Hosted the VM runs on top of an existing OS; example, User-mode Linux. The main advantage of the easier to build and install; other advantages: the VMM could use several components of the host OS, such as the scheduler, the pager and the I/O drivers, rather than providing its own. A price to pay for this simplicity is the increased overhead and the associated performance penalty; indeed, the I/O operations, page faults, and scheduling requests from a guest OS are not handled directly by the VMM, instead they are passed to the host OS. Performance, as well as the challenges to support complete isolation of VMs make this solution less attractive for servers in a cloud computing environment; 7/26/2016 UTFSM - May-June 2012 20 7/26/2016 UTFSM - May-June 2012 21 7/26/2016 UTFSM - May-June 2012 a) A taxonomy of process and systems VMs for the same and for different Instruction Set Architectures (ISAs). Traditional, Hybrid, and Hosted are three classes of VMs for systems with the same ISA. (b) Traditional VM; VMM supports multiple virtual machines and runs directly on the hardware;. (c) Hybrid VM; VMM shares the hardware with a host operating system and supports multiple virtual machines. (d) Hosted VM; VMM runs under a host operating system. 22 The interaction of VMM with the guest OS VMM (hypervisor) software that securely partitions the resources of computer system into one or more VMs; runs in kernel mode and allows: multiple services to share the same platform; live migration, the movement of a server from one platform to another; system modification while maintaining backward compatibility with the original system. Guest operating system an OS that runs under the control of a VMM rather than directly on the hardware; runs in user mode. The functions of a VMM: monitors the system performance and takes corrective actions to avoid performance degradation; when a guest OS attempts to execute a privileged instruction it traps the operation and enforces the correctness and safety of the operation; guarantees the isolation of the individual virtual machines and thus, ensures security and encapsulation, a major concern in cloud computing. 7/26/2016 UTFSM - May-June 2012 23 A VMM virtualizes the CPU and the memory The VMM traps interrupts and dispatches then to the individual guest operating systems; if a guest OS disables interrupts, the VMM buffers such interrupts until the guest OS enables them. The VMM maintains a shadow page table for each guest OS and replicates any modification made by the guest OS in its own shadow page table; this shadow page table points to the actual page frame and it is used by the hardware component called the Memory Management Unit (MMU)} for dynamic address translation. VMMs control the virtual memory management and decide what pages to swap out Memory virtualization has important implications on the performance. VMMs use a range of optimization techniques; for example, VMWare systems avoid page duplication among virtual machines, they maintain only one copy of a shared page and use copy-on-write policies while Xen imposes total isolation of the VM and does not allow page sharing. 7/26/2016 UTFSM - May-June 2012 24 Application thread 1 IR PC Virtual Memory Manager Exception Handler Scheduler Multi-Level Memory Manager Application thread 2 Translate (PC) into (Page#,Displ) Is (Page#) in primary storage? YES- compute the physical address of the instruction IR PC NO – page fault Save PC Handle page fault Identify Page # SEND(Page #) AWAIT Issue AWAIT on behalf of thread 1 Thread 1 WAITING Thread 2 RUNNING Find a block in primary storage Is “dirty” bit of block ON? YES- write block to secondary storage NO- fetch block corresponding to missing page Load PC of thread 2 IR PC I/O operation complets ADVANCE Thread 1 RUNNING Load PC of thread 1 IR 7/26/2016 PC Lecture 22 25 Performance and security isolation Performance isolation is a critical aspect for the Quality of Service grantees in shared environments; if the run-time behavior of an application is affected by other applications running concurrently and thus, competing for CPU cycles, cache, main memory, disk and network access, it is rather difficult to predict the completion time. It is equally difficult to optimize the application. Operating systems use the process abstraction not only for resource sharing but also to support isolation; unfortunately, once a process is compromised it is rather easy for an attacker to penetrate the entire system. The software running on a virtual machine it can only access virtual devices emulated by the software and has the potential to provide a level of security isolation nearly equivalent to the isolation presented by two different physical systems. Virtualization can be used to improve security. The security vulnerability of VMMs is considerably reduced; a VMM is a much simpler and better specified system than a traditional operating system; for example, the Xen VMM has approximately 60,000 lines of code while the Denali VMM has only about half, 30,000. 7/26/2016 UTFSM - May-June 2012 26 Architectural support for virtualization In 1974 Gerald J. Popek and Robert P. Goldberg gave a set of sufficient conditions for a computer architecture to support virtualization and allow a Virtual Machine Monitor to operate efficiently: A program running under the VMM should exhibit a behavior essentially identical to that demonstrated when running on an equivalent machine directly. The VMM should be in complete control of the virtualized resources. A statistically significant fraction of machine instructions must be executed without the intervention of the VMM. Another criterion allows us to identify an architecture suitable for a virtual machine and to distinguish two classes of machine instructions sensitive instructions of two types: control sensitive, instructions that attempt to change the memory allocation or the privileged mode and mode sensitive, instructions whose behavior is different in the privileged mode; innocuous instructions, that are not sensitive. 7/26/2016 UTFSM - May-June 2012 27 Full virtualization versus paravirtualization Full virtualization each virtual machine runs on an exact copy of the actual hardware; example: VMWare VMMs. This execution mode is efficient the hardware is fully exposed to the guest OS which runs unchanged. Paravirtualization each virtual machine runs on a slightly modified copy of the actual hardware; examples, Xen and Denali. The guest OS must be modified to run under the VMM; the guest OS code must be ported for individual hardware platforms. The reasons why paravirtualization is often adopted are: some aspects of the hardware cannot be virtualized; to improve performance; to present a simpler interface. 7/26/2016 UTFSM - May-June 2012 28 7/26/2016 UTFSM - May-June 2012 a) The full virtualization requires the hardware abstraction layer of the guest OS to have some knowledge about the hardware. (b) The paravirtualization avoids this requirement and allows full compatibility at the Application Binary Interface (ABI). 29 Xen - a VMM based on paravirtualization Xen is a Virtual Machine Monitor (VMM) developed at the University of Cambridge, UK, in the early 2000. The goal was to design a VMM capable of scaling to about 100 virtual machines running standard applications and services without any modifications to the Application Binary Interface (ABI). Fully aware that the x86 architecture does not support efficiently full virtualization, the designers of Xen opted for paravirtualization. Several operating systems including Linux, Minix, NetBSD, FreeBSD, NetWare, and OZONE can operate as paravirtualized Xen guest operating systems running on IA-32, x86-64, Itanium, and ARM architectures. Since 2010 Xen is a free software, developed by the community of users and licensed under the GNU General Public License (GPLv2). 7/26/2016 UTFSM - May-June 2012 30 7/26/2016 UTFSM - May-June 2012 Xen for the x86 architecture; in the original Xen implementation a guest OS could be - XenoLinix, - XenoBSD, or - XenoXP 31 Paravirtualization strategies for virtual memory management, CPU multiplexing, and I/O devices for the original x86 Xen implementation 7/26/2016 UTFSM - May-June 2012 32