Duke Systems Intro to Clouds Jeff Chase Dept. of Computer Science Duke University Part 1 VIRTUAL MACHINES The story so far: OS platforms • OS platforms let us run programs in contexts. • Contexts are protected/isolated to varying degrees. • The OS platform TCB offers APIs to create and manipulate protected contexts. – It enforces isolation of contexts for running programs. – It governs access to hardware resources. • Classical example: – Unix context: process – Unix TCB: kernel – Unix kernel API: syscalls The story so far: layered platforms • We can layer “new” platforms on “old” ones. – The outer layer hides the the inner layer, – covering the inner APIs and abstractions, and – replacing them with the model of the new platform. • Example: Android over Linux Android AMS JVM+lib Native virtual machines (VMs) • Slide a hypervisor underneath the kernel. – New OS/TCB layer: virtual machine monitor (VMM). • Kernel and processes run in a virtual machine (VM). – The VM “looks the same” to the OS as a physical machine. – The VM is a sandboxed/isolated context for an entire OS. • A VMM can run multiple VMs on a shared computer. guest VM1 P1A OS kernel 1 guest VM2 P2B OS kernel 2 hypervisor/VMM guest VM3 P3C OS kernel 3 guest or tenant VM contexts host What is a “program” for a VM? VMM/hypervisor is a new layer of OS platform, with a new kind of protected context. What is a program? ??? app app app guest kernel What kind of program do we launch into a VM context? hypervisor/VMM It’s called a virtual appliance or VM image. A VM is called an instance of the image. virtual appliance contains a complete OS system image, with file tree and apps [Graphics are from rPath inc. and VMware inc.] Thank you, VMware Motivation: support multiple OS When virtual is better than real When virtual is better than real everyone plays nicely together [image from virtualbox.org] The story so far: protected CPU mode Any kind of machine exception transfers control to a registered (trusted) kernel handler running in a protected CPU mode. syscall trap u-start fault u-return u-start fault u-return kernel “top half” kernel “bottom half” (interrupt handlers) clock interrupt user mode kernel mode interrupt return Kernel handler manipulates CPU register context to return to selected user context. A closer look user stack user stack syscall trap u-start handler dispatch table boot fault u-return u-return kernel stack clock interrupt fault u-return u-start kernel stack interrupt return X IA/x86 Protection Rings (CPL) • Modern CPUs have multiple protected modes. CPU Privilege Level (CPL) • History: IA/x86 rings (CPL) – Has built in security levels (Rings 0, 1, 2, 3) – Ring 0 – “Kernel mode” (most privileged) – Ring 3 – “User mode” • Unix uses only two modes: Increasing Privilege Level Ring 0 – user – untrusted execution Ring 1 – kernel – trusted execution Ring 2 Ring 3 [Fischbach] Protection Rings • New Intel VT and AMD SVM CPUs introduce new protected modes for VMM hypervisors. • We can think of it as a new inner ring: one ring to bind them all. • Warning: this is an oversimplification: the actual architecture is more complex for backward compatibility. user kernel hypervisor hypervisor guest user Protection Rings • Computer scientists have drawn these rings since the 1960s. • They represent layering: the outer ring “hides” the interface of the lower ring. • The machine defines the events (exceptions) that transition to higher privilege (inner ring). • Inner rings register handlers to intercept selected events. • But the picture is misleading…. Increasing Privilege Level Ring 0 Ring 1 Ring 2 Ring 3 [Fischbach] Protection Rings • We might just as soon draw it “inside out”. • Now the ring represents power: what the code at that ring can access or modify. • Bigger rings have more power. • Inclusion: bigger rings can see or do anything that the smaller rings can do. • And they can manipulate the state of the rings they contain. • But still misleading: there are multiple ‘instances’ of the weaker rings. user guest hypervisor Maybe a better picture… There are multiple ‘instances’ of the weaker rings. And powers are nested: an outer ring limits the “sandbox” or scope of the rings it contains. Post-note • The remaining slides in the section are just more slides to reinforce these concepts. • We didn’t see them in class. • There is more detail in the reading… Kernel Mode CPU mode (a field in some status register) indicates whether a machine CPU (core) is running in a user program or in the protected kernel. Some instructions or register accesses are legal only when the CPU (core) is executing in kernel mode. CPU mode transitions to kernel mode only on machine exception events (trap, fault, interrupt), which transfers control to a handler registered by the kernel with the machine at boot time. CPU core R0 Rn PC So only the kernel program chooses what code ever runs in the kernel mode (or so we hope and intend). A kernel handler can read the user register values at the time of the event, and modify them arbitrarily before (optionally) returning to user mode. U/K mode x registers Exceptions: trap, fault, interrupt synchronous caused by an instruction asynchronous caused by some other event intentional unintentional happens every time contributing factors trap: system call fault open, close, read, write, fork, exec, exit, wait, kill, etc. invalid or protected address or opcode, page fault, overflow, etc. “software interrupt” software requests an interrupt to be delivered at a later time interrupt caused by an external event: I/O op completed, clock tick, power fail, etc. Kernel Stacks and Trap/Fault Handling Processes execute user code on a user stack in the user virtual memory in the process virtual address space. Each process has a second kernel stack in kernel space (VM accessible only to the kernel). data stack stack stack syscall dispatch table stack System calls and faults run in kernel mode on the process kernel stack. Kernel code running in P’s process context (i.e., on its kstack) has access to P’s virtual memory. The syscall handler makes an indirect call through the system call dispatch table to the handler registered for the specific system call. More on VMs Recent CPUs support additional protected mode(s) for hypervisors. When the hypervisor initializes, it selects some set of event types to intercept, and registers handlers for them. Selected machine events occuring in user mode or kernel mode transfer control to a hypervisor handler. For example, a guest OS kernel accessing device registers may cause the physical machine to invoke the hypervisor to intervene. In addition, the VM architecture has another level of indirection in the MMU page tables: the hypervisor can specify and restrict what parts of physical memory are visible to each guest VM. A guest VM kernel can map to or address a physical memory frame or command device DMA I/O to/from a physical frame if and only if the hypervisor permits it. If any guest VM tries to do anything weird, then the hypervisor regains control and can see or do anything to any part of the physical or virtual machine state before (optionally) restarting the guest VM. If you are interested… 2.1 The Intel VT-x Extension In order to improve virtualization performance and simplify VMM implementation, Intel has developed VT-x [37], a virtualization extension to the x86 ISA. AMD also provides a similar extension with a different hardware interface called SVM [3]. The simplest method of adapting hardware to support virtualization is to introduce a mechanism for trapping each instruction that accesses privileged state so that emulation can be performed by a VMM. VT-x embraces a more sophisticated approach, inspired by IBM’s interpre tive execution architecture [31], where as many instructions as possible, including most that access privileged state, are executed directly in hardware without any intervention from the VMM. This is possible because hardware maintains a “shadow copy” of privileged state. The motivation for this approach is to increase performance, as traps can be a significant source of overhead. VT-x adopts a design where the CPU is split into two operating modes: VMX root and VMX non-root mode. VMX root mode is generally used to run the VMM and does not change CPU behavior, except to enable access to new instructions for managing VT-x. VMX non-root mode, on the other hand, restricts CPU behavior and is intended for running virtualized guest OSes. Transitions between VMX modes are managed by hardware. When the VMM executes the VMLAUNCH or VMRESUME instruction, hardware performs a VM entry; placing the CPU in VMX non-root mode and executing the guest. Then, when action is required from the VMM, hardware performs a VM exit, placing the CPU back in VMX root mode and jumping to a VMM entry point. Hardware automatically saves and restores most architectural state during both types of transitions. This is accomplished by using buffers in a memory resident data structure called the VM control structure (VMCS). In addition to storing architectural state, the VMCS contains a myriad of configuration parameters that allow the VMM to control execution and specify which type of events should generate VM exits. This gives the VMM considerable flexibility in determining which hardware is exposed to the guest. For example, a VMM could configure the VMCS so that the HLT instruction causes a VM exit or it could allow the guest to halt the CPU. However, some hardware interfaces, such as the interrupt descriptor table (IDT) and privilege modes, are exposed implicitly in VMX nonroot mode and never generate VM exits when accessed. Moreover, a guest can manually request a VM exit by using the VMCALL instruction. Virtual memory is perhaps the most difficult hardware feature for a VMM to expose safely. A straw man solution would be to configure the VMCS so that the guest has access to the page table root register, %CR3. However, this would place complete trust in the guest because it would be possible for it to configure the page table to access any physical memory address, including memory that belongs to the VMM. Fortunately, VT-x includes a dedicated hardware mechanism, called the extended page table (EPT), that can enforce memory isolation on guests with direct access to virtual memory. It works by applying a second, underlying, layer of address translation that can only be configured by the VMM. AMD’s SVM includes a similar mechanism to the EPT, referred to as a nested page table (NPT). From Dune: Safe User-level Access to Privileged CPU Features, Belay e.t al., (Stanford), OSDI, October, 2012 VT in a Nutshell • New VM mode bit – Orthogonal to kernel/user mode or rings (CPL) • If VM mode is off – Machine looks just like it always did • If VM bit is on – Machine is running a guest VM – “VMX non-root operation” – Various events cause gated entry into hypervisor – “virtualization intercept” – Hypervisor can control which events cause intercepts – Hypervisor can examine/manipulate guest VM state There is another motivation for VMs and hypervisors. Application services and computational jobs need access to computing power “on tap”. Virtualization allows the owner of a server to “slice and dice” server resources and allocate the virtual slices out to customers as VMs. The customers can install and manage their own software their own way in their own VMs. That is cloud hosting. Part 2 SERVICES Services RPC GET (HTTP) End-to-end application delivery Where is your application? Where is your data? Where is your OS? Cloud and Software-as-a-Service (SaaS) Rapid evolution, no user upgrade, no user data management. Agile/elastic deployment on virtual infrastructure. Networking endpoint port operations advertise (bind) listen connect (bind) close channel binding connection node A write/send read/receive node B Some IPC mechanisms allow communication across a network. E.g.: sockets using Internet communication protocols (TCP/IP). Each endpoint on a node (host) has a port number. Each node has one or more interfaces, each on at most one network. Each interface may be reachable on its network by one or more names. E.g. an IP address and an (optional) DNS name. SaaS platform elements browser [wiki.eeng.dcu.ie] container “Classical OS” Motivation: “Success disaster” [Graphic from Amazon: Mike Culver, Web Scale Computing] Motivation: “Success disaster” [Graphic from Amazon: Mike Culver, Web Scale Computing] “Cloud computing is a model for enabling convenient, ondemand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.” - US National Institute for Standards and Technology http://www.csrc.nist.gov/groups/SNS/cloud-computing/ Part 2 VIRTUAL CLOUD HOSTING Cloud > server-based computing Client Server(s) • Client/server model (1980s - ) • Now called Software-as-a-Service (SaaS) Host/guest model Client Service Guest Cloud Provider(s) Host • Service is hosted by a third party. – flexible programming model – cloud APIs for service to allocate/link resources – on-demand: pay as you grow IaaS: infrastructure services Client Service Platform Hosting performance and isolation is determined by virtualization layer Virtual machines: VMware, KVM, etc. OS VMM Physical Deployment of private clouds is growing rapidly w/ open IaaS cloud software. PaaS: platform services Client PaaS cloud services define the high-level programming models, e.g., for clusters or specific application classes. Service Platform Hadoop, grids, batch job services, etc. can also be viewed as PaaS category. OS VMM (optional) Physical Note: can deploy them over IaaS. Varying workload Fixed system Varying performance Varying workload Varying system Fixed performance “Elastic Cloud” Varying workload Varying system Resource Control Target performance Elastic provisioning Managing Energy and Server Resources in Hosting Centers, SOSP, October 2001. EC2 The canonical public cloud Virtual Appliance Image OpenStack, the Cloud Operating System Management Layer That Adds Automation & Control [Anthony Young @ Rackspace] IaaS Cloud APIs (OpenStack, EC2) • Query of availability zones (i.e. clusters in Eucalyptus) • SSH public key management (add, list, delete) • VM management (start, list, stop, reboot, get console output) • Security group management • Volume and snapshot management (attach, list, detach, create, bundle, delete) • Image management (bundle, upload, register, list, deregister) • IP address management (allocate, associate, list, release) Adding storage Competing Cloud Models: PaaS vs. IaaS • Cloud Platform as a Service (PaaS). The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly application hosting environment configurations. • Cloud Infrastructure as a Service (IaaS). The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly limited control of select networking components (e.g., host firewalls). Amazon Elastic Compute Cloud (EC2) Eucalyptus OpenNebula Post-note • The remaining slides weren’t discussed. • Some give more info on the various forms of cloud computing following the NIST model. Just understand IaaS and PaaS hosting models. • The “Adaptation” slides deal with resource management: what assurances does the holder of virtual infrastructure have about how much resource it will receive, and how good its performance will (therefore) be? We’ll discuss this more later. • The last slide refers to an advanced cloud project at Duke and RENCI.org, partially funded by NSF Global Environment for Network Innovations (geni.net). Managing images • “Let a thousand flowers bloom.” • Curated image collections are needed! • “Virtual appliance marketplace” Infrastructure as a Service (IaaS) “Consumers of IaaS have access to virtual computers, network-accessible storage, network infrastructure components, and other fundamental computing resources…and are billed according to the amount or duration of the resources consumed.” Cloud Models • Cloud Software as a Service (SaaS) – Use provider’s applications over a network • Cloud Platform as a Service (PaaS) – Deploy customer-created applications to a cloud • Cloud Infrastructure as a Service (IaaS) – Rent processing, storage, network capacity, and other fundamental computing resources NIST Cloud Definition Framework Hybrid Clouds Deployment Models Service Models Community Cloud Private Cloud Software as a Service (SaaS) Public Cloud Platform as a Service (PaaS) Infrastructure as a Service (IaaS) On Demand Self-Service Essential Characteristics Common Characteristics Broad Network Access Rapid Elasticity Resource Pooling Measured Service Massive Scale Resilient Computing Homogeneity Geographic Distribution Virtualization Service Orientation Low Cost Software Advanced Security Computer CPU Memory Disk BW memory shares Adaptations: Describing IaaS Services 16 → rc=(4,4) → rb=(4,8) → ra=(8,4) b a CPU shares c Adaptations: service classes • Must adaptations promise performance isolation? • There is a wide range of possible service classes…to the extent that we can reason about them. Continuum of service classes Available surplus Weak effort Best effort Reflects load factor or overbooking degree Proportional Elastic share reservation Hard reservation Reflects priority Constructing “slices” • I like to use TinkerToys as a metaphor for creating a slice in the GENI federated cloud. • The parts are virtual infrastructure resources: compute, networking, storage, etc. • Parts come in many types, shapes, sizes. • Parts interconnect in various ways. • We combine them to create useful built-to-order assemblies. • Some parts are programmable. • Where do the parts come from?