Cellular Disco: resource management using virtual clusters on shared memory multiprocessors Published in ACM 1999 by K.Govil, D. Teodosiu,Y. Huang, M. Rosenblum. Presenter: Soumya Eachempati Motivation • Large scale shared-Memory Multiprocessors – Large number of CPUs (32-128) – NUMA Architectures • Off-the-shelf OS not scalable – Cannot handle large number of resources – Memory management not optimized for NUMA – No fault containment Existing Solutions • Hardware partitioning – – – – Provides fault containment Rigid resource allocation Low resource utilization Cannot dynamically adapt to workload • New Operating System – Provides flexibility and efficient resource management. – Considerable effort and time Goal: To exploit hardware resources to the fullest with minimal effort while improving flexibility and fault-tolerance. Solution: DISCO(VMM) – Virtual Machine monitor – Addresses NUMA awareness issues and scalability Issues not dealt by DISCO: – Hardware fault tolerance/containment – Resource management policies Cellular DISCO • Approach: Convert Multiprocessor machine into a Virtual Cluster • Advantages: – – – – – – Inherits the benefits of DISCO Can support legacy OS transparently Combines the goodness of H/W Partitioning and new OS. Provides fault containment Fine grained resource sharing Less effort than developing an OS Cellular DISCO • Internally structured into semi-independent cells. • Much less development effort compared to HIVE • No performance loss - with fault containment. WARRANTED DESIGN DECISION: Code of Cellular DISCO is correct. Cellular Disco Architecture Resource Management • Over-commits resources • Gives flexibility to adjust fraction of resources assigned to VM. • Restrictions on resource allocation due to fault containment. • Both CPU and memory load balancing under constraints. – Scalability – Fault containment – Avoid contention • First touch allocation, dynamic migration, replication of hot memory pages Hardware Virtualization • VM’s interface mimics the underlying H/W. • Virtual Machine Resources (User-defined) – VCPUs, memory, I/O devices(physical) • Physical vs. machine resources(allocated dynamically - priority of VM) – VCPUs - CPUs – Physical - machine pages • VMM intercepts privileged instructions – 3 modes - user & supervisor(guest OS), kernel(VMM). – Supervisor mode all memory accesses are mapped. • Allocates machine memory to back the physical memory. • Pmap and memmap data structure. • Second level software TLB(L2TLB). Hardware fault containment Hardware fault containment • VMM - software fault containment. • Cell • Inter-cell communication – Inter-processor RPC – Messages - no need for locking since serialized. – Shared memory for some data structures(pmap, memmap). – Low latency, exactly once semantics • Trusted system software layer - enables us to use shared memory. Implementation 1: MIPS R10000 • • • • 32-processor SGI Origin 2000 Piggybacked on IRIX 6.4(Host OS) Guest OS - IRIX 6.2 Spawns Cellular DISCO(CD) as a multithreaded kernel process. – Additional overhead < 2%(time spent in host IRIX) – No fault isolation: IRIX kernel is monolithic • Solution: Some host OS support needed-one copy of host OS per cell. I/O Request execution • Cellular Disco piggybacked on IRIX kernel 32 - MIPS R10000 Characteristics of workloads • • • • Database - decision support workload Pmake - IO intensive workload Raytrace - CPU intensive Web - kernel intensive web-server workload. Virtualization Overheads Fault-containment Overheads Left bar single cell config Right bar - 8 cell system. CPU Management • Load Balancing mechanisms: – – – – • • • • Three types of VCPU migrations - Intra-node, Inter-node, Inter-cell. Intra node - loss of CPU cache affinity Inter node - cost of copying L2TLB, higher long term cost. Inter cell - loss of both cache and node affinity, increases fault vulnerability. Alleviates penalty by replicating pages. Load balancing policies - idle (local load stealer) and periodic (global redistribution) balancers. Each CPU has local run queue of VCPUs. Gang-scheduling – Run all VCPUs of a VM simultaneously. Load Balancing • Low contention distributed data structure - load tree. • Contention on higher level nodes • List of cells vulnerable to - VCPU. • Heavy loaded - idle balancer not enough • Local periodic balancer for 8 CPU region. CPU Scheduling and Results • • Scheduling - highest-priority gang runnable VCPU that has been waiting. Sends out RPC. 3 configs: 32- processors. a) One VM - 8 VCPUs--8 process raytrace. b) 4 VMs c) 8 VMs (total of 64 VCPUs). • • Pmap migrated only when all VCPUs are migrated out of a cell. Data pages also migrated for independence Memory Management • Each cell has its own freelist of pages indexed by the home node. • Page allocation request – Satisfied from local node – Else satisfied from same cell – Else borrowed from another cell • Memory balancing – Low memory threshold for borrowing and lending – Each VM has priority list of lender cells Memory Paging • Page Replacement – Second-chance FIFO • Avoids double paging overheads. • Tracking used pages – Use annotated OS routines • Page Sharing – Explicit marking of shared pages • Redundant Paging – Avoids by trapping every access to virtual paging disk Implementation 2: FLASH Simulation • FLASH has hardware fault recovery support • Simulation of FLASH architecture on SimOS • Use Fault injector – Power failure – Link failure – Firmware failure (?) • Results: 100% fault containment Fault Recovery • Hardware support needed – Determine what resources are operational – Reconfigure the machine to use good resources • Cellular Disco recovery – Step 1: All cells agree on a liveset of nodes – Step 2: Abort RPCs/messages to dead cells – Step 3: Kill VMs dependent on failed cells Fault-recovery Times • Recovery times higher for larger memory – Requires memory scanning for fault detections Summary • Virtual Machine Monitor – Flexible Resource Management – Legacy OS support • Cellular Disco – Cells provide fault-containment – Create Virtual Cluster – Need hardware support