Kit Cischke 09/09/08 CS 5090 DISCO: RUNNING COMMODITY OPERATING SYSTEMS ON SCALABLE MULTIPROCESSORS Overview Background What are we doing here? A Return to Virtual Machine Monitors What does Disco do? Disco: A Return to VMMs How does Disco do it? Experimental Results How well does Disco dance? The Basic Problem With the explosion of multiprocessor machines , especially of the NUMA variety, the problem of effectively using the machines becomes more immediate. NUMA = Non-Uniform Memory Access – shows up a lot in clusters. The authors point out that the problem applies to any major hardware innovation, not just multiprocessors. Potential Solution Solution: Rewrite the operating system to address fault-tolerance and scalability. Flaws: Rewriting will introduce bugs. Bugs can disrupt the system or the applications. Instabilities are usually less-tolerated on these kinds of systems because of their application space. You may not have access to the OS. Not So Good Okay. So that wasn’t so good. What else do we have? How about Virtual Machine Monitors? A new twist on an old idea, which may work better now that we have faster processors. Enter Disco •Disco is a system VM that presents a similar fundamental machine to all of the various OS’s that might be running on the machine. •These can be commodity OS’s, uniprocessor, multiprocessor or specialty systems. Disco VMM Fundamentally, the hardware is a cluster, but Disco introduces some global policies to manage all of the resources, which makes for better usage of the hardware. We’ll use commodity operating systems and write the VMM. Rather than millions of lines of code, we’ll write a few thousand. What if the resource needs exceed that of the commodity OS? Scalability Very simple changes to the commodity OS (maybe on the driver level or kernel extension) can allow virtual machines to share resources. E.g., a parallel database could have a cache in shared memory and multiple virtual processors running on virtual machines. Support for specialized OS’s that need the power of multiple processors but not all of the features offered by a commodity OS. Further Benefits Multiple copies of an OS naturally addresses scalability and fault containment. Need greater scaling? Add a VM. Only the monitor and the system protocols (NFS, etc.) need to scale. OS or application crashes? No problem. The rest of the system is isolated. NUMA memory management issues are addressed. Multiple versions of different OS’s provide legacy support and convenient upgrade paths. Not All Sunshine & Roses VMM Overhead Additional exception processing, instruction execution and memory to virtualize hardware. Privileged instructions aren’t directly executed on the hardware, so we need to fake it. I/O requests need to be intercepted and remapped. Memory overhead is rough too. Consider having 6 copies of Vista in memory simultaneously. Resource Management VMM can’t make intelligent decisions about code streams without info from OS. One Last Disadvantage Communication Sometimes resources simply can’t be shared the way we want. Most of these can be mitigated though. For example, most operating systems have good NFS support. So use it. But… We can make it even better! (Details forthcoming.) Introducing Disco VMM designed for the FLASH multiprocessor machine FLASH is an academic machine designed at Stanford University Is a collection of nodes containing a processor, memory, and I/O. Use directory cache coherence which makes it look like a CC-NUMA machine. Has also been ported to a number of other machines. Disco’s Interface The virtual CPU of Disco is an abstraction of a MIPS R10000. Not only emulates but extends (e.g., reduces some kernel operations to simple load/store instructions. A presented abstraction of physical memory starting at address 0 (zero). I/O Devices Disks, network interfaces, interrupts, clocks, etc. Special interfaces for network and disks. Disco’s Implementation Implemented as a multi-threaded shared- memory program. Careful attention paid to memory placement, cache-aware data structures and processor communication patterns. Disco is only 13,000 lines of code. Windows Server 2003 - ~50,000,000 Red Hat 7.1 - ~ 30,000,000 Mac OS X 10.4 - ~86,000,000 Disco’s Implementation The execution of a virtual processor is mapped one-for-one to a real processor. At each context switch, the state of a processor is made to be that of a VP. On MIPS, Disco runs in kernel mode and puts the processor in appropriate modes for what’s being run Supervisor mode for OS, user mode for apps Simple scheduler allows VP’s to be time- shared across the physical processors. Disco’s Implementation Virtual Physical Memory This discussion goes on for 1.5 pages. To sum up: The OS makes requests to physical addresses, and Disco translates them to machine addresses. Disco uses the hardware TLB for this. Switching a different VP onto a new processor requires a TLB flush, so Disco maintains a 2nd-level TLB to offset the performance hit. There’s a technical issue with TLBs, Kernel space and the MIPS processor that threw them for a loop. NUMA Memory Management •In an effort to mitigate the nonuniform effects of a NUMA machine, Disco does a bunch of stuff: • • Allocating as much memory to have “affinity” to a processor as possible. Migrates or replicates pages across virtual machines to reduce long memory accesses. Virtual I/O Devices Obviously Disco needs to intercept I/O requests and direct them to the actual device. Primarily handled by installing drivers for Disco I/O in the guest OS. DMA provides an interesting challenge, in that the DMA addresses need the same translation as regular accesses. However, we can do some especially cool things with DMA requests to disk. Copy-on-Write Disks All disk DMA requests are caught and analyzed. If the data is already in memory, we don’t have to go to disk for it. If the request is for a full page, we just update a pointer in the requesting virtual machine. So what? Multiple VM’s can share data without being aware of it. Only modifying the data causes a copy to be made. Awesome for scaling up apps by using multiple copies of an OS. Only really need one copy of the OS kernel, libraries, etc. My Favorite – Networking The Copy-on-write disk stuff is great for non- persistent disks. But what about persistent ones? Let’s just use NFS. But here’s a dumb thing: A VM has a copy of information it wants to send to another VM on the same physical machine. In a naïve approach, we’d let that data be duplicated, taking up extra memory pointlessly. So, let’s use copy-on-write for our network interface too! Virtual Network Interface Disco provides a virtual subnet for VM’s to talk to each other. This virtual device is Ethernet-like, but with no maximum transfer size. Transfers are accomplished by updating pointers rather than actually copying data (until absolutely necessary). The OS sends out the requests as NFS requests. “Ah,” but you say. “What about the data locality as a VM starts accessing those files and memory?” Page replication and migration! About those Commodity OS’s So what do we really need to do to get these commodity operating systems running on Disco? Surprisingly a lot and a little. Minor changes were needed to IRIX’s HAL, amounting to 2 header files and 15 lines of assembly code. This did lead to a full kernel recompile though. Disco needs device drivers. Let’s just steal them from IRIX! Don’t trap on every privileged register access. Convert them into normal loads/stores to special address space, linked to the privileged registers. More Patching “Hinting” added to HAL to help the VMM not do dumb things (or at least do fewer dumb things). When the OS goes idle, the MIPS (usually) defaults to a low power mode. Disco just stops scheduling the VM until something interesting happens. Other minor things were done, but that required patching the kernel. SPLASHOS Some high-performance apps might need most or all of the machine. The authors wrote a “thin” operating system to run SPLASH-2 applications. Mostly proof-of-concept. Experimental Results Bad Idea: Target your software for a machine that doesn’t physically exist. Like, I don’t know, FLASH? Disco was validated using two alternatives: SimOS SGI Origin2000 Board that will form the basis of FLASH Experimental Design Use 4 representative workloads for parallel applications: Software Development (Pmake of a large app) Hardware Development (Verilog simulator) Scientific Computing (Raytracing and a sorting algorithm) Commercial Database (Sybase) Not only are they representative, but they each have characteristics that are interesting to study For example, Pmake is multiprogrammed, lots of short-lived processes, OS & I/O intensive. Simplest Results Graph •Overhead of Disco is pretty modest compared to the uniprocessor results. •Raytrace is the lowest, at only 3%. Pmake is the highest, at 16%. •The main hits come from additional traps and TLB misses (from all the flushing Disco does). •Interestingly, less time is spent in the kernel in Raytrace, Engineering and Database. •Running a 64-bit system mitigates the impact of TLB misses. Memory Utilization Key thing here is how 8 VM’s doesn’t require 8x the memory of 1 VM. Interestingly, we have 8 copies of IRIX running in less than 256 MB of physical RAM! Scalability • Page migration and replication were disabled for these runs. • All use 8 processors and 256 MB of memory. • IRIX has a terrible bottleneck in synchronizing the system’s memory management code • It also has a “lazy” evaluation policy in the virtual memory system that drags “normal” RADIX down. •Overall though, check out those performance gains! Page Migration Benefits •The 100% UMA results give a lower bound on performance gains from page migration and replication. •But in short, the policies work great. Real Hardware Experiences on the real SGI hardware pretty much confirms the simulations, at least at the uniprocessor level. Overheads tend to be in the range of 3-8% on Pmake and the Engineering simulation. Summing Up Disco works pretty well. Memory usage scales well, processor utilization scales well. Performance overheads are relatively small for most loads. Lots of engineering challenges, but most seem to have been overcome. Final Thoughts Everything in this paper seems, in retrospect, to be totally obvious. However, the combination of all of these factors seems like it would have taken just a ton of work. Plus, I don’t think I could have done it half as well, to be honest. Targeting a non-existent machine seems a little silly. Overall, interesting paper.