Disco: Running Commodity Operating Systems on Scalable

advertisement
Disco: Running Commodity
Operating Systems on Scalable
Multiprocessors
Edouard Bugnion, Scott Devine, Mendel Rosenblum,
Stanford University, 1997
Presented by Divya Parekh
1
Outline




Virtualization
Disco description
Disco performance
Discussion
2
Virtualization

“a technique for hiding the physical
characteristics of computing resources from the
way in which other systems, applications, or end
users interact with those resources. This includes
making a single physical resource appear to
function as multiple logical resources; or it can
include making multiple physical resources appear
as a single logical resource”
3
Old idea from the 1960s

IBM VM/370 – A VMM for IBM mainframe



Multiple OS environments on expensive hardware
Desirable when few machine around
Popular research idea in 1960s and 1970s


Entire conferences on virtual machine monitors
Hardware/VMM/OS designed together
 Interest died out in the 1980s and 1990s


Hardware got more cheaper
Operating systems got more powerful (e.g. multi-user)
4
A Return to Virtual Machines

Disco: Stanford research project (SOSP ’97)



Commercial virtual machines for x86 architecture



VMware Workstation (now EMC) (1999-)
Connectix VirtualPC (now Microsoft)
Research virtual machines for x86 architecture



Run commodity OSes on scalable multiprocessors
Focus on high-end: NUMA, MIPS, IRIX
Xen (SOSP ’03)
plex86
OS-level virtualization

FreeBSD Jails, User-mode-linux, UMLinux
5
Overview

Virtual Machine


A fully protected and isolated copy of the underlying
physical machine’s hardware. (definition by IBM)”
Virtual Machine Monitor


A thin layer of software that's between the hardware
and the Operating system, virtualizing and managing
all hardware resources.
Also known as “Hypervisor”
6
Classification of Virtual
Machines
7
Classification of Virtual
Machines

Type I




VMM is implemented directly on the physical hardware.
VMM performs the scheduling and allocation of the
system’s resources.
IBM VM/370, Disco, VMware’s ESX Server, Xen
Type II



VMMs are built completely on top of a host OS.
The host OS provides resource allocation and standard
execution environment to each “guest OS.”
User-mode Linux (UML), UMLinux
8
Non-Virtualizable Architectures

According to Popek and Goldberg,
” an architecture is virtualizable if the set of
sensitive instructions is a subset of the set of
privileged instructions.”

x86


Several instructions can read system state in
register CPL 3 without trapping
MIPS

KSEG0 bypasses TLB, reads physical memory
directly
9
Type I contd..

Hardware Support for Virtualization
Figure: The hardware support approach to x86 Virtualization
E.g. Intel Vanderpool/VT and AMD-V/SVM
10
Type I contd..

Full Virtualization
Figure : The binary translation approach to x86 Virtualization
E.g. VMware ESX server
11
Type I contd..

Paravirtualization
Figure: The Paravirtualization approach to x86 Virtualization
E.g. Xen
12
Type II

Hosted VM Architecture
E.g. VMware Workstation, Connectix VirtualPC
13
Disco : VMM Prototype

Goals
 Extend modern OS to run efficiently on shared
memory multiprocessors without large changes to
the OS.
 A VMM built to run multiple copies of Silicon
Graphics IRIX operating system on a Stanford
Flash shared memory multiprocessor.
14
Problem Description



Multiprocessor in the market (1990s)
 Innovative Hardware
Hardware faster than System Software
 Customized OS are late, incompatible, and
possibly bug
Commodity OS not suited for multiprocessors
 Do not scale cause of lock contention, memory
architecture
 Do not isolate/contain faults
 More Processors  More failures
15
Solution to the problems


Resource-intensive Modification of OS (hard and
time consuming, increase in size, etc)
Make a Virtual Machine Monitor (software)
between OS and Hardware to resolve the problem
16
Two opposite Way for System
Software


Address these challenges in the operating system:
OS-Intensive
 Hive , Hurricane, Cellular-IRIX, etc
 innovative, single system image
 But large effort.
Hard-partition machine into independent failure units:
OS-light
 Sun Enterprise10000 machine
 Partial single system image
 Cannot dynamically adapt the partitioning
17
Return to Virtual Machine
Monitors



One Compromise Way between OS-intensive & OSlight – VMM
Virtual machine monitors, in combination with
commodity and specialized operating systems, form a
flexible system software solution for these machines
Disco was introduced to allow trading off between
the costs of performance and development cost.
18
Architecture of Disco
19
Advantages of this approach





Scalability
Flexibility
Hide NUMA effect
Fault Containment
Compatibility with legacy applications
20
Challenges Facing Virtual
Machines



Overheads
 Trap and emulate privileged instructions of
guest OS
 Access to I/O devices
 Replication of memory in each VM
Resource Management
 Lack of information to make good policy
decisions
Communication and Sharing
 Stand alone VM’s cannot communicate
21
Disco’s Interface

Processors



MIPS R10000 processor
Emulates all instructions, the MMU, trap architecture
Extension to support common processor operations


Physical memory


Enabling/disabling interrupts, accessing privileged registers
Contiguous, starting at address 0
I/O devices



Virtualize devices like I/O, disks, n/w interface exclusive to VM
Physical devices multiplexed by Disco
Special abstractions for SCSI disks and network interfaces


Virtual disks for VMs
Virtual subnet across all virtual machines
22
Disco Implementation




Multi threaded shared memory program
Attention to NUMA memory placement, cache aware
data structures and IPC patterns
Code segment of DISCO copied to each flash
processor – data locality
Communicate using shared memory
23
Virtual CPUs

Direct Execution




Challenges



execution of virtual CPU on real CPU
Sets the real machine’s registers to the virtual CPU’s
Jumps to the current PC of the virtual CPU, Direct execution
on the real CPU
Detection and fast emulation of operations that cannot be
safely exported to the virtual machine  privileged
instructions such as TLB modification and Direct access to
physical memory and I/O devices.
Maintains data structure for each virtual CPU for trap
emulation
Scheduler multiplexes virtual CPU on real processor
24
Virtual Physical Memory





Address translation & maintains a physical-tomachine address (40 bit) mapping.
Virtual machines use physical addresses
Software reloaded translation-lookaside buffer (TLB)
of the MIPS processor
Maintains pmap data structure for each VM –
contains one entry for each physical to virtual
mapping
pmap also has a back pointer to its virtual address to
help invalidate mappings in the TLB
25
Contd..




Kernel mode references on MIPS processors access
memory and I/O directly - need to re-link OS code
and data to a mapped address space
MIPS tags each TLB entry with Address space
identifiers (ASID)
ASIDs are not virtualized - TLB need to be flushed on
VM context switches
Increased TLB misses in workloads



Additional Operating system references
VM context switches
TLB misses expensive - create 2nd level software TLB . Idea similar to cache?
26
NUMA Memory management


Cache misses should be satisfied from local memory
(fast) rather than remote memory (slow)
Dynamic Page Migration and Replication




Pages frequently accessed by one node are migrated
Read-shared pages are replicated among all nodes
Write-shared are not moved, since maintaining consistency
requires remote access anyway
Migration and replacement policy is driven by cache-misscounting facility provided by the FLASH hardware
27
Transparent Page Replication
1. Two different virtual processors of the same virtual machine logically
read-share the same physical page, but each virtual processor accesses
a local copy.
2. memmap tracks which virtual page references each physical page.
Used during TLB shootdown
28
Disco Memory Management
29
Virtual I/O Devices




Disco intercepts all device accesses from the virtual
machine and forwards them to the physical devices
Special device drivers are added to the guest OS
Disco device provide monitor call interface to pass all
the arguments in single trap
Single VM accessing a device does not require
virtualizing the I/O – only needs to assure exclusivity
30
Copy-on-write Disks



Intercept DMA requests to translate the physical
addresses into machine addresses.
Maps machine page as read only to destination
address page of DMA  Sharing machine memory
Attempts to modify a shared page will result in a
copy-on-write fault handled internally by the monitor.



Logs are maintained for each VM Modification
Modification made in main memory
Non-persistent disks are copy on write shared


E.g. Kernel text and buffer cache
E.g. File systems root disks
31
Transparent Sharing of Pages
Creates a global buffer cache shared across VM's and reduces
memory foot print of the system
32
Virtual Network Interface



Virtual subnet and network interface use copy on
write mapping to share the read only pages
Persistent disks can be accessed using standard
system protocol NFS
Provides a global buffer cache that is transparently
shared by independent VMs
33
Transparent sharing of pages
over NFS
1. The monitor’s networking device remaps the data page from the source’s
machine address space to the destination’s.
2. The monitor remaps the data page from the driver’s mbuf to the clients
buffer cache.
34
Modifications to the IRIX 5.3
OS

Minor changes to kernel code and data
segment – specific to MIPS



Relocate the unmapped segment of the virtual
machine into the mapped supervisor segment of
the processor– Kernel relocation
Disco drivers are same as original device
drivers of IRIX
Patched HAL to use memory loads/stores
instead of privileged instructions
35
Modifications to the IRIX 5.3
OS




Added code to HAL to pass hints to monitor
for resource management
New Monitor calls to MMU to request zeroed
page, unused memory reclamation
Changed mbuf management to be pagealigned
Changed bcopy to use remap (with copy-onwrite)
36
SPLASHOS: A specialized OS



Thin specialized library OS, supported
directly by Disco
No need for virtual memory subsystem
since they share address space
Used for the parallel scientific
applications that can span the entire
machine
37
Disco: Performance

Experimental Setup



Disco targets the FLASH machine not
available that time
Used SimOS, a machine simulator that
models the hardware of MIPS-based
multiprocessors for the Disco monitor.
Simulator was too slow to allow long work
loads to be studied
38
Disco: Performance

Workloads
39
Disco: Performance

Execution Overhead
Pmake overhead due to I/O virtualization, others due to TLB mapping
Reduction of kernel time
On average virtualization overhead of 3% to 16%
40
Disco: Performance

Memory Overheads
V: Pmake memory used if there is no sharing
M: Pmake memory used if there is sharing
41
Disco: Performance

Scalability
Partitioning of problem into different VM’s increases scalability.
Kernel synchronization time becomes smaller.
42
Disco: Performance

Dynamic Page Migration and replication
43
Conclusion



Disco VMM hides NUMA-ness from nonNUMA aware OS
Disco VMM is low(er) effort
Moderate overhead due to virtualization
44
Discussion

Was Disco- VMM done rightly?

Virtual Physical Memory on architectures other
than MIPS





MIPS TLB is software managed
Not sure of how well other OS perform on Disco
since IRIX was designed for MIPS
Not sure how HIVE, Hurricane performs
comparatively
Performance of long workloads on the system
Performance of heterogeneous VMs e.g. Pmake
case
45
Discussion
Are VMM Microkernels
done right?
46
Download