Bernard Wong

advertisement
Supporting Multi-Processors
Bernard Wong
February 17, 2003
Uni-processor systems
Began with Uni-processor systems
 Simple to implement uni-processor
OS, allows for many assumptions



UMA, efficient locks(small impact on
throughput), straight forward cache
coherency
Hard to make faster
Small SMP systems





Multiple symmetric processors
Requires some modifications to the OS
Still allows for UMA
System/Memory bus becomes a contended
resource
Locks have larger impact on throughput



e.g. A lock on one process can block another process
(running on another processor) from making progress
Must introduce finer grain locks to improve scalability
System bus limits system size
Large Shared Memory
Multi-processor




Consist of many nodes, each of which may be a
uni-processor or an SMP
Access to memory often NUMA, sometimes does
not even provide cache coherency
Performance very poor if used with an off the
shelf SMP OS
Requirement for good performance:


Locality of service to request
Independence between services
DISCO


Uses Virtual Machine Monitors to run
multiple commodity OSes on a scalable
multi-processor
Virtual Machine Monitor




Additional layer between OS and hardware
Virtualizes processor, memory, I/O
OS unaware of virtualization (ideally)
Exports a simple general interface to the
commodity OS
DISCO Architecture
OS
SMP-OS
OS
OS
Thin OS
DISCO
PE
PE
PE
PE
PE
Interconnect
ccNUMA Multiprocessor
PE
PE
Implementation Details

Virtual CPUs

Uses direct execution on real CPU
• Fast, most instructions run at native speeds

Must detect and emulate operations that can
not be safely exported to VM
• Primary privilege instructions: TLB modification,
direct physical memory or I/O operations

Must also keep data-structure to save
registers and other state
• For when virtual CPU not scheduled to real CPU

Virtual CPUs uses affinity scheduling to
maintain cache locality
Implementation Details

Virtual Physical Memory


Adds a level of address translation
Maintains physical-to-machine address
mappings
• Because VMs use physical addresses that start from
0 and continuing for size of VM’s memory address

Performed via emulating TLB instructions
• When OS tries to insert entry into TLB, DISCO
intercepts it and insert translated version

TLB flushed on virtual CPU switches
• TLB lookup also more expensive due to required trap
• Second level software TLB added to improve
performance
Implementation Details

Virtual I/O
Intercepts all device accesses from VM
through special OS device drivers
 Virtualizes both disk and network I/O
 DISCO allows persistent disks and nonpersistent disks

• Persistent disks cannot be shared
• Non-persistent disk implemented via copyon-write
Why use a VMM?

DISCO aware of NUMA-ness





Hides NUMA-ness from commodity OS
Requires less work than engineering a NUMAaware OS
Performs better than NUMA-unaware OS
Good middle ground
How?

Dynamic page migration and page replication
• Maintain locality between virtual CPU’s cache miss
and memory pages to which cache miss occur
Memory Management

Pages heavily accessed by only one node are
migrated to that node




Change physical to machine address mapping
Invalidates TLB entries that point to old location
Copy page to local machine
Pages that are heavily read-share and replicated
to nodes move heavily accessing them



Downgrade TLB entries pointing to page to read-only
Copy pages
Update relevant TLB entries to local machine version
and remove read-only
Page Replication
Aren’t VMs memory
inefficient?



Traditionally, VMs tend to replicate
memory used for each system image
Additionally, structures such as disk cache
not shared
DISCO uses notion of global buffer cache
to reduce memory footprint
Page sharing



DISCO keeps a data structure that maps disk
sectors to memory addresses
If two VMs request for same disk sector, both
assigned to same read-only buffer page
Modifications to pages performed via copy-onwrite

Only works for non-persistent copy-on-write disks
Page sharing
Page sharing

Sharing effective even via packets
when sharing data over NFS
Virtualization overhead
Data sharing
Workload scalability
Performance Benefits of
Page Migration/Replication
Tornado


OS designed to take advantage of shared
memory multi-processors
Object Oriented structure


Every virtual and physical resource
represented by an independent object
Ensure natural locality and independence
• Resource lock and data structure stored on some
node as resource
• Resources manage independently and at a fine grain
• No global source of contention
OO structure

Example: Page fault




Separate File Cache Manager(FCM) object for different
regions of memory
COR -> Cached Object Representative
All objects are specific to either the faulting process or
the file(s) backing the process
Problem: Hard to make global policies
Clustered objects




Even with OO, widely shared objects can be
expensive due to contention
Need replication, distribution, partition to reduce
contention
Clustered Objects systematic way to do this
Gives illusion of a single object, but is actual
composed of multiple component (rep) objects


Each component handle a subset of processors
Must handle consistency across reps
Clustered objects
Clustered object
implementation

Per-processor translation table


Contains pointer for to local rep of each clustered object
Created on demand via a combination of global miss handling
object and clustered object specific miss handling object
Memory Allocation
Need an efficient, highly concurrent
allocator that maximizes locality
 Use local pools of memory

However, for small block allocation, still
have problem of false sharing
 Additional small pool of strictly local
memory used

Synchronization


Use of objects, and additional clustered object
reduces scope of lock and limits lock contention
to that of a rep
Existence guarantees hard



A thread must determine whether an object is currently
being de-allocated by another thread
Often require lock hierarchy where root is a global lock
DISCO uses semi-automatic garbage collector

Thread never worries needs to test for existence, no
locking required
Protected Procedure Calls



Since Tornado is a microkernel, IPC traffic
is significant
Need a fast IPC mechanism that
maintains locality
Protected Procedure Calls (PPC) maintains
locality by:


Spawning a new server thread in the same
processor as client to service client request
Keeping all client specific data in datastructures stored on the client
Protected Procedure Calls
Performance

Comparison to other large sharedmemory multi-processors
Performance
(n threads in 1 process)
Performance
(n threads in n process)
Conclusion


Illustrated two different approach to
make efficient use of shared memory
multi-processors
DISCO adds extra layer between
hardware and OS


Less engineering effort, more overhead
Tornado redesigns an OS to take
advantage of locality and independence

More engineering effort, less overhead but
local and independent algorithms may work
poorly with real world loads
Download