CS 422/522 Design & Implementation of Operating Systems
Zhong Shao
Dept. of Computer Science
Yale University
Acknowledgement: some slides are taken from previous versions of the CS422/522 lectures taught by Prof. Bryan Ford and Dr. David Wolinsky, and also from the official set of slides accompanying the OSPP textbook by Anderson and Dahlin.
◆
◆
◆
◆
◆
Process isolation
–
Keep a process from touching anyone else’s memory, or the kernel’s
Efficient interprocess communication
–
Shared regions of memory between processes
Shared code segments
–
E.g., common libraries used by many different programs
Program initialization
–
Start running a program before it is entirely in memory
Dynamic memory allocation
–
Allocate and initialize stack/heap pages on demand
12/1/15
1
◆
◆
◆
◆
Program debugging
–
Data breakpoints when address is accessed
Zero-copy I/O
–
Directly from I/O device into/out of user memory
Memory mapped files
–
Access file data using load/store instructions
Demand-paged virtual memory
–
Illusion of near-infinite memory, backed by disk or memory on other machines
◆
◆
◆
◆
◆
Checkpoint/restart
–
Transparently save a copy of a process, without stopping the program while the save happens
Persistent data structures
–
Implement data structures that can survive system reboots
Process migration
–
Transparently move processes between machines
Information flow control
–
Track what data is being shared externally
Distributed shared memory
–
Illusion of memory that is shared between machines
12/1/15
2
Server
Request
Buffer
1.
Network
Socket Read
3.
Kernel Copy
4.
Parse Request
Reply
Buffer
5.
File Read
8.
Kernel Copy
Kernel
2.
Copy Arriving
Packet (DMA)
Hardware
Network Interface
6.
Disk Request
7.
Disk Data
(DMA)
Disk Interface
9.
Format Reply
10.
Write and Copy to Kernel Buffer
12.
Format Outgoing
Packet and DMA
block aligned Read/Write system calls
Before Zero Copy
Empty Buffer
User Page Table
Full Kernel Buffer
After Zero Copy
Free Page
User Page Table
Full User and
Kernel Buffer
12/1/15
3
12/1/15
◆
◆
◆
◆
Qemu
Windows games on Linux and Mac (virtualize DirectX and WinAPI)
Android/iOS simulators (emulators) for app development
The cloud
Goals:
◆
Complete (machine-level) isolation (security)
◆
◆
Consolidation
Multipe Oses
◆
◆
Kernel development
….
4
◆
Instructions that violate requirements
–
Privileged – user mode trap into kernel
–
Trap and emulate
*
Execute code as normal
*
Execute privileged instruction
*
Triggers trap to hypervisor
*
Hypervisor handles and returns to user
◆ x86 hardware support
–
Instantiates different machine states: vmcs (virtual machine control structures):
–
Vmxon /vmxoff – turn on / off support
–
Vm enter and vm exit --- run / stop a VM
Guest Virtual
Address Space
Guest Virtual
Address
Guest Page Table
Guest Physical
Address
Guest Physical
Address Space
Guest Physical
Address
Host Page Table
Host Physical
Address
Host Physical
Memory
12/1/15
5
Guest Virtual
Address Space
Guest Virtual
Address
Guest Page Table
Guest Physical
Address
Guest Physical
Address Space
Guest Physical
Address
Host Page Table
Host Physical
Address
Host Physical
Memory
Shadow Page Table
Host Physical
Address
◆ x86 recently added hardware support for running virtual machines at user level
◆
Operating system kernel initializes two sets of translation tables
–
One for the guest OS
–
One for the host OS
◆
Hardware translates address in two steps
–
First using guest OS tables, then host OS tables
–
TLB holds composition
12/1/15
6
Intel® VT-x
Processor
Intel® VT-x
Hardware assists for robust virtualizaBon
Intel® VT FlexMigraBon – Flexible live migraBon
Intel® VT FlexPriority – Interrupt acceleraBon
Intel® EPT – Memory VirtualizaBon
Intel® VT-d
Chipset
Intel® VT-c
Network
Intel® VT for Directed I/O
Reliability and Security through device IsolaBon
I/O performance with direct assignment
Intel® VT for ConnecBvity
NIC Enhancement with VMDq
Single Root IOV support
Network Performance and reduced CPU uBlizaBon
Intel® I/OAT for virtualizaBon
Lower CPU Overhead and Data AcceleraBon
So$ware & Services Group
• Reduce overheads from virtualiza6on
• Intel® VT-x Latency Reduc6ons
• Extended Page Tables
• Virtual Processor IDs
• APIC Virtualiza6on (Flex Priority)
• I/O Assignment via DMA Remapping
• Introduce capabili6es that increase scaling out across VMs
• Intel® Hyper-Threading Technology
• PAUSE-loop Exi6ng
• Network Virtualiza6on
2.
… and scale out across VMs
1.
Reduce overhead within each VM…
OS
VMM
OS
OS
Host
14
Intel® Virtualization Technology (Intel® VT) for IA-32, Intel® 64 and Intel® Architecture (Intel® VT-x)
So$ware & Services Group
12/1/15
7
What makes VM Context Switching expensive
Saving-Loading privileged state
• Compounded by consistency checking
Addressing Context changes
• Transla6on Lookaside Buffer (TLB) flushes
15
Virtual Machine Ctl Structure (VMCS)
–
Maintains Guest & Host reg. state
–
“Backed” by host physical memory
–
Accessed via architectural VMREAD / VMWRITE
–
Enables caching of VMCS state on-die
Virtual Processor IDs (VPIDs)
–
Tag µarch structures (TLBs)
–
Removes need to flush TLBs
So$ware & Services Group
®
• VT-x® provides architected assists to allow guest OSes to run directly on hardware
• On Nehalem and Westmere VT-x is extended with:
Extended Page Tables (EPT)
Eliminates VM exits to the VMM for shadow pagetable maintenance
Virtual Processor IDs (VPID)
Guest Preemption Timer -- lets a VMM preempt a guest OS
Avoid flushes on VM transitions to give a lower-cost
VM transition time
Aids VMM vendors in flexibility and Quality of
Service (QoS)
Descriptor Table Exiting –Traps on modifications of guest DTs Allows VMM to protect a guest from internal attack
Transition Latency reductions
Continuing improvements in microarchitectural handling of VMM round trips
So$ware & Services Group
12/1/15
8
Address Transla6on
• Guest OS expects con6guous, zerobased physical memory
• VMM must preserve this illusion
Page-table Shadowing
• VMM intercepts paging opera6ons
• Constructs copy of page tables
Overheads
• VM exits add to execu6on 6me
• Shadow page tables consume significant host memory
VM
0
VM n
Induced
VM Exits
Guest
Page Tables
Remap
Guest
Page Tables
VMM
Shadow
Page Tables
TLB
CPU
0
Memory
17
So$ware & Services Group
Extended Page Tables (EPT)
• Map guest physical to host address
• New hardware page-table walker
Performance Benefit
• A guest OS can modify its own page tables freely and without VM exits
Memory Savings
• A single EPT supports en6re VM: instead of a shadow pagetable per guest process
18
Intel® Virtualization Technology (Intel® VT) for IA-32, Intel® 64 and Intel® Architecture (Intel® VT-x)
So$ware & Services Group
12/1/15
9
19
Linear Addr Extended Page Table (EPT) Walk
CR3
EPTP L4
PML4
TLB EPTP L4
PDP
EPTP L4
PDE
EPTP L4
PTE
EPTP L4
Physical Addr
2-level TLB reduces page-table walks
• VPID tags help to retain TLB entries
Paging-structure caches
• Cache intermediate steps in walk
• Result: Reduce length of walks
• Common case: Much beier than 24 steps
L3
L3
L3
L3
L3
L2
L2
L2
L2
L2
L1
L1
L1
L1
L1
So$ware & Services Group
Virtual Device Interface
• Traps device commands
• Translates DMA opera6ons
• Injects virtual interrupts
Solware Methods
• I/O Device Emula6on
• Paravirtualize Device Interface
Challenges
• Controlling DMA and interrupts
• Overheads of copying I/O buffers
20
1
I/O Device
Emula6on
VM
0
Guest Device
Driver
Para- 2 virtualiza6on
VM n
Guest Device
Driver
Device Model Device Model
Physical Device Driver
VMM
Memory
Storage
Network
So$ware & Services Group
12/1/15
10
Fault Genera9on
Bus 255
Bus N
Bus 0
Dev 31, Func 7
Dev P, Func 2
Dev P, Func 1
Dev 0, Func 0
DMA Remapping
Engine
Transla9on Cache
Context Cache
Memory Access with Host Physical Address
4KB Page
Tables
Device D1 Address Transla6on
Structures
Device
Assignment
Structures
Device D2
Address Transla6on
Structures
Memory-resident ParBBoning &
TranslaBon Structures
4KB
Page
Frame
21
So$ware & Services Group
Copy of Process Copy of Process
Process
Checkpoint Restore
Process
Execute
Instructions
X
Failure
Time
Execute
Instructions
12/1/15
11
◆
At what point can we resume the execution of a checkpointed program?
–
When the checkpoint starts?
–
When the checkpoint is entirely on disk?
12/1/15
Checkpoint 1
(Full)
C
D
A
B
E
Checkpoint 2 Checkpoint 3
P
Q
R
S
Restore
(Full)
R
Q
A
P
S
12
◆
◆
Can we precisely replay the execution of a multithreaded process?
–
If process does not have a memory race
From a checkpoint, record:
–
All inputs and return values from system calls
–
All scheduling decisions
–
All synchronization operations
*
Ex: which thread acquired lock in which order
◆
What if we checkpoint a process and then restart it on a different machine?
–
Process migration: move a process from one machine to another
–
Special handling needed if any system calls are in progress
*
Where does the system call return to?
12/1/15
13
◆
Can we demand page to memory on a different machine?
–
Remote memory over LAN much faster than disk
–
On page fault, look in remote memory first before fetching from disk
12/1/15
◆
◆
◆
Can we make a network of computers appear to be a shared-memory multiprocessor?
–
Read-write: if page is cached only on one machine
–
Read-only: if page is cached on several machines
–
Invalid: if page is cached read-write on a different machine
On read page fault:
–
Change remote copy to read-only
–
Copy remote version to local machine
On write page fault (if cached):
–
Change remote copy to invalid
–
Change local copy to read-write
14
◆
Data structures that survive failures
–
Want a consistent version of the data structure
–
User marks region of code as needing to be atomic
*
Begin transaction, end transaction
–
If crash, restore state before or after transaction
12/1/15
◆
◆
◆
◆
On begin transaction:
–
Snapshot data structure to disk
–
Change page table permission to read-only
On page fault:
–
Mark page as modified by transaction
–
Change page table permission to read-write
On end transaction:
–
Log changed pages to disk
–
Commit transaction when all mods are on disk
Recovery:
–
Read last snapshot + logged changes, if committed
15