pptx

advertisement
CS533 Concepts of Operating Systems
The Performance of MicroKernel Based Systems
Micro-kernels and Binary Compatibility

Emulation libraries
o


Trampoline mechanism
Single server architecture
Multi-server architecture
o
IPC overhead proportional to number of servers
(independent protection domains)
CS533 - Concepts of Operating Systems
2
Micro-kernels must optimize IPC

Liedtke argues Mach’s overhead is due to poor implementation!

Optimized IPC implementation in L3
o
Architectural level
•
o
Algorithmic level
•
o
Thread Identifier, Virtual Queues, Timeouts/Wakeups, Lazy
Scheduling, Direct Process Switch, Short Messages.
Interface level
•
o
System Calls, Messages, Direct Transfer, Strict Process Orientation,
Control Blocks.
Unnecessary Copies, Parameter passing.
Coding level
•
Cache Misses, TLB Misses, Segment Registers, General Registers,
Jumps and Checks, Process Switch.
CS533 - Concepts of Operating Systems
3
L3 IPC Performance vs Mach IPC
CS533 - Concepts of Operating Systems
4
But Is That Enough?


What is the impact on overall system performance?
Haertig et al explore performance and extensibility
of L4-based Linux OS vs Mach-based Linux and
native Linux
CS533 - Concepts of Operating Systems
5
L4Linux – a micro-kernel based Linux



Fully binary compliant with Linux/X86
No changes to the architecture-independent parts
of Linux
No Linux-specific modifications to the L4 kernel
CS533 - Concepts of Operating Systems
6
L4Linux – Design & Implementation





Linux implemented as a single Linux server in a μ-kernel task
μ-kernel tasks used for Linux user processes
A single L4 thread in the Linux server handles system calls and
page faults. This thread is multiplexed (treated as a virtual CPU)
On booting, the Linux server requests memory from its pager,
which maps physical memory into the server’s address space
The Linux server then acts as the pager for the user processes
it creates
o
L4 converts user-process page faults into an RPC to the Linux server, which
maps pages from its address space to the user process
Interrupt Handling

Linux top halves are implemented as one server thread per
interrupt source
o


L4 converts a hardware interrupt to a message to the appropriate thread
Linux bottom halves all execute in a single high priority thread
Linux interrupt threads have a higher priority than the main
thread
o
avoids concurrent execution of Linux code on a uniprocessor
System Calls
System calls implemented as IPC between user process and the
Linux server
Modified libc.so or libc.a avoid trap instructions and use L4 IPC
instead to call the Linux server
User-level exception handler (trampoline) emulates the native
system call ‘trap’ instruction for binary compatibility



o
L4 redirects trap to emulation library which then used L4 IPC to call the
Linux server
Signals



Each user process has a separate signal-handler thread
Linux server’s delivers a signal by sending a message to the
user process’s signal-handler thread
The signal-handler causes the user process’s main thread to
save it’s state and enter Linux by manipulating the main
thread’s SP and PC
Scheduling


All thread scheduling is done by the L4 kernel
The Linux server’s schedule() routine is only used for
multiplexing the Linux server’s Main thread across concurrent
Linux system calls
Experiment

What is the penalty of using L4Linux?
o

Does the performance of the underlying microkernel matter?
o

Compare L4Linux to native Linux
Compare L4Linux to MkLinux
Does co-location improve performance?
o
Compare L4Linux to an in-kernel version of MkLinux
CS533 - Concepts of Operating Systems
12
Microbenchmarks

measured system call overhead on shortest system
call “getpid()”
Linux
L4Linux
L4Linux(trampoline)
MkLinux (in kernel)
MkLinux (user)
223 cycles
526 cycles
753 cycles
2050 cycles
14710 cycles
CS533 - Concepts of Operating Systems
13
Microbenchmarks (cont.)

Measures specific system calls to determine basic performance.
CS533 - Concepts of Operating Systems
14
Macrobenchmarks

measured time to recompile Linux server
CS533 - Concepts of Operating Systems
15
Macrobenchmarks (cont.)

Next use a commercial test suite to simulate a system under full
load.
CS533 - Concepts of Operating Systems
16
Performance Analysis



L4Linux is, on average 8.3% slower than native Linux.
Only 6.8% slower at maximum load.
MkLinux: 49% slower on average, 60% at maximum.
Co-located MkLinux: 29% slower on average, 37% at
maximum
CS533 - Concepts of Operating Systems
17
Conclusion?


Can hardware-based protection be made to work
efficiently enough?
Did these experiments explore the cost of “fine
grained” protection?
CS533 - Concepts of Operating Systems
18
Spare Slides
CS533 - Concepts of Operating Systems
19
The IPC Dilemma

IPC is very import in μ-kernel design
o

Increases modularity, flexibility, security and scalability.
Past implementations have been inefficient.
o
Message transfer takes 50 - 500μs.
CS533 - Concepts of Operating Systems
20
The L3 (μ-kernel based) OS

A task consists of:
o
Threads
•
o
Dataspaces
•
o
Communicate via messages that consist of strings and/or
memory objects.
Memory objects.
Address space
•
Where dataspaces are mapped.
CS533 - Concepts of Operating Systems
21
Redesign Principles







IPC performance is the Master.
All design decisions require a performance discussion.
If something performs poorly, look for new techniques.
Synergetic effects have to be taken into considerations.
The design has to cover all levels from architecture down to coding.
The design has to be made on a concrete basis.
The design has to aim at a concrete performance goal.
CS533 - Concepts of Operating Systems
22
Achievable Performance

A simple scenario
o
o

Thread A sends a null message to thread B
Minimum of 172 cycles
Will aim at 350 cycles (7 μs)
o
Will actually achieve 250 cycles (5 μs)
CS533 - Concepts of Operating Systems
23
Levels of the redesign

Architectural
o

Algorithmic
o

Thread Identifier, Virtual Queues, Timeouts/Wakeups, Lazy Scheduling,
Direct Process Switch, Short Messages.
Interface
o

System Calls, Messages, Direct Transfer, Strict Process Orientation,
Control Blocks.
Unnecessary Copies, Parameter passing.
Coding
o
Cache Misses, TLB Misses, Segment Registers, General Registers, Jumps
and Checks, Process Switch.
CS533 - Concepts of Operating Systems
24
Architectural Level

System Calls
o
o
Expensive! So, require as few as possible.
Implement two calls:
•
•
o
Call
Reply & Receive Next
Combines sending an outgoing message with waiting for an
incoming message.
•
Schedulers can handle replies the same as requests.
CS533 - Concepts of Operating Systems
25
Messages

Complex Messages:
o
o


A Complex Message
Direct String, Indirect Strings (optional)
Memory Objects
Used to combine sends if no reply is needed.
Can transfer values directly from sender’s variable to receiver’s variables.
CS533 - Concepts of Operating Systems
26
Direct Transfer
User A
Kernel
User B

Each address space has a fixed kernel accessible part.
o
o
o
o
Messages transferred via the kernel part
User A space -> Kernel -> User B space
Requires 2 copies.
Larger Messages lead to higher costs
CS533 - Concepts of Operating Systems
27

Shared User Level memory (LRPC, SRC RPC)
o
o
o
o
o
Security can be penetrated.
Cannot check message’s legality.
Long messages -> address space becoming a critical
resource.
Explicit opening of communication channels.
Not application friendly.
CS533 - Concepts of Operating Systems
28
Temporary Mapping
User A
Kernel
User B

L3 uses a Communication Window
o
o
o
Only kernel accessible, and exists per address space.
Target region is temporarily mapped there.
Then the message is copied to the communication window and ends up in
the correct place in the target address space.
CS533 - Concepts of Operating Systems
29
Temporary Mapping



Must be fast!
2 level page table only requires one word to be copied.
o
pdir A -> pdir B
o
One thread
TLB must be clean of entries relating to the use of the
communication window by other operations.
•
o
TLB is always “window clean”.
Multiple threads
•
•
Interrupts – TLB is flushed
Thread switch – Invalidate Communication window entries.
CS533 - Concepts of Operating Systems
30
Strict Process Orientation



Kernel mode handled in same way as User mode
One kernel stack per thread
May lead to a large number of stacks
o
Minor problem if stacks are objects in virtual memory
CS533 - Concepts of Operating Systems
31
Thread Control Blocks (tcb’s)


Hold kernel, hardware, and thread-specific data.
Stored in a virtual array in shared kernel space.
User area
Kernel area
tcb
Kernel stack
CS533 - Concepts of Operating Systems
32
Tcb Benefits





Fast tcb access
Saves 3 TLB misses per IPC
Threads can be locked by unmapping the tcb
Helps make thread persistent
IPC independent from memory management
CS533 - Concepts of Operating Systems
33
Algorithmic Level

Thread ID’s
o
o
L3 uses a 64 bit unique identifier (uid) containing the thread
number.
Tcb address is easily obtained
•

anding the lower 32 bits with a bit mask and adding the tcb base
address.
Virtual Queues
o
o
Busy queue, present queue, polling-me queue.
Unmapping the tcb includes removal from queues
•
Prevents page faults from parsing/adding/deleting from the queues.
CS533 - Concepts of Operating Systems
34
Algorithmic Level

Timeouts and Wakeups
o
o
Operation fails if message transfer has not started t ms after
invoking it.
Kept in n unordered wakeup lists.
•
o
o
o
A new thread’s tcb is linked into the list τ mod n.
Thread with wakeups far away are kept in a long time wakeup list
and reinserted into the normal lists when time approaches.
Scheduler will only have to check k/n entries per clock interrupt.
Usually costs less the 4% of ipc time.
CS533 - Concepts of Operating Systems
35
Algorithmic Level

Lazy Scheduling
o
o
Only a thread state variable is changed (ready/waiting).
Deletion from queues happens when queues are parsed.
•
•
Reduces delete operations.
Reduces insert operations when a thread needs to be inserted
that hasn’t been deleted yet.
CS533 - Concepts of Operating Systems
36
Algorithmic Level

Short messages via registers
o
o
o
o
Register transfers are fast
50-80% of messages ≥ 8 bytes
Up to 8 byte messages can be transferred by registers
with a decent performance gain.
May not pay off for other processors.
CS533 - Concepts of Operating Systems
37
Interface Level

Unnecessary Copies
o
o
o
Message objects grouped by types
Send/receive buffers structured in the same way
Use same variable for sending and receiving
•

Avoid unnecessary copies
Parameter Passing
o
Use registers whenever possible.
•
•
Far more efficient
Give compilers better opportunities to optimize code.
CS533 - Concepts of Operating Systems
38
Code Level

Cache Misses
o

Cache line fill sequence should match the usual data access
sequence.
TLB Misses
o
Try and pack in one page:
•
•
•
•
Ipc related kernel code
Processor internal tables
Start/end of Larger tables
Most heavily used entries
CS533 - Concepts of Operating Systems
39
Coding Level

Registers
o
o
Segment register loading is expensive.
One flat segment coving the complete address space.
•
•

Jumps and Check
o

On entry, kernel checks if registers contain the flat descriptor.
Guarantees they contain it when returning to user level.
Basic code blocks should be arranged so that as few jumps are taken as
possible.
Process switch
o
Save/restore of stack pointer and address space only invoked when really
necessary.
CS533 - Concepts of Operating Systems
40
L4 Slides
CS533 - Concepts of Operating Systems
41
Introduction



μ-kernels have reputation for being too slow,
inflexible
Can 2nd generation μ-kernel (L4) overcome
limitations?
Experiment:
o
o
Port Linux to run on L4 (Mach 3.0)
Compared to native Linux, MkLinux (Linux on 1st gen Mach
derived μ-kernel)
CS533 - Concepts of Operating Systems
42
Introduction (cont.)


Test speed of standard OS personality on top of fast μ-kernel:
Linux implemented on L4
Test extensibility of system:
o
o
o

pipe-based communication implemented directly on μ-kernel
mapping-related OS extensions implemented as user tasks
user-level real-time memory management implemented
Test if L4 abstractions independent of platform
CS533 - Concepts of Operating Systems
43
L4 Essentials


Based on threads and address spaces
Recursive construction of address spaces by user-level servers
o
o


Initial address space σ0 represents physical memory
Basic operations: granting, mapping, and unmapping.
Owner of address space can grant or map page to another
address space
All address spaces maintained by user-level servers (pagers)
CS533 - Concepts of Operating Systems
44
L4Linux – Design & Implementation



Fully binary compliant with Linux/X86
Restricted modifications to architecture-dependent
part of Linux
No Linux-specific modifications to L4 kernel
CS533 - Concepts of Operating Systems
45
L4Linux – Design & Implementation

Address Spaces
o
o
o
o
Initial address space σ0 represents physical memory
Basic operations: granting, mapping, and unmapping.
L4 uses “flexpages”: logical memory ranging from one
physical page up to a complete address space.
An invoker can only map and unmap pages that have been
mapped into its own address space
CS533 - Concepts of Operating Systems
46
L4Linux – Design & Implementation
CS533 - Concepts of Operating Systems
47
L4Linux – Design & Implementation

Address Spaces (cont.)
o
o
I/O ports are parts of address spaces.
Hardware interrupts are handled by user-level processes.
The L4 kernel will send a message via IPC.
CS533 - Concepts of Operating Systems
48
L4Linux – Design & Implementation

The Linux server
o
o
o
o
L4Linux will use a single-server approach.
A single Linux server will run on top of L4, multiplexing a
single thread for system calls and page faults.
The Linux server maps physical memory into its address
space, and acts as the pager for any user processes it
creates.
The Server cannot directly access the hardware page
tables, and must maintain logical pages in its own address
space.
CS533 - Concepts of Operating Systems
49
L4Linux – Design & Implementation

Interrupt Handling
o
o
o
All interrupt handlers are mapped to messages.
The Linux server contains threads that do nothing but wait
for interrupt messages.
Interrupt threads have a higher priority than the main
thread.
CS533 - Concepts of Operating Systems
50
L4Linux – Design & Implementation

User Processes
o
o
Each different user process is implemented as a different
L4 task: Has its own address space and threads.
The Linux Server is the pager for these processes. Any
fault by the user-level processes is sent by RPC from the
L4 kernel to the Server.
CS533 - Concepts of Operating Systems
51
L4Linux – Design & Implementation

System Calls
o
Three system call interfaces:
•
•
•
o
A modified version of libc.so that uses L4 primitives.
A modified version of libc.a
A user-level exception handler (trampoline) calls the
corresponding routine in the modified shared library.
The first two options are the fastest. The third is
maintained for compatibility.
CS533 - Concepts of Operating Systems
52
L4Linux – Design & Implementation

Signalling
o
o
Each user-level process has an additional thread for signal
handling.
Main server thread sends a message for the signal
handling thread, telling the user thread to save it’s state
and enter Linux
CS533 - Concepts of Operating Systems
53
L4Linux – Design & Implementation

Scheduling
o
o
o
All thread scheduling is down by the L4 kernel
The Linux server’s schedule() routine is only used for
multiplexing it’s single thread.
After each system call, if no other system call is pending,
it simply resumes the user process thread and sleeps.
CS533 - Concepts of Operating Systems
54
L4Linux – Design & Implementation

Tagged TLB & Small Space.
o
o
In order to reduce TLB conflicts, L4Linux has a special
library to customize code and data for communicating with
the Linux Server
The emulation library and signal thread are mapped close
to the application, instead of default high-memory area.
CS533 - Concepts of Operating Systems
55
Performance

What is the penalty of using L4Linux?
Compare L4Linux to native Linux

Does the performance of the underlying microkernel matter?
Compare L4Linux to MkLinux

Does co-location improve performance?
Compare L4Linux to an in-kernel version of MkLinux
CS533 - Concepts of Operating Systems
56
Microbenchmarks

measured system call
overhead on shortest
system call “getpid()”
CS533 - Concepts of Operating Systems
57
Microbenchmarks (cont.)

Measures specific system calls to determine basic performance.
CS533 - Concepts of Operating Systems
58
Macrobenchmarks

measured time to
recompile Linux server
CS533 - Concepts of Operating Systems
59
Macrobenchmarks (cont.)

Next use a commercial test suite to simulate a system under full
load.
CS533 - Concepts of Operating Systems
60
Performance Analysis



L4Linux is, on average 8.3% slower than native Linux.
Only 6.8% slower at maximum load.
MkLinux: 49% average, 60% at maximum.
Co-located MkLinux: 29% average, 37% at maximum.
CS533 - Concepts of Operating Systems
61
Extensibility Performance



A micro-kernel must provide more than just the
features of the OS running on top of it.
Specialization – improved implementation of Os
functionality
Extensibility – permits implementation of new
services that cannot be easily added to a
conventional OS.
CS533 - Concepts of Operating Systems
62
Pipes and RPC
First five (1) use the standard pipe mechanism of the Linux kernel.
(2) Is asynchronous and uses only L4 IPC primitives. Emulates POSIX
standard pipes, without signalling. Added thread for buffering and crossaddress-space communication.
(3) Is synchronous and uses blocking IPC without buffering data.
(4) Maps pages into the receiver’s address space.
CS533 - Concepts of Operating Systems
63
Virtual Memory Operations




The “Fault” operation is an example of extensibility – measures the time to resolve a
page fault by a user-defined pager in a separate address space.
“Trap” – Latency between a write operation to a protected page, and the invocation of
related exception handler.
“Appel1” – Time to access a random protected page. The fault handler unprotects the
page, protects some other page, and resumes.
“Appel2” – Time to access a random protected page where the fault handler only
unprotects the page and resumes.
CS533 - Concepts of Operating Systems
64
Conclusion


Using the L4 micro-kernel imposes a 5-10% slowdown
to native Linux. Much faster than previous microkernels.
Further optimizations such as co-locating the Linux
Server, and providing extensibility could improve
L4Linux even further.
CS533 - Concepts of Operating Systems
65
Download