On p-Kernel Construction Abstract

advertisement
On p-Kernel
Construction
Jochen
GMD
— German
National
Liedtke
Research Center
for Information
jochen.liedtke@gmd
Abstract
The
software
proach
From a software-technology
concept
is superior
point
believed
systems
inefficient
are inherently
that
kernels.
(a) p-kernel
from improper
(b)
to this belief, we
evidence that inef-
malfunction.
(c) The system
their
formance
critical
is achievable,
flexibility.
implementation.
Then,
and why some published contraare not evident. Furthermore,
we
describe some implementation
why p-kernels
are inherently
portability
cepted.
This
kernels
do not
ciency
techniques and illustrate
not port able, although
interface
of the whole system.
perform
Differ-
by different
restricts
believed
most existing
well.
flexibility,
Lack
p-
of effi-
since important
cannot be used in practice
In some cases, the p-kernel
has been weakened
into
that
and special
the kernel
that
servers
have
to regain efficiency.
the mentioned
inefficiency
(and thus inflexibility)
is inherent
to the p-kernel approach. Folklore holds that increased user-kernel mode
and address-space switches are responsible.
At a first
based systems
have been built
glance,
long before the
published
performance
support this view.
In fact, the cited
is used to denote the part of the operating
performance
and common
idea of the ~-kernel
i.e. to implement
a more modular
and tailorable.
sufficiently
and principles
performance.
term itself was introduced,
e.g. by Brinch Hansen [1970]
and Wulf et al. [1974]. Traditionally,
the word ‘kernel’
mandatory
enforces
is due to the fact
been re-integrated
Rationale
p-kernel
interface
is more flexible
also heavily
mechanisms
due to poor
It is widely
1
ap-
Although
much effort haa been invested in p-kernel
construction,
the approach
is not (yet) generally
ac-
the per-
We show what performance
the efficiency is sufficient
with re-
spect to macro-kernels
dictory measurements
they improve
we analyze
points.
that
of this
ent strategies and APIs, implemented
servers, can coexist in the system.
Based on functional
reasons, we describe some concepts which must be implemented
by a p-kernel and
illustrate
advantages
system structure. 1
Servers can use the mechanisms
provided
by the
~-kernel like any other user program.
Server malfunction
is as isolated as any other user program’s
based
are not
ficiency and inflexibility
of current p-kernels is not inherited from the basic idea but mostly from overloading
and/or
technological
are obvious:
(a) A clear p-kernel
On the
and (b) they
sufficiently
flexible.
Contradictory
show and support by documentary
the kernel
*
of view, the p-kernel
to large integrated
other hand, it is widely
Technology
.de
to all other software.
approach
outside
system that is
is to minimize
the kernel whatever
performance
of a particular
the
out
only guess whether
possible.
by the concepts implemented
by this particular
p-kernel
of the p-kernel.
Since it is
or by the implementation
known
SIGOPS ’95 12/95 CO, USA
01995 ACM 0-89791 -715-419510012 ...$3.50
237
kernel
nitude
I
that
it is caused by the ~-kernel
conventional
bottlenecks,
fasterz
reasons
measured
p-kernel based system withwhich limit efficiency. We can
this part,
Permission to make digital/hard copy of part or all of this work for personal
or classroom use is granted without fee provided that copies are not made
or distributed for profit or commercial advantage, the copyright notice, the
title of the publication and its date appear, and notice is given that
copying is by permission of ACM, inc. To copy otherwise, to republish, to
post on servers, or to redistribute to lists, requires prior specific permission
and/or a fee.
the
studies
seem to
The basic
*GMD SET–RS, 53754 Sankt Augustin, Germany
analyzing
measurements
than
IPC,
one of the
can be implemented
believed
Althoughmanymacro-kernels
before,
tend
the
approach,
traditional
p-
an order
of mag-
question
is still
to be less modular,
there
are exceptions
from this rule. e.g. Chorus [Rozier et al. 1988] and
Peace [Sc2w6der-Preikschat
1994].
2 Short
u.ser-tu-user cross-address
space
IPC
in
L3
[Liedtke
both -ng
1993]
is 22 times
faster
than
in Ma&,
on a 486.
On the R2000, the specialized
Exo-tlrpc
[Engler et rd. 1995] is 30
times faster thsm Mach’s
general RPC.
open. It might
the appropriate
2.1
be possible that we are still not applying
construction
techniques.
For the above reasons, we feel that a conceptual analysis is needed which derives p-kernel concepts from pure
functionality
achievable
tion
requirements
performance
3). Further
(section
(section
sections
2) and that discusses
4) and flexibility
discuss portability
(section
level, an address space is a mapping
each virtual
page to a physical page
frame
it
or marks
we omit
read/write.
5)
Spaces
At the hardware
which associates
simplicity,
(sec-
(section
and the chances of some new developments
Address
‘non-accessible’.
The mapping
The ~-kernel,
p-Kernel
be impossible.
Concepts
mance.
criterion
used is functionality,
More precisely,
p-kernel
only
a concept
if moving
it outside
the hardware
not perfor-
is tolerated
The
inside the
the kernel,
We assume
that
and/or
the
target
system
not completely
i.e. per-
the mandatory
are empty.
spaces outside
address
or
corrupted by other subsystems S’. This is the principle
of independence:
,S can give guarantees independent
of
Map.
by system
and that
each pair is delivered
further
provides
three
into
ant restriction
from
the granter’s
the grantee’s
address
is that instead
of phys-
The owner of an address space can map any of its
address space, provided
the recipient
removed from the mapper’s address space. Comparable
to the granting
case, the mapper can only map pages
which itself already can access.
Flush.
The owner of an address space can j?ush any of
its pages. The flushed page remains accessible in the
flusher’s address space, but is removed from all other
or by
user-level boot servers.
For illustration:
a key server
should deliver public-secret
RSA key pairs on demand.
It should guarantee that each pair has the desired RSA
property
the
agrees. Afterwards,
the page can be accessed in both
address spaces. In contrast to granting,
the page is not
fur-
administration
can be ensured
there
represents
and maintaining
page is removed
space and included
pages into another
s’.
tegrity
construction
By magic,
ical page frames, the granter can only grant pages which
are already accessible to itself.
S’. The second requirement
is that other subsystems
must be able to rely on these guarantees.
This is the
principle
of integrity:
there must be a way for S1 to
address S2 and to establish a communication
channel
which can neither be corrupted
nor eavesdropped
by
described by Gasser
by servers. Their in-
of
on
to
Grant.
The owner of an address space can grant any
of its pages to another space, provided
the recipient
space. The import
are trustworthy,
recursive
the kernel.
For constructing
subsystem
and kernel
would
of address spaces
concept.
agrees. The granted
hardware
protection
concept
idea is to support
One inevitable
requirement
for such a system is that
a programmer
must be able to implement
an arbitrary
Provided
implementing
The p-kernel
to all subof address
operations:
tual memory.
ther security services, like those
et al. [1989], can be implemented
concept
address spaces on top of aO, the p-kernel
applica-
be disturbed
hard-
physical memory and is controlled
by the first subsystem SO. At system start time, all other address spaces
tions, i.e. it has to deal with protection.
We further
assume that the hardware
implements
page-based vir-
S in such a way that it cannot
layer common
is one address space aO which essentially
has to support
trustworthy
basic
of address
mitting
competing
implementations,
would prevent the
implement ation of the system’s required functionalist y.
interactive
and
by TLB
must be tamed, but must permit the implementation
arbitrary
protection
(and non-protection)
schemes
top of the p-kernel.
It should be simple and similar
In this section, we reason about the minimal
concepts
or. “primitives”
that a p-kernel should implement.3
The
determining
is implemented
has to hide the hardware
spaces, since otherwise,
Some
sake of
like read-only
ware and page tables.
6).
systems,
2
For the
access attributes
address spaces which
had received
indirectly
from the flusher.
of the affected address-space
only once and
the page directly
or
Although
explicit
consent
owners is not required, the
operation is safe, since it is restricted to own pages. The
users of these pages already agreed to accept a potential
flushing,
when they received the pages by mapping
or
granting.
only to the demander.
The key server can only be
realized if there are mechanisms
which (a) protect its
code and data, (b) ensure that nobody else reads or
modifies the key and (c) enable the demander to check
whether the key comes from the key server. Finding the
key server can be done by means of a name server and
checked by public key based authentication.
Appendix
A contains a more precise definition
dress spaces and the above three operations.
of ad-
Reasoning
3proving
nice
but
~~m~tY,
is impossible,
necess~ity
since
there
and
completeness
is no agreed-upon
wo~d
metric
be
The described
address-space
concept leaves memory
management
and paging outside the ,u-kernel; only the
and
all is Turing-equivalent.
238
.
grant,
map and flush operations
are retained
inside the
pure memory
kernel. Mapping and flushing are required to implement
memory managers and pagers on top of the p-kernel.
trolled
fied file, system
(see figure
1). In this example,
2.2
.fl maps
Threads
A thread
Y\g=rit”””
‘
/
J
~/7f=~.:p
“,
Figure
a state information.
1
ing the page by the standard
user A, flushing
either
flush
by ~1 only
(and
cannot
page
which
Furthermore,
granting
overflow of F.
In
general,
flush
itself),
is already
granting
affect .fl and
when
through
The model
pages.
can easily be extended
access right
and granting
or a subset
to access rights
of them,
-fz.
operations
An address space is the natural
device ports.
The
messages between
always
munication
enforces
a certain
and
abstraction
This is obvious
IPC
classical
method
is
by the p-kernel.
agreement
the sender
determines
is not only
between
decides
its contents;
the basic concept
between subsystems
spaces, can be const rutted
IPC.
on
Note that
need IPC,
the
for com-
but also, together
with
of independence.
message-transfer
the grant and map operations
since
granter/mapper
the
from
they
require
and recipient
based
(section
an agreement
2.1)
between
of the mapping.
may
Supervising
IPC
Architectures
like those described by Yokote [1993] and
Kiihnhauser
[1995] need not only supervise the memory
of subjects but also their communication.
This can be
for incorpo-
for memory
communicant ion,
(IPC), must be
threads
of a communication:
remains
Other forms of communication,
remote procedure call
(RPC) or controlled
thread migration
between address
1/0
rating
In some oper-
as a basic abstraction,
by the p-kernel.
parties
Therefore,
page’s
i.e., can restrict
access but not widen it. Special flushing
remove only specified access rights.
of some ~ that
activity.
receiver determines whether
it is willing to receive information
and is free to interpret
the received message.
mappings
copy the source
the notion
e.g. preemption.
threads
to send information
it.
Mapping
(or somecorruption
ducing
both
address-space
page
concept
To prevent
reasons for int re-
IPC
by
instead
most of
done in .fl and
includes
address spaces, the foundation
is used
the
ating systems, there may be additional
transferring
should be passed through a controlling
subsystem without burdening the controller’s
address space by all pages
mapped
the thread
the above mentioned
supported
Flush-
since it used the
avoids a potential
and
~ currently
executes. This
to address spaces is the de-
Consequently,
cross-address-space
also called inter-process
communication
F so
user A. F is not affected
If F had used mapping
have needed to replicate
page only transiently.
of granting,
it would
the bookkeeping
pager would
a stack pointer
Note that choosing a concrete thread concept
subject to further OS-specific design decisions.
that it is then available only in ~1 and user A; the resulting mappings are denoted by the thin line: the page
pager.
space.
includ-
state also includes
in the p-kernel.
plies that the p-kernel
represents
Example
and the standard
an address
of address spaces, all changes to a t bread’s address space
(at’)
:= a’) must be controlled
by the kernel. This im-
pager
1: A Granting
in user A, fl
pointer,
A thread’s
address space CT(TJin which
dynamic or static association
equivalent)
inside
by a set of registers,
ing at least an instruction
thing
one of its pages to F which grants the received
the page disappears
from
to user A. By granting,
is mapped
executing
is characterized
T
cisive reason for including
std
can be con-
IPC
and
is an activity
A thread
. ..
disk
i.e., device ports
4K granularity.
as pagers ~1 and ~~,
pager) into one uni-
user X
\
1/0,
with
Controlling
1/0 rights and device drivers is thus also
done by memory managers and pagers on top of the
p-kernel.
The grant operation
is required only in very special
situations:
consider a pager F which combines two underlying file systems (implemented
operating
on top of the standard
mapped
and mapped
mapped
done by introducing
either
communication
channels
or
1/0, but 1/0 ports can also be included.
The granularity of control depends on the given processor.
The
[Liedtke 1992] which allow supervision
of IPC by
user-defined
servers. Such concepts are not discussed
386 and its successors permit control per port (one very
small page per port) but no mapping of port addresses
(it enforces mappings
with v = v’); the PowerPC uses
here, since they do not belong to the minimal
set of
concepts.
We only remark that Clans do not burden
the p-kernel: their base cost is 2 cycles per IPC.
Clans
239
Interrupts
on top
The natural
principal
flexibility
of a p-kernel.
Whether
it is really
as flexible in practice strongly depends on the achieved
abstraction
IPC message.
threads which
for hardware
interrupts
is the
The hardware
is regarded as a set of
have special thread ids and send empty
efficiency
driver thread:
do
wait
for (msg.
if sender
then
sender)
reset hardware
the interrupts
remaining
in
server
the interrupt
2.3
is a device spe-
by drivers
at user level.
then is used solely for popping
from the stack and/or
switching
by the kernel.
a privileged
the kernel
when the driver
executes
operation
sta-
back to
However,
if
for releasing
this action
implicitly
llids
are required
for reliable
If S1 wants
and efficient
lo-
to send a message to
S2, it needs to specify the destination
S? (or some channel leading to S2 ). Therefore,
the p-kernel must know
which uid relates to S2. On the other hand, the receiver
S2 wants to be sure that the message comes from S1.
Therefore
and time.
the identifier
must be unique,
In theory, cryptography
tice, however, enciphering
nication
trusted
both
in space
could also be used.
messages for local
In praccommu-
is far too expensive and the kernel must be
anyway.
S2 can also not rely on purely user-
supplied capabilities,
since S1 or some other instance
could duplicate and pass them to untrusted
subsystems
without
control of S2.
3
can easily
of the physical
interfaces,
pager
– client,
and are user-level
pager
– memory
are completely
based
defined.
dress spaces as well as unpaged resident memory for device drivers and/or real time systems. Stacked pagers,
i.e. multiple
layers of pagers,
can be used for combin-
ing access control with existing pagers or for combining
various pagers (e.g. one per disk) into one composed obUser-supplied
paging
strategies
[Lee et al.
1994;
Cao et al. 1994] are handled at the user level and are in
no way restricted
by the ~-kernel.
Stacked file systems
[Khalidi
and Nelson 1993] can be realized accordingly.
Identifiers
cal communication.
managers
parts
Pagers can be used to implement
traditional
paged
virtual memory and file/database
mapping into user ad-
A p-kernel
must supply
unique identifiers
(uid) for
something, either for threads or tasks or communication
channels.
Memory
or grants
and pager – device driver,
on IPC
ject.
issues the next IPC operation.
Unique
MO maps
be
resetting
requires
the ~-kernel.
is not involved
user mode and can be hidden
an interrupt,
but outside
messages must
cific action which can be handled
a processor
topic
4.
be stacked:
into
The iret-instruction
tus information
performance
Manager.
A server managing
the initial
space CTOis a classical main memory manager,
but the p-kernel
device-specific
interrupt
handling.
In particular,
it does
not know anything
about the interrupt
semantics.
On
some processors,
The latter
we show the
Pager.
A Pager may be integrated
with a memory
manager or use a memory managing server. Pagers use
the p-kernel’s
grant, map and flush primitives.
The
fl
od
Transforming
in section
section,
interrupt
...
done by the kernel,
In this
memory (CO) to UI, controlled
by Ml, other parts to az,
controlled
by M2. Ifow we have two coexisting
main
memory managers.
;
= my hardware
interrupt
read/write
io ports ;
else
Memory
address
a
p-kernel.
of the p-kernel.
is discussed
messages (only consisting of the sender id) to associated
software threads. A receiving thread concludes from the
message source id, whether the message comes from
hardware interrupt
and from which interrupt:
of the
Flexibility
To illustrate
the flexibility
of the basic concepts, we
sketch some applications
which typically
belong to the
basic operating
system but can easily be implemented
Multimedia
and other
Resource
Allocation.
real-time
applications
require
sources to be allocated
execution
ory
times.
managers
in a way that
The above mentioned
and pagers
permit
of physical memory for specific
memory for a given time.
Multimedia
memory
re-
allows predictable
user-level
e.g. fixed
data or locking
Note that resource allocators for multimedia
timesharing
can coexist. .Managing allocation
is part of the servers’ jobs.
Device
directly
mem-
allocation
data in
and for
conflicts
Driver.
A device driver is a process which
accesses hardware
1/0 ports mapped into its
address space and receives messages from the hardware (interrupts)
through the standard IPC mechanism
Device-specific
memory,
e.g. a screen, is handled by
means of appropriate
memory managers.
Compared to
other user-level processes, there is nothing special about
a device driver.
hTo device
into the g-kernel.4
4In general,
the
kernel.
image
240
into
there
is no
reason
driver
for
has to be integrated
integrating
boot
drivers
into
The hooter,
e.g. located
in ROM,
simply
loads a bit
memory
that contains
the micro-kernel
and perhaps
Second
rates
Level
Cache
of a secondary
reallocation
[Kessler
and
can be implemented
some
even
plemented
handler
like
will
p-kernel.
handling
Hill
virtual
this.
practice,
the
in the
of a hashed
page
memory.
hardware
table,
or in the
might
be imple-
cles for returning
switch
Server.
switches.
the communication
is
We take
an x86
switches
are ex-
by an additional
36 cy-
These two instructions
pointer.
a lower bound
The remaining
107 cycles (about
on kernel–user
800 or more
mode
cycles are pure
By this term, we denote all cycles
kernel overhead.
of the kernel,
which are solely due to the construction
The
nevermind whether they are spent in executing instructions (800 cycles x 500 instructions)
or in cache and
Unixs system calls are implemented
by
server can act as a pager for its clients
and also use memory
its clients.
The Unix
mode
argument
the user and kernel stack and push/pop
and instruction
2 ps) is therefore
is not involved.
The Unix
our
processor.
kernel-user
to user mode.
between
flag register
buffers
server will also act as a special pager for the client.
Unix
costs,
mode costs 71 cycles, followed
protocols and vice
is accessed by de-
of communication
and user address space is required,
IPC.
the measured
because
In fact,
tremely expensive on these processors.
In contrast to
the worst case processor, we use a best-case measurement for discussion, 18 ps for Mach on a 486/50.
e.g.
Remote IPC is impleservers which translate local
If special sharing
call get.se~.thread
The measured costs per kernel call are 18 x 50 = 900
cycles. The bare machine instruction
for entering kernel
messages to external communication
versa. The communication
hardware
p-kernel
processor,
that
one required
we measured
costs are high.
based on a 486 (50 MHz)
TLB
he showed
ps per getpid,
these results,
p-kernel
kernel-ca(l
For analyzing
be im-
handler,
a 486 at 50 MHz),
18 ps per Machs
the measured
pol-
server.
Remote
Communication.
m~nted by communication
vice drivers.
1994]
first-level
or roughly
most machines need 20-30
even 63 KS. Corroborating
or
applies
could
TLB
11/780
hit
reducing)
handler
a second-level
as a user-level
et al.
which
in physical
TLB
be implemented
misses
conflict
the
allocation
Romer
of a pager
pages
In
of page
1992;
a software
However,
mented
Improving
(hopefully
allocating
In theory,
TLB.
by means
by means
cache-dependent
icy when
and
cache
TLB
misses (800 cycles x 270 primary
90 TLB
sharing for communicating
with
server itself can be pageable or
misses).
We have to conclude
cache misses =
that the measured
kernels do a lot of work when entering
kernel.
resident.
Note
that
this
work
and exiting
by definition
the
has no net
effect.
Conclusion.
abstractions
Is an 800 cycle kernel overhead
A small set of p-kernel concepts lead to
which stress flexibility,
provided they per-
answer is no. Empirical
TLBs.
The complete
L3 kernel
cycles, mostly
less than
The
4.1
Performance,
Switching
It is widely
believed
Facts
switching
& Rumors
(i.e. the kernel
remaining
between
Kernel–User
necessary?
The
1993] has a
overhead
call costs are thus
123 to 180
3 ps.
is process
and supports
oriented,
uses a kernel
persistent
user processes
can be exchanged
system,
kernel mode).
other ~-kernel
kernel and
user mode, between address spaces and bet ween threads
is inherently
expensive.
Some measurements
seem to
support this belief.
4.1.1
L3 p-kernel
stack per thread
Overhead
that
really
L3 [Liedtke
minimal kernel overhead of 15 cycles. If the p-kernel call
is executed infrequently
enough, it may increase by up
to 57 additional
cycles (3 TLB misses, 10 cache misses).
form well enough. The only thing which cannot be implemented on top of these abstractions
is the processor
architecture:
registers, first-level
caches and first-level
4
proof
without
even if a process
affecting
actually
the
resides in
Therefore,
it should be possible for any
to achieve comparably
low kernel call
on the same hardware.
Other processors may require a slightly
higher overhead, but they offer substantially
cheaper basic operations for entering
and leaving kernel mode.
From
an architectural
point of view, calling the kernel from
user mode is simply an indirect
call, complemented
by
Switches
Ousterhout
[1990] measured the costs for executing the
“null” kernel call getpid.
Since the real getpid opera-
a stack
switch
permit
privileged
tion consists only of a few loads and stores, this method
from
measures the basic costs of a kernel call.
mented by switching back to user stack and resetting the
‘kernel’-bit.
If the processor has different stack pointer
registers for user and kernel stack, the stack switching
costs can be hidden. Conceptually,
entering and leaving
a hypothetical
machine
with
10 MIPS
Normalized
rating
(10x
some set of initial
pagers and drivers
(-ng
at user mode
not linked
but simply
appended
to the kernel).
Afterwards,
boot drivers are no longer used.
5Unix is a registered
trademark
of UNIX
System
to
VAX
and
the
and setting
operations.
kernel mode is a normal
cMach 3.0, NORMA MK 13
Laboratories.
241
the internal
‘kernel’-bit
Accordingly,
return
operation
to
returning
comple-
this worst-caae calculation
tables may become critical
kernel mode can perform exactly like a normal indirect
call and return instruction
(which do not rely on branch
prediction).
Ideally, this means 2+2=4
cycles on a 1-
Fortunately,
and PowerPC,
differently.
The
registers which
issue processor
Conclusion.
kernel–user
sors.
Compared to the theoretical
minimum,
mode switches are costly on some proces-
Compared
to existing
kernels
however,
they
local 232-byte
If we regard
page
this is not a problem, since on Pentium
address-space switches can be handled
PowerPC architecture
includes segment
can be controlled
by the p-kernel
and
offer an additional
can
shows that switching
in some situations.
address translation
facility
address space to a global
from
252-byte
the global space as a set of one million
local
be improved 6 to 10 times by appropriate
p-kernel const ruction.
Kernel–user
mode switches are not a serious
spaces, address-space
switches
can be implemented
conceptual
reloading
registers
instead
problem
but an implementational
one.
the segment
page table.
4.1.2
Address
Space
Switches
With
the
space.
by
of switching
the
29 cycles for 3.5 GB or 12 cycles for
1 GB segment
switching,
to a no longer
required
the overhead
TLB
flush.
is low compared
In fact,
we have a
Folklore also considers address-space switches as costly.
All measurements
known to the author and related to
this topic deal with combined thread and address-space
Things are not quite as easy on the Pentium
or the
486. Since segments are mapped into a 232-byte space,
switch costs. Therefore, in this section, we analyze only
the architectural
processor costs for pure address-space
mapping
multiple
user address spaces into one linear
space must be handled dynamically
and depends on the
switching.
together
Most
The combined measurements
with thread switching.
modern
primary
processors
cache which
are discussed
use a physically
is not
affected
tagged
actually
indexed
by address-space
Some processors (e.g. Mips R4000) use tagged TLBs,
where each entry does not only contain the virtual page
address
but
also the address-space
id.
Switching
may induce
one TLB
indirect
entry
costs, since shared
the
fectively
the ~-
on such a processor.
thus requires
termined
by the TLB
a TLB
flush.
are used only
working
for
sets are ef-
For example,
we can multiplex
one 3 GB
spaces with
all large 3 GB spaces. Then any
is uncritical
anyway,
working sets implies
since switching
between two large
TLB and cache miss costs, never-
that
whether
the
two
address
programs
execute
in the
same
or
spaces.
minimal
and maximal
no referenced
cycles are given
or modified
(provided
bits need updating).
In
the case of 486, Pentium and PowerPC, this depends on
whether the corresponding
page table entry is found in
the cache or not. As a minimal
working set. we assume
4 pages. For the maximum
case, we exclude 4 pages
from the address-space overhead costs, because at most
4 pages are required by the p-kernel and thus would as
well occupy TLB entries when the address space would
(32 + 64) x 9 = 864 cycles. Therefore,
intercepting
a
program every 100ps could imply an overhead of up to
9%. Although
using the complete TLB is unrealistic~,
conflicts
so that only about
multaneous
y. Furthermore,
small
to a medium or small space is albetween two large address spaces
the
has s entries and a TLB miss costs m cycles, at most
min(n, s) x m cycles are required in total.
Apparently,
larger untagged TLBs lead to a performance problem.
For example,
completely
reloading
the Pentium’s
data and code TLBs requires at least
sets
imply
spaces which
and with
Table 1 shows the page table switch and segment
switch overhead for several processors. For a TLB miss,
which are required
are 4-way set-associative.
Working
in the virtual
address space, usually
1995]
potential
address-space switch
ways fast. Switching
mind
tore-establish
the current working set later. If the working set consists of n pages, the TLB is fully-associative,
7Both
TLBs
are not compact
the
very small in most cases, say 1 MB or less for a
in different
The real costs are de-
load operations
spaces.
[Liedtke
user and removes
address
periods
the smaller
Most current processors (e.g. 486, Pentium, PowerPC
and Alpha) include untagged TLBs.
An address-space
switch
user address
technique
user address space with 8 user spaces of 64 MB and additionally
128 user spaces of 1 MB. The trick is to share
set and that there are enough TLB entries, the problem
should not be serious. However, we cannot support this
empirically,
since we do not know an appropriate
prunning
that
device driver.
kernel (shared by all address spaces) haa a small working
kernel
to the
mention
and costs
swit thing
that
is transparent
very short
pages occupy
per address space. Provided
active
implementation
For understanding
that the restriction
of a 232-byte
global space is not crucial to performance,
one has to
architecture.
address space is thus transparent
to the TLB
no additional
cycles. However, address-space
used sizes of the
The according
performance
bottleneck.
Address space switch overhead
then is 15 cycles on the Pentium and 39 cycles on 486.
switching.
Switching
the page table is usually
very
cheap: 1 to 10 cycles. The reaI costs are determined
by the TLB
TLB.
not be switched.
which
some
half of the TLB entries
are used sia working
set of 64 data pages will
most
likely
ports
4 x 32 bytes
associative,
page
242
lead
to cache
probably
in practice.
per
thrashing:
page.
only
Since
in best
the
I or 2 cache
case, the
cache
entries
is only
cache
sup-
2-way
set-
can be used
per
486
Pentium
PowerPC 601
Alpha 21064
Mips
96
256
TLB miss
cycles
9...13
9...13
?.
40
20. . . Soa
TLB
entries
32
R4000
48
Page Table \ Segment
switch cycles
36...364
39
36...1196
15
?
29
80...1800
and Mips TLB misses are handled
*R4000 has a tagged TLB.
Address
Space
by software,
The
Switch
Overhead
modern
processors.
On a 100 MHz
costs of address-space
processor,
switches
can be implemented
effect
of
using
pretty
ning
application
data
pages)
segment
is shown
on Pentium
a small
Conclusion.
Properly
constructed
address-space
switches are not very expensive, less than 50 cycles on
herited
space.They
copy, stack and address
Spring and L3 show that
fast and that
dows, whereas L3 is burdened by the fact that a 486
trap is 100 cycles more expensive
than a Spare trap.
switch
Tablel:
are user to user, cross-address
the costs are heavily influenced by the processor architecture: Spring on Spare has to deal with register win-
nla
“Alpha
times
communication
nja
Ob
20...50”
All
include system call, argument
space switch costs. Exokernel,
with
a stable
executes
working
based
in figure
address-space
2. One long run-
working
a short
RPC
set (2 pages).
set (2 to
to a server
After
the
64
with
RPC,
the
application
re-accesses all its pages.
Measurement
is
done by 100,000 repetitions
and comparing
each run
the in-
against
can be ignored
running
the application
(100,000
time
access-
roughly up to 100,000 switches per second. Special optimization,
like executing
dedicated servers in kernel
ing all pages) without
RPC. The given times are round
trip RPC times, user to user, plus the required time for
space, are superfluous.
re-est ablishing
Expensive
context
swit thing
some existing p-kernels is due to implementation
not caused by inherent problems with the concept.
in
the application’s
working
14
segment
switch
m
by
a
by page-table
switch
12.
1
4.1.3
Thread
Switches
and
set.
and
1P C
12
i
Ousterhout
some Unix
through
[1990] also measured
context
switching
I
in
10
systems by echoing one byte back and forth
pipes between
to a 10 Mips machine,
two processes.
Again
RPc
time
+
workin
S
,M
reestabh,h
normalized
most results are between
400 and
8
7,s
6.3
[M s]
r
6
1
System
CPU, MHz
RPC time
(round trip)
cycles/IPC
(oneway)
4
II
3.6
3,2
full
IPC
semantics
L3
486, 50
10
us
250
QNX
486:33
76
/JS
1254
Mach
R2000,
16.7
190 ps
1584
SRC RPC
CVAX,
12.5
464 /.lS
2900
Mach
486, 50
Amoeba
68020,
15
Spin
Alpha
21064,
Mach
Abha
21064,
2
L
230 @
5750
800
/.lS
6000
Figure
2:
133
102
,uS
6783
Switch
in LJ
133
104
us
6916
2: l-byte-RPC
pby
Segmented
trip
RPC,
:
Versus
on Pentium,
64
48
Standard
IPC can be implemented
handle
also hardware
4.2
Memory
Address-Space
90 MHz.
interrupts
fast enough
to
by this mechanism.
Effects
with that of the Mach p-kernel which was complemented
wit h a ~’nix server. They measured memory cycle over-
construction
that 10 ps, i.e. a 40 to 80 times faster RPC
is achievable. Table 2 gives the costs of echoing one byte
by a round
/
16
Chen and Bershad [1993] compared the memory system
behaviour
of ~ltrix,
a large monolithic
Unix system,
performance
800 ps per ping-pong,
one was 1450 ,US. All existing
kernels are at least 2 times faster, but it is proved
3.:
Ill
2
Conclusion.
Table
3.2
head per instruction
(MCPI)
and found that programs
running under Mach + V-nix server had a substantially
i.e. two IPC operations.8
Renesse
aThe respective data is taken from [Liedtke 1993; Hildebrand
1992; Schroeder and Burroughs 1989; Draves et al. 1991; van
243
1988;
Liedtke
et aL
1995:
et al.
Hamilton
and
1995;
Bershad
et al.
1989].
1993;
Bershad
Kougiouris
1993;
et aL
Bryce
1995;
Engler
and
Muller
higher NICPI
than running
the same programs
under lJl-
suggest
a potential
problem
due to OS structure.
trix. For some programs, the differences were up to 0.25
cycles per instruction,
averaged over the total program
Chen and Bershad measured cache conflicts by comparing the direct mapped to a simulated 2-way cache. g
(user + system).
degradation
They found
[N-agle et al.
tant
Similar
of Mach versus Ultrix
memory
is noticed
system
by others
1994]. The crucial point is whether this problem is due
to the way that Mach is constructed,
or whether it is
caused by the p-kernel
Chen and Bershad
(black)
“This
and capacity
in an absolute
suggests
that microkernei
optimizations
focussing exclusively on
IPC [. ..], without
considering
other sources of system
overhead such as MCPI, will have a limited impact on
overall system performance.”
Although
one might suppose a principal
tioned
paper
specific
impact
of OS architecture,
exclusively
presents
implementation
for memory
facts
without
“as is” about
interference,
is more impor-
but
the data
also
(white)
scale (left)
system
cache misses both
and as ratio
(right).
~=~f~~~=es
egrep
the men-
analyzing
system self-interference
user/system
show that the ratio of conflict to capacity
misses in
Mach is lower than in Ultrix.
Table 4 show-s the conflict
approach.
[1993, p. 125] state:
that
than
U n
0,024
M D
a
y...
the reasons
m
1
D
1
0.009
U m
0.039
M -—]
0,098
system degradation.
Careful analysis of the results is thus required.
According to the original paper, we comprise under ‘system’ either all Ultrix activities
or the joined activities
of the Mach y-kernel,
Unix
emulation
library
and Unix
seal, egrep,
yacc,
drew benchmark
In figure
gee,
compress,
M m
From
ab.
4: MCPI
this
Caused by Cache
we can deduce
that
misses are caused by higher
3, we present
1 in a slightly
0.037
Figure
and the an-
espresso
u M c1012
espresso
server.
The Ultrix
case is denoted by U, the Mach
case by M. We restrict our analysis to the samples that
show a significant
MCPI
difference for both systems:
the results
reordered
way.
of Chen’s
figure
We have colored
2-
system
MCPI
(Mach
by conflicts
the
Misses.
increased
+ emulation
library
which are inherent
+ Unix
is supported
single
part
by the fact
library,
which is reWe assume
that
the samples
spent
even
Unix server than in the
We also exclude Mach’s
since Chen and Bershad
only 3’% or less of system
not
structure.
server behaves comparably
of the Ultrix
kernel.
This
fewer instructions
in Mach’s
corresponding
Ult rix routines,
emulation
of the
server),
to the system’s
The next task is to find the component
sponsible for the higher cache consumption.
that the used Unix
to the corresponding
cache
cache consumption
overhead
report
that
is caused by it.
What remains is Mach itself, including trap handling,
IPC and memory management,
which therefore
must
induce nearly all of the additional
cache misses.
Figure
3: Baseline
for
MCPI
Ultriz
and
Therefore,
the mentioned
measurements
suggest that
memory
system degradation
is caused solely by high
Mach.
cache consumption
black that
are due to system
The
bars
white
buffer
stalls,
d-cache
to
system
misses
see that
between
the
the
user
white
and
bars
that
the
are essentially
che misses
for
ity
misses.
Mach.
system
A
used
large
causes,
i-cache
buffer
stalls.
It
not
average
direct
differ
Since a p-kernel
and
are
is easy
be conflict
number
system
as defined
mapped
of
caches)
conflict
system
misses
of very
plicitly.
or capac-
misses
tagged
flushes
would
244
ttis
by Hill
approximation.
10 We do not
ca(the
by
is basically
user-level
consumption
is 0.00,
gAlthough
in memory
by increased
could
invoked
cache
significantly
difference
MCPI.
caused
fraction
write
user
differences
They
system
reads,
do
the
is 0.02
behaviour
measured
write
Mach;
deviation
conclude
i-cache or d-cache misses.
all other
uncached
and
Ultrix
standard
We
comprise
of the p-kernel.
Or in other
drastically
reducing the cache working
will solve the problem.
can
method
and
befieve
The measured
used
does not
[1989],
the Mach
system
caches. The hardware
for DMA.
or
hardware,
only 10 be explained
Smith
that
a set of procedures
threads
frequently
words:
set of a p-kernel
p-kernel
determine
which
a high
by
a large
operations
all cofict
or
fisses
it can be used as a first-level
kernel
flushes
was a rmiprocessor
does not even require
the cache
with
ex-
physically
explicit
cache
by high cache working sets of a few frequently
used operations. According
to section 2, the first case has to be
considered as a conceptual
mistake. Large cache work-
5.1
ing sets are also not
to be conceptually
For example,
(Recall:
voluminous
namic or static
by copying
an inherent
L3 requires
feature
less than
communication
mapping
For illustration,
of p-kernels.
1 Kfor
short
Although
of kernel is almost
so that the cache is not flooded
that
irrelevant.
p-kernel
in the sense that
between
executing
multiple
address spaces.
describe
modified
how a p-kernel
even when
i.e. to a compatible
the Pentium
processor
an increase in cache
context
switches
from
processor.
are some differences
I
I
We showed (section
context
switches
there
is not much
application
+ servers
line,
fast
4.1.2
are not
size. ways
Cache
ex-
p-kernels
will automatically
system degradation
measured
486
I 8K~uj
write
compatible
in the internal
4x
16B
Pentiurn
I
trap
2x
32B
1 cycle
register
I
I 8K~,~ + 8K~d~
through
instructions
segment
back
0.5–1
cycle
9 cycles
3 cycles
107 cycles
69 cycles
difference
in one or in
Table
internal
Conclusion.
The hypothesis
that p-kernel architectures inherently
lead to memory system degradation
is
not substantiated.
On the contrary,
the quoted measurements support
the hypothesis
that properly
con-
3: 486 \ Pentium
architecture
p-kernel
(see table
architect
User-address-space
avoid the memory
so that
3) which
influence
4.1.2, a Pentium
for implementing
each 232-byte
can be implemented
Non-Portability
hardware
As
men-
p-kernel should use
user address spaces
address space shares all
has strong advantages from the software technological
point of view: programmers
did not need to know very
much about processors
approach
prevented
these p-kernels
necessary performance
In retrospective,
building a p-kernel
rious implications:
●
and the resulting
to new machines.
p-kernels
could
Unfortunately,
this
from
the
achieving
loads
take advantage
p-kernels
beyond
quire
form
they are as hardware
erators.
28 x 9 +36=
of specific
the TLB
the lowest
layer of operating
Therefore,
dependent
We have learned
that
we should
as optimizing
used inside
. even the algorithms
internal
concepts are extremely
dent.
places in
case of
72 cycles in the best case. In total,
method
wins.
On the Pentium
however,
the segment
register
method pays. The reasons are several: (a) Segment register loads are faster. (b) Fast instructions
are cheaper.
whereas the overhead by trap and TLB misses remain
systems
nearly
constant.
ative to instruction
code gen-
a p-kernel
processor
at various
288 cycles in the theoretical
and only
the conventional
accept that
not only the coding
instructions
100% TLB use, 14x9+36=
162 cycles for the more probable case that the large address space uses only 5070 of
layer per se costs performance.
the hardware.
and additional
ment register model would be (130+ 39) x 2 = 338 cycles,
whereas the conventional
address-space model would re-
. It cannot take precautions
to circumvent
or avoid
performance
problems of specific hardware.
c The additional
for the 486,
space switch from a large space to a small one and back
to the large. Assuming
cache hits, the costs of the seg-
and thus flexibility.
cannot
technique
the kernel sum to roughly
130 cycles required in addition.
Now look at the relevant situation:
an address-
we should not be surprised,
since
on top of abstract hardware has se-
Such a p-kernel
hardware.
a similar
this
to the user.
and table 1 also suggests it for the 486. Nevertheless,
the conventional
hardware-address-space
switch is preferable
on this processor.
Expensive segment register
Older ,u-kernels were built machine-independently
on
top of a small hardware-dependent
layer. This approach
easily be ported
transparently
Ford [1993] proposed
the
ure:
implementation.
tioned in section
segment registers
for Mach.
Differences
small and one large user address space. Recall that
5
has
“ported”
is binary
to the 486, there
hardware
structed
we briefly
very long messages.)
between applications
with large working sets. This depends mostly on the application
load and the requirement for interleaved execution (timesharing).
The type
arid 4.1.3)
Processors
486 to Pentium,
IPC.
can be made by dy-
Mogul and Borg [199 1] reported
misses after preemptively-scheduled
pensive
Compatible
(c) Conflict
execution,
cache misses (which,
are anyway
rel-
more expen-
sive) are more likely because of reduced associativity.
Avoiding
TLB misses thus also reduces cache conflicts.
(d) Due to the three times larger TLB, the flush costs
can increase substantially.
As a result, on Pentium, the
segment register method always pays (see figure 2).
but
and its
depen-
245
I
As a consequence,
we have to implement
tional user-address-space
ify address-space switch
The differences
an addi-
are orders of magnitude
tween 486 and Pentium.
multiplexer,
we have to modrout ines, handling
of user sup-
processor
requires
higher
We have to expect
a new p-kernel
than be-
that
a new
design.
thread
control
blocks,
task control
blocks, the IPC implementation
and the address-space
structure
as seen by the kernel. In total, the mentioned
For illustration,
we compare two different kernels on
two different
processors:
the Exokernel
[Engler et al.
changes affect
though this is similar to comparing apples with oranges,
a careful analysis of the performance
differences helps
plied
addresses,
algorithms
in about
half
1995] running
of all p-kernel
modules.
1P C implementation.
Due to reduced
the Pentium
to exhibit
caches tend
nally,
conflict
to the 486 kernel,
no effect on 486 and can be implemented
to the user.
In the 486 kernel,
kernel
stacks)
2 control
blocks
blocks
to identical
set-associative.
blocks
unique
[Liedtke
thread
data
addresses.
Due
A nice optimization
bound
is to align
on 4K but on lK
due to internal
con-
two randomly
in the cache.
this affects the internal
identifiers
supplied
stalls
1993] for details).
Therefore,
the
Then
con-
bit-structure
of
families
(see
are systems
write
protection
mode
in kernel
mode,
but
implies
this
is willing
no more
than
a hypothetical
about
be
efficient
counter to
the current
environment,
of the callee’s
to receive
and permits
and in-
processor
a message from
complex
30 additional
L3-equivalent
65 cycles on the R2000.
consideration
set,
cycles.
con-
the
messages to
From our
should re(Note
that
“Exo-IPC”
Finally,
that
the cycles of both
would cost
we must take into
processors
equivalent
as far as most-frequently-executed
tions are concerned.
Based on SpecInts,
are not
instrucroughly
1.4
486-cycles appear to do as much work as one R2000cycle; comparing
the five instructions
most relevant in
this context (2-op-alu, 3-op-alu, load, branch taken and
multi-level
page tables,
hashed page tables,
(no) reference bits,
(no) page protection,
strange page protectionll,
single/multiple
page sizes,
232-, 243-, 252- and 264-byte address spaces,
flat and segmented address spaces,
various segment models,
tagged/untagged
TLBs,
virtually/physically
tagged caches.
] 1e.g. the 386 ignores
elements
must
design.
for implementing
to the callee’s processor
partner
quire
with
erPC supports
read only in kernel
page is seen in user mode as well.
is a “substrate
functionality
there
using PCT for a trusted
LRPC already costs an additional
18 cycles, see table 2.) Therefore,
we assume
register architecture,
exception
handling,
cache/TLB
architecture,
protection
and memory model. Especially
the latter ones radically
influence
p-kernel
structure.
There
to ~-kernel
and receive timeouts,
the new kernel
differ in instruction
by different
performance,
reduce marshaling
costs and IPC frequency.
experience,
extending
Exo-PCT
accordingly
Processors
Processors of competing
processor
relevant
required
that
Incompatible
be explained
sender and whether no clan borderline
is crossed), synchronizes both threads, supports error recovery by send
cannot simply replace the old one, since (persistent)
user
programs already hold uids which would become invalid.
5.2
architectures.
L3-IPC
is used for secure communication
between potentially
untrusted
partners; it therefore additionally
checks the communication
permission (whether
thread
selected
p-kernel
text .“
boundaries.
by the p-kernel
average
time-slice
to its
reasons.)
in different
IPC mechanisms.
[It] changes the program
an agreed-upon
value in the callee, donates
The
of both
cannot
and/or
Exo-PCT
accesses
stacks simultaneously.
cache
there is a 7570 chance that
trol blocks do not compete
Surprisingly,
(including
always
ference
an anomaly
this problem
could be ignored on
Pentium’s
data cache is only 2-way
no longer
(1K is the lower
since it has
transparently
blocks
IPC
maps the according
4-way associativity,
the 486. However,
control
control
and kernel
cache hardware
trol
thread
were page aligned.
this results
and
Fi-
We compare Exokernel’s
protected
control transfer
(PCT) with L3’s IPC. Exo-PCT
on the R2000 requires
about 35 cycles, whereas L3 takes 250 cycles on a 486
processor for an 8-byte message transfer.
If this dif-
misses.
One simple way to improve
cache behaviour
during IPC is by restructuring
the thread control block
data such that it profits from the doubled cache line
size. This can be adopted
on a 486. Al-
understanding
the performance-determining
factors
weighting
the differences in processor architecture.
associativity,
increased
on an R2000 and L3 running
not taken)
gives 1.6 for well-optimized
code.
Thus we
estimate that the Exo-IPC would cost up to approx. 100
486-cycles being definitely
less than L3’s 250 cycles.
This substantial
difference in timing indicates an isolated
dz~erence
between both processor architectures
that strongly
influences
IPC (and perhaps other pkernel
mechanisms),
In fact,
but not average programs.
the 486 processor
imposes
a high penalty
entering/exiting
the kernel and requires
per IPC due to its untagged TLB. This
107 + 49 = 156 cycles. On the other hand. the R2000
has a tagged TLB, i.e. avoids the TLB flush, and needs
less than 20 cycles for entering and exiting the kernel.
the Powthat
on
a TLB flush
costs at least
the
246
From the above example,
●
For well-engineered
p-kernels on different processor
architectures,
in particular
with different
memory
systems,
●
we learn two lessons:
we should
expect
ences that
are not related
formance.
Different
architectures
optimization
p-kernel
isolated
to overall
require
techniques
that
timing
differ-
processor
per-
processor-specific
even affect the global
structure.
To understand
the second point,
tory
486-TLB
flush requires
ber
of subsequent
TLB
recall that
minimization
misses.
The
the mandaof the num-
relevant
tech-
niques [Liedtke 1993, pp. 179,182-183] are mostly based
on proper address space construction:
concentrating
processor-internal
tables and heavily
one page (there
implementing
objects,
11 TLB
is no unmapped
control
used kernel data in
memory
on then 486),
blocks and kernel stacks as virtual
lazy scheduling.
In toto, these techniques save
misses, i.e. at least 99 cycles on the 486 and are
thus inevitable.
Due to its unmapped
memory
TLB,
the mentioned
constraint
facility
and tagged
disappears
on the
R2000.
Consequently,
the internal
structure
(address
space structure,
page fault handling,
perhaps control
block
access and scheduling)
can substantially
tors also imply
objects,
from
implementing
of a corresponding
a 486-kernel.
control
even the uids will differ
x pointer
swe
differ
fac-
blocks as physical
between
size + z) and 486 kernel
kernel
If other
(no
the R2000 (no
x controi
block
+x).
Conclusion.
mal “p’’-set
p-kernels form the link between a miniof abstractions
and the bare processor. The
performance
demands
microprogramming.
are comparable
As a consequence,
to those of earlier
p-kernels
are in-
herently not portable.
Instead, they are the processor
dependent basis for portable operating
systems.
Spin.
Spin [Bershad et al. 1994; Bershad et al. 1995]
is a new development
which tries to extend the Synthesis
idea: user-supplied
algorithms
are translated
by a kernel compiler and added to the kernel, i.e. the user may
write new system calls.
By controlling
branches and
memory references, the compiler
generated code does not violate
This approach reduces kernel–user
sometimes
address space switches.
Synthesis,
Panda,
Spin,
Cache
DP-Mach,
and
Exokernel
mode switches and
Spin is based on
Mach and may thus inherit
many of its inefficiencies
which makes it difficult to evaluate performance
results.
Resealing
them to an efficient
p-kernel
with
fast kernel-
user mode switches and fast IPC is needed. The most
crucial problem,
however, is the estimation
of how an
optimized
p-kernel
architecture
coming from a kernel compiler
Kernel
architecture
and the requirements
interfere
and performance
with each other.
might
be e.g. af-
fected by the requirement
for larger kernel stacks. (A
pure p-kernel needs only a few hundred bytes per kernel
stack.)
Furthermore,
the costs of safety-guaranteeing
code have to be related
timal user-level code.
The first
published
to p-kernel
results
overhead
[Bershad
and to op-
et al. 1995] can-
not answer these questions:
On an Alpha 21064, 133
MHz, a Spin system call needs nearly twice as many cycles (1600, 12ps) as the already
call
(900,
that
7ps).
Mach
kernel
The
expensive
application
compiler;
however,
Mach system
measurements
can be substantially
improved
it remains
show
by using
open whether
a
this
technique can reach or outperform
a pure p-kernel approach like that described here. For example, a simple
user-level page-fault
handler (11 00 ps under Mach) executes in 17 ps under
Spin.
However,
we must take into
consideration
that in a traditional
p-kernel, the kernel
is invoked and left only twice: page fault (enter), message to pager
(exit),
The Spin technique
on this
processor
traditional
From
our
map message (enter+exit
cost less than
Spin overhead
overhead
a kernel
reply
1 ps i.e. with
is far beyond
the ideal
of 1+1 ps.
experience,
compiler
).
can save only one system call which
should
12 ps the actual
6
ensures that the newlykernel or user integrity.
we expect
eliminates
a notable
nested
IPC
gain
if
redirection,
e.g. when
Synthesis.
Henry Massalin’s Synthesis
tem [Pu et al. 1988] is another example
forming
(and non-portable)
feature
was a kernel-integrated
ated kernel code at runtime.
kernel.
operating sysof a high per-
Its distinguishing
“compiler”
For example,
invocation.
This
technique
when issuing
was highly
successful on the 68030. However (a good example for
non-portability),
it would most probably no longer pay
on modern processors, because (a) code inflation
will
degrade cache performance
and (b) frequent
of small code chunks pollutes the instruction
sign might
be a promising
research direction.
which gener-
a read pipe system call, the Synthesis kernel generated
specialized code for reading out of this pipe and modified
the respective
using deep hierarchies
of Clans or CustodiEfficient
integration
of the
ans [Hart ig et al. 1993].
kernel compiler technique and appropriate
p-kernel de-
generation
cache.
Utah-Mach.
Ford and Lepreau
IPC
to migrating
thread
semantics
migration
between
RPC
[1994] changed Mach
which
is based on
address spaces, similar
to the
Clouds model [Bernabeu-Auban
et al. 1988]. Substantial performance
gain was achieved, a factor of 3 to 4.
DP-Mach.
DP-Mach
[Bryce and Muller 1995] implements multiple
domains of protection
within one user
address space and offers a protected
inter-domain
Consequently,
the Exokernel interface is architecture dependent, in particular
dedicated
to software-controlled
call.
The performance
results (see table 2) are encouraging.
However, although this inter-domain
call is highly specialized, it is twice as slow as achievable by a general
TLBs.
A further
approach is that
grate
RPC mechanism.
In fact, an inter-domain
call needs
two kernel calls and two address-space switches. A general RPC
requires
two additional
t bread
switches
device
difference
Exokernel
drivers,
to our driver-less p-kernel
appears to partially
inte-
in particular
for
disks,
networks
and frame buffers.
and
We believe that
dropping
the abstractional
approach
argument
transferal 2. Apparently,
the kernel call and
address-space switch costs dominate.
Bryce and Muller
could only be justified by substantial
performance
gains.
Whether
these can be achieved remains open (see dis-
presented
cussion in section 5.2) until
domain
an interesting
calls:
when
optimization
switching
for small
back from
inter-
a very small
and abstractional
domain, the TLB is only selectively flushed.
Although
the effects are rather limited
on their host machine (a
486 with only 32 TLB entries), it might become more
form.
relevant
are too inflexible.
on processors
with
larger
TLBs.
It might
To analyze
tions
pare it with
probably
Panda.
The
RPC based on segment
Panda
system’s
et
7
and exceptions.
low
In contrast
to the Exokernel,
extensible)
nels, threads,
virtual
The term
or too specialized
special
but infrequent
cases.
In fact,
kernels
this
and
be
blocks,
could be held in virtual
same way as other
Exokernel.
memory
the Exokernel
in the
tm.nsfer can be omitted.
with
space
and performance.
and
The
map,
flush
and inflexible
to optimizing
per
Some existing
abstractions,
processor
p-
or too many
ones.
code generators,
and
p-kernels
are
basic
abstractions,
the most
inherently
frequent
come from insufficient
understanding
hardware-software
system or inefficient
[En-
The
gler et al. 1994; Engler et al.
1995] is a small and
hardware-dependent
~-kernel.
In accordance with our
processor-dependency
thesis, the exokernel
is tailored
to the R2000 and gets excellent
performance
values
for its primitives.
In contrast
to our approach,
it is
based on the philosophy
that a kernel should not provide abstractions
but only a minimal
set of primitives.
12Sometimes, the ~went
flexibility
constructed
priate
to Spin,
systems
hardware.
must
not
portable.
data.)
In contrast
(address
of
Basic implementation
decisions, most algorithms and data structures
inside a ~-kernel are processor dependent.
Their
design must be guided by
performance
prediction
and analysis.
Besides inappro-
could
be done as well on top of a pure ~-kernel by means of
according pagers. (Kernel data structures,
e.g. thread
control
range
set
enough to al-
operating
a wade
chose inappropriate
Simdar
haa to deal
this
a m~nimal
with
are flexible
may additionally
need clans or a similar reference monitor concept. Choosing the right abstractions
is crucial
a dynamically
selected subset. It was hoped that
technique would lead to a smaller p-kernel interface
with
of
mechanisms
for both
code, since it no longer
layers
that
of arbitrary
ezploitatton
‘caching’ refers to the fact that the p-kernel never handles the complete set of e.g. all address spaces, but only
also to less p-kernel
on at least
threads with IPC and unzque idenand grant operation,
form such a basis. Multi-level-security
tifiers)
systems
It caches ker-
address spaces and mappings.
abstractions
implementation
presented
and
p-
it relies on a small
machine.
p-kernels
platforms.
can provade higher
of appropriate
aliow
(non
try to decide these ques-
comparable
Conclusions
A p-kernel
fixed
abstractions
Such a co-construction
will
also lead to new insights for both approaches.
domazn and virsides its two basic concepts protection
tual processor,
the Panda kernel handles only interrupts
kernel.
the right
exoplat-
al.
of a small kernel
to user space. Be-
Cache-Kernel.
The
Cache-kernel
[Cheriton
Duda 1994] is also a small and hardware-dependent
out that
We should
by constructing
two reference
switching.
[Assenmacher
1993] p-kernel is a further
example
which delegates as much as possible
then turn
on the same hardware
are even more efficient than securely multiplexing
hardware primitives
or, on the other hand, that abstractions
whether kernel enrichment
by inter-domain
calls pays,
we need e.g. a Pentium
implementation
and then coma general
we have well-engineered
p-kernels
presented
design
achieve
well
performing
spectj?c
implementations
shows
that
p-kernels
of
mistakes
of the combined
implementation.
d
ZS possible
through
to
processor-
processor-tndcpendent
ab-
stractions.
Availability
The source code of the L4 ~-kernel,
For imp-
p-kernel, is available for examination
tion through the web:
lementing inter-domain calls, a pager can be used whick shares
the address spacesof caller and mllee such that the trusted caflee
can access the parameters in the caller space. E.g. LRPC [Bershad et al. 1989] and NetWare [Major et al. 1994] use a similar
technique.
httpi/borneo.gmd,
248
a successor of the L3
and experimenta-
de/RSiL4.
Acknowledgements
Many thanks
to Hermann
Hartig
for discussion
Flushing,
out explicit
and Rich
implicitly
can j7ush
Uhlig for proofreading
and stylistic help. Further thanks
some anonyfor reviewing remarks to Dejan Milojicic,
mous referees and Sacha Krakowiak
for shepherding.
the reverse operation,
can be executed withagreement of the mappees, since they agreed
when
any
Note that
derived
Address
An
Abstract
R-u {~}
of Address
u,
address space, where
pages, R the set of available
: V
to regard each mapping
(real)
as a one
we write
U. x
x to describe
Fortunately,
a
V)
The three
physical
address
a.
:=
z
exported
.
We use it only for describing
A subsystem
address
provided
that
granted,
,
S determines
any
whereas S’ determines
the granted
page should
page is transferred
of its pages should
at which virtual
be mapped
In contrast
mapper’s
per’s
to grant,
the mapped
space u and
address
space
dress
space
from
av to a~,.
u’,
((?, v)
a link
(u, v)
instead
This
to
is stored
always
in v’.
from
mapping
address
:=
that
never
the
change,
we therefore
table
P,
where PV
address
P.
UU +
4 invoked
by a flush oper-
:=4
be used instead
of evaluating
The granted
(P, v),
a.
a(v).
In
page remains
in
page
m the
the
receiving
mappings
of u is to use one
page frame
of the frame.
where v is the according
which
describes
Each node contains
virtual
page in the ad-
dress space which is implemented
by the page table P.
Assume that a grant-, map- or flush-operation
deals
with a page v in address space c to which the page
In a first step, the operation
setable P is associated.
lects the according tree by P., the physical page. In the
next step, it selects the node of the tree that contains
(P, v). (This selection can be done by parsing the tree
.
the
implementation
tree per physical
all actual
in the
or in a single
map-
step,
if Pv is extended
by a link
to the
node.) ‘Granting
then simply replaces the values stored
in the node and map creates a new child node for storing (P’, v’). Flush lets the selected node unaffected but
ad-
of transferring
the existing link
operation
permits
to construct
address spaces recursively,
isting ones.
can
The recommended
be
A subsystem S with address space u can map any
of its pages v to a subsystem S with address space u’
provided S’ agrees:
+
gvarantee
page will
fact, P is equivalent to a hardware page table. p-kernel
address spaces can be implemented
straightforward
by
means of the hardware- address-translation
facilities.
to c+ and removed
u:,
of O(V) is never re-
For implementation,
P.
UU+4.
which
v’)
includes
address space a’
P“
u~
uv=~
each u by an additional
and each replacement
S’ agrees:
(7:, +
Note
the three
space u can grant
S with
if
evaluation
P;,
ation
S with
Uu=r
corresponds to CV and holds either the physical
of v or d. Mapping
and granting then include
operations.
of its pages v to a subsystem
v. =(a’,
if
of a virtual
A page potentially
mapped at v in u is flushed, and the
new value x is copied into UV. This operation
is internal
to the p-kernel.
if
basic operations
by flushing.
complement
;
;
a recursive
quired.
except
change of u at v, precisely:
flush (u,
=
{
of address spaces are based on the
operation:
Model
U’(v’)
u(v)
by ~v .
replacement
the
At a first glance, deriving the phyical address of page v
in address space a seems to be rather complicated
and
expensive:
V is the set
physical
column table which cent ains u(o) for all v e V and can
be indexed by v. We denote the elements of this table
All modifications
Flushing
which are indirectly
*
pages and d the nilpage which cannot be accessed. Further address spaces are defined recursively
as mappings
{4},
where X is the set of address
a : V ~ (X x V)U
spaces. It is convenient
recursively.
Spaces
address spaces as mappings.
is the initial
of virtual
are defined
UV.
Implementing
We describe
S
-Q.
also all mappings
affects
map operation.
No cycles can be established
by these three operations, since * flushes the destination
prior to copying.
Spaces
Model
= (c)u) :C:,
N and flush
from
the prior
pages:
vu:,
recursively
A
accepting
of its
parses and erases the complete
i.e. new spaces based on ex-
is executed
249
for
each
node
(P’,
subtree,
c’)
in the
where P;
subtree.
:=
@
References
Hall,
M
D
CPL
Assenmacher,
H.,
Schwarz,
Breltbach,
R.
kernel
1993
The
approach.
trzbuted
T
In
Buhler,
Panda
4th
Computtng
,
P
system
Workshop
Systems,
Hubsch,
V
architecture
on
Lmboa,
–
,
a p]cc-
Trends
of
Portugal,
pp
470–476
Intel
Dts-
Corp
The
J
M
architecture
(Jan.),
, Hutto,
of the
Georgia
P
Ra
Inst,tute
W
and
kernel.
Khalid],
Tech
Y.
Rep
of Technology,
A
1990
Manual
Intel
Bernabeu-Auban.
Smith,
A.
IEEE
J.
1989
Evaluating
Transactions
on
assoclatwmy
Computers
38,
in
12 (Dee
),
1612-1630
and
Future
and
caches
1988
t486
Intel
Corp
Mzcroprocesso~
Programmer’s
Reference
Corp
1993
Pentzurn
Architecture
and
Processor
User’s
Programming
Manual,
Manual
Intel
Volume
3:
Corp
GIT-ICS-87/35
Atlanta,
Kane,
GA
G
and
Heu-mch,
J
1992
MIPS
Rtsc
Architecture
Prentice
Hall
Bershad,
B
N , Anderson,
1989
Llghtwe]ght
szum
on
AR,
T
remote
Operattng
pp
E , Lazowska,
E. D.,
procedure
System
call
Levy,
H
12thACM
In
Prtnczples
and
(SOSP),
M.
Syrnpo
Lichfield
Kessler,
-
R
Systems
102-113
B
D.,
N.,
Chambers,
C , Eggers,
S , Maeda,
E G
P , Savage,
S , and Smer:
Pardyak,
extensible
tem
microkernel
services
Dagstuhl,
In
for
6th
application-specific
SIGOPS
Germany,
pp
C , McNamee,
1994
Spm
– an
operating
European
Hall,
10,
Y
B
N.,
M.,
Becker,
biht
y,
Savage.
Spring
czpies
4 (Nov
Workshop,
D.,
safety
Eggers,
and
15thACM
In
(sOSP),
performance
m
Sympostum
Mountain
Copper
P., Smer,
Chambers,
the
Schloi?
E G., Fmczynsk],
C
1995
Extens]-
Spin
on Operating
Resort,
CO,
operating
Lee,
system.
System
xx-xx,
pp
Nelson,
In
14th
ACM
(SOSP),
Hansen,
tem
P.
1970
Commun.
The
nucleus
ACM
13,
G
1995
of
Pr%nctples
a multiprogramming
4 (April),
In
C
E.
C
and
Muller,
applications
Symposzum
formance
Matching
sys-
mlcrc-kernels
J
ing
B
and
on
on
NC,
D.
R
pp
and
system
on
1994
to
1992
J
1994.
A
In
Destgn
R
P., Bershad,
Using
B.
N.,
continuations
1 st
of
communication
szum
on
CA,
in
pp.
D
ing
model
of
Research
, Kaashoek,
M
kernel
Implementatton
(OSDI),
.21 064-AA
J
R.
F.,
and
Rzsc
.51GOPS
European
and
R
W.
In
13th
1991
D ~, Kaa.shock,
an operating
M.
system
management
In
Prznctples
O’Toole,
J.
Schlof3
B
1993.
Ford,
B
and
and
D.,
1994.
ACM
Dagstuhl,
private
for
1995.
on
Resort,
CO,
System
PP
thread
model
In
Usents
Winter
Mach
3.0
Conference,
to
M.,
The
Goldstein,
Dig]tal
A.,
Nat%onal
Computer
timore,
Kaufmann,
distributed
pp
C,
system
Securaty
pp.
G
crokernel
nati.
for
OH,
Kougiourls,
objects.
pp
P
In
1993
Summer
The
Lampson,
B
1989.
In
Usenzz
H
security
de.mgn
f4th
In
Prznczples
(SOSP),
sw]tch]ng
on
user
(Sept.),
K
Penttum
address
GMD
spaces.
—
German
Technology,
1994
Wtnter
The
An
S ankt
overwew
Usentz
as
effect
of
the
Conference,
601
and
San
switches
and
S
IL,
pp.
Optimal
operating
on
sys-
Computer
358-369
operating
In
User’s
1994
multLple-API
aren’t
Operatzng
Microprocessor
Symposzum
Chicago,
on
on Archi-
75–84.
Sechrest,
for
hardware?
pp
RISC
T.,
Why
context
Languages
CA,
International
pp
of
Conference
Clara,
memory
1990.
systems
Usenm
Summer
J
The
Synthesis
, and
Chen
getting
Conference,
247–256
H.,
and
Ioannidls,
Systems
H.,
Lee,
1,
D.
page
standard
hardware
M
1988
1 (Jan.),
kernel.
11–32
L , Bershad,
mappmg
In
and
M
Ist
N
cache
F ,
F ,
W
Computzng
1988
1, 4,
on
Operatzng
I ,
C.,
Chorus
Systems
1994
Monterey,
Boule,
Kaiser,
B
resolution
on
(O,SDI),
, Armand,
N-euhauser,
confhct
Sympostum
Implementation
Herrmann,
P , and
system
for
USENIX
A
,
B
pohcles
Abrossimov,
,
nucleus
Conference,
A
W
.$vsterns
M
Fn-efly
Bal-
m)-
The
R , van
of
Operatzng
.Logzca/
CA,
Glen,
M.,
Langlois.
S ,
distributed
op
305-370
ACM
M
L]chfield
Park,
H
world’s
Systems
1989
Symposzum
Staveren.
the
Deszgn
of Parallel
Op-
Hall
Burroughs,
12th
(SOSP),
Renesse,
tem
and
In
Performance
Cincin-
1994
Prentice
D
RPC
Pmnczples
AR,
and
pp
4 (Ott
the
System
83–90
distributed
22,
of
Operattng
Tanenbaum,
fastest
Reuzew
Performance
on
A
S
operating
),
1988
sys-
2.5-34
147–160
Wulf,
Hzirtig
K.
CA,
T
Schroeder,
12th
(NIST/NC.SC),
Spring
Annual
Schr6der-Prelkschat,
305–319.
and
21th
Gtnliemont,
97–114
architecture
Conje~ence
kernel
Programming
R , Mudge,
on-chip
Dynamic
Rozler,
xx-xx
van
Hamilton,
Springer.
a mlgratmg
CA,
and
security
by
International
Santa
of
as fast
erattng
Gasser,
1991
(ISCA),
Leonard,
Evolving
A
In 4th
Systems
Destgn
pp
255–266
resource
Operatzng
In
Inc.
C , Massalin,
Romer,
pp
Exokernel,
apphcat]on-level
Mountain
1994
Ar-
294–305
Information
Powell,
PowerPC
J
eratmg
J
pp
933
and
1993
Computzng
commumcatlon
Lepreau,
Implementa-
6th
Germany,
J
Symposzum
Copper
h]gh
USENIX
GI/lTG-Fachtagung
System
system.
Motorola
Anaheim,
operatIn
In
faster
Grove,
The
for
for
Uhlig,
Ousterhout,
Sympo-
Pacific
machme
O’Toole,
architecture
15th
(SOSP),
Ford,
F ,
and
multiplexing
Center
(A5PLOS),
Architecture
and
ACM
($ OSP),
IPC
No.
Borg,
Support
Inc
Nagle,
Micro-
management
programmable
Workshop,
12.
Klel,
GMD
performance.
Motorola
62–67
Engler,
and
allocation
Dean,
thread
systems
a secure
Ist
CA.
C
Systems
Corp.
Prznczples
F.,
as
HIPEC:
In
De.stgn
address-space
G.,
operating
tectural
oper-
122–136.
system
1994
153–164
In
transparently
D , Mmshall,
Mogul,
Pu,
Engler,
C.
caching
175–188
der
Fhnclsco,
14th
(SOSP),
SyrnposZum
implement
System
poh-
Reltable
Germany
Opemtzng
Improved
by
NetWare
operat-
USENIX
Equipment
operating
Operating
on
Augustm
In
Prznczples
pp
1995
National
U.SENIX
Major,
caching
DECChzp
Ra.shid,
to
m
Prtn-
security
Symposaum
R.
pp.
chiefs
on
NC,
J
tems
Draves,
&
user-defined
IEEE
Systems
CA,
Improvmg
Arbeltspaplere
per-
Implements.
performance
Ist
and
Digital
systems
System
1–14.
for
Rechensystemen,
1993
Manual.
1992.
Sheet.
for
Computer
file
Operattng
memory
Operatzng
Symposzum
Lledtke,
179-194
Corp.
Data
In
Impact
System
J
pp
Chang,
vmtual
Clans
uon
cache
K.
Extensible
on
Neuenahr,
Monterey,
Ashevdle,
In IEEE
Antomo,
and
and
The
system
functlonahty
pp
Equipment
processor-
memory
Systems
CA,
caching.
1993.
Operating
Duda,
1993
14th
C , and
on
(OSDI),
J
the
Bad
M
Symposzum
ACM
120–133
kernel
Operattng
Monterey,
N
N
NC,
of
external
chttektur
modern
Implementation
file
B
structure
Asheville,
ating
K.
Systems
Destgn
CA,
pp. 165–178
Bershad,
Symposwm
Cherlton,
L1,
on Operatzng
Monterey,
system
ACM
Dlg]tal
, and
of apphcation-controlled
Symposzum
tton
(OSDI),
Chen,
W
M
Systems,
processors
E
on
238–241
using
fine-grained
memory
protection
on Parallel
D%strtbuted
Systems,
San
P , Felten,
algorithms
11-22
A paradigm
Proceedings
performance
tion
TX
Cao,
placement
Transactions
Sympostwn
1995.
H , Chen,
L1edtke,
Bryce,
Page
ACM
Asheville,
W
L1edtke,
Brmch
),
and
Ktihnhauser,
cles
Pardyak,
S , and
1992
sys-
68-71.
S.,
D
caches
A
Dtstrzbuted
Bershad,
M
real-indexed
Khalidl,
Bershad,
and
large
Park,
Kowalskl,
O , and
architecture
Kiihnhauser,
Journal
of
W
Computer
1993
The
Securzty
Bmhx
2,
W
and
1, 5–
, Cohen,
E , Corwm,
Pollack,
F.
operating
system
1974
W’ , Jones,
Hydra:
The
Comrnun.
ACM
A
. Levin,
R.,
kernel
of
17,
6 (July),
Pierson
a multiprocessing
337–345
21
Yokote,
Hddebrand,
D
1992.
An
Usenzx
Workshop
on
tectures,
Seattle.
WA,
archmectural
Mzcro-kemzels
pp.
overv:ew
and
Other
of
QNX
A’ernel
In
Y
1993
systems
Ist
on
Archi-
113–126
250
Ob~ect
Kernel-structuring
The
Apertos
Technologies
approach
for
for
object-or]
In
Interrzat%onal
Aduanced
Software
ented
operating
SymposZum
Springer.
C ,
Download