Lker-Level Interprocess Communication for Shared Memory

advertisement
Lker-Level Interprocess Communication
for Shared Memory Multiprocessors
BRIAN N. BERSHAD
Carnegie
Mellon
University
THOMAS
E. ANDERSON,
University
of Washington
Interprocess
(between
communication
address
operating
IPC
applications
encounter
level
and
communication
facilities
These
when
at the user
level,
with
thread
Terms—thread,
local
to the design
of the
kernel,
is architecturally
from
one
problems
them
since
between
cornmzmzcation
of contemporary
but
limited
address
space
kernel-based
by the
to another.
cost of
Second,
their
own thread
management
the interaction
between
kernel-
User-Level
can be solved
at the user
kernel
address
structure
allows
the
sees threads
on the
same
Call
shared
kernel
within
the communica-
each address
and processor
Procedure
using
by moving
level
invocation
spaces
Remote
protocol
The programmer
unconventional
towards
central
M. LEVY
management.
these
is improved
This
oriented
threads
and must provide
problems
stemming
from
and supporting
motivated
and HENRY
the responsibility
a processor
space communication
space communication.
Index
kernel
communicating
observations
though
been
multiprocessor,
performance
cross-address
managed
IPC
has become
its performance
reallocating
and user-level
memory
out of the
be avoided
particular
First,
that
need inexpensive
functional
and performance
Communication
fast
in
has traditionally
problems
kernel
On a shared
tion
IPC
inherent
the
(IPC),
D. LAZOWSKA,
spaces on the same machine),
systems.
has two
invoking
EDWARD
can
machine.
(URPC)
memory
URPC
with
through
combines
lightweight
to be bypassed
and RPC
space.
reallocation
during
a
threads
cross-address
a conventional
interface,
performance,
multiprocessor,
operating
system,
parallel
programming,
performance,
communication
Categories
and
Subject
Descriptors:
D. 3.3
[Programming
Languages]:
Language
Constructs
[Operating Systems]: Process Management— concurrency, multiprocessing
/ multiprograrnmmg;
D.4.4 [Operating
Systems]: Communications Manfl!ow
agement; D.4. 6 [Operating Systems]: Security and Protection— accesscontrols, information
controls;
D.4. 7 [Operating
systems]: Organization and Desig~ D.4.8 [Operating Systems]:
Performance— measurements
and
Features
— modules,
This
material
is based
8619663,
and
on work
CCR-87OO1O6,
Digital
Program).
supported
CCFL8703049,
Equipment
Bershad
D.4. 1
packages;
Corporation
was
(the
supported
by the
and
National
CCR-897666),
Systems
by an AT&T
Science
the
Research
Ph.D.
Center
Fellowship.
Authors’
Pittsburgh,
addresses:
B. N. Bershad,
School of Computer
Science,
PA 15213;
T. E. Anderson,
E. D. Lazowska,
and
Science
Permission
not made
of the
and Engineering,
to copy without
or distributed
publication
Association
and
specific
permission.
@ 1991
ACM
fee all or part
for direct
its
for Computing
FR-35,
date
of this
commercial
appear,
Machinery.
0734-2071/91/0500-0175
ACM
University
Transactions
and
material
is given
To copy otherwise,
the
External
Anderson
CCRCenter,
Research
by an IBM
Carnegie
Mellon
University,
H. M
Levy,
Department
is granted
the ACM
that
(grants
Technology
and
of Washington,
advantage,
notice
and
Scholarship,
Graduate
Computer
Foundation
Washington
Seattle,
provided
copyright
copying
or to republish,
WA
that
notice
the copies
are
and the title
is by permission
requires
of
98195.
of the
a fee and/or
$01.50
on Computer
Systems,
Vol
9, No. 2, May 1991, Pages 175-198
176
.
General
B. N
Terms:
Addltlonal
Bershad
Design,
Key
et al
Performance,
Words
and
Phrases:
Measurement
Modularity,
remote
procedure
call,
small-kernel
operating
system
1. INTRODUCTION
Efficient
interprocess
communication
is central
to the design
of contemporary
operating
systems
[16, 231. An efficient
communication
facility
encourages
system decomposition
across address space boundaries.
Decomposed
systems
have several advantages
over more monolithic
ones, including
failure
isolation (address space boundaries
prevent
a fault in one module from “leaking”
into another),
extensibility
(new modules
can be added to the system without
having
to modify
mechanism
existing
rather
Although
than
address
which they
primitives.
ones),
and
modularity
(interfaces
are enforced
by
by convention).
spaces
can be a useful
structuring
device,
the
extent
to
can be used depends on the performance
of the communication
If cross-address
space communication
is slow, the structuring
benefits
that come from decomposition
are difficult
to justify
to end users,
whose primary
concern
is system
performance,
and who treat
the entire
operating
system as a “black
box” [181 regardless
of its internal
structure.
Consequently,
designers
are forced to coalesce weakly
related
subsystems
into
the same address
modularity
Interprocess
operating
problems:
space, trading
away
failure
isolation,
extensibility,
and
for performance.
communication
system
—Architectural
has traditionally
kernel.
However,
performance
barriers.
been the responsibility
kernel-based
The
communication
performance
of the
has
of kernel-based
two
syn-
chronous
communication
is architecturally
limited
by the cost of invoking
the kernel
and reallocating
a processor from one address space to another.
In our earlier
work on Lightweight
Remote Procedure
Call (LRPC) [10], we
show that
it is possible
to reduce
the overhead
of a kernel-mediated
cross-address
space call to nearly
the limit
possible
on
processor
architecture:
the time to perform
a cross-address
slightly
greater
than
that
required
to twice
invoke
a conventional
LRPC is only
the kernel
and have
it
reallocate
a processor from one address space to another.
The efficiency
of
LRPC comes from taking
a “common
case” approach
to communication,
thereby
avoiding
unnecessary
synchronization,
kernel-level
thread
management,
and data copying
for calls between
address spaces on the same
machine.
The majority
of LRPC’S overhead
(70 percent)
can be attributed
directly
to the fact that the kernel
mediates
every cross-address
space call.
—Interaction
between
kernel-based
communication
and
high-performance
The performance
of a parallel
application
running
on a
multiprocessor
can strongly
depend on the efficiency
of thread
management operations.
Medium
and fine-grained
parallel
applications
must use
a thread
management
system
implemented
at the user level to obtain
user-level
ACM
TransactIons
threads.
on Computer
Systems,
Vol
9, No
2, May
1991
User-Level Interprocess
satkfactory
have
Communication
performance
strong
for Shared Memory Multiprocessors
[6, 36]. Communication
interdependencies,
though,
and thread
and the
across
protection
boundaries
(kernel-level
level thread
management)
is high in terms
.
177
management
cost of partitioning
communication
of performance
them
and
userand system
complexity.
On
a shared
eliminate
Because
memory
multiprocessor,
these
problems
the kernel
from the path of cross-address
address spaces can share memory
directly,
have
space
shared
a solution:
communication.
memory
can be
used as the data transfer
channel.
Because a shared memory
multiprocessor
has more than one processor,
processor
reallocation
can often be avoided by
taking
advantage
of a processor
already
active
without
involving
the kernel.
User-level
management
of cross-address
improved
performance
over kernel-mediated
–Messages
kernel.
are sent
between
–Unnecessary
processor
reducing
call overhead,
contexts
across
inherent
exploited,
the
same
address
and
spaces,
processor
call
Procedure
directly,
results
without
invoking
in
the
without
components
management,
which
reallocation.
(URPC),
the
communication
kernel
can be
of a message
can be
mediation.
system
described
between
URPC
address
isolates
of interprocess
communication:
and data transfer.
Control
is the
is implemented
and receiving
its overhead
performance.
Call
safe and efficient
machine
programmer,
in the sending
improving
Remote
other the three
cation,
thread
spaces
space
reallocation
between
address spaces is eliminated,
and helping
to preserve
valuable
cache and TLB
parallelism
provides
address
space communication
approaches
because
reallocation
does prove to be necessary,
several independent
calls.
thereby
User-Level
paper,
target
calls.
–When
processor
amortized
over
—The
address
in the
communication
through
Only
processor
volvement;
thread
management
and
ment and interprocess
communication
rather
than by the kernel.
from
presented
of thread
reallocation
one an-
processor reallotransfer
between
abstraction
a combination
in this
spaces on
to the
management
requires
data transfer
do not. Thread
are done by application-level
kernel
in-
managelibraries,
URPC’S approach
stands in contrast
to conventional
“small
kernel”
systems in which the kernel
is responsible
for address spaces, thread
management, and interprocess
communication.
Moving
communication
and thread
management
to the user level leaves the kernel
responsible
only for the
mechanisms
that allocate
processors to address spaces. For reasons of performance and flexibility,
this is an appropriate
division
of responsibility
for
operating
systems
of shared
memory
multiprocessors.
(URPC
can also be
appropriate
for uniprocessors
running
multithreaded
applications.
)
The latency of a simple cross-address
space procedure
call is 93 psecs using
a URPC implementation
running
on the Firefly,
DEC SRC’S multiprocessor
workstation
[37]. Operating
as a pipeline,
two processors
can complete
one
ACM
Transactions
on Computer
Systems,
Vol
9, No
2, May 1991.
178
.
call
every
B. N. Bershad et al
53 psecs.
that involve
support
for
kernel-based
user-level
On the
same
multiprocessor
hardware,
RPC
threads
that
accompany
URPC
takes
43
psecs,
while
operation
for the kernel-level
threads
that
accompany
LRPC
millisecond.
To put these figures
in perspective,
a same-address
dure
facilities
the kernel
are markedly
slower,
without
providing
adequate
user-level
thread
management.
LRPC,
a high
performanceimplementation,
takes 157 psecs. The Fork operation
for the
call takes
7 psecs on the Firefly,
and a protected
kernel
the
Fork
take over a
space proce-
invocation
(trap)
takes 20 psecs.
We describe the mechanics
of URPC in more detail in the next section. In
Section 3 we discuss the design rationale
behind
URPC.
We discuss performance
in Section
we present
4. In Section
2. A USER-LEVEL
In
many
5 we survey
related
work.
REMOTE PROCEDURE
contemporary
uniprocessor
applications
communicate
sages
narrow
across
operations
(create,
between
programs
a control
languages
memory
tween
in Section
6
with
channels
and
CALL FACILITY
multiprocessor
operating
system
that
or ports
support
and
that
for
by
only
a small
systems,
sending
mes-
number
of
send, receive,
destroy).
Messages
permit
communication
on the same machine,
separated
by address space bound-
data-structuring
support
device
synchronous
communication
address
operating
services
aries, or between
programs
on different
machines,
boundaries.
Although
messages
are a powerful
communication
sent
Finally,
our conclusions.
within
spaces is with
foreign
procedure
an
untyped,
primitive,
to traditional
call,
address
separated
data
space.
asynchronous
typing,
by
physical
they
repre-
Algol-like
and shared
Communication
messages.
be-
Programmers
of message-passing
systems who must use one of the many popular
Algol-like
languages
as a systems-building
platform
must think,
program,
and structure according
to two quite different
programming
paradigms.
Consequently,
almost every mature
message-based
operating
system also includes
support
for Remote
Procedure
Call (RPC) [111, enabling messages to serve as the
underlying
transport
mechanism
beneath
a procedure
call interface.
Nelson defines RPC as the synchronous
language-level
transfer
of control
between
programs
in disjoint
address spaces whose primary
communication
medium
is a narrow
channel
[301. The definition
of RPC is silent about the
operation
of that narrow
channel
and how the processor scheduling
(reallocation) mechanisms
interact
with data transfer.
IJRPC exploits
this silence in
two ways:
-Messages
are passed
between
address
spaces through
logical
channels
kept
in memory
that is pair-wise
shared between
the client
and server.
The
channels
are created and mapped once for every client/server
pairing,
and
are used for all subsequent
communication
between
the two address spaces
so that several interfaces
can be multiplexed
over the same channel.
The
integrity
of data passed through
the shared memory
channel
is ensured by
ACM
Transactions
on Computer
Systems,
Vol
9, No
2, May
1991
User-Level Interprocess
a combination
statically)
correctness
Communication
for Shared Memory Multiprocessors
of the pair-wise
mapping
and the URPC runtime
is verified
dynamically).
(message
system
authentication
in each address
—Thread
management
is implemented
entirely
at the
integrated
with the user-level
machinery
that manages
nels.
A user-level
directly
thread
enqueuing
channel.
the
No kernel
Although
sends
a message
message
calls
on the
are necessary
a cross-address
space
server,
it blocks
waiting
to send a call
procedure
for the reply
link
call
signifying
179
is implied
space (message
user level
and is
the message chan-
to another
outgoing
perspective
of the programmer,
it is asynchronous
thread
management.
When a thread
in a client
.
address
of the
space
by
appropriate
or reply
message.
is synchronous
from
the
at and beneath the level of
invokes
a procedure
in a
the procedure’s
return;
while
blocked,
its processor
can run another
ready thread
in the same address
space. In our earlier
system, LRPC, the blocked thread
and the ready thread
were really
the same; the thread just crosses an address space boundary.
In
contrast,
URPC
address
that
always
tries
space on the client
can be handled
When
the reply
entirely
arrives,
to
schedule
thread’s
another
processor.
by the
This
user-level
the blocked
client
thread
thread
thread
from
the
is a scheduling
same
operation
management
system.
can be rescheduled
on any
of the processors allocated
to the client’s
address space, again without
kernel
intervention.
Similarly,
execution
of the call on the server side can be done
by a processor
already
and need not occur
By preferentially
takes
advantage
in switching
call
this
address
requires
executing
in the context
of the fact that
a processor
context
of the server’s
address
synchronously
with the call.
scheduling
threads within
the same address
there
to another
than
switching)
is significantly
thread
in
reallocating
space, URPC
less overhead
in the same address
space,
involved
space (we will
it to a thread
in
another
reallocation).
Processor reallocation
space (we will call this processor
changing
the mapping
registers
that
define
the virtual
address
space context
architectures,
in privileged
Several
in which
these
kernel
components
tion.
There
are
processor should
a processor
mapping
mode.
is executing.
registers
contribute
are protected
to the
high
On conventional
and can only
overhead
processor
be accessed
of processor
realloca-
scheduling
costs to decide the address
space to which
a
be reallocated;
immediate
costs to update the virtual
mem-
ory mapping
registers
and to transfer
the processor between
and long-term
costs due to the diminished
performance
of
translation
lookaside
buffer
(TLB) that occurs whefi locality
address space to another
[31. Although
there is a long-term
with context
switching
within
when processors are frequently
To demonstrate
the relative
the same address space, that cost is less than
reallocated
between
address spaces [201.
overhead
involved
in switching
contexts
be-
tween two threads
in the
between
address
spaces,
same-address
space
same-address
space
same address space
we introduce
the
On the
context
switch.
context
switch
requires
ACM
address spaces;
the cache and
shifts from one
cost associated
TransactIons
versus reallocating
processors
latency
concept
of a minimal
C-VAX,
the minimal
latency
saving
and then restoring
the
on Computer
Systems,
Vol
9, No
2, May 1991.
180
B. N Bershad et al
.
machine’s
general-purpose
registers,
reallocating
the
takes
55 psecs without
about
Because
reallocation,
switching
processor
from
and
takes
one address
including
about
15 psecs.
space to another
the long-term
C-VAX
cost.
of the cost difference
between
context
switching
URPC
strives
to avoid processor
reallocation,
whenever
In contrast,
on the
and processor
instead
context
possible.
Processor
reallocation
is sometimes
necessary
with
URPC,
though.
If a
client
invokes
a procedure
in a server that has an insufficient
number
of
processors
to handle
the call (e. g., the server’s
processors
are busy doing
other
reply.
with
work),
the client’s
processors
may idle for some time
awaiting
the
This is a load-balancing
problem.
URPC considers
an address space
pending
incoming
messages
on its
channel
to be “underpowered.”
An
idle processor in the client address space can balance the load by reallocating
itself to a server that has pending
(unprocessed)
messages from that client.
Processor
reallocation
requires
that
the kernel
be invoked
to change
the
processor’s
virtual
memory
context
to that
of the underpowered
address
space. That done, the kernel
upcalls
into a server routine
that handles
the
client’s
outstanding
requests.
After these have been processed, the processor
is returned
to the client address space via the kernel.
The responsibility
for detecting
incoming
messages and scheduling
threads
to handle
them belongs
to special,
low-priority
threads
that are part of a
URPC runtime
library
linked
incoming
messages only when
Figure
proceeding
1 illustrates
downward.
manager
(WinMgr)
a separate
address
procedures
into
they
each address space. Processors
would otherwise
be idle.
a sample
execution
timeline
One client,
an editor,
and
and a file cache manager
space. Two threads
in
in the window
manager
for URPC
two servers,
scan
for
with
time
a window
(FCMW),
are shown. Each is in
T1 and
T2, invoke
the editor,
and the file
cache manager.
Initially,
the
editor has one processor allocated
to it, the window
manager
has one, and the
space call into
file cache manager
has none. T1 first makes a cross-address
the window
manager
by sending
and then attempting
to receive a message.
The window
manager
receives
the message
and begins processing.
In the
meantime,
the thread
scheduler
in the editor has context
switched
from TI
(which is blocked waiting
T2 initiates
a procedure
to receive a message) to another
thread
call to the file cache manager
by sending
Thread
a message
T2.
T1 can be unblocked
because the window
and waiting
to receive
a reply.
T1 then calls
manager
has sent a reply message back to the editor.
Thread
into the file cache manager
and blocks. The file cache manager
now has two
pending
messages from the editor
but no processors
with which
to handle
T1 and T2 are blocked
in the editor,
which
has no other
them.
Threads
threads
waiting
to run. The editor’s
thread
management
system detects the
outgoing
pending
messages
and donates
a processor
to the file cache manager, which can then receive, process, and reply to the editor’s
two incoming
calls before returning
the processor back to the editor.
At this point,
incoming
reply messages from the file cache manager
can be handled.
Tz each terminate
ACM
TransactIons
when
on Computer
they
receive
Systems,
Vol
their
9, No
replies.
2, May
1991
the two
7’1 and
User-Level Interprocess
Communication
for Shared Memory Multiprocessors
(has no processors)
(scanning channels)
m
181
FCMgr
WinMgr
Editor
.
m
Send WinMgr
TI
Recv & Process
Reply to T1’s Call
Call
[
Recv WinMgr
/
Context Switch
T2
Send FCMgr
Call
[
Recv FCMgr
Context
Switch
Send
TI
Call
[
FCMgr
Recv
FCMgr
Reallocate
Processor
Recv
to
FCMgr
Reply
Recv
& Process
to T2’s Call
& Process
Reply
to T1’s Call
Return
Processor
to Editor
Context
Terminate
Context
Terminate
Switch
T2
Switch
T1
Time
Fig.
1.
URPC
timeline
URPC’S treatment
of pending
messages is analogous
to the treatment
of
runnable
threads
waiting
on a ready queue. Each represents
an execution
context in need of attention
from a physical
processor;
otherwise
idle processors poll the queues looking
for work in the form of these execution
contexts.
There are two main differences
between
the operation
of message channels
and thread
ready queues.
First,
URPC enables
on address space to spawn
work in another
address space by enqueuing
a message (an execution
context
consisting
of some control
information
and arguments
or results).
Second,
ACM
Transactions
on Computer
Systems,
Vol
9, No
2, May 1991
182
B. N Bershad et al.
.
Server
Client
Application
Application
Stubs
Message
URPC
❑
Stubs
Channel
mm
L
URPC
Fa~tThreads
FastThreads
/
\
/
~eallocate~roces~F-<
Fig,2
URPC
enables
another
Thesoftware
processors
through
components
in one address
an explicit
reallocation
of URPC
space to execute
in the
if the
between
workload
context
of
them
is
imbalance.
2.1
The User View
It is important
to emphasize
that all of the URPC mechanisms
described
in
this section exist “under
the covers” of quite ordinary
looking
Modula2
+ [34]
interfaces.
The RPC paradigm
provides
the freedom to implement
the control
and data transfer
mechanisms
in ways that are best matched
by the underlying hardware.
URPC
consists
of two
code. One package,
called
software
are managed
level
at the
user
packages
FczstThreacis
and
[5],
used by stubs and application
provides
lightweight
threads
that
scheduled
on top of middleweight
kernel
implements
the channel
managethreads.
The second package, called URPC,
package
ment and message primitives
described
in this section. The URPC
FastThreads,
as
lies directly
beneath
the stubs and closely interacts
with
shown
in Figure
3. RATIONALE
2.
FOR THE URPC DESIGN
In this section we discuss the design rationale
behind
URPC.
rationale
is based on the observation
that there
are several
components
to a cross-address
space call and each can be
separately.
The main
components
In brief, this
independent
implemented
are the following:
management:
blocking
the caller’s thread, running
a thread through
– ‘Thread
the procedure
in the server’s
address
space, and resuming
the caller’s
thread on return,
ACM
Transactions
on Computer
Systems,
Vol
9, No
2, May
1991
User-Level Interprocess
—Data
moving
transfer:
spaces,
arguments
between
the
client
and
.
server
183
address
and
–Processor
ensuring
that there is a physical
processor
in the server and the server’s reply in the client.
to handle
reallocation:
the client’s
In
Communicahon for Shared Memory Mulhprocessors
call
conventional
message
systems,
these
three
components
are
combined
beneath
a kernel interface,
leaving
the kernel responsible
for each. However,
thread management
and data transfer
do not require
kernel
assistance,
only
processor
how
reallocation
the
another,
3.1
does. In the three
components
and the benefits
Processor
attempts
occurs
URPC
through
the
optimistically
—The
client
address
—The
to reduce
arise
from
that
call
follow
we describe
can be isolated
from
one
such a separation.
has
serious
the
frequency
with
use of an optimistic
assumes the following:
other
messages,
not have
that
space
Reallocation
URPC
incoming
subsections
of a cross-address
work
and
effect
to
do,
in
a potential
which
processor
reallocation
the
delay
form
of runnable
in the
on the performance
reallocation
At
policy.
processing
of other
threads
call
time,
threads
of a call
or
will
in the client’s
space.
server
has, or will
soon have,
a processor
with
which
it can service
a
message.
In
terms
several
context
of performance,
ways. The first
switch
between
URPC’S
assumption
user-level
optimistic
makes
threads
assumptions
can pay
off in
it possible
to do an inexpensive
during
the blocking
phase of a
cross-address
space call. The second assumption
enables a URPC to execute
in parallel
with
threads
in the client’s
address
space while
avoiding
a
processor
reallocation.
An implication
of both
assumptions
is that
it is
possible
to amortize
the
cost of a single
outstanding
calls between
the
threads in the same client make
a single
processor
reallocation
processor
reallocation
across
several
same address
spaces. For example,
if two
calls into the same server in succession, then
can handle
both.
A user-level
approach
to communication
and thread management
is appropriate for shared memory
multiprocessors
where applications
rely on aggressive multithreading
to exploit parallelism
while at the same time compensating for multiprogramming
effects and memory
latencies
[2], where a few key
operating
system services
are the target
of the majority
calls [8], and where operating
system functions
are affixed
ing nodes for the sake of locality
[31].
In contrast
well
suited
–Kernel-level
to URPC,
contemporary
for use on shared
memory
communication
and
uniprocessor
structures
are not
multiprocessors:
thread
pessimistic
processor reallocation
policies,
rency within
an application
to reduce
scheduling
[13] underscores
this
Handoff
tion blocks the client
and reallocates
its
ACM
kernel
of all application
to specific process-
TransactIons
management
facilities
rely
on
and are unable to exploit concurthe overhead
of communication.
pessimism:
a single kernel operaprocessor
directly
to the server.
on Computer
Systems,
Vol. 9, No
2, May 1991
184
B. N. Bershad et al
.
Although
handoff
is limited
by the cost of kernel
–In
a traditional
scheduling
operating
shared
performance,
invocation
and processor
system
running
on a multiprocessor,
distributed
over the many
large-scale
does improve
memory
kernel
designed
for
the
improvement
reallocation.
a uniprocessor,
but
kernel resources are logically
centralized,
processors
in the system.
For a mediummultiprocessor
such as the Butterfly
but
to
[71, Alewife
[41, or DASH [251, URPC’S user-level
orientation
to operating
system design
localizes
system resources
to those processors
where the resources
are in
use, relaxing
the performance
bottleneck
that
comes from
relying
on
centralized
kernel
data structures.
This bottleneck
is due to the contention
for logical
resources
(locks) and physical
resources
(memory
and interconnect bandwidth)
that results
when a few data structures,
such as thread
run
queues
and message
By distributing
processors.
functions
to
directly,
the
address
contention
URPC
can
channels,
are shared
by a large
number
of
the communication
and thread
management
also
spaces
(and
hence
processors)
that
use
be appropriate
on uniprocessors
applications
where inexpensive
threads
physical
concurrency
within
a problem.
running
multithreaded
can be used to express
Low overhead
threads
the logical and
and communi-
cation make it possible to overlap
even small amounts
of external
tion. Further,
multithreaded
applications
that are able to benefit
layed
reallocation
tion
protocols
uniprocessor,
long
them
is reduced.
can do so without
[191. Although
having
reallocation
it can be delayed
to develop
will
by scheduling
their
eventually
within
computafrom de-
own communicabe necessary
an address
on a
space for as
as possible.
3.1.1
The
Optimistic
Assumptions
Won’
t Alulays
In cases where
Hold.
the
optimistic
assumptions
do not hold, it is necessary
to invoke
the kernel
to
force a processor reallocation
from one address space to another.
Examples
of
where it is inappropriate
to rely on URPC’S optimistic
processor reallocation
policy
are single-threaded
latency
must be bounded),
initiate
the
and priority
call
is of high
1/0
operation
invocations
applications,
high-latency
early
(where
priority).
since
real-time
applications
1/0 operations
(where
it will
the thread
To handle
these
take
a long
executing
situations,
time
(where
call
it is best to
to complete),
the cross-address
URPC
address space to force a processor
reallocation
to the
there might
still be runnable
threads
in the client’s.
allows
server’s,
space
the client’s
even
though
3.1.2 The Kernel
Handles
Processor
Reallocation.
The kernel implements
the mechanism
that support
processor
reallocation.
When an idle processor
discovers
an underpowered
address
space, it invokes
a procedure
called
Processor.
Donate,
passing
in the identity
of the address space to which the
processor
(on which
the invoking
thread
is running)
should be reallocated.
Donate
transfers
control down through
the kernel
and then up to a
prespecified
address
in the receiving
space. The identity
of the donating
Processor.
address
ACM
space is made
TransactIons
known
on Computer
to the receiver
Systems,
Vol
9, No
2, May
by the kernel.
1991
User-Level Interprocess
3.1.3
Voluntary
interface
URPC,
Return
defines
as with
Communication
of
a contract
for Shared Memory Multiprocessors
Processors
Cannot
between
traditional
RPC
a client
systems,
Be
and
in the
185
A service
Guaranteed.
a server.
implicit
.
In
the
contract
case of
is that
the
server obey the policies
that determine
when a processor
is to be returned
back to the client.
The URPC communication
library
implements
the following policy
in the server:
upon receipt
of a processor
from a client
address
space, return
the processor
when all
have generated
replies,
or when the
become
“underpowered”
(there
outstanding
messages from
server determines
that the
are outstanding
messages
and one of the server’s processors
is idle).
Although
URPC’S runtime
libraries
implement
no way
to enforce
that
kernel
transfers
control
guarantee
returning
protocol.
of
Just
the
as with
processor
back
a specific
kernel-based
to
an
to the client,
protocol,
there
systems,
application,
there
a fair
available
deals
processing
power.
of load balancing
URPC’s
between
direct
reallocation
applications
that
clients,
is
once the
is
that
the application
will
voluntarily
return
the processor
from the procedure
that the client
invoked.
The server could,
example,
use the processor to handle requests from other
this was not what the client had intended.
It is necessary
to ensure
that
applications
receive
problem
the client
client
has
no
by
for
even though
share
only
of the
with
are communicating
the
with
one another.
Independent
of communication,
a multiprogrammed
system
requires
policies
and mechanisms
that balance the load between
noncommunicating
(or noncooperative)
address spaces. Applications,
for example,
must
not be able to starve
able
to delay
clients
one another
out for processors,
indefinitely
by
not
returning
and servers
processors.
policies,
which
forcibly
reallocate
processors
from one address
other, are therefore
necessary
to ensure that applications
make
A processor
reallocation
policy
should
be work
conserving,
high-priority
thread waits for a processor while a lower priority
and,
by implication,
that
anywhere
in the system,
specifics of how to enforce
no processor
address
be easily
3.2
when
there
is work
not be
Preemptive
space to anprogress.
in that
no
thread runs,
for
it to do
even if the work is in another
address space. The
this constraint
in a system with user-level
threads
are beyond the scope of this paper,
The direct processor
reallocation
optimization
space of a
ing to that
standpoint
reason for
processor
idles
must
but are discussed
done in URPC
by Anderson
et al. in [61.
can be thought
of as an
of a work-conserving
policy.
A processor
idling
in the address
URPC client can determine
which address spaces are not respondclient’s
calls, and therefore
which
address spaces are, from the
of the client, the most eligible
to receive a processor.
There is no
the client to first voluntarily
relinquish
the processor to a global
allocator,
only to have the allocator
then
determine
to which
space the processor
made
by the client
should
be reallocated.
This
is a decision
that
can
itself.
Data Transfer Using Shared Memory
The requirements
of a communication
system can be relaxed
and its efficiency improved
when it is layered beneath
a procedure
call interface
rather
than exposed to programmers
directly.
In particular,
arguments
that are part
ACM
TransactIons
on Computer
Systems,
Vol. 9, No
2, May 1991.
186
B. N. Bershad et al
.
of a cross-address
space procedure
call
while still guaranteeing
safe interaction
can be passed using shared
between mutually
suspicious
memory
subsys-
tems.
Shared memory
message
of client-server
interactions.
still overload
one another,
channels
do not increase the “abusability
factor”
As with traditional
RPC, clients and servers can
deny service,
provide
bogus results,
and violate
communication
protocols
(e. g., fail
nel data structures).
And, as with
to release
traditional
protocols
abuses
to ensure
that
lower
level
a well-defined
manner
(e. g., by raising
down the channel).
In URPC,
the safety of communication
responsibility
of the stubs.
channel
RPC,
filter
locks, or corrupt
it is up to higher
up to the application
exception
a call-faded
based
The arguments
on shared
of a URPC!
layer
memory
are passed
is implemented
one address
space to another.
application
programmers
deal
necessary
nor sufficient
by having
Such
directly
form of messages. But, when standard
runtime
as is the case with
URPC,
having
the kernel
spaces is neither
is the
in message
phase
server.
unmarshal
the message’s
data into
copying
and checking
is necessary
to
ensure the application’s
safety.
Traditionally,
safe communication
copy data from
necessary
when
in
or by closing
buffers
that are allocated
and pair-wise
mapped
during
the binding
that precedes the first cross-address
space call between
a client and
On receipt
of a message,
the stubs
procedure
parameters
doing whatever
chanlevel
the
kernel
an implementation
is
with raw data in the
facilities
and stubs are used,
copy data between
address
to guarantee
safety.
Copying
in the kernel
is not necessary
because programming
languages
are implemented
to pass parameters
between
procedures
on the stack, the
heap, or in registers.
When data is passed between
address spaces, none of
these
storage
areas
can,
in general,
be used
directly
by both
the
client
and
the server.
Therefore,
the stubs must
copy the data into and out of the
memory
used to move arguments
between
address
spaces. Safety
is not
increased
by first doing an extra kernel-level
copy of the data.
Copying
in the kernel
is not sufficient
for ensuring
safety when using
type-safe
languages
such as Modula2
+ or Ada [1] since each actual parameter must be checked by the stubs for conformity
with the type of its corresponding
formal.
Without
such checking,
for example,
a client
could crash
the server by passing in an illegal
(e. g., out of range) value for a parameter.
These
points
motivate
the use of pair-wise
shared
memory
for cross-address
space communication.
Pair-wise
shared memory
can be used to transfer
between
address spaces more efficiently,
but just as safely, as messages
are copied by the kernel between
address spaces.
3.2.1
Controlling
different
address
test-and-set
locks
nitely
on message
is simply
if the
spin-wait.
The
ACM
TransactIons
data
that
Channel
Access.
Data flows between
URPC packages in
spaces over a bidirectional
shared
memory
queue with
on either
end. To prevent
processors
from waiting
indefichannels,
the locks are nonspinning;
i.e., the lock protocol
lock
is free,
rationale
on Computer
here
acquire
is that
Systems,
Vol
it,
or
else go on
the receiver
9, No
2, May
1991
to something
of a message
else– never
should
never
User-Level Interprocess
Communication
for Shared Memory Multiprocessors
187
.
have to delay while the sender holds the lock. If the sender stands to benefit
from the receipt of a message (as on call), then it is up to the sender to ensure
that locks are not held indefinitely;
if the receiver
stands to benefit
(as on
return),
realized
then the sender could just as easily prevent
the benefits
by not sending the message in the first place.
Cross-Address
3.3
Space Procedure
The calling
semantics
normal
procedure
call,
performing
software
address
a server
starts
from
space procedure
call, like those
with
respect to the thread
that
the call. Consequently,
there
system
that
manages
threads
the
server.
thread.
When
the
The
server
communication
thread
client
client
thread
finishes,
—High
and thread
performance
grained
parallel
management.
thread
it
stops,
sends
waiting
on a receive
a response
back
to the
we argue
facilities
are
the following:
necessary
for
fine-
programs.
— While thread management
facilities
at the user level, high-performance
be provided
(send
Finally,
the
to be started.
for the next message.
between
cross-address
space
In brief,
management
function
management
synchronization
sends
a message to the server,
client,
which
causes the client’s
waiting
thread
stops again
on a receive,
waiting
server’s thread
In this subsection
we explore the relationship
communication
of
is
is a strong interaction
between
the
and the one that
manages
cross-
Each underlying
and receive) involves
a corresponding
function
(start and stop). On call, the
which
being
Call and Thread Management
of cross-address
are synchronous
space communication.
from
at the user
can be provided
either in the kernel or
thread
management
facilities
can only
level.
—The close interaction
between
communication
and thread management
can
be exploited
to achieve extremely
good performance
for both, when both are
implemented
3.3.1
threads
Concurrent
within
activities.
parallel
together
at the user
Programming
an
address
Concurrency
programs
space
needs
and
Thread
simplify
the
can be entirely
for
access to a thread
or
it
to the
can
some amount
address space.
management
system
that
continuum
major
middleweight,
are three
and lightweight,
for which
on the order of thousands,
hundreds,
Kernels
supporting
heavyweight
points
thread
application,
as with
as with
an
of its own computation
with
In either
case, the program-
of control.
can be described
there
of concurrent
be external,
create, schedule,
and destroy threads
Support
for thread
management
on which
Multiple
Management.
management
internal
multiprocessors,
application
that needs to overlap
activity
on its behalf in another
mer
level.
makes
in
it possible
terms
of reference:
management
of
to
a cost
heavyweight,
overheads
are
and tens of psecs, respectively.
threads
[26, 32, 35] make no distinction
between
a thread,
the dynamic
component
of a program,
and its address
space, the static component.
Threads
and address spaces are created,
scheduled, and destroyed
as single entities.
Because an address space has a large
amount
of baggage,
such as open file descriptors,
page tables, and accounting
ACM
Transactions
on Computer
Systems,
Vol
9, No
2, May 1991
188
B. N Bershad et al.
.
state that
must
be manipulated
during
thread
management
operations,
operations
on heavyweight
threads
are costly, taking
tens of milliseconds.
Many contemporary
operating
system kernels
provide
support
for middleweight,
address
threads
[12, 361. In contrast
to heavyweight
middleweight
threads
are decoupled,
so that
or kernel-level
spaces and
address space can have many middleweight
threads.
As with
threads,
though,
the kernel
is responsible
for implementing
management
functions,
and directly
schedules
an application’s
the
available
defines
physical
a general
processors.
programming
With
middleweight
interface
for
heavyweight
the thread
threads
on
threads,
use by all
threads,
a single
the
concurrent
kernel
applica-
tions.
Unfortunately,
the
grained
parallel
slicing,
or
between
saving
threads
applications,
cost of this
applications.
and
restoring
contexts
traps
are required
kernel
must treat
affects
example,
floating
can be sources
even those for which
thread
management
malfeasant
programs.
generality
For
the
simple
point
performance
features
registers
of performance
the features
overhead
are unnecessary.
of kernel-level
for all parallel
weight
for
all
A kernel-level
thread
management
operation,
and the
suspiciously,
checking
arguments
and
from
the effect
of thread
to the high
cost of
but more pervasive
management
on program
policy
There are a large number
of parallel
programming
models, and
a wide variety
of scheduling
disciplines
that are most appropri-
ate (for performance)
to a given
strongly
influenced
by the choice
rithms
used to implement
threads,
In response
to lightweight
switching
system must also protect
itself from malfunctioning
or
Expensive
(relative
to procedure
call) protected
kernel
to invoke
any
each invocation
stemming
when
as time
degradation
access privileges
to ensure legitimacy.
In addition
to the direct
overheads
that
contribute
kernel-level
thread
management,
there
is an indirect
performance.
within
these,
of fine-
such
thread
is unlikely
programs.
model
[25, 33]. Performance,
though,
is
of interfaces,
data structures,
and algoso a single model represented
by one style
to have
an implementation
to the costs of kernel-level
threads,
threads
that execute in the context
threads
provided
by the kernel,
but
that
is efficient
programmers
have turned
of middleweight
or heavy-
are managed
at the user level
by
a library
linked
in with each application
[9, 38]. Lightweight
thread managetwo-level
scheduling,
since the application
library
schedules
ment implies
lightweight
threads
on top of weightier
threads,
which are themselves
being
scheduled
by the kernel.
Two-level
schedulers
attack
both the direct
and indirect
costs of kernel
threads.
Directly,
it is possible
to implement
efficient
user-level
thread
management
functions
because they are accessible
through
a simple procedure call rather
than a trap, need not be bulletproofed
against
application
errors, and can be customized
to provide
only the level of support needed by a
given application.
The cost of thread
management
operations
in a user-level
scheduler
3.3.2
for two
ACM
can be orders
The
Problem
reasons:
fiansactlons
of magnitude
with
Having
(1) to synchronize
on Computer
Systems,
Vol
less than
Functions
their
9, No
for kernel-level
at
activities
2, May 1991
Two
Levels.
within
threads.
Threads
an address
block
space
User-Level Interprocess
Communication
for Shared Memory Multiprocessors
(e.g., while controlling
access to critical
sections],
events in other address spaces (e. g., while reading
manager).
pensive
kernel,
If threads
are implemented
at the
and flexible
concurrency),
but
then the performance
benefits
switching
are lost when
scheduler
accurately
reflects
In contrast,
when
efficient
user-level
(necessary
in separate
address
of the application,
awaiting
and thread
and
management
scheduling
and then
kernel-mediated
com-
are moved
together
at user level, the
system
can control
threads
synchronization
for inex-
is implemented
in the
scheduling
and context
activities
state
applications
notified.
communication
of the kernel
and implemented
needed by the communication
level
operation
must be implemented
and
the application,
so that the user-level
the scheduling
again (2) in the kernel,
so that
munication
activity
are properly
user
between
189
and (2) to wait for external
keystrokes
from a window
communication
of inexpensive
synchronizing
spaces. In effect, each synchronizing
executed
at two levels:
(1) once in
.
out
synchronization
with
the same
primitives
that
are used
by applications.
3.4 Summary of Rationale
This
section
has
described
the
design
advantages
of a user-level
approach
made without
invoking
the kernel
processors
between
processor reallocation
over multiple
calls.
grated
with
spaces.
do turn
Finally,
out to be necessary,
their
user-level
communication
thread
programs,
as within,
4. THE PERFORMANCE
Further,
as
measurements
megabytes
4.1
have
kernel
processor
described
many
as
dedicated
main
invocation
and
cost can be amortized
can be cleanly
intenecessary
can inexpensively
in
this
of URPC
by DEC’s
six
to
C-VAX
1/0.
paper
for
fine-
synchronize
The
had
on the Firefly,
an experiSystems Research Center.
microprocessors,
Firefly
four
used
C-VAX
to
plus
one
collect
the
processors
and
32
of memory.
User-Level Thread Management
Performance
In this section we compare the performance
of thread management
when implemented
at the user level and in the kernel.
The
threads
are those provided
by Taos, the Firefly’s
native
operating
Table
mented
invokes
The
calls can be
reallocating
OF URPC
A
can
URPC.
facilities,
programs
spaces.
the performance
workstation
built
Firefly
when
management
so that
address
This section describes
mental
multiprocessor
MicroVAX-11
behind
address
user-level
grained
parallel
between,
as well
rationale
to communication
are that
and without
unnecessarily
primitives
kernel-level
system.
I shows the cost of several
thread
management
operations
impleat the user level
and at the kernel
level.
PROCEDURE
CALL
the null procedure
and provides
a baseline
value for the machine’s
performance.
FORK creates, starts, and terminates
a thread.
FORK;JOIN
is
like FORK, except that the thread which creates the new thread blocks until
the forked
thread
completes.
YIELD
forces a thread
to make a full trip
through
the scheduling
machinery;
when a thread
yields the processor,
it is
blocked (state saved), made ready (enqueued
on the ready queue), scheduled
ACM
Transactions
on Computer
Systems,
Vol
9, No. 2, May 1991
190
.
B. N Bershad et al
Table
1.
Comparative
Performance
of Thread
URPC
Test
II
Component
(ysecs)
7
43
102
37
27
53
7
1192
1574
57
27
271
(dequeued),
Server
(ysecs)
18
13
poll
6
6
receme
10
20
25
Total
54
53
unblocked
and forth on a condition
means of paired
signal
Each
(state
measured
in
the
was
elapsed
table.
processor.
variable,
and wait
and unblocks
test
The
run
restored),
was divided
using
the
continued.
ACQUIRE;
in succession.
number
FastThreads
tests
and
a blocking
(nonspinning)
lock for
has two threads
“pingpong”
back
blocking
and unblocking
one another
by
operations.
Each pingpong
cycle blocks,
two threads
a large
time
The
9
dispatch
RELEASE
acquires
and then
releases
which there is no contention.
PING-PONG
schedules,
of a URPC
Client
(~secs)
send
again
Taos Threads
FastThreads
Breakdown
Component
Operations
(psecs)
PROCEDURE
CALL
FORK
FORK;JOIN
YIELD
ACQUIRE, RELEASE
PINGPONG
Table
Management
of times
as part
by the loop limit
tests
were
kernel-level
of a loop,
to determine
constrained
to run
threads
no such
had
and
the
the values
on a single
constraint;
Taos caches recently
executed
kernel-level
threads
on idle processors
to
reduce wakeup
latency.
Because
the tests were run on an otherwise
idle
machine,
the caching optimization
worked well and we saw little
variance
in
the measurements.
The table
demonstrates
the performance
advantage
of
implementing
thread
management
at the user-level,
where
the operations
have an overhead
hundred,
as when
4.2
on the
threads
URPC Component
order of a few procedure
calls,
are provided
by the kernel.
rather
than
a few
Breakdown
In the case when an address space has enough processing
power to handle all
incoming
messages without
requiring
a processor reallocation,
the work done
send (enqueuing
a
during
a URPC can be broken down into four components:
poll
(detecting
the channel
on which
a
message on an outgoing
channel);
receive
(dequeueing
the message);
and dispatch
(dismessage has arrived);
patching
the appropriate
thread
to handle
an incoming
message).
Table II
shows the time taken by each component.
The total
processing
times
are almost
the same for the client
and the
server. Nearly
half of that time goes towards
thread management
(dispatch),
ACM TransactIons
on Computer
Systems,
Vol
9, No
2, May 1991
User-Level Interprocess
demonstrating
the
munication
The
influence
for Shared Memory Mulhprocessors
that
thread
management
.
overhead
has
191
on
com-
performance.
send
and
marshaling
receive
that
spaces.
On
buffers
adds
passed
in
average
Table
Communication
is
the
Firefly,
about
most
do
cross-address
and
not
the
cost
of
psec
space
is
reflect
cost
are
data
of
data
passed
passed
of latency.
in
calls
the
influenced
by
is
and
address
shared
the
memory
amount
small
the
copying
between
Because
procedure
most
of data
the
parameters
word
additional
performance
not
when
each
one
call
II,
times
incurred
[8,
of data
16,
components
17,
24],
shown
in
transfer.
4.3 Call Latency and Throughput
The
figures
there
in
Table
is no need
latency
and
call
throughput
server
II
for
depend
space,
on
for
are
the
S,
and
of client
reallocation.
load
number
the
Two
and
dependent,
of
client
number
of
server
other
load,
important
though.
Both
processors,
runnable
C,
call
latency
and
the
threads
provided
metrics,
number
in
the
of
client’s
T.
To evaluate
values
independent
throughput,
processors,
address
are
processor
T,
latency
and
C,
S
and
throughput,
in
we ran
which
threads to make 100,000 “Null”
loop. The Null
procedure
call
we
timed
a series
how
of tests
long
procedure
calls into
takes no arguments,
it
using
took
different
the
the server
computes
client’s
T
inside a tight
nothing,
and
returns
no results;
it exposes the performance
characteristics
of a round-trip
cross-address
space control
transfer.
Figures
3 and 4 graph call latency and throughput
as a function
of T. Each
line on the graphs represents
a different
combination
of S and C. Latency
is
measured
as the time from when a thread
calls into the Null stub to when
control
returns
from the stub, and depends
on the speed of the message
primitives
and the length of the thread ready queues in the client and server.
When
(T>
the
number
processor
of caller
call
C + S),
latency
in the client
exceeds
since
or the server
to T. Except
is less sensitive
call processing
threads
increases
times
total
call
or both.
for the special
in the client
the
each
number
must
When
the
number
of calling
on the other
case where
S = 0, as long
and server
message
wasted
transfer.
processing
threads,
are roughly
client
equal
T >
hand,
as the
(see Table
II),
At that
no further
C + S.
can
processors,
a free
and
be
server
proces-
latency
is 93 ~secs. This is “pure”
latency,
in
basic processing
required
to do a round-trip
1 Unfortunately,
time
for
Throughput,
until
then throughput,
as a function
of T, improves
point,
processors
are kept
completely
busy,
so there
T.
improvement
in throughput
by increasing
sors is 1 (T = C = S = 1), call
the sense that it reflects
the
of processors
wait
the
due to idling
latency
in the
includes
client
and
a large
server
amount
while
of
waiting
1 The 14 psec discrepancy between the obseved latency and the expected latency (54 + 53, cf.,
Table II) is due to a low-level scheduling optimization
that allows the caller thread’s context to
remain loaded on the processor if the ready queue is empty and there is no other work to be done
anywhere in the system. In this case, the idle loop executes in the context of the caller thread,
enabling a fast wakeup when the reply message arrives. At 2’ = C = S, the optimization
is
enabled. The optimization
is also responsible for the small “spike”
in the throughput
curves
shown in Figure 4.
ACM
Transactions
on Computer
Systems,
Vol. 9, No
2, May 1991.
192
.
B. N. Bershad et al.
1500–
C=l,
s=o
C=l,
S=2
s=l
14001300120011001000 –
900Call
800-
Latency
700-
(psecs)
/c=l,
600500400 -
/<,..””
~=*~5~2s==2
‘-- ‘
/’
300 200 100 -
k’~”;”;
‘
“
,---”-’”
o~
12
8
Number
Fig
for the next
call
or reply
of Client
3
Threads
9
10
{T)
URPC call latency
message.
At
T =
2, C = 1, S = 1, latency
increases
by only 20 percent
(to 112 Wsecs), but throughput
increases
by nearly
75
percent
(from 10,300 to 17,850 calls per second). This increase
reflects
the
fact that message processing
can be done in parallel
between
the client and
server.
Operating
as a two-stage
pipeline,
the client
and server
complete
call every 53 psecs.
For larger
values of S and C, the same analysis
holds. As long
sufficiently
large to keep the processors
in the client
and server
spaces busy, throughput
Note that throughput
is maximized.
for S = C = 2 is not twice
as large
one
as T is
address
as for S = C = 1
(23,800 vs. 17,850 calls per second, a 33 percent
improvement).
Contention
for the critical
sections that manage access to the thread and message queues
in each address space limits
throughput
over a single channel.
Of the 148
instructions
required
to do a round-trip
message transfer,
approximately
half
are part of critical
sections
guarded
by two separate
locks kept on a perchannel
basis. With four processors,
the critical
sections are therefore
nearly
always
occupied,
so there is slowdown
due to queueing
at the critical
sections. This factor does not constrain
the aggregate
multiple
clients and servers since they use different
When
reallocate
round-trip
sor
ACM
S = O throughput
and latency
are at their
processors frequently
between
the client
latency
is 375
reallocations.
At this
Transactions
on Computer
psecs.
point,
Systems,
Every
URPC
Vol
call
requires
performs
9, No
worse
2, May 1991
rate of URPCS
channels.
between
worst due to the need to
and server. When T = 1,
two
traps
than
and
LRPC
two
proces-
(157
psecs).
User-Level Interprocess
Communication
for Shared Memory Multiprocessors
.
193
25000
---1
—C=2’S’2
A — — —
/
/
20000 4
/–––__–___c=2,
~<:.— ——.— ——————
~
-’-----------------------
15000
s=l
—_ —c=l
S=l
~=1:’=2
///
Calls
$
f
per
Second
10000
1
I
C=l,
s=o
5000
I
o~
2345678
1
Number
Fig. 4.
9
of Client
Threads
(T)
URPC call throughput
We consider
this comparison
further
in Section
5.5.
mance is worse, URPC’S thread management
primitives
performance
overheads
proves
be amortized
steadily,
and
has been absorbed,
4.4
I). As T increases,
(see Table
can
Performance
over
latency,
the
once the
worsens
10
only
though,
the trap
outstanding
initial
because
Although
call perforretain their superior
calls.
shock
and reallocation
Throughput
of processor
of scheduling
im-
reallocation
delays.
Summary
We have
described
the performance
of the communication
and thread
management
primitives
for an implementation
of URPC
on the Firefly
multiprocessor.
The performance
figures demonstrate
the advantages
of moving
traditional
operating
them at the user level.
Our approach
stands
ogy.
Normally,
system
in contrast
operating
system
order to improve
performance
we are pushing
functionality
and
functions
out
to traditional
functionality
of kernel
and
operating
implementing
system
is pushed
into
methodol-
the
kernel
in
at the cost of reduced flexibility.
With URPC,
out of the kernel
to improve
both performance
flexibility.
5. WORK RELATED TO URPC
In this
of
section
URPC.
The
we examine
common
several
theme
other
communication
underlying
ACM Transactions
these
on Computer
systems
other
Systems,
in the light
systems
is
that
Vol. 9, No, 2, May 1991
194
.
B N Bershad et al
they
reduce
the
kernel’s
role
in
communication
in
an
effort
to
improve
performance.
5.1 Sylvan
Sylvan
[141 relies
on architectural
message-passing
primitives.
Special
support
to implement
coprocessor
instructions
the Thoth
[151
are used to ac-
cess these primitives,
so that messages can be passed between
Thoth tasks
without
kernel
intervention.
A pair of 68020 processors,
using a coprocessor
to implement
the communication
primitives,
can execute a send-receiue-reply
cycle on a zero-byte
message (i. e., the Null call) in 48 ~secs.
Although
ing
the
Sylvan
outperforms
URPC,
use of special-purpose
the
hardware.
enqueueing
and dequeueing
messages
and
when a thread becomes runnable
or blocked.
handles
thread
scheduling
and context
improvement
Sylvan’s
is small
coprocessor
updating
Software
switching.
consider-
takes
care
of
thread
control
blocks
on the main processor
As Table
II shows,
though,
thread
management
can be responsible
for a large portion
of the round-trip
processing
time.
Consequently,
a special-purpose
coprocessor
can offer only
limited
performance
gains.
5.2
Yackos
Yackos
[22] is similar
communication
Messages
and
to URPC
in that
is intended
are passed
between
for
it strives
to bypass
use on shared
address
memory
spaces by means
the kernel
during
multiprocessors.
of a special
message-
passing process
is no automatic
that shares memory
with all communicating
processes. There
load balancing
among the processors cooperating
to provide
a
Yackos
Address
service.
spaces are single
threaded,
and clients
communicate
with a process running
on a single processor.
That process is responsible
for
forwarding
the message to a less busy process if necessary.
In URPC, address
spaces are multithreaded,
so a message
sent from
one address
space to
another
can be fielded
by any processor
allocated
to the receiving
address
space.
Further,
URPC
on
a multiprocessor
in
the
best
case
(no
kernel
involvement)
outperforms
Yackos by a factor of two for a round-trip
call (93
VS. 200 psecs).2 The difference
in communication
performance
is due mostly
to the indirection
through
the interposing
Yackos message-passing
process.
This
indirection
results
in increased
latency
and data copying.
By taking
advantage
in an RPC relationship,
URPC
avoids
due to message
queueing,
polling,
of the client/server
pairing
implicit
the indirection
through
the use of
pair-wise
shared message queues. In the pessimistic
case (synchronous
cessor allocation
on every message send), URPC is over thirteen
times
efficient
(375 vs. 5000 psecs) than Yackos.
5.3
promore
Promises
Argus [28] provides
the programmer
with
that can be used to make an asynchronous
a mechanism
cross-address
called a Promise
space procedure
call.
80386
that
faster
2 Yackos
runs
than
the
C-VAX
the
ACM
TransactIons
Sequent
processors
Symmetry,
used
on Computer
which
uses
processors
in the Firefly.
Systems,
Vol
9, No
2, May
1991
are
somewhat
[271
User-Level Interprocess
Communication
for Shared Memory Multiprocessors
.
The caller thread
continues
to run, possibly
the time that the call’s results
are needed.
in parallel
If a thread
on a Promise
(i. e., the cross-address
that
has not yet been fulfilled
has not yet completed),
A Promise
the collecting
is used to compensate
thread
195
with the callee, until
attempts
to “collect”
space call
blocks.
for the high
cost of thread
management
in
the Argus
system.
Although
an asynchronous
call can be simulated
by
forking
a thread and having
the forked thread do the cross-address
space call
synchronously,
the cost of thread creation
in Argus [29] (around 20 procedure
call times)
the
precludes
order
explicit
this
approach.
of six procedure
language
one largely
call
support
In URPC,
times,
for asynchronous
of preference
and style,
the cost of thread
and so the
issue
cross-address
rather
than
creation
of whether
is on
to provide
space calls
becomes
performance.
Camelot
5.4
Camelot
[19]
is a high-performance
distributed
transaction
processing
sys-
tem.
Key to the system’s
performance
is its use of recoverable
virtual
memory
and write-ahead
logging.
Because of the cost of Mach’s communication
primitives
[18], data
servers
use shared
memory
queues
to communicate
virtual
memory
control
information
and log records to the disk manager
on
each machine.
When the queues become full, data servers use Mach’s kernellevel RPC system to force a processor reallocation
to the disk manager
so that
it can process
the pending
messages
in the queue.
URPC generalizes
the ad hoc message-passing
approach
used by Camelot.
Camelot’s
shared memory
queues can only be used between
the data servers
and the disk manager.
URPC’S channels
are implemented
beneath
the stub
layer, allowing
any client and any server to communicate
through
a standard
RPC
interface.
with those that
memory
queues
nication
At
Camelot’s
in URPC
the
thread
management
access the shared memory
support only unidirectional
lowest
is bidirectional,
level,
supporting
URPC
is
simply
separate
address
spaces access shared
Programmers
have often
used shared
address space communication,
interfaces.
URPC formalizes
work of standard
programming
pensive
5.5
primitives
are not integrated
queues. Finally,
communication,
request-reply
a protocol
memory
memory
by
in
for
Camelot’s
whereas
shared
commu-
interaction.
which
threads
a disciplined
high-bandwidth
in
fashion.
cross-
but have had to implement
special-purpose
this interface
by presenting
it in the frameabstraction
based on procedure
call and inex-
threads.
LRPC
LRPC
demonstrates
that
it
is
possible
to
communicate
between
address
spaces by way of the kernel
at “hardware
speed. ” Like URPC, LRPC passes
parameters
in shared memory
that is pair-wise
mapped
between
the client
and server.
Unlike
URPC,
though,
LRPC
uses kernel-level
threads
that
pass
between
address spaces on each call and return.
LRPC greatly
reduces the
amount
of thread
management
that must be done on each call, but is still
performance
limited
by kernel
mediation.
ACM
Transactions
on Computer
Systems,
Vol. 9, No
2, MaY 1991
196
.
B. N. Bershad et al
In the worst
case, when
to the client,
375
and there
,usecs. On the
and LRPC,
reallocations.
there
are no processors
is one client
same
thread,
hardware,
LRPC
each call involves
two
There are two reasons
reallocation
in
URPC
takes
to the server,
for the Null
157 ~secs.
For
kernel
invocations
and
why LRPC outperforms
case; the first is an artifact
of URPC’S
inherent
in URPC’s approach:
—Processor
allocated
the latency
is
implementation,
based
on
both
is
URPC
two processor
URPC in this
while
the
URPC
LRPC.
one
URPC
second
is
implements
processor
allocation
using LRPC,
which
is a general
cross-address
space
procedure
call facility
for kernel-level
threads.
We used LRPC because it
was available
and integrated
into the Firefly
operating
system. The overhead
of LRPC-based
that
a special-purpose
— URPC
curs
is
processor
integrated
after
two
processor,
and
be reallocated?
reallocation
mechanism
with
distinct
would
two-level
is 180
be about
Processor
scheduling.
scheduling
decisions
psecs, but
30 percent
are made:
we estimate
faster.
reallocation
(1) Is there
ocan idle
(2) is there an underpowered
address space to which
In contrast,
a call using LRPC makes no scheduling
it can
deci-
sions (one of the main reasons for its speed). Making
these two decisions is
more expensive
than not making
them, but is necessary to take advantage
of the inexpensive
synchronization
and scheduling
functions
made possible
by user-level
thread
management.
The overhead
of these two scheduling
decisions
is about
processor
reallocation
100
It should
not be surprising
uler, which manages
the first level, which
act
with
the
6.
round-trip
call
(only
in the
case where
of course).
that
indirecting
through
a second-level
sched-
threads,
increases the cost of accessing the scheduler
at
manages
processors.
The trick is to infrequently
inter-
first-level
used, the overhead
inevitably
degrade
of scheduling.
~secs per
is required,
scheduler.
When
the
first-level
scheduler
must
of having
to pass through
the second-level
mechanism
performance
relative
to a system with only a single
be
will
level
CONCLUSIONS
This
paper
mance
has described
of URPC,
the motivation,
a new approach
that
design,
addresses
implementation,
the problems
and perforof kernel-based
communication
by moving
traditional
operating
system functionality
out of
the kernel
and up to the user level.
We believe that URPC represents
the appropriate
division
of responsibility
for the operating
system kernels
of shared memory
multiprocessors.
While it
is a straightforward
task to port
a uniprocessor
operating
system
to a
multiprocessor
by adding
kernel
support
for threads
and synchronization,
it
is another
thing
entirely
to design facilities
for a multiprocessor
that will
enable
URPC
programmers
demonstrates
to fully
exploit
the processing
power of the machine.
that one way in which
a multiprocessor’s
performance
potential
can
multiprocessor
be greatly
increased
is by
in the first
place, thereby
ACM
on Computer
Transactions
Systems,
Vol
9, No
designing
making
2, May
1991
system
facilities
for
a distinction
between
a
a
User-Level Interprocess
multiprocessor
happens
Communication
operating
to run
for Shared Memory Multiprocessors
system
and
a uniprocessor
.
operating
197
system
that
on a multiprocessor.
REFERENCES
1. United
States
guage,
July
2. AGARWAL,
Memo
Sept.
Department
A.
Performance
89-566,
tradeoffs
Massachusetts
in
Manual
multithreaded
Institute
of Technology,
A., HENNESSY, J., AND HOROWITZ, M.
multiprogramming
4. AGARWAL,
workloads.
A., LIM, B.-H.,
multiprocessing.
puter
thread
T.
12 (Dec.
Operating
for
the Ada
Programming
Lan-
processors.
Tech.
Laboratory
for
Rep.
MIT
Computer
VLSI
Science,
1989),
thread
7. BBN
June
performance
of the 17th
of operating
S+@. 6, 4 (Nov.
J.
Annual
APRIL:
1988),
system
A processor
International
and
393-431.
architecture
Symposium
on Com-
E.
D.,
AND LEVY,
for shared
1631-1644.
H.
M.
memory
To appear
in
The
performance
multiprocessors.
Proceedings
implications
IEEE
of the 13th
Trans.
ACM
of
Comput.
Symposium
on
Principles.
T. E. BERSHAD,
management
Computer
Cache
Compzd.
104-114.
alternatives
Systems
6. ANDERSON,
1990,
LAZOWSKA,
management
Trans.
Proceedings
May
E.,
ACM
KRANZ, D., AND KUBIATOWICZ,
In
Architecture.
5. ANDERSON,
38,
Reference
1989.
3. AGARWAL,
for
of Defense.
1980.
Science
for
B. N.,
shared
LAZOWSKA,
memory
and Engineering,
ButterfZy
Laboratories.
Univ.
Parallel
E. D., AND LEVY,
multiprocessors.
of Washington,
Processor
H.
Tech.
April
Oueruzew.
M.
Efficient
Rep.
user-level
90-04-02,
Dept.
of
1990
BBN
Labs.,
Cambridge,
Mass.,
1985.
8. BERSHAD, B. N.
Dept.
High
of Computer
performance
Science
and
cross-address
Engineering,
space communication,
Univ.
Ph.D.
of Washington,
June
dissertation,
1990
Tech.
Rep.
90-06-02.
9. BERSHAD, B. N., LAZOWSKA,
parallel
programming.
E, D., AND LEVY,
Softw.
10. BERSHAD, B. N., ANDERSON,
procedure
call.
Proceedings
11. BIRRELL,
of the 12th
A.
Comput.
ACM
D.
A.
13. BLACK,
D. L.
system.
IEEE
14. BURKOWSKI,
15. CHERITON,
39-59.
An
1984),
Scheduling
In
D. R.
and
Thoth:
concurrency
1990),
of the 3rd
Operating
A portable
Lightweight
Also
Principles,
procedure
threads.
Alto,
and
M.
37-55.
Dec.
calls
Tech.
Calif.,
Jan.
parallelism
remote
appeared
in
1989.
ACM
Rep.
Trans.
35,
Digital
1989.
in the
Mach
operating
35-43.
Architectural
G., AND DUECK, G.
Proceedings
Languages
for
with
Palo
H
1990),
remote
Center,
for object-oriented
713-732.
Systems
programming
23, 5 (May
F. J., CORMAC~,
1 (Feb.
on Operating
Research
support
Mug.
8,
Implementing
to
Systems
A system
1988),
E. D., AND LEVY,
Syst.
J.
introduction
PRESTO:
ACM
Systems.
real-time
support
Conference
April
1989,
operating
for synchronous
on Architectural
task
Support
for
40-53.
system.
Commun.
ACM
2 (Feb.
105-115.
16. CHERITON,
17. CooK,
D.
Computer
D. R.
The
The V distributed
evaluation
Lab.,
18. DUCHAMP,
April
D.
20. GUPTA,
policies
A.,
TUCKER,
(May
R.
Lang.
ACM
Commun.
system.
ACM.
Ph.D.
A.,
management
Systems
memory
of Computer
and synchronization
HALSTEAD,
Program.
Virtual
Dept.
ings of the 1991
Systems
of transaction
on Operating
J. L.
dissertation,
system.
of a protection
31, 3 (Mar.
dissertation,
1988),
314-333.
Cambridge
Univ.,
1978.
Analyis
Symposium
19. EPPINGER.
21.
Symposium
Comput.
Programming
ACM
ACM
B.
communication.
1979),
Comput.
Corporation
H. M.
18, 8 (Aug.
T. E., LAZOWSKA,
Trans.
1 (Feb.
D.
Equipment
Exper.
AND NELSON,
Syst. 2,
12. BIRRELL,
Pratt
management
Science,
Dec.
for
Carnegie
AND URUSHIBARA,
methods
performance.
Principles.
S.
The
Conference
In
processing
Univ.,
impact
Proceedings
Feb.
of operating
of parallel
on Measurement
of the 12th
177-190.
transaction
Mellon
of the performance
SIGMETRICS
1989,
systems.
system
applications.
and
Ph.D.
1989.
Modeling
scheduling
In
Proceed-
of Computer
1991).
Multilisp:
Syst.
A language
7, 4 (Oct.
1985),
ACM
for
concurrent
symbolic
computation.
ACM
Trans.
501-538.
Transactions
on Computer
Systems,
Vol. 9, No
2, May 1991.
198
22.
.
B. N. Bershad et al
HENSGEN,
D., AND FINKEL,
of Computer
23
Science,
R
Umv
JONES, M. B. AND RASHID,
Dynamic
server
of Kentucky,
1989
R, F.
Mach
squads
m Yackos.
and Matchmaker:
Tech.
Kernel
object-oriented
distributed
systems
In
Proceedz ngs of the
Oriented
Programming
Systems,
Languages,
and ApphcatLons
and
Rep
138-89,
language
Dept
support
for
ACM
Conference
on ObJect(OOPSLA
‘86). (Sept. 1986),
67-77.
24
KARGER, P A.
the 3rd
L~NOSKI,
based
26
27
LKKOV,
dure
1989),
coherence
Digital
protocol
m
28.
LISKOV, B
29.
LISKOV,
Language
R
of the
In
Proceedings
Languages
and
of
Operat-
ACM
Architecture.
(May
Programmmg
LIngu&lc
In
J.
The
Proceedings
1990),
directory-
of the
17th
148-159.
and Architecture.
support
Proceedings
the
(June
in Argus,
P , AND
for
of
Implementation
D , JOHNSON,
In
The VAX-11.
1989.
Promises.
and
AND HENNESSY,
multiprocessor,
Computer
systems.
Ilth
K , GUPTA, A.,
DASH
Mass.,
programming
B , CURTIS,
proceedings
H
Deszgn
Distributed
the
on Computer
Bedford,
distributed
Programming
for
Symposium
Press,
call performance.
for Programming
194-204.
B. AND SHRIRA, L
calls
cross-domain
Support
J., GHARACHORLOO,
H. M. AND ECKHOUSE,
Ed.
to optlmlze
on Architectural
3-6
Internatzona~
LEVY,
2nd
(April
D , LAUDON,
cache
Annual
regmters
Conference
rng Systems
25.
Using
ACM
1988),
Commun.
ACM
SCHEIFLER,
Symposium
on
efficient
R.
Operating
asynchronous
SIGPLAN
’88
proce-
Conference
on
260-267
31, 3 (Mar,
1988),
Implementation
Systems
300-312
of Argus,
Prmczples
(Nov.
In
1987),
111-122
30.
31.
NELSON,
B
Carnegie
Mellon
J.
OUSTERHOUT,
RITGHIE,
1974),
33
34
Umv.,
J
dissertation,
32.
Remote
K.
call
Ph.D.
dLssertatlon,
Dept.
and
of Computer
Computer
Science,
cooperation
Science,
D. AND THC,MPSON, K.
The
in a distributed
Carnegie-Mellon
Unix
operating
Univ.,
time-sharing
April
system
system.
Ph D.
1980.
Corn mun.
ACM
abstraction
for
17, 7 (July
365-375.
E
S
AND
VANDEVOORDE,
fiarallelism
Tech
Rep.
Alto,
April
1989.
Cahf.,
ROYNER, P , LEVIN,
systems
Calif.,
Tech
Jan.
Sequent
42, Digital
3, Digital
M,
T.
WorkCrews:
Equipment
R., AND WICK,
Rep
J.
On extendmg
Equipment
An
Corporation,
Systems
Modula-2
Corporation,
Computer
Systeme,
A , RASHID,
Mach
and
threads
controlling
Center,
for building
large,
Research
Center,
Systems
Inc.
Symmetry
R. F.,
GOLUB,
Unix
kernel:
Palo
integrated
Palo
Alto,
JR., E
M.
Trans.
DEMERS,
interoperability
In
pies.
(Dee
1989),
Received
July
1990;
TransactIons
Cornput.
A , AND
proceedings
battle
37, 8 (Aug
HAUSER,
C
of the 12th
1988),
The
ACM
In
H
February
on Computer
Systems,
1991;
Vol
accepted
9, No
portable
sympostu
February
2, May
Proceedings
Firefly:
M.
of the
W
1987
A multiprocessor
909-920
common
1991
runtlme
approach
m on Operatz ng Systems
114-122.
revised
1988.
COOPER, E , AND YOUNG,
USENIX
Summer
Conference.
(1987), 185-197.
37. THACKER, C. P., STEWART, L. C., AND SATTERTHWAITE,
IEEE
The
Summary,
D. L.,
control.
workstation
the
Technical
D. B , BLACK,
for
38. WEISER,
Research
1985.
36. TEVANIAN,
ACM
of
1981
Partitioning
Dept
ROBERTS,
35.
procedure
May
1991
to
Pi-t ncl.
Download