Lker-Level Interprocess Communication for Shared Memory Multiprocessors BRIAN N. BERSHAD Carnegie Mellon University THOMAS E. ANDERSON, University of Washington Interprocess (between communication address operating IPC applications encounter level and communication facilities These when at the user level, with thread Terms—thread, local to the design of the kernel, is architecturally from one problems them since between cornmzmzcation of contemporary but limited address space kernel-based by the to another. cost of Second, their own thread management the interaction between kernel- User-Level can be solved at the user kernel address structure allows the sees threads on the same Call shared kernel within the communica- each address and processor Procedure using by moving level invocation spaces Remote protocol The programmer unconventional towards central M. LEVY management. these is improved This oriented threads and must provide problems stemming from and supporting motivated and HENRY the responsibility a processor space communication space communication. Index kernel communicating observations though been multiprocessor, performance cross-address managed IPC has become its performance reallocating and user-level memory out of the be avoided particular First, that need inexpensive functional and performance Communication fast in has traditionally problems kernel On a shared tion IPC inherent the (IPC), D. LAZOWSKA, spaces on the same machine), systems. has two invoking EDWARD can machine. (URPC) memory URPC with through combines lightweight to be bypassed and RPC space. reallocation during a threads cross-address a conventional interface, performance, multiprocessor, operating system, parallel programming, performance, communication Categories and Subject Descriptors: D. 3.3 [Programming Languages]: Language Constructs [Operating Systems]: Process Management— concurrency, multiprocessing / multiprograrnmmg; D.4.4 [Operating Systems]: Communications Manfl!ow agement; D.4. 6 [Operating Systems]: Security and Protection— accesscontrols, information controls; D.4. 7 [Operating systems]: Organization and Desig~ D.4.8 [Operating Systems]: Performance— measurements and Features — modules, This material is based 8619663, and on work CCR-87OO1O6, Digital Program). supported CCFL8703049, Equipment Bershad D.4. 1 packages; Corporation was (the supported by the and National CCR-897666), Systems by an AT&T Science the Research Ph.D. Center Fellowship. Authors’ Pittsburgh, addresses: B. N. Bershad, School of Computer Science, PA 15213; T. E. Anderson, E. D. Lazowska, and Science Permission not made of the and Engineering, to copy without or distributed publication Association and specific permission. @ 1991 ACM fee all or part for direct its for Computing FR-35, date of this commercial appear, Machinery. 0734-2071/91/0500-0175 ACM University Transactions and material is given To copy otherwise, the External Anderson CCRCenter, Research by an IBM Carnegie Mellon University, H. M Levy, Department is granted the ACM that (grants Technology and of Washington, advantage, notice and Scholarship, Graduate Computer Foundation Washington Seattle, provided copyright copying or to republish, WA that notice the copies are and the title is by permission requires of 98195. of the a fee and/or $01.50 on Computer Systems, Vol 9, No. 2, May 1991, Pages 175-198 176 . General B. N Terms: Addltlonal Bershad Design, Key et al Performance, Words and Phrases: Measurement Modularity, remote procedure call, small-kernel operating system 1. INTRODUCTION Efficient interprocess communication is central to the design of contemporary operating systems [16, 231. An efficient communication facility encourages system decomposition across address space boundaries. Decomposed systems have several advantages over more monolithic ones, including failure isolation (address space boundaries prevent a fault in one module from “leaking” into another), extensibility (new modules can be added to the system without having to modify mechanism existing rather Although than address which they primitives. ones), and modularity (interfaces are enforced by by convention). spaces can be a useful structuring device, the extent to can be used depends on the performance of the communication If cross-address space communication is slow, the structuring benefits that come from decomposition are difficult to justify to end users, whose primary concern is system performance, and who treat the entire operating system as a “black box” [181 regardless of its internal structure. Consequently, designers are forced to coalesce weakly related subsystems into the same address modularity Interprocess operating problems: space, trading away failure isolation, extensibility, and for performance. communication system —Architectural has traditionally kernel. However, performance barriers. been the responsibility kernel-based The communication performance of the has of kernel-based two syn- chronous communication is architecturally limited by the cost of invoking the kernel and reallocating a processor from one address space to another. In our earlier work on Lightweight Remote Procedure Call (LRPC) [10], we show that it is possible to reduce the overhead of a kernel-mediated cross-address space call to nearly the limit possible on processor architecture: the time to perform a cross-address slightly greater than that required to twice invoke a conventional LRPC is only the kernel and have it reallocate a processor from one address space to another. The efficiency of LRPC comes from taking a “common case” approach to communication, thereby avoiding unnecessary synchronization, kernel-level thread management, and data copying for calls between address spaces on the same machine. The majority of LRPC’S overhead (70 percent) can be attributed directly to the fact that the kernel mediates every cross-address space call. —Interaction between kernel-based communication and high-performance The performance of a parallel application running on a multiprocessor can strongly depend on the efficiency of thread management operations. Medium and fine-grained parallel applications must use a thread management system implemented at the user level to obtain user-level ACM TransactIons threads. on Computer Systems, Vol 9, No 2, May 1991 User-Level Interprocess satkfactory have Communication performance strong for Shared Memory Multiprocessors [6, 36]. Communication interdependencies, though, and thread and the across protection boundaries (kernel-level level thread management) is high in terms . 177 management cost of partitioning communication of performance them and userand system complexity. On a shared eliminate Because memory multiprocessor, these problems the kernel from the path of cross-address address spaces can share memory directly, have space shared a solution: communication. memory can be used as the data transfer channel. Because a shared memory multiprocessor has more than one processor, processor reallocation can often be avoided by taking advantage of a processor already active without involving the kernel. User-level management of cross-address improved performance over kernel-mediated –Messages kernel. are sent between –Unnecessary processor reducing call overhead, contexts across inherent exploited, the same address and spaces, processor call Procedure directly, results without invoking in the without components management, which reallocation. (URPC), the communication kernel can be of a message can be mediation. system described between URPC address isolates of interprocess communication: and data transfer. Control is the is implemented and receiving its overhead performance. Call safe and efficient machine programmer, in the sending improving Remote other the three cation, thread spaces space reallocation between address spaces is eliminated, and helping to preserve valuable cache and TLB parallelism provides address space communication approaches because reallocation does prove to be necessary, several independent calls. thereby User-Level paper, target calls. –When processor amortized over —The address in the communication through Only processor volvement; thread management and ment and interprocess communication rather than by the kernel. from presented of thread reallocation one an- processor reallotransfer between abstraction a combination in this spaces on to the management requires data transfer do not. Thread are done by application-level kernel in- managelibraries, URPC’S approach stands in contrast to conventional “small kernel” systems in which the kernel is responsible for address spaces, thread management, and interprocess communication. Moving communication and thread management to the user level leaves the kernel responsible only for the mechanisms that allocate processors to address spaces. For reasons of performance and flexibility, this is an appropriate division of responsibility for operating systems of shared memory multiprocessors. (URPC can also be appropriate for uniprocessors running multithreaded applications. ) The latency of a simple cross-address space procedure call is 93 psecs using a URPC implementation running on the Firefly, DEC SRC’S multiprocessor workstation [37]. Operating as a pipeline, two processors can complete one ACM Transactions on Computer Systems, Vol 9, No 2, May 1991. 178 . call every B. N. Bershad et al 53 psecs. that involve support for kernel-based user-level On the same multiprocessor hardware, RPC threads that accompany URPC takes 43 psecs, while operation for the kernel-level threads that accompany LRPC millisecond. To put these figures in perspective, a same-address dure facilities the kernel are markedly slower, without providing adequate user-level thread management. LRPC, a high performanceimplementation, takes 157 psecs. The Fork operation for the call takes 7 psecs on the Firefly, and a protected kernel the Fork take over a space proce- invocation (trap) takes 20 psecs. We describe the mechanics of URPC in more detail in the next section. In Section 3 we discuss the design rationale behind URPC. We discuss performance in Section we present 4. In Section 2. A USER-LEVEL In many 5 we survey related work. REMOTE PROCEDURE contemporary uniprocessor applications communicate sages narrow across operations (create, between programs a control languages memory tween in Section 6 with channels and CALL FACILITY multiprocessor operating system that or ports support and that for by only a small systems, sending mes- number of send, receive, destroy). Messages permit communication on the same machine, separated by address space bound- data-structuring support device synchronous communication address operating services aries, or between programs on different machines, boundaries. Although messages are a powerful communication sent Finally, our conclusions. within spaces is with foreign procedure an untyped, primitive, to traditional call, address separated data space. asynchronous typing, by physical they repre- Algol-like and shared Communication messages. be- Programmers of message-passing systems who must use one of the many popular Algol-like languages as a systems-building platform must think, program, and structure according to two quite different programming paradigms. Consequently, almost every mature message-based operating system also includes support for Remote Procedure Call (RPC) [111, enabling messages to serve as the underlying transport mechanism beneath a procedure call interface. Nelson defines RPC as the synchronous language-level transfer of control between programs in disjoint address spaces whose primary communication medium is a narrow channel [301. The definition of RPC is silent about the operation of that narrow channel and how the processor scheduling (reallocation) mechanisms interact with data transfer. IJRPC exploits this silence in two ways: -Messages are passed between address spaces through logical channels kept in memory that is pair-wise shared between the client and server. The channels are created and mapped once for every client/server pairing, and are used for all subsequent communication between the two address spaces so that several interfaces can be multiplexed over the same channel. The integrity of data passed through the shared memory channel is ensured by ACM Transactions on Computer Systems, Vol 9, No 2, May 1991 User-Level Interprocess a combination statically) correctness Communication for Shared Memory Multiprocessors of the pair-wise mapping and the URPC runtime is verified dynamically). (message system authentication in each address —Thread management is implemented entirely at the integrated with the user-level machinery that manages nels. A user-level directly thread enqueuing channel. the No kernel Although sends a message message calls on the are necessary a cross-address space server, it blocks waiting to send a call procedure for the reply link call signifying 179 is implied space (message user level and is the message chan- to another outgoing perspective of the programmer, it is asynchronous thread management. When a thread in a client . address of the space by appropriate or reply message. is synchronous from the at and beneath the level of invokes a procedure in a the procedure’s return; while blocked, its processor can run another ready thread in the same address space. In our earlier system, LRPC, the blocked thread and the ready thread were really the same; the thread just crosses an address space boundary. In contrast, URPC address that always tries space on the client can be handled When the reply entirely arrives, to schedule thread’s another processor. by the This user-level the blocked client thread thread thread from the is a scheduling same operation management system. can be rescheduled on any of the processors allocated to the client’s address space, again without kernel intervention. Similarly, execution of the call on the server side can be done by a processor already and need not occur By preferentially takes advantage in switching call this address requires executing in the context of the fact that a processor context of the server’s address synchronously with the call. scheduling threads within the same address there to another than switching) is significantly thread in reallocating space, URPC less overhead in the same address space, involved space (we will it to a thread in another reallocation). Processor reallocation space (we will call this processor changing the mapping registers that define the virtual address space context architectures, in privileged Several in which these kernel components tion. There are processor should a processor mapping mode. is executing. registers contribute are protected to the high On conventional and can only overhead processor be accessed of processor realloca- scheduling costs to decide the address space to which a be reallocated; immediate costs to update the virtual mem- ory mapping registers and to transfer the processor between and long-term costs due to the diminished performance of translation lookaside buffer (TLB) that occurs whefi locality address space to another [31. Although there is a long-term with context switching within when processors are frequently To demonstrate the relative the same address space, that cost is less than reallocated between address spaces [201. overhead involved in switching contexts be- tween two threads in the between address spaces, same-address space same-address space same address space we introduce the On the context switch. context switch requires ACM address spaces; the cache and shifts from one cost associated TransactIons versus reallocating processors latency concept of a minimal C-VAX, the minimal latency saving and then restoring the on Computer Systems, Vol 9, No 2, May 1991. 180 B. N Bershad et al . machine’s general-purpose registers, reallocating the takes 55 psecs without about Because reallocation, switching processor from and takes one address including about 15 psecs. space to another the long-term C-VAX cost. of the cost difference between context switching URPC strives to avoid processor reallocation, whenever In contrast, on the and processor instead context possible. Processor reallocation is sometimes necessary with URPC, though. If a client invokes a procedure in a server that has an insufficient number of processors to handle the call (e. g., the server’s processors are busy doing other reply. with work), the client’s processors may idle for some time awaiting the This is a load-balancing problem. URPC considers an address space pending incoming messages on its channel to be “underpowered.” An idle processor in the client address space can balance the load by reallocating itself to a server that has pending (unprocessed) messages from that client. Processor reallocation requires that the kernel be invoked to change the processor’s virtual memory context to that of the underpowered address space. That done, the kernel upcalls into a server routine that handles the client’s outstanding requests. After these have been processed, the processor is returned to the client address space via the kernel. The responsibility for detecting incoming messages and scheduling threads to handle them belongs to special, low-priority threads that are part of a URPC runtime library linked incoming messages only when Figure proceeding 1 illustrates downward. manager (WinMgr) a separate address procedures into they each address space. Processors would otherwise be idle. a sample execution timeline One client, an editor, and and a file cache manager space. Two threads in in the window manager for URPC two servers, scan for with time a window (FCMW), are shown. Each is in T1 and T2, invoke the editor, and the file cache manager. Initially, the editor has one processor allocated to it, the window manager has one, and the space call into file cache manager has none. T1 first makes a cross-address the window manager by sending and then attempting to receive a message. The window manager receives the message and begins processing. In the meantime, the thread scheduler in the editor has context switched from TI (which is blocked waiting T2 initiates a procedure to receive a message) to another thread call to the file cache manager by sending Thread a message T2. T1 can be unblocked because the window and waiting to receive a reply. T1 then calls manager has sent a reply message back to the editor. Thread into the file cache manager and blocks. The file cache manager now has two pending messages from the editor but no processors with which to handle T1 and T2 are blocked in the editor, which has no other them. Threads threads waiting to run. The editor’s thread management system detects the outgoing pending messages and donates a processor to the file cache manager, which can then receive, process, and reply to the editor’s two incoming calls before returning the processor back to the editor. At this point, incoming reply messages from the file cache manager can be handled. Tz each terminate ACM TransactIons when on Computer they receive Systems, Vol their 9, No replies. 2, May 1991 the two 7’1 and User-Level Interprocess Communication for Shared Memory Multiprocessors (has no processors) (scanning channels) m 181 FCMgr WinMgr Editor . m Send WinMgr TI Recv & Process Reply to T1’s Call Call [ Recv WinMgr / Context Switch T2 Send FCMgr Call [ Recv FCMgr Context Switch Send TI Call [ FCMgr Recv FCMgr Reallocate Processor Recv to FCMgr Reply Recv & Process to T2’s Call & Process Reply to T1’s Call Return Processor to Editor Context Terminate Context Terminate Switch T2 Switch T1 Time Fig. 1. URPC timeline URPC’S treatment of pending messages is analogous to the treatment of runnable threads waiting on a ready queue. Each represents an execution context in need of attention from a physical processor; otherwise idle processors poll the queues looking for work in the form of these execution contexts. There are two main differences between the operation of message channels and thread ready queues. First, URPC enables on address space to spawn work in another address space by enqueuing a message (an execution context consisting of some control information and arguments or results). Second, ACM Transactions on Computer Systems, Vol 9, No 2, May 1991 182 B. N Bershad et al. . Server Client Application Application Stubs Message URPC ❑ Stubs Channel mm L URPC Fa~tThreads FastThreads / \ / ~eallocate~roces~F-< Fig,2 URPC enables another Thesoftware processors through components in one address an explicit reallocation of URPC space to execute in the if the between workload context of them is imbalance. 2.1 The User View It is important to emphasize that all of the URPC mechanisms described in this section exist “under the covers” of quite ordinary looking Modula2 + [34] interfaces. The RPC paradigm provides the freedom to implement the control and data transfer mechanisms in ways that are best matched by the underlying hardware. URPC consists of two code. One package, called software are managed level at the user packages FczstThreacis and [5], used by stubs and application provides lightweight threads that scheduled on top of middleweight kernel implements the channel managethreads. The second package, called URPC, package ment and message primitives described in this section. The URPC FastThreads, as lies directly beneath the stubs and closely interacts with shown in Figure 3. RATIONALE 2. FOR THE URPC DESIGN In this section we discuss the design rationale behind URPC. rationale is based on the observation that there are several components to a cross-address space call and each can be separately. The main components In brief, this independent implemented are the following: management: blocking the caller’s thread, running a thread through – ‘Thread the procedure in the server’s address space, and resuming the caller’s thread on return, ACM Transactions on Computer Systems, Vol 9, No 2, May 1991 User-Level Interprocess —Data moving transfer: spaces, arguments between the client and . server 183 address and –Processor ensuring that there is a physical processor in the server and the server’s reply in the client. to handle reallocation: the client’s In Communicahon for Shared Memory Mulhprocessors call conventional message systems, these three components are combined beneath a kernel interface, leaving the kernel responsible for each. However, thread management and data transfer do not require kernel assistance, only processor how reallocation the another, 3.1 does. In the three components and the benefits Processor attempts occurs URPC through the optimistically —The client address —The to reduce arise from that call follow we describe can be isolated from one such a separation. has serious the frequency with use of an optimistic assumes the following: other messages, not have that space Reallocation URPC incoming subsections of a cross-address work and effect to do, in a potential which processor reallocation the delay form of runnable in the on the performance reallocation At policy. processing of other threads call time, threads of a call or will in the client’s space. server has, or will soon have, a processor with which it can service a message. In terms several context of performance, ways. The first switch between URPC’S assumption user-level optimistic makes threads assumptions can pay off in it possible to do an inexpensive during the blocking phase of a cross-address space call. The second assumption enables a URPC to execute in parallel with threads in the client’s address space while avoiding a processor reallocation. An implication of both assumptions is that it is possible to amortize the cost of a single outstanding calls between the threads in the same client make a single processor reallocation processor reallocation across several same address spaces. For example, if two calls into the same server in succession, then can handle both. A user-level approach to communication and thread management is appropriate for shared memory multiprocessors where applications rely on aggressive multithreading to exploit parallelism while at the same time compensating for multiprogramming effects and memory latencies [2], where a few key operating system services are the target of the majority calls [8], and where operating system functions are affixed ing nodes for the sake of locality [31]. In contrast well suited –Kernel-level to URPC, contemporary for use on shared memory communication and uniprocessor structures are not multiprocessors: thread pessimistic processor reallocation policies, rency within an application to reduce scheduling [13] underscores this Handoff tion blocks the client and reallocates its ACM kernel of all application to specific process- TransactIons management facilities rely on and are unable to exploit concurthe overhead of communication. pessimism: a single kernel operaprocessor directly to the server. on Computer Systems, Vol. 9, No 2, May 1991 184 B. N. Bershad et al . Although handoff is limited by the cost of kernel –In a traditional scheduling operating shared performance, invocation and processor system running on a multiprocessor, distributed over the many large-scale does improve memory kernel designed for the improvement reallocation. a uniprocessor, but kernel resources are logically centralized, processors in the system. For a mediummultiprocessor such as the Butterfly but to [71, Alewife [41, or DASH [251, URPC’S user-level orientation to operating system design localizes system resources to those processors where the resources are in use, relaxing the performance bottleneck that comes from relying on centralized kernel data structures. This bottleneck is due to the contention for logical resources (locks) and physical resources (memory and interconnect bandwidth) that results when a few data structures, such as thread run queues and message By distributing processors. functions to directly, the address contention URPC can channels, are shared by a large number of the communication and thread management also spaces (and hence processors) that use be appropriate on uniprocessors applications where inexpensive threads physical concurrency within a problem. running multithreaded can be used to express Low overhead threads the logical and and communi- cation make it possible to overlap even small amounts of external tion. Further, multithreaded applications that are able to benefit layed reallocation tion protocols uniprocessor, long them is reduced. can do so without [191. Although having reallocation it can be delayed to develop will by scheduling their eventually within computafrom de- own communicabe necessary an address on a space for as as possible. 3.1.1 The Optimistic Assumptions Won’ t Alulays In cases where Hold. the optimistic assumptions do not hold, it is necessary to invoke the kernel to force a processor reallocation from one address space to another. Examples of where it is inappropriate to rely on URPC’S optimistic processor reallocation policy are single-threaded latency must be bounded), initiate the and priority call is of high 1/0 operation invocations applications, high-latency early (where priority). since real-time applications 1/0 operations (where it will the thread To handle these take a long executing situations, time (where call it is best to to complete), the cross-address URPC address space to force a processor reallocation to the there might still be runnable threads in the client’s. allows server’s, space the client’s even though 3.1.2 The Kernel Handles Processor Reallocation. The kernel implements the mechanism that support processor reallocation. When an idle processor discovers an underpowered address space, it invokes a procedure called Processor. Donate, passing in the identity of the address space to which the processor (on which the invoking thread is running) should be reallocated. Donate transfers control down through the kernel and then up to a prespecified address in the receiving space. The identity of the donating Processor. address ACM space is made TransactIons known on Computer to the receiver Systems, Vol 9, No 2, May by the kernel. 1991 User-Level Interprocess 3.1.3 Voluntary interface URPC, Return defines as with Communication of a contract for Shared Memory Multiprocessors Processors Cannot between traditional RPC a client systems, Be and in the 185 A service Guaranteed. a server. implicit . In the contract case of is that the server obey the policies that determine when a processor is to be returned back to the client. The URPC communication library implements the following policy in the server: upon receipt of a processor from a client address space, return the processor when all have generated replies, or when the become “underpowered” (there outstanding messages from server determines that the are outstanding messages and one of the server’s processors is idle). Although URPC’S runtime libraries implement no way to enforce that kernel transfers control guarantee returning protocol. of Just the as with processor back a specific kernel-based to an to the client, protocol, there systems, application, there a fair available deals processing power. of load balancing URPC’s between direct reallocation applications that clients, is once the is that the application will voluntarily return the processor from the procedure that the client invoked. The server could, example, use the processor to handle requests from other this was not what the client had intended. It is necessary to ensure that applications receive problem the client client has no by for even though share only of the with are communicating the with one another. Independent of communication, a multiprogrammed system requires policies and mechanisms that balance the load between noncommunicating (or noncooperative) address spaces. Applications, for example, must not be able to starve able to delay clients one another out for processors, indefinitely by not returning and servers processors. policies, which forcibly reallocate processors from one address other, are therefore necessary to ensure that applications make A processor reallocation policy should be work conserving, high-priority thread waits for a processor while a lower priority and, by implication, that anywhere in the system, specifics of how to enforce no processor address be easily 3.2 when there is work not be Preemptive space to anprogress. in that no thread runs, for it to do even if the work is in another address space. The this constraint in a system with user-level threads are beyond the scope of this paper, The direct processor reallocation optimization space of a ing to that standpoint reason for processor idles must but are discussed done in URPC by Anderson et al. in [61. can be thought of as an of a work-conserving policy. A processor idling in the address URPC client can determine which address spaces are not respondclient’s calls, and therefore which address spaces are, from the of the client, the most eligible to receive a processor. There is no the client to first voluntarily relinquish the processor to a global allocator, only to have the allocator then determine to which space the processor made by the client should be reallocated. This is a decision that can itself. Data Transfer Using Shared Memory The requirements of a communication system can be relaxed and its efficiency improved when it is layered beneath a procedure call interface rather than exposed to programmers directly. In particular, arguments that are part ACM TransactIons on Computer Systems, Vol. 9, No 2, May 1991. 186 B. N. Bershad et al . of a cross-address space procedure call while still guaranteeing safe interaction can be passed using shared between mutually suspicious memory subsys- tems. Shared memory message of client-server interactions. still overload one another, channels do not increase the “abusability factor” As with traditional RPC, clients and servers can deny service, provide bogus results, and violate communication protocols (e. g., fail nel data structures). And, as with to release traditional protocols abuses to ensure that lower level a well-defined manner (e. g., by raising down the channel). In URPC, the safety of communication responsibility of the stubs. channel RPC, filter locks, or corrupt it is up to higher up to the application exception a call-faded based The arguments on shared of a URPC! layer memory are passed is implemented one address space to another. application programmers deal necessary nor sufficient by having Such directly form of messages. But, when standard runtime as is the case with URPC, having the kernel spaces is neither is the in message phase server. unmarshal the message’s data into copying and checking is necessary to ensure the application’s safety. Traditionally, safe communication copy data from necessary when in or by closing buffers that are allocated and pair-wise mapped during the binding that precedes the first cross-address space call between a client and On receipt of a message, the stubs procedure parameters doing whatever chanlevel the kernel an implementation is with raw data in the facilities and stubs are used, copy data between address to guarantee safety. Copying in the kernel is not necessary because programming languages are implemented to pass parameters between procedures on the stack, the heap, or in registers. When data is passed between address spaces, none of these storage areas can, in general, be used directly by both the client and the server. Therefore, the stubs must copy the data into and out of the memory used to move arguments between address spaces. Safety is not increased by first doing an extra kernel-level copy of the data. Copying in the kernel is not sufficient for ensuring safety when using type-safe languages such as Modula2 + or Ada [1] since each actual parameter must be checked by the stubs for conformity with the type of its corresponding formal. Without such checking, for example, a client could crash the server by passing in an illegal (e. g., out of range) value for a parameter. These points motivate the use of pair-wise shared memory for cross-address space communication. Pair-wise shared memory can be used to transfer between address spaces more efficiently, but just as safely, as messages are copied by the kernel between address spaces. 3.2.1 Controlling different address test-and-set locks nitely on message is simply if the spin-wait. The ACM TransactIons data that Channel Access. Data flows between URPC packages in spaces over a bidirectional shared memory queue with on either end. To prevent processors from waiting indefichannels, the locks are nonspinning; i.e., the lock protocol lock is free, rationale on Computer here acquire is that Systems, Vol it, or else go on the receiver 9, No 2, May 1991 to something of a message else– never should never User-Level Interprocess Communication for Shared Memory Multiprocessors 187 . have to delay while the sender holds the lock. If the sender stands to benefit from the receipt of a message (as on call), then it is up to the sender to ensure that locks are not held indefinitely; if the receiver stands to benefit (as on return), realized then the sender could just as easily prevent the benefits by not sending the message in the first place. Cross-Address 3.3 Space Procedure The calling semantics normal procedure call, performing software address a server starts from space procedure call, like those with respect to the thread that the call. Consequently, there system that manages threads the server. thread. When the The server communication thread client client thread finishes, —High and thread performance grained parallel management. thread it stops, sends waiting on a receive a response back to the we argue facilities are the following: necessary for fine- programs. — While thread management facilities at the user level, high-performance be provided (send Finally, the to be started. for the next message. between cross-address space In brief, management function management synchronization sends a message to the server, client, which causes the client’s waiting thread stops again on a receive, waiting server’s thread In this subsection we explore the relationship communication of is is a strong interaction between the and the one that manages cross- Each underlying and receive) involves a corresponding function (start and stop). On call, the which being Call and Thread Management of cross-address are synchronous space communication. from at the user can be provided either in the kernel or thread management facilities can only level. —The close interaction between communication and thread management can be exploited to achieve extremely good performance for both, when both are implemented 3.3.1 threads Concurrent within activities. parallel together at the user Programming an address Concurrency programs space needs and Thread simplify the can be entirely for access to a thread or it to the can some amount address space. management system that continuum major middleweight, are three and lightweight, for which on the order of thousands, hundreds, Kernels supporting heavyweight points thread application, as with as with an of its own computation with In either case, the program- of control. can be described there of concurrent be external, create, schedule, and destroy threads Support for thread management on which Multiple Management. management internal multiprocessors, application that needs to overlap activity on its behalf in another mer level. makes in it possible terms of reference: management of to a cost heavyweight, overheads are and tens of psecs, respectively. threads [26, 32, 35] make no distinction between a thread, the dynamic component of a program, and its address space, the static component. Threads and address spaces are created, scheduled, and destroyed as single entities. Because an address space has a large amount of baggage, such as open file descriptors, page tables, and accounting ACM Transactions on Computer Systems, Vol 9, No 2, May 1991 188 B. N Bershad et al. . state that must be manipulated during thread management operations, operations on heavyweight threads are costly, taking tens of milliseconds. Many contemporary operating system kernels provide support for middleweight, address threads [12, 361. In contrast to heavyweight middleweight threads are decoupled, so that or kernel-level spaces and address space can have many middleweight threads. As with threads, though, the kernel is responsible for implementing management functions, and directly schedules an application’s the available defines physical a general processors. programming With middleweight interface for heavyweight the thread threads on threads, use by all threads, a single the concurrent kernel applica- tions. Unfortunately, the grained parallel slicing, or between saving threads applications, cost of this applications. and restoring contexts traps are required kernel must treat affects example, floating can be sources even those for which thread management malfeasant programs. generality For the simple point performance features registers of performance the features overhead are unnecessary. of kernel-level for all parallel weight for all A kernel-level thread management operation, and the suspiciously, checking arguments and from the effect of thread to the high cost of but more pervasive management on program policy There are a large number of parallel programming models, and a wide variety of scheduling disciplines that are most appropri- ate (for performance) to a given strongly influenced by the choice rithms used to implement threads, In response to lightweight switching system must also protect itself from malfunctioning or Expensive (relative to procedure call) protected kernel to invoke any each invocation stemming when as time degradation access privileges to ensure legitimacy. In addition to the direct overheads that contribute kernel-level thread management, there is an indirect performance. within these, of fine- such thread is unlikely programs. model [25, 33]. Performance, though, is of interfaces, data structures, and algoso a single model represented by one style to have an implementation to the costs of kernel-level threads, threads that execute in the context threads provided by the kernel, but that is efficient programmers have turned of middleweight or heavy- are managed at the user level by a library linked in with each application [9, 38]. Lightweight thread managetwo-level scheduling, since the application library schedules ment implies lightweight threads on top of weightier threads, which are themselves being scheduled by the kernel. Two-level schedulers attack both the direct and indirect costs of kernel threads. Directly, it is possible to implement efficient user-level thread management functions because they are accessible through a simple procedure call rather than a trap, need not be bulletproofed against application errors, and can be customized to provide only the level of support needed by a given application. The cost of thread management operations in a user-level scheduler 3.3.2 for two ACM can be orders The Problem reasons: fiansactlons of magnitude with Having (1) to synchronize on Computer Systems, Vol less than Functions their 9, No for kernel-level at activities 2, May 1991 Two Levels. within threads. Threads an address block space User-Level Interprocess Communication for Shared Memory Multiprocessors (e.g., while controlling access to critical sections], events in other address spaces (e. g., while reading manager). pensive kernel, If threads are implemented at the and flexible concurrency), but then the performance benefits switching are lost when scheduler accurately reflects In contrast, when efficient user-level (necessary in separate address of the application, awaiting and thread and management scheduling and then kernel-mediated com- are moved together at user level, the system can control threads synchronization for inex- is implemented in the scheduling and context activities state applications notified. communication of the kernel and implemented needed by the communication level operation must be implemented and the application, so that the user-level the scheduling again (2) in the kernel, so that munication activity are properly user between 189 and (2) to wait for external keystrokes from a window communication of inexpensive synchronizing spaces. In effect, each synchronizing executed at two levels: (1) once in . out synchronization with the same primitives that are used by applications. 3.4 Summary of Rationale This section has described the design advantages of a user-level approach made without invoking the kernel processors between processor reallocation over multiple calls. grated with spaces. do turn Finally, out to be necessary, their user-level communication thread programs, as within, 4. THE PERFORMANCE Further, as measurements megabytes 4.1 have kernel processor described many as dedicated main invocation and cost can be amortized can be cleanly intenecessary can inexpensively in this of URPC by DEC’s six to C-VAX 1/0. paper for fine- synchronize The had on the Firefly, an experiSystems Research Center. microprocessors, Firefly four used C-VAX to plus one collect the processors and 32 of memory. User-Level Thread Management Performance In this section we compare the performance of thread management when implemented at the user level and in the kernel. The threads are those provided by Taos, the Firefly’s native operating Table mented invokes The calls can be reallocating OF URPC A can URPC. facilities, programs spaces. the performance workstation built Firefly when management so that address This section describes mental multiprocessor MicroVAX-11 behind address user-level grained parallel between, as well rationale to communication are that and without unnecessarily primitives kernel-level system. I shows the cost of several thread management operations impleat the user level and at the kernel level. PROCEDURE CALL the null procedure and provides a baseline value for the machine’s performance. FORK creates, starts, and terminates a thread. FORK;JOIN is like FORK, except that the thread which creates the new thread blocks until the forked thread completes. YIELD forces a thread to make a full trip through the scheduling machinery; when a thread yields the processor, it is blocked (state saved), made ready (enqueued on the ready queue), scheduled ACM Transactions on Computer Systems, Vol 9, No. 2, May 1991 190 . B. N Bershad et al Table 1. Comparative Performance of Thread URPC Test II Component (ysecs) 7 43 102 37 27 53 7 1192 1574 57 27 271 (dequeued), Server (ysecs) 18 13 poll 6 6 receme 10 20 25 Total 54 53 unblocked and forth on a condition means of paired signal Each (state measured in the was elapsed table. processor. variable, and wait and unblocks test The run restored), was divided using the continued. ACQUIRE; in succession. number FastThreads tests and a blocking (nonspinning) lock for has two threads “pingpong” back blocking and unblocking one another by operations. Each pingpong cycle blocks, two threads a large time The 9 dispatch RELEASE acquires and then releases which there is no contention. PING-PONG schedules, of a URPC Client (~secs) send again Taos Threads FastThreads Breakdown Component Operations (psecs) PROCEDURE CALL FORK FORK;JOIN YIELD ACQUIRE, RELEASE PINGPONG Table Management of times as part by the loop limit tests were kernel-level of a loop, to determine constrained to run threads no such had and the the values on a single constraint; Taos caches recently executed kernel-level threads on idle processors to reduce wakeup latency. Because the tests were run on an otherwise idle machine, the caching optimization worked well and we saw little variance in the measurements. The table demonstrates the performance advantage of implementing thread management at the user-level, where the operations have an overhead hundred, as when 4.2 on the threads URPC Component order of a few procedure calls, are provided by the kernel. rather than a few Breakdown In the case when an address space has enough processing power to handle all incoming messages without requiring a processor reallocation, the work done send (enqueuing a during a URPC can be broken down into four components: poll (detecting the channel on which a message on an outgoing channel); receive (dequeueing the message); and dispatch (dismessage has arrived); patching the appropriate thread to handle an incoming message). Table II shows the time taken by each component. The total processing times are almost the same for the client and the server. Nearly half of that time goes towards thread management (dispatch), ACM TransactIons on Computer Systems, Vol 9, No 2, May 1991 User-Level Interprocess demonstrating the munication The influence for Shared Memory Mulhprocessors that thread management . overhead has 191 on com- performance. send and marshaling receive that spaces. On buffers adds passed in average Table Communication is the Firefly, about most do cross-address and not the cost of psec space is reflect cost are data of data passed passed of latency. in calls the influenced by is and address shared the memory amount small the copying between Because procedure most of data the parameters word additional performance not when each one call II, times incurred [8, of data 16, components 17, 24], shown in transfer. 4.3 Call Latency and Throughput The figures there in Table is no need latency and call throughput server II for depend space, on for are the S, and of client reallocation. load number the Two and dependent, of client number of server other load, important though. Both processors, runnable C, call latency and the threads provided metrics, number in the of client’s T. To evaluate values independent throughput, processors, address are processor T, latency and C, S and throughput, in we ran which threads to make 100,000 “Null” loop. The Null procedure call we timed a series how of tests long procedure calls into takes no arguments, it using took different the the server computes client’s T inside a tight nothing, and returns no results; it exposes the performance characteristics of a round-trip cross-address space control transfer. Figures 3 and 4 graph call latency and throughput as a function of T. Each line on the graphs represents a different combination of S and C. Latency is measured as the time from when a thread calls into the Null stub to when control returns from the stub, and depends on the speed of the message primitives and the length of the thread ready queues in the client and server. When (T> the number processor of caller call C + S), latency in the client exceeds since or the server to T. Except is less sensitive call processing threads increases times total call or both. for the special in the client the each number must When the number of calling on the other case where S = 0, as long and server message wasted transfer. processing threads, are roughly client equal T > hand, as the (see Table II), At that no further C + S. can processors, a free and be server proces- latency is 93 ~secs. This is “pure” latency, in basic processing required to do a round-trip 1 Unfortunately, time for Throughput, until then throughput, as a function of T, improves point, processors are kept completely busy, so there T. improvement in throughput by increasing sors is 1 (T = C = S = 1), call the sense that it reflects the of processors wait the due to idling latency in the includes client and a large server amount while of waiting 1 The 14 psec discrepancy between the obseved latency and the expected latency (54 + 53, cf., Table II) is due to a low-level scheduling optimization that allows the caller thread’s context to remain loaded on the processor if the ready queue is empty and there is no other work to be done anywhere in the system. In this case, the idle loop executes in the context of the caller thread, enabling a fast wakeup when the reply message arrives. At 2’ = C = S, the optimization is enabled. The optimization is also responsible for the small “spike” in the throughput curves shown in Figure 4. ACM Transactions on Computer Systems, Vol. 9, No 2, May 1991. 192 . B. N. Bershad et al. 1500– C=l, s=o C=l, S=2 s=l 14001300120011001000 – 900Call 800- Latency 700- (psecs) /c=l, 600500400 - /<,..”” ~=*~5~2s==2 ‘-- ‘ /’ 300 200 100 - k’~”;”; ‘ “ ,---”-’” o~ 12 8 Number Fig for the next call or reply of Client 3 Threads 9 10 {T) URPC call latency message. At T = 2, C = 1, S = 1, latency increases by only 20 percent (to 112 Wsecs), but throughput increases by nearly 75 percent (from 10,300 to 17,850 calls per second). This increase reflects the fact that message processing can be done in parallel between the client and server. Operating as a two-stage pipeline, the client and server complete call every 53 psecs. For larger values of S and C, the same analysis holds. As long sufficiently large to keep the processors in the client and server spaces busy, throughput Note that throughput is maximized. for S = C = 2 is not twice as large one as T is address as for S = C = 1 (23,800 vs. 17,850 calls per second, a 33 percent improvement). Contention for the critical sections that manage access to the thread and message queues in each address space limits throughput over a single channel. Of the 148 instructions required to do a round-trip message transfer, approximately half are part of critical sections guarded by two separate locks kept on a perchannel basis. With four processors, the critical sections are therefore nearly always occupied, so there is slowdown due to queueing at the critical sections. This factor does not constrain the aggregate multiple clients and servers since they use different When reallocate round-trip sor ACM S = O throughput and latency are at their processors frequently between the client latency is 375 reallocations. At this Transactions on Computer psecs. point, Systems, Every URPC Vol call requires performs 9, No worse 2, May 1991 rate of URPCS channels. between worst due to the need to and server. When T = 1, two traps than and LRPC two proces- (157 psecs). User-Level Interprocess Communication for Shared Memory Multiprocessors . 193 25000 ---1 —C=2’S’2 A — — — / / 20000 4 /–––__–___c=2, ~<:.— ——.— —————— ~ -’----------------------- 15000 s=l —_ —c=l S=l ~=1:’=2 /// Calls $ f per Second 10000 1 I C=l, s=o 5000 I o~ 2345678 1 Number Fig. 4. 9 of Client Threads (T) URPC call throughput We consider this comparison further in Section 5.5. mance is worse, URPC’S thread management primitives performance overheads proves be amortized steadily, and has been absorbed, 4.4 I). As T increases, (see Table can Performance over latency, the once the worsens 10 only though, the trap outstanding initial because Although call perforretain their superior calls. shock and reallocation Throughput of processor of scheduling im- reallocation delays. Summary We have described the performance of the communication and thread management primitives for an implementation of URPC on the Firefly multiprocessor. The performance figures demonstrate the advantages of moving traditional operating them at the user level. Our approach stands ogy. Normally, system in contrast operating system order to improve performance we are pushing functionality and functions out to traditional functionality of kernel and operating implementing system is pushed into methodol- the kernel in at the cost of reduced flexibility. With URPC, out of the kernel to improve both performance flexibility. 5. WORK RELATED TO URPC In this of section URPC. The we examine common several theme other communication underlying ACM Transactions these on Computer systems other Systems, in the light systems is that Vol. 9, No, 2, May 1991 194 . B N Bershad et al they reduce the kernel’s role in communication in an effort to improve performance. 5.1 Sylvan Sylvan [141 relies on architectural message-passing primitives. Special support to implement coprocessor instructions the Thoth [151 are used to ac- cess these primitives, so that messages can be passed between Thoth tasks without kernel intervention. A pair of 68020 processors, using a coprocessor to implement the communication primitives, can execute a send-receiue-reply cycle on a zero-byte message (i. e., the Null call) in 48 ~secs. Although ing the Sylvan outperforms URPC, use of special-purpose the hardware. enqueueing and dequeueing messages and when a thread becomes runnable or blocked. handles thread scheduling and context improvement Sylvan’s is small coprocessor updating Software switching. consider- takes care of thread control blocks on the main processor As Table II shows, though, thread management can be responsible for a large portion of the round-trip processing time. Consequently, a special-purpose coprocessor can offer only limited performance gains. 5.2 Yackos Yackos [22] is similar communication Messages and to URPC in that is intended are passed between for it strives to bypass use on shared address memory spaces by means the kernel during multiprocessors. of a special message- passing process is no automatic that shares memory with all communicating processes. There load balancing among the processors cooperating to provide a Yackos Address service. spaces are single threaded, and clients communicate with a process running on a single processor. That process is responsible for forwarding the message to a less busy process if necessary. In URPC, address spaces are multithreaded, so a message sent from one address space to another can be fielded by any processor allocated to the receiving address space. Further, URPC on a multiprocessor in the best case (no kernel involvement) outperforms Yackos by a factor of two for a round-trip call (93 VS. 200 psecs).2 The difference in communication performance is due mostly to the indirection through the interposing Yackos message-passing process. This indirection results in increased latency and data copying. By taking advantage in an RPC relationship, URPC avoids due to message queueing, polling, of the client/server pairing implicit the indirection through the use of pair-wise shared message queues. In the pessimistic case (synchronous cessor allocation on every message send), URPC is over thirteen times efficient (375 vs. 5000 psecs) than Yackos. 5.3 promore Promises Argus [28] provides the programmer with that can be used to make an asynchronous a mechanism cross-address called a Promise space procedure call. 80386 that faster 2 Yackos runs than the C-VAX the ACM TransactIons Sequent processors Symmetry, used on Computer which uses processors in the Firefly. Systems, Vol 9, No 2, May 1991 are somewhat [271 User-Level Interprocess Communication for Shared Memory Multiprocessors . The caller thread continues to run, possibly the time that the call’s results are needed. in parallel If a thread on a Promise (i. e., the cross-address that has not yet been fulfilled has not yet completed), A Promise the collecting is used to compensate thread 195 with the callee, until attempts to “collect” space call blocks. for the high cost of thread management in the Argus system. Although an asynchronous call can be simulated by forking a thread and having the forked thread do the cross-address space call synchronously, the cost of thread creation in Argus [29] (around 20 procedure call times) the precludes order explicit this approach. of six procedure language one largely call support In URPC, times, for asynchronous of preference and style, the cost of thread and so the issue cross-address rather than creation of whether is on to provide space calls becomes performance. Camelot 5.4 Camelot [19] is a high-performance distributed transaction processing sys- tem. Key to the system’s performance is its use of recoverable virtual memory and write-ahead logging. Because of the cost of Mach’s communication primitives [18], data servers use shared memory queues to communicate virtual memory control information and log records to the disk manager on each machine. When the queues become full, data servers use Mach’s kernellevel RPC system to force a processor reallocation to the disk manager so that it can process the pending messages in the queue. URPC generalizes the ad hoc message-passing approach used by Camelot. Camelot’s shared memory queues can only be used between the data servers and the disk manager. URPC’S channels are implemented beneath the stub layer, allowing any client and any server to communicate through a standard RPC interface. with those that memory queues nication At Camelot’s in URPC the thread management access the shared memory support only unidirectional lowest is bidirectional, level, supporting URPC is simply separate address spaces access shared Programmers have often used shared address space communication, interfaces. URPC formalizes work of standard programming pensive 5.5 primitives are not integrated queues. Finally, communication, request-reply a protocol memory memory by in for Camelot’s whereas shared commu- interaction. which threads a disciplined high-bandwidth in fashion. cross- but have had to implement special-purpose this interface by presenting it in the frameabstraction based on procedure call and inex- threads. LRPC LRPC demonstrates that it is possible to communicate between address spaces by way of the kernel at “hardware speed. ” Like URPC, LRPC passes parameters in shared memory that is pair-wise mapped between the client and server. Unlike URPC, though, LRPC uses kernel-level threads that pass between address spaces on each call and return. LRPC greatly reduces the amount of thread management that must be done on each call, but is still performance limited by kernel mediation. ACM Transactions on Computer Systems, Vol. 9, No 2, MaY 1991 196 . B. N. Bershad et al In the worst case, when to the client, 375 and there ,usecs. On the and LRPC, reallocations. there are no processors is one client same thread, hardware, LRPC each call involves two There are two reasons reallocation in URPC takes to the server, for the Null 157 ~secs. For kernel invocations and why LRPC outperforms case; the first is an artifact of URPC’S inherent in URPC’s approach: —Processor allocated the latency is implementation, based on both is URPC two processor URPC in this while the URPC LRPC. one URPC second is implements processor allocation using LRPC, which is a general cross-address space procedure call facility for kernel-level threads. We used LRPC because it was available and integrated into the Firefly operating system. The overhead of LRPC-based that a special-purpose — URPC curs is processor integrated after two processor, and be reallocated? reallocation mechanism with distinct would two-level is 180 be about Processor scheduling. scheduling decisions psecs, but 30 percent are made: we estimate faster. reallocation (1) Is there ocan idle (2) is there an underpowered address space to which In contrast, a call using LRPC makes no scheduling it can deci- sions (one of the main reasons for its speed). Making these two decisions is more expensive than not making them, but is necessary to take advantage of the inexpensive synchronization and scheduling functions made possible by user-level thread management. The overhead of these two scheduling decisions is about processor reallocation 100 It should not be surprising uler, which manages the first level, which act with the 6. round-trip call (only in the case where of course). that indirecting through a second-level sched- threads, increases the cost of accessing the scheduler at manages processors. The trick is to infrequently inter- first-level used, the overhead inevitably degrade of scheduling. ~secs per is required, scheduler. When the first-level scheduler must of having to pass through the second-level mechanism performance relative to a system with only a single be will level CONCLUSIONS This paper mance has described of URPC, the motivation, a new approach that design, addresses implementation, the problems and perforof kernel-based communication by moving traditional operating system functionality out of the kernel and up to the user level. We believe that URPC represents the appropriate division of responsibility for the operating system kernels of shared memory multiprocessors. While it is a straightforward task to port a uniprocessor operating system to a multiprocessor by adding kernel support for threads and synchronization, it is another thing entirely to design facilities for a multiprocessor that will enable URPC programmers demonstrates to fully exploit the processing power of the machine. that one way in which a multiprocessor’s performance potential can multiprocessor be greatly increased is by in the first place, thereby ACM on Computer Transactions Systems, Vol 9, No designing making 2, May 1991 system facilities for a distinction between a a User-Level Interprocess multiprocessor happens Communication operating to run for Shared Memory Multiprocessors system and a uniprocessor . operating 197 system that on a multiprocessor. REFERENCES 1. United States guage, July 2. AGARWAL, Memo Sept. Department A. Performance 89-566, tradeoffs Massachusetts in Manual multithreaded Institute of Technology, A., HENNESSY, J., AND HOROWITZ, M. multiprogramming 4. AGARWAL, workloads. A., LIM, B.-H., multiprocessing. puter thread T. 12 (Dec. Operating for the Ada Programming Lan- processors. Tech. Laboratory for Rep. MIT Computer VLSI Science, 1989), thread 7. BBN June performance of the 17th of operating S+@. 6, 4 (Nov. J. Annual APRIL: 1988), system A processor International and 393-431. architecture Symposium on Com- E. D., AND LEVY, for shared 1631-1644. H. M. memory To appear in The performance multiprocessors. Proceedings implications IEEE of the 13th Trans. ACM of Comput. Symposium on Principles. T. E. BERSHAD, management Computer Cache Compzd. 104-114. alternatives Systems 6. ANDERSON, 1990, LAZOWSKA, management Trans. Proceedings May E., ACM KRANZ, D., AND KUBIATOWICZ, In Architecture. 5. ANDERSON, 38, Reference 1989. 3. AGARWAL, for of Defense. 1980. Science for B. N., shared LAZOWSKA, memory and Engineering, ButterfZy Laboratories. Univ. Parallel E. D., AND LEVY, multiprocessors. of Washington, Processor H. Tech. April Oueruzew. M. Efficient Rep. user-level 90-04-02, Dept. of 1990 BBN Labs., Cambridge, Mass., 1985. 8. BERSHAD, B. N. Dept. High of Computer performance Science and cross-address Engineering, space communication, Univ. Ph.D. of Washington, June dissertation, 1990 Tech. Rep. 90-06-02. 9. BERSHAD, B. N., LAZOWSKA, parallel programming. E, D., AND LEVY, Softw. 10. BERSHAD, B. N., ANDERSON, procedure call. Proceedings 11. BIRRELL, of the 12th A. Comput. ACM D. A. 13. BLACK, D. L. system. IEEE 14. BURKOWSKI, 15. CHERITON, 39-59. An 1984), Scheduling In D. R. and Thoth: concurrency 1990), of the 3rd Operating A portable Lightweight Also Principles, procedure threads. Alto, and M. 37-55. Dec. calls Tech. Calif., Jan. parallelism remote appeared in 1989. ACM Rep. Trans. 35, Digital 1989. in the Mach operating 35-43. Architectural G., AND DUECK, G. Proceedings Languages for with Palo H 1990), remote Center, for object-oriented 713-732. Systems programming 23, 5 (May F. J., CORMAC~, 1 (Feb. on Operating Research support Mug. 8, Implementing to Systems A system 1988), E. D., AND LEVY, Syst. J. introduction PRESTO: ACM Systems. real-time support Conference April 1989, operating for synchronous on Architectural task Support for 40-53. system. Commun. ACM 2 (Feb. 105-115. 16. CHERITON, 17. CooK, D. Computer D. R. The The V distributed evaluation Lab., 18. DUCHAMP, April D. 20. GUPTA, policies A., TUCKER, (May R. Lang. ACM Commun. system. ACM. Ph.D. A., management Systems memory of Computer and synchronization HALSTEAD, Program. Virtual Dept. ings of the 1991 Systems of transaction on Operating J. L. dissertation, system. of a protection 31, 3 (Mar. dissertation, 1988), 314-333. Cambridge Univ., 1978. Analyis Symposium 19. EPPINGER. 21. Symposium Comput. Programming ACM ACM B. communication. 1979), Comput. Corporation H. M. 18, 8 (Aug. T. E., LAZOWSKA, Trans. 1 (Feb. D. Equipment Exper. AND NELSON, Syst. 2, 12. BIRRELL, Pratt management Science, Dec. for Carnegie AND URUSHIBARA, methods performance. Principles. S. The Conference In processing Univ., impact Proceedings Feb. of operating of parallel on Measurement of the 12th 177-190. transaction Mellon of the performance SIGMETRICS 1989, systems. system applications. and Ph.D. 1989. Modeling scheduling In Proceed- of Computer 1991). Multilisp: Syst. A language 7, 4 (Oct. 1985), ACM for concurrent symbolic computation. ACM Trans. 501-538. Transactions on Computer Systems, Vol. 9, No 2, May 1991. 198 22. . B. N. Bershad et al HENSGEN, D., AND FINKEL, of Computer 23 Science, R Umv JONES, M. B. AND RASHID, Dynamic server of Kentucky, 1989 R, F. Mach squads m Yackos. and Matchmaker: Tech. Kernel object-oriented distributed systems In Proceedz ngs of the Oriented Programming Systems, Languages, and ApphcatLons and Rep 138-89, language Dept support for ACM Conference on ObJect(OOPSLA ‘86). (Sept. 1986), 67-77. 24 KARGER, P A. the 3rd L~NOSKI, based 26 27 LKKOV, dure 1989), coherence Digital protocol m 28. LISKOV, B 29. LISKOV, Language R of the In Proceedings Languages and of Operat- ACM Architecture. (May Programmmg LIngu&lc In J. The Proceedings 1990), directory- of the 17th 148-159. and Architecture. support Proceedings the (June in Argus, P , AND for of Implementation D , JOHNSON, In The VAX-11. 1989. Promises. and AND HENNESSY, multiprocessor, Computer systems. Ilth K , GUPTA, A., DASH Mass., programming B , CURTIS, proceedings H Deszgn Distributed the on Computer Bedford, distributed Programming for Symposium Press, call performance. for Programming 194-204. B. AND SHRIRA, L calls cross-domain Support J., GHARACHORLOO, H. M. AND ECKHOUSE, Ed. to optlmlze on Architectural 3-6 Internatzona~ LEVY, 2nd (April D , LAUDON, cache Annual regmters Conference rng Systems 25. Using ACM 1988), Commun. ACM SCHEIFLER, Symposium on efficient R. Operating asynchronous SIGPLAN ’88 proce- Conference on 260-267 31, 3 (Mar, 1988), Implementation Systems 300-312 of Argus, Prmczples (Nov. In 1987), 111-122 30. 31. NELSON, B Carnegie Mellon J. OUSTERHOUT, RITGHIE, 1974), 33 34 Umv., J dissertation, 32. Remote K. call Ph.D. dLssertatlon, Dept. and of Computer Computer Science, cooperation Science, D. AND THC,MPSON, K. The in a distributed Carnegie-Mellon Unix operating Univ., time-sharing April system system. Ph D. 1980. Corn mun. ACM abstraction for 17, 7 (July 365-375. E S AND VANDEVOORDE, fiarallelism Tech Rep. Alto, April 1989. Cahf., ROYNER, P , LEVIN, systems Calif., Tech Jan. Sequent 42, Digital 3, Digital M, T. WorkCrews: Equipment R., AND WICK, Rep J. On extendmg Equipment An Corporation, Systems Modula-2 Corporation, Computer Systeme, A , RASHID, Mach and threads controlling Center, for building large, Research Center, Systems Inc. Symmetry R. F., GOLUB, Unix kernel: Palo integrated Palo Alto, JR., E M. Trans. DEMERS, interoperability In pies. (Dee 1989), Received July 1990; TransactIons Cornput. A , AND proceedings battle 37, 8 (Aug HAUSER, C of the 12th 1988), The ACM In H February on Computer Systems, 1991; Vol accepted 9, No portable sympostu February 2, May Proceedings Firefly: M. of the W 1987 A multiprocessor 909-920 common 1991 runtlme approach m on Operatz ng Systems 114-122. revised 1988. COOPER, E , AND YOUNG, USENIX Summer Conference. (1987), 185-197. 37. THACKER, C. P., STEWART, L. C., AND SATTERTHWAITE, IEEE The Summary, D. L., control. workstation the Technical D. B , BLACK, for 38. WEISER, Research 1985. 36. TEVANIAN, ACM of 1981 Partitioning Dept ROBERTS, 35. procedure May 1991 to Pi-t ncl.