Lightweight Remote Procedure Call (Bershad, et. al.) Andy Jost CS 533, Winter 2012 Introduction • Preliminary definitions – Monolithic kernels and microkernels – Capability systems – Remote procedure calls (RPC) • Motivation – Analysis of the common case – Performance of RPC – Sources of overhead • Lightweight RPC (LRPC) • Performance OS Kernel Paradigms • Monolithic OS – – – – – All (or nearly all) services built into the kernel One level of protection, but typically no internal firewalls E.g., BSD UNIX (millions of LOC) HW is exposed to a great deal of complex software Hard to debug, extend • Microkernel (or “small-kernel”) OS – The kernel provides only the minimum services necessary to support independent application programs (address space management, thread management, IPC) – Additional services are provided by user-space daemons – Certain daemons are imbued with special permission (e.g., HW access) by the kernel – Service requests utilize IPC Separate address spaces are used to establish protection domains 1. No cross-domain read/write 2. The kernel mediates IPC 3. RPC can be used to give a procedural interface to IPC http://en.wikipedia.org/wiki/File:OS-structure.svg Capability Systems • A capability system is a security model – A security model specifies and enforces a security policy • Provides a fine-grained protection model • A capability is a communicable, unforgeable token of authority representing an access right • In a capability system, the exchange of capabilities among mutually untrusting entities is used to manage privileged access throughout the system One Possibility Capability System write(“/etc/passwd”) fd = open(“/etc/passwd”, O_RDWR) write(fd) kernel ACL On whose authority do we write /etc/passwd? Need to consult an access control list kernel The open file descriptor proves write access was previously granted Remote Procedure Call • An IPC mechanism that allows a program to invoke a subroutine in another address space – The receiver might reside on the same physical system or over a network • Provides a large-grained protection model • The call semantics make it appear as though only a normal procedure call was performed – Stubs interface to a runtime environment, which handles data marshalling; the OS handles low-level IPC – Protection domain boundaries are hidden by stubs Steps in a Traditional RPC Client Application Server Application sending path return path Client Stub Server Stub Client Runtime Library Server Runtime Library Client Kernel Server Kernel Potentially shared in a single-system RPC transport layer The Use of RPC in Microkernel Systems • Small-kernel systems can and do use RPC to borrow its large-grained protection model – Separate components are placed in disjoint address spaces (protection domains) – Communication between components is mediated by RPC, using messages – Advantages include: modularity, design simplification, failure isolation, and transparency (of network services) • But this approach simultaneously borrows the control transfer facilities of RPC – Those are not optimized for same-machine control transfer – This leads to an unnecessary loss of efficiency The Use of RPC Systems (I) Bershad argues that the common case for RPC: – – – is cross-domain (not cross-machine) involves relatively simple parameters can be optimized 1. Frequency of Cross-Machine Activity Frequency of Remote Activity Operation System Percentage of operations that cross machine boundaries V 3.0 Taos 5.3 Sun UNIX+NFS 0.6 The Use of RPC Systems (II) 2. Parameter Size and Complexity – 1,487,105 cross-domain procedure RPCs observed during one four-day period – 95% were to 10 procedures; 75% were to 3 procedures – None of them involved complex arguments – Furthermore, most RPCs involve a relatively small amount of data transfer The Use of RPC Systems (III) 3. The Performance of Cross-Domain RPC – The theoretical minimum time for a null crossdomain operation includes time for • Two procedure calls • Two traps • Two virtual memory context switches – The cross-domain performance, measured across six systems using the Null RPC, varies from over 300% to over 800% of the theoretical minimum Sources of Overhead in Cross-Domain RPC • Stub Overhead: stubs are general enough for cross-machine RPC, but inefficient for the common case of local RPC calls • Message Buffer Overhead: client/kernel, kernel/server, server/kernel, kernel/client • Access Validation: the kernel must validate the message sender on call and again in return • Message Transfer: messages are enqueued by the sender and dequeued by the receiver • Scheduling: separate, concrete threads run in client and server domains • Context Switch: in going from client to server • Dispatch: the server must receive and interpret the message Lightweight RPC (LRPC) • LRPC aims to improve the performance of cross-domain communication relative to RPC • The execution model is borrowed from a protected procedure call – Control transfer proceeds by way of a kernel trap; the kernel validates the call and establishes a linkage – The client provides an argument stack and its own concrete thread of execution • The programming semantics and large-grained protection model are borrowed from RPC – Servers execute in private protection domains – Each one exports a specific set of interfaces to which clients may bind – By allowing a binding, the server authorizes a client to access its procedures LRPC High-Level Design Control Flow Control Flow sending path E-stack Kernel thread return path Virtual Memory A-stack Physical Memory Virtual Memory A-stack Implementation Details • Execution of the server procedure is made by way of a kernel trap • The client provides the server with an argument stack and its own concrete thread of execution • The argument stacks (A-stacks) are shared between client and server; the execution stacks (E-stacks) belong exclusively in the server domain – A-stacks and E-stacks are associated at call time – Each A-stack queue is guarded by a single lock • The client must bind to an LRPC interface before using it; binding: – establishes shared segments between client and server – allocates bookkeeping structures in the kernel – returns a non-forgeable binding object to the client, which serves as the key for accessing the server (recall capability systems) • On multiprocessors, domains are cached on idle processors (to reduce latency) Performance • The measurements below were taken across 100,000 cross-domain calls in a tight loop • LRPC/MP uses the domain-caching optimization for multiprocessors • LRPC performs a context switch on each call Table IV. LRPC Performance of Four Tests (in microseconds) Test Description LRPC/MP LRPC Taos Null The Null cross-domain call 125 157 464 Add A procedure taking two 4-byte arguments and returning one 4-byte argument 130 164 480 BigIn A procedure taking one 200-byte argument 173 192 539 BigInOut A procedure taking and returning one 200byte argument 219 227 636 Discussion Items • When the client thread is executing an LRPC, does the scheduler know it has changed context? • Who is the parent of the server process? What is its main thread doing?