Remote Procedure Call An Effective Primitive for Distributed Computing Seth James Nielson What is RPC? Procedure calls transfer control within local memory RPC’s transfer control to remote machines Unused Proc B Proc A Main Why RPC? RPC is an effective primitive for distributed systems because of Clean/Simple semantics Communication efficiency Generality How it Works (Idealized Example) CLIENT localCall() … … SERVER With specialized hardware and encryption key Request c = encrypt(msg) encrypt(msg) Implementation wait… c = encrypt(msg) localCall() … … Response Early History of RPC 1976: early reference in literature 1976-1984: few full implementations Feb 1984: Cedar RPC – A. Birrell, B. Nelson at Xerox – “Implementing Remote Procedure Calls” Imagine our Surprise… “In practice, … several areas [of RPC] were inadequately understood” RPC Design Issues 1. 2. 3. 4. 5. 6. Machine/communication failures Address-containing arguments Integration into existing systems Binding Suitable protocols Data integrity/security Birrell and Nelson Aims Primary Aim – Easy distributed computation Secondary Aims – Efficient (with powerful semantics) – Secure Fundamental Decisions 1. 2. No shared address space among computers Semantics of remote procedure calls should be as close as possible to local procedure calls Note that the first decision partially violates the second… Binding Binds an importer to exporter Interface name: type/instance Uses Grapevine DB to locate appropriate exporter Bindings (based on unique ID) break if exporter crashes and restarts Unique ID At binding, importer learns of exported interface’s Unique ID (UID) The UID is initialized by a real-time clock on system start-up If the system crashes and restarts, the UID will be a new unique number The change in UID breaks existing connections How Cedar RPC works Caller Machine User Grapevine User Stub RPCRun. Callee Machine RPCRun. Server Stub Server record import return x=F(y) import getConnect update setConnect update addmember export export return lookup bind(A,B) lookup transmit Check 3 record F=>3 3=>F F(y) Packet-Level Transport Protocol Primary goal: minimize time between initiating the call and getting results NOT general – designed for RPC Why? possible 10X performance gain No upper bound on waiting for results Error Semantics: User does not know if machine crashed or network failed Developer User Stub RPCRuntime Interface Modules Server Code RPCRuntime Server Stub Server Program Server Machine Lupine User Code Client Program Client Machine Creating RPC-enabled Software Making it Faster Simple Calls (common case): all of the arguments fit in a single packet A server reply and a 2nd RPC operates as an implicit ACK Explicit ACKs required if call lasts longer or there is a longer interval between calls Simple Calls Call SERVER CLIENT Response/ACK Call/ACK Response/ACK Complex Calls Call (pkt 0) CLIENT SERVER ACK pkt 0 Data (pkt 1) ACK pkt 1 Data (pkt 2) Response/ACK ACK or New Call Keeping it Light A connection is just shared state Reduce process creation/swapping – Maintain idle server processes – Each packet has a process identifier to reduce swap – Full scheme results in no processes created/four process swaps per call RPC directly on top of Ethernet Elapsed Time Performance Number of Args/Results Time 0 100 100 word array 1097µ 1278µ 2926µ THE NEED FOR SPEED RPC performance cost is a barrier (Cedar RPC requires .1 sec for a 0 arg call!) Peregrine RPC (about nine years later) manages a 0 arg call in .0573 seconds! A Few Definitions Hardware latency – Sum of call/result network penalty Network penalty – Time to transmit (greater than…) Network transmission time – Raw Network Speed Network RPC – RPC between two machines Local RPC – RPC between separate threads Peregrine RPC Supports full functionality of RPC Network RPC performance close to HW latency Also supports efficient local RPC Messing with the Guts Three General Optimizations Three RPC-Specific Optimizations General Optimization 1. 2. 3. Transmitted arguments avoid copies No conversion for client/server with the same data representation Use of packet header templates that avoid recomputation per call RPC Specific Optimizations 1. 2. 3. No thread-specific state is saved between calls in the server Server arguments are mapped (not copied) No copying in the critical path of multipacket arguments I think this is COOL To avoid copying arguments from a single-packet RPC, Peregrine arranges instead to use the packet buffer itself as the server thread’s stack Any pointers are replaced with serverappropriate pointers (Cedar RPC didn’t support this…) This is cool too Multi-packet RPC’s use blast protocol (selective retransmission) Data is transmitted in parallel with data copy Last packet is mapped into place Fast Multi-Packet Receive Data 0 Header0 Packet 0 buffer (sent last) Is remapped at server Data 3 Data 3 Data 2 Header3 Data 1 Data 2 Header2 Data 1 Header1 Data 0 Packets 1-3 data are copied into buffer at server Header 0 Page Boundary Peregrine 0-Arg Performance System Cedar Amoeba** x-kernel V-System Firefly (5 CPU) Sprite Firefly (1 CPU) SunRPC** Latency 1097µsec 1100µsec 1730µsec 2540µsec 2660µsec 2800µsec 4800µsec 6700µsec Throughput 2.0mbps 6.4mbps 7.1mbps 4.4mbps 4.6mbps 5.7mbps 2.5mbps 2.7mbps Peregrine 573µsec 8.9mbps Peregrine Multi-Packet Performance Procedure (Bytes) Network Penalty (ms) Latency Through put (ms) (mbps) 3000 byte in RPC 2.71 3.20 7.50 3000 byte in-out RPC 5.16 6.04 7.95 48000 byte in RPC 40.96 43.33 8.86 48000 byte in-out RPC 81.66 86.29 8.90 Cedar RPC Summary Cedar RPC introduced practical RPC Demonstrated easy semantics Identified major design issues Established RPC as effective primitive Peregrine RPC Summary Same RPC semantics (with addition of pointers) Significantly faster than Cedar RPC and others General optimizations (e.g., precomputed headers) RPC-Specific (e.g., no copying in multipacket critical path) Observations RPC is a very “transparent” mechanism – it acts like a local call However, RPC requires a deep understanding of hardware to tune In short, RPC requires sophistication in its presentation as well as its operation to be viable