RPC and Client-Server Computing
Basic concepts
Implementation issues, usual optimizations
Where are the costs?
Firefly RPC, Lightweight RPC, Winsock
Direct and VIA
Reliability and consistency
Multithreading debate
Introduced by Birrell and Nelson in 1985
Pre-RPC: Most applications were built directly over the Internet primitives
Their idea: mask distributed computing system using a “transparent” abstraction
Looks like normal procedure call
Hides all aspects of distributed interaction
Supports an easy programming model
Today, RPC is the core of many distributed systems
Early focus was on RPC “environments”
Culminated in DCE (Distributed Computing
Environment), standardizes many aspects of RPC
Then emphasis shifted to performance, many systems improved by a factor of 10 to 20
Today, RPC often used from object-oriented systems employing CORBA or COM standards. Reliability issues are more evident than in the past.
client
“binds” to server server registers with name service
client
“binds” to server prepares, sends request server registers with name service receives request
client
“binds” to server prepares, sends request server registers with name service receives request invokes handler
client
“binds” to server prepares, sends request server registers with name service receives request invokes handler sends reply
client
“binds” to server prepares, sends request server registers with name service receives request invokes handler sends reply unpacks reply
Server defines and “exports” a header file giving interfaces it supports and arguments expected. Uses
“interface definition language” (IDL)
Client includes this information
Client invokes server procedures through “stubs”
provides interface identical to the server version responsible for building the messages and interpreting the reply messages passes arguments by value, never by reference may limit total size of arguments, in bytes
Occurs when client and server program first start execution
Server registers its network address with name directory, perhaps with other information
Client scans directory to find appropriate server
Depending on how RPC protocol is implemented, may make a “connection” to the server, but this is not mandatory
We say that data is “marshalled” into a message and “demarshalled” from it
Representation needs to deal with byte ordering issues (big-endian versus little endian), strings (some CPUs require padding), alignment, etc
Goal is to be as fast as possible on the most common architectures, yet must also be very general
Client builds a message containing arguments, indicates what procedure to invoke
Do to need for generality, data representation a potentially costly issue!
Performs a send I/O operation to send the message
Performs a receive I/O operation to accept the reply
Unpacks the reply from the reply message
Returns result to the client program
Allocation and marshalling data into message
(can reduce costs if you are certain client, server have identical data representations)
Two system calls, one to send, one to receive, hence context switching
Much copying all through the O/S: application to UDP, UDP to IP, IP to ethernet interface, and back up to application
Studied RPC performance in O/S kernel
Suggested a series of major optimizations
Resulted in performance improvments of about 10-fold for Xerox firefly workstation (from 10ms to below 1ms)
Compile the stub “inline” to put arguments directly into message
Two versions of stub; if (at bind time) sender and dest. found to have same data representations, use host-specific rep.
Use a special “send, then receive” system call
(requires O/S extension)
Optimize the O/S kernel path itself to eliminate copying – treat RPC as the most important task the kernel will do
RPC is transparent for simple calls with a small amount of data passed
“Transparent” in the sense that the interface to the procedure is unchanged
But exceptions thrown will include new exceptions associated with network
What about complex structures, pointers, big arrays?
These will be very costly, and perhaps impractical to pass as arguments
Most implementations limit size, types of RPC arguments. Very general systems less limited but much more costly.
client sends request server
Timeout!
client sends request retransmit ack for request server duplicate request: ignored
Timeout!
client sends request retransmit ack for request reply server
Timeout!
client sends request retransmit ack for request reply ack for reply server
Acks are expensive. Try and avoid them, e.g. if the reply will be sent quickly supress the initial ack
Retransmission is costly. Try and tune the delay to be “optimal”
For big messages, send packets in bursts and ack a burst at a time, not one by one
client sends request as a burst server reply ack entire burst ack for reply
At most once: request is processed 0 or 1 times
Exactly once: request is always processed 1 time
At least once: request processed 1 or more times
... but exactly once is impossible because we can’t distinguish packet loss from true failures!
In both cases, RPC protocol simply times out.
Use a timer (clock) value and a unique id, plus sender address
Server remembers recent id’s and replies with same data if a request is repeated
Also uses id to identify duplicates and reject them
Very old requests detected and ignored by checking time
Assumes that the clocks are working
In particular, requires “synchronized” clocks
Restrictions on argument sizes and types
New error cases:
Bind operation failed
Request timed out
Argument “too large” can occur if, e.g., a table grows
Costs may be very high
... so RPC is actually not very transparent!
Often, the destination is right on the caller’s machine!
Caller builds message
Issues send system call, blocks, context switch
Message copied into kernel, then out to dest.
Dest is blocked... wake it up, context switch
Dest computes result
Entire sequence repeated in reverse direction
If scheduler is a process, context switch 6 times!
Dest on same site
Source does xyz(a, b, c)
O/S
Destination and O/S are blocked
Dest on same site
O/S
Source does xyz(a, b, c)
Source, dest both block. O/S runs its scheduler, copies message from source outqueue to dest in-queue
Dest on same site
O/S
Source does xyz(a, b, c)
Dest runs, copies in message
Dest on same site
O/S
Source does xyz(a, b, c)
Same sequence needed to return results
Lightweight RPC (LRPC): for case of sender, dest on same machine (Bershad et. al.)
Uses memory mapping to pass data
Reuses same kernel thread to reduce context switching costs (user suspends and server wakes up on same kernel thread or “stack”)
Single system call: send_rcv or rcv_send
O/S and dest initially are idle
Dest on same site
O/S
Source does xyz(a, b, c)
Control passes directly to dest
Dest on same site
O/S
Source does xyz(a, b, c) arguments directly visible through remapped memory
On same platform, offers about a 10-fold improvement over a hand-optimized RPC implementation
Does two memory remappings, no context switch
Runs about 50 times faster than standard
RPC by same vendor (at the time of the research)
Semantics stronger: easy to ensure exactly once
Peterson: tool for speeding up layered protocols
Observation: buffer management is a major source of overhead in layered protocols (ISO style)
Solution: uses memory management, protection to “cache” buffers on frequently used paths
Stack layers effectively share memory
Tremendous performance improvement seen
control flows through stack of layers, or pipeline of processes data copied from “out” buffer to “in” buffer
control flows through stack of layers, or pipeline of processes data placed into “out” buffer, shaded buffers are mapped into address space but protected against access
control flows through stack of layers, or pipeline of processes buffer remapped to eliminate copy
control flows through stack of layers, or pipeline of processes in buffer reused as out buffer
control flows through stack of layers, or pipeline of processes buffer remapped to eliminate copy
Although this specific widely used
is not
Most kernels use similar ideas to reduce costs of in-kernel layering
And many application-layer libraries use the same sorts of tricks to achieve clean structure without excessive overheads from layer crossing
Concept developed by Culler and von Eicken for parallel machines
Assumes the sender knows all about the dest, including memory layout, data formats
Message header gives address of handler
Applications copy directly into and out of the network interface
Even with optimizations, standard RPC requires about 1000 instructions to send a null message
Active messages: as few as 6 instructions!
One-way latency as low as 35usecs
But model works only if “same program” runs on all nodes and if application has direct control over communication hardware
Low latency/high performance communication for
ATM on normal UNIX machines, later extended to fast Ethernet
Developed by Von Eicken, Vogels and others at
Cornell (1995)
Idea is that application and ATM controller share memory-mapped region. I/O done by adding messages to queue or reading from queue
Latency 50-fold reduced relative to UNIX, throughput
10-fold better for small messages!
Normally, data flows through the O/S to the driver, then is handed to the device controller
In U/Net the device controller sees the data directly in shared memory region
Normal architecture gets protection from trust in kernel
U/Net gets protection using a form of cooperation between controller and device driver
Reprogram ATM controller to understand special data structures in memory-mapped region
Rebuild ATM device driver to match this model
Pin shared memory pages, leave mapped into
I/O DMA map
Disable memory caching for these pages (else changes won’t be visible to ATM)
User’s address space has a direct-mapped communication region
ATM device controller sees whole region and can transfer directly in and out of it
... organized as an in-queue, outqueue, freelist
No user can see contents of any other user’s mapped I/O region (U-Net controller sees whole region but not the user programs)
Driver mediates to create “channels”, user can only communicate over channels it owns
U-Net controller uses channel code on incoming/outgoing packets to rapidly find the region in which to store them
With space available, has the same properties as the underlying ATM (which should be nearly 100% reliable)
When queues fill up, will lose packets
Also loses packets if the channel information is corrupted, etc
Build message in a preallocated buffer in the shared region
Enqueue descriptor on “out queue”
ATM immediately notices and sends it
Remote machine was polling the “in queue”
ATM builds descriptor for incoming message
Application sees it immediately: 35usecs latency
Von Eicken, Vogels support IP, UDP, TCP over U/Net
These versions run the TCP stack in user space!
Later in course will look at other complex protocols over U/Net
Windows consortium (MSFT, Intel, others) commercialized U/Net:
Virtual Interface Architecture (VIA)
Runs in NT Clusters
But most applications run over UNIX-style sockets (“Winsock” interface in NT)
Winsock direct automatically senses and uses
VIA where available
Today is widely used on clusters and may be a key reason that they have been successful
RPC is not very transparent
Failure handling is not evident at all: if an RPC times out, what should the developer do?
Reissuing the request only makes sense if there is another server available
Anyhow, what if the request was finished but the reply was lost? Do it twice? Try to duplicate the lost reply?
Performance work is producing enormous gains: from the old 75ms RPC to RPC over U/Net with a 75usec round-trip time: a factor of 1000!
Standards for data representation
Stub compilers, IDL databases
Services to manage server directory, clock synchronization
Tools for visualizing system state and managing servers and applications
Multithreading is a common performanceenhancing technique
Idea is that server is often idle while doing I/O for one client, so use extra threads to allow concurrent request processing
In the limit, leads to database transactional concurrency model, but many nontransactional servers use threads for enhanced performance
Three major options:
Single-threaded server: only does one thing at a time, uses send/recv system calls and blocks while waiting
Multi-threaded server: internally concurrent, each request spawns a new thread to handle it
Upcalls: event dispatch loop does a procedure call for each incoming event, like for X11 or PC’s running Windows.
Applications can deadlock if a request cycle forms:
I’m waiting for you and you send me a request, which I can’t handle
Much of system may be idle waiting for replies to pending requests
Harder to implement RPC protocol itself (need to use a timer interrupt to trigger acks, retransmission, which is awkward)
Idea is to support internal concurrency as if each process was really multiple processes that share one address space
Thread scheduler uses timer interrupts and context switching to mimic a physical multiprocessor using the smaller number of
CPU’s actually available
Each incoming request is handled by spawning a new thread
Designer must implement appropriate mutual exclusion to guard against “race conditions” and other concurrency problems
Ideally, server is more active because it can process new requests while waiting for its own RPC’s to complete on other pending requests
Users may have little experience with concurrency and will then make mistakes
Concurrency bugs are very hard to find due to nonreproducible scheduling orders
Reentrancy can come as an undesired surprise
Threads need stacks hence consumption of memory can be very high
Deadlock remains a risk, now associated with concurrency control
Stacks for threads must be finite and can overflow, corrupting the address space
SCHED event
Thread spawned, but blocks
SCHED event
SCHED
Eventually, application becomes bloated, begins to thrash. Performance drops and clients may think the server has failed event
Common in windowing systems
Each incoming “event” is encoded as a small descriptive data structure
User registers event handling procedures
Dispatch loop calls the procedures as new events arrive, waits for the call to finish, then dispatches a new event
Perhaps the best model for RPC programming
Each handler can be tagged: needs thread, or can be executed
“unthreaded”
Developer must still be very careful where threads are used
RPC was once touted as the transparent answer to distributed computing
Today the protocol is very widely used
... but it isn’t very transparent, and reliability issues can be a major problem
Today the strongest interest is in Web
Services and CORBA, which use RPC as the mechanism to implement object invocation