User-Level Interprocess Communication for Shared Memory Multiprocessors Bershad, B. N., Anderson, T. E., Lazowska, E.D., and Levy, H. M. Presented by: SHILPI AGARWAL 1 OUTLINE InterProcess Communication URPC ITS COMPONENTS 2 Processor Reallocation Data Transfer Thread Management Performance Its problem Latency Throughput Conclusion IPC: INTERPROCESS COMMUNICATION Central to the design of OS. Communication between different address spaces on the same machine. Allows system decomposition across address space boundaries. 3 Failure Isolation Extensibility Modularity Usability of the address spaces depends on the performance of the communication primitives. Problems: IPC is traditionally the responsibility of Kernel. To switch from one address space to the other on the calling processor in order to run the receiving thread there and then returning back to the caller thread requires the kernel Intervention. High cost for invoking the kernel and reallocating processor to a different address space. LRPC indicates 70% of overhead can be attributed to kernel mediation. Degraded performance and added complexity when user-level threads communicate across boundaries 4 OUTLINE InterProcess Communication URPC ITS COMPONENTS Performance 5 Processor Reallocation Data Transfer Thread Management Latency Throughput Conclusion Solution: URPC for shared memory multiprocessors. 6 The user level thread packages on each machine can be used to efficiently switch to a different thread whenever caller or callee threads block. Thus kernel can be eliminated from the path of cross-address space communication. Use shared memory to send messages directly between address spaces. Avoid Processor Reallocation (use processor already active in the target address space). URPC: Client thread invokes a procedure at server. It gets block, waiting for the reply. While blocked, it can run another ready thread in the same address space. When reply arrives, the blocked thread can be rescheduled to any processor allocated to its address space. In the server side, execution can be done by the processor already executing in the same address space. IN LRPC: The blocked thread and the ready thread are the same except running in the different address space. IN URPC: It schedules another thread from same address space on the clients processor. Advantage: Less overhead in Context switch then Processor Reallocation 7 OUTLINE InterProcess Communication URPC ITS COMPONENTS Performance 8 Processor Reallocation Data Transfer Thread Management Latency Throughput Conclusion URPC Division Of Responsibilities: 9 Processor Reallocation Thread management Data transfer Only Processor reallocation requires kernel. Move Thread Management and data Transfer to User level. Components of URPC 10 OUTLINE InterProcess Communication URPC ITS COMPONENTS Performance 11 Processor Reallocation Data Transfer Thread Management Latency Throughput Conclusion Processor Reallocation Why should be avoided? 12 Deciding and transferring the processor between threads of different address spaces Requires privileged kernel mode to access protected mapping registers Diminished cost of cache and TLB. Minimal latency same-address space context switch takes about 15 microseconds on the C-VAX while cross-address space processor reallocation takes 55 microseconds (doesn’t consider long-term costs!). URPC: Optimistic reallocation policy 13 Assumptions: – Client has other work to do – Server will soon have a processor to service a message Doesn't perform well in all situations – Uniprocessors – Real-time applications – High-latency I/O operations (require early initialization ) – Priority invocations URPC allows forced processor reallocation to solve some of these problems Advantages over: 14 Handoff scheduling: a single kernel operation blocks the client and reallocates its processor directly to the server. Kernel centralized data structure: creates performance bottleneck (lock contention, thread run queues and message channels) If needed, Processor reallocation is done via Kernel. Needed for Load balancing Problem: Idle processor at the client side can donate itself to underpowered address space Kernel required to change the processor’s virtual memory context to underpowered address space. The identity of the donating processor is made known to the receiver. 15 Voluntary Return of Processors It states: A processor needs to be returned back to the client. Voluntary return of processors cannot be Enforced. URPC deals with Load balancing only for communicating applications Preemptive policies, which forcibly reallocate processors from one address space to other is required to avoid starvation. No need for global Processor allocator (it could be done by the client itself) 16 when all outstanding messages from the client have generated replies. when the client has become “underpowered. Sample execution Client: Editor Two servers: A window manager A File cache manager Two threads: T1 & T2 17 OUTLINE InterProcess Communication URPC ITS COMPONENTS Performance 18 Processor Reallocation Data Transfer Thread Management Latency Throughput Conclusion Data transfer using shared memory In traditional RPC: In URPC: 19 Clients and servers can overhead each other (deny service, fail to release channel locks, provide bogus results. Up to higher-level protocols to filter abuses up to application layer. Kernel copies the data between address spaces Logical channels of pair-wise shared memory Applications access URPC procedures through Stubs layer Stubs copy data in/out, no direct use of shared memory Arguments are passed in buffers that are allocated and pair-wise mapped during binding Data queues monitored by application level thread management Channels created & mapped once for every client/server pairing A bidirectional shared memory queue with test and set locks is used for data flow. OUTLINE InterProcess Communication URPC ITS COMPONENTS Performance 20 Processor Reallocation Data Transfer Thread Management Latency Throughput Conclusion Thread Management Strong interaction between thread management (start….stop) and cross address space communication (send…receive). This close interaction can be exploited to achieve extremely good performance for both (implemented together at user level) Thread management facilities can be provided either kernel or User level but high performance can be provided by user level. Threads overhead can be decided over three points of reference: Heavyweight: no distinction between a thread and its address space. Middleweight: threads and address spaces are decoupled. Lightweight: threads managed by user-level libraries. : implies two level scheduling (light weight threads on the top of weightier threads) 21 OUTLINE InterProcess Communication URPC ITS COMPONENTS Performance 22 Processor Reallocation Data Transfer Thread Management Latency Throughput Conclusion Performance of URPC 23 Call latency And Throughput 24 Latency increases when T> C + S Pure latency =T=C=S=1= 93 micro secs, Latency is proportional to the number of threads per CPU T = C = S = 1 call latency is 93 microseconds C = 1, S = 0, worst performance (need to reallocate processors frequently) In both cases, C = 2, S = 2 yields best performance Problems with URPC: 25 When T=1,latency is 373microsecs. Every call requires two traps and two processor reallocations. At this point, URPC performs worse than LRPC (157 microsecs) Why? 1. Processor reallocation in URPC is based on LRPC. 2. URPC integrated with two level scheduling – Is there an idle processor ? and – is there an underpowered address space to which it can be reallocated ? Two processors for single computation, only one active at a time. (Due to synchronous nature of RPC) Not ideal for all application types Single-threaded applications High-latency I/O OUTLINE InterProcess Communication URPC ITS COMPONENTS Performance 26 Processor Reallocation Data Transfer Thread Management Latency Throughput Conclusion Conclusion 27 Better Performance and flexibility when move traditional OS functions out of kernel. URPC designs a appropriate division of responsibility between user level and kernel URPC demonstratres a design specific to a multiprocessor, and not just uniprocessor design that runs on multiprocessor hardware