Faster! Vidhyashankar Venkataraman CS614 Presentation U-Net : A User-Level Network Interface for Parallel and Distributed Computing Background – Fast Computing Emergence of MPP – Massively Parallel Processors in the early 90’s Repackage hardware components to form a dense configuration of very large parallel computing systems But require custom software Alternative : NOW (Berkeley) – Network Of Workstations Formed by inexpensive, low latency, high bandwidth, scalable, interconnect networks of workstations Interconnected through fast switches Challenge: To build a scalable system that is able to use the aggregate resources in the network to execute parallel programs efficiently Issues Problem with traditional networking architectures Software path through kernel involves several copies - processing overhead In faster networks, may not get application speed-up commensurate with network performance Observations: Small messages : Processing overhead is more dominant than network latency Most applications use small messages Eg.. UCB NFS Trace : 50% of bits sent were messages of size 200 bytes or less Issues (contd.) Flexibility concerns: Protocol processing in kernel Greater flexibility if application specific information is integrated into protocol processing Can tune protocol to application’s needs Eg.. Customized retransmission of video frames U-Net Philosophy Achieve flexibility and performance by Removing kernel from the critical path Placing entire protocol stack at user level Allowing protected user-level access to network Supplying full bandwidth to small messages Supporting both novel and legacy protocols Do MPPs do this? Parallel machines like Meiko CS-2, Thinking Machines CM-5 Have tried to solve the problem of providing user-level access to network Use of custom network and network interface – No flexibility U-Net targets applications on standard workstations Using off-the-shelf components Basic U-Net architecture Virtualize N/W device so that each process has illusion of owning NI Mux/ Demuxing device virtualizes the NI Offers protection! Kernel removed from critical path Kernel involved only in setup The U-Net Architecture Building Blocks Application End-points Communication Segment(CS) Message Queues Sending Assemble message in CS EnQ Message Descriptor Receiving A region of memory An application endpoint Poll-driven/ Event-driven DeQ Message Descriptor Consume message EnQ buffer in free Q U-Net Architecture (contd.) More on event-handling (upcalls) Can be UNIX signal handler or user-level interrupt handler Amortize cost of upcalls by batching receptions Mux/ Demux : Each endpoint uniquely identified by a tag (eg.. VCI in ATM) OS performs initial route setup and security tests and registers a tag in U-Net for that application The message tag mapped to a communication channel Observations Have to preallocate buffers – memory overhead! Protected User-level access to NI : Ensured by demarcating into protection boundaries Defined by endpoints and communication channels Applications cannot interfere with each other because Endpoints, CS and message queues user-owned Outgoing messages tagged with originating endpoint address Incoming messages demuxed by U-Net and sent to correct endpoint Zero-copy and True zero-copy Two levels of sophistication depending on whether copy is made at CS Base-Level Architecture Zero-copy : Copied in an intermediate buffer in the CS CS’es are allocated, aligned, pinned to physical memory Optimization for small messages Direct-access Architecture True zero copy : Data sent directly out of data structure Also specify offset where data has to be deposited CS spans the entire process address space Limitations in I/O Addressing force one to resort to Zerocopy Kernel emulated end-point Communication segments and message queues are scarce resources Optimization: Provide a single kernel emulated endpoint Cost : Performance overhead U-Net Implementation U-Net architectures implemented in two systems Using Fore Systems SBA 100 and 200 ATM network interfaces But why ATM? Setup : SPARCStations 10 and 20 on SunOS 4.1.3 with ASX200 ATM switch with 140 Mbps fiber links SBA-200 firmware 25 MHz On-board i960 processor, 256 KB RAM, DMA capabilities Complete redesign of firmware Device Driver Protection offered through VM system (CS’es) Also through <VCI, communication channel> mappings U-Net Performance RTT and bandwidth measurements Small messages 65 μs RTT (optimization for single cells) Fiber saturated at 800 B U-Net Active Messages Layer An RPC that can be implemented efficiently on a wide range of hardware A basic communication primitive in NOW Allow overlapping of communication with computation Message contains data & ptr to handler Reliable Message delivery Handler moves data into data structures for some (ongoing) operation AM – Micro-benchmarks Single-cell RTT RTT ~ 71 μs for a 0-32 B message Overhead of 6 μs over raw U-Net – Why? Block store BW 80% of the maximum limit with blocks of 2KB size Almost saturated at 4KB Good performance! Split-C application benchmarks Parallel Extension to C Implemented on top of UAM Tested on 8 processors ATM cluster performs close to CS-2 TCP/IP and UDP/IP over U-Net Good performance necessary to show flexibility Traditional IP-over-ATM shows very poor performance eg.. TCP : Only 55% of max BW TCP and UDP over U-Net show improved performance Primarily because of tighter application-network coupling IP-over-U-Net: IP-over-ATM does not exactly correspond to IP-over-UNet Demultiplexing for the same VCI is not possible Performance Graphs UDP Performance Saw-tooth behavior for Fore UDP TCP Performance Conclusion U-Net provides virtual view of network interface to enable userlevel access to high-speed communication devices The two main goals were to achieve performance and flexibility By avoiding kernel in critical path Achieved? Look at the table below… Lightweight Remote Procedure Calls Motivation Small kernel OSes have most services implemented as separate user-level processes Have separate, communicating user processes Improve modular structure More protection Ease of system design and maintenance Cross-domain & cross-machine communication treated equal - Problems? Fails to isolate the common case Performance and Simplicity considerations Measurements Measurements show cross-domain predominance V System – 97% Taos Firefly – 94% Sun UNIX+NFS Diskless – 99.4% But how about RPCs these days? Taos takes 109 μs for a Null() local call and 464 μs for RPC – 3.5x overhead Most interactions are simple with small numbers of arguments This could be used to make optimizations Overheads in Cross-domain Calls Stub Overhead – Additional execution path Message buffer overhead – Cross-domain calls can involve four copy operations for any RPC Context switch – VM context switch from client’s domain to the server’s and vice versa on return Scheduling – Abstract and Concrete threads Available solutions? Eliminating kernel copies (DASH system) Handoff scheduling (Mach and Taos) In SRC RPC : Message buffers globally shared! Trades safety for performance Solution proposed : LRPCs Written for the Firefly system Mechanism for communication between protection domains in the same system Motto : Strive for performance without foregoing safety Basic Idea : Similar to RPCs but, Do not context switch to server thread Change the context of the client thread instead, to reduce overhead Overview of LRPCs Design Client calls server through kernel trap Kernel validates caller Kernel dispatches client thread directly to server’s domain Client provides server with a shared argument stack and its own thread Return through the kernel to the caller Implementation - Binding Server Client Kernel Export interface Register with name server Wait Notify Clerk Trap for import Client Thread Server thread Clerk Send PDL Send BO Processing: A-stack list Allocates A-stacks Linkage Records Binding Object (BO) Data Structures used and created Kernel receives Procedure Descriptor List (PDL) from Clerk Contains a PD for each procedure Entry Address apart from other information Kernel allocates Argument stacks (A-stacks) shared by client-server domains for each PD Allocates linkage record for each A-Stack to record caller’s address Allocates Binding Object - the client’s key to access the server’s interface Calling Client stub traps kernel for call after Pushing arguments in A-stack Storing BO, procedure identifier, address of A-stack in registers Kernel Validates client, verifies A-stack and locates PD & linkage Stores Return address in linkage and pushes on stack Switches client thread’s context to server by running a new stack Estack from server’s domain Calls the server’s stub corresponding to PD Server Client thread runs in server’s domain using E-stack Can access parameters of A-stack Return values in A-stack Calls back kernel through stub Stub Generation LRPC stub automatically generated in assembly language for simple execution paths Sacrifices portability for performance Maintains local and remote stubs First instruction in local stub is branch stmt What are optimized here? Using the same thread in different domains reduces overhead Avoids scheduling decisions Saves on cost of saving and restoring thread state Pairwise A-stack allocation guarantees protection from third party domain Within? Asynchronous updates? Validate client using BO – To provide security Elimination of redundant copies through use of A-stack! 1 against 4 in traditional cross-domain RPCs Sometimes two? Optimizations apply Argument Copy But… Is it really good enough? Trades off memory management costs for the reduction of overhead A-stacks have to be allocated at bind time But size generally small Will LRPC work even if a server migrates from a remote machine to the local machine? Other Issues – Domain Termination Domain Termination LRPC from terminated server domain should be returned back to the client LRPC should not be sent back to the caller if latter has terminated Use binding objects Revoke binding objects For threads running LRPCs in domain restart new threads in corresponding caller Invalidate active linkage records – thread returned back to first domain with active linkage Otherwise destroyed Multiprocessor Issues LRPC minimizes use of shared data structures on the critical path Guaranteed by pairwise allocation of A-stacks Cache contexts on idle processors Idling threads in server’s context in idle processors When client thread does RPC to server swap processors Reduces context-switch overhead Evaluation of LRPC Performance of four test programs (time in μs) (run on CVAX-Firefly averaged over 100000 calls) Cost Breakdown for the Null LRPC Minimum refers to the inherent minimum overhead 18 μs spent in client stub and 3 μs in the server stub 25% time spent in TLB misses Throughput on a multiprocessor Tested with Firefly on four CVAX and one MicroVaxII I/O processors Speedup of 3.7 with 4 processors as against 1 processor Speedup of 4.3 with 5 processors SRC RPCs : inferior performance due to a global lock held during critical transfer path Conclusion LRPC Combines Control Transfer and communication model of capability systems Programming semantics and large-grained protection model of RPCs Enhances performance by isolating the common case NOW We will see ‘NOW’ later in one of the subsequent 614 presentations