Page 1 HPCA-3, San Antonio, 5 February 1997 Joint work with Anindya Basu and Thorsten von Eicken Matt Welsh Department of Computer Science Cornell University mdw@cs.cornell.edu, http://www.cs.cornell.edu/Info/Projects/U-Net/ ATM and Fast Ethernet Network Interfaces for User-Level Communication ATM and Fast Ethernet NIs for User-Level Communication Matt Welsh, Cornell University Page 2 HPCA-3, San Antonio, 5 February 1997 • Need to ensure protection between processes • Application implements protocols directly • Provide minimal interface enabling communication Proposed solution: User-level network access • That means: Abandon in-kernel protocol stacks • New communication semantics • Variants of standard protocols Motivation 2: Flexibility fi That means: Low latency and high bandwidth • Finer comm granularity • Support parallel processing on NoW’s • Utilize high-speed nets Motivation 1: Performance User-level Network Interfaces: Motivation ATM and Fast Ethernet NIs for User-Level Communication Matt Welsh, Cornell University set-up/shut-down Page 3 HPCA-3, San Antonio, 5 February 1997 A cannot inspect or corrupt B’s messages, A cannot impersonate B - No compromise on protection: - Off-the-shelf hardware and software Invariants: Matt Welsh, Cornell University - Kernel only involved in connection simple MUX in NI - Applications send/recv directly via U-Net: - All communication via kernel Traditional: - Can be implemented in hardware, software, or both Generic communication architecture U-Net: Basic Idea ATM and Fast Ethernet NIs for User-Level Communication Page 4 HPCA-3, San Antonio, 5 February 1997 • Restricts multi-user capabilities • Assumes homogeneous nodes • Custom network and NI Shortcomings • Application-specific protocols • No copy: DMA direct to/from user memory • Bypasses the kernel for send/recv Advantages • Examples: TMC CM-5, IBM SP-2, Meiko CS-2 Key idea: User-level access to NI Inspiration: MPP Systems ATM and Fast Ethernet NIs for User-Level Communication Matt Welsh, Cornell University Page 5 HPCA-3, San Antonio, 5 February 1997 Matt Welsh, Cornell University - Set of parallel benchmarks measured over FE and ATM workstation clusters Application performance - Micro-benchmarks for latency/bandwidth performance - Careful instrumentation of U-Net/FE implementation Detailed performance analysis - Explore Fast Ethernet as an alternative to ATM for commodity interconnect - NIs with and without a programmable co-processor Comparing implementations of U-Net Our method: - Does user-level communication require expensive/complex NI’s? - What is the hardware/software tradeoff? Summary: Explore design space of user-level NI’s Overview ATM and Fast Ethernet NIs for User-Level Communication Page 6 HPCA-3, San Antonio, 5 February 1997 Matt Welsh, Cornell University - Msg arrives, data in buffer from free queue, Rx descriptor pushed onto recv queue Receive operation: - User constructs msg in buffer area, pushes Tx descriptor onto send queue Transmit operation: - Message buffers and send/recv/free queues U-Net Endpoint: Virtual device interface The U-Net Interface ATM and Fast Ethernet NIs for User-Level Communication 25 MHz i960 Page 7 FORE PCA-200 ATM Interface 256K SRAM U-Net Endpoint User address space HPCA-3, San Antonio, 5 February 1997 Matt Welsh, Cornell University • No O/S, CPU intervention in Tx/Rx ... always DMA-able by the i960 • Buffers, Rx ring in pinned memory segments • Tx/Free rings mapped from i960 RAM • U-Net implemented on i960 - Pentium 133 WS, Linux 1.3.97 - 25 MHz i960, 256K SRAM - PCI bus 155 Mbps OC-3 ATM NI FORE Systems PCA-200 • Programmable co-processor, ATM as "obvious choice" for interconnect • Original implementation of U-Net U-Net ATM Implementation ATM and Fast Ethernet NIs for User-Level Communication Page 8 HPCA-3, San Antonio, 5 February 1997 • U-Net implemented in kernel trap and interrupt routines • Assumes single O/S agent to mux the queues • Single, shared Tx and Rx rings, buffer pool Matt Welsh, Cornell University - Pentium 133 WS, Linux 1.3.97 - Low cost: $150/board - But, not programmable - PCI busmastering interface - 100 Mbps UTP5 or fiber DECchip 21140 FE controller U-Net Fast Ethernet Implementation ATM and Fast Ethernet NIs for User-Level Communication 3. User calls trap 4. Trap pushes descr to device Tx Ring 5. On Tx done, trap sets Tx descr done flag 3. i960 polls Tx rings, fetches descriptor 4. i960 initiates DMA to fiber output 5. i960 sets Tx descr done flag 25 MHz i960 Page 9 FORE PCA-200 ATM Interface 256K SRAM U-Net Endpoint Matt Welsh, Cornell University 2. User pushes Tx descr 2. User pushes Tx descr into Tx Ring HPCA-3, San Antonio, 5 February 1997 1. User constructs data 1. User constructs data in buffer region User address space U-Net/Fast Ethernet U-Net/ATM Transmit Operation ATM and Fast Ethernet NIs for User-Level Communication 3. Intr copies from device buffer to user buffer 4. Intr writes Rx descr into Rx FIFO 5. User polls Rx FIFO, or upcall 3. i960 initiates DMA to free buffer 4. At End-of-PDU, i960 writes Rx descr 5. User polls Rx FIFO, or upcall 25 MHz i960 Page 10 FORE PCA-200 ATM Interface 256K SRAM U-Net Endpoint Matt Welsh, Cornell University 2. Intr fetches free buffer descr 2. i960 fetches free buffer descr HPCA-3, San Antonio, 5 February 1997 1. FE packet arrives, interrupt raised 1. AAL5 PDU cells arrive at fiber input User address space U-Net/Fast Ethernet U-Net/ATM Receive Operation ATM and Fast Ethernet NIs for User-Level Communication Page 11 HPCA-3, San Antonio, 5 February 1997 - Take (small) slice of main CPU time to mux U-Net - Trap seen as ’protected co-routine’ - Trap semantics are ’service U-Net Tx queue’ - PCI access time dominates - Null trap: 1 usec Fast trap to start transmit, 4.2 usec any size packet U-Net/FE Transmit operation ATM and Fast Ethernet NIs for User-Level Communication Matt Welsh, Cornell University Page 12 HPCA-3, San Antonio, 5 February 1997 - Need to integrate with IP/packet filtering Matt Welsh, Cornell University - U-Net ’protocol ID’ in Ethernet header, plus ’channel number’ and length field Mux/Demux: - Msg arrives in fixed buffer pool in kernel, copy to user Interrupt handler on Rx, copy time dominates U-Net/FE Receive operation ATM and Fast Ethernet NIs for User-Level Communication 0 250 500 750 Message size, bytes 1000 1250 ATM FE, Hub FE, Bay Networks 28115 1500 Page 13 HPCA-3, San Antonio, 5 February 1997 Matt Welsh, Cornell University 90 Mbps+ with 500-byte messages ... but switch shaves off some b/w? Fast Ethernet 120 Mbps TAXI used as receiver ATM 0 10 20 30 40 50 60 70 80 90 100 110 120 Performance: Bandwidth ATM and Fast Ethernet NIs for User-Level Communication Bandwidth, Mbps 0 250 500 750 Message size, bytes Page 14 1000 1250 ATM FE, Hub FE, Bay Networks 28115 FE, Cabletron FN100 HPCA-3, San Antonio, 5 February 1997 • FE switches add 17 usec one-way ... for small messages, anyway FE has lower latency than ATM! 0 100 200 300 400 500 600 700 800 Performance: Latency ATM and Fast Ethernet NIs for User-Level Communication Round-trip time, usec 1500 Matt Welsh, Cornell University mm16x16 atm2 fe2 atm4 fe4 atm8 fe8 ssortsm512K ssortlg512K rsortsm512K atm2 fe2 atm4 fe4 atm8 fe8 rsortlg512K Page 15 HPCA-3, San Antonio, 5 February 1997 • Use of ’global pointers’ to access other proc addr space • Novel parallel language based on C Split-C mm128x128 atm2 fe2 atm4 fe4 atm8 fe8 net atm2 fe2 atm4 fe4 atm8 fe8 cpu atm2 fe2 atm4 fe4 atm8 fe8 Split-C Benchmarks: ATM vs. FE ATM and Fast Ethernet NIs for User-Level Communication atm2 fe2 atm4 fe4 atm8 fe8 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Matt Welsh, Cornell University than SPARC • Pentium int ops faster Pentium • SPARC fp faster than • FE faster for small msgs • ATM faster for large msgs cluster SparcStations used in ATM rsort: Radix sort ssort: Sample sort mm: Matrix Multiply Page 16 HPCA-3, San Antonio, 5 February 1997 Matt Welsh, Cornell University Implementations for PCA-200 & Linux, DC21140, Zeitnet & Windows NT - Lazy read-page retrieval: Tell the NI to try again - What about swapped-out read page? (Can’t swap in interrupt...) - Writeable pages are easy to get Issues - Pages discarded on TLB capacity miss - TLB miss causes kernel interrupt to fetch page - Uses software TLB to cache page mappings - On-demand paging of U-Net buffers Paging Endpoints - Required to allow direct DMA to/from user space - Locked into physical memory for lifetime of process Pinned buffers and queues Current work work: Memory Management ATM and Fast Ethernet NIs for User-Level Communication Page 17 HPCA-3, San Antonio, 5 February 1997 Matt Welsh, Cornell University - Fast Ethernet is an excellent price-performance point for workstation clusters - Split-C benchmarks demonstrate comparable app performance - U-Net model extends to other networks and NI architectures Conclusions - Bandwidth reaches > 90 Mbps with 500-byte messages - Lower latency than OC-3 ATM (120 usec, 40 byte ping-pong) - Round-trip latency starts at 57 usec, 40 byte ping-pong U-Net extended to Fast Ethernet - Hardware requires kernel trap and copy on receive - Implementation using DC21140 FE interface U-Net extended to non-programmable NICs Summary ATM and Fast Ethernet NIs for User-Level Communication