An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai Designing a 10 Gigabit NIC Programmability for performance Computation NICs have power, area concerns Architecture 1 offloading improves performance solutions should be efficient Above all, must support 10 Gb/s links What are the computation, memory requirements? What architecture efficiently meets them? What firmware organization should be used? Mechanisms for an Efficient Programmable 10 Gb/s NIC A partitioned memory system A distributed task-queue firmware Utilizes frame-level parallelism to scale across many simple, low-frequency processors New RMW instructions 2 Low-latency access to control structures High-bandwidth, high-capacity access to frame data Reduce firmware frame-ordering overheads by 50% and reduce clock frequency requirement by 17% Outline 3 Motivation How Programmable NICs work Architecture Requirements, Design Frame-parallel Firmware Evaluation How Programmable NICs Work Memory Bus 4 PCI Interface Processor(s) Ethernet Interface Ethernet Per-frame Requirements Instructions Data Accesses TX Frame 281 101 RX Frame 253 85 Processing and control data requirements per frame, as determined by dynamic traces of relevant NIC functions 5 Aggregate Requirements 10 Gb/s - Max Sized Frames Instruction Control Data Frame Data Throughput Bandwidth Bandwidth TX Frame 229 MIPS 2.6 Gb/s 19.75 Gb/s RX Frame 206 MIPS 2.2 Gb/s 19.75 Gb/s Total 435 MIPS 4.8 Gb/s 39.5 Gb/s 1514-byte Frames at 10 Gb/s 6 812,744 Frames/s Meeting 10 Gb/s Requirements with Hardware Processor Architecture At least 435 MIPS within embedded device Does NIC firmware have ILP? Memory Architecture Low latency control data High bandwidth, high capacity frame data … both, how? 7 ILP Processors for NIC Firmware? ILP limited by data, control dependences Analysis of dynamic trace reveal dependences In-order 1 In-order 2 In-order 4 Out-order 1 Out-order 2 Out-order 4 8 Perfect BP 0.87 1.19 1.34 1.00 1.96 2.65 Perfect 1BP 0.87 1.19 1.33 1.00 1.74 2.00 No BP 0.87 1.13 1.17 0.88 1.21 1.29 Processors: 1-Wide, In-order Perfect 1BP 0.87 In-order 1 1.74 Out-order 2 2x performance costly 9 Branch prediction, reorder buffer, renaming logic, wakeup logic Overheads translate to greater than 2x core power, area costs Great for a GP processor; not for an embedded device Other opportunities for parallelism? YES! No BP 0.87 1.21 Many steps to process a frame - run them simultaneously Many frames need processing - process simultaneously Use parallel single-issue cores Memory Architecture Competing demands The traditional solution: Caches 10 Frame data: High bandwidth, high capacity for many offload mechanisms Control data: Low latency; coherence among processors, PCI Interface, and Ethernet Interface Advantages: low latency, transparent to the programmer Disadvantages: Hardware costs (tag arrays, coherence) In many applications, advantages outweigh costs Are Caches Effective? 60 Hit Ratio (Percent) 50 40 30 20 6 Processor Hit Ratio 10 32K B 16K B 8KB 4KB 2KB 1KB 512B 256B 128B 64B 32B 16B 0 Cache Size (Bytes) SMPCache trace analysis of a 6-processor NIC architecture 11 Choosing a Better Organization Cache Hierarchy 12 A Partitioned Organization Putting it All Together Instruction Memory I-Cache 0 I-Cache 1 I-Cache P-1 CPU 0 CPU 1 CPU P-1 (P+4)x(S) Crossbar (32-bit) Scratchpad 0 PCI Bus 13 PCI Interface Scratchpad 1 Ext. Mem. Interface (Off-Chip) DRAM S-pad S-1 Ethernet Interface Parallel Firmware NIC processing steps already well-defined Previous Gigabit NIC firmware divides steps between 2 processors 14 … but does this mechanism scale? Task Assignment with an Event Register PCI Read Bit SW Event Bit … Other Bits 0 1 0 1 0 PCI Interface Finishes Work 15 Processor(s) Processor(s) need inspect to enqueue transactions TX Data Processor(s) pass data to Ethernet Interface Task-level Parallel Firmware PCI Read HW Status PCI Read Bit Function Running (Proc 0) Transfer DMAs 0-4 0 Idle Transfer DMAs 5-9 1 Process DMAs 0-4 Idle Process DMAs 5-9 Idle 1 1 16 0 Function Running (Proc 1) Idle Time Frame-level Parallel Firmware PCI RD HW Status Function Running (Proc 0) Function Running (Proc 1) Transfer DMAs 0-4 Idle Idle Transfer DMAs 5-9 17 Build Event Process DMAs 0-4 Idle Build Event Process DMAs 5-9 Time Evaluation Methodology Spinach: A library of cycle-accurate LSE simulator modules for network interfaces Idea: Model everything inside the NIC 18 Memory latency, bandwidth, contention modeled precisely Processors modeled in detail NIC I/O (PCI, Ethernet Interfaces) modeled in detail Verified when modeling the Tigon 2 Gigabit NIC (LCTES 2004) Gather performance, trace data Scaling in Two Dimensions 20 Throughput (Gb/s) 18 16 14 Ethernet Limit 12 8 Processors 10 6 Processors 8 4 Processors 6 2 Processors 4 1 Processor 2 0 100 19 150 200 250 Core Frequency (MHz) 300 Processor Performance 20 Processor Behavior IPC Component Execution 0.72 Miss Stalls 0.01 Load Stalls 0.12 Achieves 83% of theoretical peak IPC Small I-Caches work Sensitive to mem stalls Scratchpad Conflict Stalls 0.05 Pipeline Stalls 0.10 Total 1.00 Half of loads are part of a loadto-use sequence Conflict stalls could be reduced with more ports, more banks Reducing Frame Ordering Overheads 21 Firmware ordering costly - 30% of execution Synchronization, bitwise check/updates occupy processors, memory Solution: Atomic bitwise operations that also update a pointer according to last set location Maintaining Frame Ordering Index 0 Index 1 Index 3 Index 4 Frame Status Array 0 1 0 1 0 1 CPU C CPU A CPU B Detects prepares prepares Completed frames frames Frames 0 1 LOCK Iterate Notify Hardware UNLOCK 22 … more bits Ethernet Interface RMW Instructions Reduce Clock Frequency Performance: 6x166 MHz = 6x200 MHz Dynamically tasked firmware balances the benefit 23 Performance is equivalent at all frame sizes 17% reduction in frequency requirement Send cycles reduced by 28.4% Receive cycles reduced by 4.7% Conclusions A Programmable 10 Gb/s NIC This NIC architecture relies on: 24 Data Memory System - Partitioned organization, not coherent caches Processor Architecture - Parallel scalar processors Firmware - Frame-level parallel organization RMW Instructions - reduce ordering overheads A programmable NIC: A substrate for offload services Comparing Frame Ordering Methods Full-Duplex Throughput (Gb/s) 20 18 16 14 12 10 Ethernet Limit (Duplex) 8 6 6x200 MHz, Software Only 4 6x166 MHz, RMW Enhanced 2 0 0 25 200 400 600 800 1000 UDP Datagram Size (Bytes) 1200 1400