Willmann

An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai Designing a 10 Gigabit NIC  Programmability for performance  Computation  NICs have power, area concerns  Architecture  1 offloading improves performance solutions should be efficient Above all, must support 10 Gb/s links  What are the computation, memory requirements?  What architecture efficiently meets them?  What firmware organization should be used? Mechanisms for an Efficient Programmable 10 Gb/s NIC  A partitioned memory system    A distributed task-queue firmware   Utilizes frame-level parallelism to scale across many simple, low-frequency processors New RMW instructions  2 Low-latency access to control structures High-bandwidth, high-capacity access to frame data Reduce firmware frame-ordering overheads by 50% and reduce clock frequency requirement by 17% Outline 3  Motivation  How Programmable NICs work  Architecture Requirements, Design  Frame-parallel Firmware  Evaluation How Programmable NICs Work Memory Bus 4 PCI Interface Processor(s) Ethernet Interface Ethernet Per-frame Requirements Instructions Data Accesses TX Frame 281 101 RX Frame 253 85 Processing and control data requirements per frame, as determined by dynamic traces of relevant NIC functions 5 Aggregate Requirements 10 Gb/s - Max Sized Frames Instruction Control Data Frame Data Throughput Bandwidth Bandwidth TX Frame 229 MIPS 2.6 Gb/s 19.75 Gb/s RX Frame 206 MIPS 2.2 Gb/s 19.75 Gb/s Total 435 MIPS 4.8 Gb/s 39.5 Gb/s 1514-byte Frames at 10 Gb/s 6 812,744 Frames/s Meeting 10 Gb/s Requirements with Hardware  Processor Architecture  At least 435 MIPS within embedded device  Does NIC firmware have ILP?  Memory Architecture  Low latency control data  High bandwidth, high capacity frame data  … both, how? 7 ILP Processors for NIC Firmware?   ILP limited by data, control dependences Analysis of dynamic trace reveal dependences In-order 1 In-order 2 In-order 4 Out-order 1 Out-order 2 Out-order 4 8 Perfect BP 0.87 1.19 1.34 1.00 1.96 2.65 Perfect 1BP 0.87 1.19 1.33 1.00 1.74 2.00 No BP 0.87 1.13 1.17 0.88 1.21 1.29 Processors: 1-Wide, In-order Perfect 1BP 0.87 In-order 1 1.74 Out-order 2  2x performance costly      9 Branch prediction, reorder buffer, renaming logic, wakeup logic Overheads translate to greater than 2x core power, area costs Great for a GP processor; not for an embedded device Other opportunities for parallelism? YES!   No BP 0.87 1.21 Many steps to process a frame - run them simultaneously Many frames need processing - process simultaneously Use parallel single-issue cores Memory Architecture  Competing demands    The traditional solution: Caches    10 Frame data: High bandwidth, high capacity for many offload mechanisms Control data: Low latency; coherence among processors, PCI Interface, and Ethernet Interface Advantages: low latency, transparent to the programmer Disadvantages: Hardware costs (tag arrays, coherence) In many applications, advantages outweigh costs Are Caches Effective? 60 Hit Ratio (Percent) 50 40 30 20 6 Processor Hit Ratio 10 32K B 16K B 8KB 4KB 2KB 1KB 512B 256B 128B 64B 32B 16B 0 Cache Size (Bytes) SMPCache trace analysis of a 6-processor NIC architecture 11 Choosing a Better Organization Cache Hierarchy 12 A Partitioned Organization Putting it All Together Instruction Memory I-Cache 0 I-Cache 1 I-Cache P-1 CPU 0 CPU 1 CPU P-1 (P+4)x(S) Crossbar (32-bit) Scratchpad 0 PCI Bus 13 PCI Interface Scratchpad 1 Ext. Mem. Interface (Off-Chip) DRAM S-pad S-1 Ethernet Interface Parallel Firmware  NIC processing steps already well-defined  Previous Gigabit NIC firmware divides steps between 2 processors  14 … but does this mechanism scale? Task Assignment with an Event Register PCI Read Bit SW Event Bit … Other Bits 0 1 0 1 0 PCI Interface Finishes Work 15 Processor(s) Processor(s) need inspect to enqueue transactions TX Data Processor(s) pass data to Ethernet Interface Task-level Parallel Firmware PCI Read HW Status PCI Read Bit Function Running (Proc 0) Transfer DMAs 0-4 0 Idle Transfer DMAs 5-9 1 Process DMAs 0-4 Idle Process DMAs 5-9 Idle 1 1 16 0 Function Running (Proc 1) Idle Time Frame-level Parallel Firmware PCI RD HW Status Function Running (Proc 0) Function Running (Proc 1) Transfer DMAs 0-4 Idle Idle Transfer DMAs 5-9 17 Build Event Process DMAs 0-4 Idle Build Event Process DMAs 5-9 Time Evaluation Methodology  Spinach: A library of cycle-accurate LSE simulator modules for network interfaces      Idea: Model everything inside the NIC  18 Memory latency, bandwidth, contention modeled precisely Processors modeled in detail NIC I/O (PCI, Ethernet Interfaces) modeled in detail Verified when modeling the Tigon 2 Gigabit NIC (LCTES 2004) Gather performance, trace data Scaling in Two Dimensions 20 Throughput (Gb/s) 18 16 14 Ethernet Limit 12 8 Processors 10 6 Processors 8 4 Processors 6 2 Processors 4 1 Processor 2 0 100 19 150 200 250 Core Frequency (MHz) 300 Processor Performance 20 Processor Behavior IPC Component Execution 0.72 Miss Stalls 0.01  Load Stalls 0.12   Achieves 83% of theoretical peak IPC Small I-Caches work Sensitive to mem stalls  Scratchpad Conflict Stalls 0.05 Pipeline Stalls 0.10 Total 1.00  Half of loads are part of a loadto-use sequence Conflict stalls could be reduced with more ports, more banks Reducing Frame Ordering Overheads    21 Firmware ordering costly - 30% of execution Synchronization, bitwise check/updates occupy processors, memory Solution: Atomic bitwise operations that also update a pointer according to last set location Maintaining Frame Ordering Index 0 Index 1 Index 3 Index 4 Frame Status Array 0 1 0 1 0 1 CPU C CPU A CPU B Detects prepares prepares Completed frames frames Frames 0 1 LOCK Iterate Notify Hardware UNLOCK 22 … more bits Ethernet Interface RMW Instructions Reduce Clock Frequency  Performance: 6x166 MHz = 6x200 MHz    Dynamically tasked firmware balances the benefit   23 Performance is equivalent at all frame sizes 17% reduction in frequency requirement Send cycles reduced by 28.4% Receive cycles reduced by 4.7% Conclusions A Programmable 10 Gb/s NIC  This NIC architecture relies on:      24 Data Memory System - Partitioned organization, not coherent caches Processor Architecture - Parallel scalar processors Firmware - Frame-level parallel organization RMW Instructions - reduce ordering overheads A programmable NIC: A substrate for offload services Comparing Frame Ordering Methods Full-Duplex Throughput (Gb/s) 20 18 16 14 12 10 Ethernet Limit (Duplex) 8 6 6x200 MHz, Software Only 4 6x166 MHz, RMW Enhanced 2 0 0 25 200 400 600 800 1000 UDP Datagram Size (Bytes) 1200 1400

Willmann

Related documents

Products

Support

Willmann

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib