Exploiting Task-level Concurrency in a Programmable Network Interface Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture Group http://www.cs.rice.edu/CS/Architecture/ June 11, 2003 Why programmable network interface? More complex functionality on network interface – TCP offload, iSCSI, etc. Easy maintenance – Bug fix, upgrade, customization, etc. Performance? – 51% less web server throughput than ASIC NIC – Big problem 2 Improving Performance Increase clock speed and/or complexity – Typical solutions for general-purpose processors – Do not work for embedded processors Design – – – – constraints: limited power and area Power proportional to C V² f Higher f requires higher V Thus, Power roughly proportional to f³ Complexity increases C only for marginal gains Implication: simple and low frequency processor 3 Use Parallel Programming Use multiple programmable cores – Increase computational capacity – Achieve performance within power limit • Consume far less power than higher-frequency core Improvements with two cores over single core – 65-157% for bidirectional traffic – 27-51% for web server workloads – Web server throughput comparable to ASIC NIC 4 Outline Background – Tigon Programmable Gigabit Ethernet Controller – Network Interface Processing: Send/Receive Parallelization of Firmware Experimental Results Conclusion 5 Tigon Gigabit Ethernet Controller Two programmable cores – Based on MIPS running at 88MHz – Small on-chip memory (scratch pad) per core Shared off-chip SRAM Supports event-driven firmware No interrupts – Event handlers run to completion – Handlers on same core require no synchronization Released firmware fully utilizes only one core No previous Ethernet firmware to utilize 2 cores 6 Send Processing Tigon event: Send Data Ready Mailbox Read Complete Buffer Descriptor Update Send ConsumerReady DMA Write Complete CPU Main Memory Index Bridge Interrupt Memory-mapped I/O PCI Bus Direct Memory Access Network Interface Card Descriptor Index Descriptor Packet Packet Network 2. 3. Fetch buffer descriptor buffer 1. 6. Alert: Createproduced consumed buffer descriptor bufferdescriptor descriptor 4. Transfer Packet 5. Transmit Packet 7 Receive Processing: Pre-allocation Tigon event: Receive Buffer Descriptor Ready Mailbox DMA Read Complete CPU Main Memory Descriptor Receive Buffer Bridge Memory-mapped I/O PCI Bus Direct Memory Access Network Interface Card Network Descriptor 4. Allocate Fetch produced buffer descriptor 3. Alert: buffer 2. 1. Create buffer receive descriptor bufferdescriptor 8 Receive Processing: Actual Receive Tigon event: Update Receive Return Producer Receive Complete DMA Write Complete CPU Main Memory Index Descriptor Receive Buffer Packet Bridge Interrupt PCI Bus Direct Memory Access Network Interface Card Network Index Descriptor Packet 5. Alert: produced buffer descriptor 2. Create buffer 3. Transfer packet 1. 4. Store packet bufferdescriptor descriptor 9 Tigon Uniprocessor Performance Decreasing maximum UDP throughput due to network headers and per-frame overhead Intel 100% over Tigon Intel PRO/1000 MT Netgear 622T Tigon with uniprocessor firmware 10 Outline Background Parallelization of Firmware – Principles – Resource Sharing Patterns – Partitioning Process Experimental Results Conclusion 11 Principles Identify unit of concurrency – Event handler Analyze resource sharing patterns Profile uniprocessor firmware Partition event handlers so as to – Balance load – Minimize synchronization – Maximize on-chip memory utilization 12 Resource Sharing Patterns Mailbox Receive Buffer Descriptor Ready DMA Write Complete Send Buffer Descriptor Ready DMA Read Complete Update Receive Return Producer Receive Complete Update Send Consumer Send Data Ready Shared data objects Shared DMA read channel Shared DMA write channel 13 Partitioning Process Mailbox Receive Buffer Descriptor Ready 6% DMA Write Complete 6% DMA Read Complete 14% 4% Send Buffer Descriptor Ready 3% Update Receive Return Producer 5% Receive Complete 30% Send Data Ready 31% CPU A: 30% 47% 41% 553% 2% CPU B: 31% Update Send Consumer 1% Shared data objects Shared DMA read channel Shared DMA write channel 14 Final Partition Receive Complete Send Data Ready DMA Write Complete DMA Read Complete Mailbox Send Buffer Descriptor Ready Receive Buffer Descriptor Ready Update Send Consumer Update Receive Return Producer CPU A: 47% CPU B: 53% Shared data objects Shared DMA read channel Shared DMA write channel 15 Outline Background Parallelization of Firmware Experimental Results – Improved Maximum Throughput – Improved Web Server Throughput Conclusion 16 Experimental Setup Network interface card – 3Com 710024 Gigabit Ethernet interface card based on Tigon Firmware versions – Uniprocessor firmware: 12.4.13 from original manufacturer – Parallel firmware: modified version of 12.4.13 Benchmarks – UDP bidirectional, unidirectional, and ping traffic – Web server (thttpd) and software router (Click) Testbed – PC machines with AMD Athlon 2600+ CPU and 2GB RAM – FreeBSD 4.7 17 Overall Improvements 65% 157% improvement improvement 18 Sources of Improvements 70%improvement improvement 37% due scratch pads due toto two processors 19 Comparison to ASIC NICs Intel only 21% over Tigon Intel PRO/1000 MT Intel (2002) 3Com 710024 Tigon (1997) Netgear GA622T Nat. Semi. (2001) 3Com 710024 Tigon (1997) UNIPROCESSOR 20 Impact on Web Server Throughput Overall 27-51% Comparable to improvement ASIC NICs 21 Parallelization Makes Programmability Viable Programmability useful for complex functions Limited clock speed for embedded processor – Limited uniprocessor performance Use multiple cores to improve performance Two core vs. single core – 65% increase for maximum throughput – 51% increase for web server throughput – Web server throughput comparable to ASIC NICs 22 23 UDP Send: Overall Improvements 24 UDP Send: Sources of Improvements 25 UDP Receive: Overall Improvements 26 UDP Receive: Sources of Improvements 27 UDP Ping: Overall Improvements 28 UDP Ping: Sources of Improvements 29 Impact on Routing Throughput 30