Increasing Web Server Throughput with Network Interface Data Caching Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture Group http://www.cs.rice.edu/CS/Architecture/ October 9, 2002 Anatomy of a Web Request Static CPU content web server Main Memory Request Interconnect Network Interface Network File Headers Request File Headers 95 % utilization 2 Problem Inefficient use of local interconnect – Repeated transfers – Transfer every bit of data sent out to network Local interconnect bottleneck Transfer overhead exacerbates inefficiency – Overhead reduces available bandwidth – E.g. Peripheral Component Interconnect (PCI) • 30 % transfer overhead 3 Solution Network – – – – interface data caching Cache data in network interface Reduces interconnect traffic Software-controlled cache Minimal changes to the operating system Prototype web server – Up to 57% reduction in PCI traffic – Up to 31% increase in server performance – Peak 1571 Mb/s of content throughput • Breaks PCI bottleneck 4 Outline Background Network Interface Data Caching Implementation Experimental Prototype / Results Summary 5 Network Interface Data Cache Software-controlled CPU Main Memory Request cache in network interface Interconnect X Network Interface Network Headers Request File Cache Headers File 6 Web Traces Five web traces – Realistic working set / file distribution Berkeley computer science department IBM NASA Kennedy Space Center Rice computer science department 1998 World Cup 7 Content Locality Block cache with 4KB block size 8-16MB caches capture locality 8 Outline Background Network Interface Data Caching Implementation – OS modification / NIC API Experimental Prototype / Results Summary 9 Unmodified Operating System Transmit File data flow Page Network Stack Page Page Device Driver Packet Packet Packet Page 2. Protocol processing 3. 1. Inform network interface Identify pages Break into packets 10 Modified Operating System OS completely controls network interface data cache Minimal changes to the OS Page File Page Page Network Stack Page Device Driver Packet Packet Packet Cache Directory 3. Protocol processing 2. Inform Annotate 5. network interface Break into packets 1. Identify pages 4. Query directory (New step) (Unmodified) (Unmodified) (Unmodified) (New step) 11 Operating System Modification Device Driver – Completely controls cache – Makes allocation/use/replacement decisions Cache directory (in device driver) – An entry is a tuple of • • • • file identifier offset within file file revision number flags – Sufficient to maintain cache coherence 12 Network Interface API Initialize Insert data into the cache Append data to a packet Append cached data to a packet Append Main Memory Append cached data Append cached data TX Buffer Interconnect Cache Network Interface 13 Outline Background Network Interface Data Caching Implementation Experimental Prototype / Results Summary 14 Prototype Server Athlon 2200+ processor, 2GB RAM 64-bit, 33 MHz PCI bus (2 Gb/s) Two Gigabit Ethernet NICs (4 Gb/s) – Based on programmable Tigon 2 controller – Firmware implements new API FreeBSD 4.6 – 850 lines of new code/150 lines of kernel changes thttpd web server – High performance lightweight web server – Supports zero-copy sendfile 15 Results: PCI Traffic PCI saturated 30 % Overhead ~60 %% Content traffic 60 utilization 1198 Mb/s of HTTP content ~1260 Mb/s is limit! 16 Results: PCI Traffic Reduction 36-57 % reduction four traces Good temporal reuse Lowwith temporal reuse CPU bottleneck Low PCI utilization 17 Results: World Cup Temporal reuse (84 %) PCI utilization (69 %) 57 % traffic reduction 7% throughput increase 794 Mb/s w/o caching 849 Mb/s w/ caching CPU bottleneck 18 Results: Rice Temporal reuse (40 %) PCI utilization (91 %) 40 % traffic reduction 17% throughput increase 1126 Mb/s w/o caching 1322 Mb/s w/ caching Breaks PCI bottleneck 19 Results: NASA Temporal reuse (71 %) PCI utilization (95 %) 54 % traffic reduction 31% throughput increase 1198 Mb/s w/o caching 1571 Mb/s w/ caching Break PCI bottleneck 20 Summary Network – – – – interface data caching Exploits web request locality Network protocol independent Interconnect architecture independent Minimal changes to OS 36-57% reductions in PCI traffic 7-31% increase in server performance Peak 1571Mb/s of content throughput – Surpasses PCI bottleneck 21