Locality-Aware Request Distribution in Cluster-based Network Servers Presented by: Kevin Boos Authors: Vivek S. Pai, Mohit Aron, et al. Rice University ASPLOS 1998 *** Figures adapted from original presentation *** Time Warp to 1998 Rapid Internet growth Bandwidth limitations “Cheap” PCs and “fast” LANs Need for increased throughput 2 Clustered Servers Back-End Node Clien t Back-End Node Clien t Back-End Node 3 Weighted Round Robin (WRR) Back end nodes Front end node B C A A C B C A A B A C A C C C A B A B B C 4 Pure Locality-Based Distribution Back end nodes Front end node B C A A C B C A A A A C B C C B A C B 5 Motivation for Change Weighted Round Robin Disregards content on back-end nodes Many cache misses Limited by disk performance Pure Locality-Based Distribution Disregards current load on back-end nodes Uneven load distribution Inefficient use of resources 6 LARD Concepts Locality-Aware Request Distribution Goal: improve performance Higher throughput Higher cache hit rates Reduced disk access Even load distribution + content-based distribution The best of both algorithms 7 Outline Basic LARD Algorithm Improvements to LARD TCP Handoff Protocol Simulation and Results Prototype Implementation and Testing 8 Outline Basic LARD Algorithm Improvements to LARD TCP Handoff Protocol Simulation and Results Prototype Implementation and Testing 9 Basic LARD Algorithm Front-end maps target content to back-end nodes 1-to-1 mapping First request for each target is assigned to the least-loaded back-end node Subsequent requests are distributed to the same back-end node based on target content mapping Unless overloaded… Re-assigns target content to a new back-end node 10 Flow of Basic LARD Client 11 Determining Load in Basic LARD Ask the server? Introduces unnecessary communication Current load = number of open connections Tracked in the front-end node Use thresholds to determine when to re-balance Low, High, and Limit Re-balance when (load > Tlimit) or (load > Thigh and there is a “free” node with load < Tlow) 12 Outline Basic LARD Algorithm Improvements to LARD TCP Handoff Protocol Simulation and Results Prototype Implementation and Testing 13 LARD Needs Improvement Only one back-end node per target content Working set is a single node Front-end must limit total connections Still need to increase throughput One node per content type is unrealistic …add more back-end nodes? 14 LARD/R LARD with Replication Maps target content to a set of back-end nodes Working set is several nodes with similar cache content Sends new requests to least-loaded node in set Moves nodes to/from sets based on load imbalance Idle nodes in a low-load set are moved to higher-load set 15 Flow of LARD/R Client 16 LARD Outline Basic LARD Algorithm Improvements to LARD Request Handoff Protocol Simulation and Results Prototype Implementation and Testing 17 Determining Content Type How do we determine content in the front-end? Front-end must see network traffic Standard TCP Assumptions Requests are small and light Responses are big and heavy How do we forward requests? 18 Potential TCP Solutions Simple TCP Proxy Everything must flow through front-end node Can inspect all incoming content Cannot respond directly from back-end to client But front-end can also inspect all outgoing content Better for persistent connections 19 TCP Connection Handoff Front-end connects to client Inspects content Forwards request to back-end node Returned directly back to client from back-end node 20 LARD Outline Basic LARD Algorithm Improvements to LARD TCP Handoff Protocol Simulation and Results Prototype Implementation and Testing 21 Evaluation Goals Throughput Requests/second served by entire cluster Hit rate (Requests that hit memory cache) / (total requests) Underutilization time Time that a node’s load is ≤ 40% of Tlow 22 Simulation Model 300MHz Pentium II 32MB Memory (cache) 100Mbps Ethernet Traces from web servers at Rice and IBM 23 Simulation Results – Prior Work Weighted Round Robin Lowest throughput Highest cache miss ratio But lowest idle time Pure Locality-Based An increase in nodes decrease in cache miss ratio But idle time increases (unbalanced load) Only minor improvement over WRR 24 Simulation Results – LARD & LARD/R Throughput ~4x better (8 nodes) WRR would need nodes with a 10x larger cache size CPU bound after 8 nodes Cache miss rate decreases Only 1% idle time on average 25 Simulation Results – Throughput 26 Simulation Results – Cache Misses 27 Simulation Results – Idle Time 28 What Affects Performance? WRR is disk-bound, LARD/R is CPU bound Increasing CPU speed improves LARD/R, not WRR Adding more disks improves WRR, not LARD/R LARD/R shows no improvement if a node has > 2 disks WRR is not scalable 29 LARD Outline Basic LARD Algorithm Improvements to LARD TCP Handoff Protocol Simulation and Results Prototype Implementation and Testing 30 Prototype Implementation One front-end PC 300MHz Pentium II, 128MB RAM 6 back-end PCs 7 client PCs 166MHz Pentium Pro, 64MB RAM 100Mb Ethernet, 24-port switch 31 Prototype Testing Results 32 Evaluation Shortcomings What influences the results more? LARD/R protocol? TCP handoff protocol? 33 Conclusion LARD and LARD/R significantly better than WRR Higher throughput Better CPU utilization More frequent cache hits Reduced disk access Benefits of Locality-Based and Load-Balanced Scalable at low cost 34