Presentation Slides - Rice University -

advertisement
Locality-Aware Request
Distribution in Cluster-based
Network Servers
Presented by: Kevin Boos
Authors: Vivek S. Pai, Mohit Aron, et al.
Rice University
ASPLOS 1998
*** Figures adapted from original presentation ***
Time Warp to 1998
 Rapid Internet growth
 Bandwidth limitations
 “Cheap” PCs and “fast” LANs
 Need for increased throughput
2
Clustered Servers
Back-End
Node
Clien
t
Back-End
Node
Clien
t
Back-End
Node
3
Weighted Round Robin (WRR)
Back end
nodes
Front end
node
B
C
A
A
C
B
C
A
A B A C
A
C
C C A B
A B
B
C
4
Pure Locality-Based Distribution
Back end
nodes
Front end
node
B
C
A
A
C
B
C
A
A A A
C B C C B
A
C B
5
Motivation for Change
 Weighted Round Robin
 Disregards content on back-end nodes
 Many cache misses
 Limited by disk performance
 Pure Locality-Based Distribution
 Disregards current load on back-end nodes
 Uneven load distribution
 Inefficient use of resources
6
LARD Concepts
 Locality-Aware Request Distribution
 Goal: improve performance
 Higher throughput
 Higher cache hit rates
 Reduced disk access
 Even load distribution + content-based distribution
 The best of both algorithms
7
Outline
 Basic LARD Algorithm
 Improvements to LARD
 TCP Handoff Protocol
 Simulation and Results
 Prototype Implementation and Testing
8
Outline
 Basic LARD Algorithm
 Improvements to LARD
 TCP Handoff Protocol
 Simulation and Results
 Prototype Implementation and Testing
9
Basic LARD Algorithm
 Front-end maps target content to back-end nodes
 1-to-1 mapping
 First request for each target is assigned to the
least-loaded back-end node
 Subsequent requests are distributed to the same
back-end node based on target content mapping
 Unless overloaded…
 Re-assigns target content to a new back-end node
10
Flow of Basic LARD
Client
11
Determining Load in Basic LARD
 Ask the server?
 Introduces unnecessary communication
 Current load = number of open connections
 Tracked in the front-end node
 Use thresholds to determine when to re-balance
 Low, High, and Limit
 Re-balance when (load > Tlimit) or
(load > Thigh and there is a “free” node with load < Tlow)
12
Outline
 Basic LARD Algorithm
 Improvements to LARD
 TCP Handoff Protocol
 Simulation and Results
 Prototype Implementation and Testing
13
LARD Needs Improvement
 Only one back-end node per target content
 Working set is a single node
 Front-end must limit total connections
 Still need to increase throughput
 One node per content type is unrealistic
 …add more back-end nodes?
14
LARD/R
 LARD with Replication
 Maps target content to a set of back-end nodes
 Working set is several nodes with similar cache content
 Sends new requests to least-loaded node in set
 Moves nodes to/from sets based on load imbalance
 Idle nodes in a low-load set are moved to higher-load set
15
Flow of LARD/R
Client
16
LARD Outline
 Basic LARD Algorithm
 Improvements to LARD
 Request Handoff Protocol
 Simulation and Results
 Prototype Implementation and Testing
17
Determining Content Type
 How do we determine content in the front-end?
 Front-end must see network traffic
 Standard TCP Assumptions
 Requests are small and light
 Responses are big and heavy
 How do we forward requests?
18
Potential TCP Solutions
 Simple TCP Proxy
 Everything must flow through front-end node
 Can inspect all incoming content
 Cannot respond directly from back-end to client
 But front-end can also inspect all outgoing content
 Better for persistent connections
19
TCP Connection Handoff
 Front-end connects
to client
 Inspects content
 Forwards request to
back-end node
 Returned directly
back to client from
back-end node
20
LARD Outline
 Basic LARD Algorithm
 Improvements to LARD
 TCP Handoff Protocol
 Simulation and Results
 Prototype Implementation and Testing
21
Evaluation Goals
 Throughput
 Requests/second served by entire cluster
 Hit rate
 (Requests that hit memory cache) / (total requests)
 Underutilization time
 Time that a node’s load is ≤ 40% of Tlow
22
Simulation Model
 300MHz Pentium II
 32MB Memory (cache)
 100Mbps Ethernet
 Traces from web servers at Rice and IBM
23
Simulation Results – Prior Work
 Weighted Round Robin
 Lowest throughput
 Highest cache miss ratio
 But lowest idle time
 Pure Locality-Based
 An increase in nodes  decrease in cache miss ratio
 But idle time increases (unbalanced load)
 Only minor improvement over WRR
24
Simulation Results – LARD & LARD/R
 Throughput ~4x better (8 nodes)
 WRR would need nodes with a 10x larger cache size
 CPU bound after 8 nodes
 Cache miss rate decreases
 Only 1% idle time on average
25
Simulation Results – Throughput
26
Simulation Results – Cache Misses
27
Simulation Results – Idle Time
28
What Affects Performance?
 WRR is disk-bound, LARD/R is CPU bound
 Increasing CPU speed improves LARD/R, not WRR
 Adding more disks improves WRR, not LARD/R
 LARD/R shows no improvement if a node has > 2 disks
 WRR is not scalable
29
LARD Outline
 Basic LARD Algorithm
 Improvements to LARD
 TCP Handoff Protocol
 Simulation and Results
 Prototype Implementation and Testing
30
Prototype Implementation
 One front-end PC
 300MHz Pentium II, 128MB RAM
 6 back-end PCs
 7 client PCs
 166MHz Pentium Pro, 64MB RAM
 100Mb Ethernet, 24-port switch
31
Prototype Testing Results
32
Evaluation Shortcomings
 What influences the results more?
 LARD/R protocol?
 TCP handoff protocol?
33
Conclusion
 LARD and LARD/R significantly better than WRR




Higher throughput
Better CPU utilization
More frequent cache hits
Reduced disk access
 Benefits of Locality-Based and Load-Balanced
 Scalable at low cost
34
Download