Resource Management in the Cloud

advertisement
Managing Cloud Resources:
Distributed Rate Limiting
Alex C. Snoeren
Kevin Webb, Bhanu Chandra Vattikonda, Barath Raghavan,
Kashi Vishwanath, Sriram Ramabhadran,
and Kenneth Yocum
Building and Programming the Cloud Workshop
13 January 2010
Centralized Internet services

Hosting with a single physical presence

However, clients are across the Internet
Mysore-Park Cloud Workshop – 13 January 2010
2
Cloud-based services

Resources and clients distributed across the world

Often incorporates resources from multiple providers
Windows Live
Mysore-Park Cloud Workshop – 13 January 2010
3
Resources in the Cloud

Distributed resource consumption



Clients consume resources at multiple sites
Metered billing is state-of-the-art
Service “punished” for popularity
» Those unable to pay are disconnected


No control of resources used to serve increased demand
Overprovision and pray


Application designers typically cannot describe needs
Individual service bottlenecks varied but severe
» IOps, network bandwidth, CPU, RAM, etc.
» Need a way to balance resource demand
Mysore-Park Cloud Workshop – 13 January 2010
4
Two lynchpins for success

Need a way to control and manage distributed
resources as if they were centralized



All current models from OS scheduling and provisioning
literature assume full knowledge and absolute control
(This talk focuses specifically on network bandwidth)
Must be able to efficiently support rapidly evolving
application demand


Balance the resource needs to hardware realization
automatically without application designer input
(Another talk if you’re interested)
Mysore-Park Cloud Workshop – 13 January 2010
5
Ideal: Emulate a single limiter

Make distributed feel centralized

Packets should experience the same limiter behavior
Limiters
S
S
D
0 ms
D
0 ms
S
0 ms
Mysore-Park Cloud Workshop – 13 January 2010
D
6
Engineering tradeoffs
Accuracy
(how close to K Mbps is delivered, flow rate fairness)
+
Responsiveness
(how quickly demand shifts are accommodated)
Vs.
Communication Efficiency
(how much and often rate limiters must communicate)
Mysore-Park Cloud Workshop – 13 January 2010
7
An initial architecture
Packet
arrival
Estimate
interval
timer
Limiter 2
Gossip
Limiter 3
Enforce limit
Estimate
local demand
Limiter 1
Set allocation
Gossip
Gossip
Limiter 4
Global
demand
Mysore-Park Cloud Workshop – 13 January 2010
8
Token bucket limiters
Packet
Token bucket, fill rate K Mbps
Mysore-Park Cloud Workshop – 13 January 2010
9
A global token bucket (GTB)?
Demand info
(bytes/sec)
Limiter 1
Mysore-Park Cloud Workshop – 13 January 2010
Limiter 2
10
A baseline experiment
Limiter 1
3 TCP flows
Single token bucket
D
S
10 TCP flows
S
D
Limiter 2
7 TCP flows
S
Mysore-Park Cloud Workshop – 13 January 2010
D
11
GTB performance
Single token bucket
10 TCP flows
Global token bucket
7 TCP flows
3 TCP flows
Problem: GTB requires near-instantaneous arrival info
Mysore-Park Cloud Workshop – 13 January 2010
12
Take 2: Global Random Drop
Limiters send, collect global rate info from others
5 Mbps (limit)
4 Mbps (global arrival rate)
Case 1: Below global limit, forward packet
Mysore-Park Cloud Workshop – 13 January 2010
13
Global Random Drop (GRD)
6 Mbps (global arrival rate)
5 Mbps (limit)
Same at all limiters
Case 2: Above global limit, drop with probability:
Excess / Global arrival rate = 1/6
Mysore-Park Cloud Workshop – 13 January 2010
14
GRD baseline performance
Single token bucket
10 TCP flows
Global token bucket
7 TCP flows
3 TCP flows
Delivers flow behavior similar to a central limiter
Mysore-Park Cloud Workshop – 13 January 2010
15
GRD under dynamic arrivals
Mysore-Park Cloud Workshop – 13 January 2010
(50-ms estimate interval)
16
Returning to our baseline
Limiter 1
3 TCP flows
D
S
Limiter 2
7 TCP flows
S
Mysore-Park Cloud Workshop – 13 January 2010
D
17
Basic idea: flow counting
Goal: Provide inter-flow fairness for TCP flows
Local token-bucket
enforcement
“3 flows”
“7 flows”
Limiter 1
Mysore-Park Cloud Workshop – 13 January 2010
Limiter 2
18
Estimating TCP demand
Local token rate (limit) = 10 Mbps
1 TCP flow
Flow A = 5 Mbps
S
1 TCP flow
Flow B = 5 Mbps
S
Flow count = 2 flows
Mysore-Park Cloud Workshop – 13 January 2010
19
FPS under dynamic arrivals
Mysore-Park Cloud Workshop – 13 January 2010
(500-ms estimate interval)
20
Comparing FPS to GRD
FPS (500-ms est. int.)
GRD (50-ms est. int.)


Both are responsive and provide similar utilization
GRD requires accurate estimates of the global rate at all limiters.
Mysore-Park Cloud Workshop – 13 January 2010
21
Estimating skewed demand
Limiter 1
1 TCP flow
S
D
1 TCP flow
S
Limiter 2
3 TCP flows
S
Mysore-Park Cloud Workshop – 13 January 2010
D
22
Estimating skewed demand
Local token rate (limit) = 10 Mbps
Flow A = 8 Mbps
Bottlenecked elsewhere
Flow B = 2 Mbps
Flow count ≠ demand
Key insight: Use a TCP flow’s rate to infer demand
Mysore-Park Cloud Workshop – 13 January 2010
23
Estimating skewed demand
Local token rate (limit) = 10 Mbps
Flow A = 8 Mbps
Bottlenecked elsewhere
Flow B = 2 Mbps
Local Limit
Largest Flow’s Rate
Mysore-Park Cloud Workshop – 13 January 2010
10
= 1.25 flows
=
8
24
FPS example
Global limit = 10 Mbps
Limiter 1
Limiter 2
1.25 flows
3 flows
Set local token rate =
=
10 Mbps x 1.25
1.25 + 3
Mysore-Park Cloud Workshop – 13 January 2010
Global limit x local flow count
Total flow count
= 2.94 Mbps
25
FPS bottleneck example



Initially 3:7 split between 10 un-bottlenecked flows
At 25s, 7-flow aggregate bottlenecked to 2 Mbps
At 45s, un-bottlenecked flow arrives: 3:1 for 8 Mbps
Mysore-Park Cloud Workshop – 13 January 2010
26
Real world constraints

Resources spent tracking usage is pure overhead



Control channel is slow and lossy



Efficient implementation (<3% CPU, sample & hold)
Modest communication budget (<1% bandwidth)
Need to extend gossip protocols to tolerate loss
An interesting research problem on its own…
The nodes themselves may fail or partition


In an asynchronous system, you cannot tell the difference
Need to have a mechanism that deals gracefully with both
Mysore-Park Cloud Workshop – 13 January 2010
27
Robust control communication



7 Limiters enforcing 10 Mbps limit
Demand fluctuates every 5 sec between 1-100 flows
Varying loss on the control channel
Mysore-Park Cloud Workshop – 13 January 2010
28
Handling partitions



Failsafe operation: each disconnected group k/n
Ideally: Bank-o-mat problem (credit/debit scheme)
Challege: group membership with asymmetric partitions
Mysore-Park Cloud Workshop – 13 January 2010
29
Following PlanetLab demand

Apache Web servers on 10 PlanetLab nodes


5-Mbps aggregate limit
Shift load over time from 10 nodes to 4
5 Mbps
Mysore-Park Cloud Workshop – 13 January 2010
30
Current limiting options
Demands at 10 apache servers on Planetlab
Wasted capacity
Demand shifts to just 4 nodes
Mysore-Park Cloud Workshop – 13 January 2010
31
Applying FPS on PlanetLab
Mysore-Park Cloud Workshop – 13 January 2010
32
Hierarchical limiting
Mysore-Park Cloud Workshop – 13 January 2010
33
A sample use case

T 0:
A: 5 flows at L1

T 55:
A: 5 flows at L2

T 110:
B: 5 flows at L1

T 165:
B: 5 flows at L2
Mysore-Park Cloud Workshop – 13 January 2010
34
Worldwide flow join

8 nodes split between UCSD and Polish Telecom


5 Mbps aggregate limit
A new flow arrives at each limiter every 10 seconds
Mysore-Park Cloud Workshop – 13 January 2010
35
Worldwide demand shift

Same demand-shift experiment as before


At 50 sec, Polish Telecom demand disappears
Reappears at 90 sec.
Mysore-Park Cloud Workshop – 13 January 2010
36
Where to go from here

Need to “let go” of full control, make decisions with
only a “cloudy” view of actual resource consumption




Distinguish between what you know and what you don’t know
Operate efficiently when you know you know.
Have failsafe options when you know you don’t.
Moreover, we cannot rely upon application/service
designers to understand their resource demands



The system needs to dynamically adjust to shifts
We’ve started to manage the demand equation
We’re now focusing on the supply side: custom-tailored
resource provisioning.
Mysore-Park Cloud Workshop – 13 January 2010
37
Download