Public Clouds (EC2, Azure, Rackspace, …)

advertisement
Public Clouds (EC2, Azure, Rackspace, …)
VM
Multi-tenancy
Different customers’
virtual machines (VMs)
share same server
VM
VM
VM
VM
VM
VM
Provider: Why multi-tenancy?
• Improved resource utilization
• Benefits of economies of scale
Tenant: Why Cloud?
• Pay-as-you-go
• Infinite Resources
• Cheaper Resources
1
Implications of Multi-tenancy
– CPU, cache, memory, disk, network, etc.
• Virtual Machine Managers (VMM)
– Goal: Provide Isolation
VMM
• VMs share many resources
VM
VM
• Deployed VMMs don’t
perfectly isolate VMs
– Side-channels [Ristenpart et al. ’09, Zhang et al. ’12]
2
Lies to by the Cloud
• Infinite resources
• All VMs are created equally
• Perfect isolation
3
This Talk
Taking control of where your instances run
• Are all VMs created equally?
• How much variation exists and why?
• Can we take advantage of the variation to improve
performance?
Gaining performance at any cost
• Can users impact each other’s performance?
• Is there a way to maliciously steal another user’s
resource?
• Is tehre
Heterogeneity in EC2
• Cause of heterogeneity:
– Contention for resources: you are sharing!
– CPU Variation:
• Upgrades over time
• Replacement of failed machined
– Network Variation:
• Different path lengths
• Different levels of oversubscription
5
Are All VMs Created Equally?
• Inter-architecture:
– Is there differences between architectures
– Can this be used to predict perform aprior?
• Intra-architecture:
– Within an architecture
– If large, then you can’t predict performance
• Temporal
– On the same VM over time?
– There is no hope!
6
Benchmark Suite & Methodoloy
• Methodology:
– 6 Workloads
– 20 VMs (small instances) for 1 week
– Each run micro-benchmarks every hour
7
Inter-Architecture
8
Intra-Architecture
CPU is predictable – les than 15%
Storage is unpredictable --- as high as 250%
9
Temporal
10
Overall
CPU type can only be used to predict CPU
performance
For Mem/IO bound jobs need to empirically
learn how good an instance is
11
What Can We Do about it?
• Goal: Run VM on best instances
• Constraints:
– Can control placement – can’t control which instance
the cloud gives us
– Can’t migrate
• Placement gaming:
– Try and find the best instances simply by starting and
stopping VMs
12
Measurement Methodology
• Deploy on Amazon EC2
– A=10 instances
– 12 hours
• Compare against no strategy:
– Run initial machines with no strategy
• Baseline varies for each run
– Re-use machines for strategy
EC2 results
16 migrations
Baseline
100
Strategy
Baseline
Strategy
80
11
MB/sec
Records/sec
12
10
9
60
40
20
8
0
1
2
NER Runs
3
1
2
Apache Runs
3
Placement Gaming
• Approach:
– Start a bunch of extra instances
– Rank them based on performance
– Kill the under performing instances
• Performing poorer than average
– Start new instances.
• Interesting Questions:
– How many instances should be killed in each round?
– How frequently should you evaluate performance of
instances.
15
Resource-Freeing Attacks:
Improve Your Cloud Performance
(at Your Neighbor's Expense)
(Venkat)anathan Varadarajan,
Thawan Kooburat,
Benjamin Farley,
Thomas Ristenpart,
and Michael Swift
DEPARTMENT OF COMPUTER SCIENCES
16
Contention in Xen
• Same Core
– Same core & same L1 Cache & Same memory
• Same Package
– Diff core but share L1 Cache and memory
• Different Package
– Diff core & diff Cache but share Memory
17
I/O
contends
with self
Increase
in Run-times
• VMs contend for the same resource
– Network with Network:
• More VMs  Fair share is smaller
– Disk I/O with Disk I/O:
• More disk access  longer seek times
• Xen does N/W batching to give better
performances
– BUT: this adds jitter and delay
– ALSO: you can get more than your fairshare
because of the batch
10
18
I/O contends with self
• VMs contend for the same resource
– Network with Network:
• More VMs  Fair share is smaller
– Disk I/O with Disk I/O:
• More disk access  longer seek times
• Xen does N/W batching to give better
performances
– BUT: this adds jitter and delay
– ALSO: you can get more than your fairshare
because of the batch
19
Everyone Contends with Cache
• No contention on same core
– VMs run in serial so access to cache is serial
• No contention on diff package
– VMs use different cache
• Lots of contention when same package
– VMs run in parallel but share same cache
20
Contention in Xen
Performance Degradation (%)
3x-6x Performance loss  Higher cost
600
500
Work-conserving
scheduling
VM
VM
400
300
Local Xen Testbed
200
Machine
Intel Xeon E5430,
2.66 Ghz
CPU
2 packages each
with 2 cores
Cache Size
6MB per package
100
0
CPU
Net
Non-work-conserving
CPU scheduling
Disk
Cache
21
What can a tenant do?
Ask provider for better isolation
… requires overhaul of the cloud
VM
Pack up VM and move
(See our SOCC 2012 paper)
… but, not all workloads cheap
to move
VM
This work: Greedy customer can recover
performance by interfering with other tenants
Resource-Freeing Attack
22
Resource-freeing attacks (RFAs)
• What is an RFA?
• RFA case studies
1. Two highly loaded web server VMs
2. Last Level Cache (LLC) bound VM and
highly loaded webserver VM
• Demonstration on Amazon EC2
23
The Setting
Victim:
– One or more VMs
– Public interface (eg, http)
Beneficiary:
– VM whose performance we want
to improve
Helper:
Victim
VM
VM
Beneficiary
– Mounts the attack
Beneficiary and victim fighting
over a target resource
Helper
24
Example: Network Contention
• Beneficiary & Victim
– Apache webservers hosting static and
dynamic (CGI) web pages.
– ℎ𝑎𝑙𝑓 the network bandwidth
What can you do?
Local Xen Test bed
• Target Resource: Network Bandwidth
Beneficiary
• Work-conserving scheduler
Victim
Clients
Net
25
Recipe for a Successful RFA
Proportion of CPU usage
Push towards CPU bottleneck
Shift resource away from the target resource
towards the bottleneck resource
CPU intensive dynamic pages
Shift resource usage
via public interface
Limits
Static pages
Proportion of Network usage
Reduce target resource usage
26
An RFA in Our Example
Result in our testbed:
Increases beneficiary’s share of
bandwidth
CPU Utilization
Clients
No RFA: 1800 page requests/sec
W/ RFA: 3026 page requests/sec
CGI Request
50% 85%
share of
bandwidth
Net
Helper
29
Resource-freeing attacks
1) Send targeted requests to victim
2) Shift resources use from target to a bottleneck
Can we mount RFAs when target
resource is CPU cache?
Shared CPU Cache:
– Ubiquitous: Almost all workloads need cache
– Hardware controlled: Not easily isolated via
software
– Performance Sensitive: High performance cost!
30
Cache Performance Degradation (%)
Cache Contention
250
RFA Goal
200
150
100
50
0
1000
2000
Webserver Request Rate
3000
31
Case Study: Cache vs. Network
– ~3x slower when sharing
cache with webserver
Local Xen Test bed
• Victim : Apache webserver hosting static
and dynamic (CGI) web pages
• Beneficiary: Synthetic cache bound
workload (LLCProbe)
Beneficiary Victim
• Target Resource: Cache
$$$
• No cache isolation:
Core
Clients
Core
Net
Cache
32
Cache vs. Network
Victim webserver frequently
interrupts, pollutes the cache
– Reason: Xen gives higher
priority to VM consuming
less CPU time
$$$
Core
Clients
Core
Net
Cache
Beneficiary starts to run
decreased cache efficiency
cache state
Webserver
receives a
request
Cache state time line
Heavily loaded
web server
33
Cache vs. Network w/ RFA
RFA helps in two ways:
1. Webserver loses its
priority.
2. Reducing the capacity
of webserver.
$$$
Core
Clients
Core
Net
Cache
cache state
Webserver
Heavily loaded
Heavily
loadedawebserver
requests
receives
web
server under RFA
request
CGI Request
Beneficiary starts to run
Cache state time line
Helper
34
RFA: Performance Improvement
RFA intensities – time in ms per second
60%
Performance
Improvement
196% slowdown
86% slowdown
35
Discussion: Practical Aspects
RFA case studies used CPU intensive
CGI requests
– Alternative: DoS vulnerabilities
(Eg. hash-collision attacks)
Identifying co-resident victims
– Easy on most clouds
(Co-resident VMs have predictable
internal IP addresses)
VM
VM
No public interface?
– Paper discusses possibilities for RFAs
36
Limitations
• Experiments setup:
– Only 1 VMs in each experiment
Bobtail:)Avoiding)Long)Tails)in)
– Don’t vary the number of each type of job
the)Cloud))
Yunjing Xu, Zachary Musgrave,
Brian Noble, Michael Bailey
University)of)Michigan)
)
371)
Discussion:
Practical Aspects
How)bad)is)the)latency)tail?)
RFA case studies used CPU intensive
CGI requests
– Alternative: DoS vulnerabilities
(Eg. hash-collision attacks)
Identifying co-resident victims
– Easy on most clouds
(Co-resident VMs have predictable
99.9th)
internal IP addresses)
~30ms)
VM
VM
No public interface?
– Paper discusses possibilities for RFAs
387)
Discussion:
Practical Aspects
How)bad)is)the)latency)tail?)
Over)10X)larger)
RFA case studies used
CPU intensive
CGI requests
– Alternative: DoS vulnerabilities
(Eg. hash-collision attacks)
Identifying co-resident victims
– Easy on most clouds
(Co-resident VMs have predictable
99.9th)
internal IP Median)
addresses)
~30ms)
~0.6ms)
VM
VM
No public interface?
– Paper discusses possibilities for RFAs
398)
Discussion: Practical Aspects
RFA case studies used CPU intensive
CGI requests
– Alternative: DoS vulnerabilities
(Eg. hash-collision attacks)
Identifying co-resident victims
– Easy on most clouds
(Co-resident VMs have predictable
internal IP addresses)
VM
VM
No public interface?
– Paper discusses possibilities for RFAs
40
Discussion: Practical Aspects
RFA case studies used CPU intensive
CGI requests
– Alternative: DoS vulnerabilities
(Eg. hash-collision attacks)
Identifying co-resident victims
– Easy on most clouds
(Co-resident VMs have predictable
internal IP addresses)
VM
VM
No public interface?
– Paper discusses possibilities for RFAs
41
Discussion: Practical Aspects
RFA case studies used CPU intensive
CGI requests
– Alternative: DoS vulnerabilities
(Eg. hash-collision attacks)
Identifying co-resident victims
– Easy on most clouds
(Co-resident VMs have predictable
internal IP addresses)
VM
VM
No public interface?
– Paper discusses possibilities for RFAs
42
Discussion: Practical Aspects
RFA case studies used CPU intensive
CGI requests
– Alternative: DoS vulnerabilities
(Eg. hash-collision attacks)
Identifying co-resident victims
– Easy on most clouds
(Co-resident VMs have predictable
internal IP addresses)
VM
VM
No public interface?
– Paper discusses possibilities for RFAs
43
Discussion: Practical Aspects
RFA case studies used CPU intensive
CGI requests
– Alternative: DoS vulnerabilities
(Eg. hash-collision attacks)
Identifying co-resident victims
– Easy on most clouds
(Co-resident VMs have predictable
internal IP addresses)
VM
VM
No public interface?
– Paper discusses possibilities for RFAs
44
Discussion: Practical Aspects
ParHHonIAggregaHon)Workflows)
RFA
case studies used CPU intensive
10)servers)
CGI requests
– Alternative: DoS vulnerabilities
(Eg. hash-collision attacks)
Identifying
co-resident victims
20)servers)
– Easy on most clouds
(Co-resident VMs have predictable
internal IP addresses)
VM
VM
No40)servers)
public interface?
– Paper discusses possibilities for RFAs
45
24)
Download