Public Clouds (EC2, Azure, Rackspace, …) VM Multi-tenancy Different customers’ virtual machines (VMs) share same server VM VM VM VM VM VM Provider: Why multi-tenancy? • Improved resource utilization • Benefits of economies of scale Tenant: Why Cloud? • Pay-as-you-go • Infinite Resources • Cheaper Resources 1 Implications of Multi-tenancy – CPU, cache, memory, disk, network, etc. • Virtual Machine Managers (VMM) – Goal: Provide Isolation VMM • VMs share many resources VM VM • Deployed VMMs don’t perfectly isolate VMs – Side-channels [Ristenpart et al. ’09, Zhang et al. ’12] 2 Lies to by the Cloud • Infinite resources • All VMs are created equally • Perfect isolation 3 This Talk Taking control of where your instances run • Are all VMs created equally? • How much variation exists and why? • Can we take advantage of the variation to improve performance? Gaining performance at any cost • Can users impact each other’s performance? • Is there a way to maliciously steal another user’s resource? • Is tehre Heterogeneity in EC2 • Cause of heterogeneity: – Contention for resources: you are sharing! – CPU Variation: • Upgrades over time • Replacement of failed machined – Network Variation: • Different path lengths • Different levels of oversubscription 5 Are All VMs Created Equally? • Inter-architecture: – Is there differences between architectures – Can this be used to predict perform aprior? • Intra-architecture: – Within an architecture – If large, then you can’t predict performance • Temporal – On the same VM over time? – There is no hope! 6 Benchmark Suite & Methodoloy • Methodology: – 6 Workloads – 20 VMs (small instances) for 1 week – Each run micro-benchmarks every hour 7 Inter-Architecture 8 Intra-Architecture CPU is predictable – les than 15% Storage is unpredictable --- as high as 250% 9 Temporal 10 Overall CPU type can only be used to predict CPU performance For Mem/IO bound jobs need to empirically learn how good an instance is 11 What Can We Do about it? • Goal: Run VM on best instances • Constraints: – Can control placement – can’t control which instance the cloud gives us – Can’t migrate • Placement gaming: – Try and find the best instances simply by starting and stopping VMs 12 Measurement Methodology • Deploy on Amazon EC2 – A=10 instances – 12 hours • Compare against no strategy: – Run initial machines with no strategy • Baseline varies for each run – Re-use machines for strategy EC2 results 16 migrations Baseline 100 Strategy Baseline Strategy 80 11 MB/sec Records/sec 12 10 9 60 40 20 8 0 1 2 NER Runs 3 1 2 Apache Runs 3 Placement Gaming • Approach: – Start a bunch of extra instances – Rank them based on performance – Kill the under performing instances • Performing poorer than average – Start new instances. • Interesting Questions: – How many instances should be killed in each round? – How frequently should you evaluate performance of instances. 15 Resource-Freeing Attacks: Improve Your Cloud Performance (at Your Neighbor's Expense) (Venkat)anathan Varadarajan, Thawan Kooburat, Benjamin Farley, Thomas Ristenpart, and Michael Swift DEPARTMENT OF COMPUTER SCIENCES 16 Contention in Xen • Same Core – Same core & same L1 Cache & Same memory • Same Package – Diff core but share L1 Cache and memory • Different Package – Diff core & diff Cache but share Memory 17 I/O contends with self Increase in Run-times • VMs contend for the same resource – Network with Network: • More VMs Fair share is smaller – Disk I/O with Disk I/O: • More disk access longer seek times • Xen does N/W batching to give better performances – BUT: this adds jitter and delay – ALSO: you can get more than your fairshare because of the batch 10 18 I/O contends with self • VMs contend for the same resource – Network with Network: • More VMs Fair share is smaller – Disk I/O with Disk I/O: • More disk access longer seek times • Xen does N/W batching to give better performances – BUT: this adds jitter and delay – ALSO: you can get more than your fairshare because of the batch 19 Everyone Contends with Cache • No contention on same core – VMs run in serial so access to cache is serial • No contention on diff package – VMs use different cache • Lots of contention when same package – VMs run in parallel but share same cache 20 Contention in Xen Performance Degradation (%) 3x-6x Performance loss Higher cost 600 500 Work-conserving scheduling VM VM 400 300 Local Xen Testbed 200 Machine Intel Xeon E5430, 2.66 Ghz CPU 2 packages each with 2 cores Cache Size 6MB per package 100 0 CPU Net Non-work-conserving CPU scheduling Disk Cache 21 What can a tenant do? Ask provider for better isolation … requires overhaul of the cloud VM Pack up VM and move (See our SOCC 2012 paper) … but, not all workloads cheap to move VM This work: Greedy customer can recover performance by interfering with other tenants Resource-Freeing Attack 22 Resource-freeing attacks (RFAs) • What is an RFA? • RFA case studies 1. Two highly loaded web server VMs 2. Last Level Cache (LLC) bound VM and highly loaded webserver VM • Demonstration on Amazon EC2 23 The Setting Victim: – One or more VMs – Public interface (eg, http) Beneficiary: – VM whose performance we want to improve Helper: Victim VM VM Beneficiary – Mounts the attack Beneficiary and victim fighting over a target resource Helper 24 Example: Network Contention • Beneficiary & Victim – Apache webservers hosting static and dynamic (CGI) web pages. – ℎ𝑎𝑙𝑓 the network bandwidth What can you do? Local Xen Test bed • Target Resource: Network Bandwidth Beneficiary • Work-conserving scheduler Victim Clients Net 25 Recipe for a Successful RFA Proportion of CPU usage Push towards CPU bottleneck Shift resource away from the target resource towards the bottleneck resource CPU intensive dynamic pages Shift resource usage via public interface Limits Static pages Proportion of Network usage Reduce target resource usage 26 An RFA in Our Example Result in our testbed: Increases beneficiary’s share of bandwidth CPU Utilization Clients No RFA: 1800 page requests/sec W/ RFA: 3026 page requests/sec CGI Request 50% 85% share of bandwidth Net Helper 29 Resource-freeing attacks 1) Send targeted requests to victim 2) Shift resources use from target to a bottleneck Can we mount RFAs when target resource is CPU cache? Shared CPU Cache: – Ubiquitous: Almost all workloads need cache – Hardware controlled: Not easily isolated via software – Performance Sensitive: High performance cost! 30 Cache Performance Degradation (%) Cache Contention 250 RFA Goal 200 150 100 50 0 1000 2000 Webserver Request Rate 3000 31 Case Study: Cache vs. Network – ~3x slower when sharing cache with webserver Local Xen Test bed • Victim : Apache webserver hosting static and dynamic (CGI) web pages • Beneficiary: Synthetic cache bound workload (LLCProbe) Beneficiary Victim • Target Resource: Cache $$$ • No cache isolation: Core Clients Core Net Cache 32 Cache vs. Network Victim webserver frequently interrupts, pollutes the cache – Reason: Xen gives higher priority to VM consuming less CPU time $$$ Core Clients Core Net Cache Beneficiary starts to run decreased cache efficiency cache state Webserver receives a request Cache state time line Heavily loaded web server 33 Cache vs. Network w/ RFA RFA helps in two ways: 1. Webserver loses its priority. 2. Reducing the capacity of webserver. $$$ Core Clients Core Net Cache cache state Webserver Heavily loaded Heavily loadedawebserver requests receives web server under RFA request CGI Request Beneficiary starts to run Cache state time line Helper 34 RFA: Performance Improvement RFA intensities – time in ms per second 60% Performance Improvement 196% slowdown 86% slowdown 35 Discussion: Practical Aspects RFA case studies used CPU intensive CGI requests – Alternative: DoS vulnerabilities (Eg. hash-collision attacks) Identifying co-resident victims – Easy on most clouds (Co-resident VMs have predictable internal IP addresses) VM VM No public interface? – Paper discusses possibilities for RFAs 36 Limitations • Experiments setup: – Only 1 VMs in each experiment Bobtail:)Avoiding)Long)Tails)in) – Don’t vary the number of each type of job the)Cloud)) Yunjing Xu, Zachary Musgrave, Brian Noble, Michael Bailey University)of)Michigan) ) 371) Discussion: Practical Aspects How)bad)is)the)latency)tail?) RFA case studies used CPU intensive CGI requests – Alternative: DoS vulnerabilities (Eg. hash-collision attacks) Identifying co-resident victims – Easy on most clouds (Co-resident VMs have predictable 99.9th) internal IP addresses) ~30ms) VM VM No public interface? – Paper discusses possibilities for RFAs 387) Discussion: Practical Aspects How)bad)is)the)latency)tail?) Over)10X)larger) RFA case studies used CPU intensive CGI requests – Alternative: DoS vulnerabilities (Eg. hash-collision attacks) Identifying co-resident victims – Easy on most clouds (Co-resident VMs have predictable 99.9th) internal IP Median) addresses) ~30ms) ~0.6ms) VM VM No public interface? – Paper discusses possibilities for RFAs 398) Discussion: Practical Aspects RFA case studies used CPU intensive CGI requests – Alternative: DoS vulnerabilities (Eg. hash-collision attacks) Identifying co-resident victims – Easy on most clouds (Co-resident VMs have predictable internal IP addresses) VM VM No public interface? – Paper discusses possibilities for RFAs 40 Discussion: Practical Aspects RFA case studies used CPU intensive CGI requests – Alternative: DoS vulnerabilities (Eg. hash-collision attacks) Identifying co-resident victims – Easy on most clouds (Co-resident VMs have predictable internal IP addresses) VM VM No public interface? – Paper discusses possibilities for RFAs 41 Discussion: Practical Aspects RFA case studies used CPU intensive CGI requests – Alternative: DoS vulnerabilities (Eg. hash-collision attacks) Identifying co-resident victims – Easy on most clouds (Co-resident VMs have predictable internal IP addresses) VM VM No public interface? – Paper discusses possibilities for RFAs 42 Discussion: Practical Aspects RFA case studies used CPU intensive CGI requests – Alternative: DoS vulnerabilities (Eg. hash-collision attacks) Identifying co-resident victims – Easy on most clouds (Co-resident VMs have predictable internal IP addresses) VM VM No public interface? – Paper discusses possibilities for RFAs 43 Discussion: Practical Aspects RFA case studies used CPU intensive CGI requests – Alternative: DoS vulnerabilities (Eg. hash-collision attacks) Identifying co-resident victims – Easy on most clouds (Co-resident VMs have predictable internal IP addresses) VM VM No public interface? – Paper discusses possibilities for RFAs 44 Discussion: Practical Aspects ParHHonIAggregaHon)Workflows) RFA case studies used CPU intensive 10)servers) CGI requests – Alternative: DoS vulnerabilities (Eg. hash-collision attacks) Identifying co-resident victims 20)servers) – Easy on most clouds (Co-resident VMs have predictable internal IP addresses) VM VM No40)servers) public interface? – Paper discusses possibilities for RFAs 45 24)