Xen Virtual Machine Monitor Performance Isolation E0397 Lecture 17/8/2010 Many slides based verbatim on “Xen Credit Scheduler Wiki” Recall: Xen Architecture Hypervisor Core Functions Scheduling Domains Allocating Memory Driver Domain (Domain 0) – All access to devices needs to go through this domain CPU Sharing between domains Credit Scheduler Proportional fair share CPU scheduler - work conserving on SMP hosts Definitions: – Each domain (including Host OS) is assigned a weight and a cap. – Weight – A domain with a weight of 512 will get twice as much CPU as a domain with a weight of 256 on a contended host. Legal weights range from 1 to 65535 and the default is 256. – Cap – The cap optionally fixes the maximum amount of CPU a domain will be able to consume, even if the host system has idle CPU cycles. The cap is expressed in percentage of one physical CPU: 100 is 1 physical CPU, 50 is half a CPU, 400 is 4 CPUs, etc... The default, 0, means there is no upper cap. – VCPUs: Number of virtual CPUs given to a domain – exactly replaces the concept of number of CPUs in physical machine More VCPUs- can run threads parallel (only if physical CPUs > 1) Number of VCPUs should be = Number of phy CPUs < will limit performance ….Credit Scheduler – SMP load balancing The credit scheduler automatically load balances guest VCPUs across all available physical CPUs on an SMP host. The administrator does not need to manually pin VCPUs to load balance the system. However, she can restrict which CPUs a particular VCPU may run on using the generic vcpu-pin interface. Usage The xm sched-credit command may be used to tune the per VM guest scheduler parameters. xm sched-credit -d <domain> lists weight and cap xm sched-credit -d <domain> -w <weight> sets the weight xm sched-credit -d <domain> -c <cap> sets the cap Credit Scheduler Algorithm Each CPU manages a local run queue of runnable VCPUs. – queue is sorted by VCPU priority. – A VCPU's priority can be one of two value: over or under Over: exceeded its fair share of CPU resource in the ongoing accounting period. Under: not exceeded – When inserting a VCPU onto a run queue, it is put after all other VCPUs of equal priority to it. As a VCPU runs, it consumes credits. – Every so often (10ms), 100 credits are reduced – Unless negative, VPU gets three rounds to run (30ms) – Negative credits imply a priority of over. Until a VCPU consumes its alloted credits, it priority is under. – Credits refreshed periodically (30 ms) Active vs Inactive VM – – Credits reduction/refreshing happens only for active VMs (those that keep using their credits) – Those not using their credits are marked “inactive” …Credit Schdeuler Algorithm On each CPU, at every scheduling decision (when a VCPU blocks, yields, completes its time slice, or is awaken), the next VCPU to run is picked off the head of the run queue (UNDER). When a CPU doesn't find a VCPU of priority under on its local run queue, it will look on other CPUs for one. This load balancing guarantees each VM receives its fair share of CPU resources system-wide. Before a CPU goes idle, it will look on other CPUs to find any runnable VCPU. This guarantees that no CPU idles when there is runnable work in the system. …Credit Scheduler O U U U end of time slice / domain blocked on IO credit calculation for V5 V5 V4 V3 V2 V1 CPU1 Positive credits Negative credits V5 state ? O U U U V4 V5 V3 V2 V1 UNDER O U CPU1 O U U U V5 V4 V3 V2 V1 OVER CPU1 …Credit Scheduler SMP load balancing – find next runnable VCPU in following order. – – – – UNDER domain in local run queue UNDER domain in run queue of other CPUs OVER domain in local run queue OVER domain in run queue of other CPUs Performance Studies CPU Sharing is predictable I/O sharing is not See: Study 1 Cherkasova paper. Applications running in Domain 1: (one of) Web server: requests for 10KB sized files – Measure: requests/sec iperf: measures maximum achievable network throughput, – Mbps dd: reads 1000 1KB blocks – Disk throughput: MB/s Schedulers with three workloads •Performance varies between schedulers •Sensitive to Dom0 weight •Disk reads least sensitive to both schedulers and Dom0 weight Study 2: Ongarro paper: Experiment 3: Burn x 7, ping x 1 •Processor sharing fair for all schedulers •SEDF provides low latency •Boost, sort and ticking together work best Study 3: Sujesha et al (IITB) Impact of Resource Allocation on App performance Response time vs resource (CPU) allocated, for a different loads Desired operation: at the “knee” of the curve. Study 4 (Gundecha/Apte): Performance of multiple domains vs scheduler parameters Is resource management fair to every domain irrespective of – type of load – scheduling parameters of the domain Scheduler used – default credit scheduler Two virtual machines one with CPU-intensive and other with IO-intensive workload (Apache webserver). Results Scheduler parameters VM weight CAP webserver statistics LOAD %CPU usage requests per sec time per request transfer rate (Kbytes/sec) Experiment 1 : one VM with webserver running vm1 256 400 - - vm2 256 400 webserver 180 NA 797.61 2.507 1035.17 Experiment 2 : Mixed Load - one VM with CPU load, one VM with webserver vm1 256 400 CPU 100 vm2 256 400 webserver 180 NA 607.43 3.293 788.35 Table 1 : Webserver performance stats with and without CPU load on other VM Study 4: Conclusions Performance isolation is imperfect, when high-level performance measures are considered (application throughput) Live Migration in Xen VMM Live Migration: Advantages Avoids difficulties of process migration – “Residual dependencies” – original host has to remain available and network-connected to service some calls from migrated OS* In-memory state can be transferred consistently – TCP state – “this means that we can migrate an on-line game server or streaming media server without requiring clients to reconnect: ....” – “Allows a separation of concerns between the users and operator of a data center or cluster. Users need not provide the operator with any OS-level access at all (e.g. a root login to quiesce processes or I/O prior to migration). “ *Residual dependencies “The problem of the residual dependencies that a migrated process retains on the machine from which it migrated. Examples of residual dependencies include open file descriptors, shared memory segments, and other local resources. These are undesirable because the original machine must remain available, and because they usually negatively impact the performance of migrated processes.” In summary: Live migration is extremely powerful tool for cluster administrators – allowing separation of hardware and software considerations – consolidating clustered hardware into a single coherent management domain. – If host needs to be removed from service Move guest domains, shutdown machine – Relieve bottlenecks If host is overloaded, move guest domains to idle hosts “virtualization + live migration = improved manageability” Goals of live migration Reduce time during which services are totally unavailable Reduce total migration time – time during which synch is happening (which might be unreliable) Migration should not create unnecessary resource contention (CPU, disk, network, etc) Xen live migration highlights SPECweb benchmark migrated with 210ms unavailability Quake 3 server migrated with 60ms downtime Can maintain network connections and application state Seamless migration Xen live migration: Key ideas Pre-copy approach – Pages of memory iteratively copied without stopping the VM – Page-level protection hardware used to ensure consistent snapshot – Rate adaptive algorithm to control impact of migration traffic – VM pauses only in final copy phase Details: Time definitions Downtime: period during which the service is unavailable due to there being no currently executing instance of the VM; this period will be directly visible to clients of the VM as service interruption. Total migration time: duration between when migration is initiated and when the original VM may be finally discarded and hence, the source host may potentially be taken down for maintenance, upgrade or repair. Memory transfer phases Push – VM keeps sending pages from source to dest while running (modified pages have to be resent) Stop-and-Copy – Halt VM- copy entire image, start VM on dest Pull – Start VM on dest. Pages that are not found locally are retrieved from source Migration approaches Combinations of phases – Pure stop-and-copy: Downtime & total migration time proportional to phy memory size (we want downtime to be lesser) – Pure demand-migration Move some minimal data required, start VM on dest. VM will page-fault a lot initially – total migration time will be long, and performance may be unacceptable – This paper: “Bounded push phase followed by stop-and-copy” Xen “pre-copy” key ideas Pre-copying (push) happens in rounds – In round n pages that changed in round n-1 are copied – Knowledge of pages which are written to frequently – poor candidates for pre-copy “Writable working set “ (WWS) analysis for server workloads Short stop-and-copy phase Awareness of impact on actual services – Control resources used by migration (e.g. network bandwidth used, CPU used etc.) Decoupling local resources Network: Assume source, dest VM on same LAN – Migrating VM will move with TCP/IP state and IP address – Generate unsolicited ARP reply – ARP reply includes IP address mapping to MAC address. All receiving hosts will update their mapping Few packets in transit may be lost, very minimal impact expected … Decoupling local resources Network…Problem: Routers may not accept broadcast ARP replies (note: ARP requests are expected to be broadcast – not replies) – Send ARP replies to addresses in its own ARP cache – On a switched network migrating OS keeps its original Ethernet mac – depends on network switch to detect the mac moved to a new port … Decoupling local resources Disk – Assume Network Attached Storage – Source/Dest machine on the same storage network – Problem not directly addressed Migration Steps Failure management: If not enough resources, no migration At end of this phase, two consistent copies of VM OK Message from BA and ack from AB. VM on A can stop One consistent VM active at all times. If “commit” step does not happen, migration aborted and original VM continues Writable Working Sets Copying VM memory the largest migration overhead Stop-and-copy simplest approach – Downtime unacceptable Pre-copy migration may reduce downtime – What about pages that are dirtied while copying? – What if rate of dirtying > rate of copying? Key observation: most VMs will be such that large parts of memory are not modified, small part is (called WWS) WWS Measurements Different for different workloads Estimated Effect on downtime Each successive line (top bottom) showing increasing pre-copy rounds Pre-copy reduces downtime (confirmed by many experiments) Details: Managed Migration (Migration by Domain 0) First round: All pages copied Subsequent rounds – only dirtied pages copied. (maintain dirty bitmap). How: – Insert shadow page tables – Populated by translating sections of guest OS page tables Page table entries are initially read-only mappings If guest tries to modify a page fault created/trapped by Xen If write access permitted by original page table, do same here – and set bit in dirty bitmap – At start of each pre-copy round, dirty bitmap given to control software, Xens bitmap cleared, shadow page tables destroyed and recreated, all write permissions lost When pre-copy is to stop, mesg sent to guest OS to suspend itself, dirty bitmap checked, pages synched. Self-Migration Major difficulty: OS itself has to run to transfer itself – what will be correct state to transfer? Soln. – Suspend all activities except migration-related – Dirty page scan – Copy dirty pages to “shadow buffer” – Transfer shadow buffer Page dirtying during this time is ignored. Dynamic Rate-Limiting Algorithm Select minimum and maximum bandwidth limits First pre-copy round transfers pages minimum b/w Dirty page rate calculates (number/duration of round) B/w limit of next round = dirty rate + 50 Mbit/sec (higher b/w if dirty rate high) Terminate when calculated rate > max or less than 256KB remaining Stop-and-copy done at max bandwitdh Live Migration: Some results …Results Downtime: 210 ms Total migration time: 71 secs No perceptible impact on performance during uptime