xenSchedLiveMigra

advertisement
Xen Virtual Machine Monitor
Performance Isolation
E0397 Lecture
17/8/2010
Many slides based verbatim on “Xen
Credit Scheduler Wiki”
Recall: Xen Architecture
Hypervisor Core Functions
 Scheduling Domains
 Allocating Memory
 Driver Domain (Domain 0)
– All access to devices needs to go through this
domain
CPU Sharing between domains
Credit Scheduler


Proportional fair share CPU scheduler - work conserving on SMP hosts
Definitions:
– Each domain (including Host OS) is assigned a weight and a cap.
– Weight
– A domain with a weight of 512 will get twice as much CPU as a domain with a weight
of 256 on a contended host. Legal weights range from 1 to 65535 and the default is
256.
– Cap
– The cap optionally fixes the maximum amount of CPU a domain will be able to
consume, even if the host system has idle CPU cycles. The cap is expressed in
percentage of one physical CPU: 100 is 1 physical CPU, 50 is half a CPU, 400 is 4
CPUs, etc... The default, 0, means there is no upper cap.
– VCPUs: Number of virtual CPUs given to a domain – exactly replaces the
concept of number of CPUs in physical machine
 More VCPUs- can run threads parallel (only if physical CPUs > 1)
 Number of VCPUs should be = Number of phy CPUs
 < will limit performance
….Credit Scheduler
– SMP load balancing
 The credit scheduler automatically load balances
guest VCPUs across all available physical CPUs on
an SMP host. The administrator does not need to
manually pin VCPUs to load balance the system.
However, she can restrict which CPUs a particular
VCPU may run on using the generic vcpu-pin
interface.
Usage
The xm sched-credit command may be used to tune the per VM guest
scheduler parameters.
xm sched-credit -d <domain>
lists weight and cap
xm sched-credit -d <domain> -w <weight>
sets the weight
xm sched-credit -d <domain> -c <cap>
sets the cap
Credit Scheduler Algorithm

Each CPU manages a local run queue of runnable VCPUs.
– queue is sorted by VCPU priority.
– A VCPU's priority can be one of two value: over or under
 Over: exceeded its fair share of CPU resource in the ongoing accounting period.
 Under: not exceeded
– When inserting a VCPU onto a run queue, it is put after all other VCPUs of equal
priority to it.

As a VCPU runs, it consumes credits.
– Every so often (10ms), 100 credits are reduced
– Unless negative, VPU gets three rounds to run (30ms)
– Negative credits imply a priority of over. Until a VCPU consumes its alloted credits,
it priority is under.
– Credits refreshed periodically (30 ms)
 Active vs Inactive VM –
– Credits reduction/refreshing happens only for active VMs (those that keep
using their credits)
– Those not using their credits are marked “inactive”
…Credit Schdeuler Algorithm
 On each CPU, at every scheduling decision (when a VCPU
blocks, yields, completes its time slice, or is awaken), the
next VCPU to run is picked off the head of the run queue
(UNDER).
 When a CPU doesn't find a VCPU of priority under on its
local run queue, it will look on other CPUs for one. This
load balancing guarantees each VM receives its fair share
of CPU resources system-wide.
 Before a CPU goes idle, it will look on other CPUs to find
any runnable VCPU. This guarantees that no CPU idles
when there is runnable work in the system.
…Credit Scheduler
O
U U
U
end of time slice /
domain blocked on IO credit calculation for V5
V5
V4 V3 V2 V1
CPU1
Positive credits
Negative credits
V5 state ?
O
U
U U
V4 V5 V3 V2 V1
UNDER
O
U
CPU1
O
U U
U
V5 V4 V3 V2 V1
OVER
CPU1
…Credit Scheduler
 SMP load balancing – find next runnable VCPU in following
order.
–
–
–
–
UNDER domain in local run queue
UNDER domain in run queue of other CPUs
OVER domain in local run queue
OVER domain in run queue of other CPUs
Performance Studies
 CPU Sharing is predictable
 I/O sharing is not
 See:
Study 1 Cherkasova paper.
Applications running in Domain 1: (one of)
 Web server: requests for 10KB sized files
– Measure: requests/sec
 iperf: measures maximum achievable network throughput,
– Mbps
 dd: reads 1000 1KB blocks
– Disk throughput: MB/s
Schedulers with three workloads
•Performance varies between
schedulers
•Sensitive to Dom0 weight
•Disk reads least sensitive to
both schedulers and Dom0
weight
Study 2: Ongarro paper:
Experiment 3: Burn x 7, ping x 1
•Processor sharing fair for
all schedulers
•SEDF provides low
latency
•Boost, sort and ticking
together work best
Study 3: Sujesha et al (IITB)
Impact of Resource Allocation on
App performance
Response time vs resource (CPU) allocated,
for a different loads
Desired operation: at the
“knee” of the curve.
Study 4 (Gundecha/Apte):
Performance of multiple domains
vs scheduler parameters
 Is resource management fair to every domain
irrespective of
– type of load
– scheduling parameters of the domain
 Scheduler used – default credit scheduler
 Two virtual machines one with CPU-intensive and
other with IO-intensive workload (Apache
webserver).
Results
Scheduler
parameters
VM
weight
CAP
webserver statistics
LOAD
%CPU
usage
requests
per sec
time
per
request
transfer rate
(Kbytes/sec)
Experiment 1 : one VM with webserver running
vm1
256
400
-
-
vm2
256
400
webserver
180
NA
797.61
2.507
1035.17
Experiment 2 : Mixed Load - one VM with CPU load, one VM with webserver
vm1
256
400
CPU
100
vm2
256
400
webserver
180
NA
607.43
3.293
788.35
Table 1 : Webserver performance stats with and without CPU load on other VM
Study 4: Conclusions
 Performance isolation is imperfect, when
high-level performance measures are
considered (application throughput)
Live Migration in Xen VMM
Live Migration: Advantages
 Avoids difficulties of process migration
– “Residual dependencies” – original host has to remain available
and network-connected to service some calls from migrated OS*
 In-memory state can be transferred consistently
– TCP state
– “this means that we can migrate an on-line game server or
streaming media server without requiring clients to reconnect: ....”
– “Allows a separation of concerns between the users and operator of
a data center or cluster. Users need not provide the operator with
any OS-level access at all (e.g. a root login to quiesce processes or
I/O prior to migration). “
*Residual dependencies
“The problem of the residual dependencies
that a migrated process retains on the
machine from which it migrated. Examples
of residual dependencies include open file
descriptors, shared memory segments, and
other local resources. These are
undesirable because the original machine
must remain available, and because they
usually negatively impact the performance
of migrated processes.”
In summary: Live migration is
 extremely powerful tool for cluster administrators
– allowing separation of hardware and software
considerations
– consolidating clustered hardware into a single coherent
management domain.
– If host needs to be removed from service
 Move guest domains, shutdown machine
– Relieve bottlenecks
 If host is overloaded, move guest domains to idle hosts
“virtualization + live migration = improved manageability”
Goals of live migration
 Reduce time during which services are
totally unavailable
 Reduce total migration time – time during
which synch is happening (which might be
unreliable)
 Migration should not create unnecessary
resource contention (CPU, disk, network,
etc)
Xen live migration highlights
 SPECweb benchmark migrated with 210ms
unavailability
 Quake 3 server migrated with 60ms
downtime
 Can maintain network connections and
application state
 Seamless migration
Xen live migration: Key ideas
 Pre-copy approach
– Pages of memory iteratively copied without
stopping the VM
– Page-level protection hardware used to ensure
consistent snapshot
– Rate adaptive algorithm to control impact of
migration traffic
– VM pauses only in final copy phase
Details: Time definitions
 Downtime: period during which the service is
unavailable due to there being no currently
executing instance of the VM; this period will be
directly visible to clients of the VM as service
interruption.
 Total migration time: duration between when
migration is initiated and when the original VM
may be finally discarded and hence, the source
host may potentially be taken down for
maintenance, upgrade or repair.
Memory transfer phases
 Push
– VM keeps sending pages from source to dest
while running (modified pages have to be
resent)
 Stop-and-Copy
– Halt VM- copy entire image, start VM on dest
 Pull
– Start VM on dest. Pages that are not found
locally are retrieved from source
Migration approaches
 Combinations of phases
– Pure stop-and-copy:
 Downtime & total migration time proportional to phy
memory size (we want downtime to be lesser)
– Pure demand-migration
 Move some minimal data required, start VM on dest.
 VM will page-fault a lot initially – total migration time
will be long, and performance may be unacceptable
– This paper: “Bounded push phase followed by
stop-and-copy”
Xen “pre-copy” key ideas
 Pre-copying (push) happens in rounds
– In round n pages that changed in round n-1 are copied
– Knowledge of pages which are written to frequently –
poor candidates for pre-copy
 “Writable working set “ (WWS) analysis for server workloads
 Short stop-and-copy phase
 Awareness of impact on actual services
– Control resources used by migration (e.g. network
bandwidth used, CPU used etc.)
Decoupling local resources
 Network: Assume source, dest VM on same
LAN
– Migrating VM will move with TCP/IP state and
IP address
– Generate unsolicited ARP reply – ARP reply
includes IP address mapping to MAC address.
All receiving hosts will update their mapping
 Few packets in transit may be lost, very minimal
impact expected
… Decoupling local resources
 Network…Problem: Routers may not accept
broadcast ARP replies (note: ARP requests
are expected to be broadcast – not replies)
– Send ARP replies to addresses in its own ARP
cache
– On a switched network migrating OS keeps its
original Ethernet mac – depends on network
switch to detect the mac moved to a new port
… Decoupling local resources
 Disk
– Assume Network Attached Storage
– Source/Dest machine on the same storage
network
– Problem not directly addressed
Migration Steps
Failure
management:
If not enough
resources, no
migration
At end of this
phase, two
consistent
copies of VM
OK Message
from BA
and ack from
AB. VM on
A can stop
One
consistent
VM active at
all times.
If “commit”
step does not
happen,
migration
aborted and
original VM
continues
Writable Working Sets
 Copying VM memory the largest migration
overhead
 Stop-and-copy simplest approach
– Downtime unacceptable
 Pre-copy migration may reduce downtime
– What about pages that are dirtied while copying?
– What if rate of dirtying > rate of copying?
 Key observation: most VMs will be such that large
parts of memory are not modified, small part is
(called WWS)
WWS Measurements
Different for different workloads
Estimated Effect on downtime
Each successive line (top bottom) showing increasing pre-copy rounds
Pre-copy reduces downtime (confirmed by many experiments)
Details: Managed Migration
(Migration by Domain 0)
 First round: All pages copied
 Subsequent rounds – only dirtied pages copied. (maintain
dirty bitmap). How:
– Insert shadow page tables
– Populated by translating sections of guest OS page tables
 Page table entries are initially read-only mappings
 If guest tries to modify a page fault created/trapped by Xen
 If write access permitted by original page table, do same here – and
set bit in dirty bitmap
– At start of each pre-copy round, dirty bitmap given to control
software, Xens bitmap cleared, shadow page tables destroyed and
recreated, all write permissions lost
 When pre-copy is to stop, mesg sent to guest OS to
suspend itself, dirty bitmap checked, pages synched.
Self-Migration
 Major difficulty: OS itself has to run to
transfer itself – what will be correct state to
transfer? Soln.
– Suspend all activities except migration-related
– Dirty page scan
– Copy dirty pages to “shadow buffer”
– Transfer shadow buffer
 Page dirtying during this time is ignored.
Dynamic Rate-Limiting Algorithm
 Select minimum and maximum bandwidth limits
 First pre-copy round transfers pages minimum b/w
 Dirty page rate calculates (number/duration of
round)
 B/w limit of next round = dirty rate + 50 Mbit/sec
(higher b/w if dirty rate high)
 Terminate when calculated rate > max or less than
256KB remaining
 Stop-and-copy done at max bandwitdh
Live Migration: Some results
…Results
 Downtime: 210 ms
 Total migration time: 71 secs
 No perceptible impact on performance
during uptime
Download