NORMAN BOBROFF, ANDRZEJ KOCHUT, KIRK BEATY
SOME SLIDE CONTENT ADAPTED FROM ALEXANDER NUS
PRESENTED BY JON LOGAN
Virtual machines are becoming more and more popular throughout our datacenters
Servers use electricity
Electricity can be expensive!
How do we minimize the number of utilized machines, while meeting our SLA obligations?
Usage patterns of machines are NOT static, and generally change dynamically
Maximize utilization of active machines
Minimize Service Level Agreement (SLA) violations
Minimize number of active machines
Power off unused machines to conserve cost
(electricity)
Essentially, minimize cost while meeting SLA guarantees
All machines are taken offline, and historical usage is used to determine ideal placement
Happens very infrequently (~weeks or months)
Must interrupt service to relocate
Utilization is not consistent in many cases! Demand may vary significantly within the period between allocations
VMs are seamlessly migrated between machines based on predicted demand
Is done rather frequently (~minutes, hours)
Live migration
Minimal (~ms) service disruptions during migration
Allows for allocations to more closely follow demand
Moves a VM image between machines without service interruption
The paper cites a ~45 second transition time
VM must be serialized and transferred over the network
Artificially limits our reallocation period
Can’t reallocate faster than we can migrate!
Essentially is a contract between the provider and the customer that states that resources R will be available X% of the time
Violations cost money!
X is usually high (ex. 95%)
VMs do not necessarily use this entire resource allocation at all times, but it must be available should they choose to use it
Ex. VM may be doing batch processing, and only do substantial work between 12:00AM and 1:00AM
Workloads are not static!
Try to predict the usage of the VM in a time
T
Reallocate machines to be able to meet that predicted usage
Need to be within a certain percentile to meet SLA requirements
Capacity savings is simply
Static Allocation - (Predicted Usage + Error
Factor)
Repeat this process every time T
Not all Workloads are created equal
Some tend to be better than others
Constant workloads = bad!
A workload is an ideal candidate for dynamic allocation if
It has strong variability AND
It has strong autocorrelation combined with periodic behavior
Essentially, you need to have a decent degree of variability, and be able to reasonably predict its usage
Strongly variable – good
Autocorrelation ~0.8 – good
Weak periodic behavior – bad
Verdict – Good
Large variability offers significant potential for optimization
Strong autocorrelation makes it possible to obtain a low-error predication
Weakly variable - bad
Decaying autocorrelation - bad
Weak periodic behavior – bad
Verdict – Bad
Low variability makes potential gain low
Weak autocorrelation and no periodic component make it difficult to predict demand
Strongly variable – good
Strong Autocorrelation– good
Strong periodic behavior – good
Verdict – Very Good
An ideal case for dynamic allocation
Determine the periods in demand using ‘common sense’ aided by periodogram (e.g.time-of-day,day of week,…)
Decompose the process into deterministic periodic and residual components D i
+ r i
Estimate the deterministic part using averaging of multiple smoothed historical periods
Fit Auto Regressive Moving Average (ARMA) model to the residual process
Use the combined components for demand prediction
U i
= D i
+ r i
Goal is to minimize time averaged number of active servers without violating the SLA agreement
Machines that are not utilized to handle VMs are powered off or put in a low power state
Will be reactivated if/when required (minimally, the next period)
The time to power on & migrate must be less than the period T
Responsible for actual migrations of machines
Placing of VMs is essentially a version of the bin packing problem
NP hard!
We use an approximation, using first-fit
Measure – Measure usage
Forecast – Predict usage for the next window
Remap – Relocate machines if necessary
Preform this (MFR) at regular intervals
Designed to try to predict the “best we can do”
N – virtual machines
M – physical machines
C m – Maximum capacity of physical machine f n i, k i+k
– forcast value for resource demand of VM n at interval
R – migration interval
C p
(u, o 2 ) – (1-p)-percentile of Gaussian distribution with mean u and variance o 2
Simulated using traces gathered from hundreds of production servers using various applications
Traces contain CPU, memory, storage, and network
We are only focusing on CPU usage
Samples were collected every 15 minutes
The simulated study
Verifies that the MFR meets SLA targets
Quantifies the reduction of SLA violations
Quantifies the number of saved machines
Explores the relationship between the remapping interval and the gain from dynamic management
Performs measurements to determine properties of a practical infrastructure with respect to migration of VMs
Significantly reduces number of machines active
Performance degrades as the migration interval increases
Essentially, the prediction is the max usage predicted within the range
The paper only looks at one resource utilization
In this case, CPU utilization
In the real world, you have numerous resources to handle allocations for
Memory, CPU, IO, Network, etc.
Assumes bandwidth between machines is free & unrestricted
Relocating some VMs in some cases may not be worth the cost of relocating the image
Their study size is small
Only 6 physical machines
What if different VMs have different SLA requirements?
What if your PMs had differing hardware?
Based on the simulated data, it significantly reduces cost to execute virtual machines
Relies on an ideal case of VMs
Predictable and volatile usage
Algorithm could be optimized to reduce the number of VM relocations, or to more optimally schedule
Simulation is too small
The paper claims a 44% average savings in the number of active
PMs