Dynamic Placement of Virtual Machines for

advertisement

Dynamic Placement of Virtual

Machines for Managing SLA

Violations

NORMAN BOBROFF, ANDRZEJ KOCHUT, KIRK BEATY

SOME SLIDE CONTENT ADAPTED FROM ALEXANDER NUS

PRESENTED BY JON LOGAN

Motivation

Virtual machines are becoming more and more popular throughout our datacenters

Servers use electricity

Electricity can be expensive!

How do we minimize the number of utilized machines, while meeting our SLA obligations?

Usage patterns of machines are NOT static, and generally change dynamically

Goals

Maximize utilization of active machines

Minimize Service Level Agreement (SLA) violations

Minimize number of active machines

Power off unused machines to conserve cost

(electricity)

Essentially, minimize cost while meeting SLA guarantees

Static Allocation

All machines are taken offline, and historical usage is used to determine ideal placement

Happens very infrequently (~weeks or months)

Must interrupt service to relocate

Utilization is not consistent in many cases! Demand may vary significantly within the period between allocations

Dynamic Allocation

VMs are seamlessly migrated between machines based on predicted demand

Is done rather frequently (~minutes, hours)

Live migration

Minimal (~ms) service disruptions during migration

Allows for allocations to more closely follow demand

Live Migration

Moves a VM image between machines without service interruption

The paper cites a ~45 second transition time

VM must be serialized and transferred over the network

Artificially limits our reallocation period

Can’t reallocate faster than we can migrate!

Service Level Agreement

Essentially is a contract between the provider and the customer that states that resources R will be available X% of the time

Violations cost money!

X is usually high (ex. 95%)

VMs do not necessarily use this entire resource allocation at all times, but it must be available should they choose to use it

Ex. VM may be doing batch processing, and only do substantial work between 12:00AM and 1:00AM

Static vs Dynamic Usages

Workloads are not static!

Try to predict the usage of the VM in a time

T

Reallocate machines to be able to meet that predicted usage

Need to be within a certain percentile to meet SLA requirements

Capacity savings is simply

Static Allocation - (Predicted Usage + Error

Factor)

Repeat this process every time T

What Workloads Are Best For Dynamic Allocation?

Not all Workloads are created equal

Some tend to be better than others

Constant workloads = bad!

A workload is an ideal candidate for dynamic allocation if

It has strong variability AND

It has strong autocorrelation combined with periodic behavior

Essentially, you need to have a decent degree of variability, and be able to reasonably predict its usage

Workload 3a

Strongly variable – good

Autocorrelation ~0.8 – good

Weak periodic behavior – bad

Verdict – Good

Large variability offers significant potential for optimization

Strong autocorrelation makes it possible to obtain a low-error predication

Workload 3b

Weakly variable - bad

Decaying autocorrelation - bad

Weak periodic behavior – bad

Verdict – Bad

Low variability makes potential gain low

Weak autocorrelation and no periodic component make it difficult to predict demand

Workload 3c

Strongly variable – good

Strong Autocorrelation– good

Strong periodic behavior – good

Verdict – Very Good

An ideal case for dynamic allocation

Potential Gain

Demand forecast algorithm

Determine the periods in demand using ‘common sense’ aided by periodogram (e.g.time-of-day,day of week,…)

Decompose the process into deterministic periodic and residual components D i

+ r i

Estimate the deterministic part using averaging of multiple smoothed historical periods

Fit Auto Regressive Moving Average (ARMA) model to the residual process

Use the combined components for demand prediction

U i

= D i

+ r i

Management Algorithm

Goal is to minimize time averaged number of active servers without violating the SLA agreement

Machines that are not utilized to handle VMs are powered off or put in a low power state

Will be reactivated if/when required (minimally, the next period)

The time to power on & migrate must be less than the period T

Responsible for actual migrations of machines

Placing of VMs is essentially a version of the bin packing problem

NP hard!

We use an approximation, using first-fit

Management Algorithm

Measure – Measure usage

Forecast – Predict usage for the next window

Remap – Relocate machines if necessary

Preform this (MFR) at regular intervals

Designed to try to predict the “best we can do”

Management Algorithm

Overview

Key Terms

N – virtual machines

M – physical machines

C m – Maximum capacity of physical machine f n i, k i+k

– forcast value for resource demand of VM n at interval

R – migration interval

C p

(u, o 2 ) – (1-p)-percentile of Gaussian distribution with mean u and variance o 2

Management Algorithm

Management Algorithm (2)

Management Algorithm (3)

Management Algorithm (4)

Simulations

Simulated using traces gathered from hundreds of production servers using various applications

Traces contain CPU, memory, storage, and network

We are only focusing on CPU usage

Samples were collected every 15 minutes

The simulated study

Verifies that the MFR meets SLA targets

Quantifies the reduction of SLA violations

Quantifies the number of saved machines

Explores the relationship between the remapping interval and the gain from dynamic management

Performs measurements to determine properties of a practical infrastructure with respect to migration of VMs

Overflows vs Number of PMs

Number of Machines vs Overflow Desired

Significantly reduces number of machines active

Performance degrades as the migration interval increases

Essentially, the prediction is the max usage predicted within the range

Limitations

The paper only looks at one resource utilization

In this case, CPU utilization

In the real world, you have numerous resources to handle allocations for

Memory, CPU, IO, Network, etc.

Assumes bandwidth between machines is free & unrestricted

Relocating some VMs in some cases may not be worth the cost of relocating the image

Their study size is small

Only 6 physical machines

What if different VMs have different SLA requirements?

What if your PMs had differing hardware?

Conclusion

Based on the simulated data, it significantly reduces cost to execute virtual machines

Relies on an ideal case of VMs

Predictable and volatile usage

Algorithm could be optimized to reduce the number of VM relocations, or to more optimally schedule

Simulation is too small

The paper claims a 44% average savings in the number of active

PMs

Download