VirtualKnotter: Online Virtual Machine Shuffling for Congestion

advertisement

VirtualKnotter: Online Virtual Machine

Shuffling for Congestion Resolving in

Virtualized Datacenter

Xitao Wen, Kai Chen, Yan Chen,

Yongqiang Liu, Yong Xia, Chengchen Hu

1

Datacenter as Infrastructure

2

Congestion in Datacenter

Core

Aggregation

2:1~10:1

Edge

Packet loss!

10:1~100:1

Degrading

Throughput!

Queuing delay!

Pod 0 Pod 1 Pod 2 Pod 3

3

Congestion in the Wild

General Approaches

Problem Formulation

Main Design

Evaluation

4

Spatial Pattern

• Unbalanced utilization

– Hotspot: Hot links account for <10% core links [IMC10]

– Spatially unbalanced utilization

Sender

5

Temporal Pattern

• Long congestion event

– lasts for 10s of minutes

– Individual event has clear spatial pattern

6

Traffic Stability

• Bursty at a fine granularity

– Not predictable at 10s or

100s or milliseconds

[IMC10][SIGCOMM09]

• Predictable at timescale of 10s of minutes

– 40% to 70% pairwise traffic can be expected stable

– 90%+ predictable traffic aggregated at core links

7

Congestion in the Wild

General Approaches

Problem Formulation

Main Design

Evaluation

8

General Approaches

• Network Layer

– Increase network bandwidth

• Fat-tree, BCube, OSA…

– Optimize flow routing

• Hedera, MicroTE

• Application Layer

– Optimize VM placement

• Expensive

• Requires to upgrade entire DC network

• Not scalable

• Requires hardware support

• Depends on rich path diversity

• Scalable

• Lightweight deployment

• Suitable for existing oversubscribed network

9

Background on Virtualized DC

• Virtualization Layer

• VM Live Migration

Major

Cost!

– Keep continuous service while migrating

– 1.1x – 1.4x VM memory transfer

VM VM VM VM

Server Server

DC Network

10

Optimize VM Placement

• Offload traffic from congested link active VM idle VM

11

Congestion in the Wild

General Approaches

Problem Formulation

Main Design

Evaluation

12

Design Goal

• Mitigate congestion

Objective

– Maximum link utilization (MLU)

• Controllable migration traffic (i.e. moving VM)

– Less than reduced traffic

• Reasonable runtime overhead

– Far less than target timescale (10s of mins)

13

Problem Statement

• Input

– Topology and routing of physical servers

– Traffic matrix among VMs

– Current Placement

• Variable & Output

– Optimized Placement

• NP-hardness

– Proof: reduced from

Quadratic Bottleneck

Assignment Problem

14

Related Work

• Optimize VM placement

– Server consolidation [SOSP’07]

– Fault tolerance [ICS’07]

– Network scalability [INFOCOM’10]

15

Congestion in the Wild

General Approaches

Problem Formulation

Main Design

Evaluation

16

Inspiration

Solve each tie gently, by carefully reeving the end out of the tie.

Stretch the tie violently, making it loose and less tangled.

17

Two-step Algorithm

Topology &

Routing

Traffic

Matrix

Multiway

θ-Kernighan-Lin

Current VM

Placement

• Fast and greedy

• Search for localizing overall traffic

• May stuck in local minimum

Simulated

Annealing

Optimized VM placement

• Fine-grained and randomized

• Search for mitigating traffic on the most congested links

• Help avoid local minimum

18

Multiway Θ-Kernighan-Lin (KL)

• Top-down graph cut improvement

• Introduce Θ to limit # of moves

• O(n 2 log(n))

19

Multiway Θ-Kernighan-Lin (KL)

• Top-down graph cut improvement

• Introduce Θ to limit # of moves

• O(n 2 log(n))

20

Multiway Θ-Kernighan-Lin (KL)

• Top-down graph cut improvement

• Introduce Θ to limit # of moves

• O(n 2 log(n))

21

Simulated Annealing Searching (SA)

• Randomized global searching

• Terminate when obtains satisfied solution, or predefined max depth is reached

22

Congestion in the Wild

General Approaches

Problem Formulation

Main Design

Evaluation

23

Methodology

• Baseline Algorithm

– Clustering-based algorithm

– Pro: best-known static optimality

– Con: high runtime and migration overhead

• Metrics

– MLU reduction without migration overhead

– Overhead

• Migration traffic

• Runtime overhead

– Simulation results

24

MLU Reduction without Overhead

VirtualKnotter demonstrates similar static performance as that of Clustering.

25

Migration Traffic

VirtualKnotter shows significantly less migration traffic than that of Clustering.

26

Runtime Overhead

VirtualKnotter demonstrates reasonable runtime overhead .

27

Simulation Results

53% less congestion

Altogether, VirtualKnotter obtains significant gain on congestion resolving.

28

Conclusions

Collaborative VM migration can substantially resolve long-term congestion in DC

• Trade-off between optimality and migration

traffic is essential to harvest the benefit

DC networking projects of Northwestern LIST: http://list.cs.northwestern.edu/dcn

29

Thank you!

30

Backup

31

General Approaches

Cost

Hardware

Support

Scalability

Other

Dependency

Increase

Bandwidth

Optimize

Routing

High

Low

Optimize VM

Placement

Low

Yes

Yes

No

Varies

Low

High

Rich path diversity

VM deployment

32

Problem Statement

• Objective

– Minimize Maximum Link Utilization (MLU)

– “Cool down the hottest spot”

• Constraints

– Migration traffic

– Server hardware capacity

– Inseparable VM

• NP-hardness

– Proof: reduced from Quadratic Bottleneck Assignment

Problem

33

Observation Summary

• Unbalanced jam (spatial)

• Long-term congestion (temporal)

• Predictable at 10s of minutes scale (stability)

34

Two-step Algorithm

Multiway Θ-Kernighan-Lin

Algorithm (KL)

• Fast search for approximation

Simulated Annealing

Searching (SA)

• Fine search for better solution

35

Download