Improving the Scalability of Data Center Networks with Traffic-aware Virtual Machine Placement

advertisement
Improving the Scalability of
Data Center Networks
with
Traffic-aware Virtual Machine Placement
Authors: Xiaoqiao Meng, Vasileio Pappas and Li Zhang
Presented by : Jinpeng Liu
Introduction
• Modern virtualization Data Centers are hosting a wide spectrum of
applications.
• Bandwidth usage between VMs is rapidly growing
• Scalability of data center networks becomes a concern.
• Techniques suggested:
• Rich connectivity at the edge of the network
• Dynamic routing protocols
•…
Introduction
• This paper tackles the issue from a different perspective.
• Current VM placement strategy has issues:
• Placement is decided by capacity planning tools:
•
•
•
•
CPU
Memory
Power consumption
Ignore Consumption of Network resources
• VM pairs with heavy traffic could be placed on hosts with large network cost.
How often does this pattern happen in practice?
Background
DC traffic pattern
Data collected:
• DWH hosted by IBM Global Services - server resource utilization from
hundreds of server farms.
• A server cluster with hundreds of VMs - aggregate traffic of 68 VMs
• Traces were collected for 10 days.
Background
DC traffic pattern
1. Uneven distribution of traffic volumes from VMs
While 80% of VMs have average rate less than 800 KBytes/min, 4% of
them have a rate 10x higher
Background
DC traffic pattern
2. Stable per-VM traffic at large timescale
82%
Background
DC traffic pattern
3. Weak correlation between traffic rate and latency
Background
Tree architecture
Cons:
• Topology scaling limitations
• By scaling up each individual switch
• Core tier only accommodate 8 switches
• Address space limitation
• Higher server over-subscription
Cisco Data Center Infrastructure 2.5 Design Guide
Background
Complete
Bipartite Graph
VL2 Architecture
• Share many features with Tree
• Core tier and aggregation tier form
a Clos topology.
• Valiant load balancing
• Randomly selected core switches
as intermediate destination
• Location independent
sender
receiver
Background
PortLand Architecture
(Fat-Tree)
• Require all switches are identical
• i.e., same number of ports
• Build with concept of pods
• A collection of access and aggregation switches form A Clos topology
• Pods and core switches form a second Clos topology
• Evenly distributing the up-links
• Pros:
• Full Bisection BW: 1:1 Oversubscription ratio
• Low Cost: commodity switches, low power/cooling
• Cons:
• Scalability: size of the network depends on ports per switch. For 48 ports => max 27,648 host
Background
Modular Data Center (MDC)
• Thousands of servers
Interconnected by switches
• Packed into a shipping-container
• Sun, HP, IBM DELL …
• Higher degree of mobility
• Higher system and power density
• Lower cost (cooling and manufacturing)
Background
BCube Architectures
Purpose:
•
•
•
•
Data-intense computing
Bandwidth-intensive communication among MDC servers
Low-end COTS mini-switches
Graceful performance degradation
BCube:
•
•
•
•
Server-centric
Servers are part of the network
Servers not directly connected
Defined recursively
Background
BCube:
• At level 0 ๐ต๐ถ๐‘ข๐‘๐‘’0 consists n servers connected by 1 n-port swithces
• A ๐ต๐ถ๐‘ข๐‘๐‘’๐‘˜ is constructed from n ๐ต๐ถ๐‘ข๐‘๐‘’๐‘˜−1 and ๐‘›๐‘˜ n-port switches
C
k = 1, n = 4
๐ต๐ถ๐‘ข๐‘๐‘’0 − 4 servers by 1 port
๐ต๐ถ๐‘ข๐‘๐‘’1 − 4 ๐ต๐ถ๐‘ข๐‘๐‘’0 by 4 port
Background
BCube:
• Server Label based on the locations in the BCube structure
• Severs connected at ith level if their label differs at that level
Label
Level 0
Level 1
2.4
4th
2nd
1.3
1st
3rd
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Background
BCube:
• Server Label based on the locations in the BCube structure
• Severs connected at ith level if their label differs at that level
Label
Level 0
Level 1
2.4
4th
2nd
1.3
3rd
1st
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Background
BCube:
• Server Label based on the locations in the BCube structure
• Severs connected at ith level if their label differs at that level
Label
Level 0
Level 1
2.4
4th
2nd
1.4
4th
1st
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Background
BCube:
• Server Label based on the locations in the BCube structure
• Severs connected at ith level if their label differs at that level
Label
Level 0
Level 1
2.4
4th
2nd
1.4
4st
1st
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Background
BCube:
• Server Label based on the locations in the BCube structure
• Severs connected at ith level if their label differs at that level
Label
Level 0
Level 1
2.4
4th
2nd
1.4
4st
1st
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Virtual Machine Placement Problem
How to place VMs on a set of Physical hosts ?
Assumptions:
• existing CPU/memory based capacity tools have decided the number of VMs
that a host can accommodate.
• use a slot to refer to one CPU/memory allocation on a host.
• A host can have multiple slots
• A VM can take any un-occupied slot
• Static and single-path routing
• All external traffic are routed through a common gateway switch
Scenario:
• Place n VMs in n Slots
Traffic-aware VM Placement Problem (TVMPP)
๐ถ๐‘–๐‘— : fixed communication cost from slot i to j.
๐ท๐‘–๐‘— : traffic rate from VM i to j.
๐‘’๐‘– : external traffic rate for ๐‘‰๐‘€๐‘– .
๐‘”๐‘– : communication cost between ๐‘‰๐‘€๐‘– and the gateway
๐œ‹: 1, … , ๐‘› ๏‚ฎ[1, … , ๐‘›] : permutation function for assigning n VMs to n
slots
The TVMPP is defined as finding a ๐… to minimize (1)
Traffic-aware VM Placement Problem (TVMPP)
The meaning of the function depends on the definition of ๐ถ๐‘–๐‘— .
• ๐ถ๐‘–๐‘— is defined as # of switches on the routing path from VM i to j.
• (1) is the sum of traffic rate perceived by each switches
• If (1) is normalized by the sum of VM-to-VM bandwidth demand, it is
equivalent to the average number of switches that a data unit traverses.
• If assuming equal delay on every switch, (1) can be interpreted as the
average latency for a data unit traversing the network.
Optimizing TVMPP is equivalent to minimizing traffic latency …
Traffic-aware VM Placement Problem (TVMPP)
How about # of slots > # of VMs?
• Add dummy VMs: no traffic
• Not affect VM placement
TVMPP can be simplified by ignoring
constant.
which is relatively
• Cost between every host and gateway is the same (WHY??)
Traffic-aware VM Placement Problem (TVMPP)
• Offline Mode
• Data center operators estimate traffic matrix
• Collect network topology
• Solve TVMPP to decide which host(s) to create the VMs
• Online Mode
• Re-solve TVMPP periodically
• Reshuffle VMs placement when needed
TVMPP Complexity
Matrix notation for (1):
• D is traffic rates matrix
• C is communication cost matrix
• ๏ is the set of permutation matrices
A Quadratic Assignment Problem (QAP):
There are a set of n facilities and a set of n locations. For each pair of locations,
a distance is specified and for each pair of facilities a weight or flow is specified (e.g., the
amount of supplies transported between the two facilities). The problem is to assign all
facilities to different locations with the goal of minimizing the sum of the distances
multiplied by the corresponding flows.
TVMPP Complexity
Matrix notation for (1):
• D is traffic rates matrix
• C is communication cost matrix
• ๏ is the set of permutation matrices
A Quadratic Assignment Problem (QAP):
NP-Hard
There are a set of n facilities and a set of n locations. For each pair of locations,
a distance is specified and for each pair of facilities a weight or flow is specified (e.g., the
amount of supplies transported between the two facilities). The problem is to assign all
facilities to different locations with the goal of minimizing the sum of the distances
multiplied by the corresponding flows.
TVMPP Complexity
• TVMPP problem belongs to the general QAP problem.
• No existing exact solution can be scale to the size of
current data centers.
Algorithms: Cluster-and-Cut
Proposition 1: Suppose 0 ๏‚ฃ ๐‘Ž1 ๏‚ฃ ๐‘Ž2 …๏‚ฃ ๐‘Ž๐‘› and 0 ๏‚ฃ ๐‘1 ๏‚ฃ ๐‘2 …๏‚ฃ ๐‘๐‘› , the
following inequalities hold for any permuation ๐œ‹: 1, … , ๐‘› .
Design principle 1:
Solving TVMPP is equivalent to finding a
mapping of VMs to slots such that VM pairs with heavy mutual traffic
be assigned to slot pairs with low-cost connection.
Algorithms: Cluster-and-Cut
Design principle 2:
Divide-and-Conquer
1. Partition VMs into VM-clusters / Slots into slot-clusters
2. Map each VM-cluster to a slot-cluster by TVMPP
3. Then map VMs to slots in each mapped VM and slot cluster by
TVMPP
Algorithms: Cluster-and-Cut
Design principle 2:
Divide-and-Conquer
1. Partition VMs into VM-clusters / Slots into slot-clusters
• Cluster VMs
• Classical min-cut graph algorithm.
• VM pairs with high mutual traffic rate are within the same VM-cluster
• Consistent with previously finding that traffic generated from a small
group of VMs comprise a large fraction of the total traffic.
• Approximation ratio =
๐‘˜−1
๐‘›
๐‘˜
Algorithms: Cluster-and-Cut
Design principle 2:
Divide-and-Conquer
1. Partition VMs into VM-clusters / Slots into slot-clusters
• Cluster Slots
•
•
•
•
Classical clustering algorithm.
Slot pairs with low-cost connections belong to the same slot-cluster.
Networks contains many groups of densely connected end hosts.
Approximation ratio= 2
Algorithms: Cluster-and-Cut
Design principle 2:
Algorithms: Cluster-and-Cut
Design principle 2:
๐‘‚(๐‘›๐‘˜)
๐Ÿ’
๐‘ถ(๐’ )
๐‘‚(๐‘›4 )
Impact of Network Architectures &
Traffic Patterns
Performance Gains are affected by:
1. Cost matrices (C)
i.
ii.
iii.
iv.
Tree
VL2
Fat-Tree
BCube
2. Traffic matrices (D)
i.
ii.
Global traffic model
Partitioned traffic model
Impact of Network Architectures &
Traffic Patterns
Define Tree Cost Matrices C:
• ๐‘0 : ๐‘‡โ„Ž๐‘’ ๐‘“๐‘Ž๐‘› − ๐‘œ๐‘ข๐‘ก ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘Ž๐‘๐‘๐‘’๐‘ ๐‘  ๐‘ ๐‘ค๐‘–๐‘ก๐‘โ„Ž๐‘’๐‘ 
• ๐‘1 : ๐‘‡โ„Ž๐‘’ ๐‘“๐‘Ž๐‘› − ๐‘œ๐‘ข๐‘ก ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘Ž๐‘”๐‘”๐‘Ÿ๐‘’๐‘”๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘ ๐‘ค๐‘–๐‘ก๐‘โ„Ž๐‘’๐‘ 
nxn
Impact of Network Architectures &
Traffic Patterns
Define Tree Cost Matrices C:
• ๐‘0 : ๐‘‡โ„Ž๐‘’ ๐‘“๐‘Ž๐‘› − ๐‘œ๐‘ข๐‘ก ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘Ž๐‘๐‘๐‘’๐‘ ๐‘  ๐‘ ๐‘ค๐‘–๐‘ก๐‘โ„Ž๐‘’๐‘ 
• ๐‘1 : ๐‘‡โ„Ž๐‘’ ๐‘“๐‘Ž๐‘› − ๐‘œ๐‘ข๐‘ก ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘Ž๐‘”๐‘”๐‘Ÿ๐‘’๐‘”๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘ ๐‘ค๐‘–๐‘ก๐‘โ„Ž๐‘’๐‘ 
0
nxn
Impact of Network Architectures &
Traffic Patterns
Define Tree Cost Matrices C:
• ๐‘0 : ๐‘‡โ„Ž๐‘’ ๐‘“๐‘Ž๐‘› − ๐‘œ๐‘ข๐‘ก ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘Ž๐‘๐‘๐‘’๐‘ ๐‘  ๐‘ ๐‘ค๐‘–๐‘ก๐‘โ„Ž๐‘’๐‘ 
• ๐‘1 : ๐‘‡โ„Ž๐‘’ ๐‘“๐‘Ž๐‘› − ๐‘œ๐‘ข๐‘ก ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘Ž๐‘”๐‘”๐‘Ÿ๐‘’๐‘”๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘ ๐‘ค๐‘–๐‘ก๐‘โ„Ž๐‘’๐‘ 
1
nxn
Impact of Network Architectures &
Traffic Patterns
Define Tree Cost Matrices C:
• ๐‘0 : ๐‘‡โ„Ž๐‘’ ๐‘“๐‘Ž๐‘› − ๐‘œ๐‘ข๐‘ก ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘Ž๐‘๐‘๐‘’๐‘ ๐‘  ๐‘ ๐‘ค๐‘–๐‘ก๐‘โ„Ž๐‘’๐‘ 
• ๐‘1 : ๐‘‡โ„Ž๐‘’ ๐‘“๐‘Ž๐‘› − ๐‘œ๐‘ข๐‘ก ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘Ž๐‘”๐‘”๐‘Ÿ๐‘’๐‘”๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘ ๐‘ค๐‘–๐‘ก๐‘โ„Ž๐‘’๐‘ 
3
nxn
Impact of Network Architectures &
Traffic Patterns
Define Tree Cost Matrices C:
• ๐‘0 : ๐‘‡โ„Ž๐‘’ ๐‘“๐‘Ž๐‘› − ๐‘œ๐‘ข๐‘ก ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘Ž๐‘๐‘๐‘’๐‘ ๐‘  ๐‘ ๐‘ค๐‘–๐‘ก๐‘โ„Ž๐‘’๐‘ 
• ๐‘1 : ๐‘‡โ„Ž๐‘’ ๐‘“๐‘Ž๐‘› − ๐‘œ๐‘ข๐‘ก ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘Ž๐‘”๐‘”๐‘Ÿ๐‘’๐‘”๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘ ๐‘ค๐‘–๐‘ก๐‘โ„Ž๐‘’๐‘ 
5
nxn
Impact of Network Architectures &
Traffic Patterns
Define VL2 Cost Matrices C:
• ๐‘0 : ๐‘‡โ„Ž๐‘’ ๐‘“๐‘Ž๐‘› − ๐‘œ๐‘ข๐‘ก ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘Ž๐‘๐‘๐‘’๐‘ ๐‘  ๐‘ ๐‘ค๐‘–๐‘ก๐‘โ„Ž๐‘’๐‘ 
• ๐‘1 : ๐‘‡โ„Ž๐‘’ ๐‘“๐‘Ž๐‘› − ๐‘œ๐‘ข๐‘ก ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘Ž๐‘”๐‘”๐‘Ÿ๐‘’๐‘”๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘ ๐‘ค๐‘–๐‘ก๐‘โ„Ž๐‘’๐‘ 
nxn
5
Impact of Network Architectures &
Traffic Patterns
Define Fat-Tree Cost Matrices C:
• ๐‘˜: ๐‘กโ„Ž๐‘’ ๐‘ก๐‘œ๐‘ก๐‘Ž๐‘™ ๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘๐‘œ๐‘Ÿ๐‘ก๐‘  ๐‘œ๐‘› ๐‘’๐‘Ž๐‘โ„Ž ๐‘ ๐‘ค๐‘–๐‘ก๐‘โ„Ž
nxn
3
Impact of Network Architectures &
Traffic Patterns
Define Fat-Tree Cost Matrices C:
• ๐‘˜: ๐‘กโ„Ž๐‘’ ๐‘ก๐‘œ๐‘ก๐‘Ž๐‘™ ๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘๐‘œ๐‘Ÿ๐‘ก๐‘  ๐‘œ๐‘› ๐‘’๐‘Ž๐‘โ„Ž ๐‘ ๐‘ค๐‘–๐‘ก๐‘โ„Ž
nxn
5
Impact of Network Architectures &
Traffic Patterns
Define BCube Cost Matrices C:
• Hamming distance of server address
nxn
Impact of Network Architectures &
Traffic Patterns
Define BCube Cost Matrices C:
• Hamming distance of server address
nxn
Distance=1
Impact of Network Architectures &
Traffic Patterns
Define BCube Cost Matrices C:
• Hamming distance of server address
nxn
Distance=2
Impact of Network Architectures &
Traffic Patterns
Global Traffic Model:
• Each VM communicates with every other at a constant rate
• Complexity: ๐‘‚(๐‘›3 )
• Solved by Hungarian algorithm.
: optimized objective value
• Random placement objective value
Impact of Network Architectures &
Traffic Patterns
Global Traffic Model:
• @0 variance - ๐‘†๐‘œ๐‘๐‘ก = ๐‘†๐‘Ÿ๐‘Ž๐‘›๐‘‘
• E.g., map-reduce type workload
• Otherwise ๐‘†๐‘œ๐‘๐‘ก ≤ ๐‘†๐‘Ÿ๐‘Ž๐‘›๐‘‘
• E.g., Multi-tiered web application
• Gaps indicate improvement space for
random placement.
• BCube has largest improvement space.
• Benefit in terms of scalability
• VL2 has the smallest improvement space
1024 VMs
Impact of Network Architectures &
Traffic Patterns
Partitioned Traffic Model:
• VMs form isolated partitions, and only VMs
within the same partition communicate with
each other
• Pairwise traffic rate following a normal
distribution
• GLB: lower bound for the ๐‘†๐‘œ๐‘๐‘ก
• Gaps indicate the performance
improvement potential
• ๐‘†๐‘Ÿ๐‘Ž๐‘›๐‘‘ has improvement space under
different traffic variance
• BCube has larger improvement potential
Impact of Network Architectures &
Traffic Patterns
More performance improvement potential in a
system with different partition size
Smaller partition size has higher improvement
potential
Impact of Network Architectures &
Traffic Patterns
Greater benefits under such conditions:
• increased traffic variance
• Increased number of partitions
• Multi-layer architecture
Evaluation
Compare Cluster-and Cut to other QAP solving algorithms:
• Local Optimal Pairwise Interchange (LOPI)
• Simulated Annealing (SA)
Evaluation
Compare Cluster-and Cut to other QAP solving algorithms:
• Local Optimal Pairwise Interchange (LOPI)
• Simulated Annealing (SA)
10% smaller
Evaluation
Compare Cluster-and Cut to other QAP solving algorithms:
• Local Optimal Pairwise Interchange (LOPI)
• Simulated Annealing (SA)
50% less
Summary
1. Used traffic-aware virtual machine placement to improve network
scalability
2. Formulated the VM placement as an NP-Hard optimization
problem.
3. Proposed Cluster-and-Cut algorithm as a efficient solution
4. Evaluated the potential performance on different traffic patterns
and network architectures.
Thank you !
Download