Improving the Scalability of Data Center Networks with Traffic-aware Virtual Machine Placement Authors: Xiaoqiao Meng, Vasileio Pappas and Li Zhang Presented by : Jinpeng Liu Introduction • Modern virtualization Data Centers are hosting a wide spectrum of applications. • Bandwidth usage between VMs is rapidly growing • Scalability of data center networks becomes a concern. • Techniques suggested: • Rich connectivity at the edge of the network • Dynamic routing protocols •… Introduction • This paper tackles the issue from a different perspective. • Current VM placement strategy has issues: • Placement is decided by capacity planning tools: • • • • CPU Memory Power consumption Ignore Consumption of Network resources • VM pairs with heavy traffic could be placed on hosts with large network cost. How often does this pattern happen in practice? Background DC traffic pattern Data collected: • DWH hosted by IBM Global Services - server resource utilization from hundreds of server farms. • A server cluster with hundreds of VMs - aggregate traffic of 68 VMs • Traces were collected for 10 days. Background DC traffic pattern 1. Uneven distribution of traffic volumes from VMs While 80% of VMs have average rate less than 800 KBytes/min, 4% of them have a rate 10x higher Background DC traffic pattern 2. Stable per-VM traffic at large timescale 82% Background DC traffic pattern 3. Weak correlation between traffic rate and latency Background Tree architecture Cons: • Topology scaling limitations • By scaling up each individual switch • Core tier only accommodate 8 switches • Address space limitation • Higher server over-subscription Cisco Data Center Infrastructure 2.5 Design Guide Background Complete Bipartite Graph VL2 Architecture • Share many features with Tree • Core tier and aggregation tier form a Clos topology. • Valiant load balancing • Randomly selected core switches as intermediate destination • Location independent sender receiver Background PortLand Architecture (Fat-Tree) • Require all switches are identical • i.e., same number of ports • Build with concept of pods • A collection of access and aggregation switches form A Clos topology • Pods and core switches form a second Clos topology • Evenly distributing the up-links • Pros: • Full Bisection BW: 1:1 Oversubscription ratio • Low Cost: commodity switches, low power/cooling • Cons: • Scalability: size of the network depends on ports per switch. For 48 ports => max 27,648 host Background Modular Data Center (MDC) • Thousands of servers Interconnected by switches • Packed into a shipping-container • Sun, HP, IBM DELL … • Higher degree of mobility • Higher system and power density • Lower cost (cooling and manufacturing) Background BCube Architectures Purpose: • • • • Data-intense computing Bandwidth-intensive communication among MDC servers Low-end COTS mini-switches Graceful performance degradation BCube: • • • • Server-centric Servers are part of the network Servers not directly connected Defined recursively Background BCube: • At level 0 ๐ต๐ถ๐ข๐๐0 consists n servers connected by 1 n-port swithces • A ๐ต๐ถ๐ข๐๐๐ is constructed from n ๐ต๐ถ๐ข๐๐๐−1 and ๐๐ n-port switches C k = 1, n = 4 ๐ต๐ถ๐ข๐๐0 − 4 servers by 1 port ๐ต๐ถ๐ข๐๐1 − 4 ๐ต๐ถ๐ข๐๐0 by 4 port Background BCube: • Server Label based on the locations in the BCube structure • Severs connected at ith level if their label differs at that level Label Level 0 Level 1 2.4 4th 2nd 1.3 1st 3rd 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Background BCube: • Server Label based on the locations in the BCube structure • Severs connected at ith level if their label differs at that level Label Level 0 Level 1 2.4 4th 2nd 1.3 3rd 1st 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Background BCube: • Server Label based on the locations in the BCube structure • Severs connected at ith level if their label differs at that level Label Level 0 Level 1 2.4 4th 2nd 1.4 4th 1st 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Background BCube: • Server Label based on the locations in the BCube structure • Severs connected at ith level if their label differs at that level Label Level 0 Level 1 2.4 4th 2nd 1.4 4st 1st 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Background BCube: • Server Label based on the locations in the BCube structure • Severs connected at ith level if their label differs at that level Label Level 0 Level 1 2.4 4th 2nd 1.4 4st 1st 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Virtual Machine Placement Problem How to place VMs on a set of Physical hosts ? Assumptions: • existing CPU/memory based capacity tools have decided the number of VMs that a host can accommodate. • use a slot to refer to one CPU/memory allocation on a host. • A host can have multiple slots • A VM can take any un-occupied slot • Static and single-path routing • All external traffic are routed through a common gateway switch Scenario: • Place n VMs in n Slots Traffic-aware VM Placement Problem (TVMPP) ๐ถ๐๐ : fixed communication cost from slot i to j. ๐ท๐๐ : traffic rate from VM i to j. ๐๐ : external traffic rate for ๐๐๐ . ๐๐ : communication cost between ๐๐๐ and the gateway ๐: 1, … , ๐ ๏ฎ[1, … , ๐] : permutation function for assigning n VMs to n slots The TVMPP is defined as finding a ๐ to minimize (1) Traffic-aware VM Placement Problem (TVMPP) The meaning of the function depends on the definition of ๐ถ๐๐ . • ๐ถ๐๐ is defined as # of switches on the routing path from VM i to j. • (1) is the sum of traffic rate perceived by each switches • If (1) is normalized by the sum of VM-to-VM bandwidth demand, it is equivalent to the average number of switches that a data unit traverses. • If assuming equal delay on every switch, (1) can be interpreted as the average latency for a data unit traversing the network. Optimizing TVMPP is equivalent to minimizing traffic latency … Traffic-aware VM Placement Problem (TVMPP) How about # of slots > # of VMs? • Add dummy VMs: no traffic • Not affect VM placement TVMPP can be simplified by ignoring constant. which is relatively • Cost between every host and gateway is the same (WHY??) Traffic-aware VM Placement Problem (TVMPP) • Offline Mode • Data center operators estimate traffic matrix • Collect network topology • Solve TVMPP to decide which host(s) to create the VMs • Online Mode • Re-solve TVMPP periodically • Reshuffle VMs placement when needed TVMPP Complexity Matrix notation for (1): • D is traffic rates matrix • C is communication cost matrix • ๏ is the set of permutation matrices A Quadratic Assignment Problem (QAP): There are a set of n facilities and a set of n locations. For each pair of locations, a distance is specified and for each pair of facilities a weight or flow is specified (e.g., the amount of supplies transported between the two facilities). The problem is to assign all facilities to different locations with the goal of minimizing the sum of the distances multiplied by the corresponding flows. TVMPP Complexity Matrix notation for (1): • D is traffic rates matrix • C is communication cost matrix • ๏ is the set of permutation matrices A Quadratic Assignment Problem (QAP): NP-Hard There are a set of n facilities and a set of n locations. For each pair of locations, a distance is specified and for each pair of facilities a weight or flow is specified (e.g., the amount of supplies transported between the two facilities). The problem is to assign all facilities to different locations with the goal of minimizing the sum of the distances multiplied by the corresponding flows. TVMPP Complexity • TVMPP problem belongs to the general QAP problem. • No existing exact solution can be scale to the size of current data centers. Algorithms: Cluster-and-Cut Proposition 1: Suppose 0 ๏ฃ ๐1 ๏ฃ ๐2 …๏ฃ ๐๐ and 0 ๏ฃ ๐1 ๏ฃ ๐2 …๏ฃ ๐๐ , the following inequalities hold for any permuation ๐: 1, … , ๐ . Design principle 1: Solving TVMPP is equivalent to finding a mapping of VMs to slots such that VM pairs with heavy mutual traffic be assigned to slot pairs with low-cost connection. Algorithms: Cluster-and-Cut Design principle 2: Divide-and-Conquer 1. Partition VMs into VM-clusters / Slots into slot-clusters 2. Map each VM-cluster to a slot-cluster by TVMPP 3. Then map VMs to slots in each mapped VM and slot cluster by TVMPP Algorithms: Cluster-and-Cut Design principle 2: Divide-and-Conquer 1. Partition VMs into VM-clusters / Slots into slot-clusters • Cluster VMs • Classical min-cut graph algorithm. • VM pairs with high mutual traffic rate are within the same VM-cluster • Consistent with previously finding that traffic generated from a small group of VMs comprise a large fraction of the total traffic. • Approximation ratio = ๐−1 ๐ ๐ Algorithms: Cluster-and-Cut Design principle 2: Divide-and-Conquer 1. Partition VMs into VM-clusters / Slots into slot-clusters • Cluster Slots • • • • Classical clustering algorithm. Slot pairs with low-cost connections belong to the same slot-cluster. Networks contains many groups of densely connected end hosts. Approximation ratio= 2 Algorithms: Cluster-and-Cut Design principle 2: Algorithms: Cluster-and-Cut Design principle 2: ๐(๐๐) ๐ ๐ถ(๐ ) ๐(๐4 ) Impact of Network Architectures & Traffic Patterns Performance Gains are affected by: 1. Cost matrices (C) i. ii. iii. iv. Tree VL2 Fat-Tree BCube 2. Traffic matrices (D) i. ii. Global traffic model Partitioned traffic model Impact of Network Architectures & Traffic Patterns Define Tree Cost Matrices C: • ๐0 : ๐โ๐ ๐๐๐ − ๐๐ข๐ก ๐๐ ๐กโ๐ ๐๐๐๐๐ ๐ ๐ ๐ค๐๐ก๐โ๐๐ • ๐1 : ๐โ๐ ๐๐๐ − ๐๐ข๐ก ๐๐ ๐กโ๐ ๐๐๐๐๐๐๐๐ก๐๐๐ ๐ ๐ค๐๐ก๐โ๐๐ nxn Impact of Network Architectures & Traffic Patterns Define Tree Cost Matrices C: • ๐0 : ๐โ๐ ๐๐๐ − ๐๐ข๐ก ๐๐ ๐กโ๐ ๐๐๐๐๐ ๐ ๐ ๐ค๐๐ก๐โ๐๐ • ๐1 : ๐โ๐ ๐๐๐ − ๐๐ข๐ก ๐๐ ๐กโ๐ ๐๐๐๐๐๐๐๐ก๐๐๐ ๐ ๐ค๐๐ก๐โ๐๐ 0 nxn Impact of Network Architectures & Traffic Patterns Define Tree Cost Matrices C: • ๐0 : ๐โ๐ ๐๐๐ − ๐๐ข๐ก ๐๐ ๐กโ๐ ๐๐๐๐๐ ๐ ๐ ๐ค๐๐ก๐โ๐๐ • ๐1 : ๐โ๐ ๐๐๐ − ๐๐ข๐ก ๐๐ ๐กโ๐ ๐๐๐๐๐๐๐๐ก๐๐๐ ๐ ๐ค๐๐ก๐โ๐๐ 1 nxn Impact of Network Architectures & Traffic Patterns Define Tree Cost Matrices C: • ๐0 : ๐โ๐ ๐๐๐ − ๐๐ข๐ก ๐๐ ๐กโ๐ ๐๐๐๐๐ ๐ ๐ ๐ค๐๐ก๐โ๐๐ • ๐1 : ๐โ๐ ๐๐๐ − ๐๐ข๐ก ๐๐ ๐กโ๐ ๐๐๐๐๐๐๐๐ก๐๐๐ ๐ ๐ค๐๐ก๐โ๐๐ 3 nxn Impact of Network Architectures & Traffic Patterns Define Tree Cost Matrices C: • ๐0 : ๐โ๐ ๐๐๐ − ๐๐ข๐ก ๐๐ ๐กโ๐ ๐๐๐๐๐ ๐ ๐ ๐ค๐๐ก๐โ๐๐ • ๐1 : ๐โ๐ ๐๐๐ − ๐๐ข๐ก ๐๐ ๐กโ๐ ๐๐๐๐๐๐๐๐ก๐๐๐ ๐ ๐ค๐๐ก๐โ๐๐ 5 nxn Impact of Network Architectures & Traffic Patterns Define VL2 Cost Matrices C: • ๐0 : ๐โ๐ ๐๐๐ − ๐๐ข๐ก ๐๐ ๐กโ๐ ๐๐๐๐๐ ๐ ๐ ๐ค๐๐ก๐โ๐๐ • ๐1 : ๐โ๐ ๐๐๐ − ๐๐ข๐ก ๐๐ ๐กโ๐ ๐๐๐๐๐๐๐๐ก๐๐๐ ๐ ๐ค๐๐ก๐โ๐๐ nxn 5 Impact of Network Architectures & Traffic Patterns Define Fat-Tree Cost Matrices C: • ๐: ๐กโ๐ ๐ก๐๐ก๐๐ ๐๐ข๐๐๐๐ ๐๐ ๐๐๐๐ก๐ ๐๐ ๐๐๐โ ๐ ๐ค๐๐ก๐โ nxn 3 Impact of Network Architectures & Traffic Patterns Define Fat-Tree Cost Matrices C: • ๐: ๐กโ๐ ๐ก๐๐ก๐๐ ๐๐ข๐๐๐๐ ๐๐ ๐๐๐๐ก๐ ๐๐ ๐๐๐โ ๐ ๐ค๐๐ก๐โ nxn 5 Impact of Network Architectures & Traffic Patterns Define BCube Cost Matrices C: • Hamming distance of server address nxn Impact of Network Architectures & Traffic Patterns Define BCube Cost Matrices C: • Hamming distance of server address nxn Distance=1 Impact of Network Architectures & Traffic Patterns Define BCube Cost Matrices C: • Hamming distance of server address nxn Distance=2 Impact of Network Architectures & Traffic Patterns Global Traffic Model: • Each VM communicates with every other at a constant rate • Complexity: ๐(๐3 ) • Solved by Hungarian algorithm. : optimized objective value • Random placement objective value Impact of Network Architectures & Traffic Patterns Global Traffic Model: • @0 variance - ๐๐๐๐ก = ๐๐๐๐๐ • E.g., map-reduce type workload • Otherwise ๐๐๐๐ก ≤ ๐๐๐๐๐ • E.g., Multi-tiered web application • Gaps indicate improvement space for random placement. • BCube has largest improvement space. • Benefit in terms of scalability • VL2 has the smallest improvement space 1024 VMs Impact of Network Architectures & Traffic Patterns Partitioned Traffic Model: • VMs form isolated partitions, and only VMs within the same partition communicate with each other • Pairwise traffic rate following a normal distribution • GLB: lower bound for the ๐๐๐๐ก • Gaps indicate the performance improvement potential • ๐๐๐๐๐ has improvement space under different traffic variance • BCube has larger improvement potential Impact of Network Architectures & Traffic Patterns More performance improvement potential in a system with different partition size Smaller partition size has higher improvement potential Impact of Network Architectures & Traffic Patterns Greater benefits under such conditions: • increased traffic variance • Increased number of partitions • Multi-layer architecture Evaluation Compare Cluster-and Cut to other QAP solving algorithms: • Local Optimal Pairwise Interchange (LOPI) • Simulated Annealing (SA) Evaluation Compare Cluster-and Cut to other QAP solving algorithms: • Local Optimal Pairwise Interchange (LOPI) • Simulated Annealing (SA) 10% smaller Evaluation Compare Cluster-and Cut to other QAP solving algorithms: • Local Optimal Pairwise Interchange (LOPI) • Simulated Annealing (SA) 50% less Summary 1. Used traffic-aware virtual machine placement to improve network scalability 2. Formulated the VM placement as an NP-Hard optimization problem. 3. Proposed Cluster-and-Cut algorithm as a efficient solution 4. Evaluated the potential performance on different traffic patterns and network architectures. Thank you !