Data Center Routing – Traffic
Engineering
Yao Lu
Rui Zhang
ECE 260C VLSI Advanced Topics
Outline
• What is routing/traditional routing algorithm
• What is data center
• Difference between data center and the
Internet
• Some Recent work in data center TE
• Open questions/proposals
What is routing
Traditional routing algorithm
• RIP (Routing Information Protocol)
• IGRP (Interior Gateway Routing Protocol)
• EIGRP (Enhanced Interior Gateway Routing Protocol)
• OSPF (Open Shortest Path First)
• IS-IS (Intermediate System-to-Intermediate System)
• BGP (Border Gateway Protocol)
What is data center
• Nowadays, 40% of the total Internet traffic goes to Google[1]
• Design Goal
– latency, reliability, throughput, energy, etc.
• Properties
– Well-structured topology
– Movability of the locations of sources and destinations
– Global knowledge of the whole data center network
Recent work
• Equal-Cost Multi-Path (ECMP)[7]
• Valiant Load Balancing (VLB)[6]
• CamCube[5]
• Hedera[8]
• Joint VM Placement and Routing (JVMPR)[4]
ECMP
• Many equal cost paths going up to the core switches
• Only one path down from each core switch
• Randomly allocate paths to flows using hash of the flow
S D
VLB
• Goal
– Guarantee equal-spread loadbalancing in a mesh network
• Method
– Bouncing individual packets from a source switch in the mesh off of randomly chosen intermediate
“core” switches, which finally forward those packets to their destination switch.
Camcube
• 3D Torus Topology
• Offer Camcube API
– To let service/application to design its own routing protocal
• Core services
– Basic routing algorithm
• link state-based protocol
Hedera
Detect
Large Flows
Estimate
Flow Demands
Place
Flows
• Detect Large Flows
– Flows that need bandwidth but are network-limited
• Estimate Flow Demands
– Use min-max fairness to allocate flows between src-dst pairs
• Place Flows
– Use estimated demands to heuristically find better placement of large flows on the ECMP paths
Hedera
• Large Flow Detection
– Scheduler continually polls edge switches for flow bytecounts
– Flows exceeding B/s threshold are “ large ”
• > %10 of hosts ’ link capacity (i.e. > 100Mbps)
Hedera
• Demand Estimation
– Goal
• Estimate available bandwidth to allocate
– Method
• Using min-max fairness, given traffic matrix of large flows, modify each flow ’ s size at it source and destination iteratively…
– Sender equally distributes bandwidth among outgoing flows that are not receiver-limited
– Network-limited receivers decrease exceeded capacity equally between incoming flows
– Repeat until all flows converge
Hedera
A
X
B
C
Flow Estimate Conv. ?
A X
A Y
B Y
C Y
Y
Sender
A
B
C
Available
Unconv. BW
1
1
1
Senders
Flows Share
2 1/2
1 1
1 1
Hedera
A
X
B
C
Flow Estimate Conv. ?
A X 1/2
A Y
B Y
C Y
1/2
1
1
Recv RL?
X No
Y Yes
Y
Receivers
Non-SL
Flows
-
3
Share
-
1/3
Hedera
A
X
B
C
Flow Estimate Conv. ?
A X 1/2
A Y
B Y
C Y
1/3
1/3
1/3
Yes
Yes
Yes
Y
Sender
A
B
C
Available
Unconv. BW
2/3
0
0
Senders
Flows Share
1 2/3
0 0
0 0
Hedera
A
X
B
C
Flow Estimate Conv. ?
A X 2/3 Yes
A Y
B Y
C Y
1/3
1/3
1/3
Yes
Yes
Yes
Recv RL?
X No
Y No
Y
Receivers
Non-SL
Flows
-
-
Share
-
-
Hedera
• Flow Placement
– Goal
• Find a good allocation of paths for the set of large flows, such that the average bisection bandwidth of the flows is maximized
– Method
• Global First Fit:
– Greedily choose path that has sufficient unreserved b/w
• Simulated Annealing:
– Iteratively find a globally better mapping of paths to flows
Hedera
• Global First Hit
– New flow detected, linearly search all possible paths from S D
– Place flow on first path whose component links can fit that flow
Hedera
• Simulated Annealing
– 4 specifications
• State space
• Neighboring states
• Energy
• Temperature
F(x)
• Simple example:
Minimizing f(x)
Hedera
• State: All possible mapping of flows to paths
– Constrained to reduce state space size
– Flows to a destination constrained to use same core
• Neighbor State: Swap paths between 2 hosts
– Within same pod
• Function/Energy: Total exceeded b/w capacity
– Using the estimated demand of flows
– Minimize the exceeded capacity
• Temperature: Iterations left
– Fixed number of iterations (1000s)
Hedera
JVMPR
• Joint VM Placement and Routing
• Goal: Efficient traffic engineering under dynamic arrivals and departures of jobs
– One method : Localizing traffic by flexible VM placement node utilization
– Another method : Avoiding congestion by intelligent routing link utilization coupled with each other
JVMPR existing
VM
VM we need to add
Figure1:The left structure is the existing VMs and traffic
The middle structure is good VM placement with high congestion
The right structure is a worse placement with lower congestion
JVMPR
• JVMPR consider placement and routing at the same time
• It develops an approximation algorithm that leverages the specific structure of the joint design problem
JVMPR
• Placement and Route Selection
– Placement : The feasible decision space for VM placement is
– Routing : The feasible decision space for routing is
JVMPR
• Optimize Resource Utilization
– cost net
: Network cost
• Measure the congestion
– cost node
: Node cost
• Operating cost induced by a swith or a machine
– Goal: Minimize the total cost
JVMPR
• Any problem?
• Yes!
– The number of jobs is not fixed
– Jobs enter or depart the system dynamically
• Better way: Online solution
– Static problem setting to a dynamic environment
– Key idea: Perform local re-optimization
JVMPR
• Online solution algorithm
– Upon a new job arrival, assign the new job to one configuration accoridng to the transition probability
– Upon a job departure, pick one job and migrate it to new machines according to the transition probability
JVMPR
• Why dynamic JVMPR solution is appealing?
– We do not require VM migrations when new jobs arrive and at most one job migration when jobs depart
– The computation of migration probability only requires local information
JVMPR
Percentage of elephant flows
Fig. Performance comparison
JVMPR
• What is the price we pay for it?
– The approximated Markov chain no longer converges to the exact stationary distribution
But to a neighborhood around it
– Need a lot computation
Summary
Algorithm Topology Movability Global knowledge Other idea
ECMP
VLB
CamCube
Hedera
JVMPR
Y
Y
Y
Y
Y Y
Y
Y
Summary
Algorithm
ECMP
VLB
CamCube
Pros
1. Simple
2. Works great with mice flow
1. Simple
2. Works great with mice flow
1. Flexible
Hedera
JVMPR
1. Can deal with both mice flow and elephant flow
1. Cost is low
2. Computation only need local information
Cons
1. Might cause congestion with elephant flows
1. Might cause congestion with elephant flows
1. Optimization per service/ application, no global optimization is considered
1. Algorithm cannot guarantee global optimal
2. Assumptions when doing Demand
Estimation may not hold
1.Need a lot computation
2.It is a kind of approximation
Open questions/proposals
• Imperfection of current algorithms
– Hedera
• Large flow detection too simple
• Demand estimation only considered TCP flows
– JVMPR
• Demand a lot of computation
• It is approximation
• Not fully take advantage of the nice features of data center
– Combine topology, movability and VM placement together
• Add VM placement consideration into Hedera
Reference
[1] http://www.forbes.com/sites/timworstall/2013/08/17/fascinating-number-google-is-now-40-of-theinternet/
[2] Moy, John T. OSPF: anatomy of an Internet routing protocol. Addison-Wesley Professional, 1998.
[3] Chen, Kai, Chengchen Hu, Xin Zhang, Kai Zheng, Yan Chen, and Athanasios V. Vasilakos. "Survey on routing in data centers: insights and future directions." Network, IEEE 25, no. 4 (2011): 6-10.
[4] Jiang, Joe Wenjie, Tian Lan, Sangtae Ha, Minghua Chen, and Mung Chiang. "Joint VM placement and routing for data center traffic engineering." In INFOCOM, 2012 Proceedings IEEE, pp. 2876-2880. IEEE, 2012.
[5] Abu-Libdeh, Hussam, Paolo Costa, Antony Rowstron, Greg O'Shea, and Austin Donnelly. "Symbiotic routing in future data centers." ACM SIGCOMM Computer Communication Review 41, no. 4 (2011): 51-62.
[6] Farrington, Nathan, George Porter, Sivasankar Radhakrishnan, Hamid Hajabdolali Bazzaz, Vikram
Subramanya, Yeshaiahu Fainman, George Papen, and Amin Vahdat. "Helios: a hybrid electrical/optical switch architecture for modular data centers." ACM SIGCOMM Computer Communication Review 41, no. 4 (2011):
339-350.
[7] Hopps, Christian E. "Analysis of an equal-cost multi-path algorithm." (2000).
[8] Al-Fares, Mohammad, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, and Amin Vahdat.
"Hedera: Dynamic Flow Scheduling for Data Center Networks." In NSDI, vol. 10, pp. 19-19. 2010.