Data Center Routing – Traffic Engineering

advertisement

Data Center Routing – Traffic

Engineering

Yao Lu

Rui Zhang

ECE 260C VLSI Advanced Topics

Outline

• What is routing/traditional routing algorithm

• What is data center

• Difference between data center and the

Internet

• Some Recent work in data center TE

• Open questions/proposals

What is routing

Traditional routing algorithm

RIP (Routing Information Protocol)

IGRP (Interior Gateway Routing Protocol)

EIGRP (Enhanced Interior Gateway Routing Protocol)

OSPF (Open Shortest Path First)

IS-IS (Intermediate System-to-Intermediate System)

BGP (Border Gateway Protocol)

What is data center

• Nowadays, 40% of the total Internet traffic goes to Google[1]

Difference between data center and the

Internet

• Design Goal

– latency, reliability, throughput, energy, etc.

• Properties

– Well-structured topology

– Movability of the locations of sources and destinations

– Global knowledge of the whole data center network

Recent work

• Equal-Cost Multi-Path (ECMP)[7]

• Valiant Load Balancing (VLB)[6]

• CamCube[5]

• Hedera[8]

• Joint VM Placement and Routing (JVMPR)[4]

ECMP

• Many equal cost paths going up to the core switches

• Only one path down from each core switch

• Randomly allocate paths to flows using hash of the flow

S D

VLB

• Goal

– Guarantee equal-spread loadbalancing in a mesh network

• Method

– Bouncing individual packets from a source switch in the mesh off of randomly chosen intermediate

“core” switches, which finally forward those packets to their destination switch.

Camcube

• 3D Torus Topology

• Offer Camcube API

– To let service/application to design its own routing protocal

• Core services

– Basic routing algorithm

• link state-based protocol

Hedera

Detect

Large Flows

Estimate

Flow Demands

Place

Flows

• Detect Large Flows

– Flows that need bandwidth but are network-limited

• Estimate Flow Demands

– Use min-max fairness to allocate flows between src-dst pairs

• Place Flows

– Use estimated demands to heuristically find better placement of large flows on the ECMP paths

Hedera

• Large Flow Detection

– Scheduler continually polls edge switches for flow bytecounts

– Flows exceeding B/s threshold are “ large ”

• > %10 of hosts ’ link capacity (i.e. > 100Mbps)

Hedera

• Demand Estimation

– Goal

• Estimate available bandwidth to allocate

– Method

• Using min-max fairness, given traffic matrix of large flows, modify each flow ’ s size at it source and destination iteratively…

– Sender equally distributes bandwidth among outgoing flows that are not receiver-limited

– Network-limited receivers decrease exceeded capacity equally between incoming flows

– Repeat until all flows converge

Hedera

A

X

B

C

Flow Estimate Conv. ?

A  X

A  Y

B  Y

C  Y

Y

Sender

A

B

C

Available

Unconv. BW

1

1

1

Senders

Flows Share

2 1/2

1 1

1 1

Hedera

A

X

B

C

Flow Estimate Conv. ?

A  X 1/2

A  Y

B  Y

C  Y

1/2

1

1

Recv RL?

X No

Y Yes

Y

Receivers

Non-SL

Flows

-

3

Share

-

1/3

Hedera

A

X

B

C

Flow Estimate Conv. ?

A  X 1/2

A  Y

B  Y

C  Y

1/3

1/3

1/3

Yes

Yes

Yes

Y

Sender

A

B

C

Available

Unconv. BW

2/3

0

0

Senders

Flows Share

1 2/3

0 0

0 0

Hedera

A

X

B

C

Flow Estimate Conv. ?

A  X 2/3 Yes

A  Y

B  Y

C  Y

1/3

1/3

1/3

Yes

Yes

Yes

Recv RL?

X No

Y No

Y

Receivers

Non-SL

Flows

-

-

Share

-

-

Hedera

• Flow Placement

– Goal

• Find a good allocation of paths for the set of large flows, such that the average bisection bandwidth of the flows is maximized

– Method

Global First Fit:

– Greedily choose path that has sufficient unreserved b/w

Simulated Annealing:

– Iteratively find a globally better mapping of paths to flows

Hedera

• Global First Hit

– New flow detected, linearly search all possible paths from S  D

– Place flow on first path whose component links can fit that flow

Hedera

• Simulated Annealing

– 4 specifications

• State space

• Neighboring states

• Energy

• Temperature

F(x)

• Simple example:

Minimizing f(x)

Hedera

State: All possible mapping of flows to paths

– Constrained to reduce state space size

– Flows to a destination constrained to use same core

Neighbor State: Swap paths between 2 hosts

– Within same pod

Function/Energy: Total exceeded b/w capacity

– Using the estimated demand of flows

– Minimize the exceeded capacity

Temperature: Iterations left

– Fixed number of iterations (1000s)

Hedera

JVMPR

• Joint VM Placement and Routing

• Goal: Efficient traffic engineering under dynamic arrivals and departures of jobs

– One method : Localizing traffic by flexible VM placement node utilization

– Another method : Avoiding congestion by intelligent routing link utilization coupled with each other

JVMPR existing

VM

VM we need to add

Figure1:The left structure is the existing VMs and traffic

The middle structure is good VM placement with high congestion

The right structure is a worse placement with lower congestion

JVMPR

• JVMPR consider placement and routing at the same time

• It develops an approximation algorithm that leverages the specific structure of the joint design problem

JVMPR

• Placement and Route Selection

– Placement : The feasible decision space for VM placement is

– Routing : The feasible decision space for routing is

JVMPR

• Optimize Resource Utilization

– cost net

: Network cost

• Measure the congestion

– cost node

: Node cost

• Operating cost induced by a swith or a machine

– Goal: Minimize the total cost

JVMPR

• Any problem?

• Yes!

– The number of jobs is not fixed

– Jobs enter or depart the system dynamically

• Better way: Online solution

– Static problem setting to a dynamic environment

– Key idea: Perform local re-optimization

JVMPR

• Online solution algorithm

– Upon a new job arrival, assign the new job to one configuration accoridng to the transition probability

– Upon a job departure, pick one job and migrate it to new machines according to the transition probability

JVMPR

• Why dynamic JVMPR solution is appealing?

– We do not require VM migrations when new jobs arrive and at most one job migration when jobs depart

– The computation of migration probability only requires local information

JVMPR

Percentage of elephant flows

Fig. Performance comparison

JVMPR

• What is the price we pay for it?

– The approximated Markov chain no longer converges to the exact stationary distribution

But to a neighborhood around it

– Need a lot computation

Summary

Algorithm Topology Movability Global knowledge Other idea

ECMP

VLB

CamCube

Hedera

JVMPR

Y

Y

Y

Y

Y Y

Y

Y

Summary

Algorithm

ECMP

VLB

CamCube

Pros

1. Simple

2. Works great with mice flow

1. Simple

2. Works great with mice flow

1. Flexible

Hedera

JVMPR

1. Can deal with both mice flow and elephant flow

1. Cost is low

2. Computation only need local information

Cons

1. Might cause congestion with elephant flows

1. Might cause congestion with elephant flows

1. Optimization per service/ application, no global optimization is considered

1. Algorithm cannot guarantee global optimal

2. Assumptions when doing Demand

Estimation may not hold

1.Need a lot computation

2.It is a kind of approximation

Open questions/proposals

• Imperfection of current algorithms

– Hedera

• Large flow detection too simple

• Demand estimation only considered TCP flows

– JVMPR

• Demand a lot of computation

• It is approximation

• Not fully take advantage of the nice features of data center

– Combine topology, movability and VM placement together

• Add VM placement consideration into Hedera

Reference

[1] http://www.forbes.com/sites/timworstall/2013/08/17/fascinating-number-google-is-now-40-of-theinternet/

[2] Moy, John T. OSPF: anatomy of an Internet routing protocol. Addison-Wesley Professional, 1998.

[3] Chen, Kai, Chengchen Hu, Xin Zhang, Kai Zheng, Yan Chen, and Athanasios V. Vasilakos. "Survey on routing in data centers: insights and future directions." Network, IEEE 25, no. 4 (2011): 6-10.

[4] Jiang, Joe Wenjie, Tian Lan, Sangtae Ha, Minghua Chen, and Mung Chiang. "Joint VM placement and routing for data center traffic engineering." In INFOCOM, 2012 Proceedings IEEE, pp. 2876-2880. IEEE, 2012.

[5] Abu-Libdeh, Hussam, Paolo Costa, Antony Rowstron, Greg O'Shea, and Austin Donnelly. "Symbiotic routing in future data centers." ACM SIGCOMM Computer Communication Review 41, no. 4 (2011): 51-62.

[6] Farrington, Nathan, George Porter, Sivasankar Radhakrishnan, Hamid Hajabdolali Bazzaz, Vikram

Subramanya, Yeshaiahu Fainman, George Papen, and Amin Vahdat. "Helios: a hybrid electrical/optical switch architecture for modular data centers." ACM SIGCOMM Computer Communication Review 41, no. 4 (2011):

339-350.

[7] Hopps, Christian E. "Analysis of an equal-cost multi-path algorithm." (2000).

[8] Al-Fares, Mohammad, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, and Amin Vahdat.

"Hedera: Dynamic Flow Scheduling for Data Center Networks." In NSDI, vol. 10, pp. 19-19. 2010.

Thank you!

Download