slides - Computer Science Department

advertisement
1
PEREGRINE: AN ALL-LAYER-2
CONTAINER COMPUTER NETWORK
Tzi-cker Chiueh , Cheng-Chun Tu, Yu-Cheng Wang, Pai-Wei
Wang, Kai-Wen Li, and Yu-Ming Huan
∗Computer Science Department, Stony Brook University
†Industrial Technology Research Institute, Taiwan
1
Outline
2

Motivation
Layer 2 + Layer 3 design
 Requirements for cloud-scale DC
 Problems of Classic Ethernet to the Cloud


Solutions
Related Solutions
 Peregrine’s Solution


Implementation and Evaluation
Software architecture
 Performance Evaluation

L2 + L3 Architecture: Problems
3
Bandwidth
Bottleneck
Problem: Configuration:
- Routing table in the routers
- IP assignment
- DHCP coordination
- VLAN and STP
Problem: forwarding table size
Commodity switch 16-32k
Virtual Machine Mobility Constrained to a
Physical Location
Ref: Cisco data center with FabricPath and the Cisco FabricPath Switching System
Requirements for Cloud-Scale DC
4

Any-to-any connectivity with non-blocking fabric


Virtual machine mobility


Quick failure detection and recovery
Support for Multi-tenancy


Large layer 2 domain
Fast fail-over


Scale to more than 10,000 physical nodes
Share resources between different customers
Load balancing routing

Efficiently use all available links
Solution: A Huge L2 Switch!
Layer 2 Switch
Single L2 Network
Non-blocking
Backplane Bandwidth
However, Ethernet does not scale!
Config-free, Plug-and
play
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
VM
5
Linear cost and power
scaling
….. Scale to 1 million VMs
5
Revisit Ethernet: Spanning Tree Topology
6
N4
N3
N5
s2
s1
N6
N2
s3
N1
s4
N8
N7
Revisit Ethernet: Spanning Tree Topology
7
N4
N3
N5
D
D
s1
B
s2
R
N6
N2
s3
s4
Root
N1
N8
N7
Revisit Ethernet: Broadcast and Source Learning
8
N4
N3
N5
D
D
s1
B
s2
R
N6
N2
s3
s4
Root
N1
N8
N7
Benefit: plug-and-play
Ethernet’s Scalability Issues

Limited forwarding table size


STP as a solution to loop prevention



Fail-over latency is high ( > 5 seconds)
Broadcast overhead

9
Not all physical links are used
No load-sensitive dynamic routing
Slow fail-over


Commodity switch: 16k to 64k entries
Typical Layer 2 consists of hundreds of hosts
Related Works / Solution Strategies
10

Scalability:


Clos network / Fat-tree to scale out
Alternative to STP
Link aggregation, e.g. LACP, Layer 2 trunking
 Routing protocols to layer 2 network


Limited forwarding table size


Packet header encapsulation or re-writing
Load balancing

Randomness or traffic engineering approach
11
Design of the Peregrine
Peregrine’s Solutions
12

Not all links are used


L2 Loop Prevention


Redirect broadcast and block flooding packet
Source learning and forwarding


Disable Spanning Tree protocol
calculate all routes for all node-pairs by Route Server
Limited switch forwarding table size

Mac-in-Mac two stage forwarding by Dom0 kernel module
ARP Intercept and Redirect
13
Directory Service
Control flow
Data flow
DS
Route Algorithm Server
RAS
sw3
1. DS-ARP
2. DS-Reply
sw1
A
sw4
3. Send data
sw2
B
Peregrine’s Solutions
14

Not all links are used


L2 Loop Prevention


Disable Spanning Tree protocol
Redirect broadcast and block flooding packet
Source learning and forwarding
Calculate all routes for all node-pairs by Route Server
 Fast fail-over: primary and backup routes for each pair


Limited switch forwarding table size

Mac-in-Mac two stage forwarding by Dom0 kernel module
Mac-in-Mac Encapsulation
15
Control flow
Data flow
DS
2. B locates at sw4
sw3
3. sw4
A
sw1
sw4
1. ARP redirect
4. Encap sw4 in
source mac
sw2
DA
SA
5. Decap and restore
original frame
DA
Encapsulation
IDA
sw 4
DA
B
B
SA
A
SA
Decapsulation
IDA
DA
SA
Fast Fail-Over
16

Goal: Fail-over latency < 100 msec
Application agnostic
 TCP timeout: 200ms


Strategy: Pre-compute a primary and backup route
for each VM
Each VM has two virtual MACs
 When a link fails, notify hosts using affected primary routes
that they should switch to corresponding backup routes

When a Network Link Fails
17
18
IMPLEMENTATION & Evaluation
Software Architecture
19
Review All Components
20
DS
ARP request rate
How fast can
DS handle request?
RAS
sw3
MIM module
A
How long can RAS
process request?
sw1
sw4
ARP redirect
sw2
sw5
sw7
sw6
Backup route
Performance of: MIM, DS, RAS, switch ?
B
Mac-In-Mac Performance
cdf
21
Time spent for decap/encap/total: 1us / 5us / 7us (2.66GHz CPU)
Around 2.66K / 13.3K / 18.6K cycles
Aggregate Throughput for Multiple VMs
22
980
970
960
950
No MIM
MIM
940
930
920
910
1VM
2VM
4VM
1. APR table size < 1k
2. Measure TCP throughput of 1VM, 2VM, 4VM communicating to each
other.
ARP Broadcast Rate in a Data Center
23

What’s the ARP traffic rate in real world?
 From
2456 hosts, CMU CS department claims that
there are 1150 ARP/sec at peak, 89 ARP/sec on
average.
 From 3800 hosts at university network, there are
around 1000 ARP/sec at peak, < 100 ARP/sec on
average.


To scale to 1M node, 20K-30K ARP/sec on average.
Current optimal DS: 100K ARP/sec
Fail-over time and its breakdown
24

Average fail-over time:
75ms

Switch: 25 ~ 45 ms


sending trap (soft unplug)
RS: 25ms

400
350
300
250
total
200
RS
150
receiving trap and processing
DS
100

DS: 2ms


receiving info from RS and
inform DS
The rests are network delay
and dom0 processing time
50
0
1
2
3
4
5
6
7
8
9
10
11
Conclusion
25



A unified Layer-2-only network for LAN and SAN
Centralized control plane and distributed data plane
Use only Commodity Ethernet switches




Army of commodity switches vs. few high-port-density switches
Requirements on switches: run fast and has programmable routing table
Centralized load-balancing routing using real-time traffic
matrix
Fast fail-over using pre-computed primary/back routes
26
Questions?
Thank you
Review All Components: Result
27
DS
7us for Packet
processing
ARP request rate
A
100K ARP/sec
RS
sw3
25ms per request
sw1
Link down
35ms
ARP redirect
sw4
B
sw2
sw5
sw7
sw6
Backup route
27
28
Backup slides
Thank you
OpenFlow Architecture
29



OpenFlow switch: A data plane that implements a
set of flow rules specified in terms of the OpenFlow
instruction set
OpenFlow controller: A control plane that sets up
the flow rules in the flow tables of OpenFlow
switches
OpenFlow protocol: A secure protocol for an
OpenFlow controller to set up the flow tables in
OpenFlow switches
30
OpenFlow Controller
OpenFlow Protocol (SSL/TCP)
Control Path
OpenFlow
Data Path (Hardware)
Conclusion and Contribution
31


Using commodity switches to build a large scale layer 2
network
Provide solutions to Ethernet’s scalability issues






Suppressing broadcast
Load balancing route calculation
Controlling MAC forwarding table
Scale up to one million VMs by Mac-in-Mac two stage forwarding
Fast fail-over
Future work


High Availability of DS and RAS, mater-slave model
Inter
Comparisons
32

Scalable and available data center fabrics





IEEE 802.1aq: Shortest Path Bridging
IETF TRILL
Competitors: Cisco, Juniper, Brocade
Differences: commodity switches, centralized load balancing routing and
proactive backup route deployment
Network virtualization



OpenStack Quantum API
Competitors: Nicira, NEC
Generality carries a steep performance price


Every virtual network link is a tunnel
Differences: Simpler and more efficient because it runs on L2 switches
directly
Three Stage Clos Network (m,n,r)
nxm
n
n
n
1
rxr
1
mxn
1
2
2
2
.
.
.
.
.
.
.
.
.
r
m
r
n
n
n
33
Clos Network Theory
34

Clos(m, n, r) configuration:
rn inputs, rn outputs



2r nxm + m rxr switches, less than rn x rn
Each rxr switch can in turn be implemented as
a 3-stage Clos network
Clos(m,n,r) is rearrangeably non-blocking iff
m >= n
Clos(m,n,r) is stricly non-blocking iff m >= 2n-1
Link Aggregation
35
Logically a single switch, states are synchronized
Stacking or Trunking
! " #$
! " %$
! " &$
!" ' $
Logically a single link to the upper layer
STP views as a non-loop topology
! " #$
! " &$
!" ' $
ECMP: Equal-Cost Multipath
36
$%#
$&#
$' #
$( #
'#
Layer 3
ECMP routing
Flow hashing
to 4 uplinks
!"
$) #
(#
$*#
$+#
$, #
%#
! %#
…
! "#
- %#
…
- "#
. %#
…
. "#
/ %# …
/ "#
Pros: multiple links are used,
Cons: hash collision, re-converge downstream to a single link
Example: Brocade Data Center
37
L3 ECMP
Link aggregation
Ref: Deploying Brocade VDX 6720 Data Center Switches with Brocade VCS in Enterprise Data Centers
PortLand
• Scale-out: Three-layer,
multi-root topology
• Hierarchical, encode
location into MAC
address
• Local Discover Protocol to
find shortest path, route by
MAC
• Fabric Manager maintains IP
to MAC mapping
• 60-80 ms failover, centrally
control and notify
38
VL2: Virtual Layer 2
• Three layer, Clos network
• Flat, IP-in-IP, Location
address(LA) an
Application Address (AA)
IP-in-IP
• Link-state routing to
encapsulation
disseminate LA
• VLB + flow-based ECMP
• Depend on ECMP to
detect link failure
• Packet interception at S
IP-in-IP
decapsulation
VLB
VL2 Directory Service
39
Monsoon
• Three layer, multi-root
topology
• 802.1ah MAC-in-MAC
encapsulation, source
routing
• centralized routing
decision
• VLB + MAC rotation
• Depend on LSA to detect
failures
• Packet interception at S
Monsoon Directory Service
IP <-> (server MAC, ToR MAC)
40
TRILL and SPB
TRILL
• Transparent Interconnect of
Lots of Links, IETF
• IS-IS as a topology
management protocol
• Shortest path forwarding
• New TRILL header
• Transit hash to select nexthop
SPB
• Shortest Path Bridging, IEEE
• IS-IS as a topology
management protocol
• Shortest path forwarding
• 802.1ah MAC-in-MAC
• Compute 16 source node
based trees
41
TRILL Packet Forwarding
Link-state routing
TRILL header
Ref: NIL Data Communications
A-ID: nickname of A
C-ID: nickname of C
HopC: hop count
42
SPB Packet Forwarding
Link-state routing
802.1ah
Mac-in-Mac
I-SID: Backbone Service Instance Identifier
B-VID: backbone VLAN identifier
Ref: NIL Data Communications
43
Re-arrangeable non-blocking Clos network
Example:
1. Three-stage Clos network
2. Condition: k>=n
3. An unused input at ingress switch can always be connected
to an unused output at egress switch
4. Existing calls may have to be rearranged
nxk
2x2
(N/n)x(N/n)
3x3
kxn
2x2
input
output
2x2
3x3
2x2
ingress
2x2
N=6
n=2
k=2
2x2
middle
egress
44
Features of Peregrine network
• Utilize all links
• Load balancing routing algorithm
• Scale up to 1 million VMs
– Two stage dual mode forwarding
• Fast fail over
• Load balancing routing algorithm
45
45
Goal
Primary
S
D
Backup
• Given a mesh network and traffic profile
– Load balance the network resource utilization
• Prevent congestion by balancing the network load
to support as many traffic load as possible
– Provide fast recovery from failure
• Provide primary-backup route to minimize recovery
time
46
Factors
• Only hop count
• Hop count and link residual capacity
• Hop count, link residual capacity, and link
expected load
• Hop count, link residual capacity, link expected
load and additional forwarding table entries
required
How to combine them into one number for a particular candidate
route?
47
Route Selection: idea
A
B
Leave C-D free
S1
D1
Share with S2-D2
C
S2
D
S2-D2 shares link C-D
D2
Which route is better from S1 to D1?
Link C-D is more important!
Idea: use it as sparsely as possible
48
Route Selection: hop count and Residual
capacity
Traffic Matrix:
S1 -> D1: 1G
S2 -> D2: 1G
A
B
Leave C-D free
S1
D1
Share with S2-D2
C
D
S2
D2
Using Hop count or residual capacity makes no
difference!
49
Determine Criticality of A Link
Determine the importance of a link
fl ( s,d) = fraction of all (s, d) routes that pass through link l
Expected load of a link at initial state
fl = åfl ( s,d )B( s,d )
(s,d )
B( s,d )
= Bandwidth demand matrix for s and d
50
Criticality Example
From B to C has four possible routes.
2/4
2/4
2/4
Case2:
Calculate
l
s = B, d = C
f ( s,d)
2/4
2/4
Case3:
s = A, d = C is similar
2/4
0
A
4/4
B
4/4
C
51
Expected Load
Assumption: load is equally distributed over each
possible routes between S and D.
10
10
10
Consider bandwidth demand for B-C is 20.
Expected Load:
fl = åfl ( s,d )B( s,d )
10
10
10
(s,d )
0
A
20
20
B
C
52
Cost Metrics
Cost metric represents the expected load per
unit of available capacity on the link
0.01
0.01
A
0.02
fl
Rl
f l= Expected Load
0.01
0.01
0
cos t(l) =
0.01
Rl = Residual Capacity
0.01
0.02
B
C
Idea: pick the link with minimum cost
53
Forwarding Table Metric
Consider using commodity switch with 16-32k
forwarding table size.
300
200
0
Fwd(n)
= available forwarding
table entries at node n
INC_FWD
= extra entries needed to route A-C
100
100
A
B
C
Idea: minimize entry consumption, prevent forwarding table from being exhausted
54
Load Balanced Routing
• Simulated network
– 52 PMs with 4 NICs,
total 384 links
– Replay 17 multi-VDC
300-second traces
• Compare
– Random shortest path
routing (RSPR)
– Full Link Criticality-based
routing (FLCR)
• Metrics: congestion count
– # of links with exceeded
capacity
• Low additional traffic
induced by FLCR
55
Download