Software-defined networking: Change is hard

advertisement
Software-defined networking:
Change is hard
Ratul Mahajan
with
Chi-Yao Hong, Rohan Gandhi, Xin Jin, Harry Liu,
Vijay Gill, Srikanth Kandula, Mohan Nanduri, Roger Wattenhofer, Ming Zhang
Inter-DC WAN: A critical, expensive resource
Dublin
Seattle
New York
Seoul
Barcelona
Hong Kong
Los Angeles
Miami
But it is highly inefficient
One cause of inefficiency: Lack of coordination
Another cause of inefficiency:
Local, greedy resource allocation
B
C
D
A
B
E
H
G
Local, greedy allocation
D
A
E
H
F
C
G
F
Globally optimal allocation
[Latency inflation with MPLS-based traffic engineering, IMC 2011]
SWAN: Software-driven WAN
Goals
Key design elements
Highly efficient WAN
Flexible sharing policies
Coordinate across services
Centralize resource allocation
[Achieving high utilization with software-driven WAN, SIGCOMM 2013]
SWAN overview
SWAN controller
Traffic demand
BW
allocation
Service broker
Rate limiting
Service hosts
Topology, traffic
Network
config.
Network agent
WAN
Key design challenges
Scalably computing BW allocations
Working with limited switch memory
Avoiding congestion during network updates
Congestion during network updates
Congestion-free network updates
Computing congestion-free update plan
Leave scratch capacity 𝑠 on each link
 Ensures a plan with at most
1
𝑠
− 1 steps
Find a plan with minimal number of steps using an LP
 Search for a feasible plan with 1, 2, …. max steps
Use scratch capacity for background traffic
Complementary
CDF
SWAN provides congestion-free updates
Oversubscription ratio
Extra traffic (MB)
Throughput
(relative to optimal)
SWAN comes close to optimal
SWAN
SWAN
w/o rate
control
MPLS
TE
Deploying SWAN
WAN
Data
center
Partial deployment
WAN
Data
center
Full deployment
The challenge of data plane updates in SDN
Not just about congestion
 Blackholes, loops, packet coherence, …
The challenge of data plane updates in SDN
Not just about congestion
 Blackholes, loops, packet coherence, …
Real-world is even messier
Our controlled experiments
CDF
CDF
Google’s B4
Latency (seconds)
Latency (seconds)
Many resulting questions of interest
Fundamental
 What consistency properties can be maintained and how?
 Is property strength and ease of maintenance related?
Practical
 How to quickly and safely update the data plane?
 Impacts failure recovery time, network utilization, flow response time
Minimal dependencies for a consistency property
None
Eventual
consistency
Always
guaranteed
Blackhole
freedom
Impossible
Loop
freedom
Packet
coherence
Congestion
freedom
Downstream
subset
Self
Downstream
all
Global
Add before
remove
Rule dependency Rule dependency
forest
tree
Impossible
Flow version
numbers
Impossible
Impossible
Global version
numbers
Staged partial
moves
[On consistent updates in software-defined networks, HotNets 2013]
Fast, consistent network updates
Consistency
property
Routing
policy
Current
network
state
Desired
state
generator
Forward fault correction
Computes states that are
robust to common faults
Target
network
state
Update
planner
Dionysus
Dynamically schedules
network updates
Update
plan
Overview of forward fault correction
Control and data plane faults cause congestion
 Today, reactive data plane updates are needed to remove congestion
FFC handles faults proactively
 Guarantees absence of congestion for up to k faults
Main challenge: Too many possible faults
 Constraint reduction technique based on sorting networks
[Traffic engineering with forward fault correction, SIGCOMM 2014 (to appear)]
Congestion due to control plane faults
Current State
Target state
FFC for control plane faults
Robust target state (k=1)
Current State
Vulnerable target state
Robust target state (k=2)
Congestion due to data plane faults
Pre-failure traffic distribution
Post-failure traffic distribution
FFC for data plane faults
Vulnerable traffic distribution
Robust traffic distribution (k=1)
FFC guarantee needs too many constraints
[∀𝑓 ∈ 𝐹]
𝐹: {𝑓 | 𝑓 is a set of up
to π‘˜ faulty switches}
𝑠∈𝑓 𝑇𝑙
𝑠 ≤ 𝐢𝑙
𝐢𝑙 : Spare capacity of
link 𝑙 in the absence
of faults
𝑇𝑙 (𝑠): Additional
traffic on link 𝑙 when
switch 𝑠 is faulty
Number of constraints is
𝑛
1
+ …+
𝑛
π‘˜
for each link
Efficient solution using sorting networks
∀𝑓 ∈ 𝐹
𝑠∈𝑓 𝑇𝑙 (𝑠)
≤ 𝐢𝑙 
π‘š
π‘˜
π‘š=1 𝑇𝑙
≤ 𝐢𝑙
𝑇𝑙 π‘š: mth largest variable in the array 𝑇𝑙 𝑠
Use bubble sort network to
compute linear expressions
for k largest variables
 O(nk) constraints
FFC performance in practice
Single-priority traffic
(π‘˜π‘ , π‘˜π‘‘ ) = (2,1)
Multi-priority traffic
π‘˜π‘β„Ž , π‘˜π‘‘β„Ž = 3,3 ; π‘˜π‘π‘š , π‘˜π‘‘π‘š = 2,1 ; π‘˜π‘π‘™ , π‘˜π‘‘π‘™ = (0,0)
Fast, consistent network updates
Consistency
property
Routing
policy
Current
network
state
Desired
state
generator
Forward fault correction
Computes states that are
robust to common faults
Target
network
state
Update
planner
Dionysus
Dynamically schedules
network updates
Update
plan
Overview of dynamic update scheduling
Current schedulers pre-compute a static update schedule
 Can get unlucky with switch delays
Dynamic scheduling adapts to actual conditions
Main challenge: Tractably exploring “safe” schedules
[Dionysus: Dynamic scheduling of network updates, SIGCOMM 2014 (to appear)]
Downside of static schedules
Current State
S1
S2
F2: 5
F3: 10
F1
Plan B F4
F3
F2
F3
S3
S1
S2
S3 F3
S4 F4
F1: 5
F4: 5
S4
Plan A F4
S5
Target State
S1
S2
F2: 5
F3: 10
F1: 5
F4: 5
S4
S5
S3
S1
S2
S3 F3
S4 F4
F1
F2
2
1
4 time
3
F2
F4
1
2
3
F1
F1
F2
1
2
3
4
5 time
F1
S1
F2
S2
S3 F3
F4
S4
F1
S1
S2
S3 F3
S4
F2
4 time
1
2
3
time
Downside of static schedules
Current State
S1
S2
F2: 5
F3: 10
S3
Static F4
plan A F3
F1
F2
Static F4
plan B F3
F1: 5
F4: 5
S4
S5
Dynamic
plan
Target State
S1
S2
F2: 5
F3: 10
F1: 5
F4: 5
S4
S5
S3
F1
F4
F3
F2
Low update time regardless
of latency variability
F2
F1
Challenge in dynamic scheduling
Current State
S1
Tractably explore valid orderings
F5: 10
S3
S2
F2: 5
F1: 5
F4: 5
 Exponential number of orderings
 Cannot completely avoid planning
S4
S5
Target State
S1
S2
F2: 5
F3: 5
F5: 10
F3: 10
S3
F1: 5
F4: 5
S4
S5
F3: 5
Dionysus pipeline
Consistency
property
Current
network
state
Target
network
state
Dependency
graph
generator
Dependency
graph
Update
scheduler
Dionysus dependency graph
Current State
Nodes: updates and resources
Edges: dependencies among nodes
S1
F5: 10
S3
S2
F2: 5
F1: 5
F4: 5
S4
S5
Target State
S1
S2
F2: 5
F3: 5
F5: 10
F3: 10
S3
F1: 5
F4: 5
S4
S5
F3: 5
Dionysus scheduling
NP-complete problem with capacity and memory constraints
Approach
 Critical path scheduling
 Treat strongly connected components
as virtual nodes and favor them
 Rate limit flows to resolve deadlocks
Dionysus leads to faster updates
Median improvement over static scheduling (SWAN): 60-80%
Dionysus reduces congestion due to failures
99th percentile improvement over static scheduling (SWAN): 40%
Fast, consistent network updates
Consistency
property
Routing
policy
Current
network
state
Desired
state
generator
Forward fault correction
Computes states that are
robust to common faults
Target
network
state
Update
planner
Dionysus
Dynamically schedules
network updates
Update
plan
Summary
SDN enables new network operating points such as high utilization
But also pose a new challenge: fast, consistent data plane updates
Download