SDN Controller Challenges

advertisement
SDN Controller Challenges
The Story Thus Far
• SDN --- centralize the network’s control plane
– The controller is effectively the brain of the
network
– Controller determines what to do and tell
switches how to do it.
The Story Thus Far
The Story Thus Far
Something Happened!!!!
The Story Thus Far
Let’s Ask the Brian!!!!
The Story Thus Far
Think about what happen…
Maybe come up with a solution
The Story Thus Far
Tell the network what to do
• Controller runs control function
• Control function creates switch state
– F(global network state)  Switch state
– Global network state can be graph of the network
Challenges with Centralization
• Single point of failure
– Fault tolerance
• Performance bottleneck
– Scalability
– Efficiency (switch-controller latency)
• Single point for security violations
Motivation for Distributed Controllers
• Wide-Area-Network
– Wide distribution of switches: from USA to Australia.
– High latency between one controller and All switches
• Application + Network growth
– Higher CPU load for controller
– More memory for storing FIB entries and calculations
• High availability
Class Outline
• Fault Tolerance
– Google’s B4 paper
• Controller Scalability
– Ways to scale the controller
– Distributed controllers: Mesh Versus Hierarchy
– Implications of controller placement
Fault Tolerance
Google’s B4 Network
•
•
•
•
Provides connectivity between DC sites
Uses SDN to control edge switches
Goal: high utilization of links
Insight: fine-grained control over edge and
network can lead to higher utilization
• Distributed Controllers
– One set of controllers for each Data center (site)
Google’s B4 Network
•
•
•
•
Provides connectivity between DC sites
Uses SDN to control edge switches
Goal: high utilization of links
Distributed Controllers
– One set of controllers for each Data center (site)
Fault Tolerance in B4
• Each site runs a set of controller
• Paxos is run between controllers in a site to
determine master
Quick Overview of Paxos
• Given N controllers
– 1 Acts as leader, and N-1 as workers
– All N controller maintain the same state
• Switches interact with leader
• Change doesn’t happen until whole group agrees
• Failure of primary
• N-1 work together to elect a new leader(determine new leader)
Propagate
State changes
Network
Events
Pros-Cons of Paxos
• Pros
– Well understood and studied; gives good FT
– Many implementations in the wild
– E.g. Zookeeper
• Cons
– Time to recover
– Impacts through of the put of the entire system
Controller Scalability
What limits a controller’s scalability?
• Number of control messages from switch
– Depends on the application logic
• E.g. MicroTE/Hedera periodically query all switches for
stats
• Reactive controller, evaluated in NoX, requires each
switch to send messages for a new flow
– Packet-in (if reactive Apps)
– Flow stats, Flow_time-outs
What limits a controller’s scalability?
• Application processing overhead
• The controller runs a bunch of application
– Similar to: A server running a set of programs
– CPU/Memory constraint limit how the app runs
What limits a controller’s scalability?
• Distance between controller and the switches
Hedera
L3
Controller 1
FW
How to Scale the Controller.
• Obvious: add more controllers.
• BUT: how about the applications?
– Synchronization/concurrency problems.
• Who controls which switch?
• Who reacts to which events?
Hedera
L3
Controller 1
FW
?
Hedera
L3
FW
Controller 2
Stats +
Install OF entries
?
Hedera
L3
Controller N
FW
Medium Sized Networks
• Assumption:
– controller can’t store all forwarding table entries in memory
– But can process all events and run all apps
• Each controller
– Get same network events+ running same app.  same output
– But store output for only a fraction and config only a fraction
Hedera
L3
FW
Hedera
Controller 1
L3
FW
Controller 2
Stats +
Install OF entries
Hedera
L3
Controller N
FW
Medium Sized Networks: hyperflow
• Each controller
– Push state to each controller
– Each controller things it’s the only one in the network
Sub-subscribe ssytem
Hedera
L3
FW
Hedera
Controller 1
L3
FW
Controller 2
Stats +
Install OF entries
Hedera
L3
Controller N
FW
Large Sized Networks
• Assumptions
– Each controller can’t store all the FIB entries
– Each controller can’t run the entire application or
handle events
• Need to partition the application
– But how?
Application partition 1
• Approach 1: each controller runs a specific
application
– How do your resolve conflicts in FW entries
– Apps can conflict in the rules they install
Hedera
Controller 1
L3
Controller 2
FW
Controller N
Application partition 2
• Approach 2: all controllers run the same
application but for a subset of devices
– Results in a Distributed Mesh control plane
Hedera
L3
Controller 2
Hedera
L3
Controller 1
FW
FW
Abstract
Network view
Hedera
L3
Controller N
FW
Application Partition 2
• Abstract view exchanged with each other
– Abstract view reduces the n/w information used
by each controller
REAL NETWORK
Hedera
L3
FW
Controller 2
Abstraction
Provided by
Controller 1
Controller 2’s View of NETWORK
Abstraction
Provided by
Controller N
How to Deal with State + Concurrency
Issues?
• Controllers synchronize through a DB or DHT
– So each app needs synchronization code.
– How do you deal with concurrency.
– Each switch has a table/Row in a DB.
• Table/Row reflects switch state
• Programmer interacts directly with the DB
• Onix takes care of synch between DB and switch
ONIX to the SDN Programmer
• How to synchronize between domains.
• How many domains? Or controllers?
• How many switches in a domain?
Application partition 3
• Approach 3: divide application into local, and
global.
– Results in a hierarchical control plane
• Global Controller and Local Controllers
– Applications that do not need network-wide state
• Can be run locally without communicate with other
controllers
Are Hierarchical Controllers Feasible
• Examples of local applications:
– Link Discovery, Learning switch, local policies
• Examples of local portions of a global algo
– Data center Traffic engineering
• Elephant flow detection (hedera)
• Predictability detection (MicroTE)
• Local apps/controllers have other benefits
– High parallelism
– Can be run closer to the devices.
Kandoo: Hierarchical controllers
• 2 levels of controllers: global and local
– Local applications are embarrassingly parallel
– Local shields global from network events
Hedera
Global Controller
Hedera
L3
Controller 1
FW
Hedera
L3
Controller 2
FW
Hedera
L3
Controller N
FW
Kandoo: Hierarchical controllers
• Local Controllers: run local apps
– Returns abstract view to the global controller
– Reduces # events sent to global and reduce size of
network seen by
Hedera
Global Controller
Hedera
L3
Controller 1
FW
Hedera
L3
Controller 2
FW
Hedera
L3
Controller N
FW
Kandoo: Hierarchical controllers
• Global Controllers
– Runs global apps: AKA apps that need network
wide state
Hedera
Global Controller
Hedera
L3
Controller 1
FW
Hedera
L3
Controller 2
FW
Hedera
L3
Controller N
FW
Hedera Reminder
• Goal: reduce network contention
• Insight: contention happens when elephants
share paths.
• Solution:
– Detect Elephant flows
– Place Elephant flows on different flows
Implementing Hedera in Onix
• 2 levels of controllers: global and local
– Local applications are embarrassingly parallel
– Local shields global from network events
Hedera:
detection +placement
Exchange
TM+detection
Hedera:
detection+placement
Controller 1
Flow
Table
entries
Stats
Controller 2
Flow
Table
entries
Stats
Implementing Hedera in Kandoo
• Local Controllers: get stats from networks + elephant detection
• Global Controller: decide flow placement + flow installation
Hedera: Global placement
Global Controller
Install new flow table entries
Elephant detection
Elephant detection
Controller 1
Controller 2
Inform of
elephant flows
Elephant detection
Controller N
Stats
Implementing B4 in Kandoo like
architecture
• Local Controllers: get stats from networks + determines demand
• Global Controller: calculate paths for traffic
TE+BW allocator
Install TE Ops
TE DB
Global Controller
Inform of Flow demands
Elephant detection
Elephant detection
Elephant detection
Site Controller
Site Controller 2
Site Controller N
Stats +
Install OF entries
Kandoo to the SDN Programmer
• Think of what is local and what is global
– When apps are written, annotate with local flag
• Kandoo will automatically place local
– And place global.
• Kandoo restricts messages between global and
local controllers
– You can’t send OF styles messages
– Must send Kandoo style messages
Summary
• Centralization provide simplicity at the cost of
reliability and scalability
• Replication can improve reliability and scalability
• For Reliability, Paxos is an option
• For Scalability, conqueror and divide
– Partition the applications
• Kandoo: Local apps and global apps
– Partition the network
• Onix: each controller controls a subset of switches (Domain)
Download