Internet Routing (COS 598A) Jennifer Rexford Today: Intradomain Routing Convergence Tuesdays/Thursdays 11:00am-12:20pm

advertisement
Internet Routing (COS 598A)
Today: Intradomain Routing Convergence
Jennifer Rexford
http://www.cs.princeton.edu/~jrex/teaching/spring2005
Tuesdays/Thursdays 11:00am-12:20pm
General Course Stuff
• Tuesday March 1
– Lecture by Larry Peterson on PlanetLab
– No assignment for written reviews of papers
– I’ll be attending a Computing Research Association
(CRA) Board of Directors meeting in D.C.
• Course projects
– Good to start thinking up a project idea
– Make an appointment to chat with me
– Written report due on Dean’s Date, Tue May 10
– Oral presentations during exam period
• No formal “leading a class discussion” item
Splitting Convergence into Two Lectures
• One class is just not enough
– Today on intradomain routing
– Next Thursday on interdomain routing
• Impact on reading you already did
– One paper intradomain, one on interdomain
– Though class will focus just on intradomain
• Impact on reading for Thursday March 3
– One paper intradomain, one on interdomain
– Though class will focus just on interdomain
• Push off the next topic to following class
Outline
• Routing-protocol convergence
– Steps in reacting to a link failure
– Effects on the data packets
• Intradomain routing protocols
– Steps in protocol convergence
– Implementation overhead and timers
• Operational practices to reduce convergence
– Multiple shortest paths between routers
– Costing out during planned maintenance
Routing Convergence
• The only constant is change
– Equipment failures, or new deployment
– Routing-protocol configuration changes
– Planned maintenance on the network
• Routing protocols adapt
– Detect the change
– Propagate messages
– Compute new routes
– Update the forwarding tables
Converging After a Failure
• Failure detection
– Router recognizes an incident link has failed
• Failure notification
Routing convergence
– Router informs other routers about the change
Forwarding
• Path re-computation
convergence
– Routers compute new paths avoiding the link
• Forwarding-table update
– Routers update their forwarding tables
– Data traffic starts to flow over the new path
Bad Things Happen During Convergence
• Transient inconsistencies
– Routers have different views of the network
– Forwarding decisions may be inconsistent
• Effects on data traffic
– Black-hole: packet loss
– Loops: packets going in circles
– Delay: packets going on very long paths
– Out-of-order: new packets arrive before old ones
• Want to minimize convergence delay
– … and especially the effects on the data traffic
Example: Black-hole Causing Packet Loss
• Router forwarding to dead link
– Doesn’t know (yet) that the link is dead
– Or, hasn’t computed a new forwarding entry
drop
s
d
Fortunately, IP only promises “best effort” delivery!
Example: Forwarding Loop
• Set of routers disagree
– One router acting on old information
– Another router acting on new information
Loop!
s
d
Intradomain Routing Convergence
Interior Gateway Protocols (IGPs)
• Routers running OSPF or IS-IS:
– Flood link-state advertisements (LSAs)
– Compute shortest paths from link weights
– Determine “next hop” to other routers…
2
3
2
1
1
1
3
5
4
3
Knowing a Link is Dead: Heart-Beats
• Periodic “hello” packets (hello_interval, 10sec)
hello
hello
– Timeout if not received (dead_interval, 40 sec)
– Declare failure and flood the info to others
• Small values lead to faster detection, but also:
– Higher bandwidth consumption for “hellos”
– False detection during congestion interval
– False detection if router CPU falls a little behind
Knowing the Link is Dead: Interface Support
• Smart interface hardware
– Detects loss of connectivity at lower layer
– Interrupts the router CPU about the failure
– Common in Packet Over SONET technology
– E.g. Sprint paper sees delays less than 100 msec
• But…
– Some media don’t support it (e.g., Ethernet, ATM)
– … so, you often need heartbeats anyway
– Also, want heartbeats to detect failures the
hardware cannot detect on its own
Flooding the Link-State Advertisement
• After detecting the failure
– Router sends LSA out each link
– Each router does the same
– … and so on
• Flooding delay
– (CPU delay at each hop) * (diameter of the network)
Computing the Shortest Paths
• Each router re-computes
– Shortest-path tree rooted at this router
– Determine next-hop to every other router
A
A
1
1
B
B
1
1
1
1
D
C
1
2
1
3
F
C
1
H
1
1
J
1
G
F
1
2
I
E
1
1
G
1
1
D
E
2
I
H
1
J
Reducing the Computational Overhead
• Good system
– Fast processor
– High-speed memory
• Good algorithms
– Traditional approach computes from scratch
– Incremental algorithms compute only the changes
– Especially nice if only one edge changes
• Pre-computation
– Pre-compute effects of certain failure scenarios
– E.g., all single-link or single-router failures
Updating the Forwarding Table
• Forwarding table
– Map destination prefix to outgoing link(s)
– Copy of table on each interface card
– Highly optimized for fast lookups
• Updating the forwarding table
– Computing the new forwarding table
– Making updates to the copy of the line card
• Important source of delay
– Sprint end-to-end study: around 1 second
– AT&T router-level study: 100 msec – 300 msec
All Together: Looking Inside the Router
LSA Processing
Route Processor (CPU)
OSPF Process
LSA Flooding
Topology
View
SPF Calculation
SPF Calculation
FIB Update
FIB
LSA
LS Ack
Forwarding
Forwarding
Data packet
Interface card
LSA
Switching
Fabric
Interface card
Data packet
Significance of Protocol Timers
• Hello and dead intervals
– Failure-detection delay vs. false diagnosis
• Pacing the link-state advertisements
– Combining LSAs vs. longer convergence delay
– Some routers wait till after re-running Dijkstra!
• Delaying start of shortest-path computation
– Reducing # computations vs. convergence delay
– Especially useful if failure affects multiple links
Operational Practices
Reducing the Effects of Convergence
• Long convergence delay is bad
– Transient problems with loss and delay
– Disruptive for VoIP and online gaming
• Solution #1: better equipment
– Interfaces that detect failures automatically
– Cranking down the values of the timers
– Faster CPUs and path-computation algorithms
• Solution #2: network design and operation
– Improve forwarding-plane convergence
– Improve convergence during maintenance
Equal-Cost Multi-Path (ECMP)
• Multiple shortest paths
– Router can compute multiple shortest paths
– Forwarding table has multiple outgoing links
– Router splits traffic evenly over the links
2
3
2
1
1
1
3
5
3
3
ECMP Reduces Forwarding-Plane Convergence
• Suppose one of the outgoing link fails
– Incident router detects the failure
– Quick recomputation of paths without this link
– Local forwarding table updated to use other link
– Other routers have no forwarding-table change!!!
2
3
2
Only red router changes
its forwarding table!
1
1
1
3
5
3
3
Exploiting This Observation in Traffic Engineering
• Traffic engineering
– Given a topology and a traffic matrix
– … set link weights to control the flow of traffic
– … to minimize some objective function
• Bias toward solutions with “ties”
– Penalize solutions with just one shortest path
– Favor solutions that lead to multiple paths
– … even if the link loads are a little less balanced
• Applied in some traffic-engineering tools
– Demand from ISPs buying the tools
– … with customers demanding fast convergence
Examples of Planned Failures
• Upgrades
– Changing link to higher capacity
– Loading new operating system on a router
– Swapping out an old interface card
• Maintenance
– Fixing a flaky optical amplifier
– Configuration changes that require a reboot
• Cable intrusions
– Construction activities near a fiber
Planned Events Happen Often
• Sprint study
– Maintenance window
• From 10pm to 6am EST, covering east to west
• Period of low network traffic, so less congestion
• Not much business-critical traffic
– Responsible for 50% of intradomain failures
• Significance
– Planned events should be easier to handle
– The operator knows the failure(s) will happen
– … but, how to tell the routing protocol?
– … or, how to prepare the network in advance?
“Costing Out” of Equipment
• Increase cost of link to high value
– Triggers immediate flooding of LSAs
• Leads to new shortest paths avoiding the link
– While the link still exists to forward during convergence
• Then, can safely disconnect the link
– New flooding of LSAs, but no influence on forwarding
2
3
2
1
1
1
3
5
4
3
Bigger Picture
• Learn about a planned event
– E.g., replace optical amplifier
• Map the event to the IP equipment
– E.g., find link(s) that traverse the amplifier
• Increase the weight on each link
– Slowly, perhaps one at a time to reduce overhead
• Disable the equipment
– Disconnect amplifier and replace with new one
• Reintroduce the links into the network
– Slowly, change one link weight at a time
Even Bigger Picture
• What if maintenance would cause congestion?
– Reducing the capacity of the network
– Link weights not optimized to new topology
• Compute weight changes to make
– Re-optimize the setting of the link weights
– … based on the soon-to-be new topology
• Then, do the maintenance
– Cost out the IP links
– Fix/upgrade the equipment
– Cost in the IP links
• Then, go back to the old weight setting
Project Ideas
• Multi-path routing
– Protocols that allow more multi-path routing
– … not just the equal-cost paths (as in ECMP)
• Maintenance schedules
– Compute a sequence of weight changes
– Avoid link congestion in each step
• Convergence models
– What actually happens during convergence?
– Simulation of forwarding-plane behavior
• Effective pre-computation on routers
– Routers precompute reactions to certain failures
– E.g., all single-link failures or single-router failures
For Next Time, on Tuesday: PlanetLab
• Guest lecture
– Professor Larry Peterson
• Three papers (two short, one regular)
– “A Blueprint for Introducing Disruptive Technology
into the Internet”
– “Overcoming the Internet Impasse through
Virtualization”
– “Operating System Support for Planetary-Scale
Network Services”
• No written reviews
– But, be ready to ask hard questions
• PlanetLab is very useful for course projects
Next Thursday: Interdomain Convergence
• Two papers (intradomain and interdomain)
– “Experience in Black-box OSPF Measurement”
– “Route Flap Damping Exacerbates Internet
Routing Convergence”
• Written reviews
– Summary
– Reasons to accept
– Reasons to reject
– Avenues for future work
• Optional
– NANOG video about the second paper
– Really great essay on “You and Your Research”
Download