Routing problems are easy to cause, and hard to diagnose Jennifer Rexford

advertisement
Routing problems are easy to cause,
and hard to diagnose
(“Happy operators make happy packets”)
Jennifer Rexford
AT&T Labs—Research
http://www.research.att.com/~jrex
How Can You Sleep at Night When…
• A single typo can bring down your network
–
–
–
–
Someone else’s typo can bring you down
Changing config makes your heart skip a beat
... ‘cause you can’t tell what’s might happen
… and whether it’s a career-limiting move
• The routing system discards your packets
–
–
–
–
And you can’t even figure out why
Or who’s fault it is
Or how to fix it
Or if it might just go away on its own
Configuring Routing Protocols is Hard
• Primitive configuration languages
– Thousands of assembly-language commands
• Many protocols and tunable options
– Weights, areas, timers, filters, policies, …
• Subtle interactions between protocols
– Hot potato, route injection, routing control traffic, …
• Complex techniques for achieving scalability
– Route reflectors, route aggregation, summarization, …
• Network configured at the element level
– Configuring individual boxes not entire network
• Indirect ways of achieving operations goals
– E.g., TE by tweaking IGP weights and BGP policies
Troubleshooting Routing Problems is Hard
• Problems can arise from outside of your network
– E.g., bogus advertisements, weird filtering, etc.
• Route filtering and aggregation are tricky
– … leading to black holes, forwarding loops,…
• We don’t know the Internet topology
– … and perhaps we’ll never, ever know (sigh)
• We don’t have good tools for probing the paths
– E.g., traceroute has many known limitations
• Routing protocols aren’t all that chatty
– … they don’t say why a router changed his mind
• The routing isn’t always the system to blame
– E.g., MTU mismatch, packet filters, congestion
Fixing the Problem?
• Better router configuration languages
– Higher level of abstraction, vendor independent
• Joining data together to aid detection
– Multiple vantage points, multiple data types
• Good anomaly-detection algorithms
– Based on good underlying models of routing
• Better router support for routing measurement
– Forwarding path, routing protocol messages, etc.
• Distributed platform for debugging problems
– Partial diagnosis with scalability and information hiding
• Routing protocol extensions (and “do overs”)
– Design for diagnosability, and verifiability
My Position: This is Really Pathetic!
• Two problems needing attention
– Configuring the routing protocols
– Debugging the routing problems
• This moves us beyond
– Characterizing lots of measurement data
– Bottom-up solutions to various problems
• … toward the holy grail of
– Greater abstraction of the network design
– Routing protocol design for managability
– A well-behaved communication infrastructure
Download