Traffic-aware Inter-Domain Routing for Improved Internet Routing Stability Zhenhai Duan Florida State University 1 Outline • Introduction and Background • Motivation and Intuition • Traffic-Aware Inter-Domain Routing (TIDR) • Performance Studies • Summary 2 Introduction and Background • Internet consists of large number of network domains – Or Autonomous Systems (ASes) – Currently about 26K – Exchange network prefix reachability information using BGP • In a system this big, things happen all the time – Fiber cuts, equipment outages, operator errors • Direct consequence on routing system – Large number of BGP updates exchanged between ASes – Re-computing/propagating best routes – Events may propagated through entire Internet • Effects on user-perceived network performance – Long network delay, packet loss, even loss of network connectivity 3 Introduction and Background • Implicit design assumption in BGP – Failure events of same importance to all users • No explicit mechanisms to localize failure in BGP • Internet global reachability == global propagation of failure – Is this valid? – A user (AS) in US may not be interested in failure in Asian country • Design of BGP failed to recognize two Internet properties – Internet access non-uniformity – Prevalence of transient failures 4 Motivation and Intuition • Internet access non-uniformity – APRANET(1970, Kleinrok and Naylor) • Top 12.6% responsible for 90% of traffic – NSFNET(1980,Rekhter and Chinoy) • Top 10% responsible for 85% of traffic – Fang and Peterson (1999), and Rexford(2002) • Non-uniform distribution nature of Internet traffic • Model on network value [IEEE/SPECTRUM2006] – Zipf’s law 5 Internet Access Non-Uniformity • FSU Study – Study if Internet access locality holds from viewpoint of edge network – Bidirectional data traffic collected at border router at FSU for 16 days 6 FSU Data Traffic on other Days 7 BGP Updates (RouteViews Project) Most of updates are from rest of the prefixes Only a few updates are related to top prefixes at FSU 8 Motivation and Intuition • Prevalence of transient failures – Sprint backbone measurement (2002) – BGP misconfigurations • 50% misconfigurations lasted less than 10 minutes • 50% < 1 minute • 80% < 10 minutes • 90% < 20 minutes Majority of network failures are transient 9 Motivation and Intuition TIDR Internet Access NonUniformity Users (networks) normally communicates with small set of other network domains Prevalence of Transient Failure Majority of the network failures on the Internet are transient 10 Traffic-aware Inter-Domain Routing (TIDR) • Prefix classified into either significant or insignificant – At AS v, with respect to neighbor n • Treat differently propagation of sign/insign prefixes – Propagating BGP updates of sign prefixes with high priority – Aggressively slow down propagation of BGP updates of insign prefixes • Localizing effect of transient failures on insign prefixes – Hold propagation of transient failures if valid alternative route exists • BGP withdrawals always propagated Insignificant v n Significant 11 TIDR Timers Recovery AS 15/30 SEC. MRAI TIMER 10 MIN. TIDR TIMER 12 TIDR Design • How to avoid traffic black-holes? – If the alternative route that is held by Timer is invalid, node will be the black-hole that drops all the packets that it receives – Utilizing Root Cause Information (RCI) • Similar to EPIC and RCN • flush out all local invalid alternative routes • Alternative route chosen can be guaranteed to be valid • How to avoid slow propagation of long-term failure of insign pref – Every node will hold propagation of BGP update, if not design carefully – Only one node will apply TIDR timer to insign prefixes • Nodes neighboring to failure • First node to have valid alternative route 13 TIDR Algorithm 14 Performance Studies • Used simBGP simulator • With both clique and Waxman random network topologies • Simulated both link fail-down and fail-over events – Only dummy node announce prefixes • 20% to be significant, 80% to be insignificant – Link failure • 20% to be long-term, 80% to be transient • Settings – – – – Link delay: randomly from 0.01 to 0.1 seconds Processing delay: randomly from 0.001 to 0.01 seconds MRAI timer: 30 seconds TIDR timer: 10 minutes 15 Fail-down Events 16 Fail-Over Events 17 Summary and On-going Work • TIDR: Traffic-aware Inter-Domain Routing – Capitalizing on two important properties • Internet access non-uniformity • Prevalence of transient failure – Differentiated BGP update propagation for sign and insign prefixes • Propagating updates of sign prefixes with higher priority • Aggressively slow down propagation of updates of insign prefix • Performed simulation studies – Outperforms BGP and other existing enhancements 18