Craig Labovitz, G. Robert Malan, Farnam Jahanian, "Internet Routing Instability." IEEE/ACM Transactions on Networking, 6(5):515528, 1998. Craig Labovitz, G. Robert Malan, Farnam Jahanian, "Origins of Internet Routing Instability", IEEE INFOCOM 1999. Craig Labovitz, G. Abha Ahuja, Farnam Jahanian, "Experimental Study of Internet Stability and Backbone Failures." FTCS 1999. Internet Routing Instability Three Papers Presented by Michael A. Smith Background Events NSFNet backbone ended in April ‘95 Evident “Death of Internet is Imminent” reported by popular press Routing Instability (“route flaps”) Informally defined as: Spring 2006 Network degradation bandwidth shortages lack of router switching capacity “the rapid change of network reachability and topology information” Internet Routing Instability 1 of 50 The Internet Backbone Spring 2006 12 large ISPs, tier one 4000-6000 tier two providers Large public exchange points are considered the “core” of the Internet. Backbone service providers must maintain a complete map, or default-free routing table. Divided into different regions of administrative control called autonomous systems (AS’s). Most AS’s exchange routing information through the border gateway protocol (BGP). Internet Routing Instability 2 of 50 Routing Instability Origins Effects Spring 2006 Router configuration errors Transient physical and data link problems Software bugs Poorer end-to-end network performance Degradation of overall efficiency of the Internet infrastructure Internet Routing Instability 3 of 50 Route Flaps Result in large number of routing updates passed to core Internet exchange point routers. Network instability spreads from router to router and propagates throughout the network. Effects in Internet infrstructure: Spring 2006 Increased packet loss Delays in time for network convergence Resource overhead (CPU, memory, etc.) Internet Routing Instability 4 of 50 BGP Spring 2006 An incremental protocol Does not flood intra-domain network with topological information or link state entries (like IGRP and OSPF) Sends update information only upon changes in topology or policy Uses TCP as underlying transport mechanism (as opposed to reliability through datagram service) As a path vector routing protocol, it limits the distribution of reachability information. Internet Routing Instability 5 of 50 Routing on the Backbone path - sequence of intermediate AS’s between source and destination routers that form a directed route for packets to travel Router configuration files allow the stipulation of routing policies which may: Policy decisions can be made based on: Spring 2006 specify the filtering of specific routes modify path attributes before sharing announcement of routes from peers attributes of announced routes (such as MED’s) After each router makes a new local decision on the best route to a destination, it sends it. As the route propagates, each AS appends its unique number to the route’s ASPATH, which, in conjunction with the prefix, provides a specific handle for transit. The ASPATH mechanism allows a router to detect and prevent routing loops. Internet Routing Instability 6 of 50 Routing Information in BGP Two forms: Announcements Indicates that a router has either learned a new network attachment or has made a policy decision to prefer a diff. route to a destination. Withdrawals Sent when a router decides that a network is no longer reachable Paper distinguishes between: A BGP update may contain multiple announcements and withdrawals. Spring 2006 Explicit – associated with actual withdrawal message Implicit – existing route replaced by new route Ideally, routers should only generate routing updates for relatively infrequent policy changes and the addition of new physical networks. It’s been found that BGP’s ASPATH mechanism is not sufficient to ensure network convergence. Internet Routing Instability 7 of 50 Methodology of Studies Spring 2006 Geographically diverse exchange points. Although the route servers do not forward network traffic, the route servers do peer with over 90% of the service providers at each exchange point. Internet Routing Instability 8 of 50 Route Tracker Architecture Devloped on Sun workstations Uses MRT and IPMA toolkits to analyze BGP updates Spring 2006 Internet Routing Instability 9 of 50 “Internet Routing Instability” Monitored BGP updates generated by five service provider backbone routers at the major U.S. public exchange points over a period of nine months. Paper distinguishes three types of updates: Instability is defined as: Spring 2006 forwarding instability – may reflect legitimate topological changes and affects the paths on which data will be forwarded routing policy fluctuation – reflects changes in routing policy information that do no affect forwarding paths pathological – updates are redundant BGP information that do not reflect routing nor forwarding instability an instance of either forwarding instability or policy fluctuation Data reflects the stability of inter-domain Internet routing, or changes in topology or policy among AS’s “Intra-domain routing instability is not explicitly measured and is only indirectly observed through BGP information exchanged with a domain’s peer.” Internet Routing Instability 10 of 50 Results of Study Spring 2006 The number of BGP updates exchanged per day in the Internet core is one or more orders of magnitude larger than expected. Routing information is dominated by pathological, or redundant updates, which may not reflect changes in routing policy or topology. Instability and redundant updates exhibit a specific periodicity of 30 and 60 seconds. Instability and redundant updates show a surprising correlation to network usage and exhibit corresponding daily and weekly cyclic trends. Instability is not dominated by a small set of autonomous systems or routes. Internet Routing Instability 11 of 50 Results of Study (2) Spring 2006 Instability and redundant updates exhibit both strong high and low frequency components. Much of the high frequency instability is pathological. Discounting the contribution of redundant updates, the majority (over 80%) of Internet routes exhibits a high degree of stability. This work has led to specific architectural and protocol changes in commercial Internet routers through the collaboration with vendors. Internet Routing Instability 12 of 50 Methodology of Study (2) Spring 2006 12 Gb of data starting in January ’96 Uses several tools from XYZ toolkit Focuses on largest exchange, MaeEast Data verification against BGP backbone logs from a number of large service providers Internet Routing Instability 13 of 50 More Background Problems of network topology fluctuation (nonconvergence): Internet routers of the day were based on route caching architecture. Spring 2006 Each interface card maintains a routing table of cache of destination and next-hop lookups If found, then switch on CPU independent “fast-path.” Sustained levels of instability increase the probability of packet encountering a cache miss, which leads to: packets get dropped packets delivered out of order increased load on CPU increased switching latency dropped or lost packets queuing delay, preventing timely routing of Keep-Alive packets It should be noted that new generations of routers that do not require caching and are able to maintain the full routing table in memory do not exhibit the same pathological loss under heavy routing updates. Internet Routing Instability 14 of 50 Route Flap Storms A failed router can instigate a “route flap storm.” Spring 2006 This pathological oscillation causes overloaded routers to be marked as unreachable since the required interval of Keep-Alive transmissions is not met. Peers of the failed router find alternative paths for destinations previously reachable and transmit updates. After the failed router recovers, it will re-initiate BGP peering sessions with peers, transmit large state dumps, and cause more routers to fail. “Route Flap Storms” in 1996 caused extended outages for several million network customers. Newer generations of routers provide a mechanism for giving BGP and Keep-Alive messages higher priority. Internet Routing Instability 15 of 50 Battling Routing Instability Route Aggregation (Supernetting): combines a number of smaller IP prefixes into a single, less specific route announcement. reduces overall number of networks visible on the core Internet fails in multi-homing (when end-sites have redundant connections to the internet via multiple service providers). Deployment of route dampening algorithms Spring 2006 In 1996, more than 25% (and growing) of prefixes were multi-homed and therefore non-aggregatable. “hold-down” updates that exceed certain parameters (i.e. quota of updates per hour) can introduce artificial connectivity problems as “legitimate” announcements are delayed. Internet Routing Instability 16 of 50 Problems Spring 2006 The internet continues to exhibit high levels of routing instability despite the increased emphasis on aggregation and route dampening. Internet topology is growing increasingly less hierarchical with the addition of new exchange points and peering relationships. The behavior and dynamics of Internet routing stability has gone mostly without formal study prior to the publication of the paper. Little was known! Internet Routing Instability 17 of 50 Observations Disproportionalism: 42,000 Internet prefixes 1300 Autonomous Systems 1500 Unique ASPATHS 3-6 million routing updates per day 125 updates per network per day Spring 2006 At times, 100 prefix announcements per sec. Once exceeded 30 million, monitor crashed! This is a problem for all but the most high-end of commercial routers, and even they exhibit problems. Internet Routing Instability 18 of 50 Classification of BGP Updates Spring 2006 WADiff – A route is explicitly withdrawn as it becomes unreachable and it is later replaced with an alternative route to the same destination; forwarding instability. AADiff – A route is implicitly withdrawn and replaced by an alternative route as the original route becomes unreachable, or a preferred alternative path becomes available; forwarding instability. WADup – A route is explicitly withdrawn and then reannounced as unreachable. This may reflect transient topological (link or router failure, or it may represent a pathological oscillation; forwarding instability or pathological behavior (see next slide) All considered to be instability Internet Routing Instability 19 of 50 Classification of Pathological Behavior (Redunant Updates) Spring 2006 AADup – A route is implicitly withdrawn and replaced with a duplicate of the original route (a router should only send an update for a change in topology). WWDup – The repeated transmission of BGP withdrawals for a prefix that is currently unreachable. All considered to be pathological instability. Pathological updates may have a minimal impact on the performance of the Internet. Internet Routing Instability 20 of 50 Expected Instability Problems affecting aggregation into supernets: Spring 2006 Multi-homing initial lack of hierarchical IP address space allocation reluctance to renumber IP addresses Result: Large number of globally visible addresses Each globally visible address is reachable by one or more paths. You would expect Internet instability to be proportional to the total number of available paths to all globally visible network addresses or aggregates Internet Routing Instability 21 of 50 Mae-East Routing Updates Most WWDup withdrawals are transmitted by routers belonging to AS’s that never previously announce reachability from the withdrawn prefixes. On average, 500,000 – 6 million pathological withdrawals per day Spring 2006 Internet Routing Instability 22 of 50 Update Totals per ISP on a Given Day Many of the exchange point routers withdraw an order of magnitude more routes than they announce during a given day. Provider I shows the disproportionate effect that a single service provider can have on the global routing mesh. Spring 2006 Internet Routing Instability 23 of 50 More Observations Guess what: Spring 2006 There is a strong causal relationship between the manufacturer of router used by an ISP and the ISP’s exhibited level of pathological BGP behavior. Routing updates have a regular, specific periodicity, usually either 30 or 60 seconds. The persistence of instability is the duration of time that routing information fluctuates before it stabilizes. Internet Routing Instability 24 of 50 Origins of Routing Pathologies Some pathological withdrawals can be at attributed to implementation decisions Spring 2006 time-space trade off in not maintaining state of advertisements stateless BGP = O(N*U) updates Presentation of results led to a router vendor’s updating of software to a partial state Stateless BGP contributes an insignificant number of updates and does not account for oscillating behavior of WWDup and AADup updates. Internet Routing Instability 25 of 50 Origins of Routing Pathologies (2) Single-homed, stateless peer routers should result in at most O(N) updates, but instead: Periodic routing instability may be caused by: Spring 2006 It seemed that each legitimate withdrawal induces some type of short-lived pathological network oscillation Persistence of these updates is between 1 and 5 minutes inadvertant synchronization on update transmission improper configuration of interaction between IGP and BGP (conversion is lossy) Internet Routing Instability still remains poorly understood Internet Routing Instability 26 of 50 Forwarding Instability Instability Density Black squares are above a particular threshold (mean of detrended data) (345 updates in March, 770 in September) Spring 2006 Internet Routing Instability 27 of 50 Forwarding Instability (2) A week of raw forwarding Little instability over the weekend Spring 2006 Internet Routing Instability 28 of 50 Forwarding Instability (3) Time series analyses, FFT and MEM spectral estimation, validate results. Routing instability corresponds closely to trends in Internet bandwidth usage and packet loss (intuitively obvious?) Rigorous justification of network usage equating to routing instability is problematic due to the size and heterogeneity of the internet. Spring 2006 Internet Routing Instability 29 of 50 Fine-grained Instability Stats. Spring 2006 No single AS consistently dominates the instability statistics. There is not a correlation between the size (# routes responsible for in table) of an AS and its proportion of the instability statistics. A small set of paths or prefixes do not dominate the instability statistics; instability is evenly distributed across routes Internet Routing Instability 30 of 50 Fine-grained Instability Stats. (2) Internet routing tables are dominated by 6-8 ISPs Over the course of the month, their share of the default-free routing tables did not change significantly Spring 2006 Internet Routing Instability 31 of 50 Fine-grained Instability Stats. (3) Internet routing tables are dominated by 6-8 ISPs Over the course of the month, their share of the default-free routing tables did not change significantly Spring 2006 Internet Routing Instability 32 of 50 Fine-grained Instability Stats. (4) 80-100% of the daily instability is contributed by Prefix + AS pairs announced less than 50 times. (a) ISP A announced seven routes between 630 and 650 times with no withdrawals Spring 2006 Internet Routing Instability 33 of 50 Fine-grained Instability Stats. (5) 80-100% of the daily instability is contributed by Prefix + AS pairs announced less than 50 times. (c) ISP A announced seven routes between 630 and 650 times with no withdrawals Spring 2006 Internet Routing Instability 34 of 50 Fine-grained Instability Stats. (6) (a) 20-90% of AADiff events are contributed by routes that changed 10 times or less No single route consistently dominates the instability measured. Some days, a single Prefix+AS pair contributes substantially (40%) - account for lowest curve in (a) (ISP A) WADiff climbs to a plateau about 95% faster than other three categories. WADiff has fewest number of Prefix+AS pairs that dominate their days. Spring 2006 Comforting, since categories probably best represent topological instability Investigation on prefix alone provided similar results. Internet Routing Instability 35 of 50 Temporal Properties of Instability Statistics Update frequency distributions for instability events at Prefix+AS level Spring 2006 Update frequency is the inverse of the interarrival time between routing updates; higher frequency corresponds to a short inter-arrival time Other work has been able to capture the lower frequencies through both routing table snapshots and end-toend techniques Internet Routing Instability 36 of 50 Temporal Properties of Instability Statistics (2) Histogram distribution captured in 30 second and 1 minute bins You would expect a Poisson distribution reflecting exogneous events, such as power outages, fiber cuts, and natural human events. 30 second periodicity suggests widespread systematic influence in origin. Spring 2006 Internet Routing Instability 37 of 50 Temporal Properties of Instability Statistics (3) Histogram distribution captured in 30 second and 1 minute bins You would expect a Poisson distribution reflecting exogneous events, such as power outages, fiber cuts, and natural human events. 30 second periodicity suggests widespread systematic influence in origin. Spring 2006 Internet Routing Instability 38 of 50 Conclusions Spring 2006 Routing instability can have a significant deleterious impact in Internet infrastructure Majority (99%) of routing information is pathological and may not reflect real network topological changes. Instability is well distributed across AS’s and prefix space. Instability and redundant routing information exhibit a strong periodicity (of unknown origin). Internet Routing Instability 39 of 50 Conclusions (2) Spring 2006 Proportion of Internet Routes affected by routing updates Internet Routing Instability 40 of 50 Conclusions (3) Spring 2006 Current trends in the evolution of the Internet may have a significant impact on routing instability and the future performance of the network. 25% of networks are multi-homed and the growth rate is about linear Proliferation of exchange points is leading to a less hierarchical Internet. This research helps characterize the effect of added topological complexity since the end of the NSFNet backbone. Internet Routing Instability 41 of 50 “Origins of Internet Routing Instability” Spring 2006 28 months gathering data from more than 40 commercial routers, switches, and Unix-based PC routers Also collected IBGP information at the state of Michigan’s public Internet backbone, MichNet Maintains that routing instability remains well distributed across prefix and AS space but that instability is not related to prefix length. Since previous paper’s work, the volume of inter-domain routing messages in the Internet core has decreased by an order of magnitude. Internet Routing Instability 42 of 50 Research Pays Off Spring 2006 Number of BGP updates almost doubled in 28 mo.’s Number of announcements per day eventually (finally) surpassed the number of withdrawals at Mae East. On average, across backbone, exchange point routers generated only half of the number of withdrawals at the number of announcements Internet Routing Instability 43 of 50 New Routing Update Categories We still have AADiff, AADup, and WWDup, but we add: Spring 2006 Tup and Tdown – fluctuation in the reachability for a given prefix. An announced route is withdrawn and transitions down, or a currently unreachable prefix is announced as reachable and transitions up Internet Routing Instability 44 of 50 Breakdown of BGP Updates Tup roughly equal to Tdown, connection recovery (good!) Fluctuation in prefix reachability account for over 40% of all non WWDup BGP traffic After January ’98, AADup comprised largest cat. of updates. Spring 2006 Internet Routing Instability 45 of 50 Analysis of AADiffs Spring 2006 90% of MED oscillations involve only two large ISPs, product of their specific routing policies. Internet Routing Instability 46 of 50 Dynamically Mapped MED AS2 always wants traffic flowing from AS3 to AS1 to take the shortest path through its network, so instead of setting the MED value via static configuration rules, AS2 dynamically maps the IGP distance between R5 and R3, and between R5 and R4 to the MED attribute value associated with route advertisements from routers R3 and R4 to AS1. AS2 influences AS1 who wants to reach Network A. AS1 will prefer the route via R4. Spring 2006 Internet Routing Instability 47 of 50 More Results Spring 2006 Internet Routing Instability 48 of 50 Conclusions Improvement Spring 2006 Routing update messages reduced by a magnitude Suppressed pathological withdrawals Instability is still well distributed across AS and prefix space More bugs in router software led to anomalies Internet Routing Instability 49 of 50 “Experimental Study of Internet Stability and Wide-Area Backbone Failures” Conclusions Spring 2006 Internet has proven remarkably robust. A small number of routes contribute to overall unavailability. 40% of routes exhibit multiple failures Outages lasting longer than two hours usually represent long-term outages requiring significant engineering effort for repair BGP failures must stemp from nonhardware/software sourcdes, probably TCP characteristics. Internet Routing Instability 50 of 50